Going #PBHardCore: An Inside Look at Revamping a Metadata Standard (Part 1)

This post was re-blogged from NDSR Boston’s blog “SIPS, DIPS, & Bytes: NDSR Boston’s Digital Preservation Test Kitchen.”

During the year I spent working at the Dance Heritage Coalition, I probably spent about as much time using PBCore to catalog the video material we were digitizing as I spent on the subway during my daily commute.   And – as with the good old 4/5 train from Brooklyn – the more I time I spent with the standard, the more I started to love … to complain about it.   On the other hand (again much like the 4/5 train) if anyone had tried to take it away from me, I would have started kicking and screaming for all I was worth; I knew perfectly well that without it, getting my metadata to where I wanted it to be would have been significantly harder.

PBCore’s certainly not perfect, but it’s the only metadata standard designed around all the particular weird quirks and challenges that surround the cataloging of time-based material. It was initially helmed by the Corporation of Public Broadcasting and targeted for the needs of public television and radio stations, but has since been adopted by a variety of institutions that need a metadata standard that can describe their audiovisual content. One of the key factors in PBCore is the way it embraces the idea of instantiations of intellectual content – different formats and versions of one conceptual entity as it manifests in the real world as a trackable item. That trackable item can be a physical object, a digital video file, or even a thumbnail image, but it always retains its relationship to the core idea of the work. While many metadata standards are built to describe assets at an item level, PBCore grounds each record in a piece of intellectual content, and then pulls all the real-world manifestations of that content together as part of that same record.

This relationship tracking is an absolute necessity in a broadcast archive where a single program will almost always generate an array of distinct derivatives (broadcast master, edit master, rough cut, etc.) that need to be clearly linked together and centered on the idea of the program, rather than on the item itself. However, in the digital era, as more and more pieces of intellectual content become unmoored from their physical identity, this concept of instantiations is becoming ever-more widely relevant in the creation of all different varieties of archival metadata. As a result, PBCore is becoming more things to more people, which means it has to face some changes in order to catch up to what its user base is demanding. Like, for example, one lone digitizing archivist at the Dance Heritage Coalition, with an esoteric collection of dance solos to digitize, grumbling to herself about the fact that ‘choreographer’ isn’t listed as an option in the controlled vocabulary for the ‘creator’ field.

PBCore saw its last major revision in 2011, but with the launch of the American Archive project, WGBH became the command center for a new round of revisions and updates to PBCore, with the goal of massively improving documentation, creating an RDF ontology, and making the necessary schema and vocabulary changes for the release of the new and improved PBCore 2.1 in March 2015, with the more sweeping revisions for PBCore 3.0 coming down the line. (The 2.1 revisions will be backwards compatible with PBCore as it currently stands, while the 3.0 revisions will be more sweeping and require current PBCore users to make changes to their practice.) This meant that when I left the Dance Heritage Coalition to start my residency at WGBH, I also suddenly got the opportunity to leap straight from ‘user’ to ‘PBCore revision committee member’ – or, in other words, to stop grumbling and join the process of making actual changes to the standard. For me, this felt about as exciting and intimidating someone telling me that I was going to have a chance to sit on a committee to fix some glaring bugs in the New York subway system. (And hey, MTA, if you’re reading this, feel free to call me!)

At the time I jumped on-board, the PBCore committees were just starting to look through the first round of suggestions for bug collection and improvement of the standard. These suggestions were solicited and logged through GitHub’s Issue tracker, with all users of the standard encouraged to participate –

– and OK, I’ll take a sidenote here to talk about how PBCore has taken a couple of unusual and exciting community-focused steps in treating itself like an open source application sourced in GitHub. This allows PBCore to provide a downloadable XML schema blueprint that serves as a guide for developers to make PBCore-based metadata applications, makes the discussions around edits and adaptations open and visible, and provides the wider user community with the opportunity to participate closely in the revisions process. Personally, I find that pretty amazing and very much in keeping with the spirit of public broadcasting, which after all is media created for the public, and for the benefit of a community.  (For more info about how/why/when/where to use PBCore, this webinarfrom a few weeks ago is a very good introduction.)

Anyway, as a user and grumbler myself, I obviously have a built-in investment in putting in complaints and then getting them fixed.  Of course, as is always the way of things, all the problems that you think would be so easy to fix as a user start looking a lot more labyrinthine once you have to be the one coming up for a solution for them. A lot of the time, something that looks like a bug might actually have a very good reason for being the way it is, or at least a good reason why it can’t be changed just yet. If I’ve learned one thing so far while on call with the Vocab and Schema teams, it’s that issues are like icebergs – if you think you’ve got a small one to deal with that could just be melted away, you’re probably just seeing the surface.

Iceberg

 

I’ll give you one example to start off with. One of the issues raised during the suggestion period dealt with the fact that in the current instance of the schema, an individual instantiation of an asset can be described as analog and digital simultaneously. This leads to a lot of potential for confusion – for example, ‘container’ means very different things when you’re talking about a cassette tape versus an h.264 video file – and at first seems like an obvious bug; why not just make the description an either/or and avoid the problem altogether?

But of course it’s not as simple as that. A digital item can also be physical – just consider a CD, DVD, or even Digital Betacam, which is a digital file recorded onto a reel of analog tape. Moreover, even a purely digital file has a physical existence somewhere, and that physical existence does have to be described in some way(Not to disillusion anyone, but the cloud is not literally on a cloud.)

A successful standard needs a way to capture all of these complexities without becoming so complicated itself that users are scared away from it. Slowly but surely, PBCore is getting there, and for me, at least, the process of getting there is pretty fascinating. As progress continues, I’ll keep reporting from the front lines — this overview is just the introduction to a very complex process.  Next time, spreadsheets!

 This post was written by NDSR Boston resident Rebecca Fraimow, who is a member of the PBCore Schema Team.

Leave a Reply