Webinar Recap: Getting Started with PBCore

This is the third post in a series about the PBCore webinar that the Education Team presented in October. A recording of the webinar can be found here, and we’ll be recapping the event over the next few weeks.  Part one of the series is located here, and part two is located here.

After going over a brief history of PBCore, Sadie moved on to a step-by-step guide to using PBCore to describe AV collections.

The first step to creating a PBCore record is to inventory your AV content,  using at least Identifiers, Titles, and Descriptions.  These are the only requirements to create a valid PBCore XML document.

There are many ways to go about collecting and storing PBCore data. If you use databases, like Filemaker, you can either revise an existing template or create a new template inline with the PBCore data model. This is pretty simple, as a lot of PBCore fields already correspond to fields you might already have, like title, format, duration, etc.  You can also use content management systems like Omeka, Collective Access, and Drupal, which already have plug ins for PBCore, providing easy PBCore outputs, such as PBCore xml.

Although this is not ideal, a very simple way that you can start entering PBCore data is in a spreadsheet. PBCore can express complex relationships, so it isn’t naturally flat the way that a spreadsheet is; however there are ways to record most of the data in a spreadsheet. Storing it in a spreadsheet is a good starting place, because later you can have someone write a script that will take that data and turn it into PBCore xml, where you can take advantage of the complex relationships that PBCore can handle. And considering how often most of us use spreadsheets in our work, it’s a very approachable first step.

Now we’re going to review some important concepts about how PBCore is structured, which inform what you will be able to use your PBCore for.

Instantiations and Assets

As was mentioned in earlier posts in this series, PBCore has these things called instantiations.  An instantiation is an occurrence of as asset. If you have a master tape of a program about cowboys, that tape is an instantiation. If someone dubs that master tape onto a DVD to make a viewing copy, then that viewing copy is another instantiation. If you digitize that program about cowboys, the preservation quality file you create during digitization is another instantiation, and so on.  Each instantiation is an occurrence of the same content: the cowboy program, which we refer to as the asset. All of the information at the asset level is related to that content, rather than to the tape or file, which is the instantiation.

In addition to having data at both the asset and instantiation level, PBCore also allows you to structure your data with one of 3 root elements. These will be explained more fully in a later post, however, it’s useful to keep these structures in mind as you’re considering how you’ll store your data.

Root Elements

The simplest root element is the instantiation document. This describes a single occurrence, and can be used for things like capturing technical data in PBCore about a digital instantiatiation.

 The most commonly used root element is the description document. These contain information on the asset level, and can contain one or more instantiations as well; however, to make valid PBCore description documents, you don’t actually have to have an instantiation at all. Most people I know do use instantiations in their description documents since they see no point in creating asset level data if it doesn’t relate to an object or file. Just know though, that a perfectly valid PBCore description document can consist only of an identifier, a title, and a description, which are all at the asset, not instantiation level.

PBCore also provides a root element, called PBCore collection, which allows you to group your description documents into one xml file, contained with some data at the collection level. Again, this was just a brief overview, and future posts will go into detail about the ways you can take advantage of these structures.

Now, let’s delve into the nitty-gritty of gathering all of this data and figuring out which PBCore elements and attributes it fits into.

Required Fields

Identifier

The first required field is an Identifier.  Identifiers exist at the Asset and Instantiation levels, and are unique to the asset or instantiation.  PBCore also requires source information for all of the identifiers you use. Typical sources on the asset level are things like randomly generated ids for each piece of content, or sometimes codes used for specific programs or films. Typical sources for instantiations identifiers are things like barcodes, tape numbers, filenames, etc.

The Identifier field is repeatable, which makes it important for every identifier to have a source.  In the PBCore schema, you can include both the tape number that a production used on their tape and also the barcode that the archive added to the tape when they processed it into their collection.  Including source data makes these identifiers easier to differentiate in the future.

Title

Another PBCore-required field is title. Title is also repeatable, and the schema allows for noting the type of title (although type is not required like source is for identifiers). By doing so, you can add any and all relevant titles, such as the series and episode titles. You can create a title for raw footage that you might have in your collections. And if that raw footage was recorded for a specific film or program, then you can also add the title of that in a separate title field within the same record. And just to reinforce the idea of assets and instantiations: since title refers to the content, not the specific occurrence, title is contained in the asset-level record.

Description

The final required field on the asset level is the description. This is another a repeatable field, although repetition of this field is used somewhat less frequently than repetition of the identifier and title fields.  Descriptions can include as much or as little information about the content of the assets as you wish, from summaries and shot logs to whatever descriptive data you can gather from the labels and other documentation available. Some users have employed various work-arounds, including just putting a space or some non-description like “description unavailable” into the description fields, so that there is a value in the field and the XML will validate. While I would encourage every effort to put actual data into this field, sometimes it is just impossible to do so.

Location

The final required field, location, is only at the instantiation level, and only required if you have an instantiation. This field records the physical (or virtual) location of the occurrence, so that after you describe it in PBCore, you can go back and find it when you need it again. Location data can be of various types. Sometimes you can make it as simple as the name of the holding institution, for example “WGBH Archives.” More specific location data can also be used. For physical items people often include which room, shelf, box, etc. that the item is stored in. It is just as important to provide locations for digital files.  Filepaths or hard drive names allow future curators to locate the digital file for future uses.

 Now that we have the basics of how PBCore is structured and what is required in PBCore, let’s dive into how to get the data and put in all of these fields. Let’s start things out a physical piece of media. You’ve got this tape in your collection. What do you do?

A good place to start is with the physical format. If you have the tape (or other piece of physical media) in front of you, it should be easy to tell what format it is.  The following information can often be found on the object itself:

  • Format

  • Date.  Sometimes there won’t be a date right on the label, in which case you might be able to get the date information from related documentation, or by watching the content of the tape.

  • Generation

  • Duration

  • Descriptive information, such as:

    • Title

    • Content description

Best practices for spelling, capitalization, and other style choices can be found in the recommended controlled vocabulary on the PBCore website.

But, I’m sure you’re asking, what if it’s an instantiation that you can’t hold in your hand?

Now say instead of a tape, you have a file. How do you get data about this file into PBCore? The process of putting data for digital objects into PBCore is pretty similar, although the methods for finding that data are somewhat different.

Two of the first–and easiest–things to capture are the identifier and the format. These can come straight from the file name–use it as the instantiation identifier.  The extension leads you to the data you will put in the Digital Format field. PBCore does not encourage adding “dot mp4” as a digital format, rather, use the Internet Media or Content Type, such as  “video/mp4.” Just by looking at the data that your computer will give you on the file, you can also gather other information on the digital media instantiation, such as file size, duration, frame size, etc.

Another good way to get this data about digital instantiations is by using tools that give you information about your files, such as ffprobe, mediainfo, and ExifTool (all of which are free). Using these tools, you can not only of generating this information, but also of take the data that the tools generate and put it straight into PBCore. One of the benefits of using tools like these is the automation: there are no human mistakes, no inconsistencies based on human judgment, such as different people using gigabyte or GB, and automation also saves on staff time.

I’ve gone through some of the easiest data to gather for both physical and digital instantiations. However, as you can see from this list, once you’re comfortable with PBCore, you can take advantage of the wide range of elements (and these are just on the instantiation level!). Using PBCore gives you a superb structure for describing your media assets as fully as you want: write rick descriptions, add subject headings and track the origin of those headings using a link to the authority’s URI, add genre information, and fill in the coverage field with information the content’s place and time using place names, geospatial coordinates, date ranges, and other types and formats of data.  Don’t forget to use an attribute to note which type of data it is!  Finally, use the intellectual property fields to detail any rights information available, including the content’s creators, contributors, and publishers. Within each of these elements there is one subelement for the name of the person and another sub element for their role, so that the name and role are always associated.

Finally, the American Archive of Public Broadcasting, is developing its own set of cataloging guidelines based on the PBCore elements we’re using.  These can be found here.

Leave a Reply