Featured Dataset: Prelinger Archives

Posted on 08/09/2011 by

0


Prelinger Archives Following on a bit from the Lego featured last week, I wanted to take a look at some media data. So, today I’d like to feature a very interesting set of data called the Prelinger Archives.

Rick Prelinger, back in 1982, began a collection of rather diverse films under the broad category of being “ephemeral”. These were films which weren’t intended to last forever, but were fit for the moment. They include films made by organisations, home captures and educational films. They usually had a purpose, so their creation and preservation weren’t designed to last.

The Internet Archives hold many of these films, and this set comprises metadata for around 2000 films in the Public Domain. This set is (from the obvious source of the Developer Docs!):

While the archives currently provide a page for every film, with links to its metadata, and a full-text index to support searching, there is no real formal API over the data. However the Internet Archives do provide some XML files and a simple JSON interface for downloading the metadata.

To provide more flexible ways to access the data the dataset has been crawled and loaded into Kasabi. This includes the reviews of the films posted to the Internet Archive itself. The converted dataset exposes the data as Linked Data while retaining links to the original films, allowing developers to access the media files for the films themselves.

This is some interesting data, covering a topic which I’ve never encountered before, and has lead me to read up a bit more about the ephemeral films, and dive in to discover films about coffee (for example). The Developer Docs begin in my favourite way, with a list of potential uses of this data:

Potential Uses

  • Development of an improved interface to the archive data, to support exploring an important historical archive
  • Exploration of video and film annotation using real-world public domain data and media
  • Cross-linking of the archive content with other company, location and historical archives

These uses all sound like a good match for potential apps for the Cultural Data Hackday, so I’m hoping these are interesting and usable.

How the Data Works Data Model

We can have a look at a diagram describing the relationships of the data types, and see how different pieces of data are represented. Straight away, we can see tat the majority of relationships are based around the hub of the concept of “moving image”. That way, each moving image can have certain attributes and properties associated with it. The set uses a list of different schemas to describe different elements of the data, including its own custom vocabulary of terms.

This descriptive resource gives us a picture of what’s in the set, how the data is described, and how it all relates. “Films,” for example:

Films (p:MovingImage)

Each of the films in the Prelinger Archives is modelled with a type of prel:MovingImage.

The URIs for the films have been constructed using the following URI pattern:

http://data.kasabi.com/dataset/prelinger-archives/film/{id}

The identifier for the film is taken from the Internet Archives and is also included in the dataset as the value of a dct:identifier property associated with each film.

For SPARQLers out there, there is a published example query found under the SPARL API which lists titles of films along with some important information (film maker and sponsor). Sample queries are there, usually, to demonstrate one context of the data queried using SPARQL, and you can view the full query here.

The set publisher, Leigh, has also described how the data was accumulated and modelled, and made his crawler open for collaboration:

The data was compiled by using a custom Ruby crawler to lookup the unique identifiers for each of the films in the Prelinger Archive collection, then traversing the Internet Archive site to fetch the XML metadata for the film, its associated media files, and reviews (if any).

The crawler builds a local cache of the base metadata which is then converted into N-Triples for loading into Kasabi. This also allows for rebuilding of the dataset without having to unnecessarily load the Internet Archive servers.

The code to support the crawling and conversion of this dataset has been open sourced. Developers interested in collaborating to improve the modelling or fix up data conversion errors, can contribute to the project via github.

http://github.com/ldodds/prelinger

So, what’s in it?

I may suffer from a slight obsession with coffee—I’m sure the doctors’ aren’t worried about it—so the first thing I looked for was a simple search for coffee-related data. I was excited to find out it’s an actual topic in the set, so there must be items in there which feature coffee in them:

There are 695 results matching coffee, and interestingly, they are commonly featured in films advertising coffee, such as Sanka. I was interested to note that this commercial has several reviews, one of which:

This commercial for the long time only decaf instant coffee opens with a proto-Juan Valdez picking coffee beans. Then it moves to how rich this coffee is and smoothly enthuses about the new jar shape. Fairly typical 1960s era coffee commercial, with nothing over the top.

I went to the Prelinger Archives home on the Internet Archives to watch the video, too!

 

Posted in: Datasets