Beyond the Triple Count

Posted on 09/28/2011 by

4


On Monday I gave a talk at the SemTechBiz conference: “The RDF Report Card: Beyond the Triple Count“. I’ve published the slides on Slideshare which I’ve embedded below, but I thought I’d also post some of my notes here.

I’ve felt for a while now that the Linked Data community has an unhealthy fascination on triple counts, i.e. on the size of individual datasets.

This was quite natural in the boot-strapping phase of Linked Data in which we were primarily focused on communicating how much data was being gathered. But we’re now beyond that phase and need to start considering a more nuanced discussion around published data.

If you’re a triple store vendor then you definitely want to talk about the volume of data your store can hold. After all, potential users or customers are going to be very interested in how much data could be indexed in your product.¬†Even so, no-one seriously takes a headline figure at face value. As users we’re much more interested in a variety of other factors. For example how long does it take to load my data? Or, how well does a store perform with my usage profile, taking into account my hardware investment? Etc. This is why we have benchmarks, so we can take into account additional factors and more easily compare stores across different environments.

But there’s not nearly enough attention paid to other factors when evaluating a dataset. A triple count alone tells us nothing. They’re not even a good indicator of the number of useful “facts” in a dataset.

During my talk I illustrated this point by showing how, in Dbpedia, there are often several redundant ways for capturing the same information. These in inflate the size of the datasets without adding useful extra information. By my estimate there’s over 4.6m redundant triples for capturing location information alone. In my view, having multiple copies or variations for the same data point reduces the utility of a dataset, because it adds confusion over which values are reliable.

There can be good reasons for including the same information in slightly different ways, e.g. to support consuming applications that rely on slightly different properties, or which cannot infer additional data. Vocabularies also evolve and become more popular and this too can lead to variants if a publisher is keen to adapt to changing best practices.

But I think too often the default position is to simply use every applicable property to publish some data. From a publishing perspective it’s easier: you don’t have to make a decision about which approach might be best. And because of the general fixation on dataset size, there’s an incentive to just publish more data.

I think it’s better for data publishers to make more considered curation decisions, and instead just use one preferred way to publish each piece of information. Its much easier for clients to use normalized data.

I also challenged the view that we need huge amounts of data to build useful applications. In some scenarios more data is always better, that’s especially true if you’re doing some kind of statistical analysis. Semantic web technology potentially allows us to draw on data from hundreds of different sources by reducing integration costs. But it doesn’t mean that we have to or need to in order to drive useful applications. For many cases we need much more modest collections of data.

I used BBC Programmes as an example here. It’s a great example of publishing high quality Linked Data especially because the BBC were amongst the first (if not the first) primary publisher of data on the Linked Data cloud. BBC Programmes is a very popular site with over 2.5 million unique users a week, triggering over 60 reqs/second on their back-end. Now, while the data isn’t managed in a triple store, if you crawl it then you’ll discover than there’s only about 50 million triples in the whole dataset. So you clearly don’t need billions of triples to drive useful applications.

It’s really easy to generate large amounts of data. Curating a good quality dataset is harder. Much harder.

I think it’s time to move beyond boasting about triple counts and instead provide ways for people to assess dataset quality and utility. There are lots of useful factors to take into account when deciding whether a dataset is fit for purpose. In other words, how can we help users understand whether a dataset can help them solve a particular problem, implement a particular feature, or build an application?

Typically the only information we get about a dataset are some brief notes on its size, a few example resources, perhaps a pointer to a SPARQL endpoint and maybe an RDFs or OWL schema. This is not enough. I’d consider myself to be an experienced semantic web developer and this isn’t nearly enough to get started. I always find myself doing a lot of exploration around and within a dataset before deciding whether its useful.

In the talk I presented a simple conceptual model, an “information spectrum”, that tried to tease out the different aspects of a dataset that are useful to communicate to potential users. Some of that information is more oriented towards “business” decisions: is the dataset from a reliable source, correctly licensed, etc. While others are more technical: how has the dataset been constructed, or modelled?

I identified several broad classes of information on that spectrum:

Metadata. This is the kind of information that people are busily pouring into various data catalogs, primarily from government sources. Dataset metadata, including its title, a description, publication dates, license, etc all help solve the discovery problem, i.e. identifying what datasets are available and whether you might be able to use them.

While the situation is improving, its still too hard to find out when some particular source was updated, who maintains or publishes the data, and (of biggest concern) how the data is licensed.

Scope. Scoping information for a dataset tells us what it contains. E.g. is it about people, places, or creative works? How many of each type of thing does a dataset contain? If the dataset contains points of interest, then what is the geographic coverage? If the dataset contains events, then over what time period(s)?

Then we get to the Structure of a dataset. I don’t mean a list of the specific vocabularies that are used, but more how those vocabularies have been meshed together to describe a particular type of entity. E.g. how is a person described in this dataset? Do all people have a common set of properties?

At the lowest level we then have the dataset Internals. This includes things like lists of RDF terms and their frequencies, use of named graphs, pointers to source files, etc. Triple counts may be useful at this point, but only to identify whether you could reasonably mirror a dataset locally. Knowledge of the underlying infrastructure, etc. might also be of use to developers.

Taken together I see presenting this information to users as being one of progressive disclosure: providing the right detail, to the right audience, at the right time. Currently we don’t routinely provide nearly enough information at any point on the spectrum. The irony here is that when we’re using RDF, the data is so well-structured that much of that detail could be automatically generated. Data publishing platforms need to do more to make this information more readily accessible, as well as providing data publishers with the tools to manage it.

We’ve been applying this conceptual model as we build out the features of Kasabi. Currently we’re ticking all of these boxes. It’s clear from every dataset homepage where some data has come from, how it is licensed and when it was updated. A user can easily then drill down into a dataset to get more information on its scope and internal structure. There’s lots more that we’re planning to do at all stages.

To round out the talk I previewed a feature that we’ll be releasing shortly called the “Report Card”. This is intended to provide an at a glance overview of what types of entity a dataset contains. There are examples included in the slides, but we’re still playing with the visuals. The idea is to quickly allow a user to determine the scope of a dataset, and whether it contains some useful information to them. In the BBC Music example you can quickly see that it contains data on Creative Works (reviews, albums), Organizations (bands) and People (artists) but it doesn’t contain any location information. You’re going to need to draw on a related linked dataset if you want to build a location based music app using BBC Music.

As well as summarizing a dataset, the report card will also be used to drive better discovery tools. This will allow users to quickly find datasets that include the same kinds of information, or relevant complementary data.

Ultimately my talk was arguing that I think it’s time to start focusing more on data curation. We need to give users a clearer view of the quality and utility of the data we’re publishing, and also think more carefully about the data we’re publishing.

This isn’t a unique semantic web problem. The same issues are rearing their heads with other approaches. Where I think we are well placed is in the ability to apply semantic web technology to help analyze and present data in a more useful and accessible way.