Publishing Data Checklist

Posted on 08/05/2011 by

0


Some publishing tips

Lately, I’ve been writing about datasets Kasabi hosts and trying to draw out some interesting facts and uses of each of them.

What I’d like to do today is to very quickly point out some things which can make a dataset most useful in Kasabi. As a quick guide, it might raise some questions as you work your way through creating a dataset and publishing it, so I’d like to make sure you have some links for getting help before I begin:

I won’t take you through all the steps of publishing a set. For a step-by-step guide to creating a set, there’s a video embedded below. What I’d like to do is discuss a few things that should help a dataset be usable, and leave you with a checklist for most successful publishing.

Branding

Zach Logo
It’s part of Kasabi’s vision that a dataset’s publisher is its owner. So the ability to add a logo to the dataset is important for helping to identify the information being published. I think it’s also useful for developers to see, as it’s a visual part of the relationship between them and the publisher. The logo is not required for creating a new dataset, but it’s important to remember that it’s possible to brand your data, and add some context to it.

Description

Kasabi surfaces some information about a dataset automatically, and it’s something we’re constantly working to improve. But the summary description of a dataset is your chance to explain what it contains and how it got to be the set it is. To me, a good example can be seen on the NASA dataset:

This dataset consists of a conversion of the NASA NSSDC Master Catalog and extracts of the Apollo By Numbers statistics.

Currently the data consists of all of the Spacecraft from the NSSDC database which is a comprehensive list of orbital, suborbital, and interplanetary spacecraft launches dating from the 1950s to the present day. Entries are not limited to NASA missions, but include spacecraft launched by various agencies from around the globe.

You can also select a category for your data to live in, and it’s useful to select the ones that will help folk find what they’re looking for. Select multiple categories, if your set covers more than one area.

Licensing

One of the things required to create a dataset is for you to select a license. This is your chance to specifically define how your data is to be used, reused, and given attribution. For the beta, Kasabi supports a series of licenses and waivers which we think of as “open”. This means anyone can access and use the data we’re currently hosting. Eventually, this may evolve into datasets with different kinds of licenses (i.e. for commercial datasets). It is important to developers that they understand what they’re allowed to do with the ‘set, and we’ve tried to make this easy from both sides. I can not count the number of times I’ve heard someone give a demonstration or a talk of projects built on datasets which included a statement like:

“I’m afraid we couldn’t build that, because the data we found wasn’t licensed, and we couldn’t work out whether it was OK for us to use it or not.”

Every set in Kasabi is able to be attributed, whether it’s required by the license or not. We encourage anyone using a dataset to give credit where it’s due. There’s a simple embeddable script pointing back at the dataset, and a more flexible Attribution API, so datasets can be cited by any application or hack.

Developer Documentation

This is your chance to give developers as much information as you can to help them understand your data. One of the best things to include is how you see a set being used, either as examples of applications that have been—or could be—built with it, or a set of bullet points about how you see it being useful. I personally like it when the documentation leads off with this, as it immediately has me thinking about what I could do with this data. We can take a closer look at the dev docs for the Bricklink set.

We get a clear picture of how the data is modelled, and what topics it covers. There is a list of vocabularies (like schemas for Linked Data), and how they’re implemented in this set. One thing to note here is that each heading ends with an example. The majority of feedback I receive from the developer community includes requests for an example. Here’s your chance to show a developer what you mean!

I would say that not providing at least some documentation for your dataset will probably limit its potential use, as developers will struggle to understand the data and how it’s described.

If you’ve got some examples of queries to get certain bits of data out of your set, you can include these in the documentation. You can also (and I’d recommend it :) ) add your examples to the SPARQL API once your set has been created. Simply navigate to the dataset’s SPARQL API, and follow the link to “add a sample query”. Here’s a dataset with some example queries: NHS Organisation.

Dataset Creation Checklist

So, here is a bullet-pointed summary of this post, which might work as a kind of checklist for a successful dataset, published in Kasabi:

  • Logo (add branding to your set)
  • Description (give developers an elevator pitch of what’s in the set)
  • Category (what kind of data is this?)
  • License (let us know how we can use the data)
  • Developer Documentation (tell us what can be done, and how the set’s composed)
  • Sample Queries (if you’re into SPARQL, give an example of querying)

I’m very interested in your feedback from creating a dataset, especially throughout the beta. If you have ideas to make publishing more successful (that aren’t covered here), or found problems, please drop me a line, or send a message to the Kasabi developer mailing list for discussion.

Publishing Data Video

Posted in: Ideas