The following is a version of the talk on Creating APIs over RDF I gave at SemTech 2011. I’ve pruned some of the technical details in favour of linking out to other sources and concentrated here on the core message I was trying to get across. Comments welcome!
The Trouble with SPARQL
I’m a big fan of SPARQL, I constantly use it in my own development tasks, have built a number of production systems in which the query language is a core component, and wrote (I think!) one of the first SPARQL tutorials back in 2005 when it was still in Last Call. I’ve also worked with a number of engineering teams and developer communities over the last few years, introducing them to RDF and SPARQL.
My experience so far is two-fold: given some training and guidance SPARQL isn’t hard for any developer to learn. It’s just another query language and syntax. There are often some existing mental models that need to be overcome, but that’s always the case with any new technology. So at small scales SPARQL is easy to adopt, and a very useful tool when you’re working with graph shaped data.
But I’ve found, repeatedly, that when SPARQL is presented to a larger community, then the reaction and experience is very different. Developers quickly reject it as the learning curves are too great and, instead of seeing it as an enabler, they often see it as a barrier that’s placed between them and the data.
It’s easy to dismiss this kind of feedback and criticism by exhorting developers to just try harder or read more documentation. Surely any good developer is keen to learn a new technology? This overlooks the need of many people to just get stuff done quickly. Time and commercial pressures are a reality.
It’s also easy to dismiss this reaction as being down to the quality of the tools and documentation. Now, undoubtedly, there’s still much more that can be done there. And I was pleased to hear about Bob DuCharme’s forthcoming book on SPARQL.
But I think there are some technical reasons why, when we move from small groups to distributed adoption by a wider community, SPARQL causes frustrations. And I think this boils down to its affordance.
Consider the interface that most developers are presented with when given access to a SPARQL endpoint. It’s an empty text field and a button. If you’re really lucky, there may be an example query filled in.
In order to do anything you have to not only know how to write a valid SPARQL query, but you also really need to know how the underlying dataset is structured. Two immediate hurdles to get over. Yes, there are queries you can write to return arbitrary triples, or list classes and properties, but that’s still not something a new user would necessarily know. And you typically need a lot of exploration before you can start to understand how to best query the data.
Trial and error experiments aren’t easy either, its not always obvious how a query can be tweaked to customize the results. And when we typically share SPARQL queries, its typically by passing around direct links to an endpoint. Have fun unpicking the query from the URL, reformatting it, so you can understand how it works and how it can be tweaked!
Better tools can definitely help in both of these cases. In Kasabi we’ve added a feature that allows anyone to share useful queries for a SPARQL endpoint along with a description of how it works. It’s a simple click to drop the query into the API explorer to run it, or tweak it.
But in my opinion it’s about more than just the tooling. Affordance flows not just from the tools, but also the syntax and the data. If you point someone at a SPARQL endpoint, it’s not immediately useful, not without a lot of additional background. These are issues that can hamper widespread adoption of a technology but which don’t often arise with smaller groups with direct access to mentors.
Contrast this situation with typical web APIs which have, in my opinion, much more affordance. If I give someone a link to an API call then it’s more immediately useful. I think working with good, RESTful APIs is like pulling a thread: the URLs unravel into useful data. And that data contains more links that I can just follow to find more data.
Trial and error experiments are also much easier. If I want to tweak an API call then I can do that by simply editing the URL. As a developer this is syntax that I already know. URL templates can also give me hints of how to construct a useful request.
Importantly, my understanding of the structure of the dataset can grow as I work with it. My understanding grows through use, rather than before I start using. And that’s a great way to learn. There are no real barriers to progression. I need to know much less in order to start feeling empowered.
So, I’ve come to the conclusion that SPARQL is really for power users. It’s of most use to developers that are willing to take the time and trouble to learn the syntax and the underlying data model in order to get its benefits. This is not a critique of the technology itself, but a reflection on how technology is (or isn’t) being adopted and the challenges people are facing.
The obvious question that springs to mind is: how we can give RDF data more affordance?
Linked Data is all about giving affordance to data. Linking, and “follow your nose” access to data is a core part of the approach. By binding data to the web, making it accessible via a single click, we make it incredibly more useful.
Surely then Linked Data solves all of our problems: “your website is your API”, after all. I think there’s a lot of truth to that and rich Linked Data does remove many of the needs for an separate API.
But I don’t think it addresses all of the requirements, or at least: the current approaches and patterns for publishing Linked Data don’t address all of the requirements. Right now the main guidance is to focus on your domain modelling and the key entities and relationships that it contains. That’s good advice and a useful starting point.
But when you’re developing an application against a dataset, there are many more useful ways to partition the data, e.g.: by date, location, name, etc.
It’s entirely possible to materialize many of these partitions directly in the dataset — as yet more resources and links — but this quickly becomes unfeasible: there are too many useful data partions to realistically do this for any reasonably large or complex dataset. This is the exactly the gap that query languages, and specifically SPARQL, are designed to fill. But if we concede that SPARQL may be too complex for many cases, what other options can we explore?
SPARQL Stored Procedures and the Linked Data API
One option is to just build custom APIs. But this can be expensive to maintain, and can detract from the overall message of the core usefulness of publishing Linked Data. So, are there ways to surface useful views in declarative way, that both takes advantage of, and embraces the utility of a the underlying “web native” graph model?
Currently there are two approaches that we’ve explored. The first is what we’ve calling SPARQL Stored Procedures in Kasabi. This allows developers to:
- Bind a SPARQL query to a URL, causing that query to be automatically executed when a GET request is made to the URL
- Indicate that specific URL parameters be injected into the SPARQL query before it is executed, allowing the queries to be parameterized on a per-request basis
- Generate custom output formats (e.g. XML or JSON) from a query using XSLT stylesheets that can be applied to the query results based on the requested mimetype
Kasabi provides tools for creating this type of API, including the ability to create one based on a SPARQL query shared by another user. This greatly lowers the barrier to entry for sharing useful ways to work with data.
The ability to either access the results of the query directly, e.g. as SPARQL XML results or various RDF serializations, means the underlying graph is still accessible. You can just treat the feature as a convenience layer that hides some of the complexity. But providing custom output formats we can also help developers use the data using existing skills and tools.
The second approach has grown out of work on data.gov.uk. Jeni Tennison (TSO), Dave Reynolds (Epimorphics) and I explored various options for creating APIs over Linked Data, resulting in the publication of the Linked Data API which is in use at data.gov.uk, e.g. to support the excellent organogram visualizations.
As with SPARQL Stored Procedures, the Linked Data API provides a declarative way to create a RESTful API over RDF data sources. However rather than writing SPARQL queries directly, an API developer creates a configuration file that describes how various views of the data should be bound to web requests.
The Linked Data API is much more powerful (at the cost of some complexity), providing many more options for filtering and sorting through data, as well as simple XML and JSON result formats out of the box. In my opinion, the specification does a good job at weaving API interactions together with the underlying Linked Data, creating a very rich way to interact with a dataset. And one that has a lot more affordance than the equivalent SPARQL queries.
Again, Kasabi provides support for hosting this type of API. Right now the tooling is admittedly quite basic but we’re exploring way to make it more interactive. We’ve incorporated the “view source” principle into the custom API hosting feature as a whole, so its possible to view the configuration of an API to see how it was constructed.
I think both of these approaches can usefully provide ways for a wider developer community to get to grips with RDF and Linked Data, removing some of the hurdles to adoption. The tooling we’ve created in Kasabi is designed to allow skilled members of the community to directly drive this adoption by sharing queries and creating different kinds of APIs.
By separating the publication of datasets, from the creation of APIs — useful access paths into the dataset — we hope to let communities find and share useful ways to work with the available data, whatever their skills or preferred technologies.