Featured Dataset: ChEMBL-RDF, with Egon Willighagen

Posted on 08/23/2011 by


Published in Kasabi by Egon Willighagen, ChEMBL is a conversion into Linked Data of an important dataset of chemicals from the European Bioinformatics Institute. I had a conversation with Egon—who is involved in postdoctoral research associate at the Karolinska Institutet in Stockholm—about this set, and asked him what it contains:

The database consist basically of biological effects of drugs (for medication; small molecules (like aspirin)). The data is extracted from literature and contains very many biological properties, including toxicological properties.

There currently are six main classes, as shown on: http://beta.kasabi.com/dataset/chembl-rdf/schema

Example Resource Egon pointed to the new feature on dataset pages. If you have a look in the Explore Tab on any set’s page, you can see the classes and explore some samples of data. The screenshot on the right here, shows an example resource from ChEMBL, in this case pointing to: http://data.kasabi.com/dataset/chembl-rdf/09/resource/r1109.html.

For a bit of context around the dataset, Egon explained a bit about where ChEMBL has come from.

It was originally owned by a private  company, but bought by special funding from the Wellcome Trust fund and relicensed CC-SA-BY. Dr John Overington was at that company and moved with the database along to the EBI where is it currently further developed. Dr Anna Gaulton is the SQL schema expert, and the data contains a lot of information not available from the RDF version yet.

I asked Egon how he sees this dataset being used, and was delighted to learn that this is an incredibly active project, which has been put to use already:

The data is being used to understand the properties of those drugs; why some are more toxic than others, why some show side effects; etc. The data is used extensively in drug discovery.

This post features one application how we used it.

Here we extracted data sets from ChEMBL for a given protein target, and data mined the chemical structures. This latter was done with the MoSS substructure mining algorithm. The goal of this work by my former student Annsofie Andersson was to extend our Bioclipse software to make this kind of integration possible. We needed to be able to make random queries against the database in a flexible way; SPARQL was most suited for a dynamic interaction allowing the user to search protein targets in the database, and properties to understand, and then make the proper query to download all know structures for which that property has been measured against the protein target. After that, the substructure mining algorithm could do the further data analysis.

The ChEMBL dataset in Kasabi is a work in progress, and Egon’s got some plans for its next steps. There are resources in the original dataset which aren’t small molecules (including oligonucleotides, naturally :) ), which are yet to be available as RDF. He is also working on migrating remaining classes onto a mix of schemas (such as CHEMINF, Protein Ontology etc…), so ChEMBL’s model will be tweaked and improved over time.

We have a dataset which contains important research data, and has been put to use already (before its publication in Kasabi), so there are plenty of things to do with ChEMBL. So, I asked Egon if he had anything he’d like to say to sum up the discussion:

More data will come in with each ChEMBL release. I am two releases behind at this moment (ChEMBL 11 was released last week), and I started started a project to formalize the RDF further: http://groups.google.com/group/chembl-rdf.

I have a SPARQL end point running at rdf.farmbio.uu.se … but do not have the resource (time, development) to develop a Linked Data solution. This is where Kasabi got my interest, as it adds such functionality too, on top of various
other services. I have those APIs enabled, but not used any other than the Linked Resource and SPARQL APIs for ChEMBL-RDF: The fact that the ChEMBL-RDF data is now finally available as Linked Resources means that it now fulfils the requirements to be added to the Linked Open Data cloud: http://richard.cyganiak.de/2007/10/lod/. It has not been on that cloud diagram before.

It like to stress that the Open Data nature of this database has made these application possible, and I hope that the hosting on Kasabi will trigger even more use cases of the ChEMBL data. Particularly, looking forward to seeing the Augmentation API in action, but need to explore first how to use that :)

There we have it. And I hope, alongside Egon, that we can find some more use for ChEMBL in its Linked form here at Kasabi too.

Posted in: Datasets