March 16, 2016
Summary Overview of using MySQL or PostgreSQL as a triple store.¶
Introduction¶
The RDFLib 3 plugin interface supports using either a MySQL or PostgreSQL database to store and query your RDF graphs. This document describes how to use these backends, from loading large datasets into them to taking advantage of their query capabilities.
Bulk loading¶
If you need to load a large number of RDF statements into an empty database, RDFLib provides a module that can be run as a script to help you with this task. You can run this module with the command
$ python -m rdfextras.store.FOPLRelationalModel.MySQLMassLoader [options] <DB Type>;
note that several of the options are very important. Let’s start with an example.
If you wanted to load the RDF/XML file profiles.rdf
and the N-Triples file
targets.nt
into an empty MySQL database named plan
located at host
bubastis
accessible to user ozymandias
with password ramsesIII
, you
could use the following command:
$ python -m rdflib.store.FOPLRelationalModel.MySQLMassLoader \
-c db=plan,host=bubastis,user=ozymandias,password=ramsesIII \
-i plan \
-x profiles.rdf --nt=targets.nt \
MySQL
Here, we’re connecting to a MySQL database, but this script can also utilize a
PostgreSQL database with the PostgreSQL
keyword. The -c
option allows you
to specify the connection details for the target database; it is a
comma-separated string of variable assignments, as in the example above. As in
that example, it can specify the database with db
, the name of the target
machine with host
, the username with user
, and the password for that user
with password
. Also, you can specify the port on the target machine with
port
.
A single database can support multiple RDF stores; each such store has
an additional store “identifier”, which you must provide with the -i
option.
Once we have connected, we can load data from files that can be in various formats. This script supports identifying RDF/XML files to load with the -x option, TriX files with the -t option, N3 files with the -n option, N-Triples files with the –nt option, and RDFa files with the -a option. In addition, you can load all the files in a directory, assuming that they all have the same format. To do this, use the –directory option to identify the directory containing the files, and the –format option to specify the format of the files in that directory.
There are a few advanced options available for this script; you can use the -h option to get a summary of all the available options. You may also want to see the “Benchmarking” section, below, for specific examples that you can generalize.
Query¶
The RDFLib SPARQL implementation allows you to use the SPARQL language to query your RDF stores. The default implementation works entirely in memory; with a SQL backend, two different RDFLib components offer separate approaches to utilizing that backend to optimize the query. This section will eventually provide generic instructions for how to use the different query options, but until I get around to writing it see the “Benchmarking” section, below, for specific examples that you can generalize.
Benchmarking¶
When working on the various SQL backends, I found it useful to compare the results of the RDFLib store with the results obtained in Christian Becker’s RDF Store Benchmarks with DBpedia.
Walking through this process serves both as a good example to how to load and query large RDF datasets with an SQL backend, but also helps to judge the RDFLib backend against other options. Indeed, the DBpedia data set is interesting in its own right; loading and querying DBpedia may be a reasonably common use case on its own. For our benchmarking, we will compare both the MySQL and the PostgreSQL backends.
I obtained a set of results for this benchmark dataset on a dual core 1.86 GHz machine with 3.5 GB of RAM, running Ubuntu GNU/Linux 8.10. These specs do not completely align with Becker’s configuration, so the results are only roughly comparable. Also, note that I used MySQL version 5.0.67, and, importantly, PostgreSQL 8.4beta1.
Version 8.4 of PostgreSQL contains a large performance enhancement over previous versions, so if you want the best performance (and if you want to reproduce the results in this report), you will need to install your own PostgreSQL server until the next stable version makes it out into the wild.
Loading¶
To begin, we first need to load our data. To do this, we need to first create both a MySQL and a PostgreSQL database which will receive the data; these examples assume that this database is named ‘Becker_dbpedia’. This load process also assumes that we have downloaded and extracted the benchmark datasets to a data directory relative to the current directory. Once we have created a database, we can load that database (and time the load) with the following command:
$ time python -m rdfextras.store.FOPLRelationalModel.MySQLMassLoader \
-c db=Becker_dbpedia,host=localhost,user=username,password=password \
-i Becker_dbpedia \
--nt=data/infoboxes-fixed.nt --nt=data/geocoordinates-fixed.nt \
--nt=data/homepages-fixed.nt \
MySQL
Note that the name MySQLMassLoader
is a misnomer; it started life
targeting MySQL, but now supports both MySQL and PostgreSQL through its first
positional parameter. As such, we can load the data into PostgreSQL by
changing the argument from MySQL
to PostgreSQL
(in addition to changing
any relevant connection details in the connection string).
The results for the bulk load times are listed below. Note that in addition to the hardware differences listed above, we are also doing a bulk load of all the pieces at once, instead of loading the three pieces in stages.
Backend | Load time (seconds) |
---|---|
MySQL | 28,612 |
PostgreSQL (8.4beta1) | 7,812 |
Note
the PostgreSQL and MySQL load strategies are very different, which may account for the dramatic difference. Interestingly, it was a missing feature (the IGNORE keyword on the delimited load statement) that led to the construction of a different load mechanism in PostgreSQL, but it may turn out that the alternate load mechanism may work better on MySQL as well. I will continue to experiment with that.
Queries¶
Becker’s benchmark set includes five amusing queries; we can currently run the
first three of these queries, but the last two use SPARQL features that are
not currently supported by the RDFLib SPARQL processor. To run these queries,
we will use the rdfextras.tools.sparqler
script.
For both backends, we will run each query in up to four different ways. The RDFLib SPARQL processor has a new component that can completely translate SPARQL queries to equivalent SQL queries directly against the backend, so we will run each query using that component, and again without it. Also, for each component run, we may also provide range metadata to the processor as an optimization.
All available information about a specific subject¶
We run this query using the SPARQL to SQL translator using the sparqler.py command line below.
$ time python /home/john/development/rdfextras/tools/sparqler.py -s MySQL \
db=Becker_dbpedia,host=localhost,user=username,password=password Becker_dbpedia \
'SELECT ?p ?o WHERE {
<http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o
}' > results
We run this query using the original SPARQL implementation using the command line below.
$ time python /home/john/development/rdfextras/tools/sparqler.py \
--originalSPARQL -s MySQL \
db=Becker_dbpedia,host=localhost,user=username,password=password Becker_dbpedia \
'SELECT ?p ?o WHERE {
<http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o
}' > results
We must simply change ‘MySQL’ to ‘PostgreSQL’ in the above commands (and change connection parameters as necessary) to run the same queries against the PostgreSQL backend.
The results for this query are listed below. All times are in seconds. For this query, we do not add any range information, because we don’t know anything about the properties that may be involved.
Backend | SPARQL to SQL translator | Original implementation |
---|---|---|
MySQL | 2.063 | 2.013 |
PostgreSQL (8.4beta1) | 1.993 | 2.002 |
Two degrees of separation from Kevin Bacon¶
To run this query, we can replace the query in the above commands with the new query:
PREFIX p: <http://dbpedia.org/property/>
SELECT ?film1 ?actor1 ?film2 ?actor2
WHERE {
?film1 p:starring <http://dbpedia.org/resource/Kevin_Bacon> .
?film1 p:starring ?actor1 .
?film2 p:starring ?actor1 .
?film2 p:starring ?actor2 .
}
The results for this query are listed below. All times are in seconds. This time, we will also run the query with the range optimization; we know the http://dbpedia.org/property/starring property only ranges over resources, so we can add -r http://dbpedia.org/property/starring to the query command line to provide this hint to the query processor.
Backend | Translator | Original | Translator with hint | Original with hint |
---|---|---|---|---|
MySQL | 843 | 645 | 23.58 | 25.216 |
PostgreSQL (8.4beta1) | 68.36 | 82.64 | 23.38 | 80.45 |
Unconstrained query for artworks, artists, museums and their directors¶
To run this query, we can replace the query in the above commands with the new query:
PREFIX p: <http://dbpedia.org/property/>
SELECT ?artist ?artwork ?museum ?director
WHERE {
?artwork p:artist ?artist .
?artwork p:museum ?museum .
?museum p:director ?director
}
The results for this query are listed below. All times are in seconds. We will not use any range optimizations for this query.
Backend | SPARQL to SQL translator | Original implementation |
---|---|---|
MySQL | 1026 | 336 |
PostgreSQL (8.4beta1) | 98 | 5.074 |
API¶
This section describes how to use the RDFLib API to use either a MySQL or
PostgreSQL backend as a ConjunctiveGraph. This section assumes that you have
MySQL or PostgreSQL installed and configured correctly (particularly
permissions), as well as either the MySQLdb
, the pgdb
, the postgresql
or the psycopg
Python modules installed.
Setting up the database server is outside the scope of this document and so is installing the modules.
Here’s an example:
import rdflib
from rdflib import plugin, term, graph, namespace
db_type = 'PostgreSQL' # Use 'MySQL' instead, if that's what you have
store = plugin.get(db_type, rdflib.store.Store)(
identifier = 'some_ident',
configuration = 'user=u,password=p,host=h,db=d')
store.open(create=True) # only True when opening a store for the first time
g = graph.ConjunctiveGraph(store)
sg = graph.Graph(store, identifier=term.URIRef(
'tag:jlc6@po.cwru.edu,2009-08-20:bookmarks'))
sg.add((term.URIRef('http://www.google.com/'),
namespace.RDFS.label,
term.Literal('Google home page')))
sg.add((term.URIRef('http://wikipedia.org/'),
namespace.RDFS.label,
term.Literal('Wikipedia home page')))
Other general Graph/ConjunctiveGraph API uses here