SPARQL query with multiple aggregates exceeds memory limit - optimization

I am trying to generate some user statistics from a triple store using SPARQL. Please see the query below. How can this be improved? Am I doing something evil here? Why is this consuming so much memory? (see the background story at the end of this post)
I prefer to do the aggregation and the joins all inside the triple store. Splitting up the query would mean that I had to join the results "manually", outside the database, loosing the efficiency and optimizations of the triple store. No need to reinvent the wheel for no good reason.
The query
SELECT
?person
(COUNT(DISTINCT ?sent_email) AS ?sent_emails)
(COUNT(DISTINCT ?received_email) AS ?received_emails)
(COUNT(DISTINCT ?receivedInCC_email) AS ?receivedInCC_emails)
(COUNT(DISTINCT ?revision) AS ?commits)
WHERE {
?person rdf:type foaf:Person.
OPTIONAL {
?sent_email rdf:type email:Email.
?sent_email email:sender ?person.
}
OPTIONAL {
?received_email rdf:type email:Email.
?received_email email:recipient ?person.
}
OPTIONAL {
?receivedInCC_email rdf:type email:Email.
?receivedInCC_email email:ccRecipient ?person.
}
OPTIONAL {
?revision rdf:type vcs:VcsRevision.
?revision vcs:committedBy ?person.
}
}
GROUP BY ?person
ORDER BY DESC(?commits)
Background
The problem is that I get the error "QUERY MEMORY LIMIT REACHED" in AllegroGraph (please also see my related SO question). As the repository only contains around 200k triples which easily fit into an (ntriples) input file of ca. 60 MB, I wonder how executing the query results requires more than 4 GB RAM, which is roughly two orders of magnitude higher.

Try splitting the computation in sub queries, for example:
SELECT
?person
(MAX(?sent_emails_) AS ?sent_emails_)
(MAX(?received_emails_ AS ?received_emails_)
(MAX(?receivedInCC_emails_ AS ?receivedInCC_emails_)
(MAX(?commits_) AS ?commits)
WHERE {
{
SELECT
?person
(COUNT(DISTINCT ?sent_email) AS ?sent_emails_)
(0 AS ?received_emails_)
(0 AS ?commits_)
WHERE {
?sent_email rdf:type email:Email.
?sent_email email:sender ?person.
?person rdf:type foaf:Person.
} GROUP BY ?person
} union {
(similar pattern for the others)
....
}
}
GROUP BY ?person
ORDER BY DESC(?commits)
The objective is to:
avoid the generation of a huge number of rows in the result set that needs to be processed for aggregation
avoid the use of OPTIONAL{} patterns, that also should affect performance

Related

What is the best way to discover metadata about an RDF schema from a SPARQL endpoint? [duplicate]

whenever I start using SQL I tend to throw a couple of exploratory statements at the database in order to understand what is available, and what form the data takes.
e.g.
show tables
describe table
select * from table
Could anyone help me understand the way to complete a similar exploration of an RDF datastore using a SPARQL endpoint?
Well, the obvious first start is to look at the classes and properties present in the data.
Here is how to see what classes are being used:
SELECT DISTINCT ?class
WHERE {
?s a ?class .
}
LIMIT 25
OFFSET 0
(LIMIT and OFFSET are there for paging. It is worth getting used to these especially if you are sending your query over the Internet. I'll omit them in the other examples.)
a is a special SPARQL (and Notation3/Turtle) syntax to represent the rdf:type predicate - this links individual instances to owl:Class/rdfs:Class types (roughly equivalent to tables in SQL RDBMSes).
Secondly, you want to look at the properties. You can do this either by using the classes you've searched for or just looking for properties. Let's just get all the properties out of the store:
SELECT DISTINCT ?property
WHERE {
?s ?property ?o .
}
This will get all the properties, which you probably aren't interested in. This is equivalent to a list of all the row columns in SQL, but without any grouping by the table.
More useful is to see what properties are being used by instances that declare a particular class:
SELECT DISTINCT ?property
WHERE {
?s a <http://xmlns.com/foaf/0.1/Person>;
?property ?o .
}
This will get you back the properties used on any instances that satisfy the first triple - namely, that have the rdf:type of http://xmlns.com/foaf/0.1/Person.
Remember, because a rdf:Resource can have multiple rdf:type properties - classes if you will - and because RDF's data model is additive, you don't have a diamond problem. The type is just another property - it's just a useful social agreement to say that some things are persons or dogs or genes or football teams. It doesn't mean that the data store is going to contain properties usually associated with that type. The type doesn't guarantee anything in terms of what properties a resource might have.
You need to familiarise yourself with the data model and the use of SPARQL's UNION and OPTIONAL syntax. The rough mapping of rdf:type to SQL tables is just that - rough.
You might want to know what kind of entity the property is pointing to. Firstly, you probably want to know about datatype properties - equivalent to literals or primitives. You know, strings, integers, etc. RDF defines these literals as all inheriting from string. We can filter out just those properties that are literals using the SPARQL filter method isLiteral:
SELECT DISTINCT ?property
WHERE {
?s a <http://xmlns.com/foaf/0.1/Person>;
?property ?o .
FILTER isLiteral(?o)
}
We are here only going to get properties that have as their object a literal - a string, date-time, boolean, or one of the other XSD datatypes.
But what about the non-literal objects? Consider this very simple pseudo-Java class definition as an analogy:
public class Person {
int age;
Person marriedTo;
}
Using the above query, we would get back the literal that would represent age if the age property is bound. But marriedTo isn't a primitive (i.e. a literal in RDF terms) - it's a reference to another object - in RDF/OWL terminology, that's an object property. But we don't know what sort of objects are being referred to by those properties (predicates). This query will get you back properties with the accompanying types (the classes of which ?o values are members of).
SELECT DISTINCT ?property, ?class
WHERE {
?s a <http://xmlns.com/foaf/0.1/Person>;
?property ?o .
?o a ?class .
FILTER(!isLiteral(?o))
}
That should be enough to orient yourself in a particular dataset. Of course, I'd also recommend that you just pull out some individual resources and inspect them. You can do that using the DESCRIBE query:
DESCRIBE <http://example.org/resource>
There are some SPARQL tools - SNORQL, for instance - that let you do this in a browser. The SNORQL instance I've linked to has a sample query for exploring the possible named graphs, which I haven't covered here.
If you are unfamiliar with SPARQL, honestly, the best resource if you get stuck is the specification. It's a W3C spec but a pretty good one (they built a decent test suite so you can actually see whether implementations have done it properly or not) and if you can get over the complicated language, it is pretty helpful.
I find the following set of exploratory queries useful:
Seeing the classes:
select distinct ?type ?label
where {
?s a ?type .
OPTIONAL { ?type rdfs:label ?label }
}
Seeing the properties:
select distinct ?objprop ?label
where {
?objprop a owl:ObjectProperty .
OPTIONAL { ?objprop rdfs:label ?label }
}
Seeing the data properties:
select distinct ?dataprop ?label
where {
?dataprop a owl:DatatypeProperty .
OPTIONAL { ?dataprop rdfs:label ?label }
}
Seeing which properties are actually used:
select distinct ?p ?label
where {
?s ?p ?o .
OPTIONAL { ?p rdfs:label ?label }
}
Seeing what entities are asserted:
select distinct ?entity ?elabel ?type ?tlabel
where {
?entity a ?type .
OPTIONAL { ?entity rdfs:label ?elabel } .
OPTIONAL { ?type rdfs:label ?tlabel }
}
Seeing the distinct graphs in use:
select distinct ?g where {
graph ?g {
?s ?p ?o
}
}
SELECT DISTINCT * WHERE {
?s ?p ?o
}
LIMIT 10
I often refer to this list of queries from the voiD project. They are mainly of a statistical nature, but not only. It shouldn't be hard to remove the COUNTs from some statements to get the actual values.
Especially with large datasets, it is important to distinguish the pattern from the noise and to understand which structures are used a lot and which are rare. Instead of SELECT DISTINCT, I use aggregation queries to count the major classes, predicates etc. For example, here's how to see the most important predicates in your dataset:
SELECT ?pred (COUNT(*) as ?triples)
WHERE {
?s ?pred ?o .
}
GROUP BY ?pred
ORDER BY DESC(?triples)
LIMIT 100
I usually start by listing the graphs in a repository and their sizes, then look at classes (again with counts) in the graph(s) of interest, then the predicates of the class(es) I am interested in, etc.
Of course these selectors can be combined and restricted if appropriate. To see what predicates are defined for instances of type foaf:Person, and break this down by graph, you could use this:
SELECT ?g ?pred (COUNT(*) as ?triples)
WHERE {
GRAPH ?g {
?s a foaf:Person .
?s ?pred ?o .
}
GROUP BY ?g ?pred
ORDER BY ?g DESC(?triples)
This will list each graph with the predicates in it, in descending order of frequency.

How to find relations/properties for SPARQL queries [duplicate]

whenever I start using SQL I tend to throw a couple of exploratory statements at the database in order to understand what is available, and what form the data takes.
e.g.
show tables
describe table
select * from table
Could anyone help me understand the way to complete a similar exploration of an RDF datastore using a SPARQL endpoint?
Well, the obvious first start is to look at the classes and properties present in the data.
Here is how to see what classes are being used:
SELECT DISTINCT ?class
WHERE {
?s a ?class .
}
LIMIT 25
OFFSET 0
(LIMIT and OFFSET are there for paging. It is worth getting used to these especially if you are sending your query over the Internet. I'll omit them in the other examples.)
a is a special SPARQL (and Notation3/Turtle) syntax to represent the rdf:type predicate - this links individual instances to owl:Class/rdfs:Class types (roughly equivalent to tables in SQL RDBMSes).
Secondly, you want to look at the properties. You can do this either by using the classes you've searched for or just looking for properties. Let's just get all the properties out of the store:
SELECT DISTINCT ?property
WHERE {
?s ?property ?o .
}
This will get all the properties, which you probably aren't interested in. This is equivalent to a list of all the row columns in SQL, but without any grouping by the table.
More useful is to see what properties are being used by instances that declare a particular class:
SELECT DISTINCT ?property
WHERE {
?s a <http://xmlns.com/foaf/0.1/Person>;
?property ?o .
}
This will get you back the properties used on any instances that satisfy the first triple - namely, that have the rdf:type of http://xmlns.com/foaf/0.1/Person.
Remember, because a rdf:Resource can have multiple rdf:type properties - classes if you will - and because RDF's data model is additive, you don't have a diamond problem. The type is just another property - it's just a useful social agreement to say that some things are persons or dogs or genes or football teams. It doesn't mean that the data store is going to contain properties usually associated with that type. The type doesn't guarantee anything in terms of what properties a resource might have.
You need to familiarise yourself with the data model and the use of SPARQL's UNION and OPTIONAL syntax. The rough mapping of rdf:type to SQL tables is just that - rough.
You might want to know what kind of entity the property is pointing to. Firstly, you probably want to know about datatype properties - equivalent to literals or primitives. You know, strings, integers, etc. RDF defines these literals as all inheriting from string. We can filter out just those properties that are literals using the SPARQL filter method isLiteral:
SELECT DISTINCT ?property
WHERE {
?s a <http://xmlns.com/foaf/0.1/Person>;
?property ?o .
FILTER isLiteral(?o)
}
We are here only going to get properties that have as their object a literal - a string, date-time, boolean, or one of the other XSD datatypes.
But what about the non-literal objects? Consider this very simple pseudo-Java class definition as an analogy:
public class Person {
int age;
Person marriedTo;
}
Using the above query, we would get back the literal that would represent age if the age property is bound. But marriedTo isn't a primitive (i.e. a literal in RDF terms) - it's a reference to another object - in RDF/OWL terminology, that's an object property. But we don't know what sort of objects are being referred to by those properties (predicates). This query will get you back properties with the accompanying types (the classes of which ?o values are members of).
SELECT DISTINCT ?property, ?class
WHERE {
?s a <http://xmlns.com/foaf/0.1/Person>;
?property ?o .
?o a ?class .
FILTER(!isLiteral(?o))
}
That should be enough to orient yourself in a particular dataset. Of course, I'd also recommend that you just pull out some individual resources and inspect them. You can do that using the DESCRIBE query:
DESCRIBE <http://example.org/resource>
There are some SPARQL tools - SNORQL, for instance - that let you do this in a browser. The SNORQL instance I've linked to has a sample query for exploring the possible named graphs, which I haven't covered here.
If you are unfamiliar with SPARQL, honestly, the best resource if you get stuck is the specification. It's a W3C spec but a pretty good one (they built a decent test suite so you can actually see whether implementations have done it properly or not) and if you can get over the complicated language, it is pretty helpful.
I find the following set of exploratory queries useful:
Seeing the classes:
select distinct ?type ?label
where {
?s a ?type .
OPTIONAL { ?type rdfs:label ?label }
}
Seeing the properties:
select distinct ?objprop ?label
where {
?objprop a owl:ObjectProperty .
OPTIONAL { ?objprop rdfs:label ?label }
}
Seeing the data properties:
select distinct ?dataprop ?label
where {
?dataprop a owl:DatatypeProperty .
OPTIONAL { ?dataprop rdfs:label ?label }
}
Seeing which properties are actually used:
select distinct ?p ?label
where {
?s ?p ?o .
OPTIONAL { ?p rdfs:label ?label }
}
Seeing what entities are asserted:
select distinct ?entity ?elabel ?type ?tlabel
where {
?entity a ?type .
OPTIONAL { ?entity rdfs:label ?elabel } .
OPTIONAL { ?type rdfs:label ?tlabel }
}
Seeing the distinct graphs in use:
select distinct ?g where {
graph ?g {
?s ?p ?o
}
}
SELECT DISTINCT * WHERE {
?s ?p ?o
}
LIMIT 10
I often refer to this list of queries from the voiD project. They are mainly of a statistical nature, but not only. It shouldn't be hard to remove the COUNTs from some statements to get the actual values.
Especially with large datasets, it is important to distinguish the pattern from the noise and to understand which structures are used a lot and which are rare. Instead of SELECT DISTINCT, I use aggregation queries to count the major classes, predicates etc. For example, here's how to see the most important predicates in your dataset:
SELECT ?pred (COUNT(*) as ?triples)
WHERE {
?s ?pred ?o .
}
GROUP BY ?pred
ORDER BY DESC(?triples)
LIMIT 100
I usually start by listing the graphs in a repository and their sizes, then look at classes (again with counts) in the graph(s) of interest, then the predicates of the class(es) I am interested in, etc.
Of course these selectors can be combined and restricted if appropriate. To see what predicates are defined for instances of type foaf:Person, and break this down by graph, you could use this:
SELECT ?g ?pred (COUNT(*) as ?triples)
WHERE {
GRAPH ?g {
?s a foaf:Person .
?s ?pred ?o .
}
GROUP BY ?g ?pred
ORDER BY ?g DESC(?triples)
This will list each graph with the predicates in it, in descending order of frequency.

Can I use SPARQL to query DBPedia for information about Wiki pages such as page length or number of times an article was accessed?

SELECT *
WHERE {
?Person a <http://dbpedia.org/ontology/Comedian> .
?Person <http://dbpedia.org/ontology/influenced> ?influenced.
?Person <http://dbpedia.org/ontology/birthDate> ?birthDate.
?Person <http://dbpedia.org/ontology/wikiPageLength> ?weight.
}
I had thought that the above code would, when plugged into snorql, generate a dataset that included the length of the Comedian page being queried. Instead, it's generating an empty dataset. By contrast, the code below generates a non-empty dataset, but one which does not include the page length.
SELECT *
WHERE {
?Person a <http://dbpedia.org/ontology/Comedian> .
?Person <http://dbpedia.org/ontology/influenced> ?influenced.
?Person <http://dbpedia.org/ontology/birthDate> ?birthDate.
}
Is there a way to query DBpedia for information about a Wikipedia page that isn't included in the page itself, such as the length of the page or the number of times it is accessed?
While there is such a property declared (e.g., see http://dbpedia.org/ontology/wikiPageLength), it doesn't appear to actually be used in describing any resources. E.g., the following query returns 0:
select (count(*) as ?n) { ?s dbo:wikiPageLength ?l }

SPARQL query for all people for an institution on dbpedia

I'm trying to extract alumni lists for universities using SPARQL.
I've identified the ontologies I need:
http://mappings.dbpedia.org/server/ontology/classes/University
http://mappings.dbpedia.org/server/ontology/classes/Person
I tried this query, which you can examine here:
SELECT * WHERE {
?University dbpedia2:alumni ?Person .
}
Which seemed to make sense, except this returns counts instead of people, as the ontology says the property contains.
I found this query somewhere which seemed to do a better job finding universities, but was very slow.
SELECT * WHERE {
{ <http://dbpedia.org/ontology/University> ?property ?hasValue }
UNION
{ ?isValueOf ?property <http://dbpedia.org/ontology/University> }
}
I also tried going the other way, start with all people and look for their almae matres, in this form:
SELECT * WHERE {
?person dbpedia2:almaMater ?University
}
But this is much slower, possibly because searching through the people space is too laborious. This does actually work, but it returns a different set of results in application---namely, all people with a listed alma mater, rather than all people listed by universities as alumni. I'd prefer a syntax that gets me the alumni.
How can I phrase this to return all alumni listed for universities?
The performance of DBpedia's SPARQL endpoint can be a bit unreliable at times. After all, it's apublic service, and isn't intended for huge queries. Nonetheless, I think you can get what you're looking for here without too much trouble. First, you can check how many results there are with a query like this at the public SPARQL endpoint:
select (count(*) as ?nResults) where {
?person dbpedia-owl:almaMater ?almaMater
}
SPARQL results (64928)
Now, if you just want the big list, you'd get it like this. The order by helps organize the results for easy consumption, but isn't technically necessary:
select ?almaMater ?person where {
?person dbpedia-owl:almaMater ?almaMater
}
order by ?almaMater ?person
SPARQL results
If you need to place some additional restrictions on ?almaMater, e.g., to ensure that it's a university, then you can add them to the query. For instance:
select ?almaMater ?person where {
?person dbpedia-owl:almaMater ?almaMater .
?almaMater a dbpedia-owl:University .
}
order by ?almaMater ?person
SPARQL results
In your last query, you are almost there. However, you are currently asking for any resource that can take the place of the ?University variable. As you only want universities to take that place, you can use another triple to further restrict that variable:
SELECT * WHERE {
?University a dbpedia-owl:University.
?person dbpedia2:almaMater ?University.
}
This means that ?University can only be an individual of class dbpedia-owl:University (where dbpedia-owl is mapped to http://dbpedia.org/ontology/).
Your first query:
SELECT * WHERE {
?University dbpedia2:alumni ?Person .
}
isn't just returning counts; it's returning both counts and individual alumni. Apparently dbpedia's data here is poor quality and there are a number of triples misusing the dbpedia2:alumni relation.
You can filter out the counts by adding a second condition requiring that an entity satisfying Person be a member of the appropriate class:
SELECT * WHERE {
?university dbpedia2:alumni ?person .
?person rdf:type <http://dbpedia.org/ontology/Person>
}
What you see running this is that there are very few individuals tagged as alumni; the data is surprisingly scant, unfortunately.

Limit a SPARQL query to one dataset

I'm working with the following SPARQL query, which is an example on the web-based end of my institution's SPARQL endpoint;
SELECT ?building_number ?name ?occupants WHERE {
?site a org:Site ;
rdfs:label "Highfield Campus" .
?building spacerel:within ?site ;
skos:notation ?building_number ;
rdfs:label ?name .
OPTIONAL {
?building soton:buildingOccupants ?occ .
?occ rdfs:label ?occupants .
} .
} ORDER BY ?name
The problem is that as well as getting data from 'Buildings and Places', the Dataset I'm interested in, and would expect the example to use, it also gets data from the 'Facilities and Equipment' dataset, which isn't relevant. You should see this if you follow the link.
I suspect the example may pre-date the addition of the Facilities and Equipment dataset, but even with the research I've done into SPARQL, I can't see a clear way to define which datasets to include.
Can anyone recommend a starting point to limit it to just show 'Buildings', or, more specifically, results from the 'Buildings and Places' dataset.
Thanks
First things first, you really need to use SELECT DISTINCT, as otherwise you'll get repeated results.
To answer your question, you can use GRAPH { ... } to filter certain parts of a SPARQL query to only match data from a specific dataset. This only works if the SPARQL endpoint is divided up into GRAPHs (this one is). The solution you asked for isn't the best choice, as it assumes that things within sites in the 'places' dataset will always be resticted to buildings... That's risky -- as it might end up containing trees and signposts at some time in the future.
Step one is to just find out what graphs are in play:
SELECT DISTINCT ?g1 ?building_number ?name ?occupants WHERE {
?site a org:Site ;
rdfs:label "Highfield Campus" .
GRAPH ?g1 { ?building spacerel:within ?site ;
skos:notation ?building_number ;
rdfs:label ?name .
}
OPTIONAL {
?building soton:buildingOccupants ?occ .
?occ rdfs:label ?occupants .
} .
} ORDER BY ?name
Try it here: http://is.gd/WdRAGX
From this you can see that http://id.southampton.ac.uk/dataset/places/latest and http://id.southampton.ac.uk/dataset/places/facilities are the two relevant ones.
To only look for things 'within' a site according to the "places" graph, use:
SELECT DISTINCT ?building_number ?name ?occupants WHERE {
?site a org:Site ;
rdfs:label "Highfield Campus" .
GRAPH <http://id.southampton.ac.uk/dataset/places/latest> {
?building spacerel:within ?site ;
skos:notation ?building_number ;
rdfs:label ?name .
}
OPTIONAL {
?building soton:buildingOccupants ?occ .
?occ rdfs:label ?occupants .
} .
} ORDER BY ?name
Alternate solutions:
Using rdf:type
Above I've answered your question, but it's not the answer to your problem. This solution is more semantic as it actually says 'only give me buildings within the campus' which is what you really mean.
Instead of filtering by graph, which is not very 'semantic' you could also restrict ?building to be of class 'building' which research facilities are not. They are still sometimes listed as 'within' a site. Usually when the uni has only published what campus they are on but not which building.
?building a rooms:Building
Using FILTER
In extreme cases you may not have data in different GRAPHS and there may not be an elegant relationship to use to filter your results. In this case you can use a FILTER and turn the building URI into a string and use a regular expression to match acceptable ones:
FILTER regex(str(?building), "^http://id.southampton.ac.uk/building/")
This is bar far the worst option and don't use it if you have to.
Belt and Braces
You can use any of these restictions together and a combination of restricting the GRAPH plus ensuring that all ?buildings really are buildings would be my recommended solution.