SPARQL performance in property path query Sesame / rdf4j - sparql

Let's say I want to find all subjects which are connected with objects via property path. The connection could be represented:
Subject - prop 1 -> A - prop 2 -> B - prop 3 -> Object
This could be achieved by quite simple SPARQL query:
SELECT ?s WHERE {
?s prop1/prop2/prop3 ?o .
VALUES ?o { <uri1> ... <urin> }
}
But I also want to include paths which use subclasses of A and/or B:
Subject - prop 1 -> subclassOfA - prop 2 -> subclassOfB - prop 3 -> Object
To achieve this I've added intermediate "sublassOf" property into the the path:
SELECT ?s WHERE {
?s prop1/<subclassOf>*/prop2/<subclassOf>*/prop3 ?o .
VALUES ?o { <uri1> ... <urin> }
}
This worked really fast on my dataset in Sesame 2.7.2, but after migrating to rdf4j 2.5.2 this query just hangs. The question is whether this is the correct way of querying this way or there's something much more efficient? And what could've caused such a significant performance drop in the new versions?

Related

What is the best way to discover metadata about an RDF schema from a SPARQL endpoint? [duplicate]

whenever I start using SQL I tend to throw a couple of exploratory statements at the database in order to understand what is available, and what form the data takes.
e.g.
show tables
describe table
select * from table
Could anyone help me understand the way to complete a similar exploration of an RDF datastore using a SPARQL endpoint?
Well, the obvious first start is to look at the classes and properties present in the data.
Here is how to see what classes are being used:
SELECT DISTINCT ?class
WHERE {
?s a ?class .
}
LIMIT 25
OFFSET 0
(LIMIT and OFFSET are there for paging. It is worth getting used to these especially if you are sending your query over the Internet. I'll omit them in the other examples.)
a is a special SPARQL (and Notation3/Turtle) syntax to represent the rdf:type predicate - this links individual instances to owl:Class/rdfs:Class types (roughly equivalent to tables in SQL RDBMSes).
Secondly, you want to look at the properties. You can do this either by using the classes you've searched for or just looking for properties. Let's just get all the properties out of the store:
SELECT DISTINCT ?property
WHERE {
?s ?property ?o .
}
This will get all the properties, which you probably aren't interested in. This is equivalent to a list of all the row columns in SQL, but without any grouping by the table.
More useful is to see what properties are being used by instances that declare a particular class:
SELECT DISTINCT ?property
WHERE {
?s a <http://xmlns.com/foaf/0.1/Person>;
?property ?o .
}
This will get you back the properties used on any instances that satisfy the first triple - namely, that have the rdf:type of http://xmlns.com/foaf/0.1/Person.
Remember, because a rdf:Resource can have multiple rdf:type properties - classes if you will - and because RDF's data model is additive, you don't have a diamond problem. The type is just another property - it's just a useful social agreement to say that some things are persons or dogs or genes or football teams. It doesn't mean that the data store is going to contain properties usually associated with that type. The type doesn't guarantee anything in terms of what properties a resource might have.
You need to familiarise yourself with the data model and the use of SPARQL's UNION and OPTIONAL syntax. The rough mapping of rdf:type to SQL tables is just that - rough.
You might want to know what kind of entity the property is pointing to. Firstly, you probably want to know about datatype properties - equivalent to literals or primitives. You know, strings, integers, etc. RDF defines these literals as all inheriting from string. We can filter out just those properties that are literals using the SPARQL filter method isLiteral:
SELECT DISTINCT ?property
WHERE {
?s a <http://xmlns.com/foaf/0.1/Person>;
?property ?o .
FILTER isLiteral(?o)
}
We are here only going to get properties that have as their object a literal - a string, date-time, boolean, or one of the other XSD datatypes.
But what about the non-literal objects? Consider this very simple pseudo-Java class definition as an analogy:
public class Person {
int age;
Person marriedTo;
}
Using the above query, we would get back the literal that would represent age if the age property is bound. But marriedTo isn't a primitive (i.e. a literal in RDF terms) - it's a reference to another object - in RDF/OWL terminology, that's an object property. But we don't know what sort of objects are being referred to by those properties (predicates). This query will get you back properties with the accompanying types (the classes of which ?o values are members of).
SELECT DISTINCT ?property, ?class
WHERE {
?s a <http://xmlns.com/foaf/0.1/Person>;
?property ?o .
?o a ?class .
FILTER(!isLiteral(?o))
}
That should be enough to orient yourself in a particular dataset. Of course, I'd also recommend that you just pull out some individual resources and inspect them. You can do that using the DESCRIBE query:
DESCRIBE <http://example.org/resource>
There are some SPARQL tools - SNORQL, for instance - that let you do this in a browser. The SNORQL instance I've linked to has a sample query for exploring the possible named graphs, which I haven't covered here.
If you are unfamiliar with SPARQL, honestly, the best resource if you get stuck is the specification. It's a W3C spec but a pretty good one (they built a decent test suite so you can actually see whether implementations have done it properly or not) and if you can get over the complicated language, it is pretty helpful.
I find the following set of exploratory queries useful:
Seeing the classes:
select distinct ?type ?label
where {
?s a ?type .
OPTIONAL { ?type rdfs:label ?label }
}
Seeing the properties:
select distinct ?objprop ?label
where {
?objprop a owl:ObjectProperty .
OPTIONAL { ?objprop rdfs:label ?label }
}
Seeing the data properties:
select distinct ?dataprop ?label
where {
?dataprop a owl:DatatypeProperty .
OPTIONAL { ?dataprop rdfs:label ?label }
}
Seeing which properties are actually used:
select distinct ?p ?label
where {
?s ?p ?o .
OPTIONAL { ?p rdfs:label ?label }
}
Seeing what entities are asserted:
select distinct ?entity ?elabel ?type ?tlabel
where {
?entity a ?type .
OPTIONAL { ?entity rdfs:label ?elabel } .
OPTIONAL { ?type rdfs:label ?tlabel }
}
Seeing the distinct graphs in use:
select distinct ?g where {
graph ?g {
?s ?p ?o
}
}
SELECT DISTINCT * WHERE {
?s ?p ?o
}
LIMIT 10
I often refer to this list of queries from the voiD project. They are mainly of a statistical nature, but not only. It shouldn't be hard to remove the COUNTs from some statements to get the actual values.
Especially with large datasets, it is important to distinguish the pattern from the noise and to understand which structures are used a lot and which are rare. Instead of SELECT DISTINCT, I use aggregation queries to count the major classes, predicates etc. For example, here's how to see the most important predicates in your dataset:
SELECT ?pred (COUNT(*) as ?triples)
WHERE {
?s ?pred ?o .
}
GROUP BY ?pred
ORDER BY DESC(?triples)
LIMIT 100
I usually start by listing the graphs in a repository and their sizes, then look at classes (again with counts) in the graph(s) of interest, then the predicates of the class(es) I am interested in, etc.
Of course these selectors can be combined and restricted if appropriate. To see what predicates are defined for instances of type foaf:Person, and break this down by graph, you could use this:
SELECT ?g ?pred (COUNT(*) as ?triples)
WHERE {
GRAPH ?g {
?s a foaf:Person .
?s ?pred ?o .
}
GROUP BY ?g ?pred
ORDER BY ?g DESC(?triples)
This will list each graph with the predicates in it, in descending order of frequency.

How to completely delete an RDF node/instance from a graph database?

Good day, I am using Graphdb to store some triples as seen in the image below. This particular RDF node uses a regular URI http://example/regular/uri. What I wish to do is to not only completely delete all properties attached to this node, but also delete the node itself. (with the result that http://example/regular/uri does not appear in the graph database any longer)
So far I am only able to delete all properties, but I am not able to delete the actual RDF node itself. It seemed rather simple, but the more I research online, the more this seems impossible unless clearing the complete graph.
I have tried simple "delete where" queries as shown in example 11 of SPARQL documentation. And i have also tried using simple "delete where"-queries using the wildcard operator as shown in the query below:
Is there a way to delete such RDF nodes?
Thanks in advance!
A node exists in a graph as long as there is one or more triples with that node in subject or object position. So the easiest way would be to issue two delete statements, one deleting all statements with the node in subject position and one deleting all statements with the node in object position. But if you need/want to do it with a single operation you can do that as well with filters.
Here is a sample that delete uri://node/to/delete from uri://my/graph :
DELETE { GRAPH <uri://my/graph> {
?s ?p ?o .
}}
USING <uri://my/graph>
WHERE {
{
?s ?p ?o . VALUES ?s { <uri://node/to/delete>}
} UNION {
?s ?p ?o . VALUES ?o { <uri://node/to/delete>}
}
}

SparQL: Replace subject URIs

We have graphs containing oa:Annotations with certain properties. Since I am working on a local copy of the server, it would be useful to me to change these URIs to point to localhost. According to the book I read, I thought this should work, but it does not. It does not seem to change anything. Still, the server returns a 204, I am using the priting port and url (/update). So I definitely should be able to change things. There is no error message.
PREFIX oa: <http://www.w3.org/ns/oa#>
DELETE
{ GRAPH ?g {?oldIRI ?p ?o} }
INSERT
{ GRAPH ?g {?newIRI ?p ?o} }
WHERE
{
GRAPH ?g {
?oldIRI a oa:Annotation .
?oldIRI ?p ?o .
}
BIND(
CONCAT("http://localhost:80",
SUBSTR( STR(?oldIRI),
34,
STRLEN(STR(?oldIRI)) )
) AS ?newIRI
)
FILTER(CONTAINS(?oldIRI, "part_of_old_url"))
}
Any idea why this does not have the effect I hoped for? The book I use as reference does have "recipies" to change properties and it's values, but there is no example changing the subjects, so I assume there is a more general problem?
Update: using STR()
As suggested in the comments, I used CONTAINS(STR(?oldIRI), "part_of_old_url") to convert oldIRI to a string. I am not fully aware of all changes, but this is what I can say: (I have backups, no worry :D)
PREFIX oa: <http://www.w3.org/ns/oa#>
SELECT *
WHERE {
GRAPH ?g {
?iri a oa:Annotation .
}
} LIMIT 100
This query has zero results. It's a default query I often used to get some annotation uris for looking into things.

Default RDFS inference in Virtuoso 7.x

This is a question about simple RDFS inference in Virtuoso 7.1 and DBpedia. I have a Virtuoso instance which was installed using this link as a reference. Now if I query the endpoint with the following query :
Select ?s
where { ?s a <http://dbpedia.org/ontology/Cricketer> . }
I get a list of Cricketers that are present in DBpedia. Suppose I want all the athletes (all sports and cricketers included, where Athlete is rdfs:superClassOf Cricketer), I just try the query
Select ?s
where { ?s a <http://dbpedia.org/ontology/Athlete> . }
For this I get all the correct answers. However I have an issue with rdfs:subPropertyOf. For example the property <http://dbpedia.org/ontology/capital> is the sub-property of <http://dbpedia.org/ontology/administrativeHeadCity>. So suppose I want all the capitals and the administrative head cities and I issue the query
Select ?s ?o
where { ?s <http://dbpedia.org/ontology/administrativeHeadCity> ?o . }
I get zero results. Why is it that subproperty inference isn't working in DBpedia? Is there something else that I have missed?
You've missed a couple of things.
First, Virtuoso is at 7.2.4 as of April 2016, and this version is strongly recommended over the old version from 2014, for many reasons.
#AKSW's advice about Property Paths will work much better with this later version, too.
Then, you can use inference on the DBpedia endpoint (including your local mirror), through the input:inference pragma, as shown on the live results of the query shown below --
DEFINE input:inference "http://dbpedia.org/resource/inference/rules/dbpedia#"
SELECT ?place ?HeadCity
WHERE
{
?place <http://dbpedia.org/ontology/administrativeHeadCity> ?HeadCity
}
ORDER BY ?place ?HeadCity
You can also see a list of predefined inference rule sets.
And... more of the relevant documentation.
(ObDisclaimer: I work for OpenLink Software, producer of Virtuoso.)
There is no automatic inference enabled in DBpedia. DBpedia itself is a dataset loaded into Virtuoso.
The reason that you get all instances with a superclass like dbo:Athlete is that subclass-inheritance is fully materialized in the current DBpedia dataset:
(s rdf:type c1), (c1 rdfs:subClassOf c2) -> (s rdf:type c2)
That means that for each individual x, the DBpedia dataset contains all the classes C it belongs to - in fact also the superclasses.
That procedure was not done for subproperty-inheritance, i.e.,
(s p1 o), (p1 rdfs:subPropertyOf p2) -> (s p2 o)
You can solve that problem with SPARQL 1.1 property paths:
SELECT ?s ?o WHERE {
?p rdfs:subPropertyOf* <http://dbpedia.org/ontology/administrativeHeadCity> .
?s ?p ?o .
}

Querying named RDF graphs in TDB using tdbquery

I am trying to query my newly created TDB database use the tdbquery program. However, I am having a hard time writing a query that targets the correct named graph. I am doing the following:
First a create a new dataset and add a name graph called "facts"
Dataset dataset = TDBFactory.createDataset("/tdb/");
dataset.begin(ReadWrite.WRITE) ;
try {
Model facts = RDFDataMgr.loadModel("lineitem.ttl") ;
dataset.addNamedModel("facts", facts);
dataset.commit();
TDB.sync(dataset);
dataset.end();
} finally {
dataset.close();
}
When I query all graphs in my TDB database it looks fine.
./tdbquery --loc /tdb/ "SELECT * { GRAPH ?g { ?s ?p ?o } }"
--------------------------------------------------
| s | p | o | g |
==================================================
| <fact1> | <predicate> | <nation> | <facts> |
| <fact2> | <predicate> | <region> | <facts> |
--------------------------------------------------
If I try to query the named graph I do not find and triples.
./tdbquery -v --loc /tdb/ "SELECT * { GRAPH <facts> { ?s ?p ?o } }"
OR
./tdbquery -v --loc /tdb/ "SELECT * FROM NAMED <facts> WHERE { ?s ?p ?o }"
-------------
| s | p | o |
=============
-------------
When I look at the algebra version of the query I see that the context (the graph) in my quad is wrong.
INFO exec :: ALGEBRA
(quadpattern (quad <file:///usr/local/apache-jena-2.12.1/bin/facts> ?s ?p ?o))
I know that the quad pattern should be:
(quad ?s ?p ?o)
How do I query a named graph in a TDB database?
Regards
When I look at the algebra version of the query I see that the context
(the graph) in my quad is wrong.
INFO exec :: ALGEBRA
(quadpattern (quad <file:///usr/local/apache-jena-2.12.1/bin/facts> ?s ?p ?o))
I know that the quad pattern should be: (quad ?s ?p ?o)
No its correct (if not what you expect)
A quadpattern searches against quads and so includes 4 fields the first of which is the name of the graph to be searched
And this is where your problem lies, graph names are URIs but you only provided facts as the name which is treated as a relative URI and as such subject to resolution which may differ in different parts of the system.
In your example the query parser uses the working directory as the Base URI leading to the strange graph name you see in the algebra plan.
You can see exactly what graph names are in the TDB store by issuing the following query:
SELECT ?g WHERE { GRAPH ?g { } }
If you get an absolute URI back then you can specify that directly in your original query, if you do not then there is no way to query for it from the command line.
Fixing your Issue
Don't use relative URIs wherever possible. If you do want to use them then don't use them without specifying a base URI explicitly
So in your code where you load the data make sure you give an absolute URI to the graph e.g.
dataset.addNamedModel("http://example.org/facts", facts);
And if you do want to be able to use relative URIs to refer to your graph in your queries use an appropriate BASE declaration so the URI is resolved as you want it e.g.
./tdbquery -v --loc /tdb/ "BASE <http://example.org/> SELECT * { GRAPH <facts> { ?s ?p ?o } }"
The problem here is that you have put a relative URI graph name into the data. RDF is define to work with absolute URIs (i.e start with "http:" or some other URI-scheme name).
Try
RDFDataMgr.write(System.out, dataset, Lang.NQUADS)
to see more clearly what's in the dataset. The output of tdbquery may invoke URI shorteners so some of your data has absolute URIs and some relative but it looks the same in the text format.
When "SELECT * { GRAPH { ?s ?p ?o } }" is parsed, just like if you read data from a file, relative URIs are resolved - the base URI is where the code is running so you get file:///usr/local/apache-jena-2.12.1/bin/facts
Try dataset.addNamedModel("http://example/facts", facts);
PS 1
Model m = dataset.getNamedModel("http://example/facts") ;
m.read("lineitem.ttl") ;
PS 2
sync() isn't necessary if you use transactions