Sparql VS XQuery (MarkLogic) - sparql

After playing with MarkLogic I realized results from triples can be obtained in several ways for example by fully using either Xquery or SPARQL. So the question is that, are there any advantages using SPARQL over XQuery? Is there some indexing going on which makes SPARQL much faster then searching for a certain semantic query?
For instance if we are retrieving all semantic documents with the predicate "/like".
SPARQL
SELECT *
WHERE {
?s </like> ?o
}
XQuery
cts:search(fn:doc(), cts:element-query(xs:QName("sem:predicate"), "/like"))
Therefore, is there any difference in efficiency between these two?

Yes, there are definitely differences. Whether XQuery or SPARQL is most efficient however fully depends on the problem you are trying to solve. XQuery is best at querying and processing document data, while SPARQL really allows you to reason easily over RDF data.
It is true that RDF data is serialized as XML in MarkLogic, and you can full-text search it, and even put range indexes on it if you like, but RDF data is already indexed in the triple index, which would give you more accurate results than the full-text search of above.
Also note that SPARQL allows you to follow predicate paths, which involves a lot of joining. That will be much more efficient if done via SPARQL than via XQuery, because it is mostly resolved via the triple index. Image a SPARQL query like this one:
PREFIX pers: <http://my.persons/>;
PREFIX topic: <http://my.topics/>;
PREFIX pred: <http://my.predicates/>;
SELECT DISTINCT *
WHERE {
?person pred:likes topic:Chocolate;
pred:friendOf+ ?friend.
FILTER( ?friend = (pres:WhiteSolstice) )
FILTER( ?friend != ?person )
}
It tries to find all direct and indirect friends that like chocolate. I wouldn't write something like that in XQuery.
Then again, there are other things that are easy in XQuery, and practically impossible in SPARQL. And sometimes most efficient is to combine the two, doing a sem:sparql from inside XQuery, and using the results to direct further processing in XQuery. It also sometimes comes down to what shape your data is in..
HTH!

A little nuance here: search is about searching for documents. Unless you have one triple per document, fetching just the triples that match out of a bunch in a document will involving pulling the whole document from disk (although it may be in cache). SPARQL is about selecting triple data from the triple indexes, which may involve less disk IO. Certainly if you are doing anything other than a simple fetch of a simple triple pattern, you're going to need the understanding of relationships that SPARQL gives you.

Related

Graph URIs on GraphDB

The SPARQL graph concept is still abstract for me.
Are graphs similar to SQL Tables? Does one triple could belong to more than one graph?
Is there a way to get the URI of the graph to which belongs a triple?
Is there a way to get the URI of every graph stored in a GraphDB repository? To get the URI of the default graph?
Thanks,
In addition of the answers of UninformedUser, graphs also could be listed using the following cURL command curl 'http://<host>:<port>/repositories/<your_repo>/contexts'.
One could easily explore and export the content of graphs in GraphDB repository using Workbench.
Here could be found the value of the default graph as well.
To understand the differences, we could refer first to the data model itself:
Relational model
SQL operates on relational databases. Relational model, in general terms, is a way to model your data and its dependencies as structured tables. If we want to query information that is built out of more than one table, then we perform what is called the join operation that retrieves the relationships between elements in different tables.
The join operator often uses keys that are common among the tables.
One drawback of this model is when you need to perform many of such join operations to retrieve the desired information. In this case, the response becomes very slow and the performance drops significantly.
Graph model
In contrast, SparQL is a query language that has a SQL-like syntax but operates on graph data. The graph data model does not store the data as tables. Instead, it uses the triple store (another option is to use a Property Graph, but your question refers to triple store).
RDF triples are the W3C standard way for describing statements in the graph data model. It consist of Subject, Predicate, Object.
Subject is the entity to which the triple refers, Predicate is the type of relationship it has, and Object is the entity or property to which is the subject connects.
The subject has always a unique identifier (e.g, URI), whereas the object can be a literal or another entity with an URI.
Further comparison
You can have a look at a more detailed comparison of the models in chapter 2 (Options for Storing Connected Data) of the book: Graph Databases, 2nd Edition
by Ian Robinson, Jim Webber, Emil Eifrem
Coming back to your questions
Are graphs similar to SQL Tables? Does one triple could belong to more
than one graph?
No, graphs are not SQL tables.
And yes, one triple could appear in different graphs (as long as the triple has the same Subject, Predicate, and Object). However, if you use different ontologies, the triple would appear in a different context.
Is there a way to get the URI of the graph to which belongs a triple? Is there a way to get the URI of every graph stored in a GraphDB repository? To get the URI of the default graph?
The SparQL queries are graph patterns. You need to specify the type of structure you are searching for.
As UninformedUser commented:
All triples
select distinct ?g {graph ?g {?s ?p ?o}}
All structures that contain an specific triple
select distinct ?g {graph ?g {:s :p :o}}

MarkLogic: Constrain SPARQL query scope by triple-range-query constraint

I would like to evaluate a SPARQL query against a limited document scope, which is based on a triple range query. Only embedded triples contained by documents which match a specific triple pattern should be part of the SPARQL evaluation scope. I'm using the Java SDK (via marklogic-rdf4j) to evaluate the SPARQL query. We're only using embedded/unmanaged triples.
I'm aware of the possibility to attach a structured query definition to a SPARQL query (by calling MarkLogicQuery::setConstrainingQueryDefinition), but the structured query syntax does not support triple-range-query constraints.
Is there any way to apply one or more triple-range-query constraints in a structured query definition? Or are there better alternatives?
Support for triple-range-query in structured queries has been requested before. I added your case to the ticket.
In the mean time you might get away with using a custom constraint. Me and a colleague put this together:
https://github.com/patrickmcelwee/triple-range-constraint/blob/master/triple-range-constraint.xqy
HTH!

How to extract a representative subset of triples from a triple store

We have an OpenLink Virtuoso-based triple (or rather quad-) store with about ~6bn triples in it. Our collaborators are asking us to give them a small subset of the data so they can test some of their queries and algorithms. Naturally, if we extract a random subset of graph-subject-predicate-object quads from the entire set, most of their SPARQL queries against the subset will find no solutions, because a small random subset of quads will represent an almost entirely disconnected graph. Is there a technique (possibly Virtuoso-specific) that would allow the extraction of a subset of quads s from the entire set S such that, for a given “select” or “construct” SPARQL query Q, Q executed against s would return the same solution as Q executed against the entire set S? If this could be done, it would be possible to run all sample queries that the collaborators want to be able to run against our dataset, extract that smallest possible subset and send it to them (as an n-quads file) so they can load it into their triple store.
You must have a known count of entity types in your database, right? Assuming that to be true, why don't you simply apply a SPARQL DESCRIBE to a sampling of each entity per entity type?
Example:
DESCRIBE ?EntitySample
{
{
SELECT SAMPLE(?Entity) as ?EntitySample
COUNT (?Entity) as ?EntityCount
?EntityType
WHERE {?Entity a ?EntityType}
GROUP BY ?EntityType
HAVING (COUNT (?Entity) > 10)
LIMIT 50
}
}
This doesn't have a pre-built Virtuoso-specific solution. (ObDisclaimer: OpenLink Software produces Virtuoso, and employs me.)
As noted in the comments above, it's actually a rather complex question. From many perspectives, the simple answer to "what is the minimal set to deliver the same result to the same query" is "all of it," and this might well work out to be the final answer, due to the effort needed to arrive at any satisfactory smaller subset.
The following are more along the lines of experimental exploration, than concrete advice.
It sounds like your collaborators want to run some known query, so I'd start by running that query against your full dataset, and then do a DESCRIBE over each ?s and ?p and ?o that appears the in the result, load all that output as your subset, and test the original query against that.
If known, explicitly including all the ontological data from the large set in the small may help.
If this sequence doesn't deliver the expected result set, you might try a second, third, or more rounds of the DESCRIBE, this time targeting every new ?s and ?p and ?o that appeared in the previous round.
The idea of exposing your existing endpoint, with the full data set, to your collaborators is worth considering. You could grant them only READ permissions and/or adjust server configuration to limit the processing time, result set size, and other aspects of their activity.
A sideways approach to this that may help in thinking about it might be to consider that in the SQL Relational Table world, a useful subset of a single table is easy -- it can be just a few rows, that include at least the one(s) that you want your query to return (and often at least a few that you want your query to not return).
With a SQL Relational Schema involving multiple Tables, the useful subset extends to include the rows of each table which are relationally connected to the (un)desired row(s) in any of the others.
Now, in the RDF Relational Graph world, each "row" of those SQL Relational Tables might be thought of as having been decomposed, with the primary key of each table becoming a ?s, and each column/field becoming a ?p, and each value becoming a ?o.
The reality is (obviously?) more complex (you might look at the W3C RDF2RDF work, and the Direct Mapping and R2RML results of that work, for more detail), but this gives a conceptual starting point for considering how to find the quads (or triples) from an RDF dataset that comprise the minimum sub-dataset that will satisfy a given query against both dataset and subdataset.
Querying a subset of quads cannot give the same results as querying the entire dataset, because by considering only a small percentage of the quads, you loose quads in the answer too.

Is the speed of retrieving a query result affected by using named graphs?

I am using Sesame server for storing sets of triples.
First question
I would like to know if the repository grows huge over time and I want to run queries over it, will speed performance be affected?
Second question (if the answer for the first question is positive)
If I use named graphs for different sets of triples, and run queries on them, will I retrieve the result much faster than if I would normally run them on the entire repository?
What I want to ask is —
Is this slower:
PREFIX csm: <http://exmple.org/some_ontology.owl#>
SELECT ?b ?c
WHERE {
?a a csm:SomeClass.
?a ?b ?c.
}
than this:
PREFIX csm: <http://exmple.org/some_ontology.owl#>
SELECT ?b ?c
WHERE {
GRAPH <http://example.org/some_graph> {
?a a csm:SomeClass.
?a ?b ?c.
}
}
when the data set that is stored is enormously huge?
I think this depends a bit on the triplestore that you are using. What I mainly use named-graphs is for filtering (I don't know if you mean the same when you mentioned grouping). We have massive amounts of data and very long queries. Every dataset is stored in a separate named graph, in the same repository. The triples without a named graphed (depending on the backward chaining or forward chaining reasoner) are normally the inferred triples. So, to speed up the query, you can filter some of the triples based on the named graph:
select *
where{
graph ?g {
?s a ?o.
}
filter (?g=<specific_graph>)
... the rest of the massive query
}
I have found that this approach speeds up the query (although as I mentioned before it is triplestore dependent, as I have only played with a number of triplestores).
The other advantage of having a named-graph is when you want to write a query to pull out information from only a specific source. Sometimes we use it to track the provenance of the data. If you have an API sitting on top of the data, you can easily filter based on the graphs that you have the full rights, some rights, ...
Something that I have found frustrating is that some triplestores do not honour the named-graph as much. For example, if you have a triple in a graph, and you rewrite the same triple in another graph, the context or the graph might be overwritten, which is frustrating and makes the filtering based on the named-graph inaccurate. I haven't really played around with a quad-store, but I hope they don't have this problem. I would expect to find the triple in two different contexts rather than only the latest one.
First question: I would like to know if the repository grows huge over time and I want to run queries over it, will speed performance be affected?
Yes. The extent to which size influences query performance depends on a number of factors, most importantly the actual database implementation you use, how you've configured that database, but also on the shape of your actual data (e.g. the number of type-statements, etc), and of course the types of query you do. Sesame is a quadstore framework which comes with a few built-in database types (in-memory and native), but of course numerous third-party Sesame-compatible RDF databases exist that each have their own performance characteristics.
Second question (if the answer for the first question is positive): If I use named graphs for different sets of triples, and run queries on them, will I retrieve the result much faster than if I would normally run them on the entire repository?
Again, it depends on the database and its configuration you use, and the kinds of queries you use.
Let's assume you are using a Sesame native store, and have enabled at least one index in which the named graph (or "context" as it is called in Sesame) is the primary key (e.g. cspo) - and in addition you have the usual default indices (i.e. spoc and posc). In this scenario, using named graphs can make a marked difference in performance if you can use it as a filter (that is, the named graph itself pre-selects a specific subset of the total potential result): the query planner can use the cspo index to quickly zoom in on a much smaller subset of the total repository.
Note however, that in your specific example queries, it will not matter much: you are assuming, in your example, that all resources of type csm:someClass occur in exactly one particular named graph (if that were not the case the two queries would of course not return the same result), so actually selecting that named graph does not further reduce the potential answer set (when compared to just selecting all resources of type csm:someClass).
To explain in more detail: the query engine will do lookups in the indices for each of the graph patterns in your query. The first pattern (?a a csm:someClass) is the cheapest to look up, since it has only one free variable. The engine will use the posc index for this purpose, since it knows the first two keys for this index. The second pattern of the query will be primed by the result of the first (so ?a will be instantiated by the outcome of the first lookup). In the query with named graph, the engine will select the cspo index, because we know both c and s. In the query without named graph, it will select the spoc index, since we know s (but not c). However, because all values with that particular s always occur in the same named graph, both lookups will actually range over the almost exactly the same number of values: all possible value-combinations of o and p. The spoc index will of course also range over c, but there will only ever be one single value for it, so it's a very quick lookup. So both indices will return their results in very comparable time, and knowing c in advance does not give a performance boost (as an aside, I am somewhat oversimplifying the workings of the query engine here to illustrate the point).
Named graphs are a great tool for data organization purposes, and if you have them, using them in your queries is a good idea as it can help in performance (and will certainly not hurt). But I would not organize my data in named graphs purely for query performance purposes.

SPARQL CONSTRUCT expressivity

Are there any metrics or analysis on how expressive SPARQL CONSTRUCT queries are? Are there graphs or transformations that can't be expressed via CONSTRUCT? What are the limitations?
SPARQL is pspace-complete, like SQL. It doesn't matter which form you're using.
I'd say the primary limitation of construct queries is that they cannot construct quads.
An arbitrary variable length list is not possible in a single CONSTRUCT. The template can't be written because the CONSTRUCT template is a fixed pattern.