How to generate a hash of an RDF graph using SPARQL? - sparql

I have an RDF ontology in a triplestore that users can make changes to. I want to increment/alter the ontology version with each change. One way to do it would be to use a hash of the ontology graph as the version. For that I need to calculate a hash of the RDF graph using SPARQL, as that is the only available API.

I realized this can be addressed in 2 steps:
generating a stable serialization of the graph (e.g. N-Triples with a stable ordering) using SPARQL
applying one of the SPARQL hash functions to get the actual hash value
Following suggestions on Twitter, I came up with this query that seems to generate valid N-Triples:
SELECT (GROUP_CONCAT(?tripleStr ; separator=' \n') AS ?nTriples)
WHERE
{ ?s ?p ?o
BIND(if(isURI(?s), concat("<", str(?s), ">"), concat("_:", str(?s))) AS ?sStr)
BIND(concat("<", str(?p), ">") AS ?pStr)
BIND(if(isURI(?o), concat("<", str(?o), ">"), if(isBlank(?o), concat("_:", str(?o)), concat("\"", str(?o), "\"", if(( lang(?o) != "" ), concat("#", str(lang(?o))), concat("^^<", str(datatype(?o)), ">"))))) AS ?oStr)
BIND(concat(?sStr, " ", ?pStr, " ", ?oStr, " .") AS ?tripleStr)
}
ORDER BY ?s ?p ?o datatype(?o) lcase(lang(?o))
Please do give it a try and report bugs, if any. The approach could be easily extended to N-Quads. As was pointed out on Twitter, it will not scale to large datasets.
Now, getting the actual hash of the graph is as easy as changing the SELECT criteria to:
SELECT (SHA1(GROUP_CONCAT(?tripleStr; separator=" \n")) AS ?graphHash)
Here we used the SHA1 hash function.

Related

How to completely delete an RDF node/instance from a graph database?

Good day, I am using Graphdb to store some triples as seen in the image below. This particular RDF node uses a regular URI http://example/regular/uri. What I wish to do is to not only completely delete all properties attached to this node, but also delete the node itself. (with the result that http://example/regular/uri does not appear in the graph database any longer)
So far I am only able to delete all properties, but I am not able to delete the actual RDF node itself. It seemed rather simple, but the more I research online, the more this seems impossible unless clearing the complete graph.
I have tried simple "delete where" queries as shown in example 11 of SPARQL documentation. And i have also tried using simple "delete where"-queries using the wildcard operator as shown in the query below:
Is there a way to delete such RDF nodes?
Thanks in advance!
A node exists in a graph as long as there is one or more triples with that node in subject or object position. So the easiest way would be to issue two delete statements, one deleting all statements with the node in subject position and one deleting all statements with the node in object position. But if you need/want to do it with a single operation you can do that as well with filters.
Here is a sample that delete uri://node/to/delete from uri://my/graph :
DELETE { GRAPH <uri://my/graph> {
?s ?p ?o .
}}
USING <uri://my/graph>
WHERE {
{
?s ?p ?o . VALUES ?s { <uri://node/to/delete>}
} UNION {
?s ?p ?o . VALUES ?o { <uri://node/to/delete>}
}
}

BIND values from Documents to a SPARQL Variable (MarkLogic)

I am currently trying to see if it's possible to extract a certain value from a document and bind it to a variable in SPARQL
For example if i have such a document in MarkLogic.
/person/John
<person_data>
<name>John</name>
<age>25</age>
</person_data>
using this data I attempted various ways to bind it such as using XPath in sem:sparql as shown below
xquery version "1.0-ml";
import module namespace sem = "http://marklogic.com/semantics" at "/MarkLogic/semantics.xqy";
sem:sparql('
PREFIX fn : <http://www.w3.org/2005/xpath-functions>
SELECT *
WHERE {
?s ?p ?o .
BIND (fn:doc("/person/John")//name/text() AS ?name)
}
',
(),
(),
()
)
However, this resulted in an error. Hence, I greatly appreciate any advise given on accomplishing this.
The SPARQL engine has no access to documents, but there is a better solution anyhow. You can use Template Driven Extraction for this. It can expose an SQL view on documents, but also 'Identify Triples in Documents'. It effectively means that particular values can be projected into the triple index, and will become accessible as RDF data like any other RDF data in your database.
HTH!

Setting to query only Default Graph and exclude Named Graphs

In the GraphDB documentation, I see that "the dataset’s default graph contains the merge of the database’s default graph AND all the database named graphs." This means that "if a statement ex:x ex:y ex:z exists in the database in the graph ex:g" then a query such as SELECT * { ?s ?p ?o } will return the triple ex:x ex:y ex:z
I am wondering if there is a setting which can be triggered either via the web interface or via the RDF4J/OpenRDF API which will disable this behavior in a specified GraphDB repository. That is, for the purposes of my project I would prefer to have triples which are stored in named graphs to only appear in results which specifically query that named graph.
I have not seen anything like this searching through the documentation or on the settings available on the web interface, but maybe somebody here knows something I don't.
EDIT: I am not looking for a SPARQL solution to this problem. I know that I can query just the default graph using SPARQL, but I want to be able to use the query SELECT * { ?s ?p ?o } and only see results which are in the default graph by default.
GraphDB/RDF4J have a different interpretation than Jena how to query the default graph. The only easy way to query only explicit statements in the default graph is to use the special graph sesame:nil. The SPARQL-based solution is to write:
PREFIX sesame: <http://www.openrdf.org/schema/sesame#>
SELECT ?s ?p ?o
FROM sesame:nil
WHERE {
?s ?p ?o .
} LIMIT 100
I don't think there is any easy non-SPARQL based solution like changing a configuration option or even use this special graph over the SPARQL Graph Store protocol.

Complex SPARQL query - Virtuoso performance hints?

I have a rather complex SPARQL query, which is executed thousands of times in parallel threads (400 threads). The query is here somewhat simplified (namespaces, properties, and variables have been reduced) for readability, but the complexity is left untouched (unions, number of graphs, etc.). The query is run against 4 graphs, the biggest of which contains 5,561,181 triples.
PREFIX graphA: <GraphABaseURI:>
ASK
FROM NAMED <GraphBURI>
FROM NAMED <GraphCURI>
FROM NAMED <GraphABaseURI>
FROM NAMED <GraphDBaseURI>
WHERE{
{
GRAPH <GraphABaseURI>{
?variableA a graphA:ClassA .
?variableA graphA:propertyA ?variableB .
?variableB dcterms:title ?variableC .
?variableA graphA:propertyB ?variableD .
?variableL<GraphABaseURI:propertyB> ?variableD .
?variableD <propertyBURI> ?variableE
}
.
GRAPH <GraphBURI>{
?variableF <propertyCURI>/<propertyDURI> ?variableG .
?variableF <propertyEURI> ?variableH
}
.
GRAPH <GraphCURI>{
?variableI <http://www.w3.org/2004/02/skos/core#notation> ?variableJ .
?variableI <http://www.w3.org/2004/02/skos/core#prefLabel> ?variableK .
FILTER (isLiteral(?variableK) && REGEX(?variableK, "literalA", "i"))
}
.
FILTER (isLiteral(?variableJ) && ?variableG = ?variableJ) .
FILTER (?variableE = ?variableH)
}
UNION
{
GRAPH <GraphABaseURI>{
?variableA a graphA:ClassA .
?variableA graphA:propertyA ?variableB .
?variableB dcterms:title ?variableC .
?variableA graphA:propertyB ?variableD .
?variableL<propertyBURI> ?variableE .
?variableL <propertyFURI> ?variableD .
}
.
GRAPH <GraphDBaseURI>{
?variableM <propertyGURI> ?variableN .
?variableM <propertyHURI> ?variableO .
FILTER (isLiteral(?variableO) && REGEX(?variableO, "literalA", "i"))
}
.
FILTER (?variableE = ?variableN) .
}
UNION
{
GRAPH <GraphABaseURI>{
?variableA a graphA:ClassA .
?variableA graphA:propertyA ?variableB .
?variableB dcterms:title ?variableC .
?variableA graphA:propertyB ?variableD .
?variableL<propertyBURI> ?variableE .
?variableL <propertyIURI> ?variableD .
}
.
GRAPH <GraphDBaseURI>{
?variableM <propertyGURI> ?variableN .
?variableM <propertyHURI> ?variableO .
FILTER (isLiteral(?variableO) && REGEX(?variableO, "literalA", "i"))
}
.
FILTER (?variableE = ?variableN) .
}
. FILTER (isLiteral(?variableC) && REGEX(?variableC, "literalB", "i")) .
}
I would not expect someone to transform the above query (of course...). I am only posting the query to demonstrate the complexity and all the SPARQL structures used.
My questions:
Would I gain regarding performance if I had all my triples in one graph? This way I would avoid unions and simplify my query, however, would this also benefit in terms of performance?
Are there any kind of indexes that I could built and they could be of any help with the above query? I am not really confident on data indexing, however reading in the RDF Index Scheme section of RDF Performance Tuning, I wonder if Virtuoso 7's default indexing scheme is suitable for queries like the above. While the predicates are defined in the above query's SPARQL triple patterns, there are many triple patterns that have not defined subject or predicate. Could this be a major problem regarding performance?
Perhaps there is a SPARQL syntax structure that I am not aware of and could be of great help in the above query. Could you suggest something? For example, I have already improved performance by removing STR() casts and using the isLiteral() function. Could you suggest anything else?
Perhaps you could suggest overusing a complex SPARQL syntax structure?
Please note that I use Virtuoso Open source edition, built on Ubuntu, Version: 07.20.3214, Build: Oct 14 2015.
Regards,
Pantelis Natsiavas
First thing -- your Virtuoso build is long outdated; updating to 7.20.3217 as of April 2016 (or later) is strongly recommended.
Optimization suggestions are naturally limited when looking at a simplified query., but here are several thoughts, in no particular order...
Index Scheme Selection, the RDF Performance Tuning doc section following RDF Index Scheme, offers a couple of alternative and/or additional indexes which may make sense for your queries and data. As you say that some of your patterns will have defined graph and object, and undefined subject and predicate, some other indexes may also make sense (e.g., GOPS, GOSP), depending on some other factors.
Depending on how much your data has changed since original load, it may be worth rebuilding the free-text indexes, with this SQL command (which may be issued through any SQL interface -- iSQL, ODBC, JDBC, etc.) —
VT_INC_INDEX_DB_DBA_RDF_OBJ ()
Using the bif:contains predicate can result in substantially better performance than regex() filters, for instance replacing —
FILTER (isLiteral(?variableO) && REGEX(?variableO, "literalA", "i")) .
— with —
?variableO bif:contains "'literalA'" .
FILTER ( isLiteral(?variableO) ) .
Explain() and profile() can be helpful in query optimization efforts. Much of this output is meant for analysis by Development, so it may not mean much to you, but providing it to other Virtuoso users can still yield helpful suggestions.
For a number of reasons, the rdf:type predicate (often expressed as a, thanks to SPARQL/Turtle semantic sugar) can be a performance killer. Removing those predicates from your graph pattern is likely to boost performance substantially. If needed, there are other ways to limit the solution set (such as by testing for attributes only possessed by entities your desired rdf:type) which do not have such negative performance impacts.
(ObDisclaimer: OpenLink Software produces Virtuoso, and employs me.)

dbpedia sparql movie query sql

Hi I'm trying to learn how to query DBpedia using SPARQL. I can't find any website/source that shows me how do this and I'm finding it difficult to learn how to use all the properties (like the ones available at http://mappings.dbpedia.org/index.php?title=Special%3AAllPages&from=&to=&namespace=202 ). Any good source I can learn from?
So for example if I want to check if the wikipedia page http://en.wikipedia.org/wiki/Inception is a movie (property film) or not, how do I do that?
The wikipedia URL http://en.wikipedia.org/wiki/Inception maps to the dbpedia URI http://dbpedia.org/resource/Inception. Dbpedia has a SPARQL endpoint at: http://dbpedia.org/sparql, which you may use to run queries either programmatically or via the html interface.
To check if http://dbpedia.org/page/Inception is a "movie", you have many options. To give you an idea:
If you know the URI of "movie" in dbpedia (it is http://schema.org/Movie), then run an ASK query to check against that type. ASK will return true/false based on whether the pattern in the where clause is valid against the data:
ASK where {
<http://dbpedia.org/resource/Inception> a <http://schema.org/Movie>
}
If you don't know the URI of "movie" then you have a number of options. For example:
Execute an ASK query with a filter on whether the resource has a type that contains the word "movie" somewhere in its uri (or its associated rdfs:label, or both). You would use a regular expression for this:
ASK where {
<http://dbpedia.org/resource/Inception> a ?type .
FILTER regex(str(?type), "^.*movie", "i")
}
Same idea, but return all matches and post-process the results (programmatically I pressume) to see if they match your request:
select distinct ?type where {
<http://dbpedia.org/resource/Inception> a ?type .
FILTER regex(str(?type), "^.*movie", "i")
}
Return all the types of the resource without applying a filter and post-process to see if they match your request:
select distinct ?type where {
<http://dbpedia.org/resource/Inception> a ?type
}
Many options. The SPARQL spec is you number one resource.
First I suggest you start reading up on what exactly SPARQL is. There are tons of really good tutorials such as: this.
If you want to write SPARQL queries on dbpedia, there are various endpoints that you can use. They don't always accept all features that are supported by SPARQL, but if you don't want to go through the trouble of installing one locally, they can be a relatively reliable test environment. The queries that I am going to write below, have been tested on Virtuoso endpoint.
Let's say you want to find all the movies in dbpedia. You first need to know what is the URI for a movie type in dbpedia. If you open Inception in dbpedia, you can see that the type dbpedia-owl:Film is associated to it. So if you want to get the first 100 movies, you just need to call:
select distinct *
where {
?s ?o dbpedia-owl:Film
} LIMIT 100
If you want o know more about each of these movies, you just need to expand your queries by expanding the triples.
select distinct *
where {
?s ?p dbpedia-owl:Film.
?s ?x ?y.
} LIMIT 100