Bizarre results with bif:contains - corrupted full text index? - sparql

I recently built a Virtuoso database (version 07.10.3207) using dbpedia data. I'm trying to build some queries for it, and encountering very strange results. For one example:
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?s, ?p, ?o where {
?s ?p ?o .
?s rdfs:label "Almond"#en .
?o bif:contains "mythical"
}
This yields a hit. One might expect it to mean that the comment field (the field that matches "mythical") for Almond contains the word "mythical". However, it does not. It is, in fact:
"The almond (/??m?nd/) (Prunus dulcis, syn. Prunus amygdalus, Amygdalus communis, Amygdalus dulcis) (or badam in Indian English, from Persian: ??????) is a species of tree native to the Middle East and South Asia. "Almond" is also the name of the edible and widely cultivated seed of this tree."#en
Many other queries yield similarly strange results.
Trying the same queries on the public dbpedia endpoint does not yield these bizarre results, so I know it's somehow a problem with my database. I guessed that it might have to do with some corruption of the full text indices.
I tried the following, without a super-clear understanding of what exactly they might do, based on other notes I was able to find:
DB.DBA.RDF_OBJ_FT_RULE_ADD(null, null, 'All');
DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ();
DB.DBA.RDF_OBJ_FT_RECOVER();
DB.DBA.VT_INDEX_DB_DBA_RDF_OBJ();
Thus far, no dice. I'm sort of wondering if it might have to do with the mangled characters in the comment field - the online dbpedia endpoint renders them properly, while my Virtuoso installation just gives question marks, as seen above. No idea even how to begin approaching this though.
I did include SQL_UTF8_EXECS = 1 in virtuoso.ini (and subsequently restart the server), which still left me with question marks in the results.
Actually, it does not appear to have anything to do with those question marks; I ran the following query:
select ?s, ?p, ?o where {
?s ?p ?o .
?o bif:contains "mythical" .
FILTER (!regex(?o, "mythical", "i"))
}
A pseudorandom selection of hits, none of which contain "mythical" or "?":
"Asgrrr"
"403 BC"
"Potential infinity"
"Beauty and the Beast (talk show)"
"Alberta highway highway 22"
The same query, run at http://dbpedia.org/sparql, returns nothing (as it should).
Any ideas?

Rebuilding the database did not correct the problem. I was, however, able to get a working version by taking the following steps. Some of them may have been unnecessary, but given how long it takes, I haven't done controlled experiments to narrow things down to the minimum.
First, delete the database and associated files, to start with a blank slate.
Edit virtuoso.ini to uncomment/include:
SQL_UTF8_EXECS = 1
Start up Virtuoso, and then from isql issue the following commands:
DB.DBA.RDF_OBJ_FT_RULE_ADD (null, null, 'All');
DB.DBA.VT_BATCH_UPDATE ('DB.DBA.RDF_OBJ', 'OFF', null);
DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ ();
DB.DBA.RDF_OBJ_FT_RECOVER ();
COMMIT WORK;
CHECKPOINT;
CHECKPOINT_INTERVAL(60000);
Then, load the data.
Then, call:
COMMIT WORK;
CHECKPOINT;
DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ();
CHECKPOINT;
COMMIT WORK;
CHECKPOINT;
CHECKPOINT_INTERVAL(60);
COMMIT WORK;
Enjoy your fully-text-searchable database!

Related

How to completely delete an RDF node/instance from a graph database?

Good day, I am using Graphdb to store some triples as seen in the image below. This particular RDF node uses a regular URI http://example/regular/uri. What I wish to do is to not only completely delete all properties attached to this node, but also delete the node itself. (with the result that http://example/regular/uri does not appear in the graph database any longer)
So far I am only able to delete all properties, but I am not able to delete the actual RDF node itself. It seemed rather simple, but the more I research online, the more this seems impossible unless clearing the complete graph.
I have tried simple "delete where" queries as shown in example 11 of SPARQL documentation. And i have also tried using simple "delete where"-queries using the wildcard operator as shown in the query below:
Is there a way to delete such RDF nodes?
Thanks in advance!
A node exists in a graph as long as there is one or more triples with that node in subject or object position. So the easiest way would be to issue two delete statements, one deleting all statements with the node in subject position and one deleting all statements with the node in object position. But if you need/want to do it with a single operation you can do that as well with filters.
Here is a sample that delete uri://node/to/delete from uri://my/graph :
DELETE { GRAPH <uri://my/graph> {
?s ?p ?o .
}}
USING <uri://my/graph>
WHERE {
{
?s ?p ?o . VALUES ?s { <uri://node/to/delete>}
} UNION {
?s ?p ?o . VALUES ?o { <uri://node/to/delete>}
}
}

SPARQL Query returning strange results

I am working on SPARQL since nearly 2 years, but I have never seen such strange situation in the past. (Note: I am using native triplestore)
Query1:
prefix leaks: <http://data.ontotext.com/resource/leaks/>
prefix leak: <http://data.ontotext.com/resource/leak/>
SELECT * WHERE
{
leaks:entity-10000001 leak:jurisdiction_description ?object.
}
Query2:
prefix leaks: <http://data.ontotext.com/resource/leaks/>
prefix leak: <http://data.ontotext.com/resource/leak/>
SELECT * WHERE
{
leaks:entity-10000001 ?p ?object.
}
Here Query1 is returning some results where as Query2 is returning no results.
If I put it in other way, merging above both queries, below query (Query3) is returning few records.
Query3:
prefix leaks: <http://data.ontotext.com/resource/leaks/>
prefix leak: <http://data.ontotext.com/resource/leak/>
SELECT distinct ?s WHERE
{
?s leak:jurisdiction_description ?object.
FILTER NOT EXISTS { ?s ?p ?o}.
}
Ideally this should not be the case. Query3 should be always with no results since second condition ?s ?p ?o is superset of first one ?s leak:jurisdiction_description ?object
I have no clue why is this happening.
You have an issue with your triplestore, I guess
Try the same queries at http://data.ontotext.com/sparql, which is the "home" of this dataset, backed by GraphDB. Query2 provides 23 results, as expected.
The fact that it returns results for Query3 is a serious indication there is something wrong with your setup
I came to know why was this happening. For some reason indexing files (pos, sop, etc) were not synchronized properly, had gone to inconsistent state. When I tried deleting primary-triples folder under MARMOTTA_HOME, And reingested data, it started working for me since it forced to reindex triplestore data. Thanks #Jeen Broekstra for the heads up :)

Apache Marmotta Delete All Graphs using SPARQL Update

I am trying to clear all loaded graphs into my Apache Marmotta instance. I have tried several SPARQL queries to do so, but I am not able to remove the RDF/XML graph that I imported. What is the appropriate syntax to do so?
Try this query:
DELETE WHERE { ?x ?y ?z }
Be careful, as it deletes every triple in the database, including the built-in ones of Marmotta.
A couple of things I did for experimenting:
I downloaded the source code of Marmotta and used the Silver Searcher tool for searching for DELETE queries with the following command:
ag '"DELETE '
This did not help much.
I navigated to the Marmotta installation directory and watched the debug log:
tail -f marmotta-home/log/marmotta-main.log
This showed that the parser is not able to process the query DELETE DATA { ?s ?p ?o }. The exception behind the "error while executing update" was:
org.openrdf.sail.SailException: org.openrdf.rio.RDFParseException: Expected an RDF value here, found '?' [line 8]
[followed by a long stacktrace]
This shows that the parser does not allow variables in the query after DELETE DATA.
Based on a related StackOverflow answer, I tried CLEAR / CLEAR GRAPH / DROP / DROP GRAPH, but they did not work.
I tried many combinations of DELETE, *, ?s ?p ?p and accidentally managed to get it working with the DELETE WHERE construct. According to the W3C documentation:
The DELETE WHERE operation is a shortcut form for the DELETE/INSERT operation where bindings matched by the WHERE clause are used to define the triples in a graph that will be deleted.

Find number of some entity type

What is sparql query that finds count of some entity? For examles, on Linked movie database, if I want find count of actors or films, how can I get it?
I tried this
SELECT (count ( ?Film)){?entity rdf:type ?Film}
but got wrong number.
There's a whole lot missing from this question (e.g., where you ran the query, what you expected as a result, etc.) but I think we can pinpoint the problem even without those details. First, let's rewrite your query using proper syntax (the formatting is optional; the important thing is count(?Film) as ?count):
select (count(?Film) as ?count) {
?entity rdf:type ?Film
}
?Film here is a variable, so you're asking "find me things and their types, and then count how many types were found." If you were trying to count the number of things of some particular film type, though, you probably wanted a query like:
select (count(?entity) as ?numberOfFilms) {
?entity rdf:type :Film .
}
Where :Film is some particular IRI, not a variable. Also note that you can abbreviate rdf:type with a, so you can make this even shorter and fit it nicely on one line again, if you want:
select (count(?entity) as ?numberOfFilms) { ?entity a :Film }

How to form dbPedia iSPARQL query (for wikipedia content)

Say I need to fetch content from wikipedia about all mountains. My target is to show initial paragraph, and an image from respective article (eg. Monte Rosa and Vincent Pyramid.
I came to know about dbpedia, and with some research got to find that it provides live queries into wiki database directly.
I have 2 questions:
1 - I am finding it difficult how could I formulate my queries. I can't play around iSPARQL. I tried following query but it throws error saying invalid xml.
SELECT DISTINCT ?Mountain FROM <http://dbpedia.org> WHERE {
[] rdf:type ?Mountain
}
2 - My requirement is to show only mountains that have at least 1 image (I need to show this image too). Now the ones I listed above have images, but how could I be sure? Also, looking at both examples I see many fields differ in wiki articles - so for future extension it maybe quite difficult to fetch them.
I just want to reject those which do not have sufficient data or description.
How can I filter out mountains based on pictures present?
UPDATE:
My corrected query, which solves my first problem:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?name ?description
WHERE {
?name rdf:type <http://dbpedia.org/ontology/Mountain>;
dbpedia-owl:abstract ?description .
}
You can also query dbpedia using its SPARQL endpoint (less fancy than iSPARQL). To find out more about what queries to write, take a look at the DBpedia's datasets page. The examples there show how one can select pages based on Wikipedia categories. To select resources in the Wikipedia Mountains category, you can use the following query:
select ?mountain where {
?mountain a dbpedia-owl:Mountain .
}
SPARQL Results
Once you have some of these links in hand, you can look at them in a web browser and see the data associated with them. For instance the page for Mount Everest shows lots of properties. For restricting results to those pages that have an image, you might be interested in the dbpedia-owl:thumbnail property, or perhaps better yet foaf:depiction. For the introductory paragraph, you probably want something like the dbpedia-owl:abstract. Using those, we can enhance the query from before. The following query finds things in the category Stratovolcanoes with an abstract and an depiction. Since StackOverflow is an English language site, I've restricted the abstracts to those in English.
select * where {
?mountain a dbpedia-owl:Mountain ;
dbpedia-owl:abstract ?abstract ;
foaf:depiction ?depiction .
FILTER(langMatches(lang(?abstract),"EN"))
}
LIMIT 10
SPARQL Results