Is there a better way to do case insensitive queries? [duplicate] - sparql

I am using Jena ARQ to write a SPARQL query against a large ontology being read from Jena TDB in order to find the types associated with concepts based on rdfs label:
SELECT DISTINCT ?type WHERE {
?x <http://www.w3.org/2000/01/rdf-schema#label> "aspirin" .
?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
}
This works pretty well and is actually quite speedy (<1 second). Unfortunately, for some terms, I need to perform this query in a case-insensitive way. For instance, because the label "Tylenol" is in the ontology, but not "tylenol", the following query comes up empty:
SELECT DISTINCT ?type WHERE {
?x <http://www.w3.org/2000/01/rdf-schema#label> "tylenol" .
?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
}
I can write a case-insensitive version of this query using FILTER syntax like so:
SELECT DISTINCT ?type WHERE {
?x <http://www.w3.org/2000/01/rdf-schema#label> ?term .
?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
FILTER ( regex (str(?term), "tylenol", "i") )
}
But now the query takes over a minute to complete! Is there any way to write the case-insensitive query in a more efficient manner?

From all the the possible string operators that you can use in SPARQL, regex is probably the most expensive one. Your query might run faster if you avoid regex and you use UCASE or LCASE on both sides of the test instead. Something like:
SELECT DISTINCT ?type WHERE {
?x <http://www.w3.org/2000/01/rdf-schema#label> ?term .
?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
FILTER (lcase(str(?term)) = "tylenol")
}
This might be faster but in general do not expect great performance for text search with any triple store. Triple stores are very good at graph matching and not so good at string matching.

The reason the query with the FILTER query runs slower is because ?term is unbound it requires scanning the PSO or POS index to find all statements with the rdfs:label predicate and filter them against the regex. When it was bound to a concrete resource (in your first example), it could use a OPS or POS index to scan over only statements with the rdfs:label predicate and the specified object resource, which would have a much lower cardinality.
The common solution to this type of text searching problem is to use an external text index. In this case, Jena provides a free text index called LARQ, which uses Lucene to perform the search and joins the results with the rest of the query.

Related

MarkLogic seems not to follow the RDF specification

Write the following RDF data into MarkLogic:
#prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://John> <http://have> 10 .
<http://Peter> <http://have> "10"^^xsd:double .
Then execute the following query:
SELECT *
WHERE { ?s ?p 10. }
The query result is:
?s
?p
<http://John>
<http://have>
<http://Peter>
<http://have>
In RDF document, RDF term is equal if and only if the value and data type are both the same.
Literal term equality: Two literals are term-equal (the same RDF literal) if and only if the two lexical forms, the two datatype IRIs, and the two language tags (if any) compare equal, character by character. Thus, two literals can have the same value without being the same RDF term.
https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal
It seems that MarkLogic does not follow the RDF specification?
Expected query result:
The expected query result is:
?s
?p
<http://John>
<http://have>
Alternatives
Executing the SparQL query in Query Console and with Java Client API will both get the unexpected query result.
This query can return the expected query result in Apache Jena and RDF4j.
Can someone give me an answer or a hint about it?
A triple store that does that is not automatically wrong.
Try:
{ ?s ?p ?x . FILTER (sameTerm(?x, 10)}
{ ?s ?p ?x . FILTER (?x = 10) }
and also
Also try:
with 010 and +10, "010^^xsd:integer
and explore the lexical forms:
FILTER(str(?x) = "10")
The use cases for both value-based matching and term-based matching exist.
SPARQL query does not say whether the data loaded is exactly the data queried (by design). Several store do variations of value-processing as default behaviour. Jena does for TDB2 but keeps datatypes apart (TDB1 conflating datatypes was less popular).
It depends on what use case they are targetting. If you want a particular effect, then use a FILTER form of the query assuming data was not pre-processed on loading.
See the answer on Unexpected data writing result in MarkLogic
See https://www.w3.org/TR/sparql11-entailment/#CanonicalLit for the discussion that D-entailment gives many, many answers even with a restriction to canonical lexical forms.

SPARQL: Count how often an object property is used

I have an RDF graph and I want to count how many times the skos:broader object property is actually used. Is there a simple (and ideally efficient) SPARQL query I can use for that?
Trivially simple answer:
SELECT (COUNT(*) as ?cnt)
WHERE { ?s skos:broader ?o }

SPARQL query. order of triple patterns

Consider a SPARQL aggregation query with a where clause composed of only triple patterns and filter statements.
Does the order of the triple patterns or the order of the filters affect the results (not the performance)? I am quite convinced it doesn't affect the validity of the results. But maybe someone can clarify why? Also, does the order of the variables in the group by affect the result's validity? I don't think so either but maybe someone can clarify why?
The following example is just for clarification; I am asking generally.
The query counts the number of people grouped by their eye color and country and applies a couple of filters. The query contains only these four triple patterns (each line ending by a dot is a triple pattern, i.e., a triple with variables)
?p wdt:P26 ?human.
?p wdt:P27 ?country.
?human wdt:P31 wd:Q5.
?human wdt:P1340 ?eyeColor.
and these two filters:
FILTER (?country != wd:Q142)
FILTER (?eyeColor != wd:Q17122705)
The WHERE clause doesn't contain any OPTIONAL, UNION, MINUS,... or any other constructor; only triple patterns and filters.
SELECT ?eyeColor (COUNT(?human) AS ?count) ?country
WHERE
{
?p wdt:P26 ?human.
?p wdt:P27 ?country.
?human wdt:P31 wd:Q5.
?human wdt:P1340 ?eyeColor.
FILTER (?country != wd:Q142)
FILTER (?eyeColor != wd:Q17122705)
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?eyeColor ?country
My question again: if we change the order of the six elements in the WHERE clause in any way, does that affect the result (not only in this query but also for similar ones in terms of structure)? The same question about whether the order of the variables in the GROUP BY might affect the results, for example GROUP BY ?country ?eyeColor instead of GROUP BY ?eyeColor ?country. Again, I am asking in general, not only for this case query.
No, the order does not change the results for tripe patterns and filters only.
All filters happen at the end of the {} block they are in (an optimizer may move them but must not change the result).
A number of triple patterns "subject predicate object" will result in the same results regardless of order.
SERVICE is not a triple pattern.
wikibase:label is special to wikidata, not the SPARQL spec anyway.
Including any of the graph pattern operations like SERVICE or OPTIONAL.
See the SPARQL algebra the spec or sparql.org to see the algebra.

Sparql query dbpedia istance's property and values in a domain

I want to write a SPARQL query to retrieve all properties and values for an instance in a particular domain. For example, I want all properties and values of the singer "Sting" only in "MusicalArtist" domain. How can I do this?
If I understand you correctly, this is a pretty basic SPARQL question, and you'd be better off starting with a good tutorial on SPARQL, or sitting down with the SPARQL 1.1 Query Language document. Just skim through it, you don't need to memorize everything right away, but just get familiar with it. After all, Stack Overflow isn't for reproducing standard documentation; you're expected to already be familiar with that.
As you've shown in a previous question, you can ask for all properties of a given subject with a query like:
select ?p ?o where {
dbpedia:Barry_White ?p ?o
}
You can select all instances of a particular class, e.g., dbpedia-owl:MusicalWork with a query like
select ?o where {
?o a dbpedia-owl:MusicalWork
}
It sounds like you're simply asking how to combine the two. It's just
select ?p ?o where {
dbpedia:Barry_White ?p ?o .
?o a dbpedia-owl:MusicalWork
}
SPARQL results (0 results)
That query doesn't actually have any results, because Barry White isn't related to anything that's a MusicalWork, evidently. However, if you just ask for Works that he's associated with, you'll get a few results:
select ?p ?o where {
dbpedia:Barry_White ?p ?o .
?o a dbpedia-owl:Work
}
SPARQL results (3 results)

DBpedia query returns some musicals more than once despite filters

I'm trying to use a SPARQL query against DBpedia to retrieve a list of musicals and some associated properties. However, despite using the appropriate filters (as far as I can tell), the results include many of the musicals more than once. Here is my query:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
SELECT ?label ?abstract ?book ?music ?lyrics
WHERE {
?play <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Broadway_musicals> ;
rdfs:label ?label ;
dbo:abstract ?abstract ;
dbpprop:book ?book ;
dbpprop:lyrics ?lyrics ;
dbpprop:music ?music .
FILTER (LANG(?label) = 'en')
FILTER (LANG(?abstract) = 'en')
FILTER (LANG(?book) = 'en')
FILTER (LANG(?lyrics) = 'en')
FILTER (LANG(?music) = 'en')
}
The resulting list has many duplicate entries. Pasting the query here:
DBpedia SPARQL Explorer, you'll see that starting with 'Mama Mia!' there are a lot of duplicates in the list.
Any idea what I'm missing to get unique results with no duplicates? Thanks!
[Edited by glenn mcdonald to clarify that it's musicals which are "duplicated" here, not triples.]
SPARQL returns variable-bindings. Your "duplicates" are cartesian products of multiples in your projected properties. Mamma Mia has multiple music writers and multiple lyricists, so you get every possible combination of them that could produce a row in your table.
What a pain, huh? The "solution" is to use CONSTRUCT instead of SELECT, and deal with getting back a graph instead of a table. Maybe like this:
http://dbpedia.org/snorql/?query=PREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0A++++PREFIX+dbo%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2F%3E%0D%0A++++PREFIX+dbpprop%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2F%3E%0D%0A++++CONSTRUCT+%7B%0D%0A++++++++%3Fplay+rdfs%3Alabel+%3Flabel+%3B%0D%0A++++++++++++dbo%3Aabstract+%3Fabstract+%3B%0D%0A++++++++++++dbpprop%3Abook+%3Fbook+%3B%0D%0A++++++++++++dbpprop%3Alyrics+%3Flyrics+%3B%0D%0A++++++++++++dbpprop%3Amusic+%3Fmusic+.%0D%0A++++%7D%0D%0A++++WHERE+%7B+%0D%0A++++++++%3Fplay+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3ABroadway_musicals%3E+%3B%0D%0A++++++++++++rdfs%3Alabel+%3Flabel+%3B%0D%0A++++++++++++dbo%3Aabstract+%3Fabstract+%3B%0D%0A++++++++++++dbpprop%3Abook+%3Fbook+%3B%0D%0A++++++++++++dbpprop%3Alyrics+%3Flyrics+%3B%0D%0A++++++++++++dbpprop%3Amusic+%3Fmusic+.%0D%0A++++++++FILTER+%28LANG%28%3Flabel%29+%3D+%27en%27%29++++%0D%0A++++++++FILTER+%28LANG%28%3Fabstract%29+%3D+%27en%27%29%0D%0A++++++++FILTER+%28LANG%28%3Fbook%29+%3D+%27en%27%29%0D%0A++++++++FILTER+%28LANG%28%3Flyrics%29+%3D+%27en%27%29%0D%0A++++++++FILTER+%28LANG%28%3Fmusic%29+%3D+%27en%27%29%0D%0A++++%7D
Are the duplicates exact duplicates? i.e. every value for every variable of each duplicate result is identical
If so then add the DISTINCT keyword after SELECT to force the SPARQL engine to discard duplicates solutions.
If not then Glenn is entirely correct that because there are multiple values given for the various properties so you will get multiple results. There are complex workarounds you can do with subqueries, GROUP BY etc. but they would tend to lead to less efficient queries. Sometimes you just have to deal with the duplicates on the client side.