full text search in jena sparql? - sparql

I am new to sparql and I am trying to search a word in one of the property . The simple queries works fine but I don't know how to perform full text search . I saw this example on jena website :
PREFIX text: <http://jena.apache.org/text#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?s
{ ?s text:query (rdfs:label 'word' 10) ;
rdfs:label ?label
}
my model contains property named SUB: and I want to write a query for that . I don't understand what is text and query in text:query means in the above example . Pardon me if this question doesn't meet the requirements of SO.
Link to website:http://jena.apache.org/documentation/query/text-query.html

You may not need a full text index:
SELECT ?s
{ ?s your:property ?o .
FILTER regex(str(?o), "word", "i")
}
but if you do text:query is a "property function" -- it trigger accessing the Apache Lucene index and causing ?s to be bound to each of the answers from a match of 'word' (to a limit of 10) over the rdfs:label properties if you have correctly configured and loaded the data and index.

Related

Why does this federated SPARQL query work in TopBraid but not in Apache Fuseki?

I have the following federated SPARQL query that works as I expect in TopBraid Composer Free Edition (version 5.1.4) but does not work in Apache Fuseki (version 2.3.1):
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?s WHERE {
SERVICE <http://data.linkedmdb.org/sparql> {
<http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
?actor movie:actor_name ?actorName .
}
SERVICE <http://dbpedia.org/sparql?timeout=30000> {
?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}
}
I monitor the sub SPARQL queries that are being executed under the hood and notice that TopBraid correctly executes the following query to the http://dbpedia.org/sparql endpoint:
SELECT *
WHERE
{ ?s ?p ?o
FILTER regex(str(?s), replace("Paul Reubens", " ", "_"))
}
while Apache Fuseki executes the following sub query:
SELECT *
WHERE
{ ?s ?p ?o
FILTER regex(str(?s), replace(?actorName, " ", "_"))
}
Notice the difference; TopBraid replace the variable ?actorName with a particular value 'Paul Reubens', while Apache Fuseki does not. This results in an error from the http://dbpedia.org/sparql endpoint because the ?actorName is used in the result set but not assigned.
Is this a bug in Apache Fuseki or a feature in TopBraid? How can I make Apache Fuseki correctly execute this Federated query.
update 1: to clarify the behaviour difference between TopBraid and Apache Fuseki a bit more. TopBraid executes the linkedmdb.org subquery first and then executes the dbpedia.org subquery for each result of the linkedmdb.org query )(and substitutes the ?actorName with the results from the linkedmdb.org query). I assumed Apache Fuseki behaves similar, but the first subquery to dbpedia.org fails (because ?actorName is used in the result set but not assigned) and so it does not continue. But now I am not sure if it actually want to execute the subquery to dbpedia.org multiple times, because it never gets there.
update 2: I think both TopBraid and Apache Fuseki use Jena/ARQ, but I noticed that in stack traces from TopBraid the package name is something like com.topbraid.jena.* which might indicate they use a modified version of Jena/ARQ?
update 3: Joshua Taylor says below: "Surely you wouldn't expect the second service block to be executed for each one of them?". Both TopBraid and Apache Fuseki use exactly this method for the following query:
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?film ?label ?subject WHERE {
SERVICE <http://data.linkedmdb.org/sparql> {
?film a movie:film .
?film rdfs:label ?label .
?film owl:sameAs ?dbpediaLink
FILTER(regex(str(?dbpediaLink), "dbpedia", "i"))
}
SERVICE <http://dbpedia.org/sparql> {
?dbpediaLink dcterms:subject ?subject
}
}
LIMIT 50
but I agree that in principle they should execute both parts once and join them, but maybe for performance reasons they chose a different strategy?
Additionally, notice how the above query works on Apache Fuseki, while the first query of this post does not. So, Apache Fuseki is actually behaving similarly to TopBraid in this particular case. It seems to be related to using an URI variable (?dbpediaLink) in two triple patterns (which works in Fuseki) compared to using a String variable (?actorName) from a triple pattern in a FILTER regex function (which does not work in Fuseki).
Updated (Simpler) Response
In the original answer I wrote (below), I said that the issue was that SPARQL queries are executed innermost first. I think that that still applies here, but I think the problem can be isolated even more easily. If you have
service <ex1> { ... }
service <ex2> { ... }
then the results have to be what you'd get from executing each query separately on the endpoints and then joining the results. The join will merge any results where the common variables have the same values. E.g.,
service <ex1> { values ?a { 1 2 3 } }
service <ex2> { values ?a { 2 3 4 } }
would execute, and you'd have two possible values for ?a in the outer query (2 and 3). In your query, the second service can't produce any results. If you take:
?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
and execute it at DBpedia, you shouldn't get any results, because ?actorName isn't bound, so the filter will never succeed. It appears that TopBraid is performing the first service first and then injecting the resulting values into your second service. That's convenient, but I don't think it's correct, because it returns different results than what you'd get if the DBpedia query had been executed first and the other query executed second.
Original Answer
Subqueries in SPARQL are executed inner-most first. That means that a query like
select * {
{ select ?x { ?x a :Cat } }
?x foaf:name ?name
}
Would first find all the cats, and would then find their names. "Candidate" values for ?x are determined first by the subquery, and then those values for ?x are made available to the outer query. Now, when there are two subqueries, e.g.,
select * {
{ select ?x { ?x a :Cat } }
{ select ?x ?name { ?x foaf:name ?name } }
}
the first subquery is going to find all the cats. The second subquery finds all the names of everything that has a name, and then in the outer query, the results are joined to get just the names of the cats. The values of ?x from the first subquery aren't available during the execution of the second subquery. (At least in principle, a query optimizer might be able to figure out that some things should be restricted.)
My understanding is that service blocks have the same kind of semantics. In your query, you have:
SERVICE <http://data.linkedmdb.org/sparql> {
<http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
?actor movie:actor_name ?actorName .
}
SERVICE <http://dbpedia.org/sparql?timeout=30000> {
?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}
You say that tracing shows that TopBraid is executing
SELECT *
WHERE
{ ?s ?p ?o
FILTER regex(str(?s), replace("Paul Reubens", " ", "_"))
}
If TopBraid already executed the first service block and got a unique solution, then that might be an acceptable optimization, but what if, for instance, the first query had returned multiple bindings for ?actorName? Surely you wouldn't expect the second service block to be executed for each one of them? Instead, the second service block is executed as written, and will return a result set that will be joined with the result set from the first.
The reason that it probably "doesn't work" in Jena is because the second query doesn't actually bind any variables, so it's pretty much got to look at every triple in the data, which is obviously going to take a long time.
I think that you can get around this by nesting the service calls. If nested service are all launched by the "local" endpoint (i.e., nesting a service call doesn't ask a remote endpoint to make another remote query), then you might be able to do:
SERVICE <http://dbpedia.org/sparql?timeout=30000> {
SERVICE <http://data.linkedmdb.org/sparql> {
<http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
?actor movie:actor_name ?actorName .
}
?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}
That might get you the kind of optimization that you want, but that still seems like it might not work unless DBpedia has some efficient ways of figuring out which triples to retrieve based on computing the replace. You're asking DBpedia to look at all its triples, and then to keep the ones where the string form of the subject matches a particular regular expression. It'd probably be better to construct that IRI manually in a subquery and then search for it. I.e.,
SERVICE <http://dbpedia.org/sparql?timeout=30000> {
{ select ?actor {
SERVICE <http://data.linkedmdb.org/sparql> {
<http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
?actor movie:actor_name ?actorName .
}
bind(iri(concat("http://dbpedia.org/resource",
replace(?actorName," ","_")))
as ?actor)
} }
?actor ?p ?o
}
(long comment)
Consider:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?s WHERE {
{
<http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
?actor movie:actor_name ?actorName .
}
{
?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}
}
that is the same query but with no SERVICE calls.
?actorName is not in a pattern of the inner second {}.
As join is a commutative operation, this has the same answers as the first query.
SELECT ?s WHERE {
{
?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}
{
<http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
?actor movie:actor_name ?actorName .
}
}
The SERVICE version highlights this because the parts are executes separately on different machines.
The join of the two parts happens on the results of each part.

using bind concat in construct query

I have the following query
CONSTRUCT{
?entity a something;
a label ?label .
}
WHERE
{
?entity a something;
a label ?label .
BIND(CONCAT(STR( ?label ), " | SOME ADDITIONAL TEXT I WOULD LIKE TO APPEND MANUALLY") ) AS ?label ) .
}
I simply want to concatenate some text with ?label, however when running the query I get the following error:
BIND clause alias '?label' was previously used
I only want to return a single instance of ?label hence, I defined it in the construct clause.
The error message seems to be accurate, but is only the first of many you will get with this query. The usual request to take a look at some SPARQL learning resources to at least understand the basics of triple-based graph pattern matching, along with, a couple of hints one what to look for. CONSTRUCT isn't a bad place to start, and the following should almost do what I think you intend:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT{
?entity rdfs:label ?label .
}
WHERE
{
?entity a ex:something ;
rdfs:label ?oldlabel .
BIND(CONCAT(STR( ?oldlabel ), " | SOME ADDITIONAL TEXT I WOULD LIKE TO APPEND MANUALLY") ) AS ?label ) .
}
There's quite a few things different about that query, so take a look to see if it accurately does what you want. One hint is the syntactic difference between using '.' and ';' to separate the triple patterns. Another is that each clause defines either a URL, using a qname in the example, or a variable, prefixed by a '?'. Neither 'label' or 'something' are valid.
I say "almost" because CONSTRUCT only returns a set of triples. To modify the labels, which I think is the intent, you need to use SPARQL Update, i.e.:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ex: <http://example.org/example#>
DELETE {
?entity rdfs:label ?oldlabel .
}
INSERT{
?entity rdfs:label ?label .
}
WHERE
{
?entity a ex:something .
?entity rdfs:label ?oldlabel .
BIND(CONCAT(STR( ?oldlabel ), " | SOME ADDITIONAL TEXT I WOULD LIKE TO APPEND MANUALLY") AS ?label ) .
}
Note how the triple pattern finds matches for ?oldlabel and deletes them, inserting the newly bound ?label instead. This query assumes a default graph is defined that holds both the original data and the target for updates. If not then the graph needs to be specified using WITH or GRAPH. (Also included another hint on the syntactic difference between using '.' and ';' to separate triple patterns.)

Retrieving the wider dbpedia vocabulary for tagging pictures

I'm trying to develop a tool in JS for tagging pictures, so I need a set of possible "things" from dbpedia. I already tryed to retrieve this way:
select ?s ?l {
?s a owl:Class .
?s rdf:type ?l
FILTER regex(str(?s), "House", "i").
}
http://dbpedia.org/snorql/?query=select+%3Fs+%3Fl+%7B%0D%0A+++%3Fs+a+owl%3AClass+.%0D%0A+++%3Fs+rdf%3Atype+%3Fl%0D%0A+++FILTER+regex%28str%28%3Fs%29%2C+%22House%22%2C+%22i%22%29.%0D%0A%7D
And also this way:
select ?label
WHERE {
?concept a skos:Concept.
?concept skos:prefLabel ?label.
FILTER regex(str(?label), "^House", "i").
}
http://dbpedia.org/snorql/?query=select+%3Flabel+%0D%0AWHERE+%7B%0D%0A++%3Fconcept+a+skos%3AConcept.%0D%0A++%3Fconcept+skos%3AprefLabel+%3Flabel.%0D%0A++FILTER+regex%28str%28%3Flabel%29%2C+%22%5EHouse%22%2C+%22i%22%29.%0D%0A%7D
In the first case, I just have "instances" of the house "thing", but not the "House" class itself. In the second one, I never retrieve the "house" and the similar thing is "houses". Any alternative for retrieving a better vocabulary based in dbpedia dataset?
If you don't bother to restrict yourself to owl:Thing or to skos:Concept, you can just get things that have a label that contains "house". Rather than using regex, I chose to use contains and lcase, since a string containment could be less expensive than invoking a full regular expression processor.
select ?thing ?label where {
?thing rdfs:label ?label .
filter contains(lcase(?label), "house")
}
SPARQL results (limited to 200)

Obtain linked resources with SPARQL query on DBpedia

Hello everybody i'm using a SPARQL query to retrieve properties and values of a specified resource. For example if i ask for Barry White, i obtain: birth place, associatedBand, recordLabels and so on.
Instead for any instance such as "Hammerfall", i obtain only this results:
Query results
But i want properties and values as shown in this page: Correct results.
My query is:
PREFIX db: <http://dbpedia.org/resource/>
PREFIX prop: <http://dbpedia.org/property/>
PREFIX onto: <http://dbpedia.org/ontology/>
SELECT ?property ?value
WHERE { db:Hammerfall?property ?value }
Anyone can tell me how to access to the correct resource and obtain corrects properties and values in every case?
select ?p ?o { dbpedia:HammerFall ?p ?o }
SPARQL results
The particular prefix doesn't matter; I just used dbpedia: because it's predefined on the endpoint as http://dbpedia.org/resource/, just like your db:. The issue is that HammerFall has a majuscule F in the middle, but your query uses a miniscule f.
As an alternative, since the results for Hammerfall (with a miniscule f) do include
http://dbpedia.org/ontology/wikiPageRedirects http://dbpedia.org/resource/HammerFall
you could use a property path to follow any wikiPageRedirects paths:
select ?p ?v {
dbpedia:Hammerfall dbpedia-owl:wikiPageRedirects* ?hammerfall .
?hammerfall ?p ?v
}
SPARQL results
See Retrieving dbpedia-owl:type value of resource with dbpedia-owl:wikiPageRedirect value? for more about that approach.

Sparql to recover the Type of a DBpedia resource

I need a Sparql query to recover the Type of a specific DBpedia resource. Eg.:
pt.DBpedia resource: http://pt.dbpedia.org/resource/Argentina
Expected type: Country (as can be seen at http://pt.dbpedia.org/page/Argentina)
Using pt.DBpedia Sparql Virtuoso Interface (http://pt.dbpedia.org/sparql) I have the query below:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?l ?t where {
?l rdfs:label "Argentina"#pt .
?l rdf:type ?t .
}
But it is not recovering anything, just print the variable names. The virtuoso answer.
Actually I do not need to recover the label (?l) too.
Anyone can fix it, or help me to define the correct query?
http in graph name
I'm not sure how you generated your query string, but when I copy and paste your query into the endpoint and run it, I get results, and the resulting URL looks like:
http://pt.dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fpt.dbpedia.org&sho...
However, the link in your question is:
http://pt.dbpedia.org/sparql?default-graph-uri=pt.dbpedia.org%2F&should-sponge...
If you look carefully, you'll see that the default-graph-uri parameters are different:
yours: pt.dbpedia.org%2F
mine: http%3A%2F%2Fpt.dbpedia.org
I'm not sure how you got a URL like the one you did, but it's not right; the default-graph-uri needs to be http://pt.dbpedia.org, not pt.dbpedia.org/.
The query is fine
When I run the query you've provided at the endpoint you've linked to, I get the results that I'd expect. It's worth noting that the label here is the literal "Argentina"#pt, and that what you've called ?l is the individual, not the label. The individual ?l has the label "Argentina"#pt.
We can simplify your query a bit, using ?i instead of ?l (to suggest individual):
select ?i ?type where {
?i rdfs:label "Argentina"#pt ;
a ?type .
}
When I run this at the Portuguese endpoint, I get these results:
If you don't want the individual in the results, you don't have to select it:
select ?type where {
?i rdfs:label "Argentina"#pt ;
a ?type .
}
or even:
select ?type where {
[ rdfs:label "Argentina"#pt ; a ?type ]
}
If you know the identifier of the resource, and don't need to retrieve it by using its label, you can even just do:
select ?type where {
dbpedia-pt:Argentina a ?type
}
type
==========================================
http://www.w3.org/2002/07/owl#Thing
http://www.opengis.net/gml/_Feature
http://dbpedia.org/ontology/Place
http://dbpedia.org/ontology/PopulatedPlace
http://dbpedia.org/ontology/Country
http://schema.org/Place
http://schema.org/Country