How to know if a string is a proper name of person or a place name using DBpedia? - sparql

I am using SPARQL query on DBpedia into a Prolog project and I have a doubt. I would know if a word is, most probably, a NAME OF A PERSON (something like: John, Mario) or a PLACE (like a city: Rome, London, New York).
I have implement the following two queries, the first gives me the number of persons having a specific name, and the second gives me the number of places having a specific name.
1) Query for a PERSON NAME:
select COUNT(?person) where {
?person a dbpedia-owl:Person .
{ ?person foaf:givenName "John"#en }
UNION
{ ?person foaf:surname "John"#en }
}
For the name John, I obtain the following output: callret-0: 7313, so I think that it has found 7313 instances for the proper name John. Is it right?
2) Query for a PLACE NAME:
select COUNT(?place) where {
?place a dbpedia-owl:Place .
{ ?x rdfs:label "John"#en }
}
The problem is that, as you can see in the previous “place” query, I have inserted John as parameter, which is not a place name but a proper name of persons, but I obtain the following strange result: callret-0: 81900104
The problem is that, in this way, if I compare the output of the previous two queries, it seems that John is a place and not a person name! This is not good for my scope; I have tried with other personal names and it always happens that the place query gives me a bigger output than the name query.
Why? What am I missing? Are there some errors in my queries? How can I solve it to have a correct result?

Actually, when I run the query you provided:
select COUNT(?place) where {
?place a dbpedia-owl:Place .
{ ?x rdfs:label "John"#en }
}
I get the result 93027312, not 81900104, but that does not really matter much. The strange results arise because ?x and ?place don't have to be bound to the same thing, so you are getting all the dbpedia-owl:Places and counting them, but the number of result rows is the number of dbpedia-owl:Place multiplied by the number of things with rdfs:label "John#en":
select COUNT(?place) where { ?place a dbpedia-owl:Place }
=> 646023
select COUNT(?x) where { ?x rdfs:label "John"#en }
=> 144
646023 × 144 = 93027312
If you actually ask for dbpedia-owl:Places that have the rdfs:label "John#en", you'll get no results:
select COUNT(?place) as ?numPlaces where {
?place a dbpedia-owl:Place ;
rdfs:label "John"#en .
}
SPARQL results
Also, you might consider using dbpprop:name instead of rdfs:label. Some results seem like they are more useful that way. For instance, let us find places called "Springfield". If we ask for places with that name we get no results:
select * where {
?place a dbpedia-owl:Place ;
rdfs:label "Springfield"#en .
}
SPARQL results
However, if we modify the query and use dbpprop:name, we get 17. Some of these are duplicates though, so you might have to do something else to remove duplicates. The point, though, is that dbpprop:name got some results, and rdfs:label didn't.
select * where {
?place a dbpedia-owl:Place ;
dbpprop:name "Springfield"#en .
}
SPARQL results
You can even use dbpprop:name in working with the names of persons, although it's not as useful, because the dbpprop:name value for most persons is their entire name. To find persons with the given name John using dbpprop:name requires a query like:
select * where {
?place a dbpedia-owl:Person ;
dbpprop:name ?name .
FILTER( STRSTARTS( str( ?name ), "John" ) )
}
(or you could use CONTAINS instead of STRSTARTS), but this becomes much more expensive, because it has to select all persons and their names, and then filter through that set. Being able to select persons based on a specific name (e.g., with foaf:givenName) is much more efficient.

Related

Comparing two sets in SPARQL to find instances in Set 1 that do not exist in Set 2

I am trying to create two sets in SPARQL, and return the number of instances in ?set1 that do not exist in ?set2. I am trying to do this all in SPARQL, and I was not able to find a good resource on how to do this through FILTER, as well as using MINUS, but not having any luck.
SELECT COUNT(?set1)
WHERE {
?S ?P ?set1 .
?Q ?R ?set2 .
FILTER (member in ?set1 doesn't exist in ?set2
}
You could use FILTER NOT EXISTS:
SELECT (COUNT(?electronic) AS ?count)
WHERE {
?employee :usesElectronic ?electronic .
FILTER NOT EXISTS { ?company :ownsElectronic ?electronic . }
}
This counts the devices that
are used by any employee, and
are not owned by any company.
If you just need to know if such a case exists, and you are not interested in the count, you could use an ASK query instead, which returns TRUE or FALSE. For that, replace the whole SELECT line with ASK.

SPARQL query including a subquery on Wikidata gives unexpected results

I know the following SPARQL against Wikidata SPARQL Endpoint query is senseless. A similar query is automatically generated from within my application. Please disregard the conceptual soundness, and let's dig into this strange (for me at least) thing happening.
SELECT ?year1 ?year_labelTemp
WHERE
{
?year1 <http://www.w3.org/2000/01/rdf-schema#label> ?year_labelTemp .
{ SELECT distinct ?year1
WHERE
{ ?film <http://www.wikidata.org/prop/direct/P577> ?date ;
<http://www.wikidata.org/prop/direct/P31> <http://www.wikidata.org/entity/Q11424>
BIND(year(?date) AS ?year1)
}
}
}
limit 10
According to query evaluation in SPARQL, the subquery is evaluated first, and its results are then projected out to the containing query. Consequently, this subquery will be evaluated first.
SELECT distinct ?year1
WHERE
{ ?film <http://www.wikidata.org/prop/direct/P577> ?date ;
<http://www.wikidata.org/prop/direct/P31> <http://www.wikidata.org/entity/Q11424>
BIND(year(?date) AS ?year1)
}
The subquery gives exactly the results expected (130 different years). Then, the results of this subquery (?year1 variable) will be projected out and joined with the triple pattern in the outer select.
?year1 <http://www.w3.org/2000/01/rdf-schema#label> ?year_labelTemp .
However, as the outer select shouldn't have any data (no labels for ?year1), the join will give no results.
Surprisingly (at least for me), executing the whole query ()stated first gives results, and the results are weird.
wd:Q43576 Mië
wd:Q221 Masèdonia
wd:Q221 Республикэу Македоние
wd:Q221 Republiek van Masedonië
wd:Q212 Украина
wd:Q212 Ukraina
wd:Q212 Украинэ
wd:Q212 Oekraïne
wd:Q207 George W. Bush
wd:Q207 George W. Bush
What am I missing?
The problem is that sometimes BIND does not project variables correctly.
You can check this with the following query:
SELECT ?year1 ?year_labelTemp ?projected
WHERE
{
?year1 rdfs:label ?year_labelTemp .
hint:Prior hint:runLast true .
{ SELECT DISTINCT ?year1
WHERE
{ ?film wdt:P577 ?date ;
wdt:P31 wd:Q11424
BIND(year(?date) AS ?year1)
hint:SubQuery hint:runOnce true
}
}
BIND(bound(?year1) AS ?projected)
}
LIMIT 10
Try it!
Fortunately, the following trick helps:
SELECT ?year1 ?year_labelTemp
WHERE
{
?year1 rdfs:label ?year_labelTemp .
hint:Prior hint:runLast true .
{ SELECT DISTINCT ?year1
WHERE
{ ?film wdt:P577 ?date ;
wdt:P31 wd:Q11424
BIND(year(?date) AS ?year1)
FILTER (?year1 > 0)
}
}
}
LIMIT 10
Try it!
The bug can be reproduced without nested subqueries and with hint:Query hint:optimizer "None", thus it should be not a query optimizer bug. But it's interesting that the bug disappears after replacing wd:Q11424 with wd:Q24862.
query plan with wd:Q11424
query plan with wd:Q24862
BLZG-963 seems to be the most related issue (as you can see, built-in functions are involved too).
You wrote that the subquery gave the exact expected result, but I think you missed one value! There are films with empty unknown value as publication data, for example Q18844655 (at least when I'm writing this). It was this empty value that resulted in the seemingly random objects being found.
If you change your inner SELECT by adding for example FILTER(datatype(?date) = xsd:dateTime). you will only get actual dates and therefore only actual years, which means one value less than without the filter. Try it here!
(When this corrected inner SELECT is used the whole thing then timeouts. The labelling really doesn't like odd values like these, it seems.)

SPARQL search query

I have some RDF data structured like this:
[ pref:ip_address "127.0.0.1" ;
pref:time "1459630844.482" ;
pref:url "https://google.com" ;
pref:user "johndoe"
] .
I need query that will return all results matching given IP, time frame (between from time and end time), url (even partial match) and user (also even partial match).
What I have now is simple query to get results based on single value, like this:
PREFIX pref: <http://something> SELECT DISTINCT * WHERE { ?u pref:user USER_VALUE . ?u ?p ?o . }
This returns all results with given user but only if given username is full match. Meaning it will return all results for johndoe if USER_VALUE is johndoe but not if it is john.
My knowledge of SPARQL is extremely limited and I appreciate any help. Thank you.
To do anything more than exact match, you need to use a FILTER and use operations like CONTAINS or REGEX.
Example:
{ ?u pref:user ?user .
?u ?p ?o .
FILTER( CONTAINS(?user, "john") )
}
There are a number of FILTER functions that may be useful including REGEX. See the spec for details.

restricting DISTINCT in SPARQL

I want to get the name, the surname and place of birth from a certain graph. I get results like "John" "Doe" "London" and "John" "Doe" "UK" . Is there a way to restrict and express somehow the DISTINCT in only the 2 (name and surname?) out of 3 (name, surname and birthplace) ?
You haven't shown any example data or queries you've tried which makes this question harder to answer, please include those details in future.
One approach is to use GROUP BY instead of DISTINCT e.g.
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?surname (SAMPLE(?loc) AS ?location)
WHERE
{
?person foaf:givenName ?name ;
foaf:familyName ?surname ;
foaf:based_near ?loc .
}
GROUP BY ?name ?surname
This query groups your results together by ?name and ?surname and selects one of the possible locations for each group of results.
The SAMPLE() aggregate used here basically asks the query engine to select one possible value for a non-group key variable from the grouped results.

retrieving most specific classes of instances

Is it possible to have a definition of an resource (from DBpedia) with a SPARQL query? I want to have something like the TBox and ABox that are shown in (Conceptual) Clustering methods for
the Semantic Web: issues and applications (slides 10–11). For example, for DBpedia resource Stephen King, I would like to have:
Stephen_King : Person &sqcap; Writer &sqcap; Male &sqcap; … (most specific classes)
You can use a query like the following to ask for the classes of which Stephen King is an instance which have no subclasses of which Stephen King is also an instance. This seems to align well with the idea of “most specific classes.” However, since (as far as I know) there's no reasoner attached to the DBpedia SPARQL endpoint, there may be subclass relationships that could be inferred but which aren't explicitly present in the data.
select distinct ?type where {
dbr:Stephen_King a ?type .
filter not exists {
?subtype ^a dbr:Stephen_King ;
rdfs:subClassOf ?type .
}
}
SPARQL results
Actually, since every class is an rdfs:subClassOf itself, you might want to add another line to that query to exclude the case where ?subtype and ?type are the same:
select distinct ?type where {
dbr:Stephen_King a ?type .
filter not exists {
?subtype ^a dbr:Stephen_King ;
rdfs:subClassOf ?type .
filter ( ?subtype != ?type )
}
}
SPARQL results
If you actually want a result string like the one shown in those slides, you could use values to bind a variable to dbr:Stephen_King, and then use some grouping and string concatenation to get something nicer looking (sort of):
select
(concat( ?person, " =\n", group_concat(?type; separator=" AND\n")) as ?sentence)
where {
values ?person { dbr:Stephen_King }
?type ^a ?person .
filter not exists {
?subtype ^a ?person ;
rdfs:subClassOf ?type .
filter ( ?subtype != ?type )
}
}
group by ?person
SPARQL results
http://dbpedia.org/resource/Stephen_King =
http://dbpedia.org/class/yago/AuthorsOfBooksAboutWritingFiction AND
http://dbpedia.org/ontology/Writer AND
http://schema.org/Person AND
http://xmlns.com/foaf/0.1/Person AND
http://dbpedia.org/class/yago/AmericanSchoolteachers AND
http://dbpedia.org/class/yago/LivingPeople AND
http://dbpedia.org/class/yago/PeopleFromBangor,Maine AND
http://dbpedia.org/class/yago/PeopleFromPortland,Maine AND
http://dbpedia.org/class/yago/PeopleFromSarasota,Florida AND
http://dbpedia.org/class/yago/PeopleSelf-identifyingAsAlcoholics AND
http://umbel.org/umbel/rc/Artist AND
http://umbel.org/umbel/rc/Writer AND
http://dbpedia.org/class/yago/20th-centuryNovelists AND
http://dbpedia.org/class/yago/21st-centuryNovelists AND
http://dbpedia.org/class/yago/AmericanHorrorWriters AND
http://dbpedia.org/class/yago/AmericanNovelists AND
http://dbpedia.org/class/yago/AmericanShortStoryWriters AND
http://dbpedia.org/class/yago/CthulhuMythosWriters AND
http://dbpedia.org/class/yago/HorrorWriters AND
http://dbpedia.org/class/yago/WritersFromMaine AND
http://dbpedia.org/class/yago/PeopleFromDurham,Maine AND
http://dbpedia.org/class/yago/PeopleFromLisbon,Maine AND
http://dbpedia.org/class/yago/PostmodernWriters