WikiData, locating entities with a specific type or subtype, located in a specific city - sparql

My Specific problem is that I have a place called "Beacon Theater". What I want to find is the best match for this in Wikidata.
A Wikidata Search will give me three results:
Live at the Beacon Theater (Q6656601)
Beacon Theatre (Q264186): performing arts venue
Beacon Theaters (Q19110809)
The first one is a movie, the second it the correct result, and the third is a Supreme Court decision.
Using this API call, I can find the id's for all three:
https://www.wikidata.org/w/api.php?action=query&format=json&list=search&srsearch=Beacon Theater
Next step is getting the details for each of these. I use this call to get the information for all three entities
"https://www.wikidata.org/w/api.php?action=wbgetentities&props=descriptions|labels|claims&ids=Q6656601|Q264186|Q19110809&languages=en&format=json"
At this point, I want to iterate over them and find the one that is a building. I may also later want to add a way to find the one located in New York.
My problem is that the correct answer is not a building (Q41176). The P31 value is Q3469910, which is a Performance Arts Venue, so I can't really sort on that (Imagine in the future I use this code to search for a museum. A museum is also a building, but not a Performing Arts Venue. The search for Beacon Theater is just an example.
So question: How can I find the correct entry, which for the purposes of this question I define as:
Being a building (or perhaps being derived from a Building)
Optional Answer: Being located in New York (In case of multiple hits, this would further limit the results)
I think I need to do a SPARQL query as the second query to do this, but from the examples I could not quite figure out how, or if that would be the correct/easiest way. Maybe even a SPARQL query that would do all of the above in one query?

In your case an exact label match might already do
SELECT DISTINCT ?loc ?locLabel ?locDescription
WHERE
{
values ?locLabel {"Beacon Theater"#en }
?loc rdfs:label ?locLabel .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"}
}
try it!

I have a project where I meet the same kind of problems but for books, which can also be comic books, or manga etc. The easiest solution I found was to keep a list of "alias entities", that is, entities that could be considered at match when looking for a book. It's not as dynamic as a SPARQL query and requires periodical updates - adding newly found matching entities, removing problematic ones - but that's so much faster and satisfies most of my needs.

Related

Retrieving all Wikipedia articles about people

I'm trying to retrieve all articles from Wikipedia that are about people. More specifically, I'm looking for:
only the page title (and perhaps the page ID)
of articles that are about people,
separated by gender (for the sake of simplicity, male and female),
from the current English Wikipedia.
There are several things I've tried, none of which have worked out:
The Wikipedia API lets me search for all pages in a given category. However, searching in "Men" or "Women" fetches mostly subcategory pages, and pages about actual people are buried further down the subcategory hierarchy. I can't find a way to auto-traverse the hierarchy.
PetScan lets me specify a hierarchy depth, but requests time out with a depth of more than 3. Also, like the Wikipedia API, results include articles that aren't about people.
Wikidata lets me write SPARQL queries to search for entities that have a gender of "male" or "female". This example seems to work, but once I include entity names in the query, it times out. Also, I'm not sure how exactly this data corresponds to Wikipedia articles — is this data guaranteed to be the same as on Wikipedia?
What's the best way to achieve what I'm looking for?
I've created a SPARQL-query doing the work. It's important to keep the query as simple as possible (for query optimisation read: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization). Here is the query for SPARQL: https://w.wiki/JhK
For articles of woman this might work with the Wikidata Query Service (WQS), though it's hard on the edge of timing out. So for the male articles (there are more) you need to add a LIMIT and step through it by adding an increasing OFFSET. The WQS seams to still timeout, but there are other endpoints to Wikidata, this one is limited to 100.000 results, but works with increasing OFFSET: https://wikidata.demo.openlinksw.com/sparql
The resulting SPARQL query is something like this:
SELECT ?sitelink
WHERE {
?item wdt:P21 wd:Q6581097;
wdt:P31 wd:Q5.
?sitelink schema:about ?item;
schema:isPartOf <https://en.wikipedia.org/>.
}
LIMIT 100000 OFFSET 100000

How to assign a suitable science category to keywords, using DBPedia and SPARQL?

I have some keywords like emotion perception ability, students’ motivation, self-efficacy. The goal is to map these keywords to a corresponding category(-ies) of psychology. In this case I know apriori that the answer is Educational psychology, however I want to get the same answer using DBPedia ontologies.
Using the following query I am able to extract different branches of psychology and corresponding abstracts:
SELECT DISTINCT ?subject ?abstract
WHERE {
?concept rdfs:label "Branches of psychology"#en .
?concept ^dct:subject ?subject .
?subject dbo:abstract ?abstract .
}
LIMIT 100
Now I want to add some OPTIONAL clause that would compare my keywords (using OR) with terms from the abstract (dbo:abstract). Is it possible to do this using SPARQL? Or should I use SPARQL just to obtain the abstracts and then to make all further text processing using e.g. Java or Python?
Also, the ideas of some other approaches that might be useful to reach the goal are highly appreciated.
You could retrieve data as text using sparql, but deciding if a text match a query should be done with text data analysis technics, or text mining
This is a whole science, but fortunately lots of libraries for many languages (including Java and Python) exist in order to implement related algorithms. Here is a list of software on wikipedia. NLTK is well-known for the job, and have a Python binding.
In your case, i think of many ways, but i'm far from an expert, so my idea can be wrong:
Create a corpus of abstracts of each desired categories (educational psychology,…), and, for a given abstract A, compare A to each abstract of each abstract of each category C. The result of the comparison will give for each category a score/likelihood that A belongs to C. (cf fuzzy sets)
The comparison could be implemented with vector space model, that works on vocabulary similarities.
Named Entities Recognition could help to detect names of authors, technics or tools related to particular category.
Main idea is the following: once you defined what is the particular trait of each category, by using its vocabulary, authors, references or whatever, you can decide, for any abstract, a score of membership to all categories.
So, the real question to ask is which scoring function should i use ?.
The answer depends heavily of the data, and the results you want. When you say an abstract is about educational psychology, you have to know why. And then implement it as a scoring function.
As a side node, i add that neural networks, by training on the corpus, could, maybe, bypass the scoring, by automatic learning. I don't know enough in that field to say more.

Sorting the response from the Foursquare Places API re:two word name?

We are trying to query the Foursquare api to query for a two word name:
Cava Grill in Gaithersburg, MD
We are trying this via:
https://api.foursquare.com/v2/venues/search?intent=checkin&query=cava%20grill&near=gaithersburg,%20md&limit=1&oauth_token=SEB14NBLGO4HMFTOXQX0JZTSVGM41ENNKE0X1RXHCI5XP3P5&v=20150420
(don't worry ... this is the public API key from the FS page)
Two odd behaviors:
Even though we are explicitly searching for the Cava Grill in Gaithersburg, MD ... the Bethesda, MD one comes up first in the results (odd, why??)
Chipotle Mexican Grill shows up in this result set ... we suppose because of the word "Grill"
So ...
a. anyone know why the Bethesda one would show up higher in the result set? (Should we just narrow the radius tighter?)
b. anyone know if we can look for the "entire query" vs. each word in the query?
Results are queried and sorted differently based on your intent. If you're looking for a specific venue, I suggest changing your intent from checkin to match. Browse may also be a good choice depending on future search params
Here's the nutshell on the intents:
intent=checkin returns a list of venues where the user is most likely is located
intent=browse returns a list of most relevant venues for a requested region, not biased by distance from a central point.
intent=match returns a single result that, with high confidence, is the corresponding foursquare venue for the query-based request
I hope this helps

Yago ontology for entity disambiguation

I am using the propriety rdfs:type equal to dbpedia-owl:Organisation for selecting (obviously) organizations on my SPARQL query:
SELECT ?s
WHERE {
?s a dbpedia-owl:Organisation .
} LIMIT 10
I would like to consider the YAGO ontology for increasing my performance on getting real organizations. For example, the FBI (http://dbpedia.org/resource/Federal_Bureau_of_Investigation) is not considered as a dbpedia-owl:Organisation but is tagged as yago:Organization108008335 .
Note the "random" (at least for me) number in the end of the class name. Does anyone know what this number stands for? How do I suppose to know it a priori?
Moreover, when I look for more classes with this format (using the query below), I can find two more classes: http://dbpedia.org/class/yago/Organization108008335, http://dbpedia.org/class/yago/Organization101008378, http://dbpedia.org/class/yago/Organization101136519
SELECT DISTINCT ?t WHERE {
?s a ?t
FILTER(regex(str(?t), "http://dbpedia.org/class/yago/Organization\\d+"))
}
Are they different? Why aren't they all "yago:Organization". Should I expect "new" organization classes as new versions of YAGO ontologies are made available? Is there any other class I should consider when selecting Organizations?
I have been digging into this lately, so I'll try to answer all your questions one by one:
Note the "random" (at least for me) number in the end of the class name. Does anyone know what this number stands for? How do I suppose to know it a priori?
That number corresponds to the synset id of the word in Wordnet. For example, if you look up wordnet_organization_101136519 in wordnet (the URI in the dbpedia is not resolvable at this moment, maybe they have changed something in the last releases), you will see that it has a synsetID "101136519". I don't think you can know it a priori without looking into wordnet.
Are they different? Why aren't they all "yago:Organization".
They are different because they have a different definition in wordnet. For example:
wordnet_organization_101136519: "the activity or result of distributing or disposing persons or things properly or methodically 'his organization of the work force was very efficient'". Example of an instance: Bogo-Indian_Defence. See more details here
wordnet_organization_101008378: "the act of organizing a business or an activity related to a business 'he was brought in to supervise the organization of a new department'". Example of an instance: Adam_Smith_Foundation. See more details here
If you follow the links I provided you can see more differences and common similarities.
Should I expect "new" organization classes as new versions of YAGO ontologies are made available?
When they generated Yago they associated every word in wordnet to a URI. If more words about organizations are added, then I guess that you'll have more definitions. However it is impossible to know beforehand.
Is there any other class I should consider when selecting Organizations?
You can look for all the classes with the label "organization" in wordnet and then add optionals to your query (or issue one query per class retrieving the different organizations you are interested in). These are the classes with the "organization" label in Wordnet.
I hope it helps.

CategoryId in venues search not working correctly

In foursquare Api documentation for "Search venues" https://developer.foursquare.com/docs/venues/search it states
"categoryId - A comma separated list of categories to limit results to. This is an experimental feature and subject to change or may be unavailable. If you specify categoryId you may also specify a radius. If specifying a top-level category, all sub-categories will also match the query."
Realise its supposed to be experimental, but when I provide Food category i.e. 4d4b7105d754a06374d81259, it only returns a few local results, the rest are miles away. However if I execute same search on website sing Food category, it returns correctly lots of results, assuming its the last bit "If specifying a top-level category, all sub-categories will also match the query" is not working , i.e. its not searching sub-categories ?
Any fix work around for this ?
Thanks,
Neil Pepper
You're making a /venues/search request with its default intent of intent=checkin. This returns a filter on nearby results, heavily biased by distance since it's trying to guess where the user might be checking in.
Foursquare Explore uses the /venues/explore endpoint and attempts to return recommended results for a query. If you want to get the sorts of results you get in that tool, call /venues/explore?section=food