Retrieving all Wikipedia articles about people - wikipedia-api

I'm trying to retrieve all articles from Wikipedia that are about people. More specifically, I'm looking for:
only the page title (and perhaps the page ID)
of articles that are about people,
separated by gender (for the sake of simplicity, male and female),
from the current English Wikipedia.
There are several things I've tried, none of which have worked out:
The Wikipedia API lets me search for all pages in a given category. However, searching in "Men" or "Women" fetches mostly subcategory pages, and pages about actual people are buried further down the subcategory hierarchy. I can't find a way to auto-traverse the hierarchy.
PetScan lets me specify a hierarchy depth, but requests time out with a depth of more than 3. Also, like the Wikipedia API, results include articles that aren't about people.
Wikidata lets me write SPARQL queries to search for entities that have a gender of "male" or "female". This example seems to work, but once I include entity names in the query, it times out. Also, I'm not sure how exactly this data corresponds to Wikipedia articles — is this data guaranteed to be the same as on Wikipedia?
What's the best way to achieve what I'm looking for?

I've created a SPARQL-query doing the work. It's important to keep the query as simple as possible (for query optimisation read: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization). Here is the query for SPARQL: https://w.wiki/JhK
For articles of woman this might work with the Wikidata Query Service (WQS), though it's hard on the edge of timing out. So for the male articles (there are more) you need to add a LIMIT and step through it by adding an increasing OFFSET. The WQS seams to still timeout, but there are other endpoints to Wikidata, this one is limited to 100.000 results, but works with increasing OFFSET: https://wikidata.demo.openlinksw.com/sparql
The resulting SPARQL query is something like this:
SELECT ?sitelink
WHERE {
?item wdt:P21 wd:Q6581097;
wdt:P31 wd:Q5.
?sitelink schema:about ?item;
schema:isPartOf <https://en.wikipedia.org/>.
}
LIMIT 100000 OFFSET 100000

Related

Wikidata Virtuoso SPARQL Endpoint - How to get more than 100,000 results

I need to get Wikidata artifacts (instance-types, redirects and disambiguations) for a project.
As the original Wikidata endpoint has time constraints when it comes to querying, I have come across Virtuoso Wikidata endpoint.
The problem I have is that if I try to get for example the redirects with this query, it only returns 100,000 results at most:
PREFIX owl: http://www.w3.org/2002/07/owl#
CONSTRUCT {?resource owl:sameAs ?resource2}
WHERE
{
?resource owl:sameAs ?resource2
}
I’m writing to ask if you know of any way to get more than 100,000 results. I would like to be able to achieve the maximum number of possible results.
Once the results are obtained, I must have 3 files (or as few files as possible) in the Ntriples format: wikidata_intance_types.nt, wikidata_redirecions.nt and wikidata_disambiguations.nt.
Thank you very much in advance.
All the best,
Jose Manuel
Please recognize that in both cases (Wikidata itself, and the Virtuoso instance provided by OpenLink Software, my employer), you are querying against a shared resource, and various limits should be expected.
You should space your queries out over time, and consider smaller chunks than the 100,000 limit you've run into -- perhaps 50,000 at a time, waiting for each query to finish retrieving results, plus another second or ten, before issuing the next query.
Most of the guidance in this article about working with the DBpedia public SPARQL endpoint is relevant for any public SPARQL endpoint, especially those powered by Virtuoso. Specific settings on other endpoints will vary, but if you try to be friendly — by limiting the rate of your queries; limiting the size of partial result sets when using ORDER BY, LIMIT, and OFFSET to step through to get a full result set for a query that overflows the instance's maximum result set size; and the like — you'll be far more successful.
You can get and host your own copy of wikidata as explained in
https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
There are also alternatives to get a partial dump of wikidata e.g. with https://github.com/bennofs/wdumper
Or ask for access to one of the non public copies we run by sending me a personal e-mail via my RWTH Aachen i5 account

WikiData, locating entities with a specific type or subtype, located in a specific city

My Specific problem is that I have a place called "Beacon Theater". What I want to find is the best match for this in Wikidata.
A Wikidata Search will give me three results:
Live at the Beacon Theater (Q6656601)
Beacon Theatre (Q264186): performing arts venue
Beacon Theaters (Q19110809)
The first one is a movie, the second it the correct result, and the third is a Supreme Court decision.
Using this API call, I can find the id's for all three:
https://www.wikidata.org/w/api.php?action=query&format=json&list=search&srsearch=Beacon Theater
Next step is getting the details for each of these. I use this call to get the information for all three entities
"https://www.wikidata.org/w/api.php?action=wbgetentities&props=descriptions|labels|claims&ids=Q6656601|Q264186|Q19110809&languages=en&format=json"
At this point, I want to iterate over them and find the one that is a building. I may also later want to add a way to find the one located in New York.
My problem is that the correct answer is not a building (Q41176). The P31 value is Q3469910, which is a Performance Arts Venue, so I can't really sort on that (Imagine in the future I use this code to search for a museum. A museum is also a building, but not a Performing Arts Venue. The search for Beacon Theater is just an example.
So question: How can I find the correct entry, which for the purposes of this question I define as:
Being a building (or perhaps being derived from a Building)
Optional Answer: Being located in New York (In case of multiple hits, this would further limit the results)
I think I need to do a SPARQL query as the second query to do this, but from the examples I could not quite figure out how, or if that would be the correct/easiest way. Maybe even a SPARQL query that would do all of the above in one query?
In your case an exact label match might already do
SELECT DISTINCT ?loc ?locLabel ?locDescription
WHERE
{
values ?locLabel {"Beacon Theater"#en }
?loc rdfs:label ?locLabel .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"}
}
try it!
I have a project where I meet the same kind of problems but for books, which can also be comic books, or manga etc. The easiest solution I found was to keep a list of "alias entities", that is, entities that could be considered at match when looking for a book. It's not as dynamic as a SPARQL query and requires periodical updates - adding newly found matching entities, removing problematic ones - but that's so much faster and satisfies most of my needs.

How to assign a suitable science category to keywords, using DBPedia and SPARQL?

I have some keywords like emotion perception ability, students’ motivation, self-efficacy. The goal is to map these keywords to a corresponding category(-ies) of psychology. In this case I know apriori that the answer is Educational psychology, however I want to get the same answer using DBPedia ontologies.
Using the following query I am able to extract different branches of psychology and corresponding abstracts:
SELECT DISTINCT ?subject ?abstract
WHERE {
?concept rdfs:label "Branches of psychology"#en .
?concept ^dct:subject ?subject .
?subject dbo:abstract ?abstract .
}
LIMIT 100
Now I want to add some OPTIONAL clause that would compare my keywords (using OR) with terms from the abstract (dbo:abstract). Is it possible to do this using SPARQL? Or should I use SPARQL just to obtain the abstracts and then to make all further text processing using e.g. Java or Python?
Also, the ideas of some other approaches that might be useful to reach the goal are highly appreciated.
You could retrieve data as text using sparql, but deciding if a text match a query should be done with text data analysis technics, or text mining
This is a whole science, but fortunately lots of libraries for many languages (including Java and Python) exist in order to implement related algorithms. Here is a list of software on wikipedia. NLTK is well-known for the job, and have a Python binding.
In your case, i think of many ways, but i'm far from an expert, so my idea can be wrong:
Create a corpus of abstracts of each desired categories (educational psychology,…), and, for a given abstract A, compare A to each abstract of each abstract of each category C. The result of the comparison will give for each category a score/likelihood that A belongs to C. (cf fuzzy sets)
The comparison could be implemented with vector space model, that works on vocabulary similarities.
Named Entities Recognition could help to detect names of authors, technics or tools related to particular category.
Main idea is the following: once you defined what is the particular trait of each category, by using its vocabulary, authors, references or whatever, you can decide, for any abstract, a score of membership to all categories.
So, the real question to ask is which scoring function should i use ?.
The answer depends heavily of the data, and the results you want. When you say an abstract is about educational psychology, you have to know why. And then implement it as a scoring function.
As a side node, i add that neural networks, by training on the corpus, could, maybe, bypass the scoring, by automatic learning. I don't know enough in that field to say more.

Yago ontology for entity disambiguation

I am using the propriety rdfs:type equal to dbpedia-owl:Organisation for selecting (obviously) organizations on my SPARQL query:
SELECT ?s
WHERE {
?s a dbpedia-owl:Organisation .
} LIMIT 10
I would like to consider the YAGO ontology for increasing my performance on getting real organizations. For example, the FBI (http://dbpedia.org/resource/Federal_Bureau_of_Investigation) is not considered as a dbpedia-owl:Organisation but is tagged as yago:Organization108008335 .
Note the "random" (at least for me) number in the end of the class name. Does anyone know what this number stands for? How do I suppose to know it a priori?
Moreover, when I look for more classes with this format (using the query below), I can find two more classes: http://dbpedia.org/class/yago/Organization108008335, http://dbpedia.org/class/yago/Organization101008378, http://dbpedia.org/class/yago/Organization101136519
SELECT DISTINCT ?t WHERE {
?s a ?t
FILTER(regex(str(?t), "http://dbpedia.org/class/yago/Organization\\d+"))
}
Are they different? Why aren't they all "yago:Organization". Should I expect "new" organization classes as new versions of YAGO ontologies are made available? Is there any other class I should consider when selecting Organizations?
I have been digging into this lately, so I'll try to answer all your questions one by one:
Note the "random" (at least for me) number in the end of the class name. Does anyone know what this number stands for? How do I suppose to know it a priori?
That number corresponds to the synset id of the word in Wordnet. For example, if you look up wordnet_organization_101136519 in wordnet (the URI in the dbpedia is not resolvable at this moment, maybe they have changed something in the last releases), you will see that it has a synsetID "101136519". I don't think you can know it a priori without looking into wordnet.
Are they different? Why aren't they all "yago:Organization".
They are different because they have a different definition in wordnet. For example:
wordnet_organization_101136519: "the activity or result of distributing or disposing persons or things properly or methodically 'his organization of the work force was very efficient'". Example of an instance: Bogo-Indian_Defence. See more details here
wordnet_organization_101008378: "the act of organizing a business or an activity related to a business 'he was brought in to supervise the organization of a new department'". Example of an instance: Adam_Smith_Foundation. See more details here
If you follow the links I provided you can see more differences and common similarities.
Should I expect "new" organization classes as new versions of YAGO ontologies are made available?
When they generated Yago they associated every word in wordnet to a URI. If more words about organizations are added, then I guess that you'll have more definitions. However it is impossible to know beforehand.
Is there any other class I should consider when selecting Organizations?
You can look for all the classes with the label "organization" in wordnet and then add optionals to your query (or issue one query per class retrieving the different organizations you are interested in). These are the classes with the "organization" label in Wordnet.
I hope it helps.

DBPedia queries missing certain chemical compounds

I am running this query to get list of all compounds from the DBPedia public SPARQL endpoint.
SELECT * WHERE {
?y rdf:type dbpedia-owl:Drug.
?y rdfs:label ?Name .
OPTIONAL {?y dbpedia-owl:iupacName ?iupacname} .
OPTIONAL {?y dcterms:subject ?y1}
FILTER (langMatches(lang(?Name),"en"))
}
LIMIT 50000
I am downloading in batches of 50000 (2 files) using offset parameter.
Somehow Isopropyl_alcohol is not getting covered in this even where page exists at
http://live.dbpedia.org/page/Isopropyl_alcohol
and it has the properties that I am searching for?
There are two issues here. The first is that DBpedia Live and DBpedia do not have exactly the same content. According to the DBpedia live webpage
Wikipedia users constantly revise Wikipedia articles with updates
happening almost each second. Hence, data stored in the official
DBpedia endpoint can quickly become outdated, and Wikipedia articles
need to be re-extracted. DBpedia Live enables such a continuous
synchronization between DBpedia and Wikipedia.
That page also lists two SPARQL endpoints for DBpedia Live:
http://live.dbpedia.org/sparql
http://dbpedia-live.openlinksw.com/sparql
However, you'll run into issues on both. Isopropyl_alcohol is in DBpedia, and its URI is
http://dbpedia.org/resource/Isopropyl_alcohol (which, when using a web browser, redirects to
http://dbpedia.org/page/Isopropyl_alcohol)
Looking there, we see that Isopropyl alcohol doesn't have rdf:type dbpedia-owl:Drug, but only
owl:Thing
dbpedia-owl:ChemicalCompound
dbpedia-owl:ChemicalSubstance
yago:Alcohols
http://umbel.org/umbel/rc/ChemicalSubstanceType
yago:AlcoholSolvents
so you won't be able to find it with your query on DBpedia, because it doesn't have the type `dbpedia-owl:Drug. Now, Isopropyl_alcohol also exists in DBpedia live, and its URL is
http://live.dbpedia.org/resource/Isopropyl_alcohol, which redirects to
http://live.dbpedia.org/page/Isopropyl_alcohol
but it only has the folllowing rdf:types:
yago:Alcohols
yago:AlcoholSolvents
http://umbel.org/umbel/rc/ChemicalSubstanceType
so it won't be found by your query on DBpedia Live, for the same reason.
The second issue is the one that AndyS pointed out. Even if the query would select Isopropyl_alcohol in DBpedia or DBpedia Live, unless you provide an ordering constraint, the limit/offset combination won't be guaranteed to return it, since without an ordering constraint, the server could legitimately return the same set of 50000 results to you every time.
Maybe it is not finding it in the LIMIT/OFFSET combination you are using. The server is not obliged to answer queries in the same order everytime unless you use ORDER BY so maybe the slices you have are not in fact all results.
Maybe the SPARQL site and live.dbpedia are not in step.
Try asking directly for Isopropyl_alcohol.