How to get all person instances with its height from DBpedia?

How to get all person instances with its height from DBpedia? - api

I'm working on a project that displays celebs height
and Wikidata does not always provide the information needed.
So I considered DBpedia as another option and spent days trying to find out how to deal with it
how can I fetch every instance of Person class with its Height in
DBpedia, is it possible ?

You can use SPARQL to export the data from DBPedia (http://live.dbpedia.org/sparql/).
select * where {
?person dbo:Person\/height ?height.
}
LIMIT 10000 OFFSET 10000
The public endpoint limits the size of result sets, currently to 10000 rows, so you'll have to step through the data with an increasing OFFSET, or you find a better endpoint for your need.

Related

Wikidata Virtuoso SPARQL Endpoint - How to get more than 100,000 results

I need to get Wikidata artifacts (instance-types, redirects and disambiguations) for a project.
As the original Wikidata endpoint has time constraints when it comes to querying, I have come across Virtuoso Wikidata endpoint.
The problem I have is that if I try to get for example the redirects with this query, it only returns 100,000 results at most:
PREFIX owl: http://www.w3.org/2002/07/owl#
CONSTRUCT {?resource owl:sameAs ?resource2}
WHERE
{
?resource owl:sameAs ?resource2
}
I’m writing to ask if you know of any way to get more than 100,000 results. I would like to be able to achieve the maximum number of possible results.
Once the results are obtained, I must have 3 files (or as few files as possible) in the Ntriples format: wikidata_intance_types.nt, wikidata_redirecions.nt and wikidata_disambiguations.nt.
Thank you very much in advance.
All the best,
Jose Manuel

Please recognize that in both cases (Wikidata itself, and the Virtuoso instance provided by OpenLink Software, my employer), you are querying against a shared resource, and various limits should be expected.
You should space your queries out over time, and consider smaller chunks than the 100,000 limit you've run into -- perhaps 50,000 at a time, waiting for each query to finish retrieving results, plus another second or ten, before issuing the next query.
Most of the guidance in this article about working with the DBpedia public SPARQL endpoint is relevant for any public SPARQL endpoint, especially those powered by Virtuoso. Specific settings on other endpoints will vary, but if you try to be friendly — by limiting the rate of your queries; limiting the size of partial result sets when using ORDER BY, LIMIT, and OFFSET to step through to get a full result set for a query that overflows the instance's maximum result set size; and the like — you'll be far more successful.

You can get and host your own copy of wikidata as explained in
https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
There are also alternatives to get a partial dump of wikidata e.g. with https://github.com/bennofs/wdumper
Or ask for access to one of the non public copies we run by sending me a personal e-mail via my RWTH Aachen i5 account

Retrieving all Wikipedia articles about people

I'm trying to retrieve all articles from Wikipedia that are about people. More specifically, I'm looking for:
only the page title (and perhaps the page ID)
of articles that are about people,
separated by gender (for the sake of simplicity, male and female),
from the current English Wikipedia.
There are several things I've tried, none of which have worked out:
The Wikipedia API lets me search for all pages in a given category. However, searching in "Men" or "Women" fetches mostly subcategory pages, and pages about actual people are buried further down the subcategory hierarchy. I can't find a way to auto-traverse the hierarchy.
PetScan lets me specify a hierarchy depth, but requests time out with a depth of more than 3. Also, like the Wikipedia API, results include articles that aren't about people.
Wikidata lets me write SPARQL queries to search for entities that have a gender of "male" or "female". This example seems to work, but once I include entity names in the query, it times out. Also, I'm not sure how exactly this data corresponds to Wikipedia articles — is this data guaranteed to be the same as on Wikipedia?
What's the best way to achieve what I'm looking for?

I've created a SPARQL-query doing the work. It's important to keep the query as simple as possible (for query optimisation read: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization). Here is the query for SPARQL: https://w.wiki/JhK
For articles of woman this might work with the Wikidata Query Service (WQS), though it's hard on the edge of timing out. So for the male articles (there are more) you need to add a LIMIT and step through it by adding an increasing OFFSET. The WQS seams to still timeout, but there are other endpoints to Wikidata, this one is limited to 100.000 results, but works with increasing OFFSET: https://wikidata.demo.openlinksw.com/sparql
The resulting SPARQL query is something like this:
SELECT ?sitelink
WHERE {
?item wdt:P21 wd:Q6581097;
wdt:P31 wd:Q5.
?sitelink schema:about ?item;
schema:isPartOf <https://en.wikipedia.org/>.
}
LIMIT 100000 OFFSET 100000

Is there a SPARQL query to yield the total number of items (in the database)?

using this SPARQL query on Wikidata Query Service
# Number of items with P31 (instance of) being Q15284 (municipality)
SELECT (COUNT(?item) AS ?item) WHERE {
?item (wdt:P31) wd:Q15284 .
}
the result has one row with the only one colum count at the moment indicatas 4248.
I would not only know the number of those items in the database that are limited by some attribute, but indeed the number of all items
SELECT (COUNT(?item) AS ?item) WHERE {
# empty because I do not want any limitation
}
expected result should be number of all items in the database.
I attempt to read the specs and it appears that it is not even possible?
Hence the question here, is there a (hackish) way to use SPARQL to select all items, as to be able to count them?

Virtuoso 42000 Error The estimated execution time

Using DBpedia-Live SPARQL endpoint http://dbpedia-live.openlinksw.com/sparql, I am trying to count the total number of triples associated with the instances of type owl:Thing. As the count is really big, an exception is being thrown "Virtuoso 42000 Error The estimated execution time". To get rid of this I tried to use subselect, limit, and offset in the query. However, when the offset is greater than equal to the limit, the solution isn't working and the same exception is being thrown again (Virtuoso 42000 Error), can anyone please identify the problem with my query? Or suggest a workaround? Provided is the query I was trying:
select count(?s) as ?count
where
{
?s ?p ?o
{
select ?s
where
{
?s rdf:type owl:Thing.
}
limit 10000
offset 10000
}
}

Your solution starts with patience. Virtuoso's Anytime Query feature returns some results when a timeout strikes, and keeps running the query in the background -- so if you come back later, you'll typically get more solutions, up to the complete result set.
I had to guess at your original query, since you only posted the piecemeal one you were trying to use --
select ( count(?s) as ?count )
where
{
?s rdf:type owl:Thing.
}
I got 3,923,114 within a few seconds, without hitting any timeout. I had set a timeout of 3000000 milliseconds (= 3000 seconds = 50 minutes) on the form -- in contrast to the endpoint's default timeout of 30000 milliseconds (= 30 seconds) -- but clearly hit neither of these, nor the endpoint's server-side configured timeout.
I think you already understand this, but please do note that this count is a moving target, and will change regularly as the DBpedia-Live content continues to be updated from the Wikipedia firehose.
Your divide-and-conquer effort has a significant issue. Note that without an ORDER BY clause in combination with your LIMIT/OFFSET clauses, you may find that some solutions (in this case, some values of ?s) repeat and/or some solutions never appear in a final aggregation that combines all those partial results.
Also, as you are trying to count triples, you should probably do a count(*) instead of count (?s). If nothing else, this helps readers of the query understand what you're doing.
Toward being able to adjust such execution time limits as your query is hitting -- the easiest way would be to instantiate your own mirror via the the DBpedia-Live AMI; unfortunately, this is not currently available for new customers, for a number of reasons. (Existing customers may continue to use their AMIs.) We will likely revive this at some point, but the timing is indefinite; you could open a Support Case to register your interest, and be notified when the AMI is made available for new users.
Toward an ultimate solution... There may be better ways to get to your actual end goal than those you're currently working on. You might consider asking on the DBpedia mailing list or the OpenLink Community Forum.

How do I generate a random sample of data in SPARQL?

Is it possible to generate a random sample of triples using SPARQL?
I thought it might be via the SAMPLE function but this returns a single SAMPLE.
My workaround would be to generate a random number to use with the OFFSET keyword and use the LIMIT keyword to return the desired sample size. I'll just hardcode the random number for offset to 200 for ease like so:
SELECT *
WHERE {
?s ?p ?o
}
OFFSET 200 #random number variable
LIMIT 100
Any better suggestions to generate a random sample of 100 data triples from a SPARQL endpoint?

In SPARQL 1.1 you can try to use
...} ORDER BY RAND() LIMIT 100
But whether this works might depend on the triple store.

The accepted answer may work but is not optimal as the suggested approach may lead to errors on some triple stores (e.g. on Apache Jena).
As pointed out in the ticket and by #Joshua-Taylor in the comment section above, a better answer is:
SELECT ... WHERE
{
...
BIND(RAND() AS ?sortKey)
} ORDER BY ?sortKey ...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas