DBPedia queries missing certain chemical compounds - sparql

I am running this query to get list of all compounds from the DBPedia public SPARQL endpoint.
SELECT * WHERE {
?y rdf:type dbpedia-owl:Drug.
?y rdfs:label ?Name .
OPTIONAL {?y dbpedia-owl:iupacName ?iupacname} .
OPTIONAL {?y dcterms:subject ?y1}
FILTER (langMatches(lang(?Name),"en"))
}
LIMIT 50000
I am downloading in batches of 50000 (2 files) using offset parameter.
Somehow Isopropyl_alcohol is not getting covered in this even where page exists at
http://live.dbpedia.org/page/Isopropyl_alcohol
and it has the properties that I am searching for?

There are two issues here. The first is that DBpedia Live and DBpedia do not have exactly the same content. According to the DBpedia live webpage
Wikipedia users constantly revise Wikipedia articles with updates
happening almost each second. Hence, data stored in the official
DBpedia endpoint can quickly become outdated, and Wikipedia articles
need to be re-extracted. DBpedia Live enables such a continuous
synchronization between DBpedia and Wikipedia.
That page also lists two SPARQL endpoints for DBpedia Live:
http://live.dbpedia.org/sparql
http://dbpedia-live.openlinksw.com/sparql
However, you'll run into issues on both. Isopropyl_alcohol is in DBpedia, and its URI is
http://dbpedia.org/resource/Isopropyl_alcohol (which, when using a web browser, redirects to
http://dbpedia.org/page/Isopropyl_alcohol)
Looking there, we see that Isopropyl alcohol doesn't have rdf:type dbpedia-owl:Drug, but only
owl:Thing
dbpedia-owl:ChemicalCompound
dbpedia-owl:ChemicalSubstance
yago:Alcohols
http://umbel.org/umbel/rc/ChemicalSubstanceType
yago:AlcoholSolvents
so you won't be able to find it with your query on DBpedia, because it doesn't have the type `dbpedia-owl:Drug. Now, Isopropyl_alcohol also exists in DBpedia live, and its URL is
http://live.dbpedia.org/resource/Isopropyl_alcohol, which redirects to
http://live.dbpedia.org/page/Isopropyl_alcohol
but it only has the folllowing rdf:types:
yago:Alcohols
yago:AlcoholSolvents
http://umbel.org/umbel/rc/ChemicalSubstanceType
so it won't be found by your query on DBpedia Live, for the same reason.
The second issue is the one that AndyS pointed out. Even if the query would select Isopropyl_alcohol in DBpedia or DBpedia Live, unless you provide an ordering constraint, the limit/offset combination won't be guaranteed to return it, since without an ordering constraint, the server could legitimately return the same set of 50000 results to you every time.

Maybe it is not finding it in the LIMIT/OFFSET combination you are using. The server is not obliged to answer queries in the same order everytime unless you use ORDER BY so maybe the slices you have are not in fact all results.
Maybe the SPARQL site and live.dbpedia are not in step.
Try asking directly for Isopropyl_alcohol.

Related

Wikidata Virtuoso SPARQL Endpoint - How to get more than 100,000 results

I need to get Wikidata artifacts (instance-types, redirects and disambiguations) for a project.
As the original Wikidata endpoint has time constraints when it comes to querying, I have come across Virtuoso Wikidata endpoint.
The problem I have is that if I try to get for example the redirects with this query, it only returns 100,000 results at most:
PREFIX owl: http://www.w3.org/2002/07/owl#
CONSTRUCT {?resource owl:sameAs ?resource2}
WHERE
{
?resource owl:sameAs ?resource2
}
I’m writing to ask if you know of any way to get more than 100,000 results. I would like to be able to achieve the maximum number of possible results.
Once the results are obtained, I must have 3 files (or as few files as possible) in the Ntriples format: wikidata_intance_types.nt, wikidata_redirecions.nt and wikidata_disambiguations.nt.
Thank you very much in advance.
All the best,
Jose Manuel
Please recognize that in both cases (Wikidata itself, and the Virtuoso instance provided by OpenLink Software, my employer), you are querying against a shared resource, and various limits should be expected.
You should space your queries out over time, and consider smaller chunks than the 100,000 limit you've run into -- perhaps 50,000 at a time, waiting for each query to finish retrieving results, plus another second or ten, before issuing the next query.
Most of the guidance in this article about working with the DBpedia public SPARQL endpoint is relevant for any public SPARQL endpoint, especially those powered by Virtuoso. Specific settings on other endpoints will vary, but if you try to be friendly — by limiting the rate of your queries; limiting the size of partial result sets when using ORDER BY, LIMIT, and OFFSET to step through to get a full result set for a query that overflows the instance's maximum result set size; and the like — you'll be far more successful.
You can get and host your own copy of wikidata as explained in
https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
There are also alternatives to get a partial dump of wikidata e.g. with https://github.com/bennofs/wdumper
Or ask for access to one of the non public copies we run by sending me a personal e-mail via my RWTH Aachen i5 account

Retrieving all Wikipedia articles about people

I'm trying to retrieve all articles from Wikipedia that are about people. More specifically, I'm looking for:
only the page title (and perhaps the page ID)
of articles that are about people,
separated by gender (for the sake of simplicity, male and female),
from the current English Wikipedia.
There are several things I've tried, none of which have worked out:
The Wikipedia API lets me search for all pages in a given category. However, searching in "Men" or "Women" fetches mostly subcategory pages, and pages about actual people are buried further down the subcategory hierarchy. I can't find a way to auto-traverse the hierarchy.
PetScan lets me specify a hierarchy depth, but requests time out with a depth of more than 3. Also, like the Wikipedia API, results include articles that aren't about people.
Wikidata lets me write SPARQL queries to search for entities that have a gender of "male" or "female". This example seems to work, but once I include entity names in the query, it times out. Also, I'm not sure how exactly this data corresponds to Wikipedia articles — is this data guaranteed to be the same as on Wikipedia?
What's the best way to achieve what I'm looking for?
I've created a SPARQL-query doing the work. It's important to keep the query as simple as possible (for query optimisation read: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization). Here is the query for SPARQL: https://w.wiki/JhK
For articles of woman this might work with the Wikidata Query Service (WQS), though it's hard on the edge of timing out. So for the male articles (there are more) you need to add a LIMIT and step through it by adding an increasing OFFSET. The WQS seams to still timeout, but there are other endpoints to Wikidata, this one is limited to 100.000 results, but works with increasing OFFSET: https://wikidata.demo.openlinksw.com/sparql
The resulting SPARQL query is something like this:
SELECT ?sitelink
WHERE {
?item wdt:P21 wd:Q6581097;
wdt:P31 wd:Q5.
?sitelink schema:about ?item;
schema:isPartOf <https://en.wikipedia.org/>.
}
LIMIT 100000 OFFSET 100000

No Data Available When Query on Apache Jena Fuseki

I'm new with Apache Jena Fuseki and SparQL. I have a problem when I tried to query data on Fuseki. The data I used is from DBpedia named 'Topical Concepts' (can be found here). I upload the data through the control panel on a browser (through the default port 3030) and used the query below:
SELECT ?subject ?predicate ?object
WHERE {
?subject ?predicate ?subject
}
LIMIT 25
I got a null table and a message "no data available in this table". However, when I installed Fuseki and do the same thing on my Mac (the problem in above happened on my desktop with Ubuntu 16 operation system), I successfully got 25 entries of the data. I don't think it is the problem of the operation system, but I really have no idea about why it happened. Does anybody encounter the same problem?
In your SPARQL query, you have the following pattern:
?subject ?predicate ?subject
Notice that you repeat ?subject. This query effectively asks: "give me all RDF triples for which the subject is the same value as the object". It's likely that the reason you're not getting a result is simply that no such triples exist in your database.
As for why this didn't happen on a Mac, without more info about your setup we can only speculate. It's possible that you configured your database slightly differently there (e.g. enabling a reasoner which would result in additional RDF triples that do match the query), or it might be as simple as you did a slightly different query there.
I am making two assumptions for answering your question:
you have two different instances of Jena installed. One on your laptop and one on your desktop.
You are sure you have uploaded data, possibly into a named graph and the default is empty
Fuseki, I haven't tried this on TBD, has one feature that, often, by default is set to query only the default graph. If in the config setting you activate tdb:unionDefaultGraph true ; then it will query all the graphs. Before changing the settings, please do check that this is true. You could check by executing this query:
SELECT distinct ?g
WHERE {
graph ?g{
?s ?p ?o
}
}
If you get a result, that means you need to change the settings for it to work, or be mindful of this fact and always call your queries with graphs (as in the above query).
For more explanation please refer to https://jena.apache.org/documentation/serving_data/

Downloaded DBpedia data does not include all instance-types triples

I downloaded some data from http://downloads.dbpedia.org/2015-04/core/,
including: instance-type_en.nt, mappingbased-properties_en.nt, and some others.
I have successfully loaded them into OpenLink Virtuoso DB, but when I run some sample SPARQL query, e.g., a query to see all the triples about a subject Xiamen_University, the problem appears.
select ?s ?p ?o
where
{
?s rdfs:label "Xiamen University"#en .
?s ?p ?o .
}
From the DBPedia SPARQL endpoint, there are heaps of triples about iamen_University; while in my db, there are only 4 or 5 of them.
Especially, there is no triple in db indicating Xiamen_University is a type of University, or any instance-type triples at all. I found similar cases on some other subjects as well.
I think the instance-types_en.nt file does not include all the instance-types triple from Wikipedia, same problem with mappingbased-properties. Is that right? If so, where could I find the right source file?
There's a whole list of the datasets available at on the downloads page. I don't see lots of documentation about exactly what's in each one of them, but the names are fairly descriptive, and the question mark links next to each one show a preview of what kind of information is in each one of them. Hovering over each title will provide a short description. E.g.:
It looks like to get most of the interesting properties, you'd probably want the mappingbased datasets, along with the labels dataset (since the query that you wrote identifies objects by label).

How do triple stores use linked data?

Lets say I have the following scenario:
I have some different ontology files hosted somewhere on the web on different domains like _http://foo1.com/ontolgy1.owl#, _http://foo2.com/ontology2.owl# etc.
I also have a triple store in which I want to insert instances based on the ontology files mentioned like this:
INSERT DATA
{
<http://foo1.com/instance1> a <http://foo1.com/ontolgy1.owl#class1>.
<http://foo2.com/instance2> a <http://foo2.com/ontolgy2.owl#class2>.
<http://foo2.com/instance2x> a <http://foo2.com/ontolgy2.owl#class2x>.
}
Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.
And after the insert if I run a SPARQL query like this:
select ?a
where
{
?a rdf:type ?type.
?type rdfs:subClassOf* <http://foo2.com/ontolgy2.owl#class2> .
}
the result would be:
<http://foo2.com/instance2>
and not:
<http://foo2.com/instance2>
<http://foo2.com/instance2x>
as it should be. This is happening because the ontology file _http://foo2.com/ontolgy2.owl# is not imported into the triple store.
My question is:
Can we talk in this example about "linked" data? Because it seems to me that it is not linked at all. It has to be imported locally into a triple store, and after that you can start querying.
Lets say if you want to run a query on some complex data that is described by 20 ontology files, then all 20 ontology files would need to be imported.
Isn't this a bit disappointing?
Do I misunderstood triple stores and linked data and how they work together?
as it should be.
I'm not certain that should is the right term here. The semantics of the SPARQL query is to query the data stored in a particular graph stored at the endpoint. IRIs are more or less opaque identifiers; just because they might also be URLs from which additional data can be retrieved doesn't obligate any particular system to actually do that kind of retrieval. Doing that would easily make query behavior unpredictable: "this query worked yesterday, why doesn't it work today? oh, a remote website is no longer available…".
Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.
Remember, since IRIs are opaque, anyone can define a term in any ontology. It's always possible for someone else to come along and say something else about a resource. You have no way of tracking all that information. For instance, if I go and write an ontology, I can declare http://foo2.com/ontolgy2.owl#class2x as a class and assert that it's equivalent to http://dbpedia.org/ontology/Person. Should the system have some way to know about what I did someplace else, and even if it did, should it be required to go and retrieve information from it? What if I made an ontology that's 2GB in size? Surely your endpoint can't be expected to go and retrieve that just to answer a quick query?
Can we talk in this example about "linked" data? Because it seems to
me that it is not linked at all. It has to be imported locally into a
triple store, and after that you can start querying.
Lets say if wan to run a query on some complex data that is describe
by 20 ontology files, in this case I have to import all 20 ontology
files.
This is usually the case, and the point about linked data is that you have a way to get more information if you choose to, and that you don't have to do as much work in negotiating how to identify resources in that data. However, you can use the service keyword in SPARQL to reference other endpoints, and that can provide a type of linking. For instance, knowing that DBpedia has a SPARQL endpoint, I can run a local query that incorporates DBpedia with something like this:
select ?person ?localValue ?publicName {
?person :hasLocalValueOfInterest ?localValue
service <http://dbpedia.org/sparql> {
?person foaf:name ?publicName
}
}
You can use multiple service blocks to aggregate data from multiple endpoints; you're not limited to just one. That seems pretty "linked" to me.