Get chemical compound data in SI units from wikidata - sparql

For a project I need to get data about chemical compounds like density, mass, boiling point and melting point in SI units (meters, kg, Degree Celsius,...) via the CAS number of the compound.
With the Query builder and some testing I managed to achieve some of it with the following code (CAS-Number is the property P231 and I am searching for e.g. 67-64-1):
Wikidata Query Service
SELECT DISTINCT ?itemLabel ?melting_point ?boiling_point ?mass ?density WHERE {
{
SELECT DISTINCT ?item WHERE {
?item p:P231 ?statement0.
?statement0 (ps:P231) "67-64-1".
}
}
OPTIONAL { ?item wdt:P2101 ?melting_point. }
OPTIONAL { ?item wdt:P2102 ?boiling_point. }
OPTIONAL { ?item wdt:P2054 ?density. }
OPTIONAL { ?item wdt:P2067 ?mass. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "de". }
}
The problem is that I don't manage to get only temperatures in Degree Celsius but also Fahrenheit

This is a very interesting question, as WikiData offers plenty of tools to disambiguate -- this is why it's so powerful.
But these tools come with some learning that the user needs to do before using them.
Before starting, let me make three points about your query:
1-You don't actually need an inner query to select acetone, as you would in SQL. This is one of the reasons why SPARQL is so great compared to SQL -- you don't have to navigate an endless field of keys but your data is still 'normalised'.
2-You don't need the ?statement0 variable, as this is not used to disambiguate. You can just use the wdt:P231 property directly links acetone with its CAS registry number.
3-Since you do need to disambiguate the values of the physical quantities associated with acetone, you will need to go through a disambiguation statement.
Now, here is a query that works:
SELECT DISTINCT ?itemLabel ?melting_point ?boiling_point ?density ?mass
WHERE {
?item wdt:P231 "67-64-1".
OPTIONAL {
?item p:P2101 ?ps1 .
?ps1 ps:P2101 ?melting_point;
psv:P2101/wikibase:quantityUnit/wdt:P31/wdt:P279* wd:Q61610698
}
OPTIONAL {
?item p:P2102 ?ps2 .
?ps2 ps:P2102 ?boiling_point;
psv:P2102/wikibase:quantityUnit/wdt:P31/wdt:P279* wd:Q61610698
}
OPTIONAL {
?item p:P2054 ?ps3 .
?ps3 ps:P2054 ?density;
psv:P2054/wikibase:quantityUnit/wdt:P31/wdt:P279* wd:Q61610698
}
OPTIONAL {
?item p:P2067 ?ps4 .
?ps4 ps:P2067 ?mass;
psv:P2067/wikibase:quantityUnit/wdt:P31/wdt:P279* wd:Q61610698
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "de". }
}
To begin with, I removed the inner query and the statement mentioned in §1 and §2 above.
Then, I retrieve the statement that talks about melting point by using the p:P2101/ps:P2101 combination.
This will allows me to distinguish between Celsius and Fahrenheit values.
Now, since we have multiple physical quantities to look for (i.e. not just temperature), and we want these to be SI, we can use a property path (see below for explanation) to restrict the values that we return as being SI (as opposed to returning Celsius, kg/m^3, kg specifically and individually, although this would be perfectly valid too, just more complex).
For reference, a property path is just a way to shorten a query so:
SELECT ?person ?grandparent
WHERE {
?person :hasParent ?parent .
?parent :hasParent ?grandparent .
}
can be shortened to:
SELECT ?person ?grandparent
WHERE {
?person :hasParent/:hasParent ?grandparent .
}
Now, let's get back to the melting point being returned in Celsius and Fahrenheit.
The two statements that give us different units use a psv:P2101 property to tell us more about the value mentioned in the statement.
From this we can use the wikibase:quantityUnit property to determine the unit.
We will then want to make sure the unit is a SI unit or any subclass thereof. So wdt:P31 tells us that the unit is "an instance of" some class, and wdt:P279* wd:Q61610698 (wdt:P279* is another property path) tells us that the class is either the class of SI units (wd:Q61610698 = SI units), or a direct or indirect subclass of SI units.
I added a picture of what the data looks like (although confusingly there are two Celsius melting points for acetone for some reason.

Related

Return a table of basic physical objects?

How can I return a table of basic physical objects, e.g., Ball (Q18545), Arrow (Q45922), in Wikidata using SPARQL?
I'm not able to directly return objects with the property Physical Object (Q223557) because it has way too many records. But its subtypes, e.g., Toy (Q11422) or Projectile (Q49393), are too narrow for me. I've tried the following to get my broad query working:
removing the label service
using LIMIT for a moderate number of records
filtering out records with very few sitelinks
limiting the objects to those with ids from BNCF Thesaurus ID, BabelNet ID, etc.
Nothing has worked for me. I suspect this is straightforward for anyone who's had more than a few days with Wikidata. Please help.
I shared my wrecked query below.
SELECT ?obj #?objLabel
WHERE {
{
SELECT ?obj WHERE {
?obj wdt:P508 ?bncfid;
wdt:P2581 ?bnid;
wdt:P227 ?gndid;
wdt:P8814 ?wsid;
wdt:P18 ?image;
wikibase:sitelinks ?sitelinks;
wdt:P31/wdt:P279* wd:Q223557.
#FILTER(?sitelinks > 5).
#FILTER(LANG(?objLabel)="en").
#SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
# ?obj rdfs:label ?objLabel}
}
LIMIT 1000
}
#?obj rdfs:label ?objLabel
#FILTER(LANG(?objLabel)="en").
}
(Run by clicking here)

Efficient filter query in Wikidata

I am trying to form an efficient filter query in SPARQL on Wikidata. Let me explain my process:
I query the search-entities API using key words e.g. (Apple, Orange)
The API query returns a list of relevant item ID's e.g. (wd:Q629269, wd:Q154950, wd:Q312, wd:Q95, wd:Q4878289, wd:Q10817602)
With this list of ID's, I then query SPARQL and to return items that are CLASS or are SUBLCASS of certain types e.g. (p:P31/ps:P31/wdt:P279* wd:Q43229) - which returns everything if it is an Organisation or subclass thereof.
Then for items in the list of ID's, that are of certain CLASS, return information items if they exists e.g. (OPTIONAL).
I am new to SPARQL. My Question is, is this the most efficient method to achieve this output? It seems to me to be quite inefficient and I cannot find a similar type of problem in the tutorial examples.
You can try the query here.
SELECT distinct ?item ?itemLabel ?itemDescription ?web ?inception ?ISIN
WHERE{
FILTER (?item IN (wd:Q629269, wd:Q154950, wd:Q312, wd:Q95, wd:Q4878289, wd:Q10817602))
?item p:P31/ps:P31/wdt:P279* wd:Q43229.
OPTIONAL {
?item wdt:P856 ?web. # get item-web
?item wdt:P571 ?inception. # get item-web
?item wdt:P946 ?ISIN. # get item-isin
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
LIMIT 10

How to get only the most recent value from a Wikidata property?

Suppose I want to get a list of every country (Q6256) and its most recently recorded Human Development Index (P1081) value. The Human Development Index property for the country contains a list of data points taken at different points in time, but I only care about the most recent data. This query will not work because it gets multiple results for each country (one for each Human Development Index data point):
SELECT
?country
?countryLabel
?hdi_value
?hdi_date
WHERE {
?country wdt:P31 wd:Q6256.
OPTIONAL { ?country p:P1081 ?hdi_statement.
?hdi_statement ps:P1081 ?hdi_value.
?hdi_statement pq:P585 ?hdi_date.
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Link to Query Console
I'm aware of GROUP BY/GROUP CONCAT but that will still give me every result when I'd prefer to just have one. GROUP BY/SAMPLE will also not work since SAMPLE is not guaranteed to take the most recent result.
Any help or link to a relevant example query is appreciated!
P.S. Another thing I'm confused about is why population P1082 in this query returns only one population result per country
SELECT
?country
?countryLabel
?population
WHERE {
?country wdt:P31 wd:Q6256.
OPTIONAL { ?country wdt:P1082 ?population. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
while the same query but for HDI returns multiple results per country:
SELECT
?country
?countryLabel
?hdi
WHERE {
?country wdt:P31 wd:Q6256.
OPTIONAL { ?country wdt:P1081 ?hdi. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
What is different about population and HDI that causes the behavior to be different? When I view the population data for each country on Wikidata I see multiple population points listed, but only one gets returned by the query.
Both your questions are duplicates, but I'll try to add interesting facts to existing answers.
Question 1 is a duplicate of SPARQL query to get only results with the most recent date.
This technique does the trick:
FILTER NOT EXISTS {
?country p:P1081/pq:P585 ?hdi_date_ .
FILTER (?hdi_date_ > ?hdi_date)
}
However, you should add this clause outside of OPTIONAL, it is not working inside of OPTIONAL (and I'm not sure this is not a bug).
Question 2 is a duplicate of Some cities aren't instances of city or big city?
You can't use wdt-predicates, because missing statements are not truthy.
They are normal-rank statements, but there is a preferred-rank statement.
Truthy statements represent statements that have the best non-deprecated rank for given property. Namely, if there is a preferred statement for property P2, then only preferred statements for P2 will be considered truthy. Otherwise, all normal-rank statements are considered truthy.
The reason why P1081 always has preferred statement is that this property is processed by PreferentialBot.

Wikidata SPARQL Query Qualifier Value

This should be fairly easy for anyone familiar with SPARQL (which I am not). I'm trying to return a qualifier/property value for "score_by" in this query and it's showing up blank:
SELECT ?item ?itemLabel ?IMDb_ID ?_review_score ?_score_by WHERE {
?item wdt:P345 "tt3315342".
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
OPTIONAL { ?item wdt:P345 ?IMDb_ID. }
OPTIONAL { ?item wdt:P444 ?_review_score. }
OPTIONAL { ?item ps:P447 ?_score_by. }
}
Here is a link to this query
'Score by' is a tricky thing, because it qualifies a score.
Scores are complex things: they aren't just a value, but are qualified by the scorer (Rotten Tomatoes, IMDB, etc). If your query worked the answers would be misleading, since it wouldn't be clear whether ?_review_score corresponded to ?_score_by, i.e. whether the review score corresponded to the review.
(You might ask why P444 - score - is there, since without a reviewer the information isn't complete. It's a fair question. The actual property is wdt:P444, a wikidata direct property. What that means is that the property was created as a shortcut for convenience, at the expense of losing some context. They're like database views.)
The way they actually work is by 'reifying' the complex review score as a thing, an object 'the review', then hanging the information - score, reviewer etc - off that.
For example:
select * where {
wd:Q24053263 p:P444 ?review . # Get reviews for wolverine
?review ?p ?o # Get all info from the review
}
Link
You can see here that the score is there under p:statement/P444, and there's a 'qualifier' p:qualifier/P447, i.e. the reviewer.
Essentially properties in wikidata can appear in a number of guises, encoded in the prefix.
To answer your question:
OPTIONAL { ?item wdt:P444 ?_review_score. }
OPTIONAL { ?item ps:P447 ?_score_by. }
should be
OPTIONAL {
?item p:P444 ?review .
?review pq:P447 ?_score_by ; ps:P444 ?_review_score
}
Link
i.e. Treat the review as a single thing, then get the score and corresponding reviewer from that.
(If you worry that there might be scores without reviewers you could add another optional within that)

Querying WikiData, difference between p and wdt default prefix

I am new to wikidata and I can't figure out when I should use -->
wdt prefix (http://www.wikidata.org/prop/direct/)
and when I should use -->
p prefix (http://www.wikidata.org/prop/).
in my sparql queries. Can someone explain what each of these mean and what is the difference?
Things in the p: namespace are used to select statements. Things in the wdt: namespace are used to select entites. Entity selection, with wdt:, allows you to simplify or summarize more complex queries involving statement selection.
When you see a p: you are usually going to see a ps: or pq: shortly following. This is because you rarely want a list of statements; you usually want to know something about those statements.
This example is a two-step process showing you all the graffiti in Wikidata:
SELECT ?graffiti ?graffitiLabel
WHERE
{
?graffiti p:P31 ?statement . # entities that are statements
?statement ps:P31 wd:Q17514 . # which state something is graffiti
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Two different versions of the P31 property are used here, housed in different namespaces. Each version comes with different expectations about how it will connect to other items. Things in the p: namespace connect entities to statements, and things in the ps: namespace connect statements to values. In the example, p:P31 is used to select statements about an entity. The entity will be graffiti, but we do not specify that until the next line, where ps:P31 is used to select the values (subjects) of the statements, specifying that those values should be graffiti.
So, that's kind of complicated! The wdt: namespace is supposed to make this kind of query simper. The example could be rewritten as:
SELECT ?graffiti ?graffitiLabel
WHERE
{
?graffiti wdt:P31 wd:Q17514 . # entities that are graffiti
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
This is now one line shorter because we are no longer looking for statements about graffiti, but for graffiti itself. The dual p: and ps: linkages are summarized with a wdt: version of the same P31 property. However, be aware:
This technique only works for statements that are true or false in nature, like, is a thing graffiti or not. (The "t" in wdt: stands for "truthy").
Information available to wdt: is just missing some facts, sometimes. Often in my experience a p: and ps: query will return a few more results than a wdt: query.
If you go to the Wikidata item page for Barack Obama at https://www.wikidata.org/wiki/Q76 and scroll down, you see the entry for the "spouse" property P26:
Think of the p: prefix as a way to get to the entire white box on the right side of the image.
In order to get to the information inside the white box, you need to dig deeper.
In order to get to the main part of the information ("Michelle Obama"), you combine the p: prefix with the ps: prefix like this:
SELECT ?spouse WHERE {
wd:Q76 p:P26 ?s .
?s ps:P26 ?spouse .
}
The variable ?s is an abstract statement node (aka the white box).
You can get the same information with only one triple in the body of the query by using wdt::
SELECT ?spouse WHERE {
wd:Q76 wdt:P26 ?spouse .
}
So why would you ever use p:?
You might have noticed that the white box also contains meta information ("start time" and "place of marriage").
In order to get to the meta information, you combine the p: prefix with the pq: prefix.
The following example query returns all the information together with the statement node:
SELECT ?s ?spouse ?time ?place WHERE {
wd:Q76 p:P26 ?s .
?s ps:P26 ?spouse .
?s pq:P580 ?time .
?s pq:P2842 ?place .
}
They're simply XML namespace prefixes, basically a shortcut for full URIs. So given wdt:Apples, the full URI is http://www.wikidata.org/prop/direct/Apples and given p:fruitType the URI is http://www.wikidata.org/prop/fruitType.
Prefixes/namespaces have no other meaning, they are simply ways to define the name of something with URL format. However conventions, such as defining properties in http://www.wikidata.org/prop/, are useful to separate the meanings of terms, so 'direct' is likely a sub-type of property as well (in this case having to do with wikipedia dumps).
For the specifics, you'd need to hope the authors have exposed some naming convention, or be caught in a loop of "was it p:P51 or p:P15 or maybe wdt:P51?". And may luck be with you because the "semantics" of semantic technology have been lost.