SPARQL Query to find results which do not meet a certain criteria - sparql

I am trying to write a SPARQL query which will return a set of patient identifier codes (?Crid) which have associated with them a specific diagnosis code (?ICD9) and DO NOT have associated with them a specific medication AND which have an order date (?OrderDate) prior to their recruitment date (?RecruitDate). I have incorporated the OBIB ontology into my graph.
Here is what I have so far (a bit simplified and with a few steps through the graph omitted for readability/sensitivity):
SELECT DISTINCT ?Crid WHERE
{?Crid a obib:CRID .
#-- Return CRIDs with a diagnosis
?Crid obib:hasPart ?ICD9 .
?ICD9 a obib:diagnosis .
#-- Return CRIDs with a medical prescription record
?Crid obib:hasPart ?medRecord .
?medRecord a obib:medicalRecord .
#-- Return CRIDs with an order date
?medRecord obib:hasPart ?OrderDate .
?OrderDate a obib:dateOfDataEntry .
#-- Return CRIDs with a recruitment date
?Crid obib:hasPart ?FormFilling .
?FormFilling a obib:formFilling .
?RecruitDate obib:isAbout ?FormFilling .
?RecruitDate a obib:dateOfDataEntry .
#-- Filter results for specific ICD9 codes
FILTER (?ICD9 = '1')
#-- Subtract Results with Certain Medication and Order Date Prior to Recruitment
#-- This is the part that I think is giving me a problem
MINUS {
FILTER (regex (?medRecord, "medication_1", "i"))
FILTER (?RecruitDate-?OrderDate < "P0D"^^xsd:dayTimeDuration)
}
}
My gut feeling is that I am not using MINUS correctly. This query returns mostly the right results: I am expecting 10 results and it is returning 12. The extraneous 2 results did take "medication_1" and have order dates before their recruitment dates, so I do not want them to be included in the set.
In case it matters, I am using a Stardog endpoint to run this query and to store my graph data.

Instead of
#-- Subtract Results with Certain Medication and Order Date Prior to Recruitment
#-- This is the part that I think is giving me a problem
MINUS {
FILTER (regex (?medRecord, "medication_1", "i"))
FILTER (?RecruitDate-?OrderDate < "P0D"^^xsd:dayTimeDuration)
}
}
I'd probably just write this without MINUS as:
FILTER (!regex(?medRecord, "medication_1", "i"))
FILTER (?RecruitDate-?OrderDate >= "P0D"^^xsd:dayTimeDuration)
I'd also probably consider whether REGEX is the right tool here (would a simple string comparison work?), but that's a different issue.

Related

Why DISTINCT keyword lead to different entity for these two queries?

Query 1
PREFIX ns: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?x
WHERE {
FILTER (!isLiteral(?x) OR lang(?x) = '' OR langMatches(lang(?x), 'en'))
?x ns:type.object.type ns:religion.religious_leadership_title .
?x ns:religion.religious_leadership_title.leaders ?c0 .
?c0 ns:religion.religious_organization_leadership.start_date ?sk0 .
}
ORDER BY ?sk0
LIMIT 1
Query 2
PREFIX ns: <http://rdf.freebase.com/ns/>
SELECT ?x
WHERE {
FILTER (!isLiteral(?x) OR lang(?x) = '' OR langMatches(lang(?x), 'en'))
?x ns:type.object.type ns:religion.religious_leadership_title .
?x ns:religion.religious_leadership_title.leaders ?c0 .
?c0 ns:religion.religious_organization_leadership.start_date ?sk0 .
}
ORDER BY ?sk0
LIMIT 1
So the only difference between Q1 and Q2 is that there is a DISTINCT keyword when SELECT ?x in Q1. However, Q1 gives answer m.01h_90 while Q2 gives answer m.05rd8.
Ideally, I feel this should not lead to different results, as the purpose of DISTINCT is only to get rid of duplicates in the results set if I understand it correctly, so if the original results do not have duplicates at all, there should not be any difference by adding the DISTINCT keyword.
You have a tie on the value you're ordering. Specifying distinct is causing a different execution plan which orders the rows differently, though still ordering by the one column as requested, with another row as the first one to output. Add the output column to the order by clause and you should see consustent results between the two queries.

get a variable number of columns for output in sparql

Is there a way to get a variable number of columns for a given predicate? Essentially, I want to turn this:
title note
A. 1
A. 2
A. 3
B. 4
B. 5
into
title note1 note2 note3
A. 1 2 3
B. 4 5 null
Like, can i set the columns created to the maximum number of "notes" in the query or something. Thanks.
There are several ways you can approach this. One way is to change your query. Now, in the general case it is not possible to do a SELECT query that does exactly what you want. However, if you happen to know in advance what the maximum number of notes per title is, you can sort of do this.
Supposing your original query was something like this:
SELECT ?title ?note
WHERE { ?title :hasNote ?note }
And supposing you know titles have at most 3 notes, you could probably (untested) do something like this:
SELECT ?title ?note1 ?note2 ?note3
WHERE {
?title :hasNote ?note1 .
OPTIONAL { ?title :hasNote ?note2 . FILTER (?note2 != ?note1) }
OPTIONAL { ?title :hasNote ?note3 . FILTER (?note3 != ?note1 && ?note3 != ?note2) }
}
As you can see this is not a very nice solution though: it doesn't scale and is probably very inefficient to process as well.
Alternatives are various forms of post-processing. To make it simpler to post-process you could use an aggregate operator to get all notes for a single item on a single line at least:
SELECT ?title (GROUP_CONCAT(?note) as ?notes)
WHERE { ?title :hasNote ?note }
GROUP BY ?title
result:
title notes
A. "1 2 3"
B. "4 5"
You could then post-process the values of the ?notes variable to split them into the separate notes again.
Another solution is that instead of using a SELECT query, you use a CONSTRUCT query to give you back an RDF graph, rather than a table, and work directly with that in your code. Tables are kinda weird in an RDF world if you think about it: you're querying a graph model, why is the query result not a graph but a table?
CONSTRUCT
WHERE { ?title :hasNote ?note }
...and then process the result in whatever API you're using to do the queries.

Optimizing aggregation query against wikidata

I am running an aggregation query against wiki data. The query tries to calculate the average duration of films, grouped by their genre and the year of publication
The multiple grouping/subqueries in the query are intended to retain an n-1 relationship from film to the grouping criteria (year and genre) and a 1-1 relationship between a film and its duration. Reason for this is having approximately correct aggregations (n-1 relationships are familiar for OLAP and data warehousing practitioners).
More explanation is embedded in the query. Hence I cannot drop down the groupings done in the subqueries and the if statements or the group concatenation. This query times out on Wikidata SPARQL endpoint.
QUESTION
I need some suggestion for performance enhancement... Any optimization hints? In case that's not possible, anyone aware of some authenticated way (so that they know I am not playing around) to query Wikidata so that timeout can be increased, or a way to increase timeout generally?
# Average duration of films, grouped by their genre and the year of publication
SELECT
?genre1 # film genre
?year1 # film year of publication
(AVG(?duration1) AS ?avg) # film average duration
WHERE
{
# Calculating the average duration for each single film.
# As there are films with multiple duration, these durations are
# averagred by grouping aggregating durations by film.
# Hence, a single duration for each film is projected out from the subquery.
{
select ?film (avg(?duration) as ?duration1)
where{
?film <http://www.wikidata.org/prop/direct/P2047> ?duration .
}group by ?film
}
# Here the grouping criteria (genre and year) are calculated.
# The criteria is grouped by film, so that in case multiple
# genre/multiple year exist for a single film, all of them are
# group concated into a single value.
# Also in case of a lack of a value of year or genre for some
# specific film, a dummy value "OtherYear"/"OtherGenre" is generated.
{
select ?film (
IF
(
group_concat(distinct ?year ; separator="-- ") != "",
# In case multiple year exist for a single film, all of them are group concated into a single value.
group_concat(distinct ?year ; separator="-- "),
# In case of a lack of a value of year for some specific film, a dummy value "OtherYear" is generated.
"OtherYear"
)
as ?year1
)
(
IF
(
group_concat(distinct ?genre ; separator="-- ") != "",
# In case multiple genre exist for a single film, all of them are group concated into a single value.
group_concat(distinct ?genre ; separator="-- "),
# In case of a lack of a value of genre for some specific film, a dummy value "OtherGenre" is generated.
"OtherGenre"
)
as ?genre1
)
where
{
?film <http://www.wikidata.org/prop/direct/P31> <http://www.wikidata.org/entity/Q11424> .
optional {
?film <http://www.wikidata.org/prop/direct/P577> ?date .
BIND(year(?date) AS ?year)
}
optional {
?film <http://www.wikidata.org/prop/direct/P136> ?genre .
}
} group by ?film
}
} GROUP BY ?year1 ?genre1
The query seems to work after replacing the two IF expressions with a simple sample (which picks an arbitrary value from the group):
(sample(?year) as ?year1)
(sample(?genre) as ?genre1)
So it appears that the expense of group_concat is the main problem. I don't find that very intuitive and have no explanation.
Maybe the version with sample is good enough, or at least it may give you a baseline point for further improvements.

Fast publication date lookup with Wikidata Query Service

Is there a way to lookup publication dates quickly in Wikidata Query Service's SPARQL to find publications of a certain date, e.g., today?
I was hoping that something like this query would be quick:
SELECT * WHERE {
?work wdt:P577 ?datetime .
BIND("2018-09-28T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> as ?now_datetime)
FILTER (?datetime = ?now_datetime)
}
LIMIT 10
However, it times out when using it on the SPARQL endpoint at https://query.wikidata.org
A range query seems neither to be quick. The query below returns after almost 30 seconds:
SELECT * WHERE {
?work wdt:P577 ?datetime .
FILTER (?datetime > "2018-09-28T00:00:00Z"^^xsd:dateTime)
}
LIMIT 1
The trick is to avoid full scan and use indexes:
VALUES:
SELECT * WHERE {
VALUES (?datetime) {("2018-09-28T00:00:00Z"^^xsd:dateTime)}
?work wdt:P577 ?datetime .
} LIMIT 10
Try it!
hint:rangeSafe:
SELECT * WHERE {
VALUES (?datetime) {("2018-09-28T00:00:00Z"^^xsd:dateTime)}
?work wdt:P577 ?date_time .
hint:Prior hint:rangeSafe true .
FILTER (?date_time > ?datetime)
} LIMIT 10
Try it!
[The rangeSafe hint] declare[s] that the data touched by the query for a specific triple pattern is strongly typed, thus allowing a range filter to be pushed down onto an index.

How to form SPARQL queries that refers to multiple resources

My question is a followup with my first question about SPARQL here.
My SPARQL query results for Mountain objects are here.
From those results I picked a certain object resource.
Now I want to get values of "is dbpedia-owl:highestPlace of" records for this chosen Mountain object.
That is, names of mountain ranges for which this mountain is highest place of.
This is, as I figure, complex. Not only because I do not know the required syntax, but also I get two objects here.
One of them is Mont Blank Massif which is of type "place".
Another one is Western Alps which is of type "mountain range" - my desired record.
I need record # 2 above but not 1. I know 1 is also relevant but sometimes it doesn't follow same pattern. Sometimes the records appear to be of YAGO type, which can be totally misleading. To be safe, I simply want to discard those records whenever there is type mismatch.
How can I form my SPARQL query to get these "is dbpedia-owl:highestPlace of" records and also have the type filtering?
you can use this query, note however that Mont_Blanc_massif in your example is both a dbpedia-owl:Place and a dbpedia-owl:MountainRange
select * where {
?place dbpedia-owl:highestPlace :Mont_Blanc.
?place rdf:type dbpedia-owl:MountainRange.
}
run query
edit after comment: filter
It is not really clear what you want to filter (yago?), technically you can filter for example like this:
select * where {
?place dbpedia-owl:highestPlace :Mont_Blanc.
?place rdf:type dbpedia-owl:MountainRange.
FILTER NOT EXISTS {
?place ?pred ?obj
Filter (regex(?obj, "yago"))
}
}
this filters out results that have any object with 'yago' in its URL.
Extending the result from the previous answer, the appropriate query would be
select * where {
?mountain a dbpedia-owl:Mountain ;
dbpedia-owl:abstract ?abstract ;
foaf:depiction ?depiction .
?range a dbpedia-owl:MountainRange ;
dbpedia-owl:highestPlace ?mountain .
FILTER(langMatches(lang(?abstract),"EN"))
}
LIMIT 10
SPARQL Results
This selects mountains with English abstracts that have at least one depiction (or else the pattern wouldn't match) and for which there is some mountain range of which the mountain is the highest place. Without the parts from the earlier question, if you just want to retrieve mountains that are the highest place of a range, you can use a query like this:
select * where {
?mountain a dbpedia-owl:Mountain .
?range a dbpedia-owl:MountainRange ;
dbpedia-owl:highestPlace ?mountain .
}
LIMIT 10
SPARQL results