For every sosa:FeatureOfInterest (room), get the lowest temperature per day from the associated sensors for that room. There are 100 rooms. Each room has 3 sensors. The timeframe is one year.
Goal: Query to select lowest temperature per day per room from group of sensors plus time of day when the temperature occurred.
Example data (N3):
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix sosa: <http://www.w3.org/ns/sosa/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix ex: <http://www.example.com/>
# Room FOIs
ex:room1Foi a sosa:FeatureOfInterest .
# ...
ex:room100Foi a sosa:FeatureOfInterest .
# Room 1 sensor observations 1/1/2021
ex:obs1Room1 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room1Foi ;
sosa:resultTime "2021-01-01T00:00:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "51.4"^^xsd:decimal .
ex:obs2Room1 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room1Foi .
sosa:resultTime "2021-01-01T08:00:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "50.2"^^xsd:decimal .
ex:obs3Room1 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room1Foi .
sosa:resultTime "2021-01-01T:16:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "49.8"^^xsd:decimal .
# Room 1 sensor observations 1/2/2021
ex:obs4Room1 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room1Foi ;
sosa:resultTime "2021-01-02T00:00:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "61.4"^^xsd:decimal .
ex:obs5Room1 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room1Foi .
sosa:resultTime "2021-01-02T08:00:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "60.2"^^xsd:decimal .
ex:obs6Room1 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room1Foi .
sosa:resultTime "2021-01-02T:16:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "59.8"^^xsd:decimal .
# ...
# Room 100 sensor observations 1/1/2021
ex:obs1Room100 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room100Foi ;
sosa:resultTime "2021-01-01T00:00:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "50.7"^^xsd:decimal .
ex:obs2Room100 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room100Foi .
sosa:resultTime "2021-01-01T08:00:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "51.6"^^xsd:decimal .
ex:obs3Room100 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room100Foi .
sosa:resultTime "2021-01-01T:16:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "48.0"^^xsd:decimal .
# Room 1 sensor observations 1/2/2021
# ...
One attempt:
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix sosa: <http://www.w3.org/ns/sosa/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix ex: <http://www.example1.com/>
select ?oFoi ?day min(?val) as ?minTemp ?time where {
{select ?f where {
?f a sosa:FeatureOfInterest .
}}
?o a sosa:Observation ;
sosa:hasFeatureOfInterest ?oFoi ;
sosa:resultTime ?time ;
sosa:observedProperty ?p ;
sosa:hasSimpleResult ?val .
filter(?oFoi = ?f) .
bind(day(?t) as ?day) .
} group by ?oFoi ?day ?time
order by desc(?oFoi) asc(?day)
Result:
oFoi
day
minTemp
time
http://www.example.com/room1Foi
1
51.4
2021-01-01 0:00:00
http://www.example.com/room1Foi
1
50.2
2021-01-01 8:00:00
http://www.example.com/room1Foi
1
49.8
2021-01-01 16:00:00
http://www.example.com/room1Foi
2
59.8
2021-01-02 16:00:00
http://www.example.com/room1Foi
2
60.2
2021-01-02 8:00:00
http://www.example.com/room1Foi
2
61.4
2021-01-02 0:00:00
...
...
...
...
This doesn't work because ?time must be included in the group by clause. Removing ?time from group by clause will return the correct rows. However, ?time is necessary to be included.
Ideal result:
oFoi
day
minTemp
time
http://www.example.com/room1Foi
1
49.8
2021-01-01 16:00:00
http://www.example.com/room1Foi
2
59.8
2021-01-02 16:00:00
...
...
...
...
Update:
This gets closer but still is including multiple results when two times in the same day have the same temperature (both results included):
select ?o2 ?oFoi2 ?day2 ?val2 sample(?t2) as ?tx2 ?p2 where {
?o2 a sosa:Observation ;
sosa:hasFeatureOfInterest ?oFoi2 ;
sosa:resultTime ?t2 ;
sosa:observedProperty ?p2 ;
sosa:hasSimpleResult ?val2 .
bind(day(?t2) as ?day2) .
filter(?oFoi2 = ?oFoi) .
filter(?day2 = ?day) .
filter(?val2 = ?vx) .
{select ?oFoi ?day min(?val) as ?vx where {
{select ?f where {
?f a sosa:FeatureOfInterest .
}}
?o a sosa:Observation ;
sosa:hasFeatureOfInterest ?oFoi ;
sosa:resultTime ?t ;
sosa:observedProperty ?p ;
sosa:hasSimpleResult ?val .
filter(?oFoi = ?f) .
bind(day(?t) as ?day) .
} group by ?oFoi ?day
order by desc(?oFoi) asc(?day)
}
} group by ?o2 ?oFoi2 ?day2 ?p2 ?val2
Result:
o2
oFoi2
day2
val2
tx2
p2
http://www.example3.com/obs3Room100
http://www.example3.com/room100Foi
1
48
2021-01-01 16:00:00
"TEMP"^^http://www.w3.org/2001/XMLSchema#string
http://www.example3.com/obs6Room1
http://www.example3.com/room1Foi
2
59.8
2021-01-02 16:00:00
"TEMP"^^http://www.w3.org/2001/XMLSchema#string
http://www.example3.com/obs333Room1
http://www.example3.com/room1Foi
1
-9.8
2021-01-01 16:00:00
"aTEMP"^^http://www.w3.org/2001/XMLSchema#string
http://www.example3.com/obs33Room1
http://www.example3.com/room1Foi
1
-9.8
2021-01-01 7:59:00
"aTEMP"^^http://www.w3.org/2001/XMLSchema#string
...
...
...
...
...
...
Oops: ?o2 is unnecessary and removing it from the above query results in the correct solution.
Adding an outer select appears to solve this. Using an aggregate for the time such as min(), max(), avg(), sample(), etc. are all valid approaches for aggregating the time variable. (Adding an extra property aTEMP to help in understanding.)
Query:
select ?oFoi2 ?day2 ?val2 sample(?t2) as ?tx2 ?p2 where {
?o2 a sosa:Observation ;
sosa:hasFeatureOfInterest ?oFoi2 ;
sosa:resultTime ?t2 ;
sosa:observedProperty ?p2 ;
sosa:hasSimpleResult ?val2 .
bind(day(?t2) as ?day2) .
filter(?oFoi2 = ?oFoi) .
filter(?day2 = ?day) .
filter(?val2 = ?vx) .
{select ?oFoi ?day min(?val) as ?vx where {
{select ?f where {
?f a sosa:FeatureOfInterest .
}}
?o a sosa:Observation ;
sosa:hasFeatureOfInterest ?oFoi ;
sosa:resultTime ?t ;
sosa:observedProperty ?p ;
sosa:hasSimpleResult ?val .
filter(?oFoi = ?f) .
bind(day(?t) as ?day) .
} group by ?oFoi ?day
order by desc(?oFoi) asc(?day)
}
} group by ?oFoi2 ?day2 ?p2 ?val2
Result:
oFoi2
day2
val2
tx2
p2
http://www.example3.com/room100Foi
1
48
2021-01-01 16:00:00
"TEMP"^^http://www.w3.org/2001/XMLSchema#string
http://www.example3.com/room1Foi
2
59.8
2021-01-02 16:00:00
"TEMP"^^http://www.w3.org/2001/XMLSchema#string
http://www.example3.com/room1Foi
1
-9.8
2021-01-01 7:59:00
"aTEMP"^^http://www.w3.org/2001/XMLSchema#string
...
...
...
...
...
In a data column in BigQuery, I have a JSON object with the structure:
{
"sections": [
{
"secName": "Flintstones",
"fields": [
{ "fldName": "Fred", "age": 55 },
{ "fldName": "Barney", "age": 44 }
]
},
{
"secName": "Jetsons",
"fields": [
{ "fldName": "George", "age": 33 },
{ "fldName": "Elroy", "age": 22 }
]
}
]}
I'm hoping to unnest() and json_extract() to get results that resemble:
id | section_num | section_name | field_num | field_name | field_age
----+--------------+--------------+-----------+------------+-----------
1 | 1 | Flintstones | 1 | Fred | 55
1 | 1 | Flintstones | 2 | Barney | 44
1 | 2 | Jetsons | 1 | George | 33
1 | 2 | Jetsons | 2 | Elroy | 22
So far, I have the query:
SELECT id,
json_extract_scalar(curSection, '$.secName') as section_name,
json_extract_scalar(curField, '$.fldName') as field_name,
json_extract_scalar(curField, '$.age') as field_age
FROM `tick8s.test2` AS tbl
LEFT JOIN unnest(json_extract_array(tbl.data, '$.sections')) as curSection
LEFT JOIN unnest(json_extract_array(curSection, '$.fields')) as curField
that yields:
id | section_name | field_name | field_age
----+--------------+------------+-----------
1 | Flintstones | Fred | 55
1 | Flintstones | Barney | 44
1 | Jetsons | George | 33
1 | Jetsons | Elroy | 22
QUESTION: I'm not sure how, if possible, to get the section_num and field_num ordinal positions from their array index values?
(If you are looking to duplicate my results, I have a table named test2 with 2 columns:
id - INTEGER, REQUIRED
data - STRING, NULLABLE
and I insert the data with:
insert into tick8s.test2 values (1,
'{"sections": [' ||
'{' ||
'"secName": "Flintstones",' ||
'"fields": [' ||
'{ "fldName": "Fred", "age": 55 },' ||
'{ "fldName": "Barney", "age": 44 }' ||
']' ||
'},' ||
'{' ||
'"secName": "Jetsons",' ||
'"fields": [' ||
'{ "fldName": "George", "age": 33 },' ||
'{ "fldName": "Elroy", "age": 22 }' ||
']' ||
'}]}'
);
)
Do you just want with offset?
SELECT id,
json_extract_scalar(curSection, '$.secName') as section_name,
n_s,
json_extract_scalar(curField, '$.fldName') as field_name,
json_extract_scalar(curField, '$.age') as field_age,
n_c
FROM `tick8s.test2` tbl LEFT JOIN
unnest(json_extract_array(tbl.data, '$.sections')
) curSection WITH OFFSET n_s LEFT JOIN
unnest(json_extract_array(curSection, '$.fields')
) curField WITH OFFSET n_c;
I'm working with pyspark and want to transform this spark data frame:
+----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
| TS | ABC[0].VAL.VAL[0].UNT[0].sth1 | ABC[0].VAL.VAL[0].UNT[1].sth1 | ABC[0].VAL.VAL[1].UNT[0].sth1 | ABC[0].VAL.VAL[1].UNT[1].sth1 | ABC[0].VAL.VAL[0].UNT[0].sth2 | ABC[0].VAL.VAL[0].UNT[1].sth2 | ABC[0].VAL.VAL[1].UNT[0].sth2 | ABC[0].VAL.VAL[1].UNT[1].sth2 |
+----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
| 1 | some_value | some_value | some_value | some_value | some_value | some_value | some_value | some_value |
+----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
to that:
+----+-----+-----+------------+------------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------------+------------+
| 1 | 0 | 0 | some_value | some_value |
| 1 | 0 | 1 | some_value | some_value |
| 1 | 1 | 0 | some_value | some_value |
| 1 | 1 | 1 | some_value | some_value |
+----+-----+-----+------------+------------+
Any idea how I can do that using some fancy transformation?
Edit:
So this is how I could solve it:
from pyspark.sql.functions import array, col, explode, struct, lit
import re
df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1", "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2", "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"])
newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in ["TS"]))
kvs = explode(array([struct(
lit( re.search(re.compile(r"VAL\[(\d{1,2})\]"),c).group(1) ).alias("VAL"),
lit( re.search(re.compile(r"UNT\[(\d{1,2})\]"),c).group(1) ).alias("UNT"),
lit( re.search(re.compile(r"([^_]+$)"),c).group(1) ).alias("Parameter"),
col(c).alias("data")) for c in cols
])).alias("kvs")
display(df.select(["TS"] + [kvs]).select(["TS"] + ["kvs.VAL", "kvs.UNT", "kvs.Parameter", "kvs.data"]).groupBy("TS","VAL","UNT").pivot("Parameter").sum("data").orderBy("TS","VAL","UNT"))
Output:
+----+-----+-----+------+------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------+------+
| 1 | 0 | 0 | 0 | 0.7 |
| 1 | 0 | 1 | 0.6 | 0.2 |
| 1 | 1 | 0 | 0.1 | 0.4 |
| 1 | 1 | 1 | 0.4 | 0.1 |
| 2 | 0 | 0 | 0.6 | 0.8 |
| 2 | 0 | 1 | 0.7 | 0.3 |
| 2 | 1 | 0 | 0.1 | 0.1 |
| 2 | 1 | 1 | 0.5 | 0.3 |
+----+-----+-----+------+------+
How can it be done better?
So this is how I could solve it:
from pyspark.sql.functions import array, col, explode, struct, lit
import re
df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1", "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2", "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"])
newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in ["TS"]))
kvs = explode(array([struct(
lit( re.search(re.compile(r"VAL\[(\d{1,2})\]"),c).group(1) ).alias("VAL"),
lit( re.search(re.compile(r"UNT\[(\d{1,2})\]"),c).group(1) ).alias("UNT"),
lit( re.search(re.compile(r"([^_]+$)"),c).group(1) ).alias("Parameter"),
col(c).alias("data")) for c in cols
])).alias("kvs")
display(df.select(["TS"] + [kvs]).select(["TS"] + ["kvs.VAL", "kvs.UNT", "kvs.Parameter", "kvs.data"]).groupBy("TS","VAL","UNT").pivot("Parameter").sum("data").orderBy("TS","VAL","UNT"))
Output:
+----+-----+-----+------+------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------+------+
| 1 | 0 | 0 | 0 | 0.7 |
| 1 | 0 | 1 | 0.6 | 0.2 |
| 1 | 1 | 0 | 0.1 | 0.4 |
| 1 | 1 | 1 | 0.4 | 0.1 |
| 2 | 0 | 0 | 0.6 | 0.8 |
| 2 | 0 | 1 | 0.7 | 0.3 |
| 2 | 1 | 0 | 0.1 | 0.1 |
| 2 | 1 | 1 | 0.5 | 0.3 |
+----+-----+-----+------+------+
Now at least tell me how it can be done better...
Your approach is good (upvoted). The only thing I would really do is extract the essential parts from the column names in one regex search. I’d also remove a superfluous select in favor of groupBy, but that’s not as important.
import re
from pyspark.sql.functions import lit, explode, array, struct, col
df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(
["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1",
"ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2",
"ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"])
newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)
def extract_indices_and_label(column_name):
s = re.match(r"\D+\d+\D+(\d+)\D+(\d+)[^_]_(.*)$", column_name)
m, n, label = s.groups()
return int(m), int(n), label
def create_struct(column_name):
val, unt, label = extract_indices_and_label(column_name)
return struct(lit(val).alias("val"),
lit(unt).alias("unt"),
lit(label).alias("label"),
col(column_name).alias("value"))
df2 = (df.select(
df.TS,
explode(array([create_struct(c) for c in df.columns[1:]]))))
df2.printSchema() # this is instructional: it shows the structure is nearly there
# root
# |-- TS: long (nullable = true)
# |-- col: struct (nullable = false)
# | |-- val: integer (nullable = false)
# | |-- unt: integer (nullable = false)
# | |-- label: string (nullable = false)
# | |-- value: double (nullable = true)
df3 = (df2
.groupBy(df2.TS, df2.col.val.alias("VAL"), df2.col.unt.alias("UNT"))
.pivot("col.label", values=("sth1", "sth2"))
.sum("col.value"))
df3.orderBy("TS", "VAL", "UNT").show()
# +---+---+---+----+----+
# | TS|VAL|UNT|sth1|sth2|
# +---+---+---+----+----+
# | 1| 0| 0| 0.0| 0.7|
# | 1| 0| 1| 0.6| 0.2|
# | 1| 1| 0| 0.1| 0.4|
# | 1| 1| 1| 0.4| 0.1|
# | 2| 0| 0| 0.6| 0.8|
# | 2| 0| 1| 0.7| 0.3|
# | 2| 1| 0| 0.1| 0.1|
# | 2| 1| 1| 0.5| 0.3|
# +---+---+---+----+----+
If you know a priori that you will have only the two columns sth1 and sth2 that will be pivoted, you could add these to pivot’s values parameter, which will improve the efficiency further.
This is what and example of the table looks like.
+---------------------+------------------+------------------+
| country_code | region | num_launches |
+---------------------+------------------+------------------+
| 'CA' | 'Ontario' | 5 |
+---------------------+------------------+------------------+
| 'CA' | 'Quebec' | 9 |
+---------------------+------------------+------------------+
| 'DE' | 'Bavaria' | 15 |
+---------------------+------------------+------------------+
| 'DE' | 'Saarland' | 12 |
+---------------------+------------------+------------------+
| 'DE' | 'Berlin' | 23 |
+---------------------+------------------+------------------+
| 'JP' | 'Tokyo' | 19 |
+---------------------+------------------+------------------+
I am able to write a query that returns each country_code with all regions nested within, but I am unable to get exactly what I am looking for.
My intended return looks like.
[
{ 'CA': [
{ 'Ontario': 5 },
{ 'Quebec': 9 }
]
},
{ 'DE': [
{ 'Bavaria': 15 },
{ 'Saarland': 12 },
{ 'Berlin': 23 }
]
},
{ 'JP': [
{ 'Tokyo': 19 }
]
}
]
How could this be calculated if the num_launches was not available?
+---------------------+------------------+
| country_code | region |
+---------------------+------------------+
| 'CA' | 'Ontario' |
+---------------------+------------------+
| 'CA' | 'Ontario' |
+---------------------+------------------+
| 'CA' | 'Ontario' |
+---------------------+------------------+
| 'CA' | 'Quebec' |
+---------------------+------------------+
| 'CA' | 'Quebec' |
+---------------------+------------------+
| 'DE' | 'Bavaria' |
+---------------------+------------------+
| 'DE' | 'Bavaria' |
+---------------------+------------------+
| 'DE' | 'Bavaria' |
+---------------------+------------------+
| 'DE' | 'Bavaria' |
+---------------------+------------------+
| 'DE' | 'Saarland' |
+---------------------+------------------+
| 'DE' | 'Berlin' |
+---------------------+------------------+
| 'DE' | 'Berlin' |
+---------------------+------------------+
| 'JP' | 'Tokyo' |
+---------------------+------------------+
Expected Return
[
{ 'CA': [
{ 'Ontario': 3 },
{ 'Quebec': 2 }
]
},
{ 'DE': [
{ 'Bavaria': 4 },
{ 'Saarland': 1 },
{ 'Berlin': 2 }
]
},
{ 'JP': [
{ 'Tokyo': 1 }
]
}
]
Thanks
You can try to use json_agg with json_build_object function in a subquery to get the array then do it again in the main query.
Schema (PostgreSQL v9.6)
CREATE TABLE T(
country_code varchar(50),
region varchar(50),
num_launches int
);
insert into t values ('CA','Ontario',5);
insert into t values ('CA','Quebec',9);
insert into t values ('DE','Bavaria',15);
insert into t values ('DE','Saarland',12);
insert into t values ('DE','Berlin',23);
insert into t values ('JP','Tokyo',19);
Query #1
select json_agg(json_build_object(country_code,arr)) results
from (
SELECT country_code,
json_agg(json_build_object(region,num_launches)) arr
FROM T
group by country_code
) t1;
results
[{"CA":[{"Ontario":5},{"Quebec":9}]},{"DE":[{"Bavaria":15},{"Saarland":12},{"Berlin":23}]},{"JP":[{"Tokyo":19}]}]
View on DB Fiddle
I'm trying to extract intersectionOf and unionOf in an OWL file, where interesctionOf and unionOf consist of collection of classes, someValuesFrom or/and onProperty. I have created a SPARQL query which extracts the "collection" for the intersectionOf, but the problem is that some of the retrieved data are not related to the class.
For example, I have class called man. This class has an equivalent class which is intersectionOf of three classes, namely, adult,person, and male .My SPARQL query returns some incorrect result: it returns that the classes adult, person, and male are equivalent to class man (i.e., this part is correct), but they are also equivalent classes to all other classes in my OWL file such as haulage_worker, which is incorrect. Here is my SPARQL query:
PREFIX abc: <http://owl.cs.manchester.ac.uk/2009/07/sssw/people#>
PREFIX ghi: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX mno: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX list: <http://jena.hpl.hp.com/ARQ/list#>
SELECT Distinct ?class ?equivalentClass
WHERE{ ?class a owl:Class .
FILTER( STRSTARTS(STR(?class),"http://www.w3.org/2002/07/owl#") || STRSTARTS(STR(?class),"http://owl.cs.manchester.ac.uk/2009/07/sssw/people#")
)
?x a owl:Class ; owl:intersectionOf ?list .
?list rdf:rest*/rdf:first ?equivalentClass .
} GROUP BY ?class ?equivalentClass ORDER BY ?no
and this is my OWL file:
<?xml version="1.0"?>
<rdf:RDF
xmlns="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:ns0="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#"
xml:base="http://owl.cs.manchester.ac.uk/2009/07/sssw/people">
<owl:Ontology rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people"/>
<owl:Class rdf:about="http://www.w3.org/2002/07/owl#Thing"/>
<owl:Class rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#haulage_worker">
<rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
></rdfs:comment>
<owl:equivalentClass>
<owl:Restriction>
<owl:onProperty>
<owl:ObjectProperty rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#works_for"/>
</owl:onProperty>
<owl:someValuesFrom>
<owl:Class>
<owl:unionOf rdf:parseType="Collection">
<owl:Restriction>
<owl:onProperty>
<owl:ObjectProperty rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#part_of"/>
</owl:onProperty>
<owl:someValuesFrom>
<owl:Class rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#haulage_company"/>
</owl:someValuesFrom>
</owl:Restriction>
<owl:Class rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#haulage_company"/>
</owl:unionOf>
</owl:Class>
</owl:someValuesFrom>
</owl:Restriction>
</owl:equivalentClass>
<rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
>haulage worker</rdfs:label>
</owl:Class>
<owl:Class rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#man">
<owl:equivalentClass>
<owl:Class>
<owl:intersectionOf rdf:parseType="Collection">
<owl:Class rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#adult"/>
<owl:Class rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#person"/>
<owl:Class rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#male"/>
</owl:intersectionOf>
</owl:Class>
</owl:equivalentClass>
</owl:Class>
</rdf:RDF>
This is the output which I got (they are not correct output):
-----------------------------------------
| class | equivalentClass |
=========================================
| abc:adult | abc:adult |
| abc:adult | abc:male |
| abc:adult | abc:person |
| abc:haulage_company | abc:adult |
| abc:haulage_company | abc:male |
| abc:haulage_company | abc:person |
| abc:haulage_worker | abc:adult |
| abc:haulage_worker | abc:male |
| abc:haulage_worker | abc:person |
| abc:male | abc:adult |
| abc:male | abc:male |
| abc:male | abc:person |
| abc:man | abc:adult |
| abc:man | abc:male |
| abc:man | abc:person |
| abc:person | abc:adult |
| abc:person | abc:male |
| abc:person | abc:person |
| owl:Thing | abc:adult |
| owl:Thing | abc:male |
| owl:Thing | abc:person |
-----------------------------------------
The expected output would be like this:
-----------------------------------------
| class | equivalentClass |
=========================================
| abc:adult | abc:adult |
| abc:adult | abc:male |
| abc:adult | abc:person |
| abc:haulage_company | |
| abc:haulage_company | |
| abc:haulage_company | |
| abc:haulage_worker | |
| abc:haulage_worker | |
| abc:haulage_worker | |
| abc:male | abc:adult |
| abc:male | abc:male |
| abc:male | abc:person |
| abc:man | abc:adult |
| abc:man | abc:male |
| abc:man | abc:person |
| abc:person | abc:adult |
| abc:person | abc:male |
| abc:person | abc:person |
| owl:Thing | |
| owl:Thing | |
| owl:Thing | |
-----------------------------------------
What should I change in my SPARQL query in order to make my output like the previous table?
Cleaning up your query a bit, we have:
prefix abc: <http://owl.cs.manchester.ac.uk/2009/07/sssw/people#>
prefix ghi: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix mno: <http://www.w3.org/2001/XMLSchema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix list: <http://jena.hpl.hp.com/ARQ/list#>
select distinct ?class ?equivalentClass where {
?class a owl:Class .
filter( strstarts(str(?class),str(owl:)) || # since "owl:" is an IRI, you can
strstarts(str(?class),str(abc:)) ) # use str(owl:) and str(:abc)
?x a owl:Class ;
owl:intersectionOf ?list .
?list rdf:rest*/rdf:first ?equivalentClass .
}
group by ?class ?equivalentClass
order by ?class # ?class, not ?no
Your problem lies in that you're selecting ?class, which can be every owl:Class in the ontology (as long as it starts with an appropriate prefix), and then selecting ?equivalentClass from the list of intersection classes of ?x, and ?x has no connection whatsoever to ?class. (You were also sorting by?no, but I think you meant to sort by?class`.)
Figuring out the right query to write will be easier if we take a look at the data in a more human readable format, e.g., Turtle. In Turtle, the man class is:
ns0:man a owl:Class ;
owl:equivalentClass [ a owl:Class ;
owl:intersectionOf ( ns0:adult ns0:person ns0:male )
] .
You're looking for things which are owl:Classes, are related by owl:equivalentClass to something else that's an owl:Class, and which has a list value for owl:intersectionOf. This isn't too hard in SPARQL, and the query actually has the same kind of structure as this Turtle text:
prefix abc: <http://owl.cs.manchester.ac.uk/2009/07/sssw/people#>
prefix ghi: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix mno: <http://www.w3.org/2001/XMLSchema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix list: <http://jena.hpl.hp.com/ARQ/list#>
select distinct ?class ?otherClass where {
?class a owl:Class ;
owl:equivalentClass [ a owl:Class ;
owl:intersectionOf [ rdf:rest*/rdf:first ?otherClass ] ] .
filter( strstarts(str(?class),str(owl:)) ||
strstarts(str(?class),str(abc:)) )
}
group by ?class ?otherClass
order by ?class
I changed the variable name from equivalentClass to otherClass, because adult, male, and person aren't equivalent to man. Their intersection is. Using Jena's command line sparql tool, you'll get results like this:
$ sparql --data data.rdf --query query.rq
------------------------
| class | otherClass |
========================
| abc:man | abc:adult |
| abc:man | abc:male |
| abc:man | abc:person |
------------------------
This query only retrieves classes that are equivalent to some intersection. Your expected results showed all the classes whose IRIs started with abc: or owl:, which means that the extra structure is actually optional, so we adjust the query accordingly by wrapping the optional parts in optional { … }, and we get the kind of results we're looking for:
prefix abc: <http://owl.cs.manchester.ac.uk/2009/07/sssw/people#>
prefix ghi: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix mno: <http://www.w3.org/2001/XMLSchema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix list: <http://jena.hpl.hp.com/ARQ/list#>
select distinct ?class ?otherClass where {
?class a owl:Class .
optional {
?class owl:equivalentClass [ a owl:Class ;
owl:intersectionOf [ rdf:rest*/rdf:first ?otherClass ] ] .
}
filter( strstarts(str(?class),str(owl:)) ||
strstarts(str(?class),str(abc:)) )
}
group by ?class ?otherClass
order by ?class
$ sparql --data data.rdf --query query.rq
------------------------------------
| class | otherClass |
====================================
| abc:adult | |
| abc:haulage_company | |
| abc:haulage_worker | |
| abc:male | |
| abc:man | abc:adult |
| abc:man | abc:male |
| abc:man | abc:person |
| abc:person | |
| owl:Thing | |
------------------------------------