Retrieve the collection of unionOf and intersectionOf for each OWL - sparql

I'm trying to extract intersectionOf and unionOf in an OWL file, where interesctionOf and unionOf consist of collection of classes, someValuesFrom or/and onProperty. I have created a SPARQL query which extracts the "collection" for the intersectionOf, but the problem is that some of the retrieved data are not related to the class.
For example, I have class called man. This class has an equivalent class which is intersectionOf of three classes, namely, adult,person, and male .My SPARQL query returns some incorrect result: it returns that the classes adult, person, and male are equivalent to class man (i.e., this part is correct), but they are also equivalent classes to all other classes in my OWL file such as haulage_worker, which is incorrect. Here is my SPARQL query:
PREFIX abc: <http://owl.cs.manchester.ac.uk/2009/07/sssw/people#>
PREFIX ghi: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX mno: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX list: <http://jena.hpl.hp.com/ARQ/list#>
SELECT Distinct ?class ?equivalentClass
WHERE{ ?class a owl:Class .
FILTER( STRSTARTS(STR(?class),"http://www.w3.org/2002/07/owl#") || STRSTARTS(STR(?class),"http://owl.cs.manchester.ac.uk/2009/07/sssw/people#")
)
?x a owl:Class ; owl:intersectionOf ?list .
?list rdf:rest*/rdf:first ?equivalentClass .
} GROUP BY ?class ?equivalentClass ORDER BY ?no
and this is my OWL file:
<?xml version="1.0"?>
<rdf:RDF
xmlns="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:ns0="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#"
xml:base="http://owl.cs.manchester.ac.uk/2009/07/sssw/people">
<owl:Ontology rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people"/>
<owl:Class rdf:about="http://www.w3.org/2002/07/owl#Thing"/>
<owl:Class rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#haulage_worker">
<rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
></rdfs:comment>
<owl:equivalentClass>
<owl:Restriction>
<owl:onProperty>
<owl:ObjectProperty rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#works_for"/>
</owl:onProperty>
<owl:someValuesFrom>
<owl:Class>
<owl:unionOf rdf:parseType="Collection">
<owl:Restriction>
<owl:onProperty>
<owl:ObjectProperty rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#part_of"/>
</owl:onProperty>
<owl:someValuesFrom>
<owl:Class rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#haulage_company"/>
</owl:someValuesFrom>
</owl:Restriction>
<owl:Class rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#haulage_company"/>
</owl:unionOf>
</owl:Class>
</owl:someValuesFrom>
</owl:Restriction>
</owl:equivalentClass>
<rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
>haulage worker</rdfs:label>
</owl:Class>
<owl:Class rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#man">
<owl:equivalentClass>
<owl:Class>
<owl:intersectionOf rdf:parseType="Collection">
<owl:Class rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#adult"/>
<owl:Class rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#person"/>
<owl:Class rdf:about="http://owl.cs.manchester.ac.uk/2009/07/sssw/people#male"/>
</owl:intersectionOf>
</owl:Class>
</owl:equivalentClass>
</owl:Class>
</rdf:RDF>
This is the output which I got (they are not correct output):
-----------------------------------------
| class | equivalentClass |
=========================================
| abc:adult | abc:adult |
| abc:adult | abc:male |
| abc:adult | abc:person |
| abc:haulage_company | abc:adult |
| abc:haulage_company | abc:male |
| abc:haulage_company | abc:person |
| abc:haulage_worker | abc:adult |
| abc:haulage_worker | abc:male |
| abc:haulage_worker | abc:person |
| abc:male | abc:adult |
| abc:male | abc:male |
| abc:male | abc:person |
| abc:man | abc:adult |
| abc:man | abc:male |
| abc:man | abc:person |
| abc:person | abc:adult |
| abc:person | abc:male |
| abc:person | abc:person |
| owl:Thing | abc:adult |
| owl:Thing | abc:male |
| owl:Thing | abc:person |
-----------------------------------------
The expected output would be like this:
-----------------------------------------
| class | equivalentClass |
=========================================
| abc:adult | abc:adult |
| abc:adult | abc:male |
| abc:adult | abc:person |
| abc:haulage_company | |
| abc:haulage_company | |
| abc:haulage_company | |
| abc:haulage_worker | |
| abc:haulage_worker | |
| abc:haulage_worker | |
| abc:male | abc:adult |
| abc:male | abc:male |
| abc:male | abc:person |
| abc:man | abc:adult |
| abc:man | abc:male |
| abc:man | abc:person |
| abc:person | abc:adult |
| abc:person | abc:male |
| abc:person | abc:person |
| owl:Thing | |
| owl:Thing | |
| owl:Thing | |
-----------------------------------------
What should I change in my SPARQL query in order to make my output like the previous table?

Cleaning up your query a bit, we have:
prefix abc: <http://owl.cs.manchester.ac.uk/2009/07/sssw/people#>
prefix ghi: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix mno: <http://www.w3.org/2001/XMLSchema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix list: <http://jena.hpl.hp.com/ARQ/list#>
select distinct ?class ?equivalentClass where {
?class a owl:Class .
filter( strstarts(str(?class),str(owl:)) || # since "owl:" is an IRI, you can
strstarts(str(?class),str(abc:)) ) # use str(owl:) and str(:abc)
?x a owl:Class ;
owl:intersectionOf ?list .
?list rdf:rest*/rdf:first ?equivalentClass .
}
group by ?class ?equivalentClass
order by ?class # ?class, not ?no
Your problem lies in that you're selecting ?class, which can be every owl:Class in the ontology (as long as it starts with an appropriate prefix), and then selecting ?equivalentClass from the list of intersection classes of ?x, and ?x has no connection whatsoever to ?class. (You were also sorting by?no, but I think you meant to sort by?class`.)
Figuring out the right query to write will be easier if we take a look at the data in a more human readable format, e.g., Turtle. In Turtle, the man class is:
ns0:man a owl:Class ;
owl:equivalentClass [ a owl:Class ;
owl:intersectionOf ( ns0:adult ns0:person ns0:male )
] .
You're looking for things which are owl:Classes, are related by owl:equivalentClass to something else that's an owl:Class, and which has a list value for owl:intersectionOf. This isn't too hard in SPARQL, and the query actually has the same kind of structure as this Turtle text:
prefix abc: <http://owl.cs.manchester.ac.uk/2009/07/sssw/people#>
prefix ghi: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix mno: <http://www.w3.org/2001/XMLSchema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix list: <http://jena.hpl.hp.com/ARQ/list#>
select distinct ?class ?otherClass where {
?class a owl:Class ;
owl:equivalentClass [ a owl:Class ;
owl:intersectionOf [ rdf:rest*/rdf:first ?otherClass ] ] .
filter( strstarts(str(?class),str(owl:)) ||
strstarts(str(?class),str(abc:)) )
}
group by ?class ?otherClass
order by ?class
I changed the variable name from equivalentClass to otherClass, because adult, male, and person aren't equivalent to man. Their intersection is. Using Jena's command line sparql tool, you'll get results like this:
$ sparql --data data.rdf --query query.rq
------------------------
| class | otherClass |
========================
| abc:man | abc:adult |
| abc:man | abc:male |
| abc:man | abc:person |
------------------------
This query only retrieves classes that are equivalent to some intersection. Your expected results showed all the classes whose IRIs started with abc: or owl:, which means that the extra structure is actually optional, so we adjust the query accordingly by wrapping the optional parts in optional { … }, and we get the kind of results we're looking for:
prefix abc: <http://owl.cs.manchester.ac.uk/2009/07/sssw/people#>
prefix ghi: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix mno: <http://www.w3.org/2001/XMLSchema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix list: <http://jena.hpl.hp.com/ARQ/list#>
select distinct ?class ?otherClass where {
  ?class a owl:Class .
  optional { 
    ?class owl:equivalentClass [ a owl:Class ;
                                 owl:intersectionOf [ rdf:rest*/rdf:first ?otherClass ] ] .
  }
  filter( strstarts(str(?class),str(owl:)) ||
          strstarts(str(?class),str(abc:))    )
}
group by ?class ?otherClass
order by ?class
$ sparql --data data.rdf --query query.rq
------------------------------------
| class | otherClass |
====================================
| abc:adult | |
| abc:haulage_company | |
| abc:haulage_worker | |
| abc:male | |
| abc:man | abc:adult |
| abc:man | abc:male |
| abc:man | abc:person |
| abc:person | |
| owl:Thing | |
------------------------------------

Related

PySpark error loading data (JSON array) into a dataframe [duplicate]

I was trying to explode a column called phones like the following schema and content:
(customer_external_id,StringType
phones,StringType)
customer_id phones
x8x46x5 [{"phone" : "(xx) 35xx4x80"},{"phone" : "(xx) xxxx46605"}]
xx44xx5 [{"phone" : "(xx) xxx3x8443"}]
4xxxxx5 [{"phone" : "(xx) x6xx083x3"},{"areaCode" : "xx"},{"phone" : "(xx) 3xxx83x3"}]
xx6564x [{"phone" : "(x3) x88xx344x"}]
xx8x4x0 [{"phone" : "(xx) x83x5x8xx"}]
xx0434x [{"phone" : "(0x) 3x6x4080"},{"areaCode" : "xx"}]
x465x40 [{"phone" : "(6x) x6x445xx"}]
x0684x8 [{"phone" : "(xx) x8x88x4x4"},{"phone" : "(xx) x8x88x4x4"}]
x84x850 [{"phone" : "(xx) 55x56xx4"}]
x0604xx [{"phone" : "(xx) x8x4xxx68"}]
4x6xxx0 [{"phone" : "(xx) x588x43xx"},{"phone" : "(xx) 5x6465xx"},{"phone" : "(xx) x835xxxx8"},{"phone" : "(xx) x5x6465xx"}]
x6x000x [{"phone" : "(xx) xxx044xx4"}]
5x65533 [{"phone" : "(xx) xx686x0xx"}]
x3668xx [{"phone" : "(5x) 33x8x3x4"},{"phone" : "(5x) 8040x8x6"}]
So I tried to run this code and got the subsequential error:
df.select('customer_external_id', explode(df.phones))
AnalysisException: u"cannot resolve 'explode(`phones`)' due to data type mismatch: input to function explode should be array or map type, not StringType;;
'Project [customer_external_id#293, explode(phones#296) AS List()]\n+- Relation[order_id#292,customer_external_id#293,name#294,email#295,phones#296,phones_version#297,classification#298,locale#299] parquet\n"
And by this error I found out my column was a StringType so I ran this code to remove brackets and convert to json:
phones = df.select('customer_external_id', 'phones').rdd\
.map(lambda x: str(x).replace('[','')\
.replace(']','')\
.replace('},{', ','))\
.map(lambda x: json.loads(x).get('phone')\
.map(lambda x: Row(x))\
.toDF(df.select('customer_external_id','phones').schema)
phones.show()
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 38.0 failed 4 times, most recent failure: Lost task 0.3 in stage 38.0 (TID 2740, 10.112.80.248, executor 8): org.apache.spark.api.python.PythonException: Traceback (most recent call last)
Apparently I can't cast to Json and I can't explode the column. So how could I properly deal with this kind of data to get this output:
+-----------+--------+--------------+
|customer_id|areaCode|phone |
+-----------+--------+--------------+
|x8x46x5 |null |(xx) 35xx4x80 |
|x8x46x5 |null |(xx) xxxx46605|
|xx44xx5 |null |(xx) xxx3x8443|
|4xxxxx5 |null |(xx) x6xx083x3|
|4xxxxx5 |xx |null |
|4xxxxx5 |null |(xx) 3xxx83x3 |
|xx6564x |null |(x3) x88xx344x|
|xx8x4x0 |null |(xx) x83x5x8xx|
|xx0434x |null |(0x) 3x6x4080 |
|xx0434x |xx |null |
|x465x40 |null |(6x) x6x445xx |
|x0684x8 |null |(xx) x8x88x4x4|
|x0684x8 |null |(xx) x8x88x4x4|
|x84x850 |null |(xx) 55x56xx4 |
|x0604xx |null |(xx) x8x4xxx68|
|4x6xxx0 |null |(xx) x588x43xx|
|4x6xxx0 |null |(xx) 5x6465xx |
|4x6xxx0 |null |(xx) x835xxxx8|
|4x6xxx0 |null |(xx) x5x6465xx|
|x6x000x |null |(xx) xxx044xx4|
|5x65533 |null |(xx) xx686x0xx|
|x3668xx |null |(5x) 33x8x3x4 |
|x3668xx |null |(5x) 8040x8x6 |
+-----------+--------+--------------+
What you want to do is use the from_json method to convert the string into an array and then explode:
from pyspark.sql.functions import *
from pyspark.sql.types import *
phone_schema = ArrayType(StructType([StructField("phone", StringType())]))
converted = inputDF\
.withColumn("areaCode", get_json_object("phones", "$[*].areaCode"))\
.withColumn("phones", explode(from_json("phones", phone_schema)))\
.withColumn("phone", col("phones.phone"))\
.drop("phones")\
.filter(~isnull("phone"))
converted.show()
I think you should be able to directly call json.loads() without using replace() as you have shown.
Map the string to ArrayType(MapType()) using json.loads().
Call flatMap() to create a new Row() for each element of the array.
Map these rows to the desired output.
Take a look at the following example:
from StringIO import StringIO
from pyspark.sql import Row
import json
import pandas as pd
# mock up some sample data
data = StringIO("""customer_id\tphones
x8x46x5\t[{"phone" : "(xx) 35xx4x80"},{"phone" : "(xx) xxxx46605"}]
xx44xx5\t[{"phone" : "(xx) xxx3x8443"}]
4xxxxx5\t[{"phone" : "(xx) x6xx083x3"},{"areaCode" : "xx"},{"phone" : "(xx) 3xxx83x3"}]
xx6564x\t[{"phone" : "(x3) x88xx344x"}]
xx8x4x0\t[{"phone" : "(xx) x83x5x8xx"}]
xx0434x\t[{"phone" : "(0x) 3x6x4080"},{"areaCode" : "xx"}]
x465x40\t[{"phone" : "(6x) x6x445xx"}]
x0684x8\t[{"phone" : "(xx) x8x88x4x4"},{"phone" : "(xx) x8x88x4x4"}]
x84x850\t[{"phone" : "(xx) 55x56xx4"}]
x0604xx\t[{"phone" : "(xx) x8x4xxx68"}]
4x6xxx0\t[{"phone" : "(xx) x588x43xx"},{"phone" : "(xx) 5x6465xx"},{"phone" : "(xx) x835xxxx8"},{"phone" : "(xx) x5x6465xx"}]
x6x000x\t[{"phone" : "(xx) xxx044xx4"}]
5x65533\t[{"phone" : "(xx) xx686x0xx"}]
x3668xx\t[{"phone" : "(5x) 33x8x3x4"},{"phone" : "(5x) 8040x8x6"}]""")
pandas_df = pd.read_csv(data, sep="\t")
df = sqlCtx.createDataFrame(pandas_df) # convert pandas to spark df
# run the steps outlined above
df.rdd\
.map(lambda x: Row(customer_id=x['customer_id'], phones=json.loads(x['phones'])))\
.flatMap(lambda x: [Row(customer_id=x['customer_id'], phone=phone) for phone in x['phones']])\
.map(lambda x: Row(customer_id=x['customer_id'], phone=x['phone'].get('phone'), areaCode=x['phone'].get('areaCode')))\
.toDF()\
.select('customer_id', 'areaCode', 'phone')\
.show(truncate=False, n=100)
The output:
+-----------+--------+--------------+
|customer_id|areaCode|phone |
+-----------+--------+--------------+
|x8x46x5 |null |(xx) 35xx4x80 |
|x8x46x5 |null |(xx) xxxx46605|
|xx44xx5 |null |(xx) xxx3x8443|
|4xxxxx5 |null |(xx) x6xx083x3|
|4xxxxx5 |xx |null |
|4xxxxx5 |null |(xx) 3xxx83x3 |
|xx6564x |null |(x3) x88xx344x|
|xx8x4x0 |null |(xx) x83x5x8xx|
|xx0434x |null |(0x) 3x6x4080 |
|xx0434x |xx |null |
|x465x40 |null |(6x) x6x445xx |
|x0684x8 |null |(xx) x8x88x4x4|
|x0684x8 |null |(xx) x8x88x4x4|
|x84x850 |null |(xx) 55x56xx4 |
|x0604xx |null |(xx) x8x4xxx68|
|4x6xxx0 |null |(xx) x588x43xx|
|4x6xxx0 |null |(xx) 5x6465xx |
|4x6xxx0 |null |(xx) x835xxxx8|
|4x6xxx0 |null |(xx) x5x6465xx|
|x6x000x |null |(xx) xxx044xx4|
|5x65533 |null |(xx) xx686x0xx|
|x3668xx |null |(5x) 33x8x3x4 |
|x3668xx |null |(5x) 8040x8x6 |
+-----------+--------+--------------+
I'm not sure if this is the output you were hoping for, but this should help you get there.

SPARQL: Variable must be included in group by clause

For every sosa:FeatureOfInterest (room), get the lowest temperature per day from the associated sensors for that room. There are 100 rooms. Each room has 3 sensors. The timeframe is one year.
Goal: Query to select lowest temperature per day per room from group of sensors plus time of day when the temperature occurred.
Example data (N3):
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix sosa: <http://www.w3.org/ns/sosa/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix ex: <http://www.example.com/>
# Room FOIs
ex:room1Foi a sosa:FeatureOfInterest .
# ...
ex:room100Foi a sosa:FeatureOfInterest .
# Room 1 sensor observations 1/1/2021
ex:obs1Room1 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room1Foi ;
sosa:resultTime "2021-01-01T00:00:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "51.4"^^xsd:decimal .
ex:obs2Room1 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room1Foi .
sosa:resultTime "2021-01-01T08:00:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "50.2"^^xsd:decimal .
ex:obs3Room1 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room1Foi .
sosa:resultTime "2021-01-01T:16:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "49.8"^^xsd:decimal .
# Room 1 sensor observations 1/2/2021
ex:obs4Room1 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room1Foi ;
sosa:resultTime "2021-01-02T00:00:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "61.4"^^xsd:decimal .
ex:obs5Room1 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room1Foi .
sosa:resultTime "2021-01-02T08:00:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "60.2"^^xsd:decimal .
ex:obs6Room1 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room1Foi .
sosa:resultTime "2021-01-02T:16:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "59.8"^^xsd:decimal .
# ...
# Room 100 sensor observations 1/1/2021
ex:obs1Room100 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room100Foi ;
sosa:resultTime "2021-01-01T00:00:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "50.7"^^xsd:decimal .
ex:obs2Room100 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room100Foi .
sosa:resultTime "2021-01-01T08:00:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "51.6"^^xsd:decimal .
ex:obs3Room100 a sosa:Observation .
sosa:hasFeatureOfInterest ex:room100Foi .
sosa:resultTime "2021-01-01T:16:00"^^xsd:dateTime ;
sosa:observedProperty "TEMP"^^xsd:string ;
sosa:hasSimpleResult "48.0"^^xsd:decimal .
# Room 1 sensor observations 1/2/2021
# ...
One attempt:
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix sosa: <http://www.w3.org/ns/sosa/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix ex: <http://www.example1.com/>
select ?oFoi ?day min(?val) as ?minTemp ?time where {
{select ?f where {
?f a sosa:FeatureOfInterest .
}}
?o a sosa:Observation ;
sosa:hasFeatureOfInterest ?oFoi ;
sosa:resultTime ?time ;
sosa:observedProperty ?p ;
sosa:hasSimpleResult ?val .
filter(?oFoi = ?f) .
bind(day(?t) as ?day) .
} group by ?oFoi ?day ?time
order by desc(?oFoi) asc(?day)
Result:
oFoi
day
minTemp
time
http://www.example.com/room1Foi
1
51.4
2021-01-01 0:00:00
http://www.example.com/room1Foi
1
50.2
2021-01-01 8:00:00
http://www.example.com/room1Foi
1
49.8
2021-01-01 16:00:00
http://www.example.com/room1Foi
2
59.8
2021-01-02 16:00:00
http://www.example.com/room1Foi
2
60.2
2021-01-02 8:00:00
http://www.example.com/room1Foi
2
61.4
2021-01-02 0:00:00
...
...
...
...
This doesn't work because ?time must be included in the group by clause. Removing ?time from group by clause will return the correct rows. However, ?time is necessary to be included.
Ideal result:
oFoi
day
minTemp
time
http://www.example.com/room1Foi
1
49.8
2021-01-01 16:00:00
http://www.example.com/room1Foi
2
59.8
2021-01-02 16:00:00
...
...
...
...
Update:
This gets closer but still is including multiple results when two times in the same day have the same temperature (both results included):
select ?o2 ?oFoi2 ?day2 ?val2 sample(?t2) as ?tx2 ?p2 where {
?o2 a sosa:Observation ;
sosa:hasFeatureOfInterest ?oFoi2 ;
sosa:resultTime ?t2 ;
sosa:observedProperty ?p2 ;
sosa:hasSimpleResult ?val2 .
bind(day(?t2) as ?day2) .
filter(?oFoi2 = ?oFoi) .
filter(?day2 = ?day) .
filter(?val2 = ?vx) .
{select ?oFoi ?day min(?val) as ?vx where {
{select ?f where {
?f a sosa:FeatureOfInterest .
}}
?o a sosa:Observation ;
sosa:hasFeatureOfInterest ?oFoi ;
sosa:resultTime ?t ;
sosa:observedProperty ?p ;
sosa:hasSimpleResult ?val .
filter(?oFoi = ?f) .
bind(day(?t) as ?day) .
} group by ?oFoi ?day
order by desc(?oFoi) asc(?day)
}
} group by ?o2 ?oFoi2 ?day2 ?p2 ?val2
Result:
o2
oFoi2
day2
val2
tx2
p2
http://www.example3.com/obs3Room100
http://www.example3.com/room100Foi
1
48
2021-01-01 16:00:00
"TEMP"^^http://www.w3.org/2001/XMLSchema#string
http://www.example3.com/obs6Room1
http://www.example3.com/room1Foi
2
59.8
2021-01-02 16:00:00
"TEMP"^^http://www.w3.org/2001/XMLSchema#string
http://www.example3.com/obs333Room1
http://www.example3.com/room1Foi
1
-9.8
2021-01-01 16:00:00
"aTEMP"^^http://www.w3.org/2001/XMLSchema#string
http://www.example3.com/obs33Room1
http://www.example3.com/room1Foi
1
-9.8
2021-01-01 7:59:00
"aTEMP"^^http://www.w3.org/2001/XMLSchema#string
...
...
...
...
...
...
Oops: ?o2 is unnecessary and removing it from the above query results in the correct solution.
Adding an outer select appears to solve this. Using an aggregate for the time such as min(), max(), avg(), sample(), etc. are all valid approaches for aggregating the time variable. (Adding an extra property aTEMP to help in understanding.)
Query:
select ?oFoi2 ?day2 ?val2 sample(?t2) as ?tx2 ?p2 where {
?o2 a sosa:Observation ;
sosa:hasFeatureOfInterest ?oFoi2 ;
sosa:resultTime ?t2 ;
sosa:observedProperty ?p2 ;
sosa:hasSimpleResult ?val2 .
bind(day(?t2) as ?day2) .
filter(?oFoi2 = ?oFoi) .
filter(?day2 = ?day) .
filter(?val2 = ?vx) .
{select ?oFoi ?day min(?val) as ?vx where {
{select ?f where {
?f a sosa:FeatureOfInterest .
}}
?o a sosa:Observation ;
sosa:hasFeatureOfInterest ?oFoi ;
sosa:resultTime ?t ;
sosa:observedProperty ?p ;
sosa:hasSimpleResult ?val .
filter(?oFoi = ?f) .
bind(day(?t) as ?day) .
} group by ?oFoi ?day
order by desc(?oFoi) asc(?day)
}
} group by ?oFoi2 ?day2 ?p2 ?val2
Result:
oFoi2
day2
val2
tx2
p2
http://www.example3.com/room100Foi
1
48
2021-01-01 16:00:00
"TEMP"^^http://www.w3.org/2001/XMLSchema#string
http://www.example3.com/room1Foi
2
59.8
2021-01-02 16:00:00
"TEMP"^^http://www.w3.org/2001/XMLSchema#string
http://www.example3.com/room1Foi
1
-9.8
2021-01-01 7:59:00
"aTEMP"^^http://www.w3.org/2001/XMLSchema#string
...
...
...
...
...

How to create nested JSON return with aggregate function and dynamic key values using `jsonb_build_object`

This is what and example of the table looks like.
+---------------------+------------------+------------------+
| country_code | region | num_launches |
+---------------------+------------------+------------------+
| 'CA' | 'Ontario' | 5 |
+---------------------+------------------+------------------+
| 'CA' | 'Quebec' | 9 |
+---------------------+------------------+------------------+
| 'DE' | 'Bavaria' | 15 |
+---------------------+------------------+------------------+
| 'DE' | 'Saarland' | 12 |
+---------------------+------------------+------------------+
| 'DE' | 'Berlin' | 23 |
+---------------------+------------------+------------------+
| 'JP' | 'Tokyo' | 19 |
+---------------------+------------------+------------------+
I am able to write a query that returns each country_code with all regions nested within, but I am unable to get exactly what I am looking for.
My intended return looks like.
[
{ 'CA': [
{ 'Ontario': 5 },
{ 'Quebec': 9 }
]
},
{ 'DE': [
{ 'Bavaria': 15 },
{ 'Saarland': 12 },
{ 'Berlin': 23 }
]
},
{ 'JP': [
{ 'Tokyo': 19 }
]
}
]
How could this be calculated if the num_launches was not available?
+---------------------+------------------+
| country_code | region |
+---------------------+------------------+
| 'CA' | 'Ontario' |
+---------------------+------------------+
| 'CA' | 'Ontario' |
+---------------------+------------------+
| 'CA' | 'Ontario' |
+---------------------+------------------+
| 'CA' | 'Quebec' |
+---------------------+------------------+
| 'CA' | 'Quebec' |
+---------------------+------------------+
| 'DE' | 'Bavaria' |
+---------------------+------------------+
| 'DE' | 'Bavaria' |
+---------------------+------------------+
| 'DE' | 'Bavaria' |
+---------------------+------------------+
| 'DE' | 'Bavaria' |
+---------------------+------------------+
| 'DE' | 'Saarland' |
+---------------------+------------------+
| 'DE' | 'Berlin' |
+---------------------+------------------+
| 'DE' | 'Berlin' |
+---------------------+------------------+
| 'JP' | 'Tokyo' |
+---------------------+------------------+
Expected Return
[
{ 'CA': [
{ 'Ontario': 3 },
{ 'Quebec': 2 }
]
},
{ 'DE': [
{ 'Bavaria': 4 },
{ 'Saarland': 1 },
{ 'Berlin': 2 }
]
},
{ 'JP': [
{ 'Tokyo': 1 }
]
}
]
Thanks
You can try to use json_agg with json_build_object function in a subquery to get the array then do it again in the main query.
Schema (PostgreSQL v9.6)
CREATE TABLE T(
country_code varchar(50),
region varchar(50),
num_launches int
);
insert into t values ('CA','Ontario',5);
insert into t values ('CA','Quebec',9);
insert into t values ('DE','Bavaria',15);
insert into t values ('DE','Saarland',12);
insert into t values ('DE','Berlin',23);
insert into t values ('JP','Tokyo',19);
Query #1
select json_agg(json_build_object(country_code,arr)) results
from (
SELECT country_code,
json_agg(json_build_object(region,num_launches)) arr
FROM T
group by country_code
) t1;
results
[{"CA":[{"Ontario":5},{"Quebec":9}]},{"DE":[{"Bavaria":15},{"Saarland":12},{"Berlin":23}]},{"JP":[{"Tokyo":19}]}]
View on DB Fiddle

Printing out a SQL table in an R Sweave PDF

This seems like it should be very simple but I can't seem to find the answer anywhere I look.
This seems like it has just as much chance being easier to solve using clever SQL queries as it is to use R code.
The table is being pulled into the script with this code:
dbhandle <- SQLConn_remote(DBName = "DATABASE", ServerName = "SERVER")
Testdf<-sqlQuery(dbhandle, 'select * from TABLENAME
order by FileName, Number, Category', stringsAsFactors = FALSE)
I want to print out a SQL Table on a R Sweave PDF. I'd like to do it with the following conditions:
Printing only specific columns. This seems simple enough using sqlQuery but I've already created a variable in my script called Testdf that contains all of the table so I'd rather just subset that if I can. The reason I'm not satisfied to simply do this, is because the next condition seems beyond me in queries.
Here's the tricky part. In the sample table I gave below, There is a list of File names that are organized by Version numbers and group Numbers. I'd like to print the table in the .Rnw file so that there are 3 columns. The 1st column is the FileName column, the 2nd column is a column of all Values where Number == 2, and the final (3rd) column is a column of all Values where Number == 3.
Here's what the table looks like:
| Name | Version | Category | Value | Date | Number | Build | Error |
|:-----:|:-------:|:--------:|:-----:|:------:|:------:|:---------:|:-----:|
| File1 | 0.01 | Time | 123 | 1-1-12 | 1 | Iteration | None |
| File1 | 0.01 | Size | 456 | 1-1-12 | 1 | Iteration | None |
| File1 | 0.01 | Final | 789 | 1-1-12 | 1 | Iteration | None |
| File2 | 0.01 | Time | 312 | 1-1-12 | 1 | Iteration | None |
| File2 | 0.01 | Size | 645 | 1-1-12 | 1 | Iteration | None |
| File2 | 0.01 | Final | 978 | 1-1-12 | 1 | Iteration | None |
| File3 | 0.01 | Time | 741 | 1-1-12 | 1 | Iteration | None |
| File3 | 0.01 | Size | 852 | 1-1-12 | 1 | Iteration | None |
| File3 | 0.01 | Final | 963 | 1-1-12 | 1 | Iteration | None |
| File1 | 0.02 | Time | 369 | 1-1-12 | 2 | Iteration | None |
| File1 | 0.02 | Size | 258 | 1-1-12 | 2 | Iteration | None |
| File1 | 0.02 | Final | 147 | 1-1-12 | 2 | Iteration | None |
| File2 | 0.02 | Time | 753 | 1-1-12 | 2 | Iteration | None |
| File2 | 0.02 | Size | 498 | 1-1-12 | 2 | Iteration | None |
| File2 | 0.02 | Final | 951 | 1-1-12 | 2 | Iteration | None |
| File3 | 0.02 | Time | 753 | 1-1-12 | 2 | Iteration | None |
| File3 | 0.02 | Size | 915 | 1-1-12 | 2 | Iteration | None |
| File3 | 0.02 | Final | 438 | 1-1-12 | 2 | Iteration | None |
Here's what I'd like it to look like:
| Name | 0.01 | 0.02 |
|:-----:|:----:|:----:|
| File1 | 123 | 369 |
| File1 | 456 | 258 |
| File1 | 789 | 147 |
| File2 | 312 | 753 |
| File2 | 645 | 498 |
| File2 | 978 | 951 |
| File3 | 741 | 753 |
| File3 | 852 | 915 |
| File3 | 963 | 438 |
The middle and right column titles are derived from the original Version column. The values in the middle column are all of the entries in the Value column that correspond to both 0.01 in the Version column and 1 in the Number column. The values in the right column are all of the entries in the Value column that correspond to both 0.02 in the Version column and 2 in the Number column.
Here's a sample database for reference and if you'd like to reproduce this using R:
rw1 <- c("File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3", "File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3", "File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3")
rw2 <- c("0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03")
rw3 <- c("Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final")
rw4 <- c(123, 456, 789, 312, 645, 978, 741, 852, 963, 369, 258, 147, 753, 498, 951, 753, 915, 438, 978, 741, 852, 963, 369, 258, 147, 753, 498)
rw5 <- c("01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12")
rw6 <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3)
rw7 <- c("Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Release", "Release", "Release", "Release", "Release", "Release", "Release", "Release", "Release")
rw8 <- c("None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "Cannot Connect to Database", "None", "None", "None", "None", "None", "None", "None", "None")
Testdf = data.frame(rw1, rw2, rw3, rw4, rw5, rw6, rw7, rw8)
colnames(Testdf) <- c("FileName", "Version", "Category", "Value", "Date", "Number", "Build", "Error")
Here's a solution using dplyr and tidyr. The relevant variables are selected. An index column is then added to allow for the data to be spread without issues around duplicate indices. The data is then reshaped with spread, and finally the Index column removed.
library("dplyr")
library("tidyr")
Testdf %>%
select(FileName, Version, Value) %>%
group_by(FileName, Version) %>%
mutate(Index = 1:n()) %>%
spread(Version, Value) %>%
select(-Index)
If it can always be assumed that for each FileName there will be 9 Values, one for each combination of Version and Category, then this would work:
Testdf %>%
select(FileName, Category, Version, Value) %>%
spread(Version, Value) %>%
select(-Category)
If you wanted to use data.table, you could do:
setDT(Testdf)[, split(Value, Version), by = FileName]
If you want LaTeX output, then you could further pipe the output to xtable::xtable.

SPARQL select does not work as expected

If I run this query
SELECT ?a ?b ?c ?g ?maxtrust
WHERE
{
{
SELECT ?a (MAX(?t) AS ?maxtrust)
WHERE{
GRAPH ?g { ?a a gr:productorservicemodel }
GRAPH <uri:trust> { ?g exepo:trust ?t}
}
GROUP BY ?a
}
GRAPH ?g {?a ?b ?c}
GRAPH <uri:trust> { ?g exepo:trust ?maxtrust}
}
i get this response:
| a | b | c | g | maxtrust |
|-----------|------------------|--------------------------|-----------|----------|
| uri:prodA | rdf:type | gr:ProductOrServiceModel | uri:alice | 1.0 |
| uri:prodA | exe:EarCoupling | Intraaural | uri:alice | 1.0 |
| uri:prodA | exe:WearingStyle | In-ear | uri:alice | 1.0 |
| uri:prodB | rdf:type | gr:ProductOrServiceModel | uri:bob | 0.5 |
| uri:prodB | exe:EarCoupling | Extraauricolare | uri:bob | 0.5 |
since I'm only interested in the relation between a and g I guessed that this query will have done the trick:
SELECT ?a ?g
WHERE
{
{
SELECT ?a (MAX(?t) AS ?maxtrust)
WHERE{
GRAPH ?g { ?a a gr:productorservicemodel }
GRAPH <uri:trust> { ?g exepo:trust ?t}
}
GROUP BY ?a
}
GRAPH ?g {?a ?b ?c}
GRAPH <uri:trust> { ?g exepo:trust ?maxtrust}
}
I would expected this result:
| a | g |
|-----------|-----------|
| uri:prodA | uri:alice |
| uri:prodA | uri:alice |
| uri:prodA | uri:alice |
| uri:prodB | uri:bob |
| uri:prodB | uri:bob |
instead I got this:
| a | g |
|-----------|-----------|
| uri:prodA | uri:alice |
| uri:prodA | uri:alice |
| uri:prodA | uri:alice |
What is going on? is my understanding of the SPARQL logic completely wrong?
edit: more info
the datasets are:
alice (GRAPH uri:alice):
uri:prodA
a gr:ProductOrServiceModel;
exe:EarCoupling "Intraaural"^^xsd:string ;
exe:WearingStyle "In-ear"^^xsd:string .
bob (GRAPH uri:bob):
uri:prodA
a gr:ProductOrServiceModel;
exe:EarCoupling "Intraauricolare"^^xsd:string .
uri:prodB
exe:WearingStyle "extraauricolare"^^xsd:string .
trust (GRAPH uri:trust):
uri:alice exe:trust "1.0"^^xsd:float .
uri:bob exe:trust "0.5"^^xsd:float .
I'm using stardog as triplestore
if you're only interested in the relation between ?a and ?g your query could be much simpler:
select distinct ?a ?g where {
graph ?g { ?a [] [] }
}