Printing out a SQL table in an R Sweave PDF - sql

This seems like it should be very simple but I can't seem to find the answer anywhere I look.
This seems like it has just as much chance being easier to solve using clever SQL queries as it is to use R code.
The table is being pulled into the script with this code:
dbhandle <- SQLConn_remote(DBName = "DATABASE", ServerName = "SERVER")
Testdf<-sqlQuery(dbhandle, 'select * from TABLENAME
order by FileName, Number, Category', stringsAsFactors = FALSE)
I want to print out a SQL Table on a R Sweave PDF. I'd like to do it with the following conditions:
Printing only specific columns. This seems simple enough using sqlQuery but I've already created a variable in my script called Testdf that contains all of the table so I'd rather just subset that if I can. The reason I'm not satisfied to simply do this, is because the next condition seems beyond me in queries.
Here's the tricky part. In the sample table I gave below, There is a list of File names that are organized by Version numbers and group Numbers. I'd like to print the table in the .Rnw file so that there are 3 columns. The 1st column is the FileName column, the 2nd column is a column of all Values where Number == 2, and the final (3rd) column is a column of all Values where Number == 3.
Here's what the table looks like:
| Name | Version | Category | Value | Date | Number | Build | Error |
|:-----:|:-------:|:--------:|:-----:|:------:|:------:|:---------:|:-----:|
| File1 | 0.01 | Time | 123 | 1-1-12 | 1 | Iteration | None |
| File1 | 0.01 | Size | 456 | 1-1-12 | 1 | Iteration | None |
| File1 | 0.01 | Final | 789 | 1-1-12 | 1 | Iteration | None |
| File2 | 0.01 | Time | 312 | 1-1-12 | 1 | Iteration | None |
| File2 | 0.01 | Size | 645 | 1-1-12 | 1 | Iteration | None |
| File2 | 0.01 | Final | 978 | 1-1-12 | 1 | Iteration | None |
| File3 | 0.01 | Time | 741 | 1-1-12 | 1 | Iteration | None |
| File3 | 0.01 | Size | 852 | 1-1-12 | 1 | Iteration | None |
| File3 | 0.01 | Final | 963 | 1-1-12 | 1 | Iteration | None |
| File1 | 0.02 | Time | 369 | 1-1-12 | 2 | Iteration | None |
| File1 | 0.02 | Size | 258 | 1-1-12 | 2 | Iteration | None |
| File1 | 0.02 | Final | 147 | 1-1-12 | 2 | Iteration | None |
| File2 | 0.02 | Time | 753 | 1-1-12 | 2 | Iteration | None |
| File2 | 0.02 | Size | 498 | 1-1-12 | 2 | Iteration | None |
| File2 | 0.02 | Final | 951 | 1-1-12 | 2 | Iteration | None |
| File3 | 0.02 | Time | 753 | 1-1-12 | 2 | Iteration | None |
| File3 | 0.02 | Size | 915 | 1-1-12 | 2 | Iteration | None |
| File3 | 0.02 | Final | 438 | 1-1-12 | 2 | Iteration | None |
Here's what I'd like it to look like:
| Name | 0.01 | 0.02 |
|:-----:|:----:|:----:|
| File1 | 123 | 369 |
| File1 | 456 | 258 |
| File1 | 789 | 147 |
| File2 | 312 | 753 |
| File2 | 645 | 498 |
| File2 | 978 | 951 |
| File3 | 741 | 753 |
| File3 | 852 | 915 |
| File3 | 963 | 438 |
The middle and right column titles are derived from the original Version column. The values in the middle column are all of the entries in the Value column that correspond to both 0.01 in the Version column and 1 in the Number column. The values in the right column are all of the entries in the Value column that correspond to both 0.02 in the Version column and 2 in the Number column.
Here's a sample database for reference and if you'd like to reproduce this using R:
rw1 <- c("File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3", "File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3", "File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3")
rw2 <- c("0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03")
rw3 <- c("Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final")
rw4 <- c(123, 456, 789, 312, 645, 978, 741, 852, 963, 369, 258, 147, 753, 498, 951, 753, 915, 438, 978, 741, 852, 963, 369, 258, 147, 753, 498)
rw5 <- c("01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12")
rw6 <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3)
rw7 <- c("Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Release", "Release", "Release", "Release", "Release", "Release", "Release", "Release", "Release")
rw8 <- c("None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "Cannot Connect to Database", "None", "None", "None", "None", "None", "None", "None", "None")
Testdf = data.frame(rw1, rw2, rw3, rw4, rw5, rw6, rw7, rw8)
colnames(Testdf) <- c("FileName", "Version", "Category", "Value", "Date", "Number", "Build", "Error")

Here's a solution using dplyr and tidyr. The relevant variables are selected. An index column is then added to allow for the data to be spread without issues around duplicate indices. The data is then reshaped with spread, and finally the Index column removed.
library("dplyr")
library("tidyr")
Testdf %>%
select(FileName, Version, Value) %>%
group_by(FileName, Version) %>%
mutate(Index = 1:n()) %>%
spread(Version, Value) %>%
select(-Index)
If it can always be assumed that for each FileName there will be 9 Values, one for each combination of Version and Category, then this would work:
Testdf %>%
select(FileName, Category, Version, Value) %>%
spread(Version, Value) %>%
select(-Category)
If you wanted to use data.table, you could do:
setDT(Testdf)[, split(Value, Version), by = FileName]
If you want LaTeX output, then you could further pipe the output to xtable::xtable.

Related

Swap values between columns based on third column

I have a table like this:
src_id | src_source | dst_id | dst_source | metadata
--------------------------------------------------------
123 | A | 345 | B | some_string
234 | B | 567 | A | some_other_string
498 | A | 432 | A | another_one # this line should be ignored
765 | B | 890 | B | another_one # this line should be ignored
What I would like is:
A_id | B_id | metadata
-----------------------
123 | 345 | some string
567 | 234 | some_other_string
Here's the data to replicate:
data = [
("123", "A", "345", "B", "some_string"),
("234", "B", "567", "A", "some_other_string"),
("498", "A", "432", "A", "another_one"),
("765", "B", "890", "B", "another_two"),
]
cols = ["src_id", "src_source", "dst_id", "dst_source", "metadata"]
df = spark.createDataFrame(data).toDF(*cols)
I am a bit confused as to how to do this - I got to here:
output = (
df
.filter(F.col("src_source") != F.col("dst_source"))
.withColumn("A_id",
F.when(F.col("src_source") == "A", F.col("src_id")))
.withColumn("B_id",
F.when(F.col("src_source") == "B", F.col("src_id")))
)
I think i figured it out - I need to split the df and union again!
ab_df = (
df
.filter(F.col("src_source") != F.col("dst_source"))
.filter((F.col("src_source") == "A") & (F.col("dst_source") == "B"))
.select(F.col("src_id").alias("A_id"),
F.col("dst_id").alias("B_id"),
"metadata")
)
ba_df = (
df
.filter(F.col("src_source") != F.col("dst_source"))
.filter((F.col("src_source") == "B") & (F.col("dst_source") == "A"))
.select(F.col("src_id").alias("B_id"),
F.col("dst_id").alias("A_id"),
"metadata")
)
all = ab_df.unionByName(ba_df)
You can do it without union, just in one select, without the need to write the same filter twice.
output = (
df
.filter(F.col("src_source") != F.col("dst_source"))
.select(
F.when(F.col("src_source") == "A", F.col("src_id")).otherwise(F.col("dst_id")).alias("A_id"),
F.when(F.col("src_source") == "A", F.col("dst_id")).otherwise(F.col("src_id")).alias("B_id"),
"metadata"
)
)
output.show()
# +----+----+-----------------+
# |A_id|B_id| metadata|
# +----+----+-----------------+
# | 123| 345| some_string|
# | 567| 234|some_other_string|
# +----+----+-----------------+

Scala Unpivot Table

SCALA
I have a table with this struct:
FName
SName
Email
Jan 2021
Feb 2021
Mar 2021
Total 2021
Micheal
Scott
scarrel#gmail.com
4000
5000
3400
50660
Dwight
Schrute
dschrute#gmail.com
1200
6900
1000
35000
Kevin
Malone
kmalone#gmail.com
9000
6000
18000
32000
And i want to transform it to:
I tried with 'stack' method but i couldn't get it to work.
Thanks
You can flatten the monthly/total columns via explode as shown below:
val df = Seq(
("Micheal", "Scott", "scarrel#gmail.com", 4000, 5000, 3400, 50660),
("Dwight", "Schrute", "dschrute#gmail.com", 1200, 6900, 1000, 35000),
("Kevin", "Malone", "kmalone#gmail.com", 9000, 6000, 18000, 32000)
).toDF("FName","SName", "Email", "Jan 2021", "Feb 2021", "Mar 2021", "Total 2021")
val moYrCols = Array("Jan 2021", "Feb 2021", "Mar 2021", "Total 2021") // (**)
val otherCols = df.columns diff moYrCols
val structCols = moYrCols.map{ c =>
val moYr = split(lit(c), "\\s+")
struct(moYr(1).as("Year"), moYr(0).as("Month"), col(c).as("Value"))
}
df.
withColumn("flattened", explode(array(structCols: _*))).
select(otherCols.map(col) :+ $"flattened.*": _*).
show
/*
+-------+-------+------------------+----+-----+-----+
| FName| SName| Email|Year|Month|Value|
+-------+-------+------------------+----+-----+-----+
|Micheal| Scott| scarrel#gmail.com|2021| Jan| 4000|
|Micheal| Scott| scarrel#gmail.com|2021| Feb| 5000|
|Micheal| Scott| scarrel#gmail.com|2021| Mar| 3400|
|Micheal| Scott| scarrel#gmail.com|2021|Total|50660|
| Dwight|Schrute|dschrute#gmail.com|2021| Jan| 1200|
| Dwight|Schrute|dschrute#gmail.com|2021| Feb| 6900|
| Dwight|Schrute|dschrute#gmail.com|2021| Mar| 1000|
| Dwight|Schrute|dschrute#gmail.com|2021|Total|35000|
| Kevin| Malone| kmalone#gmail.com|2021| Jan| 9000|
| Kevin| Malone| kmalone#gmail.com|2021| Feb| 6000|
| Kevin| Malone| kmalone#gmail.com|2021| Mar|18000|
| Kevin| Malone| kmalone#gmail.com|2021|Total|32000|
+-------+-------+------------------+----+-----+-----+
*/
(**) Use pattern matching in case there are many columns; for example:
val moYrCols = df.columns.filter(_.matches("[A-Za-z]+\\s+\\d{4}"))
val data = Seq(
("Micheal","Scott","scarrel#gmail.com",4000,5000,3400,50660),
("Dwight","Schrute","dschrute#gmail.com",1200,6900,1000,35000),
("Kevin","Malone","kmalone#gmail.com",9000,6000,18000,32000)) )
val columns = Seq("FName","SName","Email","Jan 2021","Feb 2021","Mar 2021","Total 2021")
val newColumns = Array( "FName", "SName", "Email","Total 2021" )
val df = spark.createDataFrame( data ).toDF(columns:_*)
df
.select(
struct(
(for {column <- df.columns } yield col(column)).toSeq :_*
).as("mystruct")) // create your data set with a column as a struct.
.select(
$"mystruct.Fname", // refer to sub element of struct with '.' operator
$"mystruct.sname",
$"mystruct.Email",
explode( /make rows for every entry in the array.
array(
(for {column <- df.columns if !(newColumns contains column) } //filter out the columns we already selected
yield // for each element yield the following expression (similar to map)
struct(
col(s"mystruct.$column").as("value"), // create the value column
lit(column).as("date_year")) // create a date column
).toSeq :_* ) // shorthand to pass scala array into varargs for array function
)
)
.select(
col("*"), // just being lazy instead of typing.
col("col.*") // create columns from new column. Seperating the year/date should be easy from here.
).drop($"col")
.show(false)
+--------------+--------------+------------------+-----+---------+
|mystruct.Fname|mystruct.sname|mystruct.Email |value|date_year|
+--------------+--------------+------------------+-----+---------+
|Micheal |Scott |scarrel#gmail.com |4000 |Jan 2021 |
|Micheal |Scott |scarrel#gmail.com |5000 |Feb 2021 |
|Micheal |Scott |scarrel#gmail.com |3400 |Mar 2021 |
|Dwight |Schrute |dschrute#gmail.com|1200 |Jan 2021 |
|Dwight |Schrute |dschrute#gmail.com|6900 |Feb 2021 |
|Dwight |Schrute |dschrute#gmail.com|1000 |Mar 2021 |
|Kevin |Malone |kmalone#gmail.com |9000 |Jan 2021 |
|Kevin |Malone |kmalone#gmail.com |6000 |Feb 2021 |
|Kevin |Malone |kmalone#gmail.com |18000|Mar 2021 |
+--------------+--------------+------------------+-----+---------

Convert a json which is a list of dictionaries into column/row format in Postgresql

I´ve a json which is a list of dictionaries with the next syntax:
[
{
"Date_and_Time": "Dec 29, 2017 15:35:37",
"Componente": "Bar",
"IP_Origen": "175.11.13.6",
"IP_Destino": "81.18.119.864",
"Country": "Brazil",
"Age": "3"
},
{
"Date_and_Time": "Dec 31, 2017 17:35:37",
"Componente": "Foo",
"IP_Origen": "176.11.13.6",
"IP_Destino": "80.18.119.864",
"Country": "France",
'Id': '123456',
'Car': 'Ferrari'
},
{
"Date_and_Time": "Dec 31, 2017 17:35:37",
"Age": "1",
"Country": "France",
'Id': '123456',
'Car': 'Ferrari'
},
{
"Date_and_Time": "Mar 31, 2018 14:35:37",
"Componente": "Foo",
"Country": "Germany",
'Id': '2468',
'Genre': 'Male'
}
]
The json is really big and each dictionary have different amount of key/values fields. And what I want to do is to create a table in postgresSQL where the key represents a column and the value a row. In the example explained above I would like table like this:
Date_and_Time | Componente | IP_Origen | IP_Destino | Country| Id | Car | Age| Genre
Dec 29, 2017 15:35:37 | Bar | 175.11.13.6 | 81.18.119.864 | Brazil | - | - | 3 | -
Dec 31, 2017 17:35:37 | Foo | 176.11.13.6 | 80.18.119.864 | France |123456 |Ferrari | - | -
Dec 31, 2017 17:35:37 | - | - | - | France |123456 |Ferrari | 1 | -
Mar 31, 2018 14:35:37 | Foo | - | - | Germany| 2468 | - | - | Male
The only solution I can think is putting the values one by one but this is no efficient at all
You can use jsonb_to_recordset to create record set out of your json and then use insert into to insert the records.
insert into table
select * from jsonb_to_recordset('<your json>'::jsonb)
as rec(Date_and_Time datetime, Componente text, IP_Origen text) --Specify all columns inside the table
Sample DBFiddle

Pyspark transformation: Column names to rows

I'm working with pyspark and want to transform this spark data frame:
+----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
| TS | ABC[0].VAL.VAL[0].UNT[0].sth1 | ABC[0].VAL.VAL[0].UNT[1].sth1 | ABC[0].VAL.VAL[1].UNT[0].sth1 | ABC[0].VAL.VAL[1].UNT[1].sth1 | ABC[0].VAL.VAL[0].UNT[0].sth2 | ABC[0].VAL.VAL[0].UNT[1].sth2 | ABC[0].VAL.VAL[1].UNT[0].sth2 | ABC[0].VAL.VAL[1].UNT[1].sth2 |
+----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
| 1 | some_value | some_value | some_value | some_value | some_value | some_value | some_value | some_value |
+----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
to that:
+----+-----+-----+------------+------------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------------+------------+
| 1 | 0 | 0 | some_value | some_value |
| 1 | 0 | 1 | some_value | some_value |
| 1 | 1 | 0 | some_value | some_value |
| 1 | 1 | 1 | some_value | some_value |
+----+-----+-----+------------+------------+
Any idea how I can do that using some fancy transformation?
Edit:
So this is how I could solve it:
from pyspark.sql.functions import array, col, explode, struct, lit
import re
df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1", "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2", "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"])
newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in ["TS"]))
kvs = explode(array([struct(
lit( re.search(re.compile(r"VAL\[(\d{1,2})\]"),c).group(1) ).alias("VAL"),
lit( re.search(re.compile(r"UNT\[(\d{1,2})\]"),c).group(1) ).alias("UNT"),
lit( re.search(re.compile(r"([^_]+$)"),c).group(1) ).alias("Parameter"),
col(c).alias("data")) for c in cols
])).alias("kvs")
display(df.select(["TS"] + [kvs]).select(["TS"] + ["kvs.VAL", "kvs.UNT", "kvs.Parameter", "kvs.data"]).groupBy("TS","VAL","UNT").pivot("Parameter").sum("data").orderBy("TS","VAL","UNT"))
Output:
+----+-----+-----+------+------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------+------+
| 1 | 0 | 0 | 0 | 0.7 |
| 1 | 0 | 1 | 0.6 | 0.2 |
| 1 | 1 | 0 | 0.1 | 0.4 |
| 1 | 1 | 1 | 0.4 | 0.1 |
| 2 | 0 | 0 | 0.6 | 0.8 |
| 2 | 0 | 1 | 0.7 | 0.3 |
| 2 | 1 | 0 | 0.1 | 0.1 |
| 2 | 1 | 1 | 0.5 | 0.3 |
+----+-----+-----+------+------+
How can it be done better?
So this is how I could solve it:
from pyspark.sql.functions import array, col, explode, struct, lit
import re
df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1", "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2", "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"])
newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in ["TS"]))
kvs = explode(array([struct(
lit( re.search(re.compile(r"VAL\[(\d{1,2})\]"),c).group(1) ).alias("VAL"),
lit( re.search(re.compile(r"UNT\[(\d{1,2})\]"),c).group(1) ).alias("UNT"),
lit( re.search(re.compile(r"([^_]+$)"),c).group(1) ).alias("Parameter"),
col(c).alias("data")) for c in cols
])).alias("kvs")
display(df.select(["TS"] + [kvs]).select(["TS"] + ["kvs.VAL", "kvs.UNT", "kvs.Parameter", "kvs.data"]).groupBy("TS","VAL","UNT").pivot("Parameter").sum("data").orderBy("TS","VAL","UNT"))
Output:
+----+-----+-----+------+------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------+------+
| 1 | 0 | 0 | 0 | 0.7 |
| 1 | 0 | 1 | 0.6 | 0.2 |
| 1 | 1 | 0 | 0.1 | 0.4 |
| 1 | 1 | 1 | 0.4 | 0.1 |
| 2 | 0 | 0 | 0.6 | 0.8 |
| 2 | 0 | 1 | 0.7 | 0.3 |
| 2 | 1 | 0 | 0.1 | 0.1 |
| 2 | 1 | 1 | 0.5 | 0.3 |
+----+-----+-----+------+------+
Now at least tell me how it can be done better...
Your approach is good (upvoted). The only thing I would really do is extract the essential parts from the column names in one regex search. I’d also remove a superfluous select in favor of groupBy, but that’s not as important.
import re
from pyspark.sql.functions import lit, explode, array, struct, col
df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(
["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1",
"ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2",
"ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"])
newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)
def extract_indices_and_label(column_name):
s = re.match(r"\D+\d+\D+(\d+)\D+(\d+)[^_]_(.*)$", column_name)
m, n, label = s.groups()
return int(m), int(n), label
def create_struct(column_name):
val, unt, label = extract_indices_and_label(column_name)
return struct(lit(val).alias("val"),
lit(unt).alias("unt"),
lit(label).alias("label"),
col(column_name).alias("value"))
df2 = (df.select(
df.TS,
explode(array([create_struct(c) for c in df.columns[1:]]))))
df2.printSchema() # this is instructional: it shows the structure is nearly there
# root
# |-- TS: long (nullable = true)
# |-- col: struct (nullable = false)
# | |-- val: integer (nullable = false)
# | |-- unt: integer (nullable = false)
# | |-- label: string (nullable = false)
# | |-- value: double (nullable = true)
df3 = (df2
.groupBy(df2.TS, df2.col.val.alias("VAL"), df2.col.unt.alias("UNT"))
.pivot("col.label", values=("sth1", "sth2"))
.sum("col.value"))
df3.orderBy("TS", "VAL", "UNT").show()
# +---+---+---+----+----+
# | TS|VAL|UNT|sth1|sth2|
# +---+---+---+----+----+
# | 1| 0| 0| 0.0| 0.7|
# | 1| 0| 1| 0.6| 0.2|
# | 1| 1| 0| 0.1| 0.4|
# | 1| 1| 1| 0.4| 0.1|
# | 2| 0| 0| 0.6| 0.8|
# | 2| 0| 1| 0.7| 0.3|
# | 2| 1| 0| 0.1| 0.1|
# | 2| 1| 1| 0.5| 0.3|
# +---+---+---+----+----+
If you know a priori that you will have only the two columns sth1 and sth2 that will be pivoted, you could add these to pivot’s values parameter, which will improve the efficiency further.

How to create nested JSON return with aggregate function and dynamic key values using `jsonb_build_object`

This is what and example of the table looks like.
+---------------------+------------------+------------------+
| country_code | region | num_launches |
+---------------------+------------------+------------------+
| 'CA' | 'Ontario' | 5 |
+---------------------+------------------+------------------+
| 'CA' | 'Quebec' | 9 |
+---------------------+------------------+------------------+
| 'DE' | 'Bavaria' | 15 |
+---------------------+------------------+------------------+
| 'DE' | 'Saarland' | 12 |
+---------------------+------------------+------------------+
| 'DE' | 'Berlin' | 23 |
+---------------------+------------------+------------------+
| 'JP' | 'Tokyo' | 19 |
+---------------------+------------------+------------------+
I am able to write a query that returns each country_code with all regions nested within, but I am unable to get exactly what I am looking for.
My intended return looks like.
[
{ 'CA': [
{ 'Ontario': 5 },
{ 'Quebec': 9 }
]
},
{ 'DE': [
{ 'Bavaria': 15 },
{ 'Saarland': 12 },
{ 'Berlin': 23 }
]
},
{ 'JP': [
{ 'Tokyo': 19 }
]
}
]
How could this be calculated if the num_launches was not available?
+---------------------+------------------+
| country_code | region |
+---------------------+------------------+
| 'CA' | 'Ontario' |
+---------------------+------------------+
| 'CA' | 'Ontario' |
+---------------------+------------------+
| 'CA' | 'Ontario' |
+---------------------+------------------+
| 'CA' | 'Quebec' |
+---------------------+------------------+
| 'CA' | 'Quebec' |
+---------------------+------------------+
| 'DE' | 'Bavaria' |
+---------------------+------------------+
| 'DE' | 'Bavaria' |
+---------------------+------------------+
| 'DE' | 'Bavaria' |
+---------------------+------------------+
| 'DE' | 'Bavaria' |
+---------------------+------------------+
| 'DE' | 'Saarland' |
+---------------------+------------------+
| 'DE' | 'Berlin' |
+---------------------+------------------+
| 'DE' | 'Berlin' |
+---------------------+------------------+
| 'JP' | 'Tokyo' |
+---------------------+------------------+
Expected Return
[
{ 'CA': [
{ 'Ontario': 3 },
{ 'Quebec': 2 }
]
},
{ 'DE': [
{ 'Bavaria': 4 },
{ 'Saarland': 1 },
{ 'Berlin': 2 }
]
},
{ 'JP': [
{ 'Tokyo': 1 }
]
}
]
Thanks
You can try to use json_agg with json_build_object function in a subquery to get the array then do it again in the main query.
Schema (PostgreSQL v9.6)
CREATE TABLE T(
country_code varchar(50),
region varchar(50),
num_launches int
);
insert into t values ('CA','Ontario',5);
insert into t values ('CA','Quebec',9);
insert into t values ('DE','Bavaria',15);
insert into t values ('DE','Saarland',12);
insert into t values ('DE','Berlin',23);
insert into t values ('JP','Tokyo',19);
Query #1
select json_agg(json_build_object(country_code,arr)) results
from (
SELECT country_code,
json_agg(json_build_object(region,num_launches)) arr
FROM T
group by country_code
) t1;
results
[{"CA":[{"Ontario":5},{"Quebec":9}]},{"DE":[{"Bavaria":15},{"Saarland":12},{"Berlin":23}]},{"JP":[{"Tokyo":19}]}]
View on DB Fiddle