How to flatten a latlong array in bigquery to produce a linestring? - google-bigquery

I have a nested table structure, like this:
[
{
"startTime": "2017-09-02 09:08:00:000",
"endTime": "2017-09-02 09:09:00:000",
"startTimeMillis": "1504343280000",
"endTimeMillis": "1504343340000",
"uuid": "1748750880",
"country": "CI",
"city": "Punta Arenas",
"x": "-70.906904",
"y": "-53.133514"
},
{
"startTime": "2017-09-02 09:08:00:000",
"endTime": "2017-09-02 09:09:00:000",
"startTimeMillis": "1504343280000",
"endTimeMillis": "1504343340000",
"uuid": "1748750880",
"country": "CI",
"city": "Punta Arenas",
"x": "-70.907353",
"y": "-53.133253"
},
{
"startTime": "2017-09-02 09:08:00:000",
"endTime": "2017-09-02 09:09:00:000",
"startTimeMillis": "1504343280000",
"endTimeMillis": "1504343340000",
"uuid": "1748750880",
"country": "CI",
"city": "Punta Arenas",
"x": "-70.90771",
"y": "-53.133041"
},
{
"startTime": "2017-09-02 09:08:00:000",
"endTime": "2017-09-02 09:09:00:000",
"startTimeMillis": "1504343280000",
"endTimeMillis": "1504343340000",
"uuid": "1748750880",
"country": "CI",
"city": "Punta Arenas",
"x": "-70.908979",
"y": "-53.132287"
}
]
A resulting table is something like this:
Row|startTime|endTime|startTimeMillis|endTimeMillis|uuid|country|city|x|y|
1|2017-09-02 09:08:00:000|2017-09-02 09:09:00:000|1504343280000|1504343340000|1748750880|CI|Punta Arenas|-70.906904|-53.133514|
2|2017-09-02 09:08:00:000|2017-09-02 09:09:00:000|1504343280000|1504343340000|1748750880|CI|Punta Arenas|-70.907353|-53.133253|
3|2017-09-02 09:08:00:000|2017-09-02 09:09:00:000|1504343280000|1504343340000|1748750880|CI|Punta Arenas|-70.90771|-53.133041|
4|2017-09-02 09:08:00:000|2017-09-02 09:09:00:000|1504343280000|1504343340000|1748750880|CI|Punta Arenas|-70.908979|-53.132287|
I'd like to concat the repeated fields x and y to produce a GIS linestring, in a single line, like this:
Row|startTime|endTime|startTimeMillis|endTimeMillis|uuid|country|city|linestring
1|2017-09-02 09:08:00:000|2017-09-02 09:09:00:000|1504343280000|1504343340000|1748750880|CI|Punta Arenas|LINESTRING(-70.906904 -53.133514, -70.907353 -53.133253, -70.90771 -53.133041, -70.908979 -53.132287)
How can I do this? The original x and y values are floats.

Below is for BigQuery Standard SQL
#standardSQL
WITH `yourTable` AS (
SELECT '2017-09-02 09:08:00:000' AS startTime, '2017-09-02 09:09:00:000' AS endTime, 1504343280000 AS startTimeMillis, 1504343340000 AS endTimeMillis, 1748750880 AS uuid, 'CI' AS country, 'Punta Arenas' AS city, -70.906904 AS x, -53.133514 AS y UNION ALL
SELECT '2017-09-02 09:08:00:000', '2017-09-02 09:09:00:000', 1504343280000, 1504343340000, 1748750880, 'CI', 'Punta Arenas', -70.907353, -53.133253 UNION ALL
SELECT '2017-09-02 09:08:00:000', '2017-09-02 09:09:00:000', 1504343280000, 1504343340000, 1748750880, 'CI', 'Punta Arenas', -70.90771, -53.133041 UNION ALL
SELECT '2017-09-02 09:08:00:000', '2017-09-02 09:09:00:000', 1504343280000, 1504343340000, 1748750880, 'CI', 'Punta Arenas', -70.908979, -53.132287
)
SELECT startTime, endTime, startTimeMillis, endTimeMillis, uuid, country, city,
STRING_AGG(CONCAT(CAST(x AS STRING), ' ', CAST(y AS STRING)), ',') AS linestring
FROM `yourTable`
GROUP BY startTime, endTime, startTimeMillis, endTimeMillis, uuid, country, city

You could use the ARRAY_AGG function available in Standard SQL, something like:
#standardSQL
WITH data AS(
SELECT "2017-09-02 09:08:00:000" AS startTime, "2017-09-02 09:09:00:000" endTime, "1504343280000" AS startTimeMillis, "1504343340000" endTimeMillis, "1748750880" AS uuid, "CI" AS country, "Punta Arenas" AS city, "-70.906904" AS x, "-53.133514" AS y UNION ALL
SELECT "2017-09-02 09:08:00:000", "2017-09-02 09:09:00:000", "1504343280000", "1504343340000", "1748750880", "CI", "Punta Arenas", "-70.907353", "-53.133253" UNION ALL
SELECT "2017-09-02 09:08:00:000", "2017-09-02 09:09:00:000", "1504343280000", "1504343340000", "1748750880", "CI", "Punta Arenas", "-70.90771", "-53.133041" UNION ALL
SELECT "2017-09-02 09:08:00:000", "2017-09-02 09:09:00:000", "1504343280000", "1504343340000", "1748750880", "CI", "Punta Arenas", "-70.908979", "-53.132287"
)
SELECT
startTime,
endTime,
startTimeMillis,
endTimeMillis,
uuid,
country,
city,
ARRAY_AGG(STRUCT(x, y)) AS LINESTRING
FROM data
GROUP BY
startTime,
endTime,
startTimeMillis,
endTimeMillis,
uuid,
country,
city
Result:
Even though result is an ARRAY with the elements x and y, notice that they have been structured together as a STRUCT which will allow you to access each field by its respective name.

Thank you all!
I'm using Mikhail Berlyant solution!
SELECT
w.startTime, w.endTime, w.startTimeMillis, w.endTimeMillis,
jams_u.uuid, jams_u.country, jams_u.city, jams_u.street,
jams_u.roadType, jams_u.turnType,
jams_u.type, jams_u.length, jams_u.speed, jams_u.level, jams_u.delay,
jams_u.startNode, jams_u.endNode, jams_u.pubMillis,
TIMESTAMP_MILLIS(jams_u.pubMillis) as pubdatetime_utc,
STRING_AGG(CONCAT(CAST(line_u.x AS STRING),' ',CAST(line_u.y AS STRING))) linestring_4326
FROM
a_import.table w,
UNNEST(jams) jams_u,
UNNEST(line) line_u
GROUP BY
w.startTime, w.endTime, w.startTimeMillis, w.endTimeMillis,
jams_u.uuid, jams_u.country, jams_u.city, jams_u.street,
jams_u.roadType, jams_u.turnType,
jams_u.type, jams_u.length, jams_u.speed, jams_u.level, jams_u.delay,
jams_u.startNode, jams_u.endNode, jams_u.pubMillis,
pubdatetime_utc

One concern with the proposed solutions that use GROUP BY only - without an ORDER BY operator within group, the order of the elements in the GROUP BY group is undefined. So you can get arbitrary order of points in the linestring, which is probably not what you want. Unfortunately, with small inline datasets you get stable results, but this might break once you have real data.
To solve this you need to define which attributes define group, and which define order. E.g. if uuid defines a linestring, and start timestamp defines order (they would need to be different, unlike in your sample), your query might group by uuid, and sort by timestamp.
I also prefer to use new Geospatial functions to construct WKT linestring, rather than string concatenation, which gives:
#standardSQL
WITH `yourTable` AS (
SELECT * FROM UNNEST([
STRUCT('2017-09-02 09:08:00:000' AS startTime, '2017-09-02 09:09:00:000' AS endTime, 1504343280002 AS startTimeMillis, 1504343340000 AS endTimeMillis,
1748750880 AS uuid, 'CI' AS country, 'Punta Arenas' AS city, -70.906904 AS x, -53.133514 AS y),
STRUCT('2017-09-02 09:08:00:000', '2017-09-02 09:09:00:000', 1504343280001, 1504343340000, 1748750880, 'CI', 'Punta Arenas', -70.907353, -53.133253),
STRUCT('2017-09-02 09:08:00:000', '2017-09-02 09:09:00:000', 1504343280004, 1504343340000, 1748750880, 'CI', 'Punta Arenas', -70.90771, -53.133041),
STRUCT('2017-09-02 09:08:00:000', '2017-09-02 09:09:00:000', 1504343280003, 1504343340000, 1748750880, 'CI', 'Punta Arenas', -70.908979, -53.132287)])
)
SELECT uuid, MIN(startTime) startTime, MAX(endTime) endTime,
ANY_VALUE(country), ANY_VALUE(city),
ST_MakeLine(ARRAY_AGG(ST_GeogPoint(x, y)
ORDER BY startTime, startTimeMillis)) line
FROM `yourTable`
GROUP BY uuid

Related

Concatenate column by condition (oracle)

source_id
source_groupid
source_nm
category_id
level_id
12345
34
ABC
7
2
67549
GI
5
1
24751
BL
6
Result
{"id": 12345, "groupid": 34, "name": ABC, "category_id": 7, "level_id": 2}
{"id": 67549, "groupid": , "name": GI, "category_id": 5, "level_id": 1}
SELECT CONCAT ('{','"id": ', source_id,', ', '"groupid": ', source_groupid,', ','"name": ',source_nm,', ','"category_id": ',category_id,', ', '"level_id": ', level_id, '}') as full_info
FROM table
I need to do column concatenation according to the following pattern. If, for example, there is no entry in the group_id or category_id, then how to write the code so that the template changes and looks for lines 2 and 3 as follows.
{"id": 12345, "groupid": 34, "name": ABC, "category_id": 7, "level_id": 2}
{"id": 67549, "name": GI, "category_id": 5, "level_id": 1}
{"id": 24751, "name": BL, "level_id": 6}
Well, in Oracle (which tag you've used), CONCAT function accepts only two arguments, and - therefore - that code won't work.
Instead, use a double pipe || operator. As of your main problem, CASE it is.
SQL> with test (source_id, source_groupid, source_nm, category_id, level_id) as
2 (select 12345, 34, 'ABC', 7, 2 from dual union all
3 select 67549, null, 'GI' , 5, 1 from dual union all
4 select 24751, null, 'BL' , null, 6 from dual
5 )
6 select '{' || '"id": ' || source_id ||
7 case when source_groupid is not null then ', "groupid": ' || source_groupid end ||
8 case when source_nm is not null then ', "name": ' || source_nm end ||
9 case when category_id is not null then ', "category_id": ' || category_id end ||
10 case when level_id is not null then ', "level_id": ' || level_id end || '}'
11 as result
12 from test;
RESULT
--------------------------------------------------------------------------------
{"id": 12345, "groupid": 34, "name": ABC, "category_id": 7, "level_id": 2}
{"id": 67549, "name": GI, "category_id": 5, "level_id": 1}
{"id": 24751, "name": BL, "level_id": 6}
SQL>
From Oracle 12, don't build JSON by hand; use the JSON_OBJECT function and then you can use ABSENT ON NULL:
SELECT JSON_OBJECT(
KEY 'id' VALUE source_id,
KEY 'groupid' VALUE source_groupid,
KEY 'name' VALUE source_nm,
KEY 'category_id' VALUE category_id,
KEY 'level_id' VALUE level_id
ABSENT ON NULL
) As json
FROM table_name;
Which, for the sample data:
CREATE TABLE table_name (source_id, source_groupid, source_nm, category_id, level_id) AS
SELECT 12345, 34, 'ABC', 7, 2 FROM DUAL UNION ALL
SELECT 67549, NULL, 'GI', 5, 1 FROM DUAL UNION ALL
SELECT 24751, NULL, 'BL', NULL, 6 FROM DUAL;
Outputs:
JSON
{"id":12345,"groupid":34,"name":"ABC","category_id":7,"level_id":2}
{"id":67549,"name":"GI","category_id":5,"level_id":1}
{"id":24751,"name":"BL","level_id":6}
db<>fiddle here

Format SQL output to custom JSON

I have this table which is very simple with this data
CREATE TABLE #Prices
(
ProductId int,
SizeId int,
Price int,
Date date
)
INSERT INTO #Prices
VALUES (1, 1, 100, '2020-01-01'),
(1, 1, 120, '2020-02-01'),
(1, 1, 130, '2020-03-01'),
(1, 2, 100, '2020-01-01'),
(1, 2, 100, '2020-02-01'),
(2, 1, 100, '2020-01-01'),
(2, 1, 120, '2020-02-01'),
(2, 1, 130, '2020-03-01'),
(2, 2, 100, '2020-01-01'),
(2, 2, 100, '2020-02-01')
I would like to format the output to be something like this:
{
"Products": [
{
"Product": 2,
"UnitSizes": [
{
"SizeId": 1,
"PerDate": [
{
"Date": "2020-01-02",
"Price": 870.0
},
{
"Date": "2021-04-29",
"Price": 900.0
}
]
},
{
"SizeId": 2,
"PerDate": [
{
"Date": "2020-01-02",
"Price": 435.0
},
{
"Date": "2021-04-29",
"Price": 450.0
}
]
}
]
},
{
"Product": 4,
"UnitSizes": [
{
"SizeId": 1,
"PerDate": [
{
"Date": "2020-01-02",
"Price": 900.0
}
]
}
]
}
]
}
I almost have it but I don't know how to format to get the array inside 'PerDate'. This is what I have
SELECT
ProductId AS [Product],
SizeId AS 'Sizes.SizeId',
date AS 'Sizes.PerDate.Date',
price AS 'Sizes.PerDate.Price'
FROM
#Prices
ORDER BY
ProductId, [Sizes.SizeId], Date
FOR JSON PATH, ROOT('Products')
I have tried with FOR JSON AUTO and nothing, I've tried with JSON_QUERY() but I was not able to achieve the result I want.
Every help will be very appreciated.
Thanks
Unfortunately, SQL Server does not have the JSON_AGG function, which means you would normally need to use a number of correlated subqueries and keep on rescanning the base table.
However, we can simulate it by using STRING_AGG against single JSON objects generated in an APPLY. This means that we only scan the base table once.
Use of JSON_QUERY with no path prevents double-escaping
WITH PerDate AS (
SELECT
p.ProductId,
p.SizeId,
PerDate = '[' + STRING_AGG(j.PerDate, ',') WITHIN GROUP (ORDER BY p.Date) + ']'
FROM #Prices AS p
CROSS APPLY ( -- This produces multiple rows of single JSON objects
SELECT p.Date, p.Price
FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
) j(PerDate)
GROUP BY
p.ProductId,
p.SizeId
),
UnitSizes AS (
SELECT
p.ProductId,
UnitSizes = '[' + STRING_AGG(j.UnitSizes, ',') WITHIN GROUP (ORDER BY p.SizeId) + ']'
FROM PerDate p
CROSS APPLY (
SELECT p.SizeId, PerDate = JSON_QUERY(p.PerDate)
FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
) j(UnitSizes)
GROUP BY
p.ProductId
)
SELECT
Product = p.ProductId,
UnitSizes = JSON_QUERY(p.UnitSizes)
FROM UnitSizes p
ORDER BY p.ProductId
FOR JSON PATH, ROOT('Products');
db<>fiddle
This is one way of doing it
DROP TABLE IF EXISTS #Prices
CREATE TABLE #Prices
(
ProductId INT,
SizeId INT,
Price INT,
Date DATE
)
-- SQL Prompt formatting off
INSERT INTO #Prices
VALUES (1, 1, 100, '2020-01-01'),
(1, 1, 120, '2020-02-01'),
(1, 1, 130, '2020-03-01'),
(1, 2, 100, '2020-01-01'),
(1, 2, 100, '2020-02-01'),
(2, 1, 100, '2020-01-01'),
(2, 1, 120, '2020-02-01'),
(2, 1, 130, '2020-03-01'),
(2, 2, 100, '2020-01-01'),
(2, 2, 100, '2020-02-01')
-- SQL Prompt formatting on
SELECT m.ProductId AS Product,
(
SELECT s.SizeId,
(
SELECT p.Date,
p.Price
FROM #Prices AS p
WHERE p.SizeId = s.SizeId
GROUP BY p.Date,
p.Price
ORDER BY p.Date
FOR JSON PATH
) AS PerDate
FROM #Prices AS s
WHERE s.ProductId = m.ProductId
GROUP BY s.SizeId
ORDER BY s.SizeId
FOR JSON PATH
) AS UnitSizes
FROM #Prices AS m
GROUP BY m.ProductId
ORDER BY m.ProductId
FOR JSON PATH, ROOT('Products')
Output:
{
"Products":
[
{
"Product": 1,
"UnitSizes":
[
{
"SizeId": 1,
"PerDate":
[
{
"Date": "2020-01-01",
"Price": 100
},
{
"Date": "2020-02-01",
"Price": 120
},
{
"Date": "2020-03-01",
"Price": 130
}
]
},
{
"SizeId": 2,
"PerDate":
[
{
"Date": "2020-01-01",
"Price": 100
},
{
"Date": "2020-02-01",
"Price": 100
}
]
}
]
},
{
"Product": 2,
"UnitSizes":
[
{
"SizeId": 1,
"PerDate":
[
{
"Date": "2020-01-01",
"Price": 100
},
{
"Date": "2020-02-01",
"Price": 120
},
{
"Date": "2020-03-01",
"Price": 130
}
]
},
{
"SizeId": 2,
"PerDate":
[
{
"Date": "2020-01-01",
"Price": 100
},
{
"Date": "2020-02-01",
"Price": 100
}
]
}
]
}
]
}

Only extract json if field not null

I want to extract a key value from a (nullable) JSONB field. If the field is NULL, I want the record still present in my result set, but with a null field.
customer table:
id, name, phone_num, address
1, "john", 983, [ {"street":"23, johnson ave", "city":"Los Angeles", "state":"California", "current":true}, {"street":"12, marigold drive", "city":"Davis", "state":"California", "current":false}]
2, "jane", 9389, null
3, "sally", 352, [ "street":"90, park ave", "city":"Los Angeles", "state":"California", "current":true} ]
Current PostgreSQL query:
select id, name, phone_num, items.city
from customer,
jsonb_to_recordset(customer) as items(city str, current bool)
where items.current=true
It returns:
id, name, phone_num, city
1, "john", 983, "Los Angeles"
3, "sally", 352, "Los Angeles"
Required Output:
id, name, phone_num, city
1, "john", 983, "Los Angeles"
2, "jane", 9389, null
3, "sally", 352, "Los Angeles"
How do I achieve the above output?
Use a left join lateral instead of an implicit lateral join:
select c.id, c.name, c.phone_num, i.city
from customer c
left join lateral jsonb_to_recordset(c.address) as i(city str, current bool)
on i.current=true

folding over large BigQuery result

Is there any easy way for me to do something like Ocaml's fold_left on a result of a BigQuery query, where each iteration corresponds to one row in the result?
What product or approach would be the easiest way? It would be great if:
all I need to do is to supply the initial state and the 'folder' function
preferably, I'd like to write the 'folder' function in a functional language
I don't need to install any GCP package
Since I don't know which product or language would work, I cannot be more specific, but pseudocode would be like:
let my_init = []
let my_folder = fun state row ->
// append for now, but it will be complicated. I need to do some set operations here. The point is that I need some way of transferring "state" across rows, when I iterate over rows in a predefined order.
row.col1 :: state
let query = "SELECT col1, col2, col3 FROM table1 ORDER BY timestamp"
query |> List.fold my_folder my_init
The result that I want to get from this simplified example is the final "state".
--- UPDATED ---
There is no bound on the number of rows---if we receive more, we get more rows. Typically, the number is more than a few millions but it can be larger than that.
Here's a simplified example that shows the major problem I'm encountering. We have a table with a few columns:
timestamp
user_id: a string id
operation_json: a stringified JSON object, which is a list of operations, each of which corresponds to either:
add user_id to a set
remove user_id from a set
For example, the followings are valid rows:
----------+---------+----------------------------------------------
timestamp | user_id | operation_json
----------+---------+----------------------------------------------
1 | id1 | [ { "op": "add", "set": "set1" } ]
2 | id2 | [ { "op": "add", "set": "set1" } ]
3 | id1 | [ { "op": "add", "set": "set2" } ]
4 | id3 | [ { "op": "add", "set": "set2" } ]
5 | id1 | [ { "op": "remove", "set": "set1" } ]
----------+---------+----------------------------------------------
As a result, I'd like to get sets of users; i.e.,
set1 |-> { id2 }
set2 |-> { id1, id3 }
I thought fold_left-like operation would be convenient. The state would be map>, and the initial-state would be an empty map.
Below [quick and simple] example for BigQuery Standard SQL
#standardSQL
CREATE TEMP FUNCTION fold(arr ARRAY<INT64>, init INT64)
RETURNS FLOAT64
LANGUAGE js AS """
const reducer = (accumulator, currentValue) => accumulator + parseInt(currentValue);
return arr.reduce(reducer, 5);
""";
WITH `project.dataset.table` AS (
SELECT 1 id, [1, 2, 3, 4] arr, 5 initial_state UNION ALL
SELECT 2, [1, 2, 3, 4, 5, 6, 7], 10
)
SELECT id, fold(arr, initial_state) result
FROM `project.dataset.table`
output is
Row id result
1 1 15.0
2 2 33.0
I think it is self-explanatory enough
See more for JS UDF
folding list of rows
See below extension of above
Here you are assembling array from the result's rows before applying fold function (of course you have some limits for UDF here to have in mind and also on how big your ARRAY of rows can go, etc.
#standardSQL
CREATE TEMP FUNCTION fold(arr ARRAY<INT64>, init INT64)
RETURNS FLOAT64
LANGUAGE js AS """
const reducer = (accumulator, currentValue) => accumulator + parseInt(currentValue);
return arr.reduce(reducer, 5);
""";
WITH `project.dataset.table` AS (
SELECT 1 id, 1 item UNION ALL
SELECT 1, 2 UNION ALL
SELECT 1, 3 UNION ALL
SELECT 1, 4 UNION ALL
SELECT 2, 1 UNION ALL
SELECT 2, 2 UNION ALL
SELECT 2, 3 UNION ALL
SELECT 2, 4 UNION ALL
SELECT 2, 5 UNION ALL
SELECT 2, 6 UNION ALL
SELECT 2, 7
)
SELECT id, fold(ARRAY_AGG(item), 5) result
FROM `project.dataset.table`
GROUP BY id
Note, if you need to include more than one field from each row - you can use ARRAY of STRUCT as below example
ARRAY_AGG(STRUCT(id , item) ORDER by id)
Of course, you will need to adjust respectively signature of fold UDF
For example:
#standardSQL
CREATE TEMP FUNCTION fold(arr ARRAY<STRUCT<id INT64, item INT64>>, init INT64)
RETURNS FLOAT64
LANGUAGE js AS """
const reducer = (accumulator, currentValue) => accumulator + parseInt(currentValue.item);
return arr.reduce(reducer, 5);
""";
WITH `project.dataset.table` AS (
SELECT 1 id, 1 item UNION ALL
SELECT 1, 2 UNION ALL
SELECT 1, 3 UNION ALL
SELECT 1, 4 UNION ALL
SELECT 2, 1 UNION ALL
SELECT 2, 2 UNION ALL
SELECT 2, 3 UNION ALL
SELECT 2, 4 UNION ALL
SELECT 2, 5 UNION ALL
SELECT 2, 6 UNION ALL
SELECT 2, 7
)
SELECT id, fold(ARRAY_AGG(t), 5) result
FROM `project.dataset.table` t
GROUP BY id
Below approach has nothing to do with folding per se, but rather attempt to translate your challenge into set-based one (which is more natural for when you dealing with sql) by identifying the latest op action for each user per set and if it is "remove" just eliminate that user from further consideration - if it is "add" just use the latest "add" for that user / set. This in assumption that there cannot be multiple consecutive "add" action for the same user / set - rather - it can be add /remove / add and so on. of course this can be further adjusted based on real use case
So having above in mind - below example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 ts, 'id1' user_id, '[ { "op": "add", "set": "set1" } ]' operation_json UNION ALL
SELECT 2, 'id2', '[ { "op": "add", "set": "set1" } ]' UNION ALL
SELECT 3, 'id1', '[ { "op": "add", "set": "set2" } ]' UNION ALL
SELECT 4, 'id3', '[ { "op": "add", "set": "set2" } ]' UNION ALL
SELECT 5, 'id1', '[ { "op": "remove", "set": "set1" } ]'
)
SELECT bin, STRING_AGG(user_id, ',' ORDER BY ts) result
FROM (
SELECT user_id, bin, ARRAY_AGG(ts ORDER BY ts DESC LIMIT 1)[OFFSET(0)] ts
FROM (
SELECT ts, user_id, op, bin, LAST_VALUE(op) OVER(win) fin
FROM (
SELECT ts, user_id,
JSON_EXTRACT_SCALAR(REGEXP_REPLACE(operation_json, r'^\[|\]$', ''), '$.op') op,
JSON_EXTRACT_SCALAR(REGEXP_REPLACE(operation_json, r'^\[|\]$', ''), '$.set') bin
FROM `project.dataset.table`
)
WINDOW win AS (
PARTITION BY user_id, bin
ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
)
WHERE fin = 'add'
GROUP BY user_id, bin
)
GROUP BY bin
-- ORDER BY bin
output is
Row bin result
1 set1 id2
2 set2 id1,id3
if to apply to below dummy data
WITH `project.dataset.table` AS (
SELECT 1 ts, 'id1' user_id, '[ { "op": "add", "set": "set1" } ]' operation_json UNION ALL
SELECT 2, 'id2', '[ { "op": "add", "set": "set1" } ]' UNION ALL
SELECT 3, 'id1', '[ { "op": "add", "set": "set2" } ]' UNION ALL
SELECT 4, 'id3', '[ { "op": "add", "set": "set2" } ]' UNION ALL
SELECT 5, 'id1', '[ { "op": "remove", "set": "set1" } ]' UNION ALL
SELECT 6, 'id1', '[ { "op": "add", "set": "set1" } ]' UNION ALL
SELECT 7, 'id1', '[ { "op": "remove", "set": "set1" } ]' UNION ALL
SELECT 8, 'id1', '[ { "op": "add", "set": "set1" } ]' UNION ALL
SELECT 9, 'id1', '[ { "op": "remove", "set": "set2" } ]' UNION ALL
SELECT 10, 'id1', '[ { "op": "add", "set": "set2" } ]'
)
result will be
Row bin result
1 set1 id2,id1
2 set2 id3,id1

Postgres Build Complex JSON Object from Wide Column Like Design to Key Value

I could really use some help here before my mind explodes...
Given the following data structure:
SELECT * FROM (VALUES (1, 1, 1, 1), (2, 2, 2, 2)) AS t(day, apple, banana, orange);
day | apple | banana | orange
-----+-------+--------+--------
1 | 1 | 1 | 1
2 | 2 | 2 | 2
I want to construct a JSON object which looks like the following:
{
"data": [
{
"day": 1,
"fruits": [
{
"key": "apple",
"value": 1
},
{
"key": "banana",
"value": 1
},
{
"key": "orange",
"value": 1
}
]
}
]
}
Maybe I am not so far away from my goal:
SELECT json_build_object(
'data', json_agg(
json_build_object(
'day', t.day,
'fruits', t)
)
) FROM (VALUES (1, 1, 1, 1), (2, 2, 2, 2)) AS t(day, apple, banana, orange);
Results in:
{
"data": [
{
"day": 1,
"fruits": {
"day": 1,
"apple": 1,
"banana": 1,
"orange": 1
}
}
]
}
I know that there is json_each which may do the trick. But I am struggling to apply it to the query.
Edit:
This is my updated query which, I guess, is pretty close. I have dropped the thought to solve it with json_each. Now I only have to return an array of fruits instead appending to the fruits object:
SELECT json_build_object(
'data', json_agg(
json_build_object(
'day', t.day,
'fruits', json_build_object(
'key', 'apple',
'value', t.apple,
'key', 'banana',
'value', t.banana,
'key', 'orange',
'value', t.orange
)
)
)
) FROM (VALUES (1, 1, 1, 1), (2, 2, 2, 2)) AS t(day, apple, banana, orange);
Would I need to add a subquery to prevent a nested aggregate function?
Use the function jsonb_each() to get pairs (key, value), so you do not have to know the number of columns and their names to get a proper output:
select jsonb_build_object('data', jsonb_agg(to_jsonb(s) order by day))
from (
select day, jsonb_agg(jsonb_build_object('key', key, 'value', value)) as fruits
from (
values (1, 1, 1, 1), (2, 2, 2, 2)
) as t(day, apple, banana, orange),
jsonb_each(to_jsonb(t)- 'day')
group by 1
) s;
The above query gives this object:
{
"data": [
{
"day": 1,
"fruits": [
{
"key": "apple",
"value": 1
},
{
"key": "banana",
"value": 1
},
{
"key": "orange",
"value": 1
}
]
},
{
"day": 2,
"fruits": [
{
"key": "apple",
"value": 2
},
{
"key": "banana",
"value": 2
},
{
"key": "orange",
"value": 2
}
]
}
]
}