Related
I have a lot of values in a postgres db that include a time value.
The database contains a record unit colors, something like this:
[
{
id: 1234,
unit: 2,
color: "red",
time: "Wed, 16 Dec 2020 21:45:30"
},
{
id: 1235,
unit: 2,
color: "red",
time: "Wed, 16 Dec 2020 21:47:30"
},{
id: 1236,
unit: 6,
color: "blue",
time: "Wed, 16 Dec 2020 21:48:30"
},
{
id: 1237,
unit: 6,
color: "green",
time: "Wed, 16 Dec 2020 21:49:30"
},
{
id: 1237,
unit: 6,
color: "blue",
time: "Wed, 16 Dec 2020 21:49:37"
},
]
I want to be able to query this list but in 10 minute averages, which should return the earliest record which contains the average.
For example in the 10 minute period of 21:40 - 21:50 I should only recieve the 2 unique units with the average value that they had within that time period.
The returned data should look something like this:
[
{
id: 1234,
unit: 2,
color: "red",
time: "Wed, 16 Dec 2020 21:45:30"
},
{
id: 1236,
unit: 6,
color: "blue",
time: "Wed, 16 Dec 2020 21:48:30"
},
]
What type of query should I be using to acheive soething like this?
Thanks
You can use distinct on:
select distinct on (x.time_trunc, t.unit) t.*
from mytable t
cross join lateral (values (
date_trunc('hour', time)
+ extract(minute from time) / 10 * '10 minute'::interval)
) as x(time_trunc)
order by x.time_trunc, t.unit, t.time
The trick is to truncate the timestamps to 10 minute. For this, we use date arithmetics; I moved the computation in a lateral join so there is no need to repeat the expression. Then, distinct on comes into play, to select the earlier record per timestamp bucket and per unit.
I don't see how the question relates to an average whatsoever.
I'm trying to find rows with N count of identifier A AND M count of identifier B in an array of structs within a Google BigQuery table, using the new Standard SQL. The data in the table (simplified) where each row looks a bit like this:
{
"Session": "abc123",
"Information" [
{
"Identifier": "A",
"Count": 1,
},
{
"Identifier": "B"
"Count": 2,
},
{
"Identifier": "C"
"Count": 3,
}
...
]
}
I've been struggling to work with the struct in an array. Any way I can do that?
Below is for BigQuery Standard SQL
#standardSQL
SELECT *
FROM `project.dataset.table`
WHERE 2 = (SELECT COUNT(1) FROM UNNEST(information) kv WHERE kv IN (('a', 5), ('b', 10)))
If to apply to dummy data as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'abc123' session, [STRUCT('a' AS identifier, 1 AS `count`), ('b', 2), ('c', 3)] information UNION ALL
SELECT 'abc456', [('a', 5), ('b', 10), ('c', 20)]
)
SELECT *
FROM `project.dataset.table`
WHERE 2 = (SELECT COUNT(1) FROM UNNEST(information) kv WHERE kv IN (('a', 5), ('b', 10)))
result is
Row session information.identifier information.count
1 abc456 a 5
b 10
c 20
I have a dataset in Google Bigquery with vehicle positions over time and the direction they are going relative to a base, like
time | x | y | direction | vehicle_id
-----|-----|-----|-----------|-----------
0:00 | ... | ... | returning | 100
0:00 | ... | ... | returning | 200
0:00 | ... | ... | exploring | 300
0:05 | ... | ... | returning | 100
0:05 | ... | ... | exploring | 200
0:05 | ... | ... | exploring | 300
0:10 | ... | ... | exploring | 100
0:10 | ... | ... | exploring | 200
0:10 | ... | ... | exploring | 300
0:15 | ... | ... | exploring | 100
0:15 | ... | ... | exploring | 200
0:15 | ... | ... | returning | 300
I'm able to aggregate by vehicle easily, but I can't come up with a query that can break each vehicle series into 'trips', consisting of sequential occurrences of 'returning' or 'exploring'. I have read about analytic functions but none seem to fit the bill.
SELECT
vehicle_id,
ARRAY_AGG(
STRUCT(direction, time, x, y)
ORDER BY time) as series
FROM t
GROUP BY vehicle_id;
[
{
"vehicle_id": 100,
"series":
[
{"direction": "returning", "time": "0:00", "x": ..., "y": ...},
{"direction": "returning", "time": "0:05", "x": ..., "y": ...},
{"direction": "exploring", "time": "0:10", "x": ..., "y": ...},
{"direction": "exploring", "time": "0:15", "x": ..., "y": ...}
]
},
{
"vehicle_id": 200,
"series":
[
{"direction": "returning", "time": "0:00", "x": ..., "y": ...},
{"direction": "exploring", "time": "0:00", "x": ..., "y": ...},
{"direction": "exploring", "time": "0:00", "x": ..., "y": ...},
{"direction": "exploring", "time": "0:00", "x": ..., "y": ...}
]
},
{
"vehicle_id": 300,
"series":
[
{"direction": "exploring", "time": "0:00", "x": ..., "y": ...},
{"direction": "exploring", "time": "0:00", "x": ..., "y": ...},
{"direction": "exploring", "time": "0:00", "x": ..., "y": ...},
{"direction": "returning", "time": "0:00", "x": ..., "y": ...}
]
}
]
What I really want is to have a sequence of trips by vehicle, where each trip has a direction and a series of (t, x, y) positions. Is that possible to do?
Below is for BigQuery Standard SQL and uses pure SQL to achieve the very same result
#standardSQL
SELECT vehicle_id, ARRAY_AGG(STRUCT(direction, trip)) trips
FROM (
SELECT vehicle_id, direction, ARRAY_AGG(STRUCT(time, x, y) ORDER BY time) trip
FROM dataset
GROUP BY vehicle_id, direction
)
GROUP BY vehicle_id
If to apply to sample data from your question as in example below
#standardSQL
WITH dataset AS (
SELECT
TIMESTAMP '2019-09-07 00:00:00' AS time,
0.1 AS x, 0.1 AS y, 'returning' AS direction,
100 AS vehicle_id
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:00', 0.2, 0.2, 'returning', 200
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:00', 0.3, 0.3, 'exploring', 300
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:05', 1.1, 1.1, 'returning', 100
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:05', 1.2, 1.2, 'exploring', 200
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:05', 1.3, 1.3, 'exploring', 300
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:10', 2.1, 2.1, 'exploring', 100
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:10', 2.2, 2.2, 'exploring', 200
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:10', 2.3, 2.3, 'exploring', 300
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:15', 3.1, 3.1, 'exploring', 100
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:15', 3.2, 3.2, 'exploring', 200
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:15', 3.3, 3.3, 'returning', 300
)
SELECT vehicle_id, ARRAY_AGG(STRUCT(direction, trip)) trips
FROM (
SELECT vehicle_id, direction, ARRAY_AGG(STRUCT(time, x, y) ORDER BY time) trip
FROM dataset
GROUP BY vehicle_id, direction
)
GROUP BY vehicle_id
result is
I wasn't able to come up with a pure SQL solution, but Bigquery does provide a way to execute arbitrary processing within itself in the form of user-defined functions (UDF).
By aggregating the entire series of a vehicle into an array, we can feed it into a Javascript function that performs the necessary logic and splits the series in a sequence of trips.
CREATE TEMPORARY FUNCTION split_trips(
series ARRAY<STRUCT<direction STRING,
time TIMESTAMP,
x FLOAT64,
y FLOAT64>>)
RETURNS ARRAY<STRUCT<direction STRING,
trip ARRAY<STRUCT<time TIMESTAMP,
x FLOAT64,
y FLOAT64>>>>
LANGUAGE js AS """
if (series.length == 0) {
return [];
}
let result = [];
let trip = [];
for (let i = 0; i < series.length-1; i++) {
let {direction, time, x, y} = series[i];
trip.push({time: time, x: x, y: y});
if (direction == series[i+1].direction) {
continue;
}
result.push({direction: direction, trip: trip});
trip = [];
}
let lastEntry = series[series.length-1];
trip.push({time: lastEntry.time, x: lastEntry.x, y: lastEntry.y});
result.push({direction: lastEntry.direction, trip: trip});
return result;
""";
WITH dataset AS (
SELECT
TIMESTAMP '2019-09-07 00:00:00' AS time,
0.1 AS x, 0.1 AS y, 'returning' AS direction,
100 AS vehicle_id
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:00', 0.2, 0.2, 'returning', 200
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:00', 0.3, 0.3, 'exploring', 300
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:05', 1.1, 1.1, 'returning', 100
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:05', 1.2, 1.2, 'exploring', 200
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:05', 1.3, 1.3, 'exploring', 300
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:10', 2.1, 2.1, 'exploring', 100
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:10', 2.2, 2.2, 'exploring', 200
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:10', 2.3, 2.3, 'exploring', 300
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:15', 3.1, 3.1, 'exploring', 100
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:15', 3.2, 3.2, 'exploring', 200
UNION ALL SELECT TIMESTAMP '2019-09-07 00:00:15', 3.3, 3.3, 'returning', 300
),
by_vehicle AS (
SELECT
vehicle_id,
ARRAY_AGG(STRUCT(direction, time, x, y)
ORDER BY TIME) AS series
FROM dataset
GROUP BY vehicle_id
)
SELECT
vehicle_id,
split_trips(series) AS trips
FROM by_vehicle
The Bigquery documentation notes that each function invocation can produce at most 5 MiB of data, so considering the timestamp (64 bits) and two floats (64 bits each) that gives us at most ~200K entries to manipulate at once for each vehicle.
Is there any easy way for me to do something like Ocaml's fold_left on a result of a BigQuery query, where each iteration corresponds to one row in the result?
What product or approach would be the easiest way? It would be great if:
all I need to do is to supply the initial state and the 'folder' function
preferably, I'd like to write the 'folder' function in a functional language
I don't need to install any GCP package
Since I don't know which product or language would work, I cannot be more specific, but pseudocode would be like:
let my_init = []
let my_folder = fun state row ->
// append for now, but it will be complicated. I need to do some set operations here. The point is that I need some way of transferring "state" across rows, when I iterate over rows in a predefined order.
row.col1 :: state
let query = "SELECT col1, col2, col3 FROM table1 ORDER BY timestamp"
query |> List.fold my_folder my_init
The result that I want to get from this simplified example is the final "state".
--- UPDATED ---
There is no bound on the number of rows---if we receive more, we get more rows. Typically, the number is more than a few millions but it can be larger than that.
Here's a simplified example that shows the major problem I'm encountering. We have a table with a few columns:
timestamp
user_id: a string id
operation_json: a stringified JSON object, which is a list of operations, each of which corresponds to either:
add user_id to a set
remove user_id from a set
For example, the followings are valid rows:
----------+---------+----------------------------------------------
timestamp | user_id | operation_json
----------+---------+----------------------------------------------
1 | id1 | [ { "op": "add", "set": "set1" } ]
2 | id2 | [ { "op": "add", "set": "set1" } ]
3 | id1 | [ { "op": "add", "set": "set2" } ]
4 | id3 | [ { "op": "add", "set": "set2" } ]
5 | id1 | [ { "op": "remove", "set": "set1" } ]
----------+---------+----------------------------------------------
As a result, I'd like to get sets of users; i.e.,
set1 |-> { id2 }
set2 |-> { id1, id3 }
I thought fold_left-like operation would be convenient. The state would be map>, and the initial-state would be an empty map.
Below [quick and simple] example for BigQuery Standard SQL
#standardSQL
CREATE TEMP FUNCTION fold(arr ARRAY<INT64>, init INT64)
RETURNS FLOAT64
LANGUAGE js AS """
const reducer = (accumulator, currentValue) => accumulator + parseInt(currentValue);
return arr.reduce(reducer, 5);
""";
WITH `project.dataset.table` AS (
SELECT 1 id, [1, 2, 3, 4] arr, 5 initial_state UNION ALL
SELECT 2, [1, 2, 3, 4, 5, 6, 7], 10
)
SELECT id, fold(arr, initial_state) result
FROM `project.dataset.table`
output is
Row id result
1 1 15.0
2 2 33.0
I think it is self-explanatory enough
See more for JS UDF
folding list of rows
See below extension of above
Here you are assembling array from the result's rows before applying fold function (of course you have some limits for UDF here to have in mind and also on how big your ARRAY of rows can go, etc.
#standardSQL
CREATE TEMP FUNCTION fold(arr ARRAY<INT64>, init INT64)
RETURNS FLOAT64
LANGUAGE js AS """
const reducer = (accumulator, currentValue) => accumulator + parseInt(currentValue);
return arr.reduce(reducer, 5);
""";
WITH `project.dataset.table` AS (
SELECT 1 id, 1 item UNION ALL
SELECT 1, 2 UNION ALL
SELECT 1, 3 UNION ALL
SELECT 1, 4 UNION ALL
SELECT 2, 1 UNION ALL
SELECT 2, 2 UNION ALL
SELECT 2, 3 UNION ALL
SELECT 2, 4 UNION ALL
SELECT 2, 5 UNION ALL
SELECT 2, 6 UNION ALL
SELECT 2, 7
)
SELECT id, fold(ARRAY_AGG(item), 5) result
FROM `project.dataset.table`
GROUP BY id
Note, if you need to include more than one field from each row - you can use ARRAY of STRUCT as below example
ARRAY_AGG(STRUCT(id , item) ORDER by id)
Of course, you will need to adjust respectively signature of fold UDF
For example:
#standardSQL
CREATE TEMP FUNCTION fold(arr ARRAY<STRUCT<id INT64, item INT64>>, init INT64)
RETURNS FLOAT64
LANGUAGE js AS """
const reducer = (accumulator, currentValue) => accumulator + parseInt(currentValue.item);
return arr.reduce(reducer, 5);
""";
WITH `project.dataset.table` AS (
SELECT 1 id, 1 item UNION ALL
SELECT 1, 2 UNION ALL
SELECT 1, 3 UNION ALL
SELECT 1, 4 UNION ALL
SELECT 2, 1 UNION ALL
SELECT 2, 2 UNION ALL
SELECT 2, 3 UNION ALL
SELECT 2, 4 UNION ALL
SELECT 2, 5 UNION ALL
SELECT 2, 6 UNION ALL
SELECT 2, 7
)
SELECT id, fold(ARRAY_AGG(t), 5) result
FROM `project.dataset.table` t
GROUP BY id
Below approach has nothing to do with folding per se, but rather attempt to translate your challenge into set-based one (which is more natural for when you dealing with sql) by identifying the latest op action for each user per set and if it is "remove" just eliminate that user from further consideration - if it is "add" just use the latest "add" for that user / set. This in assumption that there cannot be multiple consecutive "add" action for the same user / set - rather - it can be add /remove / add and so on. of course this can be further adjusted based on real use case
So having above in mind - below example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 ts, 'id1' user_id, '[ { "op": "add", "set": "set1" } ]' operation_json UNION ALL
SELECT 2, 'id2', '[ { "op": "add", "set": "set1" } ]' UNION ALL
SELECT 3, 'id1', '[ { "op": "add", "set": "set2" } ]' UNION ALL
SELECT 4, 'id3', '[ { "op": "add", "set": "set2" } ]' UNION ALL
SELECT 5, 'id1', '[ { "op": "remove", "set": "set1" } ]'
)
SELECT bin, STRING_AGG(user_id, ',' ORDER BY ts) result
FROM (
SELECT user_id, bin, ARRAY_AGG(ts ORDER BY ts DESC LIMIT 1)[OFFSET(0)] ts
FROM (
SELECT ts, user_id, op, bin, LAST_VALUE(op) OVER(win) fin
FROM (
SELECT ts, user_id,
JSON_EXTRACT_SCALAR(REGEXP_REPLACE(operation_json, r'^\[|\]$', ''), '$.op') op,
JSON_EXTRACT_SCALAR(REGEXP_REPLACE(operation_json, r'^\[|\]$', ''), '$.set') bin
FROM `project.dataset.table`
)
WINDOW win AS (
PARTITION BY user_id, bin
ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
)
WHERE fin = 'add'
GROUP BY user_id, bin
)
GROUP BY bin
-- ORDER BY bin
output is
Row bin result
1 set1 id2
2 set2 id1,id3
if to apply to below dummy data
WITH `project.dataset.table` AS (
SELECT 1 ts, 'id1' user_id, '[ { "op": "add", "set": "set1" } ]' operation_json UNION ALL
SELECT 2, 'id2', '[ { "op": "add", "set": "set1" } ]' UNION ALL
SELECT 3, 'id1', '[ { "op": "add", "set": "set2" } ]' UNION ALL
SELECT 4, 'id3', '[ { "op": "add", "set": "set2" } ]' UNION ALL
SELECT 5, 'id1', '[ { "op": "remove", "set": "set1" } ]' UNION ALL
SELECT 6, 'id1', '[ { "op": "add", "set": "set1" } ]' UNION ALL
SELECT 7, 'id1', '[ { "op": "remove", "set": "set1" } ]' UNION ALL
SELECT 8, 'id1', '[ { "op": "add", "set": "set1" } ]' UNION ALL
SELECT 9, 'id1', '[ { "op": "remove", "set": "set2" } ]' UNION ALL
SELECT 10, 'id1', '[ { "op": "add", "set": "set2" } ]'
)
result will be
Row bin result
1 set1 id2,id1
2 set2 id3,id1
I have prepared a simple SQL Fiddle demonstrating my problem -
In a two-player game I store user chats in a table:
CREATE TABLE chat(
gid integer, /* game id */
uid integer, /* user id */
created timestamptz,
msg text
);
Here I fill the table with a simple test data:
INSERT INTO chat(gid, uid, created, msg) VALUES
(10, 1, NOW() + interval '1 min', 'msg 1'),
(10, 2, NOW() + interval '2 min', 'msg 2'),
(10, 1, NOW() + interval '3 min', 'msg 3'),
(10, 2, NOW() + interval '4 min', 'msg 4'),
(10, 1, NOW() + interval '5 min', 'msg 5'),
(10, 2, NOW() + interval '6 min', 'msg 6'),
(20, 3, NOW() + interval '7 min', 'msg 7'),
(20, 4, NOW() + interval '8 min', 'msg 8'),
(20, 4, NOW() + interval '9 min', 'msg 9');
And I can fetch the data by running the SELECT query:
SELECT ARRAY_TO_JSON(
COALESCE(ARRAY_AGG(ROW_TO_JSON(x)),
array[]::json[])) FROM (
SELECT
gid,
uid,
EXTRACT(EPOCH FROM created)::int AS created,
msg
FROM chat) x;
which returns me a JSON-array:
[{"gid":10,"uid":1,"created":1514813043,"msg":"msg 1"},
{"gid":10,"uid":2,"created":1514813103,"msg":"msg 2"},
{"gid":10,"uid":1,"created":1514813163,"msg":"msg 3"},
{"gid":10,"uid":2,"created":1514813223,"msg":"msg 4"},
{"gid":10,"uid":1,"created":1514813283,"msg":"msg 5"},
{"gid":10,"uid":2,"created":1514813343,"msg":"msg 6"},
{"gid":20,"uid":3,"created":1514813403,"msg":"msg 7"},
{"gid":20,"uid":4,"created":1514813463,"msg":"msg 8"},
{"gid":20,"uid":4,"created":1514813523,"msg":"msg 9"}]
This is close to what I need, however I would like to use "gid" as JSON object properties and the rest data as values in that object:
{"10": [{"uid":1,"created":1514813043,"msg":"msg 1"},
{"uid":2,"created":1514813103,"msg":"msg 2"},
{"uid":1,"created":1514813163,"msg":"msg 3"},
{"uid":2,"created":1514813223,"msg":"msg 4"},
{"uid":1,"created":1514813283,"msg":"msg 5"},
{"uid":2,"created":1514813343,"msg":"msg 6"}],
"20": [{"uid":3,"created":1514813403,"msg":"msg 7"},
{"uid":4,"created":1514813463,"msg":"msg 8"},
{"uid":4,"created":1514813523,"msg":"msg 9"}]}
Is that please doable by using the PostgreSQL JSON functions?
I think you're looking for json_object_agg for that last step. Here is how I'd do it:
SELECT json_object_agg(
gid::text, array_to_json(ar)
)
FROM (
SELECT gid,
array_agg(
json_build_object(
'uid', uid,
'created', EXTRACT(EPOCH FROM created)::int,
'msg', msg)
) AS ar
FROM chat
GROUP BY gid
) x
;
I left off the coalesce because I don't think an empty array is possible. But it should be easy to put it back if your real query is something more complicated that could require it.