Distinct key with ARRAY_CONCAT with Struct<String, String> - google-bigquery

I need to do the following with 2 array fields in the table below. The arrays are of type Struct<String, String>.
Merge the arrays together
If there is a duplicate key between the labels.key and project.key, then I only want to keep the kvp from the labels field
flatten the combined array into a delimited string an order them (so I can group by)
Example Table Data
SELECT 1 as id, ARRAY
[STRUCT("testlabel2" as key, "thisvalueisbetter" as value), STRUCT("testlabel3", "testvalue3")] as labels,
[STRUCT("testlabel2" as key, "testvalue2" as value)] as project
The below query does everything except #2 and I'm not sure how to accomplish that. Does anyone have a suggestion on how to do this?
SELECT
id,
(SELECT STRING_AGG(DISTINCT CONCAT(l.key, ':', l.value) ORDER BY CONCAT(l.key, ':', l.value))
FROM UNNEST(
ARRAY_CONCAT(labels, project)) AS l) AS label,
FROM `mytestdata` AS t
GROUP BY id, label
Currently this query gives the output:
1 testlabel2:testvalue2,testlabel2:thisvalueisbetter,testlabel3:testvalue3
But I'm looking for:
1 testlabel2:thisvalueisbetter,testlabel3:testvalue3

Below is for BigQuery Standard SQL
#standardSQL
SELECT *,
ARRAY(
SELECT AS STRUCT key, ARRAY_AGG(value ORDER BY source LIMIT 1)[OFFSET(0)] AS value
FROM (
SELECT 0 AS source, * FROM t.labels UNION ALL
SELECT 1, * FROM t.project
)
GROUP BY key
) AS combined_array
FROM `project.dataset.table` t
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT ARRAY
[STRUCT("testlabel2" AS key, "thisvalueisbetter" AS value), STRUCT("testlabel3", "testvalue3")] AS labels,
[STRUCT("testlabel2" AS key, "testvalue2" AS value)] AS project
)
SELECT *,
ARRAY(
SELECT AS STRUCT key, ARRAY_AGG(value ORDER BY source LIMIT 1)[OFFSET(0)] AS value
FROM (
SELECT 0 AS source, * FROM t.labels UNION ALL
SELECT 1, * FROM t.project
)
GROUP BY key
) AS combined_array
FROM `project.dataset.table` t
with result
Or ... to fully match your expected output - use below
#standardSQL
SELECT *,
(SELECT STRING_AGG(x) FROM (
SELECT CONCAT(key, ':', ARRAY_AGG(value ORDER BY source LIMIT 1)[OFFSET(0)]) x
FROM (
SELECT 0 AS source, * FROM t.labels UNION ALL
SELECT 1, * FROM t.project
)
GROUP BY key
)) AS combined_result
FROM `project.dataset.table` t
with result

Related

BQ - getting a field from an array of structs without join

I have a table with the following columns:
items ARRAY<STRUCT<label STRING, counter INTEGER>>
explore BOOLEAN
For each record I would like to choose the label with the highest counter, and then count explore on each unique label.
Ideally I would like to run something like:
SELECT FIRST_VALUE(items.label) OVER (ORDER BY items.counter DESC) as label,
COUNT(explore) as explore
FROM my_table
GROUP BY 1
If this is the data in my table:
explore items
1 [['A',1],['B',3]]
1 [['B',1]]
0. [['C',2],['D',1]]
Then I would like to get:
label explore
'B' 2
'C' 1
Consider below approach
select ( select label from t.items
order by counter desc limit 1
) label,
count(*) explore
from your_table t
group by label
if applied to sample data in your question
with your_table as (
select 1 explore, [struct('A' as label, 1 as counter), struct('B' as label, 3 as counter) ] items union all
select 1, [struct('B', 1)] union all
select 0, [struct('C', 2), struct('D', 1) ]
)
output is
Using your sample data, consider below approach.
with data as (
select 1 as explore, [STRUCT( 'A' as label, 1 as counter), STRUCT( 'B' as label, 3 as counter) ] as items,
union all select 1 as explore, [STRUCT( 'B' as label, 1 as counter)] as items,
union all select 0 as explore, [STRUCT( 'C' as label, 2 as counter), STRUCT( 'D' as label, 1 as counter) ] as items
),
add_row_num as (
SELECT
explore,
items,
row_number() over (order by explore desc) as row_number
FROM data
),
get_highest_label as (
select
explore,
row_number,
label,
counter,
first_value(label) over (partition by row_number order by counter desc) as highest_label_per_row
from add_row_num, unnest(items)
),
-- https://stackoverflow.com/questions/36675521/delete-duplicate-rows-from-a-bigquery-table (REMOVE DUPLICATE)
remove_dups as (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY row_number) as new_row_number
FROM get_highest_label
)
select
highest_label_per_row,
count(highest_label_per_row) as explore,
from remove_dups
where new_row_number = 1
group by highest_label_per_row
Output:

How to get the most frequent value in Google's Bigquery

Postgres has an easy function to achieve this, just using the mode() function we can find the most frequent value. Is there something equivalent within Google's Bigquery?
How could be written a query like this in Bigquery?
select count(*),
avg(vehicles) as mean,
percentile_cont(0.5) within group (order by vehicles) as median,
mode() within group (order by vehicles) as most_frequent_value
FROM "driver"
WHERE vehicles is not null;
Below is for BigQuery Standard SQL
Option 1
#standardSQL
SELECT * FROM (
SELECT COUNT(*) AS cnt,
AVG(vehicles) AS mean,
APPROX_TOP_COUNT(vehicles, 1)[OFFSET(0)].value AS most_frequent_value
FROM `project.dataset.table`
WHERE vehicles IS NOT NULL
) CROSS JOIN (
SELECT PERCENTILE_CONT(vehicles, 0.5) OVER() AS median
FROM `project.dataset.table`
WHERE vehicles IS NOT NULL
LIMIT 1
)
Option 2
#standardSQL
SELECT * FROM (
SELECT COUNT(*) cnt,
AVG(vehicles) AS mean
FROM `project.dataset.table`
WHERE vehicles IS NOT NULL
) CROSS JOIN (
SELECT PERCENTILE_CONT(vehicles, 0.5) OVER() AS median
FROM `project.dataset.table`
WHERE vehicles IS NOT NULL
LIMIT 1
) CROSS JOIN (
SELECT vehicles AS most_frequent_value
FROM `project.dataset.table`
WHERE vehicles IS NOT NULL
GROUP BY vehicles
ORDER BY COUNT(1) DESC
LIMIT 1
)
Option 3
#standardSQL
CREATE TEMP FUNCTION median(arr ANY TYPE) AS ((
SELECT PERCENTILE_CONT(x, 0.5) OVER()
FROM UNNEST(arr) x LIMIT 1
));
CREATE TEMP FUNCTION most_frequent_value(arr ANY TYPE) AS ((
SELECT x
FROM UNNEST(arr) x
GROUP BY x
ORDER BY COUNT(1) DESC
LIMIT 1
));
SELECT COUNT(*) cnt,
AVG(vehicles) AS mean,
median(ARRAY_AGG(vehicles)) AS median,
most_frequent_value(ARRAY_AGG(vehicles)) AS most_frequent_value
FROM `project.dataset.table`
WHERE vehicles IS NOT NULL
and so on ...
You can use APPROX_TOP_COUNT to get top values, e.g.:
SELECT APPROX_TOP_COUNT(vehicles, 5) AS top_five_vehicles
FROM dataset.driver
If you just want the top value, you can select it from the array:
SELECT APPROX_TOP_COUNT(vehicles, 1)[OFFSET(0)] AS most_frequent_value
FROM dataset.driver
No, there is no equivalent of the mode()-function in BigQuery, but you may define one yourself using any of the logics in the other answers to this thread. You could call it like so:
SELECT mode(`an_array`) AS top_count FROM `somewhere_with_arrays`
but this approach lead to multiple by-row sub-queries wihch is terrible for performance, so if you never grinded BQ to a halt before, you can do it with these functions. I it (the second) only for readability in quick-fixes for very small data-sets.
Check out the two UDF:s below. A third approach would be to implement a JS function in which case this oneliner should be usefull
return arr.sort((a,b) => arr.filter(v => v===a).length - arr.filter(v => v===b).length).pop();
This code establishes two mode()-like functions which eat arrays and return most common string:
CREATE TEMPORARY FUNCTION mode1(mystring ANY TYPE)
RETURNS STRING
AS
(
(
SELECT var FROM
( /* Count occurances of each value of input */
SELECT var, COUNT(*) AS n FROM
( /* Unnest and name*/
SELECT var FROM UNNEST(mystring) var
)
GROUP BY var /* Output is one of existing values */
ORDER BY n DESC /* Output is value with HIGHEST n */
) /* -------------------------------- */
LIMIT 1 /* Only ONE string is the output */
)
);
CREATE TEMPORARY FUNCTION mode2(inp ANY TYPE)
RETURNS STRING
AS
(
(
SELECT result.value FROM UNNEST( (SELECT APPROX_TOP_COUNT(v,1) AS result FROM UNNEST(inp) v)) result
)
);
SELECT
inp,
mode1(inp) AS first_logic_output,
mode2(inp) AS second_logic_output
FROM
(
/* Test data */
SELECT ['Erdős','Turán', 'Erdős','Turán','Euler','Erdős'] AS inp
UNION ALL
SELECT ['Euler','Euler', 'Gauss', 'Euler'] AS inp
)
The method I prefer is to query off of an array since you can easily adjust the criteria of the mode. Below are two example using both an offset and a limit method. With the offset you can grab the Nth most/least frequent value.
WITH t AS (SELECT 18 AS length,
'HIGH' as amps,
99.95 price UNION ALL
SELECT 18, "HIGH", 99.95 UNION ALL
SELECT 18, "HIGH", 5.95 UNION ALL
SELECT 18, "LOW", 33.95 UNION ALL
SELECT 18, "LOW", 33.95 UNION ALL
SELECT 18, "LOW", 4.5 UNION ALL
SELECT 3, "HIGH", 77.95 UNION ALL
SELECT 3, "HIGH", 77.95 UNION ALL
SELECT 3, "HIGH", 9.99 UNION ALL
SELECT 3, "LOW", 44.95 UNION ALL
SELECT 3, "LOW", 44.95 UNION ALL
SELECT 3, "LOW", 5.65
)
SELECT
length,
amps,
-- By Limit
(SELECT x FROM UNNEST(price_array) x
GROUP BY x ORDER BY COUNT(*) DESC LIMIT 1 ) most_freq_price,
(SELECT x FROM UNNEST(price_array) x
GROUP BY x ORDER BY COUNT(*) ASC LIMIT 1 ) least_freq_price,
-- By Offset
ARRAY((SELECT x FROM UNNEST(price_array) x
GROUP BY x ORDER BY COUNT(*) DESC))[OFFSET(0)] most_freq_price_offset,
ARRAY((SELECT x FROM UNNEST(price_array) x
GROUP BY x ORDER BY COUNT(*) ASC))[OFFSET(0)] least_freq_price_offset
FROM (
SELECT
length,
amps,
ARRAY_AGG(price) price_array
FROM t
GROUP BY 1,2
)

Bigquery- Struct format

WITH yourTable AS (
SELECT 1 AS id, '2013,1625,1297,7634' AS string_col UNION ALL
SELECT 2, '1,2,3,4,5'
)
SELECT id,
(SELECT ARRAY_AGG(CAST(num AS INT64))
FROM UNNEST(SPLIT(string_col)) AS num
) AS num,
ARRAY(SELECT CAST(num AS INT64)
FROM UNNEST(SPLIT(string_col)) AS num
) AS num_2
FROM yourTable
This is how exactly my actual table is designed and Now I would like to multiply num*num_2 and then later sum it up. Is there a way to get this into struct format like ID, nums.num,nums.num_2 so that I can simply multiply which gives me the necessary result.
PS: I am looking for solution in the select statement above but not within "with" statement.
Ok, assuming that you really have reason to have your table the way you have (see my comment on your question) - below should work
#standardSQL
SELECT id,
(
SELECT SUM(num * num_2)
FROM (SELECT pos, num FROM UNNEST(num) num WITH OFFSET pos) a
JOIN (SELECT pos_2, num_2 FROM UNNEST(num_2) num_2 WITH OFFSET pos_2) b
ON a.pos = b.pos_2
) mul
FROM yourTable
you can test it with below
#standardSQL
WITH yourTable AS (
SELECT 1 id, [2013,1625,1297,7634] num, [2013,1625,1297,7634] num_2 UNION ALL
SELECT 2, [1,2,3,4,5], [1,2,3,4,5]
)
SELECT id,
(
SELECT SUM(num * num_2)
FROM (SELECT pos, num FROM UNNEST(num) num WITH OFFSET pos) a
JOIN (SELECT pos_2, num_2 FROM UNNEST(num_2) num_2 WITH OFFSET pos_2) b
ON a.pos = b.pos_2
) mul
FROM yourTable

Combine the most recent entries from a number of tables

I have a master table with a number of IDs in it:
ID ...
0 ...
1 ...
And multiple tables (say vtbl1, vtbl2, vtbl3) with a foreign key to master, a timestamp and a value:
ID Timestamp Value
0 01/01/01.. 2
1 01/01/02.. 7
0 01/01/03.. 5
I would like to get one or more entries for each ID in master with an entry (or null if no entries exist) containing the most recent entry in each v... table (grouped by timestamps):
ID Timestamp vtbl1.Value vtbl2.Value vtbl3.value
0 01/01/03.. 5 2
0 01/01/01.. 4
1 01/01/02.. 7 4 9
I'm sure this is fairly simple but my SQL is rusty and I've been going in circles. Any help would be appreciated.
Clarification
These values come from one or more sensors able to read one or more of the values. So the latest value in each value table for the ID is to be considered the current system state for that ID. If the timestamps match they are considered one update.
I need the minimal set of updates required for each ID to give a full data set for the current state.
Also the values can be of different types.
If I understand your question correctly, one option is to use conditional aggregation and union all:
select id, timestamp,
max(case when tbl = 'tbl1' then value end) t1value,
max(case when tbl = 'tbl2' then value end) t2value,
max(case when tbl = 'tbl3' then value end) t3value
from (
select id, timestamp, value, 'tbl1' tbl
from tbl1
union all
select id, timestamp, value, 'tbl2' tbl
from tbl2
union all
select id, timestamp, value, 'tbl3' tbl
from tbl3
) t
group by id, timestamp
Or if you have multiple records per id and you want the highest value per by timestamp, you can include row_number() in your subquery:
select id, timestamp,
max(case when tbl = 'tbl1' then value end) t1value,
max(case when tbl = 'tbl2' then value end) t2value,
max(case when tbl = 'tbl3' then value end) t3value
from (
select id, timestamp, value, 'tbl1' tbl,
row_number() over (partition by id order by timestamp desc) rn
from tbl1
union all
select id, timestamp, value, 'tbl2' tbl,
row_number() over (partition by id order by timestamp desc) rn
from tbl2
union all
select id, timestamp, value, 'tbl3' tbl,
row_number() over (partition by id order by timestamp desc) rn
from tbl3
) t
where rn = 1
group by id, timestamp
This can get difficult though if max(timestamp) values aren't the same in each of the child tables. Which do you join on at that point?
select m.*, v1.value as t1_val, v2.value as t2_val, v3.value as t3_val
from master m
left join (select x.*
from vtbl1 x
join (select id, max(timestamp) as last_ts
from vtbl1
group by id) y
on x.id = y.id
and x.timestamp = y.last_ts) v1
on m.id = v1.id
left join (select x.*
from vtbl2 x
join (select id, max(timestamp) as last_ts
from vtbl2
group by id) y
on x.id = y.id
and x.timestamp = y.last_ts) v2
on m.id = v2.id
left join (select x.*
from vtbl3 x
join (select id, max(timestamp) as last_ts
from vtbl3
group by id) y
on x.id = y.id
and x.timestamp = y.last_ts) v3
on m.id = v3.id
The fastest query technique depends on the distribution of values. DISTINCT ON would be a simple solution in Postgres, ideal for just a few values per id in each child table. But guessing from your description I expect many rows per id, so I suggest a solution with LATERAL joins. Requires Postgres 9.3+:
Optimize GROUP BY query to retrieve latest record per user
One more complication for your already-not-so-simple case:
Also the values can be of different types
Alternative 1
Cast all values to text. Every data type can be cast to text.
Base query
SELECT m.id, v.timestamp, 1 AS tbl, v.value -- simple int as table id
FROM master m
, LATERAL (
SELECT timestamp, value::text -- cast to text
FROM vtbl1
WHERE id = m.id -- lateral reference
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) v
UNION ALL
SELECT m.id, v.timestamp, 2 AS tbl, v.value -- ascending without gaps
FROM master m
, LATERAL (
SELECT timestamp, value::text
FROM vtbl2
WHERE id = m.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) v
UNION ALL
SELECT m.id, v.timestamp, 3 AS tbl, value
FROM ...
;
All you need for this to be fast is an index on (id, timestamp) for each child table. Best in this form (adding value is only useful if you get index-only scans out of it):
CREATE INDEX vtbl1_combo_idx ON vtbl1 (id, timestamp DESC NULLS LAST, value)
1a. Aggregate (pseudo-crosstab)
To format as desired use aggregate functions on CASE expressions in Postgres 9.3 or older (like demonstrated by #sgeddes) or (better) the new aggregate FILTER clause in Postgres 9.4+:
How can I simplify this game statistics query?
SELECT id, timestamp
, max(value) FILTER (WHERE tbl = 1) AS val1
, max(value) FILTER (WHERE tbl = 2) AS val2
, ...
FROM ( <query frm above> ) t
GROUP BY 1, 2;
1b. Crosstab
Actual cross tabulation (also called "pivot" in other RDBMS) should be considerably faster. You need the additional module tablefunc installed, instructions below.
The special difficulty here: we have a composite "row name" (id, timestamp), but the function expects a single column as row name. So we substitute with row_number(), but do not display that surrogate key in the result:
SELECT id, timestamp, val1, val2, val3, ...
-- normally SELECT * is enough; explicit list to filter rn
FROM crosstab(
$$
SELECT row_number() OVER (ORDER BY id, timestamp DESC NULLS LAST) AS rn
, id, timestamp, tbl, value
FROM ( <query from above> ) t
ORDER BY 1
$$
, 'SELECT generate_series(1,3)' -- replace 3 with highest table nr.
) AS ct (
rn int, id int, timestamp date
, val1 text, val2 text, val3 text, ...);
Closely related:
Postgres - Transpose Rows to Columns
Relevant basics:
PostgreSQL Crosstab Query
Pivot on Multiple Columns using Tablefunc
Alternative 2
Simple, but may be just as fast and preserves original data types:
SELECT id, timestamp
, max(val1) AS val1, max(val2) AS val2, max(val3) AS val3, ...
FROM (
SELECT m.id, v.timestamp
, v.value AS val1, NULL::int AS val2, NULL::numeric AS val3, ...
-- list all values with actual data type
FROM master m
, LATERAL (
SELECT timestamp, value
FROM vtbl1
WHERE id = m.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) v
UNION ALL
SELECT m.id, v.timestamp
, NULL, v.value, NULL, ... -- column names & data types defined in first SELECT
FROM master m
, LATERAL (
SELECT timestamp, value
FROM vtbl2
WHERE id = m.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) v
UNION ALL
SELECT m.id, v.timestamp
, NULL, NULL, v.value, ...
FROM ...
) t
GROUP BY 1, 2
ORDER BY 1, 2;
Aside: Never use basic type names or reserved words (in standard SQL) like timestamp as identifier.

Oracle SQL -- Analytic functions OVER a group?

My table:
ID NUM VAL
1 1 Hello
1 2 Goodbye
2 2 Hey
2 4 What's up?
3 5 See you
If I want to return the max number for each ID, it's really nice and clean:
SELECT MAX(NUM) FROM table GROUP BY (ID)
But what if I want to grab the value associated with the max of each number for each ID?
Why can't I do:
SELECT MAX(NUM) OVER (ORDER BY NUM) FROM table GROUP BY (ID)
Why is that an error? I'd like to have this select grouped by ID, rather than partitioning separately for each window...
EDIT: The error is "not a GROUP BY expression".
You could probably use the MAX() KEEP(DENSE_RANK LAST...) function:
with sample_data as (
select 1 id, 1 num, 'Hello' val from dual union all
select 1 id, 2 num, 'Goodbye' val from dual union all
select 2 id, 2 num, 'Hey' val from dual union all
select 2 id, 4 num, 'What''s up?' val from dual union all
select 3 id, 5 num, 'See you' val from dual)
select id, max(num), max(val) keep (dense_rank last order by num)
from sample_data
group by id;
When you use windowing function, you don't need to use GROUP BY anymore, this would suffice:
select id,
max(num) over(partition by id)
from x
Actually you can get the result without using windowing function:
select *
from x
where (id,num) in
(
select id, max(num)
from x
group by id
)
Output:
ID NUM VAL
1 2 Goodbye
2 4 What's up
3 5 SEE YOU
http://www.sqlfiddle.com/#!4/a9a07/7
If you want to use windowing function, you might do this:
select id, val,
case when num = max(num) over(partition by id) then
1
else
0
end as to_select
from x
where to_select = 1
Or this:
select id, val
from x
where num = max(num) over(partition by id)
But since it's not allowed to do those, you have to do this:
with list as
(
select id, val,
case when num = max(num) over(partition by id) then
1
else
0
end as to_select
from x
)
select *
from list
where to_select = 1
http://www.sqlfiddle.com/#!4/a9a07/19
If you're looking to get the rows which contain the values from MAX(num) GROUP BY id, this tends to be a common pattern...
WITH
sequenced_data
AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY id ORDER BY num DESC) AS sequence_id,
*
FROM
yourTable
)
SELECT
*
FROM
sequenced_data
WHERE
sequence_id = 1
EDIT
I don't know if TeraData will allow this, but the logic seems to make sense...
SELECT
*
FROM
yourTable
WHERE
num = MAX(num) OVER (PARTITION BY id)
Or maybe...
SELECT
*
FROM
(
SELECT
*,
MAX(num) OVER (PARTITION BY id) AS max_num_by_id
FROM
yourTable
)
AS sub_query
WHERE
num = max_num_by_id
This is slightly different from my previous answer; if multiple records are tied with the same MAX(num), this will return all of them, the other answer will only ever return one.
EDIT
In your proposed SQL the error relates to the fact that the OVER() clause contains a field not in your GROUP BY. It's like trying to do this...
SELECT id, num FROM yourTable GROUP BY id
num is invalid, because there can be multiple values in that field for each row returned (with the rows returned being defined by GROUP BY id).
In the same way, you can't put num inside the OVER() clause.
SELECT
id,
MAX(num), <-- Valid as it is an aggregate
MAX(num) <-- still valid
OVER(PARTITION BY id), <-- Also valid, as id is in the GROUP BY
MAX(num) <-- still valid
OVER(PARTITION BY num) <-- Not valid, as num is not in the GROUP BY
FROM
yourTable
GROUP BY
id
See this question for when you can't specify something in the OVER() clause, and an answer showing when (I think) you can: over-partition-by-question