How to get the most frequent value in Google's Bigquery - google-bigquery

Postgres has an easy function to achieve this, just using the mode() function we can find the most frequent value. Is there something equivalent within Google's Bigquery?
How could be written a query like this in Bigquery?
select count(*),
avg(vehicles) as mean,
percentile_cont(0.5) within group (order by vehicles) as median,
mode() within group (order by vehicles) as most_frequent_value
FROM "driver"
WHERE vehicles is not null;

Below is for BigQuery Standard SQL
Option 1
#standardSQL
SELECT * FROM (
SELECT COUNT(*) AS cnt,
AVG(vehicles) AS mean,
APPROX_TOP_COUNT(vehicles, 1)[OFFSET(0)].value AS most_frequent_value
FROM `project.dataset.table`
WHERE vehicles IS NOT NULL
) CROSS JOIN (
SELECT PERCENTILE_CONT(vehicles, 0.5) OVER() AS median
FROM `project.dataset.table`
WHERE vehicles IS NOT NULL
LIMIT 1
)
Option 2
#standardSQL
SELECT * FROM (
SELECT COUNT(*) cnt,
AVG(vehicles) AS mean
FROM `project.dataset.table`
WHERE vehicles IS NOT NULL
) CROSS JOIN (
SELECT PERCENTILE_CONT(vehicles, 0.5) OVER() AS median
FROM `project.dataset.table`
WHERE vehicles IS NOT NULL
LIMIT 1
) CROSS JOIN (
SELECT vehicles AS most_frequent_value
FROM `project.dataset.table`
WHERE vehicles IS NOT NULL
GROUP BY vehicles
ORDER BY COUNT(1) DESC
LIMIT 1
)
Option 3
#standardSQL
CREATE TEMP FUNCTION median(arr ANY TYPE) AS ((
SELECT PERCENTILE_CONT(x, 0.5) OVER()
FROM UNNEST(arr) x LIMIT 1
));
CREATE TEMP FUNCTION most_frequent_value(arr ANY TYPE) AS ((
SELECT x
FROM UNNEST(arr) x
GROUP BY x
ORDER BY COUNT(1) DESC
LIMIT 1
));
SELECT COUNT(*) cnt,
AVG(vehicles) AS mean,
median(ARRAY_AGG(vehicles)) AS median,
most_frequent_value(ARRAY_AGG(vehicles)) AS most_frequent_value
FROM `project.dataset.table`
WHERE vehicles IS NOT NULL
and so on ...

You can use APPROX_TOP_COUNT to get top values, e.g.:
SELECT APPROX_TOP_COUNT(vehicles, 5) AS top_five_vehicles
FROM dataset.driver
If you just want the top value, you can select it from the array:
SELECT APPROX_TOP_COUNT(vehicles, 1)[OFFSET(0)] AS most_frequent_value
FROM dataset.driver

No, there is no equivalent of the mode()-function in BigQuery, but you may define one yourself using any of the logics in the other answers to this thread. You could call it like so:
SELECT mode(`an_array`) AS top_count FROM `somewhere_with_arrays`
but this approach lead to multiple by-row sub-queries wihch is terrible for performance, so if you never grinded BQ to a halt before, you can do it with these functions. I it (the second) only for readability in quick-fixes for very small data-sets.
Check out the two UDF:s below. A third approach would be to implement a JS function in which case this oneliner should be usefull
return arr.sort((a,b) => arr.filter(v => v===a).length - arr.filter(v => v===b).length).pop();
This code establishes two mode()-like functions which eat arrays and return most common string:
CREATE TEMPORARY FUNCTION mode1(mystring ANY TYPE)
RETURNS STRING
AS
(
(
SELECT var FROM
( /* Count occurances of each value of input */
SELECT var, COUNT(*) AS n FROM
( /* Unnest and name*/
SELECT var FROM UNNEST(mystring) var
)
GROUP BY var /* Output is one of existing values */
ORDER BY n DESC /* Output is value with HIGHEST n */
) /* -------------------------------- */
LIMIT 1 /* Only ONE string is the output */
)
);
CREATE TEMPORARY FUNCTION mode2(inp ANY TYPE)
RETURNS STRING
AS
(
(
SELECT result.value FROM UNNEST( (SELECT APPROX_TOP_COUNT(v,1) AS result FROM UNNEST(inp) v)) result
)
);
SELECT
inp,
mode1(inp) AS first_logic_output,
mode2(inp) AS second_logic_output
FROM
(
/* Test data */
SELECT ['Erdős','Turán', 'Erdős','Turán','Euler','Erdős'] AS inp
UNION ALL
SELECT ['Euler','Euler', 'Gauss', 'Euler'] AS inp
)

The method I prefer is to query off of an array since you can easily adjust the criteria of the mode. Below are two example using both an offset and a limit method. With the offset you can grab the Nth most/least frequent value.
WITH t AS (SELECT 18 AS length,
'HIGH' as amps,
99.95 price UNION ALL
SELECT 18, "HIGH", 99.95 UNION ALL
SELECT 18, "HIGH", 5.95 UNION ALL
SELECT 18, "LOW", 33.95 UNION ALL
SELECT 18, "LOW", 33.95 UNION ALL
SELECT 18, "LOW", 4.5 UNION ALL
SELECT 3, "HIGH", 77.95 UNION ALL
SELECT 3, "HIGH", 77.95 UNION ALL
SELECT 3, "HIGH", 9.99 UNION ALL
SELECT 3, "LOW", 44.95 UNION ALL
SELECT 3, "LOW", 44.95 UNION ALL
SELECT 3, "LOW", 5.65
)
SELECT
length,
amps,
-- By Limit
(SELECT x FROM UNNEST(price_array) x
GROUP BY x ORDER BY COUNT(*) DESC LIMIT 1 ) most_freq_price,
(SELECT x FROM UNNEST(price_array) x
GROUP BY x ORDER BY COUNT(*) ASC LIMIT 1 ) least_freq_price,
-- By Offset
ARRAY((SELECT x FROM UNNEST(price_array) x
GROUP BY x ORDER BY COUNT(*) DESC))[OFFSET(0)] most_freq_price_offset,
ARRAY((SELECT x FROM UNNEST(price_array) x
GROUP BY x ORDER BY COUNT(*) ASC))[OFFSET(0)] least_freq_price_offset
FROM (
SELECT
length,
amps,
ARRAY_AGG(price) price_array
FROM t
GROUP BY 1,2
)

Related

Bigquery uncommon elements in array

Let us say you have a column of arrays like this , I am trying to group the rows based on the count of uncommon elements. Once the number of distinct uncommon elements reaches 5, it will be in the next group.
In the below example, first three rows will be group 1 because the uncommon elements are ['3','4','6','7'] which is 4 in length, but if you add the next row to the group , the array of distinct uncommon element would be ['1','3','4','5','6','7'] it will exceed the limit of 5 distinct uncommon elements.
with arr as (
select 1 ord, ['1','2','3','4'] as ar
union all
select 2, ['1','2','3']
union all
select 3,['1','2','6','7']
union all
select 4,['2','4','5','7']
union all
select 5, ['string1','5','6','7','8']
)
select * from arr
I am looking for an output like below
Code I have written so far but definitely missing a big piece. Adding it just in case if it is helpful
with arr as (
select 1 ord, ['1','2','3','4'] as ar,1 subclass
union all
select 2, ['1','2','3'],1
union all
select 3,['1','2','6','7'],1
union all
select 4,['2','4','5','7'],1
union all
select 5, ['string1','5','6','7','8'],1
)
, history_t as (
select a.* ,
ARRAY_AGG(struct(ar)) OVER (PARTITION BY SUBCLASS ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as history
from arr a )
, tem2 as (
select a.* except(history,ar),
(SELECT COUNT(1) FROM UNNEST(history) AS col ) AS array_cnt
,b.ar unnest1
from history_t a
,unnest(history) b
)
, tem3 as (
select a.* except(unnest1),sku_lst
from tem2 a , unnest(unnest1) sku_lst
)
, all_sku_freq as (
select
ord, array_cnt , sku_lst , subclass,count(*) sku_freq
from tem3
group by 1,2,3,4 )
, uncommon_sku_cnt as (
select ord, subclass, count( sku_lst) uncommon_sku_count from all_sku_freq where sku_freq <> array_cnt group by 1,2 )
,rolling_uncomm_sku_cnt as (
select a.*, sum(uncommon_sku_count) over(partition by subclass order by ord asc range between unbounded preceding and current row ) roll_uncomm_sku_cnt
from uncommon_sku_cnt a
)
select a.* from rolling_uncomm_sku_cnt a

Distinct key with ARRAY_CONCAT with Struct<String, String>

I need to do the following with 2 array fields in the table below. The arrays are of type Struct<String, String>.
Merge the arrays together
If there is a duplicate key between the labels.key and project.key, then I only want to keep the kvp from the labels field
flatten the combined array into a delimited string an order them (so I can group by)
Example Table Data
SELECT 1 as id, ARRAY
[STRUCT("testlabel2" as key, "thisvalueisbetter" as value), STRUCT("testlabel3", "testvalue3")] as labels,
[STRUCT("testlabel2" as key, "testvalue2" as value)] as project
The below query does everything except #2 and I'm not sure how to accomplish that. Does anyone have a suggestion on how to do this?
SELECT
id,
(SELECT STRING_AGG(DISTINCT CONCAT(l.key, ':', l.value) ORDER BY CONCAT(l.key, ':', l.value))
FROM UNNEST(
ARRAY_CONCAT(labels, project)) AS l) AS label,
FROM `mytestdata` AS t
GROUP BY id, label
Currently this query gives the output:
1 testlabel2:testvalue2,testlabel2:thisvalueisbetter,testlabel3:testvalue3
But I'm looking for:
1 testlabel2:thisvalueisbetter,testlabel3:testvalue3
Below is for BigQuery Standard SQL
#standardSQL
SELECT *,
ARRAY(
SELECT AS STRUCT key, ARRAY_AGG(value ORDER BY source LIMIT 1)[OFFSET(0)] AS value
FROM (
SELECT 0 AS source, * FROM t.labels UNION ALL
SELECT 1, * FROM t.project
)
GROUP BY key
) AS combined_array
FROM `project.dataset.table` t
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT ARRAY
[STRUCT("testlabel2" AS key, "thisvalueisbetter" AS value), STRUCT("testlabel3", "testvalue3")] AS labels,
[STRUCT("testlabel2" AS key, "testvalue2" AS value)] AS project
)
SELECT *,
ARRAY(
SELECT AS STRUCT key, ARRAY_AGG(value ORDER BY source LIMIT 1)[OFFSET(0)] AS value
FROM (
SELECT 0 AS source, * FROM t.labels UNION ALL
SELECT 1, * FROM t.project
)
GROUP BY key
) AS combined_array
FROM `project.dataset.table` t
with result
Or ... to fully match your expected output - use below
#standardSQL
SELECT *,
(SELECT STRING_AGG(x) FROM (
SELECT CONCAT(key, ':', ARRAY_AGG(value ORDER BY source LIMIT 1)[OFFSET(0)]) x
FROM (
SELECT 0 AS source, * FROM t.labels UNION ALL
SELECT 1, * FROM t.project
)
GROUP BY key
)) AS combined_result
FROM `project.dataset.table` t
with result

sql - single line per distinct values in a given column

is there a way using sql, in bigquery more specifically, to get one line per unique value in a given column
I know that this is possible using a sequence of union queries where you have as much union as distinct values as there is in the column of interest. but i'm wondering if there is a better way to do it.
You can use row_number():
select t.* except (seqnum)
from (select t.*, row_number() over (partition by col order by col) as seqnum
from t
) t
where seqnum = 1;
This returns an arbitrary row. You can control which row by adjusting the order by.
Another fun solution in BigQuery uses structs:
select array_agg(t limit 1)[ordinal(1)].*
from t
group by col;
You can add an order by (order by X limit 1) if you want a particular row.
here is just a more formated format :
select tab.* except(seqnum)
from (
select *, row_number() over (partition by column_x order by column_x) as seqnum
from `project.dataset.table`
) as tab
where seqnum = 1
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM `project.dataset.table` t
GROUP BY col
You can test, play with above using dummy data as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, 1 col UNION ALL
SELECT 2, 1 UNION ALL
SELECT 3, 1 UNION ALL
SELECT 4, 2 UNION ALL
SELECT 5, 2 UNION ALL
SELECT 6, 3
)
SELECT AS VALUE ANY_VALUE(t)
FROM `project.dataset.table` t
GROUP BY col
with result
Row id col
1 1 1
2 4 2
3 6 3

Find the mode in BigQuery

The mode is the value that appears most often in a set.
I would like something like:
SELECT
t.id as t_id,
GROUP_CONCAT(t.value) as value_list,
MODE(t.value) AS value_mode
FROM dataset.table as t
GROUP BY t_id
such that, for example:
t_id value_list value_mode
1 2,2,2,3,6,6 2
How is that done?
EDIT: The value_list is just there for illustration purpose. Only need the mode
select id, value as value_list, v as value_mode
from (
select
id, value, v,
count(1) as c,
row_number() over(partition by id order by c desc) as top
from (
select id, value, split(value) as v
from dataset.table
)
group by id, value, v
)
where top = 1
I often have to find the mode of prices for respective groups (e.g. length and amps) to filter out sale prices and the like.
I typically use two methods both with creating an array and un-nesting it in order of frequency. One method I use is by a LIMIT another with an [OFFSET(0)] in case you want to get Nth values.
Both are included below:
WITH t AS (SELECT 18 AS length,
'HIGH' as amps,
99.95 price UNION ALL
SELECT 18, "HIGH", 99.95 UNION ALL
SELECT 18, "HIGH", 5.95 UNION ALL
SELECT 18, "LOW", 33.95 UNION ALL
SELECT 18, "LOW", 33.95 UNION ALL
SELECT 18, "LOW", 4.5 UNION ALL
SELECT 3, "HIGH", 77.95 UNION ALL
SELECT 3, "HIGH", 77.95 UNION ALL
SELECT 3, "HIGH", 9.99 UNION ALL
SELECT 3, "LOW", 44.95 UNION ALL
SELECT 3, "LOW", 44.95 UNION ALL
SELECT 3, "LOW", 5.65
)
SELECT
length,
amps,
-- By Limit
(SELECT x FROM UNNEST(price_array) x
GROUP BY x ORDER BY COUNT(*) DESC LIMIT 1 ) most_freq_price,
(SELECT x FROM UNNEST(price_array) x
GROUP BY x ORDER BY COUNT(*) ASC LIMIT 1 ) least_freq_price,
-- By Offset
ARRAY((SELECT x FROM UNNEST(price_array) x
GROUP BY x ORDER BY COUNT(*) DESC))[OFFSET(0)] most_freq_price_offset,
ARRAY((SELECT x FROM UNNEST(price_array) x
GROUP BY x ORDER BY COUNT(*) ASC))[OFFSET(0)] least_freq_price_offset
FROM (
SELECT
length,
amps,
ARRAY_AGG(price) price_array
FROM t
GROUP BY 1,2
)
For your example, this is how I would solve it:
SELECT x, w mode
FROM (
SELECT COUNT(*) c, w, ROW_NUMBER() OVER(ORDER BY c DESC) rn, FIRST(x) x
FROM (
SELECT SPLIT(x) w, x FROM (SELECT "2,2,2,3,6,6" x)
)
GROUP BY 2
)
WHERE rn=1
And with the GROUP_CONCAT within query:
SELECT gc, w mode
FROM (
SELECT COUNT(*) c, w, ROW_NUMBER() OVER(ORDER BY c DESC) rn, FIRST(gc) gc
FROM (
SELECT GROUP_CONCAT(w) OVER() gc, w
FROM (FLATTEN((
SELECT SPLIT(x) w, x FROM (SELECT "2,2,2,3,6,6" x)), w)
)
)
GROUP BY 2
)
WHERE rn=1
And handling partitions:
SELECT tid, gc value_list, w value_mode
FROM (
SELECT tid, COUNT(*) c, w, ROW_NUMBER() OVER(PARTITION BY tid ORDER BY c DESC) rn, FIRST(gc) gc
FROM (
SELECT tid, GROUP_CONCAT(w) OVER(PARTITION BY tid) gc, w
FROM (FLATTEN((
SELECT 1 tid, SPLIT(x) w, x FROM (SELECT "2,2,2,3,6,6" x)), w)
)
)
GROUP BY tid, w
)
WHERE rn=1
There is a direct function available now
approx_top_count()
Here is an example of its usage
https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate_aggregate_functions#approx_top_count

Oracle SQL -- Analytic functions OVER a group?

My table:
ID NUM VAL
1 1 Hello
1 2 Goodbye
2 2 Hey
2 4 What's up?
3 5 See you
If I want to return the max number for each ID, it's really nice and clean:
SELECT MAX(NUM) FROM table GROUP BY (ID)
But what if I want to grab the value associated with the max of each number for each ID?
Why can't I do:
SELECT MAX(NUM) OVER (ORDER BY NUM) FROM table GROUP BY (ID)
Why is that an error? I'd like to have this select grouped by ID, rather than partitioning separately for each window...
EDIT: The error is "not a GROUP BY expression".
You could probably use the MAX() KEEP(DENSE_RANK LAST...) function:
with sample_data as (
select 1 id, 1 num, 'Hello' val from dual union all
select 1 id, 2 num, 'Goodbye' val from dual union all
select 2 id, 2 num, 'Hey' val from dual union all
select 2 id, 4 num, 'What''s up?' val from dual union all
select 3 id, 5 num, 'See you' val from dual)
select id, max(num), max(val) keep (dense_rank last order by num)
from sample_data
group by id;
When you use windowing function, you don't need to use GROUP BY anymore, this would suffice:
select id,
max(num) over(partition by id)
from x
Actually you can get the result without using windowing function:
select *
from x
where (id,num) in
(
select id, max(num)
from x
group by id
)
Output:
ID NUM VAL
1 2 Goodbye
2 4 What's up
3 5 SEE YOU
http://www.sqlfiddle.com/#!4/a9a07/7
If you want to use windowing function, you might do this:
select id, val,
case when num = max(num) over(partition by id) then
1
else
0
end as to_select
from x
where to_select = 1
Or this:
select id, val
from x
where num = max(num) over(partition by id)
But since it's not allowed to do those, you have to do this:
with list as
(
select id, val,
case when num = max(num) over(partition by id) then
1
else
0
end as to_select
from x
)
select *
from list
where to_select = 1
http://www.sqlfiddle.com/#!4/a9a07/19
If you're looking to get the rows which contain the values from MAX(num) GROUP BY id, this tends to be a common pattern...
WITH
sequenced_data
AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY id ORDER BY num DESC) AS sequence_id,
*
FROM
yourTable
)
SELECT
*
FROM
sequenced_data
WHERE
sequence_id = 1
EDIT
I don't know if TeraData will allow this, but the logic seems to make sense...
SELECT
*
FROM
yourTable
WHERE
num = MAX(num) OVER (PARTITION BY id)
Or maybe...
SELECT
*
FROM
(
SELECT
*,
MAX(num) OVER (PARTITION BY id) AS max_num_by_id
FROM
yourTable
)
AS sub_query
WHERE
num = max_num_by_id
This is slightly different from my previous answer; if multiple records are tied with the same MAX(num), this will return all of them, the other answer will only ever return one.
EDIT
In your proposed SQL the error relates to the fact that the OVER() clause contains a field not in your GROUP BY. It's like trying to do this...
SELECT id, num FROM yourTable GROUP BY id
num is invalid, because there can be multiple values in that field for each row returned (with the rows returned being defined by GROUP BY id).
In the same way, you can't put num inside the OVER() clause.
SELECT
id,
MAX(num), <-- Valid as it is an aggregate
MAX(num) <-- still valid
OVER(PARTITION BY id), <-- Also valid, as id is in the GROUP BY
MAX(num) <-- still valid
OVER(PARTITION BY num) <-- Not valid, as num is not in the GROUP BY
FROM
yourTable
GROUP BY
id
See this question for when you can't specify something in the OVER() clause, and an answer showing when (I think) you can: over-partition-by-question