BigQuery array intersection across all rows - google-bigquery

I have the following data:
col1
[1,2,3]
[1,2]
[2,3]
I want the following output
ans
[2]
In other words, I want the intersection of arrays in all rows.
Any pointers?

Consider below approach
select array_agg(item) ans from (
select distinct item
from your_table t, t.col1 item
qualify count(distinct format('%t', col1)) over(partition by item)
= count(distinct format('%t', col1)) over()
)

Intersection of array of all rows. There is the need to unnest the array and group by each element and count how often it is in a row present.
Lets start with generating your datset in table tbl.
Then in a helper table, we generate a row_number, starting with one for each row. We also count the maximum numbers of rows.
In the last step, we unnest the col1 column with all the array elements and call them y. We group by y and count for each unique entry of row_num in the new column ref. having does a comparing of the ref to the total amount of rows in the beginning and filters only the element which are present in all arrays.
With tbl as (select a.x as col1 from unnest( [struct([1,2,3,3] as x),struct([1,2]),struct([2,3])] )as a),
helper as ( select col1,row_number() over () as row_num,count(col1) over () as counts from tbl)
select y,count(distinct row_num) as ref, any_value(counts) ref_target
from helper , unnest(col1) y
group by 1
having ref=ref_target

Another approach could be seen below. This is assuming that your arrays does not have duplicate values.
cte groups by array value, and gets all the indices that contain this value and counts them. It also assigns row_number for each row. This can now be filtered where the maximum row_number = count of indices that contain the value.
with sample_data as (
select [1,2,3] as col1
union all select [1,2] as col1
union all select [2,3] as col1
),
cte as (
select
col1[safe_offset(index)] as arr_value,
array_length(array_agg(index)) as exist_in_n_rows, -- get indices where values appeared then count them
row_number() over () as rn
from sample_data,
unnest(generate_array(0,array_length(col1)-1)) as index
group by 1
)
select
array_agg(intersection) as ans
from (
select
arr_value as intersection,
from cte
where true
qualify max(rn) over () = exist_in_n_rows
)
Output:
NOTE: If you have duplicate values in the array, it is much better to de duplicate them first prior to running the query above.

Related

How to select rows corresponding to a randomly selected column value in SQL

My query returns a result like shown in the table. I would like to randomly pick an ID from the ID column and get all the rows having that ID. How can I do that in SnowFlake or SQL:
ID
Postalcode
Value
...
1e3d
NK25F4
3214
...
1e3d
NK25F4
3258
...
1e3d
NK25F4
3354
...
1f74
NG2LK8
5524
1f74
NG2LK8
5548
3e9a
N6B7H4
3694
3e9a
N6B7H4
3325
38e4
N6C7H2
3654
...
There is a Snowflake function to return a fix number of "random" rows SAMPLE, so using that will reduce the need to read all rows.
SELECT t.*
FROM your_table as t
JOIN (SELECT ID FROM your_table SAMPLE (1 ROWS)) as r
ON t.id = r.id
thus using your data above:
with your_table(id, postalcode, value) as (
select * from values
('1e3d', 'NK25F4', 3214),
('1e3d', 'NK25F4', 3258),
('1e3d', 'NK25F4', 3354),
('1f74', 'NG2LK8', 5524),
('1f74', 'NG2LK8', 5548),
('3e9a', 'N6B7H4', 3694),
('3e9a', 'N6B7H4', 3325),
('38e4', 'N6C7H2', 3654)
)
I get (random set) but one looks like:
ID
POSTALCODE
VALUE
1f74
NG2LK8
5,524
1f74
NG2LK8
5,548
You could also use a NATURAL JOIN like:
SELECT *
FROM your_table
NATURAL JOIN (SELECT ID FROM your_table SAMPLE (1 ROWS))
You could put your existing query in a common table expression, then pick a random ID from it, and use it to filter the dataset:
with
dat as ( ... your query ...),
tid as (select id from dat order by random() fetch first 1 row)
select d.*
from dat d
inner join tid t on t.id = d.id
The second CTE, tid picks the random id; it does that by randomly ordering the dataset, then getting the id of the top row.
Something like
SELECT *
FROM Table_NAME
WHERE ID IN (SELECT ID FROM Table_Name ORDER BY RAND() LIMIT 1);
Should work. Though it's not particularly efficient and in many application scenarios it would arguably be more reasonable overall to compute the random ID in your application (e.g. keeping the set of all ids cached, periodically pulling it separately if need be etc).
(Note: The query assumes MYSQL, other variants may have slightly different keywords/structure, e.g. for the random function).
WITH DATA AS (
select '1e3d' id,'NK25F4' postalcode,3214 some_value union all
select '1e3d' id,'NK25F4' postalcode,3258 some_value union all
select '1e3d' id,'NK25F4' postalcode,3354 some_value union all
select '1f74' id,'NG2LK8' postalcode,5524 some_value union all
select '1f74' id,'NG2LK8' postalcode,5548 some_value union all
select '3e9a' id,'N6B7H4' postalcode,3694 some_value union all
select '3e9a' id,'N6B7H4' postalcode,3325 some_value union all
select '38e4' id,'N6C7H2' postalcode,3654 some_value )
SELECT * FROM DATA ,LATERAL (SELECT ID FROM DATA SAMPLE(2 ROWS)) I WHERE I.ID = DATA.ID
You can also play with the window frame a little and let qualify do the work
select *
from your_table
qualify id=first_value(id) over (order by random() rows between unbounded preceding and unbounded following)
Snowflake deviates from ANSI standard on the default window frames for rank-related functions (first_value, last_value, nth_value), so that makes the above equivalent to :
select *
from your_table
qualify id=first_value(id) over (order by random())

Is it possible to apply "Select Distinct" to any column of the query that isn’t in the first place?

For example, like the query below:
WITH T1 AS
(
SELECT DISTINCT
song_name,
year_rank AS rank,
group_name
FROM
billboard_top_100_year_end
WHERE
year = 2010
ORDER BY
rank
LIMIT 10
)
SELECT
rank,
group_name,
song_name
FROM
T1
LIMIT 10
I need to put the column song_name on the top because I didn’t know how to use DISTINCT if the column song_name was in third place.
So, after I needed to query again just to obtain the exactly same result but by another order of visualization.
DISTINCT does not apply to a certain column of the result set, but to all. It just eliminates duplicate result rows.
SELECT DISTINCT a, b, c FROM tab;
is the same as
SELECT a, b, c, FROM tab GROUP BY a, b, c;
Perhaps you are looking for the (non-standard!) PostgreSQL extension DISTINCT ON:
SELECT DISTINCT ON (song_name)
song_name, col2, col2, ...
FROM tab
ORDER BY song_name, col2;
With the ORDER BY, this will give you for each song_name the result with the smallest col2. If you omit the ORDER BY, you will get a random result row for each song_name.

Count values if text contains in BigQuery

I have a datasource that looks like this:
combinations
A
A
A,B
B
A,C
A,B,C
what I want to do is to create a table that count every single time one combination occurs OR is contained in another combination. For that there are two steps:
output all the unique combinations.
Count every single time that combination occurs or is contained in another combination.
In this example, the desired output is this one:
combinations
frequency
A
5
B
3
A,B
2
A,C
2
A,B,C
1
Any ideas on how I can achieve this with BigQuery or SQL? I have tried with Count(), but the results are not correct.
You can extract the unique combinations then use join to count:
select c.combination, count(*)
from (select distinct combination from t) c join
t
on concat(',', t.combination, ',') like concat('%,', c, ',%')
group by c.combination;
EDIT:
The above treats combinations as strings. If you want to treat the combinations as individual values, then storing them in strings is the wrong data structure. However, you can still do what you want by using arrays:
select t.combination, countif(num_combos = num_matches)
from (select t.combination, t.num_combos, t2.seqnum, count(*) as num_matches ​
from (select distinct t.combination,
​​split(t.combination, ',') as ar_combos,
​​array_length(split(t.combination, ',')) as num_combos
​​​from t
​​​) t cross join
​​unnest(t.ar_combos) ar join
​​(select t2.*, row_number() over (order by combination) as seqnum,
​ar2
​from t t2 cross join
​​unnest(split(t2.combination, ',')) ar2
​​) t2
​​on t2.ar2 = ar
group by t.combination, t.num_combos, t2.seqnum
) t
group by combination;
Consider below approach
select a.combination, countif((
select count(1) = array_length(split(a.combination))
from unnest(split(a.combination)) item
join unnest(split(d.combination)) item
using(item)
)) frequency
from (select distinct combination from data) a, data d
group by a.combination
# order by array_length(split(a.combination)), a.combination
if applied to sample data in your question
output is

Select max value of each group using partition by

I have the following code which is taking a looong time to get executed. What I need to do is select the column having row number equals 1 after partitioning it by three columns (col_1, col_2, col_3) [which are also the key columns] and ordering by some columns as mentioned below. The number of records in the table is around 90 million. Am I following the best approach or is there any other better one?
with cte as (SELECT
b.*
,ROW_NUMBER() OVER ( PARTITION BY col_1,col_2,col_3
ORDER BY new_col DESC, new_col_2 DESC, new_col_3 DESC ) AS ROW_NUMBER
FROM (
SELECT
*
,CASE
WHEN update_col = ' ' THEN new_update_col
ELSE update_col
END AS new_col_1
FROM schema_name.table_name
) b
)
select top 10 * from cte WHERE ROW_NUMBER=1
Currently you are applying CASE on different columns which is impacting all rows in the database table. CASE (String Comparison) Is a costly method.
At the end, you are keeping only records with ROW NUMBER = 1. If I guess this filter keeping Half of your all records, this will increase the query execution time if you filter (Generate ROW NUMBER First and Keep Rows with RN=1) first and then apply CASE method on columns.

SQL select segment

I'm using SQL Server 2008.
I have a table with x amount of rows. I would like to always divide x by 5 and select the 3rd group of records.
Let's say there are 100 records in the table:
100 / 5 = 20
the 3rd segment will be record 41 to 60.
How will I be able in SQL to calculate and select this 3rd segment only?
Thanks.
You can use NTILE.
Distributes the rows in an ordered partition into a specified number of groups.
Example:
SELECT col1, col2, ..., coln
FROM
(
SELECT
col1, col2, ..., coln,
NTILE(5) OVER (ORDER BY id) AS groupno
FROM yourtable
)
WHERE groupno = 3
That's a perfect use for the NTILE ranking function.
Basically, you define your query inside a CTE and add an NTILE to your rows - a number going from 1 to n (the argument to NTILE). You order your rows by some column, and then you get the n groups of rows you're looking for, and you can operate on any one of those "groups" of data.
So try something like this:
;WITH SegmentedData AS
(
SELECT
(list of your columns),
GroupNo = NTILE(5) OVER (ORDER BY SomeColumnOfYours)
FROM dbo.YourTable
)
SELECT *
FROM SegmentedData
WHERE GroupNo = 3
Of course, you can also use an UPDATE statement after the CTE to update those rows.