Fewest number of buckets to bag elements in bigquery - sql

I have a matrix with buckets and elements like below. If an element can fit in a bucket it is 1 in the corresponding cell
For example: If you look at the image, element x can fit in bucket-a,b,c and not in d and e
I want to find the fewest buckets to group my elements. In this case, buckets c and d could group all the elements in just two buckets.
Any idea if i can do this in bigquery dynamically and efficiently ? original data is not as simple as this.
select "element-x" as element , 1 as bucketa, 1 as bucketb, 1 as bucketc, 0 as bucketd, 0 as buckete
union all
select "element-y" as element , 0 as bucketa, 0 as bucketb, 1 as bucketc, 0 as bucketd, 0 as buckete
union all
select "element-z" as element , 1 as bucketa, 0 as bucketb, 1 as bucketc, 0 as bucketd, 0 as buckete
union all
select "element-p" as element , 0 as bucketa, 0 as bucketb, 1 as bucketc, 0 as bucketd, 0 as buckete
union all
select "element-q" as element , 1 as bucketa, 0 as bucketb, 0 as bucketc, 1 as bucketd, 0 as buckete
union all
select "element-r" as element , 0 as bucketa, 1 as bucketb, 0 as bucketc, 1 as bucketd, 1 as buckete

Consider below solution - obviously you need to make sure you provide accurate data in matrix CTE and also you need respectively adjust buckets_elements CTE to reflect all buckets in matrix. The rest of CTE's and final query will make a work for you!
with matrix as (
select "element-x" as element, 1 as bucketa, 1 as bucketb, 1 as bucketc, 0 as bucketd, 0 as buckete union all
select "element-y", 0, 0, 1, 0, 0 union all
select "element-z", 1, 0, 1, 0, 0 union all
select "element-p", 0, 0, 1, 0, 0 union all
select "element-q", 1, 0, 0, 1, 0 union all
select "element-r", 0, 1, 0, 1, 1
), buckets_elements as (
select array[struct(a), struct(b), struct(c), struct(d), struct(e)] buckets
from (
select
array_agg(if(bucketa = 1, element, null) ignore nulls) a,
array_agg(if(bucketb = 1, element, null) ignore nulls) b,
array_agg(if(bucketc = 1, element, null) ignore nulls) c,
array_agg(if(bucketd = 1, element, null) ignore nulls) d,
array_agg(if(buckete = 1, element, null) ignore nulls) e
from matrix
)
), columns_names as (
select
regexp_extract_all(to_json_string((select as struct * except(element) from unnest([t]))), r'"([^"]+)"') cols
from matrix t limit 1
), columns_index as (
select generate_array(0, array_length(cols) - 1) as arr
from columns_names
), buckets_combinations as (
select
(select array_agg(
case when n & (1<<pos) <> 0 then arr[offset(pos)] end
ignore nulls)
from unnest(generate_array(0, array_length(arr) - 1)) pos
) as combo
from columns_index cross join
unnest(generate_array(1, cast(power(2, array_length(arr)) - 1 as int64))) n
)
select
array(select cols[offset(i)] from columns_names, unnest(combo) i) winners
from (
select combo,
rank() over(order by (select count(distinct el) from unnest(val) v, unnest(v.a) el) desc, array_length(combo)) as rnk
from (
select any_value(c).combo, array_agg(buckets[offset(i)]) val
from buckets_combinations c, unnest(combo) i, buckets_elements b
group by format('%t', c)
)
)
where rnk = 1
with output

Related

BigQuery SQL query to Indicate a sequence of 3 rows sharing the same value

I need a query that every time the indicator column turns into zero and there are 3 zeros in a row, I would like to assign them a unique group number.
Here is a sample data:
select 0 as offset, 1 as indicator, -1 as grp union all
select 1, 1, -1 union all
select 2, 1, -1 union all
select 3, 1, -1 union all
select 4, 1, -1 union all
select 5, 1, -1 union all
select 6, 1, -1 union all
select 7, 0, 1 union all
select 8, 0, 1 union all
select 9, 0, 1 union all
select 10, 1, -1 union all
select 11, 0, 2 union all
select 12, 0, 2 union all
select 13, 0, 2 union all
select 14, 1, -1 union all
select 15, 1, -1 union all
select 16, 1, -1
In this example there are two sequences of 3 zeros, indicated as grp=1 and grp=2.
Consider below approach
select offset, indicator, if(grp = 0, -1, grp) as grp
from (
select offset, indicator, dense_rank() over(order by pregroup) - 1 as grp
from (
select offset, indicator,
if(countif(indicator = 0) over(partition by pregroup) = 3 and indicator = 0, pregroup, -1) as pregroup
from (
select offset, indicator, count(*) over win - countif(indicator = 0) over win as pregroup
from your_table
window win as (order by offset)
)
)
)
if applied to slightly modified sample data n your question (with sequence of 4 zeros - just for test purpose) - output is
The below query solves this.
Firstly it assigns all of the desired groups a tag.
Secondly, we get the row number for them and use integer casting on row_number to assign them a unique group number.
with data as (select 0 as offset, 1 as indicator, -1 as grp union all
select 1, 1, -1 union all
select 2, 1, -1 union all
select 3, 1, -1 union all
select 4, 1, -1 union all
select 5, 1, -1 union all
select 6, 1, -1 union all
select 7, 0, 1 union all
select 8, 0, 1 union all
select 9, 0, 1 union all
select 10, 1, -1 union all
select 11, 0, 2 union all
select 12, 0, 2 union all
select 13, 0, 2 union all
select 14, 1, -1 union all
select 15, 1, -1 union all
select 16, 1, -1 ),
tagged as (select
*,
-- mark as part of the group if both indicators in front, both indicators behind, or one indicator in front and behind are 0.
case
when indicator = 0 and lead(indicator) over(order by offset) = 0 and lead(indicator, 2) over(order by offset) = 0 then true
when indicator = 0 and lead(indicator) over(order by offset) = 0 and lag(indicator) over(order by offset) = 0 then true
when indicator = 0 and lag(indicator) over(order by offset) = 0 and lag(indicator, 2) over(order by offset) = 0 then true
else false
end as part_of_group
from data),
group_tags as (
select
*,
-- use cast as int to acquire the group number from the row number
CAST((row_number() over(order by offset) + 1)/3 AS INT) as group_tag
from
tagged
where
part_of_group = true)
-- rejoin this data back together
select
d.*,
gt.group_tag
from data as d
left join
group_tags as gt
on
d.offset = gt.offset
You may consider below approach as well,
WITH partitions AS (
SELECT *, indicator = 0 AND COUNT(div) OVER (PARTITION BY div, indicator) = 3 AS flag
FROM (
SELECT *, SUM(indicator) OVER (ORDER BY offset) AS div FROM sample_data
)
)
SELECT offset, indicator, IF(flag, DENSE_RANK() OVER w, -1) AS grp
FROM partitions
WINDOW w AS (PARTITION BY CASE WHEN flag THEN 0 ELSE 1 END ORDER BY div)
ORDER BY offset;
Query results

Count the number of matches in an array in BigQuery

How I can count the number of matches in an array? For example, for numbers [1,3] in the array [1,2,3] there will be 2 matches, and for the array [1,2] there will be 1 match. Right now I can only check if [1,3] is in the array or not.
WITH `arrays` AS (
SELECT 1 id, [1,2,3] as arr
UNION ALL
SELECT 2, [1,2]
UNION ALL
SELECT 3, [3]
)
SELECT id, arr, [1,3] as numbers,
CASE
1 IN UNNEST(arr) and
3 IN UNNEST(arr)
WHEN TRUE THEN 'numbers is in array'
ELSE 'numbers is not in array'
END conclusion
FROM `arrays`
I'm trying to get such result:
Using a math, following seems to be possible:
If union of arr and numbers is same as arr, it will be numbers is in array
If union of arr and numbers is greater than arr, elements as much as the increased number is not in the arr.
So, numbers_len - (union_len - arr_len) will be check
WITH `arrays` AS (
SELECT 1 id, [1,2,3] as arr
UNION ALL
SELECT 2, [1,2]
UNION ALL
SELECT 3, [3]
),
calculated_arrays AS (
SELECT *, [1,3] as numbers,
ARRAY_LENGTH(ARRAY(SELECT DISTINCT * FROM UNNEST(arr || [1, 3]))) AS union_len,
ARRAY_LENGTH(arr) AS arr_len,
ARRAY_LENGTH([1, 3]) AS numbers_len
FROM `arrays`
)
SELECT id, arr, numbers,
numbers_len - union_len + arr_len AS check,
IF (union_len = arr_len, 'numbers is in array', 'numbers is not in array') AS conclusion
FROM calculated_arrays
;
output:
Consider below approach
with `arrays` as (
select 1 id, [1,2,3] as arr union all
select 2, [1,2] union all
select 3, [3]
)
select *,
( select count(*)
from t.numbers num join t.arr num
using(num)
) check,
( select format('number is %sin array',
if(logical_and(if(num2 is null, false, true)), '', 'not '))
from t.numbers num1 left join t.arr num2
on num1 = num2
) conclusion
from (
select id, arr, [1,3] as numbers
from `arrays`
) t
with output

SQL: Get last referring and post referring page during a signup process

I'm trying to write an efficient SQL query to select 'before' and 'after' pages for the signup process. I have a solution using for loops which doesn't scale and am hoping to get a SQL native solution.
For a single clientId, I would want to get the latest pages before sign up and after signup (only 1 from each side of the join process).
The join process ALWAYS has /join/complete
Input:
clientId time path
1 0 /page1
1 10 /page2
1 20 /join/<random_token_id>
1 30 /join/<random_token_id>/step2
1 40 /join/complete
1 50 /page2
2 0 /page3
2 10 /join/complete
Output
ClientId Before After
1 /page2 /page2
2 /page3 null
I would be grateful if there is an easy solution in SQL. If it's complex, just leave it out. I will leave the code running overnight.
#standardSQL
WITH lineup AS (
SELECT clientId, time, path,
ROW_NUMBER() OVER(PARTITION BY clientId ORDER BY time) pos
FROM `project.dataset.table`
), start AS (
SELECT row.clientId, row.pos FROM (
SELECT ARRAY_AGG(t ORDER BY pos LIMIT 1)[OFFSET(0)] row
FROM lineup t WHERE STARTS_WITH(path, '/join/')
GROUP BY clientId)
), complete AS (
SELECT clientId, pos FROM lineup WHERE path = '/join/complete'
), before AS (
SELECT lineup.clientId, path FROM lineup JOIN start
ON lineup.clientId = start.clientId AND lineup.pos = start.pos - 1
), after AS (
SELECT lineup.clientId, path FROM lineup JOIN complete
ON lineup.clientId = complete.clientId AND lineup.pos = complete.pos + 1
)
SELECT clientId, before.path AS before, after.path AS after
FROM before FULL OUTER JOIN after USING (clientId)
You can test / play with above using dummy data from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 clientId, 0 time, '/page1' path UNION ALL
SELECT 1, 10, '/page2' UNION ALL
SELECT 1, 20, '/join/<random_token_id>' UNION ALL
SELECT 1, 30, '/join/<random_token_id>/step2' UNION ALL
SELECT 1, 40, '/join/complete' UNION ALL
SELECT 1, 50, '/page2' UNION ALL
SELECT 2, 0, '/page3' UNION ALL
SELECT 2, 10, '/join/complete' UNION ALL
SELECT 3, 0, '/join/complete' UNION ALL
SELECT 3, 10, '/page4'
), lineup AS (
SELECT clientId, time, path,
ROW_NUMBER() OVER(PARTITION BY clientId ORDER BY time) pos
FROM `project.dataset.table`
), start AS (
SELECT row.clientId, row.pos FROM (
SELECT ARRAY_AGG(t ORDER BY pos LIMIT 1)[OFFSET(0)] row
FROM lineup t WHERE STARTS_WITH(path, '/join/')
GROUP BY clientId)
), complete AS (
SELECT clientId, pos FROM lineup WHERE path = '/join/complete'
), before AS (
SELECT lineup.clientId, path FROM lineup JOIN start
ON lineup.clientId = start.clientId AND lineup.pos = start.pos - 1
), after AS (
SELECT lineup.clientId, path FROM lineup JOIN complete
ON lineup.clientId = complete.clientId AND lineup.pos = complete.pos + 1
)
SELECT clientId, before.path AS before, after.path AS after
FROM before FULL OUTER JOIN after USING (clientId)
with result as
Row clientId before after
1 1 /page2 /page2
2 2 /page3 null
3 3 null /page4

How to convert below query from Oracle to Postgres having connect by?

I am converting my application DB from oracle to postgres. I am stuck on a function having connect by syntax. Below is the Oracle query.
PROCEDURE Get_Report_Data(parm_Billing_Month VARCHAR2, OUT Ref_Cur) IS
BEGIN
OPEN p_Data FOR
SELECT CASE
WHEN Id = 1 THEN
'Amount < 10000'
WHEN Id = 2 THEN
'10000-15000'
WHEN Id = 3 THEN
'15000-20000'
ELSE
'Amount > 20000'
END "Range",
SUM(Nvl(N1, 0)) N1,
SUM(Nvl(N2, 0)) N2,
SUM(Nvl(C1, 0)) C1,
SUM(Nvl(C2, 0)) C2,
SUM(Nvl(C3, 0)) C3,
SUM(Nvl(S1, 0)) S1,
SUM(Nvl(S2, 0)) S2,
COUNT(Site_Id) "No of Sites"
FROM (SELECT CASE
WHEN Nvl(Ed.Actual_Bill, 0) < 10000 THEN
1
WHEN Ed.Actual_Bill < 15000 THEN
2
WHEN Ed.Actual_Bill < 20000 THEN
3
ELSE
4
END Amount_Sort,
Decode(Er.Region_Id, 1, 1, 0) N1,
Decode(Er.Region_Id, 2, 1, 0) N2,
Decode(Er.Region_Id, 3, 1, 0) C1,
Decode(Er.Region_Id, 4, 1, 0) C2,
Decode(Er.Region_Id, 5, 1, 0) C3,
Decode(Er.Region_Id, 6, 1, 0) S1,
Decode(Er.Region_Id, 7, 1, 0) S2,
Ed.Site_Id
FROM Tbl_Details Ed,
Tbl_Site Es,
Tbl_Region Er,
Tbl_Subregion Esr
WHERE Ed.Site_Id = Es.Site_Id
AND Es.Subregion_Id = Esr.Subregion_Id
AND Esr.Region_Id = Er.Region_Id
AND Ed.Billing_Month_f = parm_Billing_Month) Data,
(SELECT Regexp_Substr('1,2,3,4,', '[^,]+', 1, Rownum) Id
FROM Dual
CONNECT BY Rownum <= Length('1,2,3,4,') -
Length(REPLACE('1,2,3,4,', ','))) All_Value
WHERE Data.Amount_Sort(+) = All_Value.Id
GROUP BY All_Value.Id
ORDER BY AVG(All_Value.Id);
END;
When I convert this query to postgres having some changes like Ref_Cur to refcursor and NVL to Coalesce function. I am still unable to resolve the connect by syntax. Some people suggested to use CTE's but I am unable to get it. Any help guys?
Edit
For random googlers below is the answer to my above problem. Special thanks to MTO.
WHERE Ed.Site_Id = Es.Site_Id
AND Es.Subregion_Id = Esr.Subregion_Id
AND Esr.Region_Id = Er.Region_Id
AND Ed.Billing_Month_f = p_Billing_Month) data
Right Outer Join (Select 1 as Id union All
Select 2 as Id union All
Select 3 as Id union All
Select 4 as Id) all_value
On data.Amount_Sort = all_value.Id
GROUP BY all_value.Id
ORDER BY AVG(all_value.Id);
The "generation" of IDs can be simplified in Postgres.
either use a values() clause:
Right Outer Join ( values (1,2,3,4) ) as all_value(id) On data.Amount_Sort = all_value.Id
or, if those are always a consecutive numbers, use generate_series():
Right Outer Join generate_series(1,4) as all_value(id) On data.Amount_Sort = all_value.Id
Since your hierarchical query appears to be using static strings, you can convert this:
SELECT Regexp_Substr('1,2,3,4,', '[^,]+', 1, Rownum) Id
FROM Dual
CONNECT BY Rownum <= Length('1,2,3,4,') - Length(REPLACE('1,2,3,4,', ',')
To:
SELECT 1 AS id FROM DUAL UNION ALL
SELECT 2 FROM DUAL UNION ALL
SELECT 3 FROM DUAL UNION ALL
SELECT 4 FROM DUAL
Which should then be simpler to convert to PostgreSQL.

TSQL counting '1' in a string position, by positions

There are fields for categories like that:
"101011111000000101010011000101..." every position in this strings represents a certain category if set to "1".
So "1" means set and "0" means not set.
I would like to count the categories with the highest number of "1" and order them descending.
My current solution is like that:
SELECT COUNT(SUBSTRING([Interests], 1, 1)) AS xcount, 1 AS ID
FROM [db1].[dbo].[Contacts]
WHERE SUBSTRING([Interests], 1, 1) = '1'
UNION
SELECT COUNT(SUBSTRING([Interests], 2, 1)) AS xcount, 2 AS ID
FROM [db1].[dbo].[Contacts]
WHERE SUBSTRING([Interests], 2, 1) = '1'
UNION
SELECT COUNT(SUBSTRING([Interests], 3, 1)) AS xcount, 3 AS ID
FROM [db1].[dbo].[Contacts]
WHERE SUBSTRING([Interests], 3, 1) = '1'
UNION
SELECT COUNT(SUBSTRING([Interests], 4, 1)) AS xcount, 4 AS ID
FROM [db1].[dbo].[Contacts]
WHERE SUBSTRING([Interests], 4, 1) = '1'
UNION
SELECT COUNT(SUBSTRING([Interests], 5, 1)) AS xcount, 5 AS ID
FROM [db1].[dbo].[Contacts]
WHERE SUBSTRING([Interests], 5, 1) = '1'
ORDER BY xcount DESC
Is there a better or faster way to count those categories?
SELECT SUM(CASE WHEN SUBSTRING([Interests], _ID.ID, 1) = '1' THEN 1 ELSE 0 END) AS xcount, _ID.ID
FROM [db1].[dbo].[Contacts], (VALUES (1),(2),(3),(4),(5)) AS _ID(ID)
GROUP BY _ID.ID
ORDER BY xcount DESC
For more categories just increase _ID sequence.
This will count number of '1' in a string consisting of 0 and 1
declare #s varchar(100) ='101011111000000101010011000101';
select cnt = len(#s) - len(replace(#s,'0',''))