Big query query is too complex after pivot - sql

Assume I have the following table table and a list of interests (cat, dog, music, soccer, coding)
| userId | user_interest | label |
| -------- | -------------- |----------|
| 12345 | cat | 1 |
| 12345 | dog | 1 |
| 6789 | music | 1 |
| 6789 | soccer | 1 |
I want to transform the user interest into a binary array (i.e. binarization), and the resulting table will be something like
| userId | labels |
| -------- | -------------- |
| 12345 | [1,1,0,0,0] |
| 6789 | [0,0,1,1,0] |
I am able to do it with PIVOT and ARRAY, e.g.
WITH user_interest_pivot AS (
SELECT
*
FROM (
SELECT userId, user_interest, label FROM table
) AS T
PIVOT
(
MAX(label) FOR user_interestc IN ('cat', 'dog', 'music', 'soccer', 'coding')
) AS P
)
SELECT
userId,
ARRAY[IFNULL(cat,0), IFNULL(dog,0), IFNULL(music,0), IFNULL(soccer,0), IFNULL(coding,0)] AS labels,
FROM user_interea_pivot
HOWEVER, in reality I have a very long list of interests, and the above method in bigquery seems to not work due to
Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too comple
Please help to let me know if there is anything I can do to deal with this situation. Thanks!

Still it's likely to face resource problem depending on your real data, but it is worth trying the following approach without PIVOT.
Create interests table with additional index column first
+----------+-----+-----------------+
| interest | idx | total_interests |
+----------+-----+-----------------+
| cat | 0 | 5 |
| dog | 1 | 5 |
| music | 2 | 5 |
| soccer | 3 | 5 |
| coding | 4 | 5 |
+----------+-----+-----------------+
find idx of each user interest and aggreage them like below. (assuming that user intererest is sparse over overall interests)
SELECT userId, ARRAY_AGG(idx) user_interests
FROM sample_table t JOIN interests i ON t.user_interest = i.interest
GROUP BY 1
Lastly, create labels vector using a sparse user interest array and dimension of interest space (i.e. total_interests) like below
ARRAY(SELECT IF(ui IS NULL, 0, 1)
FROM UNNEST(GENERATE_ARRAY(0, total_interests - 1)) i
LEFT JOIN t.user_interests ui ON i = ui
ORDER BY i
) AS labels
Query
CREATE TEMP TABLE sample_table AS
SELECT '12345' AS userId, 'cat' AS user_interest, 1 AS label UNION ALL
SELECT '12345' AS userId, 'dog' AS user_interest, 1 AS label UNION ALL
SELECT '6789' AS userId, 'music' AS user_interest, 1 AS label UNION ALL
SELECT '6789' AS userId, 'soccer' AS user_interest, 1 AS label;
CREATE TEMP TABLE interests AS
SELECT *, COUNT(1) OVER () AS total_interests
FROM UNNEST(['cat', 'dog', 'music', 'soccer', 'coding']) interest
WITH OFFSET idx
;
SELECT userId,
ARRAY(SELECT IF(ui IS NULL, 0, 1)
FROM UNNEST(GENERATE_ARRAY(0, total_interests - 1)) i
LEFT JOIN t.user_interests ui ON i = ui
ORDER BY i
) AS labels
FROM (
SELECT userId, total_interests, ARRAY_AGG(idx) user_interests
FROM sample_table t JOIN interests i ON t.user_interest = i.interest
GROUP BY 1, 2
) t;
Query results

I think below approach will "survive" any [reasonable] data
create temp function base10to2(x float64) returns string
language js as r'return x.toString(2);';
with your_table as (
select '12345' as userid, 'cat' as user_interest, 1 as label union all
select '12345' as userid, 'dog' as user_interest, 1 as label union all
select '6789' as userid, 'music' as user_interest, 1 as label union all
select '6789' as userid, 'soccer' as user_interest, 1 as label
), interests as (
select *, pow(2, offset) weight, max(offset + 1) over() as len
from unnest(['cat', 'dog', 'music', 'soccer', 'coding']) user_interest
with offset
)
select userid,
split(rpad(reverse(base10to2(sum(weight))), any_value(len), '0'), '') labels,
from your_table
join interests
using(user_interest)
group by userid
with output

Related

How can i find all linked rows by array values in postgres?

i have a table like this:
id | arr_val | grp
-----------------
1 | {10,20} | -
2 | {20,30} | -
3 | {50,5} | -
4 | {30,60} | -
5 | {1,5} | -
6 | {7,6} | -
I want to find out which rows are in a group together.
In this example 1,2,4 are one group because 1 and 2 have a common element and 2 and 4. 3 and 5 form a group because they have a common element. 6 Has no common elments with anybody else. So it forms a group for itself.
The result should look like this:
id | arr_val | grp
-----------------
1 | {10,20} | 1
2 | {20,30} | 1
3 | {50,5} | 2
4 | {30,60} | 1
5 | {1,5} | 2
6 | {7,6} | 3
I think i need recursive cte because my problem is graphlike but i am not sure how to that.
Additional info and background:
The Table has ~2500000 rows.
In reality the problem i try to solve has more fields and conditions for finding a group:
id | arr_val | date | val | grp
---------------------------------
1 | {10,20} | -
2 | {20,30} | -
Not only do the element of a group need to be linked by common elements in arr_val. They all need to have the same value in val and need to be linked by a timespan in date (gaps and islands). I solved the other two but now the condition of my question was added. If there is an easy way to do all three together in one query that would be awesome but it is not necessary.
----Edit-----
While both answer work for the example of five rows they do not work for a table with a lot more rows. Both answers have the problem that the number of rows in the recursive part explodes and only reduce them at the end.
A solutiuon should work for data like this too:
id | arr_val | grp
-----------------
1 | {1} | -
2 | {1} | -
3 | {1} | -
4 | {1} | -
5 | {1} | -
6 | {1} | -
7 | {1} | -
8 | {1} | -
9 | {1} | -
10 | {1} | -
11 | {1} | -
more rows........
Is there a solution to that problem?
You can handle this as a recursive CTE. Define the edges between the ids based on common values. Then traverse the edges and aggregate:
with recursive nodes as (
select id, val
from t cross join
unnest(arr_val) as val
),
edges as (
select distinct n1.id as id1, n2.id as id2
from nodes n1 join
nodes n2
on n1.val = n2.val
),
cte as (
select id1, id2, array[id1] as visited, 1 as lev
from edges
where id1 = id2
union all
select cte.id1, e.id2, visited || e.id2,
lev + 1
from cte join
edges e
on cte.id2 = e.id1
where e.id2 <> all(cte.visited)
),
vals as (
select id1, array_agg(distinct id2 order by id2) as id2s
from cte
group by id1
)
select *, dense_rank() over (order by id2s) as grp
from vals;
Here is a db<>fiddle.
Here is an approach at this graph-walking problem:
with recursive cte as (
select id, arr_val, array[id] path from mytable
union all
select t.id, t.arr_val, c.path || t.id
from cte c
inner join mytable t on t.arr_val && c.arr_val and not t.id = any(c.path)
)
select c.id, c.arr_val, dense_rank() over(order by min(x.id)) grp
from cte c
cross join lateral unnest(c.path) as x(id)
group by c.id, c.arr_val
order by c.id
The common-table-expression walks the graph, recursively looking for "adjacent" nodes to the current node, while keeping track of already-visited nodes. Then the outer query aggregates, identifies groups using the least node per path, and finally ranks the groups.
Demo on DB Fiddle:
id | arr_val | grp
-: | :------ | --:
1 | {10,20} | 1
2 | {20,30} | 1
3 | {50,5} | 2
4 | {30,60} | 1
5 | {1,5} | 2
6 | {7,6} | 3
While Gordon Linoffs solution is the fastest i found for a small amount of data where the groups are not so big it will not work for bigger datasets and bigger groups.
I changed his solution to make it work.
I moved the edges to an indexed table:
create table edges
(
id1 integer not null,
id2 integer not null,
constraint staffel_group_nodes_pk
primary key (id1, id2)
);
insert into edges(id1, id2) with
nodes(id, arr_val) as (
select id, arr_val
from my_table
)
select n1.id as id1, n2.id as id2
from nodes n1
join
nodes n2
on n1.arr_val && n2.arr_val ;
Tha alone didnt help. I changed his recursive part too:
with recursive
cte as (
select id1, array [id1] as visited
from edges
where id1 = id2
union all
select unnested.id1, array_agg(distinct unnested.vis) as visited
from (
select cte.id1,
unnest(cte.visited || e.id2) as vis
from cte
join
staffel_group_edges e
on e.id1 = any (cte.visited)
and e.id2 <> all (cte.visited)) as unnested
group by unnested.id1
),
vals as (
select id1, array_agg(distinct vis) as id2s
from (
select cte.id1,
unnest(cte.visited) as vis
from cte) as unnested
group by unnested.id1
)
select id1,id2s, dense_rank() over (order by id2s) as grp
from vals;
Every step i group all searches by their starting point. This reduces the amount of parallel walked paths a lot and works surprisingly fast.

Possible to use a column name in a UDF in SQL?

I have a query in which a series of steps is repeated constantly over different columns, for example:
SELECT DISTINCT
MAX (
CASE
WHEN table_2."GRP1_MINIMUM_DATE" <= cohort."ANCHOR_DATE" THEN 1
ELSE 0
END)
OVER (PARTITION BY cohort."USER_ID")
AS "GRP1_MINIMUM_DATE",
MAX (
CASE
WHEN table_2."GRP2_MINIMUM_DATE" <= cohort."ANCHOR_DATE" THEN 1
ELSE 0
END)
OVER (PARTITION BY cohort."USER_ID")
AS "GRP2_MINIMUM_DATE"
FROM INPUT_COHORT cohort
LEFT JOIN INVOLVE_EVER table_2 ON cohort."USER_ID" = table_2."USER_ID"
I was considering writing a function to accomplish this as doing so would save on space in my query. I have been reading a bit about UDF in SQL but don't yet understand if it is possible to pass a column name in as a parameter (i.e. simply switch out "GRP1_MINIMUM_DATE" for "GRP2_MINIMUM_DATE" etc.). What I would like is a query which looks like this
SELECT DISTINCT
FUNCTION(table_2."GRP1_MINIMUM_DATE") AS "GRP1_MINIMUM_DATE",
FUNCTION(table_2."GRP2_MINIMUM_DATE") AS "GRP2_MINIMUM_DATE",
FUNCTION(table_2."GRP3_MINIMUM_DATE") AS "GRP3_MINIMUM_DATE",
FUNCTION(table_2."GRP4_MINIMUM_DATE") AS "GRP4_MINIMUM_DATE"
FROM INPUT_COHORT cohort
LEFT JOIN INVOLVE_EVER table_2 ON cohort."USER_ID" = table_2."USER_ID"
Can anyone tell me if this is possible/point me to some resource that might help me out here?
Thanks!
There is no such direct as #Tejash already stated, but the thing looks like your database model is not ideal - it would be better to have a table that has USER_ID and GRP_ID as keys and then MINIMUM_DATE as seperate field.
Without changing the table structure, you can use UNPIVOT query to mimic this design:
WITH INVOLVE_EVER(USER_ID, GRP1_MINIMUM_DATE, GRP2_MINIMUM_DATE, GRP3_MINIMUM_DATE, GRP4_MINIMUM_DATE)
AS (SELECT 1, SYSDATE, SYSDATE, SYSDATE, SYSDATE FROM dual UNION ALL
SELECT 2, SYSDATE-1, SYSDATE-2, SYSDATE-3, SYSDATE-4 FROM dual)
SELECT *
FROM INVOLVE_EVER
unpivot ( minimum_date FOR grp_id IN ( GRP1_MINIMUM_DATE AS 1, GRP2_MINIMUM_DATE AS 2, GRP3_MINIMUM_DATE AS 3, GRP4_MINIMUM_DATE AS 4))
Result:
| USER_ID | GRP_ID | MINIMUM_DATE |
|---------|--------|--------------|
| 1 | 1 | 09/09/19 |
| 1 | 2 | 09/09/19 |
| 1 | 3 | 09/09/19 |
| 1 | 4 | 09/09/19 |
| 2 | 1 | 09/08/19 |
| 2 | 2 | 09/07/19 |
| 2 | 3 | 09/06/19 |
| 2 | 4 | 09/05/19 |
With this you can write your query without further code duplication and if you need use PIVOT-syntax to get one line per USER_ID.
The final query could then look like this:
WITH INVOLVE_EVER(USER_ID, GRP1_MINIMUM_DATE, GRP2_MINIMUM_DATE, GRP3_MINIMUM_DATE, GRP4_MINIMUM_DATE)
AS (SELECT 1, SYSDATE, SYSDATE, SYSDATE, SYSDATE FROM dual UNION ALL
SELECT 2, SYSDATE-1, SYSDATE-2, SYSDATE-3, SYSDATE-4 FROM dual)
, INPUT_COHORT(USER_ID, ANCHOR_DATE)
AS (SELECT 1, SYSDATE-1 FROM dual UNION ALL
SELECT 2, SYSDATE-2 FROM dual UNION ALL
SELECT 3, SYSDATE-3 FROM dual)
-- Above is sampledata query starts from here:
, unpiv AS (SELECT *
FROM INVOLVE_EVER
unpivot ( minimum_date FOR grp_id IN ( GRP1_MINIMUM_DATE AS 1, GRP2_MINIMUM_DATE AS 2, GRP3_MINIMUM_DATE AS 3, GRP4_MINIMUM_DATE AS 4)))
SELECT qcsj_c000000001000000 user_id, GRP1_MINIMUM_DATE, GRP2_MINIMUM_DATE, GRP3_MINIMUM_DATE, GRP4_MINIMUM_DATE
FROM INPUT_COHORT cohort
LEFT JOIN unpiv table_2
ON cohort.USER_ID = table_2.USER_ID
pivot (MAX(CASE WHEN minimum_date <= cohort."ANCHOR_DATE" THEN 1 ELSE 0 END) AS MINIMUM_DATE
FOR grp_id IN (1 AS GRP1,2 AS GRP2,3 AS GRP3,4 AS GRP4))
Result:
| USER_ID | GRP1_MINIMUM_DATE | GRP2_MINIMUM_DATE | GRP3_MINIMUM_DATE | GRP4_MINIMUM_DATE |
|---------|-------------------|-------------------|-------------------|-------------------|
| 3 | | | | |
| 1 | 0 | 0 | 0 | 0 |
| 2 | 0 | 1 | 1 | 1 |
This way you only have to write your calculation logic once (see line starting with pivot).

Category and Sub category in one column in sql

How we can show category and subcategory in one column in sql. Both columns are present in same table TableA.
Example: TableA
--------------------------------------
| category | subcategory | Values |
--------------------------------------
| Bird | parrot | 5 |
| Bird | pigeon | 10 |
| Animal | lion | 2 |
| Animal | Tiger | 5 |
--------------------------------------
Output table :
-------------------
| NEW | Value |
--------------------
| Bird | 15 |
| parrot | 5 |
| Piegon | 10 |
| Animal | 7 |
| lion | 2 |
| Tiger | 5 |
--------------------
In the output, New is a column where I want both category and sub category.
Sample Query to generate data:
CREATE TABLE #TEMP
(
catgory nvarchar(200),
sub_category nvarchar(200),
[values] nvarchar(200),
)
INSERT INTO #TEMP VALUES ('Bird','parrot',5)
INSERT INTO #TEMP VALUES ('Bird','pigeon',10)
INSERT INTO #TEMP VALUES ('Animal','lion',2)
INSERT INTO #TEMP VALUES ('Animal','Tiger',5)
Where the logic is:
I want category and sub category together, Where category should show the sum of all sub category values and should be in order as I have output table
select category, sum(values) values from table group by category
union
select subcategory, values from table
WITH cte AS
(
SELECT category,value FROM yourtable WHERE category IS NOT NULL
UNION ALL
SELECT subcategory, value FROM yourtable WHERE subcategory IS NOT NULL
)
SELECT Company as new, value FROM cte
You can use cte :
with cte as (
select t.*,
dense_rank() over (order by category) as seq,
sum([values]) over (partition by category) as sums
from table t
)
select t.cat as new, (case when cat_name = 'category' then sums else c.[values] end) as Value
from cte c cross apply
( values ('category', category), ('sub_category', sub_category) ) t(cat_name, cat)
order by seq, (case when cat_name = 'category' then 1 else 2 end);
Try this.
select new, [values] from (select catgory, sub_category as 'new', [values] from temp
union all
select catgory, catgory as 'new', sum([values]) from temp group by catgory) order by catgory
Output:
new values
-------------
parrot 5
pigeon 10
Bird 15
lion 2
lion 2
Tiger 5
Animal 9
Needs UNION ALL after getting the sums with the main table:
select
case t.sub_category
when '' then t.category
else sub_category
end new,
t.value
from (
select category, '' sub_category, sum([values]) value from temp group by category
union all
select category, sub_category, [values] value from temp
) t
order by t.category, t.sub_category
See the demo.
Results:
> new | value
> :----- | ----:
> Animal | 7
> lion | 2
> Tiger | 5
> Bird | 15
> parrot | 5
> pigeon | 10
A simple way to do this uses grouping sets:
select coalesce(subcategory, category) as new, sum(val)
from temp
group by grouping sets ( (category), (subcategory, category))
order by category, subcategory
Here is a db<>fiddle.
Table structure:
Id, name, parentcategoryId
1, animals, null
2, fish, 1
Example mssql:
Select name, subcats = STUFF((
SELECT ',' + NAME
FROM category as cat1 where cat1.parentcategoryId = cat. parentcategoryId
FOR XML PATH('')
), 1, 1, '') from category as cat where parentcategoryId = null
Preview:
Name, subcats
Animals, fish

Oracle Sql: Obtain a Sum of a Group, if Subgroup condition met

I have a dataset upon which I am trying to obain a summed value for each group, if a subgroup within each group meets a certain condition. I am not sure if this is possible, or if I am approaching this problem incorrectly.
My data is structured as following:
+----+-------------+---------+-------+
| ID | Transaction | Product | Value |
+----+-------------+---------+-------+
| 1 | A | 0 | 10 |
| 1 | A | 1 | 15 |
| 1 | A | 2 | 20 |
| 1 | B | 1 | 5 |
| 1 | B | 2 | 10 |
+----+-------------+---------+-------+
In this example I want to obtain the sum of values by the ID column, if a transaction does not contain any products labeled 0. In the above described scenario, all values related to Transaction A would be excluded because Product 0 was purchased. With the outcome being:
+----+-------------+
| ID | Sum of Value|
+----+-------------+
| 1 | 15 |
+----+-------------+
This process would repeat for multiple IDs with each ID only containing the sum of values if the transaction does not contain product 0.
Hmmm . . . one method is to use not exists for the filtering:
select id, sum(value)
from t
where not exists (select 1
from t t2
where t2.id = t.id and t2.transaction = t.transaction and
t2.product = 0
)
group by id;
Do not need to use correlated subquery with not exists.
Just use group by.
with s (id, transaction, product, value) as (
select 1, 'A', 0, 10 from dual union all
select 1, 'A', 1, 15 from dual union all
select 1, 'A', 2, 20 from dual union all
select 1, 'B', 1, 5 from dual union all
select 1, 'B', 2, 10 from dual)
select id, sum(sum_value) as sum_value
from
(select id, transaction,
sum(value) as sum_value
from s
group by id, transaction
having count(decode(product, 0, 1)) = 0
)
group by id;
ID SUM_VALUE
---------- ----------
1 15

Query for missing elements

I have a table with the following structure:
timestamp | name | value
0 | john | 5
1 | NULL | 3
8 | NULL | 12
12 | john | 3
33 | NULL | 4
54 | pete | 1
180 | NULL | 4
400 | john | 3
401 | NULL | 4
592 | anna | 2
Now what I am looking for is a query that will give me the sum of the values for each name, and treats the nulls in between (orderd by the timestamp) as the first non-null name down the list, as if the table were as follows:
timestamp | name | value
0 | john | 5
1 | john | 3
8 | john | 12
12 | john | 3
33 | pete | 4
54 | pete | 1
180 | john | 4
400 | john | 3
401 | anna | 4
592 | anna | 2
and I would query SUM(value), name from this table group by name. I have thought and tried, but I can't come up with a proper solution. I have looked at recursive common table expressions, and think the answer may lie in there, but I haven't been able to properly understand those.
These tables are just examples, and I don't know the timestamp values in advance.
Could someone give me a hand? Help would be very much appreciated.
With Inputs As
(
Select 0 As [timestamp], 'john' As Name, 5 As value
Union All Select 1, NULL, 3
Union All Select 8, NULL, 12
Union All Select 12, 'john', 3
Union All Select 33, NULL, 4
Union All Select 54, 'pete', 1
Union All Select 180, NULL, 4
Union All Select 400, 'john', 3
Union All Select 401, NULL, 4
Union All Select 592, 'anna', 2
)
, NamedInputs As
(
Select I.timestamp
, Coalesce (I.Name
, (
Select I3.Name
From Inputs As I3
Where I3.timestamp = (
Select Max(I2.timestamp)
From Inputs As I2
Where I2.timestamp < I.timestamp
And I2.Name Is not Null
)
)) As name
, I.value
From Inputs As I
)
Select NI.name, Sum(NI.Value) As Total
From NamedInputs As NI
Group By NI.name
Btw, what would be orders of magnitude faster than any query would be to first correct the data. I.e., update the name column to have the proper value, make it non-nullable and then run a simple Group By to get your totals.
Additional Solution
Select Coalesce(I.Name, I2.Name), Sum(I.value) As Total
From Inputs As I
Left Join (
Select I1.timestamp, MAX(I2.Timestamp) As LastNameTimestamp
From Inputs As I1
Left Join Inputs As I2
On I2.timestamp < I1.timestamp
And I2.Name Is Not Null
Group By I1.timestamp
) As Z
On Z.timestamp = I.timestamp
Left Join Inputs As I2
On I2.timestamp = Z.LastNameTimestamp
Group By Coalesce(I.Name, I2.Name)
You don't need CTE, just a simple subquery.
select t.timestamp, ISNULL(t.name, (
select top(1) i.name
from inputs i
where i.timestamp < t.timestamp
and i.name is not null
order by i.timestamp desc
)), t.value
from inputs t
And summing from here
select name, SUM(value) as totalValue
from
(
select t.timestamp, ISNULL(t.name, (
select top(1) i.name
from inputs i
where i.timestamp < t.timestamp
and i.name is not null
order by i.timestamp desc
)) as name, t.value
from inputs t
) N
group by name
I hope I'm not going to be embarassed by offering you this little recursive CTE query of mine as a solution to your problem.
;WITH
numbered_table AS (
SELECT
timestamp, name, value,
rownum = ROW_NUMBER() OVER (ORDER BY timestamp)
FROM your_table
),
filled_table AS (
SELECT
timestamp,
name,
value
FROM numbered_table
WHERE rownum = 1
UNION ALL
SELECT
nt.timestamp,
name = ISNULL(nt.name, ft.name),
nt.value
FROM numbered_table nt
INNER JOIN filled_table ft ON nt.rownum = ft.rownum + 1
)
SELECT *
FROM filled_table
/* or go ahead aggregating instead */