Select column names with X highest values - sql

I have created a matrix of users and interactions with product categories, my data looks like this, where each row is a user and each column is a category, with the number indicating how many interactions they have made with that category:
User Cat1 Cat2 Cat3 Cat4 Cat5 ...
1 0 1 0 2 30
2 0 0 10 5 0
3 0 5 0 0 0
4 2 0 20 2 0
5 0 40 0 0 0
...
I'd like to add a column (either in this query or in a fresh query on this table) which returns, for each user, the 3 column names that contain the highest values.
My complete data has 200+ columns.
Any suggestions on how I could achieve this in StandardSQL?
Here is the code I used to build my grid:
SELECT
customDimension.value AS UserID,
SUM(IF(LOWER(hits_product.productbrand) LIKE "Brand 1",1,0)) AS brand_1,
SUM(IF(LOWER(hits_product.productbrand) LIKE "Brand 2",1,0)) AS brand_2,
SUM(IF(LOWER(hits_product.productbrand) LIKE "Brand 3",1,0)) AS brand_3,
FROM
`table*` AS t
CROSS JOIN
UNNEST (hits) AS hits
CROSS JOIN
UNNEST(t.customdimensions) AS customDimension
CROSS JOIN
UNNEST(hits.product) AS hits_product
WHERE
parse_DATE('%y%m%d',
_table_suffix) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)
AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)
AND customDimension.index = 2
AND hits.eventInfo.eventCategory = 'Ecommerce'
AND hits.eventInfo.eventAction = 'Purchase'
GROUP BY
UserID
LIMIT 50

Below is for BigQuery Standard SQL (and has no dependency on number of category columns - even though example has just 5)
#standardSQL
SELECT *,
ARRAY_TO_STRING(ARRAY(
SELECT SPLIT(kv, ':')[OFFSET(0)]
FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{"}]', ''))) kv
WHERE LOWER(SPLIT(kv, ':')[OFFSET(0)]) <> 'user'
ORDER BY CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64) DESC
LIMIT 3
), ',') top3_cat
FROM `yourproject.yourdataset.yourtable` t
You can test, play with above using dummy data from your question:
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 user, 0 cat1, 1 cat2, 0 cat3, 2 cat4, 30 cat5 UNION ALL
SELECT 2, 0, 0, 10, 5, 0 UNION ALL
SELECT 3, 0, 5, 0, 0, 0 UNION ALL
SELECT 4, 2, 0, 20, 2, 0 UNION ALL
SELECT 5, 0, 40, 0, 0, 0
)
SELECT *,
ARRAY_TO_STRING(ARRAY(
SELECT SPLIT(kv, ':')[OFFSET(0)]
FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{"}]', ''))) kv
WHERE LOWER(SPLIT(kv, ':')[OFFSET(0)]) <> 'user'
ORDER BY CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64) DESC
LIMIT 3
), ',') top3_cat
FROM `project.dataset.table` t
with result
Row user cat1 cat2 cat3 cat4 cat5 top3_cat
1 1 0 1 0 2 30 cat5,cat4,cat2
2 2 0 0 10 5 0 cat3,cat4,cat2
3 3 0 5 0 0 0 cat2,cat3,cat1
4 4 2 0 20 2 0 cat3,cat4,cat1
5 5 0 40 0 0 0 cat2,cat3,cat1
I've updated my question with the code I used to build the matrix, would you mind showing how I would integrate your solution?
#standardSQL
WITH `query_result` AS (
SELECT
customDimension.value AS UserID,
SUM(IF(LOWER(hits_product.productbrand) LIKE "Brand 1",1,0)) AS brand_1,
SUM(IF(LOWER(hits_product.productbrand) LIKE "Brand 2",1,0)) AS brand_2,
SUM(IF(LOWER(hits_product.productbrand) LIKE "Brand 3",1,0)) AS brand_3,
...
...
FROM
`table*` AS t
CROSS JOIN
UNNEST (hits) AS hits
CROSS JOIN
UNNEST(t.customdimensions) AS customDimension
CROSS JOIN
UNNEST(hits.product) AS hits_product
WHERE
parse_DATE('%y%m%d',
_table_suffix) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)
AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)
AND customDimension.index = 2
AND hits.eventInfo.eventCategory = 'Ecommerce'
AND hits.eventInfo.eventAction = 'Purchase'
GROUP BY
UserID
LIMIT 50
)
SELECT *,
ARRAY_TO_STRING(ARRAY(
SELECT SPLIT(kv, ':')[OFFSET(0)]
FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{"}]', ''))) kv
WHERE LOWER(SPLIT(kv, ':')[OFFSET(0)]) <> LOWER('UserID')
ORDER BY CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64) DESC
LIMIT 3
), ',') top3_cat
FROM `query_result` t

Expanding on my comment: If your data were in a more reasonable format like user | category | cat_count you could run something like:
SELECT user, group_concat(category) as top_3_cat
FROM
(
SELECT user, category, rank() OVER (PARTITION BY user ORDER BY cat_count) as cat_rank
FROM yourtable
) cat_ranking
WHERE cat_rank <= 3;
Doing this in your current schema would be nearly impossible given the number of categories you have as columns.
I would focus on unpivoting your table first so it can be ran through the sql above. This may be possible using bigquery's unpivot transform although I'm not sure what the limit is for unpivotting columns.
unpivot col:cat1, cat2, cat3, cat4, cat5, catN groupEvery:N
I don't use bigquery, so I'm not certain how that gets applied to your dataset, but it looks promising.
The other option is UNION many statements together to make up yourtable in that sql above:
SELECT user, 'cat1' as category, cat1 FROM yourtable
UNION ALL SELECT user, 'cat2', cat2 FROM yourtable
UNION ALL SELECT user, 'cat3', cat3 FROM yourtable
UNION ALL SELECT user, 'cat4', cat4 FROM yourtable
UNION ALL SELECT user, 'cat5', cat5 FROM yourtable
UNION ALL SELECT user, 'catN', catN FROM yourtable;

You would use arrays in bigquery:
select t.*,
(select array_agg(s.colname order by s.val desc limit 3)
from unnest(array[struct('col1' as colname), col1 as val),
struct('col2' as colname), col2 as val),
. . .
]
) s
) as top3
from t

Related

SQL how to select n row from each interval of one column

For example, the table looks like
a
b
c
1
1
1
2
1
1
3
1
1
4
1
1
5
1
1
6
1
1
7
1
1
8
1
1
9
1
1
10
1
1
11
1
1
I want to randomly pick 2 rows from every interval based on column a, where a ~ [0, 2], a ~ [4, 6], a ~ [9-20].
Another more complicated case would be select n rows from every interval based on multiple columns, for example in this case the interval will be a ~ [0, 2], a ~ [4, 6], b ~ [7, 9], ...
Is there a way to do so with just SQL?
Find out to which interval each row belongs, order by random partitioned by an interval id, get the top n rows for each interval:
create transient table mytable as
select seq8() id, random() data
from table(generator(rowcount => 100)) v;
create transient table intervals as
select 0 i_start, 6 i_end, 2 random_rows
union all select 7, 20, 1
union all select 21, 30, 3
union all select 31, 50, 1;
select *
from (
select *
, row_number() over(partition by i_start order by random()) rn
from mytable a
join intervals b
on a.id between b.i_start and b.i_end
)
where rn<=random_rows
Edit: Shorter and cleaner.
select a.*
from mytable a
join intervals b
on a.id between b.i_start and b.i_end
qualify row_number() over(partition by i_start order by random()) <= random_rows
To get two rows per group, you want to use row_number(). To define the groups, you can use a lateral join to define the groupings:
select t.*
from (select t.*,
row_number() over (partition by v.grp order by random()) as seqnum
from t cross join lateral
(values (case when a between 0 and 2 then 1
when a between 4 and 6 then 2
when a between 7 and 9 then d
end)
) v(grp)
where grp is not null
) t
where seqnum <= 2;
You can adjust the case expression to define whatever groups you like.

count zeros between 1s in same column

I've data like this.
ID IND
1 0
2 0
3 1
4 0
5 1
6 0
7 0
I want to count the zeros before the value 1. So that, the output will be like below.
ID IND OUT
1 0 0
2 0 0
3 1 2
4 0 0
5 1 1
6 0 0
7 0 2
Is it possible without pl/sql? I tried to find the differences between row numbers but couldn't achieve it.
The match_recognize clause, introduced in Oracle 12.1, can do quick work of such "row pattern recognition" problems. The solution is just a bit complex due to the special treatment of a "last row" with ID = 0, but it is straightforward otherwise.
As usual, the with clause is not part of the solution; I include it to test the query. Remove it and use your actual table and column names.
with
inputs (id, ind) as (
select 1, 0 from dual union all
select 2, 0 from dual union all
select 3, 1 from dual union all
select 4, 0 from dual union all
select 5, 1 from dual union all
select 6, 0 from dual union all
select 7, 0 from dual
)
select id, ind, out
from inputs
match_recognize(
order by id
measures case classifier() when 'Z' then 0
when 'O' then count(*) - 1
else count(*) end as out
all rows per match
pattern ( Z* ( O | X ) )
define Z as ind = 0, O as ind != 0
);
ID IND OUT
---------- ---------- ----------
1 0 0
2 0 0
3 1 2
4 0 0
5 1 1
6 0 0
7 0 2
You can treat this as a gaps-and-islands problem. You can define the "islands" by the number of "1"s one or after each row. Then use a window function:
select t.*,
(case when ind = 1 or row_number() over (order by id desc) = 1
then sum(1 - ind) over (partition by grp)
else 0
end) as num_zeros
from (select t.*,
sum(ind) over (order by id desc) as grp
from t
) t;
If id is sequential with no gaps, you can do this without a subquery:
select t.*,
(case when ind = 1 or row_number() over (order by id desc) = 1
then id - coalesce(lag(case when ind = 1 then id end ignore nulls) over (order by id), min(id) over () - 1)
else 0
end)
from t;
I would suggest removing the case conditions and just using the then clause for the expression, so the value is on all rows.

How to group user sessions by converted row

I'm doing simple multichannel attribution exploration and got stuck with grouping user sessions.
For example, I have simple sessions table:
client channel time converted
1 social 1 0
1 cpc 2 0
1 email 3 1
1 email 4 0
1 cpc 5 1
2 organic 1 0
2 cpc 2 1
3 email 1 0
Each row contains user sessions and converted column, which shows if user converted in particular session.
I need to group sessions which lead conversion for each user and for each conversion, so perfect result should be:
client channels time converted
1 [social,cpc,email] 3 1
1 [email,cpc] 5 1
2 [organic,cpc] 2 1
3 [email] 1 0
Notice user 3, he's not converted but I need to have his sessions
You need to assign a group. For this purpose, an inverse sum of converted looks like the right thing:
select client, array_agg(channel order by time) as channels,
max(time) as time, max(converted) as converted
from (select t.*,
sum(t.converted) over (partition by t.client order by t.time desc) as grp
from t
) t
group by client, grp;
Below is for BigQuery Standard SQL
#standardSQL
SELECT
client,
STRING_AGG(channel ORDER BY time) channels,
MAX(time) time,
MAX(converted) converted
FROM (
SELECT *, COUNTIF(converted = 1) OVER(PARTITION BY client ORDER BY time DESC) session
FROM `project.dataset.table`
)
GROUP BY client, session
-- ORDER BY client, time
You can test, play with above using sample data from your question as in example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 client, 'social' channel, 1 time, 0 converted UNION ALL
SELECT 1, 'cpc', 2, 0 UNION ALL
SELECT 1, 'email', 3, 1 UNION ALL
SELECT 1, 'email', 4, 0 UNION ALL
SELECT 1, 'cpc', 5, 1 UNION ALL
SELECT 2, 'organic', 1, 0 UNION ALL
SELECT 2, 'cpc', 2, 1 UNION ALL
SELECT 3, 'email', 1, 0
)
SELECT
client,
STRING_AGG(channel ORDER BY time) channels,
MAX(time) time,
MAX(converted) converted
FROM (
SELECT *, COUNTIF(converted = 1) OVER(PARTITION BY client ORDER BY time DESC) session
FROM `project.dataset.table`
)
GROUP BY client, session
ORDER BY client, time
with result
Row client channels time converted
1 1 social,cpc,email 3 1
2 1 email,cpc 5 1
3 2 organic,cpc 2 1
4 3 email 1 0

SQL Server: how to find the record where a field is X for the first time and there are no later records where it isn't

I tried for quite some time now but cannot figure out how to best do this without using cursors. What I want to do (in SQL Server) is:
Find the earliest (by Date) record where Criterion=1 AND NOT followed by Criterion=0 for each Name and Category.
Or expressed differently:
Find the Date when Criterion turned 1 and not turned 0 again afterwards (for each Name and Category).
Some sort of CTE would seem to make sense I guess but that's not my strong suit unfortunately. So I tried nesting queries to find the latest record where Criterion=0 and then select the next record if there is one but I'm getting incorrect results. Another challenge with this is returning a record where there are only records with Criterion=1 for a Name and Category.
Here's the sample data:
Name Category Criterion Date
------------------------------------------------
Bob Cat1 1 22.11.16 08:54 X
Bob Cat2 0 21.02.16 02:29
Bob Cat3 1 22.11.16 08:55
Bob Cat3 0 22.11.16 08:56
Bob Cat4 0 21.06.12 02:30
Bob Cat4 0 18.11.16 08:18
Bob Cat4 1 18.11.16 08:19
Bob Cat4 0 22.11.16 08:20
Bob Cat4 1 22.11.16 08:50 X
Bob Cat4 1 22.11.16 08:51
Hannah Cat1 1 22.11.16 08:54 X
Hannah Cat2 0 21.02.16 02:29
Hannah Cat3 1 22.11.16 08:55
Hannah Cat3 0 22.11.16 08:56
The rows with an X after the row are the ones I want to retrieve.
It's probably not all that complicated in the end...
If you just want the name, category, and date:
select name, category, min(date)
from t
where criterion = 1 and
not exists (select 1
from t t2
where t2.name = t.name and t2.category = t.category and
t2.criterion = 0 and t2.date >= t.date
)
group by name, category;
There are fancier ways to get this information, but this is a relatively simple method.
Actually, the fancier ways aren't particularly complicated:
select t.*
from (select t.*,
min(case when date > maxdate_0 or maxdate_0 is NULL then date end) over (partition by name, category) as mindate_1
from (select t.*,
max(case when criterion = 0 then date end) over (partition by name, category) as maxdate_0
from t
) t
where criterion = 1
) t
where mindate_1 = date;
EDIT:
SQL Fiddle doesn't seem to be working these days. The following is working for me (using Postgres):
with t(name, category, criterion, date) as (
values ('Bob', 'Cat1', 1, '2016-11-16 08:54'),
('Bob', 'Cat2', 0, '2016-02-21 02:29'),
('Bob', 'Cat3', 1, '2016-11-16 08:55'),
('Bob', 'Cat3', 0, '2016-11-16 08:56'),
('Bob', 'Cat4', 0, '2012-06-21 02:30'),
('Bob', 'Cat4', 0, '2016-11-18 08:18'),
('Bob', 'Cat4', 1, '2016-11-18 08:19'),
('Bob', 'Cat4', 0, '2016-11-22 08:20'),
('Bob', 'Cat4', 1, '2016-11-22 08:50'),
('Bob', 'Cat4', 1, '2016-11-22 08:51'),
('Hannah', 'Cat1', 1, '2016-11-22 08:54'),
('Hannah', 'Cat2', 0, '2016-02-21 02:29'),
('Hannah', 'Cat3', 1, '2016-11-22 08:55'),
('Hannah', 'Cat3', 0, '2016-11-22 08:56')
)
select t.*
from (select t.*,
min(case when date > maxdate_0 or maxdate_0 is NULL then date end) over (partition by name, category) as mindate_1
from (select t.*,
max(case when criterion = 0 then date end) over (partition by name, category) as maxdate_0
from t
) t
where criterion = 1
) t
where mindate_1 = date;
How about a left join, and filter the NULLs?
SELECT yt.Name, yt.Category, yt.Criterion, MIN(yt.Date) AS Date
FROM YourTable yt
LEFT JOIN YourTable lj ON lj.Name = yt.Name AND lj.Category = yt.Category AND
lj.Criterion != yt.Criterion AND lj.Date > yt.Date
WHERE yt.Criterion = 1 AND lj.Name IS NULL
GROUP BY yt.Name, yt.Category, yt.Criterion
there are ton's of ways of doing it especially with Window Functions. The NOT EXISTS, or Anti Join are 2 of the better methods but just for fun here is one of the fancier (to steal Gordon's term) ways of doing it with Window Functions:
;WITH cte AS (
SELECT
Name
,Category
,CASE WHEN Criterion = 1 THEN Date END as Criterion1Date
,MAX(CASE WHEN Criterion = 0 THEN Date END) OVER (PARTITION BY Name, Category) as MaxDateCriterion0
FROM
Table
)
SELECT
Name
,Category
,MIN(Criterion1Date) as Date
FROM
cte
WHERE
ISNULL(MaxDateCriterion0,'1/1/1900') < Criterion1Date
GROUP BY
Name
,Category
Or as a Derived Table if you don't like cte, the only difference is basically nesting the cte in the from clause.
SELECT
Name
,Category
,MIN(Criterion1Date) as Date
FROM
(
SELECT
Name
,Category
,CASE WHEN Criterion = 1 THEN Date END as Criterion1Date
,MAX(CASE WHEN Criterion = 0 THEN Date END) OVER (PARTITION BY Name, Category) as MaxDateCriterion0
FROM
Table
) t
WHERE
ISNULL(MaxDateCriterion0,'1/1/1900') < Criterion1Date
GROUP BY
Name
,Category
Modified answer
select name,category
,min (date) as date
from (select name,category,criterion,date
,min (criterion) over
(
partition by name,category
order by date
rows between current row and unbounded following
) as min_following_criterion
from t
) t
where criterion = 1
and ( min_following_criterion <> 0
or min_following_criterion is null
)
group by name,category

How to transpose recordset columns into rows

I have a query whose code looks like this:
SELECT DocumentID, ComplexSubquery1 ... ComplexSubquery5
FROM Document
WHERE ...
ComplexSubquery are all numerical fields that are calculated using, duh, complex subqueries.
I would like to use this query as a subquery to a query that generates a summary like the following one:
Field DocumentCount Total
1 dc1 s1
2 dc2 s2
3 dc3 s3
4 dc4 s4
5 dc5 s5
Where:
dc<n> = SUM(CASE WHEN ComplexSubquery<n> > 0 THEN 1 END)
s <n> = SUM(CASE WHEN Field = n THEN ComplexSubquery<n> END)
How could I do that in SQL Server?
NOTE: I know I could avoid the problem by discarding the original query and using unions:
SELECT '1' AS TypeID,
SUM(CASE WHEN ComplexSubquery1 > 0 THEN 1 END) AS DocumentCount
SUM(ComplexSubquery1) AS Total
FROM (SELECT DocumentID, BLARGH ... AS ComplexSubquery1) T
UNION ALL
SELECT '2' AS TypeID,
SUM(CASE WHEN ComplexSubquery2 > 0 THEN 1 END) AS DocumentCount
SUM(ComplexSubquery2) AS Total
FROM (SELECT DocumentID, BLARGH ... AS ComplexSubquery2) T
UNION ALL
...
But I want to avoid this route, because redundant code makes my eyes bleed. (Besides, there is a real possibility that the number of complex subqueries grow in the future.)
WITH Document(DocumentID, Field) As
(
SELECT 1, 1 union all
SELECT 2, 1 union all
SELECT 3, 2 union all
SELECT 4, 3 union all
SELECT 5, 4 union all
SELECT 6, 5 union all
SELECT 7, 5
), CTE AS
(
SELECT DocumentID,
Field,
(select 10) As ComplexSubquery1,
(select 20) as ComplexSubquery2,
(select 30) As ComplexSubquery3,
(select 40) as ComplexSubquery4,
(select 50) as ComplexSubquery5
FROM Document
)
SELECT Field,
SUM(CASE WHEN RIGHT(Query,1) = Field AND QueryValue > 1 THEN 1 END ) AS DocumentCount,
SUM(CASE WHEN RIGHT(Query,1) = Field THEN QueryValue END ) AS Total
FROM CTE
UNPIVOT (QueryValue FOR Query IN
(ComplexSubquery1, ComplexSubquery2, ComplexSubquery3,
ComplexSubquery4, ComplexSubquery5)
)AS unpvt
GROUP BY Field
Returns
Field DocumentCount Total
----------- ------------- -----------
1 2 20
2 1 20
3 1 30
4 1 40
5 2 100
I'm not 100% positive from your example, but perhaps the PIVOT operator will help you out here? I think if you selected your original query into a temporary table, you could pivot on the document ID and get the sums for the other queries.
I don't have much experience with it though, so I'm not sure how complex you can get with your subqueries - you might have to break it down.