How to group user sessions by converted row - sql

I'm doing simple multichannel attribution exploration and got stuck with grouping user sessions.
For example, I have simple sessions table:
client channel time converted
1 social 1 0
1 cpc 2 0
1 email 3 1
1 email 4 0
1 cpc 5 1
2 organic 1 0
2 cpc 2 1
3 email 1 0
Each row contains user sessions and converted column, which shows if user converted in particular session.
I need to group sessions which lead conversion for each user and for each conversion, so perfect result should be:
client channels time converted
1 [social,cpc,email] 3 1
1 [email,cpc] 5 1
2 [organic,cpc] 2 1
3 [email] 1 0
Notice user 3, he's not converted but I need to have his sessions

You need to assign a group. For this purpose, an inverse sum of converted looks like the right thing:
select client, array_agg(channel order by time) as channels,
max(time) as time, max(converted) as converted
from (select t.*,
sum(t.converted) over (partition by t.client order by t.time desc) as grp
from t
) t
group by client, grp;

Below is for BigQuery Standard SQL
#standardSQL
SELECT
client,
STRING_AGG(channel ORDER BY time) channels,
MAX(time) time,
MAX(converted) converted
FROM (
SELECT *, COUNTIF(converted = 1) OVER(PARTITION BY client ORDER BY time DESC) session
FROM `project.dataset.table`
)
GROUP BY client, session
-- ORDER BY client, time
You can test, play with above using sample data from your question as in example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 client, 'social' channel, 1 time, 0 converted UNION ALL
SELECT 1, 'cpc', 2, 0 UNION ALL
SELECT 1, 'email', 3, 1 UNION ALL
SELECT 1, 'email', 4, 0 UNION ALL
SELECT 1, 'cpc', 5, 1 UNION ALL
SELECT 2, 'organic', 1, 0 UNION ALL
SELECT 2, 'cpc', 2, 1 UNION ALL
SELECT 3, 'email', 1, 0
)
SELECT
client,
STRING_AGG(channel ORDER BY time) channels,
MAX(time) time,
MAX(converted) converted
FROM (
SELECT *, COUNTIF(converted = 1) OVER(PARTITION BY client ORDER BY time DESC) session
FROM `project.dataset.table`
)
GROUP BY client, session
ORDER BY client, time
with result
Row client channels time converted
1 1 social,cpc,email 3 1
2 1 email,cpc 5 1
3 2 organic,cpc 2 1
4 3 email 1 0

Related

count zeros between 1s in same column

I've data like this.
ID IND
1 0
2 0
3 1
4 0
5 1
6 0
7 0
I want to count the zeros before the value 1. So that, the output will be like below.
ID IND OUT
1 0 0
2 0 0
3 1 2
4 0 0
5 1 1
6 0 0
7 0 2
Is it possible without pl/sql? I tried to find the differences between row numbers but couldn't achieve it.
The match_recognize clause, introduced in Oracle 12.1, can do quick work of such "row pattern recognition" problems. The solution is just a bit complex due to the special treatment of a "last row" with ID = 0, but it is straightforward otherwise.
As usual, the with clause is not part of the solution; I include it to test the query. Remove it and use your actual table and column names.
with
inputs (id, ind) as (
select 1, 0 from dual union all
select 2, 0 from dual union all
select 3, 1 from dual union all
select 4, 0 from dual union all
select 5, 1 from dual union all
select 6, 0 from dual union all
select 7, 0 from dual
)
select id, ind, out
from inputs
match_recognize(
order by id
measures case classifier() when 'Z' then 0
when 'O' then count(*) - 1
else count(*) end as out
all rows per match
pattern ( Z* ( O | X ) )
define Z as ind = 0, O as ind != 0
);
ID IND OUT
---------- ---------- ----------
1 0 0
2 0 0
3 1 2
4 0 0
5 1 1
6 0 0
7 0 2
You can treat this as a gaps-and-islands problem. You can define the "islands" by the number of "1"s one or after each row. Then use a window function:
select t.*,
(case when ind = 1 or row_number() over (order by id desc) = 1
then sum(1 - ind) over (partition by grp)
else 0
end) as num_zeros
from (select t.*,
sum(ind) over (order by id desc) as grp
from t
) t;
If id is sequential with no gaps, you can do this without a subquery:
select t.*,
(case when ind = 1 or row_number() over (order by id desc) = 1
then id - coalesce(lag(case when ind = 1 then id end ignore nulls) over (order by id), min(id) over () - 1)
else 0
end)
from t;
I would suggest removing the case conditions and just using the then clause for the expression, so the value is on all rows.

Running Total of all Previous Rows BigQuery

I have a BigQuery Table which looks like Below:
ID SessionNumber CountOfAction Category
1 1 1 B
1 2 3 A
1 3 1 A
1 4 4 B
1 5 5 B
I am trying to get the running total of all previous rows for CountofAction where category = A. The final Output should be
ID SessionNumber CountOfAction
1 1 0 --no previous rows have countofAction for category = A
1 2 0 --no previous rows have countofAction for category = A
1 3 3 --previous row (Row 2) has countofAction = 3 for category = A
1 4 4 --previous rows (Row 2 and 3) have countofAction = 3 and 1 for category = A
1 5 4 --previous rows (Row 2 and 3) have countofAction = 3 and 1 for category = A
Below is the query I have written but it doesn't give me desired output
select
ID,
SessionNumber ,
SUM(CountofAction) OVER(Partition by clieIDntid ORDER BY SessionNumber ROWS BETWEEN UNBOUNDED
PRECEDING AND 1 PRECEDING)as CumulativeCountofAction
From TAble1 where category = 'A'
I would really appreciate any help on this! Thanks in advance
Filtering on category in the where clause evicts (id, sessionNumber) tuples where category 'A' does not appear, which is not what you want.
Instead, you can use aggregation and a conditional sum():
select
id,
sessionNumber,
sum(sum(if(category = 'A', countOfAction, 0))) over(
partition by id
order by sessionNumber
rows between unbounded preceding and 1 preceding
) CumulativeCountofAction
from mytable t
group by id, sessionNumber
order by id, sessionNumber
Below is for BigQuery Standard SQL
#standardSQL
SELECT ID, SessionNumber,
IFNULL(SUM(IF(category = 'A', CountOfAction, 0)) OVER(win), 0) AS CountOfAction
FROM `project.dataset.table`
WINDOW win AS (ORDER BY SessionNumber ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
If to apply to sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 ID, 1 SessionNumber, 1 CountOfAction, 'B' Category UNION ALL
SELECT 1, 2, 3, 'A' UNION ALL
SELECT 1, 3, 1, 'A' UNION ALL
SELECT 1, 4, 4, 'B' UNION ALL
SELECT 1, 5, 5, 'B'
)
SELECT ID, SessionNumber,
IFNULL(SUM(IF(category = 'A', CountOfAction, 0)) OVER(win), 0) AS CountOfAction
FROM `project.dataset.table`
WINDOW win AS (ORDER BY SessionNumber ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
result is
Row ID SessionNumber CountOfAction
1 1 1 0
2 1 2 0
3 1 3 3
4 1 4 4
5 1 5 4

Is there a way to find active users in SQL?

I'm trying to find the total count of active users in a database. "Active" users here as defined as those who have registered an event on the selected day or later than the selected day. So if a user registered an event on days 1, 2 and 5, they are counted as "active" throughout days 1, 2, 3, 4 and 5.
My original dataset looks like this (note that this is a sample - the real dataset will run to up to 365 days, and has around 1000 users).
Day ID
0 1
0 2
0 3
0 4
0 5
1 1
1 2
2 1
3 1
4 1
4 2
As you can see, all 5 IDs are active on Day 0, and 2 IDs (1 and 2) are active until Day 4, so I'd like the finished table to look like this:
Day Count
0 5
1 2
2 2
3 2
4 2
I've tried using the following query:
select Day as days, sum(case when Day <= days then 1 else 0 end)
from df
But it gives incorrect output (only counts users who were active on each specific days).
I'm at a loss as to what I could try next. Does anyone have any ideas? Many thanks in advance!
I think I would just use generate_series():
select gs.d, count(*)
from (select id, min(day) as min_day, max(day) as max_day
from t
group by id
) t cross join lateral
generate_series(t.min_day, .max_day, 1) gs(d)
group by gs.d
order by gs.d;
If you want to count everyone as active from day 1 -- but not all have a value on day 1 -- then use 1 instead of min_day.
Here is a db<>fiddle.
A bit verbose, but this should do:
with dt as (
select 0 d, 1 id
union all
select 0 d, 2 id
union all
select 0 d, 3 id
union all
select 0 d, 4 id
union all
select 0 d, 5 id
union all
select 1 d, 1 id
union all
select 1 d, 2 id
union all
select 2 d, 1 id
union all
select 3 d, 1 id
union all
select 4 d, 1 id
union all
select 4 d, 2 id
)
, active_periods as (
select id
, min(d) min_d
, max(d) max_d
from dt
group by id
)
, days as (
select distinct d
from dt
)
select d.d
, count(ap.id)
from days d
join active_periods ap on d.d between ap.min_d and ap.max_d
group by 1
order by 1 asc
You need count by day.
select
id,
count(*)
from df
GROUP BY
id

Select column names with X highest values

I have created a matrix of users and interactions with product categories, my data looks like this, where each row is a user and each column is a category, with the number indicating how many interactions they have made with that category:
User Cat1 Cat2 Cat3 Cat4 Cat5 ...
1 0 1 0 2 30
2 0 0 10 5 0
3 0 5 0 0 0
4 2 0 20 2 0
5 0 40 0 0 0
...
I'd like to add a column (either in this query or in a fresh query on this table) which returns, for each user, the 3 column names that contain the highest values.
My complete data has 200+ columns.
Any suggestions on how I could achieve this in StandardSQL?
Here is the code I used to build my grid:
SELECT
customDimension.value AS UserID,
SUM(IF(LOWER(hits_product.productbrand) LIKE "Brand 1",1,0)) AS brand_1,
SUM(IF(LOWER(hits_product.productbrand) LIKE "Brand 2",1,0)) AS brand_2,
SUM(IF(LOWER(hits_product.productbrand) LIKE "Brand 3",1,0)) AS brand_3,
FROM
`table*` AS t
CROSS JOIN
UNNEST (hits) AS hits
CROSS JOIN
UNNEST(t.customdimensions) AS customDimension
CROSS JOIN
UNNEST(hits.product) AS hits_product
WHERE
parse_DATE('%y%m%d',
_table_suffix) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)
AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)
AND customDimension.index = 2
AND hits.eventInfo.eventCategory = 'Ecommerce'
AND hits.eventInfo.eventAction = 'Purchase'
GROUP BY
UserID
LIMIT 50
Below is for BigQuery Standard SQL (and has no dependency on number of category columns - even though example has just 5)
#standardSQL
SELECT *,
ARRAY_TO_STRING(ARRAY(
SELECT SPLIT(kv, ':')[OFFSET(0)]
FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{"}]', ''))) kv
WHERE LOWER(SPLIT(kv, ':')[OFFSET(0)]) <> 'user'
ORDER BY CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64) DESC
LIMIT 3
), ',') top3_cat
FROM `yourproject.yourdataset.yourtable` t
You can test, play with above using dummy data from your question:
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 user, 0 cat1, 1 cat2, 0 cat3, 2 cat4, 30 cat5 UNION ALL
SELECT 2, 0, 0, 10, 5, 0 UNION ALL
SELECT 3, 0, 5, 0, 0, 0 UNION ALL
SELECT 4, 2, 0, 20, 2, 0 UNION ALL
SELECT 5, 0, 40, 0, 0, 0
)
SELECT *,
ARRAY_TO_STRING(ARRAY(
SELECT SPLIT(kv, ':')[OFFSET(0)]
FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{"}]', ''))) kv
WHERE LOWER(SPLIT(kv, ':')[OFFSET(0)]) <> 'user'
ORDER BY CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64) DESC
LIMIT 3
), ',') top3_cat
FROM `project.dataset.table` t
with result
Row user cat1 cat2 cat3 cat4 cat5 top3_cat
1 1 0 1 0 2 30 cat5,cat4,cat2
2 2 0 0 10 5 0 cat3,cat4,cat2
3 3 0 5 0 0 0 cat2,cat3,cat1
4 4 2 0 20 2 0 cat3,cat4,cat1
5 5 0 40 0 0 0 cat2,cat3,cat1
I've updated my question with the code I used to build the matrix, would you mind showing how I would integrate your solution?
#standardSQL
WITH `query_result` AS (
SELECT
customDimension.value AS UserID,
SUM(IF(LOWER(hits_product.productbrand) LIKE "Brand 1",1,0)) AS brand_1,
SUM(IF(LOWER(hits_product.productbrand) LIKE "Brand 2",1,0)) AS brand_2,
SUM(IF(LOWER(hits_product.productbrand) LIKE "Brand 3",1,0)) AS brand_3,
...
...
FROM
`table*` AS t
CROSS JOIN
UNNEST (hits) AS hits
CROSS JOIN
UNNEST(t.customdimensions) AS customDimension
CROSS JOIN
UNNEST(hits.product) AS hits_product
WHERE
parse_DATE('%y%m%d',
_table_suffix) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)
AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)
AND customDimension.index = 2
AND hits.eventInfo.eventCategory = 'Ecommerce'
AND hits.eventInfo.eventAction = 'Purchase'
GROUP BY
UserID
LIMIT 50
)
SELECT *,
ARRAY_TO_STRING(ARRAY(
SELECT SPLIT(kv, ':')[OFFSET(0)]
FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{"}]', ''))) kv
WHERE LOWER(SPLIT(kv, ':')[OFFSET(0)]) <> LOWER('UserID')
ORDER BY CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64) DESC
LIMIT 3
), ',') top3_cat
FROM `query_result` t
Expanding on my comment: If your data were in a more reasonable format like user | category | cat_count you could run something like:
SELECT user, group_concat(category) as top_3_cat
FROM
(
SELECT user, category, rank() OVER (PARTITION BY user ORDER BY cat_count) as cat_rank
FROM yourtable
) cat_ranking
WHERE cat_rank <= 3;
Doing this in your current schema would be nearly impossible given the number of categories you have as columns.
I would focus on unpivoting your table first so it can be ran through the sql above. This may be possible using bigquery's unpivot transform although I'm not sure what the limit is for unpivotting columns.
unpivot col:cat1, cat2, cat3, cat4, cat5, catN groupEvery:N
I don't use bigquery, so I'm not certain how that gets applied to your dataset, but it looks promising.
The other option is UNION many statements together to make up yourtable in that sql above:
SELECT user, 'cat1' as category, cat1 FROM yourtable
UNION ALL SELECT user, 'cat2', cat2 FROM yourtable
UNION ALL SELECT user, 'cat3', cat3 FROM yourtable
UNION ALL SELECT user, 'cat4', cat4 FROM yourtable
UNION ALL SELECT user, 'cat5', cat5 FROM yourtable
UNION ALL SELECT user, 'catN', catN FROM yourtable;
You would use arrays in bigquery:
select t.*,
(select array_agg(s.colname order by s.val desc limit 3)
from unnest(array[struct('col1' as colname), col1 as val),
struct('col2' as colname), col2 as val),
. . .
]
) s
) as top3
from t

count consecutive statuses from each ID

I am trying to find a list of clients that have at least 3 consecutive items that are "processed". The following is what my table looks like:
ClientID ItemID Status
1 1 Pending
1 2 Processed
1 3 Processed
2 4 Processed
2 5 Processed
1 6 Processed
1 7 Pending
2 8 Pending
2 9 Processed
3 10 Pending
3 11 Pending
2 12 Processed
3 13 Pending
2 14 Processed
1 15 Processed
2 16 Processed
Expected results:
1 (since it had 3 consecutive processed records from 2, 3, 6 )
2 (since it had 4 consecutive processed records from 9, 12, 14, 16)
As you can see, I define "consecutive" as the next record with the same ClientID and not as the next record in the table, this is what I am having trouble with. My counter restarts when the next clientid in the table is different.
my attempt:
WITH count
AS
(
SELECT *, COUNT(1) OVER(PARTITION BY clientid, count) NotPending
FROM (
SELECT *, (
SELECT COUNT(ItemId)
FROM ##temp a
WHERE status like '%pend%'
AND ItemId < b.ItemId) AS count
FROM ##temp b
WHERE status not like '%pend%'
) t1
)
SELECT distinct clientid from count where NotPending >= 3
You can use row_number() to place rows with the same consecutive status in the same group:
select *,
row_number() over (partition by ClientId order by ItemId)
- row_number() over (partition by ClientId, ItemStatus order by ItemId) as groupName
from Table1
order by ClientId, ItemId
Then you can count the number of entries per group:
select distinct ClientId, count(*) from (
select *,
row_number() over (partition by ClientId order by ItemId)
- row_number() over (partition by ClientId, ItemStatus order by ItemId) as groupName
from Table1
) t
where ItemStatus = 'Processed'
group by ClientId, groupName
having count(*) >= 3
Demo
In your example there are no clients with more than 5 consecutive items (and in your Select you check for >= 10).
Looking for 3 items, returns 1 & 2 for your example data:
WITH cte AS
(
SELECT ClientID, ItemID, Status,
-- returns 3 when there's only 'Processed'
Sum(CASE WHEN Status = 'Processed' THEN 1 end)
Over (PARTITION BY ClientID
ORDER BY ItemId
-- 3 rows including current row
ROWS 2 Preceding) AS Cnt
FROM ##temp
)
SELECT DISTINCT ClientID
FROM cte
WHERE Cnt = 3