PARTITION BY in CASE doesn't work with several AND statements - sql

I have a table with 4 columns: hitId, userId, timestamp and Camp.
I need to classify if a hit is a start of a new session or not (1 or 0) using two parameters: 1. the time difference between hits and 2. if the source of the hit is a new campaign.
I need a standard SQL query in BigQuery.
A hit is considered as a start of a new session if one of the following is true:
it's the first hit from its userId
the time difference between the timestamp of the previous hit from
the same userId is more than 30 mins.
the time difference between the timestamp of the previous hit from the same userId is less than 30 mins, but Camp (ad campaign) value is not NULL and occures for the first time for the same userId within the previous 30 min.
So if hit1 from user1 has a Camp equal to Campaign1, and hit2 from user1 has a Camp equal to Campaign1, and time difference between hit1 and hit2 is less than 30 mins, hit1 will be considered as a start of a session, and hit2 won't be considered as a start.
I have a trouble with Campaign part. I tried this code:
I tried this code:
WITH timeDifference AS (
SELECT *,
TIMESTAMP_DIFF(timestamp, LAG(timestamp, 1) OVER
(PARTITION BY userId ORDER BY timestamp), SECOND) AS difference
FROM hitTable
ORDER BY timestamp)
SELECT *,
CASE
WHEN difference >= 30 * 60 THEN 1
WHEN difference IS NULL THEN 1
WHEN difference <= 30 * 60 AND Camp IS NOT NULL AND RANK()
OVER (PARTITION BY userId ORDER BY Camp) = 1 THEN 1
ELSE 0 END AS sess
FROM timeDifference
ORDER BY timestamp;
The condition RANK() OVER (PARTITION BY userId ORDER BY Camp) seems not working, as I receive this table:
hitId | userId | timestamp | Camp | difference | sess
_______________________________________________________________________
00150 | 858201 | 00:48:35.315 | NULL | NULL | 1
00151 | 858201 | 00:49:35.315 | NULL | 5 | 0
00152 | 858201 | 00:50:35.315 | Search-Ads-US | 10 | 0
00153 | 858201 | 00:53:35.315 | Search-Ads-US | 15 | 0
00154 | 858202 | 00:54:35.315 | Facebook-Ads | NULL | 1
00155 | 858202 | 00:54:55.315 | Facebook-Ads | 9 | 0
00156 | 858202 | 00:57:20.315 | Facebook-Ads | 12 | 0
While I expect to have 1 for sess column for hitId = 00152:
hitId | userId | timestamp | Camp | difference | sess
_______________________________________________________________________
00150 | 858201 | 00:48:35.315 | NULL | NULL | 1
00151 | 858201 | 00:49:35.315 | NULL | 5 | 0
00152 | 858201 | 00:50:35.315 | Search-Ads-US | 10 | 1
00153 | 858201 | 00:53:35.315 | Search-Ads-US | 15 | 0
00154 | 858202 | 00:54:35.315 | Facebook-Ads | NULL | 1
00155 | 858202 | 00:54:55.315 | Facebook-Ads | 9 | 0
00156 | 858202 | 00:57:20.315 | Facebook-Ads | 12 | 0

This RANK() OVER (PARTITION BY userId ORDER BY Camp) returns falsely results in cases where a user had multiple Camps.
Notice your PARTITION BY uses userId while you want to mark sessions within each Camp.
The actual "rank 1" of the RANK() (...) statement for userId 00150 is where the Camp is NULL (hitId 00150) therefore it misses your CASE condition at hitId 00152.
You could try and add 'Camp' to your PARTITION BY as follows:
RANK() OVER (PARTITION BY userId, Camp ORDER BY Camp)
Alternatively, you could replace the RANK() (...) and use LAG(Camp) (... order by timestamp) in addition to the LAG(timestamp) (...) you are calculating.
This will retrieve the Camp value for the row before (call it 'PreviousCampValue'). Then you could add something like WHEN PreviousCampValue != Camp THEN 1
Hope that's helpful

Related

How to find longest subsequence based on conditions in Impala SQL

I have a SQL table on Impala that contains ID, dt (monthly basis with no skipped month), and status of each person ID. I want to check how long that each ID is in each status (my expected answer is shown on expected column)
I tried to solve this problem on the value column by using
count(status) over (partition by ID, status order by dt)
but it doesn't reset the value when the status is changed.
+------+------------+--------+-------+----------+
| ID | dt | status | value | expected |
+------+------------+--------+-------+----------+
| 0001 | 01/01/2020 | 0 | 1 | 1 |
| 0001 | 01/02/2020 | 0 | 2 | 2 |
| 0001 | 01/03/2020 | 1 | 1 | 1 |
| 0001 | 01/04/2020 | 1 | 2 | 2 |
| 0001 | 01/05/2020 | 1 | 3 | 3 |
| 0001 | 01/06/2020 | 0 | 3 | 1 |
| 0001 | 01/07/2020 | 1 | 4 | 1 |
| 0001 | 01/08/2020 | 1 | 5 | 2 |
+------+------------+--------+-------+----------+
Is there anyway to reset the counter when the status is changed?
When you partition by ID and status, two groups are formed for the values 0 and 1 in status field. So, the months 1, 2, 6 go into first group with 0 status and the months 3, 4, 5, 7, 8 go into the second group with 1 status. Then, the count function counts the number of statuses individually in those groups. Thus the first group has counts from 1 to 3 and the second group has counts from 1 to 5. This query so far doesn't account for the change in statuses rather just simply divide the record set as per different status values.
One approach would be to divide the records into different blocks where each status change starts a new block. The below query follows this approach and gives the expected result:
SELECT ID,dt,status,
COUNT(status) OVER(PARTITION BY ID,block_number ORDER BY dt) as value
FROM (
SELECT ID,dt,status,
SUM(change_in_status) OVER(PARTITION BY ID ORDER BY dt) as block_number
FROM(
SELECT ID,dt,status,
CASE WHEN
status<>LAG(status) OVER(PARTITION BY ID ORDER BY dt)
OR LAG(status) OVER(PARTITION BY ID ORDER BY dt) IS NULL
THEN 1
ELSE 0
END as change_in_status
FROM statuses
) derive_status_changes
) derive_blocks;
Here is a working example in DB Fiddle.

How to add records for each user based on another existing row in BigQuery?

Posting here in case someone with more knowledge than may be able to help me with some direction.
I have a table like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201125 | 1 | 0 |
-----------------------------------
| 4 | 20201114 | 2 | 32 |
-----------------------------------
| 5 | 20201116 | 2 | 0 |
-----------------------------------
| 6 | 20201120 | 2 | 23 |
-----------------------------------
However, from this, I need to have a record for each user for each day where if a day is missing for a user, then the last score recorded should be maintained then I would have something like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201122 | 1 | 14 |
-----------------------------------
| 4 | 20201123 | 1 | 14 |
-----------------------------------
| 5 | 20201124 | 1 | 14 |
-----------------------------------
| 6 | 20201125 | 1 | 0 |
-----------------------------------
| 7 | 20201114 | 2 | 32 |
-----------------------------------
| 8 | 20201115 | 2 | 32 |
-----------------------------------
| 9 | 20201116 | 2 | 0 |
-----------------------------------
| 10 | 20201117 | 2 | 0 |
-----------------------------------
| 11 | 20201118 | 2 | 0 |
-----------------------------------
| 12 | 20201119 | 2 | 0 |
-----------------------------------
| 13 | 20201120 | 2 | 23 |
-----------------------------------
I'm trying to to this in BigQuery using StandardSQL. I have an idea of how to keep the same score across following empty dates, but I really don't know how to add new rows for missing dates for each user. Also, just to keep in mind, this example only has 2 users, but in my data I have more than 1500.
My end goal would be to show something like the average of the score per day. For background, because of our logic, if the score wasn't recorded in a specific day, this means that the user is still in the last score recorded which is why I need a score for every user every day.
I'd really appreciate any help I could get! I've been trying different options without success
Below is for BigQuery Standard SQL
#standardSQL
select date, user_id,
last_value(score ignore nulls) over(partition by user_id order by date) as score
from (
select user_id, format_date('%Y%m%d', day) date,
from (
select user_id, min(parse_date('%Y%m%d', date)) min_date, max(parse_date('%Y%m%d', date)) max_date
from `project.dataset.table`
group by user_id
) a, unnest(generate_date_array(min_date, max_date)) day
)
left join `project.dataset.table` b
using(date, user_id)
-- order by user_id, date
if applied to sample data from your question - output is
One option uses generate_date_array() to create the series of dates of each user, then brings the table with a left join.
select d.date, d.user_id,
last_value(t.score ignore nulls) over(partition by d.user_id order by d.date) as score
from (
select t.user_id, d.date
from mytable t
cross join unnest(generate_date_array(min(date), max(date), interval 1 day)) d(date)
group by t.user_id
) d
left join mytable t on t.user_id = d.user_id and t.date = d.date
I think the most efficient method is to use generate_date_array() but in a very particular way:
with t as (
select t.*,
date_add(lead(date) over (partition by user_id order by date), interval -1 day) as next_date
from t
)
select row_number() over (order by t.user_id, dte) as id,
t.user_id, dte, t.score
from t cross join join
unnest(generate_date_array(date,
coalesce(next_date, date)
interval 1 day
)
) dte;

30 day rolling count of distinct IDs

So after looking at what seems to be a common question being asked and not being able to get any solution to work for me, I decided I should ask for myself.
I have a data set with two columns: session_start_time, uid
I am trying to generate a rolling 30 day tally of unique sessions
It is simple enough to query for the number of unique uids per day:
SELECT
COUNT(DISTINCT(uid))
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '30 days'
it is also relatively simple to calculate the daily unique uids over a date range.
SELECT
DATE_TRUNC('day',session_start_time) AS "date"
,COUNT(DISTINCT uid) AS "count"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY date(session_start_time)
I then I tried several ways to do a rolling 30 day unique count over a time interval
SELECT
DATE(session_start_time) AS "running30day"
,COUNT(distinct(
case when date(session_start_time) >= running30day - interval '30 days'
AND date(session_start_time) <= running30day
then uid
end)
) AS "unique_30day"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '3 months'
GROUP BY date(session_start_time)
Order BY running30day desc
I really thought this would work but when looking into the results, it appears I'm getting the same results as I was when doing the daily unique rather than the unique over 30days.
I am writing this query from Metabase using the SQL query editor. the underlying tables are in redshift.
If you read this far, thank you, your time has value and I appreciate the fact that you have spent some of it to read my question.
EDIT:
As rightfully requested, I added an example of the data set I'm working with and the desired outcome.
+-----+-------------------------------+
| UID | SESSION_START_TIME |
+-----+-------------------------------+
| | |
| 10 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 5 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 3 | 2020-01-18T02:49:23.000-05:00 |
| | |
| 9 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 2 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 8 | 2020-03-31T23:13:33.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 2 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 9 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 8 | 2020-09-15T16:40:29.000-04:00 |
| | |
| 3 | 2020-09-21T20:49:09.000-04:00 |
| | |
| 1 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 6 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 5 | 2020-12-12T04:42:00.000-05:00 |
+-----+-------------------------------+
bellow is what the result I would like looks like:
+------------+---------------------+
| DATE | UNIQUE 30 DAY COUNT |
+------------+---------------------+
| | |
| 2020-01-13 | 3 |
| | |
| 2020-01-18 | 1 |
| | |
| 2020-03-06 | 3 |
| | |
| 2020-03-31 | 1 |
| | |
| 2020-08-28 | 4 |
| | |
| 2020-09-15 | 2 |
| | |
| 2020-09-21 | 1 |
| | |
| 2020-11-05 | 2 |
| | |
| 2020-12-12 | 2 |
+------------+---------------------+
Thank you
You can approach this by keeping a counter of when users are counted and then uncounted -- 30 (or perhaps 31) days later. Then, determine the "islands" of being counted, and aggregate. This involves:
Unpivoting the data to have an "enters count" and "leaves" count for each session.
Accumulate the count so on each day for each user you know whether they are counted or not.
This defines "islands" of counting. Determine where the islands start and stop -- getting rid of all the detritus in-between.
Now you can simply do a cumulative sum on each date to determine the 30 day session.
In SQL, this looks like:
with t as (
select uid, date_trunc('day', session_start_time) as s_day, 1 as inc
from users_sessions
union all
select uid, date_trunc('day', session_start_time) + interval '31 day' as s_day, -1
from users_sessions
),
tt as ( -- increment the ins and outs to determine whether a uid is in or out on a given day
select uid, s_day, sum(inc) as day_inc,
sum(sum(inc)) over (partition by uid order by s_day rows between unbounded preceding and current row) as running_inc
from t
group by uid, s_day
),
ttt as ( -- find the beginning and end of the islands
select tt.uid, tt.s_day,
(case when running_inc > 0 then 1 else -1 end) as in_island
from (select tt.*,
lag(running_inc) over (partition by uid order by s_day) as prev_running_inc,
lead(running_inc) over (partition by uid order by s_day) as next_running_inc
from tt
) tt
where running_inc > 0 and (prev_running_inc = 0 or prev_running_inc is null) or
running_inc = 0 and (next_running_inc > 0 or next_running_inc is null)
)
select s_day,
sum(sum(in_island)) over (order by s_day rows between unbounded preceding and current row) as active_30
from ttt
group by s_day;
Here is a db<>fiddle.
I'm pretty sure the easier way to do this is to use a join. This creates a list of all the distinct users who had a session on each day and a list of all distinct dates in the data. Then it one-to-many joins the user list to the date list and counts the distinct users, the key here is the expanded join criteria that matches a range of dates to a single date via a system of inequalities.
with users as
(select
distinct uid,
date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01'),
dates as
(select
distinct date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01')
select
count(distinct uid),
dates.dt
from users
join
dates
on users.dt >= dates.dt - 29
and users.dt <= dates.dt
group by dates.dt
order by dt desc
;

In Redshift, how do I run the opposite of a SUM function

Assuming I have a data table
date | user_id | user_last_name | order_id | is_new_session
------------+------------+----------------+-----------+---------------
2014-09-01 | A | B | 1 | t
2014-09-01 | A | B | 5 | f
2014-09-02 | A | B | 8 | t
2014-09-01 | B | B | 2 | t
2014-09-02 | B | test | 3 | t
2014-09-03 | B | test | 4 | t
2014-09-04 | B | test | 6 | t
2014-09-04 | B | test | 7 | f
2014-09-05 | B | test | 9 | t
2014-09-05 | B | test | 10 | f
I want to get another column in Redshift which basically assigns session numbers to each users session. It starts at 1 for the first record for each user and as you move further down, if it encounters a true in the "is_new_session" column, it increments. Stays the same if it encounters a false. If it hits a new user, the value resets to 1. The ideal output for this table would be:
1
1
2
1
2
3
4
4
5
5
In my mind it's kind of the opposite of a SUM(1) over (Partition BY user_id, is_new_session ORDER BY user_id, date ASC)
Any ideas?
Thanks!
I think you want an incremental sum:
select t.*,
sum(case when is_new_session then 1 else 0 end) over (partition by user_id order by date) as session_number
from t;
In Redshift, you might need the windowing clause:
select t.*,
sum(case when is_new_session then 1 else 0 end) over
(partition by user_id
order by date
rows between unbounded preceding and current row
) as session_number
from t;

Full outer join on a table itself and run some window functions

Background
I have some ETL job processing real-time log files hourly. Whenever the system generates a new event, it will take a snapshot of all historical event summary (if exists) and record it together with the current event. Then the data is loaded into Redshift.
Example
The table looks like something below:
+------------+--------------+---------+-----------+-------+-------+
| current_id | current_time | past_id | past_time | freq1 | freq2 |
+------------+--------------+---------+-----------+-------+-------+
| 2 | time2 | 1 | time1 | 13 | 5 |
| 3 | time3 | 1 | time1 | 13 | 5 |
| 3 | time3 | 2 | time2 | 2 | 1 |
| 4 | time4 | 1 | time1 | 13 | 5 |
| 4 | time4 | 2 | time2 | 2 | 1 |
| 4 | time4 | 3 | time3 | 1 | 1 |
+------------+--------------+---------+-----------+-------+-------+
This is what happened for the above table:
time1: event 1 happened. System took a snapshot, but nothing is recorded.
time2: event 2 happened. System took a snapshot and record event 1.
time3: event 3 happened. System took a snapshot and record event 1 & 2.
time4: event 4 happened. System took a snapshot and record event 1, 2 & 3.
Desired Outcome
I will need to transform the data into the following format in order to do some analysis:
+----+------------+-------+-------+
| id | event_time | freq1 | freq2 |
+----+------------+-------+-------+
| 1 | time1 | 0 | 0 |
| 2 | time2 | 13 | 5 | -- 13 | 5
| 3 | time3 | 15 | 6 | -- 13 + 2 | 5 + 1
| 4 | time4 | 16 | 7 | -- 15 + 1 | 6 + 1
+----+------------+-------+-------+
Basically, the new freq1 and freq2 are cumulative sum of lagged freq1 and freq2.
My Idea
I am thinking of a self full outer join on current_id and past_id and achieve the following result first:
+----+------------+-------+-------+
| id | event_time | freq1 | freq2 |
+----+------------+-------+-------+
| 1 | time1 | 13 | 5 |
| 2 | time2 | 2 | 1 |
| 3 | time3 | 1 | 1 |
| 4 | time4 | null | null |
+----+------------+-------+-------+
Then I can do a window function of lag over() and then sum over().
Question
Is this the correct approach? Is there a more efficient way to do this? This is just a small sample of the actual data, so performance could be a concern.
My query is always returning a lot of duplicated values, so I am not sure what went wrong.
Solution
Answer from #GordonLinoff is correct for the above use case. I am adding some minor updates in order to get it working on my actual table. The only difference is that my event_id are some 36-character Java UUID and the event_time are timestamp.
select distinct past_id, past_time, 0 as freq1, 0 as freq2
from (
select past_id, past_time,
row_number() over (partition by current_id order by current_time desc) as seqnum
from t
) a
where a.seqnum = 1
union all
select current_id, current_time,
sum(freq1) over (order by current_time rows unbounded preceding) as freq1,
sum(freq2) over (order by current_time rows unbounded preceding) as freq2
from (
select current_id, current_time, freq1, freq2,
row_number() over (partition by current_id order by past_id desc) as seqnum
from t
) b
where b.seqnum = 1;
I'm thinking you want union all along with window functions. Here is an example:
select min(past_id) as id, min(past_time) as event_time, 0 as freq1, 0 as freq2
from t
union all
(select current_id, current_time,
sum(freq1) over (order by current_time),
sum(freq2) over (order by current_time)
from (select current_id, current_time, freq1, freq2,
row_number() over (partition by current_id order by past_id desc) as seqnum
from t
) t
where seqnum = 1
);
The way your data is in your snapshot table, I think the following SQL should give you what you are looking for in the desired outcome that you posted
SELECT 1 AS id
,"time1" AS event_time
,0 AS freq1
,0 AS freq2
UNION
SELECT T.id
,T.current_time AS event_time
,SUM(T.freq1) AS freq1
,SUM(T.freq2) AS freq2
FROM snapshot AS T
GROUP
BY T.id
,T.current_name
The first SELECT in the above UNION is so that you can get the first record for time1 since it does not really have an entry in your base table which holds all the snapshots.. It does not have a FROM in it since we are only selecting variables, if Redshift does not support it you might need to look for something equivalent to the DUAL table in Oracle.
Hope this helps..