In Redshift, how do I run the opposite of a SUM function - sql

Assuming I have a data table
date | user_id | user_last_name | order_id | is_new_session
------------+------------+----------------+-----------+---------------
2014-09-01 | A | B | 1 | t
2014-09-01 | A | B | 5 | f
2014-09-02 | A | B | 8 | t
2014-09-01 | B | B | 2 | t
2014-09-02 | B | test | 3 | t
2014-09-03 | B | test | 4 | t
2014-09-04 | B | test | 6 | t
2014-09-04 | B | test | 7 | f
2014-09-05 | B | test | 9 | t
2014-09-05 | B | test | 10 | f
I want to get another column in Redshift which basically assigns session numbers to each users session. It starts at 1 for the first record for each user and as you move further down, if it encounters a true in the "is_new_session" column, it increments. Stays the same if it encounters a false. If it hits a new user, the value resets to 1. The ideal output for this table would be:
1
1
2
1
2
3
4
4
5
5
In my mind it's kind of the opposite of a SUM(1) over (Partition BY user_id, is_new_session ORDER BY user_id, date ASC)
Any ideas?
Thanks!

I think you want an incremental sum:
select t.*,
sum(case when is_new_session then 1 else 0 end) over (partition by user_id order by date) as session_number
from t;
In Redshift, you might need the windowing clause:
select t.*,
sum(case when is_new_session then 1 else 0 end) over
(partition by user_id
order by date
rows between unbounded preceding and current row
) as session_number
from t;

Related

BigQuery: delete duplicated row that are not fully duplicated (delete desire row)

I have a table recording customer step on daily basis. The table had Id, date and step column. Some rows contained different steps on the same day for the same Id. Sample as shown below on 5/3/2020 and 5/4/2020 for Id 1:
| Id | Date | Step |
|:-----|:---------|:-----|
| 1 | 5/1/2020 | 1 |
| 1 | 5/2/2020 | 1 |
| 1 | 5/3/2020 | 0 |
| 1 | 5/3/2020 | 5 |
| 1 | 5/4/2020 | 2 |
| 1 | 5/4/2020 | 10 |
| 1 | 5/5/2020 | 1 |
| 2 | 5/1/2020 | 1 |
| 2 | 5/2/2020 | 2 |
| 2 | 5/3/2020 | 0 |
I want to delete rows that contain lesser step, which is 5/3/2020 for 0 step, 5/4/2020 for 2 step for Id 1.
I had tried using row_number() like this:
SELECT
Id,
Date,
step,
ROW_NUMBER() OVER (PARTITION BY Id, Date ORDER BY Id, Date) AS rn
FROM
`dataset.step`
WHERE rn>1
But that will give me rows with higher step, which is not want I want.
I also able to select rows with fewer step like this:
SELECT * FROM
`dataset.step` AS A
INNER JOIN
`dataset.step` AS B
ON A.Id = B.Id
AND A.Date = B.Date
WHERE A.step < B.step
But find no way to use it for delete.
Use below approach
select *
from your_table
qualify 1 = row_number() over win
window win as (partition by id, date order by step desc)
if applied to sample data in your question - output is

Adding indicator column to table based on having two consecutive days within group

I need to add a logic that helps me to flag the first of two consecutive days as 1 and the second day as 0 grouped by a column (test). If a test (a) has three consecutive days then the third should start with 1 again etc.
Example table would be like following with new col being the column I need.
|---------------------|------------------|---------------------|
| test | test_date | new col |
|---------------------|------------------|---------------------|
| a | 1/1/2020 | 1 |
|---------------------|------------------|---------------------|
| a | 1/2/2020 | 0 |
|---------------------|------------------|---------------------|
| a | 1/3/2020 | 1 |
|---------------------|------------------|---------------------|
| b | 1/1/2020 | 1 |
|---------------------|------------------|---------------------|
| b | 1/2/2020 | 0 |
|---------------------|------------------|---------------------|
| b | 1/15/2020 | 1 |
|---------------------|------------------|---------------------|
As it seems to be some gaps-and-islands problem and I assume some windows function approach should get me there.
I tried something like following to get the consecutive part but struggle with the indicator column.
Select
test,
test_date,
grp_var = dateadd(day,
-row_number() over (partition by test order by test_date), test_date)
from
my_table
This does read as a gaps-and-island problem. I would recommend using the difference between row_number() and the date to generate the groups, and then arithmetic:
select
test,
test_date,
row_number() over(
partition by test, dateadd(day, -rn, test_date)
order by test_date
) % 2 new_col
from (
select
t.*,
row_number() over(partition by test order by test_date) rn
from mytable t
) t
Demo on DB Fiddle:
test | test_date | new_col
:--- | :--------- | ------:
a | 2020-01-01 | 1
a | 2020-01-02 | 0
a | 2020-01-03 | 1
b | 2020-01-01 | 1
b | 2020-01-02 | 0
b | 2020-01-15 | 1

bigquery get highest possible steps group by col

i have question about counting row number based on a column iteration
my table looks like this
time | steps | name
13:02 | 0 | a
13:03 | 0 | a
13:04 | 1 | a
13:05 | 0 | a
13:07 | 1 | a
13:10 | 1 | a
13:12 | 2 | a
13:04 | 0 | b
13:06 | 0 | b
13:12 | 1 | b
13:14 | 2 | b
13:19 | 3 | b
13:14 | 0 | b
13:19 | 3 | b
from table above i want to get the highest possible steps made by name. but must meet these condition:
steps made by name must be sequential(ex: 0,1,2,3 return 0,1,2,3; 0,1,2,4 return 0,1,2)
each step must be sequential according to time
Select any value if there are more than 1 record is possible(ex: 0,1,1,2 return 0,ANY(1,1),2)
table i looking for is
time | steps | name
13:05 | 0 | a
13:07 | 1 | a
13:12 | 2 | a
13:06 | 0 | b
13:12 | 1 | b
13:14 | 2 | b
13:19 | 3 | b
Is there any way i can do this in bigquery?
First remove duplicates. Then identify the rows where the "next" step (by time) is what you expect.
The following almost works:
select t.*
from (select min(time) as time, steps, name,
lead(steps) over (partition by name order by min(time)) as next_step
from yourtable t
group by steps, name
) t
where next_step = step + 1;
However, you want the minimum set. For that, you also need for the row number to match. It turns out that that condition is sufficient:
select t.*
from (select min(time) as time, steps, name,
row_number() over (partition by name order by min(time)) as seqnum
from yourtable t
group by steps, name
) t
where step = seqnum - 1;

PARTITION BY in CASE doesn't work with several AND statements

I have a table with 4 columns: hitId, userId, timestamp and Camp.
I need to classify if a hit is a start of a new session or not (1 or 0) using two parameters: 1. the time difference between hits and 2. if the source of the hit is a new campaign.
I need a standard SQL query in BigQuery.
A hit is considered as a start of a new session if one of the following is true:
it's the first hit from its userId
the time difference between the timestamp of the previous hit from
the same userId is more than 30 mins.
the time difference between the timestamp of the previous hit from the same userId is less than 30 mins, but Camp (ad campaign) value is not NULL and occures for the first time for the same userId within the previous 30 min.
So if hit1 from user1 has a Camp equal to Campaign1, and hit2 from user1 has a Camp equal to Campaign1, and time difference between hit1 and hit2 is less than 30 mins, hit1 will be considered as a start of a session, and hit2 won't be considered as a start.
I have a trouble with Campaign part. I tried this code:
I tried this code:
WITH timeDifference AS (
SELECT *,
TIMESTAMP_DIFF(timestamp, LAG(timestamp, 1) OVER
(PARTITION BY userId ORDER BY timestamp), SECOND) AS difference
FROM hitTable
ORDER BY timestamp)
SELECT *,
CASE
WHEN difference >= 30 * 60 THEN 1
WHEN difference IS NULL THEN 1
WHEN difference <= 30 * 60 AND Camp IS NOT NULL AND RANK()
OVER (PARTITION BY userId ORDER BY Camp) = 1 THEN 1
ELSE 0 END AS sess
FROM timeDifference
ORDER BY timestamp;
The condition RANK() OVER (PARTITION BY userId ORDER BY Camp) seems not working, as I receive this table:
hitId | userId | timestamp | Camp | difference | sess
_______________________________________________________________________
00150 | 858201 | 00:48:35.315 | NULL | NULL | 1
00151 | 858201 | 00:49:35.315 | NULL | 5 | 0
00152 | 858201 | 00:50:35.315 | Search-Ads-US | 10 | 0
00153 | 858201 | 00:53:35.315 | Search-Ads-US | 15 | 0
00154 | 858202 | 00:54:35.315 | Facebook-Ads | NULL | 1
00155 | 858202 | 00:54:55.315 | Facebook-Ads | 9 | 0
00156 | 858202 | 00:57:20.315 | Facebook-Ads | 12 | 0
While I expect to have 1 for sess column for hitId = 00152:
hitId | userId | timestamp | Camp | difference | sess
_______________________________________________________________________
00150 | 858201 | 00:48:35.315 | NULL | NULL | 1
00151 | 858201 | 00:49:35.315 | NULL | 5 | 0
00152 | 858201 | 00:50:35.315 | Search-Ads-US | 10 | 1
00153 | 858201 | 00:53:35.315 | Search-Ads-US | 15 | 0
00154 | 858202 | 00:54:35.315 | Facebook-Ads | NULL | 1
00155 | 858202 | 00:54:55.315 | Facebook-Ads | 9 | 0
00156 | 858202 | 00:57:20.315 | Facebook-Ads | 12 | 0
This RANK() OVER (PARTITION BY userId ORDER BY Camp) returns falsely results in cases where a user had multiple Camps.
Notice your PARTITION BY uses userId while you want to mark sessions within each Camp.
The actual "rank 1" of the RANK() (...) statement for userId 00150 is where the Camp is NULL (hitId 00150) therefore it misses your CASE condition at hitId 00152.
You could try and add 'Camp' to your PARTITION BY as follows:
RANK() OVER (PARTITION BY userId, Camp ORDER BY Camp)
Alternatively, you could replace the RANK() (...) and use LAG(Camp) (... order by timestamp) in addition to the LAG(timestamp) (...) you are calculating.
This will retrieve the Camp value for the row before (call it 'PreviousCampValue'). Then you could add something like WHEN PreviousCampValue != Camp THEN 1
Hope that's helpful

Select dynamic couples of lines in SQL (PostgreSQL)

My objective is to make dynamic group of lines (of product by TYPE & COLOR in fact)
I don't know if it's possible just with one select query.
But : I want to create group of lines (A PRODUCT is a TYPE and a COLOR) as per the number_per_group column and I want to do this grouping depending on the date order (Order By DATE)
A single product with a NB_PER_GROUP number 2 is exclude from the final result.
Table :
-----------------------------------------------
NUM | TYPE | COLOR | NB_PER_GROUP | DATE
-----------------------------------------------
0 | 1 | 1 | 2 | ...
1 | 1 | 1 | 2 |
2 | 1 | 2 | 2 |
3 | 1 | 2 | 2 |
4 | 1 | 1 | 2 |
5 | 1 | 1 | 2 |
6 | 4 | 1 | 3 |
7 | 1 | 1 | 2 |
8 | 4 | 1 | 3 |
9 | 4 | 1 | 3 |
10 | 5 | 1 | 2 |
Results :
------------------------
GROUP_NUMBER | NUM |
------------------------
0 | 0 |
0 | 1 |
~~~~~~~~~~~~~~~~~~~~~~~~
1 | 2 |
1 | 3 |
~~~~~~~~~~~~~~~~~~~~~~~~
2 | 4 |
2 | 5 |
~~~~~~~~~~~~~~~~~~~~~~~~
3 | 6 |
3 | 8 |
3 | 9 |
If you have another way to solve this problem, I will accept it.
What about something like this?
select max(gn.group_number) group_number, ip.num
from products ip
join (
select date, type, color, row_number() over (order by date) - 1 group_number
from (
select op.num, op.type, op.color, op.nb_per_group, op.date, (row_number() over (partition by op.type, op.color order by op.date) - 1) % nb_per_group group_order
from products op
) sq
where sq.group_order = 0
) gn
on ip.type = gn.type
and ip.color = gn.color
and ip.date >= gn.date
group by ip.num
order by group_number, ip.num
This may only work if your nb_per_group values are the same for each combination of type and color. It may also require unique dates, but that could probably be worked around if required.
The innermost subquery partitions the rows by type and color, orders them by date, then calculates the row numbers modulo nb_per_group; this forms a 0-based count for the group that resets to 0 each time nb_per_group is exceeded.
The next-level subquery finds all of the 0 values we mapped in the lower subquery and assigns group numbers to them.
Finally, the outermost query ties each row in the products table to a group number, calculated as the highest group number that split off before this product's date.