How to select specific rows in a "group by" groups using conditions on multiple columns? - sql

I have the following table with many userId (in the example only one userId for demo purpose):
For every userId I want to extract two rows:
The first row should be isTransaction = 0 and the earliest date!
The second row should be isTransaction = 1, device should be different from that of the first row, isTransaction should be equal to 1 and the earliest date right after that of the first row
That is, the output should be:
Time userId device isTransaction
2021-01-27 10187675 mobile 0
2021-01-30 10187675 web 1
I tried to rank rows with partitioning and ordering but it didn't work:
Select * from
(SELECT *, rank() over(partition by userId, device, isTransaction order by isTransaction, Time) as rnk
FROM table 1)
where rnk=1
order by Time
Please help! It would be also good to check the time difference between these two rows to not exceed 30 days. Otherwise, userId should be dropped.

You can first identify the earliest time for 0. Then enumerate the rows and take only the first one:
select t.*
from (select t.*,
row_number() over (partition by userid, status order by time) as seqnum
from (select t.*,
min(case when isTransaction = 0 then time end) over (partition by userid order by time) as time_0
from t
) t
where time > time_0
) t
where seqnum = 1;
This satisfies the two conditions you enumerated.
Then buried in the text, you want to eliminate rows where the difference is greater than 30 days. That is a little tricker . . . but not too hard:
select t.*
from (select t.*,
min(case when isTransaction = 1 then time end) over (partition by userid) as time_1
row_number() over (partition by userid, status order by time) as seqnum
from (select t.*,
min(case when isTransaction = 0 then time end) over (partition by userid order by time) as time_0
from t
) t
where time > time_0
) t
where seqnum = 1 and
time_1 < timestamp_add(time_0, interval 30 day);

Related

count consecutive number of -1 in a column. count >=14

I'm trying to figure out query to count "-1" that have occurred for more than 14 times. Can anyone help me here. I tried everything from lead, row number, etc but nothing is working out.
The BP is recorded for every minute and I need to figure the id's who's bp_level was "-1" for more than 14min
You may try the following:
Select Distinct B.Person_ID, B.[Consecutive]
From
(
Select D.person_ID, COUNT(D.bp_level) Over (Partition By D.grp, D.person_ID Order By D.Time_) [Consecutive]
From
(
Select Time_, Person_ID, bp_level,
DATEADD(Minute, -ROW_NUMBER() Over (Partition By Person_ID Order By Time_), Time_) grp
From mytable Where bp_level = -1
) D
) B
Where B.[Consecutive] >= 14
See a demo from db<>fiddle. Using SQL Server.
DATEADD(Minute, -ROW_NUMBER() Over (Partition By Person_ID Order By Time_), Time_): to define a unique group for consecutive times per person, where (bp_level = -1).
COUNT(D.bp_level) Over (Partition By D.grp, D.person_ID Order By D.Time_): to find the cumulative sum of bp_level over the increasing of time for each group.
Once a none -1 value appeared the group will split into two groups and the counter will reset to 0 for the other group.
NOTE: this solution works only if there are no gaps between the consecutive times, the time is increased by one minute for each row/ person, otherwise, the query will not work but can be modified to cover the gaps.
with data as (
select *,
count(case when bp_level = 1 then 1 end) over
(partition by person_id order by time) as grp
from T
)
select distinct person_id
from data
where bp_level = -1
group by person_id, grp
having count(*) > 14; /* = or >= ? */
If you want to rely on timestamps rather than a count of rows then you could use the time difference:
...
-- where 1 = 1 /* all rows */
group by person_id, grp
having datediff(minute, min(time), max(time)) > 14;
The accepted answer would have issues with scenarios where there are multiple rows with the same timestamp if there's any potential for that to happen.
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=2ad6a1b515bb4091efba9b8831e5d579

How do I add an autoincrement Counter based on Conditions and conditional reset in Google-Bigquery

Have my table in Big query and have a problem getting an incremental field based on a condition.
Basically every time the score hits below 95% it returns Stage 1 for the first week. If it hits below 95% for a second straight week it returns Stage 2 etc etc. however, if it goes above 95 % the counter resets to "Good". and thereafter returns Stage 1 if it goes below 95% etc etc.
You can use row_number() -- but after assigning a group based on the count of > 95% values up to each row:
select t.*,
(case when row_number() over (partition by grp order by month, week) = 1
then 'Good'
else concat('Stage ', row_number() over (partition by grp order by month, week) - 1)
end) as level
from (select t.*,
countif(score > 0.95) over (order by month, week) as grp
from t
) t;
Consider below
select * except(grp),
(case when Average_score >= 95 and 1 = row_number() over grps then 'Good'
else format('Stage %i', row_number() over grps - sign(grp))
end) as Level
from (
select *, countif(Average_score >= 95) over (order by Month, Week) as grp
from `project.dataset.table`
)
window grps as (partition by grp order by Month, Week)
If applied to sample data in your question - output is

SQL how to find continuous count of rows that exceed a value over time

ID value timestamp1 timestamp2
1 1. 1. 1
1. 2. 2 1
1. 0. 3 1
1. 2. 4. 1
2. 2. 3 4
2. 2. 4. 4
I need to figure out how to query this table such that I find out the continuous count for each ID in which the value is equal or above 1 and the timestamp1 is bigger than timestamp 2. So for example:
ID 1 will have a count of 2 because the 3rd row is 0 and even though the 4th is of value 2 I need for it to be continuous.
ID 2 will have a count of 1 because row 4 timestamp 1 is below timstamp 2.
Please help, have tried a bunch of things
If I understand correctly, this is a gaps-and-islands problem. You can get each range as:
select id, count(*)
from (select t.*,
row_number() over (partition by id order by timestamp1) as seqnum,
row_number() over (partition by id, (value >= 1), (timestamp1 >= timestamp2)
order by timestamp1
) as seqnum_2
from t
) t
where value >= 1 and (timestamp1 >= timestamp2)
group by id, (seqnum - seqnum_2);
In Postgres, you can get the maximum using distinct on. I'm guessing that you are using Redshift and under the mistaken impression that it is equivalent to Postgres (it might have been decades ago). You can get what you want using an additional level of aggregation:
select id, max(cnt)
from (select id, count(*) as cnt
from (select t.*,
row_number() over (partition by id order by timestamp1) as seqnum,
row_number() over (partition by id, (value >= 1), (timestamp1 >= timestamp2)
order by timestamp1
) as seqnum_2
from t
) t
where value >= 1 and (timestamp1 >= timestamp2)
group by id, (seqnum - seqnum_2)
) ig
group by id;
Here is a db<>fiddle.

count consecutive record with timestamp interval requirement

ref to this post: link, I used the answer provided by #Gordon Linoff:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
row_number() over (partition by taxi order by time) as seqnum,
row_number() over (partition by taxi, client order by time) as seqnum_c
from t
) t
group by t.taxi, t.client, (seqnum - seqnum_c)
having count(*) >= 2
)
group by taxi;
and got my answer perfectly like this:
Tom 3 (AA count as 1, AAA count as 1 and BB count as 1, so total of 3 count)
Bob 1
But now I would like to add one more condition which is the time between two consecutive clients for same taxi should not be longer than 2hrs.
I know that I should probably use row_number() again and calculate the time difference with datediff. But I have no idea where to add and how to do.
So any suggestion?
This requires a bit more logic. In this case, I would use lag() to calculate the groups:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
sum(case when prev_client = client and
prev_time > time - interval '2 hour'
then 1
else 0
end) over (partition by client order by time) as grp
from (select t.*,
lag(client) over (partition by taxi order by time) as prev_client,
lag(time) over (partition by taxi order by time) as prev_time
from t
) t
) t
group by t.taxi, t.client, grp
having count(*) >= 2
)
group by taxi;
Note: You don't specify the database, so this uses ISO/ANSI standard syntax for date/time comparisons. You can adjust this for your actual database.

Running count distinct

I am trying to see how the cumulative number of subscribers changed over time based on unique email addresses and date they were created. Below is an example of a table I am working with.
I am trying to turn it into the table below. Email 1#gmail.com was created twice and I would like to count it once. I cannot figure out how to generate the Running count distinct column.
Thanks for the help.
I would usually do this using row_number():
select date, count(*),
sum(count(*)) over (order by date),
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (order by date)
from (select t.*,
row_number() over (partition by email order by date) as seqnum
from t
) t
group by date
order by date;
This is similar to the version using lag(). However, I get nervous using lag if the same email appears multiple times on the same date.
Getting the total count and cumulative count is straight forward. To get the cumulative distinct count, use lag to check if the email had a row with a previous date, and set the flag to 0 so it would be ignored during a running sum.
select distinct dt
,count(*) over(partition by dt) as day_total
,count(*) over(order by dt) as cumsum
,sum(flag) over(order by dt) as cumdist
from (select t.*
,case when lag(dt) over(partition by email order by dt) is not null then 0 else 1 end as flag
from tbl t
) t
DEMO HERE
Here is a solution that does not uses sum over, neither lag... And does produces the correct results.
Hence it could appear as simpler to read and to maintain.
select
t1.date_created,
(select count(*) from my_table where date_created = t1.date_created) emails_created,
(select count(*) from my_table where date_created <= t1.date_created) cumulative_sum,
(select count( distinct email) from my_table where date_created <= t1.date_created) running_count_distinct
from
(select distinct date_created from my_table) t1
order by 1