Running count distinct - sql

I am trying to see how the cumulative number of subscribers changed over time based on unique email addresses and date they were created. Below is an example of a table I am working with.
I am trying to turn it into the table below. Email 1#gmail.com was created twice and I would like to count it once. I cannot figure out how to generate the Running count distinct column.
Thanks for the help.

I would usually do this using row_number():
select date, count(*),
sum(count(*)) over (order by date),
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (order by date)
from (select t.*,
row_number() over (partition by email order by date) as seqnum
from t
) t
group by date
order by date;
This is similar to the version using lag(). However, I get nervous using lag if the same email appears multiple times on the same date.

Getting the total count and cumulative count is straight forward. To get the cumulative distinct count, use lag to check if the email had a row with a previous date, and set the flag to 0 so it would be ignored during a running sum.
select distinct dt
,count(*) over(partition by dt) as day_total
,count(*) over(order by dt) as cumsum
,sum(flag) over(order by dt) as cumdist
from (select t.*
,case when lag(dt) over(partition by email order by dt) is not null then 0 else 1 end as flag
from tbl t
) t
DEMO HERE

Here is a solution that does not uses sum over, neither lag... And does produces the correct results.
Hence it could appear as simpler to read and to maintain.
select
t1.date_created,
(select count(*) from my_table where date_created = t1.date_created) emails_created,
(select count(*) from my_table where date_created <= t1.date_created) cumulative_sum,
(select count( distinct email) from my_table where date_created <= t1.date_created) running_count_distinct
from
(select distinct date_created from my_table) t1
order by 1

Related

how can i reset the count to 0 in sql when i have a condition that is false?

i have a sql table which the following data shown in the picture
I need to create a query in sql which counts for ticker the number of consecutive days per year in which
the close_value is greater than the open_value, if close_value is less than the open value the counter must be reset to zero and I have to save the counter in that instant
This is an example of a gaps-and-islands problem. You can use the difference of row_numbers():
select ticker, min(date), max(date), min(open_value), max(close_value),
count(*) as num_rows
from (select t.*,
row_number() over (partition by ticker order by date) as seqnum,
row_number() over (partition by ticker, (case when close_value > open_value then 1 else 2 end) order by date) as seqnum_2
from t
) t
where close_value > open_value
group by ticker, (seqnum - seqnum_2);
This returns all such periods. You haven't specified what the result set should look like, but this should be pretty close.

MariaDB get first and last record of the month - nested query

I am using MariaDB and I have these kind of data:
I have also data for March and I am using this query to select distinct Months from the database:
select distinct(DATE_FORMAT(DT,'%m-%Y')) AS singleMonth FROM myTable
I want to be able to select FIRST and LAST record of P2 column for every month. How it is possible using the query above for getting all distinct months and also getting first record for the month and last?
Example what the query should return look-like:
You can use window functions and conditional aggregation:
select year(dt), month(dt),
min(case when seqnumn_asc = 1 then p2 end) as first_p2,
min(case when seqnumn_desc = 1 then p2 end) as last_p2
from (select t.*,
row_number() over (partition by year(dt), month(dt) order by dt asc) as seqnum_asc,
row_number() over (partition by year(dt), month(dt) order by dt desc) as seqnum_desc
from t
) t
group by year(dt), month(dt);

sql - find users who made 3 consecutive actions

i use sql server and i have this table :
ID Date Amount
I need to write a query that returns only users that have made at least 3 consecutive purchases, each one larger than the other.
I know that i need to use partition_id and row_number but i dont know how to do it
Thank you in advance
If you want three purchases in a row with increases in amount, then use lead() to get the next amounts:
select t.*
from (select t.*,
lead(amount, 1) over (partition by id order by date) as next_date,
lead(amount, 2) over (partition by id order by date) as next2_amount
from t
) t
where next_amount > amount and next2_amount > next_amount;
I originally missed the "greater than" part of the question. If you wanted purchases on three days in a row, then:
If you want three days in a row and there is at most one purchase per day, then you can use:
select t.*
from (select t.*,
lead(date, 2) over (partition by id order by date) as next2_date
from t
) t
where next2_date = dateadd(day, 2, date);
If you can have duplicates on a date, I would suggest this variant:
select t.*
from (select t.*,
lead(date, 2) over (partition by id order by date) as next2_date
from (select distinct id, date from t) t
) t
where next2_date = dateadd(day, 2, date);

SQL Random 10 Percent based on 2 other fields

I have reviewed the other posts here with no luck on finding a solution to get a random 10 percent of records based on 2 other fields. For example my table contains ID, Date and User. I want to flag 10 percent of the records for each user for each day.
You can use row_number() and count():
select t.*,
(case when seqnum * 10 <= cnt then 'Y' else 'N' end) as flag
from (select t.*,
row_number() over (partition by user, date order by newid()) as seqnum,
count(*) over (partition by user, date) as cnt
from t
) t;
You don't actually need the subquery. It is just to make it a bit easier to follow, so:
select t.*,
(case when row_number() over (partition by user, date order by newid()) * 10 <= count(*) over (partition by user, date)
then 'Y' else 'N'
end) as flag
from t;

count consecutive record with timestamp interval requirement

ref to this post: link, I used the answer provided by #Gordon Linoff:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
row_number() over (partition by taxi order by time) as seqnum,
row_number() over (partition by taxi, client order by time) as seqnum_c
from t
) t
group by t.taxi, t.client, (seqnum - seqnum_c)
having count(*) >= 2
)
group by taxi;
and got my answer perfectly like this:
Tom 3 (AA count as 1, AAA count as 1 and BB count as 1, so total of 3 count)
Bob 1
But now I would like to add one more condition which is the time between two consecutive clients for same taxi should not be longer than 2hrs.
I know that I should probably use row_number() again and calculate the time difference with datediff. But I have no idea where to add and how to do.
So any suggestion?
This requires a bit more logic. In this case, I would use lag() to calculate the groups:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
sum(case when prev_client = client and
prev_time > time - interval '2 hour'
then 1
else 0
end) over (partition by client order by time) as grp
from (select t.*,
lag(client) over (partition by taxi order by time) as prev_client,
lag(time) over (partition by taxi order by time) as prev_time
from t
) t
) t
group by t.taxi, t.client, grp
having count(*) >= 2
)
group by taxi;
Note: You don't specify the database, so this uses ISO/ANSI standard syntax for date/time comparisons. You can adjust this for your actual database.