cummulative distinct count - sql

I'm having trouble getting a cumulative distinct count so let's just assume the below dataset.
DATE RID
1/1/18 1
1/1/18 2
1/1/18 3
1/1/18 3
So if we run this query
SELECT DATE, COUNT(DISTINCT RID) FROM TABLE;
we would expect it to return 3, however let's assume that the data for the next day is as follows.
DATE RID
1/2/18 1
1/2/18 6
1/2/18 9
How would you write a query to get the following results where the data for 1/1/18 is considered when returning the distinct for 1/2/18.
So it would be the following results.
Date Count(*)
1/1/18 3
1/2/18 5 <- 1/1/18 distinct plus + 1/2 distinct.
Hope that makes sense, keep in mind this is a very large dataset if that changes things.

You can do a cumulative count of the earliest date for each rid:
select mindate, count(*), sum(count(*)) over (order by mindate)
from (select rid, min(date) as mindate
from t
group by rid
) t
group by mindate
order by mindate;
Note: This will be missing dates that is not a mindate for some rid. Here is one way to get all the dates, if that is an issue:
select mindate, count(rid), sum(count(rid)) over (order by mindate)
from ((select rid, min(date) as mindate
from t
group by rid
)
union all
(select distinct NULL, date
from t
)
) rd
group by mindate
order by mindate;

Below query can give required cumulative distinct count.
--Step 3:
SELECT dt,
cum_distinct_cnt
FROM (
--Step 2:
SELECT rid,
dt,
COUNT(CASE WHEN row_num = 1 THEN rid END) OVER (ORDER BY dt ROWS BETWEEN Unbounded PRECEDING AND CURRENT ROW) cum_distinct_cnt
FROM (
--Step 1:
SELECT rid,
dt,
ROW_NUMBER() OVER (PARTITION BY rid ORDER BY dt) row_num
FROM table) innerTab1
) innerTab2
QUALIFY ROW_NUMBER() OVER (PARTITION BY dt ORDER BY cum_distinct_cnt DESC) = 1
Since your dataset is very large, you can break the below query on steps as explained in query and create work tables to populate innerTab1/ innerTab2 to get final output

Related

Oracle SQL LAG() function results in duplicate rows

I have a very simple query that results in two rows:
SELECT DISTINCT
id,
trunc(start_date) start_date
FROM example.table
WHERE ID = 1
This results in the following rows:
id start_date
1 7/1/2012
1 9/1/2016
I want to add a column that simply shows the previous date for each row. So I'm using the following:
SELECT DISTINCT id,
Trunc(start_date) start_date,
Lag(start_date, 1)
over (
ORDER BY start_date) pdate
FROM example.table
WHERE id = 1
However, when I do this, I get four rows instead of two:
id start_date pdate
1 7/1/2012 NULL
1 7/1/2012 7/1/2012
1 9/1/2016 7/1/2012
1 9/1/2016 9/1/2012
If I change the offset to 2 or 3 the results remain the same. If I change the offset to 0, I get two rows again but of course now the start_date == pdate.
I can't figure out what's going on
Use an explicit GROUP BY instead:
SELECT id, trunc(start_date) as start_date,
LAG(trunc(start_date)) OVER (PARTITION BY id ORDER BY trunc(start_date))
FROM example.table
WHERE ID = 1
GROUP BY id, trunc(start_date)
The reason for this is: the order of execution of an SQL statements, is that LAG runs before the DISTINCT.
You actually want to run the LAG after the DISTINCT, so the right query should be:
WITH t1 AS (
SELECT DISTINCT id, trunc(start_date) start_date
FROM example.table
WHERE ID = 1
)
SELECT *, LAG(start_date, 1) OVER (ORDER BY start_date) pdate
FROM t1

date window function

Dont know how to solve the problem.May be you can show right direction or give a link.
I have a table:
id Date
23 01.01.2020
23 03.01.2020
23 04.01.2020
56 07.01.2020
56 08.01.2020
87 11.01.2020
23 12.01.2020
23 18.01.2020
I want to aggregate data (id, Date_min) and add new column like this one:
id Date_min Date_new
23 01.01.2020 07.01.2020
56 07.01.2020 11.01.2020
87 11.01.2020 12.01.2020
23 12.01.2020 18.01.2020
In column Data_new I want to see next user's first date. If there is no next user, add user`s max date
LEAD will give you the next date, but we also have the slight sticking problem that your ID repeats, so we need something to make the second 23 distinct from the first. For that I guess we can establish a counter that ticks up every time the ID changes:
with a as(
select '23' as id, '01.01.2020' as "date" union all
select '23' as id, '03.01.2020' as "date" union all
select '23' as id, '04.01.2020' as "date" union all
select '56' as id, '07.01.2020' as "date" union all
select '56' as id, '08.01.2020' as "date" union all
select '87' as id, '11.01.2020' as "date" union all
select '23' as id, '12.01.2020' as "date" union all
select '23' as id, '18.01.2020' as "date"
), b as (
SELECT *, LAG(id) OVER(ORDER BY "date") as last_id FROM a
), c AS(
SELECT *,
LEAD("date") OVER(ORDER BY "date") as next_date,
SUM(CASE WHEN last_id <> id THEN 1 ELSE 0 END) OVER(ORDER BY "date" ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) id_ctr
FROM b
)
SELECT id, MIN("date"), MAX(next_date)
FROM c
GROUP BY id, id_ctr
I haven't got a PG instance to test this on, but it works in SQLS and I'm pretty sure that PG supports everything that SQLS does - there isn't any SQLS specific stuff here
a takes the place of your table - you can drop it from your query and just straight d a with b as (select... from yourtablenamehere)
b calculates the previous ID; we'll use this to detect if the id has changed between current row and prev row. If it changes we'll put a 1 otherwise a 0. When these are summed as a running total it effectively means the counter ticks up every time the ID changes, so we can group by this counter as well as the ID to split our two 23s apart. We need to do this separately because window functions can't be nested
c takes the last_id and does the running total. It also does the next_date with a simple window function that pulls the date from the following row (rows ordered by date). the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is techincally unnecessary as it's the default action for a SUM OVER ORDERBY, but I find being explicit helps document/change if needed
then all that is required is to select the id, min date and max next_date, but throw the counter in there too to split the 23s up - you're allowed to group by more columns than you select but not the other way round
This is a particularly simply type of gaps-and-islands problem.
You can simply use lag() to determine the first row of each bunch of rows and then a lead() to get date_new:
select id, date as date_min,
lead(date, 1, max_date) over (order by date) as date_max
from (select t.*,
lag(id) over (order by date) as prev_id,
max(date) over () as max_date
from t
) t
where prev_id is null or prev_id <> id;
Here is a db<>fiddle.
Three window functions and no aggregation: this should be by far the fastest approach to this problem.

Add columns to SQL query and filter by min(date) and sum(price)

I am trying to generate a list of users who's first purchase was in December 2018 and have spent over 100 dollars since then in SQL. I'm able to generate the list of users, but I'm unable to determine what their first purchase was or other variables and it appears to be an issue since the columns I'm trying to include are neither grouped nor aggregated so I'm hoping someone can point me in the right direction as I'm new to SQL.
Here's my code to generate the list I want to add more columns to:
select billing_address.name, contact_email, min(processed_at) as First_Purchase_Date, sum(total_price) as Total_Revenue
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
FROM `table.orders`
) orders -- identify duplicate rows
WHERE instance = 1
group by contact_email, billing_address.name
having min(processed_at) between '2019-01-01 00:00:00 UTC' and '2019-02-01 00:00:00 UTC' and sum(total_price) > 100
order by sum(total_price) desc
Is there some way I can modify this to pull each user's purchase from this list into a separate row and include more columns? So I'd pull in each user (and ALL of their purchases) who has a min(processed_at) in December 2018 AND their sum(total_price) > 100? something like this:
SELECT contact_email, billing_address, line_items, min(processed_at), sum(total_price) OVER (PARTITION BY contact_email)
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
FROM `table.orders`
) orders -- identify duplicate rows
WHERE instance = 1
However, the sum(total_price) doesn't work in this case and I can't filter by min(processed_at). Can someone guide me in the right direction?
I think that should use window functions instead of aggregation. You can compute the date of the first purchase and the total amount spent on the fly in a subquery, without aggregating (your original group by columns become the partition columns of the window functions). Then you can use these information to filter in the outer query.
This should get you close to what you want:
select o.*
from (
select
o.*,
min(processed_at) over(partition by contact_email, billing_address) min_processed_at,
sum(total_price) over(partition by contact_email, billing_address) sum_total_price
from (
select
o.*,
row_number() over(partition by id) instance
from orders o
) o
where instance = 1
) o
where
processed_at between '2019-01-01 00:00:00 UTC' and '2019-02-01 00:00:00 UTC'
and sum_total_price > 100
Your question was a bit unclear as you did not provide much detail about your input tables or your expected output, so this is a guess.
The following query gets all transactions from users who meet the criteria:
-- BigQuery StandardSQL
with ordered_orders as (
--rank each ID by processed_at date first to last
select *, row_number() over(partition by id order by processed_at asc) as rn
from `table.orders`
),
first_criteria as (
-- select IDs where first processed_at date is in 2018-12
select id, processed_at as first_order_date
from ordered_orders
where rn = 1
and extract(year from processed_at) = 2018
and extract(month from processed_at) = 12
),
second_criteria as (
-- further select IDs who meet first criteria and have a total of > 100
select id, sum(total_prices) as total_revenue
from ordered_orders
inner join first_criteria using(id)
group by id
having total_revenue > 100
),
orders_with_criteria as (
-- get all orders for users who meet both criteria
select ordered_orders.* except(rn), first_order_date, total_revenue
from ordered_orders
inner join first_criteria using(id)
inner join second_criteria using(id)
),
-- select any fields you want
select * from orders_with_criteria
I prefer liberal use of CTEs in cases like this to keep the logic clear.
I also wouldn't be surprised if this query doesn't work as you intend. I think it is highly doubtful that the ID column in your orders table refers to the customer id, which is what you/we are partitioning on. Depending on who set up your tables, id probably refers to the order id. If you have a customer_id (or account #, etc), then I would use that instead of id in the query.
No need to use row_number() in BigQuery for this:
SELECT billing_address.name, contact_email,
MIN(processed_at) as First_Purchase_Date,
SUM(total_price) as Total_Revenue,
ARRAY_AGG(o ORDER BY processed_at LIMIT 1) as first_order
FROM `table.orders` o
WHERE instance = 1
GROUP BY contact_email, billing_address.name
HAVING MIN(processed_at) >= '2019-01-01 00:00:00 UTC' AND
MIN(processed_at) < '2019-02-01 00:00:00 UTC' AND
SUM(total_price) > 100
ORDER BY SUM(total_price) desc;
This returns the entire first order as a struct. You can select specific columns, if you prefer.

Running count distinct

I am trying to see how the cumulative number of subscribers changed over time based on unique email addresses and date they were created. Below is an example of a table I am working with.
I am trying to turn it into the table below. Email 1#gmail.com was created twice and I would like to count it once. I cannot figure out how to generate the Running count distinct column.
Thanks for the help.
I would usually do this using row_number():
select date, count(*),
sum(count(*)) over (order by date),
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (order by date)
from (select t.*,
row_number() over (partition by email order by date) as seqnum
from t
) t
group by date
order by date;
This is similar to the version using lag(). However, I get nervous using lag if the same email appears multiple times on the same date.
Getting the total count and cumulative count is straight forward. To get the cumulative distinct count, use lag to check if the email had a row with a previous date, and set the flag to 0 so it would be ignored during a running sum.
select distinct dt
,count(*) over(partition by dt) as day_total
,count(*) over(order by dt) as cumsum
,sum(flag) over(order by dt) as cumdist
from (select t.*
,case when lag(dt) over(partition by email order by dt) is not null then 0 else 1 end as flag
from tbl t
) t
DEMO HERE
Here is a solution that does not uses sum over, neither lag... And does produces the correct results.
Hence it could appear as simpler to read and to maintain.
select
t1.date_created,
(select count(*) from my_table where date_created = t1.date_created) emails_created,
(select count(*) from my_table where date_created <= t1.date_created) cumulative_sum,
(select count( distinct email) from my_table where date_created <= t1.date_created) running_count_distinct
from
(select distinct date_created from my_table) t1
order by 1

Find date ranges between large gaps and ignore smaller gaps

I have a column of a mostly continous unique dates in ascending order. Although the dates are mostly continuos, there are some gaps in the dates of less than 3 days, others have more than 3 days.
I need to create a table where each record has a start date and an end date of the range that includes a gap of 3 days or less. But a new record has to be generated if the gap is longer than 3 days.
so if dates are:
1/2/2012
1/3/2012
1/4/2012
1/15/2012
1/16/2012
1/18/2012
1/19/2012
I need:
1/2/2012 1/4/2012
1/15/2012 1/19/2012
You can do something like this:
WITH CTE_Source AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY DT) RN
FROM dbo.Table1
)
,CTE_Recursion AS
(
SELECT *, 1 AS Grp
FROM CTE_Source
WHERE RN = 1
UNION ALL
SELECT src.*, CASE WHEN DATEADD(DD,3,rec.DT) < src.DT THEN rec.Grp + 1 ELSE Grp END AS Grp
FROM CTE_Source src
INNER JOIN CTE_Recursion rec ON src.RN = rec.RN +1
)
SELECT
MIN(DT) AS StartDT, MAX(DT) AS EndDT
FROM CTE_Recursion
GROUP BY Grp
First CTE is just to assign continuous numbers for all rows in order to join them later. Then using recursive CTE you can join on each next row assigning groups if date difference is larger than 3 days. In the end just group by grouping column and select desired results.
SQLFiddle DEMO