Cumulative distinct count with Spark SQL

Cumulative distinct count with Spark SQL - sql

Using Spark 1.6.2.
Here the data:
day | visitorID
-------------
1 | A
1 | B
2 | A
2 | C
3 | A
4 | A
I want to count how many distinct visitors by day + cumul with the day before (I dont know the exact term for that, sorry).
This should give:
day | visitors
--------------
1 | 2 (A+B)
2 | 3 (A+B+C)
3 | 3
4 | 3
Tried self-join but really too slow
I am sure windowed function is what I am looking for but didnt manage to find it :/

You should be able to do:
select day, max(visitors) as visitors
from (select day,
count(distinct visitorId) over (order by day) as visitors
from t
) d
group by day;
Actually, I think a better approach is to record a visitor only on the first day s/he appears:
select startday, sum(count(*)) over (order by startday) as visitors
from (select visitorId, min(day) as startday
from t
group by visitorId
) t
group by startday
order by startday;

In SQL, you could do this.
select t1.day,sum(max(t.cnt)) over(order by t1.day) as visitors
from tbl t1
left join (select minday,count(*) as cnt
from (select visitorID,min(day) as minday
from tbl
group by visitorID
) t
group by minday
) t
on t1.day=t.minday
group by t1.day
Get the first day a visitorID appears using min.
Count the rows per such minday found above.
Left join this to your original table and get the cumulative sum.
Another approach would be
select t1.day,sum(count(t.visitorid)) over(order by t1.day) as cnt
from tbl t1
left join (select visitorID,min(day) as minday
from tbl
group by visitorID
) t
on t1.day=t.minday and t.visitorid=t1.visitorid
group by t1.day

Try it's
select
day,
count(*),
(
select count(*) from your_table b
where a.day >= b.day
) cumulative
from your_table as a
group by a.day
order by 1

Related

Time Between First and Second Records SQL

I am trying to calculate the time between the first and second records. My thought was to add a ranking for each record and then do a calculation on RN 2 - RN 1. I'm struggling to actually get the subquery to do RN2-RN1.
SAMPLE Data:
user_id
date
rn
698998737289929044
2021-04-08 11:27:38
1
698998737289929044
2021-04-08 12:20:25
2
698998737289929044
2021-04-01 13:23:59
3
732850336550572910
2021-03-23 06:13:25
1
598830651911547855
2021-03-11 11:56:53
1
SELECT
user_id,
date,
row_number() over(partition by user_id order by date) as RN
FROM event_table
GROUP BY user_id, date

You can join the result with itself to get the first and second row.
For example:
with
q as (
-- your query here
)
select
f.user_id,
f.date,
s.date - f.date as diff
from q f
left join q s on s.user_id = f.user_id and s.rn = 2
where f.rn = 1

SQL - unique users who are visiting for the first time

Given following table visitorLog, write a SQL to find the following by date.
Total_Visitors
VisitorGain - compare to previous day
VisitorLoss - compare to previous day
Total_New_Visitors - unique users who are visiting for the first time
visitorLog :
*----------------------*
| Date Visitor |
*----------------------*
| 01-Jan-2011 V1 |
| 01-Jan-2011 V2 |
| 01-Jan-2011 V3 |
| 02-Jan-2011 V2 |
| 03-Jan-2011 V2 |
| 03-Jan-2011 V4 |
| 03-Jan-2011 V5 |
*----------------------*
Expected output:
*---------------------------------------------------------------------*
| Date Total_Visitors VisitorGain VisitorLoss Total_New_Visitors |
*---------------------------------------------------------------------*
| 01-Jan-2011 3 3 0 3 |
| 02-Jan-2011 1 0 2 0 |
| 03-Jan-2011 3 2 0 2 |
*---------------------------------------------------------------------*
Here is my SQL and SLQ fiddle.
with cte as
(
select
date,
total_visitors,
lag(total_visitors) over (order by date) as prev_visitors,
row_number() over (order by date ) as rnk
from
(
select
*,
count(visitor) over (partition by date) as total_visitors
from visitorLog
) val
group by
date,
total_visitors
),
cte2 as
(
select
date,
sum(case when rnk = 1 then 1 else 0 end) as total_new_visitors
from
(
select
date,
visitor,
row_number() over (partition BY visitor order by date) as rnk
from visitorLog
) t
group by
date
)
select
c.date,
sum(total_visitors) as total_visitors,
sum(
case
when rnk = 1 then total_visitors
when (rnk > 1 and prev_visitors < total_visitors) then (total_visitors - prev_visitors)
else
0
end
)visitorGain,
sum(
case
when rnk = 1 then 0
when prev_visitors > total_visitors then (prev_visitors - total_visitors)
else
0
end
) as visitorLoss,
sum(total_new_visitors) as total_new_visitors
from cte c
join cte2 c2
on c.date = c2.date
group by
c.date
order by
c.date
My solution is working as expected but I am wondering if I am missing any any edge cases here which may break my logic. any help would be great.

This logic does what you want:
select date, count(*) as num_visitor,
greatest(count(*) - lag(count(*)::int, 1, 0) over (order by date), 0) as visitor_gain,
greatest(lag(count(*)::int, 1, 0) over (order by date) - count(*), 0) as visitor_loss,
count(*) filter (where seqnum = 1) as num_new_visitors
from (select vl.*,
row_number() over (partition by visitor order by date) as seqnum
from visitorLog vl
) vl
group by date
order by date
Here is a db<>fiddle.

I would use window functions and aggregation:
select
date,
count(*) no_visitor,
count(*) - lag(count(*), 1, 0) over(partition by date) no_visitor_diff,
count(*) filter(where rn = 1) no_new_visitors
from (
select t.*, row_number() over(partition by visitor order by date) rn
from visitorLog
) t
group by date
order by date
The subquery ranks the visits of each customer using row_number() (the first visit of each customer gets row number 1). Then, the outer query aggregates by date, and uses lag() to get the visitor count of the "previous" day.
I don't really see the point to have two distinct columns for the difference of visitors compared to the last day, so this gives you a single column, with a value that's either positive or negative depending whether customers were gained or lost.
If you really want two columns, then:
greatest(count(*) - lag(count(*), 1, 0) over(partition by date), 0) visitor_gain,
- least(count(*) - lag(count(*), 1, 0) over(partition by date), 0) visitor_loss

Count by week between dates

I'm trying to show a count by week but I am unsure of how to find the week that isn't showing between effdate and expdat. How do show the week and count shown below? Thanks.

You could use a recursive query to enumerate the weeks, then join it with the table
with cte as (
select min(effweek) week, max(expweek) max_week from mytable
union all
select week + 1, max_week from cte where week < max_week
)
select c.week, count(t.id_num) cnt
from cte c
left join mytable t on c.week between t.effweek and t.expweek
group by c.week
order by c.week
(Simplified) demo on DB Fiddle:
week | cnt
---: | --:
12 | 2
13 | 1
14 | 1

Finding the interval between dates in SQL Server

I have a table including more than 5 million rows of sales transactions. I would like to find sum of date intervals between each customer three recent purchases.
Suppose my table looks like this :
CustomerID ProductID ServiceStartDate ServiceExpiryDate
A X1 2010-01-01 2010-06-01
A X2 2010-08-12 2010-12-30
B X4 2011-10-01 2012-01-15
B X3 2012-04-01 2012-06-01
B X7 2012-08-01 2013-10-01
A X5 2013-01-01 2015-06-01
The Result that I'm looking for may looks like this :
CustomerID IntervalDays
A 802
B 135
I know the query need to first retrieve 3 resent transactions of each customer (based on ServiceStartDate) and then calculate the interval between startDate and ExpiryDate of his/her transactions.

You want to calculate the difference between the previous row's ServiceExpiryDate and the current row's ServiceStartDate based on descending dates and then sum up the last two differences:
with cte as
(
select tab.*,
row_number()
over (partition by customerId
order by ServiceStartDate desc
, ServiceExpiryDate desc -- don't know if this 2nd column is necessary
) as rn
from tab
)
select t2.customerId,
sum(datediff(day, prevEnd, ServiceStartDate)) as Intervaldays
,count(*) as purchases
from cte as t2 left join cte as t1
on t1.customerId = t2.customerId
and t1.rn = t2.rn+1 -- previous and current row
where t2.rn <= 3 -- last three rows
group by t2.customerId;
Same result using LEAD:
with cte as
(
select tab.*,
row_number()
over (partition by customerId
order by ServiceStartDate desc) as rn
,lead(ServiceExpiryDate)
over (partition by customerId
order by ServiceStartDate desc
) as prevEnd
from tab
)
select customerId,
sum(datediff(day, prevEnd, ServiceStartDate)) as Intervaldays
,count(*) as purchases
from cte
where rn <= 3
group by customerId;
Both will not return the expected result unless you subtract purchases (or max(rn)) from Intervaldays. But as you only sum two differences this seems to be not correct for me either...
Additional logic must be applied based on your rules regarding:
customer has less than 3 purchases
overlapping intervals

Assuming there are no overlaps, I think you want this:
select customerId,
sum(datediff(day, ServiceStartDate, ServieEndDate) as Intervaldays
from (select t.*, row_number() over (partition by customerId
order by ServiceStartDate desc) as seqnum
from table t
) t
where seqnum <= 3
group by customerId;

Try this:
SELECT dt.CustomerID,
SUM(DATEDIFF(DAY, dt.PrevExpiry, dt.ServiceStartDate)) As IntervalDays
FROM (
SELECT *
, ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY ServiceStartDate DESC) AS rn
, (SELECT Max(ti.ServiceExpiryDate)
FROM yourTable ti
WHERE t.CustomerID = ti.CustomerID
AND ti.ServiceStartDate < t.ServiceStartDate) As PrevExpiry
FROM yourTable t )dt
GROUP BY dt.CustomerID
Result will be:
CustomerId | IntervalDays
-----------+--------------
A | 805
B | 138

Select distinct users group by time range

I have a table with the following info
|date | user_id | week_beg | month_beg|
SQL to create table with test values:
CREATE TABLE uniques
(
date DATE,
user_id INT,
week_beg DATE,
month_beg DATE
)
INSERT INTO uniques VALUES ('2013-01-01', 1, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-03', 3, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-06', 4, '2013-01-06', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-07', 4, '2013-01-06', '2013-01-01')
INPUT TABLE:
| date | user_id | week_beg | month_beg |
| 2013-01-01 | 1 | 2012-12-30 | 2013-01-01 |
| 2013-01-03 | 3 | 2012-12-30 | 2013-01-01 |
| 2013-01-06 | 4 | 2013-01-06 | 2013-01-01 |
| 2013-01-07 | 4 | 2013-01-06 | 2013-01-01 |
OUTPUT TABLE:
| date | time_series | cnt |
| 2013-01-01 | D | 1 |
| 2013-01-01 | W | 1 |
| 2013-01-01 | M | 1 |
| 2013-01-03 | D | 1 |
| 2013-01-03 | W | 2 |
| 2013-01-03 | M | 2 |
| 2013-01-06 | D | 1 |
| 2013-01-06 | W | 1 |
| 2013-01-06 | M | 3 |
| 2013-01-07 | D | 1 |
| 2013-01-07 | W | 1 |
| 2013-01-07 | M | 3 |
I want to calculate the number of distinct user_id's for a date:
For that date
For that week up to that date (Week to date)
For the month up to that date (Month to date)
1 is easy to calculate.
For 2 and 3 I am trying to use such queries:
SELECT
date,
'W' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY week_beg) AS "cnt"
FROM user_subtitles
SELECT
date,
'M' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY month_beg) AS "cnt"
FROM user_subtitles
Postgres does not allow window functions for DISTINCT calculation, so this approach does not work.
I have also tried out a GROUP BY approach, but it does not work as it gives me numbers for whole week/months.
Whats the best way to approach this problem?

Count all rows
SELECT date, '1_D' AS time_series, count(DISTINCT user_id) AS cnt
FROM uniques
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W', count(*) OVER (PARTITION BY week_beg ORDER BY date)
FROM uniques
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M', count(*) OVER (PARTITION BY month_beg ORDER BY date)
FROM uniques
ORDER BY 1, time_series
Your columns week_beg and month_beg are 100 % redundant and can easily be replaced by
date_trunc('week', date + 1) - 1 and date_trunc('month', date) respectively.
Your week seems to start on Sunday (off by one), therefore the + 1 .. - 1.
The default frame of a window function with ORDER BY in the OVER clause uses is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. That's exactly what you need.
Use UNION ALL, not UNION.
Your unfortunate choice for time_series (D, W, M) does not sort well, I renamed to make the final ORDER BY easier.
This query can deal with multiple rows per day. Counts include all peers for a day.
More about DISTINCT ON:
Select first row in each GROUP BY group?
DISTINCT users per day
To count every user only once per day, use a CTE with DISTINCT ON:
WITH x AS (SELECT DISTINCT ON (1,2) date, user_id FROM uniques)
SELECT date, '1_D' AS time_series, count(user_id) AS cnt
FROM x
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W'
,count(*) OVER (PARTITION BY (date_trunc('week', date + 1)::date - 1)
ORDER BY date)
FROM x
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M'
,count(*) OVER (PARTITION BY date_trunc('month', date) ORDER BY date)
FROM x
ORDER BY 1, 2
DISTINCT users over dynamic period of time
You can always resort to correlated subqueries. Tend to be slow with big tables!
Building on the previous queries:
WITH du AS (SELECT date, user_id FROM uniques GROUP BY 1,2)
,d AS (
SELECT date
,(date_trunc('week', date + 1)::date - 1) AS week_beg
,date_trunc('month', date)::date AS month_beg
FROM uniques
GROUP BY 1
)
SELECT date, '1_D' AS time_series, count(user_id) AS cnt
FROM du
GROUP BY 1
UNION ALL
SELECT date, '2_W', (SELECT count(DISTINCT user_id) FROM du
WHERE du.date BETWEEN d.week_beg AND d.date )
FROM d
GROUP BY date, week_beg
UNION ALL
SELECT date, '3_M', (SELECT count(DISTINCT user_id) FROM du
WHERE du.date BETWEEN d.month_beg AND d.date)
FROM d
GROUP BY date, month_beg
ORDER BY 1,2;
SQL Fiddle for all three solutions.
Faster with dense_rank()
#Clodoaldo came up with a major improvement: use the window function dense_rank(). Here is another idea for an optimized version. It should be even faster to exclude daily duplicates right away. The performance gain grows with the number of rows per day.
Building on a simplified and sanitized data model
- without the redundant columns
- day as column name instead of date
date is a reserved word in standard SQL and a basic type name in PostgreSQL and shouldn't be used as identifier.
CREATE TABLE uniques(
day date -- instead of "date"
,user_id int
);
Improved query:
WITH du AS (
SELECT DISTINCT ON (1, 2)
day, user_id
,date_trunc('week', day + 1)::date - 1 AS week_beg
,date_trunc('month', day)::date AS month_beg
FROM uniques
)
SELECT day, count(user_id) AS d, max(w) AS w, max(m) AS m
FROM (
SELECT user_id, day
,dense_rank() OVER(PARTITION BY week_beg ORDER BY user_id) AS w
,dense_rank() OVER(PARTITION BY month_beg ORDER BY user_id) AS m
FROM du
) s
GROUP BY day
ORDER BY day;
SQL Fiddle demonstrating the performance of 4 faster variants. It depends on your data distribution which is fastest for you.
All of them are about 10x as fast as the correlated subqueries version (which isn't bad for correlated subqueries).

Without correlated subqueries. SQL Fiddle
with u as (
select
"date", user_id,
date_trunc('week', "date" + 1)::date - 1 week_beg,
date_trunc('month', "date")::date month_beg
from uniques
)
select
"date", count(distinct user_id) D,
max(week_dr) W, max(month_dr) M
from (
select
user_id, "date",
dense_rank() over(partition by week_beg order by user_id) week_dr,
dense_rank() over(partition by month_beg order by user_id) month_dr
from u
) s
group by "date"
order by "date"

Try
SELECT
*
FROM
(
SELECT dates, count(user_id), 'D' as timesereis FROM users_data GROUP BY dates
UNION
SELECT max(dates), count(user_id), 'W' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
UNION
SELECT max(dates), count(user_id), 'M' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
) tEMP order by dates, timesereis
SQLFIDDLE

Try queries like this
SELECT count(distinct user_id), date_format(date, '%Y-%m-%d') as date_period
FROM uniques
GROUP By date_period

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Cumulative distinct count with Spark SQL - sql

Try it's select day, count(), ( select count() from your_table b where a.day >= b.day ) cumulative from your_table as a group by a.day order by 1

Related

Time Between First and Second Records SQL

SQL - unique users who are visiting for the first time

Count by week between dates

Finding the interval between dates in SQL Server

Select distinct users group by time range

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Cumulative distinct count with Spark SQL - sql

Try it's select day, count(*), ( select count(*) from your_table b where a.day >= b.day ) cumulative from your_table as a group by a.day order by 1

Related

Time Between First and Second Records SQL

SQL - unique users who are visiting for the first time

Count by week between dates

Finding the interval between dates in SQL Server

Select distinct users group by time range

Categories

Resources

Try it's select day, count(), ( select count() from your_table b where a.day >= b.day ) cumulative from your_table as a group by a.day order by 1