How to get ' COUNT DISTINCT' over moving window - sql

I'm working on a query to compute the distinct users of particular features of an app within a moving window. So, if there's a range from 15-20th October, I want a query to go from 8-15 Oct, 9-16 Oct etc and get the count of distinct users per feature. So for each date, it should have x rows where x is the number of features.
I have a query the following query so far:
WITH V1(edate, code, total) AS
(
SELECT date, featurecode,
DENSE_RANK() OVER ( PARTITION BY (featurecode ORDER BY accountid ASC) + DENSE_RANK() OVER ( PARTITION BY featurecode ORDER By accountid DESC) - 1
FROM....
GROUP BY edate, featurecode, appcode, accountid
HAVING appcode='sample' AND eventdate BETWEEN '15-10-2018' And '20-10-2018'
)
Select distinct date, code, total
from V1
WHERE date between '2018-10-15' AND '2018-10-20'
This returns the same set of values for all the dates. Is there any way to do this efficiently?? It's a DB2 database by the way but I'm looking for insight from postgresql users too.
Present result- All the totals are being repeated.
date code total
10/15/2018 appname-feature1 123
10/15/2018 appname-feature2 234
10/15/2018 appname-feature3 321
10/16/2018 appname-feature1 123
10/16/2018 appname-feature2 234
10/16/2018 appname-feature3 321
Desired result.
date code total
10/15/2018 appname-feature1 123
10/15/2018 appname-feature2 234
10/15/2018 appname-feature3 321
10/16/2018 appname-feature1 212
10/16/2018 appname-feature2 577
10/16/2018 appname-feature3 2345

This is not easy to do efficiently. DISTINCT counts are't incrementally maintainable (unless you go down the route of in-exact DISTINCT counts such as HyperLogLog).
It is easy to code in SQL, and try the usual indexing etc to help.
It is (possibly) not possible, however, to code with OLAP functions.. not least because you can only use RANGE BETWEEN for SUM(), COUNT(), MAX() etc, but not RANK() or DENSE_RANK() ... so just use a traditional co-related sub-select
First some data
CREATE TABLE T(D DATE,F CHAR(1),A CHAR(1));
INSERT INTO T (VALUES
('2018-10-10','X','A')
, ('2018-10-11','X','A')
, ('2018-10-15','X','A')
, ('2018-10-15','X','A')
, ('2018-10-15','X','B')
, ('2018-10-15','Y','A')
, ('2018-10-16','X','C')
, ('2018-10-18','X','A')
, ('2018-10-21','X','B')
)
;
Now a simple select
WITH B AS (
SELECT DISTINCT D, F FROM T
)
SELECT D,F
, (SELECT COUNT(DISTINCT A)
FROM T
WHERE T.F = B.F
AND T.D BETWEEN B.D - 3 DAYS AND B.D + 4 DAYS
) AS DISTINCT_A_MOVING_WEEK
FROM
B
ORDER BY F,D
;
giving, e.g.
D F DISTINCT_A_MOVING_WEEK
---------- - ----------------------
2018-10-10 X 1
2018-10-11 X 2
2018-10-15 X 3
2018-10-16 X 3
2018-10-18 X 3
2018-10-21 X 2
2018-10-15 Y 1

Related

Find last job change date with JOB_TITLE and EVENT_DATE

Hi I am working in an Azure Databricks and I am looking for a SQL query solution.
Assuming that my db has five columns:
ID
EVENT_DATE
JOB_TITLE
PAY
12345
2021-01-01
VP1
100,000
12345
2020-01-10
VP1
90,000
12345
2019-01-20
Analyst1
80,000
12346
2021-02-01
VP2
200,000
12346
2020-02-10
Analyst2
150,000
12346
2020-01-20
Analyst2
110,000
Basically I want the EVENT_DATE when JOB_TITLE changed the last time. This is my desired output:
ID
JOB_TITLE
PAY
LAST_JOB_CHANGE_DATE
12345
VP1
90,000
2021-01-10
12346
VP2
200,000
2021-02-01
For the last column LAST_JOB_CHANGE_DATE, we are pulling from the 2nd and 4th row of the table because that's the date when they changed job the last time.
Thank you!
You can just use INNER JOIN to accomplish that, ie
%sql
SELECT a.*
FROM yourTable a
INNER JOIN
(
SELECT id, MAX(event_date) event_date
FROM yourTable b
GROUP BY id
) b ON a.id = b.id
AND a.event_date = b.event_date
The ROW_NUMBER approach would also work well:
%sql
WITH cte AS
(
SELECT
ROW_NUMBER() OVER( PARTITION BY id ORDER BY event_date DESC ) AS rn,
*
FROM yourTable a
)
SELECT *
FROM cte
WHERE rn = 1
My results:
There's probably a simpler solution for this but the following should work.
I'm assuming you wanted the MOST resent job change for each employee. To illustrate this, I added an extra row for an Engineer1. The ROW_NUMBER() window function helps us with this.
ID
EVENT_DATE
JOB_TITLE
PAY
12345
2021-01-01
VP1
100,000
12345
2020-01-10
VP1
90,000
12345
2019-01-20
Analyst1
80,000
12345
2018-01-04
Engineer1
75,000
12346
2021-02-01
VP2
200,000
12346
2020-02-10
Analyst2
150,000
12346
2020-01-20
Analyst2
110,000
Here is the query:
SELECT <---- (4)
c.ID,
c.JOB_TITLE,
c.PAY,
c.last_job_change_date
FROM
(
SELECT <---- (3)
b.ID,
ROW_NUMBER() OVER (PARTITION BY b.ID ORDER BY b.last_job_change_date DESC) AS row_id,
b.JOB_TITLE,
b.PAY,
b.last_job_change_date
FROM
(
SELECT <---- (2)
a.ID,
a.JOB_TITLE,
a.PAY,
a.EVENT_DATE as last_job_change_date
FROM
(
SELECT <---- (1)
ID,
EVENT_DATE,
PAY,
JOB_TITLE,
LEAD(JOB_TITLE, 1) OVER (
PARTITION BY ID ORDER BY EVENT_DATE DESC) job_change
FROM yourtable
) a
WHERE JOB_TITLE <> job_change
) b
) c
WHERE row_id = 1
I used a 4 step process and annotated the query with each step:
Returns a table with a column for the subsequent job title (ordered by most recent title) of each employee.
Returns the table from (1) but removes rows where the employee did not change their job
Add row numbers so we can get the most recent job change of each employee
Return most recent job changes for each employee

Vertica SQL for running count distinct and running conditional count

I'm trying to build a department level score table based on a deeper product url level score table.
Date is not consecutive
Not all urls got score updates at same day (independent to each other)
dist_url should be running count distinct (cumulative count distinct)
dist urls and urls score >=30 are both count distinct
What I have now is:
Date url Store Dept Page Score
10/1 a US A X 10
10/1 b US A X 30
10/1 c US A X 60
10/4 a US A X 20
10/4 d US A X 60
10/6 b US A X 22
10/9 a US A X 40
10/9 e US A X 10
Date Store Dept Page dist urls urls score >=30
10/1 US A X 3 2
10/4 US A X 4 3
10/6 US A X 4 2
10/9 US A X 5 2
I think the dist_url can be done by using window function, just not sure on query.
Current query is as below, but it's wrong since not cumulative count distinct:
SELECT
bm.AnalysisDate,
su.SoID AS Store,
su.DptCaID AS DTID,
su.PageTypeID AS PTID,
COUNT(DISTINCT bm.SeoURLID) AS NumURLsWithDupScore,
SUM(CASE WHEN bm.DuplicationScore > 30 THEN 1 ELSE 0 END) AS Over30Count
FROM csn_seo.tblBotifyMetrics bm
INNER JOIN csn_seo.tblSEOURLs su
ON bm.SeoURLID = su.ID
WHERE su.DptCaID IS NOT NULL
AND su.DptCaID <> 0
AND su.PageTypeID IS NOT NULL
AND su.PageTypeID <> -1
AND bm.iscompliant = 1
GROUP BY bm.AnalysisDate, su.SoID, su.DptCaID, su.PageTypeID;
Please let me know if anyone has any idea.
Based on your question, you seem to want two levels of logic:
select date, store, dept,
sum(sum(start)) over (partition by dept, page order by date) as distinct_urls,
sum(sum(start_30)) over (partition by dept, page order by date) as distinct_urls_30
from ((select store, dept, page, url, min(date) as date, 1 as start, 0 as start_30
from t
group by store, dept, page, url
) union all
(select store, dept, page, url, min(date) as date, 0, 1
from t
where score >= 30
group by store, dept, page, url
)
) t
group by date, store, dept, page;
I don't understand how your query is related to your question.
Try as I might, I don't get your output either:
But I think you can avoid UNION SELECTs - Does this do what you expect?
NULLS don't figure in COUNT DISTINCTs - and here you can combine an aggregate expression with an OLAP one ...
And Vertica has named windows to increase readability ....
WITH
input(Date,url,Store,Dept,Page,Score) AS (
SELECT DATE '2019-10-01','a','US','A','X',10
UNION ALL SELECT DATE '2019-10-01','b','US','A','X',30
UNION ALL SELECT DATE '2019-10-01','c','US','A','X',60
UNION ALL SELECT DATE '2019-10-04','a','US','A','X',20
UNION ALL SELECT DATE '2019-10-04','d','US','A','X',60
UNION ALL SELECT DATE '2019-10-06','b','US','A','X',22
UNION ALL SELECT DATE '2019-10-09','a','US','A','X',40
UNION ALL SELECT DATE '2019-10-09','e','US','A','X',10
)
SELECT
date
, store
, dept
, page
, SUM(COUNT(DISTINCT url) ) OVER(w) AS dist_urls
, SUM(COUNT(DISTINCT CASE WHEN score >=30 THEN url END)) OVER(w) AS dist_urls_gt_30
FROM input
GROUP BY
date
, store
, dept
, page
WINDOW w AS (PARTITION BY store,dept,page ORDER BY date)
;
-- out date | store | dept | page | dist_urls | dist_urls_gt_30
-- out ------------+-------+------+------+-----------+-----------------
-- out 2019-10-01 | US | A | X | 3 | 2
-- out 2019-10-04 | US | A | X | 5 | 3
-- out 2019-10-06 | US | A | X | 6 | 3
-- out 2019-10-09 | US | A | X | 8 | 4
-- out (4 rows)
-- out
-- out Time: First fetch (4 rows): 45.321 ms. All rows formatted: 45.364 ms

Get MAX count but keep the repeated calculated value if highest

I have the following table, I am using SQL Server 2008
BayNo FixDateTime FixType
1 04/05/2015 16:15:00 tyre change
1 12/05/2015 00:15:00 oil change
1 12/05/2015 08:15:00 engine tuning
1 04/05/2016 08:11:00 car tuning
2 13/05/2015 19:30:00 puncture
2 14/05/2015 08:00:00 light repair
2 15/05/2015 10:30:00 super op
2 20/05/2015 12:30:00 wiper change
2 12/05/2016 09:30:00 denting
2 12/05/2016 10:30:00 wiper repair
2 12/06/2016 10:30:00 exhaust repair
4 12/05/2016 05:30:00 stereo unlock
4 17/05/2016 15:05:00 door handle repair
on any given day need do find the highest number of fixes made on a given bay number, and if that calculated number is repeated then it should also appear in the resultset
so would like to see the result set as follows
BayNo FixDateTime noOfFixes
1 12/05/2015 00:15:00 2
2 12/05/2016 09:30:00 2
4 12/05/2016 05:30:00 1
4 17/05/2016 15:05:00 1
I manage to get the counts of each but struggling to get the max and keep the highest calculated repeated value. can someone help please
Use window functions.
Get the count for each day by bayno and also find the min fixdatetime for each day per bayno.
Then use dense_rank to compute the highest ranked row for each bayno based on the number of fixes.
Finally get the highest ranked rows.
select distinct bayno,minfixdatetime,no_of_fixes
from (
select bayno,minfixdatetime,no_of_fixes
,dense_rank() over(partition by bayno order by no_of_fixes desc) rnk
from (
select t.*,
count(*) over(partition by bayno,cast(fixdatetime as date)) no_of_fixes,
min(fixdatetime) over(partition by bayno,cast(fixdatetime as date)) minfixdatetime
from tablename t
) x
) y
where rnk = 1
Sample Demo
You are looking for rank() or dense_rank(). I would right the query like this:
select bayno, thedate, numFixes
from (select bayno, cast(fixdatetime) as date) as thedate,
count(*) as numFixes,
rank() over (partition by cast(fixdatetime as date) order by count(*) desc) as seqnum
from t
group by bayno, cast(fixdatetime as date)
) b
where seqnum = 1;
Note that this returns the date in question. The date does not have a time component.

SQL script to partition data on a column and return the max value [duplicate]

This question already has answers here:
How to group by on consecutive values in SQL
(2 answers)
Closed 6 years ago.
I have a requirement to compute bonus payout based on spread goal and date achieved as follows:
Spread Goal | Date Achieved | Bonus Payout
----------------------------------------------
$3,500 | < 27 wks | $2,000
$3,500 | 27 wks to 34 wks | $1,000
$3,500 | > 34 wks | $0
I have a table in SQL Server 2014 where the subset of the data is as follows:
EMP_ID WK_NUM NET_SPRD_LCL
123 10 0
123 11 1500
123 15 3600
123 18 3800
123 19 4000
Based on the requirement, I need to look for records where NET_SPRD_LCL is greater than or equal to 3500 during 2 continuous wk_num.
So, in my example, WK_NUM 15 and 18 (which in my case are continuous because I have a calendar table that I join to to exclude the holiday weeks) are less than 27 wks and have NET_SPRD_LCL > 3500.
For this case, I want to output the MAX(WK_NUM), it's associated NET_SPRD_LCL and BONUSPAYOUT = 2000. So, the output should be as follows:
EMP_ID WK_NUM NET_SPRD_LCL BONUSPAYOUT
123 18 3800 2000
If this meets the first requirement, the script should output and quit. If not, then I will look for the second requirement where Date Achieved is between 27 wks to 34 wks.
I hope I was able to explain my requirement clearly :-)
Thanks for the help.
Nice question! I broke my mind on situations like 4 rows in a turn are with 3500 and more. And came up with this.
You can use CTE, recursive CTE and ROW_NUMBER():
;WITH cte AS(
SELECT EMP_ID,
WK_NUM,
NET_SPRD_LCL,
ROW_NUMBER() OVER (PARTITION BY EMP_ID ORDER BY WK_NUM) rn
FROM YourTable
)
, recur AS (
SELECT EMP_ID,
WK_NUM,
NET_SPRD_LCL,
rn,
1 as lev
FROM cte
WHERE rn = 1
UNION ALL
SELECT c.EMP_ID,
c.WK_NUM,
c.NET_SPRD_LCL,
c.rn,
CASE WHEN c.NET_SPRD_LCL < 3500 THEN Lev+1 ELSE Lev END
FROM cte c
INNER JOIN recur r
ON r.rn+1 = c.rn
)
SELECT TOP 1 WITH TIES
EMP_ID,
WK_NUM,
NET_SPRD_LCL,
CASE WHEN WK_NUM < 27 THEN $2000
WHEN WK_NUM between 27 and 34 THEN $1000
ELSE $0 END as Bonus
FROM recur
WHERE NET_SPRD_LCL >= 3500
ORDER BY ROW_NUMBER() OVER(PARTITION BY EMP_ID,lev ORDER BY WK_NUM)%2
Output for data you provided:
EMP_ID WK_NUM NET_SPRD_LCL Bonus
123 18 3800 2000,00

SQL Server : count types with totals by date change

I need to count a value (M_Id) at each change of a date (RS_Date) and create a column grouped by the RS_Date that has an active total from that date.
So the table is:
Ep_Id Oa_Id M_Id M_StartDate RS_Date
--------------------------------------------
1 2001 5 1/1/2014 1/1/2014
1 2001 9 1/1/2014 1/1/2014
1 2001 3 1/1/2014 1/1/2014
1 2001 11 1/1/2014 1/1/2014
1 2001 2 1/1/2014 1/1/2014
1 2067 7 1/1/2014 1/5/2014
1 2067 1 1/1/2014 1/5/2014
1 3099 12 1/1/2014 3/2/2014
1 3099 14 2/14/2014 3/2/2014
1 3099 4 2/14/2014 3/2/2014
So my goal is like
RS_Date Active
-----------------
1/1/2014 5
1/5/2014 7
3/2/2014 10
If the M_startDate = RS_Date I need to count the M_id and then for
each RS_Date that is not equal to the start date I need to count the M_Id and then add that to the M_StartDate count and then count the next RS_Date and add that to the last active count.
I can get the basic counts with something like
(Case when M_StartDate <= RS_Date
then [m_Id] end) as Test.
But I am stuck as how to get to the result I want.
Any help would be greatly appreciated.
Brian
-added in response to comments
I am using Server Ver 10
If using SQL SERVER 2012+ you can use ROWS with your the analytic/window functions:
;with cte AS (SELECT RS_Date
,COUNT(DISTINCT M_ID) AS CT
FROM Table1
GROUP BY RS_Date
)
SELECT *,SUM(CT) OVER(ORDER BY RS_Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Run_CT
FROM cte
Demo: SQL Fiddle
If stuck using something prior to 2012 you can use:
;with cte AS (SELECT RS_Date
,COUNT(DISTINCT M_ID) AS CT
FROM Table1
GROUP BY RS_Date
)
SELECT a.RS_Date
,SUM(b.CT)
FROM cte a
LEFT JOIN cte b
ON a.RS_DAte >= b.RS_Date
GROUP BY a.RS_Date
Demo: SQL Fiddle
You need a cumulative sum, easy in SQL Server 2012 using Windowed Aggregate Functions. Based on your description this will return the expected result
SELECT p_id, RS_Date,
SUM(COUNT(*))
OVER (PARTITION BY p_id
ORDER BY RS_Date
ROWS UNBOUNDED PRECEDING)
FROM tab
GROUP BY p_id, RS_Date
It looks like you want something like this:
SELECT
RS_Date,
SUM(c) OVER (PARTITION BY M_StartDate ORDER BY RS_Date ROWS UNBOUNDED PRECEEDING)
FROM
(
SELECT M_StartDate, RS_Date, COUNT(DISTINCT M_Id) AS c
FROM my_table
GROUP BY M_StartDate, RS_Date
) counts
The inline view computes the counts of distinct M_Id values within each (M_StartDate, RS_Date) group (distinctness enforced only within the group), and the outer query uses the analytic version of SUM() to add up the counts within each M_StartDate.
Note that this particular query will not exactly reproduce your example results. It will instead produce:
RS_Date Active
-----------------
1/1/2014 5
1/5/2014 7
3/2/2014 8
3/2/2014 2
This is on account of some rows in your example data with RS_Date 3/2/2014 having a later M_StartDate than others. If this is not what you want then you need to clarify the question, which currently seems a bit inconsistent.
Unfortunately, analytic functions are not available until SQL Server 2012. In SQL Server 2010, the job is messier. It could be done like this:
WITH gc AS (
SELECT M_StartDate, RS_Date, COUNT(DISTINCT M_Id) AS c
FROM my_table
GROUP BY M_StartDate, RS_Date
)
SELECT
RS_Date,
(
SELECT SUM(c)
FROM gc2
WHERE gc2.M_StartDate = gc.M_StartDate AND gc2.RS_Date <= gc.RS_Date
) AS Active
FROM gc
If you are using SQL 2012 or newer you can use LAG to produce a running total.
https://msdn.microsoft.com/en-us/library/hh231256(v=sql.110).aspx