How to use window functions to get meterics for today, last 7 days, last 30 days for each value of the date? - sql

My problem seems simple on paper:
For a given date, give me active users for that given date, active users in given_Date()-7, active users in a given_Date()-30
i.e. sample data.
"timestamp" "user_public_id"
"23-Sep-15" "805a47023fa611e58ebb22000b680490"
"28-Sep-15" "d842b5bc5b1711e5a84322000b680490"
"01-Oct-15" "ac6b5f70b95911e0ac5312313d06dad5"
"21-Oct-15" "8c3e91e2749f11e296bb12313d086540"
"29-Nov-15" "b144298810ee11e4a3091231390eb251"
for 01-10 the count for today would be 1, last_7_days would be 3, last_30_days would be 3+n (where n would be the count of the user_ids that fall in dates that precede Oct 1st in a 30 day window)
I am on redshift amazon. Can somebody provide a sample sql to help me get started?
the outputshould look like this:
"timestamp" "users_today", "users_last_7_days", "users_30_days"
"01-Oct-15" 1 3 (3+n)

I know asking for help/incomplete solutions are frowned upon, but this is not getting any other attention so I thought I would do my bit.
I have been pulling my hair out trying to nut this one out, alas, I am a beginner and something is not clicking for me. Perhaps yourself or others will be able to drastically improve my answer, but I think I am on the right track.
SELECT replace(convert(varchar, [timestamp], 111), '/','-') AS [timestamp], -- to get date in same format as you require
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE ([TIMESTAMP]) = ([timestamp])) AS users_today,
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE [TIMESTAMP] BETWEEN DATEADD(DY,-7,[TIMESTAMP]) AND [TIMESTAMP]) AS users_last_7_days ,
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE [TIMESTAMP] BETWEEN DATEADD(DY,-30,[TIMESTAMP]) AND [timestamp]) AS users_last_30_days
FROM #SIMPLE
GROUP BY [timestamp]
Starting with this:
CREATE TABLE #SIMPLE (
[timestamp] datetime, user_public_id varchar(32)
)
INSERT INTO #SIMPLE
VALUES('23-Sep-15','805a47023fa611e58ebb22000b680490'),
('28-Sep-15','d842b5bc5b1711e5a84322000b680490'),
('01-Oct-15','ac6b5f70b95911e0ac5312313d06dad5'),
('21-Oct-15','8c3e91e2749f11e296bb12313d086540'),
('29-Nov-15','b144298810ee11e4a3091231390eb251')
The problem I am having is that each row contains the same counts, despite my grouping by [timestamp].

Step 1-- Create a table which has daily counts.
create temp table daily_mobile_Sessions as
select "timestamp" ,
count(user_public_id) over (partition by "timestamp" ) as "today"
from mobile_sessions
group by 1, mobile_sessions.user_public_id
order by 1 DESC
Step 2 -- From the table above. We create yet another table which can use the "today" field, and we apply the window function to Sum the counts.
select "timestamp", today,
sum(today) over (order by "timestamp" rows between 6 PRECEDING and CURRENT ROW) as "last_7days",
sum(today) over (order by "timestamp" rows between 29 PRECEDING and CURRENT ROW) as "last_30days"
from daily_mobile_Sessions group by "timestamp" , 2 order by 1 desc

Related

How to track whether field has changed after a date

In my problem, I want to be able to track whether a state has shifted from 04a. Lapsing - Lowering Engagement to 03d. Engaged - Very High after trigger_send_date has occurred.
I believe a window function is required that checks whether a state is 04a. Lapsing - Lowering Engagement before trigger_send_date, and then measures whether that changes after trigger_send_date is needed , but I can't figure out how to write this. I made a start below, but have difficulty continuing!
Ideally I'd like a new column that is a True/False as to whether that switching has occurred post trigger_send_date within 31 days of the date occuring.
SELECT
cust_id,
state_date,
trigger_send_date,
total_state,
IF (
total_state IN ("04a. Lapsing - Lowering Engagement"),
True,
False
) as lapse,
-- Trying to write this column
sum(IF ((trigger_send_date >= state_date) & (total_state IN ("04a. Lapsing - Lowering Engagement") , 1, null)) OVER (
PARTITION BY cust_id,
state_date
ORDER BY
state_date
) as lapsed_and_returned_wirthin_31_days
FROM
base
ORDER BY
state_date,
trigger_send_date
Does anyone have any tips to help me write this?
This is what my table looks like with expected result as right-most column if it helps!
Let me preface my answer by saying that I don't have access to spark SQL, so the below is written in MySQL (it would probably work in SQL Server as well). I've had a look at the docs and the window frame should still work, you obviously might need to make some tweaks.
The window frame tells the partition function which rows to look at, by included UNBOUNDED PRECEDING you're telling the function to include every row before the current row, and using UNBOUNDED FOLLOWING you're telling the function to look at every row after the current row.
I tried to include another test, for a customer that was engaged before the trigger date and it seems to work. Obviously if you provided some sample data we could test further.
DROP TABLE IF EXISTS Base;
CREATE TABLE Base
(
cust_id BIGINT,
state_date DATE,
trigger_send_date DATE,
total_state VARCHAR(256)
);
INSERT INTO Base (cust_id,state_date, trigger_send_date, total_state) VALUES
(9177819375032029782,'2022-03-07','2022-03-14','03d. Engaged - Very High'),
(9177819375032029782,'2022-03-13','2022-03-14','04a. Lapsing - Lowering Engagement'),
(9177819375032029782,'2022-03-19','2022-03-14','03d. Engaged - Very High'),
(9177819375032029782,'2022-05-07','2022-03-14','03d. Engaged - Very High'),
(819375032029782,'2022-03-07','2022-03-14','03d. Engaged - Very High'),
(819375032029782,'2022-03-10','2022-03-14','04a. Lapsing - Lowering Engagement'),
(819375032029782,'2022-03-11','2022-03-14','03d. Engaged - Very High'),
(819375032029782,'2022-03-19','2022-03-14','03d. Engaged - Very High'),
(819375032029782,'2022-05-07','2022-03-14','03d. Engaged - Very High');
With LapsedCTE AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY cust_id ORDER BY state_date DESC) AS `RNum`
FROM Base
WHERE state_date <= trigger_send_date
AND LEFT(total_state, 3) IN ('03d','04a')
)
SELECT b.cust_id, b.state_date, b.trigger_send_date, b.total_state,
IF (
b.total_state IN ("04a. Lapsing - Lowering Engagement"),
True,
False
) as lapse,
-- Here we find the MIN engaged date (you can other states if needed) AFTER the trigger date.
-- Then we compare that to the trigger_send_date from the list of customers that were lapsed prior to the trigger_send_date (this will be empty for non-lapsed customers
-- so will default to 0 in our results column
-- Then we do a DATEDIFF between the trigger date and the engaged date, if the value is less than or equal to 31 days, Robert is your Mother's Brother..
IF(DATEDIFF(
MIN(IF(b.state_date > b.trigger_send_date AND LEFT(b.total_state, 3) IN ('03d'), b.state_date, NULL))
OVER (PARTITION BY b.cust_id ORDER BY b.state_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), l.trigger_send_date) <= 31, 1, 0) AS `lapsed_and_returned_wirthin_31_days`
-- Here's some other stuff just to show you the inner working of the above
/*
DATEDIFF(
MIN(IF(b.state_date > b.trigger_send_date AND LEFT(b.total_state, 3) IN ('03d'), b.state_date, NULL))
OVER (PARTITION BY b.cust_id ORDER BY b.state_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), b.trigger_send_date) AS `engaged_time_lag_days`,
MIN(IF(b.state_date > b.trigger_send_date AND LEFT(b.total_state, 3) IN ('03d'), b.state_date, NULL))
OVER (PARTITION BY b.cust_id ORDER BY b.state_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS `first_engaged_date_after_trigger`
*/
FROM Base b
LEFT JOIN LapsedCTE l ON l.cust_id = b.cust_id AND l.RNum = 1 AND LEFT(l.total_state, 3) IN ('04a');
It would be possible to remove the CTE if you need, it just makes things a bit cleaner.
Here's a runnable DBFiddle just incase you don't have access to a MySQL database.

Z-Score in SQL based on last 1 year

I have daily data structured in the below format. Please note this is just a subset of the data and I had to make some modifications to be able to share it.
The first column is the [DataValue] for which I need to find the Z-score by IndexValue, [Qualifier], [QualifierCode] and [QualifierType]. I also have the [Date] column in there.
I essentially need to find the Z-score value for each data point by IndexValue, [Qualifier], [QualifierCode] and [QualifierType]. The main point of focus here is that I have data for the last 3 years but in order to calculate Z-score, I only want to take the average and standard deviation for the last one year.
Z-Score = [DataValue] - (Avg in last 1 year) / (Std Dev in last 1 year)
I am struggling with how to get average for the last one year. Would anybody be able to help me with this?
SELECT [IndexValue]
,[Qualifier]
,[QualifierCode]
,[QualifierType],[Date]
,[Month]
,[Year]
,[Z-Score] = ([DataValue] - ROUND(AVG([DataValue]),3))/ ROUND(STDEV([DataValue]),3)
FROM [TABLEA]
GROUP BY [IndexValue]
,[Qualifier]
,[QualifierCode]
,[QualifierType]
,[Date]
,[Month]
,[Year]
order by [IndexValue]
,[Qualifier]
,[QualifierCode]
,[QualifierType]
,[Date] desc
: https://i.stack.imgur.com/pqhJD.png
You need window functions for this:
SELECT a.*,
( (DataValue - AVG(DataValue) OVER ()) /
STDEV(DataValue) OVER ()
) as z_score
FROM [TABLEA] a;
Note: if data_value is an integer, you will need to convert it to a number with digits:
SELECT a.*,
( (DataValue - AVG(DataValue * 1.0) OVER ()) /
STDEV(DataValue) OVER ()
) as z_score
FROM [TABLEA] a;
Rounding for the calculation seems to be way off base, unless your intention is to produce a z-like score that isn't really a z-score.

How to obtain the "largest day" in a week?

How to get a data when the "largest day" of the week does not exist?
For example, if Friday does not exist in my database for a particular week (assuming that Saturday and Sundays are not in my database), I would still like to be able to get the data from Thursday. If Both Friday and Thursday do not exist, I would like to get the data from Wednesday etc.
This is what I currently have (this code allows me to obtain the last day of the month from my database):
/**Select Last Day Of Month**/
SELECT * FROM [mytable]
WHERE [date] IN (SELECT MAX([date])
FROM [mytable]
GROUP BY MONTH([Date]), YEAR([Date]))
I also understand that you can use the DATEPART function to get all datas from a particular day (i.e. Friday), the thing I'm not sure about is how to get a thursday if Friday doesn't exist. I'm looking to grab all the data that has the corresponding features, not one particular week. Thanks!
As a clearer example:
Say I have four input data in my database->
1. 2016/8/19(Fri), red
2. 2016/8/18(Thu), blue
3. 2016/8/11(Thu), red
4. 2016/8/10(Wed), red
after the query is executed, I would like to have:
1. 2016/8/19(Fri), red
3. 2016/8/11(Thu), red
They are selected because the two data are the corresponding "largest" data in that week.
See if the following query helps. It displays the last day of every week in the given input data.
declare #table table(
date datetime
)
insert into #table
values
('08/01/2016'),
('08/02/2016'),
('08/03/2016'),
('08/04/2016'),
('08/05/2016'),
('08/06/2016'),
('08/07/2016'),
('08/08/2016'),
('08/09/2016'),
('08/10/2016'),
('08/11/2016'),
('08/12/2016'),
('08/13/2016'),
('08/14/2016'),
('08/15/2016'),
('08/16/2016'),
('08/17/2016'),
('08/18/2016'),
('08/19/2016'),
('08/20/2016'),
('08/21/2016'),
('08/22/2016'),
('08/23/2016')
;with cte as
(
select date, row_number() over(partition by datepart(year,date),datepart(week,date) order by date desc) as rn from #table
)
select date,datename(weekday,date) from cte
where rn = 1
You can extract various weekparts and construct a suitable ROW_NUMBER() expression.
E.g. the following assigns the row number 1 to the latest day within each week:
ROW_NUMBER() OVER (
PARTITION BY DATEPART(year,[date]),DATEPART(week,[date])
ORDER BY DATEPART(weekday,[date]) DESC) as rn

Calculating a running count of Weeks

I am looking to calculate a running count of the weeks that have occurred since a starting point. The biggest problem here is that the calendar I am working on is not a traditional Gregorian calendar.
The easiest dimension to reference would be something like 'TWEEK' which actually tells you the week of the year that the record falls into.
Example data:
CREATE TABLE #foobar
( DateKey INT
,TWEEK INT
,CumWEEK INT
);
INSERT INTO #foobar (DateKey, TWEEK, CumWEEK)
VALUES(20150630, 1,1),
(20150701,1,1),
(20150702,1,1),
(20150703,1,1),
(20150704,1,1),
(20150705,1,1),
(20150706,1,1),
(20150707,2,2),
(20150708,2,2),
(20150709,2,2),
(20150710,2,2),
(20150711,2,2),
(20150712,2,2),
(20150713,2,2),
(20150714,1,3),
(20150715,1,3),
(20150716,1,3),
(20150717,1,3),
(20150718,1,3),
(20150719,1,3),
(20150720,1,3),
(20150721,2,4),
(20150722,2,4),
(20150723,2,4),
(20150724,2,4),
(20150725,2,4),
(20150726,2,4),
(20150727,2,4)
For sake of ease, I did not go all the way to 52, but you get the point. I am trying to recreate the 'CumWEEK' column. I have a column already that tells me the correct week of the year according to the weird calendar convention ('TWEEK').
I know this will involve some kind of OVER() windowing, but I cannot seem to figure It out.
The windowing function LAG() along with a summation of ORDER BY ROWS BETWEEN "Changes" should get you close enough to work with. The caveat to this is that the ORDER BY ROWS BETWEEN can only take an integer literal.
Year Rollover : I guess you could create another ranking level based on mod 52 to start the count fresh. So 53 would become year 2, week 1, not 53.
SELECT
* ,
SUM(ChangedRow) OVER (ORDER BY DateKey ROWS BETWEEN 99999 PRECEDING AND CURRENT ROW)
FROM
(
SELECT
DateKey,
TWEEK,
ChangedRow=CASE WHEN LAG(TWEEK) OVER (ORDER BY DateKey) <> TWEEK THEN 1 ELSE 0 END
FROM
#foobar F2
)AS DETAIL
Some minutes ago I answered a different question, in a way this is a similar question to
https://stackoverflow.com/a/31303395/5089204
The idea is roughly to create a table of a running number and find the weeks with modulo 7. This you could use as grouping in an OVER clause...
EDIT: Example
CREATE FUNCTION dbo.RunningNumber(#Counter AS INT)
RETURNS TABLE
AS
RETURN
SELECT TOP (#Counter) ROW_NUMBER() OVER(ORDER BY o.object_id) AS RunningNumber
FROM sys.objects AS o; --take any large table here...
GO
SELECT 'test',CAST(numbers.RunningNumber/7 AS INT)
FROM dbo.RunningNumber(100) AS numbers
Dividing by 7 "as INT" offers a quite nice grouping criteria.
Hope this helps...

Display a rolling 12 weeks chart in SSRS report

I am calling the data query in ssrs like this:
SELECT * FROM [DATABASE].[dbo].[mytable]
So, the current week is the last week from the query (e.g. 3/31 - 4/4) and each number represents the week before until we have reached the 12 weeks prior to this week and display in a point chart.
How can I accomplish grouping all the visits for all locations by weeks and adding it to the chart?
I suggest updating your SQL query to Group by a descending Dense_Rank of DatePart(Week,ARRIVED_DATE). In this example, I have one column for Visits because I couldn't tell which columns you were using to get your Visit count:
-- load some test data
if object_id('tempdb..#MyTable') is not null
drop table #MyTable
create table #MyTable(ARRIVED_DATE datetime,Visits int)
while (select count(*) from #MyTable) < 1000
begin
insert into #MyTable values
(dateadd(day,round(rand()*100,0),'2014-01-01'),round(rand()*1000,0))
end
-- Sum Visits by WeekNumber relative to today's WeekNumber
select
dense_rank() over(order by datepart(week,ARRIVED_DATE) desc) [Week],
sum(Visits) Visits
from #MyTable
where datepart(week,ARRIVED_DATE) >= datepart(week,getdate()) - 11
group by datepart(week,ARRIVED_DATE)
order by datepart(week,ARRIVED_DATE)
Let me know if I can provide any more detail to help you out.
You are going to want to do the grouping of the visits within SQL. You should be able to add a calculated column to your table which is something like WorkWeek and it should be calculated on the days difference from a certain day such as Sunday. This column will then by your X value rather than the date field you were using.
Here is a good article that goes into first day of week: First Day of Week