Z-Score in SQL based on last 1 year - sql

I have daily data structured in the below format. Please note this is just a subset of the data and I had to make some modifications to be able to share it.
The first column is the [DataValue] for which I need to find the Z-score by IndexValue, [Qualifier], [QualifierCode] and [QualifierType]. I also have the [Date] column in there.
I essentially need to find the Z-score value for each data point by IndexValue, [Qualifier], [QualifierCode] and [QualifierType]. The main point of focus here is that I have data for the last 3 years but in order to calculate Z-score, I only want to take the average and standard deviation for the last one year.
Z-Score = [DataValue] - (Avg in last 1 year) / (Std Dev in last 1 year)
I am struggling with how to get average for the last one year. Would anybody be able to help me with this?
SELECT [IndexValue]
,[Qualifier]
,[QualifierCode]
,[QualifierType],[Date]
,[Month]
,[Year]
,[Z-Score] = ([DataValue] - ROUND(AVG([DataValue]),3))/ ROUND(STDEV([DataValue]),3)
FROM [TABLEA]
GROUP BY [IndexValue]
,[Qualifier]
,[QualifierCode]
,[QualifierType]
,[Date]
,[Month]
,[Year]
order by [IndexValue]
,[Qualifier]
,[QualifierCode]
,[QualifierType]
,[Date] desc
: https://i.stack.imgur.com/pqhJD.png

You need window functions for this:
SELECT a.*,
( (DataValue - AVG(DataValue) OVER ()) /
STDEV(DataValue) OVER ()
) as z_score
FROM [TABLEA] a;
Note: if data_value is an integer, you will need to convert it to a number with digits:
SELECT a.*,
( (DataValue - AVG(DataValue * 1.0) OVER ()) /
STDEV(DataValue) OVER ()
) as z_score
FROM [TABLEA] a;
Rounding for the calculation seems to be way off base, unless your intention is to produce a z-like score that isn't really a z-score.

Related

Find the difference of rows within a column

I want to create a new table where the difference of weight will be displayed as
weight diff. for eg first day the difference is 0, so second day for the same id the weight should be like +.. for gain and - .. for loss
You seem to want lag():
select t.*,
(weight -
lag(weight, 1, weight) over (partition by id order by date)
) as weight_diff
from t;
Your image is really hard to read so I just used the names given in the description.

Adding grouping in framing clause window while creating partitions

Using the dataset hosted on Google (MBL Data) as an example, here is what I am accomplishing to do - obtain last 3 weeks score run for a given Venue.
My aggregated dataset looks like this without the strikes_3wk column -
Logic for strikes_3wk column is to partition the aggregated dataset by venueName, order by YearWeek column and then obtain the last 3 weeks aggregated strikes data.
Here is the query I have written so far. I see that the windowing function is where I need to modify the logic. So, is there a way to add grouping within the windowing function? Is there any alternative way of doing this?
In the image I added a new column 'expected', showing values for two weeks.
select inr.*
,sum(inr.strikes) over (Venue_Week rows between current row and 2 following) as strikes_3wk
from
(
select seasonType
,gameStatus
,homeTeamName
,awayTeamName
,venueName
,CAST(
CONCAT(
CAST(EXTRACT(YEAR FROM createdAt) as string)
,CAST(EXTRACT(WEEK(Monday) FROM createdAt) as string)
) as INT64)
as YearWeek
,sum(homeFinalRuns) as homeFinalRuns
,sum(strikes) as strikes
from `bigquery-public-data.baseball.games_wide`
where createdAt is not null
group by seasonType
,gameStatus
,homeTeamName
,awayTeamName
,venueName
,YearWeek
)inr
window Venue_Week as (
partition by inr.venueName
order by inr.YearWeek desc
)
So you are looking for strikes per venue regardless of who did them, right?
May be something like:
SELECT INR.*, STATS.strikes_3wk
FROM `bigquery-public-data.baseball.games_wide` INR
LEFT JOIN (
SELECT venueName, SUM(strikes) as strikes_3wk
FROM `bigquery-public-data.baseball.games_wide` INR2
WHERE YearWeek IN (
SELECT TOP 3 YearWeek
FROM `bigquery-public-data.baseball.games_wide`
WHERE venueName = INR2.venueName
ORDER BY YearWeek DESC
)
GROUP BY venueName
) STATS
ON INR.venueName = STATS.venueName

DAX running total based on 3 columns, one of which is a repeating integer running total

Very new to DAX/PowerPivot, and faced with devilishly tricky question on day one.
I have some data (90,000 rows) I'm trying to use to calculate a cumulative fatigue score for folk working shifts(using PowerPivot/Excel 2016). As per the below screenshot, the dataset is shift data for multiple employees, that has a cumulative count of days worked vs. days off that resets back to 1 whenever they switch from one state to the other, and a 'Score' column that in my production data contains a measure of how fatigued they are.
I would like to cumulatively sum that fatigue score, and reset it whenever they move between the 'Days worked' and 'Days off' states. My desired output is in the 'Desired' column far right, and I've used green highlighting to show days worked vs. days off as well as put a bold border around separate Emp_ID blocks to help demonstrate the data.
There is some similarity between my question and the SO post at DAX running total (or count) across 2 groups except that one of my columns (i.e. the Cumulative Days one) is in a repeating sequence from 1 to x. And Javier Guillén's post would probably make a good starting point if I'd had a couple of months of DAX under my belt, rather than the couple of hours I've gained today.
I can barely begin to conceptualize what the DAX would need to look like, given I'm a DAX newbie (my background is VBA, SQL, and Excel formulas). But lest someone berate me for not even providing a starting point, I tried to tweak the following DAX without really having a clue what I was doing:
Cumulative:=CALCULATE(
SUM( Shifts[Score] ) ,
FILTER(Shifts,Shifts[Cumulative Days] <= VALUES(Shifts[Cumulative Days] )) ,
ALLEXCEPT( shifts, Shifts[Workday],Shifts[EMP_ID] ) )
Now I'll be the first to admit that this code is DAX equivelant of the Infinite Monkey Theorem. And alas, I have no bananas today, and my only hope is that someone finds this problem suitably a-peeling.
The problem with this table is there is no way to determine when stop summing while performing the cumulative total.
I think one way to achive it could be calculating the next first date where continuous workday status changes.
For example the workday status in the first three rows for EMP_ID 70073 are the same, until the fourth row, date 04-May which is the date the workday status changes. My idea is to create a calculated column that find the status change date for each workday serie. That column lets us implement the cumulative sum.
Below is the expression for the calculated column I named Helper.
Helper =
IF (
ISBLANK (
CALCULATE (
MIN ( [Date] ),
FILTER (
'Shifts',
'Shifts'[EMP_ID] = EARLIER ( 'Shifts'[EMP_ID] )
&& 'Shifts'[Workday] <> EARLIER ( 'Shifts'[Workday] )
&& [Date] > EARLIER ( 'Shifts'[Date] )
)
)
),
CALCULATE (
MAX ( [Date] ),
FILTER (
Shifts,
Shifts[Date] >= EARLIER ( Shifts[Date] )
&& Shifts[EMP_ID] = EARLIER ( Shifts[EMP_ID] )
)
)
+ 1,
CALCULATE (
MIN ( [Date] ),
FILTER (
'Shifts',
'Shifts'[EMP_ID] = EARLIER ( 'Shifts'[EMP_ID] )
&& 'Shifts'[Workday] <> EARLIER ( 'Shifts'[Workday] )
&& [Date] > EARLIER ( 'Shifts'[Date] )
)
)
)
In short, the expression says if the date calculation for the current workday series change returns a blank use the last date for that EMP_ID ading one date.
Note there is no way to calculate the change date for the last workday serie, in this case 08-May rows, so if the the calculation returns blank it means it is being evaluated in the last serie then my expression should return the max date for that EMP_ID adding one day.
Once the calculated column is in the table you can use the following expression to create a measure for the cumulative value:
Cumulative Score =
CALCULATE (
SUM ( 'Shifts'[Score] ),
FILTER ( ALL ( 'Shifts'[Helper] ), [Helper] = MAX ( [Helper] ) ),
FILTER ( ALL ( 'Shifts'[Date] ), [Date] <= MAX ( [Date] ) )
)
In a table in Power BI (I have no access to PowerPivot at least eight hours) the result is this:
I think there is an easier solution, my first thought was using a variable, but that is only supported in DAX 2015, it is quite possible you are not using Excel 2016.
UPDATE: Leaving only one filter in the measure calculation. FILTER are iterators through the entire table, so using only one filter and logic operators could be more performant.
Cumulative Score =
CALCULATE (
SUM ( 'Shifts'[Score] ),
FILTER (
ALL ( 'Shifts'[Helper], Shifts[Date] ),
[Helper] = MAX ( [Helper] )
&& [Date] <= MAX ( [Date] )
)
)
UPDATE 2: Solution for pivot tables (matrix), since previous expression worked only for a tabular visualization. Also measure expression was optimized to implement only one filter.
This should be the final expression for pivot table:
Cumulative Score =
CALCULATE (
SUM ( 'Shifts'[Score] ),
FILTER (
ALLSELECTED ( Shifts ),
[Helper] = MAX ( [Helper] )
&& [EMP_ID] = MAX ( Shifts[EMP_ID] )
&& [Date] <= MAX ( Shifts[Date] )
)
)
Note: If you want to ignore filters use ALL instead of
ALLSELECTED.
Results in Power BI Matrix:
Results in PowerPivot Pivot Table:
Let me know if this helps.

How to use window functions to get meterics for today, last 7 days, last 30 days for each value of the date?

My problem seems simple on paper:
For a given date, give me active users for that given date, active users in given_Date()-7, active users in a given_Date()-30
i.e. sample data.
"timestamp" "user_public_id"
"23-Sep-15" "805a47023fa611e58ebb22000b680490"
"28-Sep-15" "d842b5bc5b1711e5a84322000b680490"
"01-Oct-15" "ac6b5f70b95911e0ac5312313d06dad5"
"21-Oct-15" "8c3e91e2749f11e296bb12313d086540"
"29-Nov-15" "b144298810ee11e4a3091231390eb251"
for 01-10 the count for today would be 1, last_7_days would be 3, last_30_days would be 3+n (where n would be the count of the user_ids that fall in dates that precede Oct 1st in a 30 day window)
I am on redshift amazon. Can somebody provide a sample sql to help me get started?
the outputshould look like this:
"timestamp" "users_today", "users_last_7_days", "users_30_days"
"01-Oct-15" 1 3 (3+n)
I know asking for help/incomplete solutions are frowned upon, but this is not getting any other attention so I thought I would do my bit.
I have been pulling my hair out trying to nut this one out, alas, I am a beginner and something is not clicking for me. Perhaps yourself or others will be able to drastically improve my answer, but I think I am on the right track.
SELECT replace(convert(varchar, [timestamp], 111), '/','-') AS [timestamp], -- to get date in same format as you require
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE ([TIMESTAMP]) = ([timestamp])) AS users_today,
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE [TIMESTAMP] BETWEEN DATEADD(DY,-7,[TIMESTAMP]) AND [TIMESTAMP]) AS users_last_7_days ,
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE [TIMESTAMP] BETWEEN DATEADD(DY,-30,[TIMESTAMP]) AND [timestamp]) AS users_last_30_days
FROM #SIMPLE
GROUP BY [timestamp]
Starting with this:
CREATE TABLE #SIMPLE (
[timestamp] datetime, user_public_id varchar(32)
)
INSERT INTO #SIMPLE
VALUES('23-Sep-15','805a47023fa611e58ebb22000b680490'),
('28-Sep-15','d842b5bc5b1711e5a84322000b680490'),
('01-Oct-15','ac6b5f70b95911e0ac5312313d06dad5'),
('21-Oct-15','8c3e91e2749f11e296bb12313d086540'),
('29-Nov-15','b144298810ee11e4a3091231390eb251')
The problem I am having is that each row contains the same counts, despite my grouping by [timestamp].
Step 1-- Create a table which has daily counts.
create temp table daily_mobile_Sessions as
select "timestamp" ,
count(user_public_id) over (partition by "timestamp" ) as "today"
from mobile_sessions
group by 1, mobile_sessions.user_public_id
order by 1 DESC
Step 2 -- From the table above. We create yet another table which can use the "today" field, and we apply the window function to Sum the counts.
select "timestamp", today,
sum(today) over (order by "timestamp" rows between 6 PRECEDING and CURRENT ROW) as "last_7days",
sum(today) over (order by "timestamp" rows between 29 PRECEDING and CURRENT ROW) as "last_30days"
from daily_mobile_Sessions group by "timestamp" , 2 order by 1 desc

Find closest date in SQL Server

I have a table dbo.X with DateTime column Y which may have hundreds of records.
My Stored Procedure has parameter #CurrentDate, I want to find out the date in the column Y in above table dbo.X which is less than and closest to #CurrentDate.
How to find it?
The where clause will match all rows with date less than #CurrentDate and, since they are ordered descendantly, the TOP 1 will be the closest date to the current date.
SELECT TOP 1 *
FROM x
WHERE x.date < #CurrentDate
ORDER BY x.date DESC
Use DateDiff and order your result by how many days or seconds are between that date and what the Input was
Something like this
select top 1 rowId, dateCol, datediff(second, #CurrentDate, dateCol) as SecondsBetweenDates
from myTable
where dateCol < #currentDate
order by datediff(second, #CurrentDate, dateCol)
I have a better solution for this problem i think.
I will show a few images to support and explain the final solution.
Background
In my solution I have a table of FX Rates. These represent market rates for different currencies. However, our service provider has had a problem with the rate feed and as such some rates have zero values. I want to fill the missing data with rates for that same currency that as closest in time to the missing rate. Basically I want to get the RateId for the nearest non zero rate which I will then substitute. (This is not shown here in my example.)
1) So to start off lets identify the missing rates information:
Query showing my missing rates i.e. have a rate value of zero
2) Next lets identify rates that are not missing.
Query showing rates that are not missing
3) This query is where the magic happens. I have made an assumption here which can be removed but was added to improve the efficiency/performance of the query. The assumption on line 26 is that I expect to find a substitute transaction on the same day as that of the missing / zero transaction.
The magic happens is line 23: The Row_Number function adds an auto number starting at 1 for the shortest time difference between the missing and non missing transaction. The next closest transaction has a rownum of 2 etc.
Please note that in line 25 I must join the currencies so that I do not mismatch the currency types. That is I don't want to substitute a AUD currency with CHF values. I want the closest matching currencies.
Combining the two data sets with a row_number to identify nearest transaction
4) Finally, lets get data where the RowNum is 1
The final query
The query full query is as follows;
; with cte_zero_rates as
(
Select *
from fxrates
where (spot_exp = 0 or spot_exp = 0)
),
cte_non_zero_rates as
(
Select *
from fxrates
where (spot_exp > 0 and spot_exp > 0)
)
,cte_Nearest_Transaction as
(
select z.FXRatesID as Zero_FXRatesID
,z.importDate as Zero_importDate
,z.currency as Zero_Currency
,nz.currency as NonZero_Currency
,nz.FXRatesID as NonZero_FXRatesID
,nz.spot_imp
,nz.importDate as NonZero_importDate
,DATEDIFF(ss, z.importDate, nz.importDate) as TimeDifferece
,ROW_NUMBER() Over(partition by z.FXRatesID order by abs(DATEDIFF(ss, z.importDate, nz.importDate)) asc) as RowNum
from cte_zero_rates z
left join cte_non_zero_rates nz on nz.currency = z.currency
and cast(nz.importDate as date) = cast(z.importDate as date)
--order by z.currency desc, z.importDate desc
)
select n.Zero_FXRatesID
,n.Zero_Currency
,n.Zero_importDate
,n.NonZero_importDate
,DATEDIFF(s, n.NonZero_importDate,n.Zero_importDate) as Delay_In_Seconds
,n.NonZero_Currency
,n.NonZero_FXRatesID
from cte_Nearest_Transaction n
where n.RowNum = 1
and n.NonZero_FXRatesID is not null
order by n.Zero_Currency, n.NonZero_importDate