Calculate stdev over a variable range in SQL Server - sql

Table format is as follows:
Date ID subID value
-----------------------------
7/1/1996 100 1 .0543
7/1/1996 100 2 .0023
7/1/1996 200 1 -.0410
8/1/1996 100 1 -.0230
8/1/1996 200 1 .0121
I'd like to apply STDEV to the value column where date falls within a specified range, grouping on the ID column.
Desired output would like something like this:
DateRange, ID, std_v
1 100 .0232
2 100 .0323
1 200 .0423
One idea I've had that works but is clunky, involves creating an additional column (which I've called 'partition') to identify a 'group' of values over which STDEV is taken (by using the OVER function and PARTITION BY applied to 'partition' and 'ID' variables).
Creating the partition variable involves a CASE statement prior where a given record is assigned a partition based on its date falling within a given range (ie,
...
, partition = CASE
WHEN date BETWEEN '7/1/1996' AND '10/1/1996' THEN 1
WHEN date BETWEEN '10/1/1996' AND '1/1/1997' THEN 2
...
Ideally, I'd be able to apply STDEV and the OVER function partitioning on the variable ID and variable date ranges (eg, say, trailing 3 months for a given reference date). Once this works for the 3 month period described above, I'd like to be able to make the date range variable, creating an additional '#dateRange' variable at the start of the program to be able to run this for 2, 3, 6, etc month ranges.

I ended up coming upon a solution to my question.
You can join the original table to a second table, consisting of a unique list of the dates in the first table, applying a BETWEEN clause to specify desired range.
Sample query below.
Initial table, with columns (#excessRets):
Date, ID, subID, value
Second table, a unique list of dates in the previous, with columns (#dates):
Date
select d.date, er.id, STDEV(er.value)
from #dates d
inner join #excessRet er
on er.date between DATEADD(m, -36, d.date) and d.date
group by d.date, er.id
order by er.id, d.date
To achieve the desired next step referenced above (making range variable), simply create a variable at the outset and replace "36" with the variable.

Related

adjust recursive sql query to exclude holidays and weekends

I have a dataset like this called data_per_day
instructional_day
points
2023-01-24
2
2023-01-23
2
2023-01-20
1
2023-01-19
0
and so on. the table shows weekdays (days minus holidays and weekends) and the number of points someone has earned. 1 is the start of a streak and 0 is the end of a streak. 2 is max points after a streak has started.
I need to find how long is the latest streak. so in this case the result should be 3
I created a recursive cte but the query returns 2 as the streak count because i'm using lag mechanism with days. instead I need to adjust so that the instructional days are used rather than all dates.
RECURSIVE cte AS (
SELECT
student_unique_id,
instructional_day,
points,
1 AS cnt
FROM
`data_per_day`
WHERE
instructional_day = DATE_ADD(CURRENT_DATE('America/Chicago'), INTERVAL -1 DAY)
UNION ALL
SELECT
a.student_unique_id,
a.instructional_day,
a.points,
c.cnt+1
FROM (
SELECT
*
FROM
`data_per_day`
WHERE
points > 0 ) a
INNER JOIN
cte c
ON
a.student_unique_id = c.student_unique_id
AND a.instructional_day = c.instructional_day - INTERVAL '1' day )
SELECT
student_unique_id,
MAX(cnt) AS streak
FROM
cte --
WHERE
student_unique_id = "419"
GROUP BY
student_unique_id
How do I adjust the query?
This is not a trivial coding exercise, so I won't actually write the code and provide it.
What you have here is a gaps and islands question. You want to identify the largest "island" of days with points within a date range. Depending upon what dates are contained in your data, you may need to generate a list of sequential dates that meet your criteria.
One problem I see is that you are trying to combine the steps to generate the date range (the recursive CTE) with the points. You'll need to separate those steps.
Define the date range.
Generate the dates within the range.
Filter the dates with isweekday = 'no' and isholiday = 'no'. You will probably want to add a row number during this step.
[left] join the dates to your data, including coalesce(points, 0)
Filter the data to points > 0.
Identify the islands.
Identify the largest island per student.

Start date and end date assigning based on date ranges and value change

I have two tables temp_N and temp_C . Table script and data is given below . I am using teradata
Table Script and data
First image is temp_N and second one is temp_C
Now I will try to explain my requirement. Key column for this two tables are 'nbr'. This two table contains all the changes for a particular period of time.( this is sample data and this two tables will get daily loaded based on the updates). Now I need to merge this two tables into one table with date range assigned correctly. The expected result is given below. To explain the logic behind the expected result, first row in the expected result, fstrtdate is the least date which from the two tables which is 2022-01-31 and for the same row if we notice the end date is given as 2022-07-10 as there is a change in the cpnrate on 2022-07-11. second row is start with 2022-07-11 giving the changed cpnrate, now when comes to third row there is a change in ntr on 2022-08-31 and the data is update accordingly. Please note all this are date fields, there wont be any timestamp, please ignore the timestamp in screenshots
Now I would like to know how to achieve this in sql or is it possible to achieve ?
You can combine all the changes into a single table and order by effective start date (fstrtdate). Then you can compute effective end date as day prior to next change, and where one of the data values is NULL use LAG IGNORE NULLS to "bring down" the desired previous not-NULL value:
SELECT nbr, fstrtdate,
PRIOR(LEAD(fstrtdate) OVER (PARTITION BY nbr ORDER BY fstrtdate)) as fenddate,
COALESCE(ntr,LAG(ntr) IGNORE NULLS OVER (PARTITION BY nbr ORDER BY fstrtdate)) as ntr,
COALESCE(cpnrate,LAG(cpnrate) IGNORE NULLS OVER (PARTITION BY nbr ORDER BY fstrtdate)) as cpnrate
FROM (
SELECT nbr, fstrtdate, max(ntr) as ntr, max(cpnrate) as cpnrate
FROM (
SELECT nbr, fstrtdate, ntr, NULL (DECIMAL(9,2)) as cpnrate
from temp_n
UNION ALL
SELECT nbr, fstrtdate, NULL (DECIMAL(9,2)) as ntr, cpnrate
from temp_c
) AS COMBINED
GROUP BY 1, 2
) AS UNIQUESTART
ORDER BY fstrtdate;
The innermost SELECTs make the structure the same for data from both tables with NULLs for the values that come from the other table, so we can do a UNION to form one COMBINED derived table with rows for both types of change events. Note that you should explicitly assign datatype for those added NULL columns to match the datatype for the corresponding column in the other table; I somewhat arbitrarily chose DECIMAL(9,2) above since I didn't know the real data types. They can't be INT as in the example, though, since that would truncate the decimal part. There's no reason to carry along the original fenddate; a new non-overlapping fenddate will be computed in the outer query.
The intermediate GROUP BY is only to combine what would otherwise be two rows in the special case where both ntr and cpnrate changed on the same day for the same nbr. That case is not present in the example data - the dates are already unique - but it might be necessary to do this when processing the full table. The syntax requires an aggregate function, but there should be at most two rows for a (nbr, fstrtdate) group; and when there are two rows, in each of the other columns one row has NULL and the other row does not. In that case either MIN or MAX will return the non-NULL value.
In the outer query, the COALESCEs will return the value for that column from the current row in the UNIQUED derived table if it's not NULL, otherwise LAG is used to obtain the value from a previous row.
The first two rows in the result won't match the screenshot above but they do accurately reflect the data provided - specifically, the example does not identify a cpnrate for any date prior to 2022-05-11.
nbr
fstrtdate
fenddate
ntr
cpnrate
233
2022-01-31
2022-05-10
311,000.00
NuLL
233
2022-05-11
2022-07-10
311,000.00
3.31
...
-
-
-
-

Single column values into multiple columns in hive

I have table which updates on weekly basis, I need to check count variation check between one week and previous week values. I just did below....
Select
case when F.wk_end_d=max(F.wk_end_d) over (partition by F.wk_end_d)then F.the_count end as count
from
(
select wk_end_d, count(*) as the_count
from table A
where wk_end_d between date_sub('2019-03-02',7) and '2019-03-02'
group by wk_end_d
) F
which give me value like below
100
200
but I need get value like 100 200 on 2 different columns as I need build some other calculations on top of it.

Redshift: Find MAX in list disregarding non-incremental numbers

I work for a sports film analysis company. We have teams with unique team IDs and I would like to find the number of consecutive weeks they have uploaded film to our site moving backwards from today. Each upload also has its own row in a separate table that I can join on teamid and has a unique date of when it was uploaded. So far I put together a simple query that pulls each unique DATEDIFF(week) value and groups on teamid.
Select teamid, MAX(weekdiff)
(Select teamid, DATEDIFF(week, dateuploaded, GETDATE()) as weekdiff
from leroy_events
group by teamid, weekdiff)
What I am given is a list of teamIDs and unique weekly date differences. I would like to then find the max for each teamID without breaking an increment of 1. For example, if my data set is:
Team datediff
11453 0
11453 1
11453 2
11453 5
11453 7
11453 13
I would like the max value for team: 11453 to be 2.
Any ideas would be awesome.
I have simplified your example assuming that I already have a table with weekdiff column. That would be what you're doing with DATEDIFF to calculate it.
First, I'm using LAG() window function to assign previous value (in ordered set) of a weekdiff to the current row.
Then, using a WHERE condition I'm retrieving max(weekdiff) value that has a previous value which is current_value - 1 for consecutive weekdiffs.
Data:
create table leroy_events ( teamid int, weekdiff int);
insert into leroy_events values (11453,0),(11453,1),(11453,2),(11453,5),(11453,7),(11453,13);
Code:
WITH initial_data AS (
Select
teamid,
weekdiff,
lag(weekdiff,1) over (partition by teamid order by weekdiff) as lag_weekdiff
from
leroy_events
)
SELECT
teamid,
max(weekdiff) AS max_weekdiff_consecutive
FROM
initial_data
WHERE weekdiff = lag_weekdiff + 1 -- this insures retrieving max() without breaking your consecutive increment
GROUP BY 1
SQLFiddle with your sample data to see how this code works.
Result:
teamid max_weekdiff_consecutive
11453 2
You can use SQL window functions to probe relationships between rows of the table. In this case the lag() function can be used to look at the previous row relative to a given order and grouping. That way you can determine whether a given row is part of a group of consecutive rows.
You still need overall to aggregate or filter to reduce the number of rows for each group of interest (i.e. each team) to 1. It's convenient in this case to aggregate. Overall, it might look like this:
select
team,
case min(datediff)
when 0 then max(datediff)
else -1
end as max_weeks
from (
select
team,
datediff,
case
when (lag(datediff) over (partition by team order by datediff) != datediff - 1)
then 0
else 1
end as is_consec
from diffs
) cd
where is_consec = 1
group by team
The inline view just adds an is_consec column to the data, marking whether each row is part of a group of consecutive rows. The outer query filters on that column (you cannot filter directly on a window function), and chooses the maximum datediff from the remaining rows for each team.
There are a few subtleties there:
The case expression in the inline view is written as it is to exploit the fact that the lag() computed for the first row of each partition will be NULL, which does not evaluate unequal (nor equal) to any value. Thus the first row in each partition is always marked consecutive.
The case testing min(datediff) in the outer select clause picks up teams that have no record with datediff = 0, and assigns -1 to column max_weeks for them.
It would also have been possible to mark rows non-consecutive if the first in their group did not have datediff = 0, but then you would lose such teams from the results altogether.

Adjust date column for change over time

This is an easy enough problem, but wondering if anyone can provide a more elegant solution.
I've got a table that consists of a date column (month end dates over time) and several value columns--say the price on a variety of stocks over time, one column for each stock. I'd like to calculate the change in value columns for each period represented in the date column (eg, a daily return from a table filled with prices).
My current plan is to join the table to itself and simply create a new column for the return as ret = b.price/a.price - 1. Code as follows:
select Date, Ret = (b.stock1/a.stock1 - 1)
from #temp a, #temp b
where datediff(day, a.Date,b.Date) between 25 and 35
order by a.Date
This works fine, BUT:
(1) I need to do this for, say, dozens of stocks--is there a good way to replicate the calculation without copying and pasting the return calculation and replacing 'stock1' with each other stock name?
(2) Is there a better way to do this join? I'm effectively doing a cross join at this point and only keeping entries that are adjacent (as defined by the datediff and range), but wondering if there's a better way to join a table like this to itself.
EDIT: Per request, data is in the form (my data has multiple price columns though):
Date Price
7/1/1996 349.22
7/31/1996 337.72
8/30/1996 343.70
9/30/1996 357.23
10/31/1996 364.07
11/29/1996 385.04
12/31/1996 383.68
And from that, I'd like to calculate return, to generate a table like this (again, with additional columns for the extra price columns that exist in the actual table):
Date Ret
7/31/1996 -0.03
8/30/1996 0.02
9/30/1996 0.04
10/31/1996 0.02
11/29/1996 0.06
12/31/1996 0.00
I would do the following. First, use the month and year to do the self join. I woudl recommend you take the year * 12 + the month number to get a unique value for each month and year combination. So, Jan of 2011 would have a value of (2011 * 12 + 1 = 24133) and December of 2010 would have a value of (2010 * 12 + 12 = 24132). This will allow you to accurately compare months without having to mess with rolling over from December to January. Next, you need to supply the calculations in the select clause. If you have the stock values in different columns then you will have to type them out as a.stock1-b.stock1, a.stock2-b.stock2, etc. The only way around that would be to massage the data to where there is only one stock value column and add a stockname column that would identify what stock that value is for.
Using the Month and Year for the self join, the following query should work:
select Date, Ret = (b.stock1/a.stock1 - 1)
from #temp a
inner join #temp b on (YEAR(a.Date) * 12) + MONTH(a.Date) = (YEAR(b.Date) * 12) + MONTH(b.Date) + 1
order by a.Date