I want to calculate a moving average for a column, that requires an arithmetic calculation using parameters from the previous record.
I have records for a meter reading X, with dates, I want to perform a calculation
to determine the average rate, using formula
(reading x - reading y)/(reading date # x - reading date # y)
Where Y is always the meter reading from the previous record. The DATEDIFF is in days.
Meter | Reading | Date
-------+---------+------------
1 | 39,000 | 1 Jan 2016
1 | 39,200 | 1 Feb 2016
1 | 39,300 | 1 Mar 2016
I would like an additional column that inserts the calculated field,
it would have to read from the latest record, and process backwards -
since I have 2 years of readings, and not the first.
Meter | Reading | Date | Rate
------+---------+------------+--------------------
1 | 39,000 | 1 Aug 2016 | (200 / 31) = 6.45
1 | 39,200 | 1 Sep 2016 | (100 / 30) = 3.33
1 | 39,300 | 1 Oct 2016 | Z
I want to select this into a table for reporting.
-- EDIT --
I was getting Divide by 0 errors and decided to calculate the Reading X - Reading Y seperately as ReadingDiff.
LEAD(MeterReading, 1, 0) OVER (PARTITION BY MeterID ORDER BY MeterReading) - MeterReading AS MeterDiff
Because there are more than 1 MeterIDs in the select list, how would i prevent it from calculating a MeterDiff between the last record of MeterID 1 and the first record of MeterID 2 ? Can I not set the first record for each MeterID to 0 ?
It would be something like this:
select t.*,
( (reading - lag(reading) over (partition by meter order by date)) /
nullif(datediff(day, lag(date) over (partition by meter order by date), date), 0)
)
from t;
If reading is an integer, then be careful, because SQL Server does integer division. So, you might want:
select t.*,
( (1.0*reading - lag(reading) over (partition by meter order by date)) /
nullif(datediff(day, lag(date) over (partition by meter order by date), date), 0)
)
from t;
Note: lag() is ANSI standard functionality implemented in SQL Server since version 2012. Prior to that, you would need to use a more computationally intensive method, such as outer apply.
Related
I have a dataset that includes a bunch of clients and date ranges that they had a "stay." For example:
| ClientID | DateStart | DateEnd |
+----------+-----------+---------+
| 1 | Jan 1 | Jan 31 | (datediff = 30)
| 1 | Apr 4 | May 4 | (datediff = 30)
| 2 | Jan 3 | Feb 27 | (datediff = 55)
| 3 | Jan 1 | Jan 7 | (datediff = 6)
| 3 | Jan 10 | Jan 17 | (datediff = 6)
| 3 | Jan 20 | Jan 27 | (datediff = 6)
| 3 | Feb 1 | Feb 7 | (datediff = 6)
| 3 | Feb 10 | Feb 17 | (datediff = 6)
| 3 | Feb 20 | Feb 27 | (datediff = 6)
My ultimate goal is to be able to identify the dates on which a client passed a threshold of N nights in the past X time. Let's say 30 days in the last 90 days. I also need to know when they pass out of the threshold. Use case: hotel stays and a VIP status.
In the example above, Client 1 passed the threshold on Jan 31 (had 30 nights in past 90 days), and still kept meeting the threshold until April 2 (now only 29 nights in the past 90 days), but passed the threshold again on May 4.
Client 2 passed the threshold on Feb 3, and kept meeting the threshold until April 28th, at which point the earliest days are more than 90 days ago and they expire.
Client 3 passed the threshold on around Feb 17
So I would like to generate a table like this:
| ClientID | VIPStart | VIPEnd |
+----------+-----------+---------+
| 1 | Jan 31 | Apr 2 |
| 1 | May 4 | Jul 5 |
| 2 | Feb 3 | Apr 28 |
| 3 | Feb 17 | Apr 11 |
(Forgive me if the dates are slightly off, I'm doing this in my head)
Ideally I would like to generate a view, as I will need to reference it often.
What I want to know is what's the most efficient way to generate this? Assuming I have thousands of clients and hundreds of thousands of stays.
The way that I've been approaching this so far has been to use a SQL statement that includes a parameter: as of {?Date}, who had VIP status and who didn't. I do that by calculating DATEADD(day,-90,{?Date}), then excluding the records that are out of the range, then truncating the DateStarts that extend earlier and DateEnds that extend later, then calculating the DATEDIFF(day,DateStart,DateEnd) for the resulting stays using adjusted DateStart and DateEnd, then getting a SUM() of the resulting DATEDIFF() for each Client as of {?Date}. It works, but it's not pretty. And it gives me a point in time snapshot; I want the history.
it seems a little inefficient to generate a table of dates and then for every single date, use the above method.
Another option I considered was converting the raw data into an exploded table with each record corresponding to one night, then I can count it easier. Like this:
| ClientID | StayDate |
+----------+-----------+
| 1 | Jan 1 |
| 1 | Jan 2 |
| 1 | Jan 3 |
| 1 | Jan 4 |
etc.
Then I could just add a column counting the number of days in the past 90 days, and that'll get me most of the way there.
But I'm not sure how to do that in a view. I have a code snippet that does this:
WITH DaysTally AS (
SELECT MAX(DATEDIFF(day, DateStart, DateEnd)) - 1 AS Tally
FROM Stays
UNION ALL
SELECT Tally - 1 AS Expr1
FROM DaysTally AS DaysTally_1
WHERE (Tally - 1 >= 0))
SELECT t.ClientID,
DATEADD(day, c.Tally, t.DateStart) AS "StayDate"
FROM Stays AS t
INNER JOIN DaysTally AS c ON
DATEDIFF(day, t.DateStart, t.DateEnd) - 1 >= c.Tally
OPTION (MAXRECURSION 0)
But I can't get it to work without the MAXRECURSION and I don't think you can save a view with MAXRECURSION
And now I'm rambling. So the help that I'm looking for is: what is the most efficient method to pursue my goal? And if you have a code example, that would be helpful too! Thanks.
This is an interesting and pretty well-asked question. I would start by enumerating the days from the beginning of the first stay of each client until 90 days after the end of its last stay with a recursive cte. You can then bring the stay table with a left join, and use window functions to flag the "VIP" days (note that this assumes no overlaping stays for a given client, which is consistent with your sample data).
What follows is gaps-and-islands: you can use a window sum to put "adjacent" VIP days in groups, and then aggregate.
with cte as (
select clientID, min(dateStart) dt, dateadd(day, 90, max(dateEnd)) dateMax
from stays
group by clientID
union all
select clientID, dateadd(day, 1, dt), dateMax
from cte
where dt < dateMax
)
select clientID, min(dt) VIPStart, max(dt) VIPEnd
from (
select t.*, sum(isNotVip) over(partition by clientID order by dt) grp
from (
select
c.clientID,
c.dt,
case when count(s.clientID) over(
partition by c.clientID
order by c.dt
rows between 90 preceding and current row
) >= 30
then 0
else 1
end isNotVip
from cte c
left join stays s
on c.clientID = s.clientID and c.dt between s.dateStart and s.dateEnd
) t
) t
where isNotVip = 0
group by clientID, grp
order by clientID, VIPStart
option (maxrecursion 0)
This demo on DB Fiddle with your sample data produces:
clientID | VIPStart | VIPEnd
-------: | :--------- | :---------
1 | 2020-01-30 | 2020-04-01
1 | 2020-05-03 | 2020-07-04
2 | 2020-02-01 | 2020-04-28
3 | 2020-02-07 | 2020-04-20
You can put this in a view as follows:
the order by and option(maxrecursion) clauses must be omitted when creating the view
each and every query that has the view in its from clause must end with option(max recursion 0)
Demo
You can eliminate the recursion by creating a tally table in the view. The approach is then the following:
For each period, generate dates from 90 days before the period to 90 days after. These are all the "candidate days" that the period could affect.
For each row, add a flag as to whether it is in the period (as opposed to the 90 days before and after).
Aggregate by client id and date.
Use a running sum to get the days with 30+ in the previous 90 days.
Then filter for the ones with 30+ days and treat this as a gaps-and-islands problem.
Assuming that 1000 days is sufficient for the periods (including the 90 days before and after), then the query looks like this:
with n as (
select v.n
from (values (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v(n)
),
nums as (
select (n1.n * 100 + n2.n * 10 + n3.n) as n
from n n1 cross join n n2 cross join n n3
),
running90 as (
select clientid, dte, sum(in_period) over (partition by clientid order by dte rows between 89 preceding and current row) as running_90
from (select t.clientid, dateadd(day, n.n - 90, datestart) as dte,
max(case when dateadd(day, n.n - 90, datestart) >= datestart and dateadd(day, n.n - 90, datestart) <= t.dateend then 1 else 0 end) as in_period
from t join
nums n
on dateadd(day, n.n - 90, datestart) <= dateadd(day, 90, dateend)
group by t.clientid, dateadd(day, n.n - 90, datestart)
) t
)
select clientid, min(dte), max(dte)
from (select r.*,
row_number() over (partition by clientid order by dte) as seqnum
from running90 r
where running_90 >= 30
) r
group by clientid, dateadd(day, - seqnum, dte);
Having no recursive CTE (although one could be used for n), this is not subject to the maxrecursion issue.
Here is a db<>fiddle.
The results are slightly different from your results. This is probably due to some slight difference in the definitions. The above includes the end day as an "occupied" day. The 90 days is 89 days before plus the current day in the above query. The second-to-last query shows the 90 days running days, and that seems correct to me.
From iso week and year, I would like to get a date.
The date should be first day of the week.
First day of the week is Monday.
For example iso week 10 and iso year should convert to 2019-03-04.
I am using Snowflake
The date expression to do this is a little complex, but not impossible:
SELECT
DATEADD( /* Calculate start of ISOWeek as offset from Jan 1st */
DAY,
WEEK * 7 - CASE WHEN DAYOFWEEKISO(DATE_FROM_PARTS(YEAR, 1, 1)) < 5 THEN 7 ELSE 0 END
+ 1 - DAYOFWEEKISO(DATE_FROM_PARTS(YEAR, 1, 1)),
DATE_FROM_PARTS(YEAR, 1, 1)
)
FROM (VALUES (2000, 1), (2000, 2), (2001, 1), (2002, 1), (2003, 1)) v(YEAR, WEEK);
Unfortunately, Snowflake doesn't support this functionality natively.
While it's possible to compute manually the date from ISO week and year, it's very complex. So like others suggested, generating a Date Dimension table for this is much easier.
Example of a query that can generate it for the lookups (note - this is not a full Date Dimension table - that is typically one row per day, this is one row per week).
create or replace table iso_week_lookup as
select
date_part(yearofweek_iso, d) year_iso,
date_part(week_iso, d) week_iso,
min(d) first_day
from (
select dateadd(day, row_number() over (order by 1) - 1, '2000-01-03'::date) AS d
from table(generator(rowCount=>10000))
)
group by 1, 2 order by 1,2;
select * from iso_week_lookup limit 2;
----------+----------+------------+
YEAR_ISO | WEEK_ISO | FIRST_DAY |
----------+----------+------------+
2000 | 1 | 2000-01-03 |
2000 | 2 | 2000-01-10 |
----------+----------+------------+
select min(first_day), max(first_day) from iso_week_lookup;
----------------+----------------+
MIN(FIRST_DAY) | MAX(FIRST_DAY) |
----------------+----------------+
2000-01-03 | 2027-05-17 |
----------------+----------------+
select * from iso_week_lookup where year_iso = 2019 and week_iso = 10;
----------+----------+------------+
YEAR_ISO | WEEK_ISO | FIRST_DAY |
----------+----------+------------+
2019 | 10 | 2019-03-04 |
----------+----------+------------+
Note, you can play with the constants in create table to create a table of the range you want. Just remember to use Monday as the starting day, otherwise you'll get a wrong value for the first week in the table :)
If you do not have Date Dimension table and/or utilities, as mentioned in the comments, you should parsing it from a textual form. But it would be DBMS implementation dependent:
In MySQL: STR_TO_DATE(CONCAT(year, ' ', week), '%x %v')
In PostgreSQL: TO_DATE(year || ' ' || week, 'IYYY IW')
(also Oracle DB would be something similar)
This is a similar scenario to
SQL: Count of rows since certain value first occurred
In SQL Server, I'm trying to calculate the count of days since the same weather as today (let's assume today is 6th August 2018) was observed first in the past 5 days. Per town.
Here's the data:
+---------+---------+--------+--------+--------+
| Date | Toronto | Cairo | Zagreb | Ankara |
+---------+---------+--------+--------+--------+
| 1.08.18 | Rain | Sun | Clouds | Sun |
| 2.08.18 | Sun | Sun | Clouds | Sun |
| 3.08.18 | Rain | Sun | Clouds | Rain |
| 4.08.18 | Clouds | Sun | Clouds | Clouds |
| 5.08.18 | Rain | Clouds | Rain | Rain |
| 6.08.18 | Rain | Sun | Sun | Sun |
+---------+---------+--------+--------+--------+
This needs to perform well but all I came up with so far is single queries for each town (and there are going to be dozens of towns, not just the four). This works but is not going to scale.
Here's the one for Toronto...
SELECT
DATEDIFF(DAY, MIN([Date]), GETDATE()) + 1
FROM
(SELECT TOP 5 *
FROM Weather
WHERE [Date] <= GETDATE()
ORDER BY [Date] DESC) a
WHERE
Toronto = (SELECT TOP 1 Toronto
FROM Weather
WHERE DataDate = GETDATE())
...which correctly returns 4 since today there is rain and the first occurrence of rain within the past 5 days was 3rd August.
But what I want returned is a table like this:
+---------+-------+--------+--------+
| Toronto | Cairo | Zagreb | Ankara |
+---------+-------+--------+--------+
| 4 | 5 | 1 | 5 |
+---------+-------+--------+--------+
Slightly modified from the accepted answer by #Used_By_Already is this code:
CREATE TABLE mytable(
Date date NOT NULL
,Toronto VARCHAR(9) NOT NULL
,Cairo VARCHAR(9) NOT NULL
,Zagreb VARCHAR(9) NOT NULL
,Ankara VARCHAR(9) NOT NULL
);
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180801','Rain','Sun','Clouds','Sun');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180802','Sun','Sun','Clouds','Sun');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180803','Rain','Sun','Clouds','Rain');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180804','Clouds','Sun','Clouds','Clouds');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180805','Rain','Clouds','Rain','Rain');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180806','Rain','Sun','Sun','Sun');
with cte as (
select
date, city, weather
FROM (
SELECT * from mytable
) AS cp
UNPIVOT (
Weather FOR City IN (Toronto, Cairo, Zagreb, Ankara)
) AS up
)
select
date, city, weather, datediff(day,ca.prior,cte.date)+1 as daysPresent
from cte
cross apply (
select min(prev.date) as prior
from cte as prev
where prev.city = cte.city
and prev.date between dateadd(day,-4,cte.date) and dateadd(day,0,cte.date)
and prev.weather = cte.weather
) ca
order by city,date
Output:
However, what I'm trying now is to keep counting "daysPresent" up even after those five past days in question. Meaning that the last marked row in the output sample should show 6. The logic being to increase the previous number by the count of days between them if there is less than 5 days of a gap between them. If there has not been the same weather in the past 5 days, go back to 1.
I experimented with LEAD and LAG but cannot get it to work. Is it even the right way to add another layer to it or would the query need to look different entirely?
I'm a but puzzled.
You have a major problem with your data structure. The values should be in rows, not columns. So, start with:
select d.dte, v.*from data d cross apply
(values ('Toronto', Toronto), ('Cairo', Cairo), . . .
) v(city, val)
where d.date >= dateadd(day, -5, getdate());
From there, we can use the window function first_value() (or last_value()) to get the most recent reading. The rest is just aggregation by city:
with d as (
select d.dte, v.*,
first_value(v.val) over (partition by v.city order by d.dte desc) as last_val
from data d cross apply
(values ('Toronto', Toronto), ('Cairo', Cairo), . . .
) v(city, val)
where d.date >= dateadd(day, -5, getdate())
)
select city, datediff(day, min(dte), getdate()) + 1
from d
where val = last_val
group by city;
This gives you the information you want, in rows rather than columns. You can re-pivot if you really want. But I advise you to keep the data with city data in different rows.
I'm calculating the growth rate between two time ranges (5 years) using the formula below:
growth rate = ((2016 net income/2012 net income) * 1/(5 years)) - 1
My IncomeStatements table is somehow structured like this:
id | stockid | year | netincome
1 | 1 | 2016 | 235235346
2 | 1 | 2015 | 432434545
..2014-2013 rows
5 | 1 | 2012 | 324324234
6 | 2 | 2016 | 234235234
7 | 2 | cycle continues..
How can I select the most recent and most past years (2016 and 2012) of each stock id (FOREIGN KEY) to apply the formula and then the result is updated in the growthrate column in the stock table?
Below is my incomplete code. Kindly help me improve it or provide workarounds since I'm new to SQL.
UPDATE stock SET growthrate = (Help)
FROM IncomeStatements WHERE IncomeStatements.stockid= stock.id
If I understand correctly, you need to get the first and last values for the year and net income. You can do this with window functions.
The rest is just arithmetic:
with i as (
select distinct stockid,
first_value(year) over (partition by stockid order by year) as year_first,
first_value(year) over (partition by stockid order by year desc) as year_last,
first_value(netincome) over (partition by stockid order by year) as netincome_first,
first_value(netincome) over (partition by stockid order by year desc) as netincome_last
from incomestatements i
update s
set growthrate = ((i.netincome_last - i.netincome_first) / nullif(i.year_last - i.year_first, 0)) - 1
from stock s
i
on s.stock_id = i.stock_id;
Say I’ve got an events table with just the columns id and occurred (which is just a datetime).
I want to get, for every day in a given period, the number of events in the previous week. So, let’s say the period was Jan 1 through April 1. I’d want the results of this query to look like:
_______________
|count | date |
|------|------|
| 3 | 1/1 |
| 2 | 1/2 |
| 0 | 1/3 |
| 4 | 1/4 |
---------------
Where count is, for that date, the number of events that happened in the week prior. So, the 3 count for 1/1 is how many events happened between Dec 25th and Jan 1.
I could do this easily enough in code:
for (date in 1/1 to 4/1) {
start_date = date - 7 days
db.query(’SELECT COUNT(1) FROm events WHERE occurred > start_date AND occurred < date`)
}
Unfortunately, this would result in over a hundred separate queries. I’d like to figure out how to do this in one query.
Hmm, you can generate all the dates in the period using generate_series(). Then then join in the data and do a cumulative sum:
select dd.dte,
sum(cnt) over (order by dd.dte rows between 6 preceding and current date) as avg_7daymoving
from generate_series('2015-01-01'::timestamp, '2015-04-01'::timestamp, '1 day'::interval) dd(dte) left join
(select date_trunc('day', occurred) as dte, count(*) as cnt
from events e
group by date_trunc('day', occurred)
) e
on e.dte = dd.dte