Get first value outside where window with lag function

Get first value outside where window with lag function - sql

When using the lag function on time series in SQL Server, I always struggle with the first value in a time series.
Assume this trivial example
CREATE TABLE demo
([id] int, [time] date, [content] int)
;
INSERT INTO demo (id, time, content) VALUES
(1, '2021-05-31', cast(rand()*1000 as int)),
(2, '2021-06-01', cast(rand()*1000 as int)),
(3, '2021-06-02',cast(rand()*1000 as int)),
(4, '2021-06-03', cast(rand()*1000 as int)),
(5, '2021-06-04', cast(rand()*1000 as int)),
(6, '2021-06-05', cast(rand()*1000 as int)),
(7, '2021-06-06', cast(rand()*1000 as int)),
(8, '2021-06-07', cast(rand()*1000 as int)),
(9, '2021-06-08', cast(rand()*1000 as int));
I want to get all values and their previous value in June, so something like this
select content, lag(content, 1, null) over (order by time)
from demo
where time >= '2021-06-01'
so far so good, however, the first entry will result in null for the previous value.
Of course there are many solutions on how to fill the null value, e.g. subselecting a larger range etc. but for very large tables I somehow think there should be an elegant solution to this.
Sometimes I do stuff like this
select content, lag(content, 1,
(select content from demo d1 join
(select max(time) maxtime from demo where time < '2021-06-01') d2 on d1.time = d2.maxtime
)) over (order by time)
from demo
where time >= '2021-06-01'
Is there something more efficient? (note: of course for this trivial example I doesn't make a difference, but for tables with partition and 500'000'000 entries, one should find the most efficient solution)
Check out the fiddle

The key idea is to use a subquery:
select t.*
from (select content, lag(content) over (order by time)
from demo d
) d
where time >= '2021-06-01';
This is probably going to scan the entire table. However, you can create an index demo(time, content) to help the lag().
Next, you can optimize this if you have a reasonable lookback period. For instance, if there are records every month, just go back one month in the subquery:
select t.*
from (select content, lag(content) over (order by time)
from demo d
where time >= '2021-05-01'
) d
where time >= '2021-06-01';
This can also be very important if your data is partitioned -- as large tables are wont to be.

For this particular case, going by your comments, you may first compute the lag over the entire unfiltered table, then subquery that based on date:
WITH cte AS (
SELECT time, content, LAG(content) OVER (ORDER BY time) lag_content
FROM demo
)
SELECT content, lag_content
FROM cte
WHERE time >= '2021-06-01';

What would you like the null values to be? I've put them as 0 in the below example.
SELECT
content,
coalesce(LAG(content, 1, NULL) OVER(
ORDER BY
time
), content-1) lag_content
FROM
demo
WHERE
time >= '2021-06-01'
Output:
content lag_content
-------------------
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
Try it out here: dbfiddle

Related

Divide monthly spend into daily spend in BigQuery

I have monthly data in BigQuery in the following form:
CREATE TABLE if not EXISTS spend (
id int,
created_at DATE,
value float
);
INSERT INTO spend VALUES
(1, '2020-01-01', 100),
(2, '2020-02-01', 200),
(3, '2020-03-01', 100),
(4, '2020-04-01', 100),
(5, '2020-05-01', 50);
I would like a query to translate it into daily data in the following day:
One row per day.
The value of each day should be the monthly value divided by the number of days of the month.
What's the simplest way of doing this in BigQuery?

You can make use of GENERATE_DATE_ARRAY() in order to get an array between the desired dates (in your case, between 2020-01-01 and 2020-05-31) and create a calendar table, and then divide the value of a given month among the days in the month :)
Try this and let me know if it worked:
with calendar_table as (
select
calendar_date
from
unnest(generate_date_array('2020-01-01', '2020-05-31', interval 1 day)) as calendar_date
),
final as (
select
ct.calendar_date,
s.value,
s.value / extract(day from last_day(ct.calendar_date)) as daily_value
from
spend as s
cross join
calendar_table as ct
where
format_date('%Y-%m', date(ct.calendar_date)) = format_date('%Y-%m', date(s.created_at))
)
select * from final

My recommendation is to do this "locally". That is, run generate_date_array() for each row in the original table. This is much faster than a join across rows. BigQuery also makes this easy with the last_day() function:
select t.id, u.date,
t.value / extract(day from last_day(t.created_at))
from `table` t cross join
unnest(generate_date_array(t.created_at,
last_day(t.created_at, month)
)
) u(date);

Irregular grouping of timestamp variable

I have a table organized as follows:
id lateAt
1231235 2019/09/14
1242123 2019/09/13
3465345 NULL
5676548 2019/09/28
8986475 2019/09/23
Where lateAt is a timestamp of when a certain loan's payment became late. So, for each current date - I need to look at these numbers daily - there's a certain amount of entries which are late for 0-15, 15-30, 30-45, 45-60, 60-90 and 90+ days.
This is my desired output:
lateGroup Count
0-15 20
15-30 22
30-45 25
45-60 32
60-90 47
90+ 57
This is something I can easily calculate in R, but to get the results back to my BI dashboard I'd have to create a new table in my database, which I don't think is a good practice. What is the SQL-native approach to this problem?

I would define the "late groups" using a range, the join against the number of days:
with groups (grp) as (
values
(int4range(0,15, '[)')),
(int4range(15,30, '[)')),
(int4range(30,45, '[)')),
(int4range(45,60, '[)')),
(int4range(60,90, '[)')),
(int4range(90,null, '[)'))
)
select grp, count(t.user_id)
from groups g
left join the_table t on g.grp #> current_date - t.late_at
group by grp
order by grp;
int4range(0,15, '[)') creates a range from 0 (inclusive) and 15 (exclusive)
Online example: https://rextester.com/QJSN89445

The quick and dirty way to do this in SQL is:
SELECT '0-15' AS lateGroup,
COUNT(*) AS lateGroupCount
FROM my_table t
WHERE (CURRENT_DATE - t.lateAt) >= 0
AND (CURRENT_DATE - t.lateAt) < 15
UNION
SELECT '15-30' AS lateGroup,
COUNT(*) AS lateGroupCount
FROM my_table t
WHERE (CURRENT_DATE - t.lateAt) >= 15
AND (CURRENT_DATE - t.lateAt) < 30
UNION
SELECT '30-45' AS lateGroup,
COUNT(*) AS lateGroupCount
FROM my_table t
WHERE (CURRENT_DATE - t.lateAt) >= 30
AND (CURRENT_DATE - t.lateAt) < 45
-- Etc...
For production code, you would want to do something more like Ross' answer.

You didn't mention which DBMS you're using, but nearly all of them will have a construct known as a "value constructor" like this:
select bins.lateGroup, bins.minVal, bins.maxVal FROM
(VALUES
('0-15',0,15),
('15-30',15.0001,30), -- increase by a small fraction so bins don't overlap
('30-45',30.0001,45),
('45-60',45.0001,60),
('60-90',60.0001,90),
('90-99999',90.0001,99999)
) AS bins(lateGroup,minVal,maxVal)
If your DBMS doesn't have it, then you can probably use UNION ALL:
SELECT '0-15' as lateGroup, 0 as minVal, 15 as maxVal
union all SELECT '15-30',15,30
union all SELECT '30-45',30,45
Then your complete query, with the sample data you provided, would look like this:
--- example from SQL Server 2012 SP1
--- first let's set up some sample data
create table #temp (id int, lateAt datetime);
INSERT #temp (id, lateAt) values
(1231235,'2019-09-14'),
(1242123,'2019-09-13'),
(3465345,NULL),
(5676548,'2019-09-28'),
(8986475,'2019-09-23');
--- here's the actual query
select lateGroup, count(*) as Count
from #temp as T,
(VALUES
('0-15',0,15),
('15-30',15.0001,30), -- increase by a small fraction so bins don't overlap
('30-45',30.0001,45),
('45-60',45.0001,60),
('60-90',60.0001,90),
('90-99999',90.0001,99999)
) AS bins(lateGroup,minVal,maxVal)
) AS bins(lateGroup,minVal,maxVal)
where datediff(day,lateAt,getdate()) between minVal and maxVal
group by lateGroup
order by lateGroup
--- remove our sample data
drop table #temp;
Here's the output:
lateGroup Count
15-30 2
30-45 2
Note: rows with null lateAt are not counted.

I think you can do it all in one clear query :
with cte_lategroup as
(
select *
from (values(0,15,'0-15'),(15,30,'15-30'),(30,45,'30-45')) as t (mini, maxi, designation)
)
select
t2.designation
, count(*)
from test t
left outer join cte_lategroup t2
on current_date - t.lateat >= t2.mini
and current_date - lateat < t2.maxi
group by t2.designation;
With a preset like yours :
create table test
(
id int
, lateAt date
);
insert into test
values (1231235, to_date('2019/09/14', 'yyyy/mm/dd'))
,(1242123, to_date('2019/09/13', 'yyyy/mm/dd'))
,(3465345, null)
,(5676548, to_date('2019/09/28', 'yyyy/mm/dd'))
,(8986475, to_date('2019/09/23', 'yyyy/mm/dd'));

Find time average time difference between stages

In SQL Server 2012, I have a database table called Stages that has 3 columns:
AccountID
StageNum
StartTime
I am trying to find out how long it usually takes between each stage. IE stage 2 usually takes 3 days to complete. Is this possible? Is it possible to skip weekends too?
Any SQL would be helpful!
Thank you

The simplest method is:
select ( datediff(minute, min(StartTime), max(StartTime)) /
nullif(60.0 * (count(*) - 1), 0)
) as avg_hours
from t;
The nullif() prevents division by zero. The idea is simple . . . take the total amount of time and divide by one less than the number of stages.

I would try the use of LEAD and AVG as follows:
CREATE TABLE Tab1(
AccountID INT, StageNum INT, StartTime DATETIME
)
INSERT INTO Tab1 VALUES(1, 1, '2018-01-01 07:00:00.000'), (1, 2, '2018-01-03 12:54:00.000'), (1, 3, '2018-02-01 12:00:00.000')
INSERT INTO Tab1 VALUES(2, 1, '2018-03-01 00:00:00.000'), (2, 2, '2018-04-03 12:54:00.000'), (2, 3, '2018-08-01 12:00:00.000')
WITH cte AS(
SELECT *
,LEAD(StartTime) OVER (PARTITION BY t.AccountID ORDER BY t.StageNum) NextStart
,DATEDIFF(MINUTE, StartTime, LEAD(StartTime) OVER (PARTITION BY t.AccountID ORDER BY t.StageNum))/60.0 TimeSpanHours
FROM Tab1 t
)
SELECT AccountID, AVG(TimeSpanHours) AvgTimeSpanHours
FROM cte
GROUP BY AccountID
ORDER BY AccountID

Is there a way to GROUP BY a time interval in this table?

I have a table like this one:
DateTime A
10:00:01 2
10:00:07 4
10:00:10 2
10:00:17 1
10:00:18 3
Is this possible to create a query that returns me the average value of A each 10 seconds? In this case the result would be:
3 (4+2)/2
2 (2+1+3)/3
Thanks in advance!
EDIT: If you really think that this can not be done just say NO WAY! :) It's an acceptable answer, I really don't know if this can be done.
EDIT2: I'm using SQL Server 2008. I would like to have different groupings but fixed. For example, ranges each 10 sec, 1 minute, 5 minutes, 30 minutes, 1 hour and 1 day (just an example but something like that)

In SQL Server, you can use DATEPART and then group by hour, minute and second integer-division 10.
CREATE TABLE #times
(
thetime time,
A int
)
INSERT #times
VALUES ('10:00:01', 2)
INSERT #times
VALUES ('10:00:07', 4)
INSERT #times
VALUES ('10:00:10', 2)
INSERT #times
VALUES ('10:00:17', 1)
INSERT #times
VALUES ('10:00:18', 3)
SELECT avg(A) -- <-- here you might deal with precision issues if you need non-integer results. eg: (avg(a * 1.0)
FROM #times
GROUP BY datepart(hour, thetime), DATEPART(minute, thetime), DATEPART(SECOND, thetime) / 10
DROP TABLE #times

It depends on DBMS you are using.
In Oracle you can do the following:
SELECT AVG(A)
FROM MYTABLE
GROUP BY to_char(DateTime, 'HH24:MI')

CREATE TABLE IF NOT EXISTS `test` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`dtime` datetime NOT NULL,
`val` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;
INSERT INTO `test` (`id`, `dtime`, `val`) VALUES
(1, '2011-09-27 18:36:19', 8),
(2, '2011-09-27 18:36:21', 4),
(3, '2011-09-27 18:36:27', 5),
(4, '2011-09-27 18:36:35', 3),
(5, '2011-09-27 18:36:37', 2);
SELECT *, AVG(val) FROM test GROUP BY FLOOR(UNIX_TIMESTAMP(dtime) / 10)

Someone may come along and give you an answer with full code, but the way I would approach this is to break it down to several smaller problems/solutions:
(1) Create a temp table with intervals. See the accepted answer on this question:
Get a list of dates between two dates
This answer was for MySQL, but should get you started. Googling "Create intervals SQL" should also yield additional ways to accomplish this. You will want to use the MAX(DateTime) and MIN(DateTime) from your main table as inputs into whatever method you use (and 10 seconds for the span, obviously).
(2) Join the temp table with your main table, with a join condition of (pseudocode):
FROM mainTable m INNER JOIN #tempTable t ON m BETWEEN t.StartDate AND t.EndDate
(3) Now it should be as simple as correctly SELECTing and GROUPing:
SELECT
AVG(m.A)
FROM
mainTable m
INNER JOIN #tempTable t ON m BETWEEN t.StartDate AND t.EndDate
GROUP BY
t.StartDate
Edit: if you want to see intervals with that have no records (zero average), you would have to rearrage the query, use a LEFT JOIN, and COALESCE on m.A (see below). If you don't care about seeing such interals, OCary's solution is better/cleaner.
SELECT
AVG(COALESCE(m.A, 0))
FROM
#tempTable t
LEFT JOIN mainTable m ON m BETWEEN t.StartDate AND t.EndDate
GROUP BY
t.StartDate

I approached this by using a Common Table Expression to get all the periods between any given dates of my data. In principal you could change the interval to any SQL interval.
DECLARE #interval_minutes INT = 5, #start_date DATETIME = '20130201', #end_date DATETIME = GETDATE()
;WITH cte_period AS
(
SELECT CAST(#start_date AS DATETIME) AS [date]
UNION ALL
SELECT DATEADD(MINUTE, #interval_minutes, cte_period.[date]) AS [date]
FROM cte_period
WHERE DATEADD(MINUTE, #interval_minutes, cte_period.[date]) < #end_date
)
, cte_intervals AS
(SELECT [first].[date] AS [Start], [second].[date] AS [End]
FROM cte_period [first]
LEFT OUTER JOIN cte_period [second] ON DATEADD(MINUTE, 5, [first].[date]) = [second].[date]
)
SELECT i.[Start], AVG(data)
FROM cte_intervals i
LEFT OUTER JOIN your_data mu ON mu.your_date_time >= i.Start and mu.your_date_time < i.[End]
GROUP BY i.[Start]
OPTION (MAXRECURSION 0)

From http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=142634 you can use the following query as well:
select dateadd(minute, datediff(minute, 0, timestamp ) / 10 * 10, 0), avg ( value )
from yourtable
group by dateadd(minute, datediff(minute, 0, timestamp ) / 10 * 10, 0)
which someone then expands upon to suggest:
Select
a.MyDate,
Start_of_10_Min =
dateadd(mi,(datepart(mi,a.MyDate)/10)*10,dateadd(hh,datediff(hh,0,a.Mydate),0))
from
( -- Test Data
select MyDate = getdate()
) a
although I'm not too how they plan on getting the average in in the second suggestion.
Personally I prefer OCary's answer as I know what is going on there and that I'll be able to understand it in 6 months time without looking it up again but I include this one for completeness.

How to write a database view that expands data into multiple rows?

I have a database table that contains collection data for product collected from a supplier and I need to produce an estimate of month-to-date production figures for that supplier using an Oracle SQL query. Each day can have multiple collections, and each collection can contain product produced across multiple days.
Here's an example of the raw collection data:
Date Volume ColectionNumber ProductionDays
2011-08-22 500 1 2
2011-08-22 200 2 2
2011-08-20 600 1 2
Creating a month-to-date estimate is tricky because the first day of the month may have a collection for two days worth of production. Only a portion of that collected volume is actually attributable to the current month.
How can I write a query to produce this estimate?
My gut feeling is that I should be able to create a database view that transforms the raw data into estimated daily production figures by summing collections on the same day and distributing collection volumes across the number of days they were produced on. This would allow me to write a simple query to find the month-to-date production figure.
Here's what the above collection data would look like after being transformed into estimated daily production figures:
Date VolumeEstimate
2011-08-22 350
2011-08-21 350
2011-08-20 300
2011-08-19 300
Am I on the right track? If so, how can this be implemented? I have absolutely no idea how to do this type of transformation in SQL. If not, what is a better approach?
Note: I cannot do this calculation in application code since that would require a significant code change which we can't afford.

try
CREATE TABLE TableA (ProdDate DATE, Volume NUMBER, CollectionNumber NUMBER, ProductionDays NUMBER);
INSERT INTO TableA VALUES (TO_DATE ('20110822', 'YYYYMMDD'), 500, 1, 2);
INSERT INTO TableA VALUES (TO_DATE ('20110822', 'YYYYMMDD'), 200, 2, 2);
INSERT INTO TableA VALUES (TO_DATE ('20110820', 'YYYYMMDD'), 600, 1, 2);
COMMIT;
CREATE VIEW DailyProdVolEst AS
SELECT DateList.TheDate, SUM (DateRangeSums.DailySum) VolumeEstimate FROM
(
SELECT ProdStart, ProdEnd, SUM (DailyProduction) DailySum
FROM
(
SELECT (ProdDate - ProductionDays + 1) ProdStart, ProdDate ProdEnd, CollectionNumber, VolumeSum/ProductionDays DailyProduction
FROM
(
Select ProdDate, CollectionNumber, ProductionDays, Sum (Volume) VolumeSum FROM TableA
GROUP BY ProdDate, CollectionNumber, ProductionDays
)
)
GROUP BY ProdStart, ProdEnd
) DateRangeSums,
(
SELECT A.MinD + MyList.L TheDate FROM
(SELECT MIN (ProdDate - ProductionDays + 1) MinD FROM TableA) A,
(SELECT LEVEL - 1 L FROM DUAL CONNECT BY LEVEL <= (SELECT Max (ProdDate) - MIN (ProdDate - ProductionDays + 1) + 1 FROM TableA)) MyList
) DateList
WHERE DateList.TheDate BETWEEN DateRangeSums.ProdStart AND DateRangeSums.ProdEnd
GROUP BY DateList.TheDate;
The view DailyProdVolEst gives you dynamically the result you described... though some "constraints" apply:
the combination of ProdDate and CollectionNumber should be unique.
the ProductionDays need to be > 0 for all rows
EDIT - as per comment requested:
How this query works:
It finds out what the smallest + biggest date in the table are, then builds rows with each row being a date in that range (DateList)... this is matched up against a list of rows containing the daily sum for unique combinations of ProdDate Start/End (DateRangeSums) and sums it up on the date level.
What do SUM (DateRangeSums.DailySum) and SUM (DailyProduction) do ?
Both sum things up - the SUM (DateRangeSums.DailySum) sums up in cases of partialy overlapping date ranges, and the SUM (DailyProduction) sums up within one date range if there are more than one CollectionNumber. Without SUM the GROUP BY wouldn't be needed.

I think a UNION query will do the trick for you. You aren't using the CollectionNumber field in your example, so I excluded it from the sample below.
Something similar to the below query should work (Disclaimer: My oracle db isn't accessible to me at the moment):
SELECT Date, SUM(Volume) VolumeEstimate
FROM
(SELECT Date, SUM(Volume / ProductionDays) Volume
FROM [Table]
GROUP BY Date
UNION
SELECT (Date - 1) Date, SUM(Volume / 2)
WHERE ProductionDays = 2
GROUP BY Date - 1)
GROUP BY Date

It sounds like what you want to do is sum up by day and then use a tally table to divide out the results.
Here's a runnable example with your data in T-SQL dialect:
DECLARE #tbl AS TABLE (
[Date] DATE
, Volume INT
, ColectionNumber INT
, ProductionDays INT);
INSERT INTO #tbl
VALUES ('2011-08-22', 500, 1, 2)
, ('2011-08-22', 200, 2, 2)
, ('2011-08-20', 600, 1, 2);
WITH Numbers AS (SELECT 1 AS N UNION ALL SELECT 2 AS N)
,AssignedVolumes AS (
SELECT t.*
, t.Volume / t.ProductionDays AS PerDay
, DATEADD(d, 1 - n.N, t.[Date]) AS AssignedDate
FROM #tbl AS t
INNER JOIN Numbers AS n
ON n.N <= t.ProductionDays
)
SELECT AssignedDate
, SUM(PerDay)
FROM AssignedVolumes
GROUP BY AssignedDate;
I dummied up a simple numbers table with only 1 and 2 in it to perform the pivot. Typically you'll have a table with a million numbers in sequence.
For Oracle, the only thing you should need to change would be the DATEADD.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas