SQL interpolating missing values for a specific date range - with some conditions

SQL interpolating missing values for a specific date range - with some conditions - sql

There are some similar questions on the site, but I believe mine warrants a new post because there are specific conditions that need to be incorporated.
I have a table with monthly intervals, structured like this:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 1 | 10 | 12/17/2017 | 1/17/2018 |
| 1 | 10 | 1/18/2018 | 2/18/2018 |
| 1 | 10 | 2/19/2018 | 3/19/2018 |
| 1 | 10 | 3/20/2018 | 4/20/2018 |
| 1 | 10 | 4/21/2018 | 5/21/2018 |
+----+--------+--------------+--------------+
I've found that sometimes there is a month of data missing around the end/beginning of the year where I know it should exist, like this:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 2 | 10 | 10/14/2018 | 11/14/2018 |
| 2 | 10 | 11/15/2018 | 12/15/2018 |
| 2 | 10 | 1/17/2019 | 2/17/2019 |
| 2 | 10 | 2/18/2019 | 3/18/2019 |
| 2 | 10 | 3/19/2019 | 4/19/2019 |
+----+--------+--------------+--------------+
What I need is a statement that will:
Identify where this year-end period is missing (but not find missing
months that aren't at the beginning/end of the year).
Create this interval by using the length of an existing interval for
that ID (maybe using the mean interval length for the ID to do it?). I could create the interval from the "gap" between the previous and next interval, except that won't work if I'm missing an interval at the beginning or end of the ID's record (i.e. if the record starts at say 1/16/2015, I need the amount for 12/15/2014-1/15/2015
Interpolate an 'amount' for this interval using the mean daily
'amount' from the closest existing interval.
The end result for the sample above should look like:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 2 | 10 | 10/14/2018 | 11/14/2018 |
| 2 | 10 | 11/15/2018 | 12/15/2018 |
| 2 | 10 | 12/16/2018 | 1/16/2018 |
| 2 | 10 | 1/17/2019 | 2/17/2019 |
| 2 | 10 | 2/18/2019 | 3/18/2019 |
+----+--------+--------------+--------------+
A 'nice to have' would be a flag indicating that this value is interpolated.
Is there a way to do this efficiently in SQL? I have written a solution in SAS, but have a need to move it to SQL, and my SAS solution is very inefficient (optimization isn't a goal, so any statement that does what I need is fantastic).
EDIT: I've made an SQLFiddle with my example table here:
http://sqlfiddle.com/#!18/8b16d

You can use a sequence of CTEs to build up the data for the missing periods. In this query, the first CTE (EOYS) generates all the end-of-year dates (YYYY-12-31) relevant to the table; the second (INTERVALS) the average interval length for each ID and the third (MISSING) attempts to find start (from t2) and end (from t3) dates of adjoining intervals for any missing (indicated by t1.ID IS NULL) end-of-year interval. The output of this CTE is then used in an INSERT ... SELECT query to add missing interval records to the table, generating missing dates by adding/subtracting the interval length to the end/start date of the adjacent interval as necessary.
First though we add the interp column to indicate if a row was interpolated:
ALTER TABLE Table1 ADD interp TINYINT NOT NULL DEFAULT 0;
This sets interp to 0 for all existing rows. Then we can do the INSERT, setting interp for all those rows to 1:
WITH EOYS AS (
SELECT DISTINCT DATEFROMPARTS(DATEPART(YEAR, interval_beg), 12, 31) AS eoy
FROM Table1
),
INTERVALS AS (
SELECT ID, AVG(DATEDIFF(DAY, interval_beg, interval_end)) AS interval_len
FROM Table1
GROUP BY ID
),
MISSING AS (
SELECT e.eoy,
ids.ID,
i.interval_len,
COALESCE(t2.amount, t3.amount) AS amount,
DATEADD(DAY, 1, t2.interval_end) AS interval_beg,
DATEADD(DAY, -1, t3.interval_beg) AS interval_end
FROM EOYS e
CROSS JOIN (SELECT DISTINCT ID FROM Table1) ids
JOIN INTERVALS i ON i.ID = ids.ID
LEFT JOIN Table1 t1 ON ids.ID = t1.ID
AND e.eoy BETWEEN t1.interval_beg AND t1.interval_end
LEFT JOIN Table1 t2 ON ids.ID = t2.ID
AND DATEADD(MONTH, -1, e.eoy) BETWEEN t2.interval_beg AND t2.interval_end
LEFT JOIN Table1 t3 ON ids.ID = t3.ID
AND DATEADD(MONTH, 1, e.eoy) BETWEEN t3.interval_beg AND t3.interval_end
WHERE t1.ID IS NULL
)
INSERT INTO Table1 (ID, amount, interval_beg, interval_end, interp)
SELECT ID,
amount,
COALESCE(interval_beg, DATEADD(DAY, -interval_len, interval_end)) AS interval_beg,
COALESCE(interval_end, DATEADD(DAY, interval_len, interval_beg)) AS interval_end,
1 AS interp
FROM MISSING
This adds the following rows to the table:
ID amount interval_beg interval_end interp
2 10 2017-12-05 2018-01-04 1
2 10 2018-12-16 2019-01-16 1
2 10 2019-12-28 2020-01-27 1
Demo on SQLFiddle

Related

SQL: usage time of item between dates combining two tables

Trying to create query that will give me usage time of each car part between dates when that part is used. Etc. let say part id 1 is installed on 2018-03-01 and on 2018-04-01 runs for 50min and then on 2018-05-10 runs 30min total usage of this part shoud be 1:20min as result.
These are examples of my tables.
Table1
| id | part_id | car_id | part_date |
|----|-------- |--------|------------|
| 1 | 1 | 3 | 2018-03-01 |
| 2 | 1 | 1 | 2018-03-28 |
| 3 | 1 | 3 | 2018-05-10 |
Table2
| id | car_id | run_date | puton_time | putoff_time |
|----|--------|------------|---------------------|---------------------|
| 1 | 3 | 2018-04-01 | 2018-04-01 12:00:00 | 2018-04-01 12:50:00 |
| 2 | 2 | 2018-04-10 | 2018-04-10 15:10:00 | 2018-04-10 15:20:00 |
| 3 | 3 | 2018-05-10 | 2018-05-10 10:00:00 | 2018-05-10 10:30:00 |
| 4 | 1 | 2018-05-11 | 2018-05-11 12:00:00 | 2018-04-01 12:50:00 |
Table1 contains dates when each part is installed, table2 contains usage time of each part and they are joined on car_id, I have try to write query but it does not work well if somebody can figure out my mistake in this query that would be healpful.
My SQL query
SELECT SEC_TO_TIME(SUM(TIME_TO_SEC(TIMEDIFF(t1.puton_time, t1.putoff_time)))) AS total_time
FROM table2 t1
LEFT JOIN table1 t2 ON t1.car_id=t2.car_id
WHERE t2.id=1 AND t1.run_date BETWEEN t2.datum AND
(SELECT COALESCE(MIN(datum), '2100-01-01') AS NextDate FROM table1 WHERE
id=1 AND t2.part_date > part_date);
Expected result
| part_id | total_time |
|---------|------------|
| 1 | 1:20:00 |
Hope that this problem make sence because in my search I found nothing like this, so I need help.
Solution, thanks to Kota Mori
SELECT t1.id, SEC_TO_TIME(SUM(TIME_TO_SEC(TIMEDIFF(t2.puton_time, t2.putoff_time)))) AS total_time
FROM table1 t1
LEFT JOIN table2 t2 ON t1.car_id = t2.car_id
AND t1.part_date >= t2.run_date
GROUP BY t1.id

You first need to join the two tables by the car_id and also a condition that part_date should be no greater than run_date.
Then compute the total minutes for each part_id separately.
The following is a query example for SQLite (The only SQL engine that I have access to right now).
Since SQLite does not have datetime type, I convert strings into unix timestamp by strftime function. This part should be changed in accordance with the SQL engine you are using. Apart from that, this is fairly a standard sql and mostly valid for other SQL dialect.
SELECT
t1.id,
sum(
cast(strftime('%s', t2.putoff_time) as integer) -
cast(strftime('%s', t2.puton_time) as integer)
) / 60 AS total_minutes
FROM
table1 t1
LEFT JOIN
table2 t2
ON
t1.car_id = t2.car_id
AND t1.part_date <= t2.run_date
GROUP BY
t1.id
The result is something like the below. Note that ID 1 gets 80 minutes (1:20) as expected.
id total_minutes
0 1 80
1 2 80
2 3 30

SQL- Invalid query because of aggregate function with simple calculations for each date and ID

I very new to SQL (less than 100Hrs). Problem case is as mentioned below. Every time I try a query either get incorrect output or error that "not contained in either an aggregate function or the GROUP BY clause"
Have tried searching similar questions or example but no results. I am lost now.
Please help
I have three tables
Table Calc,
Source_id | date(yyyymmdd) | metric1 | metric 2
-------------------------------------------------
1 | 20201010 | 2 | 3
2 | 20201010 | 4 | 5
3 | 20201010 | 6 | 7
1 | 20201011 | 8 | 9
2 | 20201011 | 10 | 11
3 | 20201011 | 12 | 13
1 | 20201012 | 14 | 15
2 | 20201012 | 16 | 17
3 | 20201012 | 18 | 19
Table Source
Source_id | Description
------------------------
1 | ABC
2 | DEF
3 | XYZ
Table Factor
Date | Factor
-----------------
20201010 | .3
20201011 | .5
20201012 | .7
If selected dates by user is 20201010 to 20201012 then result will be
Required result
Source_id | Calculated Value
-------------------------------------------------------------------------------
ABC | (((2x3)x.3 + (8x9)x.5 + (14x15)x.7))/(No of dates selected in this case =3)
DEF | (((4x5)x.3+ (10x11)x.5 + (16x17)x.7))/(No of dates selected in this case =3)
XYZ | (((6x7)x.3+ (12x13)x.5 + (18x19)x.7))/(No of dates selected in this case =3)
Dates will be user defined input so the calculated value should be average of that many dates. Selected dates will always be defined in range rather than random multiple selection.
In table calc, source_id and date together will be unique.
Each date has factor which is to be multiplied with all source_id for that date.
If selected dates by user is from 20201010 to 20201011 then result will be
Source_id | Calculated Value
-------------------------------------------------------------------------------
ABC | ((2x3)x.3+(8x9)x.5)/2
DEF | ((4x5)x.3+(10x11)x.5)/2
XYZ | ((6x7)x.3+(12x13)x.5)/2
If selected dates by user is 20201012 then result will be
Source_id | Calculated Value
-------------------------------------------------------------------------------
ABC | (14x15)x.7
DEF | (16x17)x.7
XYZ | (18x19)x.7

Create a CTE to store the starting and ending dates and cross join it to the join of the tables, group by source and aggregate:
WITH cte AS (SELECT '20201010' min_date, '20201012' max_date)
SELECT s.Description,
ROUND(SUM(c.metric1 * c.metric2 * f.factor / (DATEDIFF(day, t.min_date, t.max_date) + 1)), 2) calculated_value
FROM cte t CROSS JOIN Source s
LEFT JOIN Calc c ON c.Source_id = s.Source_id AND c.date BETWEEN t.min_date AND t.max_date
LEFT JOIN Factor f ON f.date = c.date
GROUP BY s.Source_id, s.Description
See the demo.
Results:
> Description | calculated_value
> :---------- | ---------------:
> ABC | 61.60
> DEF | 83.80
> XYZ | 110.00

30 day rolling count of distinct IDs

So after looking at what seems to be a common question being asked and not being able to get any solution to work for me, I decided I should ask for myself.
I have a data set with two columns: session_start_time, uid
I am trying to generate a rolling 30 day tally of unique sessions
It is simple enough to query for the number of unique uids per day:
SELECT
COUNT(DISTINCT(uid))
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '30 days'
it is also relatively simple to calculate the daily unique uids over a date range.
SELECT
DATE_TRUNC('day',session_start_time) AS "date"
,COUNT(DISTINCT uid) AS "count"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY date(session_start_time)
I then I tried several ways to do a rolling 30 day unique count over a time interval
SELECT
DATE(session_start_time) AS "running30day"
,COUNT(distinct(
case when date(session_start_time) >= running30day - interval '30 days'
AND date(session_start_time) <= running30day
then uid
end)
) AS "unique_30day"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '3 months'
GROUP BY date(session_start_time)
Order BY running30day desc
I really thought this would work but when looking into the results, it appears I'm getting the same results as I was when doing the daily unique rather than the unique over 30days.
I am writing this query from Metabase using the SQL query editor. the underlying tables are in redshift.
If you read this far, thank you, your time has value and I appreciate the fact that you have spent some of it to read my question.
EDIT:
As rightfully requested, I added an example of the data set I'm working with and the desired outcome.
+-----+-------------------------------+
| UID | SESSION_START_TIME |
+-----+-------------------------------+
| | |
| 10 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 5 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 3 | 2020-01-18T02:49:23.000-05:00 |
| | |
| 9 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 2 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 8 | 2020-03-31T23:13:33.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 2 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 9 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 8 | 2020-09-15T16:40:29.000-04:00 |
| | |
| 3 | 2020-09-21T20:49:09.000-04:00 |
| | |
| 1 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 6 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 5 | 2020-12-12T04:42:00.000-05:00 |
+-----+-------------------------------+
bellow is what the result I would like looks like:
+------------+---------------------+
| DATE | UNIQUE 30 DAY COUNT |
+------------+---------------------+
| | |
| 2020-01-13 | 3 |
| | |
| 2020-01-18 | 1 |
| | |
| 2020-03-06 | 3 |
| | |
| 2020-03-31 | 1 |
| | |
| 2020-08-28 | 4 |
| | |
| 2020-09-15 | 2 |
| | |
| 2020-09-21 | 1 |
| | |
| 2020-11-05 | 2 |
| | |
| 2020-12-12 | 2 |
+------------+---------------------+
Thank you

You can approach this by keeping a counter of when users are counted and then uncounted -- 30 (or perhaps 31) days later. Then, determine the "islands" of being counted, and aggregate. This involves:
Unpivoting the data to have an "enters count" and "leaves" count for each session.
Accumulate the count so on each day for each user you know whether they are counted or not.
This defines "islands" of counting. Determine where the islands start and stop -- getting rid of all the detritus in-between.
Now you can simply do a cumulative sum on each date to determine the 30 day session.
In SQL, this looks like:
with t as (
select uid, date_trunc('day', session_start_time) as s_day, 1 as inc
from users_sessions
union all
select uid, date_trunc('day', session_start_time) + interval '31 day' as s_day, -1
from users_sessions
),
tt as ( -- increment the ins and outs to determine whether a uid is in or out on a given day
select uid, s_day, sum(inc) as day_inc,
sum(sum(inc)) over (partition by uid order by s_day rows between unbounded preceding and current row) as running_inc
from t
group by uid, s_day
),
ttt as ( -- find the beginning and end of the islands
select tt.uid, tt.s_day,
(case when running_inc > 0 then 1 else -1 end) as in_island
from (select tt.*,
lag(running_inc) over (partition by uid order by s_day) as prev_running_inc,
lead(running_inc) over (partition by uid order by s_day) as next_running_inc
from tt
) tt
where running_inc > 0 and (prev_running_inc = 0 or prev_running_inc is null) or
running_inc = 0 and (next_running_inc > 0 or next_running_inc is null)
)
select s_day,
sum(sum(in_island)) over (order by s_day rows between unbounded preceding and current row) as active_30
from ttt
group by s_day;
Here is a db<>fiddle.

I'm pretty sure the easier way to do this is to use a join. This creates a list of all the distinct users who had a session on each day and a list of all distinct dates in the data. Then it one-to-many joins the user list to the date list and counts the distinct users, the key here is the expanded join criteria that matches a range of dates to a single date via a system of inequalities.
with users as
(select
distinct uid,
date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01'),
dates as
(select
distinct date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01')
select
count(distinct uid),
dates.dt
from users
join
dates
on users.dt >= dates.dt - 29
and users.dt <= dates.dt
group by dates.dt
order by dt desc
;

Classify unnest elements in PostgreSQL

I'm using PostgreSQL 10.6 and i want to read a table with an array of dates. My goal is to classify each date in this array and compare it to the date of the day, to affect them a category : past, present, futur.
With a case statment i've already classify records depending of their values and in the other hand i'm able to unnest elements from this array. But when i try a case statement in unnest elements, response isn't what i'm expecting.
myTable
id(integer) | dates(date[])
---------------------------
1 | {2020-03-17}
2 | {2020-03-17,2020-03-16}
3 | {2020-03-16,2020-03-15}
4 | {2020-03-17,2020-03-18}
5 | {2020-03-16,2020-03-18}
An simple query return me each date in a distinct row
SELECT id, UNNEST(dates) FROM myTable
An other query return me a result by not a good one, because some dates in the past are displayed "Futur" for example.
SELECT
id,
UNNEST(dates),
CASE
WHEN dates < ARRAY[now()::date] THEN 'Past'
WHEN dates = ARRAY[now()::date] THEN 'Present'
WHEN dates > ARRAY[now()::date] THEN 'Futur'
END AS myResult
FROM myTable
ORDER BY UNNEST(dates) DESC
How can i succeed to this result ? I think i'm missing something important.
id | dates | myResult
--------------------------------
4 | {2020-03-18} | Futur
5 | {2020-03-18} | Futur
1 | {2020-03-17} | Present
2 | {2020-03-17} | Present
4 | {2020-03-17} | Present
2 | {2020-03-16} | Past
3 | {2020-03-16} | Past
5 | {2020-03-16} | Past
3 | {2020-03-15} | Past

You need to unnest in the from clause - then you can classify in the from clause:
SELECT
t.id,
d.dt,
CASE
WHEN d.dt < current_date THEN 'Past'
WHEN d.dt = current_date THEN 'Present'
WHEN d.dt > current_date THEN 'Futur'
END AS myResult
FROM myTable t
CROSS JOIN LATERAL UNNEST(t.dates) d(dt)
ORDER BY t.id, d.dt DESC
Demo on DB Fiddle:
id | dt | myresult
-: | :--------- | :-------
1 | 2020-03-17 | Present
2 | 2020-03-17 | Present
2 | 2020-03-16 | Past
3 | 2020-03-16 | Past
3 | 2020-03-15 | Past
4 | 2020-03-18 | Futur
4 | 2020-03-17 | Present
5 | 2020-03-18 | Futur
5 | 2020-03-16 | Past

Get last value and this years value to compare

I have a table on sql server 2014 with date, value, type and project columns, i want to write a query that returns last value from past year and latest value from current year, but if the current year but if there is no entry in this year it returns empty in one row.
My table :
Id | date | value | type | project
1 | 2019-01-01 | 1 | a | p1
2 | 2018-01-01 | 1 | a | p1
3 | 2018-01-01 | 1 | b | p1
Result i expected :
CurrentyearDate | currentyearvalue | pastdate | pastvalue | type | project
2019-01-01 | 1 | 2018-01-01 | 1 | a | p1
Null | null | 2018-01-01 | 1 | b | p1
I've used ROW_NUMBER and partition method but can't make it though

You can try outer self join along with DATEPART:
select t1.date as CurrentyearDate, t1.value as currentyearvalue,
t2.date as pastdate, t2.pastvalue as currentyearvalue ,t2.type ,t2.project
from <<tableName>> t1 right join <<tableName>> t2
on DATEPART(MONTH, t1.lastUpdateTime)=DATEPART(MONTH, t2.lastUpdateTime)
and DATEPART(dd, t1.lastUpdateTime)=DATEPART(dd, t2.lastUpdateTime)
and DATEPART(year, t1.lastUpdateTime)=DATEPART(year, t2.lastUpdateTime)+1;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL interpolating missing values for a specific date range - with some conditions - sql

Related

SQL: usage time of item between dates combining two tables

SQL- Invalid query because of aggregate function with simple calculations for each date and ID

30 day rolling count of distinct IDs

Classify unnest elements in PostgreSQL

Get last value and this years value to compare

Categories

Resources