Filling NULL values with preceding Non-NULL values using FIRST_VALUE - sql

I am joining two tables.
In the first table, I have some items starting at a specific time. In the second table, I have values and timestamps for each minute in between the start and end time of each item.
First table
UniqueID Items start_time
123 one 10:00 AM
456 two 11:00 AM
789 three 11:30 AM
Second table
UniqueID Items time_hit value
123 one 10:00 AM x
123 one 10:05 AM x
123 one 10:10 AM x
123 one 10:30 AM x
456 two 11:00 AM x
456 two 11:15 AM x
789 three 11:30 AM x
So When joining the two tables I have this:
UniqueID Items start_time time_hit value
123 one 10:00 AM 10:00 AM x
123 null null 10:05 AM x
123 null null 10:10 AM x
123 null null 10:30 AM x
456 two 11:00 AM 11:00 AM x
456 null null 11:15 AM x
789 three 11:30 AM 11:30 AM x
I'd like to replace these null values with the values from the non-null precedent row...
So the expected result is
UniqueID Items start_time time_hit value
123 one 10:00 AM 10:00 AM x
123 one 10:00 AM 10:05 AM x
123 one 10:00 AM 10:10 AM x
123 one 10:00 AM 10:30 AM x
456 two 11:00 AM 11:00 AM x
456 two 11:00 AM 11:15 AM x
789 three 11:30 AM 11:30 AM x
I tried to build my join using the following function without success:
FIRST_VALUE(Items IGNORE NULLS) OVER (
PARTITION BY time_hit ORDER BY time_hit
ROWS BETWEEN CURRENT ROW AND
UNBOUNDED FOLLOWING) AS test
My question was a bit off. I found out that UniqueID were inconsistent that is why I had these null values in my output. So the validated answer is a good option to fill null-values when joining two tables and one of your tables has more unique rows than the other.

You could use first_value (but last_value would also work too in this scenario). The import part is to specify rows between unbounded preceding and current row to set the boundaries of the window.
Answer updated to reflect updated question, and preference for first_value
select
first_value(t1.UniqueId ignore nulls) over (partition by t2.UniqueId
order by t2.time_hit
rows between unbounded preceding and current row) as UniqueId,
first_value(t1.items ignore nulls) over (partition by t2.UniqueId
order by t2.time_hit
rows between unbounded preceding and current row) as Items,
first_value(t1.start_time ignore nulls) over (partition by t2.UniqueId
order by t2.time_hit
rows between unbounded preceding and current row) as start_time,
t2.time_hit,
t2.item_value
from table2 t2
left join table1 t1 on t1.start_time = t2.time_hit
order by t2.time_hit;
Result
| UNIQUEID | ITEMS | START_TIME | TIME_HIT | ITEM_VALUE |
|----------|-------|------------|----------|------------|
| 123 | one | 10:00:00 | 10:00:00 | x |
| 123 | one | 10:00:00 | 10:05:00 | x |
| 123 | one | 10:00:00 | 10:10:00 | x |
| 123 | one | 10:00:00 | 10:30:00 | x |
| 456 | two | 11:00:00 | 11:00:00 | x |
| 456 | two | 11:00:00 | 11:15:00 | x |
| 789 | three | 11:30:00 | 11:30:00 | x |
SQL Fiddle Example
Note: I had to use Oracle in SQL Fiddle (so I had to change the data types and a column name). But it should work for your database.

One alternative solution would be to use a NOT EXISTS clause as JOIN condition, with a correlated subquery that ensures that we are relating to the relevant record.
SELECT t1.items, t1.start_time, t2.time_hit, t2.value
FROM table1 t1
INNER JOIN table2 t2
ON t1.items = t2.items
AND t1.start_time <= t2.time_hit
AND NOT EXISTS (
SELECT 1 FROM table1 t10
WHERE
t10.items = t2.items
AND t10.start_time <= t2.time_hit
AND t10.start_time > t1.start_time
)
Demo on DB Fiddle:
| items | start_time | time_hit | value |
| ----- | ---------- | -------- | ----- |
| one | 10:00:00 | 10:00:00 | x |
| one | 10:00:00 | 10:05:00 | x |
| one | 10:00:00 | 10:10:00 | x |
| one | 10:00:00 | 10:30:00 | x |
| two | 11:00:00 | 11:00:00 | x |
| two | 11:00:00 | 11:15:00 | x |
| three | 11:30:00 | 11:30:00 | x |
Alternative solution to avoid using EXISTS on a JOIN condition (not allowed in Big Query): just move that condition to the WHERE clause.
SELECT t1.items, t1.start_time, t2.time_hit, t2.value
FROM table1 t1
INNER JOIN table2 t2
ON t1.items = t2.items
AND t1.start_time <= t2.time_hit
WHERE NOT EXISTS (
SELECT 1 FROM table1 t10
WHERE
t10.items = t2.items
AND t10.start_time <= t2.time_hit
AND t10.start_time > t1.start_time
)
DB Fiddle

I guess you are expecting an output by using INNER JOIN. But not sure why you used FIRST_VALUE.
SELECT I.Item, I.Start_Time, ID.Time_hit, ID.Value
FROM Items I
INNER JOIN ItemDetails ID
ON I.Items = ID.Items
Please explain if you are looking for any specific reasons to look over this approach.

Related

SQL: usage time of item between dates combining two tables

Trying to create query that will give me usage time of each car part between dates when that part is used. Etc. let say part id 1 is installed on 2018-03-01 and on 2018-04-01 runs for 50min and then on 2018-05-10 runs 30min total usage of this part shoud be 1:20min as result.
These are examples of my tables.
Table1
| id | part_id | car_id | part_date |
|----|-------- |--------|------------|
| 1 | 1 | 3 | 2018-03-01 |
| 2 | 1 | 1 | 2018-03-28 |
| 3 | 1 | 3 | 2018-05-10 |
Table2
| id | car_id | run_date | puton_time | putoff_time |
|----|--------|------------|---------------------|---------------------|
| 1 | 3 | 2018-04-01 | 2018-04-01 12:00:00 | 2018-04-01 12:50:00 |
| 2 | 2 | 2018-04-10 | 2018-04-10 15:10:00 | 2018-04-10 15:20:00 |
| 3 | 3 | 2018-05-10 | 2018-05-10 10:00:00 | 2018-05-10 10:30:00 |
| 4 | 1 | 2018-05-11 | 2018-05-11 12:00:00 | 2018-04-01 12:50:00 |
Table1 contains dates when each part is installed, table2 contains usage time of each part and they are joined on car_id, I have try to write query but it does not work well if somebody can figure out my mistake in this query that would be healpful.
My SQL query
SELECT SEC_TO_TIME(SUM(TIME_TO_SEC(TIMEDIFF(t1.puton_time, t1.putoff_time)))) AS total_time
FROM table2 t1
LEFT JOIN table1 t2 ON t1.car_id=t2.car_id
WHERE t2.id=1 AND t1.run_date BETWEEN t2.datum AND
(SELECT COALESCE(MIN(datum), '2100-01-01') AS NextDate FROM table1 WHERE
id=1 AND t2.part_date > part_date);
Expected result
| part_id | total_time |
|---------|------------|
| 1 | 1:20:00 |
Hope that this problem make sence because in my search I found nothing like this, so I need help.
Solution, thanks to Kota Mori
SELECT t1.id, SEC_TO_TIME(SUM(TIME_TO_SEC(TIMEDIFF(t2.puton_time, t2.putoff_time)))) AS total_time
FROM table1 t1
LEFT JOIN table2 t2 ON t1.car_id = t2.car_id
AND t1.part_date >= t2.run_date
GROUP BY t1.id
You first need to join the two tables by the car_id and also a condition that part_date should be no greater than run_date.
Then compute the total minutes for each part_id separately.
The following is a query example for SQLite (The only SQL engine that I have access to right now).
Since SQLite does not have datetime type, I convert strings into unix timestamp by strftime function. This part should be changed in accordance with the SQL engine you are using. Apart from that, this is fairly a standard sql and mostly valid for other SQL dialect.
SELECT
t1.id,
sum(
cast(strftime('%s', t2.putoff_time) as integer) -
cast(strftime('%s', t2.puton_time) as integer)
) / 60 AS total_minutes
FROM
table1 t1
LEFT JOIN
table2 t2
ON
t1.car_id = t2.car_id
AND t1.part_date <= t2.run_date
GROUP BY
t1.id
The result is something like the below. Note that ID 1 gets 80 minutes (1:20) as expected.
id total_minutes
0 1 80
1 2 80
2 3 30

Optimizing results for query with WHERE EXISTS clause

I have this table in postgres:
id | id_datetime | longitude | latitude
--------+---------------------+---------------------+--------------------
639438 | 2018-02-20 18:00:00 | -122.3880011217841 | 37.75538988423265
639439 | 2018-02-20 20:30:00 | -122.38756878451498 | 37.760550220844614
639440 | 2018-02-20 20:05:00 | -122.39640513677658 | 37.76130039041195
639441 | 2018-02-24 10:00:00 | -122.45819139221014 | 37.724317534370066
639442 | 2018-02-10 09:00:00 | -122.44693382058489 | 37.77000760474354
I want an output with all the differents ID's which has at least another ID between the last 15 minutes and between 1000 meters (geographic distance).
My table has more than 100K rows. So, I'm currently trying with the following query which works but takes too long. Are there any steps I can take to optimize this?
SELECT DISTINCT
x.id
FROM table x
WHERE EXISTS(
SELECT
1
FROM table t
WHERE t.id <> x.id
AND (t.id_datetime between x.id_datetime - interval '15 minutes' AND x.id_datetime)
AND (ST_Distance((geography(ST_MakePoint(x.longitude, x.latitude))),
geography(ST_MakePoint(t.longitude, t.latitude)) ) <= 1000)
)

SQL interpolating missing values for a specific date range - with some conditions

There are some similar questions on the site, but I believe mine warrants a new post because there are specific conditions that need to be incorporated.
I have a table with monthly intervals, structured like this:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 1 | 10 | 12/17/2017 | 1/17/2018 |
| 1 | 10 | 1/18/2018 | 2/18/2018 |
| 1 | 10 | 2/19/2018 | 3/19/2018 |
| 1 | 10 | 3/20/2018 | 4/20/2018 |
| 1 | 10 | 4/21/2018 | 5/21/2018 |
+----+--------+--------------+--------------+
I've found that sometimes there is a month of data missing around the end/beginning of the year where I know it should exist, like this:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 2 | 10 | 10/14/2018 | 11/14/2018 |
| 2 | 10 | 11/15/2018 | 12/15/2018 |
| 2 | 10 | 1/17/2019 | 2/17/2019 |
| 2 | 10 | 2/18/2019 | 3/18/2019 |
| 2 | 10 | 3/19/2019 | 4/19/2019 |
+----+--------+--------------+--------------+
What I need is a statement that will:
Identify where this year-end period is missing (but not find missing
months that aren't at the beginning/end of the year).
Create this interval by using the length of an existing interval for
that ID (maybe using the mean interval length for the ID to do it?). I could create the interval from the "gap" between the previous and next interval, except that won't work if I'm missing an interval at the beginning or end of the ID's record (i.e. if the record starts at say 1/16/2015, I need the amount for 12/15/2014-1/15/2015
Interpolate an 'amount' for this interval using the mean daily
'amount' from the closest existing interval.
The end result for the sample above should look like:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 2 | 10 | 10/14/2018 | 11/14/2018 |
| 2 | 10 | 11/15/2018 | 12/15/2018 |
| 2 | 10 | 12/16/2018 | 1/16/2018 |
| 2 | 10 | 1/17/2019 | 2/17/2019 |
| 2 | 10 | 2/18/2019 | 3/18/2019 |
+----+--------+--------------+--------------+
A 'nice to have' would be a flag indicating that this value is interpolated.
Is there a way to do this efficiently in SQL? I have written a solution in SAS, but have a need to move it to SQL, and my SAS solution is very inefficient (optimization isn't a goal, so any statement that does what I need is fantastic).
EDIT: I've made an SQLFiddle with my example table here:
http://sqlfiddle.com/#!18/8b16d
You can use a sequence of CTEs to build up the data for the missing periods. In this query, the first CTE (EOYS) generates all the end-of-year dates (YYYY-12-31) relevant to the table; the second (INTERVALS) the average interval length for each ID and the third (MISSING) attempts to find start (from t2) and end (from t3) dates of adjoining intervals for any missing (indicated by t1.ID IS NULL) end-of-year interval. The output of this CTE is then used in an INSERT ... SELECT query to add missing interval records to the table, generating missing dates by adding/subtracting the interval length to the end/start date of the adjacent interval as necessary.
First though we add the interp column to indicate if a row was interpolated:
ALTER TABLE Table1 ADD interp TINYINT NOT NULL DEFAULT 0;
This sets interp to 0 for all existing rows. Then we can do the INSERT, setting interp for all those rows to 1:
WITH EOYS AS (
SELECT DISTINCT DATEFROMPARTS(DATEPART(YEAR, interval_beg), 12, 31) AS eoy
FROM Table1
),
INTERVALS AS (
SELECT ID, AVG(DATEDIFF(DAY, interval_beg, interval_end)) AS interval_len
FROM Table1
GROUP BY ID
),
MISSING AS (
SELECT e.eoy,
ids.ID,
i.interval_len,
COALESCE(t2.amount, t3.amount) AS amount,
DATEADD(DAY, 1, t2.interval_end) AS interval_beg,
DATEADD(DAY, -1, t3.interval_beg) AS interval_end
FROM EOYS e
CROSS JOIN (SELECT DISTINCT ID FROM Table1) ids
JOIN INTERVALS i ON i.ID = ids.ID
LEFT JOIN Table1 t1 ON ids.ID = t1.ID
AND e.eoy BETWEEN t1.interval_beg AND t1.interval_end
LEFT JOIN Table1 t2 ON ids.ID = t2.ID
AND DATEADD(MONTH, -1, e.eoy) BETWEEN t2.interval_beg AND t2.interval_end
LEFT JOIN Table1 t3 ON ids.ID = t3.ID
AND DATEADD(MONTH, 1, e.eoy) BETWEEN t3.interval_beg AND t3.interval_end
WHERE t1.ID IS NULL
)
INSERT INTO Table1 (ID, amount, interval_beg, interval_end, interp)
SELECT ID,
amount,
COALESCE(interval_beg, DATEADD(DAY, -interval_len, interval_end)) AS interval_beg,
COALESCE(interval_end, DATEADD(DAY, interval_len, interval_beg)) AS interval_end,
1 AS interp
FROM MISSING
This adds the following rows to the table:
ID amount interval_beg interval_end interp
2 10 2017-12-05 2018-01-04 1
2 10 2018-12-16 2019-01-16 1
2 10 2019-12-28 2020-01-27 1
Demo on SQLFiddle

SQL: Find the longest date gap from multiple table

i need some help.
I have two tables like this.
Table Person
p_id | name | registration date
-----------------------------
1 | ABC | 2018-01-01
2 | DEF | 2018-02-02
3 | GHI | 2018-03-01
4 | JKL | 2018-01-02
5 | MNO | 2018-02-01
6 | PQR | 2018-03-02
Table Order
Order_id| p_id | order_date
----------------------------
123 | 1 | 2018-01-05
345 | 2 | 2018-02-06
678 | 3 | 2018-03-07
910 | 4 | 2018-01-08
012 | 3 | 2018-03-04
234 | 4 | 2018-01-05
567 | 5 | 2018-02-08
890 | 6 | 2018-03-09
I need to find out how many days is the longest period when this two table aren't updated.
What's the easiest query to get the result in SQL?
Thank you
UPDATE:
The result should be showing the longest date gap between order_date and registration_date. Because the longest date gap is 2018-01-08 and 2018-02-01, so the result should return '24'
Try this:
SELECT MAX(DATE_PART('day', now() - '2018-02-15'::TIMESTAMP)) FROM person p
JOIN order o
USING (p_id)
Assuming current PostgreSQL and lots of orders per person on avg., this should be among the fastest options:
SELECT o.order_date - p.registration_date AS days
FROM person p
CROSS JOIN LATERAL (
SELECT order_date
FROM "order" -- order is a reserved word!
WHRE p_id = p.p_id
ORDER BY 1 DESC -- assuming NOT NULL
LIMIT 1
) o
ORDER BY 1 DESC
LIMIT 1;
Needs an index on "orders"(p_id, order_date).
Detailed explanation:
Optimize GROUP BY query to retrieve latest record per user
Select first row in each GROUP BY group?
You seem to want:
select max(o.order_date - p.registration_date)
from person p join
orders o
on p.p_id = o.p_id;
select max((date_part('day',age(order_date, "registration date")))) + 1 as dif
from (
select "p_id" ,max(order_date) order_date
from "Order"
group by "p_id"
) T1
left join Person T2 on T1.p_id = T2.p_id
| maxday |
|--------|
| 8 |
[SQL Fiddle DEMO LINK]

Can I put a condition on a window function in Redshift?

I have an events-based table in Redshift. I want to tie all events to the FIRST event in the series, provided that event was in the N-hours preceding this event.
If all I cared about was the very first row, I'd simply do:
SELECT
event_time
,first_value(event_time)
OVER (ORDER BY event_time rows unbounded preceding) as first_time
FROM
my_table
But because I only want to tie this to the first event in the past N-hours, I want something like:
SELECT
event_time
,first_value(event_time)
OVER (ORDER BY event_time rows between [N-hours ago] and current row) as first_time
FROM
my_table
A little background on my table. It's user actions, so effectively a user jumps on, performs 1-100 actions, and then leaves. Most users are 1-10x per day. Sessions rarely last over an hour, so I could set N=1.
If I just set a PARTITION BY date_trunc('hour', event_time), I'll double create for sessions that span the hour.
Assume my_table looks like
id | user_id | event_time
----------------------------------
1 | 123 | 2015-01-01 01:00:00
2 | 123 | 2015-01-01 01:15:00
3 | 123 | 2015-01-01 02:05:00
4 | 123 | 2015-01-01 13:10:00
5 | 123 | 2015-01-01 13:20:00
6 | 123 | 2015-01-01 13:30:00
My goal is to get a result that looks like
id | parent_id | user_id | event_time
----------------------------------
1 | 1 | 123 | 2015-01-01 01:00:00
2 | 1 | 123 | 2015-01-01 01:15:00
3 | 1 | 123 | 2015-01-01 02:05:00
4 | 4 | 123 | 2015-01-01 13:10:00
5 | 4 | 123 | 2015-01-01 13:20:00
6 | 4 | 123 | 2015-01-01 13:30:00
The answer appears to be "no" as of now.
There is a functionality in SQL Server of using RANGE instead of ROWS in the frame. This allows the query to compare values to the current row's value.
https://www.simple-talk.com/sql/learn-sql-server/window-functions-in-sql-server-part-2-the-frame/
When I attempt this syntax in Redshift I get the error that "Range is not yet supported"
Someone update this when that "yet" changes!