Merging two data sets on closest date efficiently in PostgreSQL - sql

I try to merge two tables with different time resolution on their closest date.
The tables are like this:
Table1:
id | date | device | value1
----------------------------------
1 | 10:22 | 13 | 0.53
2 | 10:24 | 13 | 0.67
3 | 10:25 | 14 | 0.83
4 | 10:25 | 13 | 0.32
Table2:
id | date | device | value2
----------------------------------
22 | 10:18 | 13 | 0.77
23 | 10:21 | 14 | 0.53
24 | 10:23 | 13 | 0.67
25 | 10:28 | 14 | 0.83
26 | 10:31 | 13 | 0.23
I want to merge these tables along the first one. So I want to append value2 to Table1, where, for each device, the latest value2 appears.
Result:
id | date | device | value1 | value2
-------------------------------------------
1 | 10:22 | 13 | 0.53 | 0.77
2 | 10:24 | 13 | 0.67 | 0.67
3 | 10:25 | 14 | 0.83 | 0.53
4 | 10:25 | 13 | 0.32 | 0.67
I have some (20-30) devices, thousands of rows in Table2 (=m) and millions of them in Table1 (=n).
I could sort all the tables along date (O(n*logn)), write them into a text file and iterate over Table1 like a merge, while pulling data from Table2 until it is newer (I have to manage that ~20-30 pointers to the latest data for each device, but no more), and after the merge I could upload it back to the database. Then the complexities are O(n*log(n)) for sorting and O(n+m) for iterating over the tables.
But it would be much better to do it in the database at all. But the best query I could achive was O(n^2) complexity:
SELECT DISTINCT ON (Table1.id)
Table1.id, Table1.date, Table1.device, Table1.value1, Table2.value2
FROM Table1, Table2
WHERE Table1.date > Table2.date and Table1.device = Table2.device
ORDER BY Table1.id, Table1.date-Table2.date;
It's really slow for the data amount I need to process, are there better ways to do this? Or just do that stuff with the downloaded data?

Your query can be rewritten as:
SELECT DISTINCT ON (t1.id)
t1.id, t1.date, t1.device, t1.value1, t2.value2
FROM table1 t1
JOIN table2 t2 USING (device)
WHERE t1.date > t2.date
ORDER BY t1.id, t2.date DESC;
No need to calculate a date difference for every combination of rows (which is expensive and not sargable), just pick the row with the greatest t2.date from each set. Index support is advisable.
Details for DISTINCT ON:
Select first row in each GROUP BY group?
That's probably not fast enough, yet. Given your data distribution you would need a loose index scan, which can be emulated with correlated subqueries (like Gordon's query) or a more modern and versatile JOIN LATERAL:
SELECT t1.id, t1.date, t1.device, t1.value1, t2.value2
FROM table1 t1
LEFT JOIN LATERAL (
SELECT value2
FROM table2
WHERE device = t1.device
AND date < t1.date
ORDER BY date DESC
LIMIT 1
) t2 ON TRUE;
The LEFT JOIN avoids losing rows when no match is found in t2. Details:
Optimize GROUP BY query to retrieve latest row per user
But that's still not very fast, since you have "thousands of rows in Table2 and millions of them in Table1".
Two ideas, probably faster, but also more complex:
1. UNION ALL plus window functions
Combine Table1 and Table2 in a UNION ALL query and run a window function over the derived table. This is enhanced by the "moving aggregate support" in Postgres 9.4 or later.
SELECT id, date, device, value1, value2
FROM (
SELECT id, date, device, value1
, min(value2) OVER (PARTITION BY device, grp) AS value2
FROM (
SELECT *
, count(value2) OVER (PARTITION BY device ORDER BY date) AS grp
FROM (
SELECT id, date, device, value1, NULL::numeric AS value2
FROM table1
UNION ALL
SELECT id, date, device, NULL::numeric AS value1, value2
FROM table2
) s1
) s2
) s3
WHERE value1 IS NOT NULL
ORDER BY date, id;
You'll have to test if it can compete. Sufficient work_mem allows in-memory sorting.
db<>fiddle here for all three queries
Old sqlfiddle
2. PL/pgSQL function
Cursor for each device in Table2, loop over Table1, pick the value from respective device-cursor after advancing until cursor.date > t1.date and keeping value2 from the row before last. Similar to the winning implementation here:
Window Functions or Common Table Expressions: count previous rows within range
Probably fastest, but more code to write.

Because table 1 is so much smaller, it might be more efficient to use a correlated subquery:
select t1.*,
(select t2.value2
from table2 t2
where t2.device = t.device and t2.date <= t1.date
order by t2.date desc
limit 1
) as value2
from table1 t1;
Also create an index on table2(device, date, value2) for performance.

Related

pulling rows with a max(column) value but with regards to being smaller than another column in SQL

I have two different data sets but I want to append the columns of one onto the other based on it's date being the max date that is STILL less than the date in the other dataset in SQL.
Example Table 1)
| ID | date | value |
| 05 | 10/13 | ab |
| 10 | 10/15 | sd |
Example Table 2)
| ID2 | date2 | value2 |
| 05 | 10/10 | rf |
| 05 | 10/23 | tx |
| 10 | 10/01 | jk |
| 10 | 10/12 | fr |
| 10 | 10/23 | as |
And the resulting table I want is:
| ID | date | value | date2 | value2 |
| 05 | 10/13 | ab | 10/10 | rf |
| 10 | 10/15 | sd | 10/12 | fr |
When I try to code it, I can't seem to get the correct result. I have tried something like this but I get an error:
select
t1.*
into final
from table1 t1
left join
(select
ID2,
date2,
value2
from
table2
Where max(date2 < date)) AS t2
on t1.ID = t2.ID2;
As noted in the comments, you don't have dates in your sample data. If we pretend you indeed have dates such as those found in this fiddle, then the following could work where CTEs are used.
with max_date as (
select t2.id, max(t2.date) as max_t2_date
from table2 t2
join table1 t1
on t2.id = t1.id
where t2.date < t1.date
group by t2.id
),
max_date_value as (
select t2.id, t2.value, mm.max_t2_date
from table2 t2
join max_date mm
on t2.id = mm.id
and t2.date = mm.max_t2_date
)
select t1.id, t1.date, t1.value, mdv.max_t2_date as date2, mdv.value as value2
from table1 t1
left join max_date_value mdv
on t1.id = mdv.id
Output:
id
date
value
date2
value2
05
2021-10-13T00:00:00.000Z
ab
2021-10-10T00:00:00.000Z
rf
10
2021-10-15T00:00:00.000Z
sd
2021-10-12T00:00:00.000Z
fr
There are likely shorter ways to achieve this but knowing the RDBMS is required. This could at least get you started and play around with different methods.
Assuming that the date and date2 columns are of the type CHAR(5) and stores values like MM/DD, to get the expected results you can use a query like this
WITH t2_rn AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id2 ORDER BY date2 DESC) AS rn2
FROM table2
WHERE EXISTS (
SELECT 1 FROM table1
WHERE id = id2 AND date > date2
)
)
SELECT
id,
date,
value,
date2,
value2
FROM table1
LEFT JOIN t2_rn ON id = id2 AND rn2 = 1
If you are using SQL Server then you you can use the OUTER APPLY operator, and the query will look like this
SELECT
id,
date,
value,
date2,
value2
FROM table1
OUTER APPLY (
SELECT TOP 1
date2,
value2
FROM table2
WHERE id = id2 AND date > date2 ORDER BY date2 DESC
) t2
Both queries have the same output
id
date
value
date2
value2
5
10/13
ab
10/10
rf
10
10/15
sd
10/12
fr
You can check a working demo with both queries here

Find the nearest future or equal to date from a table of dates with sql stmt

I have two tables
Table 1
ID | T1_Date
---+-------------
1 | 09/08/2020
2 | 09/30/2020
Table 2
T2_Date | Label
-----------+-----
08/31/2020 | Aug-20
09/20/2020 | Sep-20
10/25/2020 | Oct-20
I'm trying to have the result link the nearest future date label from table 2 with each record in table 1. So my output would look like:
ID | T1_Date | Label
---+------------+--------
1 | 09/08/2020 | Sep-20
2 | 09/30/2020 | Oct-20
So far I can only return all the records that are greater than the T1_Date value, so it repeats all the labels.
Is there a way to just grab the nearest future date label or the equal to label?
One method is a correlated subquery:
select t1.*,
(select max(t2.label) keep (dense_rank first t2.date asc)
from t2
where t2.date > t1.date
)
from t1;
there are many ways, you can solve this problem
One way which is pretty simple but not optimal:
select
t1.*
,(select Label from table2 where t2_date=(select MIN(T2_Date)from table2
where T2_Date>=T1_Date))Label
from table1 t1
-----------------------------------------------------------
The second Way:
with TempTable as
(
select T1_Date,min(T2_Date)T2_Date
from table2
left join table1 on T2_Date>=T1_Date
group by T1_Date
)
select t1.T1_Date,t2.Label from TempTable temp
left join table1 t1 on t1.T1_Date=temp.T1_Date
left join table2 t2 on t2.T2_Date=temp.T2_Date

Assistance with SQL Query in Oracle

I heed your help with the following:
I have a table like this:
Table_Values
ID | Value | Date
1 | ASD | 01-Jan-2019
2 | ZXC | 10-Jan-2019
3 | ASD | 01-Jan-2019
4 | QWE | 05-Jan-2019
5 | RTY | 15-Jan-2019
6 | QWE | 29-Jan-2019
That I need is to get the values that are duplicated and have a different Date, for example the value "QWE" is duplicated and has different date:
ID | Value | Date
4 | QWE | 05-Jan-2019
6 | QWE | 29-Jan-2019
With EXISTS:
select * from Table_Values t
where exists (
select 1 from Table_Values
where value = t.value and date <> t.date
)
Using Join:
select
t1.*
from
Table_Values t1
join
Table_Values t2
on t1.Value = t2.Value
and t1.Date <> t2.Date
However, Exists approach is better.
You want all rows where there is more than one date per value. You can use COUNT OVER for this.
One method (featured as of Oracle 12c):
select id, value, date
from mytable
order by case when count(distinct date) over (partition by value) > 1 then 1 else 2 end
fetch first row with ties
But you'll have to put this into a subquery (derived table / cte), if you want the result sorted.
And another method without FETCH FIRST clause (valid as of Oracle 8i):
select id, value, date
from
(
select id, value, date, count(distinct date) over (partition by value) as cnt
from mytable
)
where cnt > 1
order by id, value, date;
forpas' solution with EXISTS may be faster, though. Well, pick whichever method you like better :-)
With EXISTS, "correlated subquery" is used. So I don't think it's better than JOIN.
However, Oracle optimizer could re-write "EXISTS" to JOIN.
I like to use JOIN in classic way :)
SELECT t1.*
FROM table_values t1, table_values t2
WHERE t1.f_value = t2.f_value
AND t1.f_date <> t2.f_date
ORDER BY 1;

MS Access: Compare 2 tables with duplicates

I have two tables which look like this:
T1:
ID | Date | Hour
T2:
ID | Date | Hour
I basically need to join these tables when their IDs, dates, and hours match. However, I only want to return the results from table 1 that do not match up with the results in table 2.
I know this seems simple, but where I'm stuck is the fact that there are multiple rows in table 1 that match up with table 2 (there are multiple intervals for any given hour). I need to return all of these intervals so long as they do not fall within the same hour period in table 2.
Example data:
T1:
1 | 1/1/2011 | 1
1 | 1/1/2011 | 1
1 | 1/1/2011 | 1
1 | 1/1/2011 | 2
T2:
1 | 1/1/2011 | 1
1 | 1/1/2011 | 1
My expected result set for this would be the last 2 rows from T1. Can anyone point me on the right track?.
I think you just want not exists:
select t1.*
from t1
where not exists (select 1
from t2
where t2.id = t1.id and t2.date = t1.date and t2.hour = t1.hour
);
EDIT:
I misread the question. This is very hard to do in MS Access. But, you can come close. The following returns the distinct rows in table 1 that do not have equivalent numbers in table 2:
select t1.id, t1.date, t1.hour, (t1.cnt - t2.cnt)
from (select id, date, hour, count(*) as cnt
from t1
group by id, date, hour
) t1 left join
(select id, date, hour, count(*) as cnt
from t2
group by id, date, hour
) t2 left join
on t2.id = t1.id and t2.date = t1.date and t2.hour = t1.hour
where t2.cnt < t1.cnt;

How to get a single result with columns from multiple records in a single table?

Platform: Oracle 10g
I have a table (let's call it t1) like this:
ID | FK_ID | SOME_VALUE | SOME_DATE
----+-------+------------+-----------
1 | 101 | 10 | 1-JAN-2013
2 | 101 | 20 | 1-JAN-2014
3 | 101 | 30 | 1-JAN-2015
4 | 102 | 150 | 1-JAN-2013
5 | 102 | 250 | 1-JAN-2014
6 | 102 | 350 | 1-JAN-2015
For each FK_ID I wish to show a single result showing the two most recent SOME_VALUEs. That is:
FK_ID | CURRENT | PREVIOUS
------+---------+---------
101 | 30 | 20
102 | 350 | 250
There is another table (lets call it t2) for the FK_ID, and it is here that there is a reference
saying which is the 'CURRENT' record. So a table like:
ID | FK_CURRENT | OTHER_FIELDS
----+------------+-------------
101 | 3 | ...
102 | 6 | ...
I was attempting this with a flawed sub query join along the lines of:
SELECT id, curr.some_value as current, prev.some_value as previous FROM t2
JOIN t1 curr ON t2.fk_current = t1.id
JOIN t1 prev ON t1.id = (
SELECT * FROM (
SELECT id FROM (
SELECT id, ROW_NUMBER() OVER (ORDER BY SOME_DATE DESC) as rno FROM t1
WHERE t1.fk_id = t2.id
) WHERE rno = 2
)
)
However the t1.fk_id = t2.id is flawed (i.e. wont run), as (I now know) you can't pass a parent
field value into a sub query more than one level deep.
Then I started wondering if Common Table Expressions (CTE) are the tool for this, but then I've no
experience using these (so would like to know I'm not going down the wrong track attempting to use them - if that is the tool).
So I guess the key complexity that is tripping me up is:
Determining the previous value by ordering, but while limiting it to the first record (and not the whole table). (Hence the somewhat convoluted sub query attempt.)
Otherwise, I can just write some code to first execute a query to get the 'current' value, and then
execute a second query to get the 'previous' - but I'd love to know how to solve this with a single
SQL query as it seems this would be a common enough thing to do (sure is with the DB I need to work
with).
Thanks!
Try an approach with LAG function:
SELECT FK_ID ,
SOME_VALUE as "CURRENT",
PREV_VALUE as Previous
FROM (
SELECT t1.*,
lag( some_value ) over (partition by fk_id order by some_date ) prev_value
FROM t1
) x
JOIN t2 on t2.id = x.fk_id
and t2.fk_current = x.id
Demo: http://sqlfiddle.com/#!4/d3e640/15
Try out this:
select t1.FK_ID ,t1.SOME_VALUE as CURRENT,
(select SOME_VALUE from t1 where p1.id2=t1.id and t1.fk_id=p1.fk_id) as PREVIOUS
from t1 inner join
(
select t1.fk_id, max(t1.id) as id1,max(t1.id)-1 as id2 from t1 group by t1.FK_ID
) as p1 on t1.id=p1.id1