SQL Join based on dates- Table2.Date=Next date after Table1.Date - sql

I have two seperate tables which I want to join based on Dates. However, I don't want the dates in the tables to be equal to one another I want the date (and accompanying value) from one table to be joined with the next date available after that date in the second table.
I've put an example of the problem below:
Table 1:
Date Value
2015-04-13 A
2015-04-10 B
2015-04-09 C
2015-04-08 D
Table 2:
Date Value
2015-04-13 E
2015-04-10 F
2015-04-09 G
2015-04-08 H
Desired Output Table:
Table1.Date Table2.Date Table1.Value Table2.Value
2015-04-10 2015-04-13 B E
2015-04-09 2015-04-10 C F
2015-04-08 2015-04-09 D G
I'm at a bit of an ends of where to even get going with this, hence the lack of my current SQL starting point!
Hopefully that is clear. I found this related question that comes close but I get lost on incorporating this into a join statment!!
SQL - Select next date query
Any help is much appreciated!
M.
EDIT- There is a consideration that is important in that the day will not always be simply 1 day later. They need to find the next day available, which was in the original question but Ive update my example to reflect this.

Since you want the next available date, and that might not necessarily be the following date (eg. date + 1) you'll want to use a correlated subquery with either min or top 1.
This will give you the desired output:
;WITH src AS (
SELECT
Date,
NextDate = (SELECT MIN(Date) FROM Table2 WHERE Date > t1.Date)
FROM table1 t1
)
SELECT src.Date, src.NextDate, t1.Value, t2.Value
FROM src
JOIN Table1 t1 ON src.Date = t1.Date
JOIN Table2 t2 ON src.NextDate = t2.Date
WHERE src.NextDate IS NOT NULL
ORDER BY src.Date DESC
Sample SQL Fiddle

try this
select [Table 1].Date,[Table 1].Value,[Table 2].date,[Table 2].Value
from [Table 1]
join [Table 1]
on dateadd(dd,1,[Table 1].date) = [Table 2].date

i'd go with an outer apply:
SELECT t1.*, t2.*
FROM Table1 t1
CROSS APPLY (
SELECT TOP 1 *
FROM Table2 t2
WHERE t2.Date > t1.Date
ORDER BY t2.Date) t2
ORDER BY t1.Date DESC

Related

How to join two SQL tables by extracting maximum numbers from one then into another?

As others have commented, I'm now going to add some code:
Imported tables
table3
Case No. is the primary key. Each report date shows one patient. Depending on if the patient is import or local, the cumulative column increases. You can see some days there are no cases so the date like 25/01/2020 is skipped
table2
Report date has no duplicate.
Now, I want to join the tables. Example outcome here:
enter image description here
The maximum cumulative of each date is joined into the new table. So although 26/01/2020 of table3 shows the increase from 6, 7, to 8, I only want the highest cumulative number there.
Thanks for letting me know how my previous query could be improved. Your opinion helps me a lot.
I have tried Gordon Linoff's by substituting the actual names (which I initially omitted because I thought they were ambiguous).
His code is as follows (I've upvoted):
SELECT t3.`Report date`,
max(max(t3.cumulative_local)) over (order by t3.`Report date`),
max(max(t3.cumulative_import)) over (order by t3.`Report date`)
from table3 t3 left join
table2 t2
using (`Report date`)
group by t2.`Report date`;
But I got an error
Error Code: 1055. Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'new.t3.Report date' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
Anyways I am now experimenting. Both answers helped. If you know how to fix 1055, let me know, or if you could propose another solution. Thanks
I think you just want aggregation and window functions:
select t1.date,
max(max(cumulativea)) over (order by t1.date),
max(max(cumulativeb)) over (order by t1.date)
from table1 t1 left join
table2 t2
on t1.date = t2.date
group by t1.date;
This returns the maximum values of the two columns up to each date, which is, I think, what you are trying to describe.
I don't understand why you have cumulA and cumulB on table1. I suppose it will be to store the Max cumulA and cumulB for each days.
You must first self-join table2 to find the Max for each date (with a GROUP BY date) :
SELECT t2.id, t2.date, cA
FROM t2
JOIN (
SELECT id, MAX(cumulA) AS cA, date AS d2
FROM t2
GROUP BY d2
) AS td
ON t2.id=td.id
AND t2.date=d2
ORDER BY t2.date
After, you join left table1 on result of self-join table2 to have each days.
SELECT * FROM `t1` LEFT JOIN t2 ON t1.date = t2.date ORDER BY t1.date
Here is the fusion of the 2 junctions :
SELECT * FROM `t1` LEFT JOIN (
SELECT t2.id, t2.date, cA
FROM t2
JOIN (
SELECT id, MAX(cumulA) AS cA, date AS d2
FROM t2
GROUP BY d2
) AS td
ON t2.id=td.id
AND t2.date=d2
ORDER BY t2.date
) AS tt
ON t1.date = tt.date ORDER BY t1.date
You do the same for cumulB.
And after (I suppose), you INSERT INTO the result into table1.
I hope I answered your question.
Good continuation.
_Teddy_

SQL Server Return Rows Where Field Changed

I have a table with 3 values.
ID AuditDateTime UpdateType
12 12-15-2015 18:09 1
45 12-04-2015 17:41 0
75 12-21-2015 04:26 0
12 12-17-2015 07:43 0
35 12-01-2015 05:36 1
45 12-15-2015 04:35 0
I'm trying to return only records where the UpdateType has changed from AuditDateTime based on the IDs. So in this example, ID 12 changes from the 12-15 entry to the 12-17 entry. I would want that record returned. There will be multiple instances of ID 12, and I need all records returned where an ID's UpdateType has changed from its previous entry. I tried adding a row_number but it didn't insert sequentially because the records are not in the table in order. I've done a ton of searching with no luck. Any help would be greatly appreciated.
By using a CTE it is possible to find the previous record based upon the order of the AuditDateTime
WITH CTEData AS
(SELECT ROW_NUMBER() OVER (PARTITION BY ID ORDER BY AuditDateTime) [ROWNUM], *
FROM #tmpTable)
SELECT A.ID, A.AuditDateTime, A.UpdateType
FROM CTEData A INNER JOIN CTEData B
ON (A.ROWNUM - 1) = B.ROWNUM AND
A.ID = B.ID
WHERE A.UpdateType <> B.UpdateType
The Inner Join back onto the CTE will give in one query both the current record (Table Alias A) and previous row (Table Alias B).
This should do what you're trying to do I believe
SELECT
T1.ID,
T1.AuditDateTime,
T1.UpdateType
FROM
dbo.My_Table T1
INNER JOIN dbo.My_Table T2 ON
T2.ID = T1.ID AND
T2.UpdateType <> T1.UpdateType AND
T2.AuditDateTime < T1.AuditDateTime
LEFT OUTER JOIN dbo.My_Table T3 ON
T3.ID = T1.ID AND
T3.AuditDateTime < T1.AuditDateTime AND
T3.AuditDateTime > T2.AuditDateTime
WHERE
T3.ID IS NULL
Alternatively:
SELECT
T1.ID,
T1.AuditDateTime,
T1.UpdateType
FROM
dbo.My_Table T1
INNER JOIN dbo.My_Table T2 ON
T2.ID = T1.ID AND
T2.UpdateType <> T1.UpdateType AND
T2.AuditDateTime < T1.AuditDateTime
WHERE
NOT EXISTS
(
SELECT *
FROM
dbo.My_Table T3
WHERE
T3.ID = T1.ID AND
T3.AuditDateTime < T1.AuditDateTime AND
T3.AuditDateTime > T2.AuditDateTime
)
The basic gist of both queries is that you're looking for rows where an earlier row had a different type and no other rows exist between the two rows (hence, they're sequential). Both queries are logically identical, but might have differing performance.
Also, these queries assume that no two rows will have identical audit times. If that's not the case then you'll need to define what you expect to get when that happens.
You can use the lag() window function to find the previous value for the same ID. Now you can pick only those rows that introduce a change:
select *
from (
select lag(UpdateType) over (
partition by ID
order by AuditDateTime) as prev_updatetype
, *
from YourTable
) sub
where prev_updatetype <> updatetype
Example at SQL Fiddle.

Forming query in DB2 to fetch row based on the values in one column along with order of another column

I apologize if the title seems absurd and lack information I am trying to explain the situation through following example:
Consider the following table-
ID Event Time
---------------------
1 EventA ta
1 EventB tx
2 EventB ty
1 EventC tb
2 EventC to
I wish to select the ID such that there is an EventC after(based on Time) any instance of EventB.
I could think of the following query:
select ID from TabET where
((select TIME from TabET where Event = EventC order by TIME desc fetch first row only)
>
(select TIME from TabET where Event = EventB order by TIME desc fetch first row only))
I am looking for a better approach and alternative as the table in reality is a very big table and this query is just a subquery inside a big query to satisfy a condition.
Edit
The ID is not unique. The problem is to identify the IDs for whcich there is an EventC after(based on TIME) an EventB
You can use a self join:
select distinct t1.ID
from table t1
join table t2 on
t1.ID = t2.ID and
t1.Event = 'EventB' and
t2.Event = 'EventC' and
t2.Time > t1.Time
Another approach:
with latest_times as (
select id, max(time) as time from table
where Event='EventC'
group by id
)
select t1.ID from table t1
join latest_times on
t1.id = latest_times.id and
t1.Event = 'EventB' and
latest_times.time > t1.time

hive sql aggregate

I have two tables in Hive, t1 and t2
>describe t1;
>date_id string
>describe t2;
>messageid string,
createddate string,
userid int
> select * from t1 limit 3;
> 2011-01-01 00:00:00
2011-01-02 00:00:00
2011-01-03 00:00:00
> select * from t2 limit 3;
87211389 2011-01-03 23:57:01 13864753
87211656 2011-01-03 23:57:59 13864769
87211746 2011-01-03 23:58:25 13864785
What I want is to count previous three-day distinct userid for a given date.
For example, for date 2011-01-03, I want to count distinct userid from 2011-01-01 to 2011-01-03.
for date 2011-01-04, I want to count distinct userid from 2011-01-02 to 2011-01-04
I wrote the following query. But it does not return three-day result. It returns distinct userid per day instead.
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
GROUP by to_date(t1.date_id);
`to_date()` and `date_sub()` are date function in Hive.
That said, the following part does not take effect.
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
EDIT: One solution can be (but it is super slow):
SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
SELECT * FROM t1 LEFT OUTER JOIN t2
WHERE
(date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
)
) t3
GROUP by to_date(t3.date_id);
UPDATE: Thanks for all answers. They are good.
But Hive is a bit different from SQL. Unfortunately, they cannot use in HIVE.
My current solution is to use UNION ALL.
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)
Then, I do group by and count. In this way, I can get what I want.
Although it is not elegant, it is much efficient than cross join.
The following should seem to work in standard SQL...
SELECT
to_date(t1.date_id),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) >= date_sub(to_date(t1.date_id), 2)
AND to_date(t2.createddate) < date_add(to_date(t1.date_id), 1)
GROUP BY
to_date(t1.date_id)
It will, however, be slow. Because you are storing dates as strings, the using to_date() to convert them to dates. What this means is that indexes can't be used, and the SQL engine can't do Anything clever to reduce the effort being expended.
As a result, every possible combination of rows needs to be compared. If you have 100 entries in T1 and 10,000 entries in T2, your SQL engine is processing a million combinations.
If you store these values as dates, you don't need to_date(). And if you index the dates, the SQL engine can quickly home in on the range of dates being specified.
NOTE: The format of the ON clause means that you do not need to round t2.createddate down to a daily value.
EDIT Why your code didn't work...
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
GROUP by to_date(t1.date_id);
This joins t1 to t2 with an ON clause of (to_date(t2.createddate) = to_date(t1.date_id)). As the join is a LEFT OUTER JOIN, the values in t2.createddate MUST now either be NULL (no matches) or be the same as t1.date_id.
The WHERE clause allows a much wider range (3 days). But the ON clause of the JOIN has already restricted you data down to a single day.
The example I gave above simply takes your WHERE clause and put's it in place of the old ON clause.
EDIT
Hive doesn't allow <= and >= in the ON clause? Are you really fixed in to using HIVE???
If you really are, what about BETWEEN?
SELECT
to_date(t1.date_id),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) BETWEEN date_sub(to_date(t1.date_id), 2) AND date_add(to_date(t1.date_id), 1)
GROUP BY
to_date(t1.date_id)
Alternatively, refactor your table of dates to enumerate the dates you want to include...
TABLE t1 (calendar_date, inclusive_date) =
{ 2011-01-03, 2011-01-01
2011-01-03, 2011-01-02
2011-01-03, 2011-01-03
2011-01-04, 2011-01-02
2011-01-04, 2011-01-03
2011-01-04, 2011-01-04
2011-01-05, 2011-01-03
2011-01-05, 2011-01-04
2011-01-05, 2011-01-05 }
SELECT
to_date(t1.calendar_date),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) = to_date(t1.inclusive_date)
GROUP BY
to_date(t1.calendar_date)
You need a subquery:
try something like this (i cannot test because i don't have hive)
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE t2.messageid in
(
select t2.messageid from t2 where
date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND
to_date(t2.createddate) <= to_date(t1.date_id)
)
GROUP by to_date(t1.date_id);
the key is that with subquery FOR EACH date in t1, the right records are selected in t2.
EDIT:
Forcing subquery in from clause you could try this:
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN
(select userid, createddate from t2 where
date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND
to_date(t2.createddate) <= to_date(t1.date_id)
) as t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
GROUP by to_date(t1.date_id);
but don't know if could work.
I am making an assumption that t1 is used to define the 3 day period. I suspect the puzzling approach is due to Hive's shortcomings.
This allows you to have an arbitrary number of 3 day periods.
Try the following 2 queries
SELECT substring(t1.date_id,1,10), count(distinct t2.userid)
FROM t1
JOIN t2
ON substring(t2.createddate,1,10) >= date_sub(substring(t1.date_id,1,10), 2)
AND substring(t2.createddate,1,10) <= substring(t1.date_id,1,10)
GROUP BY t1.date_id
--or--
SELECT substring(t1.date_id,1,10), count(distinct t2.userid)
FROM t1
JOIN t2
ON t2.createddate like substring(t1.date_id ,1,10) + '%'
OR t2.createddate like substring(date_sub(t1.date_id, 1) ,1,10) + '%'
OR t2.createddate like substring(date_sub(t1.date_id, 2) ,1,10) + '%'
GROUP BY t1.date_id
The latter minimizes the function calls on the t2 table. I am also assuming that t1 is the smaller of the 2.
substring should return the same result as to_date. According to the documentation, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions, to_date returns a string data type.
Support for date data types seems minimal but I am not familiar with hive.

SQL - Query to return result

There is a table with Columns as below:
Id : long autoincrement;
timestamp:long;
price:long
Timestamp is given as a unix_time in ms.
Question: what is the average time difference between the records ?
First thought is a sub-query grabbing the record immediately previous:
SELECT timestamp -
(select top 1 timestamp from Table T1 where T1.Id < Table.Id order by Id desc)
FROM Table
Then you can take the average of that:
SELECT AVG(delta)
from (SELECT timestamp -
(select top 1 timestamp from Table T1 where T1.Id < Table.Id order by Id desc) as delta
FROM Table) T
There will probably need to be some handling of the null that results for the first row, but I haven't tested to be sure.
In SQL Server, you could write something like that to get that information:
SELECT
t1.ID, t2.ID,
DATEDIFF(MILLISECOND, t2.PriceTime, test2.PriceTime)
FROM table t1
INNER JOIN table t2 ON t2.ID = t1.ID-1
WHERE t1.ID > (SELECT MIN(ID) FROM table)
and if you're only interested in the AVG across all entries, you could use:
SELECT
AVG(DATEDIFF(MILLISECOND, t2.PriceTime, test2.PriceTime))
FROM table t1
INNER JOIN table t2 ON t2.ID = t1.ID-1
WHERE t1.ID > (SELECT MIN(ID) FROM table)
Basically, you need to join the table with itself, and use "t1.ID = t2.ID-1" to associate item no. 2 in one table with item no. 1 in the other table and then calculate the time difference between the two. In order to avoid accessing item no. 0 which doesn't exist, use the "T1.ID > (SELECT MIN(ID) FROM table)" clause to start from the second item.
Marc
At a guess:
SELECT AVG(timestamp)
I think you need to provide more information in your question for us to help.
If you mean difference between each-other row:
select AVG(x) from (
select a.timestamp - b.timestamp as x
from table a, table b -- this multiplies a*b ) sub
SELECT AVG(T2.Timestamp - T1.TimeStamp)
FROM Table T1
JOIN Table T2 ON T2.ID = T1.ID + 1
try this
Select Avg(E.Timestamp - B.Timestamp)
From Table B Join Table E
On E.Timestamp =
(Select Max(Timestamp)
From Table
Where Timestamp < R.Timestamp)