SQL Server Inner Join with Timestamps: is each record only assigned once? - sql

I am working with timestamped records and need to do an inner join based on the timestamp difference. I have been using the DATEDIFF function and it seems to be working well. However, the amount of time between timestamps varies. To clarify, sometimes the record appears in table 2 within the same second as table 1, and sometimes the record in table 2 is up to 15 seconds behind the record in table 1. The records in table 1 are always timestamped before table 2. There is no other common field with which I can join, however there is a register number in each table that I am using to increase accuracy by ensuring that the registers are the same.
My question is: if I increase the timestamp difference to do the inner join (e.g. where the DATEDIFF = 1 or 2 or 3... or 15) will records only be joined once? Or would my table contain duplicate records from table 1 (e.g. where record 1 is joined to record 4 in table 2 where the diff is 4 seconds, and is also joined with record 7 from table 2 where the diff is 11 seconds)?
The reason my statement works now is that no registers have records with less than 6 seconds in between, so even if there are multiple timestamps that would match, the matching of registers eliminates this problem.
My Statement is currently working as:
SELECT *
INTO AtriumSequoiaJoin5
FROM Atrium INNER JOIN Sequoia ON Atrium.Reader = Sequoia.theader_pos_name
WHERE (
((DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=0
Or (DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=1
Or (DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=2
Or (DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=3
Or (DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=4
Or (Datediff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=5)
)
ORDER BY Sequoia.theader_id;

you could CROSS APPLY to the closest record in proximity. That's by no means ideal however, what if there are multiple records written at the same time? You perhaps should give the first table an identity field, then update the next table with scopeidentity
SELECT *
INTO AtriumSequoiaJoin5
FROM Atrium CROSS APPLY
(SELECT TOP 1 * FROM Sequoia WHERE
Atrium.Reader = Sequoia.theader_pos_name
ORDER BY Datediff(millisecond,[Atrium].[Date2],[Sequoia].[theader_tdatetime])) DQ
ORDER BY Sequoia.theader_id;

Related

I don't understand how make task on SQL

There is a table with two fields: Id and Timestamp.
Id is an increasing sequence. Each insertion of a new record into the table leads to the generation of ID(n)=ID(n-1) + 1. Timestamp is a timestamp that, when inserted retroactively, can take any values less than the maximum time of all previous records.
Retroactive insertion is the operation of inserting a record into a table in which
ID(n) > ID(n-1)
Timestamp(n) < max(timestamp(1):timestamp(n-1))
Example of a table:
ID
Timestamp
1
2016.09.11
2
2016.09.12
3
2016.09.13
4
2016.09.14
5
2016.09.09
6
2016.09.12
7
2016.09.15
IDs 5 and 6 were inserted retroactively (their timestamps are lower than later records).
I need a query that will return a list of all ids that fit the definition of insertion retroactively. How can I do this?
It can be rephrased to :
Find every entries for which, in the same table, there is an entry with a lesser id (a previous entry) having a greater timestamp
It can be achieved using a WHERE EXISTS clause :
SELECT t.id, t.timestamp
FROM tbl t
WHERE EXISTS (
SELECT 1
FROM tbl t2
WHERE t.id > t2.id
AND t.timestamp < t2.timestamp
);
Fiddle for MySQL It should work with any DBMS, since it's a standard SQL syntax.

Insert latest records efficiently in hive

I have around 90 tables in hive, 10 each are combined using union all in to 9 master tables.
These 90 base tables are inserted with new rows every 15 minutes. We need to bring in the newly inserted rows in master tables every 15 minutes.
Checking the ID with "not in" is consuming some time.
I have time stamps column as well, getting data based on that as well taking time
Is there a efficient way of achieving this. " Inserting newly added records in base tables into master every 15 minutes"
I can think of two options.
Option 1 - You can create a new table to keep max date timestamp for each master,stage combination. Table should be like this
masters,stages, mxts
master1,stage1, 2021-01-01 12:30:30
...
Then use it in sql like similar to above sql.
select * from Staging table-1 s
Join maxtimestamp On timestamp > mxts and stages='stage1' and masters='master1'
union all
select * from Staging table-2 s
Join maxtimestamp On timestamp > mxts and stages='stage2'and masters='master1'
And then insert max timespamp into the new table everyday after load.
Option 2 - if you can add a new column to master table called record_created_by to keep a track which stage is creating the data.
And your insert statement would be like this
select s.*, 'master1~stage1' as record_created_by from Staging table-1 s
Join (select max(timestamp) mxts from master where record_created_by='master1~stage1') mx On timestamp > mxts
union all
select s.*, 'master1~stage2' as record_created_by from Staging table-2 s
Join (select max(timestamp) mxts from master where record_created_by='master1~stage2') mx On timestamp > mxts
Pls note your first time insert statement would be same above sql but without timestamp part. If you have multiple stages, you can add them like this sql.
First option is way faster but you need to create and maintain a new table.

Get the "most" optimal row in a JOIN

Problem
I have a situation in which I have two tables in which I would like the entries from table 2 (lets call it table_2) to be matched up with the entries in table 1 (table_1) such that there are no duplicates rows of table_2 used in the match up.
Discussion
Specifically, in this case there are datetime stamps in each table (field is utcdatetime). For each row in table_1, I want to find the row in table_2 in which has the closed utcdatetime to the table 1 utcdatetime such that the table2.utcdatetime is older than the table_1 utcdatetime and within 30 minutes of the table 1 utcdatetime. Here is the catch, I do not want any repeats. If a row in table 2 gets gobbled up in a match on an earlier row in table 1, then I do not want it considered for a match later.
This has currently been implemented in a Python routine, but it is slow to iterate over all of the rows in table 1 as it is large. I thought I was there with a single SQL statement, but I found that my current SQL results in duplicate table 2 rows in the output data.
I would recommend using a nested select to get whatever results you're looking for.
For instance:
select *
from person p
where p.name_first = 'SCCJS'
and not exists (select 'x' from person p2 where p2.person_id != p.person_id
and p.name_first = 'SCCJS' and p.name_last = 'SC')

SQL INNER JOIN vs. WHERE ID IN(...) not the same results

I was surprised by the outcome of these two queries. I was expecting same from both. I have two tables that share a common field but there is not a relationship set up. The table (A) has a field EventID varchar(10) and table (B) has a field XXNumber varchar(15).
Values from table B column XXNumber are referenced in table A column EventID. Even though XXNumber can hold 15 chars, none of the 179K rows of data is longer than 10 chars.
So the requirement was:
"To avoid Duplicate table B and table A entries, if the XXNumber is contained in a table A >“Event ID” number, then it should not be counted."
To see how many common records I have I ran this query first - call it query alpha"
SELECT dbo.TableB.XXNumber FROM dbo.TableB WHERE dbo.TableB.XXNumber in
( select distinct dbo.TableA.EventId FROM dbo.TableA )
The result was 5322 rows.
The following query - call it query delta which looks like this:
SELECT DISTINCT dbo.TableB.XXNumber, dbo.TableB.EventId
FROM dbo.TableB INNER JOIN dbo.TableA ON dbo.TableB.XXNumber= dbo.TableB.EventId
haas returned 4308 rows.
Shouldn't the resulting number of rows be the same?
The WHERE ID IN () version will select all rows that match each distinct value in the list (regardless of whether you code DISTINCT indide the inner select or not - that's irrelevant). If a given value appears in the parent table more than once, you'll get multipke rows selected from the parent table for that single value found in the child table.
The INNER JOIN version will select each row from the parent table once for every successful join, so if there are 3 rows in the child table with the value, and 2 in the parent, then there will be 6 rows rows in the result for that value.
To make them "the same", add 'DISTINCT' to your main select.
To explain what you're seeing, we'd need to know more about your actual data.

finding consecutive date pairs in SQL

I have a question here that looks a little like some of the ones that I found in search, but with solutions for slightly different problems and, importantly, ones that don't work in SQL 2000.
I have a very large table with a lot of redundant data that I am trying to reduce down to just the useful entries. It's a history table, and the way it works, if two entries are essentially duplicates and consecutive when sorted by date, the latter can be deleted. The data from the earlier entry will be used when historical data is requested from a date between the effective date of that entry and the next non-duplicate entry.
The data looks something like this:
id user_id effective_date important_value useless_value
1 1 1/3/2007 3 0
2 1 1/4/2007 3 1
3 1 1/6/2007 NULL 1
4 1 2/1/2007 3 0
5 2 1/5/2007 12 1
6 3 1/1/1899 7 0
With this sample set, we would consider two consecutive rows duplicates if the user_id and the important_value are the same. From this sample set, we would only delete row with id=2, preserving the information from 1-3-2007, showing that the important_value changed on 1-6-2007, and then showing the relevant change again on 2-1-2007.
My current approach is awkward and time-consuming, and I know there must be a better way. I wrote a script that uses a cursor to iterate through the user_id values (since that breaks the huge table up into manageable pieces), and creates a temp table of just the rows for that user. Then to get consecutive entries, it takes the temp table, joins it to itself on the condition that there are no other entries in the temp table with a date between the two dates. In the pseudocode below, UDF_SameOrNull is a function that returns 1 if the two values passed in are the same or if they are both NULL.
WHILE (##fetch_status <> -1)
BEGIN
SELECT * FROM History INTO #history WHERE user_id = #UserId
--return entries to delete
SELECT h2.id
INTO #delete_history_ids
FROM #history h1
JOIN #history h2 ON
h1.effective_date < h2.effective_date
AND dbo.UDF_SameOrNull(h1.important_value, h2.important_value)=1
WHERE NOT EXISTS (SELECT 1 FROM #history hx WHERE hx.effective_date > h1.effective_date and hx.effective_date < h2.effective_date)
DELETE h1
FROM History h1
JOIN #delete_history_ids dh ON
h1.id = dh.id
FETCH NEXT FROM UserCursor INTO #UserId
END
It also loops over the same set of duplicates until there are none, since taking out rows creates new consecutive pairs that are potentially dupes. I left that out for simplicity.
Unfortunately, I must use SQL Server 2000 for this task and I am pretty sure that it does not support ROW_NUMBER() for a more elegant way to find consecutive entries.
Thanks for reading. I apologize for any unnecessary backstory or errors in the pseudocode.
OK, I think I figured this one out, excellent question!
First, I made the assumption that the effective_date column will not be duplicated for a user_id. I think it can be modified to work if that is not the case - so let me know if we need to account for that.
The process basically takes the table of values and self-joins on equal user_id and important_value and prior effective_date. Then, we do 1 more self-join on user_id that effectively checks to see if the 2 joined records above are sequential by verifying that there is no effective_date record that occurs between those 2 records.
It's just a select statement for now - it should select all records that are to be deleted. So if you verify that it is returning the correct data, simply change the select * to delete tcheck.
Let me know if you have questions.
select
*
from
History tcheck
inner join History tprev
on tprev.[user_id] = tcheck.[user_id]
and tprev.important_value = tcheck.important_value
and tprev.effective_date < tcheck.effective_date
left join History checkbtwn
on tcheck.[user_id] = checkbtwn.[user_id]
and checkbtwn.effective_date < tcheck.effective_date
and checkbtwn.effective_date > tprev.effective_date
where
checkbtwn.[user_id] is null
OK guys, I did some thinking last night and I think I found the answer. I hope this helps someone else who has to match consecutive pairs in data and for some reason is also stuck in SQL Server 2000.
I was inspired by the other results that say to use ROW_NUMBER(), and I used a very similar approach, but with an identity column.
--create table with identity column
CREATE TABLE #history (
id int,
user_id int,
effective_date datetime,
important_value int,
useless_value int,
idx int IDENTITY(1,1)
)
--insert rows ordered by effective_date and now indexed in order
INSERT INTO #history
SELECT * FROM History
WHERE user_id = #user_id
ORDER BY effective_date
--get pairs where consecutive values match
SELECT *
FROM #history h1
JOIN #history h2 ON
h1.idx+1 = h2.idx
WHERE h1.important_value = h2.important_value
With this approach, I still have to iterate over the results until it returns nothing, but I can't think of any way around that and this approach is miles ahead of my last one.