Joining two data sets with timestamps that are slightly off

Joining two data sets with timestamps that are slightly off - sql

I have two tables, one with transactions and one that reflects an audit done on each of these transactions before they were allowed process. I'd like to join the audit results to each transaction. Unfortunately, due to a system error, the TransID sometimes gets duplicated and thus does not always uniquely identify the transaction event. What DOES uniquely identify the event is the TransID + TransTimestamp. The row in the auditor's data that is uniquely associated with this transaction is the one with the same TransID and an AuditTimestamp that's the most recent timestamp that comes before the TransTimestamp. The code I've tried below causes SQL developer to run perpetually:
SELECT Trans.TransID, Audit.AuditTimestamp
FROM Trans
LEFT JOIN Audit ON Trans.TransID = Audit.TransID
WHERE Trans.TransTimestamp >= (SELECT MAX(Audit.AuditTimestamp)
FROM Audit
WHERE Trans.TransID = Audit.AuditID);

If a core piece of information such a transid is being duplicated then it is a "very messy business" and no amount of workarounds will properly overcome this.
To attempt this query I think you need to divide the transactions into those which have problems and those which don't and below I have used COUNT() OVER() to establish if a transaction has been duplicated or not. From there we can use a correlated subquery to locate a "probable match" for those transactions that need this approach. The probable match needs to have a timestamp >= the transaction but less than some allowed timeframe e.g. 1 second.
From there we join the good with the bad to arrive at (hopefully) something reasonable. But although it may appear reasonable it isn't guaranteed to be, you do need to fix the source problem.
WITH T AS (
SELECT
trans.*
, COUNT() over(partition by Trans.TransID) tran_count
FROM Trans
)
, A2 AS (
SELECT
T.TransID
, SELECT MIN(Audit.AuditTimestamp) FROM Audit
WHERE T.TransID = Audit.TransID
AND Audit.AuditTimestamp > T.TransTimestamp
AND Audit.AuditTimestamp < (T.TransTimestamp + interval '1' second) /* change to suit */
AS ProbableMatch
FROM T
WHERE T.tran_count > 1
)
SELECT
T.TransID
, T.trans_count
, COALESECE(A1.AuditTimestamp, A2.ProbableMatch) AS AuditTimestamp
, COALESECE(A1.ID, A3.ID) AS SourceAuditID
FROM T
LEFT JOIN Audit A1 ON T.TransID = A1.TransID AND T.tran_count = 1
LEFT JOIN A2 ON T.TransID = A2.TransID
LEFT JOIN Audit A3 ON A2.TransID = A3.TransID AND A2.AuditTimestamp = A3.AuditTimestamp

Related

Complex SQL View with Joins & Where clause

My SQL skill level is pretty basic. I have certainly written some general queries and done some very generic views. But once we get into joins, I am choking to get the results that I want, in the view I am creating.
I feel like I am almost there. Just can't get the final piece
SELECT dbo.ics_supplies.supplies_id,
dbo.ics_supplies.old_itemid,
dbo.ics_supplies.itemdescription,
dbo.ics_supplies.onhand,
dbo.ics_supplies.reorderlevel,
dbo.ics_supplies.reorderamt,
dbo.ics_supplies.unitmeasure,
dbo.ics_supplies.supplylocation,
dbo.ics_supplies.invtype,
dbo.ics_supplies.discontinued,
dbo.ics_supplies.supply,
dbo.ics_transactions.requsitionnumber,
dbo.ics_transactions.openclosed,
dbo.ics_transactions.transtype,
dbo.ics_transactions.originaldate
FROM dbo.ics_supplies
LEFT OUTER JOIN dbo.ics_orders
ON dbo.ics_supplies.supplies_id = dbo.ics_orders.suppliesid
LEFT OUTER JOIN dbo.ics_transactions
ON dbo.ics_orders.requisitionnumber =
dbo.ics_transactions.requsitionnumber
WHERE ( dbo.ics_transactions.transtype = 'PO' )
When I don't include the WHERE clause, I get 17,000+ records in my view. That is not correct. It's doing this because we are matching on a 1 to many table. Supplies table is 12,000 records. There should always be 12,000 records. Never more. Never less.
The pieces that I am missing are:
I only need ONE matching record from the ICS_Transactions Table. Ideally, the one that I want is the most current 'ICS_Transactions.OriginalDate'.
I only want the ICS_Transactions Table fields to populate IF ICS_Transacions.Type = 'PO'. Otherwise, these fields should remain null.
Sample code or anything would help a lot. I have done a lot of research on joins and it's still very confusing to get what I need for results.
EDIT/Update
I feel as if I asked my question in the wrong way, or didn't give a good overall view of what I am asking. For that, I apologize. I am still very new to SQL, but trying hard.
ICS_Supplies Table has 12,810 records
ICS_Orders Table has 3,666 records
ICS_Transaction Table has 4,701 records
In short, I expect to see a result of 12,810 records. No more and no less. I am trying to create a View of ALL records from the ICS_Supplies table.
Not all records in Supply Table are in Orders and or Transaction Table. But still, I want to see all 12,810 records, regardless.
My users have requested that IF any of these supplies have an open PO (ICS_Transactions.OpenClosed = 'Open' and ICS_Transactions.InvType = 'PO') Then, I also want to see additional fields from ICS_Transactions (ICS_Transactions.OpenClosed, ICS_Transactions.InvType, ICS_Transactions.OriginalDate, ICS_Transactions.RequsitionNumber).
If there are no open PO's for supply record, then these additional fields should be blank/null (regardless to what data is in these added fields, they should display null if they don't meet the criteria).
The ICS_Orders Table is nly needed to hop from the ICS_Supplies to the ICS_Transactions (I first, need to obtain the Requisition Number from the Orders field, if there is one).
I am sorry if I am not doing a good job to explain this. Please ask if you need clarification.

Here's a simplified version of Ross Bush's answer (It removes a join from the CTE to keep things more focussed, speed things up, and cut down the code).
;WITH
ordered_ics_transactions AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY requisitionnumber
ORDER BY originaldate DESC
)
AS seq_id
FROM
dbo.ics_transactions
)
SELECT
s.supplies_id, s.old_itemid,
s.itemdescription, s.onhand,
s.reorderlevel, s.reorderamt,
s.unitmeasure, s.supplylocation,
s.invtype, s.discontinued,
s.supply,
t.requsitionnumber, t.openclosed,
t.transtype, t.originaldate
FROM
dbo.ics_supplies AS s
LEFT OUTER JOIN
dbo.ics_orders AS o
ON o.supplies_id = s.suppliesid
LEFT OUTER JOIN
ordered_ics_transactions AS t
ON t.requisitionnumber = o.requisitionnumber
AND t.transtype = 'PO'
AND t.seq_id = 1
This will only join the most recent transaction record for each requisitionnumber, and only if it has transtype = 'PO'
IF you want to reverse that (joining only transaction records that have transtype = 'PO', and of those only the most recent one), then move the transtype = 'PO' filter to be a WHERE clause inside the ordered_ics_transactions CTE.

You can possibly work with the query below to get what you need.
1. I only need ONE matching record from the ICS_Transactions Table. Ideally, the one that I want is the most current 'ICS_Transactions.OriginalDate'.
I would solve this by creating a CTE with all the ICS_Transaction fields needed in the query, rank-ordered by OPriginalDate, partitioned by suppliesid.
2. I only want the ICS_Transactions Table fields to populate IF ICS_Transacions.Type = 'PO'. Otherwise, these fields should remain null.
If you move the condition from the WHERE clause to the LEFT JOIN then ICS_Transactions not matching the criteria will be peeled and replaced with null values with the rest of the query records.
;
WITH ReqNumberRanked AS
(
SELECT
dbo.ICS_Orders.SuppliesID,
dbo.ICS_Transactions.RequisitionNumber,
dbo.ICS_Transactions.TransType,
dbo.ICS_Transactions.OriginalDate,
dbo.ICS_Transactions.OpenClosed,
RequisitionNumberRankReversed = RANK() OVER(PARTITION BY dbo.ICS_Orders.SuppliesID, dbo.ICS_Transactions.RequisitionNumber ORDER BY dbo.ICS_Transactions.OriginalDate DESC)
FROM
dbo.ICS_Orders
LEFT OUTER JOIN dbo.ICS_Transactions ON dbo.ICS_Orders.RequisitionNumber = dbo.ICS_Transactions.RequsitionNumber
)
SELECT
dbo.ICS_Supplies.Supplies_ID, dbo.ICS_Supplies.Old_ItemID,
dbo.ICS_Supplies.ItemDescription, dbo.ICS_Supplies.OnHand,
dbo.ICS_Supplies.ReorderLevel, dbo.ICS_Supplies.ReorderAmt,
dbo.ICS_Supplies.UnitMeasure,
dbo.ICS_Supplies.SupplyLocation, dbo.ICS_Supplies.InvType,
dbo.ICS_Supplies.Discontinued, dbo.ICS_Supplies.Supply,
ReqNumberRanked.RequsitionNumber,
ReqNumberRanked.OpenClosed,
ReqNumberRanked.TransType,
ReqNumberRanked.OriginalDate
FROM
dbo.ICS_Supplies
LEFT OUTER JOIN dbo.ICS_Orders ON dbo.ICS_Supplies.Supplies_ID = dbo.ICS_Orders.SuppliesID
LEFT OUTER JOIN ReqNumberRanked ON ReqNumberRanked.RequisitionNumber = dbo.ICS_Transactions.RequsitionNumber
AND (ReqNumberRanked.TransType = 'PO')
AND ReqNumberRanked.RequisitionNumberRankReversed = 1

Finding Acceptance Rate from Friend Requests

Given two tables Friend_request (requester_id, sent_to_id , time) and Request_accepted (acceptor_id, requestor_id, time) , find the overall acceptance rate of requests.
So my code/basic logic is the following:
select count(acceptor_id)/count(requester_id)
from Friend_Request left join Request_Accepted
on Friend_Request.sent_to_id = Request_Accepted.acceptor_id
where Request_Accepted.acceptor_id is not null
Would that be correct?

In theory it would be almost correct.
SELECT count(ra.acceptor_id) / count(fr.requester_id)
FROM Friend_Request fr
LEFT JOIN Request_Accepted ra ON ra.acceptor_id = fr.sent_to_id
Just remove the WHERE because you want the rows will NULL acceptors from your Request_Accepted table. Otherwise it would be the same number divided by the same number.
Edit: Although you have to consider if they send multiple requests with one acceptance... then that's a different can of worms.

I handled this question in a slightly different manner and thought I would share.
( SELECT COUNT(DISTINCT requester_id, accepter_id)
FROM request_accepted) /
( SELECT COUNT(DISTINCT sender_id, send_to_id)
FROM friend_request)

I guess there would be a more efficient solution.
Since all the request actions are stored in the Friend_request table, and all the accept actions are stored in the Request_Accepted table, we can simply get the counts of each table and divide them to get the rate we need.
requests_count = select count(*) from Friend_request;
accepts_count = select count(*) from Request_Accepted;
accept_rate = accepts_count / requests_count;
In most of the databases, count(*) option do not need to scan all the records in the table, it is an O(1) option.

Need to make SQL subquery more efficient

I have a table that contains all the pupils.
I need to look through my registered table and find all students and see what their current status is.
If it's reg = y then include this in the search, however student may change from y to n so I need it to be the most recent using start_date to determine the most recent reg status.
The next step is that if n, then don't pass it through. However if latest reg is = y then search the pupil table, using pupilnumber; if that pupil number is in the pupils table then add to count.
Select Count(*)
From Pupils Partition(Pupils_01)
Where Pupilnumber in (Select t1.pupilnumber
From registered t1
Where T1.Start_Date = (Select Max(T2.Start_Date)
From registered T2
Where T2.Pupilnumber = T1.Pupilnumber)
And T1.reg = 'N');
This query works, but it is very slow as there are several records in the pupils table.
Just wondering if there is any way of making it more efficient

Worrying about query performance but not indexing your tables is, well, looking for a kind word here... ummm... daft. That's the whole point of indexes. Any variation on the query is going to be much slower than it needs to be.
I'd guess that using analytic functions would be the most efficient approach since it avoids the need to hit the table twice.
SELECT COUNT(*)
FROM( SELECT pupilnumber,
startDate,
reg,
rank() over (partition by pupilnumber order by startDate desc) rnk
FROM registered )
WHERE rnk = 1
AND reg = 'Y'

You can look execution plan for this query. It will show you high cost operations. If you see table scan in execution plan you should index them. Also you can try "exists" instead of "in".

This query MIGHT be more efficient for you and hope at a minimum you have indexes per "pupilnumber" in the respective tables.
To clarify what I am doing, the first inner query is a join between the registered table and the pupil which pre-qualifies that they DO Exist in the pupil table... You can always re-add the "partition" reference if that helps. From that, it is grabbing both the pupil AND their max date so it is not doing a correlated subquery for every student... get all students and their max date first...
THEN, join that result to the registration table... again by the pupil AND the max date being the same and qualify the final registration status as YES. This should give you the count you need.
select
count(*) as RegisteredPupils
from
( select
t2.pupilnumber,
max( t2.Start_Date ) as MostRecentReg
from
registered t2
join Pupils p
on t2.pupilnumber = p.pupilnumber
group by
t2.pupilnumber ) as MaxPerPupil
JOIN registered t1
on MaxPerPupil.pupilNumber = t1.pupilNumber
AND MaxPerPupil.MostRecentRec = t1.Start_Date
AND t1.Reg = 'Y'
Note: If you have multiple records in the registration table, such as a person taking multiple classes registered on the same date, then you COULD get a false count. If that might be the case, you could change from
COUNT(*)
to
COUNT( DISTINCT T1.PupilNumber )

Is there a way to get different results for the same SQL query if the data stays the same?

I get a different result set for this query intermittently when I run it...sometimes it gives 1363, sometimes 1365 and sometimes 1366 results. The data doesn't change. What could be causing this and is there a way to prevent it? Query looks something like this:
SELECT *
FROM
(
SELECT
RC.UserGroupId,
RC.UserGroup,
RC.ClientId AS CLID,
CASE WHEN T1.MultipleClients = 1 THEN RC.Salutation1 ELSE RC.DisplayName1 END AS szDisplayName,
T1.MultipleClients,
RC.IsPrimaryRecord,
RC.RecordTypeId,
RC.ClientTypeId,
RC.ClientType,
RC.IsDeleted,
RC.IsCompany,
RC.KnownAs,
RC.Salutation1,
RC.FirstName,
RC.Surname,
Relationship,
C.DisplayName Client,
RC.DisplayName RelatedClient,
E.Email,
RC.DisplayName + ' is the ' + R.Relationship + ' of ' + C.DisplayName Description,
ROW_NUMBER() OVER (PARTITION BY E.Email ORDER BY Relationship DESC) AS sequence_id
FROM
SSDS.Client.ClientExtended C
INNER JOIN
SSDS.Client.ClientRelationship R WITH (NOLOCK)ON C.ClientId = R.ClientID
INNER JOIN
SSDS.Client.ClientExtended RC WITH (NOLOCK)ON R.RelatedClientId = RC.ClientId
LEFT OUTER JOIN
SSDS.Client.Email E WITH (NOLOCK)ON RC.ClientId = E.ClientId
LEFT OUTER JOIN
SSDS.Client.UserDefinedData UD WITH (NOLOCK)ON C.ClientId = UD.ClientId AND C.UserGroupId = UD.UserGroupId
INNER JOIN
(
SELECT
E.Email,
CASE WHEN (COUNT(DISTINCT RC.DisplayName) > 1) THEN 1 ELSE 0 END AS MultipleClients
FROM
SSDS.Client.ClientExtended C
INNER JOIN
SSDS.Client.ClientRelationship R WITH (NOLOCK)ON C.ClientId = R.ClientID
INNER JOIN
SSDS.Client.ClientExtended RC WITH (NOLOCK)ON R.RelatedClientId = RC.ClientId
LEFT OUTER JOIN
SSDS.Client.Email E WITH (NOLOCK)ON RC.ClientId = E.ClientId
LEFT OUTER JOIN
SSDS.Client.UserDefinedData UD WITH (NOLOCK)ON C.ClientId = UD.ClientId AND C.UserGroupId = UD.UserGroupId
WHERE
Relationship IN ('z-Group Principle', 'z-Group Member ')
AND E.Email IS NOT NULL
GROUP BY E.Email
) T1 ON E.Email = T1.Email
WHERE
Relationship IN ('z-Group Principle', 'z-Group Member ')
AND E.Email IS NOT NULL
) T
WHERE
sequence_id = 1
AND T.UserGroupId IN (Select * from iCentral.dbo.GetSubUserGroups('471b9cbd-2312-4a8a-bb20-35ea53d30340',0))
AND T.IsDeleted = 0
AND T.RecordTypeId = 1
AND T.ClientTypeId IN
(
'1', --Client
'-1652203805' --NTU
)
AND T.CLID NOT IN
(
SELECT DISTINCT
UDDF.CLID
FROM
SLacsis_SLM.dbo.T_UserDef UD WITH (NOLOCK)
INNER JOIN
SLacsis_SLM.dbo.T_UserDefData UDDF WITH (NOLOCK)
ON UD.UserDef_ID = UDDF.UserDef_ID
INNER JOIN
SLacsis_SLM.dbo.T_Client CLL WITH (NOLOCK)
ON CLL.CLID = UDDF.CLID AND CLL.UserGroup_CLID = UD.UserID
WHERE
UD.UserDef_ID in
(
'F68F31CE-525B-4455-9D50-6DA77C66FEE5',
'A7CECB03-866C-4F1F-9E1A-CEB09474FE47'
)
AND UDDF.Data = 'NO'
)
ORDER BY T.Surname
EDIT:
I have removed all NOLOCK's (including the ones in views and UDFs) and I'm still having the same issue. I get the same results every time for the nested select (T) and if I put the result set of T into a temp table in the beginning of the query and join onto the temp table instead of the nested select then the final result set is the same every time I run the query.
EDIT2:
I have been doing some more reading on ROW_NUMBER()...I'm partitioning by email (of which there are duplicates) and ordering by Relationship (where there are only 1 of 2 relationships). Could this cause the query to be non-deterministic and would there be a way to fix that?
EDIT3:
Here are the actual execution plans if anyone is interested http://www.mediafire.com/?qo5gkh5dftxf0ml. Is it possible to see that it is running as read committed from the execution plan? I've compared the files using WinMerge and the only differences seem to be the counts (ActualRows="").
EDIT4:
This works:
SELECT * FROM
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY B.Email ORDER BY Relationship DESC) AS sequence_id
FROM
(
SELECT DISTINCT
RC.UserGroupId,
...
) B...
EDIT5:
When running the same ROW_NUMBER() query (T in the original question, just selecting RC.DisplayName and ROW_NUMBER) twice in a row I get different rank for some people:
Does anyone have a good explanation/example of why or how ROW_NUMBER() over a result set that contains duplicates can rank differently each time it is run and ultimately change the number of results?
EDIT6:
Ok I think this makes sense to me now. This occurs when people 2 people have the same email address (e.g a husband and wife pair) and relationship. I guess in this case their ROW_NUMBER() ranking is arbitrary and can change every time it is run.

Your use of NOLOCK all over means you are doing dirty reads and will see uncommitted data, data that will be rolled back, transient and inconsistent data etc
Take these off, try again, report back pleas
Edit: some options with NOLOCKS removed
Data is really changing
Some parameter or filter is changing (eg GETDATE)
Some float comparisons running on different cores each time
See this on dba.se https://dba.stackexchange.com/q/4810/630
Embedded NOLOCKs in udfs or views (eg iCentral.dbo.GetSubUserGroups)
...

As I said yesterday in the comments the row numbering for rows with duplicate E.Email, Relationship values will be arbitrary.
To make it deterministic you would need to do PARTITION BY B.Email ORDER BY Relationship DESC, SomeUniqueColumn . Interesting that it changes between runs though using the same execution plan. I assume this is a consequence of the hash join.

I think your problem is the first row over the partition is not deterministic. I suspect that Email and Relationship is not unique.
ROW_NUMBER() OVER (PARTITION BY E.Email ORDER BY Relationship DESC) AS sequence_id
Later you examine the first row of the partition.
WHERE T.sequence_id = 1
AND T.UserGroupId ...
If that first row is arbitrary then you are going to get an arbitrary where comparison. You need to add to the ORDER BY to include a complete unique key. If there is no unique key then you need to make one or live with arbitrary results. Even on a table with a clustered PK the select row order is not guaranteed unless the entire PK is in the sort clause.

This probably has to do with ordering. You have a sequence_id defined as a row_number ordered by Relationship. You'll always get a sensible order by relationship, but other than that your row_number will be random. So you can get different rows with sequence_id 1 each time. That in turn will affect your where clause, and you can get different numbers of results. To fix this to get a consistent result, add another field to your row_number's order by. Use a primary key to be certain of consistent results.

There's a recent KB that addresses problems with ROW_NUMBER() ... see FIX: You receive an incorrect result when you run a query that uses the row_number function in SQL Server 2008 for the details.
However this KB indicates that it's a problem when parallelism is invoked for execution, and looking at your execution plans I can't see this kicking in. But the fact that MS have found a problem with it in one situation makes me a little bit wary - i.e., could the same issue occur for a sufficiently complicated query (and your execution plan does look sufficiently large).
So it may be worth checking your patch levels of SQL Server 2008.

U Must only use
Order by
without prtition by.
ROW_NUMBER() OVER (ORDER BY Relationship DESC) AS sequence_id

Optomizing a sql Count Query

I am still fairly new to SQL so I wanted to know if I am doing this the most optimized way.
SELECT DISTINCT ACCOUNTID, ACCOUNT_NAME
(SELECT COUNT(*)
FROM TICKET
WHERE (ACCOUNTID = OPPORTUNITY.ACCOUNTID)) AS [Number Of Tickets],
(SELECT COUNT(*)
FROM TICKET
WHERE (ACCOUNTID = OPPORTUNITY.ACCOUNTID) AND (STATUSCODE = 1 OR
STATUSCODE = 2 OR
STATUSCODE = 3)) AS [Active Tickets]
from OPPORTUNITY
where AccountID > #LowerBound and AccountID < #UpperBound
What I am trying to do is get a list of all accounts and have it show how many tickets the account has and how many are active (have a status code that is 1, 2, or 3). Is the select inside the select the correct way to do this or is there a way it can be done using something like group by.
My biggest concern is speed, it takes 3-5 seconds to just pull around 20 records and the query could potentially have 1000's of results.
I am not the DBA so any changes to table schema are not imposable but will require some pleading with upper management.
This is being run against SQL Server 2000.
EDIT--
as all of the answers where asking about it I checked on it. Both Opportunity and ticket index on accountid ascending.

I think the below should be logically equivalent and more efficient. Obviously test both aspects your end!
SELECT O.ACCOUNTID, O.ACCOUNT_NAME,
COUNT(*) AS [Number Of Tickets],
ISNULL(SUM(CASE WHEN STATUSCODE IN (1,2,3) THEN 1 ELSE 0 END),0)
AS [Active Tickets]
FROM OPPORTUNITY O
LEFT OUTER JOIN TICKET T ON T.ACCOUNTID = O.ACCOUNTID
WHERE O.ACCOUNTID > #LowerBound and O.ACCOUNTID < #UpperBound
GROUP BY O.ACCOUNTID, O.ACCOUNT_NAME
IF you can view the execution plans then you should check indexes exist on ACCOUNTID in both tables and are being used.

The SQL engine (even in 2000) is smart enough to optimize that sql. Based on your performance numbers with such few results, I'm guessing the source data had a bunch of records and does not have the indexes the sql needs.
Make sure there is an index on Opportunity.AccountID and an index on Ticket.AccountID.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas