Unique Rows that match a specific criteria

Unique Rows that match a specific criteria - sql

My data is Microsoft Office 365 Mailbox audit logs.
I am working with 14 columns, incorporating names, timestamps, IP addresses, etc.
I have two tables, lets call them EXISTING and NEW. The column definition, order and count are identical in the two tables.
The data in Existing is (very close to!) Distinct.
The data in New is drawn from multiple overlapping searches and is not Distinct.
There are about millions of rows in Existing and hundreds of thousands in New.
Data is being written to New all the time, 24x7, with about 1 million rows a day being added.
~95% of the Rows in New are already present in Existing and are therefore unwanted duplicates. However the data in New contains has many gaps, there are many recent rows in Existing that are NOT present in New.
Want to select all rows from New that are not present in Existing, using Invoke-SqlCmd in Powershell.
Then want to delete all the processed rows from New so it doesn't grow uncontrollably.
My approach so far has been:
Add a [Processed] column to New.
Set [Processed] to 0 for all existing data for selection purposes. New rows that continue to be added will have [Processed] = NULL, and will be left alone.
SELECT DISTINCT all data with [Processed] = 0 from New and copy to a table temporary table called Staging. Find the oldest timestamp ([LastAccessed]) in this data. Then delete all rows from New with [Processed] = 0.
Copy all data from Existing with [LastAccessed] equal to or later to above time stamp across to STAGING, adding the column [Processed] = 1.
Now I want all data in Staging where [Processed] = 0 and there is No duplicate.
Nearest concept I can come up with is:
SELECT MailboxOwnerUPN
,MailboxResolvedOwnerName
,LastAccessed
,ClientIPAddress
,ClientInfoString
,MailboxGuid
,Operation
,OperationResult
,LogonType
,ExternalAccess
,InternalLogonType
,LogonUserDisplayName
,OriginatingServer
FROM dbo.Office365Staging
GROUP BY MailboxOwnerUPN
,MailboxResolvedOwnerName
,LastAccessed
,ClientIPAddress
,ClientInfoString
,MailboxGuid
,Operation
,OperationResult
,LogonType
,ExternalAccess
,InternalLogonType
,LogonUserDisplayName
,OriginatingServer
HAVING Count(1) = 1 and Processed = 0;
Which of course I can't do because [Processed] isn't part of the Select or Group. If I add the Column [Processed] then all lines are unique and there are no duplicates. Have tried a variety of joins and other techniques, without success thus far.
Initially without [Processed] = 0, the query worked, but returned unwanted unique lines from Existing. I only want unique lines from New.
Clearly due to the size of these structures efficiency is a consideration. This process will be happening regularly, every 15 minutes ideally.
Identifying these new lines then starts another process of Geo-IP, reputation, alerting, etc in PowerShell....

Thought the performance of the following would be horrid, but it is OK at ~27 seconds....
SELECT [MailboxOwnerUPN]
,[MailboxResolvedOwnerName]
,[LastAccessed]
,[ClientIPAddress]
,[ClientInfoString]
,[MailboxGuid]
,[Operation]
,[OperationResult]
,[LogonType]
,[ExternalAccess]
,[InternalLogonType]
,[LogonUserDisplayName]
,[OriginatingServer]
FROM dbo.New
WHERE [Processed] = 1 and
NOT EXISTS (Select * From dbo.Existing
Where New.LastAccessed = Existing.LastAccessed and
New.ClientIPAddress = Existing.ClientIPAddress and
New.ClientInfoString = Existing.ClientInfoString and
New.MailboxGuid = Existing.MailboxGuid)
GO

Related

SQL: Choose latest uploaded data

I am seeing duplicates in my data after running my sql query, and have figured out the issue stemming to our data team not updating a table but adding a new row instead. In this instance, I need to use the largest LD_SEQ_NBR to get the latest data.
Given the following table -- ORDERS
ID ORD_NBR LD_SEQ_NBR
0 130263789 1665
1 130263789 1870
What do I need to add to my WHERE clause to make sure I'm taking the rows with the largest LD_SEQ_NBR?

LD_SEQ_NBR = (SELECT MAX(LD_SEQ_NBR) FROM ORDERS A WHERE A.ORD_NBR = ORDERS.ORD_NBR)

SQL MERGE Conditional Delete on MATCH

I have two tables that I use to merge. Something is wrong with my query and I cannot find any information as to why. What happens is; I will run the query and it works fine as my target table populates with the information I want. I then run the query again (immediately after) and it changes (38 rows affected) I run it again and it adds the rows back in - and again... rows are deleted. There are over 100 rows but only the same rows seem to be affected. Nothing changes in the source table.
I suspect that I am performing a DELETE on WHEN MATCHED - I can only find information for DELETE WHEN NOT MATCHED - but I don't know as this only seems to affect the same 38 rows every time.
The tables and (hopefully) information you need are here in my code:
MERGE QA.dbo.RMA AS target USING Touchstn02.dbo.RMA_Detail AS source
ON (target.RMANUM_52 = source.RMANUM_52)
WHEN MATCHED AND (source.STATUS_52 >3)
THEN
DELETE
WHEN NOT MATCHED AND (source.STATUS_52 < 4) AND (source.RECQTY_52 > 0)
THEN
INSERT (RMANUM_52, RMADATE_52, CUSTID_52, RETNUM_52, RETQTY_52,
SERIAL_52, REPAIR_52,
RPLACE_52, CREDIT_52, WRKNUM_52, KEYWRD_52, RECQTY_52,
RECDTE_52, STATUS_52,
REM1_52, REM2_52, REM3_52, Comment, CMPDTE)
VALUES (source.RMANUM_52, source.RMADTE_52, source.CUSTID_52,
source.RETNUM_52,
source.RETQTY_52, source.SERIAL_52, source.REPAIR_52,
source.RPLACE_52, source.CREDIT_52,
source.WRKNUM_52, source.KEYWRD_52, source.RECQTY_52,
source.RECDTE_52,
source.STATUS_52, source.REM1_52, source.REM2_52,
source.REM3_52, source.REM4_52,
source.CMPDTE_52);
As always, I appreciate any help/input

I think I have got it!
The 38 rows being deleted have at least two STATUS_52 values in the source table for these RMANUM_52 key values.
For a given RMANUM_52 key value:
One of the STATUS_52 values will be < 4 and have a Qty > 0.
The other row Status value will be > 3.
So, on the first run through, a row is inserted with a STATUS_52 value < 4.
Then on next run through the DELETE is triggered because the matching is by RMANUM_52 and the STATUS_52 in the SOURCE table. The tricky thing here is that we are looking back for status values into all rows in the source table (including rows that were not part of the first insert). And there is a different row in the source table that matches RMANUM_52 with a STATUS_52 > 3. So DELETE logic matches and delete happens.
I can't tell from your example exactly what your requirements are so I'm not going to hazard guessing a fix.

SQL Query stopped working and cant figure out why

Basically I am using MS Access 2013 to import all active work items that are assigned to a specific group from an API and select the data into 2 new tables (Requests & Request_Tasks).
I then have a form sourced from a query to select specific fields from the 2 tables.
Until yesterday it was working with no problems and nothing has changed.
All of the data appears in the 2 tables so the import from the API works fine.
When it comes to the query selecting the data from the 2 tables (Which are already populated with the correct data) the query returns only data from Requests table with blank fields instead of data from Request_Tasks.
The strange part is that out of 28 active work items it returns 24 correctly and the last 4 are having the problem.
Every new task added to the group has the problem also.
Query is below.
SELECT
Request_Tasks.RQTASK_Number,
Request_Tasks.Request_Number,
Requests.Task, Requests.Entity,
Request_Tasks.Description,
Request_Tasks.Request_Status,
Requests.Requested_for_date,
Request_Tasks.Work_On_Date,
Request_Tasks.Estimated_Time,
Request_Tasks.Actual_Time_Analysis,
Request_Tasks.Offers_Built,
Request_Tasks.Number_of_links_Opened,
Request_Tasks.Number_of_Links_Extended,
Request_Tasks.Number_Of_links_closed,
Request_Tasks.Build_Allocated_to,
Request_Tasks.Buld_Review_Allocated_to,
Request_Tasks.Keying_Allocated_to,
Request_Tasks.Keying_Approval_allocated_to,
Request_Tasks.Actual_Build_Time,
Request_Tasks.Actual_Stakeholder_Support,
Request_Tasks.Task_Completed_Date
FROM Request_Tasks
RIGHT JOIN Requests
ON Request_Tasks.Request_Number = Requests.Request_Number
WHERE (((Request_Tasks.Task_Completed_Date)>=Date()
Or (Request_Tasks.Task_Completed_Date) Is Null)
AND ((Requests.Task)<>"7"
And (Requests.Task)<>"8" And (Requests.Task)<>"9"))
OR (((Request_Tasks.Task_Completed_Date)>=Date()
Or (Request_Tasks.Task_Completed_Date) Is Null)
AND ((Requests.Task)<>"7"
And (Requests.Task)<>"8"
And (Requests.Task)<>"9"))
ORDER BY Request_Tasks.Work_On_Date Is Null DESC , Request_Tasks.Work_On_Date, Requests.Entity Is Null DESC , Requests.Task;
Any help would be great.
Thanks.

The query is using RIGHT JOIN, which means rows from Requests table is always reported even if there is no corresponding entry in Request_tasks table.
A full example is here http://www.w3schools.com/Sql/sql_join_right.asp
In your case, most likely somechange might have happened during data load/API and Request_tasks table is not being populated. That is the reason you see blank data for fields from that table.
Solution
Manually check data for 4 faulty records in Request_tasks table.
Ensure keys in both table request_number are matching including data type and any leading space/non printable characters (if they are string type of data) for faulty records.
Query seems fine, its more of issue with data based on problem statement.

T-SQL query for SQL Server 2008 : how to query X # of rows where X is a value in a query while matching on another column

Summary:
I have a list of work items that I am attempting to assign to a list of workers. Each working is allowed to only have a max of 100 work items assigned to them. Each work item specifies the user that should work it (associated as an owner).
For example:
Jim works a total of 5 accounts each with multiple work items. In total jim has 50 items to work already assigned to him. I am allowed to assign only 50 more.
My plight/goal:
I am using a temp table and a select statement to get the # of items each owner has currently assigned to them and I calculate the available slots for new items and store the values in new column. I need to be able to select from the items table where the owner matches my list of owners and their available items(in the temp table), only retrieving the number of rows for each user equal to the number of available slots per user - query would return only 50 rows for jim even though there may be 200 matching the criteria while sam may get 0 rows because he has no available slots while there are 30 items for him to work in the items table.
I realize I may be approaching this problem wrong. I want to avoid using a cursor.
Edit: Adding some example code
SELECT
nUserID_Owner
, CASE
WHEN COUNT(c.nWorkID) >= 100 THEN 0
ELSE 100 - COUNT(c.nWorkID)
END
,COUNT(c.nWorkID)
FROM tblAccounts cic
LEFT JOIN tblWorkItems c
ON c.sAccountNumber = cic.sAccountNumber
AND c.nUserID_WorkAssignedTo = cic.nUserID_Owner
AND c.nTeamID_WorkAssignedTo = cic.nTeamID_Owner
WHERE cic.nUserID_Collector IS NOT NULL
AND nUserID_CurrentOwner = 5288
AND c.bCompleted = 0
GROUP BY nUserID_Owner
This provides output vaulues of 5288, 50, 50 (in Jim's scenario)

It took longer than I wanted it to but I found a solution.
I did use a sub-query as suggested above to produce the work items with a unique row count by user.
I used PARTITION BY to produce a unique row count for each worker and included in my HAVING clause that the row number must be < the count of available slots. I'd post the code but it's beyond the char limit and I'd also have a lot of things to change to anon the system properly.
Originally I was approaching the problem incorrectly focusing on limiting the results rather than thinking about creating the necessary data to relate the result sets.

Find lowest value and update other table with that value

I have 2 tables in SQL : Event and Swimstyle
The Event table has a value SwimstyleId which refers to Swimstyle.id
The Swimstyle table has 3 values : distance, relaycount and strokeid
Normally there would be somewhere between 30 and 50 rows in the table Swimstyle, which would hold all possible values (these are swimming distances like 50 (distance), 1 (relaycount), FREE (strokeid)).
However, due to a programming mistake the lookup for existing values didn't work and the importer of new results created a new swimstyle entry for each event added...
My Swimstyle table now consists of almost 200k rows, which ofcourse is performance wise not the best idea ;)
To fix this i want to go through all Events, get the swimstyle values that are attached, lookup the first existing row in Swimstyle that has the same distance, relaycount and strokeid values and update the Event.SwimstyleId with that value.
When this is all done i can delete all orphaned Swimstyle rows, leaving a table with only 30-50 rows.
I have been trying to make a query that does this, but not getting anywhere. Anyone to point me in the right direction ?

These 2 statements should fix the problem, if I've read it right. N.B. I haven't been able to try this out anywhere, and I've made a few assumptions about the table structure.
UPDATE event e
set swimstyle_id = (SELECT MIN(s_min.id)
FROM swimstyle s_min,swimstyle s_cur
WHERE s_min.distance = s_cur.distance
AND s_min.relaycount = s_cur.relaycount
AND s_min.strokeid = s_cur.strokeid
AND s_cur.id = e.swimstyle_id);
DELETE FROM swimstyle s
WHERE NOT EXISTS (SELECT 1
FROM event e
WHERE e.swimstyle_id = s.id);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas