Finding Duplicate Records on Different Criteria - sql

I am trying to write a group of queries that find instances of duplicate records with matches across several columns, some of them being exact matches and some being similar matches. I believe all of these queries should be very similar with just a few changes to the filters for the different variations, but I may be mistaken.
Two examples:
A query to find all instances of records with identical matches in columns A, B, D, and E with C and F being shown in the results but not needing to have any duplication.
A query to find all instances of records with identical matches in Columns A, B, and D, while E just has to be similar to other like records. (again C and F being shown in the results but not needing to have any duplication)
SQL is a language I have some familiarity with but am on a whole, a novice. So far I have tried to use GROUP BY and a HAVING statement to find instances where the count of the group is > 1, but I can only get this to work with the identical matches and can only see the columns used in the GROUP BY (no C and F). I have figured out that I can use levenshtein distance to find the similar instances cleanly, but I have not been able to work out how to integrate this into the GROUP BY formula.
Any advice on this would be helpful, feel free to ask any questions.
Thanks
This block is as far as I have got in an example with 3 identical columns and one similar, it produces some accurate results but also a handful of records that will only have a single of the necessary matches, I am unsure why. I am not married to having the query work this way, this is just what I came up with. E & R Keys are unique identifiers necessary for later processing.
WITH A AS (
SELECT
ROW_NUMBER() OVER(ORDER BY "Tickets"."id", "Tickets"."vid" ASC) AS RowNo,
to_hex(MD5( TO_UTF8( CONCAT( "Tickets"."vid" ,"Tickets"."id" , "Tickets"."num", cast("Tickets"."date"as varchar) )))) AS R_KEY,
"Tickets"."vid" AS "vid",
"Tickets"."id" AS "id",
"Tickets"."num" AS "num",
"Tickets"."date" AS "date",
"Tickets"."amount" AS "amount",
ROW_NUMBER() OVER(ORDER BY "TicketsB"."vid", "TicketsB"."id" ASC) AS RowNo2,
to_hex(MD5( TO_UTF8( CONCAT( "TicketsB"."vid" ,"TicketsB"."id" , "TicketsB"."num", cast("TicketsB"."date"as varchar) )))) AS R_KEY2,
"TicketsB"."vid" AS "vid2",
"TicketsB"."id" AS "id2",
"TicketsB"."num" AS "num2",
"TicketsB"."date" AS "date2",
"TicketsB"."amount" AS "amount2"
FROM
TicketTable AS "Tickets",
TicketTable AS "TicketsB"
WHERE "Tickets".vid = "TicketsB".vid
AND "Tickets".date = "TicketsB".date
AND "Tickets".amount = "TicketsB".amount
AND "Tickets".num != "TicketsB".num
AND levenshtein_distance("Tickets".num,"TicketsB".num) < 2
AND length("TicketsB".num) - length("Tickets".num) = 1
LIMIT 100000
)
SELECT to_hex(MD5( TO_UTF8( CONCAT(VID , cast(DATE as varchar), cast(AMOUNT as varchar) )))) AS E_KEY,
R_KEY, VID, ID, NUM, DATE, AMOUNT FROM A
UNION
SELECT to_hex(MD5( TO_UTF8( CONCAT(VID , cast(DATE as varchar), cast(AMOUNT as varchar) )))) AS E_KEY,
R_KEY2, ID2, ID2, NUM2, DATE2, AMOUNT2 FROM A
ORDER BY VID, DATE, AMOUNT

Related

BigQuery "Schrödingers Row" or why ROW_NUMBER() is not a good identifier

Situation
We have a fairly complex internal logic to allocate marketing spend to various channels and had currently started to rework some of our queries to simplify the setup. We recently came across a really puzzling case where using ROW_NUMBER() OVER() to identify unique rows lead to very strange results.
Problem
In essence, using ROW_NUMBER() OVER() resulted in what I call Schrödingers Rows. As they appear to be matched and unmatched at the same time (please find replicable query below). In the attached screenshot (which is a result of the query) it can be clearly seen that
german_spend + non_german_spend > total_spend
Which should not be the case.
Query
Please note that execution of the query will give you different results each time you run it as it relies on RAND() to generate dummy data. Also please be aware that the query is a very dumbed down version of what we are doing. For reasons beyond the scope of this post, we needed to uniquely identify the buckets.
###################
# CREATE Dummy Data
###################
DECLARE NUMBER_OF_DUMMY_RECORDS DEFAULT 1000000;
WITH data AS (
SELECT
num as campaign_id,
RAND() as rand_1,
RAND() as rand_2
FROM
UNNEST(GENERATE_ARRAY(1, NUMBER_OF_DUMMY_RECORDS)) AS num
),
spend_with_categories AS (
SELECT
campaign_id,
CASE
WHEN rand_1 < 0.25 THEN 'DE'
WHEN rand_1 < 0.5 THEN 'AT'
WHEN rand_1 < 0.75 THEN 'CH'
ELSE 'IT'
END AS country,
CASE
WHEN rand_2 < 0.25 THEN 'SMALL'
WHEN rand_2 < 0.5 THEN 'MEDIUM'
WHEN rand_2 < 0.75 THEN 'BIG'
ELSE 'MEGA'
END AS city_size,
CAST(RAND() * 1000000 AS INT64) as marketing_spend
FROM
data
),
###################
# END Dummy Data
###################
spend_buckets AS (
SELECT
country,
city_size,
CONCAT("row_", ROW_NUMBER() OVER()) AS identifier,
#MD5(CONCAT(country, city_size)) AS identifier, (this works)
SUM(marketing_spend) AS marketing_spend
FROM
spend_with_categories
GROUP BY 1,2
),
german_spend AS (
SELECT
country,
ARRAY_AGG(identifier) AS identifier,
SUM(marketing_spend) AS marketing_spend
FROM
spend_buckets
WHERE
country = 'DE'
GROUP BY
country
),
german_identifiers AS (
SELECT id AS identifier FROM german_spend, UNNEST(identifier) as id
),
non_german_spend AS (
SELECT SUM(marketing_spend) AS marketing_spend FROM spend_buckets WHERE identifier NOT IN (SELECT identifier FROM german_identifiers)
)
(SELECT "german_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM german_spend
UNION ALL
SELECT "non_german_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM non_german_spend
UNION ALL
SELECT "total_spend" AS category, SUM(marketing_spend) AS marketing_spend FROM spend_buckets)
Solution
We were actually able to solve the problem by using a hash of the key instead of the ROW_NUMBER() OVER() identifier, but out of curiosity I would still love to understand what causes this.
Additional Notes
Using GENERATE_UUID() AS identifier instead of CONCAT("row_", ROW_NUMBER() OVER()) AS identifier leads to almost 0 matches. I.e. entire spend is classified as non-german.
Writing spend_buckets to a table also solves the problem, which leads me to believe that maybe ROW_NUMBER() OVER() is lazily executed or so?
using a small number for the dummy data also produces non-matching results regardless of the method of generating a "unique" id
Hash functions are a way better for marking rows than generating a rownumber, which is changing each day.
The CTE (with tables) are not persistent, but calculated for each time used in your query.
Running the same CTE several times within a query, results in different results:
With test as (Select rand() as x)
Select * from test
union all Select * from test
union all Select * from test
A good solution is the use of temp table. A workaround is to use search for CTE table, which creates a row_number or generates random number and are used more than once in following. These CTE are to rename and be used in a recursive CTE and then the later CTE is used. In your example it is the spend_buckets:
WITH recursive
...
spend_buckets_ as (
...),
spend_buckets as
(select * from spend_buckets_
union all select * from spend_buckets_
where false
),
Then the values will match.

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here
You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).
Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).
A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

Find incorrect records by Id

I am trying to find records where the personID is associated to the incorrect SoundFile(String). I am trying to search for incorrect records among all personID's, not just one specific one. Here are my example tables:
TASKS-
PersonID SoundFile(String)
123 D10285.18001231234.mp3
123 D10236.18001231234.mp3
123 D10237.18001231234.mp3
123 D10212.18001231234.mp3
123 D12415.18001231234.mp3
**126 D19542.18001231234.mp3
126 D10235.18001234567.mp3
126 D19955.18001234567.mp3
RECORDINGS-
PhoneNumber(Distinct Records)
18001231234
18001234567
So in this example, I am trying to find all records like the one that I indented. The majority of the soundfiles like '%18001231234%' are associated to PersonID 123, but this one record is PersonID 126. I need to find all records where for all distinct numbers from the Recordings table, the PersonID(s) is not the majority.
Let me know if you need more information!
Thanks in advance!!
; WITH distinctRecordings AS (
SELECT DISTINCT PhoneNumber
FROM Recordings
),
PersonCounts as (
SELECT t.PersonID, dr.PhoneNumber, COUNT(*) AS num
FROM
Tasks t
JOIN distinctRecordings dr
ON t.SoundFile LIKE '%' + dr.PhoneNumber + '%'
GROUP BY t.PersonID, dr.PhoneNumber
)
SELECT t.PersonID, t.SoundFile
FROM PersonCounts pc1
JOIN PersonCounts pc2
ON pc2.PhoneNumber = pc1.PhoneNumber
AND pc2.PersonID <> pc1.PersonID
AND pc2.Num < pc1.Num
JOIN Tasks t
ON t.PersonID = pc2.PersonID
AND t.SoundFile LIKE '%' + pc2.PhoneNumber + '%'
SQL Fiddle Here
To summarize what this does... the first CTE, distinctRecordings, is just a distinct list of the Phone Numbers in Recordings.
Next, PersonCounts is a count of phone numbers associated with the records in Tasks for each PersonID.
This is then joined to itself to find any duplicates, and selects whichever duplicate has the smaller count... this is then joined back to Tasks to get the offending soundFile for that person / phone number.
(If your schema had some minor improvements made to it, this query would have been much simpler...)
here you go, receiving all pairs (PersonID, PhoneNumber) where the person has less entries with the given phone number than the person with the maximum entries. note that the query doesn't cater for multiple persons on par within a group.
select agg.pid
, agg.PhoneNumber
from (
select MAX(c) KEEP ( DENSE_RANK FIRST ORDER BY c DESC ) OVER ( PARTITION BY rt.PhoneNumber ) cmax
, rt.PhoneNumber
, rt.PersonID pid
, rt.c
from (
select r.PhoneNumber
, t.PersonID
, count(*) c
from recordings r
inner join tasks t on ( r.PhoneNumber = regexp_replace(t.SoundFile, '^[^.]+\.([^.]+)\.[^.]+$', '\1' ) )
group by r.PhoneNumber
, t.PersonID
) rt
) agg
where agg.c < agg.cmax
;
caveat: the solution is in oracle syntax though the operations should be in the current sql standard (possibly apart from regexp_replace, which might not matter too much since your sound file data seems to follow a fixed-position structure ).

filtering rows by checking a condition for group in one statement only

I have the following statement:
SELECT
(CONVERT(VARCHAR(10), f1, 120)) AS ff1,
CONVERT(VARCHAR(10), f2, 103) AS ff2,
...,
Bonus,
Malus,
ClientID,
FROM
my_table
WHERE
<my_conditions>
ORDER BY
f1 ASC
This select returns several rows for each ClientID. I have to filter out all the rows with the Clients that don't have any row with non-empty Bonus or Malus.
How can I do it by changing this select by one statement only and without duplicating all this select?
I could store the result in a #temp_table, then group the data and use the result of the grouping to filter the temp table. - BUT I should do it by one statement only.
I could perform this select twice - one time grouping it and then I can filter the rows based on grouping result. BUT I don't want to select it twice.
May be CTE (Common Table Expressions) could be useful here to perform the select one time only and to be able to use the result for grouping and then for selecting the desired result based on the grouping result.
Any more elegant solution for this problem?
Thank you in advance!
Just to clarify what the SQL should do I add an example:
ClientID Bonus Malus
1 1
1
1 1
2
2
3 4
3 5
3 1
So in this case I don't want the ClientID=2 rows to appear (they are not interesting). The result should be:
ClientID Bonus Malus
1 1
1
1 1
3 4
3 5
3 1
SELECT Bonus,
Malus,
ClientID
FROM my_table
WHERE ClientID not in
(
select ClientID
from my_table
group by ClientID
having count(Bonus) = 0 and count(Malus) = 0
)
A CTE will work fine, but in effect its contents will be executed twice because they are being cloned into all the places where the CTE is being used. This can be a net performance win or loss compared to using a temp table. If the query is very expensive it might come out as a loss. If it is cheap or if many rows are being returned the temp table will lose the comparison.
Which solution is better? Look at the execution plans and measure the performance.
The CTE is the easier, more maintainable are less redundant alternative.
You haven't specified what are data types of Bonus and Malus columns. So if they're integer (or can be converted to integer), then the query below should be helpful. It calculates sum of both columns for each ClientID. These sums are the same for each detail line of the same client so we can use them in WHERE condition. Statement SUM() OVER() is called "windowed function" and can't be used in WHERE clause so I had to wrap your select-list with a parent one just because of syntax.
SELECT *
FROM (
SELECT
CONVERT(VARCHAR(10), f1, 120) AS ff1,
CONVERT(VARCHAR(10), f2, 103) AS ff2,
...,
Bonus,
Malus,
ClientID,
SUM(Bonus) OVER (PARTITION BY ClientID) AS ClientBonusTotal,
SUM(Malus) OVER (PARTITION BY ClientID) AS ClientMalusTotal
FROM
my_table
WHERE
<my_conditions>
) a
WHERE ISNULL(a.ClientBonusTotal, 0) <> 0 OR ISNULL(a.ClientMalusTotal, 0) <> 0
ORDER BY f1 ASC

Weighted average in T-SQL (like Excel's SUMPRODUCT)

I am looking for a way to derive a weighted average from two rows of data with the same number of columns, where the average is as follows (borrowing Excel notation):
(A1*B1)+(A2*B2)+...+(An*Bn)/SUM(A1:An)
The first part reflects the same functionality as Excel's SUMPRODUCT() function.
My catch is that I need to dynamically specify which row gets averaged with weights, and which row the weights come from, and a date range.
EDIT: This is easier than I thought, because Excel was making me think I required some kind of pivot. My solution so far is thus:
select sum(baseSeries.Actual * weightSeries.Actual) / sum(weightSeries.Actual)
from (
select RecordDate , Actual
from CalcProductionRecords
where KPI = 'Weighty'
) baseSeries inner join (
select RecordDate , Actual
from CalcProductionRecords
where KPI = 'Tons Milled'
) weightSeries on baseSeries.RecordDate = weightSeries.RecordDate
Quassnoi's answer shows how to do the SumProduct, and using a WHERE clause would allow you to restrict by a Date field...
SELECT
SUM([tbl].data * [tbl].weight) / SUM([tbl].weight)
FROM
[tbl]
WHERE
[tbl].date >= '2009 Jan 01'
AND [tbl].date < '2010 Jan 01'
The more complex part is where you want to "dynamically specify" the what field is [data] and what field is [weight]. The short answer is that realistically you'd have to make use of Dynamic SQL. Something along the lines of:
- Create a string template
- Replace all instances of [tbl].data with the appropriate data field
- Replace all instances of [tbl].weight with the appropriate weight field
- Execute the string
Dynamic SQL, however, carries it's own overhead. Is the queries are relatively infrequent , or the execution time of the query itself is relatively long, this may not matter. If they are common and short, however, you may notice that using dynamic sql introduces a noticable overhead. (Not to mention being careful of SQL injection attacks, etc.)
EDIT:
In your lastest example you highlight three fields:
RecordDate
KPI
Actual
When the [KPI] is "Weight Y", then [Actual] the Weighting Factor to use.
When the [KPI] is "Tons Milled", then [Actual] is the Data you want to aggregate.
Some questions I have are:
Are there any other fields?
Is there only ever ONE actual per date per KPI?
The reason I ask being that you want to ensure the JOIN you do is only ever 1:1. (You don't want 5 Actuals joining with 5 Weights, giving 25 resultsing records)
Regardless, a slight simplification of your query is certainly possible...
SELECT
SUM([baseSeries].Actual * [weightSeries].Actual) / SUM([weightSeries].Actual)
FROM
CalcProductionRecords AS [baseSeries]
INNER JOIN
CalcProductionRecords AS [weightSeries]
ON [weightSeries].RecordDate = [baseSeries].RecordDate
-- AND [weightSeries].someOtherID = [baseSeries].someOtherID
WHERE
[baseSeries].KPI = 'Tons Milled'
AND [weightSeries].KPI = 'Weighty'
The commented out line only needed if you need additional predicates to ensure a 1:1 relationship between your data and the weights.
If you can't guarnatee just One value per date, and don't have any other fields to join on, you can modify your sub_query based version slightly...
SELECT
SUM([baseSeries].Actual * [weightSeries].Actual) / SUM([weightSeries].Actual)
FROM
(
SELECT
RecordDate,
SUM(Actual)
FROM
CalcProductionRecords
WHERE
KPI = 'Tons Milled'
GROUP BY
RecordDate
)
AS [baseSeries]
INNER JOIN
(
SELECT
RecordDate,
AVG(Actual)
FROM
CalcProductionRecords
WHERE
KPI = 'Weighty'
GROUP BY
RecordDate
)
AS [weightSeries]
ON [weightSeries].RecordDate = [baseSeries].RecordDate
This assumes the AVG of the weight is valid if there are multiple weights for the same day.
EDIT : Someone just voted for this so I thought I'd improve the final answer :)
SELECT
SUM(Actual * Weight) / SUM(Weight)
FROM
(
SELECT
RecordDate,
SUM(CASE WHEN KPI = 'Tons Milled' THEN Actual ELSE NULL END) AS Actual,
AVG(CASE WHEN KPI = 'Weighty' THEN Actual ELSE NULL END) AS Weight
FROM
CalcProductionRecords
WHERE
KPI IN ('Tons Milled', 'Weighty')
GROUP BY
RecordDate
)
AS pivotAggregate
This avoids the JOIN and also only scans the table once.
It relies on the fact that NULL values are ignored when calculating the AVG().
SELECT SUM(A * B) / SUM(A)
FROM mytable
If I have understand the problem then try this
SET DATEFORMAT dmy
declare #tbl table(A int, B int,recorddate datetime,KPI varchar(50))
insert into #tbl
select 1,10 ,'21/01/2009', 'Weighty'union all
select 2,20,'10/01/2009', 'Tons Milled' union all
select 3,30 ,'03/02/2009', 'xyz'union all
select 4,40 ,'10/01/2009', 'Weighty'union all
select 5,50 ,'05/01/2009', 'Tons Milled'union all
select 6,60,'04/01/2009', 'abc' union all
select 7,70 ,'05/01/2009', 'Weighty'union all
select 8,80,'09/01/2009', 'xyz' union all
select 9,90 ,'05/01/2009', 'kws' union all
select 10,100,'05/01/2009', 'Tons Milled'
select SUM(t1.A*t2.A)/SUM(t2.A)Result from
(select RecordDate,A,B,KPI from #tbl)t1
inner join(select RecordDate,A,B,KPI from #tbl t)t2
on t1.RecordDate = t2.RecordDate
and t1.KPI = t2.KPI