Finding & updating duplicate rows - sql

I need to implement a query (or maybe a stored procedure) that will perform soft de-duplication of data in one of my tables. If any two records are similar enough, I need to "squash" them: deactivate one and update another.
The similarity is based on a score. Score is calculated the following way:
from both records, take values of column A,
values equal? add A1 to the score,
values not equal? subtract A2 from the score,
move on to the next column.
As soon as all desired value pairs checked:
is resulting score more then X?
yes – records are duplicate, mark older record as "duplicate"; append its id to a duplicate_ids column to the newer record.
no – do nothing.
How would I approach solving this task in SQL?
The table in question is called people. People records are entered by different admins. The de-duplication process exists to make sure no two same people exists in the system.
The motivation for the task is simple: performance.
Right now the solution is implemented in scripting language via several sub-par SQL queries and logic on top of them. However, the volume of data is expected to grow to tens of millions of records, and script will eventually become very slow (it should run via cron every night).
I'm using postgresql.

It appears that the de-duplication is generally a tough problem.
I found this: https://github.com/dedupeio/dedupe. There's a good description of how this works: https://dedupe.io/documentation/how-it-works.html.
I'm going to explore dedupe. I'm not going to try to implement it in SQL.

If I get you correctly, this could help.
You can use PostgreSQL Window Functions to get all the duplicates and use "weights" to determine which records are duplicated so you can do whatever you like with them.
Here is an example:
-- Temporal table for the test, primary key is id and
-- we have A,B,C columns with a creation date:
CREATE TEMP TABLE test
(id serial, "colA" text, "colB" text, "colC" text,creation_date date);
-- Insert test data:
INSERT INTO test ("colA", "colB", "colC",creation_date) VALUES
('A','B','C','2017-05-01'),('D','E','F','2017-06-01'),('A','B','D','2017-08-01'),
('A','B','R','2017-09-01'),('C','J','K','2017-09-01'),('A','C','J','2017-10-01'),
('C','W','K','2017-10-01'),('R','T','Y','2017-11-01');
-- SELECT * FROM test
-- id | colA | colB | colC | creation_date
-- ----+-------+-------+-------+---------------
-- 1 | A | B | C | 2017-05-01
-- 2 | D | E | F | 2017-06-01
-- 3 | A | B | D | 2017-08-01 <-- Duplicate A,B
-- 4 | A | B | R | 2017-09-01 <-- Duplicate A,B
-- 5 | C | J | K | 2017-09-01
-- 6 | A | C | J | 2017-10-01
-- 7 | C | W | K | 2017-10-01 <-- Duplicate C,K
-- 8 | R | T | Y | 2017-11-01
-- Here is the query you can use to get the id's from the duplicate records
-- (the comments are backwards):
-- third, you select the id of the duplicates
SELECT id
FROM
(
-- Second, select all the columns needed and weight the duplicates.
-- You don't need to select every column, if only the id is needed
-- then you can only select the id
-- Query this SQL to see results:
SELECT
id,"colA", "colB", "colC",creation_date,
-- The weights are simple, if the row count is more than 1 then assign 1,
-- if the row count is 1 then assign 0, sum all and you have a
-- total weight of 'duplicity'.
CASE WHEN "num_colA">1 THEN 1 ELSE 0 END +
CASE WHEN "num_colB">1 THEN 1 ELSE 0 END +
CASE WHEN "num_colC">1 THEN 1 ELSE 0 END as weight
FROM
(
-- First, select using window functions and assign a row number.
-- You can run this query separately to see results
SELECT *,
-- NOTE that it is order by id, if needed you can order by creation_date instead
row_number() OVER(PARTITION BY "colA" ORDER BY id) as "num_colA",
row_number() OVER(PARTITION BY "colB" ORDER BY id) as "num_colB",
row_number() OVER(PARTITION BY "colC" ORDER BY id) as "num_colC"
FROM test ORDER BY id
) count_column_duplicates
) duplicates
-- HERE IS DEFINED WHICH WEIGHT TO SELECT, for the test,
-- id defined the ones that are more than 1
WHERE weight>1
-- The total SQL returns all the duplicates acording to the selected weight:
-- id
-- ----
-- 3
-- 4
-- 7
You can add this query to a stored procedure so you can run it whenever you like. Hope it helps.

Related

SQL Server query - keeping first and last unique records of a group

We are trying to remove and rank data in tables that is provided in a daily feed to our system. the example data of course isn't the actual product, but clearly represents the concept.
Daily inserts:
data is imported daily into tables that continually updates the status of the products
the daily status updates tell us when products were listed, are they currently listed and then the last date it was listed
after a period of {X} time, we can normalize the data
Cleanup & ranking:
we are now trying to remove duplicate records for values in a group that fall in-between the first and last values
we also want to set identifiers for the records that represent the first and last occurrence of those unique values in that group
Sample data:
I've found that the photo is the easiest way to show the data, show what's needed and not needed - I hope this makes it easier and not obtuse.
In the sample data:
"ridgerapp" we want to keep the records for 03/12/17 & 06/12/17.
"ridgerapp" we want to delete the records that fall between the dates above.
"ridgerapp" we want to also set/update the records for 03/12/17 & 06/12/17 as the first and last occurrence - something like -
update table set 03/12/17 = 0 (first), 06/12/17 = 1 (last)
"sierra" is just another expanded data sample, and we want to keep the records for 12/06/16 and 12/11/16.
"sierra" delete the records that fall between 12/06/16 and 12/11/16.
"sierra" update the status/rank for the 12/06/16 and 12/11/16 records as the first and last occurrence.
update table set 12/06/16 = 0 (first), 12/11/16 = 1 (last).
Conclusion:
Using pseudo code, this is the overall objective:
select distinct records in table (using id,name,color,value as unique identifiers)
for the records in each group look at the history and find the top and bottom dates
delete records between top and bottom dates for each group
update the history with a status/rank (field name is rank) of 0 and 1 for values in each group
using the sample data, the results would end up
Updated table values:
23 ridgerapp blue 25 03/12/17 0
23 ridgerapp blue 25 06/12/17 1
57 sierra red 15 12/06/16 0
57 sierra red 15 12/11/16 1
I'd use a CTE with the row_number() window function to find the first and last rows for each group, and then update it.
You didn't specify what makes a group a group so I only based this off the ID. If you want the group be a set of columns, i.e ID and Color and Value then just add these columns to the partition by list. For the sample data the result would be the same, but different sample data would have different outcomes.
Notice I didn't include the exact rows for the sierra group because I wanted to show you how it'd handle duplicate history dates.
declare #table table (id int, [name] varchar(64), color varchar(16), [value] int, history date)
insert into #table
values
(23,'ridgerapp','blue',25,'20170312'),
(23,'ridgerapp','blue',25,'20170325'),
(23,'ridgerapp','blue',25,'20170410'),
(23,'ridgerapp','blue',25,'20170610'),
(23,'ridgerapp','blue',25,'20170612'),
(57,'sierra','red',15,'20161206'),
(57,'sierra','red',15,'20161208'),
(57,'sierra','red',15,'20161210'),
(57,'sierra','red',15,'20161210') --notice this is a duplicate row
;with cte as(
select
*
,fst = row_number() over (partition by id order by history asc)
,lst = row_number() over (partition by id order by history desc)
from #table
)
delete from cte
where fst !=1 and lst !=1
select
*
,flag = case when row_number() over (partition by id order by history asc) = 1 then 0 else 1 end
from #table
RETURNS
+----+-----------+-------+-------+------------+------+
| id | name | color | value | history | flag |
+----+-----------+-------+-------+------------+------+
| 23 | ridgerapp | blue | 25 | 2017-03-12 | 0 |
| 23 | ridgerapp | blue | 25 | 2017-06-12 | 1 |
| 57 | sierra | red | 15 | 2016-12-06 | 0 |
| 57 | sierra | red | 15 | 2016-12-10 | 1 |
+----+-----------+-------+-------+------------+------+

Counting the total number of rows with SELECT DISTINCT ON without using a subquery

I have performing some queries using PostgreSQL SELECT DISTINCT ON syntax. I would like to have the query return the total number of rows alongside with every result row.
Assume I have a table my_table like the following:
CREATE TABLE my_table(
id int,
my_field text,
id_reference bigint
);
I then have a couple of values:
id | my_field | id_reference
----+----------+--------------
1 | a | 1
1 | b | 2
2 | a | 3
2 | c | 4
3 | x | 5
Basically my_table contains some versioned data. The id_reference is a reference to a global version of the database. Every change to the database will increase the global version number and changes will always add new rows to the tables (instead of updating/deleting values) and they will insert the new version number.
My goal is to perform a query that will only retrieve the latest values in the table, alongside with the total number of rows.
For example, in the above case I would like to retrieve the following output:
| total | id | my_field | id_reference |
+-------+----+----------+--------------+
| 3 | 1 | b | 2 |
+-------+----+----------+--------------+
| 3 | 2 | c | 4 |
+-------+----+----------+--------------+
| 3 | 3 | x | 5 |
+-------+----+----------+--------------+
My attemp is the following:
select distinct on (id)
count(*) over () as total,
*
from my_table
order by id, id_reference desc
This returns almost the correct output, except that total is the number of rows in my_table instead of being the number of rows of the resulting query:
total | id | my_field | id_reference
-------+----+----------+--------------
5 | 1 | b | 2
5 | 2 | c | 4
5 | 3 | x | 5
(3 rows)
As you can see it has 5 instead of the expected 3.
I can fix this by using a subquery and count as an aggregate function:
with my_values as (
select distinct on (id)
*
from my_table
order by id, id_reference desc
)
select count(*) over (), * from my_values
Which produces my expected output.
My question: is there a way to avoid using this subquery and have something similar to count(*) over () return the result I want?
You are looking at my_table 3 ways:
to find the latest id_reference for each id
to find my_field for the latest id_reference for each id
to count the distinct number of ids in the table
I therefore prefer this solution:
select
c.id_count as total,
a.id,
a.my_field,
b.max_id_reference
from
my_table a
join
(
select
id,
max(id_reference) as max_id_reference
from
my_table
group by
id
) b
on
a.id = b.id and
a.id_reference = b.max_id_reference
join
(
select
count(distinct id) as id_count
from
my_table
) c
on true;
This is a bit longer (especially the long thin way I write SQL) but it makes it clear what is happening. If you come back to it in a few months time (somebody usually does) then it will take less time to understand what is going on.
The "on true" at the end is a deliberate cartesian product because there can only ever be exactly one result from the subquery "c" and you do want a cartesian product with that.
There is nothing necessarily wrong with subqueries.

SQL Server, complex query

I have an Azure SQL Database table which is filled by importing XML-files.
The order of the files is random so I could get something like this:
ID | Name | DateFile | IsCorrection | Period | Other data
1 | Mr. A | March, 1 | false | 3 | Foo
20 | Mr. A | March, 1 | true | 2 | Foo
13 | Mr. A | Apr, 3 | true | 2 | Foo
4 | Mr. B | Feb, 1 | false | 2 | Foo
This table is joined with another table, which is also joined with a 3rd table.
I need to get the join of these 3 tables for the person with the newest data, based on Period, DateFile and Correction.
In my above example, Id=1 is the original data for Period 3, I need this record.
But in the same file was also a correction for Period 2 (Id=20) and in the file of April, the data was corrected again (Id=13).
So for Period 3, I need Id=1, for Period 2 I need Id=13 because it has the last corrected data and I need Id=4 because it is another person.
I would like to do this in a view, but using a stored procedure would not be a problem.
I have no idea how to solve this. Any pointers will be much appreciated.
EDIT:
My datamodel is of course much more complex than this sample. DateFile and Period are DateTime types in the table. Actually Period is two DateTime columns: StartPeriod and EndPeriod.
Well looking at your data I believe we can disregard the IsCorrection column and just pick the latest column for each user/period.
Lets start by ordering the rows placing the latest on top :
SELECT ROW_NUMBER() OVER (PARTITION BY Period, Name ORDER by DateFile DESC), *
And from this result you select all with row number 1:
;with numberedRows as (
SELECT ROW_NUMBER() OVER (PARTITION BY Period, Name ORDER by DateFile DESC) as rowIndex, *
)
select * from numberedRows where rowIndex=1
The PARTITION BY tells ROW_NUMBER() to reset the counter whenever it encounters change in the columns Period and Name. The ORDER BY tells the ROW_NUMBER() that we want th newest row to be number 1 and then older posts afterwards. We only need the latest row.
The WITH declares a "common table expression" which is a kind of subquery or temporary table.
Not knowing your exact data, I might recommend you something wrong, but you should be able to join your with last query with other tables to get your desired result.
Something like:
;with numberedRows as (
SELECT ROW_NUMBER() OVER (PARTITION BY Period, Name ORDER by DateFile DESC) as rowIndex, *
)
select * from numberedRows a
JOIN periods b on b.empId = a.Id
JOIN msg c on b.msgId = c.Id
where a.rowIndex=1

Insert multiple values and Insert value in parallel

I have a question about SQL in parallel queries. For example, suppose that I have this query:
INSERT INTO tblExample (num) VALUES (1), (2)
And this query:
INSERT INTO tblExample (num) VALUES (3)
The final table should looked like this:
num
---
1
2
3
But I wonder if there is an option that those two queries will run in parallel and the final table will be looked like this:
num
---
1
3
2
Someone know the answer?
Thanks in advance!
There is no order in sql. You can sort your queries by adding an ORDER BY clause
SELECT * FROM tblExample ORDER BY num
Or you could add a timestamp column to the table and order by that.
How your table "looks" depends on how you asked for it (in the SELECT statement). Without an ORDER BY clause, the order of your table is undefined:
ORDER BY is the only way to sort the rows in the result set. Without this clause, the relational database system may return the rows in any order. If an ordering is required, the ORDER BY must be provided in the SELECT statement sent by the application.
For example:
SELECT num FROM tblExample ORDER BY num ASC
1
2
3
SELECT num FROM tblExample ORDER BY num DESC
3
2
1
If you want to order your columns manually, you can add a new column and sort on it:
+-----+-------+
| num | order |
+-----+-------+
| 1 | 1 |
| 2 | 3 |
| 3 | 2 |
+-----+-------+
SELECT num FROM tblExample ORDER BY order ASC
1
3
2

yet another date gap-fill SQL puzzle

I'm using Vertica, which precludes me from using CROSS APPLY, unfortunately. And apparently there's no such thing as CTEs in Vertica.
Here's what I've got:
t:
day | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
Note that on the first day, the delta is equal to the metric value.
I'd like to fill in the gaps, like this:
t_fill:
day | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-02 | 1 | 10 | 0 -- a delta of 0
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
I've thought of a way to do this day by day, but what I'd really like is a solution that works in one go.
I think I could get something working with LAST_VALUE, but I can't come up with the right JOIN statements that will let me properly partition and order on each id's day-by-day history.
edit:
assume I have a table like this:
calendar:
day
------------
2011-01-01
2011-01-02
...
that can be involved with joins. My intent would be to maintain the date range in calendar to match the date range in t.
edit:
A few more notes on what I'm looking for, just to be specific:
In generating t_fill, I'd like to exactly cover the date range in t, as well as any dates that are missing in between. So a correct t_fill will start on the same date and end on the same date as t.
t_fill has two properties:
1) once an id appears on some date, it will always have a row for each later date. This is the gap-filling implied in the original question.
2) Should no row for an id ever appear again after some date, the t_fill solution should merrily generate rows with the same metric value (and 0 delta) from the date of that last data point up to the end date of t.
A solution might backfill earlier dates up to the start of the date range in t. That is, for any id that appears after the first date in t, rows between the first date in t and the first date for the id will be filled with metric=0 and d_metric=0. I don't prefer this kind of solution, since it has a higher growth factor for each id that enters the system. But I could easily deal with it by selecting into a new table only rows where metric!=0 and d_metric!=0.
This about what Jonathan Leffler proposed, but into old-fashioned low-level SQL (without fancy CTE's or window functions or aggregating subqueries):
SET search_path='tmp'
DROP TABLE ttable CASCADE;
CREATE TABLE ttable
( zday date NOT NULL
, id INTEGER NOT NULL
, metric INTEGER NOT NULL
, d_metric INTEGER NOT NULL
, PRIMARY KEY (id,zday)
);
INSERT INTO ttable(zday,id,metric,d_metric) VALUES
('2011-12-01',1,10,10)
,('2011-12-03',1,12,2)
,('2011-12-04',1,15,3)
;
DROP TABLE ctable CASCADE;
CREATE TABLE ctable
( zday date NOT NULL
, PRIMARY KEY (zday)
);
INSERT INTO ctable(zday) VALUES
('2011-12-01')
,('2011-12-02')
,('2011-12-03')
,('2011-12-04')
;
CREATE VIEW v_cte AS (
SELECT t.zday,t.id,t.metric,t.d_metric
FROM ttable t
JOIN ctable c ON c.zday = t.zday
UNION
SELECT c.zday,t.id,t.metric, 0
FROM ctable c, ttable t
WHERE t.zday < c.zday
AND NOT EXISTS ( SELECT *
FROM ttable nx
WHERE nx.id = t.id
AND nx.zday = c.zday
)
AND NOT EXISTS ( SELECT *
FROM ttable nx
WHERE nx.id = t.id
AND nx.zday < c.zday
AND nx.zday > t.zday
)
)
;
SELECT * FROM v_cte;
The results:
zday | id | metric | d_metric
------------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-02 | 1 | 10 | 0
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
(4 rows)
I am not Vertica user, but if you do not want to use their native support for GAP fillings, here you can find a more generic SQL-only solution to do so.
If you want to use something like a CTE, how about using a temporary table? Essentially, a CTE is a view for a particular query.
Depending on your needs you can make the temporary table transaction or session-scoped.
I'm still curious to know why gap-filling with constant-interpolation wouldn't work here.
Given the complete calendar table, it is doable, though not exactly trivial. Without the calendar table, it would be a lot harder.
Your query needs to be stated moderately precisely, which is usually half the battle in any issue with 'how to write the query'. I think you are looking for:
For each date in Calendar between the minimum and maximum dates represented in T (or other stipulated range),
For each distinct ID represented in T,
Find the metric for the given ID for the most recent record in T on or before the date.
This gives you a complete list of dates with metrics.
You then need to self-join two copies of that list with dates one day apart to form the deltas.
Note that if some ID values don't appear at the start of the date range, they won't show up.
With that as guidance, you should be able get going, I believe.