Related
I need to implement a query (or maybe a stored procedure) that will perform soft de-duplication of data in one of my tables. If any two records are similar enough, I need to "squash" them: deactivate one and update another.
The similarity is based on a score. Score is calculated the following way:
from both records, take values of column A,
values equal? add A1 to the score,
values not equal? subtract A2 from the score,
move on to the next column.
As soon as all desired value pairs checked:
is resulting score more then X?
yes – records are duplicate, mark older record as "duplicate"; append its id to a duplicate_ids column to the newer record.
no – do nothing.
How would I approach solving this task in SQL?
The table in question is called people. People records are entered by different admins. The de-duplication process exists to make sure no two same people exists in the system.
The motivation for the task is simple: performance.
Right now the solution is implemented in scripting language via several sub-par SQL queries and logic on top of them. However, the volume of data is expected to grow to tens of millions of records, and script will eventually become very slow (it should run via cron every night).
I'm using postgresql.
It appears that the de-duplication is generally a tough problem.
I found this: https://github.com/dedupeio/dedupe. There's a good description of how this works: https://dedupe.io/documentation/how-it-works.html.
I'm going to explore dedupe. I'm not going to try to implement it in SQL.
If I get you correctly, this could help.
You can use PostgreSQL Window Functions to get all the duplicates and use "weights" to determine which records are duplicated so you can do whatever you like with them.
Here is an example:
-- Temporal table for the test, primary key is id and
-- we have A,B,C columns with a creation date:
CREATE TEMP TABLE test
(id serial, "colA" text, "colB" text, "colC" text,creation_date date);
-- Insert test data:
INSERT INTO test ("colA", "colB", "colC",creation_date) VALUES
('A','B','C','2017-05-01'),('D','E','F','2017-06-01'),('A','B','D','2017-08-01'),
('A','B','R','2017-09-01'),('C','J','K','2017-09-01'),('A','C','J','2017-10-01'),
('C','W','K','2017-10-01'),('R','T','Y','2017-11-01');
-- SELECT * FROM test
-- id | colA | colB | colC | creation_date
-- ----+-------+-------+-------+---------------
-- 1 | A | B | C | 2017-05-01
-- 2 | D | E | F | 2017-06-01
-- 3 | A | B | D | 2017-08-01 <-- Duplicate A,B
-- 4 | A | B | R | 2017-09-01 <-- Duplicate A,B
-- 5 | C | J | K | 2017-09-01
-- 6 | A | C | J | 2017-10-01
-- 7 | C | W | K | 2017-10-01 <-- Duplicate C,K
-- 8 | R | T | Y | 2017-11-01
-- Here is the query you can use to get the id's from the duplicate records
-- (the comments are backwards):
-- third, you select the id of the duplicates
SELECT id
FROM
(
-- Second, select all the columns needed and weight the duplicates.
-- You don't need to select every column, if only the id is needed
-- then you can only select the id
-- Query this SQL to see results:
SELECT
id,"colA", "colB", "colC",creation_date,
-- The weights are simple, if the row count is more than 1 then assign 1,
-- if the row count is 1 then assign 0, sum all and you have a
-- total weight of 'duplicity'.
CASE WHEN "num_colA">1 THEN 1 ELSE 0 END +
CASE WHEN "num_colB">1 THEN 1 ELSE 0 END +
CASE WHEN "num_colC">1 THEN 1 ELSE 0 END as weight
FROM
(
-- First, select using window functions and assign a row number.
-- You can run this query separately to see results
SELECT *,
-- NOTE that it is order by id, if needed you can order by creation_date instead
row_number() OVER(PARTITION BY "colA" ORDER BY id) as "num_colA",
row_number() OVER(PARTITION BY "colB" ORDER BY id) as "num_colB",
row_number() OVER(PARTITION BY "colC" ORDER BY id) as "num_colC"
FROM test ORDER BY id
) count_column_duplicates
) duplicates
-- HERE IS DEFINED WHICH WEIGHT TO SELECT, for the test,
-- id defined the ones that are more than 1
WHERE weight>1
-- The total SQL returns all the duplicates acording to the selected weight:
-- id
-- ----
-- 3
-- 4
-- 7
You can add this query to a stored procedure so you can run it whenever you like. Hope it helps.
I have a below table
Select X,Y from T
X | Y
------
1 | 2
1 | 3
2 | 1
3 | 5
3 | 1
Column X and Y holds Strings, I gave numbers just for example.
I need output from this table as below
1,2
1,3
3,5
i,e, Unique sets from the table. Out of Row 1 (1,2) and Row 3 (2,1), I need only one set, because (1,2)=(2,1) in my set. Similarly (1,3)=(3,1).
So unique sets in this table are (1,2) (1,3) and (3,5).
I tried below SQL, let me know if there is a better way, as I am not sure whether I can use '>' or '<' with ROWID
SELECT X||','||Y FROM T t1
WHERE NOT EXISTS (SELECT 1 FROM T t2
WHERE t1.X=t2.Y AND t1.Y=t2.X and t1.ROWID>t2.ROWID)
select distinct least(x,y), greatest(x,y)
from the_table;
least() and greatest() put the values into an order so that 1,2 and 2,1 are returned as 1,2. The distinct then removes the duplicates
DISTINCT gets you distinct rows, so all you need to do is to have your pairs ordered, first the smaller then the larger. You do this with LEAST and GREATEST.
select distinct least(x,y) || ',' || greatest(x,y)
from t;
I am trying to get the last element of an ordered set, stored in a database table. The ordering is defined by one of the columns in the table. Also the table contains multiple sets, so I want the last one for each of the sets.
As an example consider the following table:
benchmarks=# select id,sorter from aggtest ;
id | sorter
----+--------
1 | 1
3 | 1
5 | 1
2 | 2
7 | 2
4 | 1
6 | 2
(7 rows)
Sorter 1 and 2 define each of the sets, sets are ordered by the id column. To get the last element of each set, I defined an aggregate function:
CREATE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
SELECT $2;
$$;
CREATE AGGREGATE public.last (
sfunc = public.last_agg,
basetype = anyelement,
stype = anyelement
);
As explained here.
However when I use this I get:
benchmarks=# select last(id),sorter from aggtest group by sorter order by sorter;
last | sorter
------+--------
4 | 1
6 | 2
(2 rows)
However, I want to get (5,1) and (7,2) as these are the last ids (numerically) in the set. Looking at how the aggregate mechanism works, I can see quite well, why the result is not what I want. The items are returned in the order I added them, and then aggregated so that the last one I added is returned.
I tried sorting by ids, so that each group is sorted independently, however that does not work:
benchmarks=# select last(id),sorter from aggtest group by sorter order by sorter,id;
ERROR: column "aggtest.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...(id),sorter from aggtest group by sorter order by sorter,id;
If I wrap the sorting criteria in another aggregate, I get wrong data again:
benchmarks=# select last(id),sorter from aggtest group by sorter order by sorter,last(id);
last | sorter
------+--------
4 | 1
6 | 2
(2 rows)
Also grouping by id in addition to sorter does not work obviously.
Of course there is an easier way, to get the last (highest) id for each group by using the max aggregate. However, I am not so much interested in the id but as in data associated with it (i.e. in the same row). Hence I do not to sort by id and then aggregate so that the row with the highest id is returned for each group.
What is the best way to accomplish this?
EDIT: Why does max(id) grouped by sorter not work
Assume the following complete table (unsorter represents the additional data I have in the table):
benchmarks=# select * from aggtest ;
id | sorter | unsorter
----+--------+----------
1 | 1 | 1
3 | 1 | 2
5 | 1 | 3
2 | 2 | 4
7 | 2 | 5
4 | 1 | 6
6 | 2 | 7
(7 rows)
I would like to retrieve the lines:
id | sorter | unsorter
----+--------+----------
5 | 1 | 3
7 | 2 | 5
However with max(id) and grouping by sorter I get:
benchmarks=# select max(id),sorter,unsorter from aggtest group by sorter;
ERROR: column "aggtest.unsorter" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select max(id),sorter,unsorter from aggtest group by sorter;
Using a max(unsorter) obviously does not work either:
benchmarks=# select max(id),sorter,max(unsorter) from aggtest group by sorter;
max | sorter | max
-----+--------+-----
5 | 1 | 6
7 | 2 | 7
(2 rows)
However using distinct (the accepted answer) I get:
benchmarks=# select distinct on (sorter) id,sorter,unsorter from aggtest order by sorter, id desc;
id | sorter | unsorter
----+--------+----------
5 | 1 | 3
7 | 2 | 5
(2 rows)
Which has the correct additional data. The join approach also seems to work, by is slightly slower on the test data.
Why not use a window function:
select id, sorter
from (
select id, sorter,
row_number() over (partition by sorter order by id desc) as rn
from aggtest
) t
where rn = 1;
Or using Postgres distinct on operator which is usually faster:
select distinct on (sorter) id, sorter
from aggtest
order by sorter, id desc
You write:
Of course there is an easier way, to get the last (highest) id for
each group by using the max aggregate. However, I am not so much
interested in the id but as in data associated with it (i.e. in the
same row).
This query will give you the data associated with the highest id of each sorter group.
select a.* from aggtest a
join (
select max(id) max_id, sorter
from aggtest
group by sorter
) b on a.id = b.max_id and a.sorter = b.sorter
select distinct max(id) over (partition by sorter) id,sorter
from aggtest order by 2 asc
returns:
5;1
7;2
I noticed some repeating rows in a paginated recordset.
When I run this query:
SELECT "students".*
FROM "students"
ORDER BY "students"."status" asc
LIMIT 3 OFFSET 0
I get:
| id | name | status |
| 1 | foo | active |
| 12 | alice | active |
| 4 | bob | active |
Next query:
SELECT "students".*
FROM "students"
ORDER BY "students"."status" asc
LIMIT 3 OFFSET 3
I get:
| id | name | status |
| 1 | foo | active |
| 6 | cindy | active |
| 2 | dylan | active |
Why does "foo" appear in both queries?
Why does "foo" appear in both queries?
Because all rows that are returned have the same value for the status column. In that case the database is free to return the rows in any order it wants.
If you want a reproducable ordering you need to add a second column to your order by statement to make it consistent. E.g. the ID column:
SELECT students.*
FROM students
ORDER BY students.status asc,
students.id asc
If two rows have the same value for the status column, they will be sorted by the id.
For more details from PostgreSQL documentation (http://www.postgresql.org/docs/8.3/static/queries-limit.html) :
When using LIMIT, it is important to use an ORDER BY clause that constrains the result rows into a unique order. Otherwise you will get an unpredictable subset of the query's rows. You might be asking for the tenth through twentieth rows, but tenth through twentieth in what ordering? The ordering is unknown, unless you specified ORDER BY.
The query optimizer takes LIMIT into account when generating a query plan, so you are very likely to get different plans (yielding different row orders) depending on what you give for LIMIT and OFFSET. Thus, using different LIMIT/OFFSET values to select different subsets of a query result will give inconsistent results unless you enforce a predictable result ordering with ORDER BY. This is not a bug; it is an inherent consequence of the fact that SQL does not promise to deliver the results of a query in any particular order unless ORDER BY is used to constrain the order.
select * from(
Select "students".*
from "students"
order by "students"."status" asc
limit 6
) as temp limit 3 offset 0;
select * from(
Select "students".*
from "students"
order by "students"."status" asc
limit 6
) as temp limit 3 offset 3;
where 6 is the total number of records that is under examination.
I have the following query:
select column_name, count(column_name)
from table
group by column_name
having count(column_name) > 1;
What would be the difference if I replaced all calls to count(column_name) to count(*)?
This question was inspired by How do I find duplicate values in a table in Oracle?.
To clarify the accepted answer (and maybe my question), replacing count(column_name) with count(*) would return an extra row in the result that contains a null and the count of null values in the column.
count(*) counts NULLs and count(column) does not
[edit] added this code so that people can run it
create table #bla(id int,id2 int)
insert #bla values(null,null)
insert #bla values(1,null)
insert #bla values(null,1)
insert #bla values(1,null)
insert #bla values(null,1)
insert #bla values(1,null)
insert #bla values(null,null)
select count(*),count(id),count(id2)
from #bla
results
7 3 2
Another minor difference, between using * and a specific column, is that in the column case you can add the keyword DISTINCT, and restrict the count to distinct values:
select column_a, count(distinct column_b)
from table
group by column_a
having count(distinct column_b) > 1;
A further and perhaps subtle difference is that in some database implementations the count(*) is computed by looking at the indexes on the table in question rather than the actual data rows. Since no specific column is specified, there is no need to bother with the actual rows and their values (as there would be if you counted a specific column). Allowing the database to use the index data can be significantly faster than making it count "real" rows.
The explanation in the docs, helps to explain this:
COUNT(*) returns the number of items in a group, including NULL values and duplicates.
COUNT(expression) evaluates expression for each row in a group and returns the number of nonnull values.
So count(*) includes nulls, the other method doesn't.
We can use the Stack Exchange Data Explorer to illustrate the difference with a simple query. The Users table in Stack Overflow's database has columns that are often left blank, like the user's Website URL.
-- count(column_name) vs. count(*)
-- Illustrates the difference between counting a column
-- that can hold null values, a 'not null' column, and count(*)
select count(WebsiteUrl), count(Id), count(*) from Users
If you run the query above in the Data Explorer, you'll see that the count is the same for count(Id) and count(*)because the Id column doesn't allow null values. The WebsiteUrl count is much lower, though, because that column allows null.
The COUNT(*) sentence indicates SQL Server to return all the rows from a table, including NULLs.
COUNT(column_name) just retrieves the rows having a non-null value on the rows.
Please see following code for test executions SQL Server 2008:
-- Variable table
DECLARE #Table TABLE
(
CustomerId int NULL
, Name nvarchar(50) NULL
)
-- Insert some records for tests
INSERT INTO #Table VALUES( NULL, 'Pedro')
INSERT INTO #Table VALUES( 1, 'Juan')
INSERT INTO #Table VALUES( 2, 'Pablo')
INSERT INTO #Table VALUES( 3, 'Marcelo')
INSERT INTO #Table VALUES( NULL, 'Leonardo')
INSERT INTO #Table VALUES( 4, 'Ignacio')
-- Get all the collumns by indicating *
SELECT COUNT(*) AS 'AllRowsCount'
FROM #Table
-- Get only content columns ( exluce NULLs )
SELECT COUNT(CustomerId) AS 'OnlyNotNullCounts'
FROM #Table
COUNT(*) – Returns the total number of records in a table (Including NULL valued records).
COUNT(Column Name) – Returns the total number of Non-NULL records. It means that, it ignores counting NULL valued records in that particular column.
Basically the COUNT(*) function return all the rows from a table whereas COUNT(COLUMN_NAME) does not; that is it excludes null values which everyone here have also answered here.
But the most interesting part is to make queries and database optimized it is better to use COUNT(*) unless doing multiple counts or a complex query rather than COUNT(COLUMN_NAME). Otherwise, it will really lower your DB performance while dealing with a huge number of data.
Further elaborating upon the answer given by #SQLMeance and #Brannon making use of GROUP BY clause which has been mentioned by OP but not present in answer by #SQLMenace
CREATE TABLE table1 (
id INT
);
INSERT INTO table1 VALUES
(1),
(2),
(NULL),
(2),
(NULL),
(3),
(1),
(4),
(NULL),
(2);
SELECT * FROM table1;
+------+
| id |
+------+
| 1 |
| 2 |
| NULL |
| 2 |
| NULL |
| 3 |
| 1 |
| 4 |
| NULL |
| 2 |
+------+
10 rows in set (0.00 sec)
SELECT id, COUNT(*) FROM table1 GROUP BY id;
+------+----------+
| id | COUNT(*) |
+------+----------+
| 1 | 2 |
| 2 | 3 |
| NULL | 3 |
| 3 | 1 |
| 4 | 1 |
+------+----------+
5 rows in set (0.00 sec)
Here, COUNT(*) counts the number of occurrences of each type of id including NULL
SELECT id, COUNT(id) FROM table1 GROUP BY id;
+------+-----------+
| id | COUNT(id) |
+------+-----------+
| 1 | 2 |
| 2 | 3 |
| NULL | 0 |
| 3 | 1 |
| 4 | 1 |
+------+-----------+
5 rows in set (0.00 sec)
Here, COUNT(id) counts the number of occurrences of each type of id but does not count the number of occurrences of NULL
SELECT id, COUNT(DISTINCT id) FROM table1 GROUP BY id;
+------+--------------------+
| id | COUNT(DISTINCT id) |
+------+--------------------+
| NULL | 0 |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
+------+--------------------+
5 rows in set (0.00 sec)
Here, COUNT(DISTINCT id) counts the number of occurrences of each type of id only once (does not count duplicates) and also does not count the number of occurrences of NULL
It is best to use
Count(1) in place of column name or *
to count the number of rows in a table, it is faster than any format because it never go to check the column name into table exists or not
There is no difference if one column is fix in your table, if you want to use more than one column than you have to specify that how much columns you required to count......
Thanks,
As mentioned in the previous answers, Count(*) counts even the NULL columns, whereas count(Columnname) counts only if the column has values.
It's always best practice to avoid * (Select *, count *, …)