Looking to find duplicates using DIFFERENCE() among 2+ columns - sql

I'm trying to write a SQL Select query that uses the DIFFERENCE() function to find similar names in a database to identify duplicates.
The short version of the code I'm using is:
SELECT *, DIFFERENCE(FirstName, LEAD(FirstName) OVER (ORDER BY SOUNDEX(FirstName))) d
WHERE d >= 3
The problem is my database has additional columns that include middle names and nicknames. So if I have a customer who has multiple names they go by, they might be in the database multiple times, and I need to compare a variety of columns against each other.
Sample Data:
+----+--------+--------+--------+--------+
|ID |First |Middle |AKA1 |AKA2 |
+----+--------+--------+--------+--------+
|1 |Sally |Ann |NULL |NULL |
|2 |Ann |NULL |NULL |NULL |
|3 |Sue |NULL |NULL |NULL |
|4 |Suzy |NULL |NULL |NULL |
|5 |Patricia|NULL |Trish |Patty |
|6 |Patty |NULL |Patricia|Trish |
|7 |Trish |NULL |Patty |Patricia|
+----+--------+--------+--------+--------+
In the above, rows 1+2 are duplicates of each other, as are 3+4, and 5+6+7.
So I'm not sure the best way to get what I want. Here's the longer version of the code I'm actually using:
WITH A AS (SELECT *,
SOUNDEX(FirstName) AS "FirstSoundex",
SOUNDEX(LastName) AS "LastSoundex",
LAG (SOUNDEX(FirstName)) OVER (ORDER BY SOUNDEX(FirstName)) AS "PreviousFirstSoundex",
LAG (SOUNDEX(LastName)) OVER (ORDER BY SOUNDEX(LastName)) AS "PreviousLastSoundex"
FROM Clients),
B AS (
SELECT *,
ISNULL(DIFFERENCE(FirstName, LEAD(FirstName) OVER (ORDER BY FirstSoundex)),0) AS "FirstScore",
ISNULL(DIFFERENCE(LastName, LEAD(LastName) OVER (ORDER BY LastSoundex)),0) AS "LastScore"
FROM A),
C AS (
SELECT *,
ISNULL(LAG (FirstScore) OVER (ORDER BY FirstSoundex),0) AS "PreviousFirstScore",
ISNULL(LAG (LastScore) OVER (ORDER BY LastSoundex),0) AS "PreviousLastScore"
FROM B
),
D AS (
SELECT *,
(CASE WHEN (PreviousFirstScore >=3 AND PreviousLastScore >=3) THEN (PreviousFirstSoundex + PreviousLastSoundex)
WHEN (FirstScore >= 3 AND LastScore >=3) THEN (FirstSoundex + LastSoundex)
END) AS "GroupName"
FROM C
WHERE ((PreviousFirstScore >=3 AND PreviousLastScore >=3) OR (FirstScore >= 3 AND LastScore >=3))
)
SELECT *,
LAG(GroupName) OVER (ORDER BY GroupName) AS "PreviousGroup",
LEAD(GroupName) OVER (ORDER BY GroupName) AS "NextGroup"
FROM D
WHERE (D.GroupName = D.PreviousGroup OR D.GroupName = D.NextGroup)
This lets me group together bundles of potential duplicates and it works well for me. However, I now want to add in a way to check against multiple columns, and I don't know how to do that.
I was thinking about creating a union, something like:
SELECT ClientID,
LastName,
FirstName AS "TempName"
FROM Clients
UNION
SELECT ClientID,
LastName,
MiddleName AS "TempName"
FROM Clients
WHERE MiddleName IS NOT NULL
...etc
But then my LAG() and LEAD() wouldn't work because I'd have multiple rows with the same ClientID. I don't want to identify a single Client as a duplicate of itself.
Anyways, any suggestions? Thanks in advance.

Related

Oracle SQL - How to return the name with the highest ID ending in a certain number

I have a table structured like this where I need to get the ID's last number, how many people's ID ends with that number, and the person with the highest ID:
Members: |ID |Name |
-----------------
|123 |foo |
|456 |bar |
|789 |boo |
|1226|far |
The result I need to get looks something like this
|LAST_NUMBER |OCCURENCES |HIGHEST_ID_GUY |
---------------------------------------------
|3 |1 |foo |
|6 |2 |far |
|9 |1 |boo |
However, while I can get the first two results to display correctly, I have no idea how to display HIGHEST_ID_GUY. My code looks like this:
SELECT DISTINCT SUBSTR(id, LENGTH(id - 1), LENGTH(id)) AS LAST_NUMBER,
COUNT(*) AS OCCURENCES
/* This is where I need to add HIGHEST_ID_GUY */
FROM Members
GROUP BY SUBSTR(id, LENGTH(id - 1), LENGTH(id))
ORDER BY LAST_NUMBER
Any help appreciated :)
If id is a number, then use arithmetic operations:
select mod(id, 10) as last_digit,
count(*),
max(name) keep (dense_rank first order by id desc) as name_at_biggest
from t
group by mod(id, 10);
If id is a string, then you need to convert to a number or something similar to define the "highest id". For instance:
select substr(id, -1) as last_digit,
count(*),
max(name) keep (dense_rank first order by to_number(id) desc) as name_at_biggest
from t
group by substr(id, -1);

group by value but only for continue value

OK, the title is far from obvious, I could not explain it better.
Let's consider the table with columns (date, xvalue, some other columns), what I need is to group them by xvalue but only when they are not interrupted considering time (column date), so for example, for:
Date |xvalue |yvalue|
1 Mar |10 |1 |
2 Mar |10 |2 |
3 Mar |20 |6 |
4 Mar |20 |1 |
5 Mar |10 |4 |
6 Mar |10 |2 |
From the above data, I would like to get three rows, for the first xvalue==10, for xvalue==20 and again for xvalue==10 and for each group aggregate of the other values, for example for sum:
1 Mar, 10, 3
3 Mar, 20, 7
5 Mar, 10, 6
It's like query:
select min(date), xvalue, sum(yvalue) from t group by xvalue
Except above will merge 1,2,5 and 6th of March and I want them separately
This is an example of a gaps-and-islands problem. But you need an ordering column. With such a column, you can use the difference of row numbers:
select min(date), xvalue, sum(yvalue)
from (select t.*,
row_number() over (partition by xvalue order by date) as seqnum_d,
row_number() over (order by date) as seqnum
from t
) t
group by xvalue, (seqnum - seqnum_d)
order by min(date)
Here is a db<>fiddle.
Datas in a database are logically stored in mathematicl sets inside which there is absolutly no order and no way to have a default ordering. they are comparable to bags in which objects can move during their use.
So there is no solution to answer your query until you add a specific column to give the requested sort order that the user need to have...

Counting rows for a particular name

I've a table temp(name int,count int). It stores:-
a|count
1|10
1|8
1|4
1|2
2|10
2|6
2|1
I want it's rows to be numbered, corresponding to a given name(also, note that count has to be in decreasing order), i.e, :-
a|count|row
1|10 |1
1|8 |2
1|4 |3
1|2 |4
2|10 |1
2|6 |2
2|1 |3
I tried How to show row numbers in PostgreSQL query? this post, but it just seems to number it from 1 to 7 and not name-wise. Can someone please help me with this? Thanks!
Use row_number() function
select a, count, row_number() over(partition by a order by count desc) as rn
from tablename

Select rows that are duplicates on two columns

I have data in a table. There are 3 columns (ID, Interval, ContactInfo). This table lists all phone contacts. I'm attempting to get a count of phone numbers that called twice on the same day and have no idea how to go about this. I can get duplicate entries for the same number but it does not match on date. The code I have so far is below.
SELECT ContactInfo, COUNT(Interval) AS NumCalls
FROM AllCalls
GROUP BY ContactInfo
HAVING COUNT(AllCalls.ContactInfo) > 1
I'd like to have it return the date, the number of calls on that date if more than 1, and the phone number.
Sample data:
|ID |Interval |ContactInfo|
|--------|------------|-----------|
|1 |3/1/2017 |8009999999 |
|2 |3/1/2017 |8009999999 |
|3 |3/2/2017 |8001234567 |
|4 |3/2/2017 |8009999999 |
|5 |3/3/2017 |8007771111 |
|6 |3/3/2017 |8007771111 |
|--------|------------|-----------|
Expected result:
|Interval |ContactInfo|NumCalls|
|------------|-----------|--------|
|3/1/2017 |8009999999 |2 |
|3/3/2017 |8007771111 |2 |
|------------|-----------|--------|
Just as juergen d suggested, you should try to add Interval in your GROUP BY. Like so:
SELECT AC.ContactInfo
, AC.Interval
, COUNT(*) AS qnty
FROM AllCalls AS AC
GROUP BY AC.ContactInfo
, AC.Interval
HAVING COUNT(*) > 1
The code should like this :
select Interval , ContactInfo, count(ID) AS NumCalls from AllCalls group by Interval, ContactInfo having count(ID)>1;

SQL query to return a grouped result as a single row

If I have a jobs table like:
|id|created_at |status |
----------------------------
|1 |01-01-2015 |error |
|2 |01-01-2015 |complete |
|3 |01-01-2015 |error |
|4 |01-02-2015 |complete |
|5 |01-02-2015 |complete |
|6 |01-03-2015 |error |
|7 |01-03-2015 |on hold |
|8 |01-03-2015 |complete |
I want a query that will group them by date and count the occurrence of each status and the total status for that date.
SELECT created_at status, count(status), created_at
FROM jobs
GROUP BY created_at, status;
Which gives me
|created_at |status |count|
-------------------------------
|01-01-2015 |error |2
|01-01-2015 |complete |1
|01-02-2015 |complete |2
|01-03-2015 |error |1
|01-03-2015 |on hold |1
|01-03-2015 |complete |1
I would like to now condense this down to a single row per created_at unique date with some sort of multi column layout for each status. One constraint is that status is any one of 5 possible words but each date might not have one of every status. Also I would like a total of all statuses for each day. So desired results would look like:
|date |total |errors|completed|on_hold|
----------------------------------------------
|01-01-2015 |3 |2 |1 |null
|01-02-2015 |2 |null |2 |null
|01-03-2015 |3 |1 |1 |1
the columns could be built dynamically from something like
SELECT DISTINCT status FROM jobs;
with a null result for any day that doesn't contain any of that type of status. I am no SQL expert but am trying to do this in a DB view so that I don't have to bog down doing multiple queries in Rails.
I am using Postresql but would like to try to keep it straight SQL. I have tried to understand aggregate function enough to use some other tools but not succeeding.
The following should work in any RDBMS:
SELECT created_at, count(status) AS total,
sum(case when status = 'error' then 1 end) as errors,
sum(case when status = 'complete' then 1 end) as completed,
sum(case when status = 'on hold' then 1 end) as on_hold
FROM jobs
GROUP BY created_at;
The query uses conditional aggregation so as to pivot grouped data. It assumes that status values are known before-hand. If you have additional cases of status values, just add the corresponding sum(case ... expression.
Demo here
An actual crosstab query would look like this:
SELECT * FROM crosstab(
$$SELECT created_at, status, count(*) AS ct
FROM jobs
GROUP BY 1, 2
ORDER BY 1, 2$$
,$$SELECT unnest('{error,complete,"on hold"}'::text[])$$)
AS ct (date date, errors int, completed int, on_hold int);
Should perform very well.
Basics:
PostgreSQL Crosstab Query
The above does not yet include the total per date.
Postgres 9.5 introduces the ROLLUP clause, which is perfect for the case:
SELECT * FROM crosstab(
$$SELECT created_at, COALESCE(status, 'total'), ct
FROM (
SELECT created_at, status, count(*) AS ct
FROM jobs
GROUP BY created_at, ROLLUP(status)
) sub
ORDER BY 1, 2$$
,$$SELECT unnest('{total,error,complete,"on hold"}'::text[])$$)
AS ct (date date, total int, errors int, completed int, on_hold int);
Up to Postgres 9.4, use this query instead:
WITH cte AS (
SELECT created_at, status, count(*) AS ct
FROM jobs
GROUP BY 1, 2
)
TABLE cte
UNION ALL
SELECT created_at, 'total', sum(ct)
FROM cte
GROUP BY 1
ORDER BY 1
Related:
Grouping() equivalent in PostgreSQL?
If you want to stick to a simple query, this is a bit shorter:
SELECT created_at
, count(*) AS total
, count(status = 'error' OR NULL) AS errors
, count(status = 'complete' OR NULL) AS completed
, count(status = 'on hold' OR NULL) AS on_hold
FROM jobs
GROUP BY 1;
count(status) for the total per date is error-prone, because it would not count rows with NULL values in status. Use count(*) instead, which is also shorter and a bit faster.
Here is a list of techniques:
For absolute performance, is SUM faster or COUNT?
In Postgres 9.4+ use the new aggregate FILTER clause, like #a_horse mentioned:
SELECT created_at
, count(*) AS total
, count(*) FILTER (WHERE status = 'error') AS errors
, count(*) FILTER (WHERE status = 'complete') AS completed
, count(*) FILTER (WHERE status = 'on hold') AS on_hold
FROM jobs
GROUP BY 1;
Details:
How can I simplify this game statistics query?