SQL find duplicates and assign group number

SQL find duplicates and assign group number - sql

Situation
On a Microsoft SQL Server 2008 I have about 2 million rows. (this should have never happened but we inherited the situation). A sample as follows:
usernum. | phone | email
1 | 123 | user1#local.com
2 | 123 | user2#local.com
3 | 245 | user3#local.com
4 | 678 | user3#local.com
Aim
I would like to create a table that looks like this. The idea is that if 'phone' or 'email' is the same, they are assigned the same group number.
groupnum |usernum. | phone | email
1 | 1 | 123 | user1#local.com
1 | 2 | 123 | user2#local.com
2 | 3 | 245 | user3#local.com
2 | 4 | 678 | user3#local.com
Tried so far
So far I have created a simple python script that conceptually does the following:
- for each usernum in the table
-- assign a group number
-- also assign the group number to all rows where phone or email is the same as this row
-- do not assign the group number if usernum already processed (else we would do things double)
Problem
The python script basically has to check for each row if there are duplicates for phone or email. Although this is perfectly fine for maybe 10,000 records or so, it is too slow for 2 million records. I think this possible to do in t-sql which should be much faster than my python script using pyodbc. The big question thus is, how to do this in sql.

Just noticed you said email or phone is duplicate. For that I would think you would need to decide which has priority in instances where a user could be joined from either field. Or you could potentially just split the update into a few batches to make group numbers based on phone AND email, then email (when not already matched), then phone (when not already matched) as such:
insert into yourGroupsTable (phone, email) -- assuming identity column of groupNum here
select distinct phone, email
from yourUserTable
-- assign group nums with priority on matching phone AND email
update yourUserTable
set groupNum = g.groupNum
from yourUserTable u
join yourGroupsTable g on u.phone = g.phone
and u.email = g.email
It occurs to me now that this would not work as each row would join on the yourGroupsTable due to the distinct select. I came across a scenario that I'm unsure what your expected outcome would be (and too big for a comment) - what happens in this instance:
your test data slightly modified:
groupnum |usernum. | phone | email
1 | 1 | 123 | user1#local.com
1 | 2 | 123 | user2#local.com
? | 3 | 245 | user3#local.com
? | 4 | 678 | user3#local.com
? | 5 | 245 | user7#local.com
? | 6 | 678 | user7#local.com
what would the group numbs be in the above case?

As you do python script is good way ... if you want to move with mysql make it one procedure before inserting record must check its exist or not in table
If Exist
THEN get that row groupnum and assign that groupnum to this new record ...
IF Not
Then give new groupnum
but i have still little confusion
now if record is like
5 | 678 | user1#local.com
if this is the case then ?
I assume that both column [phone and email ] is consider to give groupnum.
if my assumption is correct then go with mysql procedure ...

Related

How to return unique rows having count() of multiple columns = 1 using group by?

So here is my situation:
____________________________________________
| idnumber | name | sectiongroup |
--------------------------------------------
| 123 | Joe | one |
| 123 | Barry | two |
| 1234 | Laura | one |
| 1234 | LauraCopyCat | one |
--------------------------------------------
I am trying to build a query which will return any unique (i.e. - COUNT(idnumber) = 1) id numbers in a given sectiongroup. So if you are in sectiongroup number one and no one else in your sectiongroup has the same ID number as you, then I want your idnumber. If someone in group two happens to have the same idnumer, that is okay, I still want your idnumber.
For example, Barry and Joe have the same id number but they are in separate sectiongroups, so I want to return their idnubers. However, Laura and LauraCopyCat have the SAME sectiongroup, so I do NOT want their idnumbers to be returned. So far I have the following:
SELECT idnumber
FROM namestable
GROUP BY idnumber, sectiongroup
HAVING(COUNT(idnumber) = 1)
Is there a way to add sectiongroup into the COUNT()=1 condition?

Just use COUNT(*) to avoid confusion. This will count the number of records in the particular group. Remember, a group consists of the unique combinations of values in the fields specified in your GROUP BY statement.
SELECT idnumber
FROM namestable
GROUP BY idnumber, sectiongroup
HAVING COUNT(*) = 1
Note that this will result in duplicate idnumbers, if you have records that share an id but have different subgroups. To remove duplicate, just change SELECT to SELECT DISTINCT.
Tested here: http://sqlfiddle.com/#!9/b0a50c/3

Find spectators that have seen the same shows (match multiple rows for each)

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.
The model looks like this:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez#gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille#gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.
So the result I'm expecting should be something like this
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.

Note based on comments: Wanted to make clear that this answer may be of limited use as it was answered in the context of SQL-Server (tag was present at the time)
There is probably a better way to do it, but you could do it with the 'stuff 'function. The only drawback here is that, since your ids are ints, placing a comma between values will involve a work around (would need to be a string). Below is the method I can think of using a work around.
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
This will show you all other spectators that attended the same shows.

Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)
This is a special case of relational division. We have assembled an arsenal of basic techniques here:
How to filter SQL results in a has-many-through relation
It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.
Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.
A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.
1. Brute force
The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
#> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.
2. Smarter
Use a CTE to determine relevant shows, then only consider those
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
#> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.
Faster than 1. because only relevant shows are considered.
3. Real smart
To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).
This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.
SQL Fiddle demonstrating all three.

It sounds like you have one half of the total question--determining which id_shows a particular id_spectator attended.
What you want to ask yourself is how you can determine which id_spectators attended an id_show, given an id_show. Once you have that, combine the two answers to get the full result.

So the final answer I got, looks like this :
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
Which prints something like this:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
Which suits my needs, however if you have any improvements to offer, please share :) Thanks again everybody!

Increasing a +1 to the id without changing the content of a column

I have this random table with random contents.
id | name| mission
1 | aaaa | kitr
2 | bbbb | etre
3 | ccccc| qwqw
4 | dddd | qwert
5 | eeee | potentials
6 | ffffffff | toto
What I want is to add in the above table a column with id=3 with different name and different mission BUT the OLD id =3 I want to have an id = 4 with the name and the mission that it had before when it was id=3, and the OLD id =4 become id=5 with the name and mission of id 5 and so on.
its like i want to enter a column inside of the columns and the below column i want to increase there id +1 but the columns rest the same. example below:
id | name| mission
1 | aaaa | kitr
2 | bbbb | etre
3 | zzzzzz| zzzzz
4 | ccccc| qwqw
5 | dddd | qwert
6 | eeee | potentials
7 | ffffffff | toto
why I want to do this ? I have a table that has 2 CLOB. Inside of those CLOBS there are different queries ex: id =1 has clob of creation of a table id=2 inserts for the columns id=3 has creation of another table id=4 has functions
if you add all of this id in one text(or clob) they will have to create then inserts then create then functions. that table it is like a huge script .
Why I am doing this ? The developers are building their application and they want the sql to work in specific order and I have 6 developers and am organizing the data modeling and the performance and how the scripts are running .So the above table is to organize the calling of the scripts that they wany

Simply put, don't do it.
This case highlights why you should never use any business value, i.e. any 'real world values' for a Primary Key.
In your case I would recommend primary keys not be used for any other purposes.
I recommend you add an extra column 'order' and then change THAT column in order to re-order the rows. That way your primary key and all the other records will not need to be touched.
This avoid the issue that your approach would need to change ALL the database records below the current record which seems like a really bad approach. Just imagine trying to undo that update ;)
Some more info here: https://stackoverflow.com/a/8777574/631619

UPDATE random_table r1
SET id =
(SELECT CASE WHEN id > 2 THEN id+1 ELSE id END id FROM random_table r2
WHERE r1.mission=r2.mission
)
Then insert the new value.

SQLite, selecting values having same criteria (throughout all table)

I have an sqlite database table similar to the one given below
Name | Surname | AddrType | Age | Phone
John | Kruger | Home | 23 | 12345
Sarah | Kats | Home | 33 | 12345
Bill | Kruger | Work | 15 | 12345
Lars | Kats | Home | 54 | 12345
Javier | Roux | Work | 45 | 12345
Ryne | Hutt | Home | 36 | 12345
I would like to select Name values matching same "Surname" value for each of the rows in the table.
For example, for the first line the query would be "select Name from myTable where Surname='Kruger'" whereas for the second line the query would be "select Name from myTable where Surname='Kats' and so an....
Is it possible to traverse through the whole table and select all values like that?
PS : I will use these method in a C++ application, the alternative method is to use sqlite3_exec() and process each row one by one. I just want to know if there is any other possible way for the same approach.

I'd do:
sqlite> SELECT group_concat(Name, '|') Names FROM People GROUP BY Surname;
Names
----------
Ryne
Sarah|Lars
John|Bill
Javier
Then split each value of "Names" in C++ using the "|" separator (or any other you choose in group_concat function.

Basically you just want to exclude any records that don't have a buddy.
Something simple like joining the table against itself should work:
SELECT a.Name
FROM tab AS a
JOIN tab AS b
ON a.Surname = b.Surname;
Just returning the full sorted table and doing the duplicate check yourself may be faster if incidence is high (and will always be high for all sets of data). That would be a pretty strong assumption though.
SELECT Name
FROM tab
SORT BY Surname;

Out of a sample of 100 demands what else was asked for?

I am doing some work on an inbound call demand capture system where each call could have one or more than one demands linked to it.
There is a CaptureHeader table with CallDate, CallReference and CaptureID and a CaptureDemand table with CaptureID and DemandID.
EDIT:
I have added some representative data to show what would be expected in each table.
CaptureHeader
CaptureID | CallReference | CallDate
-----------------------------------------------
1 | 1 | 2009-11-02 20:37:00
2 | 3 | 2009-11-02 20:37:05
3 | 2 | 2009-11-02 20:37:10
4 | 4 | 2009-11-02 20:38:00
5 | 5 | 2009-11-02 20:38:30
CaptureDemand
DemandID | CaptureID | DemandText
------------------------------------
1 | 1 | Fund value
2 | 2 | Password reset
3 | 2 | Fund value
4 | 3 | Change address
5 | 3 | Fund value
6 | 3 | Rate change
7 | 3 | Fund value
8 | 4 | Variable to fixed
9 | 4 | Change address
10 | 5 | Fund value
11 | 5 | Address change
Using the tables above a filter on 'Fund value' would bring back call references of 1, 2, 3, 3, 5 because 3 has two fund values.
If I did a DISTINCT on this because I have ordered by date it would ask me to show that which would also give me two lines for 3.
To get the full set of data I would do the following query:
SELECT * FROM CaptureHeader AS ch
JOIN CaptureDemand AS cd ON ch.CaptureID = cd.CaptureID
JOIN DemandDetails AS dd ON cd.DemandID = dd.DemandID
What I would like though is to get the last 100 headers by date for a particular demand. Where it gets tricky is when there is more than one of the same demand on a header for a particular reference which is possible.
I would like 100 unique call references because I then need to get back all the demands for those call references and then count how many of each other demand was also recorded in the same call.
EDIT:
I would like to be able to say 'WHERE DemandID = SomeValue' to select my 100 references.
In other words out of 100 "value requested" demands what else was asked for. If this doesn't make sense let me know and I will try and modify the question to be clearer.
I would like to get a table like this:
Demands | Count
------------------------
Demand asked for | 100
Another demand | 36
Third demand | 12
Fourth demand | 6
Cheers, Ian.

Now that the sample data made your requirement more explicit, I believe the following will generally server your needs. It is essentially the same as previous submission, with an added condition on the JOIN; this condition essentially excludes any CaptureDemand row for which we readily have the same DemandText (within the same Capture), only retaining the one with the lowest DemandId.
WITH myCTE (CaptId, NbOfDemands)
AS (
SELECT CaptureID, COUNT(*) -- Can use COUNT(DISTINCT DemandText)
FROM CaptureDemand
WHERE CaptureID IN
(SELECT TOP 100 C.CaptureID
FROM CaptureHeader C
JOIN CaptureDemand D ON C.CaptureID = D.CaptureID
AND NOT EXISTS (
SELECT * FROM CaptureDemand X
WHERE X.CaptureId = D.CaptureId AND X.DemandText = D.DemandText
AND X.DemandId < D.DemandId
)
WHERE D.DemandText= 'Fund Value'
ORDER BY CallDate DESC)
)
SELECT NbOfDemands, COUNT(*)
FROM myCTE
GROUP BY NbOfDemands
ORDER BY NbOfDemands
What this query provides:
The number of Captures which had exactly one demand
The number of Captures which had exactly two demands
..
The number of Captures which had exactly n demands
For the 100 MOST RECENT Captures which included a Demand of a particular value 'someValue' (and, this time, giving indeed 100, i.e. not counting the same CaptureID twice in case of dups on the Demand Type).
A few points:
You may want to use COUNT(DISTINCT DemandText) rather than COUNT(*) in the select list of the CTE. (We do include 100 distinct CaptureIDs, i.e. that the Capture #3 in your sample doesn't come twice and hence hiding another capture at the end of the list, but we need to know if this #3 Capture should be counted as 3 Demands or a 4 Demands capture).
Oops, not quite what you required because each line show the number of Capture instances that have exactly this amount of demands...
use a CASE on NbOfDemands to display the text as in the question (trivial)
This may show Capture instances with more than 4 demands, but that's probably a plus (if any), but that is probably a plus
This would not show 0 if for example there were no Capture instances with the given number of demands.

It sounds like you are trying to solve a Many to Many problem with just two tables and you really need three tables. For example:
TABLE Calls
CallId | CallDate
----------------------------
1 | 2009-11-02 20:37:00
2 | 2009-11-02 20:37:05
3 | 2009-11-02 20:37:10
4 | 2009-11-02 20:38:00
5 | 2009-11-02 20:38:30
TABLE Requests
RequestId | RequestType
----------------------------
1 | Fund value
2 | Password reset
3 | Change address
4 | Rate change
5 | Variable to fixed
TABLE CallRequests (resolves the many to many)
CallId |RequestId
-----------------
1 |1
2 |2
2 |1
3 |3
3 |1
3 |4
3 |1
4 |5
4 |3
5 |1
5 |3
This data structure will let you query from the Call side of things and from the Request side of things.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL find duplicates and assign group number - sql

Related

How to return unique rows having count() of multiple columns = 1 using group by?

Find spectators that have seen the same shows (match multiple rows for each)

Increasing a +1 to the id without changing the content of a column

SQLite, selecting values having same criteria (throughout all table)

Out of a sample of 100 demands what else was asked for?

Categories

Resources