SQL where two fields are similar using "GROUP BY" - sql

I have a database of names, some of the names haven't been insert in the correct fashion. SecondName has sometimes been entered as FirstName.
+-----------------+--------------+
| FirstName | SecondName |
+-----------------+--------------+
| Bob | Smith |
| Gary | Rose |
| John | Jones |
| Smith | Bob |
| Gary | Oberstein |
| Adam | Sorbet |
| Jones | John |
+-----------------+--------------+
I've tried different grouping queries
select `FirstName`
, `SecondName`
from `names`
where ( `FirstName`
, `SecondName` )
in ( select `FirstName`
, `SecondName`
from `names`
group
by `FirstName`
, `SecondName`
having count(*) > 1
)
But I can't get anything to produce
+-----------------+--------------+---------+
| FirstName | SecondName | Count |
+-----------------+--------------+---------+
| Bob | Smith | 2 |
| John | Jones | 2 |
+-----------------+--------------+---------+

There is a trick to do this, you need to normalize your names, a quick way to do this is if you alphabetize first name and last name then group on the result.
SELECT name_normalized, count(*) as c
FROM (
SELECT CASE WHEN FIRSTNAME < LASTNAME THEN FIRSTNAME||LASTNAME
ELSE LASTNAME|| FIRSTNAME END as name_normalized
FROM names
) X
GROUP BY name_normalized
Notes:
This is the simple case, you could add the normalized result as a column if you want to see the original values.
You may need other normalization -- it depends on what your rules are. For example UPPER() to ignore case and TRIM() to remove whitespace.
You can add or ignore other columns as is required for matching normalization -- Birthday, Middle Initial etc.
Oten time a hash on the normalized string is faster to work with than the string -- your data model might require one or the other.

If the COUNT() itself isn't important, you can easily separate duplicates with an INNER JOIN
SELECT n.FirstName, n.SecondName, n2.FirstName, n2.SecondName
FROM Names n
INNER JOIN Names n2 on n.FirstName = n2.SecondName and n.SecondName = n2.FirstName

Related

SQL split values on multiple rows

I made a select and it looks like this
SELECT name,class_name
FROM students
INNER JOIN classes on classes.id = students.id
and I get a table like
| name | class_name |
| -------- | -------------- |
| Daniel | Math |
| Johnny | Physics |
| Johnny | Math |
| Andrew | English |
...
How am I supposed to split the table to get the first two classes each student attends (students can attend more than two classes or a single class)
Example:
| name | class_1 | class_2 |
| -------- | -------------- | ---------|
| Daniel | Math | |
| Johnny | Math | Physics |
| Andrew | English | |
...
I was thinking of transposing the table, however, I don't know how to actually do it or if it's a good approach.
I concatenated every different class for each student in a single column with commas as separators (because classes can contain spaces, for example, "Computer Science") then I split the single column into the two required classes.
WITH part
AS (SELECT NAME,
class_name
FROM students
INNER JOIN classes
ON classes.id = students.id),
ans
AS (SELECT NAME,
Array_to_string(Array_agg(class_name), ',') AS cl
FROM part
GROUP BY NAME)
SELECT NAME,
Split_part(cl, ',', 1) AS class_1,
Split_part(cl, ',', 2) AS class_2
FROM ans

SQL exercise using a self Left Outer Join

This is the exercise:
How can you output a list of all members, including the individual who recommended them (if any)? Ensure that results are ordered by (surname, firstname).
The solution is below only I don't understand why it's 'ON recs.memid = mems.recommendedby' and not 'mems.memid = recs.recommendedby'. Why doesn't the latter work? I want to correct my thinking on how to use a Left Outer Join to itself.
CREATE TABLE members (
memid SERIAL PRIMARY KEY,
firstname VARCHAR(20),
surname VARCHAR(20),
recommendedby INTEGER References members(memid)
);
SELECT
mems.firstname AS memfname,
mems.surname AS memsname,
recs.firstname AS recfname,
recs.surname AS recsname
FROM
cd.members AS mems
LEFT OUTER JOIN cd.members AS recs
ON recs.memid = mems.recommendedby
ORDER BY memsname, memfname;
Consider this data:
MEMID | FIRSTNAME | SURNAME | RECOMMENDEDBY
------|-----------|---------|--------------
1 | John | Smith | null
2 | Karen | Green | 1
3 | Peter | Jones | 1
Here John recommended both Karen and Peter, but no one recommended John
'ON recs.memid = mems.recommendedby' (The one that "works")
You're getting a list of members and the ones that recommended them. Any member can only have been recommended by one member as per the table structure, so you'll get all the members just once. You're taking the recommendedby value and looking for it in the memid column in the "other table":
recommendedby --|--> memid of members that recommended them
MEMID | FIRSTNAME | SURNAME | RECOMMENDEDBY
------|-----------|---------|--------------
Karen | Green | John | Smith
Peter | Jones | John | Smith
John | Smith | null | null
The recommendedby column only has John (1), so when looking for the value 1, John comes up.
'ON mems.memid = recs.recommendedby' (The one that doesn't work)
You'll again get all the members. But here you're getting them as the ones doing the recommending, so to say. If they didn't recommend anyone, the paired record will be blank. This is because you're taking the memid value and looking to see if it matches the recommendedby column of the "other table". If a member recommended more than one, the record will appear multiple times:
memid --|--> recommendedby
MEMID | FIRSTNAME | SURNAME | RECOMMENDEDBY
------|-----------|---------|--------------
Karen | Green | null | null
Peter | Jones | null | null
John | Smith | Karen | Green
John | Smith | Peter | Jones
Karen and Peter didn't recommend anyone, but John recommended both the others.

How to insert from each row into to multiple tables

I'm pretty new to SQL Server (using ssms). I need some help to insert and organize data from one table into multiple tables (which are connected to each other by PK/FK).
The source table has the following columns:
Email, UserName, Phone
It's a messy table with lots of duplicates: same email with different username and so on..
My data tables are:
Person - PersonID(PK, int, not null)
Email - Email (nvarchar, null) , PersonID (FK, int, not null)
Phone - PhoneNumber (int, null) , PersonID (FK, int, not null)
UserName - UserName (nvarchar, null) , PersonID (FK, int, not null)
For each row in the source table, I need to check if the person already exists (by Email); if it does exist, I need to add the new data (if any), else I need to create a new person and add the data.
I searched here for some solutions and find recommendations of using CURSOR.
I tried it, but it takes a really long time to execute (hours.. and still going)
Thanks for any help!
example:
from>
EMAIL | USERNAME | PHONE
------------------------
a#a.a | john | 956165
b#b.b | smith | 123456
c#c.c | bob | 654321
d#d.d | mike | 986514
a#a.a | dan | 658732
e#e.e | dave | 147258
f#f.f | harry | 951962
b#b.b | emmy | 456789
g#g.g | kelly | 789466
h#h.h | kelly | 258369
a#a.a | ana | 852369
to>
EMAIL | PERSONID
----------------
a#a.a | 1
b#b.b | 2
c#c.c | 3
d#d.d | 4
e#e.e | 5
f#f.f | 6
g#g.g | 7
h#h.h | 8
USERNAME | PERSONID
-------------------
john | 1
smith | 2
bob | 3
mike | 4
dan | 1
dave | 5
harry | 6
emmy | 2
kelly | 7
kelly | 8
ana | 1
PHONE | PERSONID
----------------
956165 | 1
123456 | 2
654321 | 3
986514 | 4
658732 | 1
147258 | 5
951962 | 6
456789 | 2
789466 | 7
258369 | 8
852369 | 1
Cursors will generally be slower as they operate on a row-by-row basis. Using set based operations, such as a join, will yield better performance. It's somewhat older, but this article further details the implications of cursors as opposed to set operations. I wasn't entirely sure what columns you want to use to verify matches, as well as what data to add, but a basic example is below and you can fill in the columns as necessary. The Email table was used in the example. For the UPDATE, this will update existing rows based off corresponding rows in the source table. Being an INNER JOIN, only rows with matches on both sides will be impacted. In the second statement, this is an INSERT using only rows from the source table that don't exist in the Email table. This same functionality could also be accomplished using the MERGE statement, however there are a number of issues with this, including problems with deadlocks and key violations.
Update Existing Rows:
UPDATE E
SET E.ColumnA = SRC.ColumnA,
E.ColumnB = SRC.ColumnB
FROM YourDatabase.YourSchema.Email E
INNER JOIN YourDatabase.YourSchema.SourceTable SRC
ON E.Email = SRC.Email
Add New Rows:
INSERT INTO YourDatabase.YourSchema.Email (ColumnA, ColumnB)
SELECT
ColumnA,
ColumnB
FROM YourDatabase.YourSchema.SourceTable
WHERE EMAIL NOT IN ((SELECT EMAIL FROM YourDatabase.YourSchema.Email))

JOIN, aggregate and convert in postgres between two tables

Here are the two tables i have: [all columns in both tables are of type "text"], Table name and the column names are in bold fonts.
Names
--------------------------------
Name | DoB | Team |
--------------------------------
Harry | 3/12/85 | England
Kevin | 8/07/86 | England
James | 5/05/89 | England
Scores
------------------------
ScoreName | Score
------------------------
James-1 | 120
Harry-1 | 30
Harry-2 | 40
James-2 | 56
End result i need is a table that has the following
NameScores
---------------------------------------------
Name | DoB | Team | ScoreData
---------------------------------------------
Harry | 3/12/85 | England | "{"ScoreName":"Harry-1", "Score":"30"}, {"ScoreName":"Harry-2", "Score":"40"}"
Kevin | 8/07/86 | England | null
James | 5/05/89 | England | "{"ScoreName":"James-1", "Score":"120"}, {"ScoreName":"James-2", "Score":"56"}"
I need to do this using a single SQL command which i will use to create a materialized view.
I have gotten as far as realising that it will involve a combination of string_agg, JOIN and JSON, but haven't been able to crack it fully. Please help :)
I don't think the join is tricky. The complication is building the JSON object:
select n.name, n.dob, n.team,
json_agg(json_build_object('ScoreName', s.name,
'Score', s.score)) as ScoreData
from names n left join
scores s
ons.name like concat(s.name, '-', '%')
group by n.name, n.dob, n.team;
Note: json_build_object() was introduced in Postgres 9.4.
EDIT:
I think you can add a case statement to get the simple NULL:
(case when s.name is null then NULL
else json_agg(json_build_object('ScoreName', s.name,
'Score', s.score))
end) as ScoreData
Use json_agg() with row_to_json() to aggregate scores data into a json value:
select n.*, json_agg(row_to_json(s)) "ScoreData"
from "Names" n
left join "Scores" s
on n."Name" = regexp_replace(s."ScoreName", '(.*)-.*', '\1')
group by 1, 2, 3;
Name | DoB | Team | ScoreData
-------+---------+---------+---------------------------------------------------------------------------
Harry | 3/12/85 | England | [{"ScoreName":"Harry-1","Score":30}, {"ScoreName":"Harry-2","Score":40}]
James | 5/05/89 | England | [{"ScoreName":"James-1","Score":120}, {"ScoreName":"James-2","Score":56}]
Kevin | 8/07/86 | England | [null]
(3 rows)

Need select query

Consider the following table structure with data -
AdjusterID | CompanyID | FirstName | LastName | EmailID
============================================================
1001 | Sterling | Jane | Stewart | janexxx#sterlin.com
1002 | Sterling | David | Boon | dav#sterlin.com
1003 | PHH | Irfan | Ahmed | irfan#phh.com
1004 | PHH | Rahul | Khanna | rahul#phh.com
============================================================
Where AdjusterID is the primary key. There are no. of adjusters for a company.
I need to have a query that will list single adjuster per company. i.e. I need to get the result as -
========================================================
1001 | Sterling | Jane | Stewart | janexxx#sterlin.com
1003 | PHH | Irfan | Ahmed | irfan#phh.com
========================================================
If any one could help me that will be great.
One way:
SELECT * FROM Adjusters
WHERE AdjusterID IN(SELECT min(AdjusterID)
FROM Adjusters GROUP BY CompanyID)
There are a handful of other ways involving unions and iteration, but this one is simple enough to get you started.
Edit: this assumes you want the adjuster with the lowest ID, as per your example
I know the answer from Jeremy is a valid one, so I will not repeat it. But you may try another one using a so called tie-breaker:
--//using a tie-breaker. Should be very fast on the PK field
--// but it would be good to have an index on CompanyID
SELECT t.*
FROM MyTable t
WHERE t.AdjusterID = (SELECT TOP 1 x.AdjusterID FROM MyTable x WHERE x.CompanyID = t.CompanyID ORDER BY AdjusterID)
It could be better performance-wise. But even more useful it is if you had another column in the table and you wanted to select not just one for each company but the best for each company using some other column ranking as a criteria. So instead of ORDER BY AdjusterID, you would order by that other column(s).