Questions with SQL - separating a column in two - sql

I don't know how I can answer this question. Because the name and last name are in one column. I'm not allowed to change the columns.
"Get the average spending (per customer) of all customers who share a last name with another customer"
I thought to say in sqlite3
SELECT avg_spending
FROM customer
JOIN customer on WHERE name is name;
This is how the table is defined:
CREATE TABLE customer
(
cuid INTEGER,
name STRING,
age INTEGER,
avg_spending REAL,
PRIMARY KEY(cuid)
);
So those values are having the same last name
INSERT INTO customer VALUES (4, "Henk Krom", 65, 24);
INSERT INTO customer VALUES (9, "Bob Krom", 66, 4);

From the sample data you posted I guess the format of the column name is:
FirstName LastName
so you need to extract the LastName and use group by to get the average:
select
substr(name, instr(name, ' ') + 1) lastname,
avg(avg_spending) avg_spending
from customer
group by lastname
having count(*) > 1
The having clause restricts the results to those customer names that have at least 1 other customer name with the same last name.
See the demo.
For the sample data:
> cuid | name | age | avg_spending
> :--- | :-------- | :-- | :-----------
> 4 | Henk Krom | 65 | 24
> 9 | Bob Krom | 66 | 4
> 5 | Jack Doe | 66 | 4
> 7 | Jill Doe | 66 | 6
> 1 | Alice No | 66 | 44
you get results:
> lastname | avg_spending
> :------- | :-----------
> Doe | 5
> Krom | 14

As mentioned in the comments, the crux of this is to find a rule how to reliably extract the surname from the name. Apart from that you merely need an exists clause, because you want to select customers where another customer with the same surname exists.
("Get the average spending (per customer)" simply means get a row from the table, because each row contains exactly one customer and their average spending.)
If all names were in the format first name - blank - last name, that would be:
select *
from customer c
where exists
(
select *
from customer other
where other.cuid <> c.cuid
and substr(other.name, instr(other.name, ' ') + 1) = substr(c.name, instr(c.name, ' ') + 1)
);

You were correct in joining the customer table to itself but you also need to parse out the last name to compare and remove duplicates once a match was found since if nameA equals nameB then nameB has to equal nameA.
with custs AS
(
select distinct
a.name as name_1 ,
b.name as name_2
from customer a
join customer b
on substr(a.name, instr(a.name, ' ') + 1) = substr(b.name, instr(b.name, ' ') + 1)
where a.name like '%Krom%' and a.name <> b.name
)
select * from customer where name in (select name_1 from custs)
union
select * from customer where name in (select name_2 from custs)

Related

Multi-Pass Duplication Identification with Exclusions

I have a customer table with several hundred thousand records. There are a LOT of duplicates of varying degrees. I am trying to identify duplicate records with level of possibility of being a duplicate.
My source table has 7 fields and looks like this:
I look for duplicates, and put them into an intermediate table with the level of possibility, table name, and the customer number.
Intermediate Table
CREATE TABLE DataCheck (
id int identity(1,1),
reason varchar(100) DEFAULT NULL,
tableName varchar(100) DEFAULT NULL,
tableID varchar(100) DEFAULT NULL
)
Here is my code to identify and insert:
-- Match on Company, Contact, Address, City, and Phone
-- DUPE
INSERT INTO DataCheck
SELECT 'Duplicate','CUSTOMER',tcd.uid
FROM #tmpCoreData tcd
INNER JOIN
(SELECT
company,
fname,
lname,
add1,
city,
phone1,
COUNT(*) AS count
FROM #tmpCoreData
WHERE company <> ''
GROUP BY company, fname, lname, add1, city, phone1
HAVING COUNT(*) > 1) dl
ON dl.company = tcd.company
ORDER BY tcd.company
In this example, it would insert ids 101, 102
The problem is when I perform the next pass:
-- Match on Company, Address, City, Phone (Diff Contacts)
-- LIKELY DUPE
INSERT INTO DataCheck
SELECT 'Likely Duplicate','CUSTOMER',tcd.uid
FROM #tmpCoreData tcd
INNER JOIN
(SELECT
company,
add1,
city,
phone1,
COUNT(*) AS count
FROM #tmpCoreData
WHERE company <> ''
GROUP BY company, add1, city, phone1
HAVING COUNT(*) > 1) dl
ON dl.company = tcd.company
ORDER BY tcd.companyc
This pass would then insert, 101, 102 & 103.
The next pass drops the phone so it would insert 101, 102, 103, 104
The next pass would look for company only which would insert all 5.
I now have 14 entries into my intermediate table for 5 records.
How can I add an exclusion so the 2nd pass groups on the same Company, Address, City, Phone but DIFFERENT fname and lname. Then it should only insert 101 and 103
I considered adding a NOT IN (SELECT tableID FROM DataCheck) to ensure IDs aren't added multiple times, but on the 3rd of 4th pass it may find a duplicate and entered 700 records after the row it's a duplicate of, so you lose the context of it's a dupe of.
My output uses:
SELECT
dc.reason,
dc.tableName,
tcd.*
FROM DataCheck dc
INNER JOIN #tmpCoreData tcd
ON tcd.uid = dc.tableID
ORDER BY dc.id
And looks something like this, which is a bit confusing:
I'm going to challenge your perception of your issue, and instead propose that you calculate a simple "confidence score", which will also help you vastly simplify your results table:
WITH FirstCompany AS (SELECT custNo, company, fname, lname, add1, city, phone1
FROM(SELECT custNo, company, fname, lname, add1, city, phone1,
ROW_NUMBER() OVER(PARTITION BY company ORDER BY custNo) AS ordering
FROM CoreData) FC
WHERE ordering = 1)
SELECT RankMapping.description, Duplicate.custNo, Duplicate.company, Duplicate.fname, Duplicate.lname, Duplicate.add1, Duplicate.city, Duplicate.phone1
FROM (SELECT FirstCompany.custNo AS originalCustNo, Duplicate.*,
CASE WHEN FirstCompany.custNo = Duplicate.custNo THEN 1 ELSE 0 END
+ CASE WHEN FirstCompany.fname = Duplicate.fname AND FirstCompany.lname = Duplicate.lname THEN 1 ELSE 0 END
+ CASE WHEN FirstCompany.add1 = Duplicate.add1 AND FirstCompany.city = Duplicate.city THEN 1 ELSE 0 END
+ CASE WHEN FirstCompany.phone1 = Duplicate.phone1 THEN 1 ELSE 0 END
AS ranking
FROM FirstCompany
JOIN CoreData Duplicate
ON Duplicate.custNo >= FirstCompany.custNo
AND Duplicate.company = FirstCompany.company) Duplicate
JOIN (VALUES (4, 'original'),
(3, 'duplicate'),
(2, 'likely dupe'),
(1, 'possible dupe'),
(0, 'not likely dupe')) RankMapping(score, description)
ON RankMapping.score = Duplicate.ranking
ORDER BY Duplicate.originalCustNo, Duplicate.ranking DESC
SQL Fiddle Example
... which generates results that look like this:
| description | custNo | company | fname | lname | add1 | city | phone1 |
|-----------------|--------|----------|---------|--------|--------------|--------------|------------|
| original | 101 | ACME INC | JOHN | DOE | 123 ACME ST | LOONEY HILLS | 1231234567 |
| duplicate | 102 | ACME INC | JOHN | DOE | 123 ACME ST | LOONEY HILLS | 1231234567 |
| likely dupe | 103 | ACME INC | JANE | SMITH | 123 ACME ST | LOONEY HILLS | 1231234567 |
| possible dupe | 104 | ACME INC | BOB | DOLE | 123 ACME ST | LOONEY HILLS | 4564567890 |
| not likely dupe | 105 | ACME INC | JESSICA | RABBIT | 456 ROGER LN | WARNER | 4564567890 |
This code baselessly assumes that the smallest custNo is the "original", and assumes matches will be equivalent to solely that one, but it's completely possible to get other matches as well (just unnest the subquery in the CTE, and remove the row number).

Why no similar ids in the results set when query with a correlated query inside where clause

I have a table with columns id, forename, surname, created (date).
I have a table such as the following:
ID | Forename | Surname | Created
---------------------------------
1 | Tom | Smith | 2008-01-01
1 | Tom | Windsor | 2008-02-01
2 | Anne | Thorn | 2008-01-05
2 | Anne | Baker | 2008-03-01
3 | Bill | Sykes | 2008-01-20
Basically, I want this to return the most recent name for each ID, so it would return:
ID | Forename | Surname | Created
---------------------------------
1 | Tom | Windsor | 2008-02-01
2 | Anne | Baker | 2008-03-01
3 | Bill | Sykes | 2008-01-20
I get the desired result with this query.
SELECT id, forename, surname, created
FROM name n
WHERE created = (SELECT MAX(created)
FROM name
GROUP BY id
HAVING id = n.id);
I am getting the result I want but I fail to understand WHY THE IDS ARE NOT BEING REPEATED in the result set. What I understand about correlated subquery is it takes one row from the outer query table and run the inner subquery. Shouldn't it repeat "id" when ids repeat in the outer query? Can someone explain to me what exactly is happening behind the scenes?
First, your subquery does not need a GROUP BY. It is more commonly written as:
SELECT n.id, n.forename, n.surname, n.created
FROM name n
WHERE n.created = (SELECT MAX(n2.created)
FROM name n2
WHERE n2.id = n.id
);
You should get in the habit of qualifying all column references, especially when your query has multiple table references.
I think you are asking why this works. Well, each row in the outer query is tested for the condition. The condition is: "is my created the same as the maximum created for all rows in the name table with the same id". In your data, only one row per id matches that condition, so ids are not repeated.
You can also consider joining the tables by created vs max(created) column values :
SELECT n.id, n.forename, n.surname, n.created
FROM name n
RIGHT JOIN ( SELECT id, MAX(created) as created FROM name GROUP BY id ) t
ON n.created = t.created;
or using IN operator :
SELECT id, forename, surname, created
FROM name n
WHERE ( id, created ) IN (SELECT id, MAX(created)
FROM name
GROUP BY id );
or using EXISTS with HAVING clause in the subquery :
SELECT id, forename, surname, created
FROM name n
WHERE EXISTS (SELECT id
FROM name
GROUP BY id
HAVING MAX(created) = n.created
);
Demo

Select distinct row with an empty column but not when empty column is not part of a duplicate row

Using Hive, I have duplicate rows and i want to drop duplicate rows (selecting distinct row with non empty column) when a particular column is empty. But I want to keep the rows when the column is empty but not in duplicate row.
e.g. Input is
id | name | fathername | address
1 | bob | john | street1
1 | bob | john |
2 | amir | khan |
3 | roby | johanson | street3
Output
id | name | fathername | address
1 | bob | john | street1
2 | amir | khan |
3 | roby | johanson | street3
We dropped row for id 1 when address was empty because it was a duplicated row. Although address for id 2 is missing, we still want to keep the row because its not a duplicated row. I need it for hive. There are many columns in actual problem and solution need to work with selecting * rather than particular columns.
You can use GROUP BY with MAX:
select id, name, fathername, max(address)
from data
group by id, name, fathername
Or if you want to use select *:
select *
from data
where address is not null
union
select *
from data
where address is null and id not in (
select id
from data
where address is not null
)
You can prioritize the non-null address row in an order by using row_number.
select *
from (select t.*
,row_number() over(partition by id order by case when address is not null then 1 else 2 end) as rnum
from tbl t
) t
where rnum = 1
Note: If there is more than one non-null row, you might have to specify one or more columns to break the ties.

SQL query insert and update on duplicate key

I have aTable. aTable has the following records:
+----+------+------------------+--------+
| No | Name | Date(mm/dd/yyyy) | Salary |
+----+------+------------------+--------+
| 1 | Ed | 04/01/2016 | 1000 |
| 2 | Tom | 04/02/2016 | 1500 |
+----+------+------------------+--------+
How about the SQL Server query to produce these results to other table:
+----+------+------------------+--------+---+
| No | Name | Date(yyyy/mm/dd) | Salary | k |
+----+------+------------------+--------+---+
| 1 | Ed | 04/01/2016 | 1000 | 0 |
| 2 | Tom | 04/02/2016 | 1500 | 0 |
+----+------+------------------+--------+---+
and update when duplicate key. The primary key is No and Name
You want to produce exactly the same data as your table in a new table only with a new column k which is "0" in any case?
SELECT *,0 AS k
INTO TheNewTable
FROM YourTable;
Then try it out with
SELECT * FROM TheNewTable;
But - to be honest - this seems quite strange...
The primary key is UNIQUE so you can't duplicate it. Or maybe your logical key is other combination for example Name, Date, Salary then example query could be like this:
MERGE aNewTable as Target
USING
(
SELECT Name, Date, Salary, CASE WHEN Count(*) > 1 THEN 1 ELSE 0 END as K
FROM aTable
GROUP BY Name, Date, Salary
) as Source ON Source.Name=Target.Name AND Source.Date=Target.Date AND Source.Salary=Target.Salary
WHEN NOT MATCHED THEN
INSERT (Name, Date, Salary, K)
VALUES (Source.Name, Source.Date, Source.Salary, Source.K)
WHEN MATCHED THEN
UPDATE
SET K = Source.K
WHEN NOT MATCHED BY SOURCE THEN
DELETE;
or simple to view:
SELECT Name, Date, Salary, CASE WHEN Count(*) > 1 THEN 1 ELSE 0 END as K
FROM aTable
GROUP BY Name, Date, Salary
Try this :
Insert into second_table_name(No, Name, Date, Salary, k)
select
No, Name, Date, Salary, 0
from aTable

SQL group by with a count

I have a table (simplified below)
|company|name |age|
| 1 | a | 3 |
| 1 | a | 3 |
| 1 | a | 2 |
| 2 | b | 8 |
| 3 | c | 1 |
| 3 | c | 1 |
For various reason the age column should be the same for each company. I have another process that is updating this table and sometimes it put an incorrect age in. For company 1 the age should always be 3
I want to find out which companies have a mismatch of age.
Ive done this
select company, name age from table group by company, name, age
but dont know how to get the rows where the age is different. this table is a lot wider and has loads of columns so I cannot really eyeball it.
Can anyone help?
Thanks
You should not be including age in the group by clause.
SELECT company
FROM tableName
GROUP BY company, name
HAVING COUNT(DISTINCT age) <> 1
SQLFiddle Demo
If you want to find the row(s) with a different age than the max-count age of each company/name group:
WITH CTE AS
(
select company, name, age,
maxAge=(select top 1 age
from dbo.table1 t2
group by company,name, age
having( t1.company=t2.company and t1.name=t2.name)
order by count(*) desc)
from dbo.table1 t1
)
select * from cte
where age <> maxAge
Demontration
If you want to update the incorrect with the correct ages you just need to replace the SELECT with UPDATE:
WITH CTE AS
(
select company, name, age,
maxAge=(select top 1 age
from dbo.table1 t2
group by company,name, age
having( t1.company=t2.company and t1.name=t2.name)
order by count(*) desc)
from dbo.table1 t1
)
UPDATE cte SET AGE = maxAge
WHERE age <> maxAge
Demonstration
Since you mentioned "how to get the rows where the age is different" and not just the comapnies:
Add a unique row id (a primary key) if there isn't already one. Let's call it id.
Then, do
select id from table
where company in
(select company from table
group by company
having count(distinct age)>1)