SQL Eliminate Duplicates whilst merging additional table

SQL Eliminate Duplicates whilst merging additional table - sql

i have two tables, ADDRESSES and an additional table CONTACTS. CONTACTS have a SUPERID which is the ID of the ADDRESS they belong to.
I want to identify duplicates (same Name, Firstname and Birthday) in the ADDRESSES Table and merge the contacts of these duplicates onto the latest Adress (latest DATECREATE or highest ID of the Adress).
Afterwards the other duplicates shall be deleted.
My approach for merging the contacts does not work though. Deleting duplicates works.
This is my approach. Would be grateful for support what is wrong here.
Thank you!
UPDATE dbo.CONTACTS
SET SUPERID = ADDRESSES.ID FROM dbo.ADDRESSES
inner join CONTACTS on ADDRESSES.ID = CONTACTS.SUPERID
WHERE ADDRESSES.id in (
SELECT id FROM dbo.ADDRESSES
WHERE EXISTS(
SELECT NULL FROM ADDRESSES AS tmpcomment
WHERE dbo.ADDRESSES.FIRSTNAME0 = tmpcomment.FIRSTNAME0
AND dbo.ADDRESSES.LASTNAME0 = tmpcomment.LASTNAME0
and dbo.ADDRESSES.BIRTHDAY1 = tmpcomment.BIRTHDAY1
HAVING dbo.ADDRESSES.id > MIN(tmpcomment.id)
))
DELETE FROM ADDRESSES
WHERE id in (
SELECT id FROM dbo.ADDRESSES
WHERE EXISTS(
SELECT NULL FROM ADDRESSES AS tmpcomment
WHERE dbo.ADDRESSES.FIRSTNAME0 = tmpcomment.FIRSTNAME0
AND dbo.ADDRESSES.LASTNAME0 = tmpcomment.LASTNAME0
and dbo.ADDRESSES.BIRTHDAY1 = tmpcomment.BIRTHDAY1
HAVING dbo.ADDRESSES.id > MIN(tmpcomment.id)
)
)
Here is a sample for understanding the issue.
ADDRESSES
| ID | DATECREATE | LASTNAME0 | FIRSTNAME0 | BIRTHDAY1 |
|:-----------|------------:|:------------:|------------:|:------------:|
| 1 | 19.07.2011 | Arthur | James | 05.05.1980 |
| 2 | 23.08.2012 | Arthur | James | 05.05.1980 |
| 3 | 11.12.2015 | Arthur | James | 05.05.1980 |
| 4 | 22.10.2016 | Arthur | James | 05.05.1980 |
| 6 | 20.12.2014 | Doyle | Peter | 01.01.1950 |
| 7 | 09.01.2016 | Doyle | Peter | 01.01.1950 |
|:-----------|------------:|:------------:|------------:|:------------:|
CONTACTS
| ID | SUPERID |
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 3 |
| 6 | 4 |
| 7 | 4 |
| 8 | 6 |
| 9 | 6 |
| 10 | 6 |
| 11 | 7 |
The result shall be like this
ADDRESSES
| ID | DATECREATE | LASTNAME0 | FIRSTNAME0 | BIRTHDAY1 |
|:-----------|------------:|:------------:|------------:|:------------:|
| 4 | 22.10.2016 | Arthur | James | 05.05.1980 |
| 7 | 09.01.2016 | Doyle | Peter | 01.01.1950 |
CONTACTS
| ID | SUPERID |
| 1 | 4 |
| 2 | 4 |
| 3 | 4 |
| 4 | 4 |
| 5 | 4 |
| 6 | 4 |
| 7 | 4 |
| 8 | 7 |
| 9 | 7 |
| 10 | 7 |
| 11 | 7 |

My approach would use a temporary table:
/*
CREATE TABLE addresses
([ID] int, [DATECREATE] varchar(10), [LASTNAME0] varchar(6), [FIRSTNAME0] varchar(5), [BIRTHDAY1] datetime);
INSERT INTO addresses
([ID], [DATECREATE], [LASTNAME0], [FIRSTNAME0], [BIRTHDAY1])
VALUES
(1, '19.07.2011', 'Arthur', 'James', '1980-05-05 00:00:00'),
(2, '23.08.2012', 'Arthur', 'James', '1980-05-05 00:00:00'),
(3, '11.12.2015', 'Arthur', 'James', '1980-05-05 00:00:00'),
(4, '22.10.2016', 'Arthur', 'James', '1980-05-05 00:00:00'),
(6, '20.12.2014', 'Doyle', 'Peter', '1950-01-01 00:00:00'),
(7, '09.01.2016', 'Doyle', 'Peter', '1950-01-01 00:00:00');
CREATE TABLE contacts
([ID] int, [SUPERID] int);
INSERT INTO contacts
([ID], [SUPERID])
VALUES
(1, 1),
(2, 1),
(3, 2),
(4, 2),
(5, 3),
(6, 4),
(7, 4),
(8, 6),
(9, 6),
(10, 6),
(11, 7);
*/
DROP TABLE IF EXISTS #t; --sqls2016+ only, google for an older method if yours is sub 2016
SELECT id as oldid, MAX(id) OVER(PARTITION BY lastname0, firstname0, birthday1) as newid INTO #t
FROM
addresses;
/*now #t contains data like
1, 4
2, 4
3, 4
4, 4
6, 7
7, 7*/
--remove the ones we don't need to change
DELETE FROM #t WHERE oldid = newid;
BEGIN TRANSACTION;
SELECT * FROM addresses;
SELECT * FROM contacts;
--now #t is the list of contact changes we need to make, so make those changes
UPDATE contacts
SET contacts.superid = #t.newid
FROM
contacts INNER JOIN #t ON contacts.superid = #t.oldid;
--now scrub the old addresses with no contact records. This catches all such records, not just those in #t
DELETE FROM addresses WHERE id NOT IN (SELECT DISTINCT superid FROM contacts);
--alternative to just clean up the records we affected in this operation
DELETE FROM addresses WHERE id IN (SELECT oldid FROM #t);
SELECT * FROM addresses;
SELECT * FROM contacts;
ROLLBACK TRANSACTION;
Please note, i have tested this and it produces the results you want but I advocate caution copying an update/delete query off the internet and running. I've inserted a transaction that selects the data before and after and rolls back the transaction so nothing gets wrecked. Run it on a test db first though!

Related

Convert values in related table to comma-separated list

I have two SQL Server tables:
TableA TableB
+------+--------+ +-----+------------+
| aid | Name | | aid | Activity |
+------+--------+ +-----+------------+
| 1 | Jim | | 1 | Skiing |
| 2 | Jon | | 1 | Surfing |
| 3 | Stu | | 1 | Riding |
| 4 | Sam | | 3 | Biking |
| 5 | Kat | | 3 | Flying |
+------+--------+ +-----+------------+
I'm trying to the following result where the related activities are in a comma-separated list:
+------+--------+------------------------------+
| aid | Name | Activity |
+------+--------+------------------------------+
| 1 | Jim | Skiing, Surfing, Riding |
| 2 | Jon | NULL |
| 3 | Stu | Biking, Flying |
| 4 | Sam | NULL |
| 5 | Kat | NULL |
+------+--------+------------------------------+
I tried:
SELECT aid, Name, STRING_AGG([Activity], ',') AS Activity
FROM TableA
INNER JOIN TableB
ON TableA.aid = TableB.aid
GROUP BY aid, Name
Can someone help me with this SQL query? Thank you.

You could use OUTER APPLY to aggregate the string if you're using SQL Server 2017 or higher.
drop table if exists #TableA;
go
create table #TableA (
aid int not null,
[Name] varchar(10) not null);
insert #TableA(aid, [Name]) values
(1, 'Jim'),
(2, 'Jon'),
(3, 'Stu'),
(4, 'Sam'),
(5, 'Kat');
drop table if exists #TableB;
go
create table #TableB (
aid int not null,
[Activity] varchar(10) not null);
insert #TableB(aid, [Activity]) values
(1, 'Skiing'),
(1, 'Surfing'),
(1, 'Riding'),
(3, 'Biking'),
(3, 'Flying');
select a.aid, a.[Name], oa.sa
from #TableA a
outer apply (select string_agg(b.Activity, ', ') sa
from #TableB b
where a.aid=b.aid) oa;
Name sa
Jim Skiing, Surfing, Riding
Jon NULL
Stu Biking, Flying
Sam NULL
Kat NULL

I want to write a sqlcmd script that pulls data from a database and then manipulates that data

I'm trying to find examples of sqlcmd script files that will run a select statement, and return those values internal to the script and place them in a variable. I then want to iterate over those returned values, run some if statements on those returned values, and then run some sql insert statements. I'm using Sql Server Managment Studio, so I thought I could run some scripts in the sqlcmd mode of the Query Editor. Maybe there's a better way to do it, but that seemed like a good solution.
I've looked on the Microsoft website for sqlcmd and T-SQL examples that might help. I've also done general searches of the web, but all the examples that come up are too simplistic, and weren't helpful. Any help would be appreciated.

Here is how I understand your starting position:
create table #data
(
id int,
column1 varchar(100),
column2 varchar(100),
newcolumn int
)
create table #lookup
(
id int,
column1 varchar(100),
column2 varchar(100)
)
insert into #data
values
(1, 'black', 'duck', NULL),
(2, 'white', 'panda', NULL),
(3, 'yellow', 'dog', NULL),
(4, 'orange', 'cat', NULL),
(5, 'blue', 'lemur', NULL)
insert into #lookup
values
(1, 'white', 'panda'),
(2, 'orange', 'cat'),
(3, 'black', 'duck'),
(4, 'blue', 'lemur'),
(5, 'yellow', 'dog')
select * from #data
select * from #lookup
Output:
select * from #data
/------------------------------------\
| id | column1 | column2 | newcolumn |
|----|---------|---------|-----------|
| 1 | black | duck | NULL |
| 2 | white | panda | NULL |
| 3 | yellow | dog | NULL |
| 4 | orange | cat | NULL |
| 5 | blue | lemur | NULL |
\------------------------------------/
select * from #lookup
/------------------------\
| id | column1 | column2 |
|----|---------|---------|
| 1 | white | panda |
| 2 | orange | cat |
| 3 | black | duck |
| 4 | blue | lemur |
| 5 | yellow | dog |
\------------------------/
From this starting point, you can achieve what you are asking for as follows:
update d set d.newcolumn = l.id
from #data d
left join #lookup l on d.column1 = l.column1 and d.column2 = l.column2
alter table #data
drop column column1, column2
This will leave the tables in the desired state, with the varchar values moved out into the lookup table:
select * from #data
/----------------\
| id | newcolumn |
|----|-----------|
| 1 | 3 |
| 2 | 1 |
| 3 | 5 |
| 4 | 2 |
| 5 | 4 |
\----------------/
select * from #lookup
/------------------------\
| id | column1 | column2 |
|----|---------|---------|
| 1 | white | panda |
| 2 | orange | cat |
| 3 | black | duck |
| 4 | blue | lemur |
| 5 | yellow | dog |
\------------------------/

Select MAX date using data from several columns SQL

I know this is a much asked question and I've had a look through whats already available but I believe my case is slightly unique (and if it's not please point me in the right direction).
I am trying to find the latest occurrence of a row associated to a user a currently across two tables and several columns.
table: statusUpdate
+-------+-----------+-----------+-------------------+
| id | name | status | date_change |
+-------+-----------+-----------+-------------------+
| 1 | Matt | 0 | 01-01-2001 |
| 2 | Jeff | 1 | 01-01-2001 |
| 3 | Jeff | 2 | 01-01-2002 |
| 4 | Bill | 2 | 01-01-2001 |
| 5 | Bill | 3 | 01-01-2004 |
+-------+-----------+-----------+-------------------+
table: relationship
+-------+-----------+--------------+
| id | userID |stautsUpdateID|
+-------+-----------+--------------+
| 1 | 22 | 1 |
| 2 | 33 | 2 |
| 3 | 33 | 3 |
| 4 | 44 | 4 |
| 5 | 44 | 5 |
+-------+-----------+--------------+
There is a third table which links userID to its own table but these sample tables should be good enough to get my question over.
I am looking to get the latest status change by date. The problem currently is that it returns all instances of a status change.
Current results:
+-------+---------+-----------+-------------------+
|userID |statusID | status | date_change |
+-------+---------+-----------+-------------------+
| 33 | 2 | 1 | 01-01-2001 |
| 33 | 3 | 2 | 01-01-2002 |
| 44 | 4 | 2 | 01-01-2001 |
| 44 | 5 | 3 | 01-01-2004 |
+-------+---------+-----------+-------------------+
Expected results:
+-------+-----------+-----------+-------------------+
|userID |statusID | status | date_change |
+-------+-----------+-----------+-------------------+
| 33 | 3 | 2 | 01-01-2002 |
| 44 | 5 | 3 | 01-01-2004 |
+-------+-----------+-----------+-------------------+
I hope this all makes sense, please ask for more information otherwise.
Just to reiterate I just want to return the latest instance of a users status change by date.
Sample code of one of my attempts:
select
st.ID, st.status, st.date_change, r.userID
from statusUpdate st
inner join Relationship r on st.ID = r.statusUpdateID
inner join (select ID, max(date_change) as recent from statusUpdate
group by ID) as y on r.stausUpdateID = y.ID and st.date_change =
y.recent
Hope someone can point me in the right direction.

use row_number() to get the last row by user
select *
from
(
select st.ID, st.status, st.date_change, r.userID,
rn = row_number() over (partition by r.userID order by st.date_change desc)
from statusUpdate st
inner join Relationship r on st.ID = r.statusUpdateID
) as d
where rn = 1

I ADDED MAX condition to your answer
CREATE TABLE #Table1
([id] int, [name] varchar(4), [status] int, [date_change] datetime)
;
INSERT INTO #Table1
([id], [name], [status], [date_change])
VALUES
(1, 'Matt', 0, '2001-01-01 00:00:00'),
(2, 'Jeff', 1, '2001-01-01 00:00:00'),
(3, 'Jeff', 2, '2002-01-01 00:00:00'),
(4, 'Bill', 2, '2001-01-01 00:00:00'),
(5, 'Bill', 3, '2004-01-01 00:00:00')
;
CREATE TABLE #Table2
([id] int, [userID] int, [stautsUpdateID] int)
;
INSERT INTO #Table2
([id], [userID], [stautsUpdateID])
VALUES
(1, 22, 1),
(2, 33, 2),
(3, 33, 3),
(4, 44, 4),
(5, 44, 5)
select
max(st.ID) id , max(st.status) status , max(st.date_change) date_change, r.userID
from #Table1 st
inner join #Table2 r on st.ID = r.stautsUpdateID
inner join (select ID, max(date_change) as recent from #Table1
group by ID) as y on r.stautsUpdateID = y.ID and st.date_change =
y.recent
group by r.userID
output
id status date_change userID
1 0 2001-01-01 00:00:00.000 22
3 2 2002-01-01 00:00:00.000 33
5 3 2004-01-01 00:00:00.000 44

SQL count occurrences of values grouped by external tables references

What is the best approach in terms of performance and maintainability to count the number of occurrences of the same value in a table, grouping the results with the same reference that groups the entries of the table?
Let's say I have three tables (concepts have been shrinked in order to represent a scenario that is similar to the one I'm working on):
|----------| |----------------| |-----------------------------------|
| MEAL | | RECIPE | | INGREDIENT_ENTRY |
|----------| |----------------| |-----------------------------------|
| ID | ... | | ID | ID_m | ...| | ID | ID_r | amount and description|
|----------| |----------------| |-----------------------------------|
| 1 | ... | | 1 | 1 | ...| | 1 | 1 | '15gr of yeast' |
| 2 | ... | | 2 | 2 | ...| | 2 | 4 | '2 eggs' |
| 3 | ... | | 3 | 3 | ...| | 3 | 1 | '300cl of water' |
| 4 | ... | | 4 | 4 | ...| | 4 | 2 | '300cl of beer' |
|----------| | 5 | 1 | ...| | 5 | 3 | '250cl of milk' |
| 6 | 4 | ...| | 6 | 5 | '100gr of biscuits' |
| 7 | 5 | ...| | 7 | 2 | '15gr of yeast' |
| 8 | 6 | ...| | 8 | 1 | '500gr of flour' |
|----------------| | 9 | 2 | '500gr of flour' |
| 10 | 2 | '10gr of salt' |
| 11 | 4 | '15gr of yeast' |
|-----------------------------------|
The same MEAL can be cooked with a different RECIPE, and each RECIPE is made of different INGREDIENT_ENTRYs, organized in the same RECIPE by sharing the same ID_r value.
INGREDIENT_ENTRY.[amount and description] is a column of type VARCHAR(MAX), this is the value that must be compared.
In the example, making the query with (MEAL 1,RECIPE 1):
It has 3 ingredients (1,3,8), and shares:
Two ingredients with RECIPE 2 (7,9) -> and so can be found in MEAL 2
One ingredient with RECIPE 4 (11) -> and so can be found in MEAL 3
Result should look something like:
|------| |--------| |-------|
| MEAL | | RECIPE | | COUNT |
|------| |--------| |-------|
| 2 | | 2 | | 2 |
| 4 | | 4 | | 1 |
|------| |--------| |-------|
I'm experimenting with views to reduce SQL complexity, but I cannot make it with a single SQL statement and I would like to avoid going back and forth to code (C#) and perform multiple queries (for example query for every ingredient, and reconcile results with HashMaps or similar).
Please, note that I cannot modify the DB structure.

You can find common ingredients using EXISTS. In the below I have simply used a Common table expression so that I don't have to write out the joins more than once to get back to a meal ID:
DECLARE #SelectedMealID INT = 1;
WITH LinkedData AS
(
SELECT MealID = r.ID_m,
RecipeID = r.ID,
Ingredient = i.[amount and description]
FROM RECIPE AS r
INNER JOIN INGREDIENT_ENTRY AS i
ON i.ID_r = r.ID
)
SELECT a.MealID,
a.RecipeID,
CommonIngedients = COUNT(*)
FROM LinkedData AS a
WHERE a.MealID != #SelectedMealID
AND EXISTS
( SELECT 1
FROM LinkedData AS b
WHERE b.Ingredient = a.Ingredient
AND b.MealID = #SelectedMealID
)
GROUP BY a.MealID, a.RecipeID;
I have tested this with the below sample:
-- GENERATE TABLES AND DATA
DECLARE #Meal TABLE (ID INT);
INSERT #Meal (ID) VALUES (1), (2), (3), (4);
DECLARE #Recipe TABLE (ID INT, ID_m INT);
INSERT #Recipe (ID, ID_m)
VALUES (1, 1), (2, 2), (3, 3), (4, 4), (5, 1), (6, 4), (7, 5), (8, 6);
DECLARE #Ingredient TABLE (ID INT, ID_r INT, AmountAndDescription VARCHAR(MAX));
INSERT #Ingredient (ID, ID_R, AmountAndDescription)
VALUES
(1, 1, '15gr of yeast'), (2, 4, '2 eggs'),
(3, 1, '300cl of water'), (4, 2, '300cl of beer'),
(5, 3, '250cl of milk'), (6, 5, '100gr of biscuits'),
(7, 2, '15gr of yeast'), (8, 1, '500gr of flour'),
(9, 2, '500gr of flour'), (10, 2, '10gr of salt'),
(11, 4, '15gr of yeast');
-- TEST QUERY
DECLARE #SelectedMealID INT = 1;
WITH LinkedData AS
(
SELECT MealID = r.ID_m,
RecipeID = r.ID,
Ingredient = i.AmountAndDescription
FROM #Recipe AS r
INNER JOIN #Ingredient AS i
ON i.ID_r = r.ID
)
SELECT a.MealID,
a.RecipeID,
CommonIngedients = COUNT(*)
FROM LinkedData AS a
WHERE a.MealID != #SelectedMealID
AND EXISTS
( SELECT 1
FROM LinkedData AS b
WHERE b.Ingredient = a.Ingredient
AND b.MealID = #SelectedMealID
)
GROUP BY a.MealID, a.RecipeID;
OUTPUT
MealID RecipeID CommonIngedients
------------------------------------------
2 2 2
4 4 1
N.B. The expected output in the question differs slighly but I think the question may contain a typo (states Recipe 4 relates to meal 3, but this doesn't appear to be the case in the sample data)

SQL Server : compare two tables and return similar rows

I want to compare two tables, source and target, and get similar rows.
Compare source and target on Id one by one and:
If matched and it's two or more on Target => select All matched from Target
If matched and it's two or more on Source =>
for first matched if it doesn't selected before
select Matched From target
else (IF it have selected before)
check for next one matched
I think need a recursive expression to check source and target one by one
Source
x------x---------x
| Id | Name |
x------x---------x
| 1 | a |
| 2 | b |
| 2 | c |
| 3 | d |
| 3 | e |
| 4 | x |
x------x---------x
Target
x------x---------x
| Id | Name |
x------x---------x
| 1 | f |
| 1 | g |
| 2 | h |
| 3 | i |
| 3 | j |
| 5 | y |
x------x---------x
Result
x------x---------x
| Id | Name |
x------x---------x
| 1 | f |
| 1 | g |
| 2 | h |
| 3 | i |
| 3 | j |
x------x---------x
Test data
declare #s table(Id int, name varchar(20))
DECLARE #t table( Id int, name varchar(20))
INSERT #s values(1, 'a'), (2, 'b'), (2, 'c'), (3, 'd'), (3, 'e')
INSERT #t values(1, 'f'), (1, 'g'), (2, 'h'), (3, 'i'), (3, 'j')

I think you just need Exists operator to do this.
select * from #t t
where exists (select 1 from #s s where t.id=s.id)
SQLFIDDLE DEMO

SELECT DISTINCT
t.Id,
t.name
FROM SOURCE s
INNER JOIN target t ON s.id=t.Id
WHERE s.Id IN (SELECT Id FROM target)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Eliminate Duplicates whilst merging additional table - sql

Related

Convert values in related table to comma-separated list

I want to write a sqlcmd script that pulls data from a database and then manipulates that data

Select MAX date using data from several columns SQL

SQL count occurrences of values grouped by external tables references

SQL Server : compare two tables and return similar rows

Categories

Resources