Removing dups and updating null values - sql

I've just been tasked with removing all the duplicate values in a database. Simple enough. But they also want me to go through and check if there are any Null values that were not Null in previous entries for that record.
So let's say that we have user 123. User 123 doesn't have a zip code listed for whatever reason. But in a past entry he had zip code 55555. I'm supposed to update the latest entry with that zip code from a past entry and then delete the past entry. Leaving me with only one entry for user 123 AND having the zip code 55555.
I'm just unsure how to do the update portion. Anybody have any suggestions?
Thanks!

Here is how you can do the update. It finds the last value for zip, and then updates the field, if necessary:
with lastval as (
select *
from (select id, zip, row_number() over (partition by id order by datecreated desc) as seqnum
from t
where zip is not null
) t
where seqnum = 1
)
update t
set t.zip = lastval.zip
from lastval
where t.id = lastval.id
However, I would suggest that you create a new table with the data that you want. Don't both deleting and updating a zilion rows, create a table using a query such as:
select *
from (select t.*, row_number() over (partition by id order by datecreated desc) as seqnum
from t
where zip is not null
) t
where seqnum = 1
And insert the rows into a new table.
And, one more suggestion. Ask another question, with a better notion of what the fields are like in the table, and which ones you want to look up last values for. That will provide additional information for better solutions.

You could use a statement similar to the following one:
update t1
set t1.address = dt.address,
t1.city = dt.city,
... and so on ...
from your_table as t1
inner join
(
select
max(id) as id,
companyname,
max(address) as address,
max(city) as city,
... and so on ...
from your_table
group by companyname -- your duplicate detection goes here
) dt
on dt.id = t1.id
This way you fill up all gaps in your duplicates. Then you just have to delete the duplicates.

Related

SQL: Deleting Duplicates using Not in and Group by

I have the following SQL Syntax to delete duplicate rows, but never are any rows affected.
DELETE FROM content_stacks WHERE id NOT IN (
SELECT id
FROM content_stacks
GROUP BY user_id, content_id
);
The subquery itself is returning the id list of first entries correctly.
SELECT id
FROM content_stacks
GROUP BY user_id, content_id
When I'm inserting the results list as a string it is working, too:
DELETE FROM content_stacks WHERE id NOT IN (239,231,217,218,219,232,233,220,230,226,234,235,224,225,221,223,222,227,228,229,236,237,238,216,208,209,210,204,211,212,242,203,240,201,241,205,206,207,213,214,215);
I checked many similar examples and this should be working in my opinion. What am I missing?
First find first rows using ROW_NUMBER Then delete record with row number greater than 1:
WITH CTE AS (
SELECT id , ROW_NUMBER() OVER(PARTITION BY user_id, content_id, ORDER BY id) rn
FROM content_stacks
)
DELETE cs
FROM content_stacks cs
INNER JOIN CTE ON CTE.id = cs.id
WHERE rn > 1
Am sorry to ask but if your deleting why would u need to group the records.
Are not just increasing the runtime.
The code from Meyssam Toluie is not working as it is but I made a similar solution with the same idea with rownumbers:
DELETE FROM content_stacks WHERE id IN
(SELECT id FROM (
SELECT id, ROW_NUMBER() OVER(PARTITION BY user_id, content_id)row_num
FROM content_stacks
) sub
WHERE row_num > 1)
This is working for me now.
My first command did not work because: The group by command does not show all ids in the output, but they are still there, so in fact all ids were returned in the NOT IN id-list. The row number seems to be the easiest way for this problem.

Update one row based off distinct values of another column

I've got a data set with post codes, suburbs and their longitude and latitude.
For each postcode there are multiple rows with the corresponding suburbs within that postcode, so when I match it with another table which has sales by postcode in Power BI I end up with multiple rows returned for each post code.
What I'd like to do is insert a column called unique_postcode as a boolean marking one line of each post code as True. I don't mind which one. I tried the below as well as a few other options, it didn't give any errors but didn't have any affect.
UPDATE postcodes
SET post_codes.unique_postcode = 1
FROM (
SELECT DISTINCT(postcode)
FROM postcodes
);
You could use an updatable CTE which targets a random row:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY postcode ORDER BY postcode) rn
FROM postcodes
)
UPDATE cte
SET unique_postcode = 1
WHERE rn = 1;
Note that because the ordering used in ROW_NUMBER uses the postal code itself, the "first" row number value could be any of the rows, in the case that a postal code have more than one record associated with it.
If the row doesn't matter then the simplest way would be to select TOP 1.
with cte as (select top 1 * from postcodes)
update cte
set unique_postcode = 1;
You can use row_number() to define a particular one to assign the flag to. In an update, this looks like:
WITH toupdate AS (
SELECT p.*, ROW_NUMBER() OVER (PARTITION BY postcode ORDER BY postcode) as seqnum
FROM postcodes p
)
UPDATE toupdate
SET unique_postcode = (CASE WHEN seqnum = 1 THEN 1 ELSE 0 END);
Note: This sets one value to "1" and the rest to "0". It is also safe to run multiple times on the table.

Last entry for non-unique id where X=Y

Please see the above image for rows I would like returned, those highlighted in yellow.
From the picture attached, I would like it to only return id 133766 and 133792 as they end at stage 5.
I want to pull the last entry for a non-unique id where there could be X amount of entries per non-unique id.
I am not overly experience in SQL, but what I know is;
I could do
SELECT max(stage), id
FROM [dbo].[table] group by id
and this gives me a pretty good starting point. I'd rather sort on the date field, as the "stage" isn't actually an int, I've done that for simplicity here.
So I essentially need to get the last entry (figured out by date) for all non-unique id's where stage doesn't equal X
I feel like it's a really simple, everyday query, but I just can not wrap my head around a simple, efficient way to do it.
Any help is much appreciated.
try this
SELECT *
FROM(
SELECT Id, Stage, CompletionDate
,Row_number() OVER(PARTITION BY ID ORDER BY CompletionDate DESC) AS RN
FROM YourTable
) AS t
WHERE RN = 1 AND Stage = 5;
I want to point out that not exists is also a way to approach this:
select t.*
from t
where t.stage = 5 and
not exists (select 1
from t t2
where t2.id = t.id and t2.stage > t.stage
);
With an index on (id, stage), you might be surprised at how good the performance is.
You could use the window version of MAX
;WITH CTE_DATA AS
(
SELECT *
, MAX(stage) OVER (PARTITION BY id) AS max_stage
FROM [dbo].[table]
)
SELECT *
FROM CTE_DATA
WHERE stage = max_stage
AND max_stage = 5;

How can I make a distinct with multiple field

I have some duplicate mail in my database but I can't remove it.
I want Select some field but without duplicate mail.
I have a request like this :
SELECT
DISTINCT MAIL,
ID,
CIVILITE,
PRENOM,
NAME
FROM CONTACT WHERE CODE_PAYS = 'DE'
When I launch this request, my duplicate values on mail are already here.
Do you know how can I do that ?
Update: i have tried this approach but i need to use it in a view:
ALTER VIEW ALL_VW_CONTACT_DE WITH SCHEMABINDING
AS
with cte as
(
select rn = row_number() over (partition by c.Mail Order By c.Id asc), c.Mail, c.Id, c.Civilite, c.Prenom, c.Name
from dbo.CONTACT c
where code_pays = 'DE'
)
select Mail, Id, Civilite, Prenom, Name
from cte
where rn = 1
But this doesn't work, i get this error:
Cannot schema bind view 'MY_TABLE' because name 'CONTACT' is invalid
for schema binding. Name must be in two-part format and an object
cannot reference itself
When I launch this request, my duplicate values on mail are already
here.
The reason for it is that DISTINCT doesn't work like you think. It doesn't look only at the first column after the DISTINCT keyword but it compares all columns in the list. So just if all are equal it is considered a duplicate.
One easy way is using ROW_NUMBER:
with cte as
(
select rn = row_number() over (partition by c.Mail Order By c.Id asc), c.*
from dbo.Contact c
where Code_Pays = 'DE'
)
select Mail, Id, Civilite, Prenom, Name
from cte
where rn = 1
Change the order by if you want to take a different record, here i take the one with min-ID.
you can use row_number as below
Select top (1) with ties * from Contact
where CODE_PAYS = 'DE'
order by row_number() over(partition by mail order by id)
When you use DISTINCT with other fields, then you get only original combinations of these fields.
For this case, you should exclude all dynamic fields from query (possibly ID):
SELECT
DISTINCT MAIL,
CIVILITE,
PRENOM,
NAME
FROM CONTACT WHERE CODE_PAYS = 'DE'
The problem here is probably The ID field. Since it should be unique for each row, you can't group the other fields. Remove it from the query and you should be fine.
When you do a distinct query, the trick is to look at the results and finding what columns are returning different values, that's what's differentiating them. If you add the results in your question we can help you further.

Select a NON-DISTINCT column in a query that return distincts rows

The following query returns the results that I need but I have to add the ID of the row to then update it. If I add the ID directly in the select statement it will return me more results then I need because each ID is unique so the DISTINCT statement see the line as unique.
SELECT DISTINCT ucpse.MemberID, ucpse.ProductID, ucpse.UserID
FROM UserCustomerProductSalaryExceptions as ucpse
WHERE EXISTS (SELECT NULL
FROM UserCustomerProductSalaryExceptions as upcse2
WHERE ucpse.userid = upcse2.userid AND ucpse.MemberID = upcse2.MemberID AND ucpse.ProductID = upcse2.ProductID
GROUP BY upcse2.UserID, upcse2.memberid, upcse2.productid
HAVING COUNT(UserID) >= 2
)
So basically I need to add ucpse.ID in the Select statement while keeping DISTINCT values for MemberID,ProductID and UserID.
Any Ideas ?
Thank you
According to you comment:
If the data has been duplicated 67 times for a given employee with a given product and a given client, I need to keep only one of thoses records. It's not important which one, so this is why I use DISTINC to obtain unique combinaison of given employee with a given product and a given client.
You can use MIN() or MAX() and GROUP BY instead of DISTINCT
SELECT MAX(ucpse.ID) AS ID, ucpse.MemberID, ucpse.ProductID, ucpse.UserID
FROM UserCustomerProductSalaryExceptions as ucpse
WHERE EXISTS (SELECT NULL
FROM UserCustomerProductSalaryExceptions as upcse2
WHERE ucpse.userid = upcse2.userid AND ucpse.MemberID = upcse2.MemberID AND ucpse.ProductID = upcse2.ProductID
GROUP BY upcse2.UserID, upcse2.memberid, upcse2.productid
HAVING COUNT(UserID) >= 2
)
GROUP BY ucpse.MemberID, ucpse.ProductID, ucpse.UserID
UPDATE:
From you comments I think the below query is what you need
DELETE FROM UserCustomerProductSalaryExceptions
WHERE ID NOT IN ( SELECT MAX(ucpse.ID) AS ID
FROM #UserCustomerProductSalaryExceptions
GROUP BY ucpse.MemberID, ucpse.ProductID, ucpse.UserID
HAVING COUNT(ucpse.ID) >= 2
)
If all you want is to delete the duplicates, this will do it:
WITH X AS
(SELECT ID,
ROW_NUMBER() OVER (PARTITION BY MemberID, ProductID, UserID ORDER BY ID) AS DupRowNum<br
FROM UserCustomerProductSalaryExceptions
)
DELETE X WHERE DupRowNum > 1
ID's not necessary - try:
UPDATE uu SET
<your settings here>
FROM UserCustomerProductSalaryExceptions uu
JOIN ( <paste your entire query above here>
) uc ON uc.MemberID=uu.MemberId AND uc.ProductID=uu.ProductId AND uc.UserID=uu.UserId
From the sound of your data structure (which I would STRONGLY advise normalizing as soon as possible), it sounds like you should be updating all the records. It sounds as if each duplicate is important because it contains some information about an employee's relation to a customer or product.
I would probably update all the records. Try this:
UPDATE UCPSE
SET
--Do your updates here
FROM UserCustomerProductSalaryExceptions as ucpse
JOIN
(
SELECT UserID, MemberID, ProductID
FROM UserCustomerProductSalaryExceptions
GROUP BY UserID, MemberID, ProductID
HAVING COUNT(UserID) >= 2
) T
ON ucpse.UserID = T.UserID AND ucpse.MemberID = T.MemberID AND ucpse.ProductID = T.ProductID