How to Deduplicate an id column by adding an overflow column? - sql

Alright guys I need some help! I am need to duplicate the ID Column and I'm having trouble with adding a column without loosing important data.
Is there a way to make an "overflow column" that would take the secondary [tags] and put them into a new column?
Here is the example:
**UniqueId** **Age** **Zip*** **Tag**
1 20 11111 yellow
2 25 33333 blue
2 25 33333 black
3 30 44444 purple
3 30 44444 pink
3 30 44444 white
This is what i want the output to look like
**UniqueId** **Age** **Zip*** **Tag1** **Tag2** **Tag3**
1 20 11111 yellow NULL NULL
2 25 33333 blue black NULL
3 30 44444 purple pink white
Your help would be greatly appreciated!!!

If you now the maximum number of tags, you can use pivot or conditional aggregation:
select t.uniqueid, t.age, t.zip,
max(case when seqnum = 1 then tag end) as tag_1,
max(case when seqnum = 2 then tag end) as tag_2,
max(case when seqnum = 3 then tag end) as tag_2
from (select t.*,
row_number() over (partition by uniqueid order by (select null)) as seqnum
from t
) t
group by t.uniqueid, t.age, t.zip;

Though I tend to prefer conditional aggregations as Gordon illustrated... they offer a bit more flexibilty.
You can do a simple PIVOT
Example
Select *
From (
Select UniqueID
,Age
,Zip
,Tag
,Col = concat('Tag',Row_Number() over (Partition By UniqueId order by Tag) )
From YourTable
) A
Pivot (Max(Tag) for Col in ([Tag1],[Tag2],[Tag3],[Tag4]) ) p
Returns
UniqueID Age Zip Tag1 Tag2 Tag3 Tag4
1 20 11111 yellow NULL NULL NULL
2 25 33333 black blue NULL NULL
3 30 44444 pink purple white NULL

attention: Do not store the age as an int number! Rather store a DOB and compute the age...
This is not a real answer to your question, but the thing you should rather do:
To be honest: Your question can be solved (and there are good answers already), but you should not do this.
Whenever you feel the need to add numbers to a field's name (Tag1, Tag2...) the design will be wrong (almost ever). Push these values into a related side table (just the Id and the tag), remove the column from your original table and place a foreign key pointing to the new table. Now you can join these values whenever you need them. PIVOT (or conditional aggregation) is for output only...
This is completely untested, so be careful with your data (backup!), but something along these lines should work:
CREATE TABLE TagTable (ID INT IDENTITY
,FKOriginal INT NOT NULL CONSTRAINT FK_TagTable_OriginalTable FOREIGN KEY REFERENCES OriginalTable(UniqueId)
,Tag VARCHAR(100) NOT NULL);
--an index to support the fk
CREATE NONCLUSTERED INDEX IX_TagTable_FKOriginal ON TagTable(FKOriginal);
GO
--shift the existing data
INSERT INTO TagTable --you might use DISTINCT...
SELECT UniqueId,Tag
FROM OriginalTable;
GO
--delete duplicated rows
WITH cte AS
(
SELECT *
,ROW_NUMBER() OVER(PARTITION BY UniqueId ORDER BY UniqueId) AS RowId --Find a better sort column if needed
FROM OriginalTable
)
DELETE FROM cte
WHERE RowId>1; --Only the first remains
GO
--throw away the tag column in the original table
ALTER TABLE OriginalTable DROP COLUMN Tag;
GO
--See the result via JOIN-Select
SELECT *
FROM OriginalTable AS o
INNER JOIN TagTable AS t ON o.UniqueId=t.FKOriginal;
If you need these pivoted columns, you can use the approaches provided in other answers with the final SELECT too.

Related

Using SQL, how do I select which column to add a value to, based on the contents of the row?

I'm having a difficult time phrasing the question, so I think the best thing to do is to give some example tables. I have a table, Attribute_history, I'm trying to pull data from that looks like this:
ID Attribute_Name Attribute_Val Time Stamp
--- -------------- ------------- ----------
1 Color Red 2022/09/28 01:00
2 Color Blue 2022/09/28 01:30
1 Length 3 2022/09/28 01:00
2 Length 4 2022/09/28 01:30
1 Diameter 5 2022/09/28 01:00
2 Diameter 10 2022/09/28 01:30
2 Diameter 11 2022/09/28 01:32
I want to create a table that pulls the attributes of each ID, and if the same ID and attribute_name has been updated, pull the latest info based on Time Stamp.
ID Color Length Diameter
---- ------ ------- --------
1 Red 3 5
2 Blue 4 11
I've achieved this by nesting several select statements, adding one column at a time. I achieved selecting the latest date using this stack overflow post. However, this code seems inefficient, since I'm selecting from the same table multiple times. It also only chooses the latest value for an attribute I know is likely to have been updated multiple times, not all the values I'm interested in.
SELECT
COLOR, DIAMETER, DATE_
FROM
(
SELECT
COLORS.COLOR, ATTR.ATTRIBUTE_NAME AS DIAMETER, ATTR.TIME_STAMP AS DATE_, RANK() OVER (PARTITION BY COLORS.COLOR ORDER BY ATTR.TIME_STAMP DESC) DATE_RANK -- https://stackoverflow.com/questions/3491329/group-by-with-maxdate
FROM
(
SELECT
ATTRIBUTE_HISTORY.ATTRIBUTE_VAL
FROM
ATTRIBUTE_HISTORY
WHERE
ATTRIBUTE_HISTORY.ATTRIBUTE_NAME = 'Color'
GROUP BY ATTRIBUTE_HISTORY.ID
) COLORS
INNER JOIN ATTRIBUTE_HISTORY ATTR ON COLORS.ID = ATTR.ID
WHERE
ATTR.ATTRIBUTE_NAME = 'DIAMETER'
)
WHERE
DATE_RANK = 1
(I copied my real query and renamed values with Find+Replace to obscure the data so this code might not be perfect, but it gets across the idea of how I'm achieving my goal now.)
How can I rewrite this query to be more concise, and pull the latest date entry for each attribute?
For MS SQL Server
Your Problem has 2 parts:
Identify the latest Attribute value based on Time Stamp Column
Convert the Attribute Names to columns ( Pivoting ) in the final
result.
Solution:
;with CTEx as
(
select
row_number() over(partition by id, Attr_name order by Time_Stamp desc) rnum,
id,Attr_name, Attr_value, time_stamp
from #temp
)
SELECT * FROM
(
SELECT id,Attr_name,Attr_value
FROM CTEx
where rnum = 1
) t
PIVOT(
max(Attr_value)
FOR Attr_name IN (Color,Diameter,[Length])
) AS pivot_table;
First part of the problem is taken care of by the CTE with the help of ROW_NUMBER() function. Second part is achieved by using PIVOT() function.
Definition of #temp for reference
Create table #temp(id int, Attr_name varchar(200), Attr_value varchar(200), Time_Stamp datetime)

deleting specific duplicate and original entries in a table based on date

i have a table called "main" which has 4 columns, ID, name, DateID and Sign.
i want to create a query that will delete entries in this table if there is the same ID record in twice within a certain DateID.
i have my where clause that searches the previous 3 weeks
where DateID =((SELECT MAX( DateID)
WHERE DateID < ( SELECT MAX( DateID )-3))
e.g of my dataset im working with:
id
name
DateID
sign
12345
Paul
1915
Up
23658
Danny
1915
Down
37868
Jake
1916
Up
37542
Elle
1917
Up
12345
Paul
1917
Down
87456
John
1918
Up
78563
Luke
1919
Up
23658
Danny
1920
Up
in the case above, both entries for ID 12345 would need to be removed.
however the entries for ID 23658 would need to be kept as the DateID > 3
how would this be possible?
You can use window functions for this.
It's not quite clear, but it seems LAG and conditional COUNT should fit what you need.
DELETE t
FROM (
SELECT *,
CountWithinDate = COUNT(CASE WHEN t.PrevDate >= t.DateId - 3 THEN 1 END) OVER (PARTITION BY t.id)
FROM (
SELECT *,
PrevDate = LAG(t.DateID) OVER (PARTITION BY t.id ORDER BY t.DateID)
FROM YourTable t
) t
) t
WHERE CountWithinDate > 0;
db<>fiddle
Note that you do not need to re-join the table, you can delete directly from the t derived table.
Hope this works:
DELETE FROM test_tbl
WHERE id IN (
SELECT T1.id
FROM test_tbl T1
WHERE EXISTS (SELECT 1 FROM test_tbl T2 WHERE T1.id = T2.id AND ABS(T2.dateid - T1.dateid) < 3 AND T1.dateid <> T2.dateid)
)
In case you need more logic for data processing, I would suggest using Stored Procedure.

update in oracle sql : multiple rows in 1 table

I am new to SQL and I am no good with more advanced queries and functions.
So, I have this 1 table with sales:
id date seller_name buyer_name
---- ------------ ------------- ------------
1 2015-02-02 null Adrian
1 2013-05-02 null John B
1 2007-11-15 null Chris F
2 2014-07-12 null Jane A
2 2011-06-05 null Ted D
2 2010-08-22 null Maryanne A
3 2015-12-02 null Don P
3 2012-11-07 null Chris T
3 2011-10-02 null James O
I would like to update the seller_name for each id, by putting the buyer_name from previous sale as seller_name to newer sale date. For example, for on id 1 John B would then be seller in 2015-02-02 and buyer in 2013-05-02. Does that make sense?
P.S. This is the perfect case, the table is big and the ids are not ordered so neat.
merge into your_table a
using ( select rowid rid,
lead(buyer_name, 1) over (partition by id order by date desc) seller
from your_table
) b
on (a.rowid = b.rid )
when matched then update set a.seller_name= b.seller;
Explanation : Merge into statement performs different operations based on matched or not matched criterias. Here you have to merge into your table, in the using having the new values that you want to take and also the rowid which will be your matching key. The lead function gets the result from the next n rows depending on what number you specify after the comma. After specifying how many rows to jump you also specify on what part to work, which in your case is partitioned by id and ordered by date so you can get the seller, who was the previous buyer. Hope this clears it up a bit.
Either of the below query can be used to perform the desire action
merge into sandeep24nov16_2 table1
using(select rowid r, lag(buyer_name) over (partition by id order by "DATE" asc) update_value from sandeep24nov16_2 ) table2
on (table1.rowid=table2.r)
when matched then update set table1.seller_name=table2.update_value;
or
merge into sandeep24nov16_2 table1
using(select rowid r, lead(buyer_name) over (partition by id order by "DATE" desc) update_value from sandeep24nov16_2 ) table2
on (table1.rowid=table2.r)
when matched then update set table1.seller_name=table2.update_value;
select a.*,
lag(buyer_name, 1) over(partition by id order by sale_date) seller_name
from <your_table> a;

How to increment a value in SQL based on a unique key

Apologies in advance if some of the trigger solutions already cover this but I can't get them to work for my scenario.
I have a table of over 50,000 rows, all of which have an ID, with roughly 5000 distinct ID values. There could be 100 rows with an instrumentID = 1 and 50 with an instrumentID = 2 within the table etc but they will have slightly different column entries. So I could write a
SELECT * from tbl WHERE instrumentID = 1
and have it return 100 rows (I know this is easy stuff but just to be clear)
What I need to do is form an incrementing value for each time a instrument ID is found, so I've tried stuff like this:
IntIndex INT IDENTITY(1,1),
dDateStart DATE,
IntInstrumentID INT,
IntIndex1 AS IntInstrumentID + IntIndex,
at the table create step.
However, I need the IntIndex1 to increment when an instrumentID is found, irrespective of where the record is found in the table so that it effectively would provide a count of the records just by looking at the last IntIndex1 value alone. Rather than what the above does which is increment on all of the rows of the table irrespective of the instrumentID so you would get 5001,4002,4003 etc.
An example would be: for intInstruments 5000 and 4000
intInstrumentID | IntIndex1
--------- ------------------
5000 | 5001
5000 | 5002
4000 | 4001
5000 | 5003
4000 | 4002
The reason I need to do this is because I need to join two tables based on these values (a start and end date for each instrumentID). I have tried GROUP BY etc but this can't work in both tables and the JOIN then doesn't work.
Many thanks
I'm not entirely sure I understand your problem, but if you just need IntIndex1 to join to, could you just join to the following query, rather than trying to actually keep the calculated value in the database:
SELECT *,
intInstrumentID + RANK() OVER(PARTITION BY intInstrumentID ORDER BY dDateStart ASC) AS IntIndex1
FROM tbl
Edit: If I understand your comment correctly (which is not certain!), then presumably, you know that your end date and start date tables have the exact same number of rows, which leads to a one to one mapping between them based on thir respective end dates within instrument id?
If that's the case then maybe this join is what you are looking for:
SELECT SD.intInstrumentID, SD.dDateStart, ED.dEndDate
FROM
(
SELECT intInstrumentID,
dStartDate,
RANK() OVER(PARTITION BY intInstrumentID ORDER BY dDateStart ASC) AS IntIndex1
FROM tblStartDate
) SD
JOIN
(
SELECT intInstrumentID,
dEndDate,
RANK() OVER(PARTITION BY intInstrumentID ORDER BY dEndDate ASC) AS IntIndex1
FROM tblStartDate
) ED
ON SD.intInstrumentID = ED.intInstrumentID
AND SD.IntIndex1 = ED.IntIndex1
If not, please will you post some example data for both tables and the expected results?

Update rows in table

I have a table (Fruits) with following column
Fruit_Name(varchar2(10)) | IsDuplicate Number(1)
Mango 0
Orange 0
Mango 0
What i have to do is to update IsDuplicate column to 1 where Fruit_Name in Distinct i.e
Fruit_Name(varchar2(10)) | IsDuplicate Number(1)
Mango 1
Orange 1
Mango 0
How should I do this?
This should do it as far as I can tell
update fruits
set is_duplicate =
(
select case
when dupe_count > 1 and row_num = 1 then 1
else 0
end as is_dupe
from (
select f2.fruit_name,
count(*) over (partition by f2.fruit_name) as dupe_count,
row_number() over (partition by f2.fruit_name order by f2.fruit_name) as row_num,
rowid as row_id
from fruits f2
) ft
where ft.row_id = fruits.rowid
and ft.fruit_name = fruits.fruit_name
)
Edit
But instead of actually updating the table, why don't you create a view that returns the information. Depending on the size of the table it might be more efficient.
create view fruit_dupe_view
as
select fruit_name,
case
when dupe_count > 1 and row_num = 1 then 1
else 0
end as is_duplicate
from (
select fruit_name,
count(*) over (partition by fruit_name) as dupe_count,
row_number() over (partition by fruit_name order by fruit_name) as row_num
from fruits
) ft
Straight and simple -- you can't. Not with vanilla SQL. SQL is a set-based processing language, and you do things in sets. There is no way for SQL to know which one of your many Mango's should be tagged 1. You can probably tag one of them with 1 using windowing functions or ROWNUM etc. in a SELECT, but I don't think it can be done with an UPDATE.
In other words, your table lacks a unique key in the first place, so it is not something that SQL is designed to process.
However, you may try adding a sequential primary key to each row. Then you can easily write an UPDATE query to set to 1 all the rows with COUNT > 1 and key = MIN(key).
In other words, you really have to look at your database design. Relational databases are not supposed to contain "duplicates". That fact that you need to mark something as a duplicate means that your tables are designed wrong in the first place. The database should not even allow duplications to enter into its data.