SQL: Dedupe table data and manipulate merged data - sql

I have an SQL table with:
Id INT, Name NVARCHAR(MAX), OldName NVARCHAR(MAX)
There are multiple duplicates in the name column.
I would like to remove these duplicates keeping only one master copy of 'Name'. When the the dedupe happens I want to concatenate the old names into the OldName field.
E.G:
Dave | Steve
Dave | Will
Would become
Dave | Steve, Will
After merging.
I know how to de-dupe data using something like:
with x as (select *,rn = row_number()
over(PARTITION BY OrderNo,item order by OrderNo)
from #temp1)
select * from x
where rn > 1
But not sure how to update the new 'master' record whilst I am at it.

This is really too complicated to do in a single update, because you need to update and delete rows.
select n.name,
stuff((select ',' + t2.oldname
from sqltable t2
where t2.name = n.name
for xml path (''), type
).value('/', 'nvarchar(max)'
), 1, 1, '') as oldnames
into _temp
from (select distinct name from sqltable) n;
truncate table sqltable;
insert into sqltable(name, oldnames)
select name, oldnames
from _temp;
Of course, test, test, test before deleting the old table (copy it for safe keeping). This doesn't use a temporary table. That way, if something happens -- like a server reboot -- before the insert is finished, you still have all the data.
Your question doesn't specify what to do with the id column. You can add min(id) or max(id) to the _temp if you want to use one of those values.

Related

Append values from 2 different columns in SQL

I have the following table
I need to get the following output as "SVGFRAMXPOSLSVG" from the 2 columns.
Is it possible to get this appended values from 2 columns
Please try this.
SELECT STUFF((
SELECT '' + DEPART_AIRPORT_CODE + ARRIVE_AIRPORT_CODE
FROM #tblName
FOR XML PATH('')
), 1, 0, '')
For Example:-
Declare #tbl Table(
id INT ,
DEPART_AIRPORT_CODE Varchar(50),
ARRIVE_AIRPORT_CODE Varchar(50),
value varchar(50)
)
INSERT INTO #tbl VALUES(1,'g1','g2',NULL)
INSERT INTO #tbl VALUES(2,'g2','g3',NULL)
INSERT INTO #tbl VALUES(3,'g3','g1',NULL)
SELECT STUFF((
SELECT '' + DEPART_AIRPORT_CODE + ARRIVE_AIRPORT_CODE
FROM #tbl
FOR XML PATH('')
), 1, 0, '')
Summary
Use Analytic functions and listagg to get the job done.
Detail
Create two lists of code_id and code values. Match the code_id values for the same airport codes (passengers depart from the same airport they just arrived at). Using lag and lead to grab values from other rows. NULLs will exist for code_id at the start and end of the itinerary. Default the first NULL to 0, and the last NULL to be the previous code_id plus 1. A list of codes will be produced, with a matching index. Merge the lists together and remove duplicates by using a union. Finally use listagg with no delimiter to aggregate the rows onto a string value.
with codes as
(
select
nvl(lag(t1.id) over (order by t1.id),0) as code_id,
t1.depart_airport_code as code
from table1 t1
union
select
nvl(lead(t1.id) over (order by t1.id)-1,lag(t1.id) over (order by t1.id)+1) as code_id,
t1.arrive_airport_code as code
from table1 t1
)
select
listagg(c.code,'') WITHIN GROUP (ORDER BY c.code_id) as result
from codes c;
Note: This solution does rely on an integer id field being available. Otherwise the analytic functions wouldn't have a column to sort by. If id doesn't exist, then you would need to manufacture one based on another column, such as a timestamp or another identifier that ensures the rows are in the correct order.
Use row_number() over (order by myorderedidentifier) as id in a subquery or view to achieve this. Don't use rownum. It could give you unpredictable results. Without an ORDER BY clause, there is no guarantee that the same query will return the same results each time.
Output
| RESULT |
|-----------------|
| SVGFRAMXPOSLSVG |

Return a XML field when using GROUP BY clause in MS SQL Management Studio?

I have the following table structure (partially excluded for clarity of question):
The table sometimes receives two lowFareRQ and lowFareRS that is considered to be only one booking under BookingNumber. The booking is then processed into a ticket where each booking number always have the same TicketRQ and TicketRS if the user proceeded with the booking. TicketRS contains 3rd party reference number.
I now want to display all the active bookings to the user in order to allow the user to cancel a booking if he wanted to.
So I would naturally want to retrieve the each booking number with active status as well as the TicketRS xml data in order to get the 3rd party reference number.
Here is the SQL query I started with:
SELECT TOP 100
[BookingNumber]
,[Status]
,[TicketRS]
FROM [VTResDB].[dbo].[LowFareRS]
GROUP BY [BookingNumber],[Status],[TicketRS]
ORDER BY [recID] desc
Now with MS SQL Management Studio you have to add the field [TicketRS] to 'GROUP BY' if you want to have it in the 'SELECT' field list... but you cannot have a XML field in the 'GROUP BY' list.
The XML data type cannot be compared or sorted, except when using the IS NULL operator.
I know that if I change the table structure this problem can be solved without any issue but I want to avoid changing the table structure because I am just completing the software and do not want to rewrite existing code.
Is there a way to return a XML field when using GROUP BY clause in MS SQL Management Studio?
Uhm, this seems dirty... If your XMLs are identically within the group, you might try something like this:
DECLARE #tbl TABLE(ID INT IDENTITY,Col1 VARCHAR(100),SomeValue INT,SomeXML XML);
INSERT INTO #tbl(col1,SomeValue,SomeXML) VALUES
('testA',1,'<root><a>testA</a></root>')
,('testA',2,'<root><a>testA</a></root>')
,('testB',3,'<root><a>testB</a></root>')
,('testB',4,'<root><a>testB</a></root>');
WITH GroupedSource AS
(
SELECT SUM(SomeValue) AS SumColumn
,CAST(SomeXml AS NVARCHAR(MAX)) AS XmlColumn
FROM #tbl AS tbl
GROUP BY Col1,CAST(SomeXml AS NVARCHAR(MAX))
)
SELECT SumColumn
,CAST(XmlColumn AS XML) AS ReCasted
FROM GroupedSource
Another approach was this
WITH GroupedSource AS
(
SELECT SUM(SomeValue) AS SumColumn
,MIN(ID) AS FirstID
FROM #tbl AS tbl
GROUP BY Col1
)
SELECT SumColumn
,(SELECT SomeXML FROM #tbl WHERE ID=FirstID) AS ReTaken
FROM GroupedSource
Cast it to nvarchar(max) and back
with t(xc,val) as (
select xc=cast(N'<x><y>txt</y></x>' as xml), val = 5
union all
select xc=cast(N'<x><y>txt</y></x>' as xml), val = 6
)
select xc = cast(xc as XML), val
from (
select xc = cast(xc as nvarchar(max)), val = sum(val)
from t
group by cast(xc as nvarchar(max))
) tt
;

How to determine what fields were update in an update trigger

UPDATE: Using Update_Columns() is not an answer to this question, as the fields may change in the order which will break the trigger (Update_Columns depends on the column order).
UPATE 2: I already know that the Deleted and Inserted tables hold the data. The question is how to determine what has changed without having to hard code the field names as the field names may change, or fields may be added.
Lets say I have a table with three fields.
The row already exists, and now the user updates fields 1 and 2.
How do I determine, in the Update Trigger, what the field were updated, and what the before and after values where?
I want to then log these to a log table. If there were two fields update, it should result in two rows in the history table.
Table
Id intField1 charField2 dateField3
7 3 Fred 1995-03-05
Updated To
7 3 Freddy 1995-05-06
History Table
_____________
Id IdOfRowThatWasUpdated BeforeValue AfterValue (as string)
1 7 Fred Freddy
2 7 1995-03-05 1995-05-06
I know I can use the Deleted table to Get the old values, and the inserted table to get the new values. The question however, is how to do this dynamically. In other words, the actual table has 50 columns, and I don't want to hard code 50 fields into a SQL statement, and also if the fields change, and don't want to have to worry about keeping the SQL in sync with table changes.
Greg
you can use one of my favorite XML-tricks to do this:
create trigger utr_Table1_update on Table1
after update, insert, delete
as
begin
with cte_inserted as (
select id, (select t.* for xml raw('row'), type) as data
from inserted as t
), cte_deleted as (
select id, (select t.* for xml raw('row'), type) as data
from deleted as t
), cte_i as (
select
c.ID,
t.c.value('local-name(.)', 'nvarchar(128)') as Name,
t.c.value('.', 'nvarchar(max)') as Value
from cte_inserted as c
outer apply c.Data.nodes('row/#*') as t(c)
), cte_d as (
select
c.ID,
t.c.value('local-name(.)', 'nvarchar(128)') as Name,
t.c.value('.', 'nvarchar(max)') as Value
from cte_deleted as c
outer apply c.Data.nodes('row/#*') as t(c)
)
insert into Table1_History (ID, Name, OldValue, NewValue)
select
isnull(i.ID, d.ID) as ID,
isnull(i.Name, d.Name) as Name,
d.Value,
i.Value
from cte_i as i
full outer join cte_d as d on d.ID = i.ID and d.Name = i.Name
where
not exists (select i.value intersect select d.value)
end;
sql fiddle demo
In this post:
How to refer to "New", "Old" row for Triggers in SQL server?
It is mentioned that/how you can access the original and the new values, and if you can access, you can compare them.
"INSERTED is the new row on INSERT/UPDATE. DELETED is the deleted row on DELETE and the updated row on UPDATE (i.e. the old values before the row was updated)"

PostgreSQL update query

I need to update table in my database. For sake of simplicity lets assume that table's name is tab and it has 2 columns: id (PRIMARY KEY, NOT NULL) and col (UNIQUE VARCHAR(300)). I need to update table this way:
id col
----------------------------------------------------
1 'One two three'
2 'One twothree'
3 'One two three'
4 'Remove white spaces'
5 'Something'
6 'Remove whitespaces '
to:
id col
----------------------------------------------------
1 'Onetwothree'
2 'Removewhitespaces'
3 'Something'
Id numbers and order of the rows after update is not important and can be different. I use PostgreSQL. Some of the columns are FOREIGN KEYs. That's why dropping UNIQUE constraint from col would be troublesome.
I think just using replace in this format will do what you want.
update tab
set col = replace(col, ' ', '');
Here's a SQLFiddle for it.
You shouldn't be using the non-descriptive column name id, even if some half-wit ORMs are in the habit of doing that. I use tab_id instead for this demo.
I interpret your description this way: You have other tables with FK columns pointing to tab.col. Like table child1 in my example below.
To clean up the mess, do all of this in a single session to preserve the temporary table I use. Better yet, do it all in a single transaction.
Update all referencing tables to have all referencing rows point to the "first" (unambiguously! - how ever you define that) in a set of going-to-be duplicates in tab.
Create a translation table up to be used for all updates:
CREATE TEMP TABLE up AS
WITH t AS (
SELECT tab_id, col, replace(col, ' ', '') AS col1
,row_number() OVER (PARTITION BY replace(col, ' ', '')
ORDER BY tab_id) AS rn
FROM tab
)
SELECT b.col AS old_col, a.col AS new_col
FROM (SELECT * FROM t WHERE rn = 1) a
JOIN (SELECT * FROM t WHERE rn > 1) b USING (col1);
Then update all your referencing tables.
UPDATE child1 c
SET col = up.new_col
FROM up
WHERE c.col = up.old_col;
-- more tables?
-> SQLfiddle
Now, all references point to the "first" in a group of dupes, and you have got your license to kill the rest.
Remove duplicate rows except the first from tab.
DELETE FROM tab t
USING up
WHERE t.col = up.old_col
Be sure that all referencing FK constraints have the ON UPDATE CASCADE clause.
ALTER TABLE child1 DROP CONSTRAINT child1_col_fkey;
ALTER TABLE child1 ADD CONSTRAINT child1_col_fkey FOREIGN KEY (col)
REFERENCES tab (col)
ON UPDATE CASCADE;
-- more tables?
Sanitize your values by removing white space
UPDATE tab
SET col = replace(col, ' ', '');
This only takes care of good old space characters (ASCII value 32, Unicode U+0020). Do you have others?
All FK constraints should be pointing to tab.tab_id to begin with. Your tables would be smaller and faster and all of this would be easier.
I solved it much easier then Erwin. I don't have SQL on my computer to test it but something like that worked for me:
DELETE FROM tab WHERE id IN (
SELECT id FROM (
SELECT id, col, row_number() OVER (PARTITION BY regexp_replace(col, '[ \t\n]*', '')) AS c WHERE c > 1;
)
)
UPDATE tab SET col = regexp_replace(col, '[ \t\n]*', '');

Is it possible to concatenate column values into a string using CTE?

Say I have the following table:
id|myId|Name
-------------
1 | 3 |Bob
2 | 3 |Chet
3 | 3 |Dave
4 | 4 |Jim
5 | 4 |Jose
-------------
Is it possible to use a recursive CTE to generate the following output:
3 | Bob, Chet, Date
4 | Jim, Jose
I've played around with it a bit but haven't been able to get it working. Would I do better using a different technique?
I do not recommend this, but I managed to work it out.
Table:
CREATE TABLE [dbo].[names](
[id] [int] NULL,
[myId] [int] NULL,
[name] [char](25) NULL
) ON [PRIMARY]
Data:
INSERT INTO names values (1,3,'Bob')
INSERT INTO names values 2,3,'Chet')
INSERT INTO names values 3,3,'Dave')
INSERT INTO names values 4,4,'Jim')
INSERT INTO names values 5,4,'Jose')
INSERT INTO names values 6,5,'Nick')
Query:
WITH CTE (id, myId, Name, NameCount)
AS (SELECT id,
myId,
Cast(Name AS VARCHAR(225)) Name,
1 NameCount
FROM (SELECT Row_number() OVER (PARTITION BY myId ORDER BY myId) AS id,
myId,
Name
FROM names) e
WHERE id = 1
UNION ALL
SELECT e1.id,
e1.myId,
Cast(Rtrim(CTE.Name) + ',' + e1.Name AS VARCHAR(225)) AS Name,
CTE.NameCount + 1 NameCount
FROM CTE
INNER JOIN (SELECT Row_number() OVER (PARTITION BY myId ORDER BY myId) AS id,
myId,
Name
FROM names) e1
ON e1.id = CTE.id + 1
AND e1.myId = CTE.myId)
SELECT myID,
Name
FROM (SELECT myID,
Name,
(Row_number() OVER (PARTITION BY myId ORDER BY namecount DESC)) AS id
FROM CTE) AS p
WHERE id = 1
As requested, here is the XML method:
SELECT myId,
STUFF((SELECT ',' + rtrim(convert(char(50),Name))
FROM namestable b
WHERE a.myId = b.myId
FOR XML PATH('')),1,1,'') Names
FROM namestable a
GROUP BY myId
A CTE is just a glorified derived table with some extra features (like recursion). The question is, can you use recursion to do this? Probably, but it's using a screwdriver to pound in a nail. The nice part about doing the XML path (seen in the first answer) is it will combine grouping the MyId column with string concatenation.
How would you concatenate a list of strings using a CTE? I don't think that's its purpose.
A CTE is just a temporarily-created relation (tables and views are both relations) which only exists for the "life" of the current query.
I've played with the CTE names and the field names. I really don't like reusing fields names like id in multiple places; I tend to think those get confusing. And since the only use for names.id is as a ORDER BY in the first ROW_NUMBER() statement, I don't reuse it going forward.
WITH namesNumbered as (
select myId, Name,
ROW_NUMBER() OVER (
PARTITION BY myId
ORDER BY id
) as nameNum
FROM names
)
, namesJoined(myId, Name, nameCount) as (
SELECT myId,
Cast(Name AS VARCHAR(225)),
1
FROM namesNumbered nn1
WHERE nameNum = 1
UNION ALL
SELECT nn2.myId,
Cast(
Rtrim(nc.Name) + ',' + nn2.Name
AS VARCHAR(225)
),
nn.nameNum
FROM namesJoined nj
INNER JOIN namesNumbered nn2 ON nn2.myId = nj.myId
and nn2.nameNum = nj.nameCount + 1
)
SELECT myId, Name
FROM (
SELECT myID, Name,
ROW_NUMBER() OVER (
PARTITION BY myId
ORDER BY nameCount DESC
) AS finalSort
FROM namesJoined
) AS tmp
WHERE finalSort = 1
The first CTE, namesNumbered, returns two fields we care about and a sorting value; we can't just use names.id for this because we need, for each myId value, to have values of 1, 2, .... names.id will have 1, 2 ... for myId = 1 but it will have a higher starting value for subsequent myId values.
The second CTE, namesJoined, has to have the field names specified in the CTE signature because it will be recursive. The base case (part before UNION ALL) gives us records where nameNum = 1. We have to CAST() the Name field because it will grow with subsequent passes; we need to ensure that we CAST() it large enough to handle any of the outputs; we can always TRIM() it later, if needed. We don't have to specify aliases for the fields because the CTE signature provides those. The recursive case (after the UNION ALL) joins the current CTE with the prior one, ensuring that subsequent passes use ever-higher nameNum values. We need to TRIM() the prior iterations of Name, then add the comma and the new Name. The result will be, implicitly, CAST()ed to a larger field.
The final query grabs only the fields we care about (myId, Name) and, within the subquery, pointedly re-sorts the records so that the highest namesJoined.nameCount value will get a 1 as the finalSort value. Then, we tell the WHERE clause to only give us this one record (for each myId value).
Yes, I aliased the subquery as tmp, which is about as generic as you can get. Most SQL engines require that you give a subquery an alias, even if it's the only relation visible at that point.