Is it possible to concatenate column values into a string using CTE? - sql

Say I have the following table:
id|myId|Name
-------------
1 | 3 |Bob
2 | 3 |Chet
3 | 3 |Dave
4 | 4 |Jim
5 | 4 |Jose
-------------
Is it possible to use a recursive CTE to generate the following output:
3 | Bob, Chet, Date
4 | Jim, Jose
I've played around with it a bit but haven't been able to get it working. Would I do better using a different technique?

I do not recommend this, but I managed to work it out.
Table:
CREATE TABLE [dbo].[names](
[id] [int] NULL,
[myId] [int] NULL,
[name] [char](25) NULL
) ON [PRIMARY]
Data:
INSERT INTO names values (1,3,'Bob')
INSERT INTO names values 2,3,'Chet')
INSERT INTO names values 3,3,'Dave')
INSERT INTO names values 4,4,'Jim')
INSERT INTO names values 5,4,'Jose')
INSERT INTO names values 6,5,'Nick')
Query:
WITH CTE (id, myId, Name, NameCount)
AS (SELECT id,
myId,
Cast(Name AS VARCHAR(225)) Name,
1 NameCount
FROM (SELECT Row_number() OVER (PARTITION BY myId ORDER BY myId) AS id,
myId,
Name
FROM names) e
WHERE id = 1
UNION ALL
SELECT e1.id,
e1.myId,
Cast(Rtrim(CTE.Name) + ',' + e1.Name AS VARCHAR(225)) AS Name,
CTE.NameCount + 1 NameCount
FROM CTE
INNER JOIN (SELECT Row_number() OVER (PARTITION BY myId ORDER BY myId) AS id,
myId,
Name
FROM names) e1
ON e1.id = CTE.id + 1
AND e1.myId = CTE.myId)
SELECT myID,
Name
FROM (SELECT myID,
Name,
(Row_number() OVER (PARTITION BY myId ORDER BY namecount DESC)) AS id
FROM CTE) AS p
WHERE id = 1
As requested, here is the XML method:
SELECT myId,
STUFF((SELECT ',' + rtrim(convert(char(50),Name))
FROM namestable b
WHERE a.myId = b.myId
FOR XML PATH('')),1,1,'') Names
FROM namestable a
GROUP BY myId

A CTE is just a glorified derived table with some extra features (like recursion). The question is, can you use recursion to do this? Probably, but it's using a screwdriver to pound in a nail. The nice part about doing the XML path (seen in the first answer) is it will combine grouping the MyId column with string concatenation.
How would you concatenate a list of strings using a CTE? I don't think that's its purpose.

A CTE is just a temporarily-created relation (tables and views are both relations) which only exists for the "life" of the current query.
I've played with the CTE names and the field names. I really don't like reusing fields names like id in multiple places; I tend to think those get confusing. And since the only use for names.id is as a ORDER BY in the first ROW_NUMBER() statement, I don't reuse it going forward.
WITH namesNumbered as (
select myId, Name,
ROW_NUMBER() OVER (
PARTITION BY myId
ORDER BY id
) as nameNum
FROM names
)
, namesJoined(myId, Name, nameCount) as (
SELECT myId,
Cast(Name AS VARCHAR(225)),
1
FROM namesNumbered nn1
WHERE nameNum = 1
UNION ALL
SELECT nn2.myId,
Cast(
Rtrim(nc.Name) + ',' + nn2.Name
AS VARCHAR(225)
),
nn.nameNum
FROM namesJoined nj
INNER JOIN namesNumbered nn2 ON nn2.myId = nj.myId
and nn2.nameNum = nj.nameCount + 1
)
SELECT myId, Name
FROM (
SELECT myID, Name,
ROW_NUMBER() OVER (
PARTITION BY myId
ORDER BY nameCount DESC
) AS finalSort
FROM namesJoined
) AS tmp
WHERE finalSort = 1
The first CTE, namesNumbered, returns two fields we care about and a sorting value; we can't just use names.id for this because we need, for each myId value, to have values of 1, 2, .... names.id will have 1, 2 ... for myId = 1 but it will have a higher starting value for subsequent myId values.
The second CTE, namesJoined, has to have the field names specified in the CTE signature because it will be recursive. The base case (part before UNION ALL) gives us records where nameNum = 1. We have to CAST() the Name field because it will grow with subsequent passes; we need to ensure that we CAST() it large enough to handle any of the outputs; we can always TRIM() it later, if needed. We don't have to specify aliases for the fields because the CTE signature provides those. The recursive case (after the UNION ALL) joins the current CTE with the prior one, ensuring that subsequent passes use ever-higher nameNum values. We need to TRIM() the prior iterations of Name, then add the comma and the new Name. The result will be, implicitly, CAST()ed to a larger field.
The final query grabs only the fields we care about (myId, Name) and, within the subquery, pointedly re-sorts the records so that the highest namesJoined.nameCount value will get a 1 as the finalSort value. Then, we tell the WHERE clause to only give us this one record (for each myId value).
Yes, I aliased the subquery as tmp, which is about as generic as you can get. Most SQL engines require that you give a subquery an alias, even if it's the only relation visible at that point.

Related

Display SUM of 2 listed/GROUP BY values using WHERE condition

I want to add the values of two columns displayed and display as 1 column name.
This is the output I'm getting,
ID Total
Apple 10
RawApple 10
Mango 10
RawMango 10
I want the output as
ID Total
Apples 20
Mangoes 20
If the issue is removing the first three characters -- if they are "Raw" -- then you can do:
select (case when id like 'Raw%' then stuff(id, 1, 3, '') else id end) as id,
sum(total)
from t
group by (case when id like 'Raw%' then stuff(id, 1, 3, '') else id end);
If you want to replace specific values with other values, I would suggest an in-query lookup table:
select coalesce(v.new_id, t.id) as id, sum(total)
from t left join
(values ('RawApple', 'Apple'),
('RawMango', 'Mango')
) v(id, new_id)
on t.id = v.id
group by coalesce(v.new_id, t.id);
If we can assume that the name of the fruit is after the prefix, and the prefix ends with a hyphen (-), then we can use STUFF to remove the prefix and then aggregate:
WITH VTE AS(
SELECT *
FROM (VALUES('Apple',10),
('Raw-Apple',10),
('Mango',10),
('Raw-Mango',10))V(ID,Total))
SELECT S.ID,
SUM(V.Total) AS Total
FROM VTE V
CROSS APPLY(VALUES(STUFF(V.ID,1,CHARINDEX('-',V.ID),'')))S(ID)
GROUP BY S.ID;
Note I don't change the names of the fruits to the plural, as depending on the fruit changes what the plural is. You'll need a dictionary table to store what the plural of the fruit is and then `JOIN to that. So a table that looks like this:
CREATE TABLE dbo.FruitPlural (Fruit varchar(20), Plural varchar(20));
INSERT INTO dbo.FruitPlural
VALUES ('Apple','Apples'),
('Mango','Mangoes'),
('Strawberry','Strawberries'),
...;
Note, this answer was invalidated due to the OP moving the goal posts due to the sample data not being representative of their actual data, however, I am leaving here as it may help future users.

Append values from 2 different columns in SQL

I have the following table
I need to get the following output as "SVGFRAMXPOSLSVG" from the 2 columns.
Is it possible to get this appended values from 2 columns
Please try this.
SELECT STUFF((
SELECT '' + DEPART_AIRPORT_CODE + ARRIVE_AIRPORT_CODE
FROM #tblName
FOR XML PATH('')
), 1, 0, '')
For Example:-
Declare #tbl Table(
id INT ,
DEPART_AIRPORT_CODE Varchar(50),
ARRIVE_AIRPORT_CODE Varchar(50),
value varchar(50)
)
INSERT INTO #tbl VALUES(1,'g1','g2',NULL)
INSERT INTO #tbl VALUES(2,'g2','g3',NULL)
INSERT INTO #tbl VALUES(3,'g3','g1',NULL)
SELECT STUFF((
SELECT '' + DEPART_AIRPORT_CODE + ARRIVE_AIRPORT_CODE
FROM #tbl
FOR XML PATH('')
), 1, 0, '')
Summary
Use Analytic functions and listagg to get the job done.
Detail
Create two lists of code_id and code values. Match the code_id values for the same airport codes (passengers depart from the same airport they just arrived at). Using lag and lead to grab values from other rows. NULLs will exist for code_id at the start and end of the itinerary. Default the first NULL to 0, and the last NULL to be the previous code_id plus 1. A list of codes will be produced, with a matching index. Merge the lists together and remove duplicates by using a union. Finally use listagg with no delimiter to aggregate the rows onto a string value.
with codes as
(
select
nvl(lag(t1.id) over (order by t1.id),0) as code_id,
t1.depart_airport_code as code
from table1 t1
union
select
nvl(lead(t1.id) over (order by t1.id)-1,lag(t1.id) over (order by t1.id)+1) as code_id,
t1.arrive_airport_code as code
from table1 t1
)
select
listagg(c.code,'') WITHIN GROUP (ORDER BY c.code_id) as result
from codes c;
Note: This solution does rely on an integer id field being available. Otherwise the analytic functions wouldn't have a column to sort by. If id doesn't exist, then you would need to manufacture one based on another column, such as a timestamp or another identifier that ensures the rows are in the correct order.
Use row_number() over (order by myorderedidentifier) as id in a subquery or view to achieve this. Don't use rownum. It could give you unpredictable results. Without an ORDER BY clause, there is no guarantee that the same query will return the same results each time.
Output
| RESULT |
|-----------------|
| SVGFRAMXPOSLSVG |

Remove duplicated subsets from very large table

The data I'm working with is fairly complicated, so I'm just going to provide a simpler example so I can hopefully expand that out to what I'm working on.
Note: I've already found a way to do it, but it's extremely slow and not scalable. It works great on small datasets, but if I applied it to the actual tables it needs to run on, it would take forever.
I need to remove entire duplicate subsets of data within a table. Removing duplicate rows is easy, but I'm stuck finding an efficient way to remove duplicate subsets.
Example:
GroupID Subset Value
------- ---- ----
1 a 1
1 a 2
1 a 3
1 b 1
1 b 3
1 b 5
1 c 1
1 c 3
1 c 5
2 a 1
2 a 2
2 a 3
2 b 4
2 b 5
2 b 6
2 c 1
2 c 3
2 c 6
So in this example, from GroupID 1, I would need to remove either subset 'b' or subset 'c', doesn't matter which since both contain Values 1,2,3. For GroupID 2, none of the sets are duplicated, so none are removed.
Here's the code I used to solve this on a small scale. It works great, but when applied to 10+ Million records...you can imagine it would be very slow (I was later informed of the number of records, the sample data I was given was much smaller)...:
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6)
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value]
SELECT x.GroupID, x.NameValues, MIN(x.SubSet)
FROM (
SELECT t1.GroupID, t1.SubSet
, NameValues = (SELECT ',' + CONVERT(VARCHAR(10), t2.[Value]) FROM #values t2 WHERE t1.GroupID = t2.GroupID AND t1.SubSet = t2.SubSet ORDER BY t2.[Value] FOR XML PATH(''))
FROM #values t1
GROUP BY t1.GroupID, t1.SubSet
) x
GROUP BY x.GroupID, x.NameValues
All I'm doing here is grouping by GroupID and Subset and concatenating all of the values into a comma delimited string...and then taking that and grouping on GroupID and Value list, and taking the MIN subset.
I'd go with something like this:
;with cte as
(
select v.GroupID, v.SubSet, checksum_agg(v.Value) h, avg(v.Value) a
from #values v
group by v.GroupID, v.SubSet
)
delete v
from #values v
join
(
select c1.GroupID, case when c1.SubSet > c2.SubSet then c1.SubSet else c2.SubSet end SubSet
from cte c1
join cte c2 on c1.GroupID = c2.GroupID and c1.SubSet <> c2.SubSet and c1.h = c2.h and c1.a = c2.a
)x on v.GroupID = x.GroupID and v.SubSet = x.SubSet
select *
from #values
From Checksum_Agg:
The CHECKSUM_AGG result does not depend on the order of the rows in
the table.
This is because it is a sum of the values: 1 + 2 + 3 = 3 + 2 + 1 = 3 + 3 = 6.
HashBytes is designed to produce a different value for two inputs that differ only in the order of the bytes, as well as other differences. (There is a small possibility that two inputs, perhaps of wildly different lengths, could hash to the same value. You can't take an arbitrary input and squeeze it down to an absolutely unique 16-byte value.)
The following code demonstrates how to use HashBytes to return for each GroupId/Subset.
-- Thanks for the sample data!
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6);
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value];
with
DistinctGroups as (
select distinct GroupId, Subset
from #Values ),
GroupConcatenatedValues as (
select GroupId, Subset, Convert( VarBinary(256), (
select Convert( VarChar(8000), Cast( Value as Binary(4) ), 2 ) AS [text()]
from #Values as V
where V.GroupId = DG.GroupId and V.SubSet = DG.SubSet
order by Value
for XML Path('') ), 2 ) as GroupedBinary
from DistinctGroups as DG )
-- To see the intermediate results from the CTE you can use one of the
-- following two queries instead of the last select :
-- select * from DistinctGroups;
-- select * from GroupConcatenatedValues;
select GroupId, Subset, GroupedBinary, HashBytes( 'MD4', GroupedBinary ) as Hash
from GroupConcatenatedValues
order by GroupId, Subset;
You can use checksum_agg() over a set of rows. If the checksums are the same, this is strong evidence that the 'values' columns are equal within the grouped fields.
In the 'getChecksums' cte below, I group by the group and subset, with a checksum based on your 'value' column.
In the 'maybeBadSubsets' cte, I put a row_number over each aggregation just to identify the 2nd+ row in the event the checksums match.
Finally, I delete any subgroups so identified.
with
getChecksums as (
select groupId,
subset,
cs = checksum_agg(value)
from #values v
group by groupId,
subset
),
maybeBadSubsets as (
select groupId,
subset,
cs,
deleteSubset =
case
when row_number() over (
partition by groupId, cs
order by subset
) > 1
then 1
end
from getChecksums
)
delete v
from #values v
where exists (
select 0
from maybeBadSubsets mbs
where v.groupId = mbs.groupId
and v.SubSet = mbs.subset
and mbs.deleteSubset = 1
);
I don't know what the exact likelihood is for checksums to match. If you're not comfortable with the false positive rate, you can still use it to eliminate some branches in a more algorithmic approach in order to vastly improve performance.
Note: CTE's can have a quirk performance-wise. If you find that the query engine is running 'maybeBadSubsets' for each row of #values, you may need to put its results into a temp table or table variable before using it. But I believe with 'exists' you're okay as far at that goes.
EDIT:
I didn't catch it, but as the OP noticed, checksum_agg seems to perform very poorly in terms of false hits/misses. I suspect it might be due to the simplicity of the input. I changed
cs = checksum_agg(value)
above to
cs = checksum_agg(convert(int,hashbytes('md5', convert(char(1),value))))
and got better results. But I don't know how it would perform on larger datasets.

Rotate rows into columns with column names not coming from the row

I've looked at some answers but none of them seem to be applicable to me.
Basically I have this result set:
RowNo | Id | OrderNo |
1 101 1
2 101 10
I just want to convert this to
| Id | OrderNo_0 | OrderNo_1 |
101 1 10
I know I should probably use PIVOT. But the syntax is just not clear to me.
The order numbers are always two. To make things clearer
And if you want to use PIVOT then the following works with the data provided:
declare #Orders table (RowNo int, Id int, OrderNo int)
insert into #Orders (RowNo, Id, OrderNo)
select 1, 101, 1 union all select 2, 101, 10
select Id, [1] OrderNo_0, [2] OrderNo_1
from (
select RowNo, Id, OrderNo
from #Orders
) SourceTable
pivot (
sum(OrderNo)
for RowNo in ([1],[2])
) as PivotTable
Reference: https://learn.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-2017
Note: To build each row in the result set the pivot function is grouping by the columns not begin pivoted. Therefore you need an aggregate function on the column that is being pivoted. You won't notice it in this instance because you have unique rows to start with - but if you had multiple rows with the RowNo and Id you would then find the aggregation comes into play.
As you say there are only ever two order numbers per ID, you could join the results set to itself on the ID column. For the purposes of the example below, I'm assuming your results set is merely selecting from a single Orders table, but it should be easy enough to replace this with your existing query.
SELECT o1.ID, o1.OrderNo AS [OrderNo_0], o2.OrderNo AS [OrderNo_1]
FROM Orders AS o1
INNER JOIN Orders AS o2
ON (o1.ID = o2.ID AND o1.OrderNo <> o2.OrderNo)
From your sample data, simplest you can try to use min and MAX function.
SELECT Id,min(OrderNo) OrderNo_0,MAX(OrderNo) OrderNo_1
FROM T
GROUP BY Id

PostgreSQL. I need a hierarchical table to have a constraint so no node could have the same name on the same level

I have a hierarchical table with duplicate names on the same level. Example -
user (int id, string name, int parent_id)
1, Sam, null
2, Mike, 1
3, Mike, 1
4, Mike, 1
I need to make them like this
1, Sam, null
2, Mike#1, 1
3, Mike#2, 1
4, Mike#3, 1
And somehow add constraint. How can I do that?
You may use row_number() to generate those numbers in the sequence and COUNT analytic function to check whether or not to use sequence numbers
SELECT id,
CONCAT(name,
CASE
WHEN COUNT(*) OVER(
PARTITION BY name
) > 1 THEN --multiple names exist?
'#' || ROW_NUMBER() OVER(
PARTITION BY name
ORDER BY id )
END
) AS name, --else defaults to null (for single ones).
parent_id
FROM t
ORDER BY id;
It is not clear when you say "I need a hierarchical table", if you want to simply select them or create a new table. I would recommend you not to create another table simply to store mostly those same values, just create a VIEW instead using the above query as the base.
Demo
First I needed to rename duplicates on the same level. Even in root directory where parent_id is null. This code will do
update user user_update set name = name || '#' || (
select user_count_ids.number
from (
select user_row_count.id id, row_number() over (order by user_row_count.id) number
from user user_row_count
where user_update.name = user_row_count.name and user_update.parent_id is not distinct from user_row_count.parent_id
) as user_count_ids
where user_count_ids.id = org_update.id
)
where (
select count(*) > 1
from user user_count
where user_update.name = user_count.name and user_update.parent_id is not distinct from user_count.parent_id
);
Then I needed some sort of constraint. Thanks to http://stackoverflow.com/a/8289253/5292928 I added this code
create unique index unique_name_parentId_when_parentId_is_not_null
on user (name, parent_id)
where parent_id is not null;
create unique index unique_name_when_parentId_is_null
on user (name)
where parent_id is null;