Calculate TF-IDF using Sql - sql

I have a table in my DB containning a free text field column.
I would like to know the frequency each word appears over all the rows, or maybe even calc a TF-IDF for all words, where my documents are that field's values per row.
Is it possible to calculate this using an Sql Query? if not or there's a simpler way could you please direct me to it?
Many Thanks,
Jon

In SQL Server 2008 depending on your needs you could apply full text indexing to the column then query the sys.dm_fts_index_keywords and sys.dm_fts_index_keywords_by_document table valued functions to get the occurrence count.
Edit: Actually even without creating a persistent full text index you can still leverage the parser
WITH testTable AS
(
SELECT 1 AS Id, N'how now brown cow' AS txt UNION ALL
SELECT 2, N'she sells sea shells upon the sea shore' UNION ALL
SELECT 3, N'red lorry yellow lorry' UNION ALL
SELECT 4, N'the quick brown fox jumped over the lazy dog'
)
SELECT display_term, COUNT(*) As Cnt
FROM testTable
CROSS APPLY sys.dm_fts_parser('"' + REPLACE(txt,'"','""') + '"', 1033, 0,0)
WHERE TXT IS NOT NULL
GROUP BY display_term
HAVING COUNT(*) > 1
ORDER BY Cnt DESC
Returns
display_term Cnt
------------------------------ -----------
the 3
brown 2
lorry 2
sea 2

Solution for SQL Server 2008:
here is the table:
CREATE TABLE MyTable (id INT, txt VARCHAR(MAX));
here is SQL query:
SELECT sum(case when TSplitted.txt_word = 'searched' then 1 else 0 end) as cnt_searched
, count(*) as cnt_all
FROM MyTable MYT
INNER JOIN Fn_Split(MYT.id,' ',MYT.txt) TSplitted on MYT.id=TSplitted.id
here is table valued function Fn_Split(#id int, #separator VARCHAR(32), #string VARCHAR(MAX)) (taken from here):
CREATE FUNCTION Fn_Split (#id int, #separator VARCHAR(32), #string VARCHAR(MAX))
RETURNS #t TABLE
(
ret_id INT
,txt_word VARCHAR(MAX)
)
AS
BEGIN
DECLARE #xml XML
SET #XML = N'<root><r>' + REPLACE(#s, #separator, '</r><r>') + '</r></root>'
INSERT INTO #t(ret_id, val)
SELECT #id, r.value('.','VARCHAR(5)') as Item
FROM #xml.nodes('//root/r') AS RECORDS(r)
RETURN
END

Related

Split string and display below other column data using SQL Server [duplicate]

I have a table that looks like this:
ProductId, Color
"1", "red, blue, green"
"2", null
"3", "purple, green"
And I want to expand it to this:
ProductId, Color
1, red
1, blue
1, green
2, null
3, purple
3, green
Whats the easiest way to accomplish this? Is it possible without a loop in a proc?
Take a look at this function. I've done similar tricks to split and transpose data in Oracle. Loop over the data inserting the decoded values into a temp table. The convent thing is that MS will let you do this on the fly, while Oracle requires an explicit temp table.
MS SQL Split Function
Better Split Function
Edit by author:
This worked great. Final code looked like this (after creating the split function):
select pv.productid, colortable.items as color
from product p
cross apply split(p.color, ',') as colortable
based on your tables:
create table test_table
(
ProductId int
,Color varchar(100)
)
insert into test_table values (1, 'red, blue, green')
insert into test_table values (2, null)
insert into test_table values (3, 'purple, green')
create a new table like this:
CREATE TABLE Numbers
(
Number int not null primary key
)
that has rows containing values 1 to 8000 or so.
this will return what you want:
EDIT
here is a much better query, slightly modified from the great answer from #Christopher Klein:
I added the "LTRIM()" so the spaces in the color list, would be handled properly: "red, blue, green". His solution requires no spaces "red,blue,green". Also, I prefer to use my own Number table and not use master.dbo.spt_values, this allows the removal of one derived table too.
SELECT
ProductId, LEFT(PartialColor, CHARINDEX(',', PartialColor + ',')-1) as SplitColor
FROM (SELECT
t.ProductId, LTRIM(SUBSTRING(t.Color, n.Number, 200)) AS PartialColor
FROM test_table t
LEFT OUTER JOIN Numbers n ON n.Number<=LEN(t.Color) AND SUBSTRING(',' + t.Color, n.Number, 1) = ','
) t
EDIT END
SELECT
ProductId, Color --,number
FROM (SELECT
ProductId
,CASE
WHEN LEN(List2)>0 THEN LTRIM(RTRIM(SUBSTRING(List2, number+1, CHARINDEX(',', List2, number+1)-number - 1)))
ELSE NULL
END AS Color
,Number
FROM (
SELECT ProductId,',' + Color + ',' AS List2
FROM test_table
) AS dt
LEFT OUTER JOIN Numbers n ON (n.Number < LEN(dt.List2)) OR (n.Number=1 AND dt.List2 IS NULL)
WHERE SUBSTRING(List2, number, 1) = ',' OR List2 IS NULL
) dt2
ORDER BY ProductId, Number, Color
here is my result set:
ProductId Color
----------- --------------
1 red
1 blue
1 green
2 NULL
3 purple
3 green
(6 row(s) affected)
which is the same order you want...
You can try this out, doesnt require any additional functions:
declare #t table (col1 varchar(10), col2 varchar(200))
insert #t
select '1', 'red,blue,green'
union all select '2', NULL
union all select '3', 'green,purple'
select col1, left(d, charindex(',', d + ',')-1) as e from (
select *, substring(col2, number, 200) as d from #t col1 left join
(select distinct number from master.dbo.spt_values where number between 1 and 200) col2
on substring(',' + col2, number, 1) = ',') t
I arrived this question 10 years after the post.
SQL server 2016 added STRING_SPLIT function.
By using that, this can be written as below.
declare #product table
(
ProductId int,
Color varchar(max)
);
insert into #product values (1, 'red, blue, green');
insert into #product values (2, null);
insert into #product values (3, 'purple, green');
select
p.ProductId as ProductId,
ltrim(split_table.value) as Color
from #product p
outer apply string_split(p.Color, ',') as split_table;
Fix your database if at all possible. Comma delimited lists in database cells indicate a flawed schema 99% of the time or more.
I would create a CLR table-defined function for this:
http://msdn.microsoft.com/en-us/library/ms254508(VS.80).aspx
The reason for this is that CLR code is going to be much better at parsing apart the strings (computational work) and can pass that information back as a set, which is what SQL Server is really good at (set management).
The CLR function would return a series of records based on the parsed values (and the input id value).
You would then use a CROSS APPLY on each element in your table.
Just convert your columns into xml and query it. Here's an example.
select
a.value('.', 'varchar(42)') c
from (select cast('<r><a>' + replace(#CSV, ',', '</a><a>') + '</a></r>' as xml) x) t1
cross apply x.nodes('//r/a') t2(a)
Why not use dynamic SQL for this purpose, something like this(adapt to your needs):
DECLARE #dynSQL VARCHAR(max)
SET #dynSQL = 'insert into DestinationTable(field) values'
select #dynSQL = #dynSQL + '('+ REPLACE(Color,',',''',''') + '),' from Table
SET #dynSql = LEFT(#dynSql,LEN(#dynSql) -1) -- delete the last comma
exec #dynSql
One advantage is that you can use it on any SQL Server version

Dynamic SELECT statement, generate columns based on present and future values

Currently building a SELECT statement in SQL Server 2008 but would like to make this SELECT statement dynamic, so the columns can be defined based on values in a table. I heard about pivot table and cursors, but seems kind of hard to understand at my current level, here is the code;
DECLARE #date DATE = null
IF #date is null
set # date = GETDATE() as DATE
SELECT
Name,
value1,
value2,
value3,
value4
FROM ref_Table a
FULL OUTER JOIN (
SELECT
PK_ID ID,
sum(case when FK_ContainerType_ID = 1 then 1 else null) Box,
sum(case when FK_ContainerType_ID = 2 then 1 else null) Pallet,
sum(case when FK_ContainerType_ID = 3 then 1 else null) Bag,
sum(case when FK_ContainerType_ID = 4 then 1 else null) Drum
from
Packages
WHERE
#date between PackageStart AND PackageEnd
group by PK_ID ) b on a.Name = b.ID
where
Group = 0
The following works great for me , but PK_Type_ID and the name of the column(PackageNameX,..) are hard coded, I need to be dynamic and it can build itself based on present or futures values in the Package table.
Any help or guidance on the right direction would be greatly appreciated...,
As requested
ref_Table (PK_ID, Name)
1, John
2, Mary
3, Albert
4, Jane
Packages (PK_ID, FK_ref_Table_ID, FK_ContainerType_ID, PackageStartDate, PackageEndDate)
1 , 1, 4, 1JAN2014, 30JAN2014
2 , 2, 3, 1JAN2014, 30JAN2014
3 , 3, 2, 1JAN2014, 30JAN2014
4 , 4, 1, 1JAN2014, 30JAN2014
ContainerType (PK_ID, Type)
1, Box
2, Pallet
3, Bag
4, Drum
and the result should look like this;
Name Box Pallet Bag Drum
---------------------------------------
John 1
Mary 1
Albert 1
Jane 1
The following code like I said works great, the issue is the Container table is going to grow and I need to replicated the same report without hard coding the columns.
What you need to build is called a dynamic pivot. There are plenty of good references on Stack if you search out that term.
Here is a solution to your scenario:
IF OBJECT_ID('tempdb..##ref_Table') IS NOT NULL
DROP TABLE ##ref_Table
IF OBJECT_ID('tempdb..##Packages') IS NOT NULL
DROP TABLE ##Packages
IF OBJECT_ID('tempdb..##ContainerType') IS NOT NULL
DROP TABLE ##ContainerType
SET NOCOUNT ON
CREATE TABLE ##ref_Table (PK_ID INT, NAME NVARCHAR(50))
CREATE TABLE ##Packages (PK_ID INT, FK_ref_Table_ID INT, FK_ContainerType_ID INT, PackageStartDate DATE, PackageEndDate DATE)
CREATE TABLE ##ContainerType (PK_ID INT, [Type] NVARCHAR(50))
INSERT INTO ##ref_Table (PK_ID,NAME)
SELECT 1,'John' UNION
SELECT 2,'Mary' UNION
SELECT 3,'Albert' UNION
SELECT 4,'Jane'
INSERT INTO ##Packages (PK_ID, FK_ref_Table_ID, FK_ContainerType_ID, PackageStartDate, PackageEndDate)
SELECT 1,1,4,'2014-01-01','2014-01-30' UNION
SELECT 2,2,3,'2014-01-01','2014-01-30' UNION
SELECT 3,3,2,'2014-01-01','2014-01-30' UNION
SELECT 4,4,1,'2014-01-01','2014-01-30'
INSERT INTO ##ContainerType (PK_ID, [Type])
SELECT 1,'Box' UNION
SELECT 2,'Pallet' UNION
SELECT 3,'Bag' UNION
SELECT 4,'Drum'
DECLARE #DATE DATE, #PARAMDEF NVARCHAR(MAX), #COLS NVARCHAR(MAX), #SQL NVARCHAR(MAX)
SET #DATE = '2014-01-15'
SET #COLS = STUFF((SELECT DISTINCT ',' + QUOTENAME(T.[Type])
FROM ##ContainerType T
FOR XML PATH, TYPE).value('.', 'NVARCHAR(MAX)'),1,1,'')
SET #SQL = 'SELECT [Name], ' + #COLS + '
FROM (SELECT [Name], [Type], 1 AS Value
FROM ##ref_Table R
JOIN ##Packages P ON R.PK_ID = P.FK_ref_Table_ID
JOIN ##ContainerType T ON P.FK_ContainerType_ID = T.PK_ID
WHERE #DATE BETWEEN P.PackageStartDate AND P.PackageEndDate) X
PIVOT (COUNT(Value) FOR [Type] IN (' + #COLS + ')) P
'
PRINT #COLS
PRINT #SQL
SET #PARAMDEF = '#DATE DATE'
EXEC SP_EXECUTESQL #SQL, #PARAMDEF, #DATE=#DATE
Output:
Name Bag Box Drum Pallet
Albert 0 0 0 1
Jane 0 1 0 0
John 0 0 1 0
Mary 1 0 0 0
Static Query:
SELECT [Name],[Box],[Pallet],[Bag],[Drum] FROM
(
SELECT *
FROM
(
SELECT rf.Name,cnt.[Type], pk.PK_ID AS PKID, rf.PK_ID AS RFID
FROM ref_Table rf INNER JOIN Packages pk ON rf.PK_ID = pk.FK_ref_Table_ID
INNER JOIN ContanerType cnt ON cnt.PK_ID = pk.FK_ContainerType_ID
) AS SourceTable
PIVOT
(
COUNT(PKID )
FOR [Type]
IN ( [Box],[Pallet],[Bag],[Drum])
) AS PivotTable
) AS Main
ORDER BY RFID
Dynamic Query:
DECLARE #columnList nvarchar (MAX)
DECLARE #pivotsql nvarchar (MAX)
SELECT #columnList = STUFF(
(
SELECT ',' + '[' + [Type] + ']'
FROM ContanerType
FOR XML PATH( '')
)
,1, 1,'' )
SET #pivotsql =
N'SELECT [Name],' + #columnList + ' FROM
(
SELECT *
FROM
(
SELECT rf.Name,cnt.[Type], pk.PK_ID AS PKID, rf.PK_ID AS RFID
FROM ref_Table rf INNER JOIN Packages pk ON rf.PK_ID = pk.FK_ref_Table_ID
INNER JOIN ContanerType cnt ON cnt.PK_ID = pk.FK_ContainerType_ID
) AS SourceTable
PIVOT
(
COUNT(PKID )
FOR [Type]
IN ( ' + #columnList + ')
) AS PivotTable
) AS Main
ORDER BY RFID;'
EXEC sp_executesql #pivotsql
Following my tutorial below will help you to understand the PIVOT functionality:
We write sql queries in order to get different result sets like full, partial, calculated, grouped, sorted etc from the database tables. However sometimes we have requirements that we have to rotate our tables. Sounds confusing?
Let's keep it simple and consider the following two screen grabs.
SQL Table:
Expected Results:
Wow, that's look like a lot of work! That is a combination of tricky sql, temporary tables, loops, aggregation......, blah blah blah
Don't worry let's keep it simple, stupid(KISS).
MS SQL Server 2005 and above has a function called PIVOT. It s very simple to use and powerful. With the help of this function we will be able to rotate sql tables and result sets.
Simple steps to make it happen:
Identify all the columns those will be part of the desired result set.
Find the column on which we will apply aggregation(sum,ave,max,min etc)
Identify the column which values will be the column header.
Specify the column values mentioned in step3 with comma separated and surrounded by square brackets.
So, if we now follow above four steps and extract information from the above sales table, it will be as below:
Year, Month, SalesAmount
SalesAmount
Month
[Jan],[Feb] ,[Mar] .... etc
We are nearly there if all the above steps made sense to you so far.
Now we have all the information we need. All we have to do now is to fill the below template with required information.
Template:
Our SQL query should look like below:
SELECT *
FROM
(
SELECT SalesYear, SalesMonth,Amount
FROM Sales
) AS SourceTable
PIVOT
(
SUM(Amount )
FOR SalesMonth
IN ( [Jan],[Feb] ,[Mar],
[Apr],[May],[Jun] ,[Jul],
[Aug],[Sep] ,[Oct],[Nov] ,[Dec])
) AS PivotTable;
In the above query we have hard coded the column names. Well it's not fun when you have to specify a number of columns.
However, there is a work arround as follows:
DECLARE #columnList nvarchar (MAX)
DECLARE #pivotsql nvarchar (MAX)
SELECT #columnList = STUFF(
(
SELECT ',' + '[' + SalesMonth + ']'
FROM Sales
GROUP BY SalesMonth
FOR XML PATH( '')
)
,1, 1,'' )
SET #pivotsql =
N'SELECT *
FROM
(
SELECT SalesYear, SalesMonth,Amount
FROM Sales
) AS SourceTable
PIVOT
(
SUM(Amount )
FOR SalesMonth
IN ( ' + #columnList +' )
) AS PivotTable;'
EXEC sp_executesql #pivotsql
Hopefully this tutorial will be a help to someone somewhere.
Enjoy coding.

SQL Count distinct values within the field

I have this weird scenario (at least it is for me) where I have a table (actually a result set, but I want to make it simpler) that looks like the following:
ID | Actions
------------------
1 | 10,12,15
2 | 11,12,13
3 | 15
4 | 14,15,16,17
And I want to count the different actions in the all the table. In this case, I want the result to be 8 (just counting 10, 11, ...., 17; and ignoring the repeated values).
In case it matters, I am using MS SQL 2008.
If it makes it any easier, the Actions were previously on XML that looks like
<root>
<actions>10,12,15</actions>
</root>
I doubt it makes it easier, but somebody might comeback with an xml function that I am not aware and just makes everything easier.
Let me know if there's something else I should say.
Using approach similar to http://codecorner.galanter.net/2012/04/25/t-sql-string-deaggregate-split-ungroup-in-sql-server/:
First you need a function that would split string, there're many examples on SO, here's one of them:
CREATE FUNCTION dbo.Split (#sep char(1), #s varchar(512))
RETURNS table
AS
RETURN (
WITH Pieces(pn, start, stop) AS (
SELECT 1, 1, CHARINDEX(#sep, #s)
UNION ALL
SELECT pn + 1, stop + 1, CHARINDEX(#sep, #s, stop + 1)
FROM Pieces
WHERE stop > 0
)
SELECT pn,
SUBSTRING(#s, start, CASE WHEN stop > 0 THEN stop-start ELSE 512 END) AS s
FROM Pieces
)
Using this you can run a simple query:
SELECT COUNT(DISTINCT S) FROM MyTable CROSS APPLY dbo.Split(',', Actions)
Here is the demo: http://sqlfiddle.com/#!3/5e706/3/0
SQL Fiddle
MS SQL Server 2008 Schema Setup:
CREATE TABLE Table1
([ID] int, [Actions] varchar(11))
;
INSERT INTO Table1
([ID], [Actions])
VALUES
(1, '10,12,15'),
(2, '11,12,13'),
(3, '15'),
(4, '14,15,16,17')
;
Query 1:
DECLARE #S varchar(255)
DECLARE #X xml
SET #S = (SELECT Actions + ',' FROM Table1 FOR XML PATH(''))
SELECT #X = CONVERT(xml,'<root><s>' + REPLACE(#S,',','</s><s>') + '</s></root>')
SELECT count(distinct [Value])
FROM (
SELECT [Value] = T.c.value('.','varchar(20)')
FROM #X.nodes('/root/s') T(c)) AS Result
WHERE [Value] > 0
Results:
| COLUMN_0 |
|----------|
| 8 |
EDIT :
I think this is exactly what you are looking for :
SQL Fiddle
MS SQL Server 2008 Schema Setup:
Query 1:
DECLARE #X xml
SELECT #X = CONVERT(xml,replace('
<root>
<actions>10,12,15</actions>
<actions>11,12,13</actions>
<actions>15</actions>
<actions>14,15,16,17</actions>
</root>
',',','</actions><actions>'))
SELECT count(distinct [Value])
FROM (
SELECT [Value] = T.c.value('.','varchar(20)')
FROM #X.nodes('/root/actions') T(c)) AS Result
Results:
| COLUMN_0 |
|----------|
| 8 |
A bit if a mess but here it is Create the function first and then call the lower code.
/* Helper Function */
CREATE FUNCTION dbo.Split (#sep char(1), #s varchar(8000))
RETURNS table
AS
RETURN (
WITH splitter_cte AS (
SELECT CHARINDEX(#sep, #s) as pos, 0 as lastPos
UNION ALL
SELECT CHARINDEX(#sep, #s, pos + 1), pos
FROM splitter_cte
WHERE pos > 0
)
SELECT SUBSTRING(#s, lastPos + 1,
case when pos = 0 then 80000
else pos - lastPos -1 end) as chunk
FROM splitter_cte
)
GO
---------------- End of Function
/* Function Call */
Declare #Actions varchar(1000)
SELECT #Actions = STUFF((SELECT ',' + actions
FROM tblActions
ORDER BY actions
FOR XML PATH('')), 1, 1, '')
SELECT Distinct *
FROM dbo.Split(',', #Actions)
OPTION(MAXRECURSION 0);
If you have a table of Actions with one row per possible action id, you can do this with a join:
select count(distinct a.ActionId)
from t join
Actions a
on ','+t.Actions+',' like '%,'+cast(a.ActionId as varchar(255))+',%';
You could also create a table of numbers (using a CTE) if you know the actions are within some range.

SQL Server split a single column multiple times

I have a database table that has a column with stacked data with two levels with a column that I want to break a part. Here is the example of the data (data changed to protect the innocent :) :
Table
ID = varchar(100)
CarData = varchar(1000)
ID CarData
1 Nissan:blue:20000,Ford:green:10000
2 Nissan:steel:20001,Ford:blue:10001,Chevy:blue:10000,Ford:olive:10000
** Note that cardata can is not fixed, and can have many cars in it
Output Desired:
ID Manufacture Color Cost
1 Nissan Blue 20000
1 Ford green 10000
2 Nissan steel 20001
... and on
So to say it plainly I need to break the first stacked field which is a comma and create a row for that, then break the second stacked field which is a colon into columns.
Any help would be greatly appreciated.
-- Sample data
declare #T table(ID int, CarData varchar(100))
insert into #T values
(1, 'Nissan:blue:20000,Ford:green:10000'),
(2, 'Nissan:steel:20001,Ford:blue:10001,Chevy:blue:10000,Ford:olive:10000')
-- Recursice CTE to get one row for each car
;with cte(ID, Car, CarData) as
(
select ID,
cast(substring(CarData+',', 1, charindex(',', CarData+',')-1) as varchar(100)),
stuff(CarData, 1, charindex(',', CarData), '')+','
from #T
where len(CarData) > 0
union all
select ID,
cast(substring(CarData, 1, charindex(',', CarData)-1) as varchar(100)),
stuff(CarData, 1, charindex(',', CarData), '')
from cte
where len(CarData) > 0
)
-- Use parsename to split the car data
select ID,
parsename(replace(Car, ':', '.'), 3) as Manufacture,
parsename(replace(Car, ':', '.'), 2) as Color,
parsename(replace(Car, ':', '.'), 1) as Cost
from cte
order by ID
Result:
ID Manufacture Color Cost
-- ----------- ------ -----
1 Nissan blue 20000
1 Ford green 10000
2 Nissan steel 20001
2 Ford blue 10001
2 Chevy blue 10000
2 Ford olive 10000
Edit 1
You will have trouble with parsename if color, cost or a manufacturer name contains a .. If that is the case you should try this instead.
-- Sample data
declare #T table(ID int, CarData varchar(100))
insert into #T values
(1, 'Nissan:blue:20000,Ford:green:10000'),
(2, 'Nissan:steel:20001,Ford:blue:10001,Chevy:blue:10000,Ford:olive:10000')
-- Recursice CTE to get one row for each car
;with cte(ID, Car, CarData) as
(
select ID,
cast(substring(CarData+',', 1, charindex(',', CarData+',')-1) as varchar(100)),
stuff(CarData, 1, charindex(',', CarData), '')+','
from #T
where len(CarData) > 0
union all
select ID,
cast(substring(CarData, 1, charindex(',', CarData)-1) as varchar(100)),
stuff(CarData, 1, charindex(',', CarData), '')
from cte
where len(CarData) > 0
)
-- Split the car data with substring
select ID,
substring(Car, 1, P1.Pos-1) as Manufacture,
substring(Car, P1.Pos+1, P2.Pos-P1.Pos-1) as Color,
substring(Car, P2.Pos+1, len(Car)-P2.Pos) as Cost
from cte
cross apply (select charindex(':', Car)) as P1(Pos)
cross apply (select charindex(':', Car, P1.Pos+1)) as P2(Pos)
order by ID
Use this string splitting function to produce a table of results.
I would first call dbo.split() using a , as the separator character. Then you'll have a list of items like:
Nissan:blue:20000
Ford:green:10000
Nissan:steel:20001
Ford:blue:10001
Chevy:blue:10000
Ford:olive:10000
From there you can call dbo.split() again using : as your separator. Each call will result in exactly three records (assuming your design as at least that "normal").
As #JNK mentioned in his comment, hopefully this is not something you'd want to be running regularly.
EDIT:
Some sample code to get you started:
SELECT *
INTO #YuckyCar
FROM (
SELECT 1 ID, 'Nissan:blue:20000,Ford:green:10000' CarData
UNION
SELECT 2, 'Nissan:steel:20001,Ford:blue:10001,Chevy:blue:10000,Ford:olive:10000'
) T;
-- Shows logical step #1
SELECT ID, X.items MoreCarData
FROM #YuckyCar CROSS APPLY dbo.Split(CarData, ',') X;
-- Shows logical step #2
SELECT Q.ID, Y.items
FROM (
SELECT ID, X.items MoreCarData
FROM #YuckyCar CROSS APPLY dbo.Split(CarData, ',') X) Q CROSS APPLY dbo.Split(Q.MoreCarData, ':') Y
DROP TABLE #YuckyCar;
The problem in the last part is that you can't guarantee row 1 = Manufacturer, row 2 = Color, row 3 = Cost.
This should solve your problem:
[EDIT] Your ID is a varchar(100) and you do not specify if it is a primary key, so I made some changes ... ID does't have to be primary key in this case.
declare #T table(ID varchar(100), CarData varchar(1000))
declare #OUT table(pk INT IDENTITY(1,1), ID varchar(100), Manufacture varchar(100), Color VARCHAR(100), Cost INT)
insert into #T (ID, CarData) values
('1', 'Nissan:blue:20000,Ford:green:10000'),
('2', 'Nissan:steel:20001,Ford:blue:10001,Chevy:blue:10000,Ford:olive:10000')
DECLARE #x XML, #i INT, #ID VARCHAR(100), #maxi INT;
;WITH list AS (SELECT pk = ROW_NUMBER() OVER(ORDER BY ID), * FROM #T)
SELECT #i=1, #maxi=MAX(pk) FROM list;
WHILE #i <= #maxi
BEGIN
;WITH list AS (SELECT pk = ROW_NUMBER() OVER(ORDER BY ID), * FROM #T)
SELECT
#x = CAST( '<root><car><prop>' +
REPLACE(
REPLACE(
CarData
,':'
,'</prop><prop>'
)
,','
,'</prop></car><car><prop>'
) +
'</prop></car></root>'
AS XML)
, #ID = ID
FROM list
WHERE pk = #i
INSERT INTO #OUT
SELECT
ID = #ID
,Manufacture = x.value('./prop[1]','VARCHAR(100)')
,Color = x.value('./prop[2]','VARCHAR(100)')
,Cost = x.value('./prop[3]','INT')
FROM #x.nodes('/root/car') AS T(x)
SET #i = #i + 1;
END
SELECT * FROM #OUT
/* -- OUTPUT
ID Manufacture Color Cost
--------------------------------
1 Nissan blue 20000
1 Ford green 10000
2 Nissan steel 20001
2 Ford blue 10001
2 Chevy blue 10000
2 Ford olive 10000
*/

How do I expand comma separated values into separate rows using SQL Server 2005?

I have a table that looks like this:
ProductId, Color
"1", "red, blue, green"
"2", null
"3", "purple, green"
And I want to expand it to this:
ProductId, Color
1, red
1, blue
1, green
2, null
3, purple
3, green
Whats the easiest way to accomplish this? Is it possible without a loop in a proc?
Take a look at this function. I've done similar tricks to split and transpose data in Oracle. Loop over the data inserting the decoded values into a temp table. The convent thing is that MS will let you do this on the fly, while Oracle requires an explicit temp table.
MS SQL Split Function
Better Split Function
Edit by author:
This worked great. Final code looked like this (after creating the split function):
select pv.productid, colortable.items as color
from product p
cross apply split(p.color, ',') as colortable
based on your tables:
create table test_table
(
ProductId int
,Color varchar(100)
)
insert into test_table values (1, 'red, blue, green')
insert into test_table values (2, null)
insert into test_table values (3, 'purple, green')
create a new table like this:
CREATE TABLE Numbers
(
Number int not null primary key
)
that has rows containing values 1 to 8000 or so.
this will return what you want:
EDIT
here is a much better query, slightly modified from the great answer from #Christopher Klein:
I added the "LTRIM()" so the spaces in the color list, would be handled properly: "red, blue, green". His solution requires no spaces "red,blue,green". Also, I prefer to use my own Number table and not use master.dbo.spt_values, this allows the removal of one derived table too.
SELECT
ProductId, LEFT(PartialColor, CHARINDEX(',', PartialColor + ',')-1) as SplitColor
FROM (SELECT
t.ProductId, LTRIM(SUBSTRING(t.Color, n.Number, 200)) AS PartialColor
FROM test_table t
LEFT OUTER JOIN Numbers n ON n.Number<=LEN(t.Color) AND SUBSTRING(',' + t.Color, n.Number, 1) = ','
) t
EDIT END
SELECT
ProductId, Color --,number
FROM (SELECT
ProductId
,CASE
WHEN LEN(List2)>0 THEN LTRIM(RTRIM(SUBSTRING(List2, number+1, CHARINDEX(',', List2, number+1)-number - 1)))
ELSE NULL
END AS Color
,Number
FROM (
SELECT ProductId,',' + Color + ',' AS List2
FROM test_table
) AS dt
LEFT OUTER JOIN Numbers n ON (n.Number < LEN(dt.List2)) OR (n.Number=1 AND dt.List2 IS NULL)
WHERE SUBSTRING(List2, number, 1) = ',' OR List2 IS NULL
) dt2
ORDER BY ProductId, Number, Color
here is my result set:
ProductId Color
----------- --------------
1 red
1 blue
1 green
2 NULL
3 purple
3 green
(6 row(s) affected)
which is the same order you want...
You can try this out, doesnt require any additional functions:
declare #t table (col1 varchar(10), col2 varchar(200))
insert #t
select '1', 'red,blue,green'
union all select '2', NULL
union all select '3', 'green,purple'
select col1, left(d, charindex(',', d + ',')-1) as e from (
select *, substring(col2, number, 200) as d from #t col1 left join
(select distinct number from master.dbo.spt_values where number between 1 and 200) col2
on substring(',' + col2, number, 1) = ',') t
I arrived this question 10 years after the post.
SQL server 2016 added STRING_SPLIT function.
By using that, this can be written as below.
declare #product table
(
ProductId int,
Color varchar(max)
);
insert into #product values (1, 'red, blue, green');
insert into #product values (2, null);
insert into #product values (3, 'purple, green');
select
p.ProductId as ProductId,
ltrim(split_table.value) as Color
from #product p
outer apply string_split(p.Color, ',') as split_table;
Fix your database if at all possible. Comma delimited lists in database cells indicate a flawed schema 99% of the time or more.
I would create a CLR table-defined function for this:
http://msdn.microsoft.com/en-us/library/ms254508(VS.80).aspx
The reason for this is that CLR code is going to be much better at parsing apart the strings (computational work) and can pass that information back as a set, which is what SQL Server is really good at (set management).
The CLR function would return a series of records based on the parsed values (and the input id value).
You would then use a CROSS APPLY on each element in your table.
Just convert your columns into xml and query it. Here's an example.
select
a.value('.', 'varchar(42)') c
from (select cast('<r><a>' + replace(#CSV, ',', '</a><a>') + '</a></r>' as xml) x) t1
cross apply x.nodes('//r/a') t2(a)
Why not use dynamic SQL for this purpose, something like this(adapt to your needs):
DECLARE #dynSQL VARCHAR(max)
SET #dynSQL = 'insert into DestinationTable(field) values'
select #dynSQL = #dynSQL + '('+ REPLACE(Color,',',''',''') + '),' from Table
SET #dynSql = LEFT(#dynSql,LEN(#dynSql) -1) -- delete the last comma
exec #dynSql
One advantage is that you can use it on any SQL Server version