SQL separating address into mutiple columns using spaces - sql

I have over 7 million rows, otherwise I would use Excel.
My address column has a varying number of words. Some are as short as '123 bay street', while others can be as long as '1234 west spring hill drive apt 123'.
My goal is to put each word into its own column. I was able to get the first word, using the query below. But I can't create a query efficient enough to do the rest.
update X
set X.Address_number = Y.[address]
from
(SELECT
unique_id,
CASE
WHEN SUBSTRING(phy_addr1, 1, CHARINDEX(' ', phy_addr1)) = ''
THEN phy_addr1 + ' '
ELSE SUBSTRING(phy_addr1, 1, CHARINDEX(' ', phy_addr1))
END 'address'
FROM
[RD_GeoCode].[dbo].[PA_Stg_excel]) as Y
inner join
[RD_GeoCode].[dbo].[rg_ApplicationData_AllForms_20160401_address] as X on X.unique_id = Y.unique_id
where
X.Address_number is null

you need to have a Numbers table and one of the split strings mentioned here.once you have that ,then its simple..
-----String splitter function
CREATE FUNCTION dbo.SplitStrings_Numbers
(
#List NVARCHAR(MAX),
#Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
SELECT Item = SUBSTRING(#List, Number,
CHARINDEX(#Delimiter, #List + #Delimiter, Number) - Number)
FROM dbo.Numbers
WHERE Number <= CONVERT(INT, LEN(#List))
AND SUBSTRING(#Delimiter + #List, Number, LEN(#Delimiter)) = #Delimiter
);
GO
you can use the above function like below..
select
*
from yourtable t
cross apply
dbo.SplitStrings_Numbers(t.address,' ') b
instead of updating values into same table,i suggest create some other table which has links to above table.This requires some schema modification to your existing table
create table addressreferences
(
addresss varchar(300),
delimitedvalue varchar(100)
)
insert into addressreferences
select
address,b.*
from yourtable t
cross apply
dbo.SplitStrings_Numbers(t.address,' ') b
This is just a pseudo code to give an idea,you will have to take care of references...Updating same table will not work ,because you are not aware how many rows an address column can span
Update:
I think a trigger will be better suit for your scenario instead of references ..But you have to do an insert first on references table for existing values .here is some pseudo code..
create trigger trg_test
after insert,update,delete
on dbo.yourtable
as
begin
---check for inserts
if exists(Select * from inserted)
begin
insert into addressreferences
select address,b.* from inserted i
cross apply
dbo.splitstrings(address,' ') b
--check for deletes
if exists(select 1 from deleted)
begin
delete * from
deleted d
join
adressreferences a
on a.address=d.address
end
if update(address)
begin
---here i recommend doing delete first since your old address and new one may not have equal rows
delete * from
deleted d
join
addressreferences a
on a.address=d.address
--then do a insert
insert into addressreferences
select address,a.* from
inserted i
join
addressreferences a
on a.address=i.address
end
end
end

A sequence table is a good thing. As in Louis Davidson's 'Pro Relational database design and implementation', you can create it
CREATE SCHEMA tools
go
CREATE TABLE tools.sequence
(
i int CONSTRAINT PKtools_sequence PRIMARY KEY
)
-- Then I will load it, up to 99999:
;WITH DIGITS (i) as(--set up a set of numbers from 0-9
SELECT i
FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) as digits (i))
--builds a table from 0 to 99999
,sequence (i) as (
SELECT D1.i + (10*D2.i) + (100*D3.i) + (1000*D4.i) + (10000*D5.i)
--+ (100000*D6.i)
FROM digits AS D1 CROSS JOIN digits AS D2 CROSS JOIN digits AS D3
CROSS JOIN digits AS D4 CROSS JOIN digits AS D5
/* CROSS JOIN digits AS D6 */)
INSERT INTO tools.sequence(i)
SELECT i
FROM sequence
Then split your input, again code from L. Davidson's book
DECLARE #delimitedList VARCHAR(100) = '1,2,3,4,5'
SELECT word = SUBSTRING(',' + #delimitedList + ',',i + 1,
CHARINDEX(',',',' + #delimitedList + ',',i + 1) - i - 1)
FROM tools.sequence
WHERE i >= 1
AND i < LEN(',' + #delimitedList + ',') - 1
AND SUBSTRING(',' + #delimitedList + ',', i, 1) = ','
ORDER BY i
using a space rather than a comma.
Finally, I would think of using the PIVOT operator to turn the rows into columns, but for it to work, you need to specify the maximum number of words.

Related

replace value in varchar(max) field with join

I have a table that contains text field with placeholders. Something like this:
Row Notes
1. This is some notes ##placeholder130## this ##myPlaceholder##, #oneMore#. End.
2. Second row...just a ##test#.
(This table contains about 1-5k rows on average. Average number of placeholders in one row is 5-15).
Now, I have a lookup table that looks like this:
Name Value
placeholder130 Dog
myPlaceholder Cat
oneMore Cow
test Horse
(Lookup table will contain anywhere from 10k to 100k records)
I need to find the fastest way to join those placeholders from strings to a lookup table and replace with value. So, my result should look like this (1st row):
This is some notes Dog this Cat, Cow. End.
What I came up with was to split each row into multiple for each placeholder and then join it to lookup table and then concat records back to original row with new values, but it takes around 10-30 seconds on average.
You could try to split the string using a numbers table and rebuild it with for xml path.
select (
select coalesce(L.Value, T.Value)
from Numbers as N
cross apply (select substring(Notes.notes, N.Number, charindex('##', Notes.notes + '##', N.Number) - N.Number)) as T(Value)
left outer join Lookup as L
on L.Name = T.Value
where N.Number <= len(notes) and
substring('##' + notes, Number, 2) = '##'
order by N.Number
for xml path(''), type
).value('text()[1]', 'varchar(max)')
from Notes
SQL Fiddle
I borrowed the string splitting from this blog post by Aaron Bertrand
SQL Server is not very fast with string manipulation, so this is probably best done client-side. Have the client load the entire lookup table, and replace the notes as they arrived.
Having said that, it can of course be done in SQL. Here's a solution with a recursive CTE. It performs one lookup per recursion step:
; with Repl as
(
select row_number() over (order by l.name) rn
, Name
, Value
from Lookup l
)
, Recurse as
(
select Notes
, 0 as rn
from Notes
union all
select replace(Notes, '##' + l.name + '##', l.value)
, r.rn + 1
from Recurse r
join Repl l
on l.rn = r.rn + 1
)
select *
from Recurse
where rn =
(
select count(*)
from Lookup
)
option (maxrecursion 0)
Example at SQL Fiddle.
Another option is a while loop to keep replacing lookups until no more are found:
declare #notes table (notes varchar(max))
insert #notes
select Notes
from Notes
while 1=1
begin
update n
set Notes = replace(n.Notes, '##' + l.name + '##', l.value)
from #notes n
outer apply
(
select top 1 Name
, Value
from Lookup l
where n.Notes like '%##' + l.name + '##%'
) l
where l.name is not null
if ##rowcount = 0
break
end
select *
from #notes
Example at SQL Fiddle.
I second the comment that tsql is just not suited for this operation, but if you must do it in the db here is an example using a function to manage the multiple replace statements.
Since you have a relatively small number of tokens in each note (5-15) and a very large number of tokens (10k-100k) my function first extracts tokens from the input as potential tokens and uses that set to join to your lookup (dbo.Token below). It was far too much work to look for an occurrence of any of your tokens in each note.
I did a bit of perf testing using 50k tokens and 5k notes and this function runs really well, completing in <2 seconds (on my laptop). Please report back how this strategy performs for you.
note: In your example data the token format was not consistent (##_#, ##_##, #_#), I am guessing this was simply a typo and assume all tokens take the form of ##TokenName##.
--setup
if object_id('dbo.[Lookup]') is not null
drop table dbo.[Lookup];
go
if object_id('dbo.fn_ReplaceLookups') is not null
drop function dbo.fn_ReplaceLookups;
go
create table dbo.[Lookup] (LookupName varchar(100) primary key, LookupValue varchar(100));
insert into dbo.[Lookup]
select '##placeholder130##','Dog' union all
select '##myPlaceholder##','Cat' union all
select '##oneMore##','Cow' union all
select '##test##','Horse';
go
create function [dbo].[fn_ReplaceLookups](#input varchar(max))
returns varchar(max)
as
begin
declare #xml xml;
select #xml = cast(('<r><i>'+replace(#input,'##' ,'</i><i>')+'</i></r>') as xml);
--extract the potential tokens
declare #LookupsInString table (LookupName varchar(100) primary key);
insert into #LookupsInString
select distinct '##'+v+'##'
from ( select [v] = r.n.value('(./text())[1]', 'varchar(100)'),
[r] = row_number() over (order by n)
from #xml.nodes('r/i') r(n)
)d(v,r)
where r%2=0;
--tokenize the input
select #input = replace(#input, l.LookupName, l.LookupValue)
from dbo.[Lookup] l
join #LookupsInString lis on
l.LookupName = lis.LookupName;
return #input;
end
go
return
--usage
declare #Notes table ([Id] int primary key, notes varchar(100));
insert into #Notes
select 1, 'This is some notes ##placeholder130## this ##myPlaceholder##, ##oneMore##. End.' union all
select 2, 'Second row...just a ##test##.';
select *,
dbo.fn_ReplaceLookups(notes)
from #Notes;
Returns:
Tokenized
--------------------------------------------------------
This is some notes Dog this Cat, Cow. End.
Second row...just a Horse.
Try this
;WITH CTE (org, calc, [Notes], [level]) AS
(
SELECT [Notes], [Notes], CONVERT(varchar(MAX),[Notes]), 0 FROM PlaceholderTable
UNION ALL
SELECT CTE.org, CTE.[Notes],
CONVERT(varchar(MAX), REPLACE(CTE.[Notes],'##' + T.[Name] + '##', T.[Value])), CTE.[level] + 1
FROM CTE
INNER JOIN LookupTable T ON CTE.[Notes] LIKE '%##' + T.[Name] + '##%'
)
SELECT DISTINCT org, [Notes], level FROM CTE
WHERE [level] = (SELECT MAX(level) FROM CTE c WHERE CTE.org = c.org)
SQL FIDDLE DEMO
Check the below devioblog post for reference
devioblog post
To get speed, you can preprocess the note templates into a more efficient form. This will be a sequence of fragments, with each ending in a substitution. The substitution might be NULL for the last fragment.
Notes
Id FragSeq Text SubsId
1 1 'This is some notes ' 1
1 2 ' this ' 2
1 3 ', ' 3
1 4 '. End.' null
2 1 'Second row...just a ' 4
2 2 '.' null
Subs
Id Name Value
1 'placeholder130' 'Dog'
2 'myPlaceholder' 'Cat'
3 'oneMore' 'Cow'
4 'test' 'Horse'
Now we can do the substitutions with a simple join.
SELECT Notes.Text + COALESCE(Subs.Value, '')
FROM Notes LEFT JOIN Subs
ON SubsId = Subs.Id WHERE Notes.Id = ?
ORDER BY FragSeq
This produces a list of fragments with substitutions complete. I am not an MSQL user, but in most dialects of SQL you can concatenate these fragments in a variable quite easily:
DECLARE #Note VARCHAR(8000)
SELECT #Note = COALESCE(#Note, '') + Notes.Text + COALSCE(Subs.Value, '')
FROM Notes LEFT JOIN Subs
ON SubsId = Subs.Id WHERE Notes.Id = ?
ORDER BY FragSeq
Pre-processing a note template into fragments will be straightforward using the string splitting techniques of other posts.
Unfortunately I'm not at a location where I can test this, but it ought to work fine.
I really don't know how it will perform with 10k+ of lookups.
how does the old dynamic SQL performs?
DECLARE #sqlCommand NVARCHAR(MAX)
SELECT #sqlCommand = N'PlaceholderTable.[Notes]'
SELECT #sqlCommand = 'REPLACE( ' + #sqlCommand +
', ''##' + LookupTable.[Name] + '##'', ''' +
LookupTable.[Value] + ''')'
FROM LookupTable
SELECT #sqlCommand = 'SELECT *, ' + #sqlCommand + ' FROM PlaceholderTable'
EXECUTE sp_executesql #sqlCommand
Fiddle demo
And now for some recursive CTE.
If your indexes are correctly set up, this one should be very fast or very slow. SQL Server always surprises me with performance extremes when it comes to the r-CTE...
;WITH T AS (
SELECT
Row,
StartIdx = 1, -- 1 as first starting index
EndIdx = CAST(patindex('%##%', Notes) as int), -- first ending index
Result = substring(Notes, 1, patindex('%##%', Notes) - 1)
-- (first) temp result bounded by indexes
FROM PlaceholderTable -- **this is your source table**
UNION ALL
SELECT
pt.Row,
StartIdx = newstartidx, -- starting index (calculated in calc1)
EndIdx = EndIdx + CAST(newendidx as int) + 1, -- ending index (calculated in calc4 + total offset)
Result = Result + CAST(ISNULL(newtokensub, newtoken) as nvarchar(max))
-- temp result taken from subquery or original
FROM
T
JOIN PlaceholderTable pt -- **this is your source table**
ON pt.Row = T.Row
CROSS APPLY(
SELECT newstartidx = EndIdx + 2 -- new starting index moved by 2 from last end ('##')
) calc1
CROSS APPLY(
SELECT newtxt = substring(pt.Notes, newstartidx, len(pt.Notes))
-- current piece of txt we work on
) calc2
CROSS APPLY(
SELECT patidx = patindex('%##%', newtxt) -- current index of '##'
) calc3
CROSS APPLY(
SELECT newendidx = CASE
WHEN patidx = 0 THEN len(newtxt) + 1
ELSE patidx END -- if last piece of txt, end with its length
) calc4
CROSS APPLY(
SELECT newtoken = substring(pt.Notes, newstartidx, newendidx - 1)
-- get the new token
) calc5
OUTER APPLY(
SELECT newtokensub = Value
FROM LookupTable
WHERE Name = newtoken -- substitute the token if you can find it in **your lookup table**
) calc6
WHERE newstartidx + len(newtxt) - 1 <= len(pt.Notes)
-- do this while {new starting index} + {length of txt we work on} exceeds total length
)
,lastProcessed AS (
SELECT
Row,
Result,
rn = row_number() over(partition by Row order by StartIdx desc)
FROM T
) -- enumerate all (including intermediate) results
SELECT *
FROM lastProcessed
WHERE rn = 1 -- filter out intermediate results (display only last ones)

Get the value of a column replacing the comma separator

How can I get each value of a column that has a comma separator in her value ?
Example:
ID ColumnUnified
1 12,34,56,78
2 80,99,70,56
What I want is a query to get the number without comma. If possible, in collumns.
12 34 56 78
This will work for any number of values http://beyondrelational.com/modules/2/blogs/70/posts/10844/splitting-delimited-data-to-columns-set-based-approach.aspx
The solution Madhivanan's link refers to is very creative, but I had a slight problem with it on SQL Server 2012 related to the name of one of the columns (Start). I've modified the code in his answer to use StartPos instead of Start for the column name.
I was not familiar with the system procedure spt_values, but I found a very informative description of the procedure here on SO for those who are interested in exactly how this solution works.
Finally, here's the (slightly) revised code from Madhivana's answer:
CREATE TABLE #test(id int, data varchar(100))
INSERT INTO #test VALUES (1,'This,is,a,test,string')
INSERT INTO #test VALUES (2,'See,if,it,can,be,split,into,many,columns')
DECLARE #pivot varchar(8000)
DECLARE #select varchar(8000)
SELECT #pivot = COALESCE(#pivot + ',', '') + '[col'
+ CAST(number + 1 AS VARCHAR(10)) + ']'
FROM master..spt_values
WHERE type = 'p'
AND number <= ( SELECT MAX(LEN(data) - LEN(REPLACE(data, ',', '')))
FROM #test
)
SELECT #select = '
select p.*
from (
select
id,substring(data, StartPos+2, endPos-StartPos-2) as token,
''col''+cast(row_number() over(partition by id order by StartPos) as varchar(10)) as n
from (
select
id, data, n as StartPos, charindex('','',data,n+2) endPos
from (select number as n from master..spt_values where type=''p'') num
cross join
(
select
id, '','' + data +'','' as data
from
#test
) m
where n < len(data)-1
and substring(data,n+1,1) = '','') as data
) pvt
Pivot ( max(token)for n in (' + #pivot + '))p'
EXEC(#select)
DROP TABLE #test

Find special characters in all rows in specific columns in table

I have a database containing about 50 tables, each table has about 10-100 columns with max 1 milion rows in each table. (quite big like for a newbie :P)
Database is old and some rows contains special characters (invisible characters or some weird unicode) and I would like to remove those characters.
I was searching google and I found a small snippet that lists all columns with specific type:
SELECT
OBJECT_NAME(col.OBJECT_ID) AS [TableName]
,col.[name] AS [ColName]
,typ.[name] AS [TypeName]
FROM
sys.all_columns col
INNER JOIN sys.types typ
ON col.user_type_id = typ.user_type_id
WHERE
col.user_type_id IN (167,231)
AND
OBJECT_NAME(col.OBJECT_ID) = 'Orders'
This lists all columns that are varchar or nvarchar.
I found two functions, one that returns a table of all characters from a string and second that checks if string contains any special characters:
CREATE FUNCTION AllCharactersInString (#str nvarchar(max))
RETURNS TABLE
AS
RETURN
(SELECT
substring(B.main_string,C.int_seq,1) AS character
,Unicode(substring(B.main_string,C.int_seq,1)) AS unicode_value
FROM
(SELECT
#str AS main_string) B,(SELECT
A.int_seq
FROM
(SELECT
row_number() OVER (ORDER BY name) AS int_seq
FROM
sys.all_objects) A
WHERE
A.int_seq <= len(#str)) C
)
And second:
CREATE FUNCTION ContainsInvisibleCharacter (#str nvarchar(max))
RETURNS int
AS
BEGIN
DECLARE #Result Int
IF exists
(SELECT
*
FROM
AllCharactersInString(#str)
WHERE
unicode_value IN (1,9,10,11,12,13,14,28,29,31,129,141,143,144,157,160))
BEGIN SET #Result = 1
END
ELSE
BEGIN SET #Result = 0
END
RETURN #Result
END
My question is how to combine thos two functions into one (if it is possible and if it will be faster) and second: how to run that function on all records in all columns (that are specific type) in a table.
I have this code:
SELECT
O.Order_Id
,Rn_Descriptor
FROM
dbo.Order O
WHERE
dbo.ContainsInvisibleCharacter(O.Rn_Descriptor) = 1
AND
O.Order_Id IN (SELECT TOP 1000
Order.Order_Id
FROM
dbo.Order
WHERE
Order.Rn_Descriptor IS NOT NULL
)
But it works sooo slow :/
Mayby there is a fastest way to remove unwanted characters?
What will be fine is to find rows containing those characters, list them, then I could manually check them.
You can do this more efficiently using LIKE.
CREATE FUNCTION ContainsInvisibleCharacter(#str nvarchar(max)) RETURNS int
AS
BEGIN
RETURN
(SELECT CASE WHEN #str LIKE
'%[' + NCHAR(1) + NCHAR(9) + NCHAR(10) + NCHAR(11) + NCHAR(12)
+ NCHAR(13) + NCHAR(14) + NCHAR(28) + NCHAR(29) + NCHAR(31)
+ NCHAR(129) + NCHAR(141) + NCHAR(143) + NCHAR(144)
+ NCHAR(157) + NCHAR(160) + ']%'
THEN 1 ELSE 0 END)
END

TSQL - Querying a table column to pull out popular words for a tag cloud

Just an exploratory question to see if anyone has done this or if, in fact it is at all possible.
We all know what a tag cloud is, and usually, a tag cloud is created by someone assigning tags. Is it possible, within the current features of SQL Server to create this automatically, maybe via trigger when a table has a record added or updated, by looking at the data within a certain column and getting popular words?
It is similar to this question: How can I get the most popular words in a table via mysql?. But, that is MySQL not MSSQL.
Thanks in advance.
James
Here is a good bit on parsing delimited string into rows:
http://anyrest.wordpress.com/2010/08/13/converting-parsing-delimited-string-column-in-sql-to-rows/
http://www.sqlteam.com/article/parsing-csv-values-into-multiple-rows
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=50648
T-SQL: Opposite to string concatenation - how to split string into multiple records
If you want to parse all words, you can use the space ' ' as your delimiter, Then you get a row for each word.
Next you would simply select the result set GROUPing by the word and aggregating the COUNT
Order your results and you're there.
IMO, the design approach is what makes this difficult. Just because you allow users to assign tags does not mean the tags must be stored as a single delimited list of words. You can normalize the structure into something like:
Create Table Posts ( Id ... not null primary key )
Create Table Tags( Id ... not null primary key, Name ... not null Unique )
Create Table PostTags
( PostId ... not null References Posts( Id )
, TagId ... not null References Tags( Id ) )
Now your question becomes trivial:
Select T.Id, T.Name, Count(*) As TagCount
From PostTags As PT
Join Tags As T
On T.Id = PT.TagId
Group By T.Id, T.Name
Order By Count(*) Desc
If you insist on storing tags as delimited values, then only solution is to split the values on their delimiter by writing a custom Split function and then do your count. At the bottom is an example of a Split function. With it your query would look something like (using a comma delimiter):
Select Tag.Value, Count(*) As TagCount
From Posts As P
Cross Apply dbo.Split( P.Tags, ',' ) As Tag
Group By Tag.Value
Order By Count(*) Desc
Split Function:
Create Function [dbo].[Split]
(
#DelimitedList nvarchar(max)
, #Delimiter nvarchar(2) = ','
)
RETURNS TABLE
AS
RETURN
(
With CorrectedList As
(
Select Case When Left(#DelimitedList, DataLength(#Delimiter)/2) <> #Delimiter Then #Delimiter Else '' End
+ #DelimitedList
+ Case When Right(#DelimitedList, DataLength(#Delimiter)/2) <> #Delimiter Then #Delimiter Else '' End
As List
, DataLength(#Delimiter)/2 As DelimiterLen
)
, Numbers As
(
Select TOP (Coalesce(Len(#DelimitedList),1)) Row_Number() Over ( Order By c1.object_id ) As Value
From sys.objects As c1
Cross Join sys.columns As c2
)
Select CharIndex(#Delimiter, CL.list, N.Value) + CL.DelimiterLen As Position
, Substring (
CL.List
, CharIndex(#Delimiter, CL.list, N.Value) + CL.DelimiterLen
, Case
When CharIndex(#Delimiter, CL.list, N.Value + 1)
- CharIndex(#Delimiter, CL.list, N.Value)
- CL.DelimiterLen < 0 Then Len(CL.List)
Else CharIndex(#Delimiter, CL.list, N.Value + 1)
- CharIndex(#Delimiter, CL.list, N.Value)
- CL.DelimiterLen
End
) As Value
From CorrectedList As CL
Cross Join Numbers As N
Where N.Value < Len(CL.List)
And Substring(CL.List, N.Value, CL.DelimiterLen) = #Delimiter
)
Word or Tag clouds need two fields: a string and a value of how many times that word or string appeared in your collection. You can then pass the results into a tag cloud tool that will display the data as you require.
Not to take away from the previous answers, as they do answer the original challenge. However, I have a simpler solution using two functions (similar to #Thomas answer), one of which uses regex to "clean" the words.
The two functions are:
dbo.fnStripChars(a, b) --use regex 'b' to cleanse a string 'a'
dbo.fnMakeTableFromList(a, b) --convert a single field 'a' into a tabled list, delimited by 'b'
I then apply them into a single SQL statement, using the TOP n feature to give me the top 10 words I want to pass onto PowerBI or some other graphical tool, for actually displaying a word or tag cloud.
SELECT TOP 10 b.[words], b.[total]
FROM
(SELECT a.[words], count(*) AS [total]
FROM
(SELECT upper(l.item) AS [words]
FROM dbo.MyTableWithWords AS c
CROSS APPLY POTS.fnMakeTableFromList([POTS].fnStripChars(c.myColumnThatHasTheWords,'[^a-zA-Z ]'),' ') AS l) AS a
GROUP BY a.[words]) AS b
ORDER BY 2 DESC
As you can see, the regex is [^a-zA-Z ], which is to give me only alphabetical characters and spaces. The space is then used as a delimiter to the make table function to separate each word individually. I apply a count(*), to give me the number of times that word appears, hence then I have everything I need to give me the TOP 10 results.
Note that CROSS APPLY is important here so I get only data with actual "words" in each record found. Otherwise it will go through every record with or without words to extract from the column I want.
fnStripChars()
FUNCTION [dbo].[fnStripChars]
(
#String NVARCHAR(4000),
#MatchExpression VARCHAR(255)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
SET #MatchExpression = '%' + #MatchExpression + '%'
WHILE PatIndex(#MatchExpression, #String) > 0
SET #String = Stuff(#String, PatIndex(#MatchExpression, #String), 1, '')
RETURN #String
END
fnMakeTableFromList()
FUNCTION [dbo].[fnMakeTableFromList](
#List VARCHAR(MAX),
#Delimiter CHAR(1))
RETURNS TABLE
AS
RETURN (SELECT Item = CONVERT(VARCHAR, Item)
FROM (SELECT Item = x.i.value('(./text())[1]','varchar(max)')
FROM (SELECT [XML] = CONVERT(XML,'<i>' + REPLACE(#List,#Delimiter,'</i><i>') + '</i>').query('.')) AS a
CROSS APPLY [XML].nodes('i') AS x(i)) AS y
WHERE Item IS NOT NULL);
I've tested this with over 400K records and it's able to come back with my results in under 60 seconds. I think that's reasonable.

SQL: break up one row into many (normalization)

I am in middle of upgrading from a poorly designed legacy database to a new database. In the old database there is tableA with fields Id and Commodities. Id is the primary key and contains an int and Commodities contains a comma delimited list.
TableA:
id | commodities
1135 | fish,eggs,meat
1127 | flour,oil
In the new database, I want tableB to be in the form id, commodity where each commodity is a single item from the comma delimited list in tableA.
TableB:
id | commodity
1135 | fish
1135 | eggs
1135 | meat
1127 | flour
1127 | oil
I have a function, functionA, that when given an id, a list, and a delimiter, returns a table with an id and item field. How can I use this function to turn the two fields from tableA into tableB?
(Note: I had trouble figuring out what to title this question. Please feel free to edit the title to make it more accurately reflect the question!)
Here is the function code:
ALTER FUNCTION dbo.functionA
(
#id int,
#List VARCHAR(6000),
#Delim varchar(5)
)
RETURNS
#ParsedList TABLE
(
id int,
item VARCHAR(6000)
)
AS
BEGIN
DECLARE #item VARCHAR(6000), #Pos INT
SET #List = LTRIM(RTRIM(#List))+ #Delim
SET #Pos = CHARINDEX(#Delim, #List, 1)
WHILE #Pos > 0
BEGIN
SET #item = LTRIM(RTRIM(LEFT(#List, #Pos - 1)))
IF #item <> ''
BEGIN
INSERT INTO #ParsedList (id, item)
VALUES (#id, CAST(#item AS VARCHAR(6000)))
END
SET #List = RIGHT(#List, LEN(#List) - #Pos)
SET #Pos = CHARINDEX(#Delim, #List, 1)
END
RETURN
END
Here's the link I posted as a comment:
http://www.sqlteam.com/article/parsing-csv-values-into-multiple-rows
You need a way to split and process the string in TSQL, there are many ways to do this. This article covers the PROs and CONs of just about every method:
Arrays and Lists in SQL Server 2000 and Earlier
You need to create a split function. This is how a split function can be used:
SELECT
*
FROM YourTable y
INNER JOIN dbo.yourSplitFunction(#Parameter) s ON y.ID=s.Value
[I prefer the number table approach to split a string in TSQL](Arrays and Lists in SQL Server 2000 and Earlier) but there are numerous ways to split strings in SQL Server, see the previous link, which explains the PROs and CONs of each.
For the Numbers Table method to work, you need to do this one time table setup, which will create a table Numbers that contains rows from 1 to 10,000:
SELECT TOP 10000 IDENTITY(int,1,1) AS Number
INTO Numbers
FROM sys.objects s1
CROSS JOIN sys.objects s2
ALTER TABLE Numbers ADD CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Number)
Once the Numbers table is set up, create this split function:
CREATE FUNCTION inline_split_me (#SplitOn char(1),#param varchar(7998)) RETURNS TABLE AS
RETURN(SELECT substring(#SplitOn + #param + ',', Number + 1,
charindex(#SplitOn, #SplitOn + #param + #SplitOn, Number + 1) - Number - 1)
AS Value
FROM Numbers
WHERE Number <= len(#SplitOn + #param + #SplitOn) - 1
AND substring(#SplitOn + #param + #SplitOn, Number, 1) = #SplitOn)
GO
You can now easily split a CSV string into a table and join on it:
select * from dbo.inline_split_me(';','1;22;333;4444;;') where LEN(Value)>0
OUTPUT:
Value
----------------------
1
22
333
4444
(4 row(s) affected)
to make you new table use this:
--set up tables:
create table TableA (id int, commodities varchar(8000))
INSERT TableA VALUES (1135,'fish,eggs,meat')
INSERT TableA VALUES (1127,'flour,oil')
Create table TableB (id int, commodities varchar(8000))
--populate TableB
INSERT TableB
(id, commodities)
SELECT
a.id,c.value
FROM TableA a
CROSS APPLY dbo.inline_split_me(',',a.commodities) c
--show tableB contents:
select * from TableB
OUTPUT:
id commodities
----------- -------------
1135 fish
1135 eggs
1135 meat
1127 flour
1127 oil
(5 row(s) affected)
EDIT after Conrad Frix comment about SQL Server 2000 not supporting CROSS APPLY
this will do the same:
INSERT TableB
(id, commodities)
SELECT
a.id,NullIf(SubString(',' + a.commodities + ',' , number , CharIndex(',' , ',' + a.commodities + ',' , number) - number) , '')
FROM TableA a
INNER JOIN Numbers n ON 1=1
WHERE SubString(',' + a.commodities + ',' , number - 1, 1) = ','
AND CharIndex(',' , ',' + a.commodities + ',' , number) - number > 0
AND number <= Len(',' + a.commodities + ',')
and is based on the code from the link in the answer by #Rup. It basically removes the function call and does the split in the main query (using a similar Numbers table split), so no need for a CROSS APPLY.
You write a SQL batch that loops through table A and inserts into table b the results of your function call.
Call me lazy, but I'd pull the combined rows out of the database, split them, then reinsert the split rows. This kind of thing seems kind of unnatural for SQL...
SSIS has a pretty handy Unpivot transform if that's available to you.
create table Project (ProjectId int, Description varchar(50));
insert into Project values (1, 'Chase tail, change directions');
insert into Project values (2, 'ping-pong ball in clothes dryer');
create table ProjectResource (ProjectId int, ResourceId int, Name varchar(15));
insert into ProjectResource values (1, 1, 'Adam');
insert into ProjectResource values (1, 2, 'Kerry');
insert into ProjectResource values (1, 3, 'Tom');
insert into ProjectResource values (2, 4, 'David');
insert into ProjectResource values (2, 5, 'Jeff');
-- a bit of SQL magic involving XML and voila
SELECT *,
(SELECT Name + ' ' AS [text()]
FROM ProjectResource pr
WHERE pr.ProjectId = p.ProjectId
FOR XML PATH ('')) AS ResourceList
FROM Project p
The output of this will be :
ProjectId Description ResourceList
1 Chase tail, change directions Adam Kerry Tom
2 ping-pong ball in clothes dryer David Jeff