split data in field to row using SQL2000 - sql

Please help me to find a solution. I have data in table like
ID Code
1 123,456,789,12
2 456,073
3 69,76,56
I need to list of code in row
ID Code Ref
1 123,456,789,12 123
1 123,456,789,12 456
1 123,456,789,12 789
1 123,456,789,12 12
2 456,073 456
2 456,073 073
3 69,76,56 69
3 69,76,56 76
3 69,76,56 56
How do I do this in a query command? I'll be using the value in ref column to join another column in another tables.
Thanks for supports

My first advice is to normalize your database. A column should contain a single piece of information. Your comma-delimited values violates this rule, which is why you're facing such difficulty. Since people seldom ever take that advice though, here's a kludge which might work for you. Since you're joining this to another table, you don't really need to separate out each value in its own column, you just need to be able to find a matching value in your column:
SELECT
T1.id,
T1.code,
T2.ref
FROM
My_Table T1
INNER JOIN Table_I_Am_Joining T2 ON
T1.code LIKE '%,' + CAST(T2.ref AS VARCHAR(20)) + ',%' OR
T1.code LIKE CAST(T2.ref AS VARCHAR(20)) + ',%' OR
T1.code LIKE '%,' + CAST(T2.ref AS VARCHAR(20)) OR
T1.code = CAST(T2.ref AS VARCHAR(20))
This relies on the codes in your column to be in an exact format, comma-delimited with no spaces. If that's not the case then this will likely not return what you're trying to get.

The answer is to normalize your database.
In the meantime, a workaround that will perform better on large sets, is to use a temp table. (LIKE searches can't use an index)
This approach also shows some steps towards normalizing the data and handles whitespace.
First create a "Tally Table" if you don't have one. This is a one-time deal, and Tally tables come in handy for all  kinds of things.
/*--- Create a Tally table. This only needs to be done once.
Note that "Master.dbo.SysColumns" is in all SQL 2000 installations.
For SQL 2005, or later, use "master.sys.all_columns".
*/
SELECT TOP 11000 -- Adequate for most business purposes.
IDENTITY (INT, 1, 1) AS N
INTO
dbo.Tally
FROM
Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
--- Add a Primary Key to maximize performance.
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
Now suppose your tables are:
CREATE TABLE ListO_Codes (ID INT IDENTITY(1,1), Code VARCHAR(88))
INSERT INTO ListO_Codes (Code)
SELECT '123,456,789,12' UNION ALL
SELECT '456,073' UNION ALL
SELECT '69,76,56'
CREATE TABLE AnotherTable (ID INT IDENTITY(1,1), Ref VARCHAR(8), CodeWord VARCHAR (88))
INSERT INTO AnotherTable (Ref, CodeWord)
SELECT '12', 'Children' UNION ALL
SELECT '123', 'of' UNION ALL
SELECT '456', '-' UNION ALL
SELECT '789', 'sun,' UNION ALL
SELECT '073', 'see' UNION ALL
SELECT '56', 'your' UNION ALL
SELECT '69', 'time' UNION ALL
SELECT '76', 'has'
Then the temp table is:
CREATE TABLE #NORMALIZED_Data (LOD_id INT, Ref int) -- Make Ref varchar if it's not numeric
INSERT INTO
#NORMALIZED_Data (LOD_id, Ref)
SELECT
L.ID,
-- Split Code string using Tally table and Delimiters
LTrim (RTrim (SUBSTRING (',' + L.Code + ',', T.N+1, CHARINDEX (',', ',' + L.Code + ',', T.N+1) - T.N - 1 ) ) )
FROM
dbo.Tally T,
ListO_Codes L
WHERE
T.N < LEN (',' + L.Code + ',')
AND
SUBSTRING (',' + L.Code + ',', T.N, 1) = ','
--- Index for performance
CREATE CLUSTERED INDEX CL_NORMALIZED_Data_LOD_id_Ref
ON #NORMALIZED_Data (LOD_id, Ref) WITH FILLFACTOR = 100
Then the search is:
SELECT
L.ID,
L.Code,
A.Ref,
A.CodeWord
FROM
#NORMALIZED_Data N
INNER JOIN
ListO_Codes L ON N.LOD_id = L.ID
LEFT JOIN
AnotherTable A ON N.Ref = A.Ref
ORDER BY
L.ID,
A.Ref
And the results are:
ID Code Ref CodeWord
-- -------------- --- --------
1 123,456,789,12 12 Children
1 123,456,789,12 123 of
1 123,456,789,12 456 -
1 123,456,789,12 789 sun,
2 456,073 073 see
2 456,073 456 -
3 69,76,56 56 your
3 69,76,56 69 time
3 69,76,56 76 has

Related

advice on improving query performance ~ 1 day

I have 5 tables - Each have tens of thousands of records
1 main/very important table (TABLE A)
2 other tables (TABLES B/C) that still important but not as important as table
2 side tables (TABLES D/E)that hold primary keys between A<=>B and A<=>C i.e. only have two columns each
The 3 main tables have ~140 columns each, all have the same column names
The purpose of my query is to perform column level matching between all the tables A<=>D<=>B and A<=>E<=>C in one query
The final query will have about 286 columns (two ID columns from each main table,
select tableA.ID1 as [TABLEAID1],
tableA.ID2 as [TABLEAID2],
tableB.ID1 as [TABLEBID1],
tableB.ID2 as [TABLEBID2],
tableC.ID1 as [TABLECID1],
tableC.ID2 as [TABLECID2],
fn_TESTMatcher(tableA.[postCode], tableB.[postCode],) as
[TABLEAB.postCode.RESULT],
fn_TESTMatcher(tableA.[CityCode], tableB.[CityCode],) as
[TABLEAB.CityCode.RESULT],
.
.
. x238 more 'fn_TESTMatcher(...) as xyz' columns
.
INTO #Results
From tableA WITH (NOLOCK)
FULL JOIN tableD WITH (NOLOCK) ON tableA.ID1 = tableD.A
) FULL JOIN tableB WITH (NOLOCK) ON tableD.B = tableB.ID1
) FULL JOIN tableE WITH (NOLOCK) ON tableA.ID1 = tableE.A
) FULL JOIN tableC WITH (NOLOCK) ON tableE.B = tableC.ID
fn_TESTMatcher is a function, it is fed the same column from two main tables, then it removes/replaces special characters/abbreviations, and then tries to match them, if they match it returns a bit '1', if not then a bit '0'.
at the moment it takes about a day to run (i can't really time it with some sort of query timer), I can comment out all the columns except for the last and run it and its fairly quick, but i dont think i can just scale that up
Does anyone have some advice? My first assumption is to start googling on what indexes are and ...maybe.. apply it to the ID1 of every table although I'm a bit hesitant on a) messing up my tables and b) adding an index that ends up being useless
===========================================
update 2: table structure wise all the columns for all the main tables are varchars, length 100-250 characters, where ID (primary key) is not nullable
With the two side tables, they just have two columns, both varchar, 100 character limit (they're both foreign keys). The most important table's ID in this is not nullable
for functions, i technically have two:
FUNCTION [dbo].[fn_TESTStripCharacters]
(
#String NVARCHAR(MAX) ,
#MatchExpression VARCHAR(255)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
DECLARE #expres VARCHAR(50) = '%[~,#,#,^,_,+,-,$,%,&,/,|,\,*,(,),.,!,`,:,<,>,?]%'
WHILE PATINDEX( #expres, #String ) > 0
SET #String = REPLACE(REPLACE(REPLACE( #String, SUBSTRING( #String, PATINDEX( #expres, #String ), 1 ),''),';',''),'-','')
RETURN #String
END
and second function
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER FUNCTION [dbo].[fn_TESTMatcher](#Field1 NVARCHAR(MAX), #Field2
NVARCHAR(MAX))
RETURNS BIT
BEGIN
SET #Field1 = UPPER(LTRIM(RTRIM(REPLACE(dbo.fn_TESTStripCharacters(#Field1,#SpecialCharacters),'-',''))))
SET #Field2 = UPPER(LTRIM(RTRIM(REPLACE(dbo.fn_TESTStripCharacters(#Field2,#SpecialCharacters),'-',''))))
SET #Field1 = REPLACE(#Field1,' RD ',' ROAD ')
SET #Field2 = REPLACE(#Field2,' RD ',' ROAD ')
SET #Field1 = REPLACE(#Field1,' ST ',' STREET ')
SET #Field2 = REPLACE(#Field2,' ST ',' STREET ')
SET #Field1 = REPLACE(#Field1,' ','')
SET #Field2 = REPLACE(#Field2,' ','')
RETURN
CASE WHEN #Field1=#Field2
THEN '1'
ELSE '0'
END
END
=============================
update 2
Example table data - assuming the same two records exist in all 3 tables (not always the case )
TableA (main + most important table):
ID1 ID2 postCode, cityCode, ................
10001 1221 IG11PJ London ................
10230 1022 IG22PJ Nottingham ................
tableB (slightly less important table)
ID1 ID2 postCode, cityCode, ................
10031 1011 IG1 1PJ london ................
10980 982 IG2 2PJ nottingham ................
tableC (slightly less important table)
ID1 ID2 postCode, cityCode, ................
10551 1011 iG1 1pj london ................
20980 982 iG2 2pJ nottingham ................
tableD (side table)
A B
10001 10031
10230 10980
table E (side table)
A B
10001 10551
10230 20980
If Tables A, B, and C should be identical save for formatting differences I would suggest you create 3 CTEs, the first selecting TableA ID and a HASHBYTES of all other columns (columns will need to be cast to char/varchar so any formatting and replacing can take place there), the second CTE the same for Table B and the third for Table C.
Then just match the HASHBYTES values.
As has already been said though, without sample data, table structures, DDL for the function etc. we are just guessing.
Sean and Milney both make very good points regarding scalar vs inline table function and use of NOLOCK
I see these as a task that does not belong in one query. I would create a new set of tables (or these tables is you have a backup/don't need to preserve the data) and then perform your data cleaning steps into those new tables.
Once you have happy the data has been normalized then do a single query to compare the tables.
Trying to put it all in one query gives no advantage and you don't step wise make progress. For example, if you find you forgot to strip spaces out of once field you have to redo EVERYTHING. If you make new tables with the "cleaned" data you can incrementally invest time on cleaning the data (which is clearly the slow part of this process) till the data is perfect and then run your quick comparison. Forgot something -- it is a relatively quick update and run.
Instead of all the headaches you are running through, making copies of everything, and then trying to parse based on functions that can not be optimized, I would suggest the following. You state you have a column that gets stripped of special characters. I would add a "CleanKey" column for each table and represented column. Then, via a possible table trigger, or before the add/save, pre-clean that value into the "CleanKey" column and you are done. Then have an index on THOSE "Clean" columns and do a direct join.
Since the rest of the system does not know of these "Clean" columns, you can add the columns, clean them out of whatever function you have and not worry about duplicating or otherwise ruining other data.
Yes, it may take a bit to pre-"Clean" these columns, but then its DONE. Your query should be fast after that.
I would agree with the others that cleaning these string values would be a good idea. But since you still need to accomplish that and I absolutely hate loops and scalar functions with a passion I decided to roll up an inline table valued function instead of these two nested scalar functions. I am not using any loops here and the performance might surprise you.
I am using a tally or numbers table for this. I like to keep one of these around as view. Here is the code for the view I use.
create View [dbo].[cteTally] as
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select N from cteTally
GO
Then you can use this tally table to derive a set based approach to accommodate the business rules you have for deciding if two values match. You also don't need comma delimiters in here. In your example you had a comma nearly every other character in the list of values to remove. A single instance of each character is sufficient.
create function [dbo].[fn_TESTMatcher_Sean]
(
#Field1 nvarchar(max)
, #Field2 nvarchar(max)
, #CharsToRemove nvarchar(max)
) returns table as
RETURN
with MyValues1 as
(
select substring(#Field1, N, 1) as MyChar
, t.N
from cteTally t
where N <= len(#Field1)
and charindex(substring(#Field1, N, 1), #CharsToRemove) = 0
)
, MyValues2 as
(
select substring(#Field2, N, 1) as MyChar
, t.N
from cteTally t
where N <= len(#Field2)
and charindex(substring(#Field2, N, 1), #CharsToRemove) = 0
)
select convert(bit, case when mv1.MyResult = mv2.MyResult then 1 else 0 end) as IsMatch
from
(
select distinct MyResult =
replace(
replace(replace(STUFF((select MyChar + ''
from MyValues1 mv2
order by mv2.N
FOR XML PATH(''),TYPE).value('.','NVARCHAR(MAX)'), 1, 0, '')
, ' RD ', ' ROAD ')
, ' ST ', ' STREET ')
, ' ', '')
from MyValues1 mv
) mv1
cross join
(
select distinct MyResult =
replace(
replace(replace(STUFF((select MyChar + ''
from MyValues2 mv2
order by mv2.N
FOR XML PATH(''),TYPE).value('.','NVARCHAR(MAX)'), 1, 0, '')
, ' RD ', ' ROAD ')
, ' ST ', ' STREET ')
, ' ', '')
from MyValues2 mv
) mv2
;
Give this a shot and let me know if this works in your environment.
For example:
select *
from fn_TESTMatcher_Sean('123 any st rd or something', '123 any street road or something', '%[~,##^_+-$%&/|\*().!`:<>?]%')
The above returns 1 because they are a match under the rules you defined.

How to change the information in this table into an easy to use form?

I have a legacy product that I have to maintain. One of the table is somewhat similar to the following example:
DECLARE #t TABLE
(
id INT,
DATA NVARCHAR(30)
);
INSERT INTO #t
SELECT 1,
'name: Jim Ey'
UNION ALL
SELECT 2,
'age: 43'
UNION ALL
SELECT 3,
'----------------'
UNION ALL
SELECT 4,
'name: Johnson Dom'
UNION ALL
SELECT 5,
'age: 34'
UNION ALL
SELECT 6,
'----------------'
UNION ALL
SELECT 7,
'name: Jason Thwe'
UNION ALL
SELECT 8,
'age: 22'
SELECT *
FROM #t;
/*
You will get the following result
id DATA
----------- ------------------------------
1 name: Jim Ey
2 age: 43
3 ----------------
4 name: Johnson Dom
5 age: 34
6 ----------------
7 name: Jason Thwe
8 age: 22
*/
Now I want to get the information in the following form:
name age
-------------- --------
Jim Ey 43
Johnson Dom 34
Jason Thwe 22
What's the easiest way to do this?
Thanks.
Out of (slightly morbid) curiosity I tried to come up with a means of transforming the exact input data you have provided.
Far better, of course, would be to properly structure the original data. With a legacy system, this may not be possible, but an ETL process could be created to bring this information into an intermediate location so that an ugly query like this would not need to be run in real time.
Example #1
This example assumes that all IDs are consistent and sequential (otherwise, an additional ROW_NUMBER() column or a new identity column would need to be used to guarantee correct remainder operations on ID).
SELECT
Name = REPLACE( Name, 'name: ', '' ),
Age = REPLACE( Age, 'age: ', '' )
FROM
(
SELECT
Name = T2.Data,
Age = T1.Data,
RowNumber = ROW_NUMBER() OVER( ORDER BY T1.Id ASC )
FROM #t T1
INNER JOIN #t T2 ON T1.id = T2.id +1 -- offset by one to combine two rows
WHERE T1.id % 3 != 0 -- skip delimiter records
) Q1
-- skip every other record (minus delimiters, which have already been stripped)
WHERE RowNumber % 2 != 0
Example #2: No Dependency on Sequential IDs
This is a more practical example because the actual ID values do not matter, only the row sequence.
DECLARE #NumberedData TABLE( RowNumber INT, Data VARCHAR( 100 ) );
INSERT #NumberedData( RowNumber, Data )
SELECT
RowNumber = ROW_NUMBER() OVER( ORDER BY id ASC ),
Data
FROM #t;
SELECT
Name = REPLACE( N2.Data, 'name: ', '' ),
Age = REPLACE( N1.Data, 'age: ', '' )
FROM #NumberedData N1
INNER JOIN #NumberedData N2 ON N1.RowNumber = N2.RowNumber + 1
WHERE ( N1.RowNumber % 3 ) = 2;
DELETE #NumberedData;
Example #3: Cursor
Again, it would be best to avoid running a query like this in real time and use a scheduled, transactional ETL process. In my experience, semi-structured data like this is prone to anomalies.
While examples #1 and #2 (and the solutions provided by others) demonstrate clever ways of working with the data, a more practical way to transform this data would be a cursor. Why? it may actually perform better (no nested queries, recursion, pivoting, or row numbering) and even if it is slower it provides much better opportunities for error handling.
-- this could be a table variable, temp table, or staging table
DECLARE #Results TABLE ( Name VARCHAR( 100 ), Age INT );
DECLARE #Index INT = 0, #Data VARCHAR( 100 ), #Name VARCHAR( 100 ), #Age INT;
DECLARE Person_Cursor CURSOR FOR SELECT Data FROM #t;
OPEN Person_Cursor;
FETCH NEXT FROM Person_Cursor INTO #Data;
WHILE( 1 = 1 )BEGIN -- busy loop so we can handle the iteration following completion
IF( #Index = 2 ) BEGIN
INSERT #Results( Name, Age ) VALUES( #Name, #Age );
SET #Index = 0;
END
ELSE BEGIN
-- optional: examine #Data for integrity
IF( #Index = 0 ) SET #Name = REPLACE( #Data, 'name: ', '' );
IF( #Index = 1 ) SET #Age = CAST( REPLACE( #Data, 'age: ', '' ) AS INT );
SET #Index = #Index + 1;
END
-- optional: examine #Index to see that there are no superfluous trailing
-- rows or rows omitted at the end.
IF( ##FETCH_STATUS != 0 ) BREAK;
FETCH NEXT FROM Person_Cursor INTO #Data;
END
CLOSE Person_Cursor;
DEALLOCATE Person_Cursor;
Performance
I created sample source data of 100K rows and the three aforementioned examples seem roughly equivalent for transforming data.
I created a million rows of source data and a query similar to the following gives excellent performance for selecting a subset of rows (such as would be used in a grid on a web page or a report).
-- INT IDENTITY( 1, 1 ) numbers the rows for us
DECLARE #NumberedData TABLE( RowNumber INT IDENTITY( 1, 1 ), Data VARCHAR( 100 ) );
-- subset selection; ordering/filtering can be done here but it will need to preserve
-- the original 3 rows-per-result structure and it will impact performance
INSERT #NumberedData( Data )
SELECT TOP 1000 Data FROM #t;
SELECT
N1.RowNumber,
Name = REPLACE( N2.Data, 'name: ', '' ),
Age = REPLACE( N1.Data, 'age: ', '' )
FROM #NumberedData N1
INNER JOIN #NumberedData N2 ON N1.RowNumber = N2.RowNumber + 1
WHERE ( N1.RowNumber % 3 ) = 2;
DELETE #NumberedData;
I'm seeing execution times of 4-10ms (i7-3960x) against a set of a million records.
Given that table you can do this:
;WITH DATA
AS
(
SELECT
SUBSTRING(t.DATA,CHARINDEX(':',t.DATA)+2,LEN(t.DATA)) AS value,
SUBSTRING(t.DATA,0,CHARINDEX(':',t.DATA)) AS ValueType,
ID,
ROW_NUMBER() OVER(ORDER BY ID) AS RowNbr
FROM
#t AS t
WHERE
NOT t.DATA='----------------'
)
, RecursiveCTE
AS
(
SELECT
Data.RowNbr,
Data.value,
Data.ValueType,
NEWID() AS ID
FROM
Data
WHERE
Data.RowNbr=1
UNION ALL
SELECT
Data.RowNbr,
Data.value,
Data.ValueType,
CASE
WHEN Data.ValueType='age'
THEN RecursiveCTE.ID
ELSE NEWID()
END AS ID
FROM
Data
JOIN RecursiveCTE
ON RecursiveCTE.RowNbr+1=Data.RowNbr
)
SELECT
pvt.name,
pvt.age
FROM
(
SELECT
ID,
value,
ValueType
FROM
RecursiveCTE
) AS SourceTable
PIVOT
(
MAX(Value)
FOR ValueType IN ([name],[age])
) AS pvt
Output
Name Age
------------------
Jim Ey 43
Jason Thwe 22
Johnson Dom 34
Here's another option if you upgrade to SQL Server 2012, which implements the OVER clause for aggregate functions. This approach will allow you to choose only those tags that you know you want and find them regardless of how many rows there are between names.
This will also work if the names and ages are not always in the same order within a group of rows representing a single person.
with Ready2Pivot(tag,val,part) as (
select
CASE WHEN DATA like '_%:%' THEN SUBSTRING(DATA,1,CHARINDEX(':',DATA)-1) END as tag,
CASE WHEN DATA like '_%:%' THEN SUBSTRING(DATA,CHARINDEX(':',DATA)+1,8000) END as val,
max(id * CASE WHEN DATA LIKE 'name:%' THEN 1 ELSE 0 END)
over (
order by id
)
from #t
where DATA like '_%:%'
)
select [name], [age]
from Ready2Pivot
pivot (
max(val)
for tag in ([name], [age])
) as p
If your legacy data has an entry with extra items (say "altName: Jimmy"), this query will ignore it. If your legacy data has no row (and no id number) for someone's age, it will give you NULL in that spot. It will associate all information with the closest preceding row with "name: ..." as the DATA, so it is important that every group of rows has a "name: ..." row.

How to check if first five characters of one field match another?

Assuming I have the following table:
AAAAAA
AAAAAB
CCCCCC
How could I craft a query that would let me know that AAAAA and AAAAB are similar (as they share five characters in a row)? Ideally I would like to write this as a query that would check if the two fields shared five characters in a row anywhere in the string but this seems outside the scope of SQL and something I should write into a C# application?
Ideally the query would add another column that displays: Similar to 'AAAAA', 'AAAAB'
I suggest you do not try to violate 1NF by introducing a multi-valued attribute.
Noting that SUBSTRING is highly portable:
WITH T
AS
(
SELECT *
FROM (
VALUES ('AAAAAA'),
('AAAAAB'),
('CCCCCC')
) AS T (data_col)
)
SELECT T1.data_col,
T2.data_col AS data_col_similar_to
FROM T AS T1, T AS T2
WHERE T1.data_col < T2.data_col
AND SUBSTRING(T1.data_col, 1, 5)
= SUBSTRING(T2.data_col, 1, 5);
Alternativvely:
T1.data_col LIKE SUBSTRING(T2.data_col, 1, 5) + '%';
This will find all matches, also those in the middle of the word, it will not perform well on a big table
declare #t table(a varchar(20))
insert #t select 'AAAAAA'
insert #t select 'AAAAAB'
insert #t select 'CCCCCC'
insert #t select 'ABCCCCC'
insert #t select 'DDD'
declare #compare smallint = 5
;with cte as
(
select a, left(a, #compare) suba, 1 h
from #t
union all
select a, substring(a, h + 1, #compare), h+1
from cte where cte.h + #compare <= len(a)
)
select t.a, cte.a match from #t t
-- if you don't want the null matches, remove the 'left' from this join
left join cte on charindex(suba, t.a) > 0 and t.a <> cte.a
group by t.a, cte.a
Result:
a match
-------------------- ------
AAAAAA AAAAAB
AAAAAB AAAAAA
ABCCCCC CCCCCC
CCCCCC ABCCCCC
You can use left to compare the first five characters and you can use for xml path to concatenate the similar strings to one column.
declare #T table
(
ID int identity primary key,
Col varchar(10)
)
insert into #T values
('AAAAAA'),
('AAAAAB'),
('AAAAAC'),
('CCCCCC')
select Col,
stuff((select ','+T2.Col
from #T as T2
where left(T1.Col, 5) = left(T2.Col, 5) and
T1.ID <> T2.ID
for xml path(''), type).value('.', 'varchar(max)'), 1, 1, '') as Similar
from #T as T1
Result:
Col Similar
---------- -------------------------
AAAAAA AAAAAB,AAAAAC
AAAAAB AAAAAA,AAAAAC
AAAAAC AAAAAA,AAAAAB
CCCCCC NULL

Parse SQL field into multiple rows

How can I take a SQL table that looks like this:
MemberNumber JoinDate Associate
1234 1/1/2011 A1 free A2 upgrade A31
5678 3/15/2011 A4
9012 5/10/2011 free
And output (using a view or writing to another table or whatever is easiest) this:
MemberNumber Date
1234-P 1/1/2011
1234-A1 1/1/2011
1234-A2 1/1/2011
1234-A31 1/1/2011
5678-P 3/15/2011
5678-A4 3/15/2011
9012-P 5/10/2011
Where each row results in a "-P" (primary) output line as well as any A# (associate) lines. The Associate field can contain a number of different non-"A#" values, but the "A#"s are all I'm interested in (# is from 1 to 99). There can be many "A#"s in that one field too.
Of course a table redesign would greatly simplify this query but sometimes we just need to get it done. I wrote the below query using multiple CTEs; I find its easier to follow and see exactly whats going on, but you could simplify this further once you grasp the technique.
To inject your "P" primary row you will see that I simply jammed it into Associate column but it might be better placed in a simple UNION outside the CTEs.
In addition, if you do choose to refactor your schema the below technique can be used to "split" your Associate column into rows.
;with
Split (MemberNumber, JoinDate, AssociateItem)
as ( select MemberNumber, JoinDate, p.n.value('(./text())[1]','varchar(25)')
from ( select MemberNumber, JoinDate, n=cast('<n>'+replace(Associate + ' P',' ','</n><n>')+'</n>' as xml).query('.')
from #t
) a
cross apply n.nodes('n') p(n)
)
select MemberNumber + '-' + AssociateItem,
JoinDate
from Split
where left(AssociateItem, 1) in ('A','P')
order
by MemberNumber;
The XML method is not a great option performance-wise, as its speed degrades as the number of items in the "array" increases. If you have long arrays the follow approach might be of use to you:
--* should be physical table, but use this cte if needed
--;with
--number (n)
--as ( select top(50) row_number() over(order by number) as n
-- from master..spt_values
-- )
select MemberNumber + '-' + substring(Associate, n, isnull(nullif(charindex(' ', Associate + ' P', n)-1, -1), len(Associate)) - n+1),
JoinDate
from ( select MemberNumber, JoinDate, Associate + ' P' from #t
) t (MemberNumber, JoinDate, Associate)
cross
apply number n
where n <= convert(int, len(Associate)) and
substring(' ' + Associate, n, 1) = ' ' and
left(substring(Associate, n, isnull(nullif(charindex(' ', Associate, n)-1, -1), len(Associate)) - n+1), 1) in ('A', 'P');
Try this new version
declare #t table (MemberNumber varchar(8), JoinDate date, Associate varchar(50))
insert into #t values ('1234', '1/1/2011', 'A1 free A2 upgrade A31'),('5678', '3/15/2011', 'A4'),('9012', '5/10/2011', 'free')
;with b(f, t, membernumber, joindate, associate)
as
(
select 1, 0, membernumber, joindate, Associate
from #t
union all
select t+1, charindex(' ',Associate + ' ', t+1), membernumber, joindate, Associate
from b
where t < len(Associate)
)
select MemberNumber + case when t = 0 then '-P' else '-'+substring(Associate, f,t-f) end NewMemberNumber, JoinDate
from b
where t = 0 or substring(Associate, f,1) = 'A'
--where t = 0 or substring(Associate, f,2) like 'A[1-9]'
-- order by MemberNumber, t
Result is the same as the requested output.
I would recommend altering your database structure by adding a link table instead of the "Associate" column. A link table would consist of two or more columns like this:
MemberNumber Associate Details
-----------------------------------
1234 A1 free
1234 A2 upgrade
1234 A31
5678 A4
Then the desired result can be obtained with a simple JOIN:
SELECT CONCAT(m.`MemberNumber`, '-', 'P'), m.`JoinDate`
FROM `members` m
UNION
SELECT CONCAT(m.`MemberNumber`, '-', IFNULL(a.`Associate`, 'P')), m.`JoinDate`
FROM `members` m
RIGHT JOIN `members_associates` a ON m.`MemberNumber` = a.`MemberNumber`

Table Normalization (Parse comma separated fields into individual records)

I have a table like this:
Device
DeviceId Parts
1 Part1, Part2, Part3
2 Part2, Part3, Part4
3 Part1
I would like to create a table 'Parts', export data from Parts column to the new table. I will drop the Parts column after that
Expected result
Parts
PartId PartName
1 Part1
2 Part2
3 Part3
4 Part4
DevicePart
DeviceId PartId
1 1
1 2
1 3
2 2
2 3
2 4
3 1
Can I do this in SQL Server 2008 without using cursors?
-- Setup:
declare #Device table(DeviceId int primary key, Parts varchar(1000))
declare #Part table(PartId int identity(1,1) primary key, PartName varchar(100))
declare #DevicePart table(DeviceId int, PartId int)
insert #Device
values
(1, 'Part1, Part2, Part3'),
(2, 'Part2, Part3, Part4'),
(3, 'Part1')
--Script:
declare #DevicePartTemp table(DeviceId int, PartName varchar(100))
insert #DevicePartTemp
select DeviceId, ltrim(x.value('.', 'varchar(100)'))
from
(
select DeviceId, cast('<x>' + replace(Parts, ',', '</x><x>') + '</x>' as xml) XmlColumn
from #Device
)tt
cross apply
XmlColumn.nodes('x') as Nodes(x)
insert #Part
select distinct PartName
from #DevicePartTemp
insert #DevicePart
select tmp.DeviceId, prt.PartId
from #DevicePartTemp tmp
join #Part prt on
prt.PartName = tmp.PartName
-- Result:
select *
from #Part
PartId PartName
----------- ---------
1 Part1
2 Part2
3 Part3
4 Part4
select *
from #DevicePart
DeviceId PartId
----------- -----------
1 1
1 2
1 3
2 2
2 3
2 4
3 1
You will need a Tally table to accomplish this without a cursor.
Follow the instructions to create a tally table here: Tally Tables by Jeff Moden
This script will put the table into your Temp database, so you probably want to change the "Use DB" statement
Then you can run the script below to insert a breakdown of Devices and Parts into a temp table. You should then be able to join on your part table by the part name (to get the ID) and insert into your new DevicePart table.
select *,
--substring(d.parts, 1, t.n)
substring(d.parts, t.n, charindex(', ', d.parts + ', ',t.n) - t.n) 'Part'
into #devicesparts
from device d
cross join tally t
where t.n < (select max(len(parts))+ 1 from device)
and substring(', ' + d.parts, t.n, 1) = ', '
Have a look at using fn_Split to create a table variable from the comma separated values.
You can then use this to drive your insert.
EDIT: Actually, I think you may still need a cursor. Leaving this answer incase fn_Split helps.
If there is a maximum number of parts per device then, yes, it can be done without a cursor, but this is quite complex.
Essentially, create a table (or view or subquery) that has a DeviceID and one PartID column for each possible index in the PartID string. This can be accomplished by making the PartID columns calculated columns using fn_split or another method of your choice. From there you do a multiple self-UNION of this table, with one table in the self-UNION for each PartID column. Each table in the self-UNION has only one of the PartID columns included in the select list of the query for the table.