SQL - STRING_SPLIT string position - sql

I have a table with two columns of comma-separated strings. The way the data is formatted, the number of comma-separated items in both columns is equal, and the first value in colA is related to the first value in colB, and so on. (It's obviously not a very good data format, but it's what I'm working with.)
If I have the following row (PrimaryKeyID | column1 | column2):
1 | a,b,c | A,B,C
then in this data format, a & 1 are logically related, b & 2, etc.
I want to use STRING_SPLIT to split these columns, but using it twice obviously crosses them with each other, resulting in a total of 9 rows.
1 | a | A
1 | b | A
1 | c | A
1 | a | B
1 | b | B
1 | c | B
1 | a | C
1 | b | C
1 | c | C
What I want is just the 3 "logically-related" columns
1 | a | A
1 | b | B
1 | c | C
However, STRING_SPLIT(myCol,',') doesn't appear to save the String Position anywhere.
I have done the following:
SELECT tbl.ID,
t1.Column1Value,
t2.Column2Value
FROM myTable tbl
INNER JOIN (
SELECT t.ID,
ss.value AS Column1Value,
ROW_NUMBER() OVER (PARTITION BY t.ID ORDER BY t.ID) as StringOrder
FROM myTable t
CROSS APPLY STRING_SPLIT(t.column1,',') ss
) t1 ON tbl.ID = t1.ID
INNER JOIN (
SELECT t.ID,
ss.value AS Column2Value,
ROW_NUMBER() OVER (PARTITION BY t.ID ORDER BY t.ID) as StringOrder
FROM myTable t
CROSS APPLY STRING_SPLIT(t.column2,',') ss
) t1 ON tbl.ID = t2.ID AND t1.StringOrder = t2.StringOrder
This appears to work on my small test set, but in my opinion there is no reason to expect it to work guaranteed every time. The ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) is obviously a meaningless ordering, but it appears that, in absence of any real ordering, STRING_SPLIT is returning the values in the "default" order that they were already in. Is this "expected" behaviour? Can I count on this? Is there any other way of accomplishing what I'm attempting to do?
Thanks.
======================
EDIT
I got what I wanted (I think) with the following UDF. However it's pretty slow. Any suggestions?
CREATE FUNCTION fn.f_StringSplit(#string VARCHAR(MAX),#delimiter VARCHAR(1))
RETURNS #r TABLE
(
Position INT,
String VARCHAR(255)
)
AS
BEGIN
DECLARE #current_position INT
SET #current_position = 1
WHILE CHARINDEX(#delimiter,#string) > 0 BEGIN
INSERT INTO #r (Position,String) VALUES (#current_position, SUBSTRING(#string,1,CHARINDEX(#delimiter,#string) - 1))
SET #current_position = #current_position + 1
SET #string = SUBSTRING(#string,CHARINDEX(#delimiter,#string) + 1, LEN(#string) - CHARINDEX(#delimiter,#string))
END
--add the last one
INSERT INTO #r (Position, String) VALUES(#current_position,#string)
RETURN
END

The only way I've discovered to expressively maintain the order of the String_Split() function this is using the Row_Number() function with a literal value in the "order by".
For example:
declare #Version nvarchar(128)
set #Version = '1.2.3';
with V as (select value v, Row_Number() over (order by (select 0)) n from String_Split(#Version, '.'))
select
(select v from V where n = 1) Major,
(select v from V where n = 2) Minor,
(select v from V where n = 3) Revision
Returns:
Major Minor Revision
----- ----- ---------
1 2 3
Update: if you are using a newer version of SQL Server, you can now provide an optional third bit argument which indicates that and ordinal column should also be included in the result. See my other answer here for more details.

Fortunately in newer SQL Server (Azure and 2022) an optional flag has been added to String_Split to include an "ordinal" column. If you are using a newer version of SQL Server, this finally provides a solution that is logically correct rather than implementation specific.
New definition:
String_Split(string, separator [, enable_ordinal])
e.g. String_Split('1.2.3', '.', 1)
Example:
with V as (select Value v, Ordinal n from String_Split('1.2.3', '.', 1))
select
(select v from V where n = 1) Major,
(select v from V where n = 2) Minor,
(select v from V where n = 3) Revision
Returns:
Major Minor Revision
----- ----- ---------
1 2 3

Your idea is fine, but your order by is not using a stable sort. I think it is safer to do:
SELECT tbl.ID, t1.Column1Value, t2.Column2Value
FROM myTable tbl INNER JOIN
(SELECT t.ID, ss.value AS Column1Value,
ROW_NUMBER() OVER (PARTITION BY t.ID
ORDER BY CHARINDEX(',' + ss.value + ',', ',' + t.column1 + ',')
) as StringOrder
FROM myTable t CROSS APPLY
STRING_SPLIT(t.column1,',') ss
) t1
ON tbl.ID = t1.ID INNER JOIN
(SELECT t.ID, ss.value AS Column2Value,
ROW_NUMBER() OVER (PARTITION BY t.ID
ORDER BY CHARINDEX(',' + ss.value + ',', ',' + t.column2 + ',')
) as StringOrder
FROM myTable t CROSS APPLY
STRING_SPLIT(t.column2, ',') ss
) t2
ON tbl.ID = t2.ID AND t1.StringOrder = t2.StringOrder;
Note: This may not work as desired if the strings have non-adjacent duplicates.

I'm a little late to this question, but I was just attempting the same thing with string_split since I've run into a performance problem of late. My experience with string splitters in T-SQL has led me to use recursive CTE's for most things containing fewer than 1,000 delimited values. Ideally, a CLR procedure would be used if you need ordinal in your string split.
That said, I've come to a similar conclusion as you on getting ordinal from string_split. You can see the queries and statistics below which, in order, are the bare string_split function, a CTE RowNumber of string_split, and then my personal string split CTE function I derived from this awesome write-up. The main difference between my CTE-based function and the one in the write-up is I made it an Inline-TVF instead of their implementation of a MultiStatement-TVF, which you can read about the differences here.
In my experiments I haven't seen a deviation using ROW_NUMBER on a constant returning the internal order of the delimited string, so I will be using it until such time as I find a problem with it, but if order is imperative in a business setting, I would probably recommend the Moden splitter featured in the first link above, which links to the author's article here since it is right in-line with the performance seen by the less safe string_split with RowNumber approach.
set nocount on;
declare
#iter int = 0,
#rowcount int,
#val varchar(max) = '';
while len(#val) < 1e6
select
#val += replicate(concat(#iter, ','), 8e3),
#iter += 1;
raiserror('Begin string_split Built-In', 0, 0) with nowait;
set statistics time, io on;
select
*
from
string_split(#val, ',')
where
[value] > '';
select
#rowcount = ##rowcount;
set statistics time, io off;
print '';
raiserror('End string_split Built-In | Return %d Rows', 0, 0, #rowcount) with nowait;
print '';
raiserror('Begin string_split Built-In with RowNumber', 0, 0) with nowait;
set statistics time, io on;
with cte
as (
select
*,
[group] = 1
from
string_split(#val, ',')
where
[value] > ''
),
cteCount
as (
select
*,
[id] = row_number() over (order by [group])
from
cte
)
select
*
from
cteCount;
select
#rowcount = ##rowcount;
set statistics time, io off;
print '';
raiserror('End string_split Built-In with RowNumber | Return %d Rows', 0, 0, #rowcount) with nowait;
print '';
raiserror('Begin Moden String Splitter', 0, 0) with nowait;
set statistics time, io on;
select
*
from
dbo.SplitStrings_Moden(#val, ',')
where
item > '';
select
#rowcount = ##rowcount;
set statistics time, io off;
print '';
raiserror('End Moden String Splitter | Return %d Rows', 0, 0, #rowcount) with nowait;
print '';
raiserror('Begin Recursive CTE String Splitter', 0, 0) with nowait;
set statistics time, io on;
select
*
from
dbo.fn_splitByDelim(#val, ',')
where
strValue > ''
option
(maxrecursion 0);
select
#rowcount = ##rowcount;
set statistics time, io off;
Statistics being
Begin string_split Built-In
SQL Server Execution Times:
CPU time = 2000 ms, elapsed time = 5325 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
End string_split Built-In | Return 331940 Rows
Begin string_split Built-In with RowNumber
SQL Server Execution Times:
CPU time = 2094 ms, elapsed time = 8119 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
End string_split Built-In with RowNumber | Return 331940 Rows
Begin Moden String Splitter
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 6 ms.
SQL Server Execution Times:
CPU time = 8734 ms, elapsed time = 9009 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
End Moden String Splitter | Return 331940 Rows
Begin Recursive CTE String Splitter
Table 'Worktable'. Scan count 2, logical reads 1991648, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 147188 ms, elapsed time = 147480 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
End Recursive CTE String Splitter | Return 331940 Rows

SELECT
PrimaryKeyID ,t2.items as column1, t1.items as column2 from [YourTableName]
cross Apply [dbo].[Split](column2) as t1
cross Apply [dbo].[Split](column1) as t2

Mark, here is a solution I would use. Assuming that [column 1] in your table has the "key" values that are more less stable, and [column2] has corresponding "field" values that can be sometimes omitted or NULL:
There will be two extractions, one for [column 1] - which I assume is the Key, another for [column 2] - which I assume is the sort of "values" for the "key", they will be auto parsed then by STRING_SPLIT function.
These two INDEPENDENT result-sets will be then re-numbered based on the time of operation (which is always sequential). Take note, we renumber not by the field content or position of the comma etc, BUT by the timestamp.
Then they will get joined back together by LEFT OUTER JOIN; note not by INNER JOIN due to the fact that our "field values" could get omitted, while "keys" will always be there
Below is the TSQL code, as this is my first post to this site, hope it looks ok:
SELECT T1.ID, T1.KeyValue, T2.FieldValue
from (select t1.ID, row_number() OVER (PARTITION BY t1.ID ORDER BY current_timestamp) AS KeyRow, t2.value AS KeyValue
from myTable t1
CROSS APPLY STRING_SPLIT(t1.column1,',') as t2) T1
LEFT OUTER JOIN
(select t1.ID, row_number() OVER (PARTITION BY t1.ID ORDER BY current_timestamp) AS FieldRow, t3.value AS FieldValue
from myTable t1
CROSS APPLY STRING_SPLIT(t1.column2,',') as t3) T2 ON T1.ID = T2.ID AND T1.KeyRow = T2.FieldRow

This is very simple
CREATE TABLE #a(
id [INT] IDENTITY(1,1) NOT NULL,
OrgId INT )
INSERT INTO #a
(
OrgId
)
SELECT value FROM STRING_SPLIT('18,44,45,46,47,48,49,50,51,52,53', ',')
Select * from #a

Here is a t-sql function that uses string_split and adds the ordinal column:
drop function if exists [dbo].[varchar_split2];
go
create function [dbo].[varchar_split2]
(
#text varchar(max),
#delimiter char(1) = ','
)
returns #result table ([Ordinal] int not null identity(1, 1) primary key, [Value] varchar(128) not null)
as
begin
insert #result ([Value])
select
[Value]
from
string_split(#text, #delimiter)
where
0 != len([Value])
;
return;
end;
go

Related

Issue using CHARINDEX function in SQL Server

could someone help me? I'm trying to get a specific value in my delimited column.
Column_A is my data
Column_B is what I could get
Column_C is what I want
Basically I'm trying to get the values between the 3rd ":" and the 4th ":"
I'm using this piece of code here:
select SourceID
, SUBSTRING(SourceID,CHARINDEX(':', SourceID, CHARINDEX(':', SourceID) + 1) + 1,
CHARINDEX(':', SourceID, CHARINDEX(':', SourceID, CHARINDEX(':', SourceID) + 1) + 1) -6)
from temp.table
Thanks in advance
You may try with a recursive CTE to retrieve any part of the string as you wish. Something like this
CREATE TABLE #Temp
(
MyString NVARCHAR(2000)
)
INSERT INTO #TEMP
VALUES('42:45:ABCD:GGRFG34:SADSAD'),('65:213:5435423:234234')
;WITH CTE AS
(
SELECT
ParentSTring = MyString,
MyString = CASE CHARINDEX(':',MyString) WHEN 0 THEN NULL ELSE SUBSTRING(MyString,CHARINDEX(':',MyString)+1,LEN(MyString)) END,
Part = CASE CHARINDEX(':',MyString) WHEN 0 THEN MyString ELSE SUBSTRING(MyString,1,CHARINDEX(':',MyString)-1) END,
Seq = 1
FROM
#Temp
UNION ALL
SELECT
ParentSTring,
MyString = CASE CHARINDEX(':',MyString) WHEN 0 THEN NULL ELSE SUBSTRING(MyString,CHARINDEX(':',MyString)+1,LEN(MyString)) END,
Part = CASE CHARINDEX(':',MyString) WHEN 0 THEN MyString ELSE SUBSTRING(MyString,1,CHARINDEX(':',MyString)-1) END,
Seq = ISNULL(Seq,0)+1
FROM
CTE
WHERE
ISNULL(MyString, '') <> ''
)
SELECT
*
FROM
CTE
WHERE
Seq = 3 -- for retrieving the 3rd string, change this accordingly
Result
First, if performance is important then a recursive CTE is NOT what you want. I demonstrate why in a moment.
I have a simple solution here, called SubstringBetween8K, but it's overkill for what you are doing. For this a simple Cascading APPLY will do the trick and perform the best. First the sample data:
IF OBJECT_ID('tempdb..#temp') IS NOT NULL DROP TABLE #temp;
GO
CREATE TABLE #temp (SourceId VARCHAR(1000));
INSERT #temp VALUES('42:45:10856x2019035x1200:GGRFG34:SADSAD.9999999999999999'),
('65:213:999555x2019035x9444:5435423:234234,123123.111'),
('999:12344:5555511056x35x9111:5435423:234234,555555555555'),
('225:0:11056x2019035x9444:5435423:ABAFLHG.882');
Next for the Cascading APPLY solution.
SELECT Item = SUBSTRING(t.SourceId, f2.Pos+1, f3.Pos-f2.Pos-1)
FROM #temp AS t
CROSS APPLY (VALUES(CHARINDEX(':',t.SourceId))) AS f1(Pos)
CROSS APPLY (VALUES(CHARINDEX(':',t.SourceId,f1.Pos+1))) AS f2(Pos)
CROSS APPLY (VALUES(CHARINDEX(':',t.SourceId,f2.Pos+1))) AS f3(Pos);
Results:
Item
------------------------
10856x2019035x1200
999555x2019035x9444
5555511056x35x9111
11056x2019035x9444
Now a quick performance test which will demonstrate why not to use a recursive CTE.
--==== Sample data
IF OBJECT_ID('tempdb..#temp') IS NOT NULL DROP TABLE #temp;
GO
CREATE TABLE #temp (SourceId VARCHAR(1000));
INSERT #temp VALUES('42:45:10856x2019035x1200:GGRFG34:SADSAD.9999999999999999'),
('65:213:999555x2019035x9444:5435423:234234,123123.111'),
('999:12344:5555511056x35x9111:5435423:234234,555555555555'),
('225:0:11056x2019035x9444:5435423:ABAFLHG.882');
--==== Add 10K rows for performance testing
INSERT #temp
SELECT TOP (100000) sourceId
FROM #temp
CROSS JOIN sys.all_columns, sys.all_columns AS b
GO
--==== Performance Test
IF OBJECT_ID('tempdb..#t1') IS NOT NULL DROP TABLE #t1;
IF OBJECT_ID('tempdb..#t2') IS NOT NULL DROP TABLE #t2;
GO
SET STATISTICS TIME, IO ON;
PRINT CHAR(10)+'Cascading CTE'+CHAR(10)+REPLICATE('-',90);
SELECT Item = SUBSTRING(t.SourceId, f2.Pos+1, f3.Pos-f2.Pos-1)
INTO #t1
FROM #temp AS t
CROSS APPLY (VALUES(CHARINDEX(':',t.SourceId))) AS f1(Pos)
CROSS APPLY (VALUES(CHARINDEX(':',t.SourceId,f1.Pos+1))) AS f2(Pos)
CROSS APPLY (VALUES(CHARINDEX(':',t.SourceId,f2.Pos+1))) AS f3(Pos);
PRINT CHAR(10)+'Recursive CTE'+CHAR(10)+REPLICATE('-',90);
;WITH CTE AS
(
SELECT
ParentSTring = SourceId,
SourceId = CASE CHARINDEX(':',SourceId) WHEN 0 THEN NULL ELSE SUBSTRING(SourceId,CHARINDEX(':',SourceId)+1,LEN(SourceId)) END,
Part = CASE CHARINDEX(':',SourceId) WHEN 0 THEN SourceId ELSE SUBSTRING(SourceId,1,CHARINDEX(':',SourceId)-1) END,
Seq = 1
FROM #temp
UNION ALL
SELECT
ParentSTring,
MyString = CASE CHARINDEX(':',SourceId) WHEN 0 THEN NULL ELSE SUBSTRING(SourceId,CHARINDEX(':',SourceId)+1,LEN(SourceId)) END,
Part = CASE CHARINDEX(':',SourceId) WHEN 0 THEN SourceId ELSE SUBSTRING(SourceId,1,CHARINDEX(':',SourceId)-1) END,
Seq = ISNULL(Seq,0)+1
FROM CTE
WHERE ISNULL(SourceId, '') <> ''
)
SELECT Part
INTO #t2
FROM CTE
WHERE Seq = 3
SET STATISTICS TIME, IO OFF;
Test Results:
Cascading CTE
------------------------------------------------------------------------------------------
Table '#temp'. Scan count 9, logical reads 807, physical reads 0...
SQL Server Execution Times: CPU time = 327 ms, elapsed time = 111 ms.
Recursive CTE
------------------------------------------------------------------------------------------
Table 'Worktable'. Scan count 2, logical reads 4221845, physical reads 0...
Table '#temp'. Scan count 1, logical reads 807, physical reads 0...
SQL Server Execution Times: CPU time = 8781 ms, elapsed time = 9370 ms.
From 1/10th of a second, down from 10 seconds. A roughly 100X performance improvement. Part of the issue with the recursive CTE is the excessive IO (reads). Note the 4.3 million reads for a simple 10K rows.

T-SQL - Count unique characters in a variable

Goal: To count # of distinct characters in a variable the fastest way possible.
DECLARE #String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE #String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE #String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE #String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4
I've found some posts in regards to distinct characters in a column, grouped by characters, and etc, but not one for this scenario.
Using NGrams8K as a base, you can change the input parameter to a nvarchar(4000) and tweak the DATALENGTH, making NGramsN4K. Then you can use that to split the string into individual characters and count them:
SELECT COUNT(DISTINCT NG.token) AS DistinctCharacters
FROM dbo.NGramsN4k(#String1,1) NG;
Altered NGrams8K:
IF OBJECT_ID('dbo.NGramsN4k','IF') IS NOT NULL DROP FUNCTION dbo.NGramsN4k;
GO
CREATE FUNCTION dbo.NGramsN4k
(
#string nvarchar(4000), -- Input string
#N int -- requested token size
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a contiguous stream of #N-sized tokens
based on an input string (#string). Accepts strings up to 8000 varchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGrams8k(#string,#N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGrams8K(s.SomeValue,#N) ng;
Parameters:
#string = The input string to split into tokens.
#N = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = varchar(8000); a #N-sized character-level N-Gram token
Developer Notes:
1. NGrams8k is not case sensitive
2. Many functions that use NGrams8k will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When #N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either #string or #N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(#N > 0 AND #N <= DATALENGTH(#string)) OR (#N IS NULL OR #string IS NULL)
4. NGrams8k can also be used as a tally table with the position column being your "N"
row. To do so use REPLICATE to create an imaginary string, then use NGrams8k to split
it into unigrams then only return the position column. NGrams8k will get you up to
8000 numbers. There will be no performance penalty for sorting by position in
ascending order but there is for sorting in descending order. To get the numbers in
descending order without forcing a sort in the query plan use the following formula:
N = <highest number>-position+1.
Pseudo Tally Table Examples:
--===== (1) Get the numbers 1 to 100 in ascending order:
SELECT N = position
FROM dbo.NGrams8k(REPLICATE(0,100),1);
--===== (2) Get the numbers 1 to 100 in descending order:
DECLARE #maxN int = 100;
SELECT N = #maxN-position+1
FROM dbo.NGrams8k(REPLICATE(0,#maxN),1)
ORDER BY position;
5. NGrams8k is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGrams8k('abcd',1); -- unigrams (#N=1)
SELECT position, token FROM dbo.NGrams8k('abcd',2); -- bigrams (#N=2)
SELECT position, token FROM dbo.NGrams8k('abcd',3); -- trigrams (#N=3)
--===== How many times the substring "AB" appears in each record
DECLARE #table TABLE(stringID int identity primary key, string varchar(100));
INSERT #table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM #table t
CROSS APPLY dbo.NGrams8k(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
----------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20140310 - Initial Development - Alan Burstein
Rev 01 - 20150522 - Removed DQS N-Grams functionality, improved iTally logic. Also Added
conversion to bigint in the TOP logic to remove implicit conversion
to bigint - Alan Burstein
Rev 03 - 20150909 - Added logic to only return values if #N is greater than 0 and less
than the length of #string. Updated comment section. - Alan Burstein
Rev 04 - 20151029 - Added ISNULL logic to the TOP clause for the #string and #N
parameters to prevent a NULL string or NULL #N from causing "an
improper value" being passed to the TOP clause. - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1
FROM (VALUES -- 90 NULL values used to create the CTE Tally table
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)
) t(N)
),
iTally(N) AS -- my cte Tally table
(
SELECT TOP(ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(#string,N''))/2)-(ISNULL(#N,1)-1)),0)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
FROM L1 a CROSS JOIN L1 b -- cartesian product for 8100 rows (90^2)
)
SELECT
position = N, -- position of the token in the string(s)
token = SUBSTRING(#string,CAST(N AS int),#N) -- the #N-Sized token
FROM iTally
WHERE #N > 0 AND #N <= (DATALENGTH(#string)/2); -- Protection against bad parameter values
Here is another alternative using the power of the tally table. It has been called the "Swiss Army Knife of T-SQL". I keep a tally table as a view on my system which makes it insanely fast.
create View [dbo].[cteTally] as
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select N from cteTally
Now we can use that tally anytime we need it, like for this exercise.
declare #Something table
(
String1 nvarchar(4000)
)
insert #Something values
(N'1A^')
, (N'11')
, (N'*')
, (N'*A-zz')
select count(distinct substring(s.String1, t.N, 1))
, s.String1
from #Something s
join cteTally t on t.N <= len(s.String1)
group by s.String1
To be honest I don't know this would be any faster than Larnu's usage of NGrams but testing on a large table would be fun to see.
----- EDIT -----
Thanks to Shnugo for the idea. Using a cross apply to a correlated subquery here is actually quite an improvement.
select count(distinct substring(s.String1, A.N, 1))
, s.String1
from #Something s
CROSS APPLY (SELECT TOP(LEN(s.String1)) t.N FROM cteTally t) A(N)
group by s.String1
The reason this is so much faster is that this is no longer using a triangular join which can really be painfully slow. I did also switch out the view with an indexed physical tally table. The improvement there was noticeable on larger datasets but not nearly as big as using the cross apply.
If you want to read more about triangular joins and why we should avoid them Jeff Moden has a great article on the topic. https://www.sqlservercentral.com/articles/hidden-rbar-triangular-joins
Grab a copy of NGrams8k and you can do this:
DECLARE #String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE #String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE #String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE #String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4
SELECT s.String, Total = COUNT(DISTINCT ng.token)
FROM (VALUES(#String1),(#String2),(#String3),(#String4)) AS s(String)
CROSS APPLY dbo.NGrams8k(s.String,1) AS ng
GROUP BY s.String;
Returns:
String Total
-------- -----------
* 1
*A-zz 4
11 1
1A^ 3
UPDATED
Just a quick update based on #Larnu's post and comments. I did not notice that the OP was dealing with Unicode e.g. NVARCHAR. I created an NVARCHAR(4000) version here - similar to what #Larnu posted above. I just updated the return token to use Latin1_General_BIN collation.
SUBSTRING(#string COLLATE Latin1_General_BIN,CAST(N AS int),#N)
This returns the correct answer:
DECLARE #String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2
SELECT COUNT(DISTINCT ng.token)
FROM dbo.NGramsN4k(#String5,1) AS ng;
Without the collation in place you can use the what Larnu posted and get the right answer like this:
DECLARE #String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2
SELECT COUNT(DISTINCT UNICODE(ng.token))
FROM dbo.NGramsN4k(#String5,1) AS ng;
Here's my updated NGramsN4K function:
ALTER FUNCTION dbo.NGramsN4K
(
#string nvarchar(4000), -- Input string
#N int -- requested token size
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a contiguous stream of #N-sized tokens
based on an input string (#string). Accepts strings up to 4000 nvarchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGramsN4K(#string,#N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGramsN4K(s.SomeValue,#N) ng;
Parameters:
#string = The input string to split into tokens.
#N = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = nvarchar(4000); a #N-sized character-level N-Gram token
Developer Notes:
1. NGramsN4K is not case sensitive
2. Many functions that use NGramsN4K will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When #N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either #string or #N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(#N > 0 AND #N <= DATALENGTH(#string)) OR (#N IS NULL OR #string IS NULL);
4. NGramsN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGramsN4K('abcd',1); -- unigrams (#N=1)
SELECT position, token FROM dbo.NGramsN4K('abcd',2); -- bigrams (#N=2)
SELECT position, token FROM dbo.NGramsN4K('abcd',3); -- trigrams (#N=3)
--===== How many times the substring "AB" appears in each record
DECLARE #table TABLE(stringID int identity primary key, string nvarchar(100));
INSERT #table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM #table t
CROSS APPLY dbo.NGramsN4K(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
------------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20170324 - Initial Development - Alan Burstein
Rev 01 - 20191108 - Added Latin1_General_BIN collation to token output - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1 FROM (VALUES -- 64 dummy values to CROSS join for 4096 rows
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($)) t(N)
),
iTally(N) AS
(
SELECT
TOP (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(#string,''))/2)-(ISNULL(#N,1)-1)),0)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
FROM L1 a CROSS JOIN L1 b -- cartesian product for 4096 rows (16^2)
)
SELECT
position = N, -- position of the token in the string(s)
token = SUBSTRING(#string COLLATE Latin1_General_BIN,CAST(N AS int),#N) -- the #N-Sized token
FROM iTally
WHERE #N > 0 -- Protection against bad parameter values:
AND #N <= (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(#string,''))/2)-(ISNULL(#N,1)-1)),0)));
You can do this natively in SQL Server using CTE and some string manipuation:
DECLARE #TestString NVARCHAR(4000);
SET #TestString = N'*A-zz';
WITH letters AS
(
SELECT 1 AS Pos,
#TestString AS Stri,
MAX(LEN(#TestString)) AS MaxPos,
SUBSTRING(#TestString, 1, 1) AS [Char]
UNION ALL
SELECT Pos + 1,
#TestString,
MaxPos,
SUBSTRING(#TestString, Pos + 1, 1) AS [Char]
FROM letters
WHERE Pos + 1 <= MaxPos
)
SELECT COUNT(*) AS LetterCount
FROM (
SELECT UPPER([Char]) AS [Char]
FROM letters
GROUP BY [Char]
) a
Example outputs:
SET #TestString = N'*A-zz';
{execute code}
LetterCount = 4
SET #TestString = N'1A^';
{execute code}
LetterCount = 3
SET #TestString = N'1';
{execute code}
LetterCount = 1
SET #TestString = N'*';
{execute code}
LetterCount = 1
CREATE TABLE #STRINGS(
STRING1 NVARCHAR(4000)
)
INSERT INTO #STRINGS (
STRING1
)
VALUES
(N'1A^'),(N'11'),(N'*'),(N'*A-zz')
;WITH CTE_T AS (
SELECT DISTINCT
S.STRING1
,SUBSTRING(S.STRING1, V.number + 1, 1) AS Val
FROM
#STRINGS S
INNER JOIN
[master]..spt_values V
ON V.number < LEN(S.STRING1)
WHERE
V.[type] = 'P'
)
SELECT
T.STRING1
,COUNT(1) AS CNT
FROM
CTE_T T
GROUP BY
T.STRING1

Loop through sql result set and remove [n] duplicates

I've got a SQL Server db with quite a few dupes in it. Removing the dupes manually is just not going to be fun, so I was wondering if there is any sort of sql programming or scripting I can do to automate it.
Below is my query that returns the ID and the Code of the duplicates.
select a.ID, a.Code
from Table1 a
inner join (
SELECT Code
FROM Table1 GROUP BY Code HAVING COUNT(Code)>1)
x on x.Code= a.Code
I'll get a return like this, for example:
5163 51727
5164 51727
5165 51727
5166 51728
5167 51728
5168 51728
This snippet shows three returns for each ID/Code (so a primary "good" record and two dupes). However this isnt always the case. There can be up to [n] dupes, although 2-3 seems to be the norm.
I just want to somehow loop through this result set and delete everything but one record. THE RECORDS TO DELETE ARE ARBITRARY, as any of them can be "kept".
You can use row_number to drive your delete.
ie
CREATE TABLE #table1
(id INT,
code int
);
WITH cte AS
(select a.ID, a.Code, ROW_NUMBER() OVER(PARTITION by COdE ORDER BY ID) AS rn
from #Table1 a
)
DELETE x
FROM #table1 x
JOIN cte ON x.id = cte.id
WHERE cte.rn > 1
But...
If you are going to be doing a lot of deletes from a very large table you might be better off to select out the rows you need into a temp table & then truncate your table and re-insert the rows you need.
Keeps the Transaction log from getting hammered, your CI getting Fragged and should be quicker too!
It is actually very simple:
DELETE FROM Table1
WHERE ID NOT IN
(SELECT MAX(ID)
FROM Table1
GROUP BY CODE)
Self join solution with a performance test VS cte.
create table codes(
id int IDENTITY(1,1) NOT NULL,
code int null,
CONSTRAINT [PK_codes_id] PRIMARY KEY CLUSTERED
(
id ASC
))
declare #counter int, #code int
set #counter = 1
set #code = 1
while (#counter <= 1000000)
begin
print ABS(Checksum(NewID()) % 1000)
insert into codes(code) select ABS(Checksum(NewID()) % 1000)
set #counter = #counter + 1
end
GO
set statistics time on;
delete a
from codes a left join(
select MIN(id) as id from codes
group by code) b
on a.id = b.id
where b.id is null
set statistics time off;
--set statistics time on;
-- WITH cte AS
-- (select a.id, a.code, ROW_NUMBER() OVER(PARTITION by code ORDER BY id) AS rn
-- from codes a
-- )
-- delete x
-- FROM codes x
-- JOIN cte ON x.id = cte.id
-- WHERE cte.rn > 1
--set statistics time off;
Performance test results:
With Join:
SQL Server Execution Times:
CPU time = 3198 ms, elapsed time = 3200 ms.
(999000 row(s) affected)
With CTE:
SQL Server Execution Times:
CPU time = 4197 ms, elapsed time = 4229 ms.
(999000 row(s) affected)
It's basically done like this:
WITH CTE_Dup AS
(
SELECT*,
ROW_NUMBER()OVER (PARTITIONBY SalesOrderno, ItemNo ORDER BY SalesOrderno, ItemNo)
AS ROW_NO
from dbo.SalesOrderDetails
)
DELETEFROM CTE_Dup WHERE ROW_NO > 1;
NOTICE: MUST INCLUDE ALL FIELDS!!
Here is another example:
CREATE TABLE #Table (C1 INT,C2 VARCHAR(10))
INSERT INTO #Table VALUES (1,'SQL Server')
INSERT INTO #Table VALUES (1,'SQL Server')
INSERT INTO #Table VALUES (2,'Oracle')
SELECT * FROM #Table
;WITH Delete_Duplicate_Row_cte
AS (SELECT ROW_NUMBER()OVER(PARTITION BY C1, C2 ORDER BY C1,C2) ROW_NUM,*
FROM #Table )
DELETE FROM Delete_Duplicate_Row_cte WHERE ROW_NUM > 1
SELECT * FROM #Table

Group data without changing query flow

For me it's hard to explait what do I want so article's name may be unclear, but I hope I can describe it with code.
I have some data with two most important value, so let it be time t and value f(t). It's stored in the table, for example
1 - 1000
2 - 1200
3 - 1100
4 - 1500
...
I want to plot a graph using it, and this graph should contain N points. If table has rows less than this N, then we just return this table. But if it hasn't, we should group this points, for example, N = Count/2, then for an example above:
1 - (1000+1200)/2 = 1100
2 - (1100+1500)/2 = 1300
...
I wrote an SQL script (it works fine for N >> Count) (MonitoringDateTime - is t, and ResultCount if f(t))
ALTER PROCEDURE [dbo].[usp_GetRequestStatisticsData]
#ResourceTypeID bigint,
#DateFrom datetime,
#DateTo datetime,
#EstimatedPointCount int
AS
BEGIN
SET NOCOUNT ON;
SET ARITHABORT ON;
declare #groupSize int;
declare #resourceCount int;
select #resourceCount = Count(*)
from ResourceType
where ID & #ResourceTypeID > 0
SELECT d.ResultCount
,MonitoringDateTime = d.GeneratedOnUtc
,ResourceType = a.ResourceTypeID,
ROW_NUMBER() OVER(ORDER BY d.GeneratedOnUtc asc) AS Row
into #t
FROM dbo.AgentData d
INNER JOIN dbo.Agent a ON a.CheckID = d.CheckID
WHERE d.EventType = 'Result' AND
a.ResourceTypeID & #ResourceTypeID > 0 AND
d.GeneratedOnUtc between #DateFrom AND #DateTo AND
d.Result = 1
select #groupSize = Count(*) / (#EstimatedPointCount * #resourceCount)
from #t
if #groupSize = 0 -- return all points
select ResourceType, MonitoringDateTime, ResultCount
from #t
else
select ResourceType, CAST(AVG(CAST(#t.MonitoringDateTime AS DECIMAL( 18, 6))) AS DATETIME) MonitoringDateTime, AVG(ResultCount) ResultCount
from #t
where [Row] % #groupSize = 0
group by ResourceType, [Row]
order by MonitoringDateTime
END
, but it's doesn't work for N ~= Count, and spend a lot of time for inserts.
This is why I wanted to use CTE's, but it doesn't work with if else statement.
So i calculated a formula for a group number (for use it in GroupBy clause), because we have
GroupNumber = Count < N ? Row : Row*NumberOfGroups
where Count - numer of rows in the table, and NumberOfGroups = Count/EstimatedPointCount
using some trivial mathematics we get a formula
GroupNumber = Row + (Row*Count/EstimatedPointCount - Row)*MAX(Count - Count/EstimatedPointCount,0)/(Count - Count/EstimatedPointCount)
but it doesn't work because of Count aggregate function:
Column 'dbo.AgentData.ResultCount' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
My english is very bad and I know it (and i'm trying to improve it), but hope dies last, so please advice.
results of query
SELECT d.ResultCount
, MonitoringDateTime = d.GeneratedOnUtc
, ResourceType = a.ResourceTypeID
FROM dbo.AgentData d
INNER JOIN dbo.Agent a ON a.CheckID = d.CheckID
WHERE d.GeneratedOnUtc between '2015-01-28' AND '2015-01-30' AND
a.ResourceTypeID & 1376256 > 0 AND
d.EventType = 'Result' AND
d.Result = 1
https://onedrive.live.com/redir?resid=58A31FC352FC3D1A!6118&authkey=!AATDebemNJIgHoo&ithint=file%2ccsv
Here's an example using NTILE and your simple sample data at the top of your question:
declare #samples table (ID int, sample int)
insert into #samples (ID,sample) values
(1,1000),
(2,1200),
(3,1100),
(4,1500)
declare #results int
set #results = 2
;With grouped as (
select *,NTILE(#results) OVER (order by ID) as nt
from #samples
)
select nt,AVG(sample) from grouped
group by nt
Which produces:
nt
-------------------- -----------
1 1100
2 1300
If #results is changed to 4 (or any higher number) then you just get back your original result set.
Unfortunately, I don't have your full data nor can I fully understand what you're trying to do with the full stored procedure, so the above would probably need to be adapted somewhat.
I haven't tried it, but how about instead of
select ResourceType, CAST(AVG(CAST(#t.MonitoringDateTime AS DECIMAL( 18, 6))) AS DATETIME) MonitoringDateTime, AVG(ResultCount) ResultCount
from #t
where [Row] % #groupSize = 0
group by ResourceType, [Row]
order by MonitoringDateTime
perhaps something like
select ResourceType, CAST(AVG(CAST(#t.MonitoringDateTime AS DECIMAL( 18, 6))) AS DATETIME) MonitoringDateTime, AVG(ResultCount) ResultCount
from #t
group by ResourceType, convert(int,[Row]/#groupSize)
order by MonitoringDateTime
Maybe that points you in some new direction? by converting to int we are truncating everything after the decimal so Im hoping that will give you a better grouping? you might need to put your row-number over resource type for this to work?

Parsing / Indexing a Binary String in SQL Server

I have searched extensively for a relevant answer, but none quite satisfy what I need to be doing.
For our purposes I have a column with a 50 character binary string. In our database, it is actually hundreds of characters long.
There is one string for each unique item ID in our database. The location of each '1' flags a specific criteria being true, and a '0' false, so the indexed location of the ones and zeros are very important. Mostly, I care about where the 1's are.
I am not updating any databases, so I first decided to try and make a loop to look through each string and create a list of the 1's locations.
declare #binarystring varchar(50) = '10000010000110000001000000000000000000000000000001'
declare #position int = 0
declare #list varchar(200) = ''
while (#position <= len(#binarystring))
begin
set #position = charindex('1', #binarystring, #position)
set #list = #list + ', ' + convert(varchar(10),#position)
set #position = charindex('1', #binarystring, #position)+1
end
select right(#list, len(#list)-2)
This creates the following list:
1, 7, 12, 13, 20, 50
However, the loop will bomb if there is not a '1' at the end of the string, as I am searching through the string via occurrences of 1's rather than one character at a time. I am not sure how satisfy the break criteria when the loop would normally reach the end of the string, without there being a 1.
Is there a simple solution to my loop bombing, and should I even be looping in the first place?
I have tried other methods of parsing, union joining, indexing, etc, but given this very specific set of circumstances I couldn't find any combination that did quite what I needed. The above code is the best I've got so far.
I don't specifically need a comma delimited list as an output, but I need to know the location of all 1's within the string. The amount of 1's vary, but the string size is always the same.
This is my first time posting to stackoverflow, but I have used answers many times. I seek to give a clear question with relevant information. If there is anything I can do to help, I will try to fulfill any requests.
How about changing the while condition to this?
while (charindex('1', #binarystring, #position) > 0)
while (#position <= len(#binarystring))
begin
set #position = charindex('1', #binarystring, #position)
if #position != 0
begin
set #list = #list + ', ' + convert(varchar(10),#position)
set #position = charindex('1', #binarystring, #position)+1
end
else
begin
break
end;
end
It's often useful to have a source of large ranges of sequential integers handy. I have a table, dbo.range that has a single column, id containing all the sequential integers from -500,000 to +500,000. That column is a clustered primary key so lookups against are fast. With such a table, solving your problem is easy.
Assuming your table has a schema something like
create table dbo.some_table_with_flags
(
id int not null primary key ,
flags varchar(1000) not null ,
)
The following query should do you:
select row_id = t.id ,
flag_position = r.id
from dbo.some_table t
join dbo.range r on r.id between 1 and len(t.flags)
and substring(t.flags,r.id,1) = '1'
For each 1 value in the flags column, you'll get a row containing the ID from your source table's ID column, plus the position in which the 1 was found in flags.
There are a number of techniques for generating such sequences. This link shows several:
http://sqlperformance.com/2013/01/t-sql-queries/generate-a-set-1
For instance, you could use common table expressions (CTEs) to generate your sequences, like this:
WITH
s1(n) AS -- 10 (10^1)
( SELECT 1
UNION ALL SELECT 1
UNION ALL SELECT 1
UNION ALL SELECT 1
UNION ALL SELECT 1
UNION ALL SELECT 1
UNION ALL SELECT 1
UNION ALL SELECT 1
UNION ALL SELECT 1
UNION ALL SELECT 1
) ,
s2(n) as ( select 1 from s1 a cross join s1 b ) , -- 10^2 100
s3(n) as ( select 1 FROM s1 a cross join s2 b ) , -- 10^3 1,000
s4(n) as ( select 1 from s1 a cross join s3 b ) , -- 10^4 10,000
s5(n) as ( select 1 from s1 a cross join s4 b ) , -- 10^5 100,000
s6(n) as ( select 1 from s1 a cross join s5 b ) , -- 10^6 1,000,000
seq(n) as ( select row_number() over ( order by n ) from s6 )
select *
from dbo.some_table t
join seq s on s.n between 1 and len(t.flags)
and substring(t.flags,s.n,1) = '1'