Get the Capital letters from a string - sql

I have this requirement to extract the capital letters from a column in SQL Server.
EX: ABC_DEF_ghi
I only want to extract ABC_DEF.
Sometimes the string could be like ABC_DEF_GHI_jkl, so in this case it will be ABC_DEF_GHI
Any suggestions would be helpful.
Thanks in advance.

As Tim Biegeleisen mentioned, this isn't easy in SQL Server, as it doesn't support regular expressions. As such you have to be some what inventive.
As we don't know what version of SQL Server you are using (though I did ask) I am assuming you are using the latest version of SQL Server, and have access to both STRING_AGG and TRIM. If not, you'll need to use the old FOR XML PATH and STUFF method for string aggregation, and LTRIM and RTRIM with nested REPLACEs for TRIM.
Anyway, what I do here is collate the value to a binary collation that is both case sensitive and also orders the letters in Uppercase and then Lowercase (though a collation that does Lowercase and then Uppercase would be fine too, it's just important it's not alphabetically and then case). So in an order like ABC...Zabc...z rather than like AaBb...Zz. I then use a Tally to split the collated string into it's individual characters.
I then use STRING_AGG with a CASE expression to only retain the Underscore characters (which you appear to want as well) and just the uppercase letters. Finally I use TRIM to remove any leading and trailing underscores; without this the value returned would be 'ABC_DEF_GHI_'.
I also assume you are doing this against a table, rather than a scalar value, which gives this:
DECLARE #SomeString varchar(100) = 'ABC_DEF_GHI_jkl';
WITH N AS(
SELECT N
FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL))N(N)),
Tally AS(
SELECT TOP (SELECT MAX(LEN(V.SomeString)) FROM (VALUES(#SomeString))V(SomeString)) --This would be your table
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS I
FROM N N1, N N2) --100 rows, add more cross joins for more rows
SELECT TRIM('_' FROM STRING_AGG(CASE WHEN SS.C LIKE '[A-Z_]' THEN SS.C END,'') WITHIN GROUP (ORDER BY T.I)) AS NewString
FROM (VALUES(#SomeString))V(SomeString) --This would be your table
CROSS APPLY (VALUES(V.SomeString COLLATE Latin1_General_BIN))C(SomeString) --Collate to a collation that is both case sensitive and orders Uppercase first
JOIN Tally T ON LEN(C.SomeString) >= T.I
CROSS APPLY (VALUES(SUBSTRING(C.SomeString,T.I,1)))SS(C) --Get each character
GROUP BY V.SomeString;
db<>fiddle
Of course, a "simpler" solution might be to find and implement a Regex CLR function and just use that. 🙃
Turns out the OP is using 2014... This means the above needs some significant refactorying. I am afraid I don't explain how FOR XML PATH or REPLACE work here (as I put the effort into the original solution), however, a search will yield you the details.:
WITH N AS(
SELECT N
FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL))N(N)),
Tally AS(
SELECT TOP (SELECT MAX(LEN(V.SomeString)) FROM (VALUES(#SomeString))V(SomeString)) --This would be your table
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS I
FROM N N1, N N2) --100 rows, add more cross joins for more rows
SELECT REPLACE(LTRIM(RTRIM(REPLACE((SELECT CASE WHEN SS.C LIKE '[A-Z_]' THEN SS.C END
FROM (VALUES(V.SomeString COLLATE Latin1_General_BIN))C(SomeString) --Collate to a collation that is both case sensitive and orders Uppercase first
JOIN Tally T ON LEN(C.SomeString) >= T.I
CROSS APPLY (VALUES(SUBSTRING(C.SomeString,T.I,1)))SS(C) --Get each character
ORDER BY T.I
FOR XML PATH(''),TYPE).value('(./text())[1]','varchar(100)'),'_',' '))),' ','_') AS NewString
FROM (VALUES(#SomeString))V(SomeString) --This would be your table
GROUP BY V.SomeString;

For SQL 2017 and upper :
DECLARE #SomeString varchar(100) = 'ABC_DEF_GHI_jkl';
WITH T0 AS
(
SELECT 1 AS INDICE,
SUBSTRING(#SomeString, 1, 1) AS RAW_LETTER,
SUBSTRING(UPPER(#SomeString), 1, 1) AS UP_LETTER
UNION ALL
SELECT INDICE + 1,
SUBSTRING(#SomeString, INDICE + 1, 1) AS RAW_LETTER,
SUBSTRING(UPPER(#SomeString), INDICE + 1, 1)
FROM T0
WHERE INDICE < LEN(#SomeString)
)
SELECT STRING_AGG(RAW_LETTER, '') WITHIN GROUP (ORDER BY INDICE)
FROM T0
WHERE RAW_LETTER COLLATE Latin1_General_BIN = UP_LETTER;
For SQL Server previous than 2017 :
WITH T0 AS
(
SELECT 1 AS INDICE,
SUBSTRING(#SomeString, 1, 1) AS RAW_LETTER,
SUBSTRING(UPPER(#SomeString), 1, 1) AS UP_LETTER
UNION ALL
SELECT INDICE + 1,
SUBSTRING(#SomeString, INDICE + 1, 1) AS RAW_LETTER,
SUBSTRING(UPPER(#SomeString), INDICE + 1, 1)
FROM T0
WHERE INDICE < LEN(#SomeString)
)
SELECT STUFF((SELECT '' + RAW_LETTER
FROM T0
WHERE RAW_LETTER COLLATE Latin1_General_BIN = UP_LETTER
ORDER BY INDICE
FOR XML PATH('')), 1, 0, '');

Related

Is it possible to find the first occurrence of a string that's NOT within a set of delimiters in SQL Server 2016+?

I have a column in a SQL Server table that has strings of varying lengths. I need to find the position of the first occurrence of the string , -- that's not enclosed in single quotes or square brackets.
For example, in the following two strings, I've bolded the portion I would like to get the position of. Notice in the first string, the first time , -- appears on its own (without being between single quote or square bracket delimiters) is at position 13 and in the second string, it's at position 16.
'a, --'[, --]**, --**[, --]
[a, --b]aaaaaaa_ **, --**', --'
Also I should mention that , -- itself could appear multiple times in the string.
Here's a simple query that shows the strings and my desired output.
SELECT
t.string, t.desired_pos
FROM
(VALUES (N'''a, --''[, --], --[, --]', 14),
(N'[a, —-b]aaaaaaa_ , --'', --''', 18)) t(string, desired_pos)
Is there any way to accomplish this using a SELECT query (or multiple) without using a function?
Thank you in advance!
I've tried variations of SUBSTRING, CHARINDEX, and even some CROSS APPLYs but I can't seem to get the result I'm looking for.
Before i write down my solution, i must warn you: DON'T USE IT. Use a function, or do this in some other language. This code is probably buggy.
It doesn't handle stuff like escaped quotes etcetc.
The idea is to first remove the stuff inside brackets [] and quotes '' and then just do a "simple" charindex.
To remove the brackets, i'm using a recursive CTE that loops ever part of matching quotes and replaces their content with placeholder strings.
One important point is that quotes might be embedded in each other, so you have to try both variants and chose the one that is earliest.
WITH CTE AS (
SELECT *
FROM
(VALUES (N'''a, --''[, --], --[, --]', 14),
(N'[a, —-b]aaaaaaa_ , --'', --''', 18)) t(string, desired_pos)
)
, cte2 AS (
select x.start
, x.finish
, case when x.start > 0 THEN STUFF(string, x.start, x.finish - x.start + 1, REPLICATE('a', x.finish - x.start + 1)) ELSE string END AS newString
, 1 as level
, string as orig
, desired_pos
from cte
CROSS APPLY (
SELECT *
, ROW_NUMBER() OVER(ORDER BY case when start > 0 THEN 0 ELSE 1 END, start) AS sortorder
FROM (
SELECT charindex('[', string) AS start
, charindex(']', string) AS finish
UNION ALL
SELECT charindex('''', string) AS startQ
, charindex('''', string, charindex('''', string) + 1) AS finishQ
) x
) x
WHERE x.sortorder = 1
UNION ALL
select x.start
, x.finish
, STUFF(newString, x.start, x.finish - x.start + 1, REPLICATE('a', x.finish - x.start + 1))
, 1 as level
, orig
, desired_pos
from cte2
CROSS APPLY (
SELECT *
, ROW_NUMBER() OVER(ORDER BY case when start > 0 THEN 0 ELSE 1 END, start) AS sortorder
FROM (
SELECT charindex('[', newString) AS start
, charindex(']', newString) AS finish
UNION ALL
SELECT charindex('''', newString) AS startQ
, charindex('''', newString, charindex('''', newString) + 1) AS finishQ
) x
) x
WHERE x.sortorder = 1
AND x.start > 0
AND cte2.start > 0 -- Must have been a match
)
SELECT PATINDEX('%, --%', newString), *
from (
select *, row_number() over(partition by orig order by level desc) AS sort
from cte2
) x
where x.sort = 1
Try this approach. I'm replacing the strings you don't need for another string of the same length. Then look for the position of the interested string.
SELECT string, desired_pos,
CHARINDEX(', --', REPLACE(REPLACE(string, ''', --''', '******'), '[, --]', '******')
) start_index
FROM (VALUES (N''', --''[, --], --[, --]', 13),
(N'[, --]aaaaaaa_ , --'', --''', 16)) t(string, desired_pos)
I don't know if it makes sense with a C# solution, but this class for CVS is a nice little parcer: TextFieldParser
Then you just define Delimeters etc. and assuming the input is escaped consistently then all is good.
Im late the game here but This kind of thing is simple in SQL Server when leveraging NGrams8k. Not only do you not need REGEX, a CLR, C# required. Furthermore, NGrams8k will be the fastest by far. In 8 years nobody has produced anything remotely as fast. Furthermore, this code will be faster and far less complex than a recursive CTE solution (which are almost always slow in SQL Server)
;--==== Sample Data
DECLARE #T Table (String VARCHAR(100))
INSERT #T
VALUES (N'''a, --''[, --], --[, --]'),
(N'[a, —-b]aaaaaaa_ , --'', --''');
;--==== Solution
SELECT
t.String, ng.Position
FROM #t AS t
CROSS APPLY (VALUES(REPLACE(t.String,'[',CHAR(1)))) AS f(S)
CROSS APPLY samd.NGrams8k(f.S,4) AS ng
CROSS APPLY (VALUES(SUBSTRING(f.S,ng.Position-2,7))) AS g(String)
WHERE ng.Token = ', --'
AND g.String NOT LIKE '%''%''%'
AND g.String NOT LIKE '%'+CHAR(1)+'%]%';
Results:
String Position
----------------------------- --------------------
'a, --'[, --], --[, --] 14
[a, —-b]aaaaaaa_ , --', --' 18

How to pull out information from a long string of data

I have this data point:
455-U-202007302233,455-L-202007302233,422-U-202008011052,422-L-202008011052,857-U-202008041142,857-L-202008061215
Column: ,[t810str]
How would I be able to modify column [t810str] in order to pull out the last comma set before 857?
Desired Result = 422-L-202008011052
First you need to implement some kind of splitter that respects ordinal position (STRING_SPLIT does not). I'm therefore going to make use of DelimitedSplit8k_LEAD. Then you can split the value, and use LAG to get the prior value. Finally you can filter on where the item has a value LIKE '857%' but the previous does not:
WITH CTE AS(
SELECT DS.Item,
LAG(DS.Item) OVER (PARTITION BY YourColumn ORDER BY DS.itemNumber) AS PrevItem
FROM (VALUES('455-U-202007302233,455-L-202007302233,422-U-202008011052,422-L-202008011052,857-U-202008041142,857-L-202008061215'))V(YourColumn)
CROSS APPLY dbo.DelimitedSplit8K_LEAD(V.YourColumn,',') DS)
SELECT C.PrevItem
FROM CTE C
WHERE C.Item LIKE '857%'
AND C.PrevItem NOT LIKE '857%';
Based on your data and the assumption that items are 18 characters (your data do not indicate otherwise):
DECLARE #t AS NVARCHAR(255) = '455-U-202007302233,455-L-202007302233,422-U-202008011052,422-L-202008011052,857-U-202008041142,857-L-202008061215';
SELECT RIGHT(LEFT(#t,CHARINDEX(',857',#t)-1),18)
Using cross apply (which you can also rewrite using a CTE or a subquery for readability). This removes everything after first occurrence of 857 and then grabs the last set that's left. So even if you have multiple 857 and varying length of delimited strings, this should work
select *, right(remind , charindex (',' ,reverse(remind))-1)
from t t1
cross apply (select stuff(col, charindex(',857',col), len(col),'') as remind) t2
DEMO
Another solution use a recursive CTE
DECLARE #Var VARCHAR(200) = '455-U-202007302233,455-L-202007302233,422-U-202008011052,422-L-202008011052,857-U-202008041142,857-L-202008061215';
WITH CTE AS
(
SELECT 0 N, LEFT(#Var, CHARINDEX(',', #Var)-1) Part,
RIGHT(#Var, LEN(#Var) - CHARINDEX(',', #Var)) Remind
UNION ALL
SELECT N + 1,
LEFT(Remind, CHARINDEX(',', Remind) - 1),
RIGHT(Remind, LEN(Remind) - CHARINDEX(',', Remind))
FROM CTE
WHERE CHARINDEX(',', Remind) <> 0
)
SELECT TOP 1 Part
FROM CTE
WHERE LEFT(Remind, 3) = '857'
ORDER BY N;
Demo
Implemented with string functions (and assuming your data items can have variable length :-) it might look a bit confusing (therefore I'd prefer #Larnu's answer):
DECLARE #string VARCHAR(2000) = '455-U-202007302233,455-L-202007302233,422-U-202008011052,422-L-202008011052,857-U-202008041142,857-L-202008061215'
SELECT SUBSTRING(#string, CHARINDEX(',857',#string) - CHARINDEX(',', REVERSE( LEFT(#string, PATINDEX('%,857%',#string) - 1)) ) + 1, CHARINDEX(',', REVERSE( LEFT(#string, PATINDEX('%,857%',#string) - 1)))-1 )
Parts of the latter separated:
DECLARE #string VARCHAR(2000) = '455-U-202007302233,455-L-202007302233,422-U-202008011052,422-L-202008011052,857-U-202008041142,857-L-202008061215'SELECT CHARINDEX(',857',#string)
SELECT LEFT(#string, PATINDEX('%,857%',#string) - 1)
SELECT REVERSE( LEFT(#string, PATINDEX('%,857%',#string) - 1) )
SELECT CHARINDEX(',', REVERSE( LEFT(#string, PATINDEX('%,857%',#string) - 1)) )

SQL Get string between second and third underscore

I need to extract a certain string from a column in a table as part of an SSIS package.
The contents of the column is formatted like this "TST_AB1_ABC123456_TEST".
I need to get the string between the second and 3rd "_", e.g. "ABC123456" without changing too much of the package so would rather do it in 1 SQL command if possible.
I've tried a few different methods using SUBSTRING, REVERSE and CHARINDEX but can't figure out how to get just that string.
Using the base string functions:
SELECT
SUBSTRING(col,
CHARINDEX('_', col, CHARINDEX('_', col) + 1) + 1,
CHARINDEX('_', col, CHARINDEX('_', col, CHARINDEX('_', col) + 1) + 1) -
CHARINDEX('_', col, CHARINDEX('_', col) + 1) - 1)
FROM yourTable;
In notes format, the above call to SUBSTRING is saying:
SELECT
SUBSTRING(<your column>,
<starting at one past the second underscore>,
<for a length of the number of characters in between the 2nd and 3rd
underscore>)
FROM yourTable;
On other databases, such as Postgres and Oracle, there are substring index and regex functions which can handle the above more gracefully. Actually, more recent versions of SQL Server have a STRING_SPLIT function, which could be used here, but it does not maintain the order of the resulting parts.
If your column values always have 4 parts you can use the PARSENAME() function like this.
DECLARE #MyString VARCHAR(100)
SET #MyString = 'TST_AB1_ABC123456_TEST';
SELECT PARSENAME(REPLACE(#MyString, '_', '.'), 2)
You could also do this using Cross Apply. I added in a where clause to make sure you don't get an error resulting from strings without 3 underscores
with your_table as (select 'TST_AB1_ABC123456_TEST' as txt1)
select txt1, txt2
from your_table t1
where txt1 like '%_%_%_%'
cross apply (select charindex( '_', txt1) as i1) t2 -- locate the 1st underscore
cross apply (select charindex( '_', txt1, (i1 + 1)) as i2 ) t3 -- then the 2nd
cross apply (select charindex( '_', txt1, (i2 + 1)) as i3 ) t4 -- then the 3rd
cross apply (select substring( txt1,(i2+1), (i3-i2-1)) as txt2) t5 -- between 2nd & 3rd
Outputs
+------------------------+-----------+
| txt1 | txt2 |
+------------------------+-----------+
| TST_AB1_ABC123456_TEST | ABC123456 |
+------------------------+-----------+
DEMO

sql string split for defined number of char

i've a string like 'aabbcczx' and i need to split that string by 2 char.
The result expected is something like:
aabbcczx aa
aabbcczx bb
aabbcczx cc
aabbcczx zx
How can I do this?
consider also that the length of the string change row by row.
Thanks
If it's always 2 chars:
SELECT A.Val,
CA1.N,
SUBSTRING(A.Val,n,2)
FROM (
VALUES ('aabbcczx')
) AS A(Val)
CROSS
APPLY dbo.GetNums(1,LEN(A.Val)) AS CA1
WHERE CA1.n % 2 = 1;
GetNums is a number table/tally table generator you can find some several sources online.
It will provide the position of each character and we can use that in the substring start position. The where clause uses MOD to so we only show every other starting position
You can use a recursive query:
with cte as (
select convert(varchar(max), left(str, 2)) as val2, convert(varchar(max), stuff(str, 1, 2, '')) as rest, str
from (values ( 'aabbcczx' )) v(str)
union all
select left(rest, 2) as val2, stuff(rest, 1, 2, '') as rest, str
from cte
where rest <> ''
)
select str, val2
from cte;
You can use a recursive query to extract pairs of characters:
with instring as
( select 'aabbcczx' as s )
, splitter as
(
select s, substring(s, 1, 2) as rslt, 3 as next -- first two chars
from instring
union all
select s, substring(s, next, 2), next + 2 -- next two chars
from splitter
where len(s) >= next
)
select *
from splitter
See dbfiddle

Remove Characters in a String in SQL

I have a column u_manualdoc which contains the values are like this CGY DR# 7405. I want to remove the CGY DR#.
Here's the code:
select u_manualdoc, cardcode, cardname from ODLN
I want only the 7405 number. Thanks!
Try this:
--sample data you provided in comments
declare #tbl table(codes varchar(20))
insert into #tbl values
('CGY PST - 58277') , ('CGY RMC PST # 58083'), ('CGY DR # 7443'), ('CSI # 1304'), ('PO# 0568 , 0570'), ('CGY DR# 7446')
--actual query that you can apply to your table
select SUBSTRING(codes, PATINDEX('%[0-9]%', codes), len(codes)) from #tbl
The key point here is to use patindex, which searches for a pattern and returns index where such pattern occur. I specified %[0-9]% which means that we search for any digit - it will return first occurrence of a digit. Now- since this would be our starting point to substring, we pass it to such function. Third parameter of substring is length. Since we want the rest of a string, len function makes sure that we get that :)
Applying to your naming:
select SUBSTRING(u_manualdoc, PATINDEX('%[0-9]%', u_manualdoc), len(u_manualdoc)),
cardcode,
cardname
from ODLN
You should use string functions charindex,len and substring to get it.
See the code below.
select SUBSTRING(u_manualdoc,CHARINDEX('#',u_manualdoc)+1,LEN(u_manualdoc)- CHARINDEX('#',u_manualdoc))
EDIT
In addition to the other answers, you can use this simple method:
select
substring(
u_manualdoc,
len(u_manualdoc) - patindex('%[^0-9]%', reverse(u_manualdoc)) + 2,
len(u_manualdoc)
),
cardcode, cardname
from ODLN
In this example, patindex finds the first non-digit (as specified by ^[0-9]) from the right side of the string, and then uses that as the starting point of the substring.
This will work on all of your sample strings (including 'PO# 0568 , 0570 CGY DR# 7446').
Or use SQL Server Regex, which lets you use more powerful regular expressions within your queries.
TRY THIS
DECLARE #table TABLE(DirtyCol VARCHAR(100));
INSERT INTO #table
VALUES('AB ABCDE # 123'), ('ABCDE# 123'), ('AB: ABC# 123 AB: ABC# 123'), ('AB#'), ('AB # 1 000 000'), ('AB # 1`234`567'), ('AB # (9)(876)(543)');
WITH tally
AS (
SELECT TOP (100) N = ROW_NUMBER() OVER(ORDER BY ##spid)
FROM sys.all_columns),
data
AS (
SELECT DirtyCol,
Col
FROM #table
CROSS APPLY
(
SELECT
(
SELECT C+''
FROM
(
SELECT N,
SUBSTRING(DirtyCol, N, 1) C
FROM tally
WHERE N <= DATALENGTH(DirtyCol)
) [1]
WHERE C BETWEEN '0' AND '9'
ORDER BY N FOR XML PATH('')
)
) p(Col)
WHERE p.Col IS NOT NULL)
SELECT DirtyCol,
CAST(Col AS INT) IntCol
FROM data;