How to check string have custom template in SQL Server - sql

I have a column like this :
Codes
--------------------------------------------------
3/1151---------366-500-2570533-1
9/6809---------------------368-510-1872009-1
1-260752-305-154----------------154-200-260752-1--------154-800-13557-1
2397/35425---------------------------------377-500-3224575-1
17059----------------377-500-3263429-1
126/42906---------------------377-500-3264375-1
2269/2340-------------------------377-500-3065828-1
2267/767---------377-500-1452908-4
2395/118593---------377-500-3284699-1
2395/136547---------377-500-3303413-1
92/10260---------------------------377-500-1636038-1
2345-2064---------377-500-3318493-1
365-2290--------377-500-3278261-12
365-7212--------377-500-2587120-1
How can I extract codes with this format:
3digit-3digit-5to7digit-1to2digit
xxx-xxx-xxxxxx-xx
The result must be :
Codes
--------------------------------------------------
366-500-2570533-1
368-510-1872009-1
154-200-260752-1 , 154-800-13557-1 -- have 2 code template
377-500-3224575-1
377-500-3263429-1
377-500-3264375-1
377-500-3065828-1
377-500-1452908-4
377-500-3284699-1
377-500-3303413-1
377-500-1636038-1
377-500-3318493-1
377-500-3278261-12
377-500-2587120-1
------------------------------------
This problem is completely tired of me.
Thanks for reading about my problem

This is really ugly, really really ugly. I don't for one second suggest doing this in your RDBMS, and really I suggest you fix your data. You should not be storing "delimited" (I use that word loosely to describe your data) data in your tables you should be storing in in separate columns and rows. In this case, the first "code" should be in one column, with a one to many relationship with another table with the codes you're trying to extract.
As you haven't tagged or mentioned your Version of SQL Server I've used the latest SQL Server syntax. STRING_SPLIT is available in SQL Server 2016+ and STRING_AGG in 2017+. If you aren't using those versions you will need to replace those functions with a suitable alternative (I suggest delimitedsplit8k(_lead) and FOR XML PATH respectively).
Anyway, what this does. Firstly we need to fix that data to something more useable, so I change the double hyphens (--) to a Pipe (|), as that doesn't seem to appear in your data. Then then use that pipe to split your data into parts (individual codes).
Because your delimiter is inconsistent (it isn't a consistent width) this leaves some codes with a leading hyphen, so I have to then get rid of that. Then I use my answer from your other question to split the code further into it's components, and reverse the WHERE; previously the answer was looking for "bad" rows, where as now we want "good" rows.
Then after all of that, it's as "simple" as using STRING_AGG to delimit the "good" rows:
SELECT STRING_AGG(ca.Code,',') AS Codes
FROM (VALUES('3/1151---------366-500-2570533-1'),
('9/6809---------------------368-510-1872009-1'),
('1-260752-305-154----------------154-200-260752-1--------154-800-13557-1'),
('2397/35425---------------------------------377-500-3224575-1'),
('17059----------------377-500-3263429-1'),
('126/42906---------------------377-500-3264375-1'),
('2269/2340-------------------------377-500-3065828-1'),
('2267/767---------377-500-1452908-4'),
('2395/118593---------377-500-3284699-1'),
('2395/136547---------377-500-3303413-1'),
('92/10260---------------------------377-500-1636038-1'),
('2345-2064---------377-500-3318493-1'),
('365-2290--------377-500-3278261-12'),
('365-7212--------377-500-2587120-1')) V(Codes)
CROSS APPLY (VALUES(REPLACE(V.Codes,'--','|'))) D(DelimitedCodes)
CROSS APPLY STRING_SPLIT(D.DelimitedCodes,'|') SS
CROSS APPLY (VALUES(CASE LEFT(SS.[value],1) WHEN '-' THEN STUFF(SS.[value],1,1,'') ELSE SS.[value] END)) ca(Code)
CROSS APPLY (VALUES(PARSENAME(REPLACE(ca.Code,'-','.'),4),
PARSENAME(REPLACE(ca.Code,'-','.'),3),
PARSENAME(REPLACE(ca.Code,'-','.'),2),
PARSENAME(REPLACE(ca.Code,'-','.'),1))) PN(P1, P2, P3, P4)
WHERE LEN(PN.P1) = 3
AND LEN(PN.P2) = 3
AND LEN(PN.P3) BETWEEN 5 AND 7
AND LEN(PN.P4) BETWEEN 1 AND 2
AND ca.Code NOT LIKE '%[^0-9\-]%' ESCAPE '\'
GROUP BY V.Codes;
db<>fiddle

You have several problems here:
Splitting your longer strings into the codes you want.
Dealing with the fact that your separator for the longer strings is the same as your separator for the shorter ones.
Finding the patterns that you want.
The last is perhaps the simplest, because you can use brute force to solve that.
Here is a solution that extracts the values you want:
with t as (
select v.*
from (values ('3/1151---------366-500-2570533-1'),
('9/6809---------------------368-510-1872009-1'),
('1-260752-305-154----------------154-200-260752-1--------154-800-13557-1'),
('2397/35425---------------------------------377-500-3224575-1')
) v(str)
)
select t.*, ss.value
from t cross apply
(values (replace(replace(replace(replace(replace(t.str, '--', '><'), '<>', ''), '><', '|'), '|-', '|'), '-|', '|'))
) v(str_sep) cross apply
string_split(v.str_sep, '|') ss
where ss.value like '%-%-%-%' and
ss.value not like '%-%-%-%-%' and
(ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9]-[0-9]' or
ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]' or
ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9]' or
ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]' or
ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9]' or
ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]'
);
Here is a db<>fiddle.
I would strongly encourage you to find some way of doing this string parsing anywhere other than SQL.
The key to this working is getting the long string of hyphens down to a single delimiter. SQL Server does not offer regular expressions for the hyphens (as some other databases do and as is available in other programming languages). In Python, for instance, this would be much simpler.
The strange values statement with a zillion replaces is handling the repeated delimiters, replacing them with a single pipe delimiter.
Note: This uses string_split() as a convenience. It was introduced in SQL Server 2017. For earlier versions, there are plenty of examples of string splitting functions on the web.

Related

T-SQL - How to pattern match for a list of values?

I'm trying to find the most efficient way to do some pattern validation in T-SQL and struggling with how to check against a list of values. This example works:
SELECT *
FROM SomeTable
WHERE Code LIKE '[0-9]JAN[0-9][0-9]'
OR Code LIKE '[0-9]FEB[0-9][0-9]'
OR Code LIKE '[0-9]MAR[0-9][0-9]'
OR Code LIKE '[0-9]APRIL[0-9][0-9]
but I am stuck on wondering if there is a syntax that will support a list of possible values within the single like statement, something like this (which does not work)
SELECT *
FROM SomeTable
WHERE Code LIKE '[0-9][JAN, FEB, MAR, APRIL][0-9][0-9]'
I know I can leverage charindex, patindex, etc., just wondering if there is a simpler supported syntax for a list of possible values or some way to nest an IN statement within the LIKE. thanks!
I think the closest you'll be able to get is with a table value constructor, like this:
SELECT *
FROM SomeTable st
INNER JOIN (VALUES
('[0-9]JAN[0-9][0-9]'),
('[0-9]FEB[0-9][0-9]'),
('[0-9]MAR[0-9][0-9]'),
('[0-9]APRIL[0-9][0-9]')) As p(Pattern) ON st.Code LIKE p.Pattern
This is still less typing and slightly more efficient than the OR option, if not as brief as we hoped for. If you knew the month was always three characters we could do a little better:
Code LIKE '[0-9]___[0-9][0-9]'
Unfortunately, I'm not aware of SQL Server pattern character for "0 or 1" characters. But maybe if you want ALL months we can use this much to reduce our match:
SELECT *
FROM SomeTable
WHERE (Code LIKE '[0-9]___[0-9][0-9]'
OR Code LIKE '[0-9]____[0-9][0-9]'
OR Code LIKE '[0-9]_____[0-9][0-9]')
You'll want to test this to check if the data might contain false positive matches, and of course the table-value constructor could use this strategy, too. Also, I really hope you're not storing dates in a varchar column, which is a broken schema design.
One final option you might have is building the pattern on the fly. Something like this:
Code LIKE '[0-9]' + 'JAN' + '[0-9][0-9]'
But how you find that middle portion is up to you.
The native TSQL string functions don't support anything like that.
But you can use a workaround (dbfiddle) such as
WHERE CASE WHEN Code LIKE '[0-9]%[^ ][0-9][0-9]' THEN SUBSTRING(Code, 2, LEN(Code) - 3) END
IN
( 'JAN', 'FEB', 'MAR', 'APRIL' )
So first of all check that the string starts with a digit and ends in a non-space character followed by two digits and then check the remainder of the string (not matched by the digit check) is one of the values you want.
The reason for including the SUBSTRING inside the CASE is so that is only evaluated on strings that pass the LIKE check to avoid possible "Invalid length parameter passed to the LEFT or SUBSTRING function." errors if it was to be evaluated on a shorter string.

How can I automatically extract content from a field in a SQL query?

The environment I am currently working in is Snowflake.
As a matter of data sensitivty, I will be using pseudonyms for my following question.
I have a specific field in one of my tables called FIELD_1. The data in this field is structured as such:
I am trying to figure out how to automatically extract from my FIELD_1 the output I have in FIELD_2.
Does anyone have any idea what kind of query I would need to achieve this? Any help would be GREATLYappreciated! I am really quite stuck on this problem.
Thank you!
You seem to want everything up to the first four numbers. Then to replace the underscores with spaces. If so:
select replace(regexp_substr(field_1, '^[^0-9]*[0-9]{4}'), '_', ' ')
Or alternatively, if you want the first three components separated by underscores:
select replace(regexp_substr(field_1, '^[^_]+_[^_]+_[0-9]{4}'), '_', ' ')
If the data is as simplistic in reality as you've described here, you can use a variable-length LEFT() function in conjunction with REPLACE() to get the desired output:
SELECT FIELD_1, REPLACE(LEFT(FIELD_1, LEN(FIELD_1)-10),'_',' ') AS FIELD_2
FROM table_name
See also:
SELECT - Snowflake Documentation
LEFT - Snowflake Documentation
REPLACE - Snowflake Documentation
LENGTH, LEN - Snowflake Documentation

SQL - Split string with multiple delimiters into multiple rows and columns

I am trying to split a string in SQL with the following format:
'John, Mark, Peter|23, 32, 45'.
The idea is to have all the names in the first columns and the ages in the second column.
The query should be "dynamic", the string can have several records depending on user entries.
Does anyone know how to this, and if possible without SQL functions? I have tried the cross apply approach but I wasn't able to make it work.
Any ideas?
This solution uses Jeff Moden's DelimitedSplit8k. Why? Because his solution provides the ordinal position of the item. Ordinal Position something that many others functions, including Microsoft's own STRING_SPLIT, does not provide. It's going to be vitally important for getting this to work correctly.
Once you have that, the solution becomes fairly simple:
DECLARE #NameAges varchar(8000) = 'John, Mark, Peter|23, 32, 45';
WITH Splits AS (
SELECT S1.ItemNumber AS PipeNumber,
S2.ItemNumber AS CommaNumber,
S2.Item
FROM dbo.DelimitedSplit8K (REPLACE(#NameAges,' ',''), '|') S1 --As you have spaces between the delimiters I've removed these. Be CAREFUL with that
CROSS APPLY DelimitedSplit8K (S1.item, ',') S2)
SELECT S1.Item AS [Name],
S2.Item AS Age
FROM Splits S1
JOIN Splits S2 ON S1.CommaNumber = S2.CommaNumber
AND S2.PipeNumber = 2
WHERE S1.PipeNumber = 1;

BigQuery: Convert accented characters to their plain ascii equivalents

I have the following string:
brasília
And I need to convert to:
brasilia
Withou the ´ accent!
How can I do on BigQuery?
Thank you!
Try below as quick and simple option for you:
#standardSQL
WITH lookups AS (
SELECT
'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,Ø,Å,Á,À,Â,Ä,È,É,Ê,Ë,Í,Î,Ï,Ì,Ò,Ó,Ô,Ö,Ú,Ù,Û,Ü,Ÿ,Ç,Æ,Œ,ñ' AS accents,
'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,O,A,A,A,A,A,E,E,E,E,I,I,I,I,O,O,O,O,U,U,U,U,Y,C,AE,OE,n' AS latins
),
pairs AS (
SELECT accent, latin FROM lookups,
UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1,
UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
WHERE p1 = p2
),
yourTableWithWords AS (
SELECT word FROM UNNEST(
SPLIT('brasília,ångström,aperçu,barège, beau idéal, belle époque, béguin, bête noire, bêtise, Bichon Frisé, blasé, blessèd, bobèche, boîte, bombé, Bön, Boötes, boutonnière, bric-à-brac, Brontë Beyoncé,El Niño')
) AS word
)
SELECT
word AS word_with_accent,
(SELECT STRING_AGG(IFNULL(latin, char), '')
FROM UNNEST(SPLIT(word, '')) char
LEFT JOIN pairs
ON char = accent) AS word_without_accent
FROM yourTableWithWords
Output is
word_with_accent word_without_accent
blessèd blessed
El Niño El Nino
belle époque belle epoque
boîte boite
Boötes Bootes
blasé blase
ångström angstrom
bobèche bobeche
barège barege
bric-à-brac bric-a-brac
bête noire bete noire
Bichon Frisé Bichon Frise
Brontë Beyoncé Bronte Beyonce
bêtise betise
beau idéal beau ideal
bombé bombe
brasília brasilia
boutonnière boutonniere
aperçu apercu
béguin beguin
Bön Bon
UPDATE
Below is how to pack this logic into SQL UDF - so accent2latin(word) can be called to make a "magic"
#standardSQL
CREATE TEMP FUNCTION accent2latin(word STRING) AS
((
WITH lookups AS (
SELECT
'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,Ø,Å,Á,À,Â,Ä,È,É,Ê,Ë,Í,Î,Ï,Ì,Ò,Ó,Ô,Ö,Ú,Ù,Û,Ü,Ÿ,Ç,Æ,Œ,ñ' AS accents,
'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,O,A,A,A,A,A,E,E,E,E,I,I,I,I,O,O,O,O,U,U,U,U,Y,C,AE,OE,n' AS latins
),
pairs AS (
SELECT accent, latin FROM lookups,
UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1,
UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
WHERE p1 = p2
)
SELECT STRING_AGG(IFNULL(latin, char), '')
FROM UNNEST(SPLIT(word, '')) char
LEFT JOIN pairs
ON char = accent
));
WITH yourTableWithWords AS (
SELECT word FROM UNNEST(
SPLIT('brasília,ångström,aperçu,barège, beau idéal, belle époque, béguin, bête noire, bêtise, Bichon Frisé, blasé, blessèd, bobèche, boîte, bombé, Bön, Boötes, boutonnière, bric-à-brac, Brontë Beyoncé,El Niño')
) AS word
)
SELECT
word AS word_with_accent,
accent2latin(word) AS word_without_accent
FROM yourTableWithWords
It's worth mentioning that what you're asking for is a simplified case of unicode text normalization. Many languages have a function for this in their standard libraries (e.g., Java). One good approach would be to insert your text BigQuery already normalized. If that won't work -- for example, because you need to retain the original text and you're concerned about hitting BigQuery's row size limit -- then you'll need to do normalization on the fly in your queries.
Some databases have implementations of Unicode normalization of various completeness (e.g., PostgreSQL's unaccent method, PrestoDB's normalize method) for use in queries. Unfortunately, BigQuery is not one of them. There is no text normalization function in BigQuery as of this writing. The implementations on this answer are kind of a "roll your own unaccent." When BigQuery releases an official function, everyone should use that instead!
Assuming you need to do the normalization in your query (and Google still hasn't come out with a function for this yet), these are some reasonable options.
Approach 1: Use NORMALIZE
Google now has come out with a NORMALIZE function. (Thanks to #WillianFuks in the comments for flagging!) This is now the obvious choice for text normalization. For example:
SELECT REGEXP_REPLACE(NORMALIZE(text, NFD), r"\pM", '') FROM yourtable;
There is a brief explanation of how this works and why the call to REGEXP_REPLACE is needed in the comments.
I have left the additional approaches for reference.
Approach 2: Use REGEXP_REPLACE and REPLACE on Content
I implemented the lowercase-only case of text normalization in legacy SQL using REGEXP_REPLACE. (The analog in Standard SQL is fairly self-evident.) I ran some tests on a text field with average length around 1K in a large table of 28M rows using the query below:
SELECT id, text FROM
(SELECT
id,
CASE
WHEN REGEXP_CONTAINS(LOWER(text), r"[àáâäåæçèéêëìíîïòóôöøùúûüÿœ]") THEN
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REPLACE(REPLACE(REPLACE(REPLACE(LOWER(text), 'œ', 'ce'), 'ÿ', 'y'), 'ç', 'c'), 'æ', 'ae'),
r"[ùúûü]", 'u'),
r"[òóôöø]", 'o'),
r"[ìíîï]", 'i'),
r"[èéêë]", 'e'),
r"[àáâäå]", 'a')
ELSE
LOWER(text)
END AS text
FROM
yourtable ORDER BY id LIMIT 10);
versus:
WITH lookups AS (
SELECT
'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,ñ' AS accents,
'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,n' AS latins
),
pairs AS (
SELECT accent, latin FROM lookups,
UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1,
UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
WHERE p1 = p2
)
SELECT foo FROM (
SELECT
id,
(SELECT STRING_AGG(IFNULL(latin, char), '') AS foo FROM UNNEST(SPLIT(LOWER(text), '')) char LEFT JOIN pairs ON char=accent) AS foo
FROM
yourtable ORDER BY id LIMIT 10);
On average, the REGEXP_REPLACE implementation ran in about 2.9s; the array-based implementation ran in about 12.5s.
Approach 3: Use REGEXP_REPLACE on Search Pattern
What brought me to this question my was a search use case. For this use case, I can either normalize my corpus text so that it looks more like my query, or I can "denormalize" my query so that it looks more like my text. The above describes an implementation of the first approach. This describes an implementation of the second.
When searching for a single word, one can use the REGEXP_MATCH match function and merely update the query using the following patterns:
a -> [aàáaâäãåā]
e -> [eèéêëēėę]
i -> [iîïíīįì]
o -> [oôöòóøōõ]
u -> [uûüùúū]
y -> [yÿ]
s -> [sßśš]
l -> [lł]
z -> [zžźż]
c -> [cçćč]
n -> [nñń]
æ -> (?:æ|ae)
œ -> (?:œ|ce)
So the query "hello" would look like this, as a regexp:
r"h[eèéêëēėę][lł][lł][oôöòóøōõ]"
Transforming the word into this regular expression should be fairly straightforward in any language. This isn't a solution to the posted question -- "How do I remove accents in BigQuery?" -- but is rather a solution to a related use case, which might have brought people (like me!) to this page.
I like this answer explanation. You can use:
REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '')
As a simple example:
WITH data AS(
SELECT 'brasília / paçoca' AS text
)
SELECT
REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '') RemovedDiacritics
FROM data
brasilia / pacoca
UPDATE
With the new string function Translate, it's much simpler to do it:
WITH data AS(
SELECT 'brasília / paçoca' AS text
)
SELECT
translate(text, "ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöùúûüýÿ", "SZszYAAAAAACEEEEIIIIDNOOOOOUUUUYaaaaaaceeeeiiiidnooooouuuuyy") as RemovedDiacritics
FROM data
brasilia / pacoca
You can call REPLACE() or REGEXP_REPLACE(). You can find some regular expressions at Remove accents/diacritics in a string in JavaScript.
Alternatively, you can use javascript UDF, but I expect it to be slower.
Just call --> select bigfunctions.eu.remove_accents('Voilà !') as cleaned_string
(BigFunctions are open-source BigQuery functions callable by anyone from their BigQuery Project)
https://unytics.io/bigfunctions/reference/#remove_accents

query for substring formation

I want to take the 01 part of a string abcd_01 using SQL. What should be the query for this, where the length before the _ varies? That is, it may be abcde_01 or ab_01. Basically, I want part after the _.
This is one of those examples of how there's similar functionality between SQL and the various extensions, but are just different enough that you can not guarantee portability between all databases.
The SUBSTRING keyword, using PostgreSQL syntax (no mention of pattern matching) is ANSI-99. Why this took them so long, I don't know.
The crux of your need is to obtain a substring of the existing column value, so you need to know what the database substring function(s) are called.
Oracle
SELECT SUBSTR('abcd_01', -2) FROM DUAL
Oracle doesn't have a RIGHT function, with is really just a wrapper for the substring function anyway. But Oracle's SUBSTR does allow you to specify a negative number in order to process the string in reverse (end towards the start).
SQL Server
Two options - SUBSTRING, and RIGHT:
SELECT SUBSTRING('abcd_01', LEN('abcd_01') - 1, 2)
SELECT RIGHT('abcd_01', 2)
For brevity, RIGHT is ideal. But for portability, SUBSTRING is a better choice...
MySQL
Like SQL Server, three options - SUBSTR, SUBSTRING, and RIGHT:
SELECT SUBSTR('abcd_01', LENGTH('abcd_01') - 1, 2)
SELECT SUBSTRING('abcd_01', LENGTH('abcd_01') - 1, 2)
SELECT RIGHT('abcd_01', 2)
PostgreSQL
PostgreSQL only has SUBSTRING:
SELECT SUBSTRING('abcd_01' FROM LENGTH('abcd_01')-1 for 2)
...but it does support limited pattern matching, which you can see is not supported elsewhere.
SQLite
SQLite only supports SUBSTR:
SELECT SUBSTR('abcd_01', LENGTH('abcd_01') - 1, 2)
Conclusion
Use RIGHT if it's available, while SUBSTR/SUBSTRING would be better if there's a need to port the query to other databases so it's explicit to others what is happening and should be easier to find equivalent functionality.
If it's always the last 2 characters then use RIGHT(MyString, 2) in most SQL dialects
to get 01 from abcd_01 you should write this way (assuming column name is col1)
SELECT substring(col1,-2) FROM TABLE
this will give you last two chars.
select substring(names,charindex('_',names)+1,len(names)-charindex('_',names)) from test