How to extract digits from field using regex - sql

I am using Firebird 2.5 and I have a field (called identifier) with mixed letters, numbers and special characters. I would like to use regex to extract only the numbers in a new column. I have tried something like below, but it is not working.
Any idea how I can achieve this using regex without using stored procedures or execute block
SELECT ORDER_ID,
ORDER_DATE,
SUBSTRING(IDENTIFIER FROM 1 TO 10) SIMILAR TO '^[0-9]{10}$' --- DESIRED EXTRACTION COLUMN
FROM ORDERS
Example of data
IDENTIFIER DESIRED OUTPUT
ANDRE 02869567995 02869567995
02869567995 MARIA 02869567995
028.695.67.995 02869567995
028695679-95 02869567995

You cannot do this in Firebird 2.5, at least not without help from a UDF, or a (selectable) stored procedure. I'm not aware of third-party UDFs providing regular expressions, so you might have to write this yourself.
In Firebird 3.0, you could also use a UDR or stored function to achieve this. Unfortunately, using the regular expression functionality available in Firebird alone will not be enough to solve this.
NOTE: The rest of the answer is based on the assumption to extract digits if the first 10 characters of string are digits. With the updated question, this assumption is no longer valid.
That said, if your need is exactly as shown in your question, that is only extract the first 10 characters from a string if they are all digits, then you could use:
case
when IDENTIFIER similar to '[[:DIGIT:]]{10}%'
then substring(IDENTIFIER from 1 for 10)
end
(as an aside, the positional SUBSTRING syntax is from <start> for <length>, not from <start> to <end>)
In Firebird 3.0 and higher, you can use SUBSTRING(... SIMILAR ...) with a SQL regular expression pattern. Assuming you want to extract 10 digits from the start of a string, you can do:
substring(IDENTIFIER similar '#"[[:DIGIT:]]{10}#"%' escape '#')
The #" delimits the pattern to extract (where # is a custom escape character as specified in the ESCAPE clause). The remainder of the pattern must match the rest of the string, hence the use of % here (in other cases, you may need to specify a pattern before the first #" as well.
See this dbfiddle for an example.

It is not possible in any version of Firebird.

Related

How to use Regular Expressions to replace part of a string in SQLite?

I currently would like some advice on how to find and replace part of a string using regular expressions in SQLite? i am using Rstudio/R as the SQLite connector.
I have the following strings:
my_strings
--------------
1244599arts
3490872testing
4478933great
2342340obvious
gremlin2342678
i would like to replace the numbers with the word "final" - now I would like to use regular expressions to achieve this as I want to be able to capture the numbers only and then replace them with the word "final" and not affect any other part of the string
the output i would like to achieve is the following:
my_strings
--------------
finalarts
finaltesting
finalgreat
finalobvious
gremlinfinal
As you can see the numbers have now been replaced by the word "final" - please note that I have around 8 million rows so I cannot just repeat a REPLACE function as there are simply too many numbers!
I have written some regex to capture those numbers and the following statement will match those numbers:
[0-9]{7}
Here is an example of how the above matches those numbers
Now I would like to use this regex statement to amend these strings - the reason is that I would like to learn how to use regex in sqlite to find and replace matching parts of a string.
Has anyone got any advice?
for reference, I can use the REGEXP function as I have already made a sqlite instance in R.
You can use the sqlean-regexp extension, which provides regular expressions search and replace functions:
-- replace 7 digits with the word 'final'
update t set my_strings = regexp_replace(my_strings, '[0-9]{7}', 'final');

Redshift SQL - Extract numbers from string

In Amazon Redshift tables, I have a string column from which I need to extract numbers only out. For this currently I use
translate(stringfield, '0123456789'||stringfield, '0123456789')
I was trying out REPLACE function, but its not gonna be elegant.
Any thoughts with converting the string into ASCII first and then doing some operation to extract only number? Or any other alternatives.
It is hard here as Redshift do not support functions and is missing lot of traditional functions.
Edit:
Trying out the below, but it only returns 051-a92 where as I need 05192 as output. I am thinking of substring etc, but I only have regexp_substr available right now. How do I get rid of any characters in between
select REGEXP_SUBSTR('somestring-051-a92', '[0-9]+..[0-9]+', 1)
might be late but I was solving the same problem and finally came up with this
select REGEXP_replace('somestring-051-a92', '[a-z/-]', '')
alternatively, you can create a Python UDF now
Typically your inputs will conform to some sort of pattern that can be used to do the parsing using SUBSTRING() with CHARINDEX() { aka STRPOS(), POSITION() }.
E.g. find the first hyphen and the second hyphen and take the data between them.
If not (and assuming your character range is limited to ASCII) then your best bet would be to nest 26+ REPLACE() functions to remove all of the standard alpha characters (and any punctuation as well).
If you have multibyte characters in your data though then this is a non-starter.
Better method is to remove all the non-numeric values:
select REGEXP_replace('somestring-051-a92', '[^0-9]', '')
You can specify "any non digit" that includes non-printable, symbols, alpha, etc.
e.g., regexp_replace('brws--A*1','[\D]')
returns
"1"

Regular expression filter

I have this regular expression in my sql query
DECLARE #RETURN_VALUE VARCHAR(MAX)
IF #value LIKE '%[0-9]%[^A-Z]%[0-9]%'
BEGIN
SET #RETURN_VALUE = NULL
END
I am not sure, but whenever I have this in my row 12 TEST then it gives me the value of 12, but if I have three digit number then it filters out the three digit numbers.How can I modify the regular expression to return me the three digits numbers too.
any help will be appreciated.
SQL doesn't have regular expressions: it has SQL wildcard expressions. They are much simpler than regular expressions and long predate regular expressions. For instance, there is no way to specify alternation (a|b) or repetition ( a*, a+, a?, a{m,n} ) such as you might find in a regular expression.
The 'like expression' that you have
LIKE '%[0-9]%[^A-Z]%[0-9]%'
will match any string containing the following pattern anywhere in the string
zero or more of any character, followed by...
a single decimal digit, followed by...
zero or more of any character, followed by...
a single character other than A–Z (whether it's case sensitive or not depends on the collating sequence in use), followed by...
zero or of any character, followed by...
a single decimal digit, followed by...
zero or more of any character
One should note that the % is likely to match perhaps more than you might like.
Have you tried ([0-9]*). I believe that this will capture every digit for you. However, I am not as strong at regex. When I ran this through rubular, it worked, though :) BTW, rubular is a great way to test out regular expressions
You can easily create a SQL CLR function and use this in your queries. Visual Studio has a project template for this and makes deploying the functions a snap.
Here is more information from Microsoft about how to create the function and how to use it (for boolean matches and for data extraction).
First of all, note that this is not really a "regular expression", it's a SQL-specific form of wildcard matching. You are very limited in what you can accomplish with SQL wildcards. As one example, you cannot "optionally" match a specific character or character set.
Your expression, as you've written it, will match any value that contains two digits with at least one non-letter character in between them, meaning it will match:
111
1^1
1?7
1AAAAAAAAAAA?AAAAAAAAA1
-----------------------5-----------------3-------
And infinitely more items of a similar structure.
Oddly, one string that would not match this pattern is "12 TEST" because there is no character between the 1 and 2. The pattern also won't "give you" the value of 12 back because it's not a parsing expression, just a matching expression: it returns 1 (true) or 0 (false).
There is clearly something else going on in your application, possibly even an actual regular expression, but it has nothing to do with the SQL you've included here.

Return sql rows where field contains ONLY non-alphanumeric characters

I need to find out how many rows in a particular field in my sql server table, contain ONLY non-alphanumeric characters.
I'm thinking it's a regular expression that I need along the lines of [^a-zA-Z0-9] but Im not sure of the exact syntax I need to return the rows if there are no valid alphanumeric chars in there.
SQL Server doesn't have regular expressions. It uses the LIKE pattern matching syntax which isn't the same.
As it happens, you are close. Just need leading+trailing wildcards and move the NOT
WHERE whatever NOT LIKE '%[a-z0-9]%'
If you have short strings you should be able to create a few LIKE patterns ('[^a-zA-Z0-9]', '[^a-zA-Z0-9][^a-zA-Z0-9]', ...) to match strings of different length. Otherwise you should use CLR user defined function and a proper regular expression - Regular Expressions Make Pattern Matching And Data Extraction Easier.
This will not work correctly, e.g. abcÑxyz will pass thru this as it has a,b,c... you need to work with Collate or check each byte.

How do I check the end of a particular string using SQL pattern matching?

I am trying to use sql pattern matching to check if a string value is in the correct format.
The string code should have the correct format of:
alphanumericvalue.alphanumericvalue
Therefore, the following are valid codes:
D0030.2190
C0052.1925
A0025.2013
And the following are invalid codes:
D0030
.2190
C0052.
A0025.2013.
A0025.2013.2013
So far I have the following SQL IF clause to check that the string is correct:
IF #vchAccountNumber LIKE '_%._%[^.]'
I believe that the "_%" part checks for 1 or more characters. Therefore, this statement checks for one or more characters, followed by a "." character, followed by one or more characters and checking that the final character is not a ".".
It seems that this would work for all combinations except for the following format which the IF clause allows as a valid code:
A0025.2013.2013
I'm having trouble correcting this IF clause to allow it to treat this format as incorrect. Can anybody help me to correct this?
Thank you.
This stackoverflow question mentions using word-boundaries: [[:<:]] and [[:>:]] for whole word matches. You might be able to use this since you don't have spaces in your code.
This is ANSI SQL solution
This LIKE expression will find any pattern not alphanumeric.alphanumeric. So NOT LIKE find only this that match as you wish:
IF #vchAccountNumber NOT LIKE '%[^A-Z0-9].[^A-Z0-9]%'
However, based on your examples, you can use this...
LIKE '[A-Z][0-9][0-9][0-9][0-9].[0-9][0-9][0-9][0-9]'
...or one like this if you 5 alphas, dot, 4 alphas
LIKE '[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9].[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]'
The 2nd one is slightly more obvious for fixed length values. The 1st one is slighty less intuitive but works with variable length code either side of the dot.
Other SO questions Creating a Function in SQL Server with a Phone Number as a parameter and returns a Random Number and Best equivalent for IsInteger in SQL Server