How to use Regular Expressions to replace part of a string in SQLite? - sql

I currently would like some advice on how to find and replace part of a string using regular expressions in SQLite? i am using Rstudio/R as the SQLite connector.
I have the following strings:
my_strings
--------------
1244599arts
3490872testing
4478933great
2342340obvious
gremlin2342678
i would like to replace the numbers with the word "final" - now I would like to use regular expressions to achieve this as I want to be able to capture the numbers only and then replace them with the word "final" and not affect any other part of the string
the output i would like to achieve is the following:
my_strings
--------------
finalarts
finaltesting
finalgreat
finalobvious
gremlinfinal
As you can see the numbers have now been replaced by the word "final" - please note that I have around 8 million rows so I cannot just repeat a REPLACE function as there are simply too many numbers!
I have written some regex to capture those numbers and the following statement will match those numbers:
[0-9]{7}
Here is an example of how the above matches those numbers
Now I would like to use this regex statement to amend these strings - the reason is that I would like to learn how to use regex in sqlite to find and replace matching parts of a string.
Has anyone got any advice?
for reference, I can use the REGEXP function as I have already made a sqlite instance in R.

You can use the sqlean-regexp extension, which provides regular expressions search and replace functions:
-- replace 7 digits with the word 'final'
update t set my_strings = regexp_replace(my_strings, '[0-9]{7}', 'final');

Related

How to extract digits from field using regex

I am using Firebird 2.5 and I have a field (called identifier) with mixed letters, numbers and special characters. I would like to use regex to extract only the numbers in a new column. I have tried something like below, but it is not working.
Any idea how I can achieve this using regex without using stored procedures or execute block
SELECT ORDER_ID,
ORDER_DATE,
SUBSTRING(IDENTIFIER FROM 1 TO 10) SIMILAR TO '^[0-9]{10}$' --- DESIRED EXTRACTION COLUMN
FROM ORDERS
Example of data
IDENTIFIER DESIRED OUTPUT
ANDRE 02869567995 02869567995
02869567995 MARIA 02869567995
028.695.67.995 02869567995
028695679-95 02869567995
You cannot do this in Firebird 2.5, at least not without help from a UDF, or a (selectable) stored procedure. I'm not aware of third-party UDFs providing regular expressions, so you might have to write this yourself.
In Firebird 3.0, you could also use a UDR or stored function to achieve this. Unfortunately, using the regular expression functionality available in Firebird alone will not be enough to solve this.
NOTE: The rest of the answer is based on the assumption to extract digits if the first 10 characters of string are digits. With the updated question, this assumption is no longer valid.
That said, if your need is exactly as shown in your question, that is only extract the first 10 characters from a string if they are all digits, then you could use:
case
when IDENTIFIER similar to '[[:DIGIT:]]{10}%'
then substring(IDENTIFIER from 1 for 10)
end
(as an aside, the positional SUBSTRING syntax is from <start> for <length>, not from <start> to <end>)
In Firebird 3.0 and higher, you can use SUBSTRING(... SIMILAR ...) with a SQL regular expression pattern. Assuming you want to extract 10 digits from the start of a string, you can do:
substring(IDENTIFIER similar '#"[[:DIGIT:]]{10}#"%' escape '#')
The #" delimits the pattern to extract (where # is a custom escape character as specified in the ESCAPE clause). The remainder of the pattern must match the rest of the string, hence the use of % here (in other cases, you may need to specify a pattern before the first #" as well.
See this dbfiddle for an example.
It is not possible in any version of Firebird.

Regex not working in LIKE condition

I'm currently using Oracle SQL developer and am trying to write a query that will allow me to search for all fields that resemble a certain value but always differ from it.
SELECT last_name FROM employees WHERE last_name LIKE 'Do[^e]%';
So the result that I'm after would be: Give me all last names that start with 'Do' but are not 'Doe'.
I got the square brackets method from a general SQL basics book so I assume any SQL database should be able to run it.
This is my first post and I'd be happy to clarify if my question wasn't clear enough.
In Oracle's LIKE no regular expressions can be used. But you can use REGEXP_LIKE.
SELECT * FROM EMPLOYEES WHERE REGEXP_LIKE (Name, '^Do[^e]');
The ^ at the beginning of the pattern anchors it to the beginning of the compared string. In other words the string must start with the pattern to match. And there is no wildcard needed at the end, as there is no anchor for the end of the string (which would be $). And you seem to already know the meaning of [^e].

SQL Server regular expression select and update

I have a column that I need to clean the data up on.
First I'd like to do a select to get a record of the bad data then I've like to run a replace on the invalid charters.
I'm looking to select anything that contains non alphanumeric characters but ignores the slash "\" as the second character and also ignores underscores and dashes in the rest of the string. Here's a couple of example of the data I'm expecting to get back from this query.
#\AAA
A\Adam's
A\Amanda.Smith
B\Bear's-ltd
C\Couple & More
After this I'd like to run a replace on any of these invalid characters and replace them with underscores so the result would look like this:
_\AAA
A\Adam_s
A\Amanda_Smith
B\Bear_s-ltd
C\Couple_More
I do not think there is native support for that. You can create a CLR to support regex, ex: https://www.simple-talk.com/sql/t-sql-programming/clr-assembly-regex-functions-for-sql-server-by-example/

Return sql rows where field contains ONLY non-alphanumeric characters

I need to find out how many rows in a particular field in my sql server table, contain ONLY non-alphanumeric characters.
I'm thinking it's a regular expression that I need along the lines of [^a-zA-Z0-9] but Im not sure of the exact syntax I need to return the rows if there are no valid alphanumeric chars in there.
SQL Server doesn't have regular expressions. It uses the LIKE pattern matching syntax which isn't the same.
As it happens, you are close. Just need leading+trailing wildcards and move the NOT
WHERE whatever NOT LIKE '%[a-z0-9]%'
If you have short strings you should be able to create a few LIKE patterns ('[^a-zA-Z0-9]', '[^a-zA-Z0-9][^a-zA-Z0-9]', ...) to match strings of different length. Otherwise you should use CLR user defined function and a proper regular expression - Regular Expressions Make Pattern Matching And Data Extraction Easier.
This will not work correctly, e.g. abcÑxyz will pass thru this as it has a,b,c... you need to work with Collate or check each byte.

How do I check the end of a particular string using SQL pattern matching?

I am trying to use sql pattern matching to check if a string value is in the correct format.
The string code should have the correct format of:
alphanumericvalue.alphanumericvalue
Therefore, the following are valid codes:
D0030.2190
C0052.1925
A0025.2013
And the following are invalid codes:
D0030
.2190
C0052.
A0025.2013.
A0025.2013.2013
So far I have the following SQL IF clause to check that the string is correct:
IF #vchAccountNumber LIKE '_%._%[^.]'
I believe that the "_%" part checks for 1 or more characters. Therefore, this statement checks for one or more characters, followed by a "." character, followed by one or more characters and checking that the final character is not a ".".
It seems that this would work for all combinations except for the following format which the IF clause allows as a valid code:
A0025.2013.2013
I'm having trouble correcting this IF clause to allow it to treat this format as incorrect. Can anybody help me to correct this?
Thank you.
This stackoverflow question mentions using word-boundaries: [[:<:]] and [[:>:]] for whole word matches. You might be able to use this since you don't have spaces in your code.
This is ANSI SQL solution
This LIKE expression will find any pattern not alphanumeric.alphanumeric. So NOT LIKE find only this that match as you wish:
IF #vchAccountNumber NOT LIKE '%[^A-Z0-9].[^A-Z0-9]%'
However, based on your examples, you can use this...
LIKE '[A-Z][0-9][0-9][0-9][0-9].[0-9][0-9][0-9][0-9]'
...or one like this if you 5 alphas, dot, 4 alphas
LIKE '[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9].[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]'
The 2nd one is slightly more obvious for fixed length values. The 1st one is slighty less intuitive but works with variable length code either side of the dot.
Other SO questions Creating a Function in SQL Server with a Phone Number as a parameter and returns a Random Number and Best equivalent for IsInteger in SQL Server