SQL Server regular expression select and update - sql

I have a column that I need to clean the data up on.
First I'd like to do a select to get a record of the bad data then I've like to run a replace on the invalid charters.
I'm looking to select anything that contains non alphanumeric characters but ignores the slash "\" as the second character and also ignores underscores and dashes in the rest of the string. Here's a couple of example of the data I'm expecting to get back from this query.
#\AAA
A\Adam's
A\Amanda.Smith
B\Bear's-ltd
C\Couple & More
After this I'd like to run a replace on any of these invalid characters and replace them with underscores so the result would look like this:
_\AAA
A\Adam_s
A\Amanda_Smith
B\Bear_s-ltd
C\Couple_More

I do not think there is native support for that. You can create a CLR to support regex, ex: https://www.simple-talk.com/sql/t-sql-programming/clr-assembly-regex-functions-for-sql-server-by-example/

Related

How to use Regular Expressions to replace part of a string in SQLite?

I currently would like some advice on how to find and replace part of a string using regular expressions in SQLite? i am using Rstudio/R as the SQLite connector.
I have the following strings:
my_strings
--------------
1244599arts
3490872testing
4478933great
2342340obvious
gremlin2342678
i would like to replace the numbers with the word "final" - now I would like to use regular expressions to achieve this as I want to be able to capture the numbers only and then replace them with the word "final" and not affect any other part of the string
the output i would like to achieve is the following:
my_strings
--------------
finalarts
finaltesting
finalgreat
finalobvious
gremlinfinal
As you can see the numbers have now been replaced by the word "final" - please note that I have around 8 million rows so I cannot just repeat a REPLACE function as there are simply too many numbers!
I have written some regex to capture those numbers and the following statement will match those numbers:
[0-9]{7}
Here is an example of how the above matches those numbers
Now I would like to use this regex statement to amend these strings - the reason is that I would like to learn how to use regex in sqlite to find and replace matching parts of a string.
Has anyone got any advice?
for reference, I can use the REGEXP function as I have already made a sqlite instance in R.
You can use the sqlean-regexp extension, which provides regular expressions search and replace functions:
-- replace 7 digits with the word 'final'
update t set my_strings = regexp_replace(my_strings, '[0-9]{7}', 'final');

How to select values around .(dot) using sql

I am running below query in Teradata :
sel requesttext from dbc.tables
where tablename='old_employee_table'
Result:
alter table DB_NAME.employee_table,no fallback ;
I want to get below result using SQL:
DB_NAME.employee_table
Requesttext can be:
create set table DB_NAME.employee_table;
DB Name and table can occur anywhere in the result. Since .(dot) is joining them that's why i want to split with .(dot).
Basically I need sql which can result me surrounding values of .(dot)
I want DBName and Tablename in result.
I'm not a Teradata person, but this should work for both strings given so far, as long as teradata's regexp_substr() supports positive look-behind and positive look-ahead assertions (I might have the Teradata syntax wrong, so a little tweaking may be needed):
SELECT REGEXP_SUBSTR(requesttext, '(?<= )(\w+\.\w+)(?=[,$]?)', 1, 1)
FROM dbc.tables
WHERE tablename='old_employee_table'
See the regex101 example. Hopefully it translates to Teradata easily.
The regex looks for and returns the words either side of and including the period, when preceded by a space, and followed by an optional comma or the end of the line.
You could do this with either regexp_substr() or strtok().
As Jamie Zawinski said:
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
So I would go with the strtok() method. Also I'm lazy and regular expressions are hard.
Function strtok() takes three arguments:
The string being split
The delimiter to split the string
The number of the token to grab.
To get at the <database>.<table> from that string that is returned in your query, we can split by a space, grab the third token, then split that by a comma and grab the first token.
That would look like:
SELECT strtok(strtok(requestText,' ',3),',',1)
FROM dbc.tables
WHERE tablename='old_employee_table'

SQL LIKE to find either -,/,_

Trying to select from table where the format can be either 1/2/2014, 1-2-2014 or 1_2_2014 in a text field. There's other text involved outside of this format but it shouldn't matter, but that's why this is text not a date type.
I tried '%[-,_,/]%[-,_,/]%' which doesn't work, and I've tried escaping the special characters in the brackets such as %[-,!_,/]%[-,!_,/]%' ESCAPE '!' which also doesn't work. Any suggestions?
I wanted to avoid using three searches like,
LIKE '%/%/%'
OR '%-%-%'
OR '%!_%!_%' ESCAPE '!'
EDIT: Using SQLite3
There is no regex like behavior in using the LIKE operator in SQL. You would have use two expressions and OR them together:
select * from table
where column like '%-%-%'
or column like '%/%/%'
Thanks for the information. I ended up switching to the GLOB operator which support [] in SQLite.
The Example was altered to GLOB '?[/-_]?[/-_]??*' Where * serves as % and ? serves as _ for the GLOB function.
Also thanks to Amadeaus9 for pointing out minimum characters between delimiters so that '//' isn't a valid answer.
If you're using T-SQL (AKA SQL Server) you don't want to have commas in the character set - i.e. LIKE '%[/_-]%[/_-]%'. However, keep in mind that this can match ANYTHING that has, anywhere within it, any two characters from the set.
EDIT: it doesn't looke like SQLite supports that sort of use of its LIKE operator, based on this link.
Relevant quote:
There are two wildcards used in conjunction with the LIKE operator:
The percent sign (%)
The underscore (_)
However, you may want to take a look at this question, which details using regex in SQLite.
It is not possible using the LIKE syntax.
However Sqlite3 would support the REGEXP operator; this is syntactic sugar for calling an user defined function that actually does the matching. If provided by your platform, then you could use for example
x REGEXP '.*[/_-].*[/_-].*'

Return sql rows where field contains ONLY non-alphanumeric characters

I need to find out how many rows in a particular field in my sql server table, contain ONLY non-alphanumeric characters.
I'm thinking it's a regular expression that I need along the lines of [^a-zA-Z0-9] but Im not sure of the exact syntax I need to return the rows if there are no valid alphanumeric chars in there.
SQL Server doesn't have regular expressions. It uses the LIKE pattern matching syntax which isn't the same.
As it happens, you are close. Just need leading+trailing wildcards and move the NOT
WHERE whatever NOT LIKE '%[a-z0-9]%'
If you have short strings you should be able to create a few LIKE patterns ('[^a-zA-Z0-9]', '[^a-zA-Z0-9][^a-zA-Z0-9]', ...) to match strings of different length. Otherwise you should use CLR user defined function and a proper regular expression - Regular Expressions Make Pattern Matching And Data Extraction Easier.
This will not work correctly, e.g. abcÑxyz will pass thru this as it has a,b,c... you need to work with Collate or check each byte.

How do I check the end of a particular string using SQL pattern matching?

I am trying to use sql pattern matching to check if a string value is in the correct format.
The string code should have the correct format of:
alphanumericvalue.alphanumericvalue
Therefore, the following are valid codes:
D0030.2190
C0052.1925
A0025.2013
And the following are invalid codes:
D0030
.2190
C0052.
A0025.2013.
A0025.2013.2013
So far I have the following SQL IF clause to check that the string is correct:
IF #vchAccountNumber LIKE '_%._%[^.]'
I believe that the "_%" part checks for 1 or more characters. Therefore, this statement checks for one or more characters, followed by a "." character, followed by one or more characters and checking that the final character is not a ".".
It seems that this would work for all combinations except for the following format which the IF clause allows as a valid code:
A0025.2013.2013
I'm having trouble correcting this IF clause to allow it to treat this format as incorrect. Can anybody help me to correct this?
Thank you.
This stackoverflow question mentions using word-boundaries: [[:<:]] and [[:>:]] for whole word matches. You might be able to use this since you don't have spaces in your code.
This is ANSI SQL solution
This LIKE expression will find any pattern not alphanumeric.alphanumeric. So NOT LIKE find only this that match as you wish:
IF #vchAccountNumber NOT LIKE '%[^A-Z0-9].[^A-Z0-9]%'
However, based on your examples, you can use this...
LIKE '[A-Z][0-9][0-9][0-9][0-9].[0-9][0-9][0-9][0-9]'
...or one like this if you 5 alphas, dot, 4 alphas
LIKE '[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9].[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]'
The 2nd one is slightly more obvious for fixed length values. The 1st one is slighty less intuitive but works with variable length code either side of the dot.
Other SO questions Creating a Function in SQL Server with a Phone Number as a parameter and returns a Random Number and Best equivalent for IsInteger in SQL Server