Redshift SQL - Extract numbers from string - sql

In Amazon Redshift tables, I have a string column from which I need to extract numbers only out. For this currently I use
translate(stringfield, '0123456789'||stringfield, '0123456789')
I was trying out REPLACE function, but its not gonna be elegant.
Any thoughts with converting the string into ASCII first and then doing some operation to extract only number? Or any other alternatives.
It is hard here as Redshift do not support functions and is missing lot of traditional functions.
Edit:
Trying out the below, but it only returns 051-a92 where as I need 05192 as output. I am thinking of substring etc, but I only have regexp_substr available right now. How do I get rid of any characters in between
select REGEXP_SUBSTR('somestring-051-a92', '[0-9]+..[0-9]+', 1)

might be late but I was solving the same problem and finally came up with this
select REGEXP_replace('somestring-051-a92', '[a-z/-]', '')
alternatively, you can create a Python UDF now

Typically your inputs will conform to some sort of pattern that can be used to do the parsing using SUBSTRING() with CHARINDEX() { aka STRPOS(), POSITION() }.
E.g. find the first hyphen and the second hyphen and take the data between them.
If not (and assuming your character range is limited to ASCII) then your best bet would be to nest 26+ REPLACE() functions to remove all of the standard alpha characters (and any punctuation as well).
If you have multibyte characters in your data though then this is a non-starter.

Better method is to remove all the non-numeric values:
select REGEXP_replace('somestring-051-a92', '[^0-9]', '')

You can specify "any non digit" that includes non-printable, symbols, alpha, etc.
e.g., regexp_replace('brws--A*1','[\D]')
returns
"1"

Related

Pattern matching in Big query vs SSMS-Return strings which contain special characters or numerics

I'm a bit lost.
I've had a look at the documentation but I'm not sure if you can use LIKE and pattern match in Big Query the same as SSMS.
The code shown here works in SSMS but the results are not correct in Big Query, so was wondering if there was another way to do it.
WHERE column_name NOT LIKE '[a-Z]%'
I'm looking to return strings which contain special characters or numerics.
Use REGEXP_CONTAINS instead
where not regexp_contains(column_name, r'[a-zA-Z]')
Meantime, LIKE is also supported as a comparison operator

How to extract digits from field using regex

I am using Firebird 2.5 and I have a field (called identifier) with mixed letters, numbers and special characters. I would like to use regex to extract only the numbers in a new column. I have tried something like below, but it is not working.
Any idea how I can achieve this using regex without using stored procedures or execute block
SELECT ORDER_ID,
ORDER_DATE,
SUBSTRING(IDENTIFIER FROM 1 TO 10) SIMILAR TO '^[0-9]{10}$' --- DESIRED EXTRACTION COLUMN
FROM ORDERS
Example of data
IDENTIFIER DESIRED OUTPUT
ANDRE 02869567995 02869567995
02869567995 MARIA 02869567995
028.695.67.995 02869567995
028695679-95 02869567995
You cannot do this in Firebird 2.5, at least not without help from a UDF, or a (selectable) stored procedure. I'm not aware of third-party UDFs providing regular expressions, so you might have to write this yourself.
In Firebird 3.0, you could also use a UDR or stored function to achieve this. Unfortunately, using the regular expression functionality available in Firebird alone will not be enough to solve this.
NOTE: The rest of the answer is based on the assumption to extract digits if the first 10 characters of string are digits. With the updated question, this assumption is no longer valid.
That said, if your need is exactly as shown in your question, that is only extract the first 10 characters from a string if they are all digits, then you could use:
case
when IDENTIFIER similar to '[[:DIGIT:]]{10}%'
then substring(IDENTIFIER from 1 for 10)
end
(as an aside, the positional SUBSTRING syntax is from <start> for <length>, not from <start> to <end>)
In Firebird 3.0 and higher, you can use SUBSTRING(... SIMILAR ...) with a SQL regular expression pattern. Assuming you want to extract 10 digits from the start of a string, you can do:
substring(IDENTIFIER similar '#"[[:DIGIT:]]{10}#"%' escape '#')
The #" delimits the pattern to extract (where # is a custom escape character as specified in the ESCAPE clause). The remainder of the pattern must match the rest of the string, hence the use of % here (in other cases, you may need to specify a pattern before the first #" as well.
See this dbfiddle for an example.
It is not possible in any version of Firebird.

Substring from reverse in PostgreSql

How do we substring from reverse in Postgres? In oracle we can provide the no.of occurrences of the pattern and extract the expected records. In Postgres we do not have such option.
I tried using substring(), left() and right() functions, still it is not working. Any suggestion would be helpful.
Value in the column,
col1
100~500~~~~Bangalore~~~~KA~null~Train
Expected result,
Train
To get the characters after the last ~ you can use substring() with a regex:
substring(col1 from '~([^~]+)$')
Or get the position of the last ~ by using the reverse function:
right(col1, strpos(reverse(col1), '~') - 1)
A more general approach is to convert the string to an array, then pick the last array element:
(string_to_array(col1, '~'))[cardinality(string_to_array(col1, '~'))]
The best solution to this sort of problems is to not store multiple values delimited by some character in a single column. If you really need to de-normalize using arrays or JSON would at least be a bit more flexible (and robust)

Query Chinese characters(utf-8) in Google Big Query

I want to query out titles which contains Chinese characters(ex:數學) from my google dataset, and I hava tried many methods as follows.
Google big query only has LENGTH() function,and it doesn't hava DATALENGTH() to compare the difference of length and datasize.
Then, I try to use REGEXP_MATCH() '[\u4e00-\u9fa5]' to match Chinese characters, but it doesn't work, too.
I can't figure out if there are other methods to solve this problem.
Please help, thank you.
BigQuery's LENGTH function currently has a bug which returns the incorrect STRING length for characters that fall out of the ASCII encoding range: https://code.google.com/p/google-bigquery/issues/detail?id=109
Possible workaround: If you just need an accurate LENGTH count, you could use the REGEXP_REPLACE function to convert your characters into a random ASCII character (such as '_'), and count that:
SELECT '數學',
LENGTH(REGEXP_REPLACE('數學', r'.', '_')) as correct,
LENGTH('數學') as incorrect;

Return sql rows where field contains ONLY non-alphanumeric characters

I need to find out how many rows in a particular field in my sql server table, contain ONLY non-alphanumeric characters.
I'm thinking it's a regular expression that I need along the lines of [^a-zA-Z0-9] but Im not sure of the exact syntax I need to return the rows if there are no valid alphanumeric chars in there.
SQL Server doesn't have regular expressions. It uses the LIKE pattern matching syntax which isn't the same.
As it happens, you are close. Just need leading+trailing wildcards and move the NOT
WHERE whatever NOT LIKE '%[a-z0-9]%'
If you have short strings you should be able to create a few LIKE patterns ('[^a-zA-Z0-9]', '[^a-zA-Z0-9][^a-zA-Z0-9]', ...) to match strings of different length. Otherwise you should use CLR user defined function and a proper regular expression - Regular Expressions Make Pattern Matching And Data Extraction Easier.
This will not work correctly, e.g. abcÑxyz will pass thru this as it has a,b,c... you need to work with Collate or check each byte.