How to remove only letters from a string in BigQuery? - sql

So I'm working with BigQuery SQL right now trying to figure out how to remove letters but keep numeric numbers. For example:
XXXX123456
AAAA123456789
XYZR12345678
ABCD1234567
1111
2222
All have the same amount of letters in front of the numbers along with regular numbers no letters. I want the end result to look like:
123456
123456789
12345678
1234567
1111
2222
I tried using PATINDEX but BigQuery doesn't support the function. I've also tried using LEFT but that function will get rid of any value and I don't want to get rid of any numeric value only letter values. Any help would be much appreciated!
-Maykid

You can use regexp_replace():
select regexp_replace(str, '[^0-9]', '')

Below example is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'XXXX123456' str UNION ALL
SELECT 'AAAA123456789' UNION ALL
SELECT 'XYZR12345678' UNION ALL
SELECT 'ABCD1234567' UNION ALL
SELECT '1111' UNION ALL
SELECT '2222'
)
SELECT str, REGEXP_REPLACE(str, r'[a-zA-Z]', '') str_adjusted
FROM `project.dataset.table`

Related

USING SQL . extract numbers comma separated from string 'HEADER|N1000|E1001|N1002|E1003|N1004|N1005'

'HEADER|N1000|E1001|N1002|E1003|N1004|N1005'
'HEADER|N156|E1|N7|E122|N4|E5'
'HEADER|E0|E1|E2|E3|E4|E5'
'HEADER|N0|N1|N2|N3|N4|N5'
'HEADER|N125'
How to extract the numbers in comma-separated format from this stringS?
Expected result:
1000,1001,1002,1003,1004,1005
How to extract the numbers with N or E as suffix/prefix ie.
N1000
Expected result:
1000,1002,1004,1005
below regex does not return the result needed. But I want some thing like this
select REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005', '.*?(\d+)', '\1,'), ',?\.*$', '') from dual
the problem here is
when i want numbers with E OR N
select REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005', '.*?N(\d+)', '\1,'), ',?\.*$', '') from dual
select REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005', '.*?E(\d+)', '\1,'), ',?\.*$', '') from dual
they give good results for this scenerio
but when i input 'HEADER|N1000|E1001' it gives wrong answer plzzz verify and correct it
Update
Based on the changes to the question, the original answer is not valid. Instead, the solution is considerably more complex, using a hierarchical query to extract all the numbers from the string and then LISTAGG to put back together a list of numbers extracted from each string. To extract all numbers we use this query:
WITH cte AS (
SELECT DISTINCT data, level AS l, REGEXP_SUBSTR(data, '[NE]\d+', 1, level) AS num FROM test
CONNECT BY REGEXP_SUBSTR(data, '[NE]\d+', 1, level) IS NOT NULL
)
SELECT data, LISTAGG(SUBSTR(num, 2), ',') WITHIN GROUP (ORDER BY l) AS "All numbers"
FROM cte
GROUP BY data
Output (for the new sample data):
DATA All numbers
HEADER|E0|E1|E2|E3|E4|E5 0,1,2,3,4,5
HEADER|N0|N1|N2|N3|N4|N5 0,1,2,3,4,5
HEADER|N1000|E1001|N1002|E1003|N1004|N1005 1000,1001,1002,1003,1004,1005
HEADER|N125 125
HEADER|N156|E1|N7|E122|N4|E5 156,1,7,122,4,5
To select only numbers beginning with E, we modify the query to replace the [EN] in the REGEXP_SUBSTR expressions with just E i.e.
SELECT DISTINCT data, level AS l, REGEXP_SUBSTR(data, 'E\d+', 1, level) AS num FROM test
CONNECT BY REGEXP_SUBSTR(data, 'E\d+', 1, level) IS NOT NULL
Output:
DATA E-numbers
HEADER|E0|E1|E2|E3|E4|E5 0,1,2,3,4,5
HEADER|N0|N1|N2|N3|N4|N5
HEADER|N1000|E1001|N1002|E1003|N1004|N1005 1001,1003
HEADER|N125
HEADER|N156|E1|N7|E122|N4|E5 1,122,5
A similar change can be made to extract numbers commencing with N.
Demo on dbfiddle
Original Answer
One way to achieve your desired result is to replace a string of characters leading up to a number with that number and a comma, and then replace any characters from the last ,| to the end of string from the result:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '.*?(\d+)', '\1,'), ',?\|.*$', '') FROM dual
Output:
1000,1001,1002,1003,1004,1005
To only output the numbers beginning with N, we add that to the prefix string before the capture group:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '.*?N(\d+)', '\1,'), ',?\|.*$', '') FROM dual
Output:
1000,1002,1004,1005
To only output the numbers beginning with E, we add that to the prefix string before the capture group:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '.*?E(\d+)', '\1,'), ',?\|.*$', '') FROM dual
Output:
1001,1003
Demo on dbfiddle
I don't know what DBMS you are using, but here's one way to do it in Postgres:
WITH cte AS (
SELECT CAST('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|' AS VARCHAR(1000)) AS myValue
)
SELECT SUBSTRING(MyVal FROM 2)
FROM (
SELECT REGEXP_SPLIT_TO_TABLE(myValue,'\|') MyVal
FROM cte
) src
WHERE SUBSTRING(MyVal FROM 1 FOR 1) = 'N'
;
SQL Fiddle
As Far as I have understood the question , you want to extract substrings starting with N from the string, You can try following (And then you can merge the output seperated by commas if needed)
select REPLACE(value, 'N', '')
from STRING_SPLIT('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '|')
where value like 'N%'
OutPut :
1000
1002
1004
1005

Extract Value from a string PostgreSQL

Simple Question
I have the following type of results in a string field
'Number=123456'
'Number=1234567'
'Number=12345678'
How do I extract the value from the string with regard that the value can change between 5-8 figures
So far I did this but I doubt that fits my requirement
SELECT substring('Size' from 8 for ....
If I can tell it to start from the = sign till the end that would help!
The likely simplest solution is to trim 7 leading characters with right():
right(str, -7)
Demo:
SELECT str, right(str, -7)
FROM (
VALUES ('Number=123456')
, ('Number=1234567')
, ('Number=12345678')
) t(str);
str | right
-----------------+----------
Number=123456 | 123456
Number=1234567 | 1234567
Number=12345678 | 12345678
You could use REPLACE:
SELECT col, REPLACE(col, 'Number=', '')
FROM tab;
DBFiddle Demo
Based on this question:
Split comma separated column data into additional columns
You could probably do the following:
SELECT *, split_part(col, '=', 2)
FROM table;
You may use regexp_matches :
with t(str) as
(
select 'Number=123456' union all
select 'Number=1234567' union all
select 'Number=12345678' union all
select 'Number=12345678x9'
)
select t.str as "String",
regexp_matches(t.str, '=([A-Za-z0-9]+)', 'g') as "Number"
from t;
String Number
-------------- ---------
Number=123456 123456
Number=1234567 1234567
Number=12345678 12345678
Number=12345678x9 12345678x9
--> the last line shows only we look chars after equal sign even if non-digit
Rextester Demo

Using REGEXP_SUBSTR with Strings Qualifier

Getting Examples from similar Stack Overflow threads,
Remove all characters after a specific character in PL/SQL
and
How to Select a substring in Oracle SQL up to a specific character?
I would want to retrieve only the first characters before the occurrence of a string.
Example:
STRING_EXAMPLE
TREE_OF_APPLES
The Resulting Data set should only show only STRING_EXAM and TREE_OF_AP because PLE is my delimiter
Whenever i use the below REGEXP_SUBSTR, It gets only STRING_ because REGEXP_SUBSTR treats PLE as separate expressions (P, L and E), not as a single expression (PLE).
SELECT REGEXP_SUBSTR('STRING_EXAMPLE','[^PLE]+',1,1) from dual;
How can i do this without using numerous INSTRs and SUBSTRs?
Thank you.
The problem with your query is that if you use [^PLE] it would match any characters other than P or L or E. You are looking for an occurence of PLE consecutively. So, use
select REGEXP_SUBSTR(colname,'(.+)PLE',1,1,null,1)
from tablename
This returns the substring up to the last occurrence of PLE in the string.
If the string contains multiple instances of PLE and only the substring up to the first occurrence needs to be extracted, use
select REGEXP_SUBSTR(colname,'(.+?)PLE',1,1,null,1)
from tablename
Why use regular expressions for this?
select substr(colname, 1, instr(colname, 'PLE')-1) from...
would be more efficient.
with
inputs( colname ) as (
select 'FIRST_EXAMPLE' from dual union all
select 'IMPLEMENTATION' from dual union all
select 'PARIS' from dual union all
select 'PLEONASM' from dual
)
select colname, substr(colname, 1, instr(colname, 'PLE')-1) as result
from inputs
;
COLNAME RESULT
-------------- ----------
FIRST_EXAMPLE FIRST_EXAM
IMPLEMENTATION IM
PARIS
PLEONASM

SQL How to extract numbers from a string?

I am working on a query in SQL that should be able to extract numbers on different/random lenght from the beginning of the text string.
Text string: 666 devils number is not 8888.
Text string: 12345 devils number is my PIN, that is 6666.
I want to get in a column
666
12345
Use a combination of Substr & instr
SELECT Substr (textstring, 1,instr(textstring,' ') - 1) AS Output
FROM yourtable
Result:
OUTPUT
666
12345
Use this if you have text at the beginning e.g. aa12345 devils number is my PIN, that is 6666. as it utilises the REGEXP_REPLACE function.
SELECT REGEXP_REPLACE(Substr (textstring, 1,instr(textstring,' ') - 1), '[[:alpha:]]','') AS Output
FROM yourtable
SQL Fiddle: http://sqlfiddle.com/#!4/8edc9/1/0
This version utilizes a regular expression which gives you the first number whether or not it's preceded by text and does not use the ghastly nested instr/substr calls:
SQL> with tbl(data) as (
select '666 devils number is not 8888' from dual
union
select '12345 devils number is my PIN, that is 6666' from dual
union
select 'aa12345 devils number is my PIN, that is 6666' from dual
)
select regexp_substr(data, '^\D*(\d+) ', 1, 1, null, 1) first_nbr
from tbl;
FIRST_NBR
---------------------------------------------
12345
666
12345
SQL>

Parse column value based on delimeters

Here is a sample of my data:
ABC*12345ABC
BCD*234()
CDE*3456789(&(&
DEF*4567A*B*C
Using SQL Server 2008 or SSIS, I need to parse this data and return the following result:
12345
234
3456789
4567
As you can see, the asterisk (*) is my first delimiter. The second "delimiter" (I use this term loosely) is when the sequence of numbers STOP.
So, basically, just grab the sequence of numbers after the asterisk...
How can I accomplish this?
EDIT:
I made a mistake in my original post. An example of another possible value would be:
XWZ*A12345%$%
In this case, I would like to return the following:
A12345
The value can START with an alpha character, but it will always END with a number. So, grab everything after the asterisk, but stop at the last number in the sequence.
Any help with this will be greatly appreciated!
You could do this with a little patindex and charindex trickery, like:
; with YourTable(col1) as
(
select 'ABC*12345ABC'
union all select 'BCD*234()'
union all select 'CDE*3456789(&(&'
union all select 'DEF*4567A*B*C'
union all select 'XWZ*A12345%$%'
)
select left(AfterStar, len(Leader) + PATINDEX('%[^0-9]%', AfterLeader) - 1)
from (
select RIGHT(AfterStar, len(AfterStar) - PATINDEX('%[0-9]%', AfterStar) + 1)
as AfterLeader
, LEFT(AfterStar, PATINDEX('%[0-9]%', AfterStar) - 1) as Leader
, AfterStar
from (
select RIGHT(col1, len(col1) - CHARINDEX('*', col1)) as AfterStar
from YourTable
) as Sub1
) as Sub2
This prints:
12345
234
3456789
4567
A12345
If you ignore that this is in SQL then the first thing that comes to mind is Regex:
^.*\*(.*[0-9])[^0-9]*$
The capture group there should get what you want. I don't know if SQL has a regex function.