Google Big Query SQL to extract numeric ID from string - sql

How do I write a SQL Query in Google Big Query to extract numeric ID from a string like these:
Example 1:
Column Value: "http://www.google.com/abc/eeq/entity/32132"
Desired Extraction: 32132
Example 2:
Column Value: "http://www.google.com/abc/eeq/entity/32132/ABC/2138"
Desired Extraction: 32132
Example 3:
Column Value: "http://www.google.com/abc/eeq/entity/32132http://www.google.com/abc/eeq/entity/32132"
Desired Extraction: 32132

Below example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT "http://www.google.com/abc/eeq/entity/32132" url UNION ALL
SELECT "http://www.google.com/abc/eeq/entity/32132/ABC/2138" UNION ALL
SELECT "http://www.google.com/abc/eeq/entity/32132http://www.google.com/abc/eeq/entity/32132"
)
SELECT url, REGEXP_EXTRACT(url, r'\d+') extracted_id
FROM `project.dataset.table`
with output
Row url extracted_id
1 http://www.google.com/abc/eeq/entity/32132 32132
2 http://www.google.com/abc/eeq/entity/32132/ABC/2138 32132
3 http://www.google.com/abc/eeq/entity/32132http://www.google.com/abc/eeq/entity/32132 32132

You can use regexp_extract(). To get the first series of digits in the string:
select regexp_extract(col, '[0-9]+')

Related

Extract a substring and take second value in a Bigquery Column

I have this data:
id val
1 ajkdks - jkdj
2 djs - djsd
I want to take only the second value. Which is:
id val
1 jkdj
2 djsd
I know the query if using MySQL:
SUBSTRING_INDEX(SUBSTRING_INDEX(val, " - ", 2)," - ",-1)
But what the query if i using bigquery?
Use below
select id, split(val, ' - ')[safe_offset(1)] val
from your_table
if applied to sample data in your question - output is
We could phrase this using REGEXP_EXTRACT:
SELECT id, REGEXP_EXTRACT(val, r'[^ -]+$') AS val
FROM yourTable
ORDER BY id;
Note that the above regex approach is also robust to the case where val might not have any hyphen separator, in which case the entire value would be returned.

How can I count repeated values in the string in BigQuery?

Example:
I have the following string:
201904,BLANK,201902,BLANK,BLANK,201811,201810,201809
How can I count the number of repeated values "BLANK" that goes one by one?
In the described example the answer is 2, but what is the query?
Thanks for your help in advance!
Below is for BigQuery Standard SQL (with quick simplified example)
Corrected Version
#standardSQL
WITH `project.dataset.table` AS (
SELECT '201904,BLANK,201902,BLANK,BLANK,201811,201810,201809,BLANK,BLANK,BLANK' value UNION ALL
SELECT '201904,BLANK,201902,BLANK,BLANK,BLANK,201811' UNION ALL
SELECT '201904,BLANK,201902,BLANK,201811,201902,BLANK,201811'
)
SELECT value,
(
SELECT MAX(ARRAY_LENGTH(SPLIT(list))) - 1
FROM UNNEST(REGEXP_EXTRACT_ALL(value || ',', r'(?:BLANK,){1,}')) list
) max_repeated_count
FROM `project.dataset.table`
The idea here is
extract all instances of consecutive BLANK
split each such instances to array of elements of BLANK
and finally get max length of those arrays as a result
Just something came as quick approach
Refactored Version
#standardSQL
WITH `project.dataset.table` AS (
SELECT '201904,BLANK,201902,BLANK,BLANK,201811,201810,201809,BLANK,BLANK,BLANK' value UNION ALL
SELECT '201904,BLANK,201902,BLANK,BLANK,BLANK,201811' UNION ALL
SELECT '201904,BLANK,201902,BLANK,201811,201902,BLANK,201811'
)
SELECT value,
(
SELECT MAX(LENGTH(element) - 1)
FROM UNNEST(REGEXP_EXTRACT_ALL(REPLACE(value || ',', 'BLANK', ''), r',+')) element
) max_repeated_count
FROM `project.dataset.table`
Both with output
Row value max_repeated_count
1 201904,BLANK,201902,BLANK,BLANK,201811,201810,201809,BLANK,BLANK,BLANK 3
2 201904,BLANK,201902,BLANK,BLANK,BLANK,201811 3
3 201904,BLANK,201902,BLANK,201811,201902,BLANK,201811 1
Refactored version is slightly different (but main idea the same)
it removes all BLANKS (assuming BLANK cannot be part of other element - if it can - code can easily be adjusted)
then extract all consecutive entries of commas into array
calculates max length of such sequences of commas
Maybe I misunderstood, but can't you simply split by the value you're looking for and subtract 2 (1 for the first element and 1 for counting elements after splitting):
declare t DEFAULT '201904,BLANK,201902,BLANK,BLANK,201811,201810,201809';
SELECT
t as theString,
split(t,'BLANK') as theSplittedString,
array_length(split(t,'BLANK'))-2 as theAmount
n>0 - amount of repetition,
0 - no repetition,
-1 - element not found

IBM Db2: select numeric characters only from a column

I have a column 'TEST_COLUMN' that carries 3 values:
123
123ad(44)
w-eq1dfd2
I need to SELECT TEST_COLUMN but get the following result:
123
12344
12
I am running on Db2 Warehouse on Cloud.
You can use REGEXP_REPLACE:
SELECT REGEXP_REPLACE(
'123Red345', '[A-Za-z]','',1)
FROM sysibm.sysdummy1
The query would return "123345".
Because you asked below, here is the generic version:
SELECT REGEXP_REPLACE(YOUR_COLUMN, '[A-Za-z]','',1)
FROM SCHEMA.TABLE

Finding the second last occurrence of a string (date) in Regex

I got the following strings:
(1640.31; 08/19/2016; 09/13/2016;); (250000.0; 09/30/2016; 02/17/2018;); (100000.0; 03/12/2018; 12/31/2025;);
Or
(1000000.0; 05/30/2018; 06/03/2028;);
I need to return this second to last date, so in these cases for example 1: 03/12/2018 and example 2: 05/30/2018.
Because there are a lot of string-parts ending with ; I can't figure quite out how I can get the second to last date.
Below example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT '(1640.31; 08/19/2016; 09/13/2016;); (250000.0; 09/30/2016; 02/17/2018;); (100000.0; 03/12/2018; 12/31/2025;);' AS str UNION ALL
SELECT '(1000000.0; 05/30/2018; 06/03/2028;);'
)
SELECT ARRAY_REVERSE(REGEXP_EXTRACT_ALL(str, r'\d\d/\d\d/\d\d\d\d'))[SAFE_OFFSET(1)] dt
FROM `project.dataset.table`
with result:
Row dt
1 03/12/2018
2 05/30/2018
note: above assumes that dates are always in mm/dd/yyyy or dd/mm/yyyy format, but can be adjusted if different
I think this does what you want:
select (select array_agg(val order by o desc limit 2) -- the limit is just for efficiency
from unnest(split(str, ';')) val with offset o
where val like '%/%/%'
)[ordinal(2)] a
from (select '1640.31; 08/19/2016; 09/13/2016;' as str) x;
Note that this also (happens to) work with parentheses, if they are really part of the strings.

Get group maxima from combined strings

I have a table with a column code containing multiple pieces of data like this:
001/2017/TT/000001
001/2017/TT/000002
001/2017/TN/000003
001/2017/TN/000001
001/2017/TN/000002
001/2016/TT/000001
001/2016/TT/000002
001/2016/TT/000001
002/2016/TT/000002
There are 4 items in 001/2016/TT/000001: 001, 2016, TT and 000001.
How can I extract the max for every group formed by the first 3 items? The result I want is this:
001/2017/TT/000003
001/2017/TN/000002
001/2016/TT/000002
002/2016/TT/000002
Edit
The subfield separator is /, and the length of subfields can vary.
I use PostgreSQL 9.3.
Obviously, you should normalize the table and split the combined string into 4 columns with proper data type. The function split_part() is the tool of choice if the separator '/' is constant in your string and the length of can vary.
CREATE TABLE tbl_better AS
SELECT split_part(code, '/', 1)::int AS col_1 -- better names?
, split_part(code, '/', 2)::int AS col_2
, split_part(code, '/', 3) AS col_3 -- text?
, split_part(code, '/', 4)::int AS col_4
FROM tbl_bad
ORDER BY 1,2,3,4 -- optionally cluster data.
Then the task is trivial:
SELECT col_1, col_2, col_3, max(col_4) AS max_nr
FROM tbl_better
GROUP BY 1, 2, 3;
Related:
Split comma separated column data into additional columns
Of course, you can do it on the fly, too. For varying subfield length you could use substring() with a regular expression like this:
SELECT max(substring(code, '([^/]*)$')) AS max_nr
FROM tbl_bad
GROUP BY substring(code, '^(.*)/');
Related (with basic explanation for regexp pattern):
Filter strings with regex before casting to numeric
Or to get only the complete string as result:
SELECT DISTINCT ON (substring(code, '^(.*)/'))
code
FROM tbl_bad
ORDER BY substring(code, '^(.*)/'), code DESC;
About DISTINCT ON:
Select first row in each GROUP BY group?
Be aware that data items cast to a suitable type may behave differently from their string representation. The max of 900001 and 1000001 is 900001 for text and 1000001 for integer ...
Use the LEFT and RIGHT functions.
SELECT MAX(RIGHT(code,6)) AS MAX_CODE
FROM yourtable
GROUP BY LEFT(code,12)
check this out, possible helpfull
select
distinct on (tab[4],tab[2]) tab[4],tab[3],tab[2],tab[1]
from
(
select
string_to_array(exe.x,'/') as tab,
exe.x
from
(
select
unnest
(
array
['001/2017/TT/000001',
'001/2017/TT/000002',
'001/2017/TN/000003',
'001/2017/TN/000001',
'001/2017/TN/000002',
'001/2016/TT/000001',
'001/2016/TT/000002',
'001/2016/TT/000001',
'002/2016/TT/000002']
) as x
) exe
) exe2
order by tab[4] desc,tab[2] desc,tab[3] desc;