Extract number or string after string in BigQuery - google-bigquery

I have several 1.000 URLs and want to extract some values from the URL parameters.
Here some examples from the DB:
["www.xxx.com?uci=6666&rci=fefw"]
["www.xxx.com?uci=61
["www.xxx.com?rci=62&uci=5536"]
["www.xxx.com?uci=6666&utm_source=XXX"]
["www.xxx.com?pccst=TEST%20sTESTg"]
["www.xxx.com?pccst=TEST2%20s&uci=1"]
["www.xxx.com?uci=1pccst=TEST42rt24&rci=2"]
How can I extract the value of the parameter UCI. It is always a digit number (don’t know the exact length).
I tried it with REGEXP_EXTRACT. But I didn't succeed:
REGEXP_EXTRACT(URL, '(uci)\=[0-9]+') AS UCI_extract
And I also want to extract the value of the parameter pccst. It can be every character and I don`t know the exact length. But it always ends with “ or ? or &
I tried it also with REGEXP_EXTRACT but didn't succeed:
REGEXP_EXTRACT(URL, r'pccst\=(.*)(\"|\&|\?)') AS pccst_extract
I am really not the REGEX expert.
So would be great if someone could help me.
Thanks a lot in advance,
Peter

You can adapt this solution
#standardSQL
# Extract query parameters from a URL as ARRAY in BigQuery; standard-sql; 2018-04-08
# #see http://www.pascallandau.com/bigquery-snippets/extract-url-parameters-array/
WITH examples AS (
SELECT 1 AS id, 'www.xxx.com?uci=6666&rci=fefw' AS query
UNION ALL SELECT 2, 'www.xxx.com?uci=1pccst%20TEST42rt24&rci=2'
UNION ALL SELECT 3, 'www.xxx.com?pccst=TEST2%20s&uci=1'
)
SELECT
id,
query,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)((?:[^=]+)=(?:[^&]*))') as params,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:([^=]+)=(?:[^&]*))') as keys,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values
FROM examples

Below example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT "www.xxx.com?uci=6666&rci=fefw" url UNION ALL
SELECT "www.xxx.com?uci=61" UNION ALL
SELECT "www.xxx.com?rci=62&uci=5536" UNION ALL
SELECT "www.xxx.com?uci=6666&utm_source=XXX" UNION ALL
SELECT "www.xxx.com?pccst=TEST%20sTESTg" UNION ALL
SELECT "www.xxx.com?pccst=TEST2%20s&uci=1" UNION ALL
SELECT "www.xxx.com?uci=1&pccst=TEST42rt24&rci=2"
)
SELECT
url,
REGEXP_EXTRACT(url, r'[?&]uci=(.*?)(?:$|&)') uci,
REGEXP_EXTRACT(url, r'[?&]pccst=(.*?)(?:$|&)') pccst
FROM `project.dataset.table`
result is
Row url uci pccst
1 www.xxx.com?pccst=TEST%20sTESTg null TEST%20sTESTg
2 www.xxx.com?pccst=TEST2%20s&uci=1 1 TEST2%20s
3 www.xxx.com?uci=1&pccst=TEST42rt24&rci=2 1 TEST42rt24
4 www.xxx.com?uci=61 61 null
5 www.xxx.com?rci=62&uci=5536 5536 null
6 www.xxx.com?uci=6666&rci=fefw 6666 null
7 www.xxx.com?uci=6666&utm_source=XXX 6666 null
Also, below option to parse out all key-value pairs so, then you can dynamically select needed
#standardSQL
WITH `project.dataset.table` AS (
SELECT "www.xxx.com?uci=6666&rci=fefw" url UNION ALL
SELECT "www.xxx.com?uci=61" UNION ALL
SELECT "www.xxx.com?rci=62&uci=5536" UNION ALL
SELECT "www.xxx.com?uci=6666&utm_source=XXX" UNION ALL
SELECT "www.xxx.com?pccst=TEST%20sTESTg" UNION ALL
SELECT "www.xxx.com?pccst=TEST2%20s&uci=1" UNION ALL
SELECT "www.xxx.com?uci=1pccst=TEST42rt24&rci=2"
)
SELECT url,
ARRAY(
SELECT AS STRUCT
SPLIT(kv, '=')[SAFE_OFFSET(0)] key,
SPLIT(kv, '=')[SAFE_OFFSET(1)] value
FROM UNNEST(SPLIT(SUBSTR(url, LENGTH(NET.HOST(url)) + 2), '&')) kv
) key_value_pair
FROM `project.dataset.table`

Related

Regexp pattern for special characters

I have the data in the format like
Input:
Code_1
FAB
?
USP BEN,
.
-
,
Output:
Code_1
FAB
IP BEN,
I need to exclude only the value which have length as 1 and and are special characters
I am using (regexp_like(code_1,'^[^<>{}"/|;:.,~!?##$%^=&*\]\\()\[¿§«»ω⊙¤°℃℉€¥£¢¡®©0-9_+]')) AND LENGTH(CODE_1)>=1
I have also tried REGEXP_LIKE(CODE_1,'[A-Za-z0-9]')
Based on your requirements which I understand are you want data that is not single character AND non-alpha numeric (at the same time), this should do it for you.
The 'WITH' clause just sets up test data in this case and can be thought of like a temp table here. It is a great way to help people help you by setting up test data. Always include data you don't expect!
The actual query starts below and selects data that uses grouping to get the data that is NOT a group of non-alpha numeric with a length of one. It uses a POSIX shortcut of [:alnum:] to indicate [A-Za-z0-9].
Note your requirements will allow multiple non-alnum characters to be selected as is indicated by the test data.
WITH tbl(DATA) AS (
SELECT 'FAB' FROM dual UNION ALL
SELECT '?' FROM dual UNION ALL
SELECT 'USP BEN,' FROM dual UNION ALL
SELECT '.' FROM dual UNION ALL
SELECT '-' FROM dual UNION ALL
SELECT '----' FROM dual UNION ALL
SELECT ',' FROM dual UNION ALL
SELECT 'A' FROM dual UNION ALL
SELECT 'b' FROM dual UNION ALL
SELECT '5' FROM dual
)
SELECT DATA
FROM tbl
WHERE NOT (REGEXP_LIKE(DATA, '[^[:alnum:]]')
AND LENGTH(DATA) = 1);
DATA
----------
FAB
USP BEN,
----
A
b
5
6 rows selected.

Find value that is not a number or a predefined string

I have to test a column of a sql table for invalid values and for NULL.
Valid values are: Any number and the string 'n.v.' (with and without the dots and in every possible combination as listed in my sql command)
So far, I've tried this:
select count(*)
from table1
where column1 is null
or not REGEXP_LIKE(column1, '^[0-9,nv,Nv,nV,NV,n.v,N.v,n.V,N.V]+$');
The regular expression also matches the single character values 'n','N','v','V' (with and without a following dot). This shouldn't be the case, because I only want the exact character combinations as written in the sql command to be matched. I guess the problem has to do with using REGEXP_LIKE. Any ideas?
I guess this regexp will work:
NOT REGEXP_LIKE(column1, '^([0-9]+|n\.?v\.?)$', 'i')
Note that , is not a separator, . means any character, \. means the dot character itself and 'i' flag could be used to ignore case instead of hard coding all combinations of upper and lower case characters.
No need to use regexp (performance will increase by large data) - plain old TRANSLATE is good enough for your validation.
Note that the first translate(column1,'x0123456789','x') remove all numeric charcters from the string, so if you end with nullthe string is OK.
The second translate(lower(column1),'x.','x') removes all dots from the lowered string so you expect the result nv.
To avoid cases as n.....v.... you also limit the string length.
select
column1,
case when
translate(column1,'x0123456789','x') is null or /* numeric string */
translate(lower(column1),'x.','x') = 'nv' and length(column1) <= 4 then 'OK'
end as status
from table1
COLUMN1 STATUS
--------- ------
1010101 OK
1012828n
1012828nv
n.....v....
n.V OK
Test data
create table table1 as
select '1010101' column1 from dual union all -- OK numbers
select '1012828n' from dual union all -- invalid
select '1012828nv' from dual union all -- invalid
select 'n.....v....' from dual union all -- invalid
select 'n.V' from dual; -- OK nv
You can use:
select count(*)
from table1
WHERE TRANSLATE(column1, ' 0123456789', ' ') IS NULL
OR LOWER(column1) IN ('nv', 'n.v', 'nv.', 'n.v.');
Which, for the sample data:
CREATE TABLE table1 (column1) AS
SELECT '12345' FROM DUAL UNION ALL
SELECT 'nv' FROM DUAL UNION ALL
SELECT 'NV' FROM DUAL UNION ALL
SELECT 'nV' FROM DUAL UNION ALL
SELECT 'n.V.' FROM DUAL UNION ALL
SELECT '...................n.V.....................' FROM DUAL UNION ALL
SELECT '..nV' FROM DUAL UNION ALL
SELECT 'n..V' FROM DUAL UNION ALL
SELECT 'nV..' FROM DUAL UNION ALL
SELECT 'xyz' FROM DUAL UNION ALL
SELECT '123nv' FROM DUAL;
Outputs:
COUNT(*)
5
or, if you want any quantity of . then:
select count(*)
from table1
WHERE TRANSLATE(column1, ' 0123456789', ' ') IS NULL
OR REPLACE(LOWER(column1), '.') = 'nv';
Which outputs:
COUNT(*)
9
db<>fiddle here

Query to find if a aggregate string contains certain numbers

I am working on Big Query Standard SQL. I have a data table like shown below (using ; as separator):
id;operation
107327;-1,-1,-1,-1,5,-1,0,2,-1
108296;-1,6,2,-1,-1,-1
690481;0,-1,-1,-1,5
102643;5,-1,-1,-1,-1,-2,2,3,-1,0,-1,-1,-1,-1,-1,-1
103171;0,5
789481;0,-1,5
I would like to take id that only contains operation 0,5 or 0,-1,5 so the result will show:
690481
103171
789481
Below is for BigQuery Standard SQL
#standardSQL
SELECT *
FROM `project.dataset.table`
WHERE 0 = (
SELECT COUNT(1)
FROM UNNEST(SPLIT(operation)) op
WHERE NOT op IN ('0', '-1', '5')
)
You can test, play with above using sample data form your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 107327 id, '-1,-1,-1,-1,5,-1,0,2,-1' operation UNION ALL
SELECT 108296, '-1,6,2,-1,-1,-1' UNION ALL
SELECT 690481, '0,-1,-1,-1,5' UNION ALL
SELECT 102643, '5,-1,-1,-1,-1,-2,2,3,-1,0,-1,-1,-1,-1,-1,-1' UNION ALL
SELECT 103171, '0,5' UNION ALL
SELECT 789481, '0,-1,5'
)
SELECT *
FROM `project.dataset.table`
WHERE 0 = (
SELECT COUNT(1)
FROM UNNEST(SPLIT(operation)) op
WHERE NOT op IN ('0', '-1', '5')
)
with output
I think regular expression does what you want:
select t.*
from t
where regexp_contains(operation, '^0,(-1,)*5$');
If you want matches to rows that contain only 0, -1, or 5, you would use:
where regexp_contains(operation, '^((0|-1|5),)*(0|-1|5)$');

How to cut everything after a specific character, but in case string doesn't contain it do nothing?

Let's say i have following data:
fjflka, kdjf
ssssllkjf fkdsjl
skfjjsld, kjl
jdkfjlj, ksd
lkjlkj hjk
I want to cut out everything after ',' but in case the string doesn't contain this character, it wont do anything, if i use substr and cut everything after ',' the string which doesn't contain this character shows as null. How do i achieve this? Im using oracle 11g.
This should work. Simply use regexp_substr
with t_view as (
select 'fjflka, kdjf' as text from dual union
select 'ssssllkjf fkdsjl' from dual union
select 'skfjjsld, kjl' from dual union
select 'jdkfjlj, ksd' from dual union
select 'lkjlkj hjk' from dual
)
select text,regexp_substr(text,'[^,]+',1,1) from t_view;
Assuming your table :
SQL> desc mytable
s varchar2(100)
you may use:
select decode(instr(s,','),0,s,substr(s,1,instr(s,',')-1)) from mytable;
demo
Well the below query works as per your requirement.
with mytable as
(select 'aaasfasf wqwe' s from dual
union all
select 'aaasfasf, wqwe' s from dual)
select s,substr(s||',',1,instr(s||',',',')-1) from mytable;

Format a number to have commas (1000000 -> 1,000,000)

In Bigquery: How do we format a number that will be part of the resultset to have it formatted with commas: like 1000000 to 1,000,000 ?
below is for Standard SQL
SELECT
input,
FORMAT("%'d", input) as formatted
FROM (
SELECT 123 AS input UNION ALL
SELECT 1234 AS input UNION ALL
SELECT 12345 AS input UNION ALL
SELECT 123456 AS input UNION ALL
SELECT 1234567 AS input UNION ALL
SELECT 12345678 AS input UNION ALL
SELECT 123456789 AS input
)
Works great for integers, but if you will need floats too, you can use :
SELECT
input,
CONCAT(FORMAT("%'d", CAST(input AS int64)),
SUBSTR(FORMAT("%.2f", CAST(input AS float64)), -3)) as formatted
FROM (
SELECT 123 AS input UNION ALL
SELECT 1234 AS input UNION ALL
SELECT 12345 AS input UNION ALL
SELECT 123456.1 AS input UNION ALL
SELECT 1234567.12 AS input UNION ALL
SELECT 12345678.123 AS input UNION ALL
SELECT 123456789.1234 AS input
)
added for Legacy SQL
Btw, if for whatever reason you are bound to Legacy SQL - below is quick example for it
SELECT input, formatted
FROM JS((
SELECT input
FROM
(SELECT 123 AS input ),
(SELECT 1234 AS input ),
(SELECT 12345 AS input ),
(SELECT 123456 AS input ),
(SELECT 1234567 AS input ),
(SELECT 12345678 AS input ),
(SELECT 123456789 AS input)
),
// input
input,
// output
"[
{name: 'input', type:'integer'},
{name: 'formatted', type:'string'}
]",
// function
"function (r, emit) {
emit({
input: r.input,
formatted: r.input.toString().replace(/(\d)(?=(\d{3})+(?!\d))/g, '$1,')
});
}"
)
Above example uses in-line versin of Legacy SQL User-Defined Functions which is usually used for quick demo/example - but not recommended in production - if you will find it useful for you - you will need to "very slightly" transform it - see https://cloud.google.com/bigquery/user-defined-functions#webui for example
With Standard SQL:
SELECT FORMAT("%'d", 1000123)
1,000,123
Instruction to enable Standard SQL: https://cloud.google.com/bigquery/sql-reference/enabling-standard-sql
Improve on Mikhail's answer for the float session
CAST(input AS int64) will make numbers like 12345.5 become 12346.50 in output.
I will use split by "." to get the integer part of the number, then cast to int64.
CREATE TEMP FUNCTION
format_n(x float64) AS (CONCAT(FORMAT("%'d", CAST(SPLIT(CAST(x AS string), '.')[
OFFSET
(0)] AS int64)),SUBSTR(FORMAT("%.2f", x), -3)));
SELECT
input,
format_n(input)
FROM (
SELECT
123 AS input
UNION ALL
SELECT
1234 AS input
UNION ALL
SELECT
12345 AS input
UNION ALL
SELECT
123456.8 AS input
UNION ALL
SELECT
1234567.12 AS input
UNION ALL
SELECT
12345678.127 AS input
UNION ALL
SELECT
123456789.1234 AS input )