After streaming some json data into BQ, we have a record that looks like:
"{\"Type\": \"Some_type\", \"Identification\": {\"Name\": \"First Last\"}}"
How would I extract the type from this? E.g. I would like to get Some_type.
I tried all possible combinations shown in https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions without success, namely, I thought:
SELECT JSON_EXTRACT_SCALAR(raw_json , "$[\"Type\"]") as parsed_type FROM `table` LIMIT 1000
is what I need. However, I get:
Invalid token in JSONPath at: ["Type"]
Picture of rows preview
Below example is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "{\"Type\": \"Some_type\", \"Identification\": {\"Name\": \"First Last\"}}" raw_json UNION ALL
SELECT 2, '{"Type": "Some_type", "Identification": {"Name": "First Last"}}'
)
SELECT id, JSON_EXTRACT_SCALAR(raw_json , "$.Type") AS parsed_type
FROM `project.dataset.table`
with result
Row id parsed_type
1 1 Some_type
2 2 Some_type
See below update example - take a look at third record which I think mimic your case
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "{\"Type\": \"Some_type\", \"Identification\": {\"Name\": \"First Last\"}}" raw_json UNION ALL
SELECT 2, '''{"Type": "Some_type", "Identification": {"Name": "First Last"}}''' UNION ALL
SELECT 3, '''"{\"Type\": \"
null1\"}"
'''
)
SELECT id,
JSON_EXTRACT_SCALAR(REGEXP_REPLACE(raw_json, r'^"|"$', '') , "$.Type") AS parsed_type
FROM `project.dataset.table`
with result
Row id parsed_type
1 1 Some_type
2 2 Some_type
3 3 null1
Note: I use null1 instead of null so you can easily see that it is not a NULL but rather string null1
Related
Here is my BigQuery table. I am trying to find out the URLs that were displayed but not viewed.
create table dataset.url_visits(ID INT64 ,displayed_url string , viewed_url string);
select * from dataset.url_visits;
ID Displayed_URL Viewed_URL
1 url11,url12 url12
2 url9,url12,url13 url9
3 url1,url2,url3 NULL
In this example, I want to display
ID Displayed_URL Viewed_URL unviewed_URL
1 url11,url12 url12 url11
2 url9,url12,url13 url9 url12,url13
3 url1,url2,url3 NULL url1,url2,url3
Split the each string into an array and unnest them. Do a case to check if the items are in each other and combine to an array or a string.
Select ID, string_agg(viewing ) as viewed,
string_agg(not_viewing ) as not_viewed,
array_agg(viewing ignore nulls) as viewed_array
from (
Select ID ,
case when display in unnest(split(Viewed_URL)) then display else null end as viewing,
case when display in unnest(split(Viewed_URL)) then null else display end as not_viewing,
from (
Select 1 as ID, "url11,url12" as Displayed_URL, "url12" as Viewed_URL UNION ALL
Select 2, "url9,url12,url13", "url9" UNION ALL
Select 3, "url1,url2,url3", NULL UNION ALL
Select 4, "url9,url12,url13", "url9,url12"
),unnest(split(Displayed_URL)) as display
)
group by 1
Consider below approach
select *, (
select string_agg(url)
from unnest(split(Displayed_URL)) url
where url != ifnull(Viewed_URL, '')
) unviewed_URL
from `project.dataset.table`
if applied to sample data in your question - output is
I am trying to use SUBSTR and get the values like - +4, +06. from (UTC+4:00), (UTC+06:00) in BigQuery. However, i don't see FIND function to know the position of : and , position of + so that I can directly use:
SUBSTR(X,FIND(X,"+"),FIND(X,";")-1)
Any alternate solutions to achieve this or do we need to use REGEXP functions.
You can use REGEXP_EXTRACT(x, r'(\+.*):') as in example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'UTC+4:00' x UNION ALL
SELECT 'UTC+06:00'
)
SELECT x, REGEXP_EXTRACT(x, r'(\+.*):')
FROM `project.dataset.table`
with result
Row x f0_
1 UTC+4:00 +4
2 UTC+06:00 +06
Update for more cases:
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'UTC+4:00' x UNION ALL
SELECT 'UTC+06:00' UNION ALL
SELECT 'UTC' UNION ALL
SELECT 'UTC-03:00'
)
SELECT x, IFNULL(REGEXP_EXTRACT(x, r'UTC(.*):'), '0')
FROM `project.dataset.table`
with result
Row x f0_
1 UTC+4:00 +4
2 UTC+06:00 +06
3 UTC 0
4 UTC-03:00 -03
Right now I am filtering my rows by using the WHERE operator and 2 conditional statements. It seems somewhat inefficient that I am writing 2 conditions. Would it be possible to check if "amznbida" and "ksga" are in the array by only writing in one statement?
standardSQL
-- Get all the keys
SELECT
*
FROM `encoded-victory-198215.DFP_TEST.test3`
WHERE
"amznbida" IN UNNEST(ARRAY(SELECT name FROM UNNEST(keywords)))
AND
"ksga"IN UNNEST(ARRAY(SELECT name FROM UNNEST(keywords)))
Just remove the UNNEST(ARRAY( part and leave the subquery - you should be fine.
working example:
SELECT
*,
t in (select * from unnest(a)) condition
FROM unnest([
struct('a' as t, ['a', 'b', 'c'] as a),
('b',['r', 'f'])
])
Below is for BigQuery Standard SQL
#standardSQL
SELECT *
FROM `encoded-victory-198215.DFP_TEST.test3`
WHERE 2 = (SELECT COUNT(DISTINCT name) FROM UNNEST(keywords) WHERE name IN ("amznbida", "ksga"))
Yu can test, play with above using dummy data as below
#standardSQL
WITH `encoded-victory-198215.DFP_TEST.test3` AS (
SELECT
ARRAY<STRUCT<value ARRAY<STRING>, name STRING>>[
STRUCT(['ksg-1', 'ksg-2'], 'ksga'), STRUCT(['amznbid-1', 'amznbid-2'], 'amznbida')
] keywords,
1 impression UNION ALL
SELECT
ARRAY<STRUCT<value ARRAY<STRING>, name STRING>>[
STRUCT(['xxx-1', 'xxx-2'], 'xxxa'), STRUCT(['amznbid-1', 'amznbid-2'], 'amznbida')
] keywords,
2 impression
)
SELECT *
FROM `encoded-victory-198215.DFP_TEST.test3`
WHERE 2 = (SELECT COUNT(DISTINCT name) FROM UNNEST(keywords) WHERE name IN ("amznbida", "ksga"))
with result
Row keywords.value keywords.name impression
1 ksg-1 ksga 1
ksg-2
amznbid-1 amznbida
amznbid-2
I have several 1.000 URLs and want to extract some values from the URL parameters.
Here some examples from the DB:
["www.xxx.com?uci=6666&rci=fefw"]
["www.xxx.com?uci=61
["www.xxx.com?rci=62&uci=5536"]
["www.xxx.com?uci=6666&utm_source=XXX"]
["www.xxx.com?pccst=TEST%20sTESTg"]
["www.xxx.com?pccst=TEST2%20s&uci=1"]
["www.xxx.com?uci=1pccst=TEST42rt24&rci=2"]
How can I extract the value of the parameter UCI. It is always a digit number (don’t know the exact length).
I tried it with REGEXP_EXTRACT. But I didn't succeed:
REGEXP_EXTRACT(URL, '(uci)\=[0-9]+') AS UCI_extract
And I also want to extract the value of the parameter pccst. It can be every character and I don`t know the exact length. But it always ends with “ or ? or &
I tried it also with REGEXP_EXTRACT but didn't succeed:
REGEXP_EXTRACT(URL, r'pccst\=(.*)(\"|\&|\?)') AS pccst_extract
I am really not the REGEX expert.
So would be great if someone could help me.
Thanks a lot in advance,
Peter
You can adapt this solution
#standardSQL
# Extract query parameters from a URL as ARRAY in BigQuery; standard-sql; 2018-04-08
# #see http://www.pascallandau.com/bigquery-snippets/extract-url-parameters-array/
WITH examples AS (
SELECT 1 AS id, 'www.xxx.com?uci=6666&rci=fefw' AS query
UNION ALL SELECT 2, 'www.xxx.com?uci=1pccst%20TEST42rt24&rci=2'
UNION ALL SELECT 3, 'www.xxx.com?pccst=TEST2%20s&uci=1'
)
SELECT
id,
query,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)((?:[^=]+)=(?:[^&]*))') as params,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:([^=]+)=(?:[^&]*))') as keys,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values
FROM examples
Below example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT "www.xxx.com?uci=6666&rci=fefw" url UNION ALL
SELECT "www.xxx.com?uci=61" UNION ALL
SELECT "www.xxx.com?rci=62&uci=5536" UNION ALL
SELECT "www.xxx.com?uci=6666&utm_source=XXX" UNION ALL
SELECT "www.xxx.com?pccst=TEST%20sTESTg" UNION ALL
SELECT "www.xxx.com?pccst=TEST2%20s&uci=1" UNION ALL
SELECT "www.xxx.com?uci=1&pccst=TEST42rt24&rci=2"
)
SELECT
url,
REGEXP_EXTRACT(url, r'[?&]uci=(.*?)(?:$|&)') uci,
REGEXP_EXTRACT(url, r'[?&]pccst=(.*?)(?:$|&)') pccst
FROM `project.dataset.table`
result is
Row url uci pccst
1 www.xxx.com?pccst=TEST%20sTESTg null TEST%20sTESTg
2 www.xxx.com?pccst=TEST2%20s&uci=1 1 TEST2%20s
3 www.xxx.com?uci=1&pccst=TEST42rt24&rci=2 1 TEST42rt24
4 www.xxx.com?uci=61 61 null
5 www.xxx.com?rci=62&uci=5536 5536 null
6 www.xxx.com?uci=6666&rci=fefw 6666 null
7 www.xxx.com?uci=6666&utm_source=XXX 6666 null
Also, below option to parse out all key-value pairs so, then you can dynamically select needed
#standardSQL
WITH `project.dataset.table` AS (
SELECT "www.xxx.com?uci=6666&rci=fefw" url UNION ALL
SELECT "www.xxx.com?uci=61" UNION ALL
SELECT "www.xxx.com?rci=62&uci=5536" UNION ALL
SELECT "www.xxx.com?uci=6666&utm_source=XXX" UNION ALL
SELECT "www.xxx.com?pccst=TEST%20sTESTg" UNION ALL
SELECT "www.xxx.com?pccst=TEST2%20s&uci=1" UNION ALL
SELECT "www.xxx.com?uci=1pccst=TEST42rt24&rci=2"
)
SELECT url,
ARRAY(
SELECT AS STRUCT
SPLIT(kv, '=')[SAFE_OFFSET(0)] key,
SPLIT(kv, '=')[SAFE_OFFSET(1)] value
FROM UNNEST(SPLIT(SUBSTR(url, LENGTH(NET.HOST(url)) + 2), '&')) kv
) key_value_pair
FROM `project.dataset.table`
Here's a sample data
record1: field1 = test2
record2: field1 = test3
The actual output I want is
record1: field1 = test2 | field2 = test3
I've looked around the net but can't find what I'm looking for. I can use a custom function to get it in this format but I'm trying to see if there's a way to make it work without resorting to that.
thanks a lot
You need to use pivot:
with t(id, d) as (
select 1, 'field1 = test2' from dual union all
select 2, 'field1 = test3' from dual
)
select *
from t
pivot (max (d) for id in (1, 2))
If you don't have the id field you can generate it, but you will have XML type:
with t(d) as (
select 'field1 = test2' from dual union all
select 'field1 = test3' from dual
), t1(id, d) as (
select ROW_NUMBER() OVER(ORDER BY d), d from t
)
select *
from t1
pivot xml (max (d) for id in (select id from t1))
There are several ways to approach this - google pivot rows to columns. Here is one set of answers: http://www.dba-oracle.com/t_converting_rows_columns.htm