Extract value after specific sequence of string in BigQuery - google-bigquery

I have the following example of strings in BigQuery:
string
action_1 plan_id=266 revenue=123.93
action_2 plan_id=057 revenue=33.54 action_1 plan_id=432 revenue=127.12
action_4 plan_id=854 revenue=123.46 action_1 plan_id=138 revenue=98.43
action_3 plan_id=266 revenue=123.93
What I want to extract is the value of the revenue after action_1. So, the new field should be this:
string action_1_revenue
action_1 plan_id=266 revenue=123.93 123.93
action_2 plan_id=057 revenue=33.54 action_1 plan_id=432 revenue=127.12 127.12
action_4 plan_id=854 revenue=123.46 action_1 plan_id=138 revenue=98.43 98.43
action_3 plan_id=266 revenue=123.93 NULL
Any ideas?

Consider below
select *, regexp_extract(string, r'action_1 .* revenue=([\d.]+)') action_1_revenue
from your_table
if applied to sample data in your question - output is

Related

BigQuery >> Extract Value for Dictionary Key in JSON Object

I have a string object in a table column with structure:
[{"__food": "true"},{"item": "1"},{"toppings": "true"},{"__discount_amount": "4.95"},{"__original_price": "24.95"}]
How can I extract the value true from the toppings key from this?
I tried turning it into JSON first but json_extract(parse_json(string_object_column), '$.toppings') just returns null
The closest I got was keeping it as a string and doing
json_extract(string_object_column, '$[0]')
Which gets me:
{"toppings":"true"}
Is this doable without unnesting?
You may try and consider below approach using REGEXP_EXTRACT:
SELECT REGEXP_EXTRACT('[{"__food": "true"},{"item": "1"},{"toppings": "true"},{"__discount_amount": "4.95"},{"__original_price": "24.95"}]', r'"toppings": "(\D+)"}') as EXTRACT_TOPPINGS
OUTPUT:
You may just update the REGEX to make it more strict based on your use case.

dynamic split json string in bigquery [duplicate]

I have load the entire json file into a STRING column of BigQuery table. Now I am trying to access the keys using JSON_EXTRACT_SCALAR function, but I am getting null result for the child keys which contain special character period(".") within their name.
Here's the snippet of the data:
{"server_received_time":"2019-01-17 15:00:00.482000","app":161,"device_carrier":null,"$schema":12,"city":"Caro","user_id":null,"uuid":"9018","event_time":"2019-01-17 15:00:00.045000","platform":"Web","os_version":"49","vendor_id":711,"processed_time":"2019-01-17 15:00:00.817195","user_creation_time":"2018-11-01 19:16:34.971000","version_name":null,"ip_address":null,"paying":null,"dma":null,"group_properties":{},"user_properties":{"location.radio":"ca","vendor.userTier":"free","vendor.userID":"a989","user.id":"a989","user.tier":"free","location.region":"ca"},"client_upload_time":"2019-01-17 15:00:00.424000","$insert_id":"e8410","event_type":"LOADED","library":"amp\/4.5.2","vendor_attribution_ids":null,"device_type":"Mac","device_manufacturer":null,"start_version":null,"location_lng":null,"server_upload_time":"2019-01-17 15:00:00.493000","event_id":64,"location_lat":null,"os_name":"Chrome","vendor_event_type":null,"device_brand":null,"groups":{},"event_properties":{"content.authenticated":false,"content.subsection1":"regions","custom.DNT":true,"content.subsection2":"ca","referrer.url":"","content.url":"","content.type":"index","content.title":"","custom.cookiesenabled":true,"app.pillar":"feed","content.area":"news","app.name":"oc"},"data":{},"device_id":"","language":"English","device_model":"Mac","country":"","region":"","is_attribution_event":false,"adid":null,"session_id":15,"device_family":"Mac","sample_rate":null,"idfa":null,"client_event_time":"2019-01-17 14:59:59.987000"}
{"server_received_time":"2019-01-17 15:00:00.913000","app":161,"device_carrier":null,"$schema":12,"city":"Fo","user_id":null,"uuid":"9052","event_time":"2019-01-17 15:00:00.566000","platform":"Web","os_version":"71","vendor_id":797,"processed_time":"2019-01-17 15:00:01.301936","user_creation_time":"2019-01-17 15:00:00.566000","version_name":null,"ip_address":null,"paying":null,"dma":"CO","group_properties":{},"user_properties":{"user.tier":"free"},"client_upload_time":"2019-01-17 15:00:00.157000","$insert_id":"69ae","event_type":"START WEB SESSION","library":"amp\/4.5.2","vendor_attribution_ids":null,"device_type":"Android","device_manufacturer":null,"start_version":null,"location_lng":null,"server_upload_time":"2019-01-17 15:00:00.925000","event_id":1,"location_lat":null,"os_name":"Chrome Mobile","vendor_event_type":null,"device_brand":null,"groups":{},"event_properties":{"content.subsection3":"home","content.subsection2":"archives","content.title":"","content.keywords.subject":["Lifestyle\/Recreation and leisure\/Outdoor recreation\/Boating","Lifestyle\/Relationships\/Couples","General news\/Weather","Oddities"],"content.publishedtime":154687,"app.name":"oc","referrer.url":"","content.subsection1":"archives","content.url":"","content.authenticated":false,"content.keywords.location":["Ot"],"content.originaltitle":"","content.type":"story","content.authors":["Archives"],"app.pillar":"feed","content.area":"news","content.id":"1.49","content.updatedtime":1546878600538,"content.keywords.tag":["24 1","boat house","Ot","Rockcliffe","River","m"],"content.keywords.person":["Ber","Shi","Jea","Jean\u00e9tien"]},"data":{"first_event":true},"device_id":"","language":"English","device_model":"Android","country":"","region":"","is_attribution_event":false,"adid":null,"session_id":15477,"device_family":"Android","sample_rate":null,"idfa":null,"client_event_time":"2019-01-17 14:59:59.810000"}
{"server_received_time":"2019-01-17 15:00:00.913000","app":16,"device_carrier":null,"$schema":12,"city":"","user_id":null,"uuid":"905","event_time":"2019-01-17 15:00:00.574000","platform":"Web","os_version":"71","vendor_id":7973,"processed_time":"2019-01-17 15:00:01.301957","user_creation_time":"2019-01-17 15:00:00.566000","version_name":null,"ip_address":null,"paying":null,"dma":"DCO","group_properties":{},"user_properties":{"user.tier":"free"},"client_upload_time":"2019-01-17 15:00:00.157000","$insert_id":"d045","event_type":"LOADED","library":"am-js\/4.5.2","vendor_attribution_ids":null,"device_type":"Android","device_manufacturer":null,"start_version":null,"location_lng":null,"server_upload_time":"2019-01-17 15:00:00.925000","event_id":2,"location_lat":null,"os_name":"Chrome Mobile","vendor_event_type":null,"device_brand":null,"groups":{},"event_properties":{"content.subsection3":"home","content.subsection2":"archives","content.subsection1":"archives","content.keywords.subject":["Lifestyle\/Recreation and leisure\/Outdoor recreation\/Boating","Lifestyle\/Relationships\/Couples","General news\/Weather","Oddities"],"content.type":"story","content.keywords.location":["Ot"],"app.pillar":"feed","app.name":"oc","content.authenticated":false,"custom.DNT":false,"content.id":"1.4","content.keywords.person":["Ber","Shi","Jea","Je\u00e9tien"],"content.title":"","content.url":"","content.originaltitle":"","custom.cookiesenabled":true,"content.authors":["Archives"],"content.publishedtime":1546878600538,"referrer.url":"","content.area":"news","content.updatedtime":1546878600538,"content.keywords.tag":["24 1","boat house","O","Rockcliffe","River","pr"]},"data":{},"device_id":"","language":"English","device_model":"Android","country":"","region":"","is_attribution_event":false,"adid":null,"session_id":1547737199081,"device_family":"Android","sample_rate":null,"idfa":null,"client_event_time":"2019-01-17 14:59:59.818000"}
Here's the sample query against the table:
SELECT
CAST(JSON_EXTRACT_SCALAR(data,'$.uuid')AS INT64) AS uuid_id,
CAST(JSON_EXTRACT_SCALAR(data,'$.event_time') AS TIMESTAMP) AS event_time,
JSON_EXTRACT_SCALAR(data,'$[event_properties].app.name') AS app_name,
JSON_EXTRACT_SCALAR(data,'$[user_properties].user.tier') AS user_tier
FROM
mytable
Above query give null result for app_name & user_tier columns even though data exists for them.
Following the BigQuery JSON function documentation - JSON Functions in Standard SQL
In cases where a JSON key uses invalid JSONPath characters, you can escape those characters using single quotes and brackets, [' '].
and running the query as:
SELECT
CAST(JSON_EXTRACT_SCALAR(data,"$.uuid_id")AS INT64) AS uuid_id,
CAST(JSON_EXTRACT_SCALAR(data,"$.event_time") AS TIMESTAMP) AS event_time,
JSON_EXTRACT_SCALAR(data,"$.event_properties.['app.name']") AS app_name,
JSON_EXTRACT_SCALAR(data,"$.user_properties.['user.tier']") AS user_tier
FROM
mytable
result into following error:
Invalid token in JSONPath at: .['app.name']
Please advise. What am I missing here?
You have an extra . before the [. Use
"$.event_properties['app.name']"

How do you write a presto query to split a string into its own column

Trying to splint a string into multiple columns in qubole using presto query.
{"field0":[{"startdate":"2022-07-13","lastnightdate":"2022-07-16","adultguests":5,"childguests":0,"pets":null}]}
Would like startdate,lastnightdate,adultguests,childguests and pets into its own column.
I tried to unnest string but that didn't work.
The data looks a lot like json, so you can process it using json functions first (parse, extract, cast to array(map(varchar, json)) or array(map(varchar, varcchar))) and then flatten with unnest:
-- sample data
WITH dataset(json_payload) AS (
VALUES
('{"field0":[{"startdate":"2022-07-13","lastnightdate":"2022-07-16","adultguests":5,"childguests":0,"pets":null}]}')
)
-- query
select m['startdate'] startdate,
m['lastnightdate'] lastnightdate,
m['adultguests'] adultguests,
m['childguests'] childguests,
m['pets'] pets
from dataset,
unnest(cast(json_extract(json_parse(json_payload), '$.field0') as array(map(varchar, json)))) t(m)
Output:
startdate
lastnightdate
adultguests
childguests
pets
2022-07-13
2022-07-16
5
0
null

pyspark hive sql convert array(map(varchar, varchar)) to string by rows

I would like to transform a column of
array(map(varchar, varchar))
to string as rows of a table on presto db by pyspark hive sql programmatically from jupyter notebook python3.
example
user_id sport_ids
'aca' [ {'sport_id': '5818'}, {'sport_id': '6712'}, {'sport_id': '1065'} ]
expected results
user_id. sport_ids
'aca'. '5815'
'aca'. '5712'
'aca'. '1065'
I have tried
sql_q= """
select distinct, user_id, transform(sport_ids, x -> element_at(x, 'sport_id')
from tab """
spark.sql(sql_q)
but got error:
'->' cannot be resolved
I have also tried
sql_q= """
select distinct, user_id, sport_ids
from tab"""
spark.sql(sql_q)
but got error:
org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column request_features[0] is map<string,string>;;
Did I miss something ?
I have tried this, but helpful
hive convert array<map<string, string>> to string
Extract map(varchar, array(varchar)) - Hive SQL
thanks
Lets try use higher order functions to find map values and explode into individual rows
df.withColumn('sport_ids', explode(expr("transform(sport_ids, x->map_values(x)[0])"))).show()
+-------+---------+
|user_id|sport_ids|
+-------+---------+
| aca| 5818|
| aca| 6712|
| aca| 1065|
+-------+---------+
You can process json data (json_parse, cast to array of json and json_extract_scalar - for more json functions - see here) and flatten (unnest) on presto side:
-- sample data
WITH dataset(user_id, sport_ids) AS (
VALUES
('aca', '[ {"sport_id": "5818"}, {"sport_id": "6712"}, {"sport_id": "1065"} ]')
)
-- query
select user_id,
json_extract_scalar(record, '$.sport_id') sport_id
from dataset,
unnest(cast(json_parse(sport_ids) as array(json))) as t(record)
Output:
user_id
sport_id
aca
5818
aca
6712
aca
1065

Pandas read_sql_query with parameters for a string with no quotes

I have want to insert a string of identifiers into a piece of sql code using
df = pd.read_sql_query(query, self.connection,params=sql_parameter)
my parameter dictionary looks like this
sql_parameter = {'itemids':itemids_str}
where itemids_str is a string like
282940499, 276686324, 2665846, 46875436, 530272885, 2590230, 557021480, 282937154, 46259344
The SQL code looks like
SELECT
xxx,
yyy,
zzz
FROM tablexyz
where some_column_name in ( %(itemids)s )
My current code gets my the parameter inserted with its quotes
where some_column_name in ( '282940499, 276686324, 2665846, 46875436, 530272885, 2590230, 557021480, 282937154, 46259344' )
How can I prevent the string being inserted including the ', these are not part of my string, but I assume they come from the parameter type string when using %s
I don't think there is a provision in params to send a list of numeric values for one condition. I always add such condition directly to the query
item_ids = [str(item_id) for item_id in item_ids]
where_str = ','.join(item_ids)
query = f"""SELECT
xxx,
yyy,
zzz
FROM tablexyz
where some_column_name in ({where_str})"""