REGEXP_EXTRACT with String Value in Google Bigquery - google-bigquery

I want to extract each part to column.
Column name: results
Value: {"rID":"09257a3e-f251-4a2e-a63a-ba92c0f86c72","error":{xxx},"num":809}
My code:
select REGEXP_extract(results, r":([0-9]+)") as num
from table
I have problems with the other.

Use below
create temp function get_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function get_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
select * except (results) from (
select results, key, value
from your_table,
unnest(get_keys(results)) key with offset
join unnest(get_values(results)) value with offset
using(offset)
)
pivot (any_value(value) for key in ('rID', 'error', 'num'))
if applied to sample data in your question - output is

Related

How to get the value of jsonb data using only the key number id?

I have a jsonb column with the following data:
{"oz": "2835", "cup": "229", "jar": "170"}
I have the key number 0 that represents the first item "oz". How can I pull this value using the 0?
I'm thinking something similar to:
SELECT units->[0] as test
I only have the key ID to reference this data. I do not have the key name "oz".
Sounds like a horrible idea. But you can still create a function to implement this horrible idea:
create function jsonb_disaster(jsonb,int) returns jsonb language SQL as $$
select value from jsonb_each($1) with ordinality where ordinality=1+$2
$$;
select jsonb_disaster('{"oz": "2835", "cup": "229", "jar": "170"}',0);
jsonb_disaster
----------------
"2835"
You could also create your own operator to wrap up this disaster:
create operator !> ( function = jsonb_disaster, leftarg=jsonb, rightarg=int);
select '{"cup": "229", "jar": "170", "oz": "2835"}' !> 1;
?column?
----------
"229"

How to use ARRAY contains operator with ANY

I have a table where one column is array:
CREATE TABLE inherited_tags (
id serial,
tags text[]
);
Sample values:
INSERT INTO inherited_tags (tags) VALUES
(ARRAY['A','B','C']), -- id: 1
(ARRAY['D','E']), -- id: 2
(ARRAY['A','B']), -- id: 3
(ARRAY['C','D']), -- id: 4
(ARRAY['D','F']), -- id: 5
(ARRAY['A']); -- id: 6
I want to find rows which tags column contains some subset of words inside array. For example for input:
ARRAY[ARRAY['A','C'], ARRAY['F'], ARRAY['E']]::text[][]
I want to find all rows that contain ('A' and 'C') OR ('F') OR ('E'). So for example above I should get rows with ids: 1, 2, 5.
I was hoping that I could use syntax like this:
SELECT * FROM inherited_tags WHERE
tags #> ANY(ARRAY[ARRAY['A','C'], ARRAY['F'], ARRAY['E']]::text[][])
but I get error:
ERROR: operator does not exist: text[] #> text
LINE 1: SELECT * FROM inherited_tags where tags <# ANY(ARRAY[ARRAY['...
Postgres 9.6
plpgsql solution is acceptable but SQL is preferred.
DB-FIDDLE: https://www.db-fiddle.com/f/cKCr7Sfab6u8rqaCHhJvPk/0
The problem comes from the fact that the text[] and text[][] data types are internally the same data type. An array has a base type and dimensions, and the ANY operator will always extract the base type to compare, which will always be text and not text[]. It doesn't help that multidimensional arrays require that each subelement has the same length as every other. You can have ARRAY[ARRAY['A','C'],ARRAY['B','N']], but not ARRAY[ARRAY[2,3],ARRAY[1]].
In short, there is no direct way to make that particular query work. I tried to create a function and an operator for this as well, and that doesn't work, either, for different reasons. See how that went:
CREATE OR REPLACE FUNCTION check_tag_matches(
IN leftside text[],
IN rightside text)
RETURNS BOOLEAN AS
$BODY$
DECLARE rightarr text[];
BEGIN
SELECT CAST(rightside as text[]) INTO rightarr;
RETURN SELECT leftside #> rightarr;
END;
$BODY$
LANGUAGE plpgsql STABLE;
CREATE OPERATOR public.>>(
PROCEDURE = check_tag_matches,
LEFTARG = text[],
RIGHTARG = text,
COMMUTATOR = >>);
Then when testing it:
test=# SELECT * FROM inherited_tags WHERE
tags >> ANY(ARRAY[ARRAY['A','M'], ARRAY['F','E'], ARRAY['E','R']]::text[][]);
ERROR: malformed array literal: "A"
DETAIL: Array value must start with "{" or dimension information.
CONTEXT: SQL statement "SELECT CAST(rightside as text[])"
PL/pgSQL function check_tag_matches(text[],text) line 4 at SQL statement
It seems that when you try using a multidimensional array like ARRAY[ARRAY['A','M'], ARRAY['F','E'], ARRAY['E','R']]::text[][] in ANY(), it iterates not over ARRAY['A','M'], then ARRAY['F','E'], then ARRAY['E','R'], but over 'A','M','F','E','E','R'. The same thing happens when with unnest.
test=# SELECT unnest(ARRAY[ARRAY['A','M'], ARRAY['F','E'], ARRAY['E','R']]::text[][]);
unnest
--------
A
M
F
E
E
R
(6 rows)
Your remaining optiona are to define a function that will read array_length(rightside,1) and array_length(rightside,2) and use nested loops to check it all, or you can send multiple queries to get the inherited tags for each tag, or restructure your data somehow. And you can't even access the ARRAY['A','M'] element using rightside[1] to iterate over it, you're forced to go to the deepest level.
I don't think you can do that with a single condition because of the "contains A and C" requirement.
SELECT *
FROM inherited_tags
WHERE tags #> ARRAY['A','C']
OR tags && array['F', 'E'];
tags #> ARRAY['A','C'] selects those where tags contains all elements from ARRAY['A','C'] and tags && array['F', 'E'] selects those rows that contain at least one of the tags from array['F', 'E']
Updated DB Fiddle: https://www.db-fiddle.com/f/rXsjqEN3ry67uxJtEs3GM9/0
u can try
SELECT * FROM table WHERE
tags #> ARRAY['A','C']::varchar[]
OR
tags #> ARRAY['E']::varchar[]
OR
tags #> ARRAY['F']::varchar[]

extract data from redhsift

elements
[{"name":"email",
"value":"abc#gmail.com",
"nodeName":"INPUT",
"type":"text"},
{"name":"password",
"value":"*****",
"nodeName":"INPUT",
"type":"password"},
{"name":"checkbox",
"value":null,
"nodeName":"INPUT",
"type":"checkbox"}]
I have data like this in redshift. How do I extract value abc#gmail.com from this. This query is for redshift. Please help me with the SQL. Elements is a column name and the value starts with [].
Query I tried:
select
id,
json_extract_path_text(ELEMENTS, 'name') as name1
from table
error:[XX000][500310] Amazon Invalid operation: JSON parsing error Details: ----------------------------------------------- error: JSON parsing error code: 8 ...
You can create UDF in python, for your case I've created one, please test and edit as suits:
create or replace function f_py_json (jsonVar varchar(512),
jsonElemvarchar(10), occ integer)
returns varchar(512)
stable
as $$
import json
asJson = json.loads(jsonVar)
name_list = []
ret=str(asJson['elements'][occ][jsonElem])
return ret
$$ language plpythonu;
select f_py_json (id, 'value', 1) from test;
-- Input is {"elements":[{"name":"email","value":"abc#gmail.com"},{"name":"password","value":"*****"}]}

Bigquery - json_extract all elements from an array

i'm trying to extract two key from every json in an arry of jsons(using sql legacy)
currently i am using json extract function :
json_extract(json_column , '$[1].X') AS X,
json_extract(json_column , '$[1].Y') AS Y,
how can i make it run on every json at the 'json arry column', and not just [1] (for example)?
An example json:
[
{"blabla":000,"X":1,"blabla":000,"blabla":000,"blabla":000,,"Y":"2"},
{"blabla":000,"X":3,"blabla":000,"blabla":000,"blabla":000,,"Y":"4"},
]
thanks in advance!
Update 2020: JSON_EXTRACT_ARRAY()
Now BigQuery supports JSON_EXTRACT_ARRAY():
https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions#json_extract_array
For example, to solve this particular question:
SELECT id
, ARRAY(
SELECT JSON_EXTRACT_SCALAR(x, '$.author.email')
FROM UNNEST(JSON_EXTRACT_ARRAY(payload, "$.commits"))x
) emails
FROM `githubarchive.day.20180830`
WHERE type='PushEvent'
AND id='8188163772'
Previous answer
Let's start with a similar problem - this is not a very convenient way to extract all emails from a json array:
SELECT id
, [ JSON_EXTRACT_SCALAR(JSON_EXTRACT(payload, '$.commits'), '$[0].author.email')
, JSON_EXTRACT_SCALAR(JSON_EXTRACT(payload, '$.commits'), '$[1].author.email')
, JSON_EXTRACT_SCALAR(JSON_EXTRACT(payload, '$.commits'), '$[2].author.email')
, JSON_EXTRACT_SCALAR(JSON_EXTRACT(payload, '$.commits'), '$[3].author.email')
] emails
FROM `githubarchive.day.20180830`
WHERE type='PushEvent'
AND id='8188163772'
The best way we have right now to deal with this is to use some JavaScript in an UDF to split a json-array into a SQL array:
CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return JSON.parse(json).map(x=>JSON.stringify(x));
""";
SELECT * EXCEPT(array_commits),
ARRAY(SELECT JSON_EXTRACT_SCALAR(x, '$.author.email') FROM UNNEST(array_commits) x) emails
FROM (
SELECT id
, json2array(JSON_EXTRACT(payload, '$.commits')) array_commits
FROM `githubarchive.day.20180830`
WHERE type='PushEvent'
AND id='8188163772'
)
May 1st, 2020 Update
A new function, JSON_EXTRACT_ARRAY, has been just added to the list of JSON
functions. This function allows you to extract the contents of a JSON document as
a string array.
so in below you can replace use of CUSTOM_JSON_EXTRACT UDF with just in-built function JSON_EXTRACT_ARRAY as in below example
#standardSQL
SELECT
JSON_EXTRACT_SCALAR(json , '$.X') AS X,
JSON_EXTRACT_SCALAR(json , '$.Y') AS Y
FROM t, UNNEST(JSON_EXTRACT_ARRAY(json_column , '$')) json
==============
Below example for BigQuery Standard SQL and allows you to be close to standard way of working with JSONPath and no extra manipulation needed so you just simply use CUSTOM_JSON_EXTRACT(json, json_path) function
#standardSQL
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return jsonPath(JSON.parse(json), json_path);
"""
OPTIONS (
library="gs://your_bucket/jsonpath-0.8.0.js"
);
WITH t AS (
SELECT '''
[
{"blabla1":1,"X":1,"blabla2":3,"blabla3":5,"blabla4":7,"Y":"2"},
{"blabla1":2,"X":3,"blabla2":4,"blabla3":6,"blabla4":8,"Y":"4"}
]
''' AS json_column
)
SELECT
CUSTOM_JSON_EXTRACT(json_column , '$[*].X') AS X,
CUSTOM_JSON_EXTRACT(json_column , '$[*].Y') AS Y
FROM t
result will be
Row X Y
1 1 2
3 4
Note: to overcome current BigQuery's "limitation" for JsonPath, above solution uses custom function along with external library - jsonpath-0.8.0.js that can be downloaded from https://code.google.com/archive/p/jsonpath/downloads and uploaded to Google Cloud Storage - gs://your_bucket/jsonpath-0.8.0.js
Just re-read Felipe's answer - for his example above solution will look like below (just as FYI)
SELECT
id,
CUSTOM_JSON_EXTRACT(payload, '$.commits[*].author.email') emails
FROM `githubarchive.day.20180830`
WHERE type='PushEvent'
AND id='8188163772'

How can I pass a row from my table to a UDF without specifying the complete type?

Let's say that I want to do some processing on a table (such as the sample Github commits) that has a nested structure using a JavaScript UDF. I may want to change the fields that I look at in the UDF as I iterate on its implementation, so I decide just to pass entire rows from the table to it. My UDF ends up looking something like this:
#standardSQL
CREATE TEMP FUNCTION GetCommitStats(
input STRUCT<commit STRING, tree STRING, parent ARRAY<STRING>,
author STRUCT<name STRING, email STRING, ...>>)
RETURNS STRUCT<
parent ARRAY<STRING>,
author_name STRING,
diff_count INT64>
LANGUAGE js AS """
[UDF content here]
""";
Then I call the function with a query such as:
SELECT GetCommitStats(t).*
FROM `bigquery-public-data.github_repos.sample_commits` AS t;
The most cumbersome part of the UDF declaration is the input struct, since I have to include all of the nested fields and their types. Is there a better way to do this?
You can use TO_JSON_STRING to convert arbitrary structs and arrays to JSON, then parse it inside your UDF into an object for further processing. For example,
#standardSQL
CREATE TEMP FUNCTION GetCommitStats(json_str STRING)
RETURNS STRUCT<
parent ARRAY<STRING>,
author_name STRING,
diff_count INT64>
LANGUAGE js AS """
var row = JSON.parse(json_str);
var result = new Object();
result['parent'] = row.parent;
result['author_name'] = row.author.name;
result['diff_count'] = row.difference.length;
return result;
""";
SELECT GetCommitStats(TO_JSON_STRING(t)).*
FROM `bigquery-public-data.github_repos.sample_commits` AS t;
If you want to cut down on the number of columns that are scanned, you can pass a struct of the relevant columns to TO_JSON_STRING instead:
#standardSQL
CREATE TEMP FUNCTION GetCommitStats(json_str STRING)
RETURNS STRUCT<
parent ARRAY<STRING>,
author_name STRING,
diff_count INT64>
LANGUAGE js AS """
var row = JSON.parse(json_str);
var result = new Object();
result['parent'] = row.parent;
result['author_name'] = row.author.name;
result['diff_count'] = row.difference.length;
return result;
""";
SELECT
GetCommitStats(TO_JSON_STRING(
STRUCT(parent, author, difference)
)).*
FROM `bigquery-public-data.github_repos.sample_commits`;