I've imported a csv into a table in BigQuery, and some columns hold a string which really should be a list of floats.
So now i'm try to convert these strings to some arrays. Thanks to SO I've managed to convert a string to a list of floats, but now i get one row by element of the list, instead of 1 row by initial row ie the array is "unnested". But it's a problem as it would generate a huge amount of duplicated data. Would someone please now how to do the conversion from STRING to ARRAY<FLOAT64> please?
partial code:
with tbl as (
select "id1" as id,
"10000\t10001\t10002\t10003\t10004" col1_str,
"10000\t10001.1\t10002\t10003\t10004" col2_str
)
select id, cast(elem1 as float64) col1_floatarray, cast(elem2 as float64) col2_floatarray
from tbl
, unnest(split(col1_str, "\t")) elem1
, unnest(split(col2_str, "\t")) elem2
expected:
1 row, with 3 columns of types STRING id, ARRAY<FLOAT64> col1_floatarray, ARRAY<FLOAT64> col2_floatarray
Thank you!
Use below
select id,
array(select cast(elem as float64) from unnest(split(col1_str, "\t")) elem) col1_floatarray,
array(select cast(elem as float64) from unnest(split(col2_str, "\t")) elem) col2_floatarray
from tbl
if applied to sample data in y our question - output is
Related
I have a column with inconsistent data format, some of them are a list of array [], some of them are JSON_like objects {}
id
prices
1
[100,100,110]
2
{200,210,190}
create table test(id integer, prices varchar(255));
insert into test
values
(1,'[100,100,110]'),
(2,'{200,210,190}');
When I tried to unnest, my query works fine for the first row, but it fails on the second row. Is there a way I can convert the {} to a list of array []?
This is my query:
select id,prices,price from test
cross join UNNEST(cast(json_parse(prices) as array<varchar>)) as t (price)
You can use replace and then parse the data into array:
select json_parse(replace(replace('{200,210,190}', '}', ']'), '{', '['))
Output:
_col0
[200,210,190]
I have a string of numbers like this:
670000000000100000000000000000000000000000000000000000000000000
I want to add up these numbers which in the above example would result in 14: 6+7+0+...+1+0+...+0+0+0=14
How would I do this in BigQuery?
Consider below approach
with example as (
select '670000000000100000000000000000000000000000000000000000000000000' as s
)
select s, (select sum(cast(num as int64)) from unnest(split(s,'')) num) result
from example
with output
Yet another [fun] option
create temp function sum_digits(expression string)
returns int64
language js as """
return eval(expression);
""";
with example as (
select '670000000000100000000000000000000000000000000000000000000000000' as s
)
select s, sum_digits(regexp_replace(replace(s, '0', ''), r'(\d)', r'+\1')) result
from example
with output
What it does is -
first it transform initial long string into shorter one - 671.
then it transforms it into expression - +6+7+1
and finally pass it to javascript eval function (unfortunatelly BigQuery does not have [hopefully yet] eval function)
I'm trying to group BigQuery columns using an array like so:
with test as (
select 1 as A, 2 as B
union all
select 3, null
)
select *,
[A,B] as grouped_columns
from test
However, this won't work, since there is a null value in column B row 2.
In fact this won't work either:
select [1, null] as test_array
When reading the documentation on BigQuery though, it says Nulls should be allowed.
In BigQuery, an array is an ordered list consisting of zero or more
values of the same data type. You can construct arrays of simple data
types, such as INT64, and complex data types, such as STRUCTs. The
current exception to this is the ARRAY data type: arrays of arrays are
not supported. Arrays can include NULL values.
There doesn't seem to be any attributes or safe prefix to be used with ARRAY() to handle nulls.
So what is the best approach for this?
Per documentation - for Array type
Currently, BigQuery has two following limitations with respect to NULLs and ARRAYs:
BigQuery raises an error if query result has ARRAYs which contain NULL elements, although such ARRAYs can be used inside the query.
BigQuery translates NULL ARRAY into empty ARRAY in the query result, although inside the query NULL and empty ARRAYs are two distinct values.
So, as of your example - you can use below "trick"
with test as (
select 1 as A, 2 as B union all
select 3, null
)
select *,
array(select cast(el as int64) el
from unnest(split(translate(format('%t', t), '()', ''), ', ')) el
where el != 'NULL'
) as grouped_columns
from test t
above gives below output
Note: above approach does not require explicit referencing to all involved columns!
My current solution---and I'm not a fan of it---is to use a combo of IFNULL(), UNNEST() and ARRAY() like so:
select
*,
array(
select *
from unnest(
[
ifnull(A, ''),
ifnull(B, '')
]
) as grouping
where grouping <> ''
) as grouped_columns
from test
An alternative way, you can replace NULL value to some NON-NULL figures using function IFNULL(null, 0) as given below:-
with test as (
select 1 as A, 2 as B
union all
select 3, IFNULL(null, 0)
)
select *,
[A,B] as grouped_columns
from test
I have the value category:ops,client:acompany,type:sometype which as you can see is effectively a dictionary, I would like to extract the value for the dictionary key client, in other words I want to extract acompany.
Here's how I've done it:
select CASE WHEN INSTR(client_step1, ",") > 0 THEN SUBSTR(client_step1, 0, INSTR(client_step1, ",") - 1)
ELSE client_step1
END AS client
from (
select CASE WHEN INSTR(dict, "client") > 0 THEN SUBSTR(dict, INSTR(dict, "client") + 7)
ELSE CAST(NULL as STRING)
END as client_step1
from (
select "category:ops,client:acompany,type:sometype" as dict
)
)
but that seems rather verbose (and frankly, slicing up strings with a combination of INSTR(), SUBSTR() and derived tables feels a bit meh). I'm wondering if there's a better way to do it that I don't know about (I'm fairly new to bq).
thanks in advance
It sounds like you want the REGEXP_EXTRACT function. Here is an example:
SELECT REGEXP_EXTRACT(dict, r'client:([^,:]+)') AS client_step1
FROM (
SELECT "category:ops,client:acompany,type:sometype" AS dict
)
This returns the string acompany as its result. The regexp looks for client: inside the string, and matches everything after it up until the next , or : or of the end of the string.
Another option to parse dictionary like yours is as below (for BigQuery Standard SQL)
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "category:ops,client:acompany,type:sometype" AS dict
)
SELECT id,
ARRAY(
SELECT AS STRUCT
SPLIT(x, ':')[OFFSET(0)] key,
SPLIT(x, ':')[OFFSET(1)] value
FROM UNNEST(SPLIT(dict)) x
) items
FROM `project.dataset.table`
with result as below
Row id items.key items.value
1 1 category ops
client acompany
type sometype
As you can see here - you parse out all dictionary items
If you still need to only specific element's value - you can use below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "category:ops,client:acompany,type:sometype" AS dict
)
SELECT id,
( SELECT
SPLIT(x, ':')[OFFSET(1)]
FROM UNNEST(SPLIT(dict)) x
WHERE SPLIT(x, ':')[OFFSET(0)] = 'client'
LIMIT 1
) client
FROM `project.dataset.table`
=# select row(0, 1) ;
row
-------
(0,1)
(1 row)
How to get 0 within the same query? I figured the below sort of working but is there any simple way?
=# select json_agg(row(0, 1))->0->'f1' ;
?column?
----------
0
(1 row)
No luck with array-like syntax [0].
Thanks!
Your row type is anonymous and therefore you cannot access its elements easily. What you can do is create a TYPE and then cast your anonymous row to that type and access the elements defined in the type:
CREATE TYPE my_row AS (
x integer,
y integer
);
SELECT (row(0,1)::my_row).x;
Like Craig Ringer commented in your question, you should avoid producing anonymous rows to begin with, if you can help it, and type whatever data you use in your data model and queries.
If you just want the first element from any row, convert the row to JSON and select f1...
SELECT row_to_json(row(0,1))->'f1'
Or, if you are always going to have two integers or a strict structure, you can create a temporary table (or type) and a function that selects the first column.
CREATE TABLE tmptable(f1 int, f2 int);
CREATE FUNCTION gettmpf1(tmptable) RETURNS int AS 'SELECT $1.f1' LANGUAGE SQL;
SELECT gettmpf1(ROW(0,1));
Resources:
https://www.postgresql.org/docs/9.2/static/functions-json.html
https://www.postgresql.org/docs/9.2/static/sql-expressions.html
The json solution is very elegant. Just for fun, this is a solution using regexp (much uglier):
WITH r AS (SELECT row('quotes, "commas",
and a line break".',null,null,'"fourth,field"')::text AS r)
--WITH r AS (SELECT row('',null,null,'')::text AS r)
--WITH r AS (SELECT row(0,1)::text AS r)
SELECT CASE WHEN r.r ~ '^\("",' THEN ''
WHEN r.r ~ '^\("' THEN regexp_replace(regexp_replace(regexp_replace(right(r.r, -2), '""', '\"', 'g'), '([^\\])",.*', '\1'), '\\"', '"', 'g')
ELSE (regexp_matches(right(r.r, -1), '^[^,]*'))[1] END
FROM r
When converting a row to text, PostgreSQL uses quoted CSV formatting. I couldn't find any tools for importing quoted CSV into an array, so the above is a crude text manipulation via mostly regular expressions. Maybe someone will find this useful!
With Postgresql 13+, you can just reference individual elements in the row with .fN notation. For your example:
select (row(0, 1)).f1; --> returns 0.
See https://www.postgresql.org/docs/13/sql-expressions.html#SQL-SYNTAX-ROW-CONSTRUCTORS