I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
...
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!
The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.
Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
)
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
)
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is
I have a field in my dataset that include json objects in the following format:
cars
[{"element":{"name":"honda","id":"34"}}]
[{"element":{"name":"Lexus","id":"56"}}]
I am using the following query to extract the names of the cars, but just returns empty (null) rows. Any ideas what I am doing wrong?
select JSON_QUERY(cars,"$.name") AS car_names
from myTable
limit 100
Consider below approach
select *,
( select string_agg(json_extract_scalar(car, '$.element.name'))
from unnest(json_extract_array(cars)) car
) car_names
from `project.dataset.table`
if applied to sample data in your question - as in below example
with `project.dataset.table` as (
select '[{"element":{"name":"honda","id":"34"}}]' cars union all
select '[{"element":{"name":"Lexus","id":"56"}}]'
)
select *,
( select string_agg(json_extract_scalar(car, '$.element.name'))
from unnest(json_extract_array(cars)) car
) car_names
from `project.dataset.table`
the output is
if you are trying to extract a scalar value, You should simply use JSON_VALUE(expression,path)
For Example:
An object of Info contain a variable name and another object address,
You can get the value of name by using JSON_VALUE as it isn't an object
BUT
To get the address, you have to use JSON_QUERY.
I have a problem with a data query where I query a single column like this:
SELECT a.ad_morg_key, count(a.sid_mpenduduk_key) AS total_population
FROM sid_mpenduduk a
GROUP BY a.ad_morg_key;
and it really works. But when I query with multiple columns with a query like this:
SELECT a.ad_morg_key, b."name",
count(b.sid_magama_key) AS total,
count(b.sid_magama_key)::float / (SELECT count(a.sid_mpenduduk_key)
FROM sid_mpenduduk a
GROUP BY a.ad_morg_key)::float * 100::float AS percentage,
(SELECT count(a.sid_mpenduduk_key) FROM sid_mpenduduk a GROUP BY a.ad_morg_key) AS total_population
FROM sid_mpenduduk a
INNER JOIN sid_magama b ON a.sid_magama_key = b.sid_magama_key
GROUP BY a.ad_morg_key, b."name";
But it fails with:
ERROR: more than one row returned by a subquery used as an expression
I want the final result like this :
Your subquery is grouped by a.ad_morg_key so it will get you a row for each different value of a.ad_morg_key.
In general terms each subquery in a SELECT statement should return a single value. Suppose you have the following table called A.
A_key
A_value
A1
200
A2
200
If you execute
SELECT (SELECT A_KEY FROM A) as keys
FROM A
the subquery (SELECT A_KEY FROM A) returns
A_key
A1
A2
so what should be the value for keys?
SQL cannot handle this decision so you should pick one of the values or aggregate them into a single value.
Use a correlation clause instead:
(SELECT count(a.sid_mpenduduk_key)
FROM sid_mpenduduk a2
WHERE a2.ad_morg_key = a.ad_morg_key
) AS total_population
I'm not sure if the subquery is really necessary. So, you might consider asking a new question with sample data, desired results, and a clear explanation of what you are trying to do.
You're getting burned by
GROUP BY ...
(SELECT count(a.sid_mpenduduk_key)
FROM sid_mpenduduk a
GROUP BY a.ad_morg_key) AS total_population
because the outer GROUP BY wants a scalar, but the subquery is producing a count for each a.ad_morg_key.
I don't write my queries that way. Instead, produce a virtual table,
SELECT a.ad_morg_key, b."name",
...
JOIN
(SELECT ad_morg_key,
count(sid_mpenduduk_key) as N
FROM sid_mpenduduk
GROUP BY ad_morg_key) AS morgs
on ad_morg_key = morgs.ad_morg_key
That way, you have the count for each row as N, and you can divide at will,
count(b.sid_magama_key)::float / morgs.N
and, if you get tripped up, you'll have many more rows than you expected instead of an error message.
I am trying to pull records whose arrays only meet a certain condition.
For example, I want only the results that contain "IAB3".
Here is what the table looks like
Table Name:
bids
Column Names:
BidderBanner / WinCat
Entries:
1600402 / null
1911048 / null
1893069 / [IAB3-11, IAB3]
1214894 / IAB3
How I initially thought it would be
SELECT * FROM bids WHERE WinCat = "IAB3"
but I get an error that says no match for operator types array, string.
The database is in Google Big Query.
Below is for BigQuery Standard SQL
#standardSQL
SELECT * FROM `project.dataset.bids` WHERE 'IAB3' IN UNNEST(WinCat)
You can test, play with above using sample data from your question as in example below
#standardSQL
WITH `project.dataset.bids` AS (
SELECT 1600402 BidderBanner, NULL WinCat UNION ALL
SELECT 1911048, NULL UNION ALL
SELECT 1893069, ['IAB3-11', 'IAB3'] UNION ALL
SELECT 1214894, ['IAB3']
)
SELECT * FROM `project.dataset.bids` WHERE 'IAB3' IN UNNEST(WinCat)
with result
you need to use single quotes in sql for all strings. it should be WHERE WinCat = 'IAB3' not WHERE WinCat = "IAB3"
One method uses unnest(), something like this:
SELECT b.*
FROM bids b
WHERE 'IAB3' IN (SELECT unnest(b.WinCats))
However, array syntax varies among the databases that support them and they are no part of "standard SQL".
this will work:
SELECT * FROM bids WHERE REGEXP_LIKE (WinCat, '(.)*(IAB3)+()*');
I can do this in SQL Server:
SELECT 'HERRAMIENTA ELÉCTRICA' AS TIPO_PRODUCTO,
0 AS DEPRECIACION,
(select sum(empid) from HR.employees) STOCK
but in Access the same query show me the next error:
Query input must contain at least one table or query
So which could be the best form to emulate this? Make a query with any other table looks dirty for me.
EDIT 1:, HR.employees It may no have data, but i want show constants ('HERRAMIENTA ELÉCTRICA',''0') and 0 in the third column, maybe using isnull and this is not the problem here.
Why not to select directly:
select 'HERRAMIENTA ELÉCTRICA' AS TIPO_PRODUCTO,
0 AS DEPRECIACION,
IIF(ISNULL(sum(empid)), 0, sum(empid)) AS STOCK
from HR.employees
This simply doesn't work in Access. You need a FROM clause.
So you need to have a dummy table with one record, even if you don't use a single field from that table.
SELECT 'HERRAMIENTA ELÉCTRICA' AS TIPO_PRODUCTO,
0 AS DEPRECIACION,
(select sum(empid) from HR.employees) STOCK
FROM Dummy_Table
Using this example as empty table:
with employ as
(select 2 as col from dual
minus
select 2 as col from dual)
The query is this one:
select 'HERRAM' as tipo,
0 as deprec,
coalesce(sum(col), 0) as STOCK
from employ;
coalesce(x, value) sets the column to value when X is null
In Access, you can use a system table, and Val and Nz for the zero value:
SELECT TOP 1
'HERRAMIENTA ELÉCTRICA' AS TIPO_PRODUCTO,
0 AS DEPRECIACION,
Val(Nz((select sum(empid) from HR.employees), 0)) AS STOCK
FROM
MSysObjects