I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
...
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!
The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.
Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
)
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
)
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is
I have a question about SQL, and I honestly tried to search methods before asking. I will give an abstract (but precise) description below, and will greatly appreciate your example of solution (SQL query).
What I have:
Table A with category ids of the items and prices (in USD) for each item. category id has int type of value, price is string and looks like "USD 200000000" (real value is multiplied by 10^7). Tables also has a kind column with int type of value.
Table B with relation of category id and name.
What I need:
Get a table with price diapasons (like 0-100 | 100-200 | ...) as column names and count amount of items for each category id (as lines names) in all of the price diapasons. All results must be filtered by kind parameter (from table A) with value 3.
Questions, that I encountered (and which caused to ask for an example of SQL query):
Cut "USD from price string value, divide it by 10^7 and convert to float.
Gather diapasons of price values (0-100 | 100-200 | ...), with given step in the given interval (max price is considered as unknown at the start). Example: step 100 on 0-500 interval, and step 200 for values >500.
Put diapasons of price values into column names of the result table.
For each diapason, count amount of items in each category (category_id). Left limit of diapason shall not be considered (e.g. on 1000-1200 diapason, items with price 1000 shall not be considered).
Using B table, display name instead of category id.
Response is appreciated, ignorance will be understood.
If you only need category ids, then you do not need B. What you are looking for is conditional aggregation, something like:
select category_id,
sum(case when cast(substring(price, 4, 100) as int)/10000000 < 100 then 1 else 0 end) as price_000_100
sum(case when cast(substring(price, 4, 100) as int)/10000000 >= 100 and cast(substring(price, 4, 100) as int)/10000000 < 200
then 1 else 0
end) as price_100_200,
. . .
from a
group by category_id
There is no standard way to do what you describe.
That is because to do (3) you need a pivot aka crosstab, and this is not in ANSI SQL. Each DBMS has it's own implementation. Plus dynamic columns in a pivot table are an additional complication.
For example, Postgres calls it a "crosstab" and requires the tablefunc module to be installed. See this SO question and the documentation. Compare to SQL Server, which uses the PIVOT command.
You can get close using reasonably standard SQL.
Here is an example based on SQLite. A little bit of conversion would provide a solution for other systems, e.g. SUBSTR would be substring(string [from int] [for int]) in postgre.
Assuming a data table of format:
and a category name table of:
then the following code will produce:
WITH dataCTE AS
(SELECT product_id AS 'ID', CAST(SUBSTR(price, 5) AS INT)/1000000 AS 'USD',
CASE WHEN (CAST(SUBSTR(price, 5) AS INT)/1000000) <= 500 THEN
100 ELSE 200
END AS 'Interval'
FROM data
WHERE kind = 3),
groupCTE AS
(SELECT dataCTE.ID AS 'ID', dataCTE.USD AS 'USD', dataCTE.Interval AS 'Interval',
CASE WHEN dataCTE.Interval = 100 THEN
CAST(dataCTE.USD AS INT)/100
ELSE
(CAST(dataCTE.USD-500 AS INT)/200)+5
END AS 'GroupID'
FROM dataCTE),
cleanCTE AS
(SELECT *, CASE WHEN groupCTE.Interval = 100 THEN
CAST(groupCTE.GroupID *100 AS VARCHAR)
|| '-' ||
CAST((groupCTE.GroupID *100)+99 AS VARCHAR)
ELSE
CAST(((groupCTE.GroupID-5)*200)+500 AS VARCHAR)
|| '-' ||
CAST(((groupCTE.GroupID-5)*200)+500+199 AS VARCHAR)
END AS 'diapason'
FROM groupCTE
INNER JOIN cat_name AS cn ON groupCTE.ID = cn.cat_id)
SELECT *
FROM cleanCTE;
If you modify the last SELECT to:
SELECT name, diapason, COUNT(diapason)
FROM cleanCTE
GROUP BY name, diapason;
then you get a grouped output:
This is as close as you will get without specifying the exact system; even then you will have a problem with dynamically creating the column names.
I need your kind help with this SQL query. I have a query with 6 layers of subqueries that's currently structured like this. I am looking forward to advice how to:
Reduce the layers without repeating the same statement (for example, I could replace 'case when E>200' with '(Case when T2.BB >100 then B+C else B+D end) > 200' and write the statement in layer 1, hence eliminating layer2. I can't do this because in my raw queries I have a computed column that is based on another computed column in its sub-query, which is then calculated based on another computed column in the sub-sub-query... So repeating codes 5/6 time will confuse me and drive me crazy...
Avoid using select 2., select 1. while still keeping all the columns (F,E,A,B,C,D,T2.BB) in the final output. I want to do this because in my raw query there is 5 select.* -- I feel like this causes the servers to do much redundant work and slows down query execution.
Thanks very much for your help!
Select
2.*,
case when E > 200 then 'OK' else 'OH NO' end F
From
(Select
1.*,
Case when T2.BB >100 then B+C else B+D end E
From
(Select
A, B, C, D, T2.BB
From
T1
Join
T2 on T1.A = T2.AA) 1
) 2
try WITH statements, they look cleaner because they help to maintain the order in which you apply the logic, so it will look like this:
WITH
t1 as (
select ...
from src_table
)
,t2 as (
select *, ...
from t1
)
<<as much layers as needed>>
also if you need to reuse something in 2 different places you can reference the prior WITH statement from any following statement, i.e. encapsulate that logic in one place
I know there are some posts on pivoting, which I have used to get where I am today (thanks to the BQ community!). But this post seeks some advice on optimising this where there is a large number of pivot columns needed, distributed table joins are needed....as well and deudping. Not asking much right!
Objective:
We have 2 large BQ tables, with a full 10 years history that needs joining:
sales_order_header (13 GB - 1.35 million rows)
sales_order_line (50GM - 5 million rows)
This is a typical 'header/line' one to many relationship. The data for the tables arrives as 2 seperate streams unfortunately rather then 1 document style where the line is nested inside the header which would be ideal - but its not so distributed joins become necesary for some of the views our BI tool (Tableau) wants to periodically (every 60 mins) call to ingest 'cleansed' data that is:
deduped (both tables that is)
joined header to line (on salesOrderId)
each has its own array of 'sourceData' namve / value paris that needs unpacking / 'pivot' so its not an array
Point 3 presents an issue in its own right. We have a column called 'sourceData' which is basically where the core data is - its an array of string name value pairs (a row in BQ is a replication of a single row from a DB so the key is a column name and value the value for a single row).
Now I think here lay the issue, as there are 250 array entries (we know the exact number up front) , this equates to 250 'unnest' statements each and using the best approach I can think of using sub selects:
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
250 times
And this is done as a pattern for each of the header and the line tables repsective views.
So the SQL for the view for just retrieving a deduped, flattened/pivoted array for the sales_order_header table is as follows. The sales_order_line has the same pattern for its view:
#standardSQL
WITH latest_snapshot_dups AS (
SELECT
salesOrderId,
PARSE_TIMESTAMP("%Y-%m-%dT%H:%M:%E*S%Ez", lastUpdated) AS lastUpdatedTimestampUTC,
sourceData,
_PARTITIONTIME AS bqPartitionTime
FROM
`project.ds.sales_order_header_refdata`
),
latest_snapshot_nodups AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY salesOrderId ORDER BY lastUpdatedTimestampUTC DESC) AS rowNum
FROM latest_snapshot_dups
)
SELECT
salesOrderId,
lastUpdatedTimestampUTC,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'b') AS b,
....250 of these
FROM
latest_snapshot_nodups
WHERE
rowNum = 1
Although just showing one here, we have these two similar views (with total of 250 + 300 = 550 unique subqueries that unnest/pivot), and now I want to join the header with the line views and I run into an issue straight away exceeding a limit of subqueries.
Is there a better way to do this, assuming this is the data there is to work with? A better way to 'pivot' perhaps? Or a more efficient way building a single view that optimises the order of things, rather then using 2 discrete views?
Thanks for your help BQ Community!
I run into an issue straight away exceeding a limit of subqueries
You currently using below pattern (removed mot significant part of code for simplicity)
#standardSQL
SELECT
salesOrderId,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'b') AS b,
....250 OF these
FROM latest_snapshot_nodups
Try below pattern
#standardSQL
SELECT
salesOrderId,
MAX(IF(name = 'a', val, NULL)) AS a,
MAX(IF(name = 'b', val, NULL)) AS b,
....250 OF these
FROM latest_snapshot_nodups, UNNEST(sourceData) kv
GROUP BY salesOrderId
I'm currently working on calculating a larger set of data with a number of joins and the end result is a calculation across two tables. My current script looks like the following:
USE db1
Go
SELECT
customer, tb1.custid
FROM
[dbo].[tb1]
LEFT OUTER JOIN
[dbo].[tb2] ON tb1.custid = tb2.custid
LEFT OUTER JOIN
[dbo].[tb3] ON tb2.custnumber = tb3.custnumber
LEFT OUTER JOIN
[dbo].[tb4] ON tb2.custid = tb4.custid
WHERE
tb1.custclass = 'non-person'
AND tb4.zip IN ('11111', '11112')
GO
As you can see, it's not the cleanest, but it's working for gathering initial information. The reasoning for the number of joins is due to an incredibly odd table structure I did not create and the fact that the numerical data I need is only stored in tb3.
What I'm now trying to do is calculate the sum of 3 fields from tb3 that are all set as numeric fields and do an AND/OR comparison against a 4th field (also numeric). I know I can SUM them together, but I'm hoping for some input on three things:
Where to place that SUM calculation in the query?
Where to place and how to do the comparison of the SUM total against the 4th field?
Is it possible to return the higher of the two values to a TOTAL column in the initial SELECT?
Thank you in advance.
Where to place that SUM calculation in the query?
If you want it output, you probably want to just add it to the SELECT
SELECT
customer, tb1.custid
(tb3.col1 + tb3.col2 + tb3.col3) as Sum
FROM
...
Where to place and how to do the comparison of the SUM total against the 4th field?
You probably want to do this with a CASE statement, and this also answers your last question
Is it possible to return the higher of the two values to a TOTAL column in the initial SELECT?
SELECT
customer, tb1.custid
CASE WHEN (tb3.col1 + tb3.col2 + tb3.col3) > tb3.col4
THEN (tb3.col1 + tb3.col2 + tb3.col3)
ELSE tb3.col4
END as Total
FROM
...
You should be able to calculate the sum as a nested query:
SELECT (field1 + field2 + field3) AS fields_sum FROM tb3 (...)
Then in your main query you could do something like:
SELECT customer, tb1.custid, (CASE WHEN fields_sum > fourth_field THEN fields_sum ELSE fourth_field END) AS TOTAL (...)