I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
...
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!
The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.
Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
)
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
)
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is
so I have table source that 1 important column is broken json. below is the sample of the data
event_properties
"{\"source\":\"barcode\",\"voucher_id\":684883298,\"voucher_name\":\"voucher 1\"}"
"{\"entryPoint\":\"voucher_selection-popup\",\"entry_point\":\"voucher_selection-popup\",\"source\":\"mobile\",\"voucher_id\":712001960,\"voucher_name\":\"voucher 2\"}"
"{\"source\":\"barcode\",\"voucher_id\":638584138,\"voucher_name\":\"voucher 1\"}"
"{\"source\":\"QR Static\",\"voucher_id\":642124374,\"voucher_name\":\"voucher 3\"}"
each line represent the 1 record. is there a way to extract the voucher id and voucher_name information since there are more than 1 variation in the data.
so the goal is to extract the voucher id and voucher name like this
voucher_id voucher_name
684883298 voucher 1
712001960 voucher 2
638584138 voucher 1
642124374 voucher 3
im using redshift
You can try this:
first verify if the JSON is valid with is_valid_json().
if not, inspect what is required to make it valid, in this case by removing leading and trailing ".
in this scenario, use trim() to remove the redundant " chars.
use json_extract_path_text() to obtain values.
SQL:
with json_data as (
select '"{\"source\":\"barcode\",\"voucher_id\":684883298,\"voucher_name\":\"voucher 1\"}"'::text as j union
select '"{\"entryPoint\":\"voucher_selection-popup\",\"entry_point\":\"voucher_selection-popup\",\"source\":\"mobile\",\"voucher_id\":712001960,\"voucher_name\":\"voucher 2\"}"'::text union
select '"{\"source\":\"barcode\",\"voucher_id\":638584138,\"voucher_name\":\"voucher 1\"}"'::text union
select '"{\"source\":\"QR Static\",\"voucher_id\":642124374,\"voucher_name\":\"voucher 3\"}"'::text)
select j,
is_valid_json(j),
trim('"' from j) as j_trimmed,
is_valid_json(j_trimmed),
json_extract_path_text(j_trimmed, 'voucher_id') as voucher_id
from json_data;
Yields the voucher_id values. Then use the same method to get the other keys' values.
I have the following data in a postgres table,
where data is a jsonb column. I would like to get result as
[
{field_type: "Design", briefings_count: 1, meetings_count: 13},
{field_type: "Engineering", briefings_count: 1, meetings_count: 13},
{field_type: "Data Science", briefings_count: 0, meetings_count: 3}
]
Explanation
Use jsonb_each_text function to extract data from jsonb column named data. Then aggregate rows by using GROUP BY to get one row for each distinct field_type. For each aggregation we also need to include meetings and briefings count which is done by selecting maximum value with case statement so that you can create two separate columns for different counts. On top of that apply coalesce function to return 0 instead of NULL if some information is missing - in your example it would be briefings for Data Science.
At a higher level of statement now that we have the results as a table with fields we need to build a jsonb object and aggregate them all to one row. For that we're using jsonb_build_object to which we are passing pairs that consist of: name of the field + value. That brings us with 3 rows of data with each row having a separate jsonb column with the data. Since we want only one row (an aggregated json) in the output we need to apply jsonb_agg on top of that. This brings us the result that you're looking for.
Code
Check LIVE DEMO to see how it works.
select
jsonb_agg(
jsonb_build_object('field_type', field_type,
'briefings_count', briefings_count,
'meetings_count', meetings_count
)
) as agg_data
from (
select
j.k as field_type
, coalesce(max(case when t.count_type = 'briefings_count' then j.v::int end),0) as briefings_count
, coalesce(max(case when t.count_type = 'meetings_count' then j.v::int end),0) as meetings_count
from tbl t,
jsonb_each_text(data) j(k,v)
group by j.k
) t
You can aggregate columns like this and then insert data to another table
select array_agg(data)
from the_table
Or use one of built-in json function to create new json array. For example jsonb_agg(expression)
I want to use the result from subquery as the column name of another query since the data changes column all the time and the subquery will decide which column the current forcast data stored. My example:
select item,
item_type
...
forcast_0 * 0.9 as finalforcast
forcast_0 * 0.8 as newforcast
from sales_data.
but the forcast_0 column is the result (fore_column_name) of the subquery, the result may change to forcast_1 or forcast2
select
fore_column_name
from forecast_history
where ...
Also, the forcast column will be used multiple times in the first query. how could I implement this?
Use your sub query as an inline table. Something like....
select item,
item_type,
..
decode(fore_column_name, 'foo', 1, 2) * 0.9 as finalforcast,
decode(fore_column_name, 'foo', 1, 2) * 0.8 as newforcast
from sales_data,
(
select fore_column_name
from forecast_history
where ...
) inlineTable
I'm assuming here that the value from the sub-query will be the same for each row - so a quick cross-join will suffice. If the value will vary depending on the values in each row of the sales_data table, then some other type of join would be more appropriate.
Quick link to decode - in case you aren't familiar with it.
This is asked regarding an Oracle 11g database.
I'm trying to query an Atlassian Confluence calendar table. It stores calendar entries for an entire calendar into a single value in a single row, which is this gigantic glob of iCal crap.
If the fields within each entry were in a consistent order, my regex fu would be strong enough to parse out the particular entry I am searching for... but since I need to search for a date, the description, and the summary, all of which can apparently be in any order within the BEGIN/END VEVENT, this is impossible. I'm halfway certain it would be impossible even with lookahead and lookbehind.
Is there a sql (not pl-sql) construction that would chop this single string/blob value out into multiple rows, so that I could do something like:
select * from (chopped up value) where x like '%something%';
This would make it sort of the reverse of a wm_concat() or group_concat...
A typical entry looks something like this (and it has 50 or 60 already):
BEGIN:VEVENT
UID:20130724T153322Z--922125579#atlassianzzz.zzz.edu
SUMMARY:Richard Smichard
ATTENDEE;X-CONFLUENCE-USER=rismich:https://atlassianzzz.zzz.edu/c
onfluence/display/~rismich
LOCATION:
DESCRIPTION:Primary
DTSTART;VALUE=DATE:20130726
DTEND;VALUE=DATE:20130729
DTSTAMP:20130724T153322Z
CREATED:20130724T153322Z
LAST-MODIFIED:20130724T153322Z
ORGANIZER;X-CONFLUENCE-USER=botard:MAILTO:botard#zzz.edu
SEQUENCE:0
END:VEVENT
I can't use PL-SQL or build a proper parser because the environment this will run in doesn't make that possible. I get to run a select statement, and it either returns the value I'm looking for, or it doesn't.
Also, NoSQL sucks. Big time.
This is a quick test:
with w1 as
(
select 'BEGIN:VEVENT\
UID:20130724T153322Z--922125579#atlassianzzz.zzz.edu
SUMMARY:Richard Smichard
ATTENDEE;X-CONFLUENCE-USER=rismich:https://atlassianzzz.zzz.edu/c
onfluence/display/~rismich
LOCATION:
DESCRIPTION:Primary
DTSTART;VALUE=DATE:20130726
DTEND;VALUE=DATE:20130729
DTSTAMP:20130724T153322Z
CREATED:20130724T153322Z
LAST-MODIFIED:20130724T153322Z
ORGANIZER;X-CONFLUENCE-USER=botard:MAILTO:botard#zzz.edu
SEQUENCE:0
END:VEVENT' text from dual
),
w2 as
(
select 'SUMMARY' label from dual
union all
select 'DESCRIPTION' label from dual
)
select regexp_substr(w1.text, 'UID.*') id, w2.label,
substr(regexp_substr(w1.text, w2.label || '.*'),
instr(regexp_substr(w1.text, w2.label || '.*'), ':') + 1) spl
from w1, w2;
It gives:
1 UID:20130724T153322Z--922125579#atlassianzzz.zzz.edu SUMMARY Richard Smichard
2 UID:20130724T153322Z--922125579#atlassianzzz.zzz.edu DESCRIPTION Primary