I'm streaming JSON input from blob storage. Most data in the JSON is stored as name/value pairs in an array. I need to send each input as a single output where each name/value pair is transposed to a column in the output. I have code that works when using the "Test" feature while editing the query. However when testing live, only the debugblob1 output receives data.
Why would the the live test work different from the query test? Is there a better way to transpose array data to columns?
Note: The array's name/value pairs are always the same, though I don't want a solution that depends on their order always being the same, since that is out of my control.
QUERY
-- Get one row per input and array value
WITH OneRowPerArrayValue AS
(SELECT
INPUT.id AS id,
ARRAYVALUE.ArrayValue.value1 AS value1,
ARRAYVALUE.ArrayValue.value2 AS value2
FROM
[inputblob] INPUT
CROSS APPLY GetElements(INPUT.arrayValues) as ARRAYVALUE),
-- Get one row per input, transposing the array values to columns.
OneRowPerInput AS
(SELECT
INPUT.id as id,
ORPAV_value1.value1 as value1,
ORPAV_value2.value2 as value2
FROM
[inputblob] INPUT
left join OneRowPerArrayValue ORPAV_value1 ON ORPAV_value1.id = INPUT.id AND ORPAV_value1.value1 IS NOT NULL AND DATEDIFF(microsecond, INPUT, ORPAV_value1) = 0
left join OneRowPerArrayValue ORPAV_value2 ON ORPAV_value2.id = INPUT.id AND ORPAV_value2.value2 IS NOT NULL AND DATEDIFF(microsecond, INPUT, ORPAV_value2) = 0
WHERE
-- This is so that we only get one row per input, instead of one row per input multiplied by number of array values
ORPAV_value1.value1 is not null)
SELECT * INTO debugblob1 FROM OneRowPerArrayValue
SELECT * INTO debugblob2 FROM OneRowPerInput
DATA
{"id":"1","arrayValues":[{"value1":"1"},{"value2":"2"}]}
{"id":"2","arrayValues":[{"value1":"3"},{"value2":"4"}]}
See my generic example below. I believe this is what your asking; where you have a JSON object that contains an Array of json objects.
WITH MyValues AS
(
SELECT
arrayElement.ArrayIndex,
arrayElement.ArrayValue
FROM Input as event
CROSS APPLY GetArrayElements(event.<JSON Array Name>) AS arrayElement
)
SELECT ArrayValue.Value1, CAST(ArrayValue.Value2 AS FLOAT) AS Value
INTO Output
FROM MyValues
Related
I have a table with "Number", "Name" and "Result" Column. Result is a 2D text Array and I need to create a Column with the name "Average" that sum all first values of Result Array and divide by 2, can somebody help me Pls, I must use the create function for this. Its look like this:
Table1
Number
Name
Result
Average
01
Kevin
{{2.0,10},{3.0,50}}
2.5
02
Max
{{1.0,10},{4.0,30},{5.0,20}}
5.0
Average = ((2.0+3.0)/2) = 2.5
= ((1.0+4.0+5.0)/2) = 5.0
First of all: You should always avoid storing arrays in the table (or generate them in a subquery if not extremely necessary). Normalize it, it makes life much easier in nearly every single use case.
Second: You should avoid more-dimensional arrays. The are very hard to handle. See Unnest array by one level
However, in your special case you could do something like this:
demo:db<>fiddle
SELECT
number,
name,
SUM(value) FILTER (WHERE idx % 2 = 1) / 2 -- 2
FROM mytable,
unnest(avg_result) WITH ORDINALITY as elements(value, idx) -- 1
GROUP BY number, name
unnest() expands the array elements into one element per record. But this is not an one-level expand: It expand ALL elements in depth. To keep track of your elements, you could add an index using WITH ORDINALITY.
Because you have nested two-elemented arrays, the unnested data can be used as follows: You want to sum all first of two elements, which is every second (the odd ones) element. Using the FILTER clause in the aggregation helps you to aggregate only exact these elements.
However: If that's was a result of a subquery, you should think about doing the operation BEFORE array aggregation (if this is really necessary). This makes things easier.
Assumptions:
number column is Primary key.
result column is text or varchar type
Here are the steps for your requirements:
Add the column in your table using following query (you can skip this step if column is already added)
alter table table1 add column average decimal;
Update the calculated value by using below query:
update table1 t1
set average = t2.value_
from
(
select
number,
sum(t::decimal)/2 as value_
from table1
cross join lateral unnest((result::text[][])[1:999][1]) as t
group by 1
) t2
where t1.number=t2.number
Explanation: Here unnest((result::text[][])[1:999][1]) will return the first value of each child array (considering you can have up to 999 child arrays in your 2D array. You can increase or decrease it as per your requirement)
DEMO
Now you can create your function as per your requirement with above query.
I have an input table: input and one or more maptables, where input contains data for multiple identifiers and dates stacked under each other. The schemas are as follows:
#input
Id: string (might contain empty values)
Id2: string (might contain empty values)
Id3: string (might contain empty values)
Date: datetime
Value: number
#maptable_1
Id: string
Id2: string
Target_1: string
#maptable_2
Id3: string
Target_2: string
What I do now is that I run a pipeline that for each date/(id, id2, id3) combination loads the data from input and applies a left merge in python against one or more maptables (both as a DataFrame). I then stream the results to a third table named output with the schema:
#output
Id: string
Id2: string
Id3: string
Date: datetime
Value: number
Target_1: string (from maptable_1)
Target_2: string (from maptable_2)
Target_x: ...
Now I was thinking that this is not really efficient. If I change one value from a maptable, I have to redo all the pipelines for each date/(id, id2, id3) combination.
Therefore I was wondering if its possible to apply directly a left merge when loading the data? How would such a Query look like?
In the case of multiple maptables and target columns, would it also be beneficial to do the same? Would the query not become too complex or unreadable, in particular as the id columns are not the same?
How would such a Query look like?
Below is for BigQuery Standard SQL
INSERT `project.dataset.output`
SELECT *
FROM `project.dataset.input` i
LEFT JOIN `project.dataset.maptable_1` m1 USING(id, id2)
LEFT JOIN `project.dataset.maptable_2` m2 USING(id3)
In the case of multiple maptables and target columns ...
If all your map tables are same/similar to two maps in your question - in this case it is just extra LEFT JOIN for each extra map
I am trying to get calculate and retrieve some indicators from mutiple tables I have in my dataset on bigquery. I am want to invoke nesting on sfam which is a column of strings which I can't do for now i.e. it could have values or be null. So the goal is to transform that column into an array/record...that's the idea that came to mind and I have no idea how to go about doing it.
The product and cart are grouped by key_web, dat_log, univ, suniv, fam and sfam.
The data is broken down into universe refered to as univ which is composed of sub-universe refered to as suniv. Sub-universes contain families refered to as 'fam' which may or may not have sub-families refered to as sfam. I want to invoke nesting on prd.sfam to reduce the resulting columns.
The data is collected from Google Analytics for insight into website trafic and users activities.
I am trying to get information and indicators about each visitor, the amount of time he/she spent on particular pages, actions taken and so on. The resulting table gives me the sum of time spent on those pages, sum of total number of visits for a single day and a breakdown to which category it belongs, thus the univ, suniv, fam and sfam colummns which are of type string (the sfam could be null since some sub-universes suniv only have families famand don't go down to a sub-family level sfam.
dat_log: refers to the date
nrb_fp: number of views for a product page
tps_fp: total time spent on said page
I tried different methods that I found online but none worked, so I post my code and problem in hope of finding guidance and a solution !
A simpler query would be:
select
prd.key_web
, dat_log
, prd.nrb_fp
, prd.tps_fp
, prd.univ
, prd.suniv
, prd.fam
, prd.sfam
from product as prd
left join cart as cart
on prd.key_web = cart.key_web
and prd.dat_log = cart.dat_log
and prd.univ = cart.univ
and prd.suniv = cart.suniv
and prd.fam = cart.fam
and prd.sfam = cart.sfam
And this is a sample result of the query for the last 6 columns in text and images:
Again, I want to get a column of array as sfam where I have all the string values of sfam even nulls.
I limited the output to only only the last 6 columns, the first 3 are the row, key_web and dat_log. Each fam is composed of several sfam or none (null), I want to be able to do nesting on either the fam or sfam.
I want to get a column of array as sfam where I have all the string values of sfam even nulls.
This is not possible in BigQuery. As the documentation explains:
Currently, BigQuery has two following limitations with respect to NULLs and ARRAYs:
BigQuery raises an error if query result has ARRAYs which contain NULL elements, although such ARRAYs can be used inside the query.
That is, your result set cannot contain an array with NULL elements.
Obviously, in BigQuery you cannot output array which holds NULL, but if for some reason you need to preserve them somehow - the workaround is to create array of structs as opposed to arrays of single elements
For example (BigQuery Standard SQL) if you try to execute below
SELECT ['a', 'b', NULL] arr1, ['x', NULL, NULL] arr2
you will get error: Array cannot have a null element; error in writing field arr1
While if you will try below
SELECT ARRAY_AGG(STRUCT(val1, val2)) arr
FROM UNNEST(['a', 'b', NULL]) val1 WITH OFFSET
JOIN UNNEST(['x', NULL, NULL]) val2 WITH OFFSET
USING(OFFSET)
you get result
Row arr.val1 arr.val2
1 a x
b null
null null
As you can see - approaching this way - you can have have even both elements as NULL
I have the following data in a postgres table,
where data is a jsonb column. I would like to get result as
[
{field_type: "Design", briefings_count: 1, meetings_count: 13},
{field_type: "Engineering", briefings_count: 1, meetings_count: 13},
{field_type: "Data Science", briefings_count: 0, meetings_count: 3}
]
Explanation
Use jsonb_each_text function to extract data from jsonb column named data. Then aggregate rows by using GROUP BY to get one row for each distinct field_type. For each aggregation we also need to include meetings and briefings count which is done by selecting maximum value with case statement so that you can create two separate columns for different counts. On top of that apply coalesce function to return 0 instead of NULL if some information is missing - in your example it would be briefings for Data Science.
At a higher level of statement now that we have the results as a table with fields we need to build a jsonb object and aggregate them all to one row. For that we're using jsonb_build_object to which we are passing pairs that consist of: name of the field + value. That brings us with 3 rows of data with each row having a separate jsonb column with the data. Since we want only one row (an aggregated json) in the output we need to apply jsonb_agg on top of that. This brings us the result that you're looking for.
Code
Check LIVE DEMO to see how it works.
select
jsonb_agg(
jsonb_build_object('field_type', field_type,
'briefings_count', briefings_count,
'meetings_count', meetings_count
)
) as agg_data
from (
select
j.k as field_type
, coalesce(max(case when t.count_type = 'briefings_count' then j.v::int end),0) as briefings_count
, coalesce(max(case when t.count_type = 'meetings_count' then j.v::int end),0) as meetings_count
from tbl t,
jsonb_each_text(data) j(k,v)
group by j.k
) t
You can aggregate columns like this and then insert data to another table
select array_agg(data)
from the_table
Or use one of built-in json function to create new json array. For example jsonb_agg(expression)
I have a huge table with a column which contains large XML documents. I want to get all the values of a particular attribute name (Surname), occurring at any point in any of the XML values.
Currently I have this query...
select distinct XmlCol.value('(//#Surname)[1]','varchar(200)') from (
select * from MyTable
)
...it grabs the first occurrence of my desired attribute in each entry of the XML column, however as it only grabs the first, there may be any number of attributes appearing after that occurrence, in the same XML value.
The value() function only works with a single result, hence why I need to provide the [1] specifying return the first hit.
Is there a way to repeat this function to get all the hits in a piece of XML, or is there another function which takes an XPath and can return multiple values?
Illustrated example
In case above is not clear, a simple example would be if MyTable had a single XmlCol column, with just 2 rows.
Row 1
<SimpleXML>
<ArbitraryElement Surname="Smith"/>
<ArbitraryElement>
<ArbitraryInnerElement Surname="Bauer"/>
</ArbitraryElement>
</SimpleXML>
Row 2
<SimpleXML Surname="Bond">
</SimpleXML>
Note the attribute appears at different locations and in different elements, I want it to work with any amount of nested elements.
Currently my method only hits the first element per XML entry, so gives the output:
Smith, Bond
I'd like it to return an arbitrary amount per entry, meaning the result should be:
Smith, Bauer, Bond
You would want to use a CROSS APPLY to achieve this.
select distinct XmlCol.value('.', 'varchar(max)') as [Value]
from MyTable
CROSS APPLY MyTable.XmlCol.nodes('(//#Surname)') as [Table]([Column])