A bigquery challenge :
input
I have a table with incoming product batches that go into the factory and multiple sensors along the way measure different defects of different parts of the individual products. We are reading out the data from the devices in a flat structure.
The data is written to a incoming table.
Batch_id|Sensor_id|Product_part_id|defect_id|Count_defects|Event_Date
1.......|.1.......|1..............|2........|.5...........|.2018-7-1
1.......|.2.......|1..............|2........|.6...........|.2018-7-1
1.......|.2.......|2..............|3........|.7...........|.2018-7-1
1.......|.3.......|2..............|3........|.8...........|.2018-7-1
1.......|.3.......|2..............|4........|.9...........|.2018-7-1
1.......|.3.......|3..............|5........|.10...........|.2018-7-1
We can do de-duplication on theses tables as the same sensor might spit out the same data multiple times (by mistake or on purpose when the count-defects is updated) based on the last [updated_time]
the problem: transform into multiple nested repeated structs
Now I'm trying to materialize the raw input into fact tables partitioned by Event_Date but for max performance and cheapest storage, I want to achieve the following structure :
Batch_id|Sensor_id|Product_part_id|defect_id|Count_defects|Event_Date
1.......|.1.......|1..............|2........|.5...........|.2018-7-1
........|.2.......|1..............|2........|.6...........|.2018-7-1
........|.........|2..............|3........|.7...........|.2018-7-1
........|.3.......|2..............|3........|.8...........|.2018-7-1
........|.........|...............|4........|.9...........|.2018-7-1
........|.........|3..............|5........|.10..........|.2018-7-1
I cannot do multiple nested ARRAY() calls, it is not allowed and also badly performing as this would take the same base table as input multiple time.
Looking for suggestions on how to tackle this.
Thanks!
I'm using sequencial application of array_agg() + GROUP BY for that, starting with the innermost array. After the first iteration I put the query into a WITH and start over with creating the next array again using array_agg() + GROUP BY.
Performance-wise this approach has the same constraints all GROUP BY queries have - you should avoid skewed group sizes if you can - otherwise it will just take longer because BigQuery has to re-plan resources in the background when it realizes a group takes to much memory. But you can optimize using the query execution plan.
For your example table my result query looks like this:
WITH t AS (
SELECT 1 as batch_id, 1 as sensor_id, 1 as product_part_id, 2 as defect_id, 5 as count_defects, '2018-7-1' as event_date
UNION ALL SELECT 1 as batch_id, 2 as sensor_id, 1 as product_part_id, 2 as defect_id, 6 as count_defects, '2018-7-1' as event_date
UNION ALL SELECT 1 as batch_id, 2 as sensor_id, 2 as product_part_id, 3 as defect_id, 7 as count_defects, '2018-7-1' as event_date
UNION ALL SELECT 1 as batch_id, 3 as sensor_id, 2 as product_part_id, 3 as defect_id, 8 as count_defects, '2018-7-1' as event_date
UNION ALL SELECT 1 as batch_id, 3 as sensor_id, 2 as product_part_id, 4 as defect_id, 9 as count_defects, '2018-7-1' as event_date
UNION ALL SELECT 1 as batch_id, 3 as sensor_id, 3 as product_part_id, 5 as defect_id, 10 as count_defects, '2018-7-1' as event_date
),
defect_nesting as (
SELECT
batch_id,
sensor_id,
product_part_id,
array_agg(STRUCT(defect_id, count_defects, event_date) ORDER BY defect_id) defectInfo
FROM t
GROUP BY 1, 2, 3
),
product_nesting as (
SELECT
batch_id,
sensor_id,
array_agg(STRUCT(product_part_id, defectInfo) ORDER BY product_part_id) productInfo
FROM defect_nesting
GROUP BY 1,2
)
SELECT
batch_id,
array_agg(STRUCT(sensor_id, productInfo) ORDER BY sensor_id) sensorInfo
FROM product_nesting
GROUP BY 1
The resulting JSON:
[
{
"batch_id": "1",
"sensorInfo": [
{
"sensor_id": "1",
"productInfo": [
{
"product_part_id": "1",
"defectInfo": [
{
"defect_id": "2",
"count_defects": "5",
"event_date": "2018-7-1"
}
]
}
]
},
{
"sensor_id": "2",
"productInfo": [
{
"product_part_id": "1",
"defectInfo": [
{
"defect_id": "2",
"count_defects": "6",
"event_date": "2018-7-1"
}
]
},
{
"product_part_id": "2",
"defectInfo": [
{
"defect_id": "3",
"count_defects": "7",
"event_date": "2018-7-1"
}
]
}
]
},
{
"sensor_id": "3",
"productInfo": [
{
"product_part_id": "2",
"defectInfo": [
{
"defect_id": "3",
"count_defects": "8",
"event_date": "2018-7-1"
},
{
"defect_id": "4",
"count_defects": "9",
"event_date": "2018-7-1"
}
]
},
{
"product_part_id": "3",
"defectInfo": [
{
"defect_id": "5",
"count_defects": "10",
"event_date": "2018-7-1"
}
]
}
]
}
]
}
]
Hope that helps!
Related
So, I have 2 tables.
Type table
id
Name
1.
General
2.
Mostly Used
3.
Low
Component table
id
Name
typeId
1.
Component 1
1
2.
Component 2
1
4.
Component 4
2
6.
Component 6
2
7.
Component 5
3
There can be numerous types but I want to get only 'General' and 'Others' as types along with the component as follows:
[{
"General": [{
"id": "1",
"name": "General",
"component": [{
"id": 1,
"name": "component 1",
"componentTypeId": 1
}, {
"id": 2,
"name": "component 2",
"componentTypeId": 1
}]
}],
"Others": [{
"id": "2",
"name": "Mostly Used",
"component": [{
"id": 4,
"name": "component 4",
"componentTypeId": 2
}, {
"id": 6,
"name": "component 6",
"componentTypeId": 2
}]
},
{
"id": "3",
"name": "Low",
"component": [{
"id": 7,
"name": "component 5",
"componentTypeId": 3
}]
}
]
}]
WITH CTE_TYPES AS (
SELECT
CASE WHEN t. "name" <> 'General' THEN
'Others'
ELSE
'General'
END AS TYPE,
t.id,
t.name
FROM
type AS t
GROUP BY
TYPE,
t.id
),
CTE_COMPONENT AS (
SELECT
c.id,
c.name,
c.typeid
FROM
component c
)
SELECT
JSON_AGG(jsonb_build_object ('id', CT.id, 'name', CT.name, 'type', CT.type, 'component', CC))
FROM
CTE_COMPONENTTYPES CT
INNER JOIN CTE_COMPONENT CC ON CT.id = CC.tradingplancomponenttypeid
GROUP BY
CT.type
I get 2 types from the query as I expected but the components are not grouped together
Can you also point to resources to learn advanced SQL queries?
Here after is a solution to get your expected result as specified in your question :
First part
The first part of the query aggregates all the components with the same TypeId into a jsonb array. It also calculates the new type column with the value 'Others' for all the type names different from General or with the value 'General' :
SELECT CASE WHEN t.name <> 'General' THEN 'Others' ELSE 'General' END AS type
, t.id, t.name
, jsonb_build_object('id', t.id, 'name', t.name, 'component', jsonb_agg(jsonb_build_object('id', c.id, 'name', c.name, 'componentTypeId', c.typeid))) AS list
FROM component AS c
INNER JOIN type AS t
ON t.id = c.typeid
GROUP BY t.id, t.name
jsonb_build_object builds a jsonb object from a set of key/value arguments
jsonb_agg aggregates jsonb objects into a single jsonb array.
Second part
The second part of the query is much more complex because of the structure of your expected result where you want to nest the types which are different from General with their components inside each other according to the TypeId order, ie Low type with TypeId = 3 is nested inside Mostly Used type with TypeId = 2 :
{ "id": "2",
, "name": "Mostly Used"
, "component": [ { "id": 4
, "name": "component 4"
, "componentTypeId": 2
}
, { ... }
, { "id": "3"
, "name": "Low" --> 'Low' type is nested inside 'Mostly Used' type
, "component": [ { "id": 7
, "name": "component 5"
, "componentTypeId": 3
}
, { ... }
]
}
]
}
To do such a nested structure with a random number of TypeId, you could create a recursive query, but I prefer here to create a user-defined aggregate function which will make the query much more simple and readable, see the manual. The aggregate function jsonb_set_inv_agg is based on the user-defined function jsonb_set_inv which inserts the jsonb object x inside the existing jsonb object z according to the path p. This function is based on the jsonb_set standard function :
CREATE OR REPLACE FUNCTION jsonb_set_inv(x jsonb, p text[], z jsonb, b boolean)
RETURNS jsonb LANGUAGE sql IMMUTABLE AS
$$
SELECT jsonb_set(z, p, COALESCE(z#>p || x, z#>p), b) ;
$$ ;
CREATE AGGREGATE jsonb_set_inv_agg(p text[], z jsonb, b boolean)
( sfunc = jsonb_set_inv
, stype = jsonb
) ;
Based on the newly created aggregate function jsonb_set_inv_agg and the jsonb_agg and jsonb_build_object standard functions already seen above, the final query is :
SELECT jsonb_agg(jsonb_build_object('General', x.list)) FILTER (WHERE x.type = 'General')
|| jsonb_build_object('Others', jsonb_set_inv_agg('{component}', x.list, true ORDER BY x.id DESC) FILTER (WHERE x.type = 'Others'))
FROM
( SELECT CASE WHEN t.name <> 'General' THEN 'Others' ELSE 'General' END AS type
, t.id, t.name
, jsonb_build_object('id', t.id, 'name', t.name, 'component', jsonb_agg(jsonb_build_object('id', c.id, 'name', c.name, 'componentTypeId', c.typeid))) AS list
FROM component AS c
INNER JOIN type AS t
ON t.id = c.typeid
GROUP BY t.id, t.name
) AS x
see the full test result in dbfiddle.
I have a query that outputs two array of structs:
SELECT modelId, oldClassCounts, newClassCounts
FROM `xyz`
GROUP BY 1
How do I create another column that is TRUE if oldClassCounts = newClassCounts?
Here is a sample result in JSON:
[
{
"modelId": "FBF21609-65F8-4076-9B22-D6E277F1B36A",
"oldClassCounts": [
{
"id": "A041EBB1-E041-4944-B231-48BC4CCE025B",
"count": "33"
},
{
"id": "B8E4812B-A323-47DD-A6ED-9DF877F501CA",
"count": "82"
}
],
"newClassCounts": [
{
"id": "A041EBB1-E041-4944-B231-48BC4CCE025B",
"count": "33"
},
{
"id": "B8E4812B-A323-47DD-A6ED-9DF877F501CA",
"count": "82"
}
]
}
]
I want the equality column to be TRUE if oldClassCounts and newClassCounts are exactly the same like the output above.
Anything else should be false.
I would go about with this solution
#standardSQL
WITH xyz AS (
SELECT "FBF21609-65F8-4076-9B22-D6E277F1B36A" AS modelId,
[STRUCT("A041EBB1-E041-4944-B231-48BC4CCE025B" as id, "33" as count),
STRUCT("B8E4812B-A323-47DD-A6ED-9DF877F501CA" as id, "82" as count)] AS oldClassCounts,
[STRUCT("A041EBB1-E041-4944-B231-48BC4CCE025B" as id, "33" as count),
STRUCT("B8E4812B-A323-47DD-A6ED-9DF877F501CA" as id, "82" as count)] as newClassCounts),
o as (SELECT modelId, id, count, array_length(oldClassCounts) as len FROM xyz, UNNEST(oldClassCounts) as old_c),
n as (SELECT modelId, id, count, array_length(newClassCounts) as len FROM xyz, UNNEST(newClassCounts) as new_c),
uneq as (select * from o except distinct select * from n)
select xyz.*, IF(uneq.modelId is not null, false, true) as equal from xyz left join (select distinct modelId from uneq) uneq on xyz.modelId = uneq.modelId
It works regardless of the order or having duplicates within the arrays. The idea is that we treat each of the arrays as a separate temporary table removing all elements that exist in one but not the other (using except distinct) and then have an extra check for the length of the arrays in case there are duplicates e.g.
"FBF21609-65F8-4076-9B22-D6E277F1B36A" AS modelId,
[STRUCT("A041EBB1-E041-4944-B231-48BC4CCE025B" as id, "33" as count),
STRUCT("B8E4812B-A323-47DD-A6ED-9DF877F501CA" as id, "82" as count),
STRUCT("B8E4812B-A323-47DD-A6ED-9DF877F501CA" as id, "82" as count)]
I would consider comparing the result of TO_JSON_STRING function applied on both of these arrays.
In the query it would be done in the following way:
SELECT modelId,
oldClassCounts,
newClassCounts,
CASE WHEN TO_JSON_STRING(oldClassCounts) = TO_JSON_STRING(newClassCounts)
THEN true
ELSE false
END
FROM `xyz`;
I'm not sure about GROUP BY 1 part, because non of the fields are grouped or aggregated.
It is not going to work, if the order of elements in the array is going to be different. This solution is not perfect, but worked for the data you provided.
Column data is jsonb
SELECT
json_agg(shop_order)
FROM (
SELECT data from shop_order
WHERE data->'contacts'->'customer'->>'phone' LIKE '%1234567%' LIMIT 3 OFFSET 3
) shop_order
and here result as array:
[
{
"data": {
"id": 211111,
"cartCount": 4,
"created_at": "2020-10-28T12:58:33.387Z",
"modified_at": "2020-10-28T12:58:33.387Z"
}
}
]
Nice. But... I need to hide node data.
The result must be
[
{
"id": 211111,
"cartCount": 4,
"created_at": "2020-10-28T12:58:33.387Z",
"modified_at": "2020-10-28T12:58:33.387Z"
}
]
Is it possible?
you should be able to perform a second select on the result. then specificaly select data
SELECT (result->>'data') as result,
FROM result
example
My json field looks something like this
{
"Ui": [
{
"element": "TG1",
"mention": "in",
"time": 123
},
{
"element": "TG1",
"mention": "out",
"time": 125
},
{ "element": "TG2",
"mention": "in",
"time": 251
},
{
"element": "TG2",
"mention": "out",
"time": 259
},
{ "element": "TG2",
"mention": "in",
"time": 251
}
]
}
I am trying to get the sum of difference of time per element which is as below
| element | Timespent |
| TG1 | 2 |
| TG2 | 8 |
The problem is ideally for every "in" element there should be an "out" element which is clearly not the case in the above example. I want to only calculate the difference of this pair of values and ignore any value that doesn't have a corresponding out to a in. How can I do that?
Below is what I am using to get the time difference
select element, sum(time) as time_spent
from my_table
cross join lateral (
select
value->>'element' as element,
case value->>'mention' when 'in' then -(value->>'time')::numeric else (value->>'time')::numeric end as time
from json_array_elements(json_column->'Ui')) as elements
group by 1
order by 1
I was not sure about you json_column attribute - you need to group by it in order to not mix values between rows, so I included it into window aggregations in CTE part. But you don't have it in your results, so I skipped it in final qry as well. In short - you can check if order number is even and equal to max order number and then just skip it:
select json_column
, e->>'element' element
, case when
mod(lag(i) over (partition by json_column::text ,e->>'element' order by i),2) = 0
and
max(i) over (partition by json_column::text ,e->>'element') = i
then true else false end "skip"
, case when e->>'mention' = 'in' then -(e->>'time')::int else (e->>'time')::int end times
from my_table, json_array_elements(json_column->'Ui') with ordinality o(e,i)
)
select element, sum (times)
from logic
where not skip
group by element
;
element | sum
---------+-----
TG1 | 2
TG2 | 8
(2 rows)
I'm using LegacySQL, but am not strictly limited to it. (though it does have some methods I find useful, "HASH" for example).
Anyhow, the simple task is that I want to group by one top level column, while still keeping the first instance of a nested+repeated set of data alongside.
So, the following "works", and produces nested output:
SELECT
cd,
subarray.*
FROM [magicalfairy.land]
And now I attempt to just grab the entire first subarray (honestly, I don't expect this to work of course)
The following is what doesn't work:
SELECT
cd,
FIRST(subarray.*)
FROM [magicalfairy.land]
GROUP BY cd
Any alternate approaches would be appreciated.
Edit, for data behaviour example.
If Input data was roughly:
[
{
"cd": "something",
"subarray": [
{
"hello": 1,
"world": 1
},
{
"hello": 2,
"world": 2
}
]
},
{
"cd": "something",
"subarray": [
{
"hello": 1,
"world": 1
},
{
"hello": 2,
"world": 2
}
]
}
]
Would expect to get out:
[
{
"cd": "something",
"subarray": [
{
"hello": 1,
"world": 1
},
{
"hello": 2,
"world": 2
}
]
}
]
You'll have a much better time preserving the structure using standard SQL, e.g.:
WITH T AS (
SELECT
cd,
ARRAY<STRUCT<x INT64, y BOOL>>[
STRUCT(off, MOD(off, 2) = 0),
STRUCT(off - 1, false)] AS subarray
FROM UNNEST([1, 2, 1, 2]) AS cd WITH OFFSET off)
SELECT
cd,
ANY_VALUE(subarray) AS subarray
FROM T
GROUP BY cd;
ANY_VALUE will return some value of subarray for each group. If you wanted to concatenate the arrays instead, you could use ARRAY_CONCAT_AGG.
to run this against your table - try below
SELECT
cd,
ANY_VALUE(subarray) AS subarray
FROM `magicalfairy.land`
GROUP BY cd
Try below (BigQuery Standard SQL)
SELECT cd, subarray
FROM (
SELECT cd, subarray,
ROW_NUMBER() OVER(PARTITION BY cd) AS num
FROM `magicalfairy.land`
) WHERE num = 1
This gives you expected result - equivalent of "ANY ARRAY"
This solution can be extended to "FIRST ARRAY" by adding ORDER BY sort_col into OVER() clause - assuming that sort_col defines the logical order