bigquery joining a field and json unnested resulted only left hand side table - sql

Need help to get the select statement of normal text record and json unnest answers.
I am getting only left hand normal text record. Am I missing some thing?.
#standardSQL
CREATE TEMP FUNCTION jsonunnest(input STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return JSON.parse(input).map(j=>JSON.stringify(j));
""";
WITH `Impact_JSON` AS (
SELECT
Impact_Question_id,
Impact_Question_text, json,
ROW_NUMBER() OVER (PARTITION BY bdmp_id, DATE(Impact_Question_aktualisiert_am_ts)
ORDER BY
Impact_Question_aktualisiert_am_ts DESC) AS ROW
FROM
`<project.dataset.table` basetable
),
json_answers AS (
SELECT
regexp_replace(SPLIT(ANY_VALUE(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_id')),'_')[SAFE_OFFSET(1)], "[^0-9]+"," " ) AS Interview_ID,
regexp_replace(SPLIT(ANY_VALUE(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_id')),'_')[SAFE_OFFSET(3)], "[^0-9]+"," " ) AS Quest_ID,
STRING_AGG(DISTINCT(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_id')), ',\n')
AS Impact_antwort_id,
STRING_AGG(DISTINCT(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_daten_typ')),',\n')
AS Impact_reply_data_type,
IFNULL(JSON_EXTRACT_SCALAR(Impact, '$.Impact_topic_text'), 'Empty') AS Impact_topic_text,
IFNULL(JSON_EXTRACT_SCALAR(Impact, '$.Impact_reply_get'), 'Empty') AS Impact_reply_get,
FROM `Impact_JSON`,
UNNEST(jsonunnest(JSON_EXTRACT(json, '$.reply'))) Impact
GROUP by 5,6
),
Impact_Question_id_TBL AS (
select Impact_Question_id from `Impact_JSON` AS C
)
SELECT
Impact_Question_id
FROM
`Impact_JSON` AS T
left join
json_answers as J
ON
(SAFE_CAST(J.Interview_ID as INT64))
=
T.Impact_Question_id
The left hand side records and right hand side records in same table should be captured.

What do you expect as output?
For me the udf did not work. Otherwise I generated some sample data and shorten your query for testing. This had been a good starting point for a question!
Also I changed the join to a full outer join.
For each dataset there is a number in column Impact_Question_id. There is json column containing a nested array data structure. The json data is unnested and group by the column Impact_reply_get. The first part of the column Impact_antwort_id is extracted and named Interview_ID. Because of the grouping, you select a random value. Then you join by this to the master table to the column Impact_Question_id.
The random selecting by any_value of the column Impact_antwort_id (Interview_ID) could be the issue for the mismatch. I would also group by this value and risk double matches.
#standardSQL
CREATE TEMP FUNCTION jsonunnest(input STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
try{
return [].concat(JSON.parse(input) || [] ).map(j=>JSON.stringify(j));
} catch(e) {return ["no",JSON.parse(input)]}
""";
WITH
basetable as (Select row_number() over () Impact_Question_id,
"txt" as Impact_Question_text,
json
#1 as bdmp_id,
#current_date() Impact_Question_aktualisiert_am_ts,
from unnest([ '{"reply":[{"Impact_antwort_id":"anytext_2","Impact_reply_get":"ok test 2"}]}','{"reply":[{"Impact_antwort_id":"anytext_3"}]}']) as json
),
`Impact_JSON` AS (
SELECT
Impact_Question_id,
Impact_Question_text, json,
#ROW_NUMBER() OVER (PARTITION BY bdmp_id, DATE(Impact_Question_aktualisiert_am_ts)
# ORDER BY
# Impact_Question_aktualisiert_am_ts DESC) AS ROW
FROM
basetable
),
json_answers AS (
SELECT
regexp_replace(SPLIT(ANY_VALUE(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_id')),'_')[SAFE_OFFSET(1)], "[^0-9]+"," " ) AS Interview_ID,
# regexp_replace(SPLIT(ANY_VALUE(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_id')),'_')[SAFE_OFFSET(3)], "[^0-9]+"," " ) AS Quest_ID,
# STRING_AGG(DISTINCT(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_id')), ',\n')
# AS Impact_antwort_id,
#STRING_AGG(DISTINCT(JSON_EXTRACT_SCALAR(Impact, '$.Impact_antwort_daten_typ')),',\n')
# AS Impact_reply_data_type,
# IFNULL(JSON_EXTRACT_SCALAR(Impact, '$.Impact_topic_text'), 'Empty') AS Impact_topic_text,
IFNULL(JSON_EXTRACT_SCALAR(Impact, '$.Impact_reply_get'), 'Empty') AS Impact_reply_get,
string_agg(Impact) as impact_parsed_json,
FROM `Impact_JSON`,
UNNEST(jsonunnest(JSON_EXTRACT(json, '$.reply'))) Impact
GROUP by 2 #5,6
)
SELECT
*
FROM
`Impact_JSON` AS T
full join
json_answers as J
ON
(SAFE_CAST(J.Interview_ID as INT64))
=
T.Impact_Question_id

Related

Geography function over a column

I am trying to use the st_makeline() function in order to create lines for every points and the next one in a single column.
Do I need to create another column with the 2 points already ?
with t1 as(
SELECT *, ST_GEOGPOINT(cast(long as float64) , cast(lat as float64)) geometry FROM `my_table.faissal.trajets_flix`
where id = 1
order by index_loc
)
select index_loc geometry
from t1
Here are the results
Thanks for your help
You seems to want to write this code:
https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions#st_makeline
WITH t1 as (
SELECT *, ST_GEOGPOINT(cast(long as float64), cast(lat as float64)) geometry
FROM `my_table.faissal.trajets_flix`
-- WHERE id = 1
)
SELECT id, ST_MAKELINE(ARRAY_AGG(geometry ORDER BY index_loc)) traj
FROM t1
GROUP BY id;
with output:
When visualized on the map.
Consider also below simple and cheap option
select st_geogfromtext(format('linestring(%s)',
string_agg(long || ' ' || lat order by index_loc))
) as path
from `my_table.faissal.trajets_flix`
where id = 1
if applied to sample data in your question - output is
which is visualized as

How to convert fields to JSON in Postgresql

I have a table with the following schema (postgresql 14):
message sentiment classification
any text positive mobile, communication
message are only string, phrases.
sentiment is a string, only one word
classification are string but can have 1 to many word comma separated
I would like to create a json field with these columns, like this:
{"msg":"any text", "sentiment":"positive","classification":["mobile,"communication"]}
Also, if possible, is there a way to consider the classification this way:
{"msg":"any text", "sentiment":"positive","classification 1":"mobile","classification 2" communication"}
The first part of question is easy - Postgres provides functions for splitting string and converting to json:
with t(message, sentiment, classification) as (values
('any text','positive','mobile, communication')
)
select row_to_json(x.*)
from (
select t.message
, t.sentiment
, array_to_json(string_to_array(t.classification, ', ')) as classification
from t
) x
The second part is harder - your want json to have variable number of attributes, mixed of grouped and nongrouped data. I suggest to unwind all attributes and then assemble them back (note the numbered CTE is actually not needed if your real table has id - I just needed some column to group by):
with t(message, sentiment, classification) as (values
('any text','positive','mobile, communication')
)
, numbered (id, message, sentiment, classification) as (
select row_number() over (order by null)
, t.*
from t
)
, extracted (id,message,sentiment,classification,index) as (
select n.id
, n.message
, n.sentiment
, l.c
, l.i
from numbered n
join lateral unnest(string_to_array(n.classification, ', ')) with ordinality l(c,i) on true
), unioned (id, attribute, value) as (
select id, concat('classification ', index::text), classification
from extracted
union all
select id, 'message', message
from numbered
union all
select id, 'sentiment', sentiment
from numbered
)
select json_object_agg(attribute, value)
from unioned
group by id;
DB fiddle
Use jsonb_build_object and concatenate the columns you want
SELECT
jsonb_build_object(
'msg',message,
'sentiment',sentiment,
'classification',
string_to_array(classification,','))
FROM mytable;
Demo: db<>fiddle
The second output is definitely not trivial. The SQL code would be much larger and harder to maintain - not to mention that parsing such file also requires a little more effort.
You can use a cte to handle the flattening of the classification attributes and then perform the necessary grouping in the main queries for each problem component:
with cte(r, m, s, k) as (
select row_number() over (order by t.message), t.message, t.sentiment, v.* from tbl t
cross join json_array_elements(array_to_json(string_to_array(t.classification, ', '))) v
)
-- first part --
select json_build_object('msg', t1.message, 'sentiment', t1.sentiment, 'classification', string_to_array(t1.classification, ', ')) from tbl t1
-- second part --
select jsonb_build_object('msg', t1.m, 'sentiment', t1.s)||('{'||t1.g||'}')::jsonb
from (select c.m, c.s, array_to_string(array_agg('"classification '||c.r||'":'||c.k), ', ') g
from cte c group by c.m, c.s) t1

BigQuery - Count how many words in array are equal

I want to count how many similar words I have in a path (which will be split at delimiter /) and return a matching array of integers.
Input data will be something like:
I want to add another column, match_count, with an array of integers. For example:
To replicate this case, this is the query I'm working with:
CREATE TEMP FUNCTION HOW_MANY_MATCHES_IN_PATH(src_path ARRAY<STRING>, test_path ARRAY<STRING>) RETURNS ARRAY<INTEGER> AS (
-- WHAT DO I PUT HERE?
);
SELECT
*,
HOW_MANY_MATCHES_IN_PATH(src_path, test_path) as dir_path_match_count
FROM (
SELECT
ARRAY_AGG(x) AS src_path,
ARRAY_AGG(y) as test_path
FROM
UNNEST([
'lib/client/core.js',
'lib/server/core.js'
]) AS x, UNNEST([
'test/server/core.js'
]) as y
)
I've tried working with ARRAY and UNNEST in the HOW_MANY_MATCHES_IN_PATH function, but I either end up with an error or an array of 4 items (in this example)
Consider below approach
create temp function how_many_matches_in_path(src_path string, test_path string) returns integer as (
(select count(distinct src)
from unnest(split(src_path, '/')) src,
unnest(split(test_path, '/')) test
where src = test)
);
select *,
array( select how_many_matches_in_path(src, test)
from t.src_path src with offset
join t.test_path test with offset
using(offset)
) dir_path_match_count
from your_table t
if to apply to sample of Input data in your question
with your_table as (
select
['lib/client/core.js', 'lib/server/core.js'] src_path,
['test/server/core.js', 'test/server/core.js'] test_path
)
output is

pivot multi-level nested fields in bigquery

My bq table schema:
Continuing this post: bigquery pivoting with nested field
I'm trying to flatten this table. I would like to unnest the timeseries.data fields, i.e. the final number of rows should be equal to the total length of timeseries.data arrays. I would also like to add annotation.properties.key with certain value as additional columns, and annotation.properties.value as its value. So in this case, it would be the "margin" column. However the following query gives me error: "Unrecognized name: data". But after the last FROM, I did already: unnest(timeseries.data) as data.
flow_timestamp, channel_name, number_of_digits, timestamp, value, margin
2019-10-31 15:31:15.079674 UTC, channel_1, 4, 2018-02-28T02:00:00, 50, 0.01
query:
SELECT
flow_timestamp, timeseries.channel_name,
( SELECT MAX(IF(channel_properties.key = 'number_of_digits', channel_properties.value, NULL))
FROM UNNEST(timeseries.channel_properties) AS channel_properties
),
data.timestamp ,data.value
,(with subq as (select * from unnest(data.annotation))
select max(if (properties.key = 'margin', properties.value, null))
from (
select * from unnest(subq.properties)
) as properties
) as margin
FROM my_table
left join unnest(timeseries.data) as data
WHERE DATE(flow_timestamp) between "2019-10-28" and "2019-11-02"
order by flow_timestamp
Try below
#standardSQL
SELECT
flow_timestamp,
timeseries.channel_name,
( SELECT MAX(IF(channel_properties.key = 'number_of_digits', channel_properties.value, NULL))
FROM UNNEST(timeseries.channel_properties) AS channel_properties
) AS number_of_digits,
item.timestamp,
item.value,
( SELECT MAX(IF(prop.key = 'margin', prop.value, NULL))
FROM UNNEST(item.annotation) AS annot, UNNEST(annot.properties) prop
) AS margin
FROM my_table
LEFT JOIN UNNEST(timeseries.data) item
WHERE DATE(flow_timestamp) BETWEEN '2019-10-28' AND '2019-11-02'
ORDER BY flow_timestamp
Below is a little less verbose version of the same solution, but I usually prefer above as it simpler to maintain
#standardSQL
SELECT
flow_timestamp,
timeseries.channel_name,
( SELECT MAX(IF(key = 'number_of_digits', value, NULL))
FROM UNNEST(timeseries.channel_properties) AS channel_properties
) AS number_of_digits,
timestamp,
value,
( SELECT MAX(IF(key = 'margin', value, NULL))
FROM UNNEST(annotation), UNNEST(properties)
) AS margin
FROM my_table
LEFT JOIN UNNEST(timeseries.data)
WHERE DATE(flow_timestamp) BETWEEN '2019-10-28' AND '2019-11-02'
ORDER BY flow_timestamp

ORACLE SQL Pivot Issue

I am trying to pivot a sql result. I need to do this all in the one query. The below is telling me invalid identifier for header_id. I am using an Oracle database.
Code
Select * From (
select ppd.group_id,g.group_name, ct.type_desc,ht.hos_cat_descr
from item_history ih, item ci, contract ppd,
header ch, group g, cd_std_type ct, cd_hos h,
cd_std_hospital_cat ht
where ih.item_id = ci.item_id
and ih.header_id = ch.header_id
and ci.hos_id = h.hos_id
and ih.item_id = ci.item_id
and ch.user_no = ppd.user_no
and ppd.group_id = g.group_id
and ch.header_type = ct.header_type_id
and ci.hos_id = h.hos_id
and h.cat_id = ht.cat_id
)
Pivot
(
count(distinct header_id) as Volume
For hos_cat_descr IN ('A')
)
Your inner query doesn't have header_id in its projection, so the pivot clause doesn't have that column available to use. You need to add it, either as:
Select * From (
select ppd.group_id,g.group_name, ct.type_desc,ht.hos_cat_descr,ih.header_id
---------------------------------------------------------------^^^^^^^^^^^^^
from ...
)
Pivot
(
count(distinct header_id) as Volume
For hos_cat_descr IN ('A')
)
or:
Select * From (
select ppd.group_id,g.group_name, ct.type_desc,ht.hos_cat_descr,ch.header_id
---------------------------------------------------------------^^^^^^^^^^^^^
from ...
)
Pivot
(
count(distinct header_id) as Volume
For hos_cat_descr IN ('A')
)
It doesn't really matter which, since those two values must be equal as they are part of a join condition.
You could achieve the same thing with simpler aggregation instead of a pivot, but presumably you are doing more work in the pivot really.