I am trying to access the data stored in a snowflake table using python sql. Below is the columns given below i want to access - sql

Below is the data-sample and i want to access columns value,start. This data i dumped in one column(DN) of a table (stg)
{
"ok": true,
"metrics": [
{
"name": "t_in",
"data": [{"value": 0, "group": {"start": "00:00"}}]
},
{
"name": "t_out",
"data": [{"value": 0,"group": {"start": "00:00"}}]
}
]
}
##consider many lines stored in same column in different rows.
Below query only fetched data for name. I want to access other columns value also. This query is a part of python script.
select
replace(DN : metrics[0].name , '"' , '')as metrics_name, #able to get
replace(DN : metrics[2].data , '"' , '')as metrics_data_value,##suggestion needed
replace(DN : metrics.data.start, '"','') as metrics_start, ##suggestion needed
replace(DN : metrics.data.group.finish, '"','') as metrics_finish, ##suggestion needed
from stg
Do i need to iterate over data and group? If yes, please suggest the code.

Here is an example of how to query that data.
Set up sample data:
create or replace transient table test_db.public.stg (DN variant);
insert overwrite into test_db.public.stg (DN)
select parse_json('{
"ok": true,
"metrics": [
{
"name": "t_in",
"data": [
{"value": 0, "group": {"start": "00:00"}}
]
},
{
"name": "t_out",
"data": [
{"value": 0,"group": {"start": "00:00"}}
]
}
]
}');
Select statement example:
select
DN:metrics[0].name::STRING,
DN:metrics[1].data,
DN:metrics[1].data[0].group.start::TIME,
DN:metrics[1].data[0].group.finish::TIME
from test_db.public.stg;
Instead of querying individual indexes of the JSON arrays, I think you'll want to use the flatten function which is documented here.
Here is how you do it with the flatten which is what I am guessing you want:
select
mtr.value:name::string,
dta.value,
dta.value:group.start::string,
dta.value:group.finish::string
from test_db.public.stg stg,
lateral flatten(input => stg.DN:metrics) mtr,
lateral flatten(input => mtr.value:data) dta

Related

Querying over PostgreSQL JSONB column

I have a table "blobs" with a column "metadata" in jsonb data-type,
Example:
{
"total_count": 2,
"items": [
{
"name": "somename",
"metadata": {
"metas": [
{
"id": "11258",
"score": 6.1,
"status": "active",
"published_at": "2019-04-20T00:29:00",
"nvd_modified_at": "2022-04-06T18:07:00"
},
{
"id": "9251",
"score": 5.1,
"status": "active",
"published_at": "2018-01-18T23:29:00",
"nvd_modified_at": "2021-01-08T12:15:00"
}
]
}
]
}
I want to identify statuses in the "metas" array that match with certain, given strings. I have tried the following so far but without results:
SELECT * FROM blobs
WHERE metadata is not null AND
(
SELECT count(*) FROM jsonb_array_elements(metadata->'metas') AS cn
WHERE cn->>'status' IN ('active','reported')
) > 0;
It would also be sufficient if I could compare the string with "status" in the first array object.
I am using PostgreSQL 9.6.24
for some clarity I usually break code into series of WITH statements. My idea for your problem would be to use json path (https://www.postgresql.org/docs/12/functions-json.html#FUNCTIONS-SQLJSON-PATH) and function jsonb_path_query.
Below code gives a list of counts, I will leave the rest to you, to get final data.
I've added ID column just to have something to join on. Otherwise join on metadata.
Also, note additional " in where condition. Left join in blob_ext is there just to have null value if metadata is not present or that path does not work.
with blob as (
select row_number() over()"id", * from (VALUES
(
'{
"total_count": 2,
"items": [
{
"name": "somename",
"metadata": {
"metas": [
{
"id": "11258",
"score": 6.1,
"status": "active",
"published_at": "2019-04-20T00:29:00",
"nvd_modified_at": "2022-04-06T18:07:00"
},
{
"id": "9251",
"score": 5.1,
"status": "active",
"published_at": "2018-01-18T23:29:00",
"nvd_modified_at": "2021-01-08T12:15:00"
}
]
}
}
]}'::jsonb),
(null::jsonb)) b(metadata)
)
, blob_ext as (
select bb.*, blob_sts.status
from blob bb
left join (
select
bb2.id,
jsonb_path_query (bb2.metadata::jsonb, '$.items[*].metadata.metas[*].status'::jsonpath)::character varying "status"
FROM blob bb2
) as blob_sts ON
blob_sts.id = bb.id
)
select bbe.id, count(*) cnt, bbe.metadata
from blob_ext bbe
where bbe.status in ('"active"', '"reported"')
group by bbe.id, bbe.metadata;
A way is to peel one layer at a time with jsonb_extract_path() and jsonb_array_elements():
with cte_items as (
select id,
metadata,
jsonb_extract_path(jx.value,'metadata','metas') as metas
from blobs,
lateral jsonb_array_elements(jsonb_extract_path(metadata,'items')) as jx),
cte_metas as (
select id,
metadata,
jsonb_extract_path_text(s.value,'status') as status
from cte_items,
lateral jsonb_array_elements(metas) s)
select distinct
id,
metadata
from cte_metas
where status in ('active','reported');

How to understand the field "groups" and the agg "GROUPING" in EnumerableAggregate

I am new to Calcite and I am using Calcite to convert a SQL query to an optimized plan, where I will translate the plan to a dataflow graph in an execution engine. One challenge is the translation of different RelNodes (e.g., Filter, Project, Aggregate, Calc, etc). I found a difficulty in understanding the EnumerableAggregate RelNode. Specifically, for the following example, where I defined a table T as
create table T (src int, dst int, label int, time int);
and wrote a toy query as
select count(distinct dst), sum(distinct label), count(*)
from T
where dst > 1
group by src
having src = 0;
I will obtain an optimized plan which contains two EnumerableAggregate RelNodes and here is the first EnumerableAggregate RelNode:
{
"id": "2",
"relOp": "org.apache.calcite.adapter.enumerable.EnumerableAggregate",
"group": [ 0, 1, 2 ],
"groups": [
[ 0, 1 ], [ 0, 2 ], [ 0 ]
],
"aggs": [
{
"agg": {
"name": "COUNT",
"kind": "COUNT",
"syntax": "FUNCTION_STAR"
},
"type": {
"type": "BIGINT",
"nullable": false
},
"distinct": false,
"operands": [],
"name": "EXPR$2"
},
{
"agg": {
"name": "GROUPING",
"kind": "GROUPING",
"syntax": "FUNCTION"
},
"type": {
"type": "BIGINT",
"nullable": false
},
"distinct": false,
"operands": [ 0, 1, 2 ],
"name": "$g"
}
]
}
I think I understand the reason why there are two Aggregate RelNodes. The reason is due to the use of distinct on dst in count and the use of distinct on label in sum in the query, where the optimizer wants to first group the data by (1) the group key src and (2) the two distinct columns (dst and label), in order to remove duplications. Then in the second Aggregate we calculate count and sum,
What I do not understand is how the first Aggregate processes the input data, what does the field groups (i.e., [0, 1], [0, 2] and [0]) do, what does the agg function GROUPING do and how many columns are there in the ouput of the first Aggregate.
For example, given the following input data: [[2,3,4,0], [2,3,4,1], [3,2,4,2], [3,2,4,3], [5,6,7,4], [5,6,7,5]], I think the data will be firstly divided into three groups: [[2,3,4,0], [2,3,4,1]] and [[3,2,4,2], [3,2,4,3]] and [[5,6,7,4], [5,6,7,5]]. But what is the next step?
Any help would be appreciated. Thanks!

Average of numeric values in Postgres JSON column

I have a jsonb column with the following structure:
{
"key1": {
"type": "...",
"label": "...",
"variables": [
{
"label": "Height",
"value": 131315.9289,
"variable": "myVar1"
},
{
"label": "Width",
"value": 61085.7525,
"variable": "myVar2"
}
]
},
}
I want to query for the average height across all rows. The top-level key values are unknown, so I have something like this:
select id,
avg((latVars ->> 'value')::numeric) as avg
from "MyTable",
jsonb_array_elements((my_json_field->jsonb_object_keys(my_json_field)->>'variables')::jsonb) as latVars
where my_json_field is not null
group by id;
It's throwing the following error:
ERROR: set-returning functions must appear at top level of FROM
Moving the jsonb_array_elements function above MyTable in the FROM clause doesn't work.
I'm following the basic advice found in this SO answer to no avail.
Any advice?
jsonb_array_elements is not relevant until my_json_field is a json array at the top level.
You can use instead the jsonb_path_query function based on the jsonpath language if postgres >= 12 :
select id
, avg(v.value :: numeric) as avg
from "MyTable"
, jsonb_path_query(my_json_field, '$.*.variables[*] ? (#.label == "Height").value') AS v(value)
where my_json_field is not null
group by id;

JSON Parsing in Snowflake - Square Brackets At Start

I'm trying to parse out some JSON files in snowflake. In this case, I'd like to extract the "gift card" from the line that has "fulfillment_service": "gift_card". I've had success querying one dimensional JSON data, but this - with the square brackets - is confounding me.
Here's my simple query - I've created a small table called "TEST_WEEK"
select line_items:fulfillment_service
from TEST_WEEK
, lateral flatten(FULFILLMENTS:line_items) line_items;
Hopefully this isn't too basic a question. I'm very new with parsing JSON.
Thanks in advance!
Here's the start of the FULLFILLMENTS field with the info I want to get at.
[
{
"admin_graphql_api_id": "gid://shopify/Fulfillment/2191015870515",
"created_at": "2020-08-10T14:54:38Z",
"id": 2191015870515,
"line_items": [
{
"admin_graphql_api_id": "gid://shopify/LineItem/5050604355635",
"discount_allocations": [],
"fulfillable_quantity": 0,
"fulfillment_service": "gift_card",
"fulfillment_status": "fulfilled",
"gift_card": true,
"grams": 0,
"id": 5050604355635,
"name": "Gift Card - $100.00",
"origin_location": {
"address1": "100 Indian Road",
"address2": "",
"city": "Toronto",
"country_code": "CA",
Maybe you can use two lateral flatten to process values in line_items array:
Sample table:
create table TEST_WEEK( FULFILLMENTS variant ) as
select parse_json(
'[
{
"admin_graphql_api_id": "gid://shopify/Fulfillment/2191015870515",
"created_at": "2020-08-10T14:54:38Z",
"id": 2191015870515,
"line_items": [
{
"admin_graphql_api_id": "gid://shopify/LineItem/5050604355635",
"discount_allocations": [],
"fulfillable_quantity": 0,
"fulfillment_service": "gift_card",
"fulfillment_status": "fulfilled",
"gift_card": true,
"grams": 0,
"id": 5050604355635,
"name": "Gift Card - $100.00",
"origin_location": {
"address1": "100 Indian Road",
"address2": "",
"city": "Toronto",
"country_code": "CA"
}
}
]
}
]');
Sample query:
select s.VALUE:fulfillment_service
from TEST_WEEK,
lateral flatten( FULFILLMENTS ) f,
lateral flatten( f.VALUE:line_items ) s;
The output:
+-----------------------------+
| S.VALUE:FULFILLMENT_SERVICE |
+-----------------------------+
| "gift_card" |
+-----------------------------+
Those square brackets indicate that you have an array of JSON objects in your FULLFILLMENTS field. Unless there is a real need to have an array of objects in one field you should have a look at the STRIP_OUTER_ARRAY property of the COPY command. An example can be found here in the Snowflake documentation:
copy into <table>
from #~/<file>.json
file_format = (type = 'JSON' strip_outer_array = true);
In case others are stuck with same data issue (all json data in one array), I have this solution:
select f.VALUE:fulfillment_service::string
from TEST_WEEK,
lateral flatten( FULFILLMENTS[0].line_items ) f;
With this, you just grab the first element of the array (which is the only element).
If you have nested array elements, just add this to the lateral flatten:
, RECURSIVE => TRUE, mode => 'array'

Create table and query json data using Amazon Athena?

I want to query JSON data of format using Amazon Athena:
[{"id":"0581b7c92be",
"key":"0581b7c92be",
"value":{"rev":"1-ceeeecaa040"},
"doc":{"_id":"0581b7c92be497d19e5ab51e577ada12","_rev":"1ceeeecaa04","node":"belt","DeviceId":"C001"}},
{"id":"0581b7c92be49",
"key":"0581b7c92be497d19e5",
"value":{"rev":"1-ceeeecaa04031842d3ca"},
"doc":{"_id":"0581b7c92be497","_rev":"1ceeeecaa040318","node":"belt","DeviceId":"C001"}
}
]
Athena DDL is based on Hive, so u will want each json object in your array to be in a separate line:
{"id": "0581b7c92be", "key": "0581b7c92be", "value": {"rev": "1-ceeeecaa040"}, "doc": {"_id": "0581b7c92be497d19e5ab51e577ada12", "_rev": "1ceeeecaa04", "node": "belt", "DeviceId": "C001"} }
{"id": "0581b7c92be49", "key": "0581b7c92be497d19e5", "value": {"rev": "1-ceeeecaa04031842d3ca"}, "doc": {"_id": "0581b7c92be497", "_rev": "1ceeeecaa040318", "node": "belt", "DeviceId": "C001"} }
You might have problems with the nested fields ("value","doc"), so if you can flatten the jsons you will have it easier. (see for example: Hive for complex nested Json)