I am new to Calcite and I am using Calcite to convert a SQL query to an optimized plan, where I will translate the plan to a dataflow graph in an execution engine. One challenge is the translation of different RelNodes (e.g., Filter, Project, Aggregate, Calc, etc). I found a difficulty in understanding the EnumerableAggregate RelNode. Specifically, for the following example, where I defined a table T as
create table T (src int, dst int, label int, time int);
and wrote a toy query as
select count(distinct dst), sum(distinct label), count(*)
from T
where dst > 1
group by src
having src = 0;
I will obtain an optimized plan which contains two EnumerableAggregate RelNodes and here is the first EnumerableAggregate RelNode:
{
"id": "2",
"relOp": "org.apache.calcite.adapter.enumerable.EnumerableAggregate",
"group": [ 0, 1, 2 ],
"groups": [
[ 0, 1 ], [ 0, 2 ], [ 0 ]
],
"aggs": [
{
"agg": {
"name": "COUNT",
"kind": "COUNT",
"syntax": "FUNCTION_STAR"
},
"type": {
"type": "BIGINT",
"nullable": false
},
"distinct": false,
"operands": [],
"name": "EXPR$2"
},
{
"agg": {
"name": "GROUPING",
"kind": "GROUPING",
"syntax": "FUNCTION"
},
"type": {
"type": "BIGINT",
"nullable": false
},
"distinct": false,
"operands": [ 0, 1, 2 ],
"name": "$g"
}
]
}
I think I understand the reason why there are two Aggregate RelNodes. The reason is due to the use of distinct on dst in count and the use of distinct on label in sum in the query, where the optimizer wants to first group the data by (1) the group key src and (2) the two distinct columns (dst and label), in order to remove duplications. Then in the second Aggregate we calculate count and sum,
What I do not understand is how the first Aggregate processes the input data, what does the field groups (i.e., [0, 1], [0, 2] and [0]) do, what does the agg function GROUPING do and how many columns are there in the ouput of the first Aggregate.
For example, given the following input data: [[2,3,4,0], [2,3,4,1], [3,2,4,2], [3,2,4,3], [5,6,7,4], [5,6,7,5]], I think the data will be firstly divided into three groups: [[2,3,4,0], [2,3,4,1]] and [[3,2,4,2], [3,2,4,3]] and [[5,6,7,4], [5,6,7,5]]. But what is the next step?
Any help would be appreciated. Thanks!
Related
What am I trying to achieve:
I would like to have a time series chart showing the total number of members in my club at any time. This member count should be calculated by using the field "Eintrittsdatum" (joining-date) and "Austrittsdatum" (leaving-date). I’m thinking of it as a running sum - every filled field with a joining-date means +1 on the member count, every leaving-date entry is a -1.
Data structure
I’m calling the API of webling.ch with a secret key. This is my data structure with sample data per member:
[
{
"type": "member",
"meta": {
"created": "2020-03-02 11:33:00",
"createuser": {
"label": "Joana Doe",
"type": "user"
},
"lastmodified": "2022-12-06 16:32:56",
"lastmodifieduser": {
"label": "Joana Doe",
"type": "user"
}
},
"readonly": true,
"properties": {
"Mitglieder ID": 99,
"Anrede": "Dear",
"Vorname": "Jon",
"Name": "Doe",
"Strasse": "Doeington Street",
"Adresszusatz": null,
"PLZ": "9999",
"Ort": "Doetown",
"E-Mail": "jon.doe#doenet.net",
"Telefon Privat": null,
"Telefon Geschäft": null,
"Mobile": "099 877 54 54",
"Geschlecht": "m",
"Geburtstag": "1966-03-10",
"Mitgliedschaftstyp": "Aktivmitgliedschaft",
"Eintrittsdatum": "2020-03-01",
"Austrittsdatum": null,
"Passfoto": null,
"Wordpress Benutzername": null,
"Wohnhaft im Glarnerland": false,
"Lat": "43.1563379",
"Long": "6.0474622"
},
"parents": [
240
],
"children": {
},
"links": {
"debitor": [
2124,
3056,
3897
],
"attendee": [
2576
]
},
"id": 1815
}
]
Grafana data source
I am using the “JSON API” by Marcus Olsson: GitHub - grafana/grafana-json-datasource: A data source plugin for loading JSON APIs into Grafana.
Grafana v9.3.1 (89b365f8b1) on Linux
My current approach
Queries:
Query C - uses a filter on the source-API to only show entries with "Eintrittsdatum" IS NOT EMPTY
Field 1 (alias "datum") has a JSONata-Query of:
properties.Eintrittsdatum
Field 2 (alias "names") should return the full name and has a query of:
$map($.properties, function($v) {(
($v.Vorname&" "&$v.Name);
)})
Field 3 (alias "value") should return "1" for every entry and has a query of:
$map($.properties, function($v) {(
(1);
)})
Query D - uses a filter on the source-API to only show entries with "Austrittsdatum" IS NOT EMPTY
Field 1 (alias "datum") has a JSONata-Query of:
properties.Austrittsdatum
Field 2 (alias "names") should return the full name and has a query of:
$map($.properties, function($v) {(
($v.Vorname&" "&$v.Name);
)})
Field 3 (alias "value") should return "1" for every entry and has a query of:
$map($.properties, function($v) {(
(1);
)})
Here's a screenshot to clarify things
(https://zigerschlitzmakers.ch/wp-content/uploads/2023/01/ScreenshotGrafana-1.png)
Transformations:
My applied transformations
(https://zigerschlitzmakers.ch/wp-content/uploads/2023/01/ScreenshotGrafana-2.png)
What's working
I can correctly gather the number of members added/subtracted per day.
What's not working
I can't get the graph to display the way i want: I'd like to have a running sum of these numbers instead of the following two graphs.
Time series graph with merged queries
(https://zigerschlitzmakers.ch/wp-content/uploads/2023/01/ScreenshotGrafana-3.png)
Time series graph with unmerged queries
(https://zigerschlitzmakers.ch/wp-content/uploads/2023/01/ScreenshotGrafana-4.png)
I can't get the names to display within the tooltip of the data points (really not THAT necessary).
I am banging my head head here for the past 2 hours with all the available JSON_... functions in BigQuery. I've read quite a few questions here but no matter why I try, I never succeed in extracting the "amounts" from my JSON below.
This is my JSON stored in a BQ column:
{
"lines": [
{
"id": "70223039-83d6-463d-a482-7ce4d50bf0fc",
"charges": [
{
"type": "price",
"amount": 50.0
},
{
"type": "discount",
"amount": -40.00
}
]
},
{
"id": "70223039-83d6-463d-a482-7ce4d50bf0fc",
"charges": [
{
"type": "price",
"amount": 20.00
},
{
"type": "discount",
"amount": 0.00
}
]
}
]
}
Imagine the above being an order containing multiple items.
I am trying to get a sum of all amounts => 50-40+20+0. The result needs to be 30 = the total order price.
Is it possible to pull all the amount values and then have them summed up just via SQL without any custom JS functions? I guess the summing is the easy part - getting the amounts into an array is the challenge here.
Use below
select (
select sum(cast(json_value(charge, '$.amount') as float64))
from unnest(json_extract_array(order_as_json, '$.lines')) line,
unnest(json_extract_array(line, '$.charges')) charge
) total
from your_table
if applied to sample data in y our question - output is
Below is the data-sample and i want to access columns value,start. This data i dumped in one column(DN) of a table (stg)
{
"ok": true,
"metrics": [
{
"name": "t_in",
"data": [{"value": 0, "group": {"start": "00:00"}}]
},
{
"name": "t_out",
"data": [{"value": 0,"group": {"start": "00:00"}}]
}
]
}
##consider many lines stored in same column in different rows.
Below query only fetched data for name. I want to access other columns value also. This query is a part of python script.
select
replace(DN : metrics[0].name , '"' , '')as metrics_name, #able to get
replace(DN : metrics[2].data , '"' , '')as metrics_data_value,##suggestion needed
replace(DN : metrics.data.start, '"','') as metrics_start, ##suggestion needed
replace(DN : metrics.data.group.finish, '"','') as metrics_finish, ##suggestion needed
from stg
Do i need to iterate over data and group? If yes, please suggest the code.
Here is an example of how to query that data.
Set up sample data:
create or replace transient table test_db.public.stg (DN variant);
insert overwrite into test_db.public.stg (DN)
select parse_json('{
"ok": true,
"metrics": [
{
"name": "t_in",
"data": [
{"value": 0, "group": {"start": "00:00"}}
]
},
{
"name": "t_out",
"data": [
{"value": 0,"group": {"start": "00:00"}}
]
}
]
}');
Select statement example:
select
DN:metrics[0].name::STRING,
DN:metrics[1].data,
DN:metrics[1].data[0].group.start::TIME,
DN:metrics[1].data[0].group.finish::TIME
from test_db.public.stg;
Instead of querying individual indexes of the JSON arrays, I think you'll want to use the flatten function which is documented here.
Here is how you do it with the flatten which is what I am guessing you want:
select
mtr.value:name::string,
dta.value,
dta.value:group.start::string,
dta.value:group.finish::string
from test_db.public.stg stg,
lateral flatten(input => stg.DN:metrics) mtr,
lateral flatten(input => mtr.value:data) dta
I'm using knex to build a postgres query and have a table of recipes with a many to many relationship to both a table of ingredients and steps (each step being a part of an instruction). I'm trying to aggregate both the steps and ingredients into their own arrays within the query. My problem is as soon as I join the second array both arrays lose their distinctiveness (ie. table a has 2 elements, table b has 3 elements; after I join table b; both arrays now have 6 elements).
I've tried using distinct but every attempt has resulted in an error being thrown.
Here's what I'm trying to output:
"id": 1,
"title": "sometitle",
"ingredients": [
{
"ingredient": "avacado",
"quantity": 24
},
{
"ingredient": "asparagus",
"quantity": 42
},
],
"instructions": [
{
"step": 1,
"instruction": "one"
},
{
"step": 2,
"instruction": "two"
},
{
"step": 3,
"instruction": "three"
},
]
Here's what I have so far:
knex(`recipes as r`)
.where({'r.id': 1})
.join('ingredients_list as list', {'list.recipe_id': 'r.id'})
.join('ingredients', {'list.ingredient_id': 'ingredients.id'})
.join('instructions', {'instructions.recipe_id': 'r.id'})
.select(
'r.id',
db.raw(`json_agg(json_build_object(
'ingredient', ingredients.name,
'quantity', list.quantity
)) as ingredients`),
db.raw(`json_agg(json_build_object(
'step', instructions.step_number,
'instruction', instructions.description
)) as instructions`)
)
.groupBy('r.id')
.first()
Here's the solution I came up with in case anyone else runs into this issue. I assume this works because postgres is unable to evaluate equality of json objects; whereas jsonb is a binary object. I'd love a more thorough explanation of this is somebody has one.
distinct json_agg(jsonb_build_object(...))
knex(`recipes as r`)
.where({'r.id': 1})
.join('ingredients_list as list', {'list.recipe_id': 'r.id'})
.join('ingredients', {'list.ingredient_id': 'ingredients.id'})
.join('instructions', {'instructions.recipe_id': 'r.id'})
.select(
'r.id',
db.raw(`distinct json_agg(jsonb_build_object(
'ingredient', ingredients.name,
'quantity', list.quantity
)) as ingredients`),
db.raw(`distinct json_agg(jsonb_build_object(
'step', instructions.step_number,
'instruction', instructions.description
)) as instructions`)
)
.groupBy('r.id')
.first()
I am having a hard time converting this simple SQL Query below into Druid:
SELECT country, city, Count(*)
FROM people_data
WHERE name="Mary"
GROUP BY country, city;
So I came up with this query so far:
{
"queryType": "groupBy",
"dataSource" : "people_data",
"granularity": "all",
"metric" : "num_of_pages",
"dimensions": ["country", "city"],
"filter" : {
"type" : "and",
"fields" : [
{
"type": "in",
"dimension": "name",
"values": ["Mary"]
},
{
"type" : "javascript",
"dimension" : "email",
"function" : "function(value) { return (value.length !== 0) }"
}
]
},
"aggregations": [
{ "type": "longSum", "name": "num_of_pages", "fieldName": "count" }
],
"intervals": [ "2016-07-20/2016-07-21" ]
}
The query above runs but it doesn't seem like groupBy in the Druid datasource is even being evaluated since I see people in my output with names other than Mary. Does anyone have any input on how to make this work?
Simple answer is that you cannot select arbitrary dimensions in your groupBy queries.
Strictly speaking even SQL query does not make sense. If for a given combination of country, city there are many different values of name and street, then how do you squeeze that into a single row? You have to aggregate them, e.g. by using max function.
In this case you can include the same column in your data as both dimension and metric, e.g. name_dim and name_metric, and include corresponding aggregation over your metric, max(name_metric).
Please note, that if these columns, name etc, have high granularity values, then that will kill Druid's roll-up feature.