There's a count difference between Druid Native Query and Druid SQL when using query - sql

I have a problem with Druid Query.
I wanted to get data count with hour granularity.
So, I used Druid SQL like this.
SELECT TIME_FLOOR(__time, 'PT1H') AS t, count(1) AS cnt FROM mydatasource GROUP BY 1
then I got response like this.
[
{
"t": "2022-08-31T09:00:00.000Z",
"cnt": 12427
},
{
"t": "2022-08-31T10:00:00.000Z",
"cnt": 16693
},
{
"t": "2022-08-31T11:00:00.000Z",
"cnt": 16694
},
...
But, When using native query like this,
{
"queryType": "timeseries",
"dataSource": "mydatasource",
"intervals": "2022-08-31T07:01Z/2022-09-01T07:01Z",
"granularity": {
"type": "period",
"period": "PT1H",
"timeZone": "Etc/UTC"
},
"aggregations": [
{
"name": "count",
"type": "longSum",
"fieldName": "count"
}
],
"context": {
"skipEmptyBuckets": "true"
}
}
There's a difference result.
[
{
"timestamp": "2022-08-31T09:00:00.000Z",
"result": {
"count": 1288965
}
},
{
"timestamp": "2022-08-31T10:00:00.000Z",
"result": {
"count": 1431215
}
},
{
"timestamp": "2022-08-31T11:00:00.000Z",
"result": {
"count": 1545258
}
},
...
I want to use the result of Native Query.
What's the problem in my Druid SQL query??
How do I create a query to get native query results?
I found what's difference.
when using longSum type aggregation, I get result like native query.
So, I want to know how to query aggregate like below using sql.
"aggregations": [
{
"type": "longSum",
"name": "count",
"fieldName": "count"
}
]

I found solution.
Query like this.
SELECT TIME_FLOOR(__time, 'PT1H') AS t, sum("count") AS cnt FROM mydatasource GROUP BY 1

Given that your datasource has a "count" column, I'm assuming it comes from an ingestion that uses rollup. This means that the original raw rows have been aggregated and the "count" column contains the count of raw rows that were summarized into the each aggregate row.
The Native query is using the longSum function over the "count" column.
The original SQL you used, is just counting the aggregate rows.
So yes, the correct way to get the count of raw rows is SUM("count").

Related

Athena query JSON Array without struct

In Athena how can I structure a select statement to query the below by timestamp? The data is stored as a string
[{
"data": [{
"ct": "26.7"
}, {
"ct": "24.9",
}, {
"ct": "26.8",
}],
"timestamp": "1658102460"
}, {
"data": [{
"ct": "26.7",
}, {
"ct": "25.0",
}],
"timestamp": "1658102520"
}]
I tried the below but it just came back empty.
SELECT json_extract_scalar(insights, '$.timestamp') as ts
FROM history
What I am trying to get to is returning only the data where a timestamp is between X & Y.
When I try doing this as a struct and a cross join with unnest it's very very slow so I am trying to find another way.
json_extract_scalar will not help here cause it returns only one value. Trino improved vastly json path support but Athena has much more older version of the Presto engine which does not support it. So you need to cast to array and use unnest (removed trailing commas from json):
-- sample data
WITH dataset (json_str) AS (
values ('[{
"data": [{
"ct": "26.7"
}, {
"ct": "24.9"
}, {
"ct": "26.8"
}],
"timestamp": "1658102460"
}, {
"data": [{
"ct": "26.7"
}, {
"ct": "25.0"
}],
"timestamp": "1658102520"
}]')
)
-- query
select mp['timestamp'] timestamp,
mp['data'] data
from dataset,
unnest(cast(json_parse(json_str) as array(map(varchar, json)))) as t(mp)
Output:
timestamp
data
1658102460
[{"ct":"26.7"},{"ct":"24.9"},{"ct":"26.8"}]
1658102520
[{"ct":"26.7"},{"ct":"25.0"}]
After that you can apply filtering and process data.

I want to combine json rows in t-sql into single json row

I have a table
id
json
1
{"url":"url2"}
2
{"url":"url2"}
I want to combine these into a single statement where the output is :
{
"graphs": [
{
"id": "1",
"json": [
{
"url": "url1"
}
]
},
{
"id": "2",
"json": [
{
"url": "url2"
}
]
}
]
}
I am using T-SQL, I've notice there is some stuff in postgres but can't find much on tsql.
Any help would be greatly appreciated..
You need to use JSON_QUERY on the json column to ensure it is not escaped.
SELECT
id,
JSON_QUERY('[' + json + ']')
FROM YourTable t
FOR JSON PATH, ROOT('graphs');
db<>fiddle

Sequelize Postgres - Select fields not in groupby

Offer.findAll({
where: {
userId: id
},
attributes: [
"productDetailId",
"id",
"createdAt",
"userId",
[Sequelize.fn("MAX", Sequelize.col("offerPrice")), "offerPrice"]
],
group: ["productDetailId"],
include: [
{
model: ProductDetail
}
]
})
I have the above sequelize query, which aims to find the offer with the maximum offerPricefor each ProductDetail and group the results by productDetailId. The above works in mysql but throws an error in postgres. I suspect this is because the select statement contains fields which are not in the group by, which is not allowed by postgres, but I'm not sure how to update this.
UPDATE:
Error Message:
error": {
"name": "SequelizeDatabaseError",
"parent": {
"name": "error",
"length": 169,
"severity": "ERROR",
"code": "42803",
"position": "43",
"file": "parse_agg.c",
"line": "1388",
"routine": "check_ungrouped_columns_walker",
I had a similar problem and grouping by all columns in select was not a case.
You need to group by column using its name with table name, so
...
group: ["productDetail.productDetailId"],
...

how do i count the number of elements in array when elements are objects in BigQuery StandardSql

I have the following json
{
"CustomerId": "B0001",
"Items": [
{
"ItemId": "00001",
"ItemName": "Banana"
},
{
"ItemId": "00001",
"ItemName": "Orange"
},
{
"ItemId": "00001",
"ItemName": "apple"
}
]
}
i want to count the number of items in thes case the column should return 3
i have tried
select ARRAY_LENGTH(Items) as Number_of_items2
but this obviously throws error on bigquery
Assuming it's actually stored as a JSON string, you can try:
select ARRAY_LENGTH(SPLIT(Items, '},')) as Number_of_items2
FROM dataset.table
This relies on the specific format of the JSON, but if you need more advanced processing logic, you can use a JavaScript UDF.

Working with arrays with BigQuery LegacySQL

Each row in my table has a field that is an array, and I'd like to get a field from the first array entry.
For example, if my row is
[
{
"user_dim": {
"user_id": "123",
"user_properties": [
{
"key": "content_group",
"value": {
"value": {
"string_value": "my_group"
}
}
}
]
},
"event_dim": [
{
"name": "main_menu_item_selected",
"timestamp_micros": "1517584420597000"
},
{
"name": "screen_view",
"timestamp_micros": "1517584420679001"
}
]
}
]
I'd like to get
user_id: 123, content_group: my_group, timestamp_1517584420597000
As Elliott mentioned - BigQuery Standard SQL has way much better support for ARRAYs than legacy SQL. And in general, BigQuery team recommend using Standard SQL
So, below is for BigQuery Standard SQL (including handling wildcard stuff)
#standardSQL
SELECT
user_dim.user_id AS user_id,
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'content_group' LIMIT 1
) content_group,
(SELECT event.timestamp_micros
FROM UNNEST(event_dim) event
WHERE name = 'main_menu_item_selected'
) ts
FROM `project.dataset.app_events_*`
WHERE _TABLE_SUFFIX BETWEEN '20180129' AND '20180202'
with result (for the dummy example from your question)
Row user_id content_group ts
1 123 my_group 1517584420597000