Working with arrays with BigQuery LegacySQL - google-bigquery

Each row in my table has a field that is an array, and I'd like to get a field from the first array entry.
For example, if my row is
[
{
"user_dim": {
"user_id": "123",
"user_properties": [
{
"key": "content_group",
"value": {
"value": {
"string_value": "my_group"
}
}
}
]
},
"event_dim": [
{
"name": "main_menu_item_selected",
"timestamp_micros": "1517584420597000"
},
{
"name": "screen_view",
"timestamp_micros": "1517584420679001"
}
]
}
]
I'd like to get
user_id: 123, content_group: my_group, timestamp_1517584420597000

As Elliott mentioned - BigQuery Standard SQL has way much better support for ARRAYs than legacy SQL. And in general, BigQuery team recommend using Standard SQL
So, below is for BigQuery Standard SQL (including handling wildcard stuff)
#standardSQL
SELECT
user_dim.user_id AS user_id,
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'content_group' LIMIT 1
) content_group,
(SELECT event.timestamp_micros
FROM UNNEST(event_dim) event
WHERE name = 'main_menu_item_selected'
) ts
FROM `project.dataset.app_events_*`
WHERE _TABLE_SUFFIX BETWEEN '20180129' AND '20180202'
with result (for the dummy example from your question)
Row user_id content_group ts
1 123 my_group 1517584420597000

Related

How to use PSQL to extract data from an object (inside an array inside an object inside an array)

This is data that is currently sitting in a single cell (e.g. inside warehouse table in warehouse_data column) in our database (I'm unable to change the structure/DB design so would need to work with this), how would I be able to select the name of the shirt with the largest width? In this case, would expect output to be tshirt_b (without quotation marks)
{
"wardrobe": {
"apparel": {
"variety": [
{
"data": {
"shirt": {
"size": {
"width": 30
}
}
},
"names": [
{
"name": "tshirt_a"
}
]
},
{
"data": {
"shirt": {
"size": {
"width": 40
}
}
},
"names": [
{
"name": "tshirt_b"
}
]
}
]
}
}
}
I've tried a select statement, being able to get out
"names": [
{
"name": "tshirt_b"
}
]
but not too much further than that e.g.:
select jsonb_array_elements(warehouse_data#>'{wardrobe,apparel,variety}')->>'names'
from 'warehouse'
where id = 1;
In this table, we'd have 2 columns, one with the data and one with a unique identifier. I imagine I'd need to be able to select into size->>width, order DESC and limit 1 (if that's able to then limit it to include the entire object with data & shirt or with the max() func?
I'm really stuck so any help would be appreciated, thank you!
You'll first want to normalise the data into a relational structure:
SELECT
(obj #>> '{data,shirt,size,width}')::int AS width,
(obj #>> '{names,0,name}') AS name
FROM warehouse, jsonb_array_elements(warehouse_data#>'{wardrobe,apparel,variety}') obj
WHERE id = 1;
Then you can do your processing on that as a subquery, e.g.
SELECT name
FROM (
SELECT
(obj #>> '{data,shirt,size,width}')::int AS width,
(obj #>> '{names,0,name}') AS name
FROM warehouse, jsonb_array_elements(warehouse_data#>'{wardrobe,apparel,variety}') obj
WHERE id = 1
) shirts
ORDER BY width DESC
LIMIT 1;

How do I fetch the latest repeated entry of a record in bigquery?

I have a table where in all the updates are pushed as new entries. And table's scema is something of this sort:
[
{
"id":"221212",
"fieldsData": [
{
"key": "someDate",
"value": "12-12-2022"
},
{
"key": "someString",
"value": "ABCDEF"
}
],
"name": "Sample data 1",
"createdOn":"12-11-2022",
"insertedDate": "14-11-2022",
"updatedOn": "14-11-2022"
},
{
"id":"221212",
"fieldsData": [
{
"key": "someDate",
"value": "12-12-2022"
},
{
"key": "someString",
"value": "ABCDEF"
},
{
"key": "someMoreString",
"value": "12qwwe122"
}
],
"name": "Sample data 1",
"createdOn":"12-11-2022",
"insertedDate": "15-11-2022",
"updatedOn": "15-11-2022"
}
]
It is partitioned by month using the createdOn field. The fieldsData field is generic and can have any number of records/fields as separate rows.
How do I fetch the latest entry of id = 221212 and get the repeated records of only the latest one?
I know I can use flatten but flatten queries all the records and that beats the purpose of having a partitioned table.
The query I've got right now is:
select * from
(
SELECT
id, createdAt, createdBy, fields.key, fields.value,
DENSE_RANK() OVER(PARTITION BY id ORDER BY insertedDate DESC)AS Rank1
FROM `mytableName` , UNNEST(fieldsData) as fields
WHERE createdAt IS NULL or DATE(createdAt) = CURRENT_DATE()
)
where rank1 = 1
PS: This table is going to have almost 10k records pushed everyday.
Let me know if this serves your purpose.
SELECT AS value ARRAY_AGG(t ORDER BY insertedDate DESC LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY id

There's a count difference between Druid Native Query and Druid SQL when using query

I have a problem with Druid Query.
I wanted to get data count with hour granularity.
So, I used Druid SQL like this.
SELECT TIME_FLOOR(__time, 'PT1H') AS t, count(1) AS cnt FROM mydatasource GROUP BY 1
then I got response like this.
[
{
"t": "2022-08-31T09:00:00.000Z",
"cnt": 12427
},
{
"t": "2022-08-31T10:00:00.000Z",
"cnt": 16693
},
{
"t": "2022-08-31T11:00:00.000Z",
"cnt": 16694
},
...
But, When using native query like this,
{
"queryType": "timeseries",
"dataSource": "mydatasource",
"intervals": "2022-08-31T07:01Z/2022-09-01T07:01Z",
"granularity": {
"type": "period",
"period": "PT1H",
"timeZone": "Etc/UTC"
},
"aggregations": [
{
"name": "count",
"type": "longSum",
"fieldName": "count"
}
],
"context": {
"skipEmptyBuckets": "true"
}
}
There's a difference result.
[
{
"timestamp": "2022-08-31T09:00:00.000Z",
"result": {
"count": 1288965
}
},
{
"timestamp": "2022-08-31T10:00:00.000Z",
"result": {
"count": 1431215
}
},
{
"timestamp": "2022-08-31T11:00:00.000Z",
"result": {
"count": 1545258
}
},
...
I want to use the result of Native Query.
What's the problem in my Druid SQL query??
How do I create a query to get native query results?
I found what's difference.
when using longSum type aggregation, I get result like native query.
So, I want to know how to query aggregate like below using sql.
"aggregations": [
{
"type": "longSum",
"name": "count",
"fieldName": "count"
}
]
I found solution.
Query like this.
SELECT TIME_FLOOR(__time, 'PT1H') AS t, sum("count") AS cnt FROM mydatasource GROUP BY 1
Given that your datasource has a "count" column, I'm assuming it comes from an ingestion that uses rollup. This means that the original raw rows have been aggregated and the "count" column contains the count of raw rows that were summarized into the each aggregate row.
The Native query is using the longSum function over the "count" column.
The original SQL you used, is just counting the aggregate rows.
So yes, the correct way to get the count of raw rows is SUM("count").

BigQuery JSON Array extraction

I have this JSON
"type": "list",
"data": [
{
"id": "5bc7a3396fbc71aaa1f744e3",
"type": "company",
"url": "/companies/5bc7a3396fbc71aaa1f744e3"
},
{
"id": "5b0aa0ac6e378450e980f89a",
"type": "company",
"url": "/companies/5b0aa0ac6e378450e980f89a"
}
],
"url": "/contacts/5802b14755309dc4d75d184d/companies",
"total_count": 2,
"has_more": false
}
I want to dynamically create columns as the number of the companies with their Ids, for example:
company_0
comapny_1
5bc7a3396fbc71aaa1f744e3
5b0aa0ac6e378450e980f89a
Tried to use BigQuery's JSON functions but I didn't get along with it.
Thank you.
Consider below approach
select * except(json) from (
select json, json_extract_scalar(line, '$.id') company, offset
from your_table t, unnest(json_extract_array(json, '$.data')) line with offset
where json_extract_scalar(line, '$.type') = 'company'
)
pivot (any_value(company) company for offset in (0, 1))
if applied to sample data in your question - output is

how does ElasticSearch set the filtering in a group query

In this official site
https://www.elastic.co/guide/en/elasticsearch/reference/current/_executing_aggregations.html
they have this query:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state"
}
}
}
}'
and then say it is similar to :
SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC
my question is where is the mapping?
is it that when selecting the properties inside the terms part, that means the select state, count(*) ?
and where is the code in that elasticsearch query that is states to order desc?
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state"
}
}
}
}
Its sql analogy as you mentioned is SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC
We defined terms aggregation above which is a type of bucket aggregation.It returns a list of buckets which contains
Key: unique terms indexed for a given field
doc_count: No. of matching documents
The field value inside terms aggregation defines two things:
Which column is to be used for grouping (state in our case (This refers to GROUP BY state in sql))
What will be the key of bucket (unique values of state indexed in our case.(This refers to SELECT state in sql ))
doc_count which refers to count * in sql is being returned as we are using bucket aggregation.
Terms aggregation by default returns the buckets ordered by the doc_count which is analogous to ORDER BY COUNT(*) DESC in sql.
Hope this replies all your queries.
What did you mean with "mapping"?
The terms is an aggregation type, which will return each bucket containing the key (in this case the state field value) and the count of this term along all retrieved docs.
The ordering by count desc is the Elasticsearch default, so it's implicit.
Output example:
{
...
"aggregations": {
"group_by_state": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Florida",
"doc_count": 10
},
{
"key": "Rio de Janeiro",
"doc_count": 8
},
{
"key": "Lisbon",
"doc_count": 5
}
]
}
}
...
}