How to avoid InternalError on joins with subselects - google-bigquery

I am getting an internalError when running a query that joins two subselects. The query looks like this:
SELECT data_table.title AS name
FROM
(
SELECT id, title
FROM
[citrius-altrius-fortius:analyze.topic_735_1354233600], [citrius-altrius-fortius:analyze.topic_735_1354320000]
) AS data_table
INNER JOIN
(
SELECT id FROM
[citrius-altrius-fortius:analyze.topic_735_1354233600], [citrius-altrius-fortius:analyze.topic_735_1354320000]
WHERE
UPPER(title) CONTAINS ("BUDGET")
) AS filter_table
ON data_table.id == filter_table.id
this is the reply I get:
{
"status": {
"state": "DONE",
"errors": [
{
"reason": "internalError",
"message": "Unexpected. Please try again."
}
],
"errorResult": {
"reason": "internalError",
"message": "Unexpected. Please try again."
}
},
"kind": "bigquery#job",
"statistics": {
"endTime": "1354849344685",
"startTime": "1354849344148"
},
"jobReference": {
"projectId": "citrius-altrius-fortius",
"jobId": "job_e870b1fe67b94897a157dfa4f60b9725"
},
"etag": "\"OLvSfkwPDZ7M36YVEFvi-PBaFQM/SnvQhZKTutJxTY1bJLqqdphkA00\"",
"configuration": {
"query": {
"createDisposition": "CREATE_IF_NEEDED",
"query": "SELECT data_table.title AS name\nFROM\n ( \n SELECT id, title\n FROM\n [citrius-altrius-fortius:analyze.topic_735_1354233600], [citrius-altrius-fortius:analyze.topic_735_1354320000]\n ) AS data_table\n INNER JOIN\n ( \n SELECT id FROM\n [citrius-altrius-fortius:analyze.topic_735_1354233600], [citrius-altrius-fortius:analyze.topic_735_1354320000]\n WHERE\n UPPER(title) CONTAINS (\"BUDGET\") LIMIT 1000\n ) AS filter_table\nON data_table.id == filter_table.id\n;",
"destinationTable": {
"projectId": "citrius-altrius-fortius",
"tableId": "anonba898655_04dc_4d8d_8c29_673a36ba4c9f",
"datasetId": "_6fff29df40f86299d525686858be44df27c8dfd0"
}
}
},
"id": "citrius-altrius-fortius:job_e870b1fe67b94897a157dfa4f60b9725",
"selfLink": "https://www.googleapis.com/bigquery/v2/projects/citrius-altrius-fortius/jobs/job_e870b1fe67b94897a157dfa4f60b9725"
}
title and id are plain string columns (non-repeating). Both tables are fairly small in size (adding limit 1000 on the right side of the join does not change anything). When however, I alter the query to use just one table:
SELECT data_table.title AS name
FROM
(
SELECT id, title
FROM
[citrius-altrius-fortius:analyze.topic_735_1354233600]
) AS data_table
INNER JOIN
(
SELECT id FROM
[citrius-altrius-fortius:analyze.topic_735_1354233600]
WHERE
UPPER(title) CONTAINS ("BUDGET") LIMIT 1000
) AS filter_table
ON data_table.id == filter_table.id
it works every time.
What would be the proper way of joining two subselects where they (subselects) are selecting from multiple tables?

You're hitting a known bug doing a table union in a join:
BigQuery - UNION on left side of JOIN == Unexpected. Please try again
We're actively working on a fix.

Related

How do I fetch the latest repeated entry of a record in bigquery?

I have a table where in all the updates are pushed as new entries. And table's scema is something of this sort:
[
{
"id":"221212",
"fieldsData": [
{
"key": "someDate",
"value": "12-12-2022"
},
{
"key": "someString",
"value": "ABCDEF"
}
],
"name": "Sample data 1",
"createdOn":"12-11-2022",
"insertedDate": "14-11-2022",
"updatedOn": "14-11-2022"
},
{
"id":"221212",
"fieldsData": [
{
"key": "someDate",
"value": "12-12-2022"
},
{
"key": "someString",
"value": "ABCDEF"
},
{
"key": "someMoreString",
"value": "12qwwe122"
}
],
"name": "Sample data 1",
"createdOn":"12-11-2022",
"insertedDate": "15-11-2022",
"updatedOn": "15-11-2022"
}
]
It is partitioned by month using the createdOn field. The fieldsData field is generic and can have any number of records/fields as separate rows.
How do I fetch the latest entry of id = 221212 and get the repeated records of only the latest one?
I know I can use flatten but flatten queries all the records and that beats the purpose of having a partitioned table.
The query I've got right now is:
select * from
(
SELECT
id, createdAt, createdBy, fields.key, fields.value,
DENSE_RANK() OVER(PARTITION BY id ORDER BY insertedDate DESC)AS Rank1
FROM `mytableName` , UNNEST(fieldsData) as fields
WHERE createdAt IS NULL or DATE(createdAt) = CURRENT_DATE()
)
where rank1 = 1
PS: This table is going to have almost 10k records pushed everyday.
Let me know if this serves your purpose.
SELECT AS value ARRAY_AGG(t ORDER BY insertedDate DESC LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY id

Multip Choice Question - Postgres Join two tables

I am new to SQL previously we were using MongoDB and now we have shifted to Postgres. basically, we have 2 tables of Questions and Options. Each question can have multiple option so I design the database just like below table.
Table : dialogues
dialogue_id
dialogue_text
...
Table : options
option_id
option_name
dialogue_id ( which is FK for table question)
Now I am trying to get all the questions with their options. I have tried inner join like this
SELECT options.option_name, dialogues.*
FROM options INNER JOIN dialogues ON dialogues.dialogue_id = options.dialogue_id)
But it's not what I wanted.
Just to give you an example in Mongo we have used this query
const aggregateQuery = [
{
$lookup: {
from: "options", // The collection you want to join to
localField: "_id", // The field in orders you want to join on
foreignField: "question_id", // The field in products you want to join on
as: "options" // What you want to call the array that stores the joined documents from products
}
},
}
]
const questionList = await questionModel.aggregate(aggregateQuery)
In result I wanted all the dialogue along with field called "options" which contains all the relevant options from table "options".
let me share the final JSON that I get from mongo.
[
{
"_id": "yyyyy",
"question": "What is your gender?",
"options": [
{
"_id": "xxxxx",
"option": "Male",
"question_id": "yyyyy",
"created_at": "2020-07-04T05:57:00.293Z",
"updated_at": "2020-07-04T05:57:00.293Z"
},
{
"_id": "xxxx",
"option": "Female",
"question_id": "yyyyy",
"created_at": "2020-07-04T05:57:00.293Z",
"updated_at": "2020-07-04T05:57:00.293Z"
}
]
}
]
can anybody help ?
This may be your solution.
select jsonb_agg(to_jsonb(t)) from
(
select
d.dialogue_id "_id",
d.dialogue_text "question",
(
select
jsonb_agg(jsonb_build_object('_id', o.option_id, 'option', o.option_name))
from options o
where o.dialogue_id = d.dialogue_id
) "options"
from dialogues d
) t;
Here is a JOIN/GROUP BY version
select jsonb_agg(to_jsonb(t)) from
(
select
d.dialogue_id "_id",
d.dialogue_text "question",
jsonb_agg(jsonb_build_object('_id', o.option_id, 'option', o.option_name)) "options"
from dialogues d inner join options o using (option_id)
group by d.dialogue_id, d.dialogue_text
) t;

Postgres knex query joining columns

I have a Postgres DB that stores data in a table that has the following columns "date | uid | id | value | data".
Using knex on a node.js server I currently get from this query:
data = await db(tableName).select(["date", "uid", "id", "value", "value as market_cap", "data"]).whereIn("id", ["index", "market_cap", "market"]);
the following result:
"data": [
{
"date": "2020-11-07T21:43:11.709Z",
"uid": "nmvdqy0kh87sd8a",
"id": "index",
"value": "999.9999999999999",
"market_cap": "999.9999999999999",
"data": null
},
{
"date": "2020-11-07T21:43:11.709Z",
"uid": "nmvdqy0kh87sd8b",
"id": "market_cap",
"value": "10125616413",
"market_cap": "10125616413",
"data": null
},
{
"date": "2020-11-07T21:43:11.709Z",
"uid": "nmvdqy0kh87sd8c",
"id": "market",
"value": null,
"market_cap": null,
"data": {
"1": [],
"2": []
}
},
...
];
The date pairs are all exactly the same. Data stored under id "market_cap" is actually stored as "value" and data stored under id "market" is actually stored as "data".
Now, what I actually need is:
"data": [
{
"date": "2020-11-07T21:43:11.709Z",
"value": "999.9999999999999",
"market_cap": "10125616413",
"data": {
"1": [],
"2": []
}
},
...
];
Is there a way to obtain this data structure directly from the database instead of transforming the data on the server? Bonus points it you provide the knex query / SQL query. Thank you!
You can accomplish this with a self-join on date and selecting on the rows with certain IDs. Left join ensures a result for each date even if they're missing a data type.
select mv."date", mv.value, mc.value as market_cap, md.data
from market_snapshots mv
left join market_snapshots mc on mv."date" = mc."date" and mc.id = 'market_cap'
left join market_snapshots md on mv."date" = md."date" and md.id = 'market'
where mv.id = 'index';
Try it.
In Knex it would be something like...
knex.select(['mv.date', 'mv.value', 'mc.value as market_cap', 'md.data'])
.from({ mv: 'market_snapshots' })
.leftJoin({ mc: 'market_snapshots' }, function() {
this.on('mv.date', '=', 'mc.date').andOn(knex.raw('mc.id = ?', 'market_cap'))
})
.leftJoin({ md: 'market_snapshots' }, function() {
this.on('mv.date', '=', 'md.date').andOn(knex.raw('md.id = ?', 'market'))
})
.where('mv.id', 'index')

Working with arrays with BigQuery LegacySQL

Each row in my table has a field that is an array, and I'd like to get a field from the first array entry.
For example, if my row is
[
{
"user_dim": {
"user_id": "123",
"user_properties": [
{
"key": "content_group",
"value": {
"value": {
"string_value": "my_group"
}
}
}
]
},
"event_dim": [
{
"name": "main_menu_item_selected",
"timestamp_micros": "1517584420597000"
},
{
"name": "screen_view",
"timestamp_micros": "1517584420679001"
}
]
}
]
I'd like to get
user_id: 123, content_group: my_group, timestamp_1517584420597000
As Elliott mentioned - BigQuery Standard SQL has way much better support for ARRAYs than legacy SQL. And in general, BigQuery team recommend using Standard SQL
So, below is for BigQuery Standard SQL (including handling wildcard stuff)
#standardSQL
SELECT
user_dim.user_id AS user_id,
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'content_group' LIMIT 1
) content_group,
(SELECT event.timestamp_micros
FROM UNNEST(event_dim) event
WHERE name = 'main_menu_item_selected'
) ts
FROM `project.dataset.app_events_*`
WHERE _TABLE_SUFFIX BETWEEN '20180129' AND '20180202'
with result (for the dummy example from your question)
Row user_id content_group ts
1 123 my_group 1517584420597000

json extract multiple level value in sql

This is a follow-up question of Extract all values from json in sql table
What if the json value has multiple levels?
For example,
{
"store-1": {
"Apple": {
"category": "fruit",
"price": 100
},
"Orange": {
"category": "fruit",
"price": 80
}
},
"store-2": {
"Orange": {
"category": "fruit",
"price": 90
},
"Potato": {
"category": "vegetable",
"price": 40
}
}
}
In this case, I want to extract the price for all the items. But I get error when I run the below query.
with my_table(items) as (
values (
'{"store-1":{"Apple":{"category":"fruit","price":100},"Orange":{"category":"fruit","price":80}},
"store-2":{"Orange":{"category":"fruit","price":90},"Potato":{"category":"vegetable","price":40}}}'::json
)
)
select key, (value->value->>'price')::numeric as price
from my_table,
json_each(json_each(items))
I get the below error.
ERROR: function json_each(record) does not exist
LINE 10: json_each(json_each(items))
If I remove one json_each(), it throws
ERROR: operator does not exist: json -> json
LINE 8: select key, (value->value->>'price')::numeric as price
You can use lateral join, something like:
with my_table(items) as (
values (
'{"store-1":{"Apple":{"category":"fruit","price":100},"Orange":{"category":"fruit","price":80}},
"store-2":{"Orange":{"category":"fruit","price":90},"Potato":{"category":"vegetable","price":40}}}'::json
)
)
select outer_key, key, value->>'price' from (
select key as outer_key, value as val from my_table
join lateral json_each(items)
on true
)t
join lateral json_each(val)
on true