How to load data from query into table on BigQuery - google-bigquery

I have The following BigQuery tables:
orders:
[
{
"name": "orders_id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "customer_id",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
customers:
[
{
"name": "customer_id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "customer_name",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
I want to create new_orders as follows:
[
{
"name": "orders_id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "customer_name",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
So I created an empty table for new_orders and wrote this query:
SELECT o.orders_id,c.customer_name
from `project.orderswh.orders` as o
inner join `project.orderswh.customers` as c on o.customer_id = c.customer_id
My problem is how to load the data from this query result into the new table.
I have like 15M rows. To the best of my knowledge regular insert is cost-expensive and incredibly slow. How can I do this as a load job?

you could do this from BigQuery Console
There follow these steps:
1) Show Options
2) Destination Table
3) choose dataset and provide "new_orders" as Table ID
4) then set "Write Preference" to "Write if empty" as this is one time thing as you said
If needed, look also for this tutorial: https://cloud.google.com/bigquery/docs/writing-results

You could use the bq command line tool:
bq query --append_table \
--nouse_legacy_sql \
--allow_large_results \
--destination_table project.orderswh.new_orders 'SELECT o.orders_id,c.customer_name
from `project.orderswh.orders` as o
inner join `project.orderswh.customers` as c on o.customer_id = c.customer_id'

Related

BigQuery select rows with two (or more / less) matches in a repeated field

I am having a schema that looks like:
[
{
"name": "name",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "frm",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "c",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "n",
"type": "STRING",
"mode": "REQUIRED"
}
]
},
{
"name": "",
"type": "STRING",
"mode": "NULLABLE"
}
]
With a sample record that looks like this:
I am trying to write a query that selects this row when there is a row in frm that matches C = 'X' and another row that has C = 'Z'. Only when both conditions are true, I would love to select the "name" of the parent row. I actually have no clue how I could achieve this. Any suggestions?
E.g. this works, but I am unnesting frm two times, there must a more efficient way I guess.
SELECT name FROM `t2`
WHERE 'X' in UNNEST(frm.c) AND 'Y' in UNNEST(frm.c)
Consider below approach
select name
from your_table t
where 2 = (
select count(distinct c)
from t.frm
where c in ('X', 'Z')
)

Unable to get clustered table definition using bq.py cli

I have a Google BigQuery clustered partition table. And I am trying to get Google BigQuery clustered partition table definition using bq.py cli tool. I get the json output but it does not have clustering information.
% bq version
This is BigQuery CLI 2.0.69
% bq show \
--schema \
--format=prettyjson \
uk-kingdom-911:ld_cv_1_population.cv_performance_t1_agg3_auh_mvpart
[
{
"name": "hstamp",
"type": "TIMESTAMP"
},
{
"name": "account_id",
"type": "INTEGER"
},
....
.....
]
Json output does not have clustering information. Not sure what I am missing here.
That's because you've used the --schema flag. The --schema flag only shows the basic schema of the table and nothing else. Remove that flag and you should see everything:
> bq show --format=prettyjson example_ds.test_table
==========================================================
{
"clustering": {
"fields": [
"second_col"
]
},
"creationTime": "1623325001027",
"etag": "xxx",
"id": "my-project:example_ds.test_table",
"kind": "bigquery#table",
"lastModifiedTime": "1623325001112",
"location": "xxx",
"numBytes": "0",
"numLongTermBytes": "0",
"numRows": "0",
"schema": {
"fields": [
{
"mode": "NULLABLE",
"name": "first_col",
"type": "DATE"
},
{
"mode": "NULLABLE",
"name": "second_col",
"type": "INTEGER"
}
]
},
"selfLink": "xxxx",
"tableReference": {
"datasetId": "example_ds",
"projectId": "my-project",
"tableId": "test_table"
},
"timePartitioning": {
"type": "DAY"
},
"type": "TABLE"
}

Inserting data from one BigQuery table to another returns 0 rows on group by

I am trying to do insert data from one BigQuery table to another by running the query shown below but I get 0 rows in return. However if I take out the Survey column, I get the correct number of rows in return.
Both the nested fields have the same type of schema. I have checked and double checked the column names too but can´t seem to figure out what´s wrong with Survey field.
INSERT INTO destination_table
(
Title, Description, Address, Survey
)
SELECT
Title as Title,
Description as Description,
[STRUCT(
ARRAY_AGG(STRUCT(Address_Instance.Field1, Address_Instance.Field2)) AS Address_Record
)]
as Address,
[STRUCT(
ARRAY_AGG(STRUCT(Survey_Instance.Field1, Survey_Instance.Field2)) AS Survey_Record
)]
as Survey
FROM
source_table,
UNNEST(Survey) AS Survey,
UNNEST(Survey_Instance) as Survey_Instance,
GROUP BY
Title,
Description
Here´s how the schema of my source table looks like:
[
{
"name": "Title",
"type": "STRING"
},
{
"name": "Description",
"type": "STRING"
},
{
"name": "Address",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Address_Instance",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Field1",
"type": "STRING"
},
{
"name": "Field2",
"type": "STRING"
}
]
}
]
},
{
"name": "Survey",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Survey_Instance",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Field1",
"type": "STRING"
},
{
"name": "Field2",
"type": "STRING"
}
]
}
]
},
]
While mapping to the destination table, I rename the nested repeated records but that´s not causing any problems. I am wondering if I am overlooking something important and need some suggestions and advice. Basically an extra set of eyes to help me figure what I am doing wrong.
Would appreciate some help. Thanks in advance.
Use explicit JOINs in general. In this case, use LEFT JOIN:
FROM source_table st LEFT JOIN
UNNEST(st.Survey) Survey
ON 1=1 LEFT JOIN
UNNEST(Survey.Survey_Instance) Survey_Instance
ON 1=1

Bigquery fails to load data from Google Cloud Storage

I have a json record which looks like
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40",
"line_item": [{"line":"1","sku":"10","amount":10},
{"line":"2","sku":"20","amount":20}]}
I am trying to load record stated above into the table which has schema definition as,
"fields": [
{
"mode": "NULLABLE",
"name": "customer_id",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "order_id",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "line_item",
"type": "STRING"
}
]
I am getting following error "message":
JSON parsing error in row starting at position 0 at file:
gs://gcs_bucket/file0. JSON object specified for non-record field:
line_item
I want line_item json string which can have more than 1 row as array of json string in line item column in table.
Any suggestion?
The first thing is that your input JSON should't have a "\n" character, so you should save it like:
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40", "line_item": [{"line":"1","sku":"10","amount":10}, {"line":"2","sku":"20","amount":20}]}
One example of how your JSON file should look like:
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40", "line_item": [{"line":"1","sku":"10","amount":10}, {"line":"2","sku":"20","amount":20}]}
{"customer_id":"2","order_id":"2", "line_item": [{"line":"2","sku":"20","amount":20}, {"line":"2","sku":"20","amount":20}]}
{"customer_id":"3","order_id":"3", "line_item": [{"line":"3","sku":"30","amount":30}, {"line":"3","sku":"30","amount":30}]}
And also your schema is not correct. It should be:
[
{
"mode": "NULLABLE",
"name": "customer_id",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "order_id",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "line_item",
"type": "RECORD",
"fields": [{"name": "line", "type": "STRING"}, {"name": "sku", "type": "STRING"}, {"name": "amount", "type": "INTEGER"}]
}
]
For a better understanding of how schemas work, I've tried writing sort of a guide in this answer. Hopefully it can be of some value.
If your data content is saved for instance in a filed called gs://gcs_bucket/file0 and your schema in schema.json then this command should work for you:
bq load --source_format=NEWLINE_DELIMITED_JSON dataset.table gs://gcs_bucket/file0 schema.json
(supposing you are using the CLI tool as it seems to be the case in your question).

How to create new table with nested schema entirely in BigQuery

I've got a nested table A in BigQuery with a schema as follows:
{
"name": "page_event",
"mode": "repeated",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
}
]
}
I would like to enrich table A with data from other table and save result as a new nested table. Let's say I would like to add "description" field to table A (creating table B), so my schema will be as follows:
{
"name": "page_event",
"mode": "repeated",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
}
]
}
How do I do this in BigQuery? It seems, that there are no functions for creating nested structures in BigQuery SQL (except NEST functions, which produces a list - but this function doesn't seem to work, failing with Unexpected error)
The only way of doing this I can think of, is to:
use string concatenation functions to produce table B with single field called "json" with content being enriched data from A, converted to json string
export B to GCS as set of files F
load F as table C
Is there an easier way to do it?
To enrich schema of existing table one can use tables patch API
https://cloud.google.com/bigquery/docs/reference/v2/tables/patch
Request will look like below
PATCH https://www.googleapis.com/bigquery/v2/projects/{project_id}/datasets/{dataset_id}/tables/{table_id}?key={YOUR_API_KEY}
{
"schema": {
"fields": [
{
"name": "page_event",
"mode": "repeated",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
}
]
}
]
}
}
Before Patch
After Patch