I have The following BigQuery tables:
orders:
[
{
"name": "orders_id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "customer_id",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
customers:
[
{
"name": "customer_id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "customer_name",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
I want to create new_orders as follows:
[
{
"name": "orders_id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "customer_name",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
So I created an empty table for new_orders and wrote this query:
SELECT o.orders_id,c.customer_name
from `project.orderswh.orders` as o
inner join `project.orderswh.customers` as c on o.customer_id = c.customer_id
My problem is how to load the data from this query result into the new table.
I have like 15M rows. To the best of my knowledge regular insert is cost-expensive and incredibly slow. How can I do this as a load job?
you could do this from BigQuery Console
There follow these steps:
1) Show Options
2) Destination Table
3) choose dataset and provide "new_orders" as Table ID
4) then set "Write Preference" to "Write if empty" as this is one time thing as you said
If needed, look also for this tutorial: https://cloud.google.com/bigquery/docs/writing-results
You could use the bq command line tool:
bq query --append_table \
--nouse_legacy_sql \
--allow_large_results \
--destination_table project.orderswh.new_orders 'SELECT o.orders_id,c.customer_name
from `project.orderswh.orders` as o
inner join `project.orderswh.customers` as c on o.customer_id = c.customer_id'
Related
I am having a schema that looks like:
[
{
"name": "name",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "frm",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "c",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "n",
"type": "STRING",
"mode": "REQUIRED"
}
]
},
{
"name": "",
"type": "STRING",
"mode": "NULLABLE"
}
]
With a sample record that looks like this:
I am trying to write a query that selects this row when there is a row in frm that matches C = 'X' and another row that has C = 'Z'. Only when both conditions are true, I would love to select the "name" of the parent row. I actually have no clue how I could achieve this. Any suggestions?
E.g. this works, but I am unnesting frm two times, there must a more efficient way I guess.
SELECT name FROM `t2`
WHERE 'X' in UNNEST(frm.c) AND 'Y' in UNNEST(frm.c)
Consider below approach
select name
from your_table t
where 2 = (
select count(distinct c)
from t.frm
where c in ('X', 'Z')
)
I have a Google BigQuery clustered partition table. And I am trying to get Google BigQuery clustered partition table definition using bq.py cli tool. I get the json output but it does not have clustering information.
% bq version
This is BigQuery CLI 2.0.69
% bq show \
--schema \
--format=prettyjson \
uk-kingdom-911:ld_cv_1_population.cv_performance_t1_agg3_auh_mvpart
[
{
"name": "hstamp",
"type": "TIMESTAMP"
},
{
"name": "account_id",
"type": "INTEGER"
},
....
.....
]
Json output does not have clustering information. Not sure what I am missing here.
That's because you've used the --schema flag. The --schema flag only shows the basic schema of the table and nothing else. Remove that flag and you should see everything:
> bq show --format=prettyjson example_ds.test_table
==========================================================
{
"clustering": {
"fields": [
"second_col"
]
},
"creationTime": "1623325001027",
"etag": "xxx",
"id": "my-project:example_ds.test_table",
"kind": "bigquery#table",
"lastModifiedTime": "1623325001112",
"location": "xxx",
"numBytes": "0",
"numLongTermBytes": "0",
"numRows": "0",
"schema": {
"fields": [
{
"mode": "NULLABLE",
"name": "first_col",
"type": "DATE"
},
{
"mode": "NULLABLE",
"name": "second_col",
"type": "INTEGER"
}
]
},
"selfLink": "xxxx",
"tableReference": {
"datasetId": "example_ds",
"projectId": "my-project",
"tableId": "test_table"
},
"timePartitioning": {
"type": "DAY"
},
"type": "TABLE"
}
I am trying to do insert data from one BigQuery table to another by running the query shown below but I get 0 rows in return. However if I take out the Survey column, I get the correct number of rows in return.
Both the nested fields have the same type of schema. I have checked and double checked the column names too but can´t seem to figure out what´s wrong with Survey field.
INSERT INTO destination_table
(
Title, Description, Address, Survey
)
SELECT
Title as Title,
Description as Description,
[STRUCT(
ARRAY_AGG(STRUCT(Address_Instance.Field1, Address_Instance.Field2)) AS Address_Record
)]
as Address,
[STRUCT(
ARRAY_AGG(STRUCT(Survey_Instance.Field1, Survey_Instance.Field2)) AS Survey_Record
)]
as Survey
FROM
source_table,
UNNEST(Survey) AS Survey,
UNNEST(Survey_Instance) as Survey_Instance,
GROUP BY
Title,
Description
Here´s how the schema of my source table looks like:
[
{
"name": "Title",
"type": "STRING"
},
{
"name": "Description",
"type": "STRING"
},
{
"name": "Address",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Address_Instance",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Field1",
"type": "STRING"
},
{
"name": "Field2",
"type": "STRING"
}
]
}
]
},
{
"name": "Survey",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Survey_Instance",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "Field1",
"type": "STRING"
},
{
"name": "Field2",
"type": "STRING"
}
]
}
]
},
]
While mapping to the destination table, I rename the nested repeated records but that´s not causing any problems. I am wondering if I am overlooking something important and need some suggestions and advice. Basically an extra set of eyes to help me figure what I am doing wrong.
Would appreciate some help. Thanks in advance.
Use explicit JOINs in general. In this case, use LEFT JOIN:
FROM source_table st LEFT JOIN
UNNEST(st.Survey) Survey
ON 1=1 LEFT JOIN
UNNEST(Survey.Survey_Instance) Survey_Instance
ON 1=1
I have a json record which looks like
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40",
"line_item": [{"line":"1","sku":"10","amount":10},
{"line":"2","sku":"20","amount":20}]}
I am trying to load record stated above into the table which has schema definition as,
"fields": [
{
"mode": "NULLABLE",
"name": "customer_id",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "order_id",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "line_item",
"type": "STRING"
}
]
I am getting following error "message":
JSON parsing error in row starting at position 0 at file:
gs://gcs_bucket/file0. JSON object specified for non-record field:
line_item
I want line_item json string which can have more than 1 row as array of json string in line item column in table.
Any suggestion?
The first thing is that your input JSON should't have a "\n" character, so you should save it like:
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40", "line_item": [{"line":"1","sku":"10","amount":10}, {"line":"2","sku":"20","amount":20}]}
One example of how your JSON file should look like:
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40", "line_item": [{"line":"1","sku":"10","amount":10}, {"line":"2","sku":"20","amount":20}]}
{"customer_id":"2","order_id":"2", "line_item": [{"line":"2","sku":"20","amount":20}, {"line":"2","sku":"20","amount":20}]}
{"customer_id":"3","order_id":"3", "line_item": [{"line":"3","sku":"30","amount":30}, {"line":"3","sku":"30","amount":30}]}
And also your schema is not correct. It should be:
[
{
"mode": "NULLABLE",
"name": "customer_id",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "order_id",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "line_item",
"type": "RECORD",
"fields": [{"name": "line", "type": "STRING"}, {"name": "sku", "type": "STRING"}, {"name": "amount", "type": "INTEGER"}]
}
]
For a better understanding of how schemas work, I've tried writing sort of a guide in this answer. Hopefully it can be of some value.
If your data content is saved for instance in a filed called gs://gcs_bucket/file0 and your schema in schema.json then this command should work for you:
bq load --source_format=NEWLINE_DELIMITED_JSON dataset.table gs://gcs_bucket/file0 schema.json
(supposing you are using the CLI tool as it seems to be the case in your question).
I've got a nested table A in BigQuery with a schema as follows:
{
"name": "page_event",
"mode": "repeated",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
}
]
}
I would like to enrich table A with data from other table and save result as a new nested table. Let's say I would like to add "description" field to table A (creating table B), so my schema will be as follows:
{
"name": "page_event",
"mode": "repeated",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
}
]
}
How do I do this in BigQuery? It seems, that there are no functions for creating nested structures in BigQuery SQL (except NEST functions, which produces a list - but this function doesn't seem to work, failing with Unexpected error)
The only way of doing this I can think of, is to:
use string concatenation functions to produce table B with single field called "json" with content being enriched data from A, converted to json string
export B to GCS as set of files F
load F as table C
Is there an easier way to do it?
To enrich schema of existing table one can use tables patch API
https://cloud.google.com/bigquery/docs/reference/v2/tables/patch
Request will look like below
PATCH https://www.googleapis.com/bigquery/v2/projects/{project_id}/datasets/{dataset_id}/tables/{table_id}?key={YOUR_API_KEY}
{
"schema": {
"fields": [
{
"name": "page_event",
"mode": "repeated",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
}
]
}
]
}
}
Before Patch
After Patch