I have a Google BigQuery clustered partition table. And I am trying to get Google BigQuery clustered partition table definition using bq.py cli tool. I get the json output but it does not have clustering information.
% bq version
This is BigQuery CLI 2.0.69
% bq show \
--schema \
--format=prettyjson \
uk-kingdom-911:ld_cv_1_population.cv_performance_t1_agg3_auh_mvpart
[
{
"name": "hstamp",
"type": "TIMESTAMP"
},
{
"name": "account_id",
"type": "INTEGER"
},
....
.....
]
Json output does not have clustering information. Not sure what I am missing here.
That's because you've used the --schema flag. The --schema flag only shows the basic schema of the table and nothing else. Remove that flag and you should see everything:
> bq show --format=prettyjson example_ds.test_table
==========================================================
{
"clustering": {
"fields": [
"second_col"
]
},
"creationTime": "1623325001027",
"etag": "xxx",
"id": "my-project:example_ds.test_table",
"kind": "bigquery#table",
"lastModifiedTime": "1623325001112",
"location": "xxx",
"numBytes": "0",
"numLongTermBytes": "0",
"numRows": "0",
"schema": {
"fields": [
{
"mode": "NULLABLE",
"name": "first_col",
"type": "DATE"
},
{
"mode": "NULLABLE",
"name": "second_col",
"type": "INTEGER"
}
]
},
"selfLink": "xxxx",
"tableReference": {
"datasetId": "example_ds",
"projectId": "my-project",
"tableId": "test_table"
},
"timePartitioning": {
"type": "DAY"
},
"type": "TABLE"
}
Related
I am trying to understand Avro schemas and stuck with complex types (record). The problem is very simple: create a schema which contains one record filed with two primitive fields (string and timestamp) nested to record. I see two options for the schema:
option 1
{
"type": "record",
"name": "cool_subject",
"namespace": "com.example",
"fields": [
{
"name": "field_1",
"type": "record"
"fields": [
{"name": "operation", "type": "string"},
{"name": "timestamp", "type": "long", "logical_type": "timestamp_millis"}
]
}
]
}
option 2
{
"type": "record",
"name": "cool_subject",
"namespace": "com.example",
"fields": [
{
"name": "field_1",
"type": {
"type": "record",
"name": "field_1_type",
"fields": [
{"name": "operation", "type": "string"},
{"name": "timestamp", "type": {"type": "long", "logical_type": "timestamp_millis"}}
]
}
}
]
}
The difference is in the "type" attribute.
As far as I know opt2 is the correct way. Am I right? Is opt1 valid?
The second one is correct. The first one is not valid.
A record schema is something that looks like this:
{
"type": "record",
"name": <Name of the record>,
"fields": [...],
}
And for fields, it should be like this:
[
{
"name": <name of field>,
"type": <type of field>,
},
...
]
So in the case of a field which contains a record, it should always look like this:
[
{
"name": <name of field>,
"type": {
"type": "record",
"name": <Name of the record>,
"fields": [...],
}
},
...
]
The format in the first example would make it unclear if the name "field_1" was the name of the field or the name of the record.
I was trying to load an Avro file with nested record. One of the record was having a union of schema. When loaded to BigQuery, it created a very long name like com_mycompany_data_nestedClassname_value on each union element. That name is long. Wondering if there is a way to specify name without having the full package name prefixed.
For example. The following Avro schema
{
"type": "record",
"name": "EventRecording",
"namespace": "com.something.event",
"fields": [
{
"name": "eventName",
"type": "string"
},
{
"name": "eventTime",
"type": "long"
},
{
"name": "userId",
"type": "string"
},
{
"name": "eventDetail",
"type": [
{
"type": "record",
"name": "Network",
"namespace": "com.something.event",
"fields": [
{
"name": "hostName",
"type": "string"
},
{
"name": "ipAddress",
"type": "string"
}
]
},
{
"type": "record",
"name": "DiskIO",
"namespace": "com.something.event",
"fields": [
{
"name": "path",
"type": "string"
},
{
"name": "bytesRead",
"type": "long"
}
]
}
]
}
]
}
Came up with
Is that possible to make the long field name like eventDetail.com_something_event_Network_value to be something like eventDetail.Network
Avro loading is not as flexible as it should be in BigQuery (basic example is that it does not support load a subset of the fields (reader schema). Also, renaming of the columns is not supported today in BigQuery refer here. Only options are recreate your table with the proper names (create a new table from your existing table) or recreate the table from your previous table
I have The following BigQuery tables:
orders:
[
{
"name": "orders_id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "customer_id",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
customers:
[
{
"name": "customer_id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "customer_name",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
I want to create new_orders as follows:
[
{
"name": "orders_id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "customer_name",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
So I created an empty table for new_orders and wrote this query:
SELECT o.orders_id,c.customer_name
from `project.orderswh.orders` as o
inner join `project.orderswh.customers` as c on o.customer_id = c.customer_id
My problem is how to load the data from this query result into the new table.
I have like 15M rows. To the best of my knowledge regular insert is cost-expensive and incredibly slow. How can I do this as a load job?
you could do this from BigQuery Console
There follow these steps:
1) Show Options
2) Destination Table
3) choose dataset and provide "new_orders" as Table ID
4) then set "Write Preference" to "Write if empty" as this is one time thing as you said
If needed, look also for this tutorial: https://cloud.google.com/bigquery/docs/writing-results
You could use the bq command line tool:
bq query --append_table \
--nouse_legacy_sql \
--allow_large_results \
--destination_table project.orderswh.new_orders 'SELECT o.orders_id,c.customer_name
from `project.orderswh.orders` as o
inner join `project.orderswh.customers` as c on o.customer_id = c.customer_id'
I have a json record which looks like
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40",
"line_item": [{"line":"1","sku":"10","amount":10},
{"line":"2","sku":"20","amount":20}]}
I am trying to load record stated above into the table which has schema definition as,
"fields": [
{
"mode": "NULLABLE",
"name": "customer_id",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "order_id",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "line_item",
"type": "STRING"
}
]
I am getting following error "message":
JSON parsing error in row starting at position 0 at file:
gs://gcs_bucket/file0. JSON object specified for non-record field:
line_item
I want line_item json string which can have more than 1 row as array of json string in line item column in table.
Any suggestion?
The first thing is that your input JSON should't have a "\n" character, so you should save it like:
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40", "line_item": [{"line":"1","sku":"10","amount":10}, {"line":"2","sku":"20","amount":20}]}
One example of how your JSON file should look like:
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40", "line_item": [{"line":"1","sku":"10","amount":10}, {"line":"2","sku":"20","amount":20}]}
{"customer_id":"2","order_id":"2", "line_item": [{"line":"2","sku":"20","amount":20}, {"line":"2","sku":"20","amount":20}]}
{"customer_id":"3","order_id":"3", "line_item": [{"line":"3","sku":"30","amount":30}, {"line":"3","sku":"30","amount":30}]}
And also your schema is not correct. It should be:
[
{
"mode": "NULLABLE",
"name": "customer_id",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "order_id",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "line_item",
"type": "RECORD",
"fields": [{"name": "line", "type": "STRING"}, {"name": "sku", "type": "STRING"}, {"name": "amount", "type": "INTEGER"}]
}
]
For a better understanding of how schemas work, I've tried writing sort of a guide in this answer. Hopefully it can be of some value.
If your data content is saved for instance in a filed called gs://gcs_bucket/file0 and your schema in schema.json then this command should work for you:
bq load --source_format=NEWLINE_DELIMITED_JSON dataset.table gs://gcs_bucket/file0 schema.json
(supposing you are using the CLI tool as it seems to be the case in your question).
In the BigQuery documents it says that tables.insert creates a new empty table (https://cloud.google.com/bigquery/docs/reference/v2/tables/insert)
But when I try to test that endpoint within the page I get a 404 Table not found error.
Request:
POST https://www.googleapis.com/bigquery/v2/projects/myProject/datasets/temp/tables?key={YOUR_API_KEY}
{
"expirationTime": "1452627594",
"tableReference": {
"tableId": "myTable"
}
}
Response:
404 OK
- Show headers -
{
"error": {
"errors": [
{
"domain": "global",
"reason": "notFound",
"message": "Not found: Table myProject.myTable"
}
],
"code": 404,
"message": "Not found: Table myProject:temp.myTable"
}
}
I am sure there is a reasonable explanation of the 404 error but I couldn't understood it since I am trying to create that table, it should be ok if doesn't exist yet?
Thank you
Info you provided in Request Body is not enough for BigQuery to create new table
At least - add schema - something like below (just as an example):
{
"schema": {
"fields": [
{
"name": "page_event",
"mode": "repeated",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
}
]
}
]
}
}
See more in Tables resource for what you can supply in Request Body
Below is an example with above dummy schema that should work:
POST https://www.googleapis.com/bigquery/v2/projects/myProject/datasets/temp/tables?key={YOUR_API_KEY}
{
"schema": {
"fields": [
{
"name": "page_event",
"mode": "repeated",
"type": "RECORD",
"fields": [
{
"name": "id",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
}
]
}
]
},
"expirationTime": "1452560523000",
"tableReference": {
"tableId": "myTable"
}
}
EDIT:
And the reason why it didn't worked with "expirationTime": "1452627594" from your question is because it points to the past