Big JSON record to BigQuery is not showing up - google-bigquery

I wanted to try to upload big JSON record object to BigQuery.
I am talking of JSON records of 1.5 MB each, with a complex nested schema up to 7th degree.
For simplicity, I started to load file with a single record on one line.
At first I try to have BigQuery to autodetect my schema, but that resulted in table that is not responsive and I cannot perform query on, albeit it says it had at least a record.
Then, assuming that my schema could be too hard to reverse for the loader, I tried to write the schema myself and I then I tried to load my my file with single record.
At first I got a simple error with just "invalid".
bq load --source_format=NEWLINE_DELIMITED_JSON invq_data.test_table
my_single_json_record_file
Upload complete.
Waiting on bqjob_r5a4ce64904bbba9d_0000015e14aba735_1 ... (3s) Current
status: DONE
BigQuery error in load operation: Error processing job 'invq-
test:bqjob_r5a4ce64904bbba9d_0000015e14aba735_1': JSON table
encountered too many errors, giving up. Rows:
1; errors: 1.
Which after checking for the job error was just giving me the following:
"status": {
"errorResult": {
"location": "file-00000000",
"message": "JSON table encountered too many errors, giving up. Rows: 1; errors: 1.",
"reason": "invalid"
},
"errors": [
{
"location": "file-00000000",
"message": "JSON table encountered too many errors, giving up. Rows: 1; errors: 1.",
"reason": "invalid"
}
],
"state": "DONE"
},
The after a couple of more attempts creating new tables, it actually started to succeed on command line, without reporting errors:
bq load --max_bad_records=1 --source_format=NEWLINE_DELIMITED_JSON invq_data.test_table_4 my_single_json_record_file
Upload complete.
Waiting on bqjob_r368f1dff98600a4b_0000015e14b43dd5_1 ... (16s) Current status: DONE
with no error on the status checker...
"statistics": {
"creationTime": "1503585955356",
"endTime": "1503585973623",
"load": {
"badRecords": "0",
"inputFileBytes": "1494390",
"inputFiles": "1",
"outputBytes": "0",
"outputRows": "0"
},
"startTime": "1503585955723"
},
"status": {
"state": "DONE"
},
But no actual records are added to my tables.
I tried to perform the same from WebUI but the result is the same. Green on the completed job, but no actual record added.
Is there something else that I can do for checking where the data is sinking to? Maybe some more log?
I can imagine that maybe I am on the the edge of the 2 MB JSON row size limit but, if so, should this be reported as error?
Thanks in advance for the help!!
EDIT:
It turned out the complexity of my schema was a bit the devil in here.
My json files were valid, but my complex schema had several errors.
It turned out that I had to simplify it such schema anyway, because I got a new batch of data where single json instances where more 30MB and I had to restructure this data in a more relational way, whilst making smaller rows to insert in the database.
Funny enough when the schema was scattered across multiple entities (ergo, simplified) the actually error/inconsistencies of the schema started to actually show up in error returned and it was easier to fix them. (Mostly it was new nested undocumented data which I was not aware anyway... but still my bad).
The lesson here, is when a table schema is too long (I didn't experiment how much precisely is too long) BigQuery just hide itself behind reporting too many errors to show.
But that is a point where you should consider simplify the schema(/structure) of your data.

Related

Bigquery Can't Upload RECORD Data Type File

I want to upload my Sales data into Bigquery. The file contains STRUCT (RECORD) data type, therefore, I have to upload my file as ndjson.
However, once I do that, I get the error message: 'Failed to create table: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details."
The ndjson I'm trying to upload look something like this: {"SalesOrder": "SalesOrder", "CreatedBy": "CreatedBy", "CustomerID": "CustomerID", "SalesChannel": "SalesChannel", "SalesChannelType": "SalesChannelType", "CreatedDate": "CreatedDate", "WebDiscount": "WebDiscount", "RevenueFromPlatform": "RevenueFromPlatform", "DeliveryFee": "DeliveryFee", "PriceBeforeVAT": "PriceBeforeVAT", "VAT": "VAT", "DeliveryChannel": "DeliveryChannel", "DeliveryStatus": "DeliveryStatus", "Payment.Status": "Payment.Status", "Payment.Channel": "Payment.Channel", "Payment.Value": "Payment.Value", "Payment.Date": "Payment.Date", "Product.SKU": "Product.SKU", "Product.UnitsSold": "Product.UnitsSold", "Product.UnitPrice": "Product.UnitPrice", "Product.Discount": "Product.Discount", "Product.Price": "Product.Price", "": ""}
How do I solve this issue? Thanks in advance

How to fetch the current run status (i.e. success or failure) of all metronome jobs?

We are using metronome, and we want to create a dashboard for our jobs scheduled by it against its rest API.
Alas, the job endpoint
/v1/jobs
does not contain the last state, i.e. success or failure, but only its configuration.
Googling on how to get the history of a job, I found out that I can query the job history through embed=history GET parameter for each jobId.
I could now combine fetching the id list so that I could then fetch each job's history through:
/v1/jobs/{job_id}?embed=history
Yet this includes all the runs and also requires us to fetch each job individually.
Is there a way to get the metronome job status without querying all the jobs one by one?
You can click on each GET or POST endpoint on the official docs to see if it supports additional query params.
The endpoint for jobs indeed supports historic data
As you can see you can use embed=history or embed=historySummary, for your use-case embed=historySummary is better suited as it only contains the timestamps of the last run in this form and is less expensive and time-consuming:
[
{
"id": "your_job_id",
"historySummary": {
"failureCount": 6,
"lastFailureAt": "2018-01-26T12:18:46.406+0000",
"lastSuccessAt": "2018-04-19T13:50:14.132+0000",
"successCount": 226
},
...
},
...
]
You can compare those dates to figure out if the last run was successful. Yet keep in mind that lastFailureAt and lastSuccessAt might be null as a job might have been never run in the first place:
{
"id": "job-that-never-ran",
"labels": {},
"run": {
...
}
"historySummary": {
"successCount": 0,
"failureCount": 0,
"lastSuccessAt": null,
"lastFailureAt": null
}
},

How to get useful BigQuery errors

I have a job that I run with jobs().insert()
Currently I have the job failing with:
2014-11-11 11:19:15,937 - ERROR - bigquery - invalid: Could not convert value to string
Considering I have 500+ columns, I find this error message useless and pretty pathetic. What can I do to receive a proper and better error details from BigQuery?
The structured error return dictionary contains 3 elements, a "reason", a "location", and a "message". From the log line you include, it looks like only the message is logged.
Here's an example error return from a CSV import with data that doesn't match the target table schema:
"errors": [
{
"reason": "invalid",
"location": "File: 0 / Line:2 / Field:1",
"message": "Value cannot be converted to expected type."
},
...
Similar errors are returned from JSON imports with data that doesn't match the target table schema.
I hope this helps!

BigQuery Load Job [invalid] Too many errors encountered

I'm trying to insert data into BigQuery using the BigQuery Api C# Sdk.
I created a new Job with Json Newline Delimited data.
When I use :
100 lines for inputs : OK
250 lines for inputs : OK
500 lines for inputs : KO
2500 lines : KO
The error encountered is :
"status": {
"state": "DONE",
"errorResult": {
"reason": "invalid",
"message": "Too many errors encountered. Limit is: 0."
},
"errors": [
{
"reason": "internalError",
"location": "File: 0",
"message": "Unexpected. Please try again."
},
{
"reason": "invalid",
"message": "Too many errors encountered. Limit is: 0."
}
]
}
The file works well when I use the Bq Tools with command :
bq load --source_format=NEWLINE_DELIMITED_JSON dataset.datatable pathToJsonFile
Something seems to be wrong on server side or maybe when I transmit the file but we cannot get more log than "internal server error"
Does anyone have more informations on this ?
Thanks you
"Unexpected. Please try again." could either indicate that the contents of the files you provided had unexpected characters, or it could mean that an unexpected internal server condition occurred. There are several questions which might help shed some light on this:
does this consistently happen no matter how many times you retry?
does this directly depend on the lines in the file, or can you construct a simple upload file which doesn't trigger the error condition?
One option to potentially avoid these problems is to send the load job request with configuration.load.maxBadRecords higher than zero.
Feel free to comment with more info and I can maybe update this answer.

Frequent 503 errors raised from BigQuery Streaming API

Streaming data into BigQuery keeps failing due to the following error, which occurs more frequently recently:
com.google.api.client.googleapis.json.GoogleJsonResponseException: 503 Service Unavailable
{
"code" : 503,
"errors" : [ {
"domain" : "global",
"message" : "Connection error. Please try again.",
"reason" : "backendError"
} ],
"message" : "Connection error. Please try again."
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:312)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1049)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:410)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:343)
Relevant question references:
Getting high rate of 503 errors with BigQuery Streaming API
BigQuery - BackEnd error when loading from JAVA API
We (the BigQuery team) are looking into your report of increased connection errors. From the internal monitoring, there hasn't been global a spike in connection errors in the last several days. However, that doesn't mean that your tables, specifically, weren't affected.
Connection errors can be tricky to chase down, because they can be caused by errors before they get to the BigQuery servers or after they leave. The more information your can provide, the easier it is for us to diagnose the issue.
The best practice for streaming input is to handle temporary errors like this to retry the request. It can be a little tricky, since when you get a connection error you don't actually know whether the insert succeeded. If you include a unique insertId with your data (see the documentation here), you can safely resend the request (within the deduplication window period, which I think is 15 minutes) without worrying that the same row will get added multiple times.