how to get more info than generic "Failed to parse JSON: No active field found.; ParsedString returned false; Could not parse value" on BigQuery load? - google-bigquery

We're trying BigQuery for the first time, with data extracted from mongo in json format. I kept getting this generic parse error upon loading the file. But then I tried a smaller subset of the file, 20 records, and it loaded fine. This tells me it's not the general structure of the file, which I had originally thought was the problem. Is there any way to get more info on the parse error, such as the string of the record that it's trying to parse when it has this error?
I also tried using the max errors field, but that didn't work either.
This was via the website. I also tried it via the Google Cloud SDK command line 'bq load...' and got the same error.

This error is most likely caused by some of the JSON records not compying with table schema. It is not clear whether you used schema autodetect feature, or you are supplying schema for the load. But here is one example where such error could happen:
{ "a" : "1" }
{ "a" : { "b" : "2" } }
If you only have a few of these and they are for invalid records - you can automatically ignore them by using max_bad_records option for load job. More details at: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json

Related

In Matillion query,I am able to create a query profile correctly but have errors with the parameters. "The value of error could not be accessed"

I am trying to extract data from Gainsight to Snowflake using Matillion using API. I was able to create the query profile where the data is pulled correctly, but with errors in the parameters. The error is "Exception Testing table - The error was *** . The value of attribute could not be accessed. The attribute does not exist.
I tried using the escape [ with \ given below but did not work - https://metlcommunity.matillion.com/s/question/0D54G00007uPCSSSA4/i-new-to-api-and-i-am-getting-below-error-while-running-the-api-query-componentparameter-validation-failure-the-value-of-the-attribute-could-not-be-accessed-the-attribute-does-not-existi-can-successfully-create-the-api-query-profile
I was expecting data to show under the "Data Preview".

Apache Beam Java 2.26.0: BigQueryIO 'No rows present in the request'

Since the Beam 2.26.0 update we ran into errors in our Java SDK streaming data pipelines. We have been investigating the issue for quite some time now but are unable to track down the root cause. When downgrading to 2.25.0 the pipeline works as expected.
Our pipelines are responsible for ingestion, i.e., consume from Pub/Sub and ingest into BigQuery. Specifically, we use the PubSubIO source and the BigQueryIO sink (streaming mode). When running the pipeline, we encounter the following error:
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "No rows present in the request.",
"reason" : "invalid"
} ],
"message" : "No rows present in the request.",
"status" : "INVALID_ARGUMENT"
}
Our initial guess was that the pipeline's logic was somehow bugged, causing the BigQueryIO sink to fail. After investigation, we concluded that the PCollection feeding the sink is indeed containing correct data.
Earlier today I was looking in the changelog and noticed that the BigQueryIO sink received numerous updates. I was specifically worried about the following changes:
BigQuery’s DATETIME type now maps to Beam logical type org.apache.beam.sdk.schemas.logicaltypes.SqlTypes.DATETIME
Java BigQuery streaming inserts now have timeouts enabled by default. Pass --HTTPWriteTimeout=0 to revert to the old behavior
With respect to the first update, I made sure to disable all DATETIME in the resulting TableRow objects. In this specific scenario, the error still stands.
For the second change, I'm unsure how to pass the --HTTPWriteTimeout=0 flag to the pipeline. How is this best achieved?
Any other suggestions as to the root cause of this issue?
Thanks in advance!
We have finally been able to fix this issue and rest assured it has been a hell of a ride. We basically debugged the entire BigQueryIO connector and came to the following conclusions:
The TableRow objects that are being forwarded to BigQuery used to contain enum values. Due to these not being serializable, an empty payload is forwarded to BigQuery. In my opinion, this error should be made more explicit (why was this suddenly changed anyway?).
The issue was solved by adding the #value annotation to each enum entry (com.google.api.client.util.Value).
The same TableRow object also contained values of the type byte[]. This value was injected in a BigQuery column with the bytes type. While this was working without explicitly computing a base64 before, it was now yielding errors.
The issue was solved by computing a base64 ourselves (this setup is also discussed in the following post).
--HTTPWriteTimeout is a pipeline option. You can set it the same way you set the runner, etc. (typically on the command line).

Dynamic BigQuery Schema using Auto Detection:Error Schema has no fields

I am trying to mess with the auto detection feature in Bigquery and currently I am encountering issues on updating the schema on my table.
What currently I have done.
I created manually a dataset and table name in Bigquery.
Execute my first bq load command (Works perfectly fine):
bq --location=${LOCATION} load --autodetect --source_format=${FORMAT} ${DATASET}.${TABLE} ${PATH_TO_SOURCE}.
I try to append a new JSON object introduced with new field to update the current schema.
Execute 2nd bq load command:
bq --location=${LOCATION} load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=${FORMAT} ${DATASET}.${TABLE} ${PATH_TO_SOURCE}
Throws an error:
Error in query string.
Error processing job. Schema has no fields.
I thought when --autodetect flag is enabled bq load command will not request for schema on your load job. Has anyone already encountered this issue?
First object:
{
"chatSessionId": "123",
"chatRequestId": "1234",
"senderType": "CUSTOMER",
"senderFriendlyName": "Player"
}
Second Object:
{
"chatSessionId": "456",
"chatRequestId": "5678",
"senderType": "CUSTOMER",
"senderFriendlyName": "Player",
"languageCode": "EN"
}
I reproduced your steps but I couldn't reproduce the same error as you can see in the images below:
Loading first JSON
First table's data
Loading second JSON
Second's table data
The only thing I changed in your data was the format: you provided a JSON and I turned it into a NEWLINE DELIMITED JSON (type of JSON that BigQuery expects).
You can find more information about it here.
Please let me know if it clarifies something for you.
I hope it helps.

Export database(having nested documents) from cloud firestore to Bigquery using command

I am trying to export cloud firestore data into bigquery to do sql operations.
Exported cloud firestore to cloud storage [using] (https://cloud.google.com/firestore/docs/manage-data/export-import)
gcloud beta firestore export gs://htna-3695c-storage --collection-ids='users','feeds'
Followed https://cloud.google.com/bigquery/docs/loading-data-cloud-firestoreto import from bigquery.
We have 2 collections: Users & Feeds in the cloud firestore.
I have successfully exported feeds collection but am not able to export users collection.
I am getting an error while importing data from storage to bigquery
Error: unexpected property name 'Contacts'. we have contacts field in the collection users.
this contacts field is of type 'Map'.
I also tried the command line. Below is the command to export bigquery.
**bq --location=US load --source_format=DATASTORE_BACKUP myproject_Dataset.users gs://myproject-storage/2019-04-19T13:29:28_75338/all_namespaces/kind_users/all_namespaces_kind_users.export_metadata**
here also I got the same error:
'unexpected property name 'Contacts'.
I thought to add projection fields to export only specified fields some thing like below
**bq --location=US load --projection_fields=[Coins,Referral,Profile] --source_format=DATASTORE_BACKUP myproject_Dataset.users gs://myproject-storage/2019-04-19T13:29:28_75338/all_namespaces/kind_users/all_namespaces_kind_users.export_metadata**
here also I got the error:
Waiting on >bqjob_r73b7ddbc9398b737_0000016a4909dd27_1 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'myproject:bqjob_r73b7ddbc9398b737_0000016a4909dd27_1': An internal error
occurred and the request could not be completed. Error: 4550751
Can anyone please let me know how to fix these issues?
Thank you in advance.Image of firestore Db
It seems to be an issue with some of your documents. According to the limitations of the BigQuery import of Firestore data, you must have a schema with less than 10,000 unique field names. In your Contacts schema, you are using the contacts' names as keys, that design is likely to produce a big amount of field names. You would need to check if this is happening to your other documents.
As a workaround, you could change the design (at some stage of the loading process) from:
"Contacts" : {
"contact1" : "XXXXX",
"contact2" : "YYYYY",
}
to:
"Contacts" : [
{
"name" : "contact1",
"number" : "XXXXX"
},
{
"name" : "contact2",
"number" : "YYYYY"
},
]
This schema would reduce drastically the number of field_names which will make it easier to manipulate from BigQuery.

How do I make a Bigquery dataset public using command line tool or Python?

I'm making an open data website powered by BigQuery. How do I make a Bigquery dataset public using command line tool or Python?
Note I tried to make every dataset in my project public but got an unexplained error. In project permission settings via WebUI under "Add members" I put
allAuthenticatedUsers and did the permission Data Viewer. The error was "Error
Sorry, there’s a problem. If you entered information, check it and try again. Otherwise, the problem might clear up on its own, so check back later."
I wasn't able to find any command line examples for updating permissions. I also can't find a JSON string to pass to https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/update
To achieve this programatically, you need to use a dataset patch request and use the specialGroup item with the value allAuthenticatedUsers, like so:
{
"datasetReference":{
"projectId":"<removed>",
"datasetId":"<removed>"
},
"access":[
... //other access roles
{
"specialGroup":"allAuthenticatedUsers",
"role":"READER"
}
]
}
Note: You should use a read-modify-write cycle as described here & here:
Note about arrays: Patch requests that contain arrays replace the existing array with the one you provide. You cannot modify, add, or delete items in an array in a piecemeal fashion.