Dataflow export to Bigquery: insertAll error, invalid table reference - google-bigquery

I'm trying to create and export a stream of synthetic data using Dataflow, Pub/Sub, and BigQuery. I followed the synthetic data generation instructions using the following schema:
{
"id": "{{uuid()}}",
"test_value": {{integer(1,50)}}
}
The schema is in a file gs://my-folder/my-schema.json. The stream seems to be running correctly - I can export from the corresponding Pub/Sub topic to a GCS bucket using the "Export to Cloud Storage" template. When I try to use the "Export to BigQuery" template, I keep getting this error:
Request failed with code 400, performed 0 retries due to IOExceptions, performed 0 retries due to unsuccessful status codes, HTTP framework says request can be retried, (caller responsible for retrying): https://bigquery.googleapis.com/bigquery/v2/projects/<my-project>/datasets/<my-dataset>/tables/<my-table>/insertAll.
Before starting the export job, I created an empty table <my-project>:<my-dataset>.<my-table> with fields that match the JSON schema above:
id STRING NULLABLE
test_value INTEGER NULLABLE
I have outputTableSpec set to <my-project>:<my-dataset>.<my-table>.

If the BQ table name is given in the form project:dataset.table, then there cannot be any hyphens in the table string. I was using my-project.test.stream-data-102720 when I got the code 400 error. Creating a new table my-project.test.stream_data_102720 and re-running the job with the new name fixed the problem.

Related

How do I create a bigquery pubsub direct in gcp? I get an error failed to create

I am trying to publish data directly from pubsub to bigquery.
I have created a topic with a schema.
I have created a table.
But when I create the subscription, I get an error request contains an invalid argument
gcloud pubsub subscriptions create check-me.httpobs --topic=check-me.httpobs --bigquery-table=agilicus:checkme.httpobs --write-metadata --use-topic-schema
ERROR: Failed to create subscription [projects/agilicus/subscriptions/check-me.httpobs]: Request contains an invalid argument.
ERROR: (gcloud.pubsub.subscriptions.create) Failed to create the following: [check-me.httpobs].
there's not really a lot of diagnostics i can do here.
Is there any worked out example that shows? What am i doing wrong for this error?
Side note: its really a pain to have to create the BQ schema w/ its native json format, and then create the message schema in avro format. Similar, but different, and no conversion tools that I can find.
If i run with --log-http, it doesn't really enlighten:
{
"error": {
"code": 400,
"message": "Request contains an invalid argument.",
"status": "INVALID_ARGUMENT"
}
}
-- update:
switched to protobuf, same problem.
https://gist.github.com/donbowman/5ea8f8d8017493cbfa3a9e4f6e736bcc has the details.
gcloud version
Google Cloud SDK 404.0.0
alpha 2022.09.23
beta 2022.09.23
bq 2.0.78
bundled-python3-unix 3.9.12
core 2022.09.23
gsutil 5.14
I have confirmed all the fields are present and correct format, as per https://github.com/googleapis/googleapis/blob/master/google/pubsub/v1/pubsub.proto#L639
specifically:
{"ackDeadlineSeconds": 900, "bigqueryConfig": {"dropUnknownFields": true, "table": "agilicus:checkme.httpobs", "useTopicSchema": true, "writeMetadata": true}, "name": "projects/agilicus/subscriptions/check-me.httpobs", "topic": "projects/agilicus/topics/check-me.httpobs"}
I have also tried using the API Explorer to post this, same effect.
I have also tried using the python example:
https://cloud.google.com/pubsub/docs/samples/pubsub-create-bigquery-subscription#pubsub_create_bigquery_subscription-python
to create. Same error w/ a slight bit more info (ip, grpc_status 3)
debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B2607:f8b0:400b:807::200a%5D:443 {created_time:"2022-10-04T20:54:44.600831924-04:00", grpc_status:3, grpc_message:"Request contains an invalid argument."}"

Data Factory copy pipeline from API

We use Azure Data Factory copy pipeline to transfer data from REST api's to a Azure SQL Database and it is doing some strange things. Because we loop over a set of API's that need to be transferred the mapping is empty from the copy activity.
But for one API the automatic mapping is going wrong, the destination table is created with all the needed columns and correct datatypes based on the received metadata. When we run the pipeline for that specific API, the following message is showed.
{ "errorCode": "2200", "message": "ErrorCode=SchemaMappingFailedInHierarchicalToTabularStage,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed to process hierarchical to tabular stage, error message: Ticks must be between DateTime.MinValue.Ticks and DateTime.MaxValue.Ticks.\r\nParameter name: ticks,Source=Microsoft.DataTransfer.ClientLibrary,'", "failureType": "UserError", "target": "Copy data1", "details": [] }
As a test we did do the mapping for that API manually by using the "Import Schema" option on the Mapping page. there we see that all the fields are correctly mapped. We execute the pipeline again using the mapping and everything is working fine.
But of course, we don't want to use a manually mapping because it is used in a loop for different API's also.

Can't connect Azure Table Storage to PowerBI (415) Unsupported Media Type)

I'm getting the error below while connecting to Azure Table Storage,
Details:
Blockquote "AzureTables: Request failed: The remote server returned an error:
(415) Unsupported Media Type. (None of the provided media types are
supported)
The one thing I noticed is that if I fill up only the account name it will automatically add the rest of the url which is ".table.core.windows.net" where in the portal its table.cosmosdb.azure.com.
With core.windows.net Im getting err "AzureTables: Request failed: The remote name could not be resolved". But it might messing up some headers while using table.cosmosdb.azure.com
Please advise.
Thank you.
m
You should be able to connect to your azure table storage/CosmosDB account using powerBi using the following link structure: https://STORAGEACCOUNTNAME.table.core.windows.net/ , or https://yourcosmosdbname.documents.azure.com:443/ for cosmosdb
You can get the correct link by going to Portal > go to Storage accounts > Click on Tables/CosmosDB > You'll find the table link you would like to link to powerbi > remove the last table name after "/", then use it to connect in powerbi, it will later allow you to select the specific table in powerBI:
These are screenshots from testing for CosmosDB:
Errors 415:
When it comes to these errors, they can be caused by cache, which can be flushed by going to:
In Power BI Desktop: Go to "File" and select "Options". Under "Data Load" you have the option to clear the cache. After doing this you can use "Get Data" and "OData-feed" as normal and the URL won't return the 415 error
Check the following link for additional suggestions:
Not clear how you consume the table service API, but here is the solution that worked for me for React SPA and fetch api.
Request header must contain:
"Content-Type":"application/json"
It was failing for me with single quotes, and worked with double.

Streaming data to Bigquery using Appengine

I'm collecting data (deriving from cookies installed in some websites) in BigQuery using a streaming approach with a Python code in App Engine.
The function I use to save the data is the following:
def stream_data(data):
PROJECT_ID = "project_id"
DATASET_ID = "dataset_id"
_SCOPE = 'https://www.googleapis.com/auth/bigquery'
credentials = appengine.AppAssertionCredentials(scope=_SCOPE)
http = credentials.authorize(httplib2.Http())
table = "table_name"
body = {
"ignoreUnknownValues": True,
"kind": "bigquery#tableDataInsertAllRequest",
"rows": [
{
"json": data,
},
]
}
bigquery = discovery.build('bigquery', 'v2', http=http)
bigquery.tabledata().insertAll(projectId=PROJECT_ID, datasetId=DATASET_ID, tableId=table, body=body).execute()
I have deployed the solution on two different App Engine instances and I get different result. My question is: how is it possible?
On the other hand comparing the results with Google Analytics metrics I also notice that not all the data are stored in BigQuery. Do you have any idea about this problem?
In your code there isn't a query exception handling during the insertAll operation. If BigQuery can't write data, you don't catch the exception.
In your last line try this code:
bQreturn = bigquery.tabledata().insertAll(projectId=PROJECT_ID, datasetId=DATASET_ID, tableId=table, body=body).execute()
logging.debug(bQreturn)
In this way, on Google Cloud Platform log, you can easily find a possible error in the insertAll operation.
When using insertAll() method you have to keep this in mind:
Data is streamed temporarily in the streaming buffer which has
different availability characteristics than managed storage. Certain
operations in BigQuery do not interact with the streaming buffer, such
as table copy jobs and API methods like tabledata.list {1}
If you are using the table preview, streaming buffered entries may not be visible.
Doing SELECT COUNT(*) from your table should return your total number of entries.
{1}: https://cloud.google.com/bigquery/troubleshooting-errors#missingunavailable-data

Can BigQuery report mismatched the schema field?

When I upsert a row that mismatches schema I get a PartialFailureError along with a message, e.g.:
[ { errors:
[ { message: 'Repeated record added outside of an array.',
reason: 'invalid' } ],
...
]
However for large rows this isn't sufficient, because I have no idea which field is the one creating the error. The bq command does report the malformed field.
Is there either a way to configure or access name of the offending field, or can this be added to the API endpoint?
Please see this Github Issue: https://github.com/googleapis/nodejs-bigquery/issues/70 . Apparently node.js client library is not getting the location field from the API so it's not able to return it to the caller.
Workaround that worked for me: I copied the JSON payload to my Postman client and manually sent a request to REST API (let me know if you need more details of how to do it).