BigQuery: Load Data into EU Dataset from GCS

BigQuery: Load Data into EU Dataset from GCS - google-bigquery

In the past I have successfully loaded data into US-hosted BigQuery datasets from CSV data in US-hosted GCS buckets. We since decided to move our BigQuery data to the EU and I created a new dataset with this region selected on it. I have successfully populated those of our tables small enough to be uploaded from my machine at home. But two tables are far too large for this so I would like to load them from files in GCS. I have tried doing this from both a US-hosted GCS bucket and an EU-hosted GCS bucket (thinking that bq load might not like to cross regions) but the load fails every time. Below is the error detail I'm getting from the bq command line (500, Internal Error). Does anyone know a reason why this might be happening? Is loading data into EU-hosted BigQuery datasets from GCS something that is known to work for others?
{
"configuration": {
"load": {
"destinationTable": {
"datasetId": "######",
"projectId": "######",
"tableId": "test"
},
"schema": {
"fields": [
{
"name": "test_col",
"type": "INTEGER"
}
]
},
"sourceFormat": "CSV",
"sourceUris": [
"gs://######/test.csv"
]
}
},
"etag": "######",
"id": "######",
"jobReference": {
"jobId": "job_Y4U58uTyxitsvbgljFi2x534N7M",
"projectId": "######"
},
"kind": "bigquery#job",
"selfLink": "https://www.googleapis.com/bigquery/v2/projects/######",
"statistics": {
"creationTime": "1445336673213",
"endTime": "1445336674738",
"startTime": "1445336674738"
},
"status": {
"errorResult": {
"message": "An internal error occurred and the request could not be completed.",
"reason": "internalError"
},
"errors": [
{
"message": "An internal error occurred and the request could not be completed.",
"reason": "internalError"
}
],
"state": "DONE"
},
"user_email": "######"
}

After searching through other related questions on StackOverflow I eventually realised that I had set my GCS bucket region to EUROPE-WEST-1 and not the multi-region EU location. Things are now working as expected.

Related

BigQuery Execute fails with no meaningful error on Cloud Data Fusion

I'm trying to use the BigQuery Execute function in Cloud Data Fusion (Google). The component validates fine, the SQL checks out but I get this non-meaningful error with every execution:
02/11/2022 12:51:25 ERROR Pipeline 'test-bq-execute' failed.
02/11/2022 12:51:25 ERROR Workflow service 'workflow.default.test-bq-execute.DataPipelineWorkflow.<guid>' failed.
02/11/2022 12:51:25 ERROR Program DataPipelineWorkflow execution failed.
I can see nothing else to help me debug this. Any ideas? The SQL in question is a simple DELETE from dataset.table WHERE ds = CURRENT_DATE()
This was the pipeline
{
"name": "test-bq-execute",
"description": "Data Pipeline Application",
"artifact": {
"name": "cdap-data-pipeline",
"version": "6.5.1",
"scope": "SYSTEM"
},
"config": {
"resources": {
"memoryMB": 2048,
"virtualCores": 1
},
"driverResources": {
"memoryMB": 2048,
"virtualCores": 1
},
"connections": [],
"comments": [],
"postActions": [],
"properties": {},
"processTimingEnabled": true,
"stageLoggingEnabled": false,
"stages": [
{
"name": "BigQuery Execute",
"plugin": {
"name": "BigQueryExecute",
"type": "action",
"label": "BigQuery Execute",
"artifact": {
"name": "google-cloud",
"version": "0.18.1",
"scope": "SYSTEM"
},
"properties": {
"project": "auto-detect",
"sql": "DELETE FROM GCPQuickStart.account WHERE ds = CURRENT_DATE()",
"dialect": "standard",
"mode": "batch",
"dataset": "GCPQuickStart",
"table": "account",
"useCache": "false",
"location": "US",
"rowAsArguments": "false",
"serviceAccountType": "filePath",
"serviceFilePath": "auto-detect"
}
},
"outputSchema": [
{
"name": "etlSchemaBody",
"schema": ""
}
],
"id": "BigQuery-Execute",
"type": "action",
"label": "BigQuery Execute",
"icon": "fa-plug"
}
],
"schedule": "0 1 */1 * *",
"engine": "spark",
"numOfRecordsPreview": 100,
"maxConcurrentRuns": 1
}
}

I was able to catch the error using Cloud Logging. To enable Cloud Logging in Cloud Data Fusion, you may use this GCP Documentation. And follow these steps to view the logs from Data Fusion to Cloud Logging. Replicating your scenario this is the error I found:
"logMessage": "Program DataPipelineWorkflow execution failed.\njava.util.concurrent.ExecutionException: com.google.cloud.bigquery.BigQueryException: Cannot set destination table in jobs with DML statements\n at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)\n at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)\n at io.cdap.cdap.internal.app.runtime.distributed.AbstractProgramTwillRunnable.run(AbstractProgramTwillRunnable.java:274)\n at org.apache.twill.interna..."
}
What we did to resolve this error: Cannot set destination table in jobs with DML statements is we left the Dataset Name and Table Name empty inside the pipeline properties as there is no need for the destination table to be specified.
Output:

BigQuery: Routine deployment failing with error "Unknown option: description"

We use terraform to deploy BigQuery objects (datasets, tables, routines etc..) to region europe-west2 in GCP. We do this many many times a day and all of a sudden at "2021-08-18T21:15:44.033910202Z" our deployments starting failing when attempting to deploy BigQuery routines. They are all failing with errors of the form:
status: {
code: 3
message: "Unknown option: description"
}
Here is the first log message I can find pertaining to this error (I have redacted project names):
"protoPayload": {
"#type": "type.googleapis.com/google.cloud.audit.AuditLog",
"status": {
"code": 3,
"message": "Unknown option: description"
},
"authenticationInfo": {
"principalEmail": "deployer-dev#myadminproject.iam.gserviceaccount.com",
"serviceAccountDelegationInfo": [
{
"firstPartyPrincipal": {
"principalEmail": "deployer-dev#myadminproject.iam.gserviceaccount.com"
}
}
]
},
"requestMetadata": {
"callerIp": "10.51.0.116",
"callerSuppliedUserAgent": "Terraform/0.14.7 (+https://www.terraform.io) Terraform-Plugin-SDK/2.5.0 terraform-provider-google/3.69.0,gzip(gfe)",
"callerNetwork": "//compute.googleapis.com/projects/myadminproject/global/networks/__unknown__",
"requestAttributes": {},
"destinationAttributes": {}
},
"serviceName": "bigquery.googleapis.com",
"methodName": "google.cloud.bigquery.v2.RoutineService.InsertRoutine",
"authorizationInfo": [
{
"resource": "projects/myproject/datasets/p00003818_dp_model",
"permission": "bigquery.routines.create",
"granted": true,
"resourceAttributes": {}
}
],
"resourceName": "projects/myproject/datasets/p00003818_dp_model/routines/UserProfile_Events_AllCarData_Deployment",
"metadata": {
"routineCreation": {
"routine": {
"routineName": "projects/myproject/datasets/p00003818_dp_model/routines/UserProfile_Events_AllCarData_Deployment"
},
"reason": "ROUTINE_INSERT_REQUEST"
},
"#type": "type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata"
}
},
"insertId": "ak27xdbke",
"resource": {
"type": "bigquery_dataset",
"labels": {
"dataset_id": "p00003818_dp_model",
"project_id": "myproject"
}
},
"timestamp": "2021-08-18T21:15:43.109609Z",
"severity": "ERROR",
"logName": "projects/myproject/logs/cloudaudit.googleapis.com%2Factivity",
"receiveTimestamp": "2021-08-18T21:15:44.033910202Z"
}
The fact that this occurred without any changes by ourselves indicates that this is a problem at the Google end. I also observe that whilst we witnessed this in a few projects it occurred first in one project and then a few minutes later in another - that may or may not be helpful information.
Posting here in case anyone else hits this problem and also hoping it might catch the attention of a googler.
UPDATE! I have reproduced the pproblem using the REST API https://cloud.google.com/bigquery/docs/reference/rest/v2/routines/insert
I have entered a payload that does not include a description and that successfully creates a routine:
However, if I include a description which, as this screenshot indicates, is a valid parameter:
then the request fails:

Azure Data Factory v2 If activity always fails

I'm currently struggling with the Azure Data Factory v2 If activity which always fails with this error message:
enter image description here
I've designed two separate pipelines, one takes the full snapshot of the data (1333 records) from the on-premises SQL Server and loads the data into the Azure SQL Database, and another one just takes delta from the same source.
Both pipelines work fine when executed independently.
I then decided to wrap these two pipelines into the one parent pipeline which would do this:
1.
Execute LookUp activity to check if the target table in Azure SQL Database has any records, basic Select Count(Request_ID) As record_count From target_table - activity works fine, I can preview the returned record count.
2.
Pass the output from the LookUp activity to the If activity with the conditions that if record_count = 0, the parent pipeline would invoke the full load pipeline, otherwise the parent pipeline would invoke the delta load pipeline.
This is the actual expression:
{#activity('lookup_sites_record_count').output.firstRow.record_count}==0"
Whenever I try to execute this parent pipeline, it fails with the above message of "Activity failed: Activity failed because an inner activity failed."
Both inner activities, that is, full load and delta load pipelines, work just fine when triggered independently.
What I'm missing?
Many thanks in advance :).
mikhailg
Pipeline's JSON definition below:
{
"name": "pl_remedyreports_load_rs_sites",
"properties": {
"activities": [
{
"name": "lookup_sites_record_count",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "Select Count(Request_ID) As record_count From mdp.RS_Sites;"
},
"dataset": {
"referenceName": "ds_azure_sql_db_sites",
"type": "DatasetReference"
}
}
},
{
"name": "If_check_site_record_count",
"type": "IfCondition",
"dependsOn": [
{
"activity": "lookup_sites_record_count",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"expression": {
"value": "{#activity('lookup_sites_record_count').output.firstRow.record_count}==0",
"type": "Expression"
},
"ifFalseActivities": [
{
"name": "pl_remedyreports_invoke_load_sites_inc",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "pl_remedyreports_load_sites_inc",
"type": "PipelineReference"
}
}
}
],
"ifTrueActivities": [
{
"name": "pl_remedyreports_invoke_load_sites_full",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "pl_remedyreports_load_sites_full",
"type": "PipelineReference"
}
}
}
]
}
}
],
"folder": {
"name": "Load Remedy Reference Data"
}
}
}

Your expression should be:
#equals(activity('lookup_sites_record_count').output.firstRow.record_count,0)

Invalid Path error while inserting job from google cloud storage to google bigquery

I am trying to insert a job through HTTP Post request, but i am getting Invalid path error.
My request body is as follows:
{
"configuration": {
"load": {
"sourceUris": [
"gs://onianalytics/PersData.csv"
],
"schema": {
"fields": [
{
"name": "Name",
"type": "STRING"
},
{
"name": "Age",
"type": "INTEGER"
}
]
},
"destinationTable": {
"datasetId": "Test_Dataset",
"projectId": "lithe-anvil-404",
"tableId": "tb_test_Pers"
}
}
},
"jobReference": {
"jobId": "10",
"projectId": "lithe-anvil-404"
}
}
For the sourceuri parameter, I am passing "gs://onianalytics/PersData.csv", where onianalytics is my bucket name and PersData.csv is my csv file (from which I want to upload data into google bigquery).
I am getting below response:
"status": {
"state": "DONE",
"errorResult": {
"reason": "invalid",
"message": "Invalid path: gs://onianalytics/PersData.csv"
},
"errors": [
{
"reason": "invalid",
"message": "Invalid path: gs://onianalytics/PersData.csv"
}
]
},
"statistics": {
"creationTime": "1387276603674",
"startTime": "1387276603751",
"endTime": "1387276603751"
}
}
My bucket is under the same projectid which has the BigQuery service activated. Also, I have Google Cloud Storage enabled under APIs and Auth. Following scopes are added while authenticating:
googleapis.com/auth/bigquery, googleapis.com/auth/cloud-platform, googleapis.com/auth/devstorage.full_control,googleapis.com/auth/devstorage.read_only,googleapis.com/auth/devstorage.read_write
I am inserting this job by "Try it!" link which is available on developers.google.com/bigquery/docs/reference/v2/jobs/insert.
In fact I am able to create buckets and objects in goggle cloud storage through APIs. But when i try to insert job from the uploaded object (which is a csv file), i got "Invalid Path" error. Can anyone please help me to identify why this error is occurring?

The error I get when trying the code above is "Not found: URI gs://onianalytics/PersData.csv".
I'm wondering if instead of /onianalytics/ you had a different path with invalid characters?

Invalid Path while inserting job from google cloud storage to google bigquery

I am trying to insert a job through HTTP Post request, but i am getting Invalid path error.
My request body is as follows:
{
"configuration": {
"load": {
"sourceUris": [
"gs://onianalytics/PersData.csv"
],
"schema": {
"fields": [
{
"name": "Name",
"type": "STRING"
},
{
"name": "Age",
"type": "INTEGER"
}
]
},
"destinationTable": {
"datasetId": "Test_Dataset",
"projectId": "lithe-anvil-404",
"tableId": "tb_test_Pers"
}
}
},
"jobReference": {
"jobId": "10",
"projectId": "lithe-anvil-404"
}
}
For the sourceuri parameter, I am passing "gs://onianalytics/PersData.csv", where onianalytics is my bucket name and PersData.csv is my csv file (from which I want to upload data into google bigquery).
I am getting below response:
"status": {
"state": "DONE",
"errorResult": {
"reason": "invalid",
"message": "Invalid path: gs://onianalytics/PersData.csv"
},
"errors": [
{
"reason": "invalid",
"message": "Invalid path: gs://onianalytics/PersData.csv"
}
]
},
"statistics": {
"creationTime": "1387276603674",
"startTime": "1387276603751",
"endTime": "1387276603751"
}
}
Please explain why this error is occurring?

Is your bucket under the same projectId that has the BigQuery service activated and you requested tokens with? If not, have you tried enabling read/write access for that project?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery: Load Data into EU Dataset from GCS - google-bigquery

After searching through other related questions on StackOverflow I eventually realised that I had set my GCS bucket region to EUROPE-WEST-1 and not the multi-region EU location. Things are now working as expected.

Related

BigQuery Execute fails with no meaningful error on Cloud Data Fusion

BigQuery: Routine deployment failing with error "Unknown option: description"

Azure Data Factory v2 If activity always fails

Invalid Path error while inserting job from google cloud storage to google bigquery

Invalid Path while inserting job from google cloud storage to google bigquery

Categories

Resources