BigQuery Execute fails with no meaningful error on Cloud Data Fusion - google-bigquery

I'm trying to use the BigQuery Execute function in Cloud Data Fusion (Google). The component validates fine, the SQL checks out but I get this non-meaningful error with every execution:
02/11/2022 12:51:25 ERROR Pipeline 'test-bq-execute' failed.
02/11/2022 12:51:25 ERROR Workflow service 'workflow.default.test-bq-execute.DataPipelineWorkflow.<guid>' failed.
02/11/2022 12:51:25 ERROR Program DataPipelineWorkflow execution failed.
I can see nothing else to help me debug this. Any ideas? The SQL in question is a simple DELETE from dataset.table WHERE ds = CURRENT_DATE()
This was the pipeline
{
"name": "test-bq-execute",
"description": "Data Pipeline Application",
"artifact": {
"name": "cdap-data-pipeline",
"version": "6.5.1",
"scope": "SYSTEM"
},
"config": {
"resources": {
"memoryMB": 2048,
"virtualCores": 1
},
"driverResources": {
"memoryMB": 2048,
"virtualCores": 1
},
"connections": [],
"comments": [],
"postActions": [],
"properties": {},
"processTimingEnabled": true,
"stageLoggingEnabled": false,
"stages": [
{
"name": "BigQuery Execute",
"plugin": {
"name": "BigQueryExecute",
"type": "action",
"label": "BigQuery Execute",
"artifact": {
"name": "google-cloud",
"version": "0.18.1",
"scope": "SYSTEM"
},
"properties": {
"project": "auto-detect",
"sql": "DELETE FROM GCPQuickStart.account WHERE ds = CURRENT_DATE()",
"dialect": "standard",
"mode": "batch",
"dataset": "GCPQuickStart",
"table": "account",
"useCache": "false",
"location": "US",
"rowAsArguments": "false",
"serviceAccountType": "filePath",
"serviceFilePath": "auto-detect"
}
},
"outputSchema": [
{
"name": "etlSchemaBody",
"schema": ""
}
],
"id": "BigQuery-Execute",
"type": "action",
"label": "BigQuery Execute",
"icon": "fa-plug"
}
],
"schedule": "0 1 */1 * *",
"engine": "spark",
"numOfRecordsPreview": 100,
"maxConcurrentRuns": 1
}
}

I was able to catch the error using Cloud Logging. To enable Cloud Logging in Cloud Data Fusion, you may use this GCP Documentation. And follow these steps to view the logs from Data Fusion to Cloud Logging. Replicating your scenario this is the error I found:
"logMessage": "Program DataPipelineWorkflow execution failed.\njava.util.concurrent.ExecutionException: com.google.cloud.bigquery.BigQueryException: Cannot set destination table in jobs with DML statements\n at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)\n at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)\n at io.cdap.cdap.internal.app.runtime.distributed.AbstractProgramTwillRunnable.run(AbstractProgramTwillRunnable.java:274)\n at org.apache.twill.interna..."
}
What we did to resolve this error: Cannot set destination table in jobs with DML statements is we left the Dataset Name and Table Name empty inside the pipeline properties as there is no need for the destination table to be specified.
Output:

Related

REST dataset for Copy Activity Source give me error Invalid PaginationRule

My Copy Activity is setup to use a REST Get API call as my source. I keep getting Error Code 2200 Invalid PaginationRule RuleKey=supportRFC5988.
I can call the GET Rest URL using the Web Activity, but this isn't optimal as I then have to pass the output to a stored procedure to load the data to the table. I would much rather use the Copy Activity.
Any ideas why I would get an Invalid PaginationRule error on a call?
I'm using a REST Linked Service with the following properties:
Name: Workday
Connect via integration runtime: link-unknown-self-hosted-ir
Base URL: https://wd2-impl-services1.workday.com/ccx/service
Authentication type: Basic
User name: Not telling
Azure Key Vault for password
Server Certificate Validation is enabled
Parameters: Name:format Type:String Default value:json
Datasource:
"name": "Workday_Test_REST_Report",
"properties": {
"linkedServiceName": {
"referenceName": "Workday",
"type": "LinkedServiceReference",
"parameters": {
"format": "json"
}
},
"folder": {
"name": "Workday"
},
"annotations": [],
"type": "RestResource",
"typeProperties": {
"relativeUrl": "/customreport2/company1/person%40company.com/HIDDEN_BI_RaaS_Test_Outbound"
},
"schema": []
}
}
Copy Activity
{
"name": "Copy Test Workday REST API output to a table",
"properties": {
"activities": [
{
"name": "Copy data1",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:40",
"requestInterval": "00.00:00:00.010",
"requestMethod": "GET",
"paginationRules": {
"supportRFC5988": "true"
}
},
"sink": {
"type": "SqlMISink",
"tableOption": "autoCreate"
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "Workday_Test_REST_Report",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "Destination_db",
"type": "DatasetReference",
"parameters": {
"schema": "ELT",
"tableName": "WorkdayTestReportData"
}
}
]
}
],
"folder": {
"name": "Workday"
},
"annotations": []
}
}
Well after posting this, I noticed that in the copy activity code there is a nugget about "supportRFC5988": "true" I switched the true to false, and everything just worked for me. I don't see a way to change this in the Copy Activity GUI
Editing source code and setting this option to false helped!

Is there a way to obtain pipeline activity run details for Azure Synapse from Log Analytics?

I have set up a log analytics workspace and added it to the diagnostic settings on my Synapse workspace. However, I am unable to write a query that extracts pipeline activity information such as dataRead, rowsCopied, etc.
I have tried using
dataReadvar = parse_json(Output).dataRead to extract the JSON within SynapseIntegrationActivityRuns but it doesn’t seem to be able to find ‘Output’.
Azure Data Factory and Azure Synapse Analytics have three groupings of activities: data movement activities, data transformation activities, and control activities. An activity can take zero or more input datasets and produce one or more output datasets.
Check for the sample pipeline for how it is defined and its pipeline activities:
{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities":
[
],
"parameters": {
},
"concurrency": <your max pipeline concurrency>,
"annotations": [
]
}
}
Below is the JSON format for Top level structure for Execution Activities:
{
"name": "Execution Activity Name",
"description": "description",
"type": "<ActivityType>",
"typeProperties":
{
},
"linkedServiceName": "MyLinkedService",
"policy":
{
},
"dependsOn":
{
}
}
Activity Policy:
{
"name": "MyPipelineName",
"properties": {
"activities": [
{
"name": "MyCopyBlobtoSqlActivity",
"type": "Copy",
"typeProperties": {
...
},
"policy": {
"timeout": "00:10:00",
"retry": 1,
"retryIntervalInSeconds": 60,
"secureOutput": true
}
}
],
"parameters": {
...
}
}
}
Here are few MS Docs as Docs1, Docs2 which are related to your scenario which can help.

Sql Database Elastic pool and sku combination is invalid

I am able to create Sql Server, Sql database, sql elastic Pool Successfully using ARM templates. But when I trying to create new database with existing elastic pool name. I am getting below error.
Without elastic pool id, database is creating successfully.
Both Sql database Elastic Pool and database are using same location, tier, edition etc.Also When tried in azure portal it created successfully.
"error": {
"code": "ResourceDeploymentFailure",
"message": "The resource operation completed with terminal provisioning state 'Failed'.",
"details": [
{
"code": "ElasticPoolSkuCombinationInvalid",
"message": "Elastic pool 'sqlsamplepool' and sku 'Basic' combination is invalid."
}
]
ARM Template:
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"collation": {
"type": "string",
"metadata": {
"description": "The collation of the database."
},
"defaultValue": "SQL_Latin1_General_CP1_CI_AS"
},
"skutier": {
"type": "string",
"metadata": {
"description": "The edition of the database. The DatabaseEditions enumeration contains all the
valid editions. e.g. Basic, Premium."
},
"allowedValues": [ "Basic", "Standard", "Premium" ],
"defaultValue": "Basic"
},
"resourcelocation": {
"type": "string",
"defaultValue": "[resourceGroup().location]",
"metadata": {
"description": "Location for all resources."
}
},
"sqlservername": {
"type": "string",
"metadata": {
"description": "The name of the sql server."
}
},
"zoneRedundant": {
"type": "bool",
"metadata": {
"description": "Whether or not this database is zone redundant, which means the replicas of this database will be spread across multiple availability zones."
},
"defaultValue": false
},
"sqlElasticPoolName": {
"type": "string",
"metadata": {
"description": "The Elastic Pool name."
}
},
"databaseName": {
"type": "string"
}
},
"functions": [],
"variables": { },
"resources": [
{
"type": "Microsoft.Sql/servers/databases",
"apiVersion": "2020-08-01-preview",
"name": "[concat(parameters('sqlservername'),'/',parameter('databaseName'))]",
"location": "[parameters('resourcelocation')]",
"sku": {
"name": "[parameters('skutier')]",
"tier": "[parameters('skutier')]"
},
"properties": {
"collation": "[parameters('collation')]",
"zoneRedundant": "[parameters('zoneRedundant')]",
"elasticPoolId":"[concat('/subscriptions/',subscription().subscriptionId,'/resourceGroups/',resourceGroup().name,'/providers/Microsoft.Sql/servers/',parameters('sqlservername'),'/elasticPools/',parameters('sqlElasticPoolName'))]"
}
}
]
}
I am not sure what wrong with "2020-08-01-preview" version but its working fine with stable version. below is my partial arm template code that working.
I changed to 2014-04-01 api version.
"comments": "If Elastic Pool Name is defined, then curent database will be added to elastic pool.",
"type": "Microsoft.Sql/servers/databases",
"apiVersion": "2014-04-01",
"name": "[concat(parameters('sqlservername'),'/',variables('dbname'))]",
"location": "[parameters('resourcelocation')]",
"properties": {
"collation": "[parameters('collation')]",
"zoneRedundant": "[parameters('zoneRedundant')]",
"elasticPoolName":"[if(not(empty(parameters('sqlElasticPoolName'))),parameters('sqlElasticPoolName'),'')]",
"edition": "[parameters('skutier')]"
}

Azure Data Factory v2 If activity always fails

I'm currently struggling with the Azure Data Factory v2 If activity which always fails with this error message:
enter image description here
I've designed two separate pipelines, one takes the full snapshot of the data (1333 records) from the on-premises SQL Server and loads the data into the Azure SQL Database, and another one just takes delta from the same source.
Both pipelines work fine when executed independently.
I then decided to wrap these two pipelines into the one parent pipeline which would do this:
1.
Execute LookUp activity to check if the target table in Azure SQL Database has any records, basic Select Count(Request_ID) As record_count From target_table - activity works fine, I can preview the returned record count.
2.
Pass the output from the LookUp activity to the If activity with the conditions that if record_count = 0, the parent pipeline would invoke the full load pipeline, otherwise the parent pipeline would invoke the delta load pipeline.
This is the actual expression:
{#activity('lookup_sites_record_count').output.firstRow.record_count}==0"
Whenever I try to execute this parent pipeline, it fails with the above message of "Activity failed: Activity failed because an inner activity failed."
Both inner activities, that is, full load and delta load pipelines, work just fine when triggered independently.
What I'm missing?
Many thanks in advance :).
mikhailg
Pipeline's JSON definition below:
{
"name": "pl_remedyreports_load_rs_sites",
"properties": {
"activities": [
{
"name": "lookup_sites_record_count",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "Select Count(Request_ID) As record_count From mdp.RS_Sites;"
},
"dataset": {
"referenceName": "ds_azure_sql_db_sites",
"type": "DatasetReference"
}
}
},
{
"name": "If_check_site_record_count",
"type": "IfCondition",
"dependsOn": [
{
"activity": "lookup_sites_record_count",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"expression": {
"value": "{#activity('lookup_sites_record_count').output.firstRow.record_count}==0",
"type": "Expression"
},
"ifFalseActivities": [
{
"name": "pl_remedyreports_invoke_load_sites_inc",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "pl_remedyreports_load_sites_inc",
"type": "PipelineReference"
}
}
}
],
"ifTrueActivities": [
{
"name": "pl_remedyreports_invoke_load_sites_full",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "pl_remedyreports_load_sites_full",
"type": "PipelineReference"
}
}
}
]
}
}
],
"folder": {
"name": "Load Remedy Reference Data"
}
}
}
Your expression should be:
#equals(activity('lookup_sites_record_count').output.firstRow.record_count,0)

Invalid Path error while inserting job from google cloud storage to google bigquery

I am trying to insert a job through HTTP Post request, but i am getting Invalid path error.
My request body is as follows:
{
"configuration": {
"load": {
"sourceUris": [
"gs://onianalytics/PersData.csv"
],
"schema": {
"fields": [
{
"name": "Name",
"type": "STRING"
},
{
"name": "Age",
"type": "INTEGER"
}
]
},
"destinationTable": {
"datasetId": "Test_Dataset",
"projectId": "lithe-anvil-404",
"tableId": "tb_test_Pers"
}
}
},
"jobReference": {
"jobId": "10",
"projectId": "lithe-anvil-404"
}
}
For the sourceuri parameter, I am passing "gs://onianalytics/PersData.csv", where onianalytics is my bucket name and PersData.csv is my csv file (from which I want to upload data into google bigquery).
I am getting below response:
"status": {
"state": "DONE",
"errorResult": {
"reason": "invalid",
"message": "Invalid path: gs://onianalytics/PersData.csv"
},
"errors": [
{
"reason": "invalid",
"message": "Invalid path: gs://onianalytics/PersData.csv"
}
]
},
"statistics": {
"creationTime": "1387276603674",
"startTime": "1387276603751",
"endTime": "1387276603751"
}
}
My bucket is under the same projectid which has the BigQuery service activated. Also, I have Google Cloud Storage enabled under APIs and Auth. Following scopes are added while authenticating:
googleapis.com/auth/bigquery, googleapis.com/auth/cloud-platform, googleapis.com/auth/devstorage.full_control,googleapis.com/auth/devstorage.read_only,googleapis.com/auth/devstorage.read_write
I am inserting this job by "Try it!" link which is available on developers.google.com/bigquery/docs/reference/v2/jobs/insert.
In fact I am able to create buckets and objects in goggle cloud storage through APIs. But when i try to insert job from the uploaded object (which is a csv file), i got "Invalid Path" error. Can anyone please help me to identify why this error is occurring?
The error I get when trying the code above is "Not found: URI gs://onianalytics/PersData.csv".
I'm wondering if instead of /onianalytics/ you had a different path with invalid characters?