Azure Data Factory v2 If activity always fails - azure-data-factory-2

I'm currently struggling with the Azure Data Factory v2 If activity which always fails with this error message:
enter image description here
I've designed two separate pipelines, one takes the full snapshot of the data (1333 records) from the on-premises SQL Server and loads the data into the Azure SQL Database, and another one just takes delta from the same source.
Both pipelines work fine when executed independently.
I then decided to wrap these two pipelines into the one parent pipeline which would do this:
1.
Execute LookUp activity to check if the target table in Azure SQL Database has any records, basic Select Count(Request_ID) As record_count From target_table - activity works fine, I can preview the returned record count.
2.
Pass the output from the LookUp activity to the If activity with the conditions that if record_count = 0, the parent pipeline would invoke the full load pipeline, otherwise the parent pipeline would invoke the delta load pipeline.
This is the actual expression:
{#activity('lookup_sites_record_count').output.firstRow.record_count}==0"
Whenever I try to execute this parent pipeline, it fails with the above message of "Activity failed: Activity failed because an inner activity failed."
Both inner activities, that is, full load and delta load pipelines, work just fine when triggered independently.
What I'm missing?
Many thanks in advance :).
mikhailg
Pipeline's JSON definition below:
{
"name": "pl_remedyreports_load_rs_sites",
"properties": {
"activities": [
{
"name": "lookup_sites_record_count",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "Select Count(Request_ID) As record_count From mdp.RS_Sites;"
},
"dataset": {
"referenceName": "ds_azure_sql_db_sites",
"type": "DatasetReference"
}
}
},
{
"name": "If_check_site_record_count",
"type": "IfCondition",
"dependsOn": [
{
"activity": "lookup_sites_record_count",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"expression": {
"value": "{#activity('lookup_sites_record_count').output.firstRow.record_count}==0",
"type": "Expression"
},
"ifFalseActivities": [
{
"name": "pl_remedyreports_invoke_load_sites_inc",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "pl_remedyreports_load_sites_inc",
"type": "PipelineReference"
}
}
}
],
"ifTrueActivities": [
{
"name": "pl_remedyreports_invoke_load_sites_full",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "pl_remedyreports_load_sites_full",
"type": "PipelineReference"
}
}
}
]
}
}
],
"folder": {
"name": "Load Remedy Reference Data"
}
}
}

Your expression should be:
#equals(activity('lookup_sites_record_count').output.firstRow.record_count,0)

Related

BigQuery Execute fails with no meaningful error on Cloud Data Fusion

I'm trying to use the BigQuery Execute function in Cloud Data Fusion (Google). The component validates fine, the SQL checks out but I get this non-meaningful error with every execution:
02/11/2022 12:51:25 ERROR Pipeline 'test-bq-execute' failed.
02/11/2022 12:51:25 ERROR Workflow service 'workflow.default.test-bq-execute.DataPipelineWorkflow.<guid>' failed.
02/11/2022 12:51:25 ERROR Program DataPipelineWorkflow execution failed.
I can see nothing else to help me debug this. Any ideas? The SQL in question is a simple DELETE from dataset.table WHERE ds = CURRENT_DATE()
This was the pipeline
{
"name": "test-bq-execute",
"description": "Data Pipeline Application",
"artifact": {
"name": "cdap-data-pipeline",
"version": "6.5.1",
"scope": "SYSTEM"
},
"config": {
"resources": {
"memoryMB": 2048,
"virtualCores": 1
},
"driverResources": {
"memoryMB": 2048,
"virtualCores": 1
},
"connections": [],
"comments": [],
"postActions": [],
"properties": {},
"processTimingEnabled": true,
"stageLoggingEnabled": false,
"stages": [
{
"name": "BigQuery Execute",
"plugin": {
"name": "BigQueryExecute",
"type": "action",
"label": "BigQuery Execute",
"artifact": {
"name": "google-cloud",
"version": "0.18.1",
"scope": "SYSTEM"
},
"properties": {
"project": "auto-detect",
"sql": "DELETE FROM GCPQuickStart.account WHERE ds = CURRENT_DATE()",
"dialect": "standard",
"mode": "batch",
"dataset": "GCPQuickStart",
"table": "account",
"useCache": "false",
"location": "US",
"rowAsArguments": "false",
"serviceAccountType": "filePath",
"serviceFilePath": "auto-detect"
}
},
"outputSchema": [
{
"name": "etlSchemaBody",
"schema": ""
}
],
"id": "BigQuery-Execute",
"type": "action",
"label": "BigQuery Execute",
"icon": "fa-plug"
}
],
"schedule": "0 1 */1 * *",
"engine": "spark",
"numOfRecordsPreview": 100,
"maxConcurrentRuns": 1
}
}
I was able to catch the error using Cloud Logging. To enable Cloud Logging in Cloud Data Fusion, you may use this GCP Documentation. And follow these steps to view the logs from Data Fusion to Cloud Logging. Replicating your scenario this is the error I found:
"logMessage": "Program DataPipelineWorkflow execution failed.\njava.util.concurrent.ExecutionException: com.google.cloud.bigquery.BigQueryException: Cannot set destination table in jobs with DML statements\n at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)\n at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)\n at io.cdap.cdap.internal.app.runtime.distributed.AbstractProgramTwillRunnable.run(AbstractProgramTwillRunnable.java:274)\n at org.apache.twill.interna..."
}
What we did to resolve this error: Cannot set destination table in jobs with DML statements is we left the Dataset Name and Table Name empty inside the pipeline properties as there is no need for the destination table to be specified.
Output:

Is there a way to obtain pipeline activity run details for Azure Synapse from Log Analytics?

I have set up a log analytics workspace and added it to the diagnostic settings on my Synapse workspace. However, I am unable to write a query that extracts pipeline activity information such as dataRead, rowsCopied, etc.
I have tried using
dataReadvar = parse_json(Output).dataRead to extract the JSON within SynapseIntegrationActivityRuns but it doesn’t seem to be able to find ‘Output’.
Azure Data Factory and Azure Synapse Analytics have three groupings of activities: data movement activities, data transformation activities, and control activities. An activity can take zero or more input datasets and produce one or more output datasets.
Check for the sample pipeline for how it is defined and its pipeline activities:
{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities":
[
],
"parameters": {
},
"concurrency": <your max pipeline concurrency>,
"annotations": [
]
}
}
Below is the JSON format for Top level structure for Execution Activities:
{
"name": "Execution Activity Name",
"description": "description",
"type": "<ActivityType>",
"typeProperties":
{
},
"linkedServiceName": "MyLinkedService",
"policy":
{
},
"dependsOn":
{
}
}
Activity Policy:
{
"name": "MyPipelineName",
"properties": {
"activities": [
{
"name": "MyCopyBlobtoSqlActivity",
"type": "Copy",
"typeProperties": {
...
},
"policy": {
"timeout": "00:10:00",
"retry": 1,
"retryIntervalInSeconds": 60,
"secureOutput": true
}
}
],
"parameters": {
...
}
}
}
Here are few MS Docs as Docs1, Docs2 which are related to your scenario which can help.

HTTP request in Azure Data Factory

In Azure Data Factory, I need to tap into a HTTP requests via URL using the HTTP connector. I was able to do this as well as setup the dataset. Where I'm having issues is on the pipeline. Here's what I need to do. What is the best way to accomplish this?
Call out to the service base URL and retrieve the header returned of TotalPages.
Using the value for TotalPages, make subsequent requests to the URL with the parameter page (e.g., page=1, page=2, etc.) using the value from TotalPages to form those requests.
Thanks.
Ok. So the issue here is that you cannot nest control structures in Data Factory more than 1 time. The solution is to create two or more pipelines (aka Master and Child).
From the Master pipeline retrieve the number of tasks you will need to execute, and pass them to a for loop. Within the for loop launch for each activity pair a new Child pipeline which will then execute the second activity.
If the Activity is simple enough you can skip the Child Pipeline altogether and do it directly inside the first for loop.
As a Json representation of pipelines in question it should look along these lines:
{
"name": "generic_master",
"properties": {
"activities": [
{
"name": "Web1",
"type": "WebActivity",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": "https://jsonplaceholder.typicode.com/posts/1",
"method": "GET"
}
},
{
"name": "ForEach1",
"type": "ForEach",
"dependsOn": [
{
"activity": "Web1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#activity('Web1').output",
"type": "Expression"
},
"activities": [
{
"name": "Execute Pipeline1",
"type": "ExecutePipeline",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"pipeline": {
"referenceName": "generic_child",
"type": "PipelineReference"
},
"waitOnCompletion": true
}
}
]
}
}
],
"annotations": []
}
}
{
"name": "generic_child",
"properties": {
"activities": [
{
"name": "Web1",
"type": "WebActivity",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": "https://jsonplaceholder.typicode.com/posts/1",
"method": "POST"
}
}
],
"annotations": []
}
}
In order to read the TotalPages values from the HTTP Request's response, you can use a "Lookup" activity to submit the HTTP request and store the TotalPages value in a variable with the "Set variable" activity.
Actions:
Pipeline level:
create a variable called TotalPages
Lookup activity:
tick the first row only box on the Settings tab
As a source dataset, use the data set defined for your HTTP request
Select the GET method.
Set variable activity:
Select the TotalPages variable on the Variables tab
In the value box, click on "Add dynamic content" and enter something like this: #{activity('GetTotalPages').output.firstRow.RegisterSearch['#TotalPages']}
In my case, the lookup activity is called GetTotalPages, and my HTTP request returns the total number of pages in a RegisterSearch array, under a column name #TotalPages

"Value cannot be null.\r\nParameter name: endpoint" in Azure Data Factory V2

I am getting the following error when I execute Azure ML Batch Execution Activity in ADF V2.
I have written following JSON query in ML Activity
{
"name": "MLBatchExecution1",
"description": "",
"type": "AzureMLBatchExecution",
"linkedServiceName": {
"name": "AzureMLLinkedservice2",
"type": "AzureML"
},
"typeProperties": {
"webServiceInputs": {
"input1": {
"LinkedServiceName":{
"name": "azureblobstoragelinkedservice",
"type": "AzureStorage"
},
"FilePath":"tutoial/Input/TraiData.csv"
},
"input2": {
"LinkedServiceName":{
"name": "azureblobstoragelinkedservice",
"type": "AzureStorage"
},
"FilePath":"tutoial/Input/TestData.csv"
}
},
"webServiceOutputs": {
"output1": {
"LinkedServiceName":{
"name": "AzureStorageLinkedService2",
"type": "AzureStorage"
},
"FilePath":"tutoial/Output/Output.csv"
}
}
}
}
I have make use of the following link to create linked service & activity:
https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-machine-learning
Can anyone help on this pls.
Any help will be appreciated..
Thanks
Deepak
Try passing the argument when triggering the pipeline. The error you are receiving is because it enforces a parameter.

Copy Activity Properties to update Azure DW data from an on prem SQL stored procedure in data factory

I'm not sure that what I'm trying to achieve is even possible in Data factory, but I guess there should be a way.
Simply put it, I have a table in DW that needs to be updated by a stored procedure once a day.
This stored procedure resides on the Source DB, I am looking for a way to pass some IDs and get the results from that SP and store it in DB.
Any Help would be appreciated. Below Pipeline is all I could think of:
{
"name": "UpdateColumnX",
"properties": {
"activities": [
{
"type": "SqlServerStoredProcedure?? Not Really Sure",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "$$Text.Format('Passing IDs to the stored Procedure', Time.AddHours(WindowStart,10), Time.AddHours(WindowEnd,10))\n"
},
"storedProcedureName": "UpdateDataThroughSP",
"storedProcedureParameters": {
"StartDate": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', Time.AddHours(WindowStart,10))",
"EndDate ": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', Time.AddHours(WindowEnd,10))"
}
},
"inputs": [
{
"name": "Not Sure which table should be my Input, the DW table having the IDs or the source table? "
}
],
"outputs": [
{
"name": "Sames and Input not sure"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1,
"offset": "20:30:00"
},
"name": "Update Data through Source SP"
}
],
"start": "2017-09-13T20:30:00.045Z",
"end": "2099-12-30T13:00:00Z",
"isPaused": false,
"hubName": "HubName",
"pipelineMode": "Scheduled"
}
}