Parallel Tasks in Data Factory Custom Activity (ADF V2)

Parallel Tasks in Data Factory Custom Activity (ADF V2) - azure-data-factory-2

I am running Custom code activity in ADF v2 using Batch Service. Whenever this runs it only create one CloudTask within my Batch Job although I have more than two dozen parallel.Invoke methods running. Is there a way I can create multiple Tasks from one Custom Activity from ADF so that the processing can spread across all nodes in Batch Pool
I have fixed Pool with two nodes. Max Tasks are also set to 8 per node and Scheduling policy is also set to "Spread". I have only one Custom Task on my pipeline with Multiple Parallel.Invoke (Almost two Dozen).I was hoping this will create Multiple CloudTasks and will be spread Across both of my nodes as both nodes are single core. Looks like when each Custom Activity runs in ADF, it creates only one Task (CloudTask) for Batch Service.
My other hope was to use
https://learn.microsoft.com/en-us/azure/batch/tutorial-parallel-dotnet
and manually create CloudTasks in my console application and create Multiple Tasks Programatically and then run that Console Application with ADF Custom Activity but CloudTask takes JobId and Cmd. Wanted to something like following but instead of passing taskCommandLine, I wanted to pass a C# method name and parameters to execute
string taskId = "task" + i.ToString().PadLeft(3, '0');
string taskCommandLine = "ping -n " + rand.Next(minPings, maxPings +
1).ToString() + " localhost";
CloudTask task = new CloudTask(taskId, taskCommandLine);
// Wanted to do CloudTask task = new CloudTask(taskId,
SomeMethod(args));
tasks.Add(task);
Also it looks like we can't create CloudTasks by using .NET API for Batch within Custom Activity of ADF
What I wanted to Achieve?
I have data in SQL Server table and I want to run different transformations on it by slicing it Horizontally or Vertically (by picking rows or columns). I want to run those transformations in Parallel (wants to have multiple CloudTask instances so that each one can operate on a specific Column Independently and after transformation load it
into a different table). But the issue is it looks like we can't use .NET Batch Service API within ADF and the only way seems to be having multiple Custom Activities in my Data Factory pipeline.

Application needs to deployed on each and every node within Batch pool and CloudTasks needs to be created by calling the application with cmd
CloudTask task =
new CloudTask(
"MyTask",
"cmd /c %AZ_BATCH_APP_PACKAGE_MyTask%\\myTask.exe -args -here");

Related

How do I launch a SmartSim orchestrator without models?

I'm trying to prototype using the SmartRedis Python client to interact with the SmartSim Orchestrator. Is it possible to launch the orchestrator without any other models in the experiment? If so, what would be the best way to do so?

It is entirely possible to do that. A SmartSim Experiment can contain different types of 'entities' including Models, Ensembles (i.e. groups of Models), and Orchestrator (i.e. the Redis-backed database). None of these entities, however, are 'required' to be in the Experiment.
Here's a short script that creates an experiment which includes only a database.
from SmartSim import Experiment
NUM_DB_NODES = 3
exp = Experiment("Database Only")
db = exp.create_database(db_nodes=NUM_DB_NODES)
exp.generate(db)
exp.start(db)
After this, the Orchestrator (with the number of shards specified by NUM_DB_NODES) will have been spunup. You can then connect the Python client using the following line:
client = smartredis.Client(db.get_address()[0],NUM_DB_NODES>1)

Camunda : Set Assignee to all UserTasks of the process instance

I have a requirement where I need to set assignee's to all the "user-tasks" in a process instance as soon as the instance is created, which is based on the candidate group set to the user-task.
i tries getting the user-tasks using this :
Collection<UserTask> userTasks = execution.getBpmnModelInstance().getModelElementsByType(UserTask.class);
which is correct in someway but i am not able to set the assignee's , Also, looks like this would apply to the process itself and not the process instance.
secondly , I tried getting it from the taskQuery which gives me only the next task and not all the user-tasks inside a process.
Please help !!

It does not work that way. A process flow can be simplified to "a token moves through the bpmn diagram" ... only the current position of the token is relevant. So naturally, the tasklist only gives you the current task. Not what could happen after ... which you cannot know, because if you had a gateway that continues differently based on the task outcome? So drop playing with the BPMN meta model. Focus on the runtime.
You have two choices to dynamically assign user tasks:
1.) in the modeler, instead of hard-assigning the task to "a-user", use an expression like ${taskAssignment.assignTask(task)} where "taskAssignment" is a bean that provides a String method that returns the user.
2.) add a taskListener on "create" to the task and set the assignee in the listener.
for option 2 you can use the camunda spring boot events (or the (outdated) camunda-bpm-reactor extension) to register one central component rather than adding a listener to every task.

How to invoke an on-demand bigquery Data transfer service?

I really liked BigQuery's Data Transfer Service. I have flat files in the exact schema sitting to be loaded into BQ. It would have been awesome to just setup DTS schedule that picked up GCS files that match a pattern and load the into BQ. I like the built in option to delete source files after copy and email in case of trouble. But the biggest bummer is that the minimum interval is 60 minutes. That is crazy. I could have lived with a 10 min delay perhaps.
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
Lastly, anyone know if DTS will lower the limit to 10 mins in future?

So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
StartManualTransferRuns is part of the RPC library but does not have a REST API equivalent as of now. How to use that will depend on your environment. For instance, you can use the Python Client Library (docs).
As an example, I used the following code (you'll need to run pip install google-cloud-bigquery-datatransfer for the depencencies):
import time
from google.cloud import bigquery_datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = bigquery_datatransfer_v1.DataTransferServiceClient()
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = '5e6...7bc' # alphanumeric ID you'll find in the UI
parent = client.project_transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = bigquery_datatransfer_v1.types.Timestamp(seconds=int(time.time() + 10))
response = client.start_manual_transfer_runs(parent, requested_run_time=start_time)
print(response)
Note that you'll need to use the right Transfer Config ID and the requested_run_time has to be of type bigquery_datatransfer_v1.types.Timestamp (for which there was no example in the docs). I set a start time 10 seconds ahead of the current execution time.
You should get a response such as:
runs {
name: "projects/PROJECT_NUMBER/locations/us/transferConfigs/5e6...7bc/runs/5e5...c04"
destination_dataset_id: "DATASET_NAME"
schedule_time {
seconds: 1579358571
nanos: 922599371
}
...
data_source_id: "google_cloud_storage"
state: PENDING
params {
...
}
run_time {
seconds: 1579358581
}
user_id: 28...65
}
and the transfer is triggered as expected (nevermind the error):
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
With this you can set a cron job to execute your function every ten minutes. As discussed in the comments, the minimum interval is 60 minutes so it won't pick up files less than one hour old (docs).
Apart from that, this is not a very robust solution and here come into play your follow-up questions. I think these might be too broad to address in a single StackOverflow question but I would say that, for on-demand refresh, Cloud Scheduler + Cloud Functions/Cloud Run can work very well.
Dataflow would be best if you needed ETL but it has a GCS connector that can watch a file pattern (example). With this you would skip the transfer, set the watch interval and the load job triggering frequency to write the files into BigQuery. VM(s) would be running constantly in a streaming pipeline as opposed to the previous approach but a 10-minute watch period is possible.
If you have complex workflows/dependencies, Airflow has recently introduced operators to start manual runs.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
You can use wildcards to match a file pattern when you create the transfer:
Also, this can be done on a file-by-file basis using Pub/Sub notifications for Cloud Storage to trigger a Cloud Function.
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
There is already a Feature Request here. Feel free to star it to show your interest and receive updates

Now your can easy manual run transfer Bigquery data use RESTApi:
HTTP request
POST https://bigquerydatatransfer.googleapis.com/v1/{parent=projects/*/locations/*/transferConfigs/*}:startManualRuns
About this part > {parent=projects//locations//transferConfigs/*}, check on CONFIGURATION of your Transfer then notice part like image bellow.
Here
More here:
https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs/startManualRuns

following the Guillem's answer and the API updates, this is my new code:
import time
from google.cloud.bigquery import datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = datatransfer_v1.DataTransferServiceClient()
config = '34y....654'
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = config
parent = client.transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = Timestamp(seconds=int(time.time()))
request = datatransfer_v1.types.StartManualTransferRunsRequest(
{ "parent": parent, "requested_run_time": start_time }
)
response = client.start_manual_transfer_runs(request, timeout=360)
print(response)

For this to work, you need to know the correct TRANSFER_CONFIG_ID.
In my case, I wanted to list all the BigQuery Scheduled queries, to get a specific ID. You can do it like that :
# Put your projetID here
PROJECT_ID = 'PROJECT_ID'
from google.cloud import bigquery_datatransfer_v1
bq_transfer_client = bigquery_datatransfer_v1.DataTransferServiceClient()
parent = bq_transfer_client.project_path(PROJECT_ID)
# Iterate over all results
for element in bq_transfer_client.list_transfer_configs(parent):
# Print Display Name for each Scheduled Query
print(f'[Schedule Query Name]:\t{element.display_name}')
# Print name of all elements (it contains the ID)
print(f'[Name]:\t\t{element.name}')
# Extract the IDs:
TRANSFER_CONFIG_ID= element.name.split('/')[-1]
print(f'[TRANSFER_CONFIG_ID]:\t\t{TRANSFER_CONFIG_ID}')
# You can print the entire element for debug purposes
print(element)

Can one build pipeline send a value as a parameter to the next pipeline it triggers in Azure DevOps

I have a build pipeline, lets say A - that stores a file (this file has a variable value that is set within that build pipeline) within a folder. This Pipeline A triggers another Pipeline B that Publishes the folder as an artifact using the Publish artifact task. But the folder name is dynamic as it is fetched from that file within Pipeline A. I need to pass on the file with that variable value from Pipeline A to Pipeline B while triggering it. Is there any way to do this in Azure DevOps, without using the yaml pipelines?
I have a little complex set of pipelines that I set up using the Classic mode, and converting them all to yaml would take a long time, so would like to know if there is any work around to this.

There are few workarounds:
Create a variable group, and during the Pipeline A set the variable value there with Rest API, then Pipeline B use this variable.
During Pipeline A update the Pipeline B definition with the new value with Rest API.
In Pipeline A trigger the Pipeline B with Trigger Build Task, there you can pass the variable value to the Pipeline B (you do it in the "Build Parameters" field).

I don't think there's a clean way to do this if you need to trigger the build by adding Pipeline A under the triggers section of Pipeline B.
Consider triggering Pipeline B when Pipeline A completes using the REST API. That way, you can have your 'file path' as a variable on Pipeline B and pass it in the parameters collection.
Something like:
POST https://dev.azure.com/{organization}/{project}/_apis/build/builds?ignoreWarnings={ignoreWarnings}&checkInTicket={checkInTicket}&sourceBuildId={sourceBuildId}&api-version=5.0
{
"definition": {
"id": 1234
},
"parameters": "{\"fileName\":\"yourfilename\"}"
}
filePath would be the name of your variable in Pipeline B
Have a look at the Builds - Queue documentation for more info.

Assigning a property to oozie workflow not in job.properties file but in workflow.xml itself and use it further

I have a oozie workflow wchich consists of a sub-workflow. My main workflow takes three sqoop job names at a time in fork. Then it has to pass those names to the subworkflow. In main workflow there are three shell actions which receive values of job names in three respective variables(${job1},${job2},${job3}) . But my sub-workflow is common for all three shell actions. I want to assign the value of ${job1} to ${job}. Where to create the property ${job} and how to transfer the value of ${job1} to ${job}???? Please help.

Use a java action in between along with capture-output so that you can do whatever assignment or name change logic there.
Java will accept job1 and will output job=job1 using capture-output which in turn you may pass to sub-workflows.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Parallel Tasks in Data Factory Custom Activity (ADF V2) - azure-data-factory-2

Application needs to deployed on each and every node within Batch pool and CloudTasks needs to be created by calling the application with cmd CloudTask task = new CloudTask( "MyTask", "cmd /c %AZ_BATCH_APP_PACKAGE_MyTask%\\myTask.exe -args -here");

Related

How do I launch a SmartSim orchestrator without models?

Camunda : Set Assignee to all UserTasks of the process instance

How to invoke an on-demand bigquery Data transfer service?

Can one build pipeline send a value as a parameter to the next pipeline it triggers in Azure DevOps

Assigning a property to oozie workflow not in job.properties file but in workflow.xml itself and use it further

Categories

Resources