My team at work is currently looking for a replacement for a rather expensive ETL tool that, at this point, we are using as a glorified scheduler. Any of the integrations offered by the ETL tool we have improved using our own python code, so I really just need its scheduling ability. One option we are looking at is Data Pipeline, which I am currently piloting.
My problem is thus: imagine we have two datasets to load - products and sales. Each of these datasets requires a number of steps to load (get source data, call a python script to transform, load to Redshift). However, product needs to be loaded before sales runs, as we need product cost, etc to calculate margin. Is it possible to have a "master" pipeline in Data Pipeline that calls products first, waits for its successful completion, and then calls sales? If so, how? I'm open to other product suggestions as well if Data Pipeline is not well-suited to this type of workflow. Appreciate the help
I think I can relate to this use case. Any how, Data Pipeline does not do this kind of dependency management on its own. It however can be simulated using file preconditions.
In this example, your child pipelines may depend on a file being present (as a precondition) before starting. A Master pipeline would create trigger files based on some logic executed in its activities. A child pipeline may create other trigger files that will start a subsequent pipeline downstream.
Another solution is to use Simple Workflow product . That has the features you are looking for - but would need custom coding using the Flow SDK.
This is a basic use case of datapipeline and should definitely be possible. You can use their graphical pipeline editor for creating this pipeline. Breaking down the problem:
There are are two datasets:
Product
Sales
Steps to load these datasets:
Get source data: Say from S3. For this, use S3DataNode
Call a python script to transform: Use ShellCommandActivity with staging. Data Pipeline does data staging implicitly for S3DataNodes attached to ShellCommandActivity. You can use them using special env variables provided: Details
Load output to Redshift: Use RedshiftDatabase
You will need to do add above components for each of the dataset you need to work with (product and sales in this case). For easy management, you can run these on an EC2 Instance.
Condition: 'product' needs to be loaded before 'sales' runs
Add dependsOn relationship. Add this field on ShellCommandActivity of Sales that refers to ShellCommandActivity of Product. See dependsOn field in documentation. It says: 'One or more references to other Activities that must reach the FINISHED state before this activity will start'.
Tip: In most cases, you would not want your next day execution to start while previous day execution is still active aka RUNNING. To avoid such a scenario, use 'maxActiveInstances' field and set it to '1'.
Related
I have a folder of pipelines, and I want to execute the pipelines inside the folder using a single pipeline. There will be times when there will be another pipeline added to the folder, so creating a pipeline filled with Execute Pipelines is not an option (well, it is the current method, but it's not very "automate-y" and adding another Execute Pipeline whenever a new pipeline is added is, as you can imagine, a pain). I thought of the ForEach Activity, but I don't know what the approach is.
I have not tried this approach but I think we can use the
ADF RestAPI to get all the details of the pipelines which needs to be executed. Since the response is in JSON you can write it back to temp blob and add filter and focus on what you need .
https://learn.microsoft.com/en-us/rest/api/datafactory/pipelines/list-by-factory?tabs=HTTP
You can use the Create RUN API to trigger the pipeline .
https://learn.microsoft.com/en-us/rest/api/datafactory/pipelines/create-run?tabs=HTTP
As Joel called out , if different pipeline has different count of paramter , it will be little messy to maintain .
Folders are really just organizational structures for the code assets that describe pipelines (same for Datasets and Data Flows), they have no real substance or purpose inside the executing environment. This is why pipeline names have to be globally unique rather than unique to their containing folder.
Another problem you are going to face is that the "Execute Pipeline" activity is not very dynamic. The pipeline name has to be known as design time, and while parameter values are dynamic, the parameter names are not. For these reasons, you can't have a foreach loop that dynamically executes child pipelines.
If I were tackling this problem, it would be through an external pipeline management system that you would have to build yourself. This is not trivial, and in your case would have additional challenges because of the folder level focus.
While configuring a particular data pipeline in Mosaic Decisions, I want to try out different operations by using the available process nodes. I would like to keep the first few configured nodes for future reference and continue to add some other nodes.
To do this, I'm currently cloning the flow after each incremental change. But, due to this, many flows are getting configured and it becomes very difficult to keep track.
Is there any alternative way to save the history of these multiple configurations of the flow for future reference without cloning and executing them separately?
You can save the history of changes you have made in the flow by simply saving it as a version using Save As Version option provided in the canvas header.
You can also add a description for each of the incremental steps and edit a particular version later if you want. Later, you can also execute each of the saved versions separately by publishing that version from the Version tab and then
executing it normally.
I am currently working on a very interesting ETL project using Azure to transform my data manually. However, transforming data manually can be exhausting and lengthy when I start having several source files to process. My pipeline is working fine for now because I have only a few files to transform but what if I have thousands of excel files?
So what I want to achieve is that I want to extend the project and extract the excel files that are coming from Email using the logic app then apply ETL directly on top of them. Is there any way I can automate ETL in Azure. Can I do ETL without modifying the pipeline for a different type of data manually? How can I make my pipeline flexible to be able to handle data transformation for various types of source data?
Thank you in advance for your help.
Can I do ETL without modifying the pipeline for a different type of
data manually?
According to your description, i suppose that you already knew the ADF connector is supported in the Logic App. You could execute ADF pipeline in the Logic App flow and even pass parameters into ADF pipeline.
Normally, the source and sink service should be fixed in one copy activity, but you could define dynamic file path in the datasets. So you don't need to create multiple copy activities.
If the data types are different, you could try to pass the parameter from Logic App into ADF. Then before the data transmission, you could use Switch activity to route the transmission into different branches.
The way my ADF setup currently works, is that I have multiple pipelines, each containing atleast one activity. Then I have one big pipeline that sort of chains these pipelines together.
However, now in the big "master" pipeline, I would like to use the output of an activity from one pipeline and then pass it to another pipeline. All of this orchestrated from the "master" pipeline.
My "master" pipeline would look something like this:
What I have tried to do is adding a parameter to "Execute Pipeline2", and I have tried passing:
#activity('Execute Pipeline1').output.pipeline.runId.output.runOutput
#activity('Execute Pipeline1').output.pipelineRunId.output.runOutput
#activity('Execute Pipeline1').output.runOutput
How would one go about doing this?
unfortunately we don't have a way to pass the output of an activity across pipelines. Right now pipelines don't have outputs (only activities).
We have a workitem that will allow a user to choose what should be the output for a pipeline (imagine a pipeline with 40 activities, user would be able to choose the output of activity 3 as pipeline output). However, this workitem is in very early stages so don't expect to see this soon.
For now, the only way would be to save the output that you want in storage (blob, for example) and then read it and pass it to the other pipeline. Another method could be a web activity that gets the pipeline run (passing run id) and you get the output using ADF SDK or REST API, and then you pass that to the next Execute Pipeline activity.
I have a build job which takes a parameter (say which branch to build) that, when it completes triggers a testing job (actually several jobs) which does some stuff like download a bunch of test data and checks that the new version is works with the test data.
My problem is that I can't seem to figure out a way to show the test results in a sensible way. If I just use one testing job then the test results for "stable" and "dodgy-future-branch" get mixed up which isn't what I want and if I create a separate testing job for each branch that the build job understands it quickly becomes unmanageable because of combinatorial explosion (say 6 branches and 6 different types of testing mean I need 36 testing jobs and then when I want to make a change, say to save more builds, then I need to update all 36 by hand)
I've been looking at Job Generator Plugin and ez-templates in the hope that I might be able to create and manage just the templates for the testing jobs and have the actual jobs be created / updated on the fly. I can't shake the feeling that this is so hard because my basic model is wrong. Is it just that the separation of the building and testing jobs like this is not recommended or is there some other method to allow the filtering of test results for a job based on build parameters that I haven't found yet?
I would define a set of simple use cases:
Check in on development branch triggers build
Successful build triggers UpdateBuildPage
Successful build of development triggers IntegrationTest
Successful IntegrationTest triggers LoadTest
Successful IntegrationTest triggers UpdateTestPage
Successful LoadTest triggers UpdateTestPage
etc.
So especially I wouldn't look into all jenkins job results for overviews, but create a web page or something like that.
I wouldn't expect the full matrix of build/tests, and the combinations that are used will become clear from the use cases.