I have a pipeline that executes another pipeline in azure data factory v2. In the executed (child) pipeline I assign a value to a variable I want returned in the master pipeline, is this possible?
Thanks
Pipelines are independent entities - while you can execute "child" pipelines, there is no functional connection between the two. One way around this is to have the child pipeline write the value to some form of intermediate storage (blob storage, a SQL table, etc), and then have the "parent" pipeline read the value after the child pipeline completes. You should also make sure the Execute Pipeline activity has the "Wait on completion" property checked. If you don't want the value retained in the storage medium, you could have the parent pipeline delete the data once it has processed it.
Related
I have a number of pipelines that need to cycle depending on availability of data. If the data is not there wait and try again. The pipe behaviours are largely controlled by a database which captures logs which are used to make decisions about processing.
I read the Microsoft documentation about the Execute Pipeline activity which states that
The Execute Pipeline activity allows a Data Factory or Synapse
pipeline to invoke another pipeline.
It does not explicitly state that it is impossible though. I tried to reference Pipe_A from Pipe_A but the pipe is not visible in the drop down. I need a work-around for this restriction.
Constraints:
The pipe must not call all pipes again, just the pipe in question. The preceding pipe is running all pipes in parallel.
I don't know how many iterations are needed and cannot specify this quantity.
As far as possible best effort has been implemented and this pattern should continue.
Ideas:
Create a intermediary pipe that can be referenced. This is no good I would need to do this for every pipe that requires this behaviour because dynamic content is not allowed for pipe selection. This approach would also pollute the Data Factory workspace.
Direct control flow backwards after waiting inside the same pipeline if condition is met. This won't work either, the If activity does not allow expression of flow within the same context as the If activity itself.
I thought about externalising this behaviour to a Python application which could be attached to an Azure Function if needed. The application would handle the scheduling and waiting. The application could call any pipe it needed and could itself be invoked by the pipe in question. This seems drastic!
Finally, I discovered an activity Until which has do while behaviour. I could wrap these pipes in Until, the pipe executes and finishes and sets database state to 'finished' or cannot finish and sets the state to incomplete and waits. The expression then either kicks off another execution or it does not. Additional conditional logic can be included as required in the procedure that will be used to set a value to variable used by the expression in the Until. I would need a variable per pipe.
I think idea 4 makes sense, I thought I would post this anyway in case people can spot limitations in this approach and/or recommend an approach.
Yes, absolutely agree with All About BI, its seems in your scenario the best suited ADF Activity is Until :
The Until activity in ADF functions as a wrapper and parent component
for iterations, with inner child activities comprising the block of
items to iterate over. The result (s) from those inner child
activities must then be used in the parent Until expression to
determine if another iteration is necessary. Alternatively, if the
pipeline can be maintained
The assessment condition for the Until activity might comprise outputs from other activities, pipeline parameters, or variables.
When used in conjunction with the Wait activity, the Until activity allows you to create loop conditions to periodically check the status of specific operations. Here are some examples:
Check to see if the database table has been updated with new rows.
Check to see if the SQL job is complete.
Check to see whether any new files have been added to a specific
folder.
I would like to have a simple CQRS implementation on an API.
In short:
Separate routes for Command and Query.
Separate DB tables (on the same DB at the moment). Normalized one for Command and a de-normalized one for Query.
Asynchronous event-driven update of Query Read Model, using existing external Event Bus.
After the Command is executed, naturally I need to raise an event and pass it to the Event Bus.
Event Bus would process the event and pass it to it's subscriber(s).
In this case the subscriber is Read Model which needs to be updated.
So I need a callback route on the API which gets the event from Command Bus and updated the Read Model projection (i.e.: updating the de-normalized DB table which is used for Queries).
The problem is that the update of the Read Model projection is neither a Command (we do not execute any Domain Logic) nor a Query.
The questions is:
How should this async Read Model update work in order to be compliant both with CQRS and DDD?
How should this async Read Model update work in order to be compliant both with CQRS and DDD?
I normally think of the flow of information as a triangle.
We copy information from the outside world into our "write model", via commands
We copy information from the write model into our "read model"
We copy information from the read model to the outside world, via queries.
Common language for that middle step is "projection".
So the projection (typically) runs asynchronously, querying the "write model" and updating the "read model".
In the architecture you outlined, it would be the projection that is subscribed to the bus. When the bus signals that the write model has changed, we wake up the projection, and let it run so that it can update the read model.
(Note the flow of information - the signal we get from the bus triggers the projection to run, but the projection copies data from the write model, not from the event bus message. This isn't the only way to arrange things, but it is simple, and therefore easy to reason about when things start going pear shaped.)
It is often the case that the projection will store some of its own metadata when it updates the read model, so as to not repeat work.
As far as I understand, TensorFlow uses MLMD to record and retrieve metadata associated with workflows. This may include:
results of pipeline components
metadata about artifacts generated through the components of the pipelines
metadata about executions of these components
metadata about the pipeline and associated lineage information
Features:
Does the above (e.g. #1 aka "results of components") imply that MLMD stores actual data? (e.g. input features for ML training?). If not, what does it mean by results of pipeline components?
Orchestration and pipeline history:
Also, when using TFX with e.g. AirFlow, which uses its own metastore (e.g. metadata about DAGs, their runs, and other Airflow configurations like users, roles, and connections) does MLMD store redundant information? Does it supersede it?
TFX is a ML pipeline/workflow so when you write a TFX application what you are doing is essentially constructing the structure of the workflow and preparing the WF to accept a particular set of data and process or use it (transformations, model build, inference, deploy etc.). So in that aspect it never stores the actual data, it stores the information (metadata) in order to process or use the data. So for example in the condition where it checks anomalies, it requires to remember the previous data schema/stats (not the actual data), so it saves that information as metadata in the MLMD; besides the actual run metadata.
In terms of Airflow it will also save the run metadata. This can be seen as a subset of all the metadata, very limited in comparison to the metadata saved in MLMD. There will be a redundancy involved though.
And the controller is TFX that defines and makes use of the underlining Airflow orchestration. It will not supersede but it will definitely fail if there is a clash.
Imagine the filesystem of a disk drive. The contents of the files are stored in the disk, but it's the index and the pointers to these data that is called filesystem. That metadata that brings value to the user who can find the relevant data when they need them, by searching or navigating through the filesystem.
Similarly with MLMD, it stores the metadata of a ML pipeline, like which hyperparameters you've used in an execution, which version of training data, how was the distribution of the features, etc. But it's beyond being just a registry of the runs. These metadata can be used to empower two killer features of a ML pipeline tool:
asynchronous execution of its components, for example retrain a model when there are new data, without necessary having a new vocabulary generated
reuse results from previous runs, or step-level output caching. For example, do not run a step if its input parameters haven't changed, but reuse the output of a previous run from the cache to feed the next component.
So yes, the actual data are indeed stored in a storage, maybe a cloud bucket, in form of parquet files across transformations, or model files and schemata protobufs. And MLMD stores the uri to these data with some meta information. For example, a savedmodel is stored in s3://mymodels/1, and it has an entry in the Artifacts table of MLMD, with a relation to the Trainer run and it's TrainArgs parameters on the ContextProperty table.
If not, what does it mean by results of pipeline components?
It means the pointers to the data which have been generated by the run of a component, including the input parameters. In our previous example, if the input data as well as the the TrainArgs of a Trainer component haven't changed in a run, it shouldn't run again that expensive component, but reuse the modelfile from the cache.
This requirement of a continuous ML pipeline makes the use of workflow managers such as Tekton or Argo more relevant compared to Airflow, and MLMD a more focused metadata store compared to the later.
As per apache ignite spring data documentation, there are two method to save the data in ignite cache:
1. org.apache.ignite.springdata.repository.IgniteRepository.save(key, vlaue)
and
2. org.apache.ignite.springdata.repository.IgniteRepository.save(Map<ID, S> entities)
So, I just want to understand the 2nd method transaction behavior. Suppose we are going to save the 100 records by using the save(Map<Id,S>) method and due to some reason after 70 records there are some nodes go down. In this case, will it roll back all the 70 records?
Note: As per 1st method behavior, If we use #Transaction at method level then it will roll back the particular entity.
First of all, you should read about the transaction mechanism used in Apache Ignite. It is very good described in articles presented here:
https://apacheignite.readme.io/v1.0/docs/transactions#section-two-phase-commit-2pc
The most interesting part for you is "Backup Node Failures" and "Primary Node Failures":
Backup Node Failures
If a backup node fails during either "Prepare" phase or "Commit" phase, then no special handling is needed. The data will still be committed on the nodes that are alive. GridGain will then, in the background, designate a new backup node and the data will be copied there outside of the transaction scope.
Primary Node Failures
If a primary node fails before or during the "Prepare" phase, then the coordinator will designate one of the backup nodes to become primary and retry the "Prepare" phase. If the failure happens before or during the "Commit" phase, then the backup nodes will detect the crash and send a message to the Coordinator node to find out whether to commit or rollback. The transaction still completes and the data within distributed cache remains consistent.
In your case, all updates for all values in the map should be done in one transaction or rollbacked. I guess that these articles answered your question.
In CQRS, we separate Commands and Queries. As I understand it, Commands raise Domain Events that may modify Entity states while Queries return View specific DTO's directly from a data store. According to this article, the UI makes commands through a Command Bus which creates Commands that are handled by their respective CommandHandlers who then orchestrate the Domain Logic to determine the occurrence of Domain Events and persist/publish any state changes to a Repository (optionally using Event Sourcing). After being persisted, state changes are available through Queries.
Now, what if a Command creates an Entity that is not persisted/published immediately? Firstly, where is that not-yet-persisted Entity held? Is it in the Command Bus, the Command Handler, the Repository, or should a new thin application layer hold it? How should a Query gain access to it?
The problem here is that it seems like any Queries for unpersisted Entities differ significantly from those of persisted Entities, unless CQRS demands that ALL Entities be persisted upon creation, which IMO is not necessarily compatible with all Domains.
Specifically, I'm trying to build software to record training information for various Training Sessions. However, I would like it if Training Sessions were persisted manually by a Save Session button as opposed to always upon creation. I don't know where a StartNewTrainingSessionCommand would store the new Training Session so that it can be Queried, if not in the data store.
I think you understood things a bit wrong: A command is sent via a service bus to a command handler which uses the business objects to do the work. Domain events should be generated by the business (domain) objects, but sometimes the command handler does that too.
I don't see a reason for a created entity not to be saved. In your particular case, if the domain allows it, you can have a default, empty TrainingSession saved automatically then updated when the user press the Save button.
If this approach is not feasible, then simply store the input data, pretty much the view models in a temporary place (session, db) and issue the command only when the user clicks the button.