Azure Data Factory: Execute Pipeline activity cannot reference calling pipeline, cyclical behaviour required - azure-sql-database

I have a number of pipelines that need to cycle depending on availability of data. If the data is not there wait and try again. The pipe behaviours are largely controlled by a database which captures logs which are used to make decisions about processing.
I read the Microsoft documentation about the Execute Pipeline activity which states that
The Execute Pipeline activity allows a Data Factory or Synapse
pipeline to invoke another pipeline.
It does not explicitly state that it is impossible though. I tried to reference Pipe_A from Pipe_A but the pipe is not visible in the drop down. I need a work-around for this restriction.
Constraints:
The pipe must not call all pipes again, just the pipe in question. The preceding pipe is running all pipes in parallel.
I don't know how many iterations are needed and cannot specify this quantity.
As far as possible best effort has been implemented and this pattern should continue.
Ideas:
Create a intermediary pipe that can be referenced. This is no good I would need to do this for every pipe that requires this behaviour because dynamic content is not allowed for pipe selection. This approach would also pollute the Data Factory workspace.
Direct control flow backwards after waiting inside the same pipeline if condition is met. This won't work either, the If activity does not allow expression of flow within the same context as the If activity itself.
I thought about externalising this behaviour to a Python application which could be attached to an Azure Function if needed. The application would handle the scheduling and waiting. The application could call any pipe it needed and could itself be invoked by the pipe in question. This seems drastic!
Finally, I discovered an activity Until which has do while behaviour. I could wrap these pipes in Until, the pipe executes and finishes and sets database state to 'finished' or cannot finish and sets the state to incomplete and waits. The expression then either kicks off another execution or it does not. Additional conditional logic can be included as required in the procedure that will be used to set a value to variable used by the expression in the Until. I would need a variable per pipe.
I think idea 4 makes sense, I thought I would post this anyway in case people can spot limitations in this approach and/or recommend an approach.

Yes, absolutely agree with All About BI, its seems in your scenario the best suited ADF Activity is Until :
The Until activity in ADF functions as a wrapper and parent component
for iterations, with inner child activities comprising the block of
items to iterate over. The result (s) from those inner child
activities must then be used in the parent Until expression to
determine if another iteration is necessary. Alternatively, if the
pipeline can be maintained
The assessment condition for the Until activity might comprise outputs from other activities, pipeline parameters, or variables.
When used in conjunction with the Wait activity, the Until activity allows you to create loop conditions to periodically check the status of specific operations. Here are some examples:
Check to see if the database table has been updated with new rows.
Check to see if the SQL job is complete.
Check to see whether any new files have been added to a specific
folder.

Related

Fix inconsistent state right away or lazily when data is requested

Our users go through several steps of workflow - the further they go the more objects we create. We also allow users to go back to Step#1 and change one of the existing objects. Which may cause inconsistencies so we must update/delete some of the objects at Step#2. I see 2 options:
Update/delete objects from Step#2 right away. This leads to:
Operation that's supposed to be a simple PATCH of an entity field becomes complicated. And it's a shared object between multiple workflows - so we'll have to add if-statements and do different things depending on the workflow.
Circular dependencies. Operations on Step#1 have to know about objects/operations on Step#2.
On each request in Step#1 we'd have to load data for Step#2 in order to determine whether Step#2 really needs to be updated. Which slows down operations on Step#1. So to change 1 record in DB we'll have to load hundreds (or even thousands) records for Step#2.
Many actions on Step#1 may need fixing state at Step#2. So we have to ensure we don't forget anything today and in the future.
Fix Step#2 lazily - when user goes there (our current approach). Step#2 will recognize that objects are inconsistent and fix them. Which leads to just 1 place where we need to care, but:
Until user opens Step#2 - DB will contain inconsistent objects. This hasn't resulted in any problems so far. But I can imagine it may complicate future SQL migrations.
We update DB state on GET request. This one doesn't seem like that big of a deal since GET stays idempotent anyway. But still it feels awkward.
Anyone knows better approaches? Or maybe improvements to these two?
Update
I haven't found perfect solution, but eventually we implemented an improved version of #1. When updating state on Step#1 we also set a flag "need to rebuild Step#2", when UI opens Step#2 it first checks this flag and issues a PUT to rebuild the state, and only then it GETs Step#2.
This still means that DB state is inconsistent for some period of time. But at least we'll know this for sure from the flag in DB. And if needed - we could write migrations taking this flag into account. This also allows (if needed in the future) to create an async job to fix the state.
I think it is more flexible to separate the state and the context where the objects are stored. Any creation of a new object at any step is accompanied by the preservation of the invariant and consistency of context.
There are separate rules of states - these are rules for transition from one to another and available objects for creation and separate rules for the context, rules for its consistency, which is ensured every time it changes.
What about dirty data asynchronous cleanup?
Whenever user goes back to Step #1 and changes something, mark all related data as "dirty" (e.g. add links to it in "DirtyData" table) and be done for now.
Have a DataCleanup worker (e.g. separate thread or smth) that constantly looks for data to be cleaned up.
Before editing data for Step #2, check if the data is not dirty.
Depending on your logic, 3) might result in user error (e.g. user would need to repeat Step #2). If DataCleanup worker has enough resources (i.e. it processes DirtyData table almost instantaneously), that should happen only on very rare occasions. If that is not OK, you could opt for checking for dirty data on each fetch, but that could be expensive.
It sounds like you're familiar with the HTTP spec regarding GET requests, but for future readers:
Why shouldn't a GET request change data on the server?
Why is using a HTTP GET to update state on the server in a RESTful call incorrect?
For the other bullet under 2, we probably don't need a specification to agree that persisting valid data is preferable to persisting invalid data.
So what can we do for the bullets under 1 to avoid complex branching logic in a particular step and also circular dependencies? My suggestion is an event-driven design. When step #2 changes it should fire a change event. In this scenario, step #2 has no knowledge of the concrete listener(s) who may receive its events, so it remains decoupled from any complex handling logic.
There's probably no way to guarantee you don't forget anything in the future; but if every step in the workflow is defined as a listener, it forces you to consider change events to some extent every time you implement a new step.
One side note on granularity: if a step has many changes, it can batch up its events rather than fire each one individually. You can adjust the size for efficiency.
In summary, I would strongly consider the Observer design pattern.

Prevent one failed subtask failing all tasks in Flyte

I have a dynamic_task which kicks off a number of python_tasks. However, as soon as one of the python_tasks fails, the other ones that are still running would fail as well. Is this by design? Is there a way to change this behavior so that other tasks can still complete without failing?
This is by design, as a means to save resources, but it is configurable. Presumably, dynamic tasks are related to each other, and downstream tasks will need the output of all of them. So if one fails, the default behavior is to fail the rest.
If you'd like to change this, create your dynamic task with a float as this argument in the decorator: https://github.com/lyft/flytekit/blob/d4cfedc4c580f08bf904e6e474a0b948a4608737/flytekit/common/tasks/sdk_dynamic.py#L84
The idea is that partial failures are not tolerated within a data passing DAG. If some node fails, then by definition the data is partial.
But for dynamic array tasks, Flyte allows a special provision (actually the Array tasks plugin), which allows the users to provide a ratio of acceptable successful tasks.

Mule batch processing vs foreach vs splitter-aggregator

In Mule, I have quite many records to process, where processing includes some calculations, going back and forth to database etc.. We can process collections of records with these options
Batch processing
ForEach
Splitter-Aggregator
So what are the main differences between them? When should we prefer one to others?
Mule batch processing option does not seem to have batch job scope variable definition, for example. Or, what if I want to benefit multithreading to fasten the overall task? Or, which is better if I want to modify the payload during processing?
When you write "quite many" I assume it's too much for main memory, this rules out spliter/aggregator because it has to collect all records to return them as a list.
I assume you have your records in a stream or iterator, otherwise you probably have a memory problem...
So when to use for-each and when to use batch?
For Each
The most simple solution, but it has some drawbacks:
It is single threaded (so may be too slow for your use case)
It is "fire and forget": You can't collect anything within the loop, e.g. a record count
There is not support handling "broken" records
Within the loop, you can have several steps (message processors) to process your records (e.g. for the mentioned database lookup).
May be a drawback, may be an advantage: The loop is synchronous. (If you want to process asynchronous, wrap it in an async-scope.)
Batch
A little more stuff to do / to understand, but more features:
When called from a flow, always asynchronous (this may be a drawback).
Can be standalone (e.g. with a poll inside for starting)
When the data generated in the loading phase is too big, it is automatically offloaded to disk.
Multithreading for free (number of threads configurable)
Handling for "broken records": Batch steps may be executed for good/broken records only.
You get statitstics at the end (number of records, number of successful records etc.)
So it looks like you better use batch.
For Splitter and Aggregator , you are responsible for writing the splitting logic and then joining them back at the end of processing. It is useful when you want to process records asynchronously using different server. It is less reliable compared to other option, here parallel processing is possible.
Foreach is more reliable but it process records iteratively using single thread ( synchronous), hence parallel processing is not possible. Each records creates a single message by default.
Batch processing is designed to process millions of records in a very fast and reliable way. By default 16 threads will process your records and it is reliable as well.
Please go through the link below for more details.
https://docs.mulesoft.com/mule-user-guide/v/3.8/splitter-flow-control-reference
https://docs.mulesoft.com/mule-user-guide/v/3.8/foreach
I have been using approach to pass on records in array to stored procedure.
You can call stored procedure inside for loop and setting batch size of the for loop accordingly to avoid round trips. I have used this approach and performance is good. You may have to create another table to log results and have that logic in stored procedure as well.
Below is the link which has all the details
https://dzone.com/articles/passing-java-arrays-in-oracle-stored-procedure-fro

Camunda modular design

I want to manage a huge workflow in Camunda.
I have decided to split this into different processes like Create, Configuration, Review & Confirm. Each of these processes have 10 to 15 tasks. These processes should be executed in sequence.
If I want to design my workflow like this, how will I link each process. What is the proper way for Camunda modular design.
You would probably go with some kind of SubProcess. If you plan to model different processes you most likely will use Call Activities and execute them one ofter another in some kind of root process.
Beware of the fact that each sub process starts its own process instance and thus you have to handle different execution scopes. That will be relevant if you request information from the system like e.g. the List of UserTasks. You can not use the processInstanceId of the root process in this case and will have to use a businessKey.
You also have to handle the process variables and decide which variables you want to propagate to the sub process.

Recover from SQL batch-abort errors inside a transaction? Alternative?

I'm looking for a way to continue execution of a transaction despite errors while inserting low-priority data. It seems like real nested transaction could be a solution, but they aren't supported by SQL Server 2005/2008. Another solution would be to have logic to decide if an error is critical or not, but it would seem that's not possible either.
Here's more detail on my scenario:
Data is periodicaly inserted in the database using ADO.NET/C#, and while some of it is vital, some could also be missing without problems. When the inserts are done, some computations are made on the data. (Both vital and non-vital) This whole process is inside a transaction so everything remains in synch.
Currently, transaction save points are used, and partial rollbacks are made on exceptions which occur during non-vital inserts. However, this doesn't work for "batch-abort" errors, which automaticly rollback the entire transaction. I understand some errors are critical, but things like failed casts are considered by SQL Server to be batch-abort errors. (Info on batch errors) I'm trying to prevent these errors from bringing down the whole insert if they occur on low priority data.
If what I'm describing isn't possible, I'm willing to consider any alternative way to achieve data integrity but allow the failure of the non-vital inserts.
Thanks for your help.
Unfortunately, can't be done as you describe (full support for nested transactions would be key here). Couple things I can think of that have been used to get around this in the past:
Best option would probably be to separate the commands into important/non-important commands that could be executed distinctly, naturally this would require that they not be order-dependent on each other
Could also use a messaging based approach (see Service Broker) where you would execute the primary commands inline and push the non-primary commands onto a queue for execution later/separately. The push to the queue would be transactional within the batch, but the execution of the command when you pop off the queue would be separate. This too would require they not be order-dependent on each other.
If order-dependent, you could use the messaging approach for everything, which would ensure order and could have separate messages per operation, then grouping them together (via conversation groups) would allow you to pull them off the queue in order as well and use separate transactions for each 'type' of operation (i.e. primary vs. non-primary). This would require some special coding on your part if all the grouped messages must be a single autonomous operation, but could be done.
I hesitate to even mention this option, because it is a terrible option, but for full disclosure I suppose you could consider it at your discretion if you think it fits (but it is definitely not an architecture that would apply to almost any scenario). You could use xp_cmdshell to call out to the command line and execute sqlcmd/osql for the non-critical tasks - this sqlcmd execution would be in a separate transaction from the module you are executing from, and simply ignoring the xp_cmdshell failure should allow the primary batch to continue.
Those are some ideas...
Can you do your import into a temporary location, using transactions only for the important parts. Once the temp location loaded, having absorbed any non-critical errors, you can copy the data into its final destination in a single transaction. Depends on the nature the work you are doing, but potentially a viable option.