How best to handle errors and or missing data in a Neuraxle pipeline? - neuraxle

Let's assume you have a pipeline with steps that can fail for some input elements for example:
FetchSomeImagesFromIds -> Resize -> DoSomethingElse
In this case the 1st step downloads 10 out of a 100 images... and passes those to resize..
I'm looking for suggestions on how to report or handle this missing data at the pipeline level for example something like:
Pipeline.errors() -> PluginX: Succeed: 10, Failed: 90, Total: 100, Errors: key: error
My current implementation removes the missing keys from current_keys so that the key -> data mapping is kept and actually exits the whole program if there's anything missing.. given the previous problem with https://github.com/Neuraxio/Neuraxle/issues/418
Thoughts?

I think that using a Service in your pipeline would be the good way. Here's what I'd do if I think about it, although more solutions could exist:
Create your pipeline and pipeline steps.
Create a context and add to the context a custom memory bank service in which you can keep track of what data processed properly or not properly. Depending on your needs and broader context, it could be either a positive data bank, or negative one, in which you'd respectively either add the processed examples or substract them from the set.
Adapt the pipeline made at point 1, and its steps, such that it can use the service from the context in the handle_transform_data_container methods. You could even have a WhileTrue() step which would loop forever until a BreakIf() step would evaluate that everything has been processed for instance, if you want your pipeline to work until everything has been processed, and fetching the batches as they come without an end condition other than the BreakIf step and its lambda. The lambda would call the service indeed to know where the data processing is at.
At the end of the whole processing, wheter you breaked prematurely (without any while loop) or wheter you did break only at the end, you still have access to the context and to what's stored inside.
More info:
To see an example on how to use the service and context together and using this in steps, see this answer: https://stackoverflow.com/a/64850395/2476920
Also note that the BreakIf and While steps are core steps that are not yet developed in Neuraxle. We've recently had a brilliant ideas with Vincent Antaki where Neuraxle is a language, and therefore steps in a pipeline are like basic language keywords (While, Break, Continue, ForEach) and so forth. This abstraction is powerful in the sense that it's possible to control a data flow as a logical execution flow.
This is my best solution for now and this exactly has never been done yet. There may be much more other ways to do this, with creativity. One could even think of doing TryCatch steps in the pipeline to catch some errors and managing what happens in the execution flow of the data.
All of this is to be done in Neuraxle and is not yet done. We're looking for collaborators. That would do a nice paper as well: "Machine Learning Pipelines as a Programming Language" :)

Related

Best practice for system design checking a calculation results in JSON

I have a program that reads a JSON file, calculates, and outputs a JSON file on S3.
My question is how I should systematically check the output calculation seems okay?
I understand writing a unit test is a way I should do, but it doesn’t guarantee that the output file is safe. I’m thinking of making another program running on lambda that checks the output JSON.
For example, let’s say the program is calculating dynamic pricing in an area where has upper-bound value. Then I want to make sure all the calculation results in the JSON file don’t exceed the upper bound value or at least I’d like to monitor they are all safe or there are some anomalies.
I want to build an efficient and robust anomaly detection system so I don’t want to build the anomaly check in the same program to avoid single-point failures. Any suggestions are welcomed.
One option is to create a second lambda function with the S3 trigger to fire when the JSON file is written into S3 from the original function.
In this 2nd lambda, you can verify the data and if there is anomaly, you may trigger an SNS or EventBridge event which can be used to log/inform/alert about the issue or may be to trigger a separate process to auto-correct anomalies.
You should use Design by Contracts aka Contract Oriented Programming. Aka preconditions and postconditions.
If the output shall never exceed a certain value, then that is a postcondition of the code producing this value. The program should assert its postconditions.
If some other code relies on a value being bounded, then that is a precondition of that code. The code should assert this precondition. This is a type of Defensive Programming technique.

Problem with PDDL code using conditional effects

I'm trying to solve a Pacman-styled PDDL problem and there's a particular scenario I've been stuck on for days now. I'm getting the classic
ff: goal can be simplified to FALSE. No plan will solve it
which means the issue is trivial and logic related. However, I'm new to PDDL and can't seem to figure out what's causing it.
The problem is simple, Pacman (P) has to eat the Food (F), but two ghost agents (G) are blocking it. To get past them, Pacman needs to consume the Capsule (C), making him invisible.
(Edit: I've deleted the question as it was part of an assignment. I managed to solve the issue and will post the solution as soon as the assignment is graded.)
In this thread: About PDDL in AI planning #haz mentioned a good methodology to debug your PDDL Model when the goail is unreachable from the initial state.
The best way to test this out is to follow the following strategy: (1) write down a plan you know will solve it; (2) starting with the first action, set the goal to the precondition; (3) repeat to the end. If that fails, start changing the initial state to what you expect the complete state to be during the execution of the plan. – haz May 1 at 2:04
I uploaded a new version which is able to find a solution (if Ghosts are not in the path of foods):
http://editor.planning.domains/#read_session=c7Vez9nrti
Two main issues:
you never delete GhostPos, but it appears in the goal formula
If GhostPos is on the way to a FoodPos, then you can never reach the food as the move action requires not to be a Ghost.

In “Given-When-Then” style BDD tests, is it OK to have multiple “When”s conjoined with an “And”?

I read Bob Martin's brilliant article on how "Given-When-Then" can actual be compared to an FSM. It got me thinking. Is it OK for a BDD test to have multiple "When"s?
For eg.
GIVEN my system is in a defined state
WHEN an event A occurs
AND an event B occurs
AND an event C occurs
THEN my system should behave in this manner
I personally think these should be 3 different tests for good separation of intent. But other than that, are there any compelling reasons for or against this approach?
When multiple steps (WHEN) are needed before you do your actual assertion (THEN), I prefer to group them in the initial condition part (GIVEN) and keep only one in the WHEN section. This kind of shows that the event that really triggers the "action" of my SUT is this one, and that the previous one are more steps to get there.
Your test would become:
GIVEN my system is in a defined state
AND an event A occurs
AND an event B occurs
WHEN an event C occurs
THEN my system should behave in this manner
but this is more of a personal preference I guess.
If you truly need to test that a system behaves in a particular manner under those specific conditions, it's a perfectly acceptable way to write a test.
I found that the other limiting factor could be in an E2E testing scenario that you would like to reuse a statement multiple times. In my case the BDD framework of my choice(pytest_bdd) is implemented in a way that a given statement can have a singular return value and it maps the then input parameters automagically by the name of the function that was mapped to the given step. Now this design prevents reusability whereas in my case I wanted that. In short I needed to create objects and add them to a sequence object provided by another given statement. The way I worked around this limitation is by using a test fixture(which I named test_context), which was a python dictionary(a hashmap) and used when statements that don't have same singular requirement so the '(when)add object to sequence' step looked up the sequence in the context and appended the object in question to it. So now I could reuse the add object to sequence action multiple times.
This requirement was tricky because BDD aims to be descriptive. So I could have used a single given statement with the pickled memory map of the sequence object that I wanted to perform test action on. BUT would it have been useful? I think not. I needed to get the sequence constructed first and that needed reusable statements. And although this is not in the BDD bible I think in the end it is a practical and pragmatic solution to a very real E2E descriptive testing problem.

Restarting agent program after it crashes

Consider a distributed bank application, wherein distributed agent machines modify the value of a global variable : say "balance"
So, the agent's requests are queued. A request is of the form wherein value is added to the global variable on behalf of the particular agent. So,the code for the agent is of the form :
agent
{
look_queue(); // take a look at the leftmost request on queue without dequeuing
lock_global_variable(balance,agent_machine_id);
///////////////////// **POINT A**
modify(balance,value);
unlock_global_variable(balance,agent_machine_id);
/////////////////// **POINT B**
dequeue(); // once transaction is complete, request can be dequeued
}
Now, if an agent's code crashes at POINT B, then obviously the request should not be processed again, otherwise the variable will be modified twice for the same request. To avoid this, we can make the code atomic, thus :
agent
{
look_queue(); // take a look at the leftmost request on queue without dequeuing
*atomic*
{
lock_global_variable(balance,agent_machine_id);
modify(balance,value);
unlock_global_variable(balance,agent_machine_id);
dequeue(); // once transaction is complete, request can be dequeued
}
}
I am looking for answers to these questions :
How to identify points in code which need to be executed atomically 'automatically' ?
IF the code crashes during executing, how much will "logging the transaction and variable values" help ? Are there other approaches for solving the problem of crashed agents ?
Again,logging is not scalable to big applications with large number of variables. What can we in those case - instead of restarting execution from scratch ?
In general,how can identify such atomic blocks in case of agents that work together. If one agent fails, others have to wait for it to restart ? How can software testing help us in identifying potential cases, wherein if an agent crashes, an inconsistent program state is observed.
How to make the atomic blocks more fine-grained, to reduce performance bottlenecks ?
Q> How to identify points in code which need to be executed atomically 'automatically' ?
A> Any time, when there's anything stateful shared across different contexts (not necessarily all parties need to be mutators, enough to have at least one). In your case, there's balance that is shared between different agents.
Q> IF the code crashes during executing, how much will "logging the transaction and variable values" help ? Are there other approaches for solving the problem of crashed agents ?
A> It can help, but it has high costs attached. You need to rollback X entries, replay the scenario, etc. Better approach is to either make it all-transactional or have effective automatic rollback scenario.
Q> Again, logging is not scalable to big applications with large number of variables. What can we in those case - instead of restarting execution from scratch ?
A> In some cases you can relax consistency. For example, CopyOnWriteArrayList does a concurrent write-behind and switches data on for new readers after when it becomes available. If write fails, it can safely discard that data. There's also compare and swap. Also see the link for the previous question.
Q> In general,how can identify such atomic blocks in case of agents that work together.
A> See your first question.
Q> If one agent fails, others have to wait for it to restart ?
A> Most of the policies/APIs define maximum timeouts for critical section execution, otherwise risking the system to end up in a perpetual deadlock.
Q> How can software testing help us in identifying potential cases, wherein if an agent crashes, an inconsistent program state is observed.
A> It can to a fair degree. However testing concurrent code requires as much skills as to write the code itself, if not more.
Q> How to make the atomic blocks more fine-grained, to reduce performance bottlenecks?
A> You have answered the question yourself :) If one atomic operation needs to modify 10 different shared state variables, there's nothing much you can do apart from trying to push the external contract down so it needs to modify more. This is pretty much the reason why databases are not as scalable as NoSQL stores - they might need to modify depending foreign keys, execute triggers, etc. Or try to promote immutability.
If you were Java programmer, I would definitely recommend reading this book. I'm sure there are good counterparts for other languages, too.

How would you create a cyclic task graph in TPL, and/or is this possible?

My project has a requirement to gather data from a number of sources, then do things in response to the completion of the gathering of that data. Some of the gathering tasks have dependencies on prior gathering tasks. TPL has been a good fit because it naturally continues with tasks from their antecedents, and the "final" tasks that use the results are again dependents. Great. However, we would like to have a "sleep and regather" task that starts upon completion of the "final" tasks; this task's job is logically to be the antecedent of the "final" tasks and kick off the next cycle. In effect, the TPL's DAG becomes cyclic, or, if thought of sequentially, a loop.
Is it possible to express this cyclic requirement completely within the TPL API? If so, how? Our current implementation instead does a WaitAll() on the antecedents, and then a Task.StartNew() given a delegate that does a sleep followed by rebuilding a task graph with the WaitAll(). This works, but seems a bit artificial.
There are a few options here. What you are doing now seems reasonable.
However, you could potentially setup the entire operation as a producer/consumer scenario using BlockingCollection<T>. If your consuming enumerable used a ManualResetEvent that was set after the WaitAll completed, it could allow a single "item" to be consumed at a time, using tasks as you have it written now.
That being said, this seems like a perfect candidate for the TPL Dataflow library (in CTP).