Best practice for system design checking a calculation results in JSON - amazon-s3

I have a program that reads a JSON file, calculates, and outputs a JSON file on S3.
My question is how I should systematically check the output calculation seems okay?
I understand writing a unit test is a way I should do, but it doesn’t guarantee that the output file is safe. I’m thinking of making another program running on lambda that checks the output JSON.
For example, let’s say the program is calculating dynamic pricing in an area where has upper-bound value. Then I want to make sure all the calculation results in the JSON file don’t exceed the upper bound value or at least I’d like to monitor they are all safe or there are some anomalies.
I want to build an efficient and robust anomaly detection system so I don’t want to build the anomaly check in the same program to avoid single-point failures. Any suggestions are welcomed.

One option is to create a second lambda function with the S3 trigger to fire when the JSON file is written into S3 from the original function.
In this 2nd lambda, you can verify the data and if there is anomaly, you may trigger an SNS or EventBridge event which can be used to log/inform/alert about the issue or may be to trigger a separate process to auto-correct anomalies.

You should use Design by Contracts aka Contract Oriented Programming. Aka preconditions and postconditions.
If the output shall never exceed a certain value, then that is a postcondition of the code producing this value. The program should assert its postconditions.
If some other code relies on a value being bounded, then that is a precondition of that code. The code should assert this precondition. This is a type of Defensive Programming technique.

Related

language for data-driven programming

I'm wondering if there is a language that supports programming in the following style that I think of as "data driven":
Imagine a language like C or Python except that it's possible to define a function whose input parameters are bound to particular variables. Then, whenever one of those input variables changes, the function is re-run.
Alternatively, the function might only ever be run when its output is needed, regardless of when its inputs are changed.
The first option is useful when things need to be kept up to date whenever something changes. The second option is useful when the computation is expensive, and you want to run it only when needed.
In my mind, this type of programming paradigm would require many stacks, perhaps one for each function that was defined in the above manner. This would allow those functions to be run in any order, and it would allow their execution to be occassionally blocked. Execution on one stack would be blocked anytime it needs the output of a function whose inputs are not yet ready.
This would be a single-threaded application. The runtime system would take care of switching from one stack to another in an appropriate manner. Deadlocks would be possible.

Dropping samples with tagged streams

I'm trying to accomplish the following behavior:
I have a continuous stream of symbols, part of which is pilot and part is data, periodically. I have the Correlation Estimator block that tags the locations of the pilots in the stream. Now, I would like to filter out the pilots such that following blocks will receive data only, and the pilots will be discarded by the tag received from the Correlation Estimator block.
Are there any existing blocks that allow me to achieve this? I've tried to search but am a bit lost.
Hm, technically, the packet-header Demux could do that, but it's a complex beast and the things you need to do to satisfy its input requirements might be a bit complicated.
So, instead, simply write your own (general) block! That's pretty easy: just save your current state (PASSING or DROPPING) in a member of the block class, and change it based on the tags you see (when in PASSING mode, you look for the correlator tag), or whether you've dropped enough symbols (in DROPPING mode). A classical finite state machine!

How best to handle errors and or missing data in a Neuraxle pipeline?

Let's assume you have a pipeline with steps that can fail for some input elements for example:
FetchSomeImagesFromIds -> Resize -> DoSomethingElse
In this case the 1st step downloads 10 out of a 100 images... and passes those to resize..
I'm looking for suggestions on how to report or handle this missing data at the pipeline level for example something like:
Pipeline.errors() -> PluginX: Succeed: 10, Failed: 90, Total: 100, Errors: key: error
My current implementation removes the missing keys from current_keys so that the key -> data mapping is kept and actually exits the whole program if there's anything missing.. given the previous problem with https://github.com/Neuraxio/Neuraxle/issues/418
Thoughts?
I think that using a Service in your pipeline would be the good way. Here's what I'd do if I think about it, although more solutions could exist:
Create your pipeline and pipeline steps.
Create a context and add to the context a custom memory bank service in which you can keep track of what data processed properly or not properly. Depending on your needs and broader context, it could be either a positive data bank, or negative one, in which you'd respectively either add the processed examples or substract them from the set.
Adapt the pipeline made at point 1, and its steps, such that it can use the service from the context in the handle_transform_data_container methods. You could even have a WhileTrue() step which would loop forever until a BreakIf() step would evaluate that everything has been processed for instance, if you want your pipeline to work until everything has been processed, and fetching the batches as they come without an end condition other than the BreakIf step and its lambda. The lambda would call the service indeed to know where the data processing is at.
At the end of the whole processing, wheter you breaked prematurely (without any while loop) or wheter you did break only at the end, you still have access to the context and to what's stored inside.
More info:
To see an example on how to use the service and context together and using this in steps, see this answer: https://stackoverflow.com/a/64850395/2476920
Also note that the BreakIf and While steps are core steps that are not yet developed in Neuraxle. We've recently had a brilliant ideas with Vincent Antaki where Neuraxle is a language, and therefore steps in a pipeline are like basic language keywords (While, Break, Continue, ForEach) and so forth. This abstraction is powerful in the sense that it's possible to control a data flow as a logical execution flow.
This is my best solution for now and this exactly has never been done yet. There may be much more other ways to do this, with creativity. One could even think of doing TryCatch steps in the pipeline to catch some errors and managing what happens in the execution flow of the data.
All of this is to be done in Neuraxle and is not yet done. We're looking for collaborators. That would do a nice paper as well: "Machine Learning Pipelines as a Programming Language" :)

How to quickly analyse the impact of a program change?

Lately I need to do an impact analysis on changing a DB column definition of a widely used table (like PRODUCT, USER, etc). I find it is a very time consuming, boring and difficult task. I would like to ask if there is any known methodology to do so?
The question also apply to changes on application, file system, search engine, etc. At first, I thought this kind of functional relationship should be pre-documented or some how keep tracked, but then I realize that everything can have changes, it would be impossible to do so.
I don't even know what should be tagged to this question, please help.
Sorry for my poor English.
Sure. One can technically at least know what code touches the DB column (reads or writes it), by determining program slices.
Methodology: Find all SQL code elements in your sources. Determine which ones touch the column in question. (Careful: SELECT ALL may touch your column, so you need to know the schema). Determine which variables read or write that column. Follow those variables wherever they go, and determine the code and variables they affect; follow all those variables too. (This amounts to computing a forward slice). Likewise, find the sources of the variables used to fill the column; follow them back to their code and sources, and follow those variables too. (This amounts to computing a backward slice).
All the elements of the slice are potentially affecting/affected by a change. There may be conditions in the slice-selected code that are clearly outside the conditions expected by your new use case, and you can eliminate that code from consideration. Everything else in the slices you may have inspect/modify to make your change.
Now, your change may affect some other code (e.g., a new place to use the DB column, or combine the value from the DB column with some other value). You'll want to inspect up and downstream slices on the code you change too.
You can apply this process for any change you might make to the code base, not just DB columns.
Manually this is not easy to do in a big code base, and it certainly isn't quick. There is some automation to do for C and C++ code, but not much for other languages.
You can get a bad approximation by running test cases that involve you desired variable or action, and inspecting the test coverage. (Your approximation gets better if you run test cases you are sure does NOT cover your desired variable or action, and eliminating all the code it covers).
Eventually this task cannot be automated or reduced to an algorithm, otherwise there would be a tool to preview refactored changes. The better you wrote code in the beginning, the easier the task.
Let me explain how to reach the answer: isolation is the key. Mapping everything to object properties can help you automate your review.
I can give you an example. If you can manage to map your specific case to the below, it will save your life.
The OR/M change pattern
Like Hibernate or Entity Framework...
A change to a database column may be simply previewed by analysing what code uses a certain object's property. Since all DB columns are mapped to object properties, and assuming no code uses pure SQL, you are good to go for your estimations
This is a very simple pattern for change management.
In order to reduce a file system/network or data file issue to the above pattern you need other software patterns implemented. I mean, if you can reduce a complex scenario to a change in your objects' properties, you can leverage your IDE to detect the changes for you, including code that needs a slight modification to compile or needs to be rewritten at all.
If you want to manage a change in a remote service when you initially write your software, wrap that service in an interface. So you will only have to modify its implementation
If you want to manage a possible change in a data file format (e.g. length of field change in positional format, column reordering), write a service that maps that file to object (like using BeanIO parser)
If you want to manage a possible change in file system paths, design your application to use more runtime variables
If you want to manage a possible change in cryptography algorithms, wrap them in services (e.g. HashService, CryptoService, SignService)
If you do the above, your manual requirements review will be easier. Because the overall task is manual, but can be aided with automated tools. You can try to change the name of a class's property and see its side effects in the compiler
Worst case
Obviously if you need to change the name, type and length of a specific column in a database in a software with plain SQL hardcoded and shattered in multiple places around the code, and worse many tables present similar column namings, plus without project documentation (did I write worst case, right?) of a total of 10000+ classes, you have no other way than manually exploring your project, using find tools but not relying on them.
And if you don't have a test plan, which is the document from which you can hope to originate a software test suite, it will be time to make one.
Just adding my 2 cents. I'm assuming you're working in a production environment so there's got to be some form of unit tests, integration tests and system tests already written.
If yes, then a good way to validate your changes is to run all these tests again and create any new tests which might be necessary.
And to state the obvious, do not integrate your code changes into the main production code base without running these tests.
Yet again changes which worked fine in a test environment may not work in a production environment.
Have some form of source code configuration management system like Subversion, GitHub, CVS etc.
This enables you to roll back your changes

In “Given-When-Then” style BDD tests, is it OK to have multiple “When”s conjoined with an “And”?

I read Bob Martin's brilliant article on how "Given-When-Then" can actual be compared to an FSM. It got me thinking. Is it OK for a BDD test to have multiple "When"s?
For eg.
GIVEN my system is in a defined state
WHEN an event A occurs
AND an event B occurs
AND an event C occurs
THEN my system should behave in this manner
I personally think these should be 3 different tests for good separation of intent. But other than that, are there any compelling reasons for or against this approach?
When multiple steps (WHEN) are needed before you do your actual assertion (THEN), I prefer to group them in the initial condition part (GIVEN) and keep only one in the WHEN section. This kind of shows that the event that really triggers the "action" of my SUT is this one, and that the previous one are more steps to get there.
Your test would become:
GIVEN my system is in a defined state
AND an event A occurs
AND an event B occurs
WHEN an event C occurs
THEN my system should behave in this manner
but this is more of a personal preference I guess.
If you truly need to test that a system behaves in a particular manner under those specific conditions, it's a perfectly acceptable way to write a test.
I found that the other limiting factor could be in an E2E testing scenario that you would like to reuse a statement multiple times. In my case the BDD framework of my choice(pytest_bdd) is implemented in a way that a given statement can have a singular return value and it maps the then input parameters automagically by the name of the function that was mapped to the given step. Now this design prevents reusability whereas in my case I wanted that. In short I needed to create objects and add them to a sequence object provided by another given statement. The way I worked around this limitation is by using a test fixture(which I named test_context), which was a python dictionary(a hashmap) and used when statements that don't have same singular requirement so the '(when)add object to sequence' step looked up the sequence in the context and appended the object in question to it. So now I could reuse the add object to sequence action multiple times.
This requirement was tricky because BDD aims to be descriptive. So I could have used a single given statement with the pickled memory map of the sequence object that I wanted to perform test action on. BUT would it have been useful? I think not. I needed to get the sequence constructed first and that needed reusable statements. And although this is not in the BDD bible I think in the end it is a practical and pragmatic solution to a very real E2E descriptive testing problem.