Rails postgresql. Collision of two operations - sql

I've got this shop-like application (Rails 3.2 + Postgresql), where two of my resources/tables are Users, and Operations. It has the following characteristics:
Amongst other attributes, Users have a certain :credit at each moment in time.
Operations represent either:
A purchase of a product (whose price is deduced from the User's credit who purchased it).
A purchase of credit ( the amount of which is added to the User's credit).
Each Operation stores:
:precredit - The credit the User had before the Operation.
:postcredit - The final credit after the Operation.
:price - The amount of money involved, whether it's positive or negative.
There was a problem with two Operation since they happened exactly at the same second ( My guess is that there was an internet problem for a while and then both queries were executed at the same second, see below).
This is the sorted sequence of operations by created_at(credit operations add and product operation subtract from the credit):
Category:credit Precredit:2.9 Price:30.0 Postcredit:32.9 Created_at:16:34:02
Category:product Precredit:32.9 Price:30.0 Postcredit:2.9 Created_at:16:42:06
Category:credit Precredit:32.9 Price:5.0 Postcredit:37.9 Created_at:16:42:06
Category:product Precredit:37.9 Price:4.0 Postcredit:33.9 Created_at:16:45:24
As one can see, Operation#3 should have a precredit = 2.9, which is the postcredit of Operation#2. However, the result of Operation#2 is not taken into account when Operation#3 is executed.
Ideally I would have:
Category:credit Precredit:2.9 Price:30.0 Postcredit:32.9 Created_at:16:34:02
Category:product Precredit:32.9 Price:30.0 Postcredit:2.9 Created_at:16:42:06
Category:credit Precredit:2.9 Price:5.0 Postcredit:-2.1 Created_at:16:42:06
Note that Operation#3 would've raised an error due to enough_balance?-type validations resulting in false.
Questions
Any ideas regarding how this might have happened?
How can this type of collisions be avoided?

I'm not sure how you're creating the operations, but this kind of situation can happen in concurrent environments, consider the next example:
Process A: gets the User object to obtain the current credit (equal to precredit)
Process B: gets the User object to obtain the current credit (at this point both have the same value)
Process A: calculates the postcredit (precredit +/- value)
Process B: calculates the postcredit
Process B: saves the record
Process A: saves the record
Even if the record in process A and the record in process B are not saved in the exact same millisecond (which is more unlikely), they still save both records with the same precredit, and this depends on how did they calculate this value. This is a common problem in operating systems and its solved with a 'Lock' (Peterson's algorithm,Lock)
Now, Rails provides a mechanism for achieving this, I recommend you take a look at http://api.rubyonrails.org/classes/ActiveRecord/Locking/Pessimistic.html, the object you'll want to lock will probably be the user.

Related

Orderbook matching engine

My question is more of a conceptual one, rather than coding question, but I also accept code (the ideal answer).
So I have a huge dataset of secondly orderbook snapshots (that is, for each second, I have the best 200 ask prices (and their volumes) and the best 200 bid prices (and their volumes)). This is real data, real orders that were submitted at some point in time. For each state, the data is represented as pandas dataframe which has timestamp,side,price,volume. So, an example is:
2023-02-14 00:01:01, 'ask', 19874.11, 0.3
But we have many ask and bid orders per state. My question is the following: for a state s_i, if I decide to do a limit order with a specified price and volume, how would that change change state s_(i+1) (this is just a simulation). Same question goes if I had a market order with some volume.
Purpose:
I am trying to optimize order execution, and there is already existing literature on this subject. The idea is, when I train my agent, I want to reflect each decision it makes so I can update my next states based on what actions/decisions the agent has done.
Literature:
https://www.econstor.eu/bitstream/10419/216206/1/1696077540.pdf
You can try to deploy your exchange and test it there, if you can implement the logic you need for working with orders.
There is an open-source project of crypto exchange Opencex, here is a link to it:
https://github.com/Polygant/OpenCEX

Is it worth introducing "incorrect" results to avoid crashing a program?

In my organisation, I see a lot of places where code has been put inside monitor blocks (RPG's version of try..except) to prevent raising exceptions on arithmetic errors. For instance:
Monitor;
Pxxhour = Bctime/60;
PxxMin = %Rem(Bctime:60);
On-Error;
Pxxhour = 0;
PxxMin = 0;
Pxxhour and Pxxmin are screen fields that will be displayed to users. So if there is an error in the operations, these get a value of 0. Though this prevents the program from crashing, how does it help? Users keep seeing the wrong values on the screen. Similarly, I see code which assigns the highest possible value for a given variable rather than allowing an overflow exception. Though this will prevent the program from blowing up, how does it help in the long run? Wouldn't calculations have wrong values and result in wrong business data?
The answers given below by #jmarkmurphy and #Charles successfully address the question from an RPG and IBM midrange perspective, which is what I was after.
There's two use cases for a MONITOR block...
Expected errors
Unexpected errors
For expected errors, replacing bad or invalid data with an accepted value is a valid solution in some cases. The trick is knowing which cases. The answer to that is something your business people would need to help decide. Depends of what the program is doing and what data has the problem.
For instance, given some sort of internal sales report, you might have something like so:
dcl-c DIVIDE_BY_ZERO const(00102);
dcl-c RESULT_TO_LARGE const(00103);
monitor;
averageSale = totalSalesAmount / numberSales;
on-error DIVIDE_BY_ZERO;
averageSale = 0;
on-error RESULT_TO_LARGE;
averageSale = *HIVAL;
endmon;
What's important about the above is that I'm expecting one of two possible errors and I've decided to handle them a certain way. The business people don't care that technically averageSale is undefined when numberSales is *ZERO. They'y just want a zero to appear on the report. They also understand that there's only so much room on the page and that if the number is all nines, the actual value might be bigger.
And unexpected error, such as a decimal data error, would not be caught be this MONITOR block.
For an unexpected caught by a monitor block via a ON-ERROR with *ALL or no error code specified, I'd expect to see some sort of logging of the issue followed by either skipping the problem record or cleaning shutting down depending on what the program is doing in the first place.
It appears that your code is expecting certain error(s), but without explicitly defining which error(s) codes it's willing to handle. This is lazy and not a good practice.
As far as your questions about rather or not the handling of those expected errors is valid...only you and your users can decide that
You might want to take a look at Chapter 7 - Exception and error handling of the IBM Redbook Who Knew You Could Do That with RPG IV? Modern RPG for the Modern Programmer
What Should I Do When I Have Errors in my Calculations
Programs that blow up on users are bad, even if it is the user's fault. It makes the user believe that the program is buggy, and then anything unexpected that happens becomes the program's fault; something to be fixed. Things can get really out of hand in this manner causing help desk calls for ordinary occurrences that just appear a little odd, even when the outcome is actually correct.
One option is to validate the user input to prevent calculation errors, but what do you do when you can't really prevent all of them. In our world, one of these situations is in invoicing. 5250 screens have limited real estate and you can't always make the fields big enough to hold all eventualities. So there are tradeoffs. Maybe you need to be able to sell thousands of some small items on a single invoice, but the largest total invoice you have ever had is $100K. So you size your fields like this:
dcl-s quantity Packed(5:0);
dcl-s unitPrice Packed(7:2);
dcl-s ammount Packed(9:2);
All are odd because they take up the same space on disk as the next lower even precision. You don't sell fractional quantities, and the maximum value in each field is:
quantity = 99,999;
unitPrice = $99,999.99;
amount = $9,999,999.99;
Now you can see that these maximums should easily handle all valid invoices, but it also leaves plenty of potential for calculation errors. If the user keys in maximum numbers for quantity and unitPrice, the resulting number would require a Packed(12:2) field. That would cause an overflow. In an invoice when the unit price is stored in the invoice detail, we can add an edit when the quantity and unit price is entered that checks for an extended amount overflow, and send an appropriate error message. But what if unit prices are not stored in the invoice detail, but instead in a pricing table. Now there is not a good way, if a price is changed for example to ensure that none of the existing invoices will be affected adversely.
So what do you do about a decimal overflow, or any other calculation error, be it a data problem, or something else? And what happens if the error occurs Blowing up the program is not a good option. Another option, the one that seems to be taken in the question is to apply some default value that the users will quickly recognize is out of the ordinary. It will appear in reports, and on screens. When the users see those excessively large, or small numbers, then they can know to go back and check the data.

Can AMPL handle this recursively or is a remodeling neccessary?

I'm using AMPL to model a production where I have two particular constraints that I am not very sure how to handle.
subject to Constraint1 {t in T}:
prod[t] = sum{i in I} x[i,t]*u[i] + Recycle[f]*RecycledU[f];
subject to Constraint2 {t in T}:
Solditems[t]+Recycle[t]=prod[t];
EDIT: where x[i,t] is the amount of products from supply point i. u[i] denotes the "exchange rate" of the raw material from supply point i to create the product. I.E. a percentage of the raw material will become the finished products, whereas some raw material will go to waste. The same is true for RecycledU[f] where f is in F, which denotes the refinement station where it has been refined. The difference is that RecycledU[f] has a much lower percentage that will go to waste due to Recycled already being a finished product from f (albeitly a much less profitable one). I.e. Recycle has already "went through" the process of being a raw material earlier, x, but has become a finished product in some earlier stage, or hopefully (if it can be modelled) in the same time period as this. In the actual models things as "products" and "refinement station" is existent as well, but I figured for this question those could be abandoned to keep it more simple.
What I want to accomplish is that the amount of products produced is the sum of all items sold in time period t and the amount of products recycled in time period t (by recycled I mean that the finished product is kept at the production site for further refinement in some timestep g, g>t).
Is it possible to write two equal signs for prod[t] like I have done? Also, how to handle Recycle[t]? Can AMPL "understand" that since these are represented at the same time step, that AMPL must handle the constraints recursively, i.e. compute a solution for Recycle[t] and subsequently try to improve that solution in every timestep?
EDIT: The time periods are expressed in years which is why I want to avoid having an expression with Recycle[t-1].
EDIT2: prod and x are parameters and Recycle and Solditems are variables.
Hope anyone can shed some light into this!
Cenderze
The two constraints will be considered simultaneously (unless you explicitly exclude one from the problem). AMPL or optimization solvers don't have the notion of time steps and the complete problem is considered at the same time, so you might need to add some linking constraints between time periods yourself to model time periods. In particular, you might need to make sure that the inventory (such as the amount finished product is kept at the production site for further refinement) is carried over from one period to another, something like:
Recycle[t + 1] = Recycle[t] - RecycleDecrease + RecycleIncrease;
You have to figure out the expressions for the amounts by which Recycle is increased (RecycleIncrease) and decreased (RecycleDecrease).
Also if you want some kind of an iterative procedure with one constraint considered at a time instead, then you should use AMPL script.

Creating a testing strategy to check data consistency between two systems

With a quick search over stackoverflow was not able to find anything so here is my question.
I am trying to write down the testing strategy for a application where two applications sync with each other every day to keep a huge amount of data in sync.
As its a huge amount of data I don't really want to cross check everything. But just want to do a random check every time a data sync happens. What should be the strategy here for such system?
I am thinking of this 2 approach.
1) Get a count of all data and cross check both are same
2) Choose a random 5 data entry and verify that their proprty are in sync.
Any suggestion would be great.
What you need is known as Risk Management, in Software Testing it is called Software Risk Management.
It seems your question is not about "how to test" what you are about to test but how to describe what you do and why you do that (based on the question I assume you need this explanation for yourself too...).
Adding SRM to your Test Strategy should describe:
The risks of not fully testing all and every data in the mirrored system
A table scaling down SRM vs amount of data tested (ie probability of error if only n% of data tested versus -e.g.- 2n% tested), in other words saying -e.g.!- 5% of lost data/invalid data/data corrupption/etc if x% of data was tested with a k minute/hour execution time
Based on previous point, a break down of resources used for the different options (e.g. HW load% for n hours, manhours used is y, costs of HW/SW/HR use are z USD)
Probability -and cost- of errors/issues with automation code (ie data comparison goes wrong and results in false positive or false negative, giving an overhead to DBA, dev and/or testing)
What happens if SRM option taken (!!e.g.!! 10% of data tested giving 3% of data corruption/loss risk and 0.75% overhead risk -false positive/negative results-) results in actual failure, ie reference to Business Continuity and effects of data, integrity, etc loss
Everything else comes to your mind and you feel it applies to your *current issue* in your *current system* with your *actual preferences*.

SQL connection lifetime

I am working on an API to query a database server (Oracle in my case) to retrieve massive amount of data. (This is actually a layer on top of JDBC.)
The API I created tries to limit as much as possible the loading of every queried information into memory. I mean that I prefer to iterate over the result set and process the returned row one by one instead of loading every rows in memory and process them later.
But I am wondering if this is the best practice since it has some issues:
The result set is kept during the whole processing, if the processing is as long as retrieving the data, it means that my result set will be open twice as long
Doing another query inside my processing loop means opening another result set while I am already using one, it may not be a good idea to start opening too much result sets simultaneously.
On the other side, it has some advantages:
I never have more than one row of data in memory for a result set, since my queries tend to return around 100k rows, it may be worth it.
Since my framework is heavily based on functionnal programming concepts, I never rely on multiple rows being in memory at the same time.
Starting the processing on the first rows returned while the database engine is still returning other rows is a great performance boost.
In response to Gandalf, I add some more information:
I will always have to process the entire result set
I am not doing any aggregation of rows
I am integrating with a master data management application and retrieving data in order to either validate them or export them using many different formats (to the ERP, to the web platform, etc.)
There is no universal answer. I personally implemented both solutions dozens of times.
This depends of what matters more for you: memory or network traffic.
If you have a fast network connection (LAN) and a poor client machine, then fetch data row by row from the server.
If you work over the Internet, then batch fetching will help you.
You can set prefetch count or your database layer properties and find a golden mean.
Rule of thumb is: fetch everything that you can keep without noticing it
if you need more detailed analysis, there are six factors involved:
Row generation responce time / rate(how soon Oracle generates first row / last row)
Row delivery response time / rate (how soon can you get first row / last row)
Row processing response time / rate (how soon can you show first row / last row)
One of them will be the bottleneck.
As a rule, rate and responce time are antagonists.
With prefetching, you can control the row delivery response time and row delivery rate: higher prefetch count will increase rate but decrease response time, lower prefetch count will do the opposite.
Choose which one is more important to you.
You can also do the following: create separate threads for fetching and processing.
Select just ehough rows to keep user amused in low prefetch mode (with high response time), then switch into high prefetch mode.
It will fetch the rows in the background and you can process them in the background too, while the user browses over the first rows.