Orderbook matching engine - pandas

My question is more of a conceptual one, rather than coding question, but I also accept code (the ideal answer).
So I have a huge dataset of secondly orderbook snapshots (that is, for each second, I have the best 200 ask prices (and their volumes) and the best 200 bid prices (and their volumes)). This is real data, real orders that were submitted at some point in time. For each state, the data is represented as pandas dataframe which has timestamp,side,price,volume. So, an example is:
2023-02-14 00:01:01, 'ask', 19874.11, 0.3
But we have many ask and bid orders per state. My question is the following: for a state s_i, if I decide to do a limit order with a specified price and volume, how would that change change state s_(i+1) (this is just a simulation). Same question goes if I had a market order with some volume.
Purpose:
I am trying to optimize order execution, and there is already existing literature on this subject. The idea is, when I train my agent, I want to reflect each decision it makes so I can update my next states based on what actions/decisions the agent has done.
Literature:
https://www.econstor.eu/bitstream/10419/216206/1/1696077540.pdf

You can try to deploy your exchange and test it there, if you can implement the logic you need for working with orders.
There is an open-source project of crypto exchange Opencex, here is a link to it:
https://github.com/Polygant/OpenCEX

Related

Total coin supply, how it works, and what the code means?

I am currently studying the bitcoin and litecoin to try and get a better understanding of cryptocurrencies, and blockchains in general - and I have spotted something in the code that I have a question about.
in src/amount.h - I see the following code...
/** No amount larger than this (in satoshi) is valid.
*
* Note that this constant is *not* the total money supply, which in Bitcoin
* currently happens to be less than 21,000,000 BTC for various reasons, but
* rather a sanity check. As this sanity check is used by consensus-critical
* validation code, the exact value of the MAX_MONEY constant is consensus
* critical; in unusual circumstances like a(nother) overflow bug that allowed
* for the creation of coins out of thin air modification could lead to a fork.
* */
static const CAmount MAX_MONEY = 84000000 * COIN;
Now, the comment here, seems to suggest that this code does not actually define what the total supply of the currency will be, even though the amount of Litecoin available is in fact 84,000,000...
So, my real question :
Is the real total supply held in another piece of code? If so, what am I missing, where can I find this code, and if I were to be trying to edit this (I'm not - but I want to understand what is going on here) - would I need to edit code in multiple places?
NOTE : Tagged bitcoin even though this is litecoin souce in the question, because litecoin doesn't appear to have a stackoverflow tag, and the two codebases are similar anyway.
EDIT : I also wanted to add, that I performed a grep for "84000000" - and only really found that one line of code to be relevant... So I must be missing something...
EDIT 2 : According to literally every coin out there on git that I have looked at - this is the number that they change when adjusting the total supply - so is the comment just wrong - or did I misunderstand it?
I realise this is an old question, but since it hasn't been updated I'll provide an answer.
As the source suggests, MAX_MONEY is simply a sanity check. If someone tries to create a transaction spending 500 million Bitcoin, and it somehow manages to bypass all other sanity checks, the network will still reject it because the amount exceeds MAX_MONEY. So MAX_MONEY is not directly related to total supply, but as you have observed, many alts will set MAX_MONEY to the expected total supply over the lifetime of the coin.
For a pure proof-of-work coin with consistent reward scheme (eg halving every X blocks) the total supply can be pre-calculated, but a future fork could change that.
For a typical proof-of-stake or hybrid proof-of-work and proof-of-stake coin, the maximum supply can be estimated by simulation, but the exact amount will vary depending on network activity.
(This assumes there is not another part of the code that cuts off all rewards after a limit is reached.)

Data Science: Using Inferential Statistics to label train dataset

Lack of High Schools in remote areas is a problem for students in developing country. Students in some locations are better than that in other. So, I have to find those locations. Now, the main problem is defining "BETTER". I have made some rules that will define the profile of a location.
Right now, I am concerned with the good students.
So, what I have done is-
1. Used some inferential statistics to and made some rules to come up with the conclusion that Location A,B,C,etc are the most potential locations where you can put the high schools because according to my rules these locations contain quality students.
I did all of the things above to label the data because I required to define "BETTER" and label the data so that I can now use machine learning algorithm to learn the factors which makes a location a potential location so that if I give a data point from test data to the model, it will instantly tell if the location is better or not.
Overview of the method:
For each location, I have these 4 information:
total_students_staying_for_high_school_education(A),
total_students_leaving_for_high_school_education_in_another_place(B),
mean_grade_point_of_students_of_type_B,
ratio (calculated as B/A),
For the location whose ratio > 1
I applied the chi-squared significance test to come up with a statistic which would tell me if students are leaving that place in significant amount than staying. I used ANOVA and then Tukey test to compare means_grade points and then find combinations of pairs of locations whose means vary and whose is greater than the others.
I then wrote a python program with a custom comparator that first compares if mean_grade of those points vary and returns the one with greater mean. If the means don't vary, the comparator return the location with the one whose chi-squared value is greater.
This is how, the whole process comes up with few suggestions of location and I call those location "BETTER".
What I am concerned about is-
1. How do I verify if my rules are valid? Or do I even need to verify it?
2. Most importantly, is mingling statistics with machine learning as described above an appropriate approach?Is there any major leakage in the method?Can anyone suggest a more general method?

Can proc sql embedded in sas macros dynamically merge to data-sets, simulating residential treatment placement decisions for trouble youth?

Good afternoon and happy Friday, folks
I’m trying to automate a placement simulation of youth into residential treatment where they will have the highest likelihood of success. Success is operationalized as “not recidivating” within 3 years of entering treatment. Equations predicting recidivism have been generated for each location, and the equations have been applied to each individual in the scenario (based on youth characteristics like risk, age, etc., LOS). Each youth has predicted success rates for every location, which throws in a wrench: youth are not qualified for all of the treatment facilities for which they have predicted success rates. Indeed, treatment locations have differing, yet overlapping qualifications.
Let’s take a made-up example. Johnny (ID # 5, below) is a 15-year-old boy with drug charges. He could have “predicted success rates” of 91% for location A, 88% for location B, 50% for location C, and 75% for location D. Johnny is most likely to be successful (i.e., not recidivate within three years of entering treatment) if he is treated at location A; unfortunately, location A only accepts youth who are 17 years old or older; therefore, Johnny would not qualify for treatment here. Alternatively, for Johnny, location B is the next best location. Let us assume that Johnny is qualified for location B, but that all of location-B beds are filled; so, we must now look to location D, as it is now Johnny’s “best available” option at 75%.
The score so far: We are matching youth to available beds in location for which they qualify and might enjoy the greatest likelihood of success. Unfortunately, each location only has a certain number of available beds, and the number of available beds different across locations. The qualifications of entry into treatment facilities differ, yet overlap (e.g., 12-17 year-olds vs 14-20 year-olds).
In order to simulate what placement decisions might look like based on success rates, I went through the scenario describe above for over 400 youth, by hand, in excel. It took me about a week. I’d like to use PROC SQL imbedded in a SAS MACRO to automate these placement scenarios with the ultimate goals of a) obtain the ability to bootstrap iterations in order to examine effect sizes across distributions, b) save time, and c) prevent further brain damage from banging my head again desk and wall in frustration whilst doing this by hand. Whilst never having had the necessity—nay—the privilege of using SQL in my typical roll as a researcher, I believe that this time has now come to pass and I’m excited about it! Honestly. I believe it has the capacity I’m looking for. Unfortunately, it is beating the devil out of me!
Here’s what I’ve got cookin’ so far: I want to create and automate the placement simulation with the clever use of merging/joining/switching/or something like that.
I have two datasets (tables). The first dataset contains all of the youth information (one row per youth; several columns with demographics, location ranks, which correspond to the predicted success rates). The order of rows in the youth dataset (was/will be randomly generated (to simulate the randomness with which youth enter the system and are subsequently place into treatment). Note that I will be “cleaning” the youth dataset prior to merging such that rank-column cells will only be populated for programs for which a respective youth qualifies. This should take the “does the youth even qualify for the program” problem out of the equation.
However, it still leaves the issue of availability left to be contended with in the scenario.
The second dataset containing the treatment facility beds, with each row corresponding to an available bed in one of the treatment location; two columns contain bed numbers and location names. Each bed (row) has only one location cell populated, but locations will populate several cells.
Thus, in descending order, I want to merge each youth row with the available bed that represents his/her best chance of success, and so the merge/join/switch/thing should take place
on youth.Rank1= distinct TF.Location,
and if youth.Rank1≠ TF.location then
merge on youth.Rank2= TF.location,
if youth.Rank2≠ TF.location then merge at
youth.Rank3 = TF.location, etc.
Put plainly: “Merge on rank1 unless rank1 location is no longer available, then merge on rank2, unless rank2 location is no longer available, and on down the line, etc., etc., until all option are exhausted and foster care (i.e., alternative services). Is the only option.
I’ve had no success getting this to work. I haven’t even been successful getting the union function to work. About the only successful thing I’ve done in SQL so far is create a view of a single dataset. It’s pretty sad. I’ve been following this guidance, but I get hung up around the “where” command:
proc sql; /Calls the SQL procedure*/;
create table x as /*Tells SAS to create a table called x*/
select /*Specifies the column(s) to be selected*/
from /*Specificies the tables(s) (data sets) to be queried*/
where /*Subjests the data based on a condition*/
group by /*Classifies the data into groups based on the specified
column(s)*/
order by /*Sorts the resulting rows observations) by the specified
column(s)*/
; quit; /*Ends the proc sql procedure*/
Frankly, I’m stuck and I could use some advice. This greenhorn in me is in way over his head.
I appreciate any help or guidance anyone might lend.
Cheers!
P
The process you describe (and to be honest I skiped to the end so I might of missed something) does not lend itself to SQL because each step could affect the results of the next one. However, you want to get the most best results for the most kids. (I think a lot of that text was to convince us how important it is to help out). You don't actually give us anything we can really use to help since you don't give any details of your data model, your data, or expected results. There really is no way to answer this question. But I don't care -- I'm going to go forward with some suggestions because it is a friday and I've never done a stream of consciousness answer to a stream of consciousness question before. I will suggest you don't formulate your solution just in sql, but instead use a higher level program and engage is a process like the one described below -- because this a DB questions I've noted the locations where the DB might be involved.
Generate a list kids (this can be in a table -- called NEEDY-KID)
Have a list of locations to assign (this can also be a table LOCATION)
Run your matching for best fit from KID to location -- at this point don't worry about assign more than one kid to a location -- there can be duplicates (put this in table called KID2LOC using a query)
Check KID2LOC for locations assigned twice -- use some method to remove the duplicate ones so each loc is only assigned once. (remove from the KID2LOC using a query)
Prune the LOCATION list to remove assigned locations (once again -- a query)
If kids exist without a location go to 3 with new pruned location list.
Done.

Rails postgresql. Collision of two operations

I've got this shop-like application (Rails 3.2 + Postgresql), where two of my resources/tables are Users, and Operations. It has the following characteristics:
Amongst other attributes, Users have a certain :credit at each moment in time.
Operations represent either:
A purchase of a product (whose price is deduced from the User's credit who purchased it).
A purchase of credit ( the amount of which is added to the User's credit).
Each Operation stores:
:precredit - The credit the User had before the Operation.
:postcredit - The final credit after the Operation.
:price - The amount of money involved, whether it's positive or negative.
There was a problem with two Operation since they happened exactly at the same second ( My guess is that there was an internet problem for a while and then both queries were executed at the same second, see below).
This is the sorted sequence of operations by created_at(credit operations add and product operation subtract from the credit):
Category:credit Precredit:2.9 Price:30.0 Postcredit:32.9 Created_at:16:34:02
Category:product Precredit:32.9 Price:30.0 Postcredit:2.9 Created_at:16:42:06
Category:credit Precredit:32.9 Price:5.0 Postcredit:37.9 Created_at:16:42:06
Category:product Precredit:37.9 Price:4.0 Postcredit:33.9 Created_at:16:45:24
As one can see, Operation#3 should have a precredit = 2.9, which is the postcredit of Operation#2. However, the result of Operation#2 is not taken into account when Operation#3 is executed.
Ideally I would have:
Category:credit Precredit:2.9 Price:30.0 Postcredit:32.9 Created_at:16:34:02
Category:product Precredit:32.9 Price:30.0 Postcredit:2.9 Created_at:16:42:06
Category:credit Precredit:2.9 Price:5.0 Postcredit:-2.1 Created_at:16:42:06
Note that Operation#3 would've raised an error due to enough_balance?-type validations resulting in false.
Questions
Any ideas regarding how this might have happened?
How can this type of collisions be avoided?
I'm not sure how you're creating the operations, but this kind of situation can happen in concurrent environments, consider the next example:
Process A: gets the User object to obtain the current credit (equal to precredit)
Process B: gets the User object to obtain the current credit (at this point both have the same value)
Process A: calculates the postcredit (precredit +/- value)
Process B: calculates the postcredit
Process B: saves the record
Process A: saves the record
Even if the record in process A and the record in process B are not saved in the exact same millisecond (which is more unlikely), they still save both records with the same precredit, and this depends on how did they calculate this value. This is a common problem in operating systems and its solved with a 'Lock' (Peterson's algorithm,Lock)
Now, Rails provides a mechanism for achieving this, I recommend you take a look at http://api.rubyonrails.org/classes/ActiveRecord/Locking/Pessimistic.html, the object you'll want to lock will probably be the user.

Creating a testing strategy to check data consistency between two systems

With a quick search over stackoverflow was not able to find anything so here is my question.
I am trying to write down the testing strategy for a application where two applications sync with each other every day to keep a huge amount of data in sync.
As its a huge amount of data I don't really want to cross check everything. But just want to do a random check every time a data sync happens. What should be the strategy here for such system?
I am thinking of this 2 approach.
1) Get a count of all data and cross check both are same
2) Choose a random 5 data entry and verify that their proprty are in sync.
Any suggestion would be great.
What you need is known as Risk Management, in Software Testing it is called Software Risk Management.
It seems your question is not about "how to test" what you are about to test but how to describe what you do and why you do that (based on the question I assume you need this explanation for yourself too...).
Adding SRM to your Test Strategy should describe:
The risks of not fully testing all and every data in the mirrored system
A table scaling down SRM vs amount of data tested (ie probability of error if only n% of data tested versus -e.g.- 2n% tested), in other words saying -e.g.!- 5% of lost data/invalid data/data corrupption/etc if x% of data was tested with a k minute/hour execution time
Based on previous point, a break down of resources used for the different options (e.g. HW load% for n hours, manhours used is y, costs of HW/SW/HR use are z USD)
Probability -and cost- of errors/issues with automation code (ie data comparison goes wrong and results in false positive or false negative, giving an overhead to DBA, dev and/or testing)
What happens if SRM option taken (!!e.g.!! 10% of data tested giving 3% of data corruption/loss risk and 0.75% overhead risk -false positive/negative results-) results in actual failure, ie reference to Business Continuity and effects of data, integrity, etc loss
Everything else comes to your mind and you feel it applies to your *current issue* in your *current system* with your *actual preferences*.