SQL Server Payment estimate probabilty - best maths equations and recursive query? - sql

I have a table which holds a list of transactions.
Task: To estimate the next transaction amount.
Problem:
The actual payment periods for each rows is a varible, which can be weekly, monthly or anything choosen by the end user.
To estimate the next payment, based on previous data, can anyone suggest a good method?
At the moment I basically take the figure back to the daily amount then multiple by period i.e. week/month/q/year. Then given the history, choose the result that has the highest incidence (count).
This does not generate an accuarate estimations due to payments within payments that I dont need to care about i.e. £100 real payment but +20 for addition charges that are irrelevant.
Another way is to calculate the average,std,varience between payments then choose the highest probability.
Problem is, i've been unable to code this in SQL.
SELECT [Identifier]
,[DateTranEntered]
,[Type]
,[TranDateFrom],
,[TranDateTo]
,[Amount]
,[ReferenceForTran]
,[CreatedDate]
FROM .[TranTable]
Perhaps something with recursion through the table and calculate every transaction daily amount then with the variance, incidence - choose from the last 'x' what the estimate guess is ?
Problem is I have gotten stuck with the resurive query for this.
Any thoughts about this?

SQL Server Analysis services has a suite of data mining tools that provide algorithms such as Linear Regressions, Decision Trees and Neural Networks. You can learn more about them here: http://msdn.microsoft.com/en-us/library/ms175595.aspx. It sounds like Linear Regressions might be the best place to start for this problem.

Related

Orderbook matching engine

My question is more of a conceptual one, rather than coding question, but I also accept code (the ideal answer).
So I have a huge dataset of secondly orderbook snapshots (that is, for each second, I have the best 200 ask prices (and their volumes) and the best 200 bid prices (and their volumes)). This is real data, real orders that were submitted at some point in time. For each state, the data is represented as pandas dataframe which has timestamp,side,price,volume. So, an example is:
2023-02-14 00:01:01, 'ask', 19874.11, 0.3
But we have many ask and bid orders per state. My question is the following: for a state s_i, if I decide to do a limit order with a specified price and volume, how would that change change state s_(i+1) (this is just a simulation). Same question goes if I had a market order with some volume.
Purpose:
I am trying to optimize order execution, and there is already existing literature on this subject. The idea is, when I train my agent, I want to reflect each decision it makes so I can update my next states based on what actions/decisions the agent has done.
Literature:
https://www.econstor.eu/bitstream/10419/216206/1/1696077540.pdf
You can try to deploy your exchange and test it there, if you can implement the logic you need for working with orders.
There is an open-source project of crypto exchange Opencex, here is a link to it:
https://github.com/Polygant/OpenCEX

Can AMPL handle this recursively or is a remodeling neccessary?

I'm using AMPL to model a production where I have two particular constraints that I am not very sure how to handle.
subject to Constraint1 {t in T}:
prod[t] = sum{i in I} x[i,t]*u[i] + Recycle[f]*RecycledU[f];
subject to Constraint2 {t in T}:
Solditems[t]+Recycle[t]=prod[t];
EDIT: where x[i,t] is the amount of products from supply point i. u[i] denotes the "exchange rate" of the raw material from supply point i to create the product. I.E. a percentage of the raw material will become the finished products, whereas some raw material will go to waste. The same is true for RecycledU[f] where f is in F, which denotes the refinement station where it has been refined. The difference is that RecycledU[f] has a much lower percentage that will go to waste due to Recycled already being a finished product from f (albeitly a much less profitable one). I.e. Recycle has already "went through" the process of being a raw material earlier, x, but has become a finished product in some earlier stage, or hopefully (if it can be modelled) in the same time period as this. In the actual models things as "products" and "refinement station" is existent as well, but I figured for this question those could be abandoned to keep it more simple.
What I want to accomplish is that the amount of products produced is the sum of all items sold in time period t and the amount of products recycled in time period t (by recycled I mean that the finished product is kept at the production site for further refinement in some timestep g, g>t).
Is it possible to write two equal signs for prod[t] like I have done? Also, how to handle Recycle[t]? Can AMPL "understand" that since these are represented at the same time step, that AMPL must handle the constraints recursively, i.e. compute a solution for Recycle[t] and subsequently try to improve that solution in every timestep?
EDIT: The time periods are expressed in years which is why I want to avoid having an expression with Recycle[t-1].
EDIT2: prod and x are parameters and Recycle and Solditems are variables.
Hope anyone can shed some light into this!
Cenderze
The two constraints will be considered simultaneously (unless you explicitly exclude one from the problem). AMPL or optimization solvers don't have the notion of time steps and the complete problem is considered at the same time, so you might need to add some linking constraints between time periods yourself to model time periods. In particular, you might need to make sure that the inventory (such as the amount finished product is kept at the production site for further refinement) is carried over from one period to another, something like:
Recycle[t + 1] = Recycle[t] - RecycleDecrease + RecycleIncrease;
You have to figure out the expressions for the amounts by which Recycle is increased (RecycleIncrease) and decreased (RecycleDecrease).
Also if you want some kind of an iterative procedure with one constraint considered at a time instead, then you should use AMPL script.

Statistical calculations in an Access 2010 query

currently we're building a database to track different factories' pollutant emissions. Now a query is needed that gives us information about relative quantities. Somehow I feel this should be straight forward but I have had no success implementing it in SQL.
I'm starting from a working query that returns the following fields:
PRODUCTION_YEAR, COMPANY, PRODUCT_CATEGORY, POLLUTANT, TOTAL_EMISSIONS, SHARE
TOTAL_EMISSIONS contains the total emissions for each company in a particular year and product category. SHARE is a computed field and contains the contribution (as a fraction) of each company to that year's overall emissions of that particular pollutant in that particular product category.
Now the task is to count the factories contributing to each pollutant. I arrived at this:
SELECT PRODUCTION_YEAR, POLLUTANT, PRODUCT_CATEGORY, Count(COMPANY)
FROM theQuery
GROUP BY PRODUCTION_YEAR, POLLUTANT, PRODUCT_CATEGORY;
However, now our client wants something more sophisticated: count only the biggest polluters who contribute 95% of emissions. In a script, I'd probably just have the pollution percentages in each category sorted ascendingly, then walk the dataset, sum up the shares and only start counting after reaching 5%. Doing it in SQL, no idea.
My first step (adding a SUM(SHARE) field to the new query) already resulted in errors ("expression not included in aggregate function", roughly translated, not sure what to make of it because all the expressions were indeed included). Is there even a way to do this in an SQL query, or am I wasting my time and would be better off just writing some VBA?
Thanks for any input!
Best,
Ben
Gord's method (see link in comment) works well for this task.

Ranking algorithm in a rails app

We have a model in our ralis app whose objects are assigned a score based on positive user actions. We'll call them products for simplicity sake. If a user likes a product or buys a product or views a product, the score is incremented at various weights (a like might be worth more than a view, two views in the span of 30 seconds might be worth more than three views spread over an hour, etc.)
We'd like to use these scores to help sort and rank products, say for a popular products list, but for various reasons -- using the straight ranking is going to unevenly favor older products, since they'll have more time to amass a higher score.
My question is, how to normalize the scores between new and old products. I thought about dividing the products score by a unit of time, say the number of days it's been in existence, but am worried that will cut down the older products too much. Any thoughts on the best way to fairly normalize the scores between the old and new products?
I'm also considering an example of a bayesian rating system I found in another question:
rating = ((avg_num_votes * avg_rating) + (product_num_votes * product_rating)) / (avg_num_votes + product_num_votes)
Where theavg numbers are calculated by looking at the scores across all products that have more than one vote (or in our case, a positive action). This might not be the best way, because we don't have a negative rating in our system and it doesn't take time into consideration at all.
Your question reminds me the concept of Exponential Discounting Cash Flow in finance.
The concept is the following : 100$ in two years worth less than 100$ in one year, which worth less than 100$ now, ...
I think that we can make a good comparison here : a product of yesterday worth more that a product of the day before but less than a product of today.
The formula is simple :
Vn = V0 * (1-t)^n
with V0 the initial value (the real number of positives votes), t a discount rate (you have to fix it, like 10%) and n the time passed (for example n days). Thus a product will lose 10% of his value each day (but 10% of the precedent day, not of the initial value).
You can also see Hyperbolic discounting that is closer of your try. The formula can be sometyhing like that I guess :
Vn = V0 * (1/(1+k*n))
An other approach, simpler, but crudest : linear discounting. You can simply give an initial value for the scores, like 1000 and each day, you decrement all scores by 1 (or an other constant).
Vn = V0 - k*n

How would I calculate EXPECTED income if I have PAST income data in mySQL?

Ok, I'm just curious what the formula would be for calculating an expected income over the next X weeks/months/etc, if the only data I have in mySQL DB is all past transactions (dates of transactions, amounts, etc)
I am thinking taking some averages and whatnot, but I can't think of a specific formula (there must be something along those lines) to take say average rise of income over time (weekly/monthly) and then apply it to a select future period and display it weekly/monthly/etc?
Any suggestions?
use AVG() on the income in the past devide it to proper weekly/monthly amounts if neccessary.
see http://dev.mysql.com/doc/refman/5.1/en/group-by-functions.html#function_avg for more info on AVG()
Linear regression + simple integration is probably sufficient for your needs. I leave sorting out exact implementation for your DB up to you, but that follow that link to the "Estimation Methods" section, and probably use Ordinary Least Squares.
Alternatively, you can always slurp your data into something like R where the details are already implemented.
EDIT:
For more detail: you're trying to model INCOME = BASE + SCALING*T where we are assuming that a linear model is "good" (it's probably not great, but it's probably good enough on a short time scale). For two value linear regression, you're pretty much just taking averages; follow that link to "Fitting the Regression Line" and you'll see which things you need to average (y = INCOME and x = T). There are some tricks you can play to simplify the calculation for the computer if you can enforce some other conditions (e.g., having equally spaced time periods + no missing data), but you'll need to math a bit more yourself first if you want to do that (and you'll be less flexible in the face of changing db assumptions).