How would I calculate EXPECTED income if I have PAST income data in mySQL? - sql

Ok, I'm just curious what the formula would be for calculating an expected income over the next X weeks/months/etc, if the only data I have in mySQL DB is all past transactions (dates of transactions, amounts, etc)
I am thinking taking some averages and whatnot, but I can't think of a specific formula (there must be something along those lines) to take say average rise of income over time (weekly/monthly) and then apply it to a select future period and display it weekly/monthly/etc?
Any suggestions?

use AVG() on the income in the past devide it to proper weekly/monthly amounts if neccessary.
see http://dev.mysql.com/doc/refman/5.1/en/group-by-functions.html#function_avg for more info on AVG()

Linear regression + simple integration is probably sufficient for your needs. I leave sorting out exact implementation for your DB up to you, but that follow that link to the "Estimation Methods" section, and probably use Ordinary Least Squares.
Alternatively, you can always slurp your data into something like R where the details are already implemented.
EDIT:
For more detail: you're trying to model INCOME = BASE + SCALING*T where we are assuming that a linear model is "good" (it's probably not great, but it's probably good enough on a short time scale). For two value linear regression, you're pretty much just taking averages; follow that link to "Fitting the Regression Line" and you'll see which things you need to average (y = INCOME and x = T). There are some tricks you can play to simplify the calculation for the computer if you can enforce some other conditions (e.g., having equally spaced time periods + no missing data), but you'll need to math a bit more yourself first if you want to do that (and you'll be less flexible in the face of changing db assumptions).

Related

SQL Server Payment estimate probabilty - best maths equations and recursive query?

I have a table which holds a list of transactions.
Task: To estimate the next transaction amount.
Problem:
The actual payment periods for each rows is a varible, which can be weekly, monthly or anything choosen by the end user.
To estimate the next payment, based on previous data, can anyone suggest a good method?
At the moment I basically take the figure back to the daily amount then multiple by period i.e. week/month/q/year. Then given the history, choose the result that has the highest incidence (count).
This does not generate an accuarate estimations due to payments within payments that I dont need to care about i.e. £100 real payment but +20 for addition charges that are irrelevant.
Another way is to calculate the average,std,varience between payments then choose the highest probability.
Problem is, i've been unable to code this in SQL.
SELECT [Identifier]
,[DateTranEntered]
,[Type]
,[TranDateFrom],
,[TranDateTo]
,[Amount]
,[ReferenceForTran]
,[CreatedDate]
FROM .[TranTable]
Perhaps something with recursion through the table and calculate every transaction daily amount then with the variance, incidence - choose from the last 'x' what the estimate guess is ?
Problem is I have gotten stuck with the resurive query for this.
Any thoughts about this?
SQL Server Analysis services has a suite of data mining tools that provide algorithms such as Linear Regressions, Decision Trees and Neural Networks. You can learn more about them here: http://msdn.microsoft.com/en-us/library/ms175595.aspx. It sounds like Linear Regressions might be the best place to start for this problem.

Neural Network Input and Output Data formatting

and thanks for reading my thread.
I have read some of the previous posts on formatting/normalising input data for a Neural Network, but cannot find something that addresses my queries specifically. I apologise for the long post.
I am attempting to build a radial basis function network for analysing horse racing data. I realise that this has been done before, but the data that I have is "special" and I have a keen interest in racing/sportsbetting/programming so would like to give it a shot!
Whilst I think I understand the principles for the RBFN itself, I am having some trouble understanding the normalisation/formatting/scaling of the input data so that it is presented in a "sensible manner" for the network, and I am not sure how I should formulate the output target values.
For example, in my data I look at the "Class change", which compares the class of race that the horse is running in now compared to the race before, and can have a value between -5 and +5. I expect that I need to rescale these to between -1 and +1 (right?!), but I have noticed that many more runners have a class change of 1, 0 or -1 than any other value, so I am worried about "over-representation". It is not possible to gather more data for the higher/lower class changes because thats just 'the way the data comes'. Would it be best to use the data as-is after scaling, or should I trim extreme values, or something else?
Similarly, there are "continuous" inputs - like the "Days Since Last Run". It can have a value between 1 and about 1000, but values in the range of 10-40 vastly dominate. I was going to scale these values to be between 0 and 1, but even if I trim the most extreme values before scaling, I am still going to have a huge representation of a certain range - is this going to cause me an issue? How are problems like this usually dealt with?
Finally, I am having trouble understanding how to present the "target" values for training to the network. My existing results data has the "win/lose" (0 or 1?) and the odds at which the runner won or lost. If I just use the "win/lose", it treats all wins and loses the same when really they're not - I would be quite happy with a network that ignored all the small winners but was highly profitable from picking 10-1 shots. Similarly, a network could be forgiven for "losing" on a 20-1 shot but losing a bet at 2/5 would be a bad loss. I considered making the results (+1 * odds) for a winner and (-1 / odds) for a loser to capture the issue above, but this will mean that my results are not a continuous function as there will be a "discontinuity" between short price winners and short price losers.
Should I have two outputs to cover this - one for bet/no bet, and another for "stake"?
I am sorry for the flood of questions and the long post, but this would really help me set off on the right track.
Thank you for any help anyone can offer me!
Kind regards,
Paul
The documentation that came with your RBFN is a good starting point to answer some of these questions.
Trimming data aka "clamping" or "winsorizing" is something I use for similar data. For example "days since last run" for a horse could be anything from just one day to several years but tends to centre in the region of 20 to 30 days. Some experts use a figure of say 63 days to indicate a "spell" so you could have an indicator variable like "> 63 =1 else 0" for example. One clue is to look at outliers say the upper or lower 5% of any variable and clamp these.
If you use odds/dividends anywhere make sure you use the probabilities ie 1/(odds+1) and a useful idea is to normalize these to 100%.
The odds or parimutual prices tend to swamp other predictors so one technique is to develop separate models, one for the market variables (the market model) and another for the non-market variables (often called the "fundamental" model).

A better way to generate pricing based on f(N)

I have in game currency in my game. For a user to buy the next upgrade I currently use a very simple method whereby the Nth upgrade costs N*1000 coins.
Im not a massive fan of using this at the moment as I'd like it to be a bit easier to start off with and possibly scale better so its not quite as hard to get upgrades.
One solution would be to use Fibonnacci which gives great early results but would make later upgrades nigh on impossible.
Can anyone offer a solution as my maths knowledge is pretty limited
What about sigmoid function? It starts to rise slowly, then it rises nearly linearly and at the end it starts to slow down.
If you look at the graph at wolfram alpha, you can calculate your price like this:
price = a_bit_more_than_maximum_upgrade_price * sigmoid( x )
You have to choose what multiple of the maximum price will be the price of the starting upgrade, if you choose starting x=-4 you'll get some price less than 5% of maximum. Ending x could be equal to 4. You'll reach around 95% of maximum price. Then you have number of upgrades. You could calculate the input for sigmoid like this:
x = (upgrade_index / (number_of_upgrades-1)) * 8.0 - 4.0
Upgrade index is starting from zero and you have to have at least 2 upgrades :)
You can trim off last few digits or round them up to get nicer looking numbers.
This seems like a question more related for http://programmers.stackoverflow.com
But anyway, I would say try use an exponential function, something like
f(n) = 1000 * 1.1^n
Obviously once you have 100 or more upgrades the price gets a bit ridiculous, you can then perhaps use a condition to check if n is larger than a certain number, then resume with your linear function to calculate the price of the next upgrade.

Ranking algorithm in a rails app

We have a model in our ralis app whose objects are assigned a score based on positive user actions. We'll call them products for simplicity sake. If a user likes a product or buys a product or views a product, the score is incremented at various weights (a like might be worth more than a view, two views in the span of 30 seconds might be worth more than three views spread over an hour, etc.)
We'd like to use these scores to help sort and rank products, say for a popular products list, but for various reasons -- using the straight ranking is going to unevenly favor older products, since they'll have more time to amass a higher score.
My question is, how to normalize the scores between new and old products. I thought about dividing the products score by a unit of time, say the number of days it's been in existence, but am worried that will cut down the older products too much. Any thoughts on the best way to fairly normalize the scores between the old and new products?
I'm also considering an example of a bayesian rating system I found in another question:
rating = ((avg_num_votes * avg_rating) + (product_num_votes * product_rating)) / (avg_num_votes + product_num_votes)
Where theavg numbers are calculated by looking at the scores across all products that have more than one vote (or in our case, a positive action). This might not be the best way, because we don't have a negative rating in our system and it doesn't take time into consideration at all.
Your question reminds me the concept of Exponential Discounting Cash Flow in finance.
The concept is the following : 100$ in two years worth less than 100$ in one year, which worth less than 100$ now, ...
I think that we can make a good comparison here : a product of yesterday worth more that a product of the day before but less than a product of today.
The formula is simple :
Vn = V0 * (1-t)^n
with V0 the initial value (the real number of positives votes), t a discount rate (you have to fix it, like 10%) and n the time passed (for example n days). Thus a product will lose 10% of his value each day (but 10% of the precedent day, not of the initial value).
You can also see Hyperbolic discounting that is closer of your try. The formula can be sometyhing like that I guess :
Vn = V0 * (1/(1+k*n))
An other approach, simpler, but crudest : linear discounting. You can simply give an initial value for the scores, like 1000 and each day, you decrement all scores by 1 (or an other constant).
Vn = V0 - k*n

Spike removal algorithm

I have an array of values ranging from 30 to 300. I want to somehow make an weighted average, where, if I have 5 values and one is a lot bigger than the rest(spike), it won't influence the average that much as it would if I simply make a arithmetic average: eg: (n1+n2+n3+n4+n5)/5.
Does anyone has an idea how to make an simple algorithm that does just that, or where to look?
Sounds like you're looking to discard data that falls outside some parameter range you've specified. You could do it by computing the median/mode and ignoring values outside of this range when computing your mean. You'll have to adjust the divisor accordingly, of course, to account for the number of discarded values. What this "tolerable" range should be is ultimately up to you to decide, and will likely depend on your specific application needs.
Alternatively, you could try something like eliminating items r% out of range of your total average. Something like this (in javascript):
function RangedAverage(arr, r)
{
x = Average(arr);
//now eliminate items r% out of range
for(var i=0; i<arr.length; i++)
if(arr[i] < (x/r) || arr[i]>(x*(1+r)))
arr.splice(i,1);
x = Average(arr); //compute new average
return x;
}
You could try a median filter rather than a mean filter. It's often used in image processing to mitigate spurious pixel values (as opposed to white noise).
As you have noticed the mean is susceptible to skewing by spikes. perhaps median or mode may be a better statistic as they tend to be less skewed?
this should be a comment but js seems to be broken for me atm: its not quite clear whether you are after a single number that is characteristic of your array (i.e. an average) or a new array with the spikes removed (median filter)
in response to that then i'd suggest you first look at if median or mode is more appropriate as a statistic. if not then apply a median filter (very good at removing spikes) then average
A Kalman filter is often used in similar applications. I don't know if it qualifies as "simple," but it's robust and well known.
Lots of ways of doing this: You could implement a low-pass digital filter.
Or, if you're just concerned about removing outliers from a statistical summary, you could just remove the highest and lowest N% of your data values from the dataset before averaging.
"Robust statistics" is the search term that will get you into the literature. An advantage of a Kalman filter is that you have a running estimate of the variability of the data, and this allows you eventually to "discard observations that are more than x% likely to be spurious given the whole set of observations so far".