Understanding Stratified sampling in numpy - numpy

I am currently completing an exercise book on machine learning to wet my feet so to speak in the discipline. Right now I am working on a real estate data set: each instance is a district of california and has several attributes, including the district's median income, which has been scaled and capped at 15. The median income histogram reveals that most median income values are clustered around 2 to 5, but some values go far beyond 6. The author wants to use stratified sampling, basing the strata on the median income value. He offers the next piece of code to create an income category attribute.
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
He explains that he divides the median_income by 1.5 to limit the number of categories and that he then keeps only those categories lower than 5 and merges all other categories into category 5.
What I don't understand is
Why is it mathematically sound to divide the median_income of each instance to create the strata? What exactly does the result of this division mean? Are there other ways to calculate/limit the number of strata?
How does the division restrict the number of categories and why did he choose 1.5 as the divisor instead of a different value? How did he know which value to pick?
Why does he only want 5 categories and how did he know beforehand that there would be at least 5 categories?
Any help understanding these decisions would be greatly appreciated.
I'm also not sure if this is the StackOverFlow category I should post this question in, so if I made a mistake by doing so please let me know what might be the appropriate forum.
Thank you!

You may be the right person to analyze more on this based on your data set. But I can help you understanding stratified sampling, so that you will have an idea.
STRATIFIED SAMPLING: suppose you have a data set with consumers who eat different fruits. One feature is 'fruit type' and this feature has 10 different categories(apple,orange,grapes..etc) now if you just sample the data from data set, there is a possibility that sample data might not cover all the categories. Which is very bad when train the data. To avoid such scenario, we have a method called stratified sampling, in this probability of sampling each different category is same so that we will not miss any useful data.
Please let me know if you still have any questions, I would be very happy to help you.


Select the best "team" of 9 "players" based on overall "team" performance only

I have 9 bins named A through I containing the following number of objects:
A(8), B(7), C(6), D(7), E(5), F(6), G(6), H(6), I(6)
Objects from each bin fulfill a specific role and cannot be interchanged. I am selecting one object from each bin at random forming a "team" of 9 "players":
T_ijklmnopq = {a_i, b_j, c_k, d_l, e_m, f_n, g_o, h_p, i_q}
There are 15,240,960 such combinations - a huge number. I have means of evaluating performance of each "team" via a costly objective function, F(T_ijklmnopq). Thus, I can feasibly sample a limited number of random combinations, say no more than 500 samples.
Having results of such sampling, I want to predict the most likely best combination of "players". How to do it?
Keep in mind this is different from classical team selection because there is no meaningful evaluation of F() based on individual performance. For example, "player" a_6 may be good individually, but he may not "like" e_2 and therefore the performance of "team" containing the two suffers. Conversely, three mediocre players b_1, f_5, i_2 may be a part of an awesome "team". What's know is the whole "team" performance, that's all.
One more detail: contributions of the individual roles A through I are not weighted equally. Position of, say, E may be more important than, say, H. Unfortunately, these weights are not known upfront.
The described problem must be know to combinatorial analysts, but I haven't found anything exactly like it. Linear programming solutions with known individual "player" scores do not apply here. I will be most grateful for a specific name under which this problem is known to experts.
So far I have collected 400 samples. Here is a graph of the sorted F(T) values vs. a (arbitrary) sample number to illustrate that F(T) is "reasonable".
F(T) graph of sorted samples

Neural Network Input and Output Data formatting

and thanks for reading my thread.
I have read some of the previous posts on formatting/normalising input data for a Neural Network, but cannot find something that addresses my queries specifically. I apologise for the long post.
I am attempting to build a radial basis function network for analysing horse racing data. I realise that this has been done before, but the data that I have is "special" and I have a keen interest in racing/sportsbetting/programming so would like to give it a shot!
Whilst I think I understand the principles for the RBFN itself, I am having some trouble understanding the normalisation/formatting/scaling of the input data so that it is presented in a "sensible manner" for the network, and I am not sure how I should formulate the output target values.
For example, in my data I look at the "Class change", which compares the class of race that the horse is running in now compared to the race before, and can have a value between -5 and +5. I expect that I need to rescale these to between -1 and +1 (right?!), but I have noticed that many more runners have a class change of 1, 0 or -1 than any other value, so I am worried about "over-representation". It is not possible to gather more data for the higher/lower class changes because thats just 'the way the data comes'. Would it be best to use the data as-is after scaling, or should I trim extreme values, or something else?
Similarly, there are "continuous" inputs - like the "Days Since Last Run". It can have a value between 1 and about 1000, but values in the range of 10-40 vastly dominate. I was going to scale these values to be between 0 and 1, but even if I trim the most extreme values before scaling, I am still going to have a huge representation of a certain range - is this going to cause me an issue? How are problems like this usually dealt with?
Finally, I am having trouble understanding how to present the "target" values for training to the network. My existing results data has the "win/lose" (0 or 1?) and the odds at which the runner won or lost. If I just use the "win/lose", it treats all wins and loses the same when really they're not - I would be quite happy with a network that ignored all the small winners but was highly profitable from picking 10-1 shots. Similarly, a network could be forgiven for "losing" on a 20-1 shot but losing a bet at 2/5 would be a bad loss. I considered making the results (+1 * odds) for a winner and (-1 / odds) for a loser to capture the issue above, but this will mean that my results are not a continuous function as there will be a "discontinuity" between short price winners and short price losers.
Should I have two outputs to cover this - one for bet/no bet, and another for "stake"?
I am sorry for the flood of questions and the long post, but this would really help me set off on the right track.
Thank you for any help anyone can offer me!
Kind regards,
The documentation that came with your RBFN is a good starting point to answer some of these questions.
Trimming data aka "clamping" or "winsorizing" is something I use for similar data. For example "days since last run" for a horse could be anything from just one day to several years but tends to centre in the region of 20 to 30 days. Some experts use a figure of say 63 days to indicate a "spell" so you could have an indicator variable like "> 63 =1 else 0" for example. One clue is to look at outliers say the upper or lower 5% of any variable and clamp these.
If you use odds/dividends anywhere make sure you use the probabilities ie 1/(odds+1) and a useful idea is to normalize these to 100%.
The odds or parimutual prices tend to swamp other predictors so one technique is to develop separate models, one for the market variables (the market model) and another for the non-market variables (often called the "fundamental" model).

Program to optimize cost

This is my problem set for one of my CS class and I am kind of stuck. Here is the summary of the problem.
Create a program that will:
1) take a list of grocery stores and its available items and prices
2) take a list of required items that you need to buy
3) output a supermarket where you can get all your items with the cheapest price
input: supermarkets.list, [tomato, orange, turnip]
output: supermarket_1
The list looks something like
$2.00 tomato
$3.00 orange
$4.00 tomato, orange, turnip
$3.00 tomato
$2.00 orange
$3.00 turnip
$15.00 tomato, orange, turnip
If we want to buy tomato and orange, then the optimal solution would be buying
from supermarket_1 $4.00. Note that it is possible for an item to be bough
twice. So if wanted to buy 2 tomatoes, buying from supermarket_1 would be the
optimal solution.
So far, I have been able to put the dataset into a data structure that I hope will allow me to easily do operations on it. I basically have a dictionary of supermarkets and the value would point to a another dictionary containing the mapping from each entry to its price.
supermarket_1 --> [turnip --> $2.00]
[orange --> $1.50]
One way is to use brute force, to get all combinations and find whichever satisfies the solution and find the one with the minimum. So far, this is what I can come up with. There is no assumption that the price of a combination of two items would be less than buying each separately.
Any suggestions hints are welcome
Finding the optimal solution for a specific supermarket is a generalization of the set cover problem, which is NP-complete. The reduction goes as follows:
Given an instance of the set cover problem, just define a cost function assigning 1 to each combination, apply an algorithm that solves your problem, and you obtain an optimal solution of the set cover instance. (Finding the minimal price hence corresponds to finding the minimum number of covering sets.) Thus, your Problem is NP-hard, and you cannot expect to finde a solution that runs in polynomial time.
You really should implement the brute-force method you mentioned. I too recommand you to do this as a first step. If the performance is not sufficient, you can try a
using a MIP-formulation and a solver like CPLEX, or you have to devolop a heuristic approach.
For a single supermarket, it is rather trivial to find a mixed integer program (MIP). Let x_i be the integer number how often product combination i is contained in a solution, c_i its cost and w_ij the number how often product j is contained in product combination i. Then, you are minimizing
sum_i x_i * c_i
subject to conditions like
sum_i x_i * w_ij >= r_j,
where r_j is the number how often product j is required.
Well, you have one method, so implement it now so you have something that works to submit. A brute-force solution should not take long to code up, then you can get some performance data and you can think about the problem more deeply. Guesstimate the number of supermarkets in a reasonable shopping range in a large city. Create that many supermarket records and link them to product tables with random-ish prices, (this is more work than the solution).
Run your brute-force solution. Does it work? If it outputs a solution, 'manually' add up the prices and list them against three other 'supermarket' records taken at random, (just pick a number), showing that the total is less or equal. Modify the price of an item on your list so that the solution is no longer cheap and re-run, so that you get a different solution.
Is it so fast that no further work is justified? If so, say so in the conclusions section of your report and post the lot to your prof/TA. You understood the task, thought about it, came up with a solution, implemented it, tested it with a representative dataset and so shown that functionality is demonstrable and performance is adequate - your assignment is over, go to the bar, think about the next one over a beer.
I am not sure what you mean by "brute force" solution.
Why don't you just calculate the cost of your list of items in each of the supermarkets, and then select the minimum? Complexity would be in O(#items x #supermarkets) which is good.
Regarding your data structure you can also simply have a matrix (2 dimension array) price[super][item], and use ids for your supermarkets/items.

How would I calculate EXPECTED income if I have PAST income data in mySQL?

Ok, I'm just curious what the formula would be for calculating an expected income over the next X weeks/months/etc, if the only data I have in mySQL DB is all past transactions (dates of transactions, amounts, etc)
I am thinking taking some averages and whatnot, but I can't think of a specific formula (there must be something along those lines) to take say average rise of income over time (weekly/monthly) and then apply it to a select future period and display it weekly/monthly/etc?
Any suggestions?
use AVG() on the income in the past devide it to proper weekly/monthly amounts if neccessary.
see http://dev.mysql.com/doc/refman/5.1/en/group-by-functions.html#function_avg for more info on AVG()
Linear regression + simple integration is probably sufficient for your needs. I leave sorting out exact implementation for your DB up to you, but that follow that link to the "Estimation Methods" section, and probably use Ordinary Least Squares.
Alternatively, you can always slurp your data into something like R where the details are already implemented.
For more detail: you're trying to model INCOME = BASE + SCALING*T where we are assuming that a linear model is "good" (it's probably not great, but it's probably good enough on a short time scale). For two value linear regression, you're pretty much just taking averages; follow that link to "Fitting the Regression Line" and you'll see which things you need to average (y = INCOME and x = T). There are some tricks you can play to simplify the calculation for the computer if you can enforce some other conditions (e.g., having equally spaced time periods + no missing data), but you'll need to math a bit more yourself first if you want to do that (and you'll be less flexible in the face of changing db assumptions).

Design Problem

Recently I was faced with this interview question (K-Means Clustering solution). The design I came up with did not meet the expectations of the interviewer (to put simply I didnt get the job because I lost to another candidate on this design problem). I am wondering how many different / efficient / simply solutions can the SO community come up with (by doing this I am hoping to hone my skills):
To implement a simple algorithm to cluster people according to their weight and height. The
data set includes a list of people with their weights and heights like so:
Person Weight Height
(kg) (inches)
Person 1 70 70
Person 2 75 80
Person 3 120 85
You can plot the data as a 2 dimensional data. Weight being one dimension and height being
the other dimension. Weight can range from a minimum of 50kg to 150kg. Height can range
from a minimum of 38inches to 90inches
The algorithm (called K-means clustering) will cluster data into K groups goes as such:
Start with K clusters. Each cluster is defined by its center point which will start of as
random weight and random height. Pick random numbers from within the
corresponding ranges defined above.
For each person
Calculate distance to center of each cluster using formula
distance = sqrt(pow((wperson−wcenter), 2) + (pow(hperson−hcenter),2))
where wperson = weight of person,
hperson = height of person
wcenter = weight of cluster center point,
hcenter = height of cluster center point
Assign the person to the cluster with the shortest distance to center point of cluster
After end of step 2, you will end up with K clusters each assigned with a set of people
For each cluster, set the weight and height of the center point to the average of the
people in the cluster
wcenter = (sum of weight of each person in cluster)/(number of people in cluster)
hcenter = (sum of height of each person in cluster)/number of people in cluster)
Repeat steps 2 to 5 for 1000 iterations, then print out following information for each
weight and height of center of cluster.
list of people in cluster.
I am not looking for a implementation/solution but for a high level design. can you list the interfaces / classes etc.
I dont want to give my solution now, but will post it later in the day?
This is my attempt at the design. I only show the static diagram since the algorithm is pretty much laid out already. I would have a plan to have a visitor for the representation of the clusters, could allow different types of output (xml, strings, csv..etc). Maybe the visitor is overkill, if it was then I'd just have something like a ToString method that could be overridden.
Note: the Cluster creates a CenterClusterItem on the SetCenter and FindNewCenter methods. The CenterClusterItem is not a PersonClusterItem, it just holds the same amount of AClusterValues as a PersonClusterItem would (since the average isn't really a person).
Also, I forgot to make a method on the KCluster to begin the process, but that's implied.
Class Diagram http://img11.imageshack.us/img11/499/kcluster.png
Well, I would first tackle all the constants/magic numbers that reduce the reusability of the algorithm:
instead of a fixed number of iterations, use a stopping criterion (e.g., if clusters don't change too much, terminate)
don't restrict yourself to 2-dim data, use vectors
let the user define the number of clusters to be found
Then, you could hide some specifics behind interfaces, e.g. the distance might be calculated differently (for example, it might at some point have to cope with values other than double).
On the other hand, if you really have this simple problem, some of these generalizations might well be overkill - but that's what I would discuss with someone telling me to implement this algorithm.
You can create the following classes:
Person to store data about persons and centers. Properties: id, weight and height. Method: calculateDistance
Cluster to store one center and a list of persons: Properties: center and list of Person. Method: calculateCenter.
KCluster to hold your algorithm and store a list of clusters: Property: list of Cluster. Methods: generateClusters.
I'm not sure what your question actually is, the steps you point out effectively define the algorithm you're talking about.
A better idea may be to include exactly what you did then people can give you some hints / tips as to where you might have gone wrong or what they would have done differently.
That sounds like a really good way to do it. K-means will usually converge quickly (though not necessarily to the global optimum), so my one suggestion would be to run the algorithm until no more changes occur, rather than a fixed number of 1000 iterations. You could then repeat the entire process a few times with different random starting points.
One weakness of k-means is that it does require you to specify (i.e. guess) an appropriate value for k up-front. I think you would get points for asking the interviewer what an appropriate value for k would be, or, if there is no way to know, describing some goodness-of-fit measure and then calculating that measure for different values of k to find a "just low enough" value.