Minimum sampling for maximising the prediction accuracy - bayesian

Suppose that I'm training a machine learning model to predict people's age by a picture of their faces. Lets say that I have a dataset of people from 1 year olds to 100 year olds. But I want to choose just 9 ages out of this 100 age dataset and still the model should be able to predict the age of a give person. My question is how should I choose the optimal 9 ages out of the 100 age dataset?

Related

How to label a whole dataset?

I have a question.I have a pandas dataframe that contains 5000 columns and 12 rows. Each row represents the signal received from an electrocardiogram lead. I want to assign 3 labels to this dataset. These 3 tags belong to the entire dataset and are not related to a specific row. How can I do this?
I have attached the picture of my dataframepandas dataframe.
and my labels are: Atrial Fibrillation:0,
right bundle branch block:1,
T Wave Change:2
I tried to assign 3 labels to a large dataset
(Not for a specific row or column)
but I didn't find a solution.
As you see, it has 12 rows and 5000 columns. each row represents 5000 data from one specific lead and overall we have 12 leads which refers to this 12 rows (I, II, III, aVR,.... V6) in my data frame. professional experts are recognised 3 label for this data frame which helps us to train a ML Model to detect different heart disease. I have 10000 data frame just like this and each one has 3 or 4 specific labels. Here is my question: How can I assign these 3 labels to this dataset that I mentioned.as I told before these labels don't refers to specific rows, in fact each data frame has 3 or 4 label for its whole. I mean, How can I assign 3 label to a whole data frame?

Where did I go wrong in numpy normalization of input data in linear regression?

When following through Andrew Ng's Machine learning course assignment - Exercise:1 in python,
I had to predict the prize of a house given the size of the house in sq-feet,number of bedroom using multi variable linear regression.
In one of the steps where we had to predict the cost of the house on a new example X = [1,1650,3] where 1 is the bias term,1650 is the size of the house and 3 is the number of bedrooms, I used the below code to normalize and predict the output:
X_vect = np.array([1,1650,3])
X_vect[1:3] = (X_vect[1:3] - mu)/sigma
pred_price = np.dot(X_vect,theta)
print("the predicted price for 1650 sq-ft,3 bedroom house is ${:.0f}".format(pred_price))
Here mu is the mean of the training set calculated previously as [2000.68085106 3.17021277],sigma is the standard deviation of the training data calculated previously as [7.86202619e+02 7.52842809e-01] and theta is [340412.65957447 109447.79558639 -6578.3539709 ]. The value of X_vect after the calculation was [1 0 0].Hence the prediction code :
pred_price = np.dot(X_vect,theta_vals[0])
gave the result as the predicted price for 1650 sq-ft,3 bedroom house is $340413.
But this was wrong according to the answer key.So I did it manually as below:
print((np.array([1650,3]).reshape(1,2) - np.array([2000.68085106,3.17021277]).reshape(1,2))/sigma)
This is the value of normalized form of X_vect and the output was [[-0.44604386 -0.22609337]].
The next line of code to calculate the hypothesis was:
print(340412.65957447 + 109447.79558639*-0.44604386 + 6578.3539709*-0.22609337)
Or in cleaner code:
X1_X2 = (np.array([1650,3]).reshape(1,2) - np.array([2000.68085106,3.17021277]).reshape(1,2))/sigma
xo = 1
x1 = X1_X2[:,0:1]
x2 = X1_X2[:,1:2]
hThetaOfX = (340412.65957447*xo + 109447.79558639*x1 + 6578.3539709*x2)
print("The price of a 1650 sq-feet house with 3 bedrooms is ${:.02f}".format(hThetaOfX[0][0]))
This gave the result of the predicted price to be $290106.82.This was matching the answer key.
My question is where did I go wrong in my first approach?

pandas dataframe sub setting on multiple loop conditions

I have a pandas df having users and their answers to a survey and a score eg
Userid incomebracket insurance-knowledge..... score
123 3 3 56
346 4 6 65
Assume income bracket has 6 levels with 1:1000-5000,6:100000+,similarly insurance-knowledge has 6 levels (1:very little to 6:expert)
Now I have another df which has user profile features like
userid,age,gender,education....(10 such features)
Now I iterate through set of users (first df) and for each of them want to get the entire subset of other users who have the same user profile but higher answer on each column of first df, say income. I am doing this using the following say for 3 profile features like age, gender and education
df_sameusergroup=df[(df['PPGENDER']==sameuser_gender.values[0])
& (df['EDUC']==sameuser_educ.values[0])
& (df['age']==sameuser_agecat.values[0])
& (df['incomebracket']>user_feature.values[0])]
Although this works the profile features here are hardcoded and is a problem for longer conditions,what I want is
get the subset of users who have same profile on all 10 but with higher answer, if you don't get any such record (which is possible) the reduce to 9 features,then reduce to 8,7.....2 (of the most important features say age gender). My pseudocode for this should look like this
for i in range(10:2) // iterate the Userprofile_df for all profile features,then 9,then 8...
Similaruserdf[] = df[subset when all i features are same and income is >]
if(Similaruserdf.length==0)//no such users with all features same
continue loop reduce number of features to match on
else
return Similaruserdf[]
I am stuck trying to do this and have been looking throughout to find a solution. Any help would be greatly appreciated. Thanks.

SPSS Compute Variable

Below is some data:
Test Day1 Day2 Score
A 1 2 100
B 1 3 62
C 3 4 90
D 2 4 20
E 4 5 80
I am trying to take the values from column 'day' and 'day2' and use them to select the row number for the column score. For example for Test A I would like to find the sum of 100 and 62 because that is the values of the first and second rows of score. Test B I would like to find the sum of 100, 62 and 90.
Is their anyway to do this in the Compute Variable window? Found in the menu Transform-Compute Variable?
I tried the following:
Score(MEAN(VALUE(Day1), VALUE(DAY2)))
This is not the proper way to call the cell location of Score and I received an error.
Can anyone help?
Thank you!
You really have two different datasets here. One is a dataset of scores numbered 1 through 5.
The other is a dataset that includes indexes into the score dataset. So the steps would be something like this.
First take the scores dataset and transpose it so that it has one row and 5 columns (Data>Transpose)
Then match that dataset to each case in the main dataset (Data>Merge Files>Add Variables).
Next you have to resort to using syntax directly.
You would declare a vector for the scores (VECTOR)
Finally, you use COMPUTE to index into the scores.
For your real problem, I suppose that you might have batches of scores and maybe there are some gaps. The Restructure Data Wizard can help you generalize this - convert cases into variables, but let's not go there yet.
HTH,
Jon Peck

Predicting game outcomes at a point in time

Hi I am trying to gauge from past information I have on an MSSQL database the predicted outcomes of soccer games (win, tie or loss for the home team) at any point in time based on the minutes played and the scoreline
What I had envisaged as output was something like fangraphs does for baseball
http://www.fangraphs.com/scoreboard.aspx?date=2010-11-01
although with two lines as there are three rather than two possible outcomes
From the data and the existing tables I can create game records like this
Time TeamID Venue MatchID Result
6 TOT H 5 W
27 ASV A 5 W
58 ASV A 5 W
66 TOT H 5 W
77 TOT H 5 W
So for the graph for this game the home team TOT would start with the win line at around 45% (based on the historical probability of a home win) it would spike when they score their goal, dip significantly after ASV score twice but be probably above 90% when they score to go 3-2 up and then rise gently to 100% at the cloing 90 minute mark
So I want to go through the 7500 games I have data on and based on them establish for every minute of a 90 minute game what are the chances of a win, tie or loss for the home team based on the these results
For instance, in the simplest situation after 1 minute of play in actuality 44 of the home teams scored, 33 of them went on to win, 6 tied and 5 lost. The corresponding case where the away team scored has been 9 wins, 8 ties 23 losses for the home team. However, I am having trouble getting my head around how to get all 90 minutes scorelines and compare them with the final result (Only one goal can be scored in any specific minute)
TIA for any help
There will always be things you can add to the model, but the first thing I would do is, for each game, pull out the score at each minute, and assume that the probability of winning doesn't depend on when the goal was scored, but depends on what the score is now.
So now you would have 90 data points per game.
game1:
Minute: 0 10 20 30 40 50 60 70 80 90
Score :[0 0] [0 0] [0 1] [0 1] [1 1] [1 1] [1 1] [2 1] [2 1] [3 1]
The next thing I would do is, for each minute slice, add up the number of wins, losses, and draws over all games, for each configuration of scores.
So each entry in that table might correspond to something like this:
#minute 27, for score {home:5, away:2} : {homeWins: 9, draws:1, homeLosses:0}
you might want to try using the difference in score instead of the actual score values..
Either way, Once you have the data formatted that way, getting a reasonable solution is easy.
If a game is on, and it's minute 77, and the score is {home:5, away:2}, the (MLE) estimate is 90% wins, 10% draw, 0% looses (according to the example table entry above).
So you see already how it will help to include "laplacian smoothing": adding +1 to the final values of each of those win/lose/draw counters. This way if you've never seen a loss in this exact situation you don't say it's impossible, impossible is a very strong word (look for beta or drichlet distributions for background).
The obvious problem with this approach is that if you've never seen a particular score combination before it will predict (33%,33%,33%), which is obviously wrong in some cases.
The simplest fix would be to enforce a rule like "leading by 6 points is at least as good as leading by 5 points". It's ugly, but it's a start.
To avoid that sort of special-case logic you could try averaging that approach with a monte-carlo approximation.
The simplest of those approaches is to say : over all my data I expect about a 1 in 30 chance of a goal by each team in each minute of play -> simulate the game 10000 times form the current point, count the number of wins/losses/draws and you're done.
If that's too random, or processor intense, switch to Markov Chains.