I am following the https://codelabs.developers.google.com/codelabs/tfjs-training-regression/index.html#7 tutorial and trying to understand how to use tensorflow.js - I have a pretty simple problem I'm trying to solve, but no idea how to get there. Rather than mpg and horsepower in the tutorial, I have data with
{rating: x, value: y} - Rating being a scale of 1-5, and value being a float. What I'm trying to do is make the model more likely to predict closer to a value that has a rating of 5, rather than randomly. When I evaluated the predictions, I noticed that wasn't happening, because the model doesn't know that I want it to be more biased based on the rating. Can anyone explain how to do this? (And if there is a machine learning term that describes what I'm trying to do?)
Related
So I have been looking at XGBoost as a place to start with this, however I am not sure the best way to accomplish what I want.
My data is set up something like this
Where every value, whether it be input or output is numerical. The issue I'm facing is that I only have 3 input data points per several output data points.
I have seen that XGBoost has a multi-output regression method, however I am only really seeing it used to predict around 2 outputs per 1 input, whereas my data may have upwards of 50 output points needing to be predicted with only a handful of scalar input features.
I'd appreciate any ideas you may have.
For reference, I've been looking at mainly these two demos (they are the same idea just one is scikit and the other xgboost)
https://machinelearningmastery.com/multi-output-regression-models-with-python/
https://xgboost.readthedocs.io/en/stable/python/examples/multioutput_regression.html
I have a dataframe which contains three more or less significant correlations between target column and other columns ( LinarRegressionModel.coef_ from sklearn shows 57, 97 and 79). And I don't know what exact model to choose: should I use only most correlated column for regression or use regression with all three predictors. Is there any way to compare models effectiveness? Sorry, I'm very new to data analysis, I couldn't google any tools for this task
Well first at all, you must know that when we are choosing the best model to apply to new data, we are going to choose the best model to fit out of sample data, which is the kind of samples that might are not present in the training process, after all, you want to predict new probabilities or cases. In your case, predict a new number.
So, how can we do this? Well, the best is to use metrics which can help us to choose which model is better for our dataset.
There are so many kinds of metrics for regression:
MAE: Mean absolute error is the mean of the absolute value of the errors. This is the easiest of the metrics to understand since it’s just the average error.
MSE: Mean squared error is the mean of the squared error. It’s more popular than a mean absolute error because the focus is geared more towards large errors.
RMSE: Root means the squared error is the square root of the mean squared error. This is one of the most popular of the evaluation metrics because root means the squared error is interpretable in the same units as the response vector or y units, making it easy to relate its information.
RAE: Relative absolute error, also known as the residual sum of a square, where y bar is a mean value of y, takes the total absolute error and normalizes it by dividing by the total absolute error of the simple predictor.
You can work with any of these, but I highly recommend to use MSE and RMSE.
ml beginner here.
I have a dataset containing the GPA, GRE, TOEFL, SOP&LOR Ranking(out of 5)etc. (all numerical) , and a final column that states whether or not they were admitted to a university(0 or 1), which is what we'll use as y_train.
I'm supposed to not just classify the predicted labels, but also calculate the probability of each person getting admitted.
edit: so from the from the first comment, I built a Logistic Regression model, and with some googling I found 'predict_proba' from sklearn and tried implementing it. There werent any syntactical errors but the code values given by predict_proba were horribly wrong.
Link: https://github.com/tarunn2799/gre-pred/blob/master/GRE%20Admission%20Probability-%20Extraaedge.ipynb
please help me finding where I've gone wrong, and also tips to reduce the loss
thank you!
I read your notebook, but I'm confused why you think the predict_proba are horribly wrong..
Is the predict accuracy not good, or the format of predict_proba not as you expected?
You could use sklearn.metrics.accuracy_score(), sklearn.metrics.confusion_matrix() to check your predict label, or use sklearn.metrics.roc_auc_score() to check the result of predict_proba. Check both train & text parts are better.
I think the format of predict_proba is correct, or maybe you could try the predict_log_proba() to calculate the log probability?
Hope this could help you.
I am training an object detector for my own data using Tensorflow Object Detection API. I am following the (great) tutorial by Dat Tran https://towardsdatascience.com/how-to-train-your-own-object-detector-with-tensorflows-object-detector-api-bec72ecfe1d9. I am using the provided ssd_mobilenet_v1_coco-model pre-trained model checkpoint as the starting point for the training. I have only one object class.
I exported the trained model, ran it on the evaluation data and looked at the resulted bounding boxes. The trained model worked nicely; I would say that if there was 20 objects, typically there were 13 objects with spot on predicted bounding boxes ("true positives"); 7 where the objects were not detected ("false negatives"); 2 cases where problems occur were two or more objects are close to each other: the bounding boxes get drawn between the objects in some of these cases ("false positives"<-of course, calling these "false positives" etc. is inaccurate, but this is just for me to understand the concept of precision here). There are almost no other "false positives". This seems much better result than what I was hoping to get, and while this kind of visual inspection does not give the actual mAP (which is calculated based on overlap of the predicted and tagged bounding boxes?), I would roughly estimate the mAP as something like 13/(13+2) >80%.
However, when I run the evaluation (eval.py) (on two different evaluation sets), I get the following mAP graph (0.7 smoothed):
mAP during training
This would indicate a huge variation in mAP, and level of about 0.3 at the end of the training, which is way worse than what I would assume based on how well the boundary boxes are drawn when I use the exported output_inference_graph.pb on the evaluation set.
Here is the total loss graph for the training:
total loss during training
My training data consist of 200 images with about 20 labeled objects each (I labeled them using the labelImg app); the images are extracted from a video and the objects are small and kind of blurry. The original image size is 1200x900, so I reduced it to 600x450 for the training data. Evaluation data (which I used both as the evaluation data set for eval.pyand to visually check what the predictions look like) is similar, consists of 50 images with 20 object each, but is still in the original size (the training data is extracted from the first 30 min of the video and evaluation data from the last 30 min).
Question 1: Why is the mAP so low in evaluation when the model appears to work so well? Is it normal for the mAP graph fluctuate so much? I did not touch the default values for how many images the tensorboard uses to draw the graph (I read this question: Tensorflow object detection api validation data size and have some vague idea that there is some default value that can be changed?)
Question 2: Can this be related to different size of the training data and the evaluation data (1200x700 vs 600x450)? If so, should I resize the evaluation data, too? (I did not want to do this as my application uses the original image size, and I want to evaluate how well the model does on that data).
Question 3: Is it a problem to form the training and evaluation data from images where there are multiple tagged objects per image (i.e. surely the evaluation routine compares all the predicted bounding boxes in one image to all the tagged bounding boxes in one image, and not all the predicted boxes in one image to one tagged box which would preduce many "false false positives"?)
(Question 4: it seems to me the model training could have been stopped after around 10000 timesteps were the mAP kind of leveled out, is it now overtrained? it's kind of hard to tell when it fluctuates so much.)
I am a newbie with object detection so I very much appreciate any insight anyone can offer! :)
Question 1: This is the tough one... First, I think you don't understand correctly what mAP is, since your rough calculation is false. Here is, briefly, how it is computed:
For each class of object, using the overlap between the real objects and the detected ones, the detections are tagged as "True positive" or "False positive"; all the real objects with no "True positive" associated to them are labelled "False Negative".
Then, iterate through all your detections (on all images of the dataset) in decreasing order of confidence. Compute the accuracy (TP/(TP+FP)) and recall (TP/(TP+FN)), only counting the detections that you've already seen ( with confidence bigger than the current one) for TP and FP. This gives you a point (acc, recc), that you can put on a precision-recall graph.
Once you've added all possible points to your graph, you compute the area under the curve: this is the Average Precision for this category
if you have multiple categories, the mAP is the standard mean of all APs.
Applying that to your case: in the best case your true positive are the detections with the best confidence. In that case your acc/rec curve will look like a rectangle: you'd have 100% accuracy up to (13/20) recall, and then points with 13/20 recall and <100% accuracy; this gives you mAP=AP(category 1)=13/20=0.65. And this is the best case, you can expect less in practice due to false positives which higher confidence.
Other reasons why yours could be lower:
maybe among the bounding boxes that appear to be good, some are still rejected in the calculations because the overlap between the detection and the real object is not quite big enough. The criterion is that Intersection over Union (IoU) of the two bounding boxes (real one and detection) should be over 0.5. While it seems like a gentle threshold, it's not really; you should probably try and write a script to display the detected bounding boxes with a different color depending on whether they're accepted or not (if not, you'll get both a FP and a FN).
maybe you're only visualizing the first 10 images of the evaluation. If so, change that, for 2 reasons: 1. maybe you're just very lucky on these images, and they're not representative of what follows, just by luck. 2. Actually, more than luck, if these images are the first from the evaluation set, they come right after the end of the training set in your video, so they are probably quite similar to some images in the training set, so they are easier to predict, so they're not representative of your evaluation set.
Question 2: if you have not changed that part in the config file mobilenet_v1_coco-model, all your images (both for training and testing) are rescaled to 300x300 pixels at the start of the network, so your preprocessings don't matter.
Question 3: no it's not a problem at all, all these algorithms were designed to detect multiple objects in images.
Question 4: Given the fluctuations, I'd actually keep training it until you can see improvement or clear overtraining. 10k steps is actually quite small, maybe it's enough because your task is relatively easy, maybe it's not enough and you need to wait ten times that to have significant improvement...
I have some data in a pandas dataframe (although pandas is not the point of this question). As an experiment I made column ZR as column Z divided by column R. As a first step using scikit learn I wanted to see if I could predict ZR from the other columns (which should be possible as I just made it from R and Z). My steps have been.
columns=['R','T', 'V', 'X', 'Z']
for c in columns:
results[c] = preprocessing.scale(results[c])
results['ZR'] = preprocessing.scale(results['ZR'])
labels = results["ZR"].values
features = results[columns].values
#print labels
#print features
regr = linear_model.LinearRegression()
regr.fit(features, labels)
print(regr.coef_)
print np.mean((regr.predict(features)-labels)**2)
This gives
[ 0.36472515 -0.79579885 -0.16316067 0.67995378 0.59256197]
0.458552051342
The preprocessing seems wrong as it destroys the Z/R relationship I think. What's the right way to preprocess in this situation?
Is there some way to get near 100% accuracy? Linear regression is the wrong tool as the relationship is not-linear.
The five features are highly correlated in my data. Is non-negative least squares implemented in scikit learn ? ( I can see it mentioned in the mailing list but not the docs.) My aim would be to get as many coefficients set to zero as possible.
You should easily be able to get a decent fit using random forest regression, without any preprocessing, since it is a nonlinear method:
model = RandomForestRegressor(n_estimators=10, max_features=2)
model.fit(features, labels)
You can play with the parameters to get better performance.
The solutions is not as easy and can be very influenced by your data.
If your variables R and Z are bounded (for ex 0<R<1 -3<Z<2) then you should be able to get a good estimation of the output variable using neural network.
Using neural network you should be able to estimate your output even without preprocessing the data and using all the variables as input.
(Of course here you will have to solve a minimization problem).
Sklearn do not implement neural network so you should use pybrain or fann.
If you want to preprocess the data in order to make the minimization problem easier you can try to extract the right features from the predictor matrix.
I do not think there are a lot of tools for non linear features selection. I would try to estimate the important variables from you dataset using in this order :
1-lasso
2- sparse PCA
3- decision tree (you can actually use them for features selection ) but I would avoid this as much as possible
If this is a toy problem I would sugges you to move towards something of more standard.
You can find a lot of examples on google.