Tensorboard: Filter out certain time steps - tensorflow

Is there a way to filter out the first X timesteps when visualizing the histograms and scalars, only showing steps > X?
Much like the zooming feature for scalars, but something which updates as time progresses, and something which works for histograms too?

Related

Predict a nonlinear array based on 2 features with scalar values using XGBoost or equivalent

So I have been looking at XGBoost as a place to start with this, however I am not sure the best way to accomplish what I want.
My data is set up something like this
Where every value, whether it be input or output is numerical. The issue I'm facing is that I only have 3 input data points per several output data points.
I have seen that XGBoost has a multi-output regression method, however I am only really seeing it used to predict around 2 outputs per 1 input, whereas my data may have upwards of 50 output points needing to be predicted with only a handful of scalar input features.
I'd appreciate any ideas you may have.
For reference, I've been looking at mainly these two demos (they are the same idea just one is scikit and the other xgboost)
https://machinelearningmastery.com/multi-output-regression-models-with-python/
https://xgboost.readthedocs.io/en/stable/python/examples/multioutput_regression.html

Using optical flow to predict velocities

I am no expert in this field but more of a beginner with a bit of experience, so please keep the answer as simple as possible.
I cannot be very specific about this topic but what I am trying to do is predict the velocity of multiple objects(that should have a pattern because they are similar). I am taking the optical flow from every tenth frame and building a histogram of every tenth frame(x and y velocities separate). Then, I convert these histograms to vectors and store them in a CSV file. I am trying to use these vectors in an LSTM for Timeseries forecast. I do not know how to input each x and y velocity vector as a time step to output the next x and y velocities every (let's say) 5 steps.
The tutorials I see are usually about predicting temperature, and the input values(not vectors but single values) of humidity, precipitation, etc. and then output a single value(being the temperature)
please help, I hope I made it relatively clear.
Maybe there is a better approach.

Linear regression graph interpretation

I have a histogram showing frequency of some data.
I have two type of files: Pdbs and Uniprots. Each Uniprot file is associated with a certain number of Pdbs. So this histogram shows how many Uniprot files are associated with 0 Pdb files, 1 Pdb file, 2 Pdb files ... 80 Pdb files.
Y-axis is in a log scale.
I did a regression on the same dataset and this is the result.
Here is the code I'm using for the regression graph:
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
x = np.array(x).reshape((-1, 1))
y = np.array(y)
regressor.fit(x, y)
# Predicting the Test set results
y = regressor.predict(x)
# Visualizing the Training set results
plt.scatter(x, y, color = 'red')
plt.plot(x, regressor.predict(x), color = 'blue')
plt.title('Uniprot vs Pdb')
plt.xlabel('Pdbs')
plt.ylabel('Uniprot')
plt.savefig('regression_test.png')
plt.show()
Can you help me interpret the regression graph?
I can understand that as the number of Pdbs increases, there will be less Uniprots associated with them.
But why is it going negative on the y-axis? Is this normal?
The correct way to interpret this linear regression is "this linear regression is 90% meaningless." In fact, some of that 90% is worse than meaningless, it's downright misleading, as you have pointed out with the negative y values. OTOH, there is about 10% of it that we can interpret to good effect, but you have to know what you're looking for.
The Why: Amongst other often less apparent things, one of the assumptions of a linear regression model is that the data are more-or-less linear. If your data aren't linear with some very regular "noise" added in, then all bets are off. Your data aren't linear. They're not even close. So all bets are off.
Since all bets are off, it is helpful to examine the sort of things that we might have otherwise wanted to do with a linear regression model. The hardest thing is extrapolation, which is predicting y outside of the original x range. Your model's abilities at extrapolation are pretty well illustrated by its behavior at the endpoints. This is where you noticed "hey, my graph is all negative!". This is, in a very simplistic sense, because you took a linear model, fit it to data that did not satisfy the "linear" assumption, and then tried to make it do the hardest thing for a model to do. The second hardest thing for a model to do is interpolation which is making predictions inside the original x range. This linear regression isn't very good at that either. Further down the list is, if we simply look at the slope of the linear regression line, we can get a general idea of whether our data are increasing or decreasing. Note that even this bet is off if your data aren't linear. However, it generally works out in a not-entirely-useless sort of way for large classes of even non-linear real-world data. So, this one thing, your linear regression model gets kind of right. Your data are decreasing, and the linear model is also decreasing. That's the 10% I spoke of previously.
What to do: Try to fit a better model. You say that you log-transformed your original data, but it doesn't look like that helped much. In general, the whole point of "transforming" data is to make it look linear. The log transform is helpful for exponential data. If your starting data didn't look exponential-like, then the log transform probably isn't going to help. Since you are trying to do density estimation, you almost certainly want to fit a probability distribution to this stuff, for which you don't even need to do a transform to make the data linear. Here is another Stack Overflow answer with details about how to fit a beta distribution to data. However, there are many options.
Can you help me interpret the regression graph?
Linear Regression tries to built a line between x-variables and a target y-variable which assimates the 'real' value in the most closed possible way (graph you find also here: https://en.wikipedia.org/wiki/Linear_regression):
the line here is the blue line, and the original points are the black lines. The goal is to minimize the error (black dots to blue line) for all black dots.
The regression line is the blue line. That means you can describe a uniprot with a linear equatation y = m*x +b , which has a constant value m=0.1 (example) and b=0.2 (example) and x=Pdbs.
I can understand that as the number of Pdbs increases, there will be less Uniprots associated with them. But why is it going negative on the y-axis?
This is normal, you could plot this line until -10000000 Pdbs or whateever, it is just a equation. Not a real line.
But there is one mistake in your plot, you need to plot the original black dots also or not?
y = regressor.predict(x)
plt.scatter(x, y, color = 'red')
This is wrong, you should add the original values to it, to get the plot from my graphic, something like:
y = df['Uniprot']
plt.scatter(x, y, color = 'red')
should help to understand it.

TensorFlow Object Detection API: evaluation mAP behaves weirdly?

I am training an object detector for my own data using Tensorflow Object Detection API. I am following the (great) tutorial by Dat Tran https://towardsdatascience.com/how-to-train-your-own-object-detector-with-tensorflows-object-detector-api-bec72ecfe1d9. I am using the provided ssd_mobilenet_v1_coco-model pre-trained model checkpoint as the starting point for the training. I have only one object class.
I exported the trained model, ran it on the evaluation data and looked at the resulted bounding boxes. The trained model worked nicely; I would say that if there was 20 objects, typically there were 13 objects with spot on predicted bounding boxes ("true positives"); 7 where the objects were not detected ("false negatives"); 2 cases where problems occur were two or more objects are close to each other: the bounding boxes get drawn between the objects in some of these cases ("false positives"<-of course, calling these "false positives" etc. is inaccurate, but this is just for me to understand the concept of precision here). There are almost no other "false positives". This seems much better result than what I was hoping to get, and while this kind of visual inspection does not give the actual mAP (which is calculated based on overlap of the predicted and tagged bounding boxes?), I would roughly estimate the mAP as something like 13/(13+2) >80%.
However, when I run the evaluation (eval.py) (on two different evaluation sets), I get the following mAP graph (0.7 smoothed):
mAP during training
This would indicate a huge variation in mAP, and level of about 0.3 at the end of the training, which is way worse than what I would assume based on how well the boundary boxes are drawn when I use the exported output_inference_graph.pb on the evaluation set.
Here is the total loss graph for the training:
total loss during training
My training data consist of 200 images with about 20 labeled objects each (I labeled them using the labelImg app); the images are extracted from a video and the objects are small and kind of blurry. The original image size is 1200x900, so I reduced it to 600x450 for the training data. Evaluation data (which I used both as the evaluation data set for eval.pyand to visually check what the predictions look like) is similar, consists of 50 images with 20 object each, but is still in the original size (the training data is extracted from the first 30 min of the video and evaluation data from the last 30 min).
Question 1: Why is the mAP so low in evaluation when the model appears to work so well? Is it normal for the mAP graph fluctuate so much? I did not touch the default values for how many images the tensorboard uses to draw the graph (I read this question: Tensorflow object detection api validation data size and have some vague idea that there is some default value that can be changed?)
Question 2: Can this be related to different size of the training data and the evaluation data (1200x700 vs 600x450)? If so, should I resize the evaluation data, too? (I did not want to do this as my application uses the original image size, and I want to evaluate how well the model does on that data).
Question 3: Is it a problem to form the training and evaluation data from images where there are multiple tagged objects per image (i.e. surely the evaluation routine compares all the predicted bounding boxes in one image to all the tagged bounding boxes in one image, and not all the predicted boxes in one image to one tagged box which would preduce many "false false positives"?)
(Question 4: it seems to me the model training could have been stopped after around 10000 timesteps were the mAP kind of leveled out, is it now overtrained? it's kind of hard to tell when it fluctuates so much.)
I am a newbie with object detection so I very much appreciate any insight anyone can offer! :)
Question 1: This is the tough one... First, I think you don't understand correctly what mAP is, since your rough calculation is false. Here is, briefly, how it is computed:
For each class of object, using the overlap between the real objects and the detected ones, the detections are tagged as "True positive" or "False positive"; all the real objects with no "True positive" associated to them are labelled "False Negative".
Then, iterate through all your detections (on all images of the dataset) in decreasing order of confidence. Compute the accuracy (TP/(TP+FP)) and recall (TP/(TP+FN)), only counting the detections that you've already seen ( with confidence bigger than the current one) for TP and FP. This gives you a point (acc, recc), that you can put on a precision-recall graph.
Once you've added all possible points to your graph, you compute the area under the curve: this is the Average Precision for this category
if you have multiple categories, the mAP is the standard mean of all APs.
Applying that to your case: in the best case your true positive are the detections with the best confidence. In that case your acc/rec curve will look like a rectangle: you'd have 100% accuracy up to (13/20) recall, and then points with 13/20 recall and <100% accuracy; this gives you mAP=AP(category 1)=13/20=0.65. And this is the best case, you can expect less in practice due to false positives which higher confidence.
Other reasons why yours could be lower:
maybe among the bounding boxes that appear to be good, some are still rejected in the calculations because the overlap between the detection and the real object is not quite big enough. The criterion is that Intersection over Union (IoU) of the two bounding boxes (real one and detection) should be over 0.5. While it seems like a gentle threshold, it's not really; you should probably try and write a script to display the detected bounding boxes with a different color depending on whether they're accepted or not (if not, you'll get both a FP and a FN).
maybe you're only visualizing the first 10 images of the evaluation. If so, change that, for 2 reasons: 1. maybe you're just very lucky on these images, and they're not representative of what follows, just by luck. 2. Actually, more than luck, if these images are the first from the evaluation set, they come right after the end of the training set in your video, so they are probably quite similar to some images in the training set, so they are easier to predict, so they're not representative of your evaluation set.
Question 2: if you have not changed that part in the config file mobilenet_v1_coco-model, all your images (both for training and testing) are rescaled to 300x300 pixels at the start of the network, so your preprocessings don't matter.
Question 3: no it's not a problem at all, all these algorithms were designed to detect multiple objects in images.
Question 4: Given the fluctuations, I'd actually keep training it until you can see improvement or clear overtraining. 10k steps is actually quite small, maybe it's enough because your task is relatively easy, maybe it's not enough and you need to wait ten times that to have significant improvement...

How to make gnuplot generate figures with smaller/fized size (Bytes)?

I would like to avoid using every command since it simply discards data that, however, might be very important (like a spike for instance). I would like also to avoid posterior downsizing since this might lead to the deterioration of the text on the figure...
Is there a manner/option to force gnuplot generating files (eps) with maximum size?
You'd need some adaptive compression on your data. Without actually knowing it, that's rather tough.
The stats command can tell you how many datapoints you actually have, and you can then adjust the every statement to a sensible value. Otherwise, you can use smooth to achieve a predefined (set sample) number of datapoints, or (if you have a sensible model for you data) you can do a fit and simply plot the fitted model function instead of you dataset.
If you specifically want outliers to show in the plot, this might be helpful:
fit f(x) data via *parameter*
plot f(x), data using ((abs($2-f($1)) > threshold) ? $2 : NaN)
It plots a fit to your dataset, and all actual datapoints that deviate from the fit by more than threshold.