Train / Test split % for Object Detection - what's the current recommendation? - tensorflow

Using the Tensorflow Object Detection API, what's the current recommendation / best practice around the train / test split percentage for labeled examples? I've seen a lot of conflicting info, anywhere from 70/30 to 95/5. Any recent real world experience is appreciated.

Traditional advice is ~70-75% training and the rest test data. More recent articles indeed suggest a different split. I read 95/2.5/2.5 (train / test / dev for hyperparameter tuning) a lot these days.
I guess your optimal split depends on the amount of available data and the bias/variance characteristics. Poor performance on training data may be caused by underfitting and need more training data. If your model is fitting well or even overfitting, you should be able to allocate some of the training data away to test data.
If you're stuck in the middle, you may also consider cross validation as a computationally expensive but data friendly option.

It depends on the size of the dataset as Andrew Ng suggests:
(train/ dev or Val /test)
If the size of the dataset is 100 to 10K ~ 60/20/20
If the size of the dataset is 1M to INF ==> 98/1/1 or 99.5/0.25/0.25
Note that these are not fixed and just suggestions.
The goal of the test set mentioned here is to give you an unbiased performance measurement of your work. In some works, it is OK not to have only two sets set (then they will call it train/test, though test set here is actually working as dev set ratio can be 70/30 )

Related

Tensorflow / Keras: Normalize train / test / realtime Data or how to handle reality?

I started developing some LSTM-models and now have some questions about normalization.
Lets pretend I have some time series data that is roughly ranging between +500 and -500. Would it be more realistic to scale the Data from -1 to 1, or is 0 to 1 a better way, I tested it and 0 to 1 seemed to be faster. Is there a wrong way to do it? Or would it just be slower to learn?
Second question: When do I normalize the data? I split the data into training and testdata, do I have to scale / normalize this data seperately? maybe the trainingdata is only ranging between +300 to -200 and the testdata ranges from +600 to -100. Thats not very good I guess.
But on the other hand... If I scale / normalize the entire dataframe and split it after that, the data is fine for training and test, but how do I handle real new incomming data? The model is trained to scaled data, so I have to scale the new data as well, right? But what if the new Data is 1000? the normalization would turn this into something more then 1, because its a bigger number then everything else before.
To make a long story short, when do I normalize data and what happens to completely new data?
I hope I could make it clear what my problem is :D
Thank you very much!
Would like to know how to handle reality as well tbh...
On a serious note though:
1. How to normalize data
Usually, neural networks benefit from data coming from Gaussian Standard distribution (mean 0 and variance 1).
Techniques like Batch Normalization (simplifying), help neural net to have this trait throughout the whole network, so it's usually beneficial.
There are other approaches that you mentioned, to tell reliably what helps for which problem and specified architecture you just have to check and measure.
2. What about test data?
Mean to subtract and variance to divide each instance by (or any other statistic you gather by any normalization scheme mentioned previously) should be gathered from your training dataset. If you take them from test, you perform data leakage (info about test distribution is incorporated into training) and you may get false impression your algorithm performs better than in reality.
So just compute statistics over training dataset and use them on incoming/validation/test data as well.

How to test a machine learning model?

I want to develop a framework(for QA testing purpose) that validates a machine learning model. I had a lot of discussions with my peers and read articles from the google.
Most of the discussions or articles are telling machine learning model will evolve with the test data that we provide. correct me if I'm wrong.
What is the possibility of developing a framework that validates the machine learning model will give accurate results?
Few ways to test the model from the articles I read: Split and Multi-split technique, Metamorphic testing
Please also suggest any other approaches
QA testing of ML-based software requires additional, and rather unconventional, tests because oftentimes their outputs for a given set of inputs are not defined, deterministic, or known a priori and they produce approximations rather than exact results.
QA may be designed to test against:
naive but predictable benchmark methods: the average method in forecasting, the class-frequency-based classifier in classification, etc.
sanity checks (the outputs being feasible/rational): e.g., is the predicted age positive?
preset objective acceptance levels: e.g., is its AUCROC > 0.5?
extreme/boundary cases: e.g., thunderstorm conditions for a weather forecast model.
bias-variance tradeoff: what is its performance on in-sample and out-of-sample data? K-Fold cross-validation is useful here.
the model itself: is the coefficient of variation of its performance measure (e.g., AUCROC) from n runs on the same data for same/random train and test partitioning within a reasonable bound?
Some of these tests need performance measures. Here is a comprehensive library of them.
I think the data flow is, actually, the one that needs to be tested here such as raw input, manipulation, test output and predictions. For example, if you have a simple linear model you actually want to test the predictions produced from that model instead of the coefficients of the model. So, maybe, the high level steps are summarized as below;
Raw Input: Does the raw input make sense? Before you start manipulating, you need to be sure the raw data values are within the expected limits. For example, if you normally see 5-10% NA rate in some data, having 95% NA rate in a new batch might be an indicator that something is wrong.
Train/Predict Ready Input: Either you train a new model or feeding new data into a already trained model for prediction, you probably want to be sure that manipulated data makes sense, too. Some ML algorithms are delicate to data anomalies. You don't want to predict a credit score around thousands just because you have some data anomalies in the input.
Model Success: By this time, you should have some idea about your model success. So, you can measure the model's performance on a new test data. You can also check train and test score if they are not significantly different (i.e. Overfitting). If you're retraining, you can compare with the previous training scores. Or, you can separate some test set and compare its score.
Predictions: Finally, you need to be sure your final output makes sense before delivering to production/clients. For example, if you're revenue forecasting for a very small shop, the daily revenue predictions can't be million dollars or some negative amounts.
Full disclosure, I wrote a small Python package for this. You can check here or download as below,
pip install mlqa

Imbalanced Dataset for Multi Label Classification

So I trained a deep neural network on a multi label dataset I created (about 20000 samples). I switched softmax for sigmoid and try to minimize (using Adam optimizer) :
tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_, logits=y_pred)
And I end up with this king of prediction (pretty "constant") :
Prediction for Im1 : [ 0.59275776 0.08751075 0.37567005 0.1636796 0.42361438 0.08701646 0.38991812 0.54468459 0.34593087 0.82790571]
Prediction for Im2 : [ 0.52609032 0.07885984 0.45780018 0.04995904 0.32828355 0.07349177 0.35400775 0.36479294 0.30002621 0.84438241]
Prediction for Im3 : [ 0.58714485 0.03258472 0.3349618 0.03199361 0.54665488 0.02271551 0.43719986 0.54638696 0.20344526 0.88144571]
At first, I thought I just neeeded to find a threshold value for each class.
But I noticed that, for instance, among my 20000 samples, the 1st class appears about 10800 so a 0.54 ratio and it the value around which my prediction is every time. So I think I need to find a way to tackle tuis "imbalanced datset" issue.
I thought about reducing my dataset (Undersampling) to have about the same number of occurence for each class but only 26 samples correspond to one of my classes... That would make me loose a lot of samples...
I read about oversampling or about penalizing even more the classes that are rare but did not really understood how it works.
Can someone share some explainations about these methods please ?
In practice, on Tensorflow, are there functions that help doing that ?
Any other suggestions ?
Thank you :)
PS: Neural Network for Imbalanced Multi-Class Multi-Label Classification This post raises the same problem but had no answer !
Well, having 10000 samples in one class and just 26 in a rare class will be indeed a problem.
However, what you experience, to me, seems more like "outputs don't even see the inputs" and thus the net just learns your output distribution.
To debug this I would create a reduced set (just for this debugging purpose) with say 26 samples per class and then try to heavily overfit. If you get correct predictions my thought is wrong. But if the net cannot even detect those undersampled overfit samples then indeed it's an architecture/implementation problem and not due to the schewed distribution (which you will then need to fix. But it'll be not as bad as your current results).
Your problem is not the class imbalance, rather just the lack of data. 26 samples are considered to be a very small dataset for practically any real machine learning task. A class imbalance could be easily handled by ensuring that each minibatch will have at least one sample from every class (this leads to situations when some samples will be used much more frequently than another, but who cares).
However, in the case of presence only 26 samples this approach (and any other) will quickly lead to overfitting. This problem could be partly solved with some form of data augmentation, but there still too few samples to construct something reasonable.
So, my suggestion will be to collect more data.

Word2Vec: Any way to train model fastly?

I use Gensim Word2Vec to train word sets in my database.
I have about 400,000 phrase(Each phrase is short. Total 700MB) in my PostgreSQL database.
This is how I train these data using Django ORM:
post_vector_list = []
for post in Post.objects.all():
post_vector = my_tokenizer(post.category.name)
post_vector.extend(my_tokenizer(post.title))
post_vector.extend(my_tokenizer(post.contents))
post_vector_list.append(post_vector)
word2vec_model = gensim.models.Word2Vec(post_vector_list, window=10, min_count=2, size=300)
But this job getting a lot of time and feels like not efficient.
Especially, creating post_vector_list part took a lot of time and space..
I want to improve speed of training but have no idea how to do.
Want to get your advices. Thanks.
To optimize such code, you need to collect good information about where the time is spent.
Is most of the time spent preparing post_vector_list?
If so, you will want to make sure my_tokenizer (whose code is not shown) is as efficient as possible. You may want to try to minimize the number of extend()s and append()s that are done on large lists. You might have to even take a look at your DB's configuration or options to speed up the DB-to-Object mapping started inside Post.objects.all().
Is most of the time spent in the call to Word2Vec()?
If so, other steps may help:
ensure you're using gensim's Cython-optimized routines – if not, you should be seeing a logged warning (and training will be up to 100X slower)
consider using a workers=4 or workers=8 optional argument to use more threads, if your machine has at least 4 or 8 CPU cores
consider using a larger min_count, which speeds training somewhat (and since vectors for words where there are only a few examples typically aren't very good anyway, doesn't lose much and can even improve the quality of the surviving words)
consider using a smaller window, since training takes longer for larger windows
consider using a smaller vector_size (previously called size), since training takes longer for larger-size vectors
consider using a more-aggressive (smaller) value for the optional sample argument, which randomly skips more of the most-frequent words. The default is 1e-04, but values of 1e-05 or 1e-06 (especially on larger corpuses) can offer additional speedup, and even often improve the final vectors (by spending relatively less training time on words with an excess of usage examples)
consider using a lower-than-default (5) value for the optional epochs parameter (previously called iter). (I wouldn't recommend this unless the corpus is very large – so it already has many redundant, equally-good examples of the same words throughout.)
you could use a python generator instead of loading all the data into the list. Gensim works with python generators too. The code will look something like this
class Post_Vectors(object):
def __init__(self, Post):
self.Post = Post
def __iter__(self):
for post in Post.objects.all():
post_vector = my_tokenizer(post.category.name)
post_vector.extend(my_tokenizer(post.title))
post_vector.extend(my_tokenizer(post.contents))
yield post_vector
post_vectors = Post_Vectors(Post)
word2vec_model = gensim.models.Word2Vec(post_vectors, window=10, min_count=2, size=300, workers=??)
For the gensim speedup, if you have a multi-core CPU, you could use the workers parameter. (By default it is 3)

How should I test on a small dataset?

I use Weka to test machine learning algorithms on my dataset. I have 3800 rows and around 25 features. I am testing the combination of different features for prediction models and seem to predict lower than just the oneR algorithm does with the use of Cross-validation. Even C4.5 does not predict better, sometimes it does and sometimes it does not on basis of the features that are still able to classify.
But, on a certain moment I splitted my dataset in a testset and dataset(20/80), and testing it on the testset, the C4.5 algorithm had a far higher accuracy than my OneR algorithm had. I thought, with the small size of the dataset, it probably is just a coincidence that it predicted very well(the target was still splitted up relatively as target attributes). And therefore, its more useful to use Cross-validation on small datasets like these.
However, testing it on another testset, did give the high accuracy towards the testset using C4.5. So, my question actually is, what is the best way to test datasets when the datasets are actually pretty small?
I saw some posts where it is discussed, but I am still not sure what is the right way to do it.
It's almost always a good approach to test your model via Cross-Validation.
A rule of thumb is to use 10 fold cross validation.
In your case, 10 fold cross validation will do the following in Weka:
split your 3800 training instances into 10 sets of 380 instances
for each set (s = 1 .. 10) :
use the instances from s for testing and the other 9 sets for training a model (3420 training instances)
the result will be an average of the results obtained with the 10 models used.
Try to avoid testing your dataset using the training set option, because that could result in creating a model that works very well for you existing data but could have big problems with other new instances (overfitting).