How many epochs should Word2Vec be trained? What is a recommended training dataset? - tensorflow

I am learning about Word2Vec using the TensorFlow tutorial. The code I am running for Word2Vec is also from the TensorFlow tutorial: https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec_optimized.py . When I ran the code for 15 epochs, the test accuracy was around 30%. When I ran for 100 epochs, test accuracy got up to around 39%. I am using the Text8 dataset for training and questions-words.txt for evaluation.
Do I need to run for more epochs? Should I be using a different dataset? How can I improve test accuracy?

Larger datasets are better; text8 is very, very small – sufficient for showing some of the analogy-solving power of word-vectors, but not good enough for other purposes.
More iterations may help squeeze slightly stronger vectors out of smaller datasets, but with diminishing returns. (No number of extra iterations over a weak dataset can extract the same rich interrelationships that a larger, more varied corpus can provide.)
There's a related text9 from the same source that if I recall correctly, is 10x larger. You'll likely get better evaluation results from using it, than from doing 10x more iterations on text8.
I believe the 3 million pretrained vectors Google once released – the GoogleNews set – were trained on a corpus of 100 billion words' worth of news articles, but with just 3 passes.
Note that there's no single standard for word-vector quality: the questions-words.txt analogy solving is just one convenient evaluation, but it's possible the word-vectors best at that won't be best at your own domain-specific analyses. Similarly, word-vectors trained on one domain of text, like the GoogleNews set from news articles, might underperform compared to text that better matches your domain (like perhaps forum posts, scientific articles, etc. – which all use different words in different ways).
So it's often best to use your own corpus, and your own goal-specific quantitative evaluation, to help adjust corpus/parameter choices.

Related

Why machine learning algorithms focus on speed and not accuracy?

I study ML and I see that most of the time the focus of the algorithms is run time and not accuracy. Reducing features, taking sample from the data set, using approximation and so on.
Im not sure why its the focus since once I trained my model I dont need to train it anymore if my accuracy is high enough and for that if it will take me 1 hours or 10 days to train my model it does not really matter because I do it only 1 time and my goal is to predict as better as I can my outcomes (minimum loss).
If I train a model to differ between cats and dogs I want it to be the most accurate it can be and not the fasted since once I trained this model I dont need to train any more models.
I can understand why models that depends on fasting changing data need this focus of speed but for general training models I dont understand why the focus is on speed.
Speed is relative term. Accuracy is also relative depending on the difficulty of the task. Currently the goal is to achieve human-like performance for application at reasonable costs because this will replace human labor and cut costs.
From what I have seen in reading papers, people usually focus on accuracy first to produce something that works. Then do ablation studies - studies where pieces of the models are removed or modified - to achieve the same performance in less time or memory requirements.
The field is very experimentally validated. There really isn't much of a theory that states why CNN work so well other than that it can model any function given non-linear activations functions. (https://en.wikipedia.org/wiki/Universal_approximation_theorem) There have been some recent efforts to explain why it works well. One I recall is MobileNetV2: Inverted Residuals and Linear Bottlenecks. The explaination of embedding data into a low dimensional space without losing information might be worth reading.

CNN : Fine tuning small network vs feature extracting from a big network

To elaborate : Under what circumstances would fine tuning all layers of a small network (say SqueezeNet) perform better than feature extracting or fine tuning only last 1 or 2 Convolution layer of a big network (e.g inceptionV4)?
My understanding is computing resource required for both is somewhat comparable. And I remember reading in a paper that extreme options i.e fine tuning 90% or 10% of network is far better compared to more moderate like 50%. So, what should be the default choice when experimenting extensively is not an option?
Any past experiments and intuitive description of their result, research paper or blog would be specially helpful. Thanks.
I don't have much experience in training models like SqueezeNet, but I think it is much easier to finetune only the last 1 or 2 layers of a big network: you don't have to extensively search for many optimal hyperparameters. Transfer learning works amazingly well out of the box with the LR finder and the cyclical learning rate from fast.ai.
If you want fast inference after the training, then it is preferable to train SqueezeNet. It might also be the case if the new task is very different from ImageNet.
Some intuition from http://cs231n.github.io/transfer-learning/
New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the ConvNet to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN codes.
New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won’t overfit if we were to try to fine-tune through the full network.
New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier form the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier from activations somewhere earlier in the network.
New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can afford to train a ConvNet from scratch. However, in practice it is very often still beneficial to initialize with weights from a pretrained model. In this case, we would have enough data and confidence to fine-tune through the entire network.

How to make a model of 10000 Unique items using tensorflow? Will it scale?

I have a use case where I have around 100 images each of 10000 unique items. I have 10 items with me which are all from the 10000 set and I know which 10 items too but only at the time of testing on live data. I have to now match the 10 items with their names. What would be an efficient way to recognise these items? I have full control of training environment background and the testing environment background. If I make one model of all 10000 items, will it scale? Or should I make 10000 different models and run the 10 items on the 10 models I have pretrained.
Your question is regarding something called "one-vs-all classification" you can do a google search for that, the first hit is a video lecture by Andrew Ng that's almost certainly worth watching.
The question has been long studied and in a plethora of contexts. The answer to your question does very much depend on what model you use. But I'll assume that, if you're doing image classification, you are using convolutional neural networks, because, after all, they're state of the art for most such image classification tasks.
In the context of convolutional networks, there is something called "Multi task learning" that you should read up on. Boiled down to a single sentence, the concept is that the more you ask the network to learn the better it is at the individual tasks. So, in this case, you're almost certain to perform better training 1 model on 10,000 classes than 10,000 classes each performing a one-vs-all classification scheme.
Take for example the 1,000 class Imagenet dataset, and CIFAR-10's 10 class dataset. It has been demonstrated in numerous papers that first training against Imagenet's 1,000 class dataset, and then simply replacing the last layer with a 10 class output and re-training on CIFAR-10's dataset will produce a better result than just training on CIFAR-10's dataset alone. There are admittedly multiple reasons for this result, Imagenet is a larger dataset. But the richness of class labels, multi-task learning, in the Imagenet dataset is certainly among the reasons for this result.
So that was a long winded way of saying, use one model with 10,000 classes.
An aside:
If you want to get really, really interesting, and jump into the realm of research level thinking, you might consider a 1-hot vector of 10,000 classes rather sparse and start thinking about whether you could reduce the dimensionality of your output layer using an embedding. An embedding would be a dense vector, let's say size 100 as a good starting point. Now class labels turn into clusters of points in your 100 dimensional space. I bet your network will perform even better under these conditions.
If this little aside didn't make sense, it's completely safe to ignore it, your 10,000 class output is fine. But if it did peek your interest look up information on Word2Vec, and read this really nice post on how face recognition is achieved using embeddings: https://medium.com/#ageitgey/machine-learning-is-fun-part-4-modern-face-recognition-with-deep-learning-c3cffc121d78. You might also consider using an Auto Encoder to generate an embedding for the images (though I favor triplet embeddings as typically used in face recognition myself).

Is it possible to train Neural Network with low amount of instances?

I have faced some problem when I needed to solve Regression Task and use as minimum instances as possible. When I tried to use Xgboost I had to feed 4 instances to get the reasonable result. But Multilayer Perceptron tuned to overcoming Regression problems has to take 20 instances, tried to change amount of neurons&layers but the answer is still 20 .Is it possible to do something to make Neural Network solve Resgression tasks with from 2 to 4 instances? if yes - explain please what should I do to succeed in it? Maybe there is some correlation between how much instances are needed to train and get reasonable results from Perceptron and how features are valuable inside dataset?
Thanks in advance for any help
With small numbers of samples, there are likely better methods to apply, Xgaboost definitely comes to mind as a method that does quite well at avoiding overfitting.
Neural networks tend to work well with larger numbers of samples. They often over fit to small datasets and underperform other algorithms.
There is, however, an active area of research in semi-supervised techniques using neural networks with large datasets of unlabeled data and small datasets of labeled samples.
Here's a paper to start you down that path, search on 'semi supervised learning'.
http://vdel.me.cmu.edu/publications/2011cgev/paper.pdf
Another area of interest to reduce overfitting in smaller datasets is in multi-task learning.
http://ruder.io/multi-task/
Multi task learning requires the network to achieve multiple target goals for a given input. Adding more requirements tends to reduce the space of solutions that the network can converge on and often achieves better results because of it. To say that differently: when multiple objectives are defined, the parameters necessary to do well at one task are often beneficial for the other task and vice versa.
Lastly, another area of open research is GANs and how they might be used in semi-supervised learning. No papers pop to the forefront of my mind on the subject just now, so I'll leave this mention as a footnote.

Must each tensorflow batch contain a uniform distribution of the inputs for all expected classifications?

This is probably a newbie question but I'm trying to get my head around how training on small batches works.
Scenario -
For the mnist classification problem, let's say that we have a model with appropriate hyerparameters that allow training on 0-9 digits. If we feed it with a small batches of uniform distribution of inputs (that have more or less same numbers of all digits in each batch), it'll learn to classify as expected.
Now, imagine that instead of a uniform distribution, we trained the model on images containing only 1s so that the weights are adjusted until it works perfectly for 1s. And then we start training on images that contain only 2s. Note that only the inputs have changed, the model and everything else has stayed the same.
Question -
What does the training exclusively on 2s after the model was already trained exclusively on 1s do? Will it keep adjusting the weights till it has forgotten (so to say) all about 1s and is now classifying on 2s? Or will it still adjust the weights in a way that it remembers both 1s and 2s?
In other words, must each batch contain a uniform distribution of different classifications? Does retraining a trained model in Tensorflow overwrite previous trainings? If yes, if it is not possible to create small (< 256) batches that are sufficiently uniform, does it make sense to train on very large (>= 500-2000) batch sizes?
That is a good question without a clear answer. In general, the order and selection of training samples has a large impact on the performance of the trained net, in particular in respect to the generalization properties it shows.
The impact is so strong, actually, that selecting specific examples, and ordering them in a particular way to maximize performance of the net even constitutes a genuine research area called `curriculum learning'. See this research paper.
So back to your specific question: You should try different possibilities and evaluate each of them (which might actually be an interesting learning exercise anyways). I would expect uniformly distributed samples to generalize well over different categories; samples drawn from the original distribution to achieve the highest overall score (since, if you have 90% samples from one category A, getting 70% over all categories will perform worse than having 99% from category A and 0% everywhere else, in terms of total accuracy); other sample selection mechanisms will show different behavior.
An interesting reading about such questions is Bengio's 2012 paper Practical Recommendations for Gradient-Based Training of Deep
Architectures
There is a section about online learning where the distribution of training data is unknown. I quote from the original paper
It
means that online learners, when given a stream of
non-repetitive training data, really optimize (maybe
not in the optimal way, i.e., using a first-order gradient
technique) what we really care about: generalization
error.
The best practice though to figure out how your dataset behaves under different testing scenarios would be to try them both and get experimental results of how the distribution of the training data affects your generalization error.