Understanding what happens during a single epoch of DBOW - batch-processing

I am using Distributed Bag of Words (DBOW) and I'm curious what happens during a single Epoch? Does DBOW cycle through all documents (aka Batch) or does it cycle through a subset of documents (aka Mini-batch)? In addition, for a given document DBOW will randomly sample a word from a text window and learn the weights to associate that target word to the surrounding words in the window, does this mean that DBOW may not go through all text in a document?
I've gone through the GENSIM (https://github.com/RaRe-Technologies/gensim) code to identify if there is a parameter for batch, but no luck.

One epoch of PV-DBOW training in gensim Doc2Vec will iterate through all the texts, and then for each text iterate through all their words, attempting to predict each word in turn, and then back-propagating corrections for that predicted word immediately. That is, there's no "mini-batching" at all: each target word is an individual prediction/back-propagation.
(There is a sort-of-batching in how groups-of-texts are sent to worker threads, which can change the ordering somewhat, but each individual training-example presented to the neural-network is corrected individually, so no SGD-mini-batching is occurring.)
The words of each text are considered in order, and only skipped if (a) the word appeared fewer than min_count times; (b) the word is very-frequent and is chosen for random dropping via the value of the sample parameter. So you can generally think of the training as including all significant words of every document.

Related

How to train data of different lengths in machine learning?

I am analyzing the text of some literary works and I want to look at the distance between certain words in the text. Specifically, I am looking for parallelism.
Since I can’t know the specific number of tokens in a text I can’t simply put all words in the text in the training data because it would not be uniform across all training data.
For example, the text:
“I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today."
Is not the same text length as
"My fellow Americans, ask not what your country can do for you, ask what you can do for your country."
So therefore I could not columns out of each word and then assign the distance in a row because the lengths would be different.
How could I go about representing this in training data? I was under the assumption that training data had to be the same type and length.
In order to solve this problem you can use something called pad_sequence,so follow this process, sure you are going to transform the data throught some word embedding techniques like TF-IDF or any other algorithm, and after finishing the process of converting the textual data into vectors and by using the shape method you can figure the maximum length you have and than use that maximum in the pad-sequence method, and here is a how you implement this method:
'''
from keras.preprocessing.sequence import pad_sequences
padded_data= pad_sequences(name-of-your-data, maxlen=your-maximum-shape, padding='post', truncating='post')
'''

Tensorflow Shuffle Batch Non Deterministic

I am trying to get deterministic behaviour from tf.train.shuffle_batch(). I could, instead, use tf.train.batch() which works fine (always the same order of elements), but I need to get examples from multiple tf-records and so I am stuck with shuffle_batch().
I am using:
random.seed(0)
np.random.seed(0)
tf.set_random_seed(0)
data_entries = tf.train.shuffle_batch(
[data], batch_size=batch_size, num_threads=1, capacity=512,
seed=57, min_after_dequeue=32)
But every time I restart my script I get slightly different results (not completely different, but about 20% of the elements are in the wrong order).
Is there anything I am missing?
Edit: Solved it! See my answer below!
Maybe I misunderstood something, but you can collect multiple tf-records in a queue with tf.train.string_input_producer(), then read the examples into tensors and finally use tf.train.batch().
Take a look at CIFAR-10 input.
Answering my own question:
First the reason shuffle_batch is non deterministic:
The time until I request a batch is inherently random.
In that time, a random number of tensors are available.
Tensorflow calls a shuffle operation that is seeded but depending on the number of items, it will return a different order.
So no matter the seeding, the order is always different unless the number of elements is constant. So the solution is to keep the number of elements constant, but how we do it?
By setting capacity=min_after_dequeue+batch_size. This will force Tensorflow to fill up the queue until it reaches full capacity before dequeuing an item. Therefore, at the time of the shuffle operation, we have capacity many items which is a constant number.
So why are we doing this? Because one tf.record contains many examples but we want examples from multiple tf.records. With a normal batch we would first get all the examples of one record and then of the next one. This also means we should set min_after_dequeue to something larger than the number of items in one tf.record. In my example, I have 50 examples in one file so I set min_after_dequeue=2048.
Alternatively, we can also shuffle the examples before creating the tf.records, but this was not possible for me because I read tf.records from multiple directories (each with their own dataset).
Last Note: You should also use a batch size of 1 to be super save.

Inverse transform word count vector to original document

I am training a simple model for text classification (currently with scikit-learn). To transform my document samples into word count vectors using a vocabulary I use
CountVectorizer(vocabulary=myDictionaryWords).fit_transform(myDocumentsAsArrays)
from sklearn.feature_extraction.text.
This works great and I can subsequently train my classifier on this word count vectors as feature vectors. But what I don't know is how to inverse transform these word count vectors to the original documents. CountVectorizer indeed has a function inverse_transform(X) but this only gives you back the unique non-zero tokens.
As far as I know CountVectorizer doesn't have any implementation of a mapping back to the original documents.
Anyone know how I can restore the original sequences of tokens from their count-vectorized representation? Is there maybe a Tensorflow or any other module for this?
CountVectorizer is "lossy", i.e. for a document :
This is the amazing string in amazing program , it will only store counts of words in the document (i.e. string -> 1, amazing ->2 etc), but loses the position information.
So by reversing it, you can create a document having same words repeated same number of times, but their sequence in the document cannot be retraced.

Visualize Gensim's Phrases' vectors in 2D

I'm using the Phrases class and want to visualize the vectors in a 2D space. In order to do this with Word2Vec I've used T-SNE and it worked perfectly. When I'm trying to do the same with Phrases it doesn't make any sense (words appear next to irrelevant words).
Any suggestions on how to visualize the Phrases output?
As suggested/reported on the gensim mailing list, the key problem was that merely wrapping a corpus in Phrases results in an iterator that offers only one pass over the data. The Word2Vec model needs a corpus over which it can make multiple passes to do its vocabulary-discovery then multiple-passes of training. (If closely watching INFO-level logging, there should be indications that 'training' ended almost instantly in such a situation.)

Influence of number of steps back when using truncated backpropagation through time

Im currently developing a model for time series prediction using LSTM cells with tensorflow. My model is similar to the ptb_word_lm. It works, but I'm not sure how to understand the number of steps back parameter when using truncated backpropagation through time (the parameter is called num_steps in the example).
As far as I understand, the model parameters are updated after every num_steps steps. But does that also mean that the model does not recognize dependencies that are farther away than num_steps. I think it should because the internal state should capture them. But then which effect has a large/small num_steps value.
The num_steps in the ptb_word_lm example shows the sequence length or the num of words to be processed for predicting the next word.
For example if you have a sentence.
"Widows and orphans occur when the first line of a paragraph is the last line in a column or page, or when the last line of a paragraph is the first line of a new column or page."
If u say num_steps = 5
then you mean for
input = "Widows and orphans occur when"
output = "and orphans occur when the"
i.e given the words ("Widows","and","orphans","occur","when"), you are trying to predict the occurrence of the word ("the").
so, the num_steps actually plays a important role in remembering the larger context(i.e the given words) for predicting the probability of the next word
Hope, this is helpful..