Normalization in Tensorflow, text classification - tensorflow

Few days ago i read about text classification with tensorflow, i found this rep on github, and build my own model based on example: https://github.com/dmesquita/understanding_tensorflow_nn
It works well, but i don't understand:
for text in texts:
layer = np.zeros(total_words,dtype=float)
for word in text.split(' '):
layer[word2index[word.lower()]] += 1
When the same word comes second time, it increases value in layer (+=1), but where is normalization? I read that neuralnetworks works with values between 0 and 1 for input. I scanned all code, but can't find normalization. Can any body explain, why? Is it mistake in example?
And second question, when we build vocab in this example, we use Counter():
for text in newsgroups_train.data:
for word in text.split(' '):
vocab[word.lower()]+=1
We increase vocab element every time we caught a word, for what? We not use this later.

Related

Does keras.tokenizer.text_to_sequence simply translate into number vectors, or something more?

im currently trying to learn the ins and outs of keras.
in working with a dataset containing sentences, I m doing the following
from keras.preprocessing.text import Tokenizer
max_features = 2000
tokens = Tokenizer(num_words=max_features)
tokens.fit_on_texts(list(X_train))
tokenized_train = tokens.texts_to_sequences(X_train) # Converting to ints
tokenized_test = tokens.texts_to_sequences(X_test)
Now, after doing this, i seem to have a vector for every word, but I am unsure om something, when i call tokens.text_to_sequences() is the word simply being converted integer, or is something more going on?
By "something more" i mean whether, keras is able to cluster words that are semantically related?
edit: if anyone knows what is going on behind the scenes (Word2Vec,GloVe or fastText) that would be really cool to know too
Here is what happens.
When you first call Tokenizer with num_words and then, call fit_on_texts on it, then, from your text, the top num_words with highest frequency will be taken by it and then, a simple dictionary will be created in the format.
dict = {'topmost_word': 1,
'2nd_topmost_word': 2,
...
'num_wordsth_topmost_word: num_words
}
Thus, only words are being converted into integers. No other magic happens here.
That something more about clustering happens in the tf.keras.layers.Embedding where the embedding are learning for each word. The embeddings you will get from here will be learned vectors and words of same class will be close here.

how to reduce the dimension of the document embedding?

Let us assume that I have a set of document embeddings. (D)
Each of document embedding is consisting of N number of word vectors where each of these pre-trained vector has 300 dimensions.
The corpus would be represented as [D,N,300].
My question is that, what would be the best way to reduce [D,N,300] to [D,1, 300]. How should I represent the document in a single vector instead of N vectors?
Thank you in advance.
I would say that what you are looking for is doc2vec. Using this you can convert the whole document into a one, 300-dimensional vector. You can use it like this:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(your_documents)]
model = Doc2Vec(documents, vector_size=300, window=2, min_count=1, workers=4)
This will train the model on your data and you will be able to represent each document with only one vector as you specified in the question.
You can run inferrance with:
vector = model.infer_vector(doc_words)
I hope this is helpful :)
It's fairly common and fairly (perhaps surprisingly) effective to simply average the word vectors.
Good question but all the answers will result in the some loss of information. The best way for you is to use a Bi-LSTM/GRU layer and provide your word embeddings as input to that layer. And take the output of last time step.
The output of last timestep will have all the contextual information of document both in forward and backward direction. And hence, this is the best way to get what you want as the model learns the representation.
Note that, the larger the document, the more loss of information.

Inverse transform word count vector to original document

I am training a simple model for text classification (currently with scikit-learn). To transform my document samples into word count vectors using a vocabulary I use
CountVectorizer(vocabulary=myDictionaryWords).fit_transform(myDocumentsAsArrays)
from sklearn.feature_extraction.text.
This works great and I can subsequently train my classifier on this word count vectors as feature vectors. But what I don't know is how to inverse transform these word count vectors to the original documents. CountVectorizer indeed has a function inverse_transform(X) but this only gives you back the unique non-zero tokens.
As far as I know CountVectorizer doesn't have any implementation of a mapping back to the original documents.
Anyone know how I can restore the original sequences of tokens from their count-vectorized representation? Is there maybe a Tensorflow or any other module for this?
CountVectorizer is "lossy", i.e. for a document :
This is the amazing string in amazing program , it will only store counts of words in the document (i.e. string -> 1, amazing ->2 etc), but loses the position information.
So by reversing it, you can create a document having same words repeated same number of times, but their sequence in the document cannot be retraced.

Influence of number of steps back when using truncated backpropagation through time

Im currently developing a model for time series prediction using LSTM cells with tensorflow. My model is similar to the ptb_word_lm. It works, but I'm not sure how to understand the number of steps back parameter when using truncated backpropagation through time (the parameter is called num_steps in the example).
As far as I understand, the model parameters are updated after every num_steps steps. But does that also mean that the model does not recognize dependencies that are farther away than num_steps. I think it should because the internal state should capture them. But then which effect has a large/small num_steps value.
The num_steps in the ptb_word_lm example shows the sequence length or the num of words to be processed for predicting the next word.
For example if you have a sentence.
"Widows and orphans occur when the first line of a paragraph is the last line in a column or page, or when the last line of a paragraph is the first line of a new column or page."
If u say num_steps = 5
then you mean for
input = "Widows and orphans occur when"
output = "and orphans occur when the"
i.e given the words ("Widows","and","orphans","occur","when"), you are trying to predict the occurrence of the word ("the").
so, the num_steps actually plays a important role in remembering the larger context(i.e the given words) for predicting the probability of the next word
Hope, this is helpful..

Using tensorflow for sequence tagging : Synced sequence input and output

I would like to use Tensorflow for sequence tagging namely Part of Speech tagging. I tried to use the same model outlined here: http://tensorflow.org/tutorials/seq2seq/index.md (which outlines a model to translate English to French).
Since in tagging, the input sequence and output sequence have exactly the same length, I configured the buckets so that input and output sequences have same length and tried to learn a POS tagger using this model on ConLL 2000.
However it seems that the decoder sometimes outputs a taggedsequence shorter than the input sequence (it seems to feel that the EOS tag appears prematurely)
For example:
He reckons the current account deficit will narrow to only # 1.8 billion in September .
The above sentence is tokenized to have 18 tokens which gets padded to 20 (due to bucketing).
When asked to decode the above, the decoder spits out the following:
PRP VBD DT JJ JJ NN MD VB TO VB DT NN IN NN . _EOS . _EOS CD CD
So here it ends the sequence (EOS) after 15 tokens not 18.
How can I force the sequence to learn that the decoded sequence should be the same length as the encoded one in my scenario.
If your input and output sequences are the same length you probably want something simpler than a seq2seq model (since handling different sequence lengths is one of it's strengths)
Have you tried just training (word -> tag) ?
note: that for something like pos tagging where there is clear signal from tokens on either side you'll definitely get a benefit from a bidirectional net.
If you want to go all crazy there would be some fun character level variants too where you only emit the tag at the token boundary (the rationale being that pos tagging benefits from character level features; e.g. things like out of vocab names). So many variants to try! :D
There are various ways of specifying an end of sequence parameter. The translate demo uses a flag <EOS> to determine the end of sequence. However, you can also specify end of sequence by counting the number of expected words in the output. In the lines 225-227 of the translate.py:
# If there is an EOS symbol in outputs, cut them at that point.
if data_utils.EOS_ID in outputs:
outputs = outputs[:outputs.index(data_utils.EOS_ID)]
You can see that outputs are being cut off whenever <EOS> is encountered. You can easily tweak it to constrain the number of output words. You might also consider getting rid of <EOS> flag altogether while training, considering your application.
I came to the same problem. At the end I found ptb_word_lm.py example in tensorflow's examples is exactly what we need for tokenization, NER and POS tagging.
If you look into details of the language model example, you can find out that it treats the input character sequence as X and right shift X for 1 space as Y. It is exactly what fixed length sequence labeling needs.