I like to use a Naive Bayes Classifier for classifying text. If I have a word which occurs many times in one text, such as beautiful, do I count all times or only distinct occurrences?
For training Naive Bayes uses distinct words from training texts, but for classifying texts it uses each word occurrence in text for scoring.
Here http://pastebin.com/YtTa2cXm is my implementation of Naive Bayes in java.
Related
I have passed some online courses on the sentence classification problems using TensorFlow.
But I dont understand how to start the following problem.
I am interested in the biniry classification of sentences based on the meaning of specified word. This word can have two meaning. And I want to train model which will classify it.
I have a training date. All sentences contain this word. There is labels 0 or 1 for each sentence.
Do I need a neural network for this or it can be done unisg nltk library?
How to implement such project? I have learn about word embedding. But have not idea how to use it in this project.
Where I can read about it?
I am quite new to machine learning and neural nets. I‘ve used the following model for sentiment analysis of short texts. I generally understand how signals are computed, all the way to the output layer. Now what I dont understand is how the inputs are found. When the model classifies a word, how is that word translated to the 512 input units? What features of the word does the model assess and how is that decided?
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
When the model classifies a word, how is that word translated to the
512 input units?
As you already noticed, before any kind of written information (single words, sentences or whole texts) can be processed by a neural network, it must be encoded into a vector representation. This is called an embedding or a representation and to find suitable embeddings is subfield of Natural Language Procesessing (NLP) research.
Over the years a number of different representations were published. For single words e.g. Word2Vec in which a neural network has "learned" the embedding based on the semantic similarity of the words. That means words which are similar in context should be close by in the vector space.
The most simple embedding for a sentence would be a bag-of-words embedding. This means we count how many different words we have in our corpus of sentences (e.g. N) and we transform each sentence into a vector of length N where each index of the vector represents a word and the value at the index the number of occurrences of that word in the sentence.
Of course there are many more sophisticated text embeddings.
There are multiple methods by which you can obtain the vector embedding of a word.
Count based methods: PMI, PPMI and SVD
Prediction based methods: CBOW and Skip-Gram
The count-based methods create a co-occurrence matrix of words of shape Vocabulary*Vocabulary where each word is represented by some sort of count of co-occurrence in K neighborhood.
The prediction-based models train on a corpus and create a vector embedding basis on how close the context of two words are.
In the spacy's text classification train_textcat example, there are two labels specified Positive and Negative. Hence the cats score is represented as
cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
I am working with Multilabel classfication which means i have more than two labels to tag in one text. I have added my labels as
textcat.add_label("CONSTRUCTION")
and to specify cats score I have used
cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
I am pretty sure this is not correct. Any suggestions how to specify the scores for cats in multilabel classification and how to train multilabel classification? Does the example from spacy works for multilabel classification too?
If I understood you correctly, you have a list of categories, and your data can have multiple categories at once. In that case you cannot use "POSITIVE": bool(y), "NEGATIVE": not bool(y) to mark your classes. Instead, try writing a function which will return a dictionary with categories based on the classes. For example, consider having a following list of categories: categories = ['POLITICS', 'ECONOMY', 'SPORT']. Now, you can iterate over you train data, calling a function for each training example.
This function can look like this:
def func(categories):
cats = {'POLITICS': 0, 'ECONOMY': 0, 'SPORT': 0}
for category in categories:
cats[category] = 1
return {'cats': cats}
Having a training example with two categories (for example POLITICS and ECONOMY), you can call this function with a list of categories (labels = func(['POLITICS', 'ECONOMY']) and you will get a full dictionary with classes for this example
The example scripts are mainly quick demos for a single use case and you're right that this isn't the right kind of evaluation for a multilabel case.
The underlying spacy Scorer and the spacy evaluate CLI (https://spacy.io/api/cli#evaluate) report the macro-averaged AUC ROC score for multilabel classification.
You can use the Scorer with nlp.evaluate() (https://spacy.io/api/language#evaluate) or through spacy evaluate / spacy train.
If your data is in the simple TRAIN_DATA format from the example script, nlp.evaluate() is probably the easiest way to run the Scorer, since spacy evaluate would require you to convert your data to spacy's internal JSON training format.
The model settings (specified when you initialize the pipeline component) are used to pick an appropriate evaluation metric (obviously these aren't the only possible metrics, just one suitable metric for each configuration):
f-score on positive label for binary exclusive,
macro-averaged f-score for 3+ exclusive,
macro-averaged AUC ROC score for multilabel
I have two questions about how to use Tensorflow implementation of the Transformers for text classifications.
First, it seems people mostly used only the encoder layer to do the text classification task. However, encoder layer generates one prediction for each input word. Based on my understanding of transformers, the input to the encoder each time is one word from the input sentence. Then, the attention weights and the output is calculated using the current input word. And we can repeat this process for all of the words in the input sentence. As a result we'll end up with pairs of (attention weights, outputs) for each word in the input sentence. Is that correct? Then how would you use this pairs to perform a text classification?
Second, based on the Tensorflow implementation of transformer here, they embed the whole input sentence to one vector and feed a batch of these vectors to the Transformer. However, I expected the input to be a batch of words instead of sentences based on what I've learned from The Illustrated Transformer
Thank you!
There are two approaches, you can take:
Just average the states you get from the encoder;
Prepend a special token [CLS] (or whatever you like to call it) and use the hidden state for the special token as input to your classifier.
The second approach is used by BERT. When pre-training, the hidden state corresponding to this special token is used for predicting whether two sentences are consecutive. In the downstream tasks, it is also used for sentence classification. However, my experience is that sometimes, averaging the hidden states give a better result.
Instead of training a Transformer model from scratch, it is probably more convenient to use (and eventually finetune) a pre-trained model (BERT, XLNet, DistilBERT, ...) from the transformers package. It has pre-trained models ready to use in PyTorch and TensorFlow 2.0.
The Transformers are designed to take the whole input sentence at once. The main motive for designing a transformer was to enable parallel processing of the words in the sentences. This parallel processing is not possible in LSTMs or RNNs or GRUs as they take words of the input sentence as input one by one.
So in the encoder part of the transformers, the very first layer contains the number of units equal to the number of words in a sentence and then each unit converts that word into an embedding vector corresponding to that word. Further, the rest of the processes are carried out. For more details, you can go through the article: http://jalammar.github.io/illustrated-transformer/
How to use this transformer for text classification - Since in text classification our output is a single number not a sequence of numbers or vectors so we can remove the decoder part and just use the encoder part. The output of the encoder is a set of vectors, the same in number as the number of words in the input sentence. Further, we can feed these sets of output vectors into a CNN, or we can add an LSTM or RNN model and perform classification.
The input is the whole sentence or batch of sentences not word by word. Surely you would have misunderstood it.
I am trying to cluster some sentences using similarity (maybe cosine) and then maybe use a classifier to put text in predefined classes.
My idea is to use tensorflow to generate the word embedding then average them for each sentence. Next use a clustering/classification algorithm.
Does tensorflow provide ready to use word2vec generation algorithm?
Would a bag of words model generate a good output?
No, tensorflow does not provide a ready-to-use word2vec but it does have a tutorial on word2vec.
Yes, a bag of words can generate surprisingly good output (but not state-of-the-art), and has the benefit of being amazingly faster. I have a small amount of data (tens of thousands of sentences) and have achieved F1 scores of >0.90 for classification.