I am using Gensim for vector space model. after creating a dictionary and corpus from Gensim I calculated the (Term frequency*Inverse document Frequency)TFIDF using the following line
Term_IDF = TfidfModel(corpus)
corpus_tfidf = Term_IDF[corpus]
The corpus_tfidf contain list of the list having Terms ids and corresponding TFIDF. then I separated the TFIDF from ids using following lines:
for doc in corpus_tfidf:
for ids,tfidf in doc:
IDS.append(ids)
tfidfmtx.append(tfidf)
IDS=[]
now I want to use k-means clustering so I want to perform cosine similarities of tfidf matrix the problem is Gensim does not produce square matrix so when I run following line it generates an error. I wonder how can I get the square matrix from Gensim to calculate the similarities of all the documents in vector space model. Also how to convert tfidf matrix (which in this case is a list of lists) into 2D NumPy array. any comments are much appreciated.
dumydist = 1 - cosine_similarity(tfidfmtx)
When you fit your corpus to a Gensim Dictionary, get the number or documents and tokens in the dictionary:
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(corpus_lists)
num_docs = dictionary.num_docs
num_terms = len(dictionary.keys())
Transform into bow:
corpus_bow = [dictionary.doc2bow(doc) for doc in corpus_lists]
Transform into tf-idf:
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus_bow)
corpus_tfidf = tfidf[corpus_bow]
Now you can transform into sparse/dense matrix:
from gensim.matutils import corpus2dense, corpus2csc
corpus_tfidf_dense = corpus2dense(corpus_tfidf, num_terms, num_docs)
corpus_tfidf_sparse = corpus2csc(corpus_tfidf, num_terms, num_docs)
Now fit your model using either sparse/dense matrix (after transposing):
model = KMeans(n_clusters=7)
clusters = model.fit_predict(corpus_bow_dense.T)
To create document term matrix from gensim, you may use matutils.corpus2csv
Corpus - list of list(Genism Corpus)
from scipy.sparse import csc_matrix
scipy_csc_matrix =genism.matutils.corpus2csc(corpus)
full_matrix=csc_matrix(scipy_csc_matrix).toarray()
you may want to use scipy sparse format if your corpus size is very large.
Related
I have a pandas data frame, containing two columns: sentences and annotations:
Col 0
Sentence
Annotation
1
[This, is, sentence]
[l1, l2, l3]
2
[This, is, sentence, too]
[l1, l2, l3, l4]
There are several things I need to do:
split to features and labels
split into train-val-test data
vectorize train data using:
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=maxlen,
standardize='lower',
split='whitespace',
ngrams=(1,3),
output_mode='tf-idf',
pad_to_max_tokens=True,)
I haven't worked with tensors before so I am a little confused about how to order the steps above and access the information from the tensors. Specifically, at what point do I have to split into features and labels, and how to access one or the other? Then, should I split into features and labels before splitting to train-val-test (I want to make it right and not use sklearn's train_test_split when I work with tensorflow) or it is the opposite?
You can split your dataset before creating a model. After splitting you need to tokenize your sentences using
tensorflow.keras.preprocessing.text.Tokenizer((num_words = vocab_size, oov_token=oov_tok)
After tokenizing you need to add padding to the sentence using
training_padded = pad_sequences(training_sequences, maxlen=max_length, truncating = trunc_type)
Then you can train your model with the data. For more details please refer to this working code example. Thank You.
I'm computing topic models through scikit-learn with this script (I'm starting with a dataset "df" which has one document per row in the column "Text")
from sklearn.decomposition import LatentDirichletAllocation
#Applying LDA
# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=int(0.9*len(df)), min_df=int(0.01*len(df)), token_pattern='\w+|\$[\d\.]+|\S+')
# apply transformation
tf = vectorizer.fit_transform(df.Text).toarray()
# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()
number_of_topics = 6
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
model.fit(tf)
I'm interested in comparing models with different number of topics (kind of from 2 to 20 topics) through a coherence measure. How can I do it?
The code below as an example for analyzing massive corpus. I want to restrict the term-document matrix to 1000 most frequent unigrams, but changing the max-features parameter to n only return the first n unigrams. Any suggestion?
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = ['Hi my name is Joe.', 'Hi my name is Donald.']
vectorizer = TfidfVectorizer(max_features=3)
X = vectorizer.fit_transform(corpus).todense()
df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
df.to_csv('test.csv')
I am assuming that this is a problem in your example, but according to the sklearn documentation for TfidfVectorizer ssays the following for max_features:
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
It might be that the first n terms are considered when there are words of equal frequency, but otherwise it should return the correct result. If it still does not work, I strongly suggest to open a bug report in the sklearn repository. However, you can also manually construct a vocabulary yourself (with your own interpretation of "frequency", by setting the vocabulary option:
vocabulary: Mapping or iterable, default=None
Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.
I have a csv with both categorical and float dtypes. I want to do the following:
For each categorical column i will use pandas to compute the unique values (pd.unique()) that are present in the column. say u_l for a column
I will use the len(u_l) to decide upon the dimension of embeddings that i use for a particular categorical column that i want i embed (this step is the reason i cannot use tensorflow_transform)
I want to create some stateful node that can map category (token) value to embeddings index thus subsequently i can lookup the embedding from embeddings matrix that i created in step 2
I dont know how to go about doing it currently. A very inelegant solution i can see is using tensorflow_datasets:
encoder = tfds.features.text.TokenTextEncoder(u_l,decode_token_separator=' ')
concatenate the entire column using space delimiter (c_l) (c_l is one string now) and then using encoder.encode(c_l)
This is a very basic thing that i think tensorflow would be able to do relatively easily. Please guide me to the right solution
If you want to use your word corpus as embedding like if you have corpus as this :
corpus :
"This pasta is good"
"This pasta is very good"
and you want to use embedding you can use Tokenizer of TF see this. It will create a dict containing words as keys and index as value like in above corpus dict looks like :
word_index = {"this" : 1, "pasta" : 2, "good" : 3, "very" : 4}
you can avoid stopwords.
Now you can make word embedding vector using these word_index dict so that it looks like
For corpus 1 : [1, 2, 3]
For corpus 2 : [1, 2, 4, 3]
Enough talk let see some code : Also define oov_token for out of vocabulary words.
You can do like this :
vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<OOV>"
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences) # This will create word embedding vector
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type) # This will padd zeros according to `trunc_type`, here add zeros in last
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)
Also see this GitHub code of me hope it will help
I am following the wildml blog on text classification using tensorflow. I am not able to understand the purpose of max_document_length in the code statement :
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
Also how can i extract vocabulary from the vocab_processor
I have figured out how to extract vocabulary from vocabularyprocessor object. This worked perfectly for me.
import numpy as np
from tensorflow.contrib import learn
x_text = ['This is a cat','This must be boy', 'This is a a dog']
max_document_length = max([len(x.split(" ")) for x in x_text])
## Create the vocabularyprocessor object, setting the max lengh of the documents.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
## Transform the documents using the vocabulary.
x = np.array(list(vocab_processor.fit_transform(x_text)))
## Extract word:id mapping from the object.
vocab_dict = vocab_processor.vocabulary_._mapping
## Sort the vocabulary dictionary on the basis of values(id).
## Both statements perform same task.
#sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1))
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])
## Treat the id's as index into list and create a list of words in the ascending order of id's
## word with id i goes at index i of the list.
vocabulary = list(list(zip(*sorted_vocab))[0])
print(vocabulary)
print(x)
not able to understand the purpose of max_document_length
The VocabularyProcessor maps your text documents into vectors, and you need these vectors to be of a consistent length.
Your input data records may not (or probably won't) be all the same length. For example if you're working with sentences for sentiment analysis they'll be of various lengths.
You provide this parameter to the VocabularyProcessor so that it can adjust the length of output vectors. According to the documentation,
max_document_length: Maximum length of documents. if documents are
longer, they will be trimmed, if shorter - padded.
Check out the source code.
def transform(self, raw_documents):
"""Transform documents to word-id matrix.
Convert words to ids with vocabulary fitted with fit or the one
provided in the constructor.
Args:
raw_documents: An iterable which yield either str or unicode.
Yields:
x: iterable, [n_samples, max_document_length]. Word-id matrix.
"""
for tokens in self._tokenizer(raw_documents):
word_ids = np.zeros(self.max_document_length, np.int64)
for idx, token in enumerate(tokens):
if idx >= self.max_document_length:
break
word_ids[idx] = self.vocabulary_.get(token)
yield word_ids
Note the line word_ids = np.zeros(self.max_document_length).
Each row in raw_documents variable will be mapped to a vector of length max_document_length.