Let us assume that I have a set of document embeddings. (D)
Each of document embedding is consisting of N number of word vectors where each of these pre-trained vector has 300 dimensions.
The corpus would be represented as [D,N,300].
My question is that, what would be the best way to reduce [D,N,300] to [D,1, 300]. How should I represent the document in a single vector instead of N vectors?
Thank you in advance.
I would say that what you are looking for is doc2vec. Using this you can convert the whole document into a one, 300-dimensional vector. You can use it like this:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(your_documents)]
model = Doc2Vec(documents, vector_size=300, window=2, min_count=1, workers=4)
This will train the model on your data and you will be able to represent each document with only one vector as you specified in the question.
You can run inferrance with:
vector = model.infer_vector(doc_words)
I hope this is helpful :)
It's fairly common and fairly (perhaps surprisingly) effective to simply average the word vectors.
Good question but all the answers will result in the some loss of information. The best way for you is to use a Bi-LSTM/GRU layer and provide your word embeddings as input to that layer. And take the output of last time step.
The output of last timestep will have all the contextual information of document both in forward and backward direction. And hence, this is the best way to get what you want as the model learns the representation.
Note that, the larger the document, the more loss of information.
Related
The paper time2vector link (the relevant theory is in section 4) shows an approach to include a time embedding for features to improve model performance. I would like to give this a try. I found a implementation as keras layer which I changed a little bit. Basically it creates two matrices for one feature:
(1) linear = w * x + b
(2) periodic = sin(w * x + b)
Currently I choose this feature manually. Concerning the paper there are a few things i don't understand. The first thing is the term k as the number of sinusoids. The authors use up to 64 sinusoids. What does this mean? I have just 1 sinusoid at the moment, right? Secondly I'm about to put every feature I have through the sinus transformation for me dataset that would make 6 (sinusoids) periodic features. The authors use only one linear term. How should I choose the feature for the linear term? Unfortunately the code from the paper is not available anymore. Has anyone worked with time embeddings or even with this particularly approach?
For my limited understanding, the linear transformation of time is a fixed element of the produced embedding and the parameter K allows you to select how many different learned time representations you want to use in your model. So, the resulting embedding has a size of K+1 elements.
I am training a simple model for text classification (currently with scikit-learn). To transform my document samples into word count vectors using a vocabulary I use
CountVectorizer(vocabulary=myDictionaryWords).fit_transform(myDocumentsAsArrays)
from sklearn.feature_extraction.text.
This works great and I can subsequently train my classifier on this word count vectors as feature vectors. But what I don't know is how to inverse transform these word count vectors to the original documents. CountVectorizer indeed has a function inverse_transform(X) but this only gives you back the unique non-zero tokens.
As far as I know CountVectorizer doesn't have any implementation of a mapping back to the original documents.
Anyone know how I can restore the original sequences of tokens from their count-vectorized representation? Is there maybe a Tensorflow or any other module for this?
CountVectorizer is "lossy", i.e. for a document :
This is the amazing string in amazing program , it will only store counts of words in the document (i.e. string -> 1, amazing ->2 etc), but loses the position information.
So by reversing it, you can create a document having same words repeated same number of times, but their sequence in the document cannot be retraced.
how do you train a neural network to map from a vector representation, to one hot vectors? The example I'm interested in is where the vector representation is the output of a word2vec embedding, and I'd like to map onto the the individual words which were in the language used to train the embedding, so I guess this is vec2word?
In a bit more detail; if I understand correctly, a cluster of points in embedded space represents similar words. Thus if you sample from points in that cluster, and use it as the input to vec2word, the output should be a mapping to similar individual words?
I guess I could do something similar to an encoder-decoder, but does it have to be that complicated/use so many parameters?
There's this TensorFlow tutorial, how to train word2vec, but I can't find any help to do the reverse? I'm happy to do it using any deeplearning library, and it's OK to do it using sampling/probabilistic.
Thanks a lot for your help, Ajay.
One easiest thing that you can do is to use the nearest neighbor word. Given a query feature of an unknown word fq, and a reference feature set of known words R={fr}, then you can find out what is the nearest fr* for fq, and use the corresponding fr* word as fq's word.
I've using tensorflow to build word2vec model,reference here:https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py#L118
my question is that, how can i find top n similar words for a certain word.I know in gensim, I can save and load word2vec model,and then use model.most_similar to find what I want.but how in tensorflow and even more is there any way to save model in tensorflow since i find what i get is only an embedding vector,is that right?
I think as long as you have computed the weight vector for each token, then you can manipulate all the tokens in the vector space. You can simply calculate the cosine similarity between each vector and then sort by score. For your reference, you can look at the source code of most_similar method implemented in gensim word2vec model. Hope this helps.
I have some data in a pandas dataframe (although pandas is not the point of this question). As an experiment I made column ZR as column Z divided by column R. As a first step using scikit learn I wanted to see if I could predict ZR from the other columns (which should be possible as I just made it from R and Z). My steps have been.
columns=['R','T', 'V', 'X', 'Z']
for c in columns:
results[c] = preprocessing.scale(results[c])
results['ZR'] = preprocessing.scale(results['ZR'])
labels = results["ZR"].values
features = results[columns].values
#print labels
#print features
regr = linear_model.LinearRegression()
regr.fit(features, labels)
print(regr.coef_)
print np.mean((regr.predict(features)-labels)**2)
This gives
[ 0.36472515 -0.79579885 -0.16316067 0.67995378 0.59256197]
0.458552051342
The preprocessing seems wrong as it destroys the Z/R relationship I think. What's the right way to preprocess in this situation?
Is there some way to get near 100% accuracy? Linear regression is the wrong tool as the relationship is not-linear.
The five features are highly correlated in my data. Is non-negative least squares implemented in scikit learn ? ( I can see it mentioned in the mailing list but not the docs.) My aim would be to get as many coefficients set to zero as possible.
You should easily be able to get a decent fit using random forest regression, without any preprocessing, since it is a nonlinear method:
model = RandomForestRegressor(n_estimators=10, max_features=2)
model.fit(features, labels)
You can play with the parameters to get better performance.
The solutions is not as easy and can be very influenced by your data.
If your variables R and Z are bounded (for ex 0<R<1 -3<Z<2) then you should be able to get a good estimation of the output variable using neural network.
Using neural network you should be able to estimate your output even without preprocessing the data and using all the variables as input.
(Of course here you will have to solve a minimization problem).
Sklearn do not implement neural network so you should use pybrain or fann.
If you want to preprocess the data in order to make the minimization problem easier you can try to extract the right features from the predictor matrix.
I do not think there are a lot of tools for non linear features selection. I would try to estimate the important variables from you dataset using in this order :
1-lasso
2- sparse PCA
3- decision tree (you can actually use them for features selection ) but I would avoid this as much as possible
If this is a toy problem I would sugges you to move towards something of more standard.
You can find a lot of examples on google.