How can I cluster articles with 2 embeddings (english and french)? - hierarchical-clustering

I have a dataset of articles containing french and english content. My final goal is to have similar articles that are in same cluster.
What I did first is word embeddings with fasttext. So I have word embeddings for french and another one for english content.
How can I do next to make a only one clustering with these 2 embeddings ? Is there a good practice ?

Related

Non English Word Embedding from English Word Embedding

How can i generate non-english (french , spanish , italian ) word embedding from english word embedding ?
What are the best ways to generate high quality word embedding for non - english words .
Words may include (samsung-galaxy-s9)
How can i generate non-english (french , spanish , italian ) word embedding from english word embedding ?
You can't really. Unless you have words which mean exactly the same. If you have know the french word for king, queen, woman and man, you can give those words the embedding of the exact same word in english. They will show the same syntactic and semantic properties that the english words do. But you can't really use the English embeddings to make embeddings for different languages.
What are the best ways to generate high quality word embedding for non - english words
English words and non-english words can be treated the same way. Represent your non english words as strings/tokens and train a w2v model. Use gensim for this. You'll have to find a huge corpus for the language you want. Then you will have to train your model with this huge corpus for a few epochs. Done. Alternatively, look for pre existing models in your required language.
Words may include (samsung-galaxy-s9)
Unless your corpus has words like "samsung-galaxy-s9", your model won't know what it means. Use a corpus which might have more words in the domain you're hoping to use the embeddings for.
For non-english words, you can try to use a bilingual dictionary to translate English words with embedding vectors.
You need a large corpus to generate high-quality word embeddings. For non-english, you need to add the bilingual constraints into the original w2v loss with the input of bilingual corpora.
You can regard the compound word as a whole word or split it according to your applications.

Spacy - English language model outruns german language model on german text?

Is it by design that the english language model performs better on german salution entities than the german model?
# pip install spacy
# python -m spacy download en
# python -m spacy download de
nlp = spacy.load('en')
# Uncomment line below to get less good results
# nlp = spacy.load('de')
# Process text
text = (u"Das Auto kauft Herr Müller oder Frau Meier, Frank Muster")
doc = nlp(text)
# Find named entities
for entity in doc.ents:
print(entity.text, entity.label_)
expected result if using nlp = spacy.load('en'). All three PERSON is returned
Das Auto ORG
Herr Müller PERSON
Frau Meier PERSON
Frank Muster PERSON
unexpected result if using nlp = spacy.load('de'). Only one of three PERSON is returned
Frank Muster PERSON
Info about spaCy
spaCy version:** 2.0.12
Platform:** Linux-4.17.2-1-ARCH-x86_64-with-arch-Arch-Linux
Python version:** 3.6.5
Models:** en, de
It's not by design, but it's certainly possible that this is a side-effect of the training data and the statistical predictions. The English model is trained on a larger NER corpus with more entity types, while the German model uses NER data based on Wikipedia.
In Wikipedia text, full names like "Frank Muster" are quite common, whereas things like "Herr Muster" are typically avoided. This might explain why the model only labels the full name as a person and not the others. The example sentence also makes it easy for the English model to guess correctly – in English, capitalization is a much stronger indicator of a named entity than it is in German. This might explain why the model consistently labels all capitalised multi-word spans as entities.
In any case, this is a good example of how subtle language-specific or stylistic conventions end up influencing a model's predictions. It also shows why you almost always want to fine-tune a model with more examples specific to your data. I do think that the German model will likely perform better on German texts overall, but if references like "Herr Müller" are common in your texts, you probably want to update the model with more examples of them in different contexts.

SpaCy similarity score makes no sense

I am trying to figure out if I trust SpaCy's similarity function and I am getting confused. Here's my toy example:
import spacy
nlp = spacy.load('en')
doc1 = nlp(u'Unsalted butter')
doc2 = nlp(u'babi carrot peel babi carrot grim french babi fresh babi roundi fresh exot petit petit peel shred carrot dole shred')
doc1.similarity(doc2)
I get the similarity of 0.64. How can it be this high for two sentences with no overlapping tokens? Could someone please explain this to me? Thank you!
The problem is that you are using en model which is most likely linked to en_core_web_sm.
The en_core_web_sm model doesn't have pretrained glove vectors, so you are using the vectors produced by NER, PoS and DEP tagger to compute similarity.
These vectors encode structural information, e.g. having the same PoS tag or DEP role in the sentence. There is no semantic information encoded in these vectors, so the result you are getting is as expected, weird.
Have also a look here.

How to train a reverse embedding, like vec2word?

how do you train a neural network to map from a vector representation, to one hot vectors? The example I'm interested in is where the vector representation is the output of a word2vec embedding, and I'd like to map onto the the individual words which were in the language used to train the embedding, so I guess this is vec2word?
In a bit more detail; if I understand correctly, a cluster of points in embedded space represents similar words. Thus if you sample from points in that cluster, and use it as the input to vec2word, the output should be a mapping to similar individual words?
I guess I could do something similar to an encoder-decoder, but does it have to be that complicated/use so many parameters?
There's this TensorFlow tutorial, how to train word2vec, but I can't find any help to do the reverse? I'm happy to do it using any deeplearning library, and it's OK to do it using sampling/probabilistic.
Thanks a lot for your help, Ajay.
One easiest thing that you can do is to use the nearest neighbor word. Given a query feature of an unknown word fq, and a reference feature set of known words R={fr}, then you can find out what is the nearest fr* for fq, and use the corresponding fr* word as fq's word.

Lucene Term Vector Multivariate Bayes Model Expectation Maximization

I am trying to implement an Expectation Maximization algorithm for document clustering. I am planning to use Lucene Term Vectors for finding similarity between 2 documents. There are 2 kinds of EM algos using naive Bayes: the multivariate model and the multinomial model. In simple terms, the multinomial model uses the frequencies of different words in the documents which the multivariate model just uses the info of whether a word is present or not in the document(a boolean vector).
I know that the term vectors in Lucene store the terms present in the current document along with their frequencies. This is exactly what is needed for the multinomial model.
But the multivariate model requires the following:
A vector which stores the presence or absence of a particular term. Thus all the terms in all the documents must be handled by this vector.
As an example:
doc1 : field CONTENT has the following terms : this is the world of pleasure.
doc2 : field CONTENT has the following terms : this amazing world is full of sarcastic people.
now the vector that I need should be
< this is the world of pleasure amazing full sarcastic people > ( it contains all the words in all the documents )
for doc1 the value of this vector is <1 1 1 1 1 1 0 0 0 0>
for doc2 the vakue of this vector is <1 1 0 1 0 0 1 1 1 1>
Is there any way to generate such a boolean vector in Lucene?
I would first generate the multinomial vectors, and then process them (maybe their textual representation) to get the multivariate vectors.
If the set of documents is not very small, storing full vectors is wasteful. You should have a sparse representation, because every document contains a small subset of the possible terms.
This blog post describes generating feature vectors from Lucene/Solr documents, although I do not think it goes much farther than what you already did.