Non English Word Embedding from English Word Embedding - tensorflow

How can i generate non-english (french , spanish , italian ) word embedding from english word embedding ?
What are the best ways to generate high quality word embedding for non - english words .
Words may include (samsung-galaxy-s9)

How can i generate non-english (french , spanish , italian ) word embedding from english word embedding ?
You can't really. Unless you have words which mean exactly the same. If you have know the french word for king, queen, woman and man, you can give those words the embedding of the exact same word in english. They will show the same syntactic and semantic properties that the english words do. But you can't really use the English embeddings to make embeddings for different languages.
What are the best ways to generate high quality word embedding for non - english words
English words and non-english words can be treated the same way. Represent your non english words as strings/tokens and train a w2v model. Use gensim for this. You'll have to find a huge corpus for the language you want. Then you will have to train your model with this huge corpus for a few epochs. Done. Alternatively, look for pre existing models in your required language.
Words may include (samsung-galaxy-s9)
Unless your corpus has words like "samsung-galaxy-s9", your model won't know what it means. Use a corpus which might have more words in the domain you're hoping to use the embeddings for.

For non-english words, you can try to use a bilingual dictionary to translate English words with embedding vectors.
You need a large corpus to generate high-quality word embeddings. For non-english, you need to add the bilingual constraints into the original w2v loss with the input of bilingual corpora.
You can regard the compound word as a whole word or split it according to your applications.

Related

one_hot Vs Tokenizer for Word representation

I have seen in many blogs , people using one_hot (from tf.keras.preprocessing.text.one_hot ) to convert the string of words into array of numbers which represent indices. This does not ensure unicity. Whereas Tokenizer class ensures unicity (tf.keras.preprocessing.text.Tokenizer ).
Then why is one_hot prefered over tokenizer?
Update: I got to know that hashing is used in One_hot to convert words into numbers but didn't get its importance as we can use the tokenizer class to do the same thing with more accuracy.
Not sure what you mean by uncity. I expect it has to do with the sequential relationship between the words. That of course is lost with ine hot encoding. However one-hot encoding is used when the number of words is limited. If say you have 10 words in the vocubulary you will create 10 new features which is fine for most neural networks to process. If you have other features in your data set beside the word sequences say numeric ordinal parameters you can still create a single input model. However if you have 10,000 words in the vocabulary you would create 10,000 new features which at best will take a lot to process. So in the case of a large vocabularly it is best to use "dense" encoding" versus the sparse encoding generated by one hot encoding. You can use the results of the tokenizer encoding to serve as input to a keras embedding layer which will encode the words into an n dimensional space where N is a value you specify. If you have additional ordinal features then to process the data your model will need multiple inputs. Perhaps that is why some people prefer to one hot encode the words.

How can I cluster articles with 2 embeddings (english and french)?

I have a dataset of articles containing french and english content. My final goal is to have similar articles that are in same cluster.
What I did first is word embeddings with fasttext. So I have word embeddings for french and another one for english content.
How can I do next to make a only one clustering with these 2 embeddings ? Is there a good practice ?

Why does spacy return vectors for words like 'zz' ? Should they not be zero vectors

nlp.('zz').vector.sum is -10.
nlp('asc').vector.sum is -9.677
Shouldn't these words be out of vocabulary and have zero vectors ?
Depending on the model you're using, the training corpus may contain a lot of abbreviations, informal words (such as the ones in your example), typos and even words of external languages. These are still treated as lexemes and are assigned vectors.
https://spacy.io/usage/models
Spacy's default English model doesn't include vectors, so it tries to deduce them from your text. If you use the larger models, they include vectors.
v This will not have effective vectors
import spacy
nlp = spacy.load('en')
import spacy
nlp = spacy.load('en_core_web_md')
^ This will have the vectors you're looking for (I believe)

How to train a reverse embedding, like vec2word?

how do you train a neural network to map from a vector representation, to one hot vectors? The example I'm interested in is where the vector representation is the output of a word2vec embedding, and I'd like to map onto the the individual words which were in the language used to train the embedding, so I guess this is vec2word?
In a bit more detail; if I understand correctly, a cluster of points in embedded space represents similar words. Thus if you sample from points in that cluster, and use it as the input to vec2word, the output should be a mapping to similar individual words?
I guess I could do something similar to an encoder-decoder, but does it have to be that complicated/use so many parameters?
There's this TensorFlow tutorial, how to train word2vec, but I can't find any help to do the reverse? I'm happy to do it using any deeplearning library, and it's OK to do it using sampling/probabilistic.
Thanks a lot for your help, Ajay.
One easiest thing that you can do is to use the nearest neighbor word. Given a query feature of an unknown word fq, and a reference feature set of known words R={fr}, then you can find out what is the nearest fr* for fq, and use the corresponding fr* word as fq's word.

Extrapolate Sentence Similarity Given Word Similarities

Assuming that I have a word similarity score for each pair of words in two sentences, what is a decent approach to determining the overall sentence similarity from those scores?
The word scores are calculated using cosine similarity from vectors representing each word.
Now that I have individual word scores, is it too naive to sum the individual word scores and divide by the total word count of both sentences to get a score for the two sentences?
I've read about further constructing vectors to represent the sentences, using the word scores, and then again using cosine similarity to compare the sentences. But I'm not familiar with how to construct sentence vectors from the existing word scores. Nor am I aware of what the tradeoffs are compared with the naive approach described above, which at the very least, I can easily comprehend. :).
Any insights are greatly appreciated.
Thanks.
What I ended up doing, was taking the mean of each set of vectors, and then applying cosine-similarity to the two means, resulting in a score for the sentences.
I'm not sure how mathematically sound this approach is, but I've seen it done in other places (like python's gensim).
It would be better to use contextual word embeddings(vector representations) for words.
Here is an approach to sentence similarities by pairwise word similarities: BERTScore.
You can check the math here.