SpaCy similarity score makes no sense - spacy

I am trying to figure out if I trust SpaCy's similarity function and I am getting confused. Here's my toy example:
import spacy
nlp = spacy.load('en')
doc1 = nlp(u'Unsalted butter')
doc2 = nlp(u'babi carrot peel babi carrot grim french babi fresh babi roundi fresh exot petit petit peel shred carrot dole shred')
doc1.similarity(doc2)
I get the similarity of 0.64. How can it be this high for two sentences with no overlapping tokens? Could someone please explain this to me? Thank you!

The problem is that you are using en model which is most likely linked to en_core_web_sm.
The en_core_web_sm model doesn't have pretrained glove vectors, so you are using the vectors produced by NER, PoS and DEP tagger to compute similarity.
These vectors encode structural information, e.g. having the same PoS tag or DEP role in the sentence. There is no semantic information encoded in these vectors, so the result you are getting is as expected, weird.
Have also a look here.

Related

Why don't spacy transformer models do NER for non-english models?

Why is it that spacy transformer models for languages like spanish (es_dep_news_trf) don't do named entity recognition.
However, for english (en_core_web_trf) it does.
In code:
import spacy
nlp=spacy.load("en_core_web_trf")
doc=nlp("my name is John Smith and I work at Apple and I like visiting the Eiffel Tower")
print(doc.ents)
(John Smith, Apple, the Eiffel Tower)
nlp=spacy.load("es_dep_news_trf")
doc=nlp("mi nombre es John Smith y trabajo en Apple y me gusta visitar la Torre Eiffel")
print(doc.ents)
()
Why doesn't spanish extract entities but english does?
It has to do with the available training data. ner is only included for the trf models if the training data has NER annotation on the exact same data as for tagging and parsing.
Training trf models on partial annotation does not work well in practice and an independent NER component (as in the CNN pipelines) would mean including an additional transformer component in the pipeline, which would make the pipeline a lot larger and slower.
The spaCy models vary with regards to which NLP features they provide - this is just a result of how the respective authors created/trained them. I.e., https://spacy.io/models/en#en_core_web_trf lists "ner" in its components, but https://spacy.io/models/es#es_dep_news_trf does not.
The Spanish https://spacy.io/models/es#es_core_news_lg (as well the two smaller variants) does list "ner" in its components, so they show named entities:
>>> import spacy
>>> nlp=spacy.load("es_core_news_sm")
>>> doc=nlp("mi nombre es John Smith y trabajo en Apple y me gusta visitar la Torre Eiffel")
>>> print(doc.ents)
(John Smith, Apple, Torre Eiffel)

Spacy - English language model outruns german language model on german text?

Is it by design that the english language model performs better on german salution entities than the german model?
# pip install spacy
# python -m spacy download en
# python -m spacy download de
nlp = spacy.load('en')
# Uncomment line below to get less good results
# nlp = spacy.load('de')
# Process text
text = (u"Das Auto kauft Herr Müller oder Frau Meier, Frank Muster")
doc = nlp(text)
# Find named entities
for entity in doc.ents:
print(entity.text, entity.label_)
expected result if using nlp = spacy.load('en'). All three PERSON is returned
Das Auto ORG
Herr Müller PERSON
Frau Meier PERSON
Frank Muster PERSON
unexpected result if using nlp = spacy.load('de'). Only one of three PERSON is returned
Frank Muster PERSON
Info about spaCy
spaCy version:** 2.0.12
Platform:** Linux-4.17.2-1-ARCH-x86_64-with-arch-Arch-Linux
Python version:** 3.6.5
Models:** en, de
It's not by design, but it's certainly possible that this is a side-effect of the training data and the statistical predictions. The English model is trained on a larger NER corpus with more entity types, while the German model uses NER data based on Wikipedia.
In Wikipedia text, full names like "Frank Muster" are quite common, whereas things like "Herr Muster" are typically avoided. This might explain why the model only labels the full name as a person and not the others. The example sentence also makes it easy for the English model to guess correctly – in English, capitalization is a much stronger indicator of a named entity than it is in German. This might explain why the model consistently labels all capitalised multi-word spans as entities.
In any case, this is a good example of how subtle language-specific or stylistic conventions end up influencing a model's predictions. It also shows why you almost always want to fine-tune a model with more examples specific to your data. I do think that the German model will likely perform better on German texts overall, but if references like "Herr Müller" are common in your texts, you probably want to update the model with more examples of them in different contexts.

Can home-made embeddings work for RNNs, or do they HAVE to be trained?

Let's say I'm training an RNN for classification, using a vocabulary of 100 words. I can skip the embedding and pass in the sentences as one-hot vectors, but using one-hot vectors for a space of 100 features seems very wasteful in terms of memory. And it just gets worse as the vocab grows. Is there any reason why I couldn't create my own embedding where each value from 0-100 was converted into binary and stored as an array of length 7, i.e. 0=[0,0,0,0,0,0,0], 1=[1,0,0,0,0,0,0], ..., 100=[1,1,0,0,1,0,0]? I realize the dimensionality is low, but aside from that, I wasn't sure if this random embedding is a bad idea since there are no relationships between the word vectors like there are with GLoVe. BTW I can't use pre-made embeddings here, and my sample size isn't huge which is why I'm exploring making my own.
Nice logic but it is missing two important features of Embeddings.
1)We use Word Embeddings to get nearly similar vector representation of similar words.
for example, Apple and Mango will have nearly the same representation. Cricket and Football will have nearly the same representation. You might ask why is this beneficial. The answer is, Imagine we have mostly trained our Model only on Apples. If our testing has something to do with Mangos, even though we haven't explicitly trained on Mangoes, we will get an appropriate answer.
ex: Training: I like Apples, I drink apple juice every day.
Testing: I like Mangoes, I drink _____ juice every day.
The blank will be filled with "Mango" even though we didn't train on Mangoes explicitly.
This is achieved by having a similar vector representation for both Mangoes and Apples. Which cannot be achieved by your method.
2)Don't you think even with your logic the vector will be sparse? I agree Its better in compared to One hot encoding, but not compared to Word Embeddings. 90% of word embeddings vectors aren't zeros. But in your case, it will be just 50 %.

Why does spacy return vectors for words like 'zz' ? Should they not be zero vectors

nlp.('zz').vector.sum is -10.
nlp('asc').vector.sum is -9.677
Shouldn't these words be out of vocabulary and have zero vectors ?
Depending on the model you're using, the training corpus may contain a lot of abbreviations, informal words (such as the ones in your example), typos and even words of external languages. These are still treated as lexemes and are assigned vectors.
https://spacy.io/usage/models
Spacy's default English model doesn't include vectors, so it tries to deduce them from your text. If you use the larger models, they include vectors.
v This will not have effective vectors
import spacy
nlp = spacy.load('en')
import spacy
nlp = spacy.load('en_core_web_md')
^ This will have the vectors you're looking for (I believe)

What is a simple example of a TensorFlow file pipeline for a language model?

I am building a RNN language model in TensorFlow. My raw input consists of files of text. I am able to tokenize them, so that data I am working with is sequences of integers that are indexes into a vocabulary.
Following the example in ptb_word_lm.py, I have written code to build a language model that gets its training data via the feed_dict method. However, I do not want to be limited to data sets that can fit in memory, so I would like to use file pipelines to read in data instead. I cannot find any examples of how to do this.
The file pipelines examples I've seen all have a tensor of some length n associated with a label that is a tensor of length 1. (The classic example being a 28 x 28 = 784 item tensor representing an MNIST bitmap associated with a single integer label that ranges from 0 to 9.) However, RNN training data consists of a vector of n consecutive tokens and a label also consisting of n consecutive tokens (shifted one ahead of the vector), for example:
"the quick brown fox jumped"
vectors (n=3): the quick brown, quick brown fox, brown fox jumped
labels (n=3): quick brown fox, brown fox jumped, fox jumped EOF
Can someone give me a code snippet that shows how to write a file pipeline to feed this shape of data into a TensorFlow graph?