Why does SpaCy change the entire dependency tree because of a capital letter? - spacy

The problem:
spaCy will produce a different parse tree if a single word is capitalized.
Examples:
Ketone is negative, protein is negative, Leucocytes is negative
Ketone is negative, Protein is negative, Leucocytes is negative
When Protein is capitalized, spacy produces 2 dependency trees and when protein only one tree, see the attached photo.
Environment:
windows server 2019,spaCy:2.1.4,spaCy-model: en_core_web_md
Questions:
Why does SpaCy change the entire dependency tree because of a capital letter?
Is there any way to make this more consistent?

Related

one_hot Vs Tokenizer for Word representation

I have seen in many blogs , people using one_hot (from tf.keras.preprocessing.text.one_hot ) to convert the string of words into array of numbers which represent indices. This does not ensure unicity. Whereas Tokenizer class ensures unicity (tf.keras.preprocessing.text.Tokenizer ).
Then why is one_hot prefered over tokenizer?
Update: I got to know that hashing is used in One_hot to convert words into numbers but didn't get its importance as we can use the tokenizer class to do the same thing with more accuracy.
Not sure what you mean by uncity. I expect it has to do with the sequential relationship between the words. That of course is lost with ine hot encoding. However one-hot encoding is used when the number of words is limited. If say you have 10 words in the vocubulary you will create 10 new features which is fine for most neural networks to process. If you have other features in your data set beside the word sequences say numeric ordinal parameters you can still create a single input model. However if you have 10,000 words in the vocabulary you would create 10,000 new features which at best will take a lot to process. So in the case of a large vocabularly it is best to use "dense" encoding" versus the sparse encoding generated by one hot encoding. You can use the results of the tokenizer encoding to serve as input to a keras embedding layer which will encode the words into an n dimensional space where N is a value you specify. If you have additional ordinal features then to process the data your model will need multiple inputs. Perhaps that is why some people prefer to one hot encode the words.

Discrepancy documentation and implementation of spaCy vectors for German words?

According to documentation:
spaCy's small models (all packages that end in sm) don't ship with
word vectors, and only include context-sensitive tensors. [...]
individual tokens won't have any vectors assigned.
But when I use the de_core_news_sm model, the tokens Do have entries for x.vector and x.has_vector=True.
It looks like these are context_vectors, but as far as I understood the documentation only word vectors are accessible through the vector attribute and sm models should have none. Why does this work for a "small model"?
has_vector behaves differently than you expect.
This is discussed in the comments on an issue raised on github. The gist is, since vectors are available, it is True, even though those vectors are context vectors. Note that you can still use them, eg to compute similarity.
Quote from spaCy contributor Ines:
We've been going back and forth on how the has_vector should behave in
cases like this. There is a vector, so having it return False would be
misleading. Similarly, if the model doesn't come with a pre-trained
vocab, technically all lexemes are OOV.
Version 2.1.0 has been announced to include German word vectors.

Why does spacy return vectors for words like 'zz' ? Should they not be zero vectors

nlp.('zz').vector.sum is -10.
nlp('asc').vector.sum is -9.677
Shouldn't these words be out of vocabulary and have zero vectors ?
Depending on the model you're using, the training corpus may contain a lot of abbreviations, informal words (such as the ones in your example), typos and even words of external languages. These are still treated as lexemes and are assigned vectors.
https://spacy.io/usage/models
Spacy's default English model doesn't include vectors, so it tries to deduce them from your text. If you use the larger models, they include vectors.
v This will not have effective vectors
import spacy
nlp = spacy.load('en')
import spacy
nlp = spacy.load('en_core_web_md')
^ This will have the vectors you're looking for (I believe)

Spacy 2.0 en_vectors_web_lg vs en_core_web_lg

What is the difference between the word vectors given in en_core_web_lg and en_vectors_web_lg? The number of keys are different: 1.1m vs 685k. I assume this means the en_vectors_web_lg has broader coverage by maintaining morphological information somewhat resulting in more distinct tokens as they are both trained on the common crawl corpus but have a different number of tokens.
The en_vectors_web_lg package has exactly every vector provided by the original GloVe model. The en_core_web_lg model uses the vocabulary from the v1.x en_core_web_lg model, which from memory pruned out all entries which occurred fewer than 10 times in a 10 billion word dump of Reddit comments.
In theory, most of the vectors that were removed should be things that the spaCy tokenizer never produces. However, earlier experiments with the full GloVe vectors did score slightly higher than the current NER model --- so it's possible we're actually missing out on something by losing the extra vectors. I'll do more experiments on this, and likely switch the lg model to include the unpruned vector table, especially now that we have the md model, which strikes a better compromise than the current lg package.

Inverse transform word count vector to original document

I am training a simple model for text classification (currently with scikit-learn). To transform my document samples into word count vectors using a vocabulary I use
CountVectorizer(vocabulary=myDictionaryWords).fit_transform(myDocumentsAsArrays)
from sklearn.feature_extraction.text.
This works great and I can subsequently train my classifier on this word count vectors as feature vectors. But what I don't know is how to inverse transform these word count vectors to the original documents. CountVectorizer indeed has a function inverse_transform(X) but this only gives you back the unique non-zero tokens.
As far as I know CountVectorizer doesn't have any implementation of a mapping back to the original documents.
Anyone know how I can restore the original sequences of tokens from their count-vectorized representation? Is there maybe a Tensorflow or any other module for this?
CountVectorizer is "lossy", i.e. for a document :
This is the amazing string in amazing program , it will only store counts of words in the document (i.e. string -> 1, amazing ->2 etc), but loses the position information.
So by reversing it, you can create a document having same words repeated same number of times, but their sequence in the document cannot be retraced.