spacy Matcher: get original keys - spacy

Once I got a match with Spacy's Matcher, I want to get the key of the match. According to this guide, one can specify a key once initializing:
matcher_ex = Matcher(nlp.vocab)
matcher_ex.add("mickey_key", None, [{"ORTH": "Mickey"}])
matcher_ex.add("minnie_key", None, [{"ORTH": "Minnie"}])
Next I run matching:
doc = nlp("Ub Iwerks designed Mickey's body out of circles in order to make the character simple to animate")
matcher_ex(doc)
# [(7888036183581346977, 3, 4)]
That's where it gets strange. It returns some other integer key, and I cannot figure out how to match that integer key 7888036183581346977 to mickey_key. This is what help(matcher_ex) says:
Call docstring:
Find all token sequences matching the supplied pattern.
doclike (Doc or Span): The document to match over.
RETURNS (list): A list of `(key, start, end)` tuples,
describing the matches. A match tuple describes a span
`doc[start:end]`. The `label_id` and `key` are both integers.
The object has no property label_id, but it is anyway seems not what I am looking for.
Seems like the Matcher must keep them both somewhere:
matcher_ex.has_key('mickey_key') # True
matcher_ex.has_key(7888036183581346977) # True
but docs say nothing how to match them. I tried code introspection, but it's all in C.
Any idea how to match 7888036183581346977 to mickey_key?

Use nlp.vocab_strings to retrieve rule ids.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher_ex = Matcher(nlp.vocab)
matcher_ex.add("mickey_key", None, [{"ORTH": "Mickey"}])
matcher_ex.add("minnie_key", None, [{"ORTH": "Minnie"}])
doc = nlp("Ub Iwerks designed Mickey's body out of circles in order to make the character simple to animate")
matches = matcher_ex(doc) # [(7888036183581346977, 3, 4)]
print(matches)
# [(7888036183581346977, 3, 4)]
rule_ids = dict()
for match in matches:
rule_ids[match[0]] = nlp.vocab.strings[match[0]]
print(rule_ids)
# {7888036183581346977: 'mickey_key'}

Related

spacy IS_DIGIT or LIKE_NUM not working as expected for certain chars

I am trying to extract some numbers using IS_DIGIT and LIKE_NUM attributes but it seems to be behaving a bit strange for a beginner like me.
The matcher is only able to detect the numbers when the 5 character string ends in M, G, T . If it is any other character, the IS_DIGIT and LIKE_NUM attributes are not able to detect. What am I missing here?
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}]
matcher.add("DIGIT",[pattern])
doc = nlp("1231M 1232G 1233H 1234J 1235V 1236T")
matches = matcher(doc, as_spans=True)
for span in matches:
print(span.text, span.label_)
# prints only 1231, 1232 and 1236
It may be helpful to just check which tokens are true for LIKE_NUM, like this:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LIKE_NUM": True}]
matcher.add("DIGIT", [pattern])
doc = nlp("1231M 1232G 1233H 1234J 1235V 1236T")
for tok in doc:
print(tok, tok.like_num)
Here you'll see that sometimes the tokens you have are split in two, and sometimes they aren't. The tokens you match are only the ones that consist just of digits.
Now, why are M, G, and T split off, while H, J, and V aren't? This is because they are units, as for mega, giga, or terabytes.
This behaviour with units may seem inconsistent and weird, but it's been chosen to be consistent with the training data used for the English models. If you need to change it for your application, look at this section in the docs, which covers customizing the exceptions.

Obtaining the index of a word between two columns in pandas

I am checking on which words the SpaCy Spanish lemmatizer works on using the .has_vector method. In the two columns of the datafame I have the output of the function that indicates which words can be lemmatized and in the other one the corresponding phrase.
I would like to know how I can extract all the words that have False output to correct them so that I can lemmatize.
So I created the function:
def lemmatizer(text):
doc = nlp(text)
return ' '.join([str(word.has_vector) for word in doc])
And applied it to the column sentences in the DataFrame
df["Vectors"] = df.reviews.apply(lemmatizer)
And put in another data frame as:
df2= pd.DataFrame(df[['Vectors', 'reviews']])
The output is
index Vectors reviews
1 True True True False 'La pelicula es aburridora'
Two ways to do this:
import pandas
import spacy
nlp = spacy.load('en_core_web_lg')
df = pandas.DataFrame({'reviews': ["aaabbbcccc some example words xxxxyyyz"]})
If you want to use has_vector:
def get_oov1(text):
return [word.text for word in nlp(text) if not word.has_vector]
Alternatively you can use the is_oov attribute:
def get_oov2(text):
return [word.text for word in nlp(text) if word.is_oov]
Then as you already did:
df["oov_words1"] = df.reviews.apply(get_oov1)
df["oov_words2"] = df.reviews.apply(get_oov2)
Which will return:
> reviews oov_words1 oov_words2
0 aaabbbcccc some example words xxxxyyyz [aaabbbcccc, xxxxyyyz] [aaabbbcccc, xxxxyyyz]
Note:
When working with both of these ways it is important to know that this is model dependent, and usually has no backbone in smaller models and will always return a default value!
That means when you run the exact same code but e.g. with en_core_web_sm you get this:
> reviews oov_words1 oov_words2
0 aaabbbcccc some example words xxxxyyyz [] [aaabbbcccc, some, example, words, xxxxyyyz]
Which is because has_vector has a default value of False and is then not set by the model. is_oov has a default value of True and then is not by the model either. So with the has_vector model it wrongly shows all words as unknown and with is_oov it wrongly shows all as known.

spacy matcher dealing with overlapping matches

I am new to spacy and trying to experiment with the Matcher. What I do not know is how to make the matcher pick one match when overlaps. I want to be able to match both brain and tumor because there may be other types of tumor. But I don't know that once it finds both matches to pick one.I tried playing with the callback functions but cannot figure out from the examples how to make it work.
doc = nlp("brain tumor resection")
pattern1 = [{'LOWER':'brain'}, [{'LOWER':'tumor'}]
pattern2 = [[{'LOWER':'tumor'}]
matcher.add("tumor", None, pattern1, pattern2)
phrase_matches = matcher(doc)
this gives me (0,2, Brain Tumor) and (1,2, Tumor)
Desired output is:
just to pick one in this case brain tumor. but also not sure how to adapt this if in other cases you find spine tumor. How do you add logic and then make the final output pick one based on whatever expert needs.
You need to fix the syntax a bit (remove the redundant [ in the pattern definitions) and use spacy.util.filter_spans to get the final matches.
See a code demo:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
doc = nlp("brain tumor resection")
pattern1 = [{'LOWER':'brain'}, {'LOWER':'tumor'}]
pattern2 = [{'LOWER':'tumor'}]
matcher.add("tumor", None, pattern1, pattern2)
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
print((span.start, span.end, span.text))
Output: (0, 2, 'brain tumor').

Discovering taxonomic relationships with Spacy

How can we make some generic inference of taxonomic relation between entities from text? Looking for words near 'type of' in the word2vec of en_core_web_lg model, they all seem unrelated. The words near 'type' however are more similar to it. But how can I use common phrases in my text and apply some generic similarity for inferring taxonomy from SVO triples etc.? Can do a Sense2Vec type approach, but wondering if something existing can be used without new training.
Output of code below:
['eradicate', 'wade', 'equator', 'educated', 'lcd', 'byproducts', 'two', 'propensity', 'rhinos', 'procrastinate']
def get_related(word):
filtered_words = [w for w in word.vocab if w.is_lower == word.is_lower and w.prob >= -15]
similarity = sorted(filtered_words, key=lambda w: word.similarity(w), reverse=True)
return similarity[:10]
print ([w.lower_ for w in get_related(nlp.vocab[u'type_of'])])
All the similarities your code retrieves are 0.0, so sorting the list has no effect.
You are treating "type_of" as a word (more accurately, a lexeme), and assuming spaCy will understand it as the phrase "type of". Note that the first has an underscore, while the second one does not; however even without the underscore, it is not a lexeme in the model's vocabulary. Since the model does not have sufficient data on "type_of" for a similarity score, the score is 0.0 for every word you compare to it.
Instead, you can create a Span of the words "type of" and call similarity() on that. This requires only a small change to your code:
import spacy
def get_related(span): # this now expects a Span instead of a Lexeme
filtered_words = [w for w in span.vocab if
w.is_lower == span.text.islower()
and w.prob >= -15] # filter by probability and case
# (use the lowercase words if and only if the whole Span is in lowercase)
similarity = sorted(filtered_words,
key=lambda w: span.similarity(w),
reverse=True) # sort by the similarity of each word to the whole Span
return similarity[:10] # return the 10 most similar words
nlp = spacy.load('en_core_web_lg') # load the model
print([w.lower_ for w in get_related(nlp(u'type')[:])]) # print related words for "type"
print([w.lower_ for w in get_related(nlp(u'type of')[:])]) # print related words for "type of"
Output:
['type', 'types', 'kind', 'sort', 'specific', 'example', 'particular', 'similar', 'different', 'style']
['type', 'of', 'types', 'kind', 'particular', 'sort', 'different', 'such', 'same', 'associated']
As you can see, all the words are related to the input to some degree, and the output is similar but not identical for "type" and "type of".

Tensorflow vocabularyprocessor

I am following the wildml blog on text classification using tensorflow. I am not able to understand the purpose of max_document_length in the code statement :
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
Also how can i extract vocabulary from the vocab_processor
I have figured out how to extract vocabulary from vocabularyprocessor object. This worked perfectly for me.
import numpy as np
from tensorflow.contrib import learn
x_text = ['This is a cat','This must be boy', 'This is a a dog']
max_document_length = max([len(x.split(" ")) for x in x_text])
## Create the vocabularyprocessor object, setting the max lengh of the documents.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
## Transform the documents using the vocabulary.
x = np.array(list(vocab_processor.fit_transform(x_text)))
## Extract word:id mapping from the object.
vocab_dict = vocab_processor.vocabulary_._mapping
## Sort the vocabulary dictionary on the basis of values(id).
## Both statements perform same task.
#sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1))
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])
## Treat the id's as index into list and create a list of words in the ascending order of id's
## word with id i goes at index i of the list.
vocabulary = list(list(zip(*sorted_vocab))[0])
print(vocabulary)
print(x)
not able to understand the purpose of max_document_length
The VocabularyProcessor maps your text documents into vectors, and you need these vectors to be of a consistent length.
Your input data records may not (or probably won't) be all the same length. For example if you're working with sentences for sentiment analysis they'll be of various lengths.
You provide this parameter to the VocabularyProcessor so that it can adjust the length of output vectors. According to the documentation,
max_document_length: Maximum length of documents. if documents are
longer, they will be trimmed, if shorter - padded.
Check out the source code.
def transform(self, raw_documents):
"""Transform documents to word-id matrix.
Convert words to ids with vocabulary fitted with fit or the one
provided in the constructor.
Args:
raw_documents: An iterable which yield either str or unicode.
Yields:
x: iterable, [n_samples, max_document_length]. Word-id matrix.
"""
for tokens in self._tokenizer(raw_documents):
word_ids = np.zeros(self.max_document_length, np.int64)
for idx, token in enumerate(tokens):
if idx >= self.max_document_length:
break
word_ids[idx] = self.vocabulary_.get(token)
yield word_ids
Note the line word_ids = np.zeros(self.max_document_length).
Each row in raw_documents variable will be mapped to a vector of length max_document_length.