I want to use spaCy matcher to match:
orange, apple and grape are fruits.
That is: [NOUN[,and]]+ NOUN are fruits.
However, my pattern is not correct. Could anyone help me write a correct pattern?
Thank you!
Seems like spacy with en_core_web_sm identifies orange as ADJ. To check that, you run the code below:
import spacy
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
for token in nlp('orange, apple and grape are fruits.'):
print(token.pos_, end=' ')
>>> 'ADJ PUNCT NOUN CCONJ NOUN VERB NOUN PUNCT'
You can either try add entities and train it, or use text matching and handle orange. Really depends on what you are trying to achieve.
Related
I am trying to extract some numbers using IS_DIGIT and LIKE_NUM attributes but it seems to be behaving a bit strange for a beginner like me.
The matcher is only able to detect the numbers when the 5 character string ends in M, G, T . If it is any other character, the IS_DIGIT and LIKE_NUM attributes are not able to detect. What am I missing here?
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}]
matcher.add("DIGIT",[pattern])
doc = nlp("1231M 1232G 1233H 1234J 1235V 1236T")
matches = matcher(doc, as_spans=True)
for span in matches:
print(span.text, span.label_)
# prints only 1231, 1232 and 1236
It may be helpful to just check which tokens are true for LIKE_NUM, like this:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LIKE_NUM": True}]
matcher.add("DIGIT", [pattern])
doc = nlp("1231M 1232G 1233H 1234J 1235V 1236T")
for tok in doc:
print(tok, tok.like_num)
Here you'll see that sometimes the tokens you have are split in two, and sometimes they aren't. The tokens you match are only the ones that consist just of digits.
Now, why are M, G, and T split off, while H, J, and V aren't? This is because they are units, as for mega, giga, or terabytes.
This behaviour with units may seem inconsistent and weird, but it's been chosen to be consistent with the training data used for the English models. If you need to change it for your application, look at this section in the docs, which covers customizing the exceptions.
I am new to spacy and trying to experiment with the Matcher. What I do not know is how to make the matcher pick one match when overlaps. I want to be able to match both brain and tumor because there may be other types of tumor. But I don't know that once it finds both matches to pick one.I tried playing with the callback functions but cannot figure out from the examples how to make it work.
doc = nlp("brain tumor resection")
pattern1 = [{'LOWER':'brain'}, [{'LOWER':'tumor'}]
pattern2 = [[{'LOWER':'tumor'}]
matcher.add("tumor", None, pattern1, pattern2)
phrase_matches = matcher(doc)
this gives me (0,2, Brain Tumor) and (1,2, Tumor)
Desired output is:
just to pick one in this case brain tumor. but also not sure how to adapt this if in other cases you find spine tumor. How do you add logic and then make the final output pick one based on whatever expert needs.
You need to fix the syntax a bit (remove the redundant [ in the pattern definitions) and use spacy.util.filter_spans to get the final matches.
See a code demo:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
doc = nlp("brain tumor resection")
pattern1 = [{'LOWER':'brain'}, {'LOWER':'tumor'}]
pattern2 = [{'LOWER':'tumor'}]
matcher.add("tumor", None, pattern1, pattern2)
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
print((span.start, span.end, span.text))
Output: (0, 2, 'brain tumor').
I want to tokenize bs-it to ["bs","it"] using spacy, as I am using it with rasa. The output which I get from is ["bs-it"]. Can somebody help me with that?
You can add custom rules to spaCy's tokenizer. spaCy's tokenizer treats hyphenated words as a single token. In order to change that, you can add custom tokenization rule. In your case, you want to tokenize an infix i.e. something that occurs in between two words, these are usually hyphens or underscores.
import re
import spacy
from spacy.tokenizer import Tokenizer
infix_re = re.compile(r'[-]')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab,infix_finditer=infix_re.finditer)
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("bs-it")
print([t.text for t in doc])
Output
['bs', '-', 'it']
I've found that spaCy's similarity does a decent job of comparing my documents using "en_core_web_lg" out-of-the-box.
I'd like to tighten up relationships in some areas and thought adding custom NER labels to the model would help, but my results before and after show no improvements, even though I've been able to create a test set of custom entities.
Now I'm wondering, was my theory completely wrong, or could I simply be missing something in my pipeline?
If I was wrong, what's the best approach to improve results? Seems like some sort of custom labeling should help.
Here's an example of what I've tested so far:
import spacy
from spacy.pipeline import EntityRuler
from spacy.tokens import Doc
from spacy.gold import GoldParse
nlp = spacy.load("en_core_web_lg")
docA = nlp("Add fractions with like denominators.")
docB = nlp("What does one-third plus one-third equal?")
sim_before = docA.similarity(docB)
print(sim_before)
0.5949629181460099
^^ Not too shabby, but I'd like to see results closer to 0.85 in this example.
So, I use EntityRuler and add some patterns to try and tighten up the relationships:
ruler = EntityRuler(nlp)
patterns = [
{"label": "ADDITION", "pattern": "Add"},
{"label": "ADDITION", "pattern": "plus"},
{"label": "FRACTION", "pattern": "one-third"},
{"label": "FRACTION", "pattern": "fractions"},
{"label": "FRACTION", "pattern": "denominators"},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before='ner')
print(nlp.pipe_names)
['tagger', 'parser', 'entity_ruler', 'ner']
Adding GoldParse seems to be important, so I added the following and updated NER:
doc1 = Doc(nlp.vocab, [u'What', u'does', u'one-third', u'plus', u'one-third', u'equal'])
gold1 = GoldParse(doc1, [u'0', u'0', u'U-FRACTION', u'U-ADDITION', u'U-FRACTION', u'O'])
doc2 = Doc(nlp.vocab, [u'Add', u'fractions', u'with', u'like', u'denominators'])
gold2 = GoldParse(doc2, [u'U-ADDITION', u'U-FRACTION', u'O', u'O', u'U-FRACTION'])
ner = nlp.get_pipe("ner")
losses = {}
optimizer = nlp.begin_training()
ner.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
{'ner': 0.0}
You can see my custom entities are working, but the test results show zero improvement:
test1 = nlp("Add fractions with like denominators.")
test2 = nlp("What does one-third plus one-third equal?")
print([(ent.text, ent.label_) for ent in test1.ents])
print([(ent.text, ent.label_) for ent in test2.ents])
sim = test1.similarity(test2)
print(sim)
[('Add', 'ADDITION'), ('fractions', 'FRACTION'), ('denominators', 'FRACTION')]
[('one-third', 'FRACTION'), ('plus', 'ADDITION'), ('one-third', 'FRACTION')]
0.5949629181460099
Any tips would be greatly appreciated!
Doc.similarity only uses the word vectors, not any other annotation. From the Doc API:
The default estimate is cosine similarity using an average of word vectors.
I found my solution was nestled in this tutorial: Text Classification in Python Using spaCy, which generates a BoW matrix for spaCy's text data by using SciKit-Learn's CountVectorizer.
I avoided sentiment analysis tutorials, due to binary classification, since I need support for multiple categories. The trick was to set multi_class='auto' on the LogisticRegression linear model, and to use average='micro' on the precision score and precision recall, so all my text data, like entities, were leveraged:
classifier = LogisticRegression(solver='lbfgs', multi_class='auto')
and...
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted,average='micro'))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted,average='micro'))
Hope this helps save someone some time!
I want to use tf.data.TextLineDataset() to read Chinese sentences, then use the map() function to divide into the single word, but tf.split doesn't work for Chinese.
I also hope someone can help us kindly with the issue.
It is my current solution:
read Chinese sentence from the file with Utf-8 coding format.
tokenize the sentences with some tool like jieba.
construct the vocab table.
convert source/target sentence according to vocab table.
convert to the dataset using from_tensor_slices.
get iterator from the dataset.
do other things.
if using TextLineDataset to load chinese sentences directlly, the content of dataset is something strange , displayed with byte flow.
maybe we can consider every byte as one character in english kind of language.
can anyone confirm with this or has any other suggestion, plz?
The above answer is one common option when handling non-English style language like Chinese, Korean, Japanese, etc.
You can also use the code below.
BTW, as you know, TextLineDataSet will read text content as a byte string.
So if we want to handle Chinese, we need to first decode it to unicode.
Unfortunately, there is no such option in tensorflow.
We need to choose other method like py_funct to do this.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import tensorflow as tf
def preprocess_func(x):
ret= "*".join(x.decode('utf-8'))
return ret
str = tf.py_func(
preprocess_func,
[tf.constant(u"我爱,南京")],
tf.string)
with tf.Session() as sess:
value = sess.run(str)
print(value.decode('utf-8'))
output: 我*爱*,*南*京