How to tokenize word with hyphen in Spacy - tokenize

I want to tokenize bs-it to ["bs","it"] using spacy, as I am using it with rasa. The output which I get from is ["bs-it"]. Can somebody help me with that?

You can add custom rules to spaCy's tokenizer. spaCy's tokenizer treats hyphenated words as a single token. In order to change that, you can add custom tokenization rule. In your case, you want to tokenize an infix i.e. something that occurs in between two words, these are usually hyphens or underscores.
import re
import spacy
from spacy.tokenizer import Tokenizer
infix_re = re.compile(r'[-]')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab,infix_finditer=infix_re.finditer)
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("bs-it")
print([t.text for t in doc])
Output
['bs', '-', 'it']

Related

spacy IS_DIGIT or LIKE_NUM not working as expected for certain chars

I am trying to extract some numbers using IS_DIGIT and LIKE_NUM attributes but it seems to be behaving a bit strange for a beginner like me.
The matcher is only able to detect the numbers when the 5 character string ends in M, G, T . If it is any other character, the IS_DIGIT and LIKE_NUM attributes are not able to detect. What am I missing here?
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}]
matcher.add("DIGIT",[pattern])
doc = nlp("1231M 1232G 1233H 1234J 1235V 1236T")
matches = matcher(doc, as_spans=True)
for span in matches:
print(span.text, span.label_)
# prints only 1231, 1232 and 1236
It may be helpful to just check which tokens are true for LIKE_NUM, like this:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LIKE_NUM": True}]
matcher.add("DIGIT", [pattern])
doc = nlp("1231M 1232G 1233H 1234J 1235V 1236T")
for tok in doc:
print(tok, tok.like_num)
Here you'll see that sometimes the tokens you have are split in two, and sometimes they aren't. The tokens you match are only the ones that consist just of digits.
Now, why are M, G, and T split off, while H, J, and V aren't? This is because they are units, as for mega, giga, or terabytes.
This behaviour with units may seem inconsistent and weird, but it's been chosen to be consistent with the training data used for the English models. If you need to change it for your application, look at this section in the docs, which covers customizing the exceptions.

Tensorflow text tokenizer incorrect tokenization

I am trying to use TF Tokenizer for a NLP model
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=200, split=" ")
sample_text = ["This is a sample sentence1 created by sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ",
"This is another sample sentence1 created by another sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]
tokenizer.fit_on_texts(sample_text)
print (tokenizer.texts_to_sequences(["sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]))
OP:
[[1, 7, 8, 9]]
Word_Index:
print(tokenizer.index_word[8]) ===> 'ab'
print(tokenizer.index_word[9]) ===> 'cdefghijklmnopqrstuvwxyz'
The problem is that the tokenizer creates tokens based on . in this case. I am giving the split = " " in the Tokenizer so I expect the following op:
[[1,7,8]], where tokenizer.index_word[8] should be 'ab.cdefghijklmnopqrstuvwxyz'
As in I want the tokenizer to create words based on space (" ") and not on any special characters
How do I make the tokenizer create tokens only on spaces?
The Tokenizer takes another argument called filter which is currently defaults to all ascii punctuations (filters='!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n'). During tokenization, all of the characters contained in filter are replaced by the specified split string.
If you will look in the source code of the Tokenizer and specifically on the method fit_on_texts, you will see it uses the function text_to_word_sequence which receive the filter characters and consider them the same as the split it also receives:
def text_to_word_sequence(... ):
...
translate_dict = {c: split for c in filters}
translate_map = maketrans(translate_dict)
text = text.translate(translate_map)
seq = text.split(split)
return [i for i in seq if i]
So, in order to not split nothing but the specified split, just pass empty string to the filter argument

How not to get "datum" as the lemma for "data" when using Spacy?

I've run into a quite common word "data" which gets assigned a lemma "datum" from lookups exceptions table spacy uses. I understand that the lemma is technically correct, but in today's english, "data" in its basic form is just "data".
I am using the lemmas to get a sort of keywords from text and if I have a text about data, I can't possibly tag it with "datum".
I was wondering if there is another way to arrive at plain "data" then constructing another "my_exceptions" dictionary used for overriding post-processing.
Thanks for any suggestions.
You could use Lemminflect which works as an add-in pipeline component for SpaCy. It should give you better results.
To use it with SpaCy, just import lemminflect and call the new ._.lemma() function on the Token, ie.. token._.lemma(). Here's an example..
import lemminflect
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('I got the data')
for token in doc:
print('%-6s %-6s %s' % (token.text, token.lemma_, token._.lemma()))
I -PRON- I
got get get
the the the
data datum data
Lemminflect has a prioritized list of lemmas, based on occurrence in corpus data. You can see all lemmas with...
print(lemminflect.getAllLemmas('data'))
{'NOUN': ('data', 'datum')}
It's relatively easy to customize the lemmatizer once you know where to look. The original lemmatizer tables are from the package spacy-lookups-data and are loaded in the model under nlp.vocab.lookups. You can use a local install of spacy-lookups-data to customize the tables for new/blank models, but if you just want to make a few modifications to the entries for an existing model, you can modify the lemmatizer tables on the fly.
Depending on whether your pipeline includes a tagger, the lemmatizer may be referring to rules+exceptions (with a tagger) or to a simple lookup table (without a tagger), both of which include an exception that lemmatizes data to datum by default. If you remove this exception, you should get data as the lemma for data.
For a pipeline that includes a tagger (rule-based lemmatizer)
# tested with spaCy v2.2.4
import spacy
nlp = spacy.load("en_core_web_sm")
# remove exception from rule-based exceptions
lemma_exc = nlp.vocab.lookups.get_table("lemma_exc")
del lemma_exc[nlp.vocab.strings["noun"]]["data"]
assert nlp.vocab.morphology.lemmatizer("data", "NOUN") == ["data"]
# "data" with the POS "NOUN" has the lemma "data"
doc = nlp("data")
doc[0].pos_ = "NOUN" # just to make sure the POS is correct
assert doc[0].lemma_ == "data"
For a pipeline without a tagger (simple lookup lemmatizer)
import spacy
nlp = spacy.blank("en")
# remove exception from lookups
lemma_lookup = nlp.vocab.lookups.get_table("lemma_lookup")
del lemma_lookup[nlp.vocab.strings["data"]]
assert nlp.vocab.morphology.lemmatizer("data", "") == ["data"]
doc = nlp("data")
assert doc[0].lemma_ == "data"
For both: save model for future use with these modifications included in the lemmatizer tables
nlp.to_disk("/path/to/model")
Also be aware that the lemmatizer uses a cache, so make any changes before running your model on any texts or you may run into problems where it returns lemmas from the cache rather than the updated tables.

Don't include apostrophe s in Spacy named entities

Is there a way to avoid an apostrophe s being included in a named entity, and keep it as a separate token?
For example, I would like to keep the "'s" separate after merging the ents in the following sentence
import spacy
nlp = spacy.load('en')
s = 'Donald Trump\'s role in the negotiations.'
doc = nlp(s)
for ent in doc.ents:
ent.merge(tag=ent.root.tag_, lemma=ent.text, ent_type=ent.label_)
for t in doc:
print(t)
Thanks a lot!

how do I split Chinese string into characters using Tensorflow

I want to use tf.data.TextLineDataset() to read Chinese sentences, then use the map() function to divide into the single word, but tf.split doesn't work for Chinese.
I also hope someone can help us kindly with the issue.
It is my current solution:
read Chinese sentence from the file with Utf-8 coding format.
tokenize the sentences with some tool like jieba.
construct the vocab table.
convert source/target sentence according to vocab table.
convert to the dataset using from_tensor_slices.
get iterator from the dataset.
do other things.
if using TextLineDataset to load chinese sentences directlly, the content of dataset is something strange , displayed with byte flow.
maybe we can consider every byte as one character in english kind of language.
can anyone confirm with this or has any other suggestion, plz?
The above answer is one common option when handling non-English style language like Chinese, Korean, Japanese, etc.
You can also use the code below.
BTW, as you know, TextLineDataSet will read text content as a byte string.
So if we want to handle Chinese, we need to first decode it to unicode.
Unfortunately, there is no such option in tensorflow.
We need to choose other method like py_funct to do this.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import tensorflow as tf
def preprocess_func(x):
ret= "*".join(x.decode('utf-8'))
return ret
str = tf.py_func(
preprocess_func,
[tf.constant(u"我爱,南京")],
tf.string)
with tf.Session() as sess:
value = sess.run(str)
print(value.decode('utf-8'))
output: 我*爱*,*南*京