I am checking on which words the SpaCy Spanish lemmatizer works on using the .has_vector method. In the two columns of the datafame I have the output of the function that indicates which words can be lemmatized and in the other one the corresponding phrase.
I would like to know how I can extract all the words that have False output to correct them so that I can lemmatize.
So I created the function:
def lemmatizer(text):
doc = nlp(text)
return ' '.join([str(word.has_vector) for word in doc])
And applied it to the column sentences in the DataFrame
df["Vectors"] = df.reviews.apply(lemmatizer)
And put in another data frame as:
df2= pd.DataFrame(df[['Vectors', 'reviews']])
The output is
index Vectors reviews
1 True True True False 'La pelicula es aburridora'
Two ways to do this:
import pandas
import spacy
nlp = spacy.load('en_core_web_lg')
df = pandas.DataFrame({'reviews': ["aaabbbcccc some example words xxxxyyyz"]})
If you want to use has_vector:
def get_oov1(text):
return [word.text for word in nlp(text) if not word.has_vector]
Alternatively you can use the is_oov attribute:
def get_oov2(text):
return [word.text for word in nlp(text) if word.is_oov]
Then as you already did:
df["oov_words1"] = df.reviews.apply(get_oov1)
df["oov_words2"] = df.reviews.apply(get_oov2)
Which will return:
> reviews oov_words1 oov_words2
0 aaabbbcccc some example words xxxxyyyz [aaabbbcccc, xxxxyyyz] [aaabbbcccc, xxxxyyyz]
Note:
When working with both of these ways it is important to know that this is model dependent, and usually has no backbone in smaller models and will always return a default value!
That means when you run the exact same code but e.g. with en_core_web_sm you get this:
> reviews oov_words1 oov_words2
0 aaabbbcccc some example words xxxxyyyz [] [aaabbbcccc, some, example, words, xxxxyyyz]
Which is because has_vector has a default value of False and is then not set by the model. is_oov has a default value of True and then is not by the model either. So with the has_vector model it wrongly shows all words as unknown and with is_oov it wrongly shows all as known.
Related
I have a pandas dataframe that looks like this
Text | Label
Some text | 0
hellow bye what | 1
...
Each row is a data point. Label is 0/1 binary. The only feature is Text which contains a set of words. I want to use the presence or absence of each word as features. For example, the features could be contains_some contains_what contains_hello contains_bye, etc. This is typical one hot encoding.
However I don't want to manually create so many features, one for every single word in the vocabulary (the vocabulary is not huge, so I am not worried about the feature set exploding). But I just want to supply a list of words as a single column to tensorflow and I want it to create a binary feature for each word in the vocabulary.
Does tensorflow/keras have an API to do this?
You can use sklearn for this , try this:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(old_df['Text'])
new_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
new_df['Label'] = old_df['label']
and this should give you :
bye hellow some text what target
0 0 1 1 0 0
1 1 0 0 1 1
CountVectorizer convert a collection of text documents to a matrix of token counts.
This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix and if binary = True then all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
What you're looking for is a (binary) bag of words which you can get from scikit-learn using their CountVectorizer here.
You can do something like:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range=(1, 1), binary=True)
X_train = bow.fit_transform(df_train['text'].values)
This will create an array of binary values indicating the presence of a word in each text. Use binary=True to output a 1 or 0 if the word is present. Without this field you will get counts of occurrences per word, either method works fine.
In order to inspect the counts you could use the below:
# Create sample dataframe of BoW outputs
count_vect_df = pd.DataFrame(X_train[:1].todense(),columns=bow.get_feature_names())
# Re-order counts in descending order. Keep top 10 counts for demo purposes
count_vect_df= count_vect_df[count_vect_df.iloc[-1,:].sort_values(ascending=False).index[:10]]
# print combination of original train dataframe with BoW counts
pd.concat([df_train['text'][:1].reset_index(drop=True), count_vect_df], axis=1)
Update
If your features include categorical data you could try using to_categorical from tf.keras. See the docs for more information.
Once I got a match with Spacy's Matcher, I want to get the key of the match. According to this guide, one can specify a key once initializing:
matcher_ex = Matcher(nlp.vocab)
matcher_ex.add("mickey_key", None, [{"ORTH": "Mickey"}])
matcher_ex.add("minnie_key", None, [{"ORTH": "Minnie"}])
Next I run matching:
doc = nlp("Ub Iwerks designed Mickey's body out of circles in order to make the character simple to animate")
matcher_ex(doc)
# [(7888036183581346977, 3, 4)]
That's where it gets strange. It returns some other integer key, and I cannot figure out how to match that integer key 7888036183581346977 to mickey_key. This is what help(matcher_ex) says:
Call docstring:
Find all token sequences matching the supplied pattern.
doclike (Doc or Span): The document to match over.
RETURNS (list): A list of `(key, start, end)` tuples,
describing the matches. A match tuple describes a span
`doc[start:end]`. The `label_id` and `key` are both integers.
The object has no property label_id, but it is anyway seems not what I am looking for.
Seems like the Matcher must keep them both somewhere:
matcher_ex.has_key('mickey_key') # True
matcher_ex.has_key(7888036183581346977) # True
but docs say nothing how to match them. I tried code introspection, but it's all in C.
Any idea how to match 7888036183581346977 to mickey_key?
Use nlp.vocab_strings to retrieve rule ids.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher_ex = Matcher(nlp.vocab)
matcher_ex.add("mickey_key", None, [{"ORTH": "Mickey"}])
matcher_ex.add("minnie_key", None, [{"ORTH": "Minnie"}])
doc = nlp("Ub Iwerks designed Mickey's body out of circles in order to make the character simple to animate")
matches = matcher_ex(doc) # [(7888036183581346977, 3, 4)]
print(matches)
# [(7888036183581346977, 3, 4)]
rule_ids = dict()
for match in matches:
rule_ids[match[0]] = nlp.vocab.strings[match[0]]
print(rule_ids)
# {7888036183581346977: 'mickey_key'}
I've run into a quite common word "data" which gets assigned a lemma "datum" from lookups exceptions table spacy uses. I understand that the lemma is technically correct, but in today's english, "data" in its basic form is just "data".
I am using the lemmas to get a sort of keywords from text and if I have a text about data, I can't possibly tag it with "datum".
I was wondering if there is another way to arrive at plain "data" then constructing another "my_exceptions" dictionary used for overriding post-processing.
Thanks for any suggestions.
You could use Lemminflect which works as an add-in pipeline component for SpaCy. It should give you better results.
To use it with SpaCy, just import lemminflect and call the new ._.lemma() function on the Token, ie.. token._.lemma(). Here's an example..
import lemminflect
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('I got the data')
for token in doc:
print('%-6s %-6s %s' % (token.text, token.lemma_, token._.lemma()))
I -PRON- I
got get get
the the the
data datum data
Lemminflect has a prioritized list of lemmas, based on occurrence in corpus data. You can see all lemmas with...
print(lemminflect.getAllLemmas('data'))
{'NOUN': ('data', 'datum')}
It's relatively easy to customize the lemmatizer once you know where to look. The original lemmatizer tables are from the package spacy-lookups-data and are loaded in the model under nlp.vocab.lookups. You can use a local install of spacy-lookups-data to customize the tables for new/blank models, but if you just want to make a few modifications to the entries for an existing model, you can modify the lemmatizer tables on the fly.
Depending on whether your pipeline includes a tagger, the lemmatizer may be referring to rules+exceptions (with a tagger) or to a simple lookup table (without a tagger), both of which include an exception that lemmatizes data to datum by default. If you remove this exception, you should get data as the lemma for data.
For a pipeline that includes a tagger (rule-based lemmatizer)
# tested with spaCy v2.2.4
import spacy
nlp = spacy.load("en_core_web_sm")
# remove exception from rule-based exceptions
lemma_exc = nlp.vocab.lookups.get_table("lemma_exc")
del lemma_exc[nlp.vocab.strings["noun"]]["data"]
assert nlp.vocab.morphology.lemmatizer("data", "NOUN") == ["data"]
# "data" with the POS "NOUN" has the lemma "data"
doc = nlp("data")
doc[0].pos_ = "NOUN" # just to make sure the POS is correct
assert doc[0].lemma_ == "data"
For a pipeline without a tagger (simple lookup lemmatizer)
import spacy
nlp = spacy.blank("en")
# remove exception from lookups
lemma_lookup = nlp.vocab.lookups.get_table("lemma_lookup")
del lemma_lookup[nlp.vocab.strings["data"]]
assert nlp.vocab.morphology.lemmatizer("data", "") == ["data"]
doc = nlp("data")
assert doc[0].lemma_ == "data"
For both: save model for future use with these modifications included in the lemmatizer tables
nlp.to_disk("/path/to/model")
Also be aware that the lemmatizer uses a cache, so make any changes before running your model on any texts or you may run into problems where it returns lemmas from the cache rather than the updated tables.
I am a beginner in python. This is an excerpt of a code from https://github.com/minsuk-heo/kaggle-titanic/blob/master/titanic-solution.ipynb (line no. 12). I was trying to understand a bar chart with it:
def bar_chart(feature):
survived = train[train['Survived']==1][feature].value_counts()
dead = train[train['Survived']==0][feature].value_counts()
df = pd.DataFrame([survived,dead])
df.index = ['Survived','Dead']
df.plot(kind='bar',stacked=True, figsize=(10,5))
#Pranjal try to learn a python module first (here, pandas) before jumping to any challenge (say, kaggle's titanic).
To answer your question, consider the lines you asked for -
Line 2: survived = train[train['Survived']==1][feature].value_counts()
Line 3: dead = train[train['Survived']==0][feature].value_counts()
The train['Survived']==1 code results into a boolean (True/False) pandas series. It results into True where the column Survived value is equal to 1 else False. Once the series is generated it is fed to outer train and only the rows which are mapped to True will be kept and others will be dropped. Next you select only the feature column from the resulted dataframe and returns object containing counts of unique values. Similarly, proceed with line 3.
Tip: There happened no permanent change to the train dataframe.
I am following the wildml blog on text classification using tensorflow. I am not able to understand the purpose of max_document_length in the code statement :
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
Also how can i extract vocabulary from the vocab_processor
I have figured out how to extract vocabulary from vocabularyprocessor object. This worked perfectly for me.
import numpy as np
from tensorflow.contrib import learn
x_text = ['This is a cat','This must be boy', 'This is a a dog']
max_document_length = max([len(x.split(" ")) for x in x_text])
## Create the vocabularyprocessor object, setting the max lengh of the documents.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
## Transform the documents using the vocabulary.
x = np.array(list(vocab_processor.fit_transform(x_text)))
## Extract word:id mapping from the object.
vocab_dict = vocab_processor.vocabulary_._mapping
## Sort the vocabulary dictionary on the basis of values(id).
## Both statements perform same task.
#sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1))
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])
## Treat the id's as index into list and create a list of words in the ascending order of id's
## word with id i goes at index i of the list.
vocabulary = list(list(zip(*sorted_vocab))[0])
print(vocabulary)
print(x)
not able to understand the purpose of max_document_length
The VocabularyProcessor maps your text documents into vectors, and you need these vectors to be of a consistent length.
Your input data records may not (or probably won't) be all the same length. For example if you're working with sentences for sentiment analysis they'll be of various lengths.
You provide this parameter to the VocabularyProcessor so that it can adjust the length of output vectors. According to the documentation,
max_document_length: Maximum length of documents. if documents are
longer, they will be trimmed, if shorter - padded.
Check out the source code.
def transform(self, raw_documents):
"""Transform documents to word-id matrix.
Convert words to ids with vocabulary fitted with fit or the one
provided in the constructor.
Args:
raw_documents: An iterable which yield either str or unicode.
Yields:
x: iterable, [n_samples, max_document_length]. Word-id matrix.
"""
for tokens in self._tokenizer(raw_documents):
word_ids = np.zeros(self.max_document_length, np.int64)
for idx, token in enumerate(tokens):
if idx >= self.max_document_length:
break
word_ids[idx] = self.vocabulary_.get(token)
yield word_ids
Note the line word_ids = np.zeros(self.max_document_length).
Each row in raw_documents variable will be mapped to a vector of length max_document_length.