This is a very specific question for a very specific case in spacy. I hope Stackoverflow is still the right place to ask this.
There seems to be a difference in the results when using nlp.pipe(docs) vs nlp(doc). When I use the former instead of the latter I don't receive vectors under certain circumstances. Here's what I noticed:
When using the default model (en_core_web_sm) with disabled tagger and parser, I get word vectors with nlp(doc) but not with nlp.pipe(docs)
Larger models, i.e. en_core_web_lg, generate vectors in both cases
Re-enabling the tagger also "fixes" this for the default models
Apart from the vectors, disabling the tagger and parser seems to work as expected in both cases
Can somebody explain this to me? Especially why the tagger seems to influence this?
Code:
import spacy
documents = [
"spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.",
"If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?",
"spaCy is designed specifically for production use and helps you build applications that process and \"understand\" large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning."
]
nlp = spacy.load('en', disable=['tagger', 'parser'])
print('classic model, disabled tagger and parser, without pipe')
for i, doc in enumerate(documents):
spacy_doc = nlp(doc)
print(f'document {i} has_vector: {spacy_doc.has_vector}, is_parsed: {spacy_doc.is_parsed}, token_0_pos: {spacy_doc[0].pos_}')
print('classic model, disabled tagger and parser, with pipe')
for i, spacy_doc in enumerate(nlp.pipe(documents)):
print(f'document {i} has_vector: {spacy_doc.has_vector}, is_parsed: {spacy_doc.is_parsed}, token_0_pos: {spacy_doc[0].pos_}')
nlp = spacy.load('en', disable=['parser'])
print('classic model, disabled only parser, without pipe')
for i, doc in enumerate(documents):
spacy_doc = nlp(doc)
print(f'document {i} has_vector: {spacy_doc.has_vector}, is_parsed: {spacy_doc.is_parsed}, token_0_pos: {spacy_doc[0].pos_}')
print('classic model, disabled only parser, with pipe')
for i, spacy_doc in enumerate(nlp.pipe(documents)):
print(f'document {i} has_vector: {spacy_doc.has_vector}, is_parsed: {spacy_doc.is_parsed}, token_0_pos: {spacy_doc[0].pos_}')
nlp = spacy.load('en', disable=['tagger'])
print('classic model, disabled only tagger, without pipe')
for i, doc in enumerate(documents):
spacy_doc = nlp(doc)
print(f'document {i} has_vector: {spacy_doc.has_vector}, is_parsed: {spacy_doc.is_parsed}, token_0_pos: {spacy_doc[0].pos_}')
print('classic model, disabled only tagger, with pipe')
for i, spacy_doc in enumerate(nlp.pipe(documents)):
print(f'document {i} has_vector: {spacy_doc.has_vector}, is_parsed: {spacy_doc.is_parsed}, token_0_pos: {spacy_doc[0].pos_}')
nlp = spacy.load('en_core_web_lg', disable=['tagger', 'parser'])
print('model en_core_web_lg, disabled tagger and parser, with pipe')
for i, spacy_doc in enumerate(nlp.pipe(documents)):
print(f'document {i} has_vector: {spacy_doc.has_vector}, is_parsed: {spacy_doc.is_parsed}, token_0_pos: {spacy_doc[0].pos_}')
Output:
classic model, disabled tagger and parser, without pipe
document 0 has_vector: True, is_parsed: False, token_0_pos:
document 1 has_vector: True, is_parsed: False, token_0_pos:
document 2 has_vector: True, is_parsed: False, token_0_pos:
classic model, disabled tagger and parser, with pipe
document 0 has_vector: False, is_parsed: False, token_0_pos:
document 1 has_vector: False, is_parsed: False, token_0_pos:
document 2 has_vector: False, is_parsed: False, token_0_pos:
classic model, disabled only parser, without pipe
document 0 has_vector: True, is_parsed: False, token_0_pos: ADJ
document 1 has_vector: True, is_parsed: False, token_0_pos: ADP
document 2 has_vector: True, is_parsed: False, token_0_pos: ADJ
classic model, disabled only parser, with pipe
document 0 has_vector: True, is_parsed: False, token_0_pos: ADJ
document 1 has_vector: True, is_parsed: False, token_0_pos: ADP
document 2 has_vector: True, is_parsed: False, token_0_pos: ADJ
classic model, disabled only tagger, without pipe
document 0 has_vector: True, is_parsed: True, token_0_pos:
document 1 has_vector: True, is_parsed: True, token_0_pos:
document 2 has_vector: True, is_parsed: True, token_0_pos:
classic model, disabled only tagger, with pipe
document 0 has_vector: False, is_parsed: True, token_0_pos:
document 1 has_vector: False, is_parsed: True, token_0_pos:
document 2 has_vector: False, is_parsed: True, token_0_pos:
model en_core_web_lg, disabled tagger and parser, with pipe
document 0 has_vector: True, is_parsed: False, token_0_pos:
document 1 has_vector: True, is_parsed: False, token_0_pos:
document 2 has_vector: True, is_parsed: False, token_0_pos:
Related
If I have a (possibly multidimensional) Python list where each element is one of True, False, or ma.masked, what's the idiomatic way of turning this into a masked numpy array of bool?
Example:
>>> print(somefunc([[True, ma.masked], [False, True]]))
[[True --]
[False True]]
A masked array has to attributes, data and mask:
In [342]: arr = np.ma.masked_array([[True, False],[False,True]])
In [343]: arr
Out[343]:
masked_array(
data=[[ True, False],
[False, True]],
mask=False,
fill_value=True)
That starts without anything masked. Then as you suggest, assigning np.ma.masked to an element masks the slot:
In [344]: arr[0,1]=np.ma.masked
In [345]: arr
Out[345]:
masked_array(
data=[[True, --],
[False, True]],
mask=[[False, True],
[False, False]],
fill_value=True)
Here the arr.mask has been changed from scalar False (applying to the whole array) to a boolean array of False, and then the selected item has been changed to True.
arr.data hasn't changed:
In [346]: arr.data[0,1]
Out[346]: False
Looks like this change to arr.mask occurs in data.__setitem__ at:
if value is masked:
# The mask wasn't set: create a full version.
if _mask is nomask:
_mask = self._mask = make_mask_none(self.shape, _dtype)
# Now, set the mask to its value.
if _dtype.names is not None:
_mask[indx] = tuple([True] * len(_dtype.names))
else:
_mask[indx] = True
return
It checks if the assignment values is this special constant, np.ma.masked, and it makes the full mask, and assigns True to an element.
I have a 3D array of size NxNxN. I would like to fill this array with random booleans, which I can do with:
a = np.random.choice([False,True],size=(N,N,N))
However, I would like the likelihood (or p-value) of choosing either True or False to be based on the element's position in the array. I thought maybe I could do this with the p-value parameter, but that only then works for selecting how often True/False is chosen for the entire array.
Is there any way to set specific p-values for the entire (N,N,N) array? I guess that would amount to an (N,N,N,2) array then, with the extra 2 being for the p-value for False and p-value for True (though p_True = 1 - p_False). I feel like there's a simpler way to do this that I'm not thinking of.
Edit:
So say I want to create a simple array, a, of shape (1,2) (just two elements, but multidimensional on purpose). I want to fill these two elements with True/False. I have another array filled with the likelihood or p-value with which I want those elements to be False, say p_False, where p_False.shape = (1,2). Let's say I want the first element to have a 25% chance of being False, but the second element to have a 50% chance of being false, so then p_False = np.array([0.25,0.5]).
I tried something along the lines of:
a = np.random.choice([[False,True],[False,True]],p=[[.25,.75],[.5,.5]])
but I got a ValueError: a must be 1-dimensional.
To generate an array with different probabilities, you can use the following code:
# define an initial value of N
N = 512
# generate an array of probabilities. You can eventually build your own, since the size is respected
prob_array = np.array((range(0,N*N*N)))
# rescale the probabilities between 0 and 1
prob_array = (prob_array - np.min(prob_array)) / (np.max(prob_array) - np.min(prob_array))
# generate the random based on the probabilities, cast to booleans and reshape
np.reshape(np.array(np.random.binomial(1, p=prob_array, size=N*N*N), dtype=bool), (N,N,N))
This generates an array with lots of Falses in the beginning and lots of Trues in the end:
array([[[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]],
...,
[[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
...,
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True]]])
Use the binomial method with an array of numbers in [0, 1]. Here is an example, which sets each element to 0 or 1 depending on a randomly chosen probability:
import numpy
gen=numpy.random.Generator(numpy.random.PCG64())
ret=gen.binomial(1, gen.uniform(size=(3, 3, 3)))
If you want each item to be True or False rather than 0 or 1, I'm afraid I don't know how to do so.
Note that numpy.random.Generator was introduced in NumPy 1.7. You are recommended to use the latest version of NumPy; in the meantime, you can use the following:
import numpy
ret=numpy.random.binomial(1, numpy.random.uniform(size=(3, 3, 3)))
Assume a numpy array (actually Pandas) of the form:
[value, included,
0.123, False,
0.127, True,
0.140, True,
0.111, False,
0.159, True,
0.321, True,
0.444, True,
0.323, True,
0.432, False]
I'd like to split the array such that False elements are excluded and successive runs of True elements are split into their own array. So for the above case, we'd end up with:
[[0.127, True,
0.140, True],
[0.159, True,
0.321, True,
0.444, True,
0.323, True]]
I can certainly do this by pushing individual elements onto lists, but surely there must be a more numpy-ish way to do this.
You can create groups by inverse mask by ~ with Series.cumsum and filter only Trues by boolean indexing, then create list of DataFrames by DataFrame.groupby:
dfs = [v for k, v in df.groupby((~df['included']).cumsum()[df['included']])]
print (dfs)
[ value included
1 0.127 True
2 0.140 True, value included
4 0.159 True
5 0.321 True
6 0.444 True
7 0.323 True]
Also is possible convert Dataframes to arrays by DataFrame.to_numpy:
dfs = [v.to_numpy() for k, v in df.groupby((~df['included']).cumsum()[df['included']])]
print (dfs)
[array([[0.127, True],
[0.14, True]], dtype=object), array([[0.159, True],
[0.321, True],
[0.444, True],
[0.32299999999999995, True]], dtype=object)]
I recently worked on training a Part-of-Speech model for Hindi in Spacy. I got the model already trained but when analyzing any text, the .pos_ attribute of any token always points to X. The fine-grained tags, .tag_ - which were the ones the model was trained with - are correct though.
The mapping between this fine-grained tags and the "universal" tags (VERB, NOUN, ADJ, etc) is found in the spacy/lang/hi/tag_map.py file.
Lemma यूरोप, Lemmatized: False, POS: X, TAG: NNP
Lemma के, Lemmatized: False, POS: X, TAG: PSP
Lemma जिन, Lemmatized: False, POS: X, TAG: DEM
Lemma राजनीतिक, Lemmatized: False, POS: X, TAG: JJ
Lemma दलों, Lemmatized: False, POS: X, TAG: NN
Lemma को, Lemmatized: False, POS: X, TAG: PSP
Lemma व्यवस्था, Lemmatized: False, POS: X, TAG: NN
Lemma ,, Lemmatized: False, POS: SYM, TAG: SYM
Lemma राजनेताओं, Lemmatized: False, POS: X, TAG: NN
Lemma और, Lemmatized: False, POS: X, TAG: CC
Lemma मीडिया, Lemmatized: False, POS: X, TAG: NN
Lemma द्वारा, Lemmatized: False, POS: X, TAG: PSP
Lemma अति, Lemmatized: False, POS: X, TAG: INTF
Lemma दक्षिणपंथी, Lemmatized: False, POS: X, TAG: NN
Lemma कहा, Lemmatized: False, POS: X, TAG: VM
Lemma जाता, Lemmatized: False, POS: X, TAG: VAUX
Lemma है, Lemmatized: False, POS: X, TAG: VAUX
Lemma (, Lemmatized: False, POS: SYM, TAG: SYM
Lemma परन्तु, Lemmatized: False, POS: X, TAG: CC
Lemma मेरी, Lemmatized: False, POS: X, TAG: PRP
Lemma ओर, Lemmatized: False, POS: X, TAG: NST
Lemma से, Lemmatized: False, POS: X, TAG: PSP
Lemma सभ्यतावादी, Lemmatized: False, POS: X, TAG: NNP
Lemma कहा, Lemmatized: False, POS: X, TAG: VM
Lemma जाता, Lemmatized: False, POS: X, TAG: VAUX
Lemma है, Lemmatized: False, POS: X, TAG: VAUX
Lemma ), Lemmatized: False, POS: SYM, TAG: SYM
Lemma उनकी, Lemmatized: False, POS: X, TAG: PRP
Lemma आलोचना, Lemmatized: False, POS: X, TAG: NN
Lemma उनकी, Lemmatized: False, POS: X, TAG: PRP
Lemma भूलों, Lemmatized: False, POS: X, TAG: NN
Lemma और, Lemmatized: False, POS: X, TAG: CC
Lemma अतिवादिता, Lemmatized: False, POS: X, TAG: NN
Lemma के, Lemmatized: False, POS: X, TAG: PSP
Lemma कारण, Lemmatized: False, POS: X, TAG: PSP
Lemma की, Lemmatized: False, POS: X, TAG: VM
Lemma जाती, Lemmatized: False, POS: X, TAG: VAUX
Lemma है|, Lemmatized: False, POS: X, TAG: NNPC
Investigating a little bit I found out that the reason the .pos_ has this X value is because in the generated lang_model/tagger/tag_map binary file, all of its keys point to 101 which is the "code" assigned to the Part-of-Speech X, which is Other.
I deduce it is generating the keys pointing to 101 because there's no information at how it should map each of the provided tags from the dataset to the "universal" ones. The thing is, I can provide a tag_map.py in the definition of my Hindi(Language) class, but when passing a text through the pipeline, it will eventually use the tag map defined in the tagger/ directory created with by the output of the train command.
Here's a link which will clarify what I'm explaining: https://universaldependencies.org/tagset-conversion/hi-conll-uposf.html
The first item of the first column (CC, DEM, INTF, etc) are the ones provided to the model. The universal tags are the ones from the second column.
My question is, where should I define the tag_map to overwrite the one generated by the spacy train command?
You need to add your tag_map.py to spacy/lang/hi/ and tell the default model (which is what gets loaded with spacy train hi) to load it. It sounds like you already have a tag_map.py, but if not, you can see examples for any of the languages that have provided spacy models, like:
https://github.com/explosion/spaCy/blob/master/spacy/lang/en/tag_map.py
Import the tag map and add it to the HindiDefaults in spacy/lang/hi/__init__.py to load the tag map:
from .tag_map import TAG_MAP
class HindiDefaults(Language.Defaults):
tag_map = TAG_MAP
I think you could also modify the tag map in nlp.vocab.morphology.tag_map on-the-fly after initializing the blank model before you starting training, but I don't think there's any easy way to do it with command-line options to spacy train, so that would require a custom training script.
You can use spacy debug-data hi train.json dev.json to make sure the settings worked, since it will show warnings for any tags in your training data that aren't in the tag map.
How to count the number of the True in boolean matrix by use of Tensorflow? Thanks!
[[False True False False False True False False False False]
[False True False False False True False False False False]
[False True False False False True False False False False]
[False True False False False True False False False False]]
Cast the boolean values to tf.in32 and sum them
import tensorflow as tf
data = [
[False, True, False,False,False, True, False,False,False,False],
[False, True, False,False,False, True, False,False,False,False],
[False, True, False,False,False, True, False,False,False,False],
[False, True, False,False,False, True, False,False,False,False]
]
var = tf.Variable(data)
num_true = tf.reduce_sum(tf.cast(var, tf.int32))
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print(sess.run(num_true))
8
Method 1:
In [29]: data = [
[False, True, False,False,False, True, False,False,False,False],
[False, True, False,False,False, True, False,False,False,False],
[False, True, False,False,False, True, False,False,False,False],
[False, True, False,False,False, True, False,False,False,False]
]
In [30]: if_true = tf.where(tf.equal(tf.constant(data), True))
# or just if_true = tf.where(tf.constant(data))
In [31]: sess.run(if_true).shape[0]
# only get the indices where the item is true, then the number of
# returned indices amounts to the number of True
Out[31]: 8
This method is better in some way, for example to work out how many Falses are there and etc.
Method 2:
In [34]: if_true = tf.count_nonzero(tf.constant(data))
In [35]: sess.run(if_true)
Out[35]: 8