Spacy custom POS model for Hindi

Spacy custom POS model for Hindi - spacy

I recently worked on training a Part-of-Speech model for Hindi in Spacy. I got the model already trained but when analyzing any text, the .pos_ attribute of any token always points to X. The fine-grained tags, .tag_ - which were the ones the model was trained with - are correct though.
The mapping between this fine-grained tags and the "universal" tags (VERB, NOUN, ADJ, etc) is found in the spacy/lang/hi/tag_map.py file.
Lemma यूरोप, Lemmatized: False, POS: X, TAG: NNP
Lemma के, Lemmatized: False, POS: X, TAG: PSP
Lemma जिन, Lemmatized: False, POS: X, TAG: DEM
Lemma राजनीतिक, Lemmatized: False, POS: X, TAG: JJ
Lemma दलों, Lemmatized: False, POS: X, TAG: NN
Lemma को, Lemmatized: False, POS: X, TAG: PSP
Lemma व्यवस्था, Lemmatized: False, POS: X, TAG: NN
Lemma ,, Lemmatized: False, POS: SYM, TAG: SYM
Lemma राजनेताओं, Lemmatized: False, POS: X, TAG: NN
Lemma और, Lemmatized: False, POS: X, TAG: CC
Lemma मीडिया, Lemmatized: False, POS: X, TAG: NN
Lemma द्वारा, Lemmatized: False, POS: X, TAG: PSP
Lemma अति, Lemmatized: False, POS: X, TAG: INTF
Lemma दक्षिणपंथी, Lemmatized: False, POS: X, TAG: NN
Lemma कहा, Lemmatized: False, POS: X, TAG: VM
Lemma जाता, Lemmatized: False, POS: X, TAG: VAUX
Lemma है, Lemmatized: False, POS: X, TAG: VAUX
Lemma (, Lemmatized: False, POS: SYM, TAG: SYM
Lemma परन्तु, Lemmatized: False, POS: X, TAG: CC
Lemma मेरी, Lemmatized: False, POS: X, TAG: PRP
Lemma ओर, Lemmatized: False, POS: X, TAG: NST
Lemma से, Lemmatized: False, POS: X, TAG: PSP
Lemma सभ्यतावादी, Lemmatized: False, POS: X, TAG: NNP
Lemma कहा, Lemmatized: False, POS: X, TAG: VM
Lemma जाता, Lemmatized: False, POS: X, TAG: VAUX
Lemma है, Lemmatized: False, POS: X, TAG: VAUX
Lemma ), Lemmatized: False, POS: SYM, TAG: SYM
Lemma उनकी, Lemmatized: False, POS: X, TAG: PRP
Lemma आलोचना, Lemmatized: False, POS: X, TAG: NN
Lemma उनकी, Lemmatized: False, POS: X, TAG: PRP
Lemma भूलों, Lemmatized: False, POS: X, TAG: NN
Lemma और, Lemmatized: False, POS: X, TAG: CC
Lemma अतिवादिता, Lemmatized: False, POS: X, TAG: NN
Lemma के, Lemmatized: False, POS: X, TAG: PSP
Lemma कारण, Lemmatized: False, POS: X, TAG: PSP
Lemma की, Lemmatized: False, POS: X, TAG: VM
Lemma जाती, Lemmatized: False, POS: X, TAG: VAUX
Lemma है|, Lemmatized: False, POS: X, TAG: NNPC
Investigating a little bit I found out that the reason the .pos_ has this X value is because in the generated lang_model/tagger/tag_map binary file, all of its keys point to 101 which is the "code" assigned to the Part-of-Speech X, which is Other.
I deduce it is generating the keys pointing to 101 because there's no information at how it should map each of the provided tags from the dataset to the "universal" ones. The thing is, I can provide a tag_map.py in the definition of my Hindi(Language) class, but when passing a text through the pipeline, it will eventually use the tag map defined in the tagger/ directory created with by the output of the train command.
Here's a link which will clarify what I'm explaining: https://universaldependencies.org/tagset-conversion/hi-conll-uposf.html
The first item of the first column (CC, DEM, INTF, etc) are the ones provided to the model. The universal tags are the ones from the second column.
My question is, where should I define the tag_map to overwrite the one generated by the spacy train command?

You need to add your tag_map.py to spacy/lang/hi/ and tell the default model (which is what gets loaded with spacy train hi) to load it. It sounds like you already have a tag_map.py, but if not, you can see examples for any of the languages that have provided spacy models, like:
https://github.com/explosion/spaCy/blob/master/spacy/lang/en/tag_map.py
Import the tag map and add it to the HindiDefaults in spacy/lang/hi/__init__.py to load the tag map:
from .tag_map import TAG_MAP
class HindiDefaults(Language.Defaults):
tag_map = TAG_MAP
I think you could also modify the tag map in nlp.vocab.morphology.tag_map on-the-fly after initializing the blank model before you starting training, but I don't think there's any easy way to do it with command-line options to spacy train, so that would require a custom training script.
You can use spacy debug-data hi train.json dev.json to make sure the settings worked, since it will show warnings for any tags in your training data that aren't in the tag map.

Related

Create masked array from list containing ma.masked

If I have a (possibly multidimensional) Python list where each element is one of True, False, or ma.masked, what's the idiomatic way of turning this into a masked numpy array of bool?
Example:
>>> print(somefunc([[True, ma.masked], [False, True]]))
[[True --]
[False True]]

A masked array has to attributes, data and mask:
In [342]: arr = np.ma.masked_array([[True, False],[False,True]])
In [343]: arr
Out[343]:
masked_array(
data=[[ True, False],
[False, True]],
mask=False,
fill_value=True)
That starts without anything masked. Then as you suggest, assigning np.ma.masked to an element masks the slot:
In [344]: arr[0,1]=np.ma.masked
In [345]: arr
Out[345]:
masked_array(
data=[[True, --],
[False, True]],
mask=[[False, True],
[False, False]],
fill_value=True)
Here the arr.mask has been changed from scalar False (applying to the whole array) to a boolean array of False, and then the selected item has been changed to True.
arr.data hasn't changed:
In [346]: arr.data[0,1]
Out[346]: False
Looks like this change to arr.mask occurs in data.__setitem__ at:
if value is masked:
# The mask wasn't set: create a full version.
if _mask is nomask:
_mask = self._mask = make_mask_none(self.shape, _dtype)
# Now, set the mask to its value.
if _dtype.names is not None:
_mask[indx] = tuple([True] * len(_dtype.names))
else:
_mask[indx] = True
return
It checks if the assignment values is this special constant, np.ma.masked, and it makes the full mask, and assigns True to an element.

How to change p-value of Numpy.random.choice based on position in array?

I have a 3D array of size NxNxN. I would like to fill this array with random booleans, which I can do with:
a = np.random.choice([False,True],size=(N,N,N))
However, I would like the likelihood (or p-value) of choosing either True or False to be based on the element's position in the array. I thought maybe I could do this with the p-value parameter, but that only then works for selecting how often True/False is chosen for the entire array.
Is there any way to set specific p-values for the entire (N,N,N) array? I guess that would amount to an (N,N,N,2) array then, with the extra 2 being for the p-value for False and p-value for True (though p_True = 1 - p_False). I feel like there's a simpler way to do this that I'm not thinking of.
Edit:
So say I want to create a simple array, a, of shape (1,2) (just two elements, but multidimensional on purpose). I want to fill these two elements with True/False. I have another array filled with the likelihood or p-value with which I want those elements to be False, say p_False, where p_False.shape = (1,2). Let's say I want the first element to have a 25% chance of being False, but the second element to have a 50% chance of being false, so then p_False = np.array([0.25,0.5]).
I tried something along the lines of:
a = np.random.choice([[False,True],[False,True]],p=[[.25,.75],[.5,.5]])
but I got a ValueError: a must be 1-dimensional.

To generate an array with different probabilities, you can use the following code:
# define an initial value of N
N = 512
# generate an array of probabilities. You can eventually build your own, since the size is respected
prob_array = np.array((range(0,N*N*N)))
# rescale the probabilities between 0 and 1
prob_array = (prob_array - np.min(prob_array)) / (np.max(prob_array) - np.min(prob_array))
# generate the random based on the probabilities, cast to booleans and reshape
np.reshape(np.array(np.random.binomial(1, p=prob_array, size=N*N*N), dtype=bool), (N,N,N))
This generates an array with lots of Falses in the beginning and lots of Trues in the end:
array([[[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]],
...,
[[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
...,
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True]]])

Use the binomial method with an array of numbers in [0, 1]. Here is an example, which sets each element to 0 or 1 depending on a randomly chosen probability:
import numpy
gen=numpy.random.Generator(numpy.random.PCG64())
ret=gen.binomial(1, gen.uniform(size=(3, 3, 3)))
If you want each item to be True or False rather than 0 or 1, I'm afraid I don't know how to do so.
Note that numpy.random.Generator was introduced in NumPy 1.7. You are recommended to use the latest version of NumPy; in the meantime, you can use the following:
import numpy
ret=numpy.random.binomial(1, numpy.random.uniform(size=(3, 3, 3)))

How can I convert a boolean matrix into a greyscale map?

I am trying to visualize the Mandelbrot Set so it can be like the image like this(https://mathworld.wolfram.com/SeaHorseValley.html).
I have the function that returns a boolean array with True indicating It is a part of the Mandelbrot set and False that it is not.
def mandelbrot(x, y, dx, dy, dims=(300, 400), threshold=25, iterations=200):
'''Returns a boolean matrix of the given dimensions, indicating whether a
position inside the rectangle spanning from (x, y) to (x+dx, y+dy) is part of
the Mandelbrot set or not'''
xs, ys = np.meshgrid(np.linspace(x, x + dx, dims[1]),
np.linspace(y, y + dy, dims[0]))
c = xs + ys * 1j
zs = np.zeros(dims, dtype=np.complex128)
for i in range(iterations):
abs_zs = np.abs(zs)
zs[abs_zs < threshold] = zs[abs_zs < threshold] ** 2 + c[abs_zs < threshold]
return np.abs(zs) < threshold
I guess this code above doesn't really matter though when I input (-1.5,-1,2,2), it returns an array like this below.
array([[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]])
And the result I want is following below. I have no idea how I should approach this. Could you at least suggest tools I can use or hints if you have any ideas? Thank you already in advance!

Splitting a numpy array / pandas dataframe by boolean delimiters

Assume a numpy array (actually Pandas) of the form:
[value, included,
0.123, False,
0.127, True,
0.140, True,
0.111, False,
0.159, True,
0.321, True,
0.444, True,
0.323, True,
0.432, False]
I'd like to split the array such that False elements are excluded and successive runs of True elements are split into their own array. So for the above case, we'd end up with:
[[0.127, True,
0.140, True],
[0.159, True,
0.321, True,
0.444, True,
0.323, True]]
I can certainly do this by pushing individual elements onto lists, but surely there must be a more numpy-ish way to do this.

You can create groups by inverse mask by ~ with Series.cumsum and filter only Trues by boolean indexing, then create list of DataFrames by DataFrame.groupby:
dfs = [v for k, v in df.groupby((~df['included']).cumsum()[df['included']])]
print (dfs)
[ value included
1 0.127 True
2 0.140 True, value included
4 0.159 True
5 0.321 True
6 0.444 True
7 0.323 True]
Also is possible convert Dataframes to arrays by DataFrame.to_numpy:
dfs = [v.to_numpy() for k, v in df.groupby((~df['included']).cumsum()[df['included']])]
print (dfs)
[array([[0.127, True],
[0.14, True]], dtype=object), array([[0.159, True],
[0.321, True],
[0.444, True],
[0.32299999999999995, True]], dtype=object)]

Combining logical (boolean) expressions in numpy [duplicate]

This question already has answers here:
Combining logic statements AND in numpy array
(3 answers)
Closed 4 years ago.
I want to combine logical expressions but I get an exception:
array = np.arange(10)
array > 1
array([False, False, True, True, True, True, True, True, True,
True])
array < 4
array([ True, True, True, True, False, False, False, False, False,
False])
(array > 1 & array < 4)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
What I would expect instead would be a boolean array of length 10 with True value at the indices 2 and 3 --where both conditions are met-- and False elsewhere.

You need numpy's logical_and function.
import numpy as np
np.logical_and(array>1, array<4). # [False, False, True, True, False, False, False, False, False, False]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spacy custom POS model for Hindi - spacy

Related

Create masked array from list containing ma.masked

How to change p-value of Numpy.random.choice based on position in array?

How can I convert a boolean matrix into a greyscale map?

Splitting a numpy array / pandas dataframe by boolean delimiters

Combining logical (boolean) expressions in numpy [duplicate]

Categories

Resources