I'm aware of the basic spacy workflow for getting various attributes from a document, however I can't find a built in function to return the position (start/end) of a word which is part of a sentence.
Would anyone know if this is possible with Spacy?
These are available as attributes of the tokens in the sentences.
Doc says:
idx int The character offset of the token within the parent document.
i int The index of the token within the parent document.
>>> import spacy
>>> nlp = spacy.load('en')
>>> parsed_sentence = nlp(u'This is my sentence')
>>> [(token.text,token.i) for token in parsed_sentence]
[(u'This', 0), (u'is', 1), (u'my', 2), (u'sentence', 3)]
>>> [(token.text,token.idx) for token in parsed_sentence]
[(u'This', 0), (u'is', 5), (u'my', 8), (u'sentence', 11)]
Related
In short, emoji vectors in spacy? Where is this documented?
import spacy
nlp = spacy.load('en_core_web_sm')
a = "🔥"
b = "❄️"
v = "🥑"
h = "💔"
l = "💌"
e = [a,b,v,h,l]
# emoji vector
ev = [nlp(emoji).vector for emoji in e]
# numpy array
ev = np.array(ev)
ev.shape
The shape is (5, 96), so I am curious as to where I can learn more about the source of the vectors. At first, I assumed that these were OOV, but:
ev.sum(axis=1)
yields
array([2.906692 , 3.8687153, 1.2295313, 3.986846 , 1.9255924],
dtype=float32)
All above is via Colab environment as of 2/21/2021
The sm models do not contain word vectors. If there aren't any word vectors, token.vector returns token.tensor as a backoff, which is the context-sensitive tensor from the tagger component. See the first warning box here: https://v2.spacy.io/usage/vectors-similarity
If you want word vectors, use an md or lg model instead, and then the emoji will be OOV and token.vector will return an all-0 300d vector.
In the given line of code tokenizer=Tokenizer(num_words=, oov_token= '<OOV>'), what does the num_words parameter actually do and what to take into consideration before determining the value to assign to it. What will be the effect of assigning a very high value to it and a very low one.
It is basically the size of vocabulary you want to have it in your model based on the data you have. Below simple example will explain you in detail.
Without num_words:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token='<OOV>')
fit_text = ["Example with the first sentence of the tokenizer"]
tokenizer.fit_on_texts(fit_text)
test_text = ["Example with the test sentence of the tokenizer"]
sequences = tokenizer.texts_to_sequences(test_text)
print("sequences : ",sequences,'\n')
print("word_index : ",tokenizer.word_index)
print("word counts : ",tokenizer.word_counts)
sequences : [[3, 4, 2, 1, 6, 7, 2, 8]]
word_index : {'<OOV>': 1, 'the': 2, 'example': 3, 'with': 4, 'first': 5, 'sentence': 6, 'of': 7, 'tokenizer': 8}
word counts : OrderedDict([('example', 1), ('with', 1), ('the', 2), ('first', 1), ('sentence', 1), ('of', 1), ('tokenizer', 1)])
Here tokenizer.fit_on_texts(fit_text) will create the word_index of the words mentioned present in fit_text in the order starting from oov_token which will be 1 and followed by most frequent words from the word_counts.
If you don't mention num_words then all the unique words of fit_text will be considered for word_index and will be used to represent the sequences.
If the num_words is present then it will restrict the sequences to num_words -1 words from word_index will only be considered to form the sequence while using tokenizer.texts_to_sequences() if any word is present beyond num_words -1 it will be considered as oov_token.
Below is the example of it.
With use of num_words:
tokenizer = Tokenizer(num_words=4,oov_token='<OOV>')
fit_text = ["Example with the first sentence of the tokenizer"]
tokenizer.fit_on_texts(fit_text)
test_text = ["Example with the test sentence of the tokenizer"]
sequences = tokenizer.texts_to_sequences(test_text)
print("sequences : ",sequences,'\n')
print("word_index : ",tokenizer.word_index)
print("word counts : ",tokenizer.word_counts)
sequences : [[3, 1, 2, 1, 1, 1, 2, 1]]
word_index : {'<OOV>': 1, 'the': 2, 'example': 3, 'with': 4, 'first': 5, 'sentence': 6, 'of': 7, 'tokenizer': 8}
word counts : OrderedDict([('example', 1), ('with', 1), ('the', 2), ('first', 1), ('sentence', 1), ('of', 1), ('tokenizer', 1)])
Regarding the accuracy of the model it's always better to have the correct representation of the words in sequences from your data instead of oov_token.
In case of large data it's always better to provide the num_words parameter instead of giving load to model.
It's a good practice to do the preprocessing like stopword removal,lemmatization/stemming to remove all the unnecessary words and then followed by Tokenizer with the preprocessed data to choose the num_words parameter better.
# Step 2: Build the dictionary and replace rare words with UNK token.
vocabulary_size = 50000
def build_dataset(words, n_words):
"""Process raw inputs into a dataset."""
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(n_words - 1))
dictionary = dict()
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0 # dictionary['UNK']
unk_count += 1
data.append(index)
count[0][1] = unk_count
reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reversed_dictionary
data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
vocabulary_size)
I am learning the elementary example of Vector Representation of Words using Tensorflow.
This Step 2 is titled as "Build the dictionary and replace rare words with UNK token", however, there's no prior defining process of what "UNK" refers to.
To specify the question:
0) What does UNK generally refer to in NLP?
1) What does count = [['UNK', -1]] mean? I know the bracket [] refer to list in python, however, why do we collocating it with -1?
As it is already mentioned in the comments, in tokenizing and NLP when you see the UNK token, it is probably to indicate unknown word.
for example, if you want to predict a missing word in a sentence. how would you feed your data to it? you definitely need a token for showing that where is the missing word. so if the "house" is our missing word, after tokenizing it will be like:
'my house is big' -> ['my', 'UNK', 'is', 'big']
PS: that count = [['UNK', -1]] is for initionalizing the count, and it will be like [['word', number_of_occurences]] as Ivan Aksamentov has already said.
I'm trying to use cython to improve the performance of a loop, but I'm running
into some issues declaring the types of the inputs.
How do I include a field in my typed struct which is a string that can be
either 'front' or 'back'
I have a np.recarray that looks like the following (note the length of the
recarray is unknown as compile time)
import numpy as np
weights = np.recarray(4, dtype=[('a', np.int64), ('b', np.str_, 5), ('c', np.float64)])
weights[0] = (0, "front", 0.5)
weights[1] = (0, "back", 0.5)
weights[2] = (1, "front", 1.0)
weights[3] = (1, "back", 0.0)
as well as inputs of a list of strings and a pandas.Timestamp
import pandas as pd
ts = pd.Timestamp("2015-01-01")
contracts = ["CLX16", "CLZ16"]
I am trying to cythonize the following loop
def ploop(weights, contracts, timestamp):
cwts = []
for gen_num, position, weighting in weights:
if weighting != 0:
if position == "front":
cntrct_idx = gen_num
elif position == "back":
cntrct_idx = gen_num + 1
else:
raise ValueError("transition.columns must contain "
"'front' or 'back'")
cwts.append((gen_num, contracts[cntrct_idx], weighting, timestamp))
return cwts
My attempt involved typing the weights input as a struct in cython,
in a file struct_test.pyx as follows
import numpy as np
cimport numpy as np
cdef packed struct tstruct:
np.int64_t gen_num
char[5] position
np.float64_t weighting
def cloop(tstruct[:] weights_array, contracts, timestamp):
cdef tstruct weights
cdef int i
cdef int cntrct_idx
cwts = []
for k in xrange(len(weights_array)):
w = weights_array[k]
if w.weighting != 0:
if w.position == "front":
cntrct_idx = w.gen_num
elif w.position == "back":
cntrct_idx = w.gen_num + 1
else:
raise ValueError("transition.columns must contain "
"'front' or 'back'")
cwts.append((w.gen_num, contracts[cntrct_idx], w.weighting,
timestamp))
return cwts
But I am receiving runtime errors, which I believe are related to the
char[5] position.
import pyximport
pyximport.install()
import struct_test
struct_test.cloop(weights, contracts, ts)
ValueError: Does not understand character buffer dtype format string ('w')
In addition I am a bit unclear how I would go about typing contracts as well
as timestamp.
Your ploop (without the timestamp variable) produces:
In [226]: ploop(weights, contracts)
Out[226]: [(0, 'CLX16', 0.5), (0, 'CLZ16', 0.5), (1, 'CLZ16', 1.0)]
Equivalent function without a loop:
def ploopless(weights, contracts):
arr_contracts = np.array(contracts) # to allow array indexing
wgts1 = weights[weights['c']!=0]
mask = wgts1['b']=='front'
wgts1['b'][mask] = arr_contracts[wgts1['a'][mask]]
mask = wgts1['b']=='back'
wgts1['b'][mask] = arr_contracts[wgts1['a'][mask]+1]
return wgts1.tolist()
In [250]: ploopless(weights, contracts)
Out[250]: [(0, 'CLX16', 0.5), (0, 'CLZ16', 0.5), (1, 'CLZ16', 1.0)]
I'm taking advantage of the fact that returned list of tuples has same (int, str, int) layout as the input weight array. So I'm just making a copy of weights and replacing selected values of the b field.
Note that I use the field selection index before the mask one. The boolean mask produces a copy, so we have to careful about indexing order.
I'm guessing that loop-less array version will be competitive in time with the cloop (on realistic arrays). The string and list operations in cloop probably limit its speedup.
I'm writing a custom Tensorflow op using the tutorial and I'm having trouble understanding how to read and write to/from Tensors.
let's say I have a Tensor in my OpKernel that I get from
const Tensor& values_tensor = context->input(0); (where context = OpKernelConstruction*)
if that Tensor has shape, say, [2, 10, 20], how can I index into it (e.g. auto x = values_tensor[1, 4, 12], etc.)?
equivalently, if I have
Tensor *output_tensor = NULL;
OP_REQUIRES_OK(context, context->allocate_output(
0,
{batch_size, value_len - window_size, window_size},
&output_tensor
));
how can I assign to output_tensor, like output_tensor[1, 2, 3] = 11, etc.?
sorry for the dumb question, but the docs are really tripping me up here and the examples in the Tensorflow kernel code for built-in ops somehow obfuscate this to the point that I get very confused :)
thank you!
The easiest way to read from and write to tensorflow::Tensor objects is to convert them to an Eigen tensor, using the tensorflow::Tensor::tensor<T, NDIMS>() method. Note that you have to specify the (C++) type of elements in tensor as template parameter T.
For example, to read a particular value from a DT_FLOAT32 tensor:
const Tensor& values_tensor = context->input(0);
auto x = value_tensor.tensor<float, 3>()(1, 4, 12);
To write a particular value to a DT_FLOAT32 tensor:
Tensor* output_tensor = ...;
output_tensor->tensor<float, 3>()(1, 2, 3) = 11.0;
There are also convenience methods for accessing a scalar, vector, or matrix.