spacy NER does not recognize some numeric texts - spacy

I have a custom spacy NER model for entities such as amounts, IDs, etc.
Some numeric values could not be found by the model. The texts look fine, numbers look fine, but it does not extract the value. (accuracy for these numeric entities are above %98)
An ID number 81249012 could not be found by the model. If the number is changed a little such as 81249013, 81249014, 8124901 the model finds the number.
So i did a trial to check the probability it fails. Changed the ID number in the text in range (81249012, 90000000) and applied ner prediction. Randomly %0.5 of the numbers could not be found by the model, even the text is totally the same except the ID number.
An amount text "7.653,20" could not be found by the model. Like the first case, if i change the amount a little, it could find the amount.
Some recognized examples:
"7,653.21", "7.653,2", "7.653,200", "117.653,20", "17.653,20"
What could be the reason for these 2 cases, some numeric values are missed in prediction.

Related

Vertex AI AutoML Batch Predict returns prediction scores as Array Float64 in BigQuery Table instead of just FLOAT64 values>

So I have this tabular AutoML model in Vertex AI. It successfully ran batch predictions and outputs to BigQuery. However, when I try to query the prediction data based off of the score being above a certain threshold, I get an error saying the datatype doesn't support float operations. When I tried to cast the scores to float, it said that the scores are a float64 array? This confuses me because they're just individual values of a column in the table. I don't understand why they aren't normal floats, nor do I know how to convert them. Any help would be greatly appreciated.
I tried casting the datatype to float, which obviously didn't work. I tried using different operators like BETWEEN and LIKE, but again won't work because it says it's an array. I just don't understand why it's getting converted to an array. Each value should be its own value just as the table shows it to be.
AutoML does store your result in a so called RECORD, at least if you're doing classification. If that is the case for you, it stores two things within this RECORD: classes and scores. Scores itself is also an array, consisting of the probability of 0 and the probability of 1. So to access it you have to do something like this:
prediction_variable.scores[offset(1)]
This will give you the value for the probability of your classification being 1.

How to train data of different lengths in machine learning?

I am analyzing the text of some literary works and I want to look at the distance between certain words in the text. Specifically, I am looking for parallelism.
Since I can’t know the specific number of tokens in a text I can’t simply put all words in the text in the training data because it would not be uniform across all training data.
For example, the text:
“I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today."
Is not the same text length as
"My fellow Americans, ask not what your country can do for you, ask what you can do for your country."
So therefore I could not columns out of each word and then assign the distance in a row because the lengths would be different.
How could I go about representing this in training data? I was under the assumption that training data had to be the same type and length.
In order to solve this problem you can use something called pad_sequence,so follow this process, sure you are going to transform the data throught some word embedding techniques like TF-IDF or any other algorithm, and after finishing the process of converting the textual data into vectors and by using the shape method you can figure the maximum length you have and than use that maximum in the pad-sequence method, and here is a how you implement this method:
'''
from keras.preprocessing.sequence import pad_sequences
padded_data= pad_sequences(name-of-your-data, maxlen=your-maximum-shape, padding='post', truncating='post')
'''

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

Can I use labeled data and rule-based matching for multiclass text classification with Spacy?

I have some labeled data (down to 1000) (shape: text, category ) and up to 10k unlabeled data. I want to use the rules-based matching linguistic tool of Spacy to define for every category a pattern. After this, I would like to train a new model using the rules and the data that I've had labeled. It is this possible? I've seen some tutorial on youtube* which does something similar, but they use the labeled data to determine if a sentence contains some entity. On the other hand, I want to put a label on an entire paragraph.
https://www.youtube.com/watch?v=IqOJU1-_Fi0

Should my seq2seq RNN idea work?

I want to predict stock price.
Normally, people would feed the input as a sequence of stock prices.
Then they would feed the output as the same sequence but shifted to the left.
When testing, they would feed the output of the prediction into the next input timestep like this:
I have another idea, which is to fix the sequence length, for example 50 timesteps.
The input and output are exactly the same sequence.
When training, I replace last 3 elements of the input by zero to let the model know that I have no input for those timesteps.
When testing, I would feed the model a sequence of 50 elements. The last 3 are zeros. The predictions I care are the last 3 elements of the output.
Would this work or is there a flaw in this idea?
The main flaw of this idea is that it does not add anything to the model's learning, and it reduces its capacity, as you force your model to learn identity mapping for first 47 steps (50-3). Note, that providing 0 as inputs is equivalent of not providing input for an RNN, as zero input, after multiplying by a weight matrix is still zero, so the only source of information is bias and output from previous timestep - both are already there in the original formulation. Now second addon, where we have output for first 47 steps - there is nothing to be gained by learning the identity mapping, yet network will have to "pay the price" for it - it will need to use weights to encode this mapping in order not to be penalised.
So in short - yes, your idea will work, but it is nearly impossible to get better results this way as compared to the original approach (as you do not provide any new information, do not really modify learning dynamics, yet you limit capacity by requesting identity mapping to be learned per-step; especially that it is an extremely easy thing to learn, so gradient descent will discover this relation first, before even trying to "model the future").