Custom NER model produces bad results after being saved to disk? - spacy

I've created some training data (about 300 samples) to do NER for recipe ingredients and followed the code example at https://spacy.io/usage/training#example-train-ner. The newly created model does a decent job when predicting terms on my test dataset, but after saving the model to disk and loading it again, it does not do well at all. I must be missing something about saving the model to disk that loses a lot of accuracy. Is there something I should be doing prior to running the nlp.to_disk or some options I need to set?
For example, the new model before saving produced this output:
2 pounds tomatillos (about 15 medium), husks removed
Entities:
2 = QUANTITY
pounds = UNIT
tomatillos = INGREDIENT
(about 15 medium) = COMMENT
husks removed = COMMENT
and after saving and loading (like in the example code):
2 pounds tomatillos (about 15 medium), husks removed
Entities:
2 pounds tomatillos (about 15 medium), husks removed = COMMENT

Fix should be published shortly. The v2.1 release was pretty big, so we had a couple of regressions like. I thought we already had tests for this as this area of the code has caused problems before, but apparently it slipped through nonetheless.
https://github.com/explosion/spaCy/issues/3433

Related

Can you forecast with multiple trajectories?

I am new to time-series machine learning and have a, perhaps, trivial question.
I would like like to forecast the temperature for a particular region. I could train a model using the hourly data points from the first 6 days of the week and then evaluate its performance on the final day. Therefore the training set would have 144 data points (6*24) and the test set would have 24 data points (24*1). Likewise, I can train a new model for regions B-Z and evaluate each of their individual performances. My question is, can you train a SINGLE model for the predictions across multiple different regions? So the region label should be an input of course since that will effect the temperature evolution.
Can you train a single model that forecasts for multiple trajectories rather than just one? Also, what might be a good metric for evaluating its performance? I was going to use mean absolute error but maybe a correlation is better?
Yes you can train with multiple series of data from different region the question that you ask is an ultimate goal of deep learning by create a 1 model to do every things, predict every region correctly and so on. However, if you want to generalize your model that much you normally need a really huge model, I'm talking about 100M++ parameter and to train that data you also need tons of Data maybe couple TB or PB, so you also need a super powerful computer to train that thing something like GOOGLE data center. Coming to your next question, the metric, you may use just simple RMS error or mean absolute error will work fine.
Here is what you need to focus Training Data, there is no super model that take garbage and turn it in to gold, same thing here garbage in garbage out. You need a pretty good datasets that can represent whole environment of what u are trying to solve. For example, you want to create model to predict that if you hammer a glass will it break, so you have maybe 10 data for each type of glass and all of them break when u hammer it. so, you train the model and it just predict break every single time, then you try to predict with a bulletproof glass and it does not break, so your model is wrong. Therefore, you need a whole data of different type of glass then your model maybe predict it correctly. Then compare this to your 144 data points, I'm pretty sure it won't work for your case.
Therefore, I would say yes you can build that 1 model fits all but there is a huge price to pay.

Does deeper LSTM need more units?

I'm applying LSTM on time series forecasting with 20 lags. Suppose that we have two cases. The first one just using five lags and the second one (like my case) is using 20 lags. Is it correct that for the second case we need more units compared to the former one? If yes, how can we support this idea? I have 2000 samples for training the model, so this is the main limitation for increasing number of units here.
It is very difficult to give an exact answer as the relationship between timesteps and number of hidden units is not an exact science. For example, following factors can affect the number of units required.
Short term memory problem vs long-term memory problem
If your problem can be solved with relatively less memory (i.e. requires to remember only a few time steps) you wouldn't get much benefit from adding more neurons while increasing the number of steps.
The amount of data
If you don't have enough data for the model to learn from (which I feel like you will run into with 2000 data points - but I could be wrong), then increasing the number of timesteps won't help you much.
The type of model you use
Depending on the type of model you use (e.g. LSTM / GRU ) you might get different results (this is not always true but can happen for certain problems)
I'm sure there are other factors out there, but these are few that came to my mind.
Proving more units give better results while having more time steps (if true)
That should be relatively easy as you can try few different options,
5 lags with 10 / 20 / 50 hidden units
20 lags with 10 / 20 / 50 hidden units
And if you get better performance (e.g. lower MSE) with 20 lags problem than 5 lags problem (when you use 50 units), then you have gotten your point across. And you can reinforce your claims by showing results with different types of models (e.g. LSTMs vs GRUs).

Shuffling the training dataset with Tensorflow object detection api

I'm working on a logo detection algorithm using the Faster-RCNN model with the Tensorflow object detection api.
My dataset is alphabetically ordered (so there are a hundred adidas logo, then hundred apple logo etc.). And i would like it to be shuffled while training.
I've put some values in the config file:
train_input_reader:{
shuffle: true
queue_capacity: some value
min_after_dequeue : some other value}
However whatever are the values, I'm putting in, algorithm is at first training on all of the a's logos (adidas, apple and so on) and only a lapse of time after starting to see the b's logos (bmw etc.) and the c's one etc.
Of course I could just shuffle my input dataset directly, but I would like to understand the logic behind it.
PS: I've seen this post about shuffling and min_after_dequeue, but I still dont quite get it. My batch size is 1 so it shouldn't be using tf.train.shuffle_batch() but only tf.RandomShuffleQueue
My training dataset size is 5000 and if I write min_after_dequeue: 4000 or 5000 it is still not shuffled right. Why though?
Update:
#AllenLavoie It's a bit hard for me; as there is a lot of dependencies and I'm new to Tensorflow.
But in the end the queue is constructed by
tf.contrib.slim.parallel_reader.parallel_read( _, string_tensor = parallel_reader.parallel_read(
config.input_path,
reader_class=tf.TFRecordReader,
num_epochs=(input_reader_config.num_epochs
if input_reader_config.num_epochs else None),
num_readers=input_reader_config.num_readers,
shuffle=input_reader_config.shuffle,
dtypes=[tf.string, tf.string],
capacity=input_reader_config.queue_capacity,
min_after_dequeue=input_reader_config.min_after_dequeue)
It seems that when I'm putting num_readers = 1 in the config file the dataset is finally shuffling as I want, (at least in the beginning), but when there are more somehow on the start the logos are getting in the alphabetical order.
I recommend shuffling the dataset prior to training. The way shuffling currently happens is imperfect and my guess at what is happening is that at the beginning the queue starts off empty and only gets examples that start with 'A' --- after a while it may be more shuffled, but there is no getting around the beginning part when the queue hasn't been filled yet.

Word2Vec: Any way to train model fastly?

I use Gensim Word2Vec to train word sets in my database.
I have about 400,000 phrase(Each phrase is short. Total 700MB) in my PostgreSQL database.
This is how I train these data using Django ORM:
post_vector_list = []
for post in Post.objects.all():
post_vector = my_tokenizer(post.category.name)
post_vector.extend(my_tokenizer(post.title))
post_vector.extend(my_tokenizer(post.contents))
post_vector_list.append(post_vector)
word2vec_model = gensim.models.Word2Vec(post_vector_list, window=10, min_count=2, size=300)
But this job getting a lot of time and feels like not efficient.
Especially, creating post_vector_list part took a lot of time and space..
I want to improve speed of training but have no idea how to do.
Want to get your advices. Thanks.
To optimize such code, you need to collect good information about where the time is spent.
Is most of the time spent preparing post_vector_list?
If so, you will want to make sure my_tokenizer (whose code is not shown) is as efficient as possible. You may want to try to minimize the number of extend()s and append()s that are done on large lists. You might have to even take a look at your DB's configuration or options to speed up the DB-to-Object mapping started inside Post.objects.all().
Is most of the time spent in the call to Word2Vec()?
If so, other steps may help:
ensure you're using gensim's Cython-optimized routines – if not, you should be seeing a logged warning (and training will be up to 100X slower)
consider using a workers=4 or workers=8 optional argument to use more threads, if your machine has at least 4 or 8 CPU cores
consider using a larger min_count, which speeds training somewhat (and since vectors for words where there are only a few examples typically aren't very good anyway, doesn't lose much and can even improve the quality of the surviving words)
consider using a smaller window, since training takes longer for larger windows
consider using a smaller vector_size (previously called size), since training takes longer for larger-size vectors
consider using a more-aggressive (smaller) value for the optional sample argument, which randomly skips more of the most-frequent words. The default is 1e-04, but values of 1e-05 or 1e-06 (especially on larger corpuses) can offer additional speedup, and even often improve the final vectors (by spending relatively less training time on words with an excess of usage examples)
consider using a lower-than-default (5) value for the optional epochs parameter (previously called iter). (I wouldn't recommend this unless the corpus is very large – so it already has many redundant, equally-good examples of the same words throughout.)
you could use a python generator instead of loading all the data into the list. Gensim works with python generators too. The code will look something like this
class Post_Vectors(object):
def __init__(self, Post):
self.Post = Post
def __iter__(self):
for post in Post.objects.all():
post_vector = my_tokenizer(post.category.name)
post_vector.extend(my_tokenizer(post.title))
post_vector.extend(my_tokenizer(post.contents))
yield post_vector
post_vectors = Post_Vectors(Post)
word2vec_model = gensim.models.Word2Vec(post_vectors, window=10, min_count=2, size=300, workers=??)
For the gensim speedup, if you have a multi-core CPU, you could use the workers parameter. (By default it is 3)

Simple voice recognition when whispering

I'm trying to do simple voice to text mapping using pocketsphinx (. The grammar is very simple such as:
public <grammar> = (Matt, Anna, Tom, Christine)+ (One | Two | Three | Four | Five | Six | Seven | Eight | Nine | Zero)+ ;
e.g:
Tom Anna Three Three
yields
Tom Anna 33
I adapted the acoustic model (to take into account my foreign accent) and after that I received decent performance (~94% accuracy). I used training dataset of ~3minutes.
Right now I'm trying to do the same but by whispering to the microphone. The accuracy dropped significantly to ~50% w/o training. With training for accent
I got ~60%. I tried other thinks including denoising and boosting volume. I read the whole docs but was wondering if anyone could answer some questions so I can
better know in which direction should I got to improve performance.
1) in tutorial you are adapting hub4wsj_sc_8k acustic model. I guess "8k" is a sampling parameter. When using sphinx_fe you use "-samprate 16000". Was it used deliberately to train 8k model using data with 16k sampling rate? Why data with 8k sampling haven't been used? Does it have influence on performance?
2) in sphinx 4.1 (in comparison to pocketsphinx) there are differenct acoustic models e.g. WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar. Can those models be used with pocketsphinx? Will acustic model with 16k sampling have typically better performance with data having 16k sampling rate?
3) when using data for training should I use those with normal speaking mode (to adapt only for my accent) or with whispering mode (to adapt to whisper and my accent)? I think I tried both scenarios and didn't notice any difference to draw any conclussion but I don't know pocketsphinx internals so I might be doing something wrong.
4) I used the following script to record adapting training and testing data from the tutorial:
for i in `seq 1 20`; do
fn=`printf arctic_%04d $i`;
read sent; echo $sent;
rec -r 16000 -e signed-integer -b 16 -c 1 $fn.wav 2>/dev/null;
done < arctic20.txt
I noticed that each time I hit Control-C this keypress is distinct in the recorded audio that leaded to errors. Trimming audio somtimes helped to correct to or lead to
other error instead. Is there any requirement that each recording has some few seconds of quite before and after speaking?
5) When accumulating observation counts is there any settings I can tinker with to improve performance?
6) What's the difference between semi-continuous and continuous model? Can pocketsphinx use continuous model?
7) I noticed that 'mixture_weights' file from sphinx4 is much smaller comparing to the one you got in pocketsphinx-extra. Does it make any difference?
8) I tried different combination of removing white noise (using 'sox' toolkit e.g. sox noisy.wav filtered.wav noisered profile.nfo 0.1). Depending on the last parameter
sometimes it improved a little bit (~3%) and sometimes it makes worse. Is it good to remove noise or it's something pocketsphinx doing as well? My environment is quite
is there is only white noise that I guess can have more inpack when audio recorded whispering.
9) I noticed that boosting volume (gain) alone most of the time only maked the performance a little bit worse even though for humans it was easier to distinguish words. Should I avoid it?
10) Overall I tried different combination and the best results I got is ~65% when only removing noise, so only slight (5%) improvement. Below are some stats:
//ORIGNAL UNPROCESSED TESTING FILES
TOTAL Words: 111 Correct: 72 Errors: 43
TOTAL Percent correct = 64.86% Error = 38.74% Accuracy = 61.26%
TOTAL Insertions: 4 Deletions: 13 Substitutions: 26
//DENOISED + VOLUME UP
TOTAL Words: 111 Correct: 76 Errors: 42
TOTAL Percent correct = 68.47% Error = 37.84% Accuracy = 62.16%
TOTAL Insertions: 7 Deletions: 4 Substitutions: 31
//VOLUME UP
TOTAL Words: 111 Correct: 69 Errors: 47
TOTAL Percent correct = 62.16% Error = 42.34% Accuracy = 57.66%
TOTAL Insertions: 5 Deletions: 12 Substitutions: 30
//DENOISE, threshold 0.1
TOTAL Words: 111 Correct: 77 Errors: 41
TOTAL Percent correct = 69.37% Error = 36.94% Accuracy = 63.06%
TOTAL Insertions: 7 Deletions: 3 Substitutions: 31
//DENOISE, threshold 0.21
TOTAL Words: 111 Correct: 80 Errors: 38
TOTAL Percent correct = 72.07% Error = 34.23% Accuracy = 65.77%
TOTAL Insertions: 7 Deletions: 3 Substitutions: 28
Those processing I was doing only for testing data. Should the training data be processed in the same way? I think I tried that but there was barely any difference.
11) In all those testing I used ARPA language model. When using JGSF results where usually much worse (I have the latest pocketsphinx branch). Why is that?
12) Because is each sentence the maximum number would be '999' and no more than 3 names, I modified the JSGF and replaced repetition sign '+' by repeating content in the parentheses manually. This time the result where much closer to ARPA. Is there any way in grammar to tell maximum number of repetition like in regular expression?
13) When using ARPA model I generated it by using all possible combinations (since dictionary is fixed and really small: ~15 words) but then testing I was still receiving somtimes illegal results e.g. Tom Anna (without any required number). Is there any way to enforce some structure using ARPA model?
14) Should the dictionary be limited only to those ~15 words or just full dictionary will only affect speed but not performance?
15) Is modifying dictionary (phonemes) the way to go to improve recognition when whispering? (I'm not an expert but when we whisper I guess some words might sounds different?)
16) Any other tips how to improve accuracy would be really helpful!
Regarding whispering: when you do so, the sound waves don't have meaningful aperiodic parts (vibrations that result from your vocal cords resonating normally, but not when whispering). You can try this by putting your finger to your throat while loudly speaking 'aaaaaa', and then just whispering it.
AFAIR acoustic modeling relies a lot on taking the frequency spectrum of the sound to detect peaks (formants) and relate them to phones (like vowels).
Educated guess: when whispering, the spectrum is mostly white-noise, slightly shaped by the oral position (tongue, openness of mouth, etc), which is enough for humans, but far not enough to make the peeks distinguishable by a computer.