Training with spacy on full dataset - spacy

When I train my spacy model as follows
spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
the model gets trained on train.spacy data file, and scored on dev.spacy. Then output_updated/model-best is the model with the highest score.
Is this best model finally trained on a combination of both train and dev data? I understand, it makes sense to split those datasets to avoid overfitting, but given little training data, I would like the final model to be trained on all data I have at hand.

No, spaCy does not automatically merge your datasets before training model-best. If you want to do that you would need to manually create a new training data set.
If you have so little data that seems like a good idea, you should probably prioritize getting more data.

Related

Keras LSTM: how to predict beyond validation vs predictions?

When dealing with time series forecasting, I've seen most people follow these steps when using an LSTM model:
Obtain, clean, and pre-process data
Take out validation dataset for future comparison with model predictions
Initialise and train LSTM model
Use a copy of validation dataset to be pre-processed exactly like the training data
Use trained model to make predictions on the transformed validation data
Evaluate results: predictions vs validation
However, if the model is accurate, how do you make predictions that go beyond the end of the validation period?
The following only accepts data that have been transformed in the same way as the training data, but for predictions that go beyond the validation period, you don't have any input data to feed to the model. So, how do people do this?
# Predictions vs validation
predictions = model.predict(transformed_validation)
# Future predictions
future_predictions = model.predict(?)
To predict the ith value, your LSTM model need last N values.
So if you want to forecast, you should use each prediction to predict the next one.
In other terms you have to loop over something like
prediction = model.predict(X[-N:])
X.append(prediction)
As you can guess, you add your output in your input that's why your predictions can diverge and amplify uncertainty.
Other model are more stable to predict far future.
You have to break your data into training and testing and then fit your mode. Finally, you make a prediction like this.
future_predictions = model.predict(X_test)
Check out the link below for all details.
Time-Series Forecasting: Predicting Stock Prices Using An LSTM Model

Training dataset repeatedly - Keras

I am doing an image classification task using Keras.
I used the vgg16 architecture, I thought it is easier to do, the task is to classify the image having tumor or not in MRI images.
As usual, I read and make all the images in same shape (224×224×3) and normalised by dividing all the images by 255. Then train test split, test dataset is 25% and training dataset is 75%.
train, test = train_test_split(X, y, test_size=0.25)
Then, I trained and got val_loss as 0.64 and val_accuracy as 0.7261.
I save the trained model in my google drive.
Next day, I used the same procedure, to improve the model performance by loading the saved model.
I didn't change the model architecture, I simply loaded the saved model which scores 0.7261 accuracy.
This time, I got better performance, the val_loss is 0.58 and val_accurqcy is 0.7976.
I wonder how this gets high accuracy. Then, I found that when splitting the dataset, the images will splits in random, and thus some of the test data in the 1st training process will become training data in the 2nd training process. So, the model learns the images and predicted well in 2nd training process.
I have to clarify, is this model is truly learns the tumor patterns or it is like that we train and test the model with same dataset or same image samples.
Thanks
When using train_test_split and validating in different sessions, always set your random seed. Otherwise, you will be using different splits, and leaking data like you stated. The model is not "learning" more, rather is being validated on data that it has already trained on. You will likely get worse real-world performance.

Tensorflow : Is it possible to identify the data is used for training?

I have created text classification model(.pb) using tensorflow. Prediction is good.
Is it possible to check the sentence using for prediction is already used to train the model or not. I need to retrain the model when new sentence is given to model to predict.
I did some research and couldn't find a way to get the train data only with the pb file because that file only stores the features and not the actual train data(obviously),but if you have the dataset,then you can easily verify duh....
I don't think you can ever find the exact train data with only the trained model,cause the model only contains the features and not the actual train data

Training trained seq2seq model on additional training data

I have trained a seq2seq model with 1M samples and saved the latest checkpoint. Now, I have some additional training data of 50K sentence pairs which has not been seen in previous training data. How can I adapt the current model to this new data without starting the training from scratch?
You do not have to re-run the whole network initialization. You may run an incremental training.
Training from pre-trained parameters
Another use case it to use a base model and train it further with new training options (in particular the optimization method and the learning rate). Using -train_from without -continue will start a new training with parameters initialized from a pre-trained model.
Remember to tokenize your 50K corpus the same way you tokenized the previous one.
Also, you do not have to use the same vocabulary beginning with OpenNMT 0.9. See the Updating the vocabularies section and use the appropriate value with -update_vocab option.

Tensorflow embeddings

I know what embeddings are and how they are trained. Precisely, while referring to the tensorflow's documentation, I came across two different articles. I wish to know what exactly is the difference between them.
link 1: Tensorflow | Vector Representations of words
In the first tutorial, they have explicitly trained embeddings on a specific dataset. There is a distinct session run to train those embeddings. I can then later on save the learnt embeddings as a numpy object and use the
tf.nn.embedding_lookup() function while training an LSTM network.
link 2: Tensorflow | Embeddings
In this second article however, I couldn't understand what is happening.
word_embeddings = tf.get_variable(“word_embeddings”,
[vocabulary_size, embedding_size])
embedded_word_ids = tf.gather(word_embeddings, word_ids)
This is given under the training embeddings sections. My doubt is: does the gather function train the embeddings automatically? I am not sure since this op ran very fast on my pc.
Generally: What is the right way to convert words into vectors (link1 or link2) in tensorflow for training a seq2seq model? Also, how to train the embeddings for a seq2seq dataset, since the data is in the form of separate sequences for my task unlike (a continuous sequence of words refer: link 1 dataset)
Alright! anyway, I have found the answer to this question and I am posting it so that others might benefit from it.
The first link is more of a tutorial that steps you through the process of exactly how the embeddings are learnt.
In practical cases, such as training seq2seq models or Any other encoder-decoder models, we use the second approach where the embedding matrix gets tuned appropriately while the model gets trained.