Can choose some data in training data after doing data augmentation? - data-augmentation

I am training a UNET for semantic segmentation but I only have 200 labeled images. Given the small size of the dataset, it definitely needs some data augmentation techniques.
I have question about the test and the validation set.
I have custom data generator which keep feeding data from folder for training model.
So what I plan to do is:
do data augmentation for the training set and keep all of it in the same folder
"randomly" pick some of training data into test and validation set (of course, before training).
I am not sure if this is fine, since we just do some simple processing (flipping, transposing, adjusting brightness)
Would it be better to separate the data first and do the augmentation for the rest of data in the training folder?

Related

Should training data be different from validation data

So I have this cnn in python. My data has 1000 training images and 100 validation images for each class. However my validation images are the same with my training data just less. I'm facing some accuracy problems so could this be one of the reasons?
Yes, the validation data should be different (not a subset of) than the training data.
That's because the validation data is to validate that the model isn't overtrained to the training data... if it were a subset of the training data, that obviously won't work.

what if I predict data in training dataset

I'm developing recommender system using NCF in somewhat modified way.
My circumstance is that data for prediction occasionally includes data used in training.
For example, My training set is 100000rows. And by negative sampling, some unobserved datas are added to training set.
And I want to predict all the unobserved one from the trained model. Then some datas from negative sampling is intersection of train data and predict data.
Will this cause any problem?
Should I remove unobserved data from negative sampling in predict data?

should I create json annotation for validation images?

I am trying to implement mask rcnn for my own dataset but couldnt find any info about annotations for the val folder that contains the images for validattion. I created json annotations using Via 2.0.8 for my training set and that make senese. but if the validation images are the images to test later on why to make annotations for them. I can't train my module without json file in the val folder.
I tried to copy the json annotation for training images to the validation folder. it worker I think but that means I should have the same amount of images in both training and val with same names as well.
You can take a look at this answer. Basically, you need validation set to validate the output and to measure the performance of your model. After the model is trained using the training set, the validation set is used to measure the model's performance in case of accuracy, average precision, etc. This means that the validation set needs to have similar annotation files (ground truth) as the training set, so that the result of the model's prediction can be compared to the true results defined by you. For example, the model performs segmentation on an image and outputs some result. This result is then compared with the annotation (the expected correct output) in the validation set to measure the accuracy of the model's prediction. The test set is just for you to test your model on and see how it is performing. However there is no exact measurements in the test set to calculate the performance and accuracy.
In case of segmentation, one of the popular measurements is the dice score for which we need the annotations (in validation set) to calculate.

Training trained seq2seq model on additional training data

I have trained a seq2seq model with 1M samples and saved the latest checkpoint. Now, I have some additional training data of 50K sentence pairs which has not been seen in previous training data. How can I adapt the current model to this new data without starting the training from scratch?
You do not have to re-run the whole network initialization. You may run an incremental training.
Training from pre-trained parameters
Another use case it to use a base model and train it further with new training options (in particular the optimization method and the learning rate). Using -train_from without -continue will start a new training with parameters initialized from a pre-trained model.
Remember to tokenize your 50K corpus the same way you tokenized the previous one.
Also, you do not have to use the same vocabulary beginning with OpenNMT 0.9. See the Updating the vocabularies section and use the appropriate value with -update_vocab option.

SSAS Data Mining: Testing and Training Data Sets...please explain

Can someone explain what happens when you split up the data set for testing and training?
Put simply, the accuracy of your data mining model is evaluated by making predictions based on your training set of which the result is already known in test set.
More information on the testing and validation of data mining models (MSDN)
To be able to test the predictive analysis model you built, you need to split your dataset into two sets: training and test datasets. These datasets should be selected at random and should be a good representation of the actual population.
Similar data should be used for both the training and test datasets.
Normally the training dataset is significantly larger than the test dataset.
Using the test dataset helps you avoid errors such as overfitting.
The trained model is run against test data to see how well the model will perform.
More Information