Split Image dataset for keras model.fit_generator - tensorflow

I have single directory, dataset, which contains sub-folders(labels/classes) of images.
Here's the Sub-folders of animal images in dataset:
I want to split the dataset into train and test set for model.fit_generotar().
How can I do that?

Use glob to get file paths iterator.
You can then use scikit-learn's train-test split to get train and test data paths (use stratify parameter to get the same class distribution in test/train as in whole dataset).
The result would be two lists of paths, which you can write to appropriate test/train folders, and then you can apply generator's flow_from_directory method.
EDIT:
The second way would be to not use flow_from_directory, but load train/test sets (either load everything and use scikit-learn method or use what I've described before) and then use generator's flow method.
Also note that you might not want to use generators for test/validation data, since it would make comparing accuracy hard, since you won't have a fixed valid/test set.

Related

Best way to evaluate performance with tf.data.Dataset

I trained a model and now want to evaluate its performance on a test set. The test set is loaded as tf.data.TFRecordDataset object (from multiple TFRecords with multiple examples in each of them) which consists of ~million examples in the form of tuples (image, label), the data are batched. The raw labels are then mapped to the target integers (one-hot encoded) that the model needs to predict.
I understand that I can pass the Dataset object as an input to model.predict() which will output predictions for each example in the dataset. However, to compute some metric I need to compare true target values to the predicted ones, and to obtain the former ones I need to iterate through the Dataset, cause all true labels are stored in there.
This seems like a common task but I couldn't find a straightforward solution that works for huge dataset in TFRecord format. What would be the best way to compute, for instance, AUC per class in this case? Should I use Callbacks with model.predict(test_dataset)? Or should I process each example one by one in a loop, save true and predicted values into arrays and then use, for example, sklearn.metrics.roc_auc_score() to compute AUC scores for the two arrays? Or maybe I'm missing some obvious way to do it?
Thanks in advance!
If you need all labels, why not just:
model.evaluate(test_dataset.take(-1))
or if your ds is too large for this action, just iterate over your dataset, calculate your metric and the mean at the end.

should I create json annotation for validation images?

I am trying to implement mask rcnn for my own dataset but couldnt find any info about annotations for the val folder that contains the images for validattion. I created json annotations using Via 2.0.8 for my training set and that make senese. but if the validation images are the images to test later on why to make annotations for them. I can't train my module without json file in the val folder.
I tried to copy the json annotation for training images to the validation folder. it worker I think but that means I should have the same amount of images in both training and val with same names as well.
You can take a look at this answer. Basically, you need validation set to validate the output and to measure the performance of your model. After the model is trained using the training set, the validation set is used to measure the model's performance in case of accuracy, average precision, etc. This means that the validation set needs to have similar annotation files (ground truth) as the training set, so that the result of the model's prediction can be compared to the true results defined by you. For example, the model performs segmentation on an image and outputs some result. This result is then compared with the annotation (the expected correct output) in the validation set to measure the accuracy of the model's prediction. The test set is just for you to test your model on and see how it is performing. However there is no exact measurements in the test set to calculate the performance and accuracy.
In case of segmentation, one of the popular measurements is the dice score for which we need the annotations (in validation set) to calculate.

In tensorflow, why slim.data.set.Dataset requires `num_samples`? Is there any way to use it without knowing `num_samples`?

I am trying TFrecords using slim.dataset.Dataset example.
slim.dataset.Dataset(
data_sources=file_pattern,
reader=reader,
decoder=decoder,
num_samples=SPLITS_TO_SIZES[split_name],
items_to_descriptions=ITEMS_TO_DESCRIPTIONS,
num_classes=NUM_CLASSES,
labels_to_names=labels_to_names)
Later I am going to use this as in
provider = dataset_data_provider.DatasetDataProvider(dataset, shuffle=False)
I am wondering why the Dataset requires num_samples. E.g., what if one provided a pre-complied kitti_val.tfrecord?
From the KITTI dataset definition, I may be able to set other arguments such as num_classes, labels_to_names, etc. However, I may not know how many samples the val split would have (it could be 10% or 20% of the training set.)
Moreover, I am guessing that num_samples could possibly and be calculated internally, and be deterministic.
Is there any way to use pre-complied tfrecord without knowing num_samples?

In tensorflow seq2seq framework, How to train data of different bucket-size in one batch

I applied queued reader to tensorflow seq2seq to avoid reading the whole dataset into memory and process them all in advance. I didn't bucket the dataset into different bucket files first to ensure one bucket-size per batch for that will also take a lot of time. As a consequence, each batch of data from queue reader may contain sequences of different bucket-size, which lead to a failure to run the original seq2seq model (It assume that data in one batch is of the same bucket-size, and only chose one sub-graph depending on the bucket-size to execute)
What i have tried:
In the original implementation, sub-graphs, as many as buckets, are constructed to share the same parameters. The only difference between them is the times of computation that should be taken during it's RNN process.
I changed the sub-graph to a conditional one, which, when the switch is True, will compute the bucket_loss of this bucket and add it to loss_list and when the switch is False, will do nothing and add tf.constant(0.0) to loss_list. Finally, I use total_loss = tf.reduce_sum(loss_list) to collect all the losses and constructed gradient graph on it. Also, I feed a switches_list into model at every step. The size of switches_list is the same as that of buckets, and if there is any data of the ith bucket-size in this batch, the corresponding ith switch in switches_list will be True, otherwise False.
The Problems encountered:
when the backpropagation process went through the tf.cond(...)
node, I was warned by gradient.py that some sparse tensors are
transformed to dense one
when I tried to fetch the total_loss or bucket_loss, I was told:
ValueError: Operation u'cond/model_with_one_buckets/sequence_loss/truediv' has been marked as not fetchable.
Would you please help me:
How can I solve the two problems above?
How should I modify the graph to meet my requirement?
Any better ideas for training data of different bucket-size in one
batch?
Any better ideas for applying asynchronous queue reader to seq2seq
framework without bucketing the whole dataset first?
I would (did) throw out the bucketing entirely. Go with dynamic_rnn. Idea here is to fill up your batch with a padding symbol, as many as needed for THAT batch to arrive at equal length for all members of THAT batch (usually just the size of the longest member of the respective batch). Solves all four of your questions, but yes, it is some hassle to rewrite. (I dont regret it at all though)
I did many things that were very particular to my case and data on the way, thus sharing makes no sense, but maybe you want to check out this implementation: Variable Sequence Lengths in TensorFlow

Caching Computations in TensorFlow

Is there a canonical way to reuse computations from a previously-supplied placeholder in TensorFlow? My specific use case:
supply many inputs (using one placeholder) simultaneously, all of which are fed through a network to obtain smaller representations
define a loss based on various combinations of these smaller representations
train on one batch at a time, where each batch uses some subset of the inputs, without recomputing the smaller representations
Here is the goal in code, but which is defective because the same computations are carried out again and again:
X_in = some_fixed_data
combinations_in = large_set_of_combination_indices
for combination_batch_in in batches(combinations_in, batch_size=128):
session.run(train_op, feed_dict={X: X_in, combinations: combination_batch_in})
Thanks.
The canonical way to share computed values across sess.Run() calls is to use a Variable. In this case, you could set up your graph so that when the Placeholders are fed, they compute a new value of the representation that is saved into a Variable. A separate portion of the graph reads those Variables to compute the loss. This will not work if you need to compute gradients through the part of the graph that computes the representation. Computing those gradients will require recomputing every Op in the encoder.
This is the kind of thing that should be solved automatically with CSE (common subexpression elimination). Not sure what the support in TensorFlow right now, might be kind of spotty, but there's optimizer_do_cse flag for Graph options which is defaulting to false, and you can set it to true using GraphConstructorOptions. Here's a C++ example of using GraphConstructorOptions (sorry, couldn't find a Python one)
If that doesn't work, you could do "manual CSE", ie, figure out which part is being needlessly recomputed, factor it out into separate Tensor, and reference that tensor in all the calculations.