Feature wise center in ImageDataGenerator - tensorflow

The feature wise center means we have to subtract the mean value of dataset from the image. So in ImageDataGenrator if I set featurewise_center=True it will do same. I have 2 questions.
That mean values calculated over augmented data or the data which is stored in train directory?
At test time I want that same values of mean to subtract from test image. How to get that one?

That mean values calculated over augmented data or the data which is stored in train directory?
According to the Keras documentation:
fit(x, augment=False, rounds=1, seed=None )
Fits the data generator to some sample data.
This computes the internal data stats related to the data-dependent
transformations, based on an array of sample data.
Only required if featurewise_center or featurewise_std_normalization
or zca_whitening are set to True.
When rescale is set to a value, rescaling is applied to sample data
before computing the internal data stats.
So, you shall fit the ImageDataGenerator to some image data previously stored as an array of rank 4 and choose if you want to compute the stats based on the augmented images or not, by setting the 'augment' parameter to True or False. If you don't fit the data to the ImageDataGenerator object, it will just ignore the featurewise center transformation.
At test time I want that same values of mean to subtract from test image. How to get that one?
You can copy the stats from one Data Generator to another and you won't have to fit the Data Generator for the test set. After you fit the train Data Generator, just copy the stats to the test Data Generator, E.g
image_train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
featurewise_center=True,
horizontal_flip = True,
rotation_range = 20,
zoom_range = 0.2,
shear_range = 0.1,
)
image_train_datagen.fit(fit_array)
image_test_datagen = tf.keras.preprocessing.image.ImageDataGenerator(featurewise_center=True)
image_test_datagen.mean = image_train_datagen.mean
You may copy the std. deviation(for featurewise std normalization, by copying the 'std' attribute) and principal components (for zca whitening, by copying the 'principal_components' attribute) as well.

Related

How to see the indices of the split on the data that GridSearchCV used when it made the split?

When using GridSearchCV() to perform a k-fold cross validation analysis on some data is there a way to know which data was used for each split?
For example, assumed the goal is to build a binary classifier of your choosing, named 'model'. There are 100 data points (rows) with 5 features each and an associated 1 or 0 target. 20 of the 100 data points are held out for testing after training and hyperparameter tuning, GridSearchCV will never see those 20 data points. The other 80 data rows are put into the estimator as X and Y, so GridSearchCV will only see 80 rows of data. Various hyper parameters are tuned and laid out in the param_grid variable. For this case the cross validation parameter of cv is assigned a value of 3, as shown:
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3) grid_result = grid.fit(X, Y)
Is there a way to see which data was used as the training data and as the cross validation data for each fold? Maybe seeing which indices were used for the split?

How to add custom evaluation metrics in Tensorflow Object Detection API?

I would like to have my custom list of metrics when evaluating an instance segmentation model in Tensorflow's Object Detection API, which can be summarized as follows;
Precision values for IOUs of 0.5-0.95 with increments of 0.05
Recall values for IOUs of 0.5-0.95 with increments of 0.05
AUC values for precision and recall between 0-1 with increments of 0.05
What I've currently tested is modifying the already existing coco evaluation metrics by tweaking some code in the PythonAPI of pycocotools and the additional metrics file within Tensorflow's research model. Currently the default output values for COCO evaluation are the following
Precision/mAP
Precision/mAP#.50IOU
Precision/mAP#.75IOU
Precision/mAP (small)
Precision/mAP (medium)
Precision/mAP (large)
Recall/AR#1
Recall/AR#10
Recall/AR#100
Recall/AR#100 (small)
Recall/AR#100 (medium)
Recall/AR#100 (large)
So I decided first to use coco_detection_metrics in my eval_config field inside the .config file used for training
eval_config: {
metrics_set: "coco_detection_metrics"
}
And edit cocoeval.py and cocotools.py multiple times (proportional to the number of values) by adding more items to the stats list and stats sumary dictionary in order to get the desired result. For demonstration purposes, I am only going to show one example by adding precision at IOU=0.55 on top of precision at IOU=0.5.
So, this is the modified method of the COCOeval class inside cocoeval.py
def _summarizeDets():
stats[1] = _summarize(1, iouThr=.5, maxDets=self.params.maxDets[2])
stats[12] = _summarize(1, iouThr=.5, maxDets=self.params.maxDets[2])
and the edited methods under the COCOEvalWrapper class inside coco_tools.py
summary_metrics = OrderedDict([
('Precision/mAP#.50IOU', self.stats[1]),
('Precision/mAP#.55IOU', self.stats[12])
for category_index, category_id in enumerate(self.GetCategoryIdList()):
per_category_ap['Precision mAP#.50IOU ByCategory/{}'.format( category)] = self.category_stats[1][category_index]
per_category_ap['Precision mAP#.55IOU ByCategory/{}'.format( category)] = self.category_stats[12][category_index]
It would be useful to know a more efficient way to deal with my problem and easily request a list of custom evaluation metrics without having to tweak the already existing COCO files. Ideally, my primary goal is to
Be able to create a custom console output based on the metrics provided at the beginning of the question
and my secondary goals would be to
Export the metrics with their respective values in JSON format
Visualize the three graphs in Tensorboard

OneHotEncoder Multiple Columns

I am trying to encode a data table with multiple columns to a given set of categories
ohe1 = OneHotEncoder(categories = [list_names_data_rest.values],dtype = 'int8')
data_rest1 = ohe1.fit_transform(data_rest.values).toarray()
Here, list_names_data_rest.values is an array of shape (664,). I have 664 unique features and i am trying to encode data_rest which is (5050,6). After encoding, I am expecting a shape (5050,664)
I am one hot encoding to a pre-defined features set because, I am downloading data sets in chunks (due to ram limitations) and I would like the input shape to my neural network to be consistent
If i use pd.get_dummies, depending on my data set, I could get different categories and different input shape for my NN
ohe1.fit_transform does require a shape (n_values, n_features) but, I do not know how to handle this.
HashingVectorizer maybe a good solution for your case.It is independent from number of input features , just set initial size big enough.
If you wish to use pd.get_dummies there is an option to iteratively include your encodings for every batch.
For your first batch:
ohe = pd.get_dummies(data_rest, columns=['label_col'])
For every subsequent batch:
for b in batches:
batch_ohe = pd.get_dummies(b, columns=['label_col'])
ohe = pd.concat([ohe, batch_ohe], axis=0)
ohe = ohe.fillna(0)

Incorporating very large constants in Tensorflow

For example, the comments for the Tensorflow image captioning example model state:
NOTE: This script will consume around 100GB of disk space because each image
in the MSCOCO dataset is replicated ~5 times (once per caption) in the output.
This is done for two reasons:
1. In order to better shuffle the training data.
2. It makes it easier to perform asynchronous preprocessing of each image in
TensorFlow.
The primary goal of this question is to see if there is an alternative to this type of duplication. In my use case, storing the data in this way would require each image to be duplicated in the TFRecord files many more times, on the order of 20 - 50 times.
I should note first that I have already fed the images through VGGnet to extract 4096 dim features, and I have these stored as a mapping between filename and the vectors.
Before switching over to Tensorflow, I had been feeding batches containing filename strings and then looking up the corresponding vector on a per-batch basis. This allows me to store all of the image data in ~15GB without needing to duplicate the data on disk.
My first attempt to do this in in Tensorflow involved storing indices in the TFExample buffers and then doing a "preprocessing" step to slice into the corresponding matrix:
img_feat = pd.read_pickle("img_feats.pkl")
img_matrix = np.stack(img_feat)
preloaded_images = tf.Variable(img_matrix)
first_image = tf.slice(preloaded_images, [0,0], [1,4096])
However, in this case, Tensorflow disallows a variable larger than 2GB. So my next thought was to partition this across several variables:
img_tensors = []
for i in range(NUM_SPLITS):
with tf.Graph().as_default():
img_tensors.append(tf.Variable(img_matrices[i], name="preloaded_images_%i"%i))
first_image = tf.concat(1, [tf.slice(t, [0,0], [1,4096//NUM_SPLITS]) for t in img_tensors])
In this case, I'm forced to store each partition on a separate graph, because it seems any one graph cannot be this large either. However, now the concat fails because each tensor I am concatenating is on a separate graph.
Any advice on incorporating a large amount (~15GB) of preloaded into the Tensorflow graph.
Potentially related is this question; however in this case I'd like to override the decoding of the actual JPEG file with the preprocessed value in a tensor op.

Discretization of continuous attributes using np.histogram - how to apply on a new data point?

continuing How to do discretization of continuous attributes in sklearn?
After I "learned" my bins from train data, using np.histogram(A['my_var']) how do I apply it on my test set? as in which bin is the my_var attribute of each data point? Both my train and test data are in pandas data frames, if it matters.
Thanks
oops. it's easy.
hist = np.histogram(A['my_var'])
A.loc[:, 'my_bin'] = np.digitize(A['my_var'], hist[1])