CNTK ImageDeserializer and DCGAN sample - cntk

I'm reworking this sample https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_206B_DCGAN.ipynb to work with png MNIST files (rather than flat 1d array image input that tutorial uses). I use ImageDeserializer (and map file to load the data):
def create_mb_source(map_file, image_dims, num_classes, randomize=True):
transforms = [
xforms.scale(width=image_dims[2], height=image_dims[1], channels=image_dims[0], interpolations='linear')]
return MinibatchSource(ImageDeserializer(map_file, StreamDefs(
features=StreamDef(field='image', transforms=transforms),
labels=StreamDef(field='label', shape=num_classes))),
randomize=randomize)
I changed the input output of to Discriminator to expect 28x28 image (and output of Generator). See the code here: https://github.com/olgaliak/cntk-cyclegan/blob/master/trainDCGan.py
the problem is that trainDCGan.py is generating noise now. Appreciate your help!

The issue got solved once I
1) Switched to used 3 channels in ImageDeserializer
2) Changed network architecture to use 2d strides\kernels instead 1d.
This commit highlights the changes that made things working.

Related

How to Modify Data Augmentation in Darknet's YOLOv4

I built a digital scale reader using Darknet's YOLOv4Tiny. It is having trouble confusing 2's and 5's which leads me to believe that I am doing some unwanted data augmentation during training. (The results are mostly correct, and glare could be a factor, but I am expecting better results).
I have referenced this post:
Understanding darknet's yolo.cfg config files
and the darknet github:
https://github.com/AlexeyAB/darknet/wiki/CFG-Parameters-in-the-%5Bnet%5D-section
Below is a link to the yolov4-tiny.cfg that I modified for my model:
https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4-tiny.cfg
And a snippet from the link above:
[net]
# Testing
#batch=1
#subdivisions=1
# Training
batch=64
subdivisions=1
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
Am I correct that angle=0 means that there is no rotation?
Are there any other possible ways I might be augmenting my data that could cause an issue?
Edit: If I wanted, how could I eliminate all data augmentation?
Or do I just need more data (currently 2484 images for 10 digit classes)?
horizontal flip is applied by default, add "flip=0" to disable.
https://github.com/AlexeyAB/darknet
Here, the angle, saturation, exposure, and hue all are part of data augmentation. You can eliminate all data augmentation by setting the value to 0. As data argumentation's values are hyperparameters, modifying the value of these could lead the accuracy to better or worse both. Here I suggest you keep the value given by darknet the same as they found these values are good to get good accuracy. And by all these 4 types of data augmentation darknet initially generated all images. If you add more images without replication it is always good to add more images for the deep learning model to learn the necessary complexity during training.

Convert a .npy file to wav following tacotron2 training

I am training the Tacotron2 model using TensorflowTTS for a new language.
I managed to train the model (performed pre-processing, normalization, and decoded the few generated output files)
The files in the output directory are .npy files. Which makes sense as they are mel-spectograms.
I am trying to find a way to convert said files to a .wav file in order to check if my work has been fruitfull.
I used this :
melspectrogram = librosa.feature.melspectrogram(
"/content/prediction/tacotron2-0/paol_wavpaol_8-norm-feats.npy", sr=22050,
window=scipy.signal.hanning, n_fft=1024, hop_length=256)
print('melspectrogram.shape', melspectrogram.shape)
print(melspectrogram)
audio_signal = librosa.feature.inverse.mel_to_audio(
melspectrogram, sr22050, n_fft=1024, hop_length=256, window=scipy.signal.hanning)
print(audio_signal, audio_signal.shape)
sf.write('test.wav', audio_signal, sample_rate)
But it is given me this error : Audio data must be of type numpy.ndarray.
Although I am already giving it a numpy.ndarray file.
Does anyone know where the issue might be, and if anyone knows a better way to do it?
I'm not sure what your error is, but the output of a Tacotron 2 system are log Mel spectral features and you can't just apply the inverse Fourier transform to get a waveform because you are missing the phase information and because the features are not invertible. You can learn about why this is at places like Speech.Zone (https://speech.zone/courses/)
Instead of using librosa like you are doing, you need to use a vocoder like HiFiGan (https://github.com/jik876/hifi-gan) that is trained to reconstruct a waveform from log Mel spectral features. You can use a pre-trained model, and most off-the-shelf vocoders, but make sure that the sample rate, Mel range, FFT, hop size and window size are all the same between your Tacotron2 feature prediction network and whatever vocoder you choose otherwise you'll just get noise!

How to batch an object detection dataset?

I am working on implementing a face detection model on the wider face dataset. I learned it was built into Tensorflow datasets and I am using it.
However, I am facing an issue while batching the data. Since, an Image can have multiple faces, therefore the number of bounding boxes output are different for each Image. For example, an Image with 2 faces will have 2 bounding box, whereas one with 4 will have 4 and so on.
But the problem is, these unequal number of bounding boxes is causing each of the Dataset object tensors to be of different shapes. And in TensorFlow afaik we cannot batch tensors of unequal shapes ( source - Tensorflow Datasets: Make batches with different shaped data). So I am unable to batch the dataset.
So after loading the following code and batching -
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
ds1 = ds.batch(12)
for step, (x,y,z) in enumerate(ds1) :
print(step)
break
I am getting this kind of error on run Link to Error Image
In general any help on how can I batch the Tensorflow object detection datasets will be very helpfull.
It might be a bit late but I thought I should post this anyways. The padded_batch feature ought to do the trick here. It kind of goes around the issue by matching dimension via padding zeros
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
ds1 = ds.padded_batch(12)
for step, (x,y,z) in enumerate(ds1) :
print(step)
break
Another solution would be to process not use batch and process with custom buffers with for loops but that kind of defeats the purpose. Just for posterity I'll add the sample code here as an example of a simple workaround.
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
batch_size = 12
image_annotations_pair = [x['image'], x['faces']['bbox'] for n, x in enumerate(ds) if n < batch_size]
Then use a train_step modified for this.
For details one may refer to - https://www.kite.com/python/docs/tensorflow.contrib.autograph.operators.control_flow.dataset_ops.DatasetV2.padded_batch

Non Max Suppression settings and postprocessing for EfficientDet

I've downloaded and installed the Tensorflow Object Detection API and downloaded one of the EfficientDet models. As I want to do some work on the raw scores directly before Non-Max Suppression reduces it to class output, my first goal was to try and get the same final outputs from the raw scores, using the downloaded model config as a guide.
post_processing {
batch_non_max_suppression {
score_threshold: 9.99999993922529e-09
iou_threshold: 0.5
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SIGMOID
As the Object Detection API has no score converter method under postprocessing, I'm not sure what this does, but the only batch NMS method in utils seems to be batch_multiclass_non_max_suppression.
So, having fed an image into the network and got an output detections, to try and replicate its results:
result = post_processing.batch_multiclass_non_max_suppression(tf.expand_dims(detections['raw_detection_boxes'], 2), detections['raw_detection_scores'], 9.99999993922529e-09, 0.5, 100, max_total_size=100)
detections['detection_boxes'] = result[0]
detections['detection_scores'] = result[1]
detections['detection_classes'] = result[2]
i.e., substitute the relevant scores in the detections with the output of NMS, and insert the dimension needed for the batch function to work. This is then visualised following per the TensorFlow Hub colab.
The problem is that whilst the input image (this from the MSCOCO dataset) should produce this:
It instead produces this:
The bounding boxes are all (seemingly) shifted upwards and the categories are simply off, which suggests there's more processing being done between the raw scores, NMS, and output, but it's entirely unclear what. The scores are correct, so it appears to be pruning correctly.
Edit: I suspect, after looking at the SSD model template, that the problem with the misaligned bounding boxes is because I'm not passing the resized image dimensions along to NMS, which is generated by the preprocessing step, which should be easy enough to address via generating the image resize function. However, after applying the slice operation to remove a background class doesn't address the incorrect labels:
Instead, it seems to have lost the person class entirely--this makes sense; it isn't configured to include a background class of any sort and if Person (id 1) is instead coming out as index 0, then this would cut them off.
EDIT 2: I looked at the original meta-architecture further and copied the image-resizing function, i.e.:
from object_detection.protos import image_resizer_pb2
from object_detection.utils import config_util as c
from object_detection.utils import shape_utils
config = c.get_configs_from_pipeline_file(r"C:\Users\Person\.keras\datasets\efficientdet_d7_coco17_tpu-32\pipeline.config")
image_config = c.get_image_resizer_config(config['model'])
resize = image_resizer_builder.build(image_config)
def compute_clip_window(preprocessed_images, true_image_shapes):
# identical to the meta-arch definition
# image resizing
im = tf.cast(input_tensor, tf.float32)
channel_offset = [0.485, 0.456, 0.406]
channel_scale = [0.229, 0.224, 0.225]
im = ((im / 255.0) - [[channel_offset]]) / [[channel_scale]]
resized = shape_utils.resize_images_and_return_shapes(im, resize)
clip = compute_clip_window(resized[0], resized[1])
Therefore allowing the clip argument to be supplied to NMS. However, this doesn't change anything, and it still returns the same mis-aligned boxes as the second image. This is incredibly confusing, as this seems like it should replicate everything the model needs in both the preprocessing and postprocessing steps to generate its own output: the image is normalized and resized; the true image size is retained alongside the resized image; no further processing of the raw boxes or raw scores happens before they get passed to the NMS (the returned versions of the raw values are identical to the values passed to NMS except with one dimension and the model itself doesn't interfere with the post-processing at all--and the call signature calls preprocessing, prediction, and postprocessing in turn, so nothing else should be happening in the interim.
Edit 3: I added another line was added (to no effect)--setting the multiclass scores in the NMS additional fields to the detection scores with backgrounds (i.e., the raw scores). By adding +1 to all the label classes, I got the following image:
Whilst this is correct, this only corrects for the earlier parts of the dataset, i.e. where the only empty class is the 0th. It still appears that there must be some mapping step I'm not following, alongside whatever is causing the image misalignment.
The easiest solution in my case was to load the model from the checkpoint and configs, rather than use the saved model directly, in order to access the original preprocess, predict, and postprocess methods, rather than having a single function call.

TF object detection: return subset of inference payload

Problem
I'm working on training and deploying an instance segmentation model using TF's object detection API. I'm able to successfully train the model, package it into a TF Serving Docker image (latest tag as of Oct 2020), and process inference requests via the REST interface. However, the amount of data returned from an inference request is very large (hundreds of Mb). This is a big problem when the inference request and processing don't happen on the same machine because all that returned data has to go over the network.
Is there a way to trim down the number of outputs (either during model export or within the TF Serving image) so allow faster round trip times during inference?
Details
I'm using TF OD API (with TF2) to train a Mask RCNN model, which is a modified version of this config. I believe the full list of outputs is described in code here. The list of items I get during inference is also pasted below. For a model with 100 object proposals, that information is ~270 Mb if I just write the returned inference as json to disk.
inference_payload['outputs'].keys()
dict_keys(['detection_masks', 'rpn_features_to_crop', 'detection_anchor_indices', 'refined_box_encodings', 'final_anchors', 'mask_predictions', 'detection_classes', 'num_detections', 'rpn_box_predictor_features', 'class_predictions_with_background', 'proposal_boxes', 'raw_detection_boxes', 'rpn_box_encodings', 'box_classifier_features', 'raw_detection_scores', 'proposal_boxes_normalized', 'detection_multiclass_scores', 'anchors', 'num_proposals', 'detection_boxes', 'image_shape', 'rpn_objectness_predictions_with_background', 'detection_scores'])
I already encode the images within my inference requests as base64, so the request payload is not too large when going over the network. It's just that the inference response is gigantic in comparison. I only need 4 or 5 of the items out of this response, so it'd be great to exclude the rest and avoid passing such a large package of bits over the network.
Things I've tried
I've tried setting the score_threshold to a higher value during the export (code example here) to reduce the number of outputs. However, this seems to just threshold the detection_scores. All the extraneous inference information is still returned.
I also tried just manually excluding some of these inference outputs by adding the names of keys to remove here. That also didn't seem to have any effect, and I'm worried this is a bad idea because some of those keys might be needed during scoring/evaluation.
I also searched here and on tensorflow/models repo, but I wasn't able to find anything.
I was able to find a hacky workaround. In the export process (here), some of the components of the prediction dict are deleted. I added additional items to the non_tensor_predictions list, which contains all keys that will get removed during the postprocess step. Augmenting this list cut down my inference outputs from ~200MB to ~12MB.
Full code for the if self._number_of_stages == 3 block:
if self._number_of_stages == 3:
non_tensor_predictions = [
k for k, v in prediction_dict.items() if not isinstance(v, tf.Tensor)]
# Add additional keys to delete during postprocessing
non_tensor_predictions = non_tensor_predictions + ['raw_detection_scores', 'detection_multiclass_scores', 'anchors', 'rpn_objectness_predictions_with_background', 'detection_anchor_indices', 'refined_box_encodings', 'class_predictions_with_background', 'raw_detection_boxes', 'final_anchors', 'rpn_box_encodings', 'box_classifier_features']
for k in non_tensor_predictions:
tf.logging.info('Removing {0} from prediction_dict'.format(k))
prediction_dict.pop(k)
return prediction_dict
I think there's a more "proper" way to deal with this using signature definitions during the creation of the TF Serving image, but this worked for a quick and dirty fix.
I've ran into the same problem. In the exporter_main_v2 code there is stated that the outputs should be:
and the following output nodes returned by the model.postprocess(..):
* `num_detections`: Outputs float32 tensors of the form [batch]
that specifies the number of valid boxes per image in the batch.
* `detection_boxes`: Outputs float32 tensors of the form
[batch, num_boxes, 4] containing detected boxes.
* `detection_scores`: Outputs float32 tensors of the form
[batch, num_boxes] containing class scores for the detections.
* `detection_classes`: Outputs float32 tensors of the form
[batch, num_boxes] containing classes for the detections.
I've submitted an issue on the tensorflow object detection github repo, I hope we will get feedback from the tensorflow dev team.
The github issue can be found here
If you are using exporter_main_v2.py file to export your model, you can try this hack way to solve this problem.
Just add following codes in the function _run_inference_on_images of exporter_lib_v2.py file:
detections[classes_field] = (
tf.cast(detections[classes_field], tf.float32) + label_id_offset)
############# START ##########
ignored_model_output_names = ["raw_detection_boxes", "raw_detection_scores"]
for key in ignored_model_output_names:
if key in detections.keys(): del detections[key]
############# END ##########
for key, val in detections.items():
detections[key] = tf.cast(val, tf.float32)
Therefore, the generated model will not output the values of ignored_model_output_names.
Please let me know if this can solve your problem.
Another approach would be to alter the signatures of the saved model:
model = tf.saved_model.load(path.join("models", "efficientdet_d7_coco17_tpu-32", "saved_model"))
infer = model.signatures["serving_default"]
outputs = infer.structured_outputs
for o in ["raw_detection_boxes", "raw_detection_scores"]:
outputs.pop(o)
tf.saved_model.save(
model,
export_dir="export",
signatures={"serving_default" : infer},
options=None
)