I have input image Tensor with shape [?, 448, 448, 3] and my network predicts a bounding box with shape [?, 4]. I want to slice my image tensor with the bounding box tensor and re-size the resulting tensor into a fixed size image for further processing.
Is this possible with tensorflow (or even better, natively in Keras)? I have read the relevant questions. E.g, this, and this, but they do not apply to when both the indexing tensor and the original tensor have an unknown first dimension.
Any help in the right direction is much appreciated !
The best way for you should be to use tf.image.crop_and_resize. From the documentation:
Extracts crops from the input image tensor and bilinearly resizes them
(possibly aspect ratio change) to a common output size specified by
crop_size. This is more general than the crop_to_bounding_box op which
extracts a fixed size slice from the input image and does not allow
resizing or aspect ratio change.
Returns a tensor with crops from the input image at positions defined
at the bounding box locations in boxes. The cropped boxes are all
resized (with bilinear interpolation) to a fixed size = [crop_height,
crop_width]. The result is a 4-D tensor [num_boxes, crop_height,
crop_width, depth].
Related
I've been playing with different models from TF hub to extract feture vectors:
module = hub.load('https://tfhub.dev/google/tf2-preview/inception_v3/feature_vector/4')
features = module(image)
What i don't quite understand is how input image should be preprocessed.
Every model from the hub has this generic instruction:
The input image1 are expected to have color values in the range [0,1], following the common image input conventions. The expected size of the input images is height x width = 299 x 299 pixels by default, but other input sizes are possible (within limits).
where "common image input" is a link to a the following:
A signature that takes a batch of images as input accepts them as a dense 4-D tensor of dtype float32 and shape [batch_size, height, width, 3] whose elements are RGB color values of pixels normalized to the range [0, 1]. This is what you get from tf.image.decode_*() followed by tf.image.convert_image_dtype(..., tf.float32).
and this is indeed what i see quite often online:
image = tf.io.read_file(path)
# Decodes the image to W x H x 3 shape tensor with type of uint8
image = tf.io.decode_jpeg(image, channels=3)
# Resize the image to for model
image = tf.image.resize(image, [model_input_size, model_input_size])
# 1 x model_input_size x model_input_size x 3 tensor with the data type of float32
image = tf.image.convert_image_dtype(image, tf.float32)[tf.newaxis, ...]
BUT, color values are expected to be in range [0,1], in this case colors are in range [0,255] and should be scaled down:
image = numpy.array(image) * (1. / 255)
Is it just a common mistake or is the TF documentation is not up to date?
I was playing with models from tf.keras.applications and reading source code in github. I noticed in some of the models (EfficientNet) first layer is:
x = layers.Rescaling(1. / 255.)(x)
but in some models there is no such layer, instead and an utility function scales colors to [0,1] range, for example tf.keras.applications.mobilenet.preprocess_input.
So, how important for TF hub saved models image colors to be in [0,1] range?
This is just a convention TF Hub proposes: "Models for the same task are encouraged to implement a common API so that model consumers can easily exchange them without modifying the code that uses them, even if they come from different publishers" (from here).
As you've noted, the publisher of google/tf2-preview/inception_v3/feature_vector/4 decided that input images "are expected to have color values in the range [0,1]", while the publisher of tensorflow/efficientdet/d1/1 decided to add a Rescaling layer to the model itself such that "[a tensor] with values in [0, 255]" can be passed. So ultimately, it's up to the publisher how they implement their model. In any case, when using models from tfhub.dev, the expected preprocessing steps will always be documented on the model page.
I am using https://tfhub.dev/google/imagenet/resnet_v2_50/feature_vector/3 to extract image feature vectors. However, I'm confused when it comes to how to preprocess the images prior to passing them through the module.
Based on the related Github explanation, it's said that the following should be done:
image_path = "path/to/the/jpg/image"
image_string = tf.read_file(image_path)
image = tf.image.decode_jpeg(image_string, channels=3)
image = tf.image.convert_image_dtype(image, tf.float32)
# All other transformations (during training), in my case:
image = tf.random_crop(image, [224, 224, 3])
image = tf.image.random_flip_left_right(image)
# During testing:
image = tf.image.resize_image_with_crop_or_pad(image, 224, 224)
However, using the aforementioned transformation, the results I am getting suggest that something might be wrong. Moreover, the Resnet paper is saying that the images should be preprocessed by:
A 224×224 crop is randomly sampled from an image or its
horizontal flip, with the per-pixel mean subtracted...
which I can't quite understand what is means. Can someone point me in the right direction?
Looking forward to you answers!
The image modules on TensorFlow Hub all expect pixel values in range [0,1], like you get in your code snippet above. This makes it easy and safe to switch between modules.
Inside the module, the input values are scaled to the range that the network was trained for. The module https://tfhub.dev/google/imagenet/resnet_v2_50/feature_vector/3 has been published from a TF-Slim checkpoint (see documentation), which uses yet another convention for normalizing inputs than He&al. -- but all this is taken care of.
To demystify the language in He&al.: it refers to the mean R, G and B values aggregated over all pixels of the dataset they studied, following the old wisdom that normalizing inputs to zero mean helps neural networks train better. However, later papers on image classification no longer expended this degree of attention to dataset-specific preprocessing.
The citation from the Resnet paper you mentioned is based on the following explanation from the Alexnet paper:
ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality. Therefore, we down-sampled the images to a fixed resolution of256×256. Given a rectangular image, we first rescaled the image such that the shorter side was of length 256, and thencropped out the central 256×256patch from the resulting image. We did not pre-process the images in any other way, except for subtracting the mean activity over the training set from each pixel.
So in the Resnet paper, a similar process consist in taking a of 224x224 pixels part of the image (or of its horizontally flipped version) to ensure the network is given constant-sized images, and then center it by substracting the mean.
I want the graph visualizer to label edges with tensor dimensions, and edge thickness to reflect total tensor size. Basically exactly the same as written in this doc:
When the serialized GraphDef includes tensor shapes, the graph
visualizer labels edges with tensor dimensions, and edge thickness
reflects total tensor size. To include tensor shapes in the GraphDef
pass the actual graph object (as in sess.graph) to the FileWriter when
serializing the graph. The images below show the CIFAR-10 model with
tensor shape information:
I pass the graph object to my summary.FileWriter:
writer = tf.summary.FileWriter(_dir_tensorboard, graph=sess.graph, flush_secs=300)
But I do not get any information about the thickness (all the lines are of the same size). I have just information about the shape of a tensor and information about the number of tensors.
How can I achive the same visual effect as the tutorial claims?
There was a regression impacting 1.11 and 1.12. It should be fixed with https://github.com/tensorflow/tensorboard/pull/1544. Sorry about it :( Please file a GitHub issue next time this happens.
Is the Convolution symbol computed cyclically, i.e., does it assume that the padded input symbol is periodical in all dimensions?
More specifically, if I've got input symbol of dimensions 1x3xHxW, representing an RGB image, and I define a convolution operating on it as below:
conv1 = mxmet.symbol.Convolution(data=input, kernel=(3, 5, 5), pad=(0, 2, 2)...
what the trained filter will look like? I expect it to be composed from linear combinations of 2-D filters operating on each of the color channels R,G,B.
Am I correct?
It turns out that convolutions in mxnet are 3D: first two dimensions reflect image coordinates, while the third dimension reflects the depth, i.e., the dimension of the feature space. For an RGB image at the input layer, depth is 3 (unless it is a grayscale image that has depth==1). For any other layer, depth is the number of features.
The convolution across the depth dimension is of course cyclical, such that all features of the current layer can affect any feature of the following layer by finding linear combinations that optimize the detection precision.
I have one image data tensor with shape of B*H*W*C and one position tensor with shape of B*H*W*2. The values in position tensor are pixel coordinates and I want to sample pixels in image data tensor according to these pixel coordinates. I have tried one way to do that like reshaping the tensor to one-dimension tensor, but I think it's really inconvenient. I wonder whether I could implement it by some more convenient approach like matrix mapping(e.g. remap in opencv).
I would first ask if you are sure the position matrix isn't redundant. If the position matrix entries simply correspond to the pixel locations in the image array, then for a given application however you access the position matrix could be used instead on the image data.
Perhaps as a starting point, running
sess = tf.Session()
np_img, np_pos = sess.run([tf_img, tf_pos], feed_dict={...})
will convert tensors to numpy arrays, which may make your operations easier.
Otherwise, a 1D-tensor isn't that bad and there are TF functions for reshaping easily.