Dynamic Tensor Aligment/Cropping - tensorflow

I implemented Fully-Convolution Network at TensorFlow. It use encdoder-decoder structure.
When training, I use always same image size (224x224, using random crop) and everything works nicely.
In interference phase, I want to predict one image at a time, because I want to use full-image (not croped). For example, such image have size [406,256]. And here is problem.
In Encoder-Decoder architecture I add two tesors (z = x + y). When training, sizes of both tensor matches. When predicting my single image, sizes does not match (tensor sizes: [1,47,47,64] vs [1,46,46,64]). I think it is cause by some rounding done in Conv and Pool layer.
What should I change in my architecture to works for any image size I want? Should I change rounding parameters? Or add 'cropping' of tensor?
Link to implementation of architecture:
https://gist.github.com/melgor/0e43cadf742fe3336148ab64dd63138f
(the problem occur in line 166)

I found the solution for variable input size:)
What we really need was a 'Crop-layer', that crop one tensor to match other. I found really similar layer here: http://tf-unet.readthedocs.io/en/latest/_modules/tf_unet/layers.html
(crop_and_concat).
I have just made it `crop_and_add' and it is working:
def crop_and_add(x1,x2):
x1_shape = tf.shape(x1)
x2_shape = tf.shape(x2)
# offsets for the top left corner of the crop
offsets = [0, (x1_shape[1] - x2_shape[1]) // 2, (x1_shape[2] - x2_shape[2]) // 2, 0]
size = [-1, x2_shape[1], x2_shape[2], -1]
x1_crop = tf.slice(x1, offsets, size)
return x1_crop + x2
All addition in model I replaced by above layer (so merging encoder and decoder data).
Also, the input to model need to be defined as:
image = tf.placeholder(tf.float32, shape=[1, None, None, 3], name="input_image")
So we know that we will pass single image and that image have 3 channels. but we do not know neither width nor height. And it works very nice! (40 FPS on K80 as AWS P2, size of image is 224x{}-shoter side of image have 224)
FYI, I was also trying to run ENET (2x faster than LinkNet), but in TensorFlow it is slower. I think it is because of PReLu (which is slow at TF). Also it does not support arbitraty size of image becauese of UnPool layer, which need to have predefined output size by list of integers (not placeholders). So LinkNet look better in case of Speed and Performacance in TF.

Related

Filter size and stride when upsampling image using Conv2D Transpose

I am using Conv2D Transpose to upsample image by factors 18, 9, 6, 3.
My images are of sizes (1,1), (2,2), (3,3), (6,6). The goal is to upsample them to size (18,18).
The problem I am having is when choosing the correct filter size, stride and padding to achieve this. I have read articles about checkboard patterns that might arise when using improper sizes, but I still have not found any solution about which sizes to choose.
For the image (1,1) -> (18,18), I have chosen the filter size of (18,18) with no stride and padding. This makes sense to me as this one pixel is solely responsible for the look of the entire upsampled image.
But the other three are giving me problems.
One solution that I have thought of is that for (2,2) -> (18,18), I use the filter size (9,9) with stride (9,9). This would result that each pixel (2,2) provides 9,9 upsampled pixels.
Is this a proper way or would you recommend something else.
Have a look at the Keras docs. You can find the formula to calculate the output shape there:
new_rows = ((rows - 1) * strides[0] + kernel_size[0] - 2 * padding[0] + output_padding[0])
new_cols = ((cols - 1) * strides[1] + kernel_size[1] - 2 * padding[1] + output_padding[1])

Is it possible to train YOLO (any version) for a single class where the image has text data. (find region of equations)

I am wondering if YOLO (any version, specially the one with accuracy, not speed) can be trained on the text data. What I am trying to do is to find the Region in the text image where any equation is present.
For example, I want to find the 2 of the Gray regions of interest in this image so that I can outline and eventually, crop the equations separately.
I am asking this questions because :
First of all I have not found a place where the YOLO is used for text data.
Secondly, how can we customise for low resolution unlike the (416,416) as all the images are either cropped or horizontal mostly in (W=2H) format.
I have implemented the YOLO-V3 version for text data but using OpenCv which is basically for CPU. I want to train the model from scratch.
Please help. Any of the Keras, Tensorflow or PyTorch would do.
Here is the code I used for implementing in OpenCv.
net = cv2.dnn.readNet(PATH+"yolov3.weights", PATH+"yolov3.cfg") # build the model. NOTE: This will only use CPU
layer_names = net.getLayerNames() # get all the layer names from the network 254 layers in the network
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()] # output layer is the
# 3 output layers in otal
blob = cv2.dnn.blobFromImage(image=img, scalefactor=0.00392, size=(416,416), mean=(0, 0, 0), swapRB=True,)
# output as numpy array of (1,3,416,416). If you need to change the shape, change it in the config file too
# swap BGR to RGB, scale it to a threshold, resize, subtract it from the mean of 0 for all the RGB values
net.setInput(blob)
outs = net.forward(output_layers) # list of 3 elements for each channel
class_ids = [] # id of classes
confidences = [] # to store all the confidence score of objects present in bounding boxes if 0, no object is present
boxes = [] # to store all the boxes
for out in outs: # get all channels one by one
for detection in out: # get detection one by one
scores = detection[5:] # prob of 80 elements if the object(s) is/are inside the box and if yes, with what prob
class_id = np.argmax(scores) # Which class is dominating inside the list
confidence = scores[class_id]
if confidence > 0.1: # consider only those boxes which have a prob of having an object > 0.55
# grid coordinates
center_x = int(detection[0] * width) # centre X of grid
center_y = int(detection[1] * height) # Center Y of grid
w = int(detection[2] * width) # width
h = int(detection[3] * height) # height
# Rectangle coordinates
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h]) # get all the bounding boxes
confidences.append(float(confidence)) # get all the confidence score
class_ids.append(class_id) # get all the clas ids
Being an object detector Yolo can be used for specific text detection only, not for detecting any text that might be present in the image.
For example Yolo can be trained to do text based logo detection like this:
I want to find the 2 of the Gray regions of interest in this image so
that I can outline and eventually, crop the equations separately.
Your problem statement talks about detecting any equation (math formula) that's present in the image so it can't be done using Yolo alone. I think mathpix is similar to your use-case. They will be using OCR (Optical Character Recognition) system trained and fine tuned towards their use-case.
Eventually to do something like mathpix, OCR system customised for your use case is what you need. There won't be any ready ready made solution out there for this. You'll have to build one.
Proposed Methods:
Mathematical Formula Detection in Heterogeneous Document Images
A Simple Equation Region Detector for Printed Document Images in Tesseract
Note: Tesseract as it is can't be used because it is a pre-trained model trained for reading any character. You can refer 2nd paper to train tesseract towards fitting your use case.
To get some idea about OCR, you can read about it here.
EDIT:
So idea is to build your own OCR to detect something that constitutes equation/math formula rather than detecting every character. You need to have data set where equations are marked. Basically you look for region with math symbols(say summation, integration etc.).
Some Tutorials to train your own OCR:
Tesseract training guide
Creating OCR pipeline using CV and DL
Build OCR pipeline
Build Your OCR
Attention OCR
So idea is that you follow these tutorials to get to know how to train
and build your OCR for any use case and then you read research papers
I mentioned above and also some of the basic ideas I gave above to
build OCR towards your use case.

When forward using MXNet, how to do with varying 'batch size' in data_shapes?

Hi,I have a question that, how can I make predict with unfixed input data? I will try to describe in detail clear:
I use MTCNN for face detection(it's ok unfamiliar with that), and it employs 3 networks: PNet, RNet, ONet. PNet detects a mass of proposal face bounding boxes, then these boxes are coarse-to-fine by the rest net one after another, finally get precise face bbox(s). When taking an image as input to PNet, image's size is unfixed, and the output proposal box number from PNet is also unfixed, so as RNet, ONet. Reference to another MTCNN code I set a large data_shapes(e.g., image size, batch size) when I bind the module, and initialize all to zero,then make predict. That works though, Isn't that a redundant calculation? (Question 1)
PNet:
max_img_w=1000
max_img_h=1000
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det1’, 0)
self.PNets = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.PNets.bind(data_shapes=[(‘data’, (1, 3, max_img_w, max_img_h))],for_training=False)
self.PNets.set_params(arg_params,aux_params)
RNet
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det2’, 0)
self.RNet = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.RNet.bind(data_shapes=[(‘data’, (2048,3, 24, 24))],for_training=False)
self.RNet.set_params(arg_params,aux_params,allow_missing=True)
ONet
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det3’, 0)
self.ONet = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.ONet.bind(data_shapes=[(‘data’, (256, 3, 48, 48))],for_training=False)
self.ONet.set_params(arg_params,aux_params,allow_missing=True)
And I try mx.mod.Module.reshape before predict, which will adjust data'shape according to last network's output, but I get this error:(Question 2)
AssertionError: Shape of unspecified array arg:prob1_label changed. This can cause the new executor to not share parameters with the old one. Please check for error in the network. If this is intended, set partial_shaping=True to suppress this warning.
One more thing is that The MTCNN code (https://github.com/pangyupo/mxnet_mtcnn_face_detection) primary use deprecated function to load models:
self.PNet = mx.model.FeedForward.load(‘det1’,0)
One single line to work with arbitrary data_shapes, why this function be deprecated..?(Question 3)
I found a little difference that after load model, FeedFroward takes 0MB memory before make one predict, but mx.mod.Module takes up memory once loaded, and increase obviously after making one prediction.
You can use MXNet imperative API Gluon and that will let you use different batch-sizes.
If like in this case, your model was trained using the symbolic API or has been exported in the serialized MXNet format ('-0001.params', '-symbol.json' for e.g), you can load it in Gluon that way:
ctx = mx.cpu()
sym = mx.sym.load_json(open('det1-symbol.json', 'r').read())
PNet = gluon.nn.SymbolBlock(outputs=sym, inputs=mx.sym.var('data'))
PNet.load_params('det1-0001.params', ctx=ctx)
Then you can use it the following way:
# a given batch size (1)
data1 = mx.nd.ones((1, C, W, H))
output1 = PNet(data1)
# a different batch size (5)
data2 = mx.nd.ones((5, C, W, H))
output2 = PNet(data2)
And it would work.
You can get started with MXNet Gluon with the official 60 minutes crash course

Keras video frame prediction with lower output dimension then input dimension

I want to train a Keras DNN for video frame prediction:
Input: 4 consecutive Frames of a Video
Output: Next frame, predicted from the network
So, basically, the dimensions are: input: (number_samples, 4, 60, 60), output: (number_samples, 1, 60, 60). I need some help in getting from the 4 frames in the input down to 1 frame in the output.
I've found an example here and would like to work with that.
Problem is, in that network, the output is not one frame, but the same number of frames as the input. (so my task is actually simpler, because I want to generate only one next frame, not 4). Now I don't know, which layers I could append at the end of the network or how I could change the network, so the output dimensions are as desired. (one frame instead of 4).
Appending a Conv2D Layer at the end did not work, because it does not match with the dimensions of the Conv3D.
Any Ideas on how to go about that problem and how my network architecture could look like? Any advice on my task in general and how I could build a good network for it is also appreciated.
This loop in the code example (for which you gave the URL) can be tailored to do what you desire.
for j in range(16):
new_pos = seq.predict(track[np.newaxis, ::, ::, ::, ::])
new = new_pos[::, -1, ::, ::, ::]
track = np.concatenate((track, new), axis=0)

Darknet YOLO image size

I am trying to train custom object classifier in Darknet YOLO v2
https://pjreddie.com/darknet/yolo/
I gathered a dataset for images most of them are 6000 x 4000 px and some lower resolutions as well.
Do I need to resize the images before training to be squared ?
I found that the config uses:
[net]
batch=64
subdivisions=8
height=416
width=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
thats why I was wondering how to use it for different sizes of data sets.
You don't have to resize it, because Darknet will do it instead of you!
It means you really don't need to do that and you can use different image sizes during your training. What you posted above is just network configuration. There should be full network definition as well. And the height and the width tell you what's the network resolution. And it also keeps aspect ratio, check e.g this.
You don't need to resize your database images. PJReddie's YOLO architecture does it by itself keeping the aspect ratio safe (no information will miss) according to the resolution in .cfg file.
For Example, if you have image size 1248 x 936, YOLO will resize it to 416 x 312 and then pad the extra space with black bars to fit into 416 x 416 network.
It is very common to resize images before training. 416x416 is slightly larger than common. Most imagenet models resize and square the images to 256x256 for example. So I would expect the same here. Trying to train on 6000x4000 is going to require a farm of GPUs. The standard process is to square the image to the largest dimension (height, or width), padding with 0's on the shorter side, then resizing using standard image resizing tools like PIL.
You do not need to resize the images, you can directly change the values in darknet.cfg file.
When you open darknet.cfg (yolo-darknet.cfg) file, you can all
hyper-parameters and their values.
As showed in your cfg file images dimensions are (416,416)->(weight,height), you can change the values, so that darknet will automatically resize the images before training.
Since the images have high dimensions, you can adjust batch and sub-division values (lower the values 32,16,8 . it has to be multiples of 2), so that darknet will not crash (memory allocation error)
By default the darknet api changes the size of the images in both inference and training, but in theory any input size w, h = 32 x X where X belongs to a natural number should, W is the width, H the height. By default X = 13, so the input size is w, h = (416, 416). I use this rule with yolov3 in opencv, and it works better the bigger X is.