How can I achieve better than 80% on the test set - tensorflow

My goal is to detect digits from 0 to 9 on a random background. I wrote a dataset generator with the following features:
Grayscale data
Random digit rotation
Random digit blur
43 different fonts
Random noisy blurred background
Here are 1024 samples of my dataset:
1024 testset samples
I adapted the mnist expert model to train the dataset and get almost 100% on the train and validation set.
On the test set I get approximately 80% correct.
Here is a sample. The green digit is the digit predicted:
9 predicted as 5
It seems that my model has some troubles to distinguish between
1 and 7
8 and 3
9 and 6
5 and 9
I need to detect the digit on any background because the test images are not always binary images.
Now my questions:
For the testset generator:
How useful is applying digit rotation? When I rotate a 7 then I get a 1 for some fonts. When I rotate a 9 I get a 6 (rotation > 90°)
Is the convolution filter already treating image rotation?
Are 180'000 image samples enough to train the model?
For the model:
Should I increase the image size from 28x28 to 56x56 when I apply a blur filter onto the dataset?
What filter size should I use?
Do I have to increase the number of hidden layers?
Thanks a lot for any guide.

If you are stuck with the different image backgrounds, I suggest you try image filtering, which will turn your images into the same background for foreground, assuming your images have good qualities.
Try this (scikit-image library):
import numpy as np
from skimage import filters as flt
filtered_image = np.array(original_image > flt.threshold_li(original_image))
Then you can use the filtered images for both training and prediction.

I ended up extracting the dataset patches out of existing images instead of using a random background with random digits. This gives us less variance and a much better accuracy on the test set.
Here is a working but not so performant implementation which allows us to define shape and stride size:
def patchify(self, arr, shape, stride):
patches = []
arr_shape = arr.shape
(shape_h, shape_w) = shape
(stride_h, stride_w) = stride
num_patches = np.floor(np.array(arr_shape)/np.array(stride))
(num_patches_row, num_patches_col) = (int(num_patches[0]), int(num_patches[1]))
for row in range(num_patches_row):
row_from = row*stride_h
row_to = row_from+shape_h
for col in range(num_patches_col):
col_from = col * stride_w
col_to = col_from + shape_w
origin_information = (row_from,row_to, col_from,col_to)
roi = arr[row_from:row_to, col_from:col_to]
patches.append((roi, origin_information))
return patches
or we can also use scklearn where image is a numpy array
patches = image.extract_patches_2d(image, (patch_height, patch_width))


How to batch an object detection dataset?

I am working on implementing a face detection model on the wider face dataset. I learned it was built into Tensorflow datasets and I am using it.
However, I am facing an issue while batching the data. Since, an Image can have multiple faces, therefore the number of bounding boxes output are different for each Image. For example, an Image with 2 faces will have 2 bounding box, whereas one with 4 will have 4 and so on.
But the problem is, these unequal number of bounding boxes is causing each of the Dataset object tensors to be of different shapes. And in TensorFlow afaik we cannot batch tensors of unequal shapes ( source - Tensorflow Datasets: Make batches with different shaped data). So I am unable to batch the dataset.
So after loading the following code and batching -
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
ds1 = ds.batch(12)
for step, (x,y,z) in enumerate(ds1) :
I am getting this kind of error on run Link to Error Image
In general any help on how can I batch the Tensorflow object detection datasets will be very helpfull.
It might be a bit late but I thought I should post this anyways. The padded_batch feature ought to do the trick here. It kind of goes around the issue by matching dimension via padding zeros
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
ds1 = ds.padded_batch(12)
for step, (x,y,z) in enumerate(ds1) :
Another solution would be to process not use batch and process with custom buffers with for loops but that kind of defeats the purpose. Just for posterity I'll add the sample code here as an example of a simple workaround.
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
batch_size = 12
image_annotations_pair = [x['image'], x['faces']['bbox'] for n, x in enumerate(ds) if n < batch_size]
Then use a train_step modified for this.
For details one may refer to -

Yolo Training: multiple objects in one image

I have a set of training images that contain many small objects (10-20). The image resolution is high (9000x6000).
Is it better to split the image into the specific objects before running yolo training? Or just leave it as is.
Does yolo resize an entire image, or does it ‘extract’ the annotated object first before resizing?
If it is the former, I am concerned that the resolution will be bad. Imagine 20 objects in a 416x416 image.
Does yolo resize an entire image, or does it ‘extract’ the annotated
object first before resizing?
Yes, an entire image will be resized in case of Yolo and it does not extract annotated object before resizing.
Since your input images have very high resolution, what you can do is:
Yolo can handle object sizes of 25 x 25 effectively with network input layer size 608 x 608. So if your object sizes in original input image are greater than 250 x 250 you can train the images as they are (with 608 x 608 network size). In that case even when images are resized to network size, objects will be of size greater than 25x25. This should give you good accuracy.
(6000/600) * 25 = 250
If object sizes in original images are smaller than 200 x 200, split your input image into 8 smaller units/blocks, say blocks/tiles of 2250 x 1500. Train these blocks as individual images. Each bigger image (9000 x 6000) corresponds to 8 training images. Each image might contain zero to many objects. You can operate in sliding window method.
The method you choose for training should be used for inference as well.
For training on objects of all sizes use following models: [Use this if you use original image as it is used for training]
If all of the objects that you want to detect are of smaller size, then for effective detection use Yolov4 with following changes: [Use this if you split original image into 8 blocks]
Set layers = 23 instead of layers = 54
Set stride=4 instead of stride=2
Set stride=4 instead of stride=2
Refer this relevant GitHub thread
darknet documentation

Is it possible to train YOLO (any version) for a single class where the image has text data. (find region of equations)

I am wondering if YOLO (any version, specially the one with accuracy, not speed) can be trained on the text data. What I am trying to do is to find the Region in the text image where any equation is present.
For example, I want to find the 2 of the Gray regions of interest in this image so that I can outline and eventually, crop the equations separately.
I am asking this questions because :
First of all I have not found a place where the YOLO is used for text data.
Secondly, how can we customise for low resolution unlike the (416,416) as all the images are either cropped or horizontal mostly in (W=2H) format.
I have implemented the YOLO-V3 version for text data but using OpenCv which is basically for CPU. I want to train the model from scratch.
Please help. Any of the Keras, Tensorflow or PyTorch would do.
Here is the code I used for implementing in OpenCv.
net = cv2.dnn.readNet(PATH+"yolov3.weights", PATH+"yolov3.cfg") # build the model. NOTE: This will only use CPU
layer_names = net.getLayerNames() # get all the layer names from the network 254 layers in the network
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()] # output layer is the
# 3 output layers in otal
blob = cv2.dnn.blobFromImage(image=img, scalefactor=0.00392, size=(416,416), mean=(0, 0, 0), swapRB=True,)
# output as numpy array of (1,3,416,416). If you need to change the shape, change it in the config file too
# swap BGR to RGB, scale it to a threshold, resize, subtract it from the mean of 0 for all the RGB values
outs = net.forward(output_layers) # list of 3 elements for each channel
class_ids = [] # id of classes
confidences = [] # to store all the confidence score of objects present in bounding boxes if 0, no object is present
boxes = [] # to store all the boxes
for out in outs: # get all channels one by one
for detection in out: # get detection one by one
scores = detection[5:] # prob of 80 elements if the object(s) is/are inside the box and if yes, with what prob
class_id = np.argmax(scores) # Which class is dominating inside the list
confidence = scores[class_id]
if confidence > 0.1: # consider only those boxes which have a prob of having an object > 0.55
# grid coordinates
center_x = int(detection[0] * width) # centre X of grid
center_y = int(detection[1] * height) # Center Y of grid
w = int(detection[2] * width) # width
h = int(detection[3] * height) # height
# Rectangle coordinates
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h]) # get all the bounding boxes
confidences.append(float(confidence)) # get all the confidence score
class_ids.append(class_id) # get all the clas ids
Being an object detector Yolo can be used for specific text detection only, not for detecting any text that might be present in the image.
For example Yolo can be trained to do text based logo detection like this:
I want to find the 2 of the Gray regions of interest in this image so
that I can outline and eventually, crop the equations separately.
Your problem statement talks about detecting any equation (math formula) that's present in the image so it can't be done using Yolo alone. I think mathpix is similar to your use-case. They will be using OCR (Optical Character Recognition) system trained and fine tuned towards their use-case.
Eventually to do something like mathpix, OCR system customised for your use case is what you need. There won't be any ready ready made solution out there for this. You'll have to build one.
Proposed Methods:
Mathematical Formula Detection in Heterogeneous Document Images
A Simple Equation Region Detector for Printed Document Images in Tesseract
Note: Tesseract as it is can't be used because it is a pre-trained model trained for reading any character. You can refer 2nd paper to train tesseract towards fitting your use case.
To get some idea about OCR, you can read about it here.
So idea is to build your own OCR to detect something that constitutes equation/math formula rather than detecting every character. You need to have data set where equations are marked. Basically you look for region with math symbols(say summation, integration etc.).
Some Tutorials to train your own OCR:
Tesseract training guide
Creating OCR pipeline using CV and DL
Build OCR pipeline
Build Your OCR
Attention OCR
So idea is that you follow these tutorials to get to know how to train
and build your OCR for any use case and then you read research papers
I mentioned above and also some of the basic ideas I gave above to
build OCR towards your use case.

RGB to gray filter doesn't preserve the shape

I have 209 cat/noncat images and I am looking to augment my dataset. In order to do so, this is the following code I am using to convert each NumPy array of RGB values to have a grey filter. The problem is I need their dimensions to be the same for my Neural Network to work, but they happen to have different dimensions.The code:
def rgb2gray(rgb):
return[...,:3], [0.2989, 0.5870, 0.1140])
Normal Image Dimension: (64, 64, 3)
After Applying the Filter:(64,64)
I know that the missing 3 is probably the RGB Value or something,but I cannot find a way to have a "dummy" third dimension that would not affect the actual image. Can someone provide an alternative to the rgb2gray function that maintains the dimension?
The whole point of applying that greyscale filter is to reduce the number of channels from 3 (i.e. R,G and B) down to 1 (i.e. grey).
If you really, really want to get a 3-channel image that looks just the same but takes 3x as much memory, just make all 3 channels equal:
grey = np.dstack((grey, grey, grey))
def rgb2gray(rgb):
return[...,:3], [[0.2989, 0.5870, 0.1140],[0.2989, 0.5870, 0.1140],[0.2989, 0.5870, 0.1140]])

Dynamic Tensor Aligment/Cropping

I implemented Fully-Convolution Network at TensorFlow. It use encdoder-decoder structure.
When training, I use always same image size (224x224, using random crop) and everything works nicely.
In interference phase, I want to predict one image at a time, because I want to use full-image (not croped). For example, such image have size [406,256]. And here is problem.
In Encoder-Decoder architecture I add two tesors (z = x + y). When training, sizes of both tensor matches. When predicting my single image, sizes does not match (tensor sizes: [1,47,47,64] vs [1,46,46,64]). I think it is cause by some rounding done in Conv and Pool layer.
What should I change in my architecture to works for any image size I want? Should I change rounding parameters? Or add 'cropping' of tensor?
Link to implementation of architecture:
(the problem occur in line 166)
I found the solution for variable input size:)
What we really need was a 'Crop-layer', that crop one tensor to match other. I found really similar layer here:
I have just made it `crop_and_add' and it is working:
def crop_and_add(x1,x2):
x1_shape = tf.shape(x1)
x2_shape = tf.shape(x2)
# offsets for the top left corner of the crop
offsets = [0, (x1_shape[1] - x2_shape[1]) // 2, (x1_shape[2] - x2_shape[2]) // 2, 0]
size = [-1, x2_shape[1], x2_shape[2], -1]
x1_crop = tf.slice(x1, offsets, size)
return x1_crop + x2
All addition in model I replaced by above layer (so merging encoder and decoder data).
Also, the input to model need to be defined as:
image = tf.placeholder(tf.float32, shape=[1, None, None, 3], name="input_image")
So we know that we will pass single image and that image have 3 channels. but we do not know neither width nor height. And it works very nice! (40 FPS on K80 as AWS P2, size of image is 224x{}-shoter side of image have 224)
FYI, I was also trying to run ENET (2x faster than LinkNet), but in TensorFlow it is slower. I think it is because of PReLu (which is slow at TF). Also it does not support arbitraty size of image becauese of UnPool layer, which need to have predefined output size by list of integers (not placeholders). So LinkNet look better in case of Speed and Performacance in TF.