I write a code in tensorflow by using convolution neural network to detect the text from images. I used TFRecords file to read the street view text dataset, then, I resized the images to 128 for height and width.
I used 9-conv layer with zero padding and three max_pool layer with window size of (2×2) and stride of 2. Since I use just three pooling layer, the last layer shape will be (16×16). the last conv layer has '256' filters.
I used too, two regression fully connected layers (tf.nn.sigmoid) and tf.losses.mean_squared_error as a loss function.
My question is
is this architecture enough for detection process?? I know there is something call NMS for detection. Also what is the label in this case??
In general and this not a rule , it's just based on my experience, you should start with a smaller net 2 or 3 conv layer, and say what happens, if you get some good result focus more on the winning topology and adapt the hyperparameters ( learnrat, batchsize and so one ) , if you don't get good result at all go deep meaning add conv layer. and evaluate again. 12 conv is really huge , your problem complexity should be huge too ! otherwise you wil reach a good accuracy but waste a lot computer power and time for nothing ! and by the way use pyramid form meaning start wider and finish tiny
Related
The VGG16 architecture has input: 224x224x3 images.I want to have 48x48x3 inputs but to do this in keras, we remove the last fc layers which have 4096 neurons each.Why we have to do this? and is it needed to add another size of fc layers for this input?
Final pooling layer of VGG16 has dimension 7x7x512 for 224x224 input image. From there VGG16 uses fully connected layer of (7x7x512)x4096 to get 4096 dimensional output. However, since your input size is different your feature output dimension from final pooling layer will also be different (2x2x512 I think). So you need to change matrix dimension for fully connected layer to make it work. You have two other options though
use a global average pooling across spatial dimension to get 512 dimensional feature and then use few fully connected layers to get to your number of classes.
Resize you input image to 224x224x3 and you won't need to change anything in model architecture.
Removing the last FC layers is for fine-tuning or transfer learning, where you adapt an existing network to a new problem, such as changing the number of categories that your classifier can choose between.
You are adapting the network to take a different sized input, so you need to adjust the first layer(s) of the network.
So via tf.summary I visualized the first convolutional layer in my model, of shape [5,5,3,32], as a set of individual images, one per filter. so this layer has a filter of 5x5 dimensions, of depth 3, and there are 32 of them. Im viewing these filters as 5x5 color(RGB) images.
Im wondering how to generalize this to a second convolutional layer, and third and on...
shape of the second convolutional layer is [5,5,32,64].
my questions is how would I transform that tensor into individual 5x5x3 images?
with the first conv layer of shape [5,5,3,32] I visualize it by transposing first tf.transpose(W_conv1,(3,0,1,2)), and then having 32 5x5x3 images.
doing a tf.transpose(W_conv2,(3,0,1,2)) would produce a shape [64,5,5,32]. how would I then use those "32 color channels"? (Im know its not that simple :) ).
Visualization of higher level filters is usually done indirectly. To visualize a particular filter, you look for images that the filter would respond the most to. For that you perform gradient ascent in the space of images (instead of changing the parameters of the network like when you train the network, you change the input image).
You will understand it easier if you play with the following Keras code: https://github.com/keras-team/keras/blob/master/examples/conv_filter_visualization.py
How to select the following: Size of filter for convolution, strides, pooling, and densely connected layer
There is no single answer to this question. This Reddit and this answer have some nice discussion. To quote the second post on the Reddit, "Start simple."
Celeba has similar, maybe exactly the same, image size. When I was working with Celeba on a DCGAN project, I gently cropped and then reshaped the images to 64 x 64 x 3. My discriminator was a convolutional neural network and used 4 convolutional layers and one fully connected layer. All conv layers had 5 x 5 window size and stride size of 2 x 2. SAME padding and no pooling. The output channels per layer were 128 -> 256 -> 512 -> 1024. So, the last conv layer output a 4 x 4 x 1024 tensor. My dense layer then had a weight size of classes x 1024. (I had 1 class since its purpose was to determine whether the input image was from the dataset or made by the generator.)
That relatively simple architecture had good results, but it was intentionally built to not overpower the generator. If you're looking for pure classification, you may want a deeper architecture. You might not want to crop as aggressively as I did. Then you can include more conv layers before the fully connected layer. You may want to use a 3 x 3 window size with a stride size of 1 x 1 and use pooling - although I see architectures abandoning pooling in favor of larger stride size. If your dataset is small, it is prone to overfitting. Having smaller weights helps combat this when dropout isn't enough. That means fewer output channels per layer.
There are a lot of possibilities when choosing an architecture, and there is no hard-and-fast rule for the best architecture. Remember to start simple.
When one passes to tf.train.batch, it looks like the shape of the element has to be strictly defined, else it would complain that All shapes must be fully defined if there exist Tensors with shape Dimension(None). How, then, does one train on images of different sizes?
You could set dynamic_pad=True in the argument of tf.train.batch.
dynamic_pad: Boolean. Allow variable dimensions in input shapes. The given dimensions are padded upon dequeue so that tensors within a batch have the same shapes.
Usually, images are resized to a certain number of pixels.
Depending on your task you might be able to use other techniques in order to process images of varying sizes. For example, for face recognition and OCR, a fix sized window is used, that is then moved over the image. On other tasks, convolutional neural networks with pooling layers or recurrent neural networks can be helpful.
I see that this is quite old question, but in case someone will be searching how variable-size images can be still used in batches, I can tell what I did for Image-to-Image convolutional network (inference), which was trained for variable image size and batch 1. Why: when I tried to process images in batches using padding, the results become much worse, because signal was "spreading" inside of the network and started to influence its convolution pyramids.
So what I did is possible when you have source code and can load weights manually into convolutional layers. I modified the network in the following way: along with a batch of zero-padded images, I added additional placeholder which received a batch of binary masks with 1 where actual data was on the patch, and 0 where padding was applied. Then I multiplied signal by these masks after each convolutional layer inside the network, fighting "spreading". Multiplication isn't expensive operation, so it did not affect performance much.
The result was not deformed already, but still had some border artifacts, so I modified this approach further by adding small (2px) symmetric padding around input images (kernel size of all the layers of my CNN was 3), and kept it during propagation by using slightly bigger (+[2px,2px]) mask.
One can apply the same approach for training as well. Then some sort of "masked" loss is needed, where only the ROI on each patch is used to calculate loss. For example, for L1/L2 loss you can calculate the difference image between generated and label images and apply masks before summing up. More complicated losses might involve unstacking or iterating batch, and extracting ROI using tf.where or tf.boolean_mask.
Such training can be indeed beneficial in some cases, because you can combine small and big inputs for the network without small inputs being affected by the loss of big padded surroundings.
I have been trying to build a simple convolutional network to classify CIFAR images, and for whatever reason, I just can't it to perform better than chance. Here is the relevant bit of code (x is the cifar image data, of shape (n,32,32,3) ):
As you'll see, it's a pretty basic setup - I know it's not a very elaborate network but I just wanted to test it first before moving to more complicated architectures. However, no matter what I try, I can't get the prediction accuracy to exceed "chance guessing". When I train it, the loss steadily (more or less) declines, but gradually tapers off and saturates at the level of random guessing (i.e. 10% accuracy for CIFAR-10).
I've tried a variety of different parameters e.g. different kernel sizes, strides, different max pooling sizes, with without normalization, etc etc but it doesn't seem to budge. I just feel like I must have missed some crucial pre-processing step or something, since even a relatively simple/poorly tuned network should at least be able to manage better than chance, surely...? Thoughts, please ;-)
# First conv layer with max pooling
# (x is in the input CIFAR data, (n,32,32,3)
normx=tf.nn.lrn(x)
w1=tf.Variable(tf.truncated_normal((5,5,3,depth1),stddev=0.05))
b1=tf.Variable(tf.zeros([depth1]))
act1_=tf.nn.relu(tf.nn.conv2d(normx,w1,[1,1,1,1],padding='VALID')+b1)
act1=tf.nn.max_pool(act1_,[1,3,3,1],[1,2,2,1],padding='VALID')
# Second conv layer with max pooling
w2=tf.Variable(tf.truncated_normal((5,5,depth1,depth2),stddev=0.05))
b2=tf.Variable(tf.zeros([depth2]))
act2_=tf.nn.relu(tf.nn.conv2d(act1,w2,[1,1,1,1],padding='VALID')+b2)
act2=tf.nn.max_pool(act2_,[1,3,3,1],[1,2,2,1],padding='VALID')
# fully connected
d=np.prod(act2.get_shape()[1:].as_list())
wfull=tf.Variable(tf.truncated_normal((d,nhidden),stddev=0.05))
bfull=tf.Variable(tf.zeros([nhidden]))
reshaped=tf.reshape(act2,(-1,d))
actfull=tf.nn.dropout(tf.nn.relu(tf.matmul(reshaped,wfull)+bfull),0.6)
# Output
wout=tf.Variable(tf.truncated_normal((nhidden,nclasses),stddev=0.05))
bout=tf.Variable(tf.zeros([nclasses]))
logits=tf.matmul(actfull,wout)+bout
loss=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y,logits=logits))
opt=tf.train.AdagradOptimizer(0.2).minimize(loss)