Visualizing DeepLearning CNN Layers - tensorflow

I came across an experimental use of Deep Learning using Tensorflow, The author trained CNNs to play Pong game. All seem straightforward to me, except the visualization to illustrate Q-value in the CNN layers. Here's the youtube video, Anyone can explain how the graphs (heat map looking) are plotted?

I digged into the code and found this file from a previous commit, but it is not present anymore in the master version (weird).
Inside you will find the code to visualize, the important lines are:
self.l1.imshow(np.reshape(np.rollaxis(c1, 2, 1),(20,20*32)),aspect = 6)
self.l2.imshow(np.reshape(np.rollaxis(c2, 2, 1),(5,5*64)),aspect = 12)
self.l3.imshow(np.reshape(np.rollaxis(c3, 2, 1),(3,3*64)),aspect = 12)
Here they take the activation map of size (20, 20, 32) and plot all the activations. They reshape to (20, 20*32) to plot all the feature maps (32 in total) side by side. To make it fit into the screen, they use an aspect ratio of 6, which compresses the image horizontally.
To sum it up, they plot all the feature maps side by side, and compress it to fit into the screen.
I would advise you to avoid changing the aspect ratio, and instead use little blocks for each activation (32 blocks in total) and arrange the blocks in a 8x4 layout for instance.


Simple Captcha Solving

I'm trying to solve some simple captcha using OpenCV and pytesseract. Some of captcha samples are:
I tried to the remove the noisy dots with some filters:
import cv2
import numpy as np
import pytesseract
img = cv2.imread(image_path)
_, img = cv2.threshold(img, 127, 255, cv2.THRESH_BINARY)
img = cv2.morphologyEx(img, cv2.MORPH_OPEN, np.ones((4, 4), np.uint8), iterations=1)
img = cv2.medianBlur(img, 3)
img = cv2.medianBlur(img, 3)
img = cv2.medianBlur(img, 3)
img = cv2.medianBlur(img, 3)
img = cv2.GaussianBlur(img, (5, 5), 0)
cv2.imwrite('res.png', img)
Resulting tranformed images are:
Unfortunately pytesseract just recognizes first captcha correctly. Any other better transformation?
Final Update:
As #Neil suggested, I tried to remove noise by detecting connected pixels. To find connected pixels, I found a function named connectedComponentsWithStats, whichs detect connected pixels and assigns group (component) a label. By finding connected components and removing the ones with small number of pixels, I managed to get better overall detection accuracy with pytesseract.
And here are the new resulting images:
I've taken a much more direct approach to filtering ink splotches from pdf documents. I won't share the whole thing it's a lot of code, but here is the general strategy I adopted:
Use Python Pillow library to get an image object where you can manipulate pixels directly.
Binarize the image.
Find all connected pixels and how many pixels are in each group of connected pixels. You can do this using the minesweeper algorithm. Which is easy to search for.
Set some threshold value of pixels that all legitimate letters are expected to have. This will be dependent on your image resolution.
replace all black pixels in groups below the threshold with white pixels.
Convert back to image.
Your final output image is too blurry. To enhance the performance of pytesseract you need to sharpen it.
Sharpening is not as easy as blurring, but there exist a few code snippets / tutorials (e.g.
Rather than chaining blurs, blur once either using Gaussian or Median Blur, experiment with parameters to get the blur amount you need, perhaps try one method after the other but there is no reason to chain blurs of the same method.
There is an OCR example in python that detect the characters. Save several images and apply the filter and train a SVM algorithm. that may help you. I did trained a algorithm with even few Images but the results were acceptable. Check this link.
Wish you luck
I know the post is a bit old but I suggest you to try this library I've developed some time ago. If you have a set of labelled captchas that service would fit you. Take a look:
In README there is a section "Train model on external data" that you might be interested in.

Faster R-CNN object detection and deep-sort tracking algorithm integration

I have been trying to integrate the Faster R-CNN object detection model with a deep-sort tracking algorithm. However, for some reason, the tracking algorithm does not perform well which means tracking ID just keeps increasing for the same person.
I have used this repository for building my own script. (check deep-sort yolov3
What I did:
1 detection every 30 frames
created a list for detection scores
created a list for detection bounding boxes (considering the input format of deep-sort)
calling the tracker !!!
# tracking and draw bounding boxes
for i in range(0, len(refine_person_detection)):
confidence_worker.append(refine_person_detection[i][4]) # scores
bboxes.append([refine_person_detection[i][0], refine_person_detection[i][2],
(refine_person_detection[i][1] - refine_person_detection[i][0]),
(refine_person_detection[i][3] - refine_person_detection[i][2])]) # bounding boxes
features = encoder(frame, bboxes)
detections = [Detection(bbox, confidence, feature) for bbox, confidence, feature in
zip(bboxes, confidence_worker, features)]
boxes = np.array([d.tlwh for d in detections])
scores = np.array([d.confidence for d in detections])
indices = preprocessing.non_max_suppression(boxes, nms_max_overlap, scores)
detections = [detections[i] for i in indices]
tracker.predict() # calling the tracker
for track in tracker.tracks:
if not track.is_confirmed() or track.time_since_update > 1:
bbox = track.to_tlbr()
cv2.rectangle(frame, (int(bbox[0]), int(bbox[1])), (int(bbox[2]), int(bbox[3])),
(255, 255, 255), 2)
cv2.putText(frame, str(track.track_id), (int(bbox[0]), int(bbox[1])), 0, 5e-3 * 200,
(0, 255, 0), 2)
Here is an example of bad results that tracking ID increases.
Thanks in advance for any suggestion
I also study the same thing, I try to combine them, too. Have you done it yet, any progress?
The provided code it correct.
However, the detection must be done every frame.
Since the deep-sort uses the features within the bounding box for tracking, having a gap between the detection frames caused the issue of increasing numbers for the same person
#Mustafa please check the code above with every frame detection, should work.
feel free to comment if it did not

Implement CVAE for a single image

I have a multi-dimensional, hyper-spectral image (channels, width, height = 15, 2500, 2500). I want to compress its 15 channel dimensions into 5 channels.So, the output would be (channels, width, height = 5, 2500, 2500). One simple way to do is to apply PCA. However, performance is not so good. Thus, I want to use Variational AutoEncoder(VAE).
When I saw the available solution in Tensorflow or keras library, it shows an example of clustering the whole images using Convolutional Variational AutoEncoder(CVAE).
However, I have a single image. What is the best practice to implement CVAE? Is it by generating sample images by moving window approach?
One way of doing it would be to have a CVAE that takes as input (and output) values of all the spectral features for each of the spatial coordinates (the stacks circled in red in the picture). So, in the case of your image, you would have 2500*2500 = 6250000 input data samples, which are all vectors of length 15. And then the dimension of the middle layer would be a vector of length 5. And, instead of 2D convolutions that are normally used along the spatial domain of images, in this case it would make sense to use 1D convolution over the spectral domain (since the values of neighbouring wavelengths are also correlated). But I think using only fully-connected layers would also make sense.
As a disclaimer, I haven’t seen CVAEs used in this way before, but like this, you would also get many data samples, which is needed in order for the learning generalise well.
Another option would be indeed what you suggested -- to just generate the samples (patches) using a moving window (maybe with a stride that is the half size of the patch). Even though you wouldn't necessarily get enough data samples for the CVAE to generalise really well on all HSI images, I guess it doesn't matter (if it overfits), since you want to use it on that same image.

shifted convolutions as a replacements to masked convolutions in pixelcnn++

I read the PixelCnn++ paper and code.
in the code, there is this line (298):
''' utilities for shifting the image around, efficient alternative to masking convolutions '''
aftwerwards, they define several functions for that purpuse:
down_shifted_conv2d, down_right_shifted_conv2d, down_shift, right_shift.
using these and gated_resnet layers, they (based on figure 2 from the paper) convert the image from 32X32 to 8X8, and back to 32X32. I looked into these layers - it seems like down_shift adds a bottom row of zeros, and down_shifted_conv2d adds some specific padding and using a specific kernel size.
also, they divide the model to up u_list (line 37) and ul_list (38), which I think might correspond to downwards and downward+rightward streams mentioned briefly in the paper after figure 2.
lastly, in the beginning of the model, they pad the last axis with 1 (line 37), and state that it is for:
add channel of ones to distinguish image from padding later on
my questions are:
how are the shifted convolutions a replacement for masked convolutions - that is, how they prevent the network for seeing future pixel value through the layers? and why are they called "shifted"?
what is the downwards and downward_rightwards streams, how do they work and are they the same as the u_list and ul_list?
why they pad the last axis of the input with ones, in what way it helps them later?

Dynamic Tensor Aligment/Cropping

I implemented Fully-Convolution Network at TensorFlow. It use encdoder-decoder structure.
When training, I use always same image size (224x224, using random crop) and everything works nicely.
In interference phase, I want to predict one image at a time, because I want to use full-image (not croped). For example, such image have size [406,256]. And here is problem.
In Encoder-Decoder architecture I add two tesors (z = x + y). When training, sizes of both tensor matches. When predicting my single image, sizes does not match (tensor sizes: [1,47,47,64] vs [1,46,46,64]). I think it is cause by some rounding done in Conv and Pool layer.
What should I change in my architecture to works for any image size I want? Should I change rounding parameters? Or add 'cropping' of tensor?
Link to implementation of architecture:
(the problem occur in line 166)
I found the solution for variable input size:)
What we really need was a 'Crop-layer', that crop one tensor to match other. I found really similar layer here:
I have just made it `crop_and_add' and it is working:
def crop_and_add(x1,x2):
x1_shape = tf.shape(x1)
x2_shape = tf.shape(x2)
# offsets for the top left corner of the crop
offsets = [0, (x1_shape[1] - x2_shape[1]) // 2, (x1_shape[2] - x2_shape[2]) // 2, 0]
size = [-1, x2_shape[1], x2_shape[2], -1]
x1_crop = tf.slice(x1, offsets, size)
return x1_crop + x2
All addition in model I replaced by above layer (so merging encoder and decoder data).
Also, the input to model need to be defined as:
image = tf.placeholder(tf.float32, shape=[1, None, None, 3], name="input_image")
So we know that we will pass single image and that image have 3 channels. but we do not know neither width nor height. And it works very nice! (40 FPS on K80 as AWS P2, size of image is 224x{}-shoter side of image have 224)
FYI, I was also trying to run ENET (2x faster than LinkNet), but in TensorFlow it is slower. I think it is because of PReLu (which is slow at TF). Also it does not support arbitraty size of image becauese of UnPool layer, which need to have predefined output size by list of integers (not placeholders). So LinkNet look better in case of Speed and Performacance in TF.