Faster R-CNN object detection and deep-sort tracking algorithm integration - tracking

I have been trying to integrate the Faster R-CNN object detection model with a deep-sort tracking algorithm. However, for some reason, the tracking algorithm does not perform well which means tracking ID just keeps increasing for the same person.
I have used this repository for building my own script. (check demo.py) deep-sort yolov3
What I did:
1 detection every 30 frames
created a list for detection scores
created a list for detection bounding boxes (considering the input format of deep-sort)
calling the tracker !!!
# tracking and draw bounding boxes
for i in range(0, len(refine_person_detection)):
confidence_worker.append(refine_person_detection[i][4]) # scores
bboxes.append([refine_person_detection[i][0], refine_person_detection[i][2],
(refine_person_detection[i][1] - refine_person_detection[i][0]),
(refine_person_detection[i][3] - refine_person_detection[i][2])]) # bounding boxes
features = encoder(frame, bboxes)
detections = [Detection(bbox, confidence, feature) for bbox, confidence, feature in
zip(bboxes, confidence_worker, features)]
boxes = np.array([d.tlwh for d in detections])
scores = np.array([d.confidence for d in detections])
indices = preprocessing.non_max_suppression(boxes, nms_max_overlap, scores)
detections = [detections[i] for i in indices]
tracker.predict() # calling the tracker
tracker.update(detections)
for track in tracker.tracks:
k.append(track)
if not track.is_confirmed() or track.time_since_update > 1:
continue
bbox = track.to_tlbr()
cv2.rectangle(frame, (int(bbox[0]), int(bbox[1])), (int(bbox[2]), int(bbox[3])),
(255, 255, 255), 2)
cv2.putText(frame, str(track.track_id), (int(bbox[0]), int(bbox[1])), 0, 5e-3 * 200,
(0, 255, 0), 2)
Here is an example of bad results that tracking ID increases.
Thanks in advance for any suggestion

I also study the same thing, I try to combine them, too. Have you done it yet, any progress?

The provided code it correct.
However, the detection must be done every frame.
Since the deep-sort uses the features within the bounding box for tracking, having a gap between the detection frames caused the issue of increasing numbers for the same person
P.S:
#Mustafa please check the code above with every frame detection, should work.
feel free to comment if it did not

Related

How to batch an object detection dataset?

I am working on implementing a face detection model on the wider face dataset. I learned it was built into Tensorflow datasets and I am using it.
However, I am facing an issue while batching the data. Since, an Image can have multiple faces, therefore the number of bounding boxes output are different for each Image. For example, an Image with 2 faces will have 2 bounding box, whereas one with 4 will have 4 and so on.
But the problem is, these unequal number of bounding boxes is causing each of the Dataset object tensors to be of different shapes. And in TensorFlow afaik we cannot batch tensors of unequal shapes ( source - Tensorflow Datasets: Make batches with different shaped data). So I am unable to batch the dataset.
So after loading the following code and batching -
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
ds1 = ds.batch(12)
for step, (x,y,z) in enumerate(ds1) :
print(step)
break
I am getting this kind of error on run Link to Error Image
In general any help on how can I batch the Tensorflow object detection datasets will be very helpfull.
It might be a bit late but I thought I should post this anyways. The padded_batch feature ought to do the trick here. It kind of goes around the issue by matching dimension via padding zeros
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
ds1 = ds.padded_batch(12)
for step, (x,y,z) in enumerate(ds1) :
print(step)
break
Another solution would be to process not use batch and process with custom buffers with for loops but that kind of defeats the purpose. Just for posterity I'll add the sample code here as an example of a simple workaround.
ds,info = tfds.load('wider_face', split='train', shuffle_files=True, with_info= True)
batch_size = 12
image_annotations_pair = [x['image'], x['faces']['bbox'] for n, x in enumerate(ds) if n < batch_size]
Then use a train_step modified for this.
For details one may refer to - https://www.kite.com/python/docs/tensorflow.contrib.autograph.operators.control_flow.dataset_ops.DatasetV2.padded_batch

try to implement cv2.findContours for person detection

I'm new to opencv and I'm trying to detect person through cv2.findContours with morphological transformation of the video. Here is the code snippet..
import numpy as np
import imutils
import cv2 as cv
import time
cap = cv.VideoCapture(0)
while(cap.isOpened()):
ret, frame = cap.read()
#frame = imutils.resize(frame, width=700,height=100)
gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)
gray = cv.GaussianBlur(gray, (21, 21), 0)
cv.accumulateWeighted(gray, avg, 0.5)
mask2 = cv.absdiff(gray, cv.convertScaleAbs(avg))
mask = cv.absdiff(gray, cv.convertScaleAbs(avg))
contours0, hierarchy = cv.findContours(mask2,cv.RETR_EXTERNAL,cv.CHAIN_APPROX_SIMPLE)
for cnt in contours0:
.
.
.
The rest of the code has the logic of a contour passing a line and incrementing the count.
The problem I'm encountering is, cv.findContours detects every movement/change in the frame (including the person). What I want is cv.findContours to detect only person and not any other movement. I know that person detection can be achieved through harrcasacade but is there any way I can implement detection using cv2.findContours?
If not then is there a way I can still do morphological transformation and detect people because the project I'm working on requires filtering of noise and much of the background to detect the person and increment it's count on passing the line.
I will show you two options to do this.
The method I mentioned in the comments which you can use with Yolo to detect humans:
Use saliency to detect the standout parts of the video
Apply K-Means Clustering to cluster the objects into individual clusters.
Apply Background Subtraction and erosion or dilation (or both depends on the video but try them all and see which one does the best job).
Crop the objects
Send the cropped objects to Yolo
If the class name is a pedestrian or human then draw the bounding boxes on them.
Using OpenCV's builtin pedestrian detection which is much more easier:
Convert frames to black and white
Use pedestrian_cascade.detectMultiScale() on the grey frames.
Draw a bounding box over each pedestrian
Second method is much simpler but it depends what is expected of you for this project.

Is it possible to train YOLO (any version) for a single class where the image has text data. (find region of equations)

I am wondering if YOLO (any version, specially the one with accuracy, not speed) can be trained on the text data. What I am trying to do is to find the Region in the text image where any equation is present.
For example, I want to find the 2 of the Gray regions of interest in this image so that I can outline and eventually, crop the equations separately.
I am asking this questions because :
First of all I have not found a place where the YOLO is used for text data.
Secondly, how can we customise for low resolution unlike the (416,416) as all the images are either cropped or horizontal mostly in (W=2H) format.
I have implemented the YOLO-V3 version for text data but using OpenCv which is basically for CPU. I want to train the model from scratch.
Please help. Any of the Keras, Tensorflow or PyTorch would do.
Here is the code I used for implementing in OpenCv.
net = cv2.dnn.readNet(PATH+"yolov3.weights", PATH+"yolov3.cfg") # build the model. NOTE: This will only use CPU
layer_names = net.getLayerNames() # get all the layer names from the network 254 layers in the network
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()] # output layer is the
# 3 output layers in otal
blob = cv2.dnn.blobFromImage(image=img, scalefactor=0.00392, size=(416,416), mean=(0, 0, 0), swapRB=True,)
# output as numpy array of (1,3,416,416). If you need to change the shape, change it in the config file too
# swap BGR to RGB, scale it to a threshold, resize, subtract it from the mean of 0 for all the RGB values
net.setInput(blob)
outs = net.forward(output_layers) # list of 3 elements for each channel
class_ids = [] # id of classes
confidences = [] # to store all the confidence score of objects present in bounding boxes if 0, no object is present
boxes = [] # to store all the boxes
for out in outs: # get all channels one by one
for detection in out: # get detection one by one
scores = detection[5:] # prob of 80 elements if the object(s) is/are inside the box and if yes, with what prob
class_id = np.argmax(scores) # Which class is dominating inside the list
confidence = scores[class_id]
if confidence > 0.1: # consider only those boxes which have a prob of having an object > 0.55
# grid coordinates
center_x = int(detection[0] * width) # centre X of grid
center_y = int(detection[1] * height) # Center Y of grid
w = int(detection[2] * width) # width
h = int(detection[3] * height) # height
# Rectangle coordinates
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h]) # get all the bounding boxes
confidences.append(float(confidence)) # get all the confidence score
class_ids.append(class_id) # get all the clas ids
Being an object detector Yolo can be used for specific text detection only, not for detecting any text that might be present in the image.
For example Yolo can be trained to do text based logo detection like this:
I want to find the 2 of the Gray regions of interest in this image so
that I can outline and eventually, crop the equations separately.
Your problem statement talks about detecting any equation (math formula) that's present in the image so it can't be done using Yolo alone. I think mathpix is similar to your use-case. They will be using OCR (Optical Character Recognition) system trained and fine tuned towards their use-case.
Eventually to do something like mathpix, OCR system customised for your use case is what you need. There won't be any ready ready made solution out there for this. You'll have to build one.
Proposed Methods:
Mathematical Formula Detection in Heterogeneous Document Images
A Simple Equation Region Detector for Printed Document Images in Tesseract
Note: Tesseract as it is can't be used because it is a pre-trained model trained for reading any character. You can refer 2nd paper to train tesseract towards fitting your use case.
To get some idea about OCR, you can read about it here.
EDIT:
So idea is to build your own OCR to detect something that constitutes equation/math formula rather than detecting every character. You need to have data set where equations are marked. Basically you look for region with math symbols(say summation, integration etc.).
Some Tutorials to train your own OCR:
Tesseract training guide
Creating OCR pipeline using CV and DL
Build OCR pipeline
Build Your OCR
Attention OCR
So idea is that you follow these tutorials to get to know how to train
and build your OCR for any use case and then you read research papers
I mentioned above and also some of the basic ideas I gave above to
build OCR towards your use case.

When forward using MXNet, how to do with varying 'batch size' in data_shapes?

Hi,I have a question that, how can I make predict with unfixed input data? I will try to describe in detail clear:
I use MTCNN for face detection(it's ok unfamiliar with that), and it employs 3 networks: PNet, RNet, ONet. PNet detects a mass of proposal face bounding boxes, then these boxes are coarse-to-fine by the rest net one after another, finally get precise face bbox(s). When taking an image as input to PNet, image's size is unfixed, and the output proposal box number from PNet is also unfixed, so as RNet, ONet. Reference to another MTCNN code I set a large data_shapes(e.g., image size, batch size) when I bind the module, and initialize all to zero,then make predict. That works though, Isn't that a redundant calculation? (Question 1)
PNet:
max_img_w=1000
max_img_h=1000
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det1’, 0)
self.PNets = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.PNets.bind(data_shapes=[(‘data’, (1, 3, max_img_w, max_img_h))],for_training=False)
self.PNets.set_params(arg_params,aux_params)
RNet
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det2’, 0)
self.RNet = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.RNet.bind(data_shapes=[(‘data’, (2048,3, 24, 24))],for_training=False)
self.RNet.set_params(arg_params,aux_params,allow_missing=True)
ONet
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det3’, 0)
self.ONet = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.ONet.bind(data_shapes=[(‘data’, (256, 3, 48, 48))],for_training=False)
self.ONet.set_params(arg_params,aux_params,allow_missing=True)
And I try mx.mod.Module.reshape before predict, which will adjust data'shape according to last network's output, but I get this error:(Question 2)
AssertionError: Shape of unspecified array arg:prob1_label changed. This can cause the new executor to not share parameters with the old one. Please check for error in the network. If this is intended, set partial_shaping=True to suppress this warning.
One more thing is that The MTCNN code (https://github.com/pangyupo/mxnet_mtcnn_face_detection) primary use deprecated function to load models:
self.PNet = mx.model.FeedForward.load(‘det1’,0)
One single line to work with arbitrary data_shapes, why this function be deprecated..?(Question 3)
I found a little difference that after load model, FeedFroward takes 0MB memory before make one predict, but mx.mod.Module takes up memory once loaded, and increase obviously after making one prediction.
You can use MXNet imperative API Gluon and that will let you use different batch-sizes.
If like in this case, your model was trained using the symbolic API or has been exported in the serialized MXNet format ('-0001.params', '-symbol.json' for e.g), you can load it in Gluon that way:
ctx = mx.cpu()
sym = mx.sym.load_json(open('det1-symbol.json', 'r').read())
PNet = gluon.nn.SymbolBlock(outputs=sym, inputs=mx.sym.var('data'))
PNet.load_params('det1-0001.params', ctx=ctx)
Then you can use it the following way:
# a given batch size (1)
data1 = mx.nd.ones((1, C, W, H))
output1 = PNet(data1)
# a different batch size (5)
data2 = mx.nd.ones((5, C, W, H))
output2 = PNet(data2)
And it would work.
You can get started with MXNet Gluon with the official 60 minutes crash course

Visualizing DeepLearning CNN Layers

I came across an experimental use of Deep Learning using Tensorflow, https://github.com/asrivat1/DeepLearningVideoGames. The author trained CNNs to play Pong game. All seem straightforward to me, except the visualization to illustrate Q-value in the CNN layers. Here's the youtube video, https://www.youtube.com/watch?v=W9jGIzkVCsM. Anyone can explain how the graphs (heat map looking) are plotted?
Thx.
I digged into the code and found this file from a previous commit, but it is not present anymore in the master version (weird).
Inside you will find the code to visualize, the important lines are:
self.l1.imshow(np.reshape(np.rollaxis(c1, 2, 1),(20,20*32)),aspect = 6)
self.l2.imshow(np.reshape(np.rollaxis(c2, 2, 1),(5,5*64)),aspect = 12)
self.l3.imshow(np.reshape(np.rollaxis(c3, 2, 1),(3,3*64)),aspect = 12)
Here they take the activation map of size (20, 20, 32) and plot all the activations. They reshape to (20, 20*32) to plot all the feature maps (32 in total) side by side. To make it fit into the screen, they use an aspect ratio of 6, which compresses the image horizontally.
To sum it up, they plot all the feature maps side by side, and compress it to fit into the screen.
I would advise you to avoid changing the aspect ratio, and instead use little blocks for each activation (32 blocks in total) and arrange the blocks in a 8x4 layout for instance.