Viola-Jones - what does the 24x24 window mean? - viola-jones

I'm learning about the Viola-James detection framework and I read that it uses a 24x24 base detection window[1][2]. I'm having problems understanding this base detection window.
Let's say I have an image of size 1280x960 pixels and 3 people in it. When I try to perform face detection on this image, will the algorithm:
Shrink the picture to 24x24 pixels,
Tile the picture with 24x24 pixel large sections and then test each section,
Position the 24x24 window in the top left of the image and then move it by 1px over the whole image area?
Any help is appreciated, even a link to another explanation.
Source: https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/viola-cvpr-01.pdf
[1] - page 2, last paragraph before Integral images
[2] - page 4, Results

Does this video help? It is 40 minutes long.
Adam Harvey Explains Viola-Jones Face Detection
Also called Haar Cascades, the algorithm is very popular for face-detection.
About half way down that page is another video which shows a super slow-mo scan in progress so you can see how the window starts small (although much larger than 24x24 for the purpose of demonstration) and shifts around the image pixel by pixel, then does it again and again on successively larger square portions. At each stage, it's still only looking at those windows as though they were resampled to the 24x24 size.
You can also see how it quickly rejects many of those windows and spends most of its time in areas that seem face-like while it computes more and more complex comparisons that become more stringent. This is where the term "cascade" comes into play.

I found this video that perfectly explains how the detection window moves and scales on a picture. I wanted to draw a flowchart how this looks but I think the video illustrates it better:
https://vimeo.com/12774628
Credits to the original author of the video.

Related

Creating an ML-Model that detects card-values

This is a more generic question about training an ML-Model to detect cards.
The cards are a kid's game, 4 different colors, numbers and symbols. I don't need to detect the color, just the value (a.k.a symbol) of the cards.
I tried to take pictures with my iPhone of every card, used RectLabel to draw the rectangles around the symbols in the upper left corner (the cards have an upside down-symbol in the lower right corner, too, I didn't mark these as they'll be hidden during detection).
I cropped the images so only the card is visible, no surroundings.
Then I uploaded my images to app.roboflow.ai and let them do their magic (using Auto-Orient, Resize to 416x416, Grayscale, Auto-Adjust Contrast, Rotation, Shear, Blur and Noise).
That gave me another set of images which I used to train my model with CreateML from Apple.
However, when I use that model in my app (I'm using the Breakfast Finder Demo from Apple), the cards values aren't detected - well, sometimes it works, but only at a certain distance from the phone and the labels are either upside down or sideways.
My guess is this is because my images aren't taken the way they should be?
Any hints on how I'd have to set this whole thing up so my model gets trained well?
My bet would be on this being the problem:
I cropped the images so only the card is visible, no surroundings
You want your training images to be as similar as possible to the images your model will see in the wild. If it's trained only on images of cards with no surroundings and then you show it images of cards with things around them it won't know what to do.
This UNO scoring example is extremely similar to your problem and might provide some ideas and guidance.

Tiled map editor cutting off edges of imported png

I started to write a small engine to render a 2d isometric map. A friend of mine made a small basic image of a train station to use example art for my engine. I tried to import the .png into tiled and create a tileset for it, to then use the information for the rendering of that house.
When I import the image, tiled cuts off the edges of the picture (see attachment "tiled .png import to tileset") on the right and bottom side. I looked into the menus and tried to find information about it but I could'nt find any helpful advice why it happens.
Another thing I find curious is the information within the .tsx file:
<?xml version="1.0" encoding="UTF-8"?>
<tileset version="1.2" tiledversion="1.2.1" name="bAHNHOF" tilewidth="30"
tileheight="30" tilecount="195" columns="13">
<image source="bAHNHOF.png" width="401" height="468"/>
</tileset>
Shouldn't columns(13) multiplied by tile width(30) result in width of the imported image (i.e. 401). It only is 390 though, so roughly 11 pixels less then the original width.
I probably made a mistake somewhere or am confusing something. Maybe someone can help me?
Thanks in advance :)
Seem like whatever editor you are using wants "whole tile" sizes. This is not uncommon. Increase the size of your base image so that the X and Y align to tile size boundaries to prevent this. 30 for a tile size is also very unusual. I'd expect a power of 2 like "32" or "16".
In short, your importer is culling tiles that are not full size. I'd expect it to display a warning about image size before it did this, but who knows as you didn't state the programs.
When this goes onto whatever platform you are using, a power of 2 tile size will help as well in terms of efficiency, so consider making that change sooner rather than later as well.
Finally, often tiling is done to save memory. If, when you divide up your image into tiles (tile it), you can create identical tiles, the computer can use that knowledge to lessen the amount of memory is needed.

Is there any way I can enlarge a stimulus in #psychopy without losing image quiality?

I'm importing my stimulus from a folder. I would like to make them bigger *the actual image size is 120 pix (height) x 170 pix (width). I've tried to double the size by using this code in the PsychoPy Coder:
stimuli.append(visual.ImageStim(win=win, name='image', units='cm', size= [9, 6.3],
(I used the double number in cms) but this distorts the image. Is it any way to enlarge it without it distorting, or do I have to change the stimuli itself?
Thank you
Just to answer what Michael said in the comment: no, if you scale an image up, the only way of guessing what is in between pixels is interpolation. This is what psychopy does and what ANY software would do. To make an analogy: take a picture of a distant tree using your digital camera. Then scale the image up using all kinds of software. You won't suddenly be able to see the individual leaves since the software had no such information as input.
If you need higher resolution, put higher resolution images in your folder. If it's simple shapes, you may use built-in methods such as visual.ShapeStim and it's variants: visual.Polygon, visual.Rect and visual.Circle. Psychopy can scale these shapes freely so they always stay sharp.

Perspective distortions correction

I'm searching for a methods of text recognition based on document borders.
Or the methods that can solve the problem of finding new viewpoint.
For exmp. the camera is in point (x1,y1,z1) and the result picture with perspective distortions, but we can find (x2,y2,z2) for camera to correct picture.
Thanks.
The usual approach, which assumes that the document's page is approximately flat in 3D space, is to warp the quadrangle encompassing the page into a rectangle. To do so you must estimate a homography, i.e. a (linear) projective transformation between the original image and its warped counterpart.
The estimation requires matching points (or lines) between the two images, and a common choice for documents is to map the page corners in the original images to the image corners of the warped image. This will in general produce a rectangle with an incorrect aspect ratio (i.e. the warped page will look "wider" or "taller" than the real one), but this can be easily corrected if you happen to know in advance what the real aspect ratio is (for example, because you know the type of paper used, whether letter, A4, etc.).
A simple algorithm to perform the estimation is the so-called Direct Linear Transformation.
The OpenCV library contains routines to help accomplishing all these tasks, look into it.

Finding significant images in a set of surveillance camera images

I've had theft problems outside my house so I setup a simple webcam to capture every second with Dorgem (http://dorgem.sf.net).
Dorgem does offer a feature to use motion detection to only capture frames where something is moving on the screen. The problem is that the motion detection algorithm it uses is extremely sensitive. It goes off because of variations in color between successive shots on my cheap webcam, and it also goes off because the trees in front of the house are blowing in the wind. Additionally, the front of my house is a high traffic area so there is also a large number of legitimately captured frames.
I average capturing 2800/3600 frames every second using Dorgem's motion detection. This is too much for me to search through to find out where the interesting activity is.
I wish I could re-position the camera to a more optimal position where it would only capture the areas I'm interested in, so that motion detection would be simpler, however this is not an option for me.
I think that because my camera has a fixed position and each picture frames the same area in front of my house, then I should be able to scan the images and figure out which ones have motion in some interesting region of that image, throwing out all other frames.
For example: if there's a change in pixel 320,240 then someone has stepped in front of my house and I want to see that frame, but if there's a change in pixel 1,1 then its just the trees blowing in the wind and the frame can be discarded.
I've looked at pdiff, a tool for finding diffs in sets of pictures, but it seems to be also focused on diffing the entire picture, rather than a specific region of it:
http://pdiff.sourceforge.net/
I've also looked at phash, a tool for calculating a hash based on human perception of an image, but it seems too complex:
http://www.phash.org/
I suppose I could implement it in a shell script using imagemagick's mogrify -crop to cherry pick the regions of the image I'm interested in, then running pdiff to find the interesting ones, and using that to pick out the interesting frames.
Any thoughts? ideas? existing tools?
cropping and then using pdiff seems like the best choice to me.