Let's say that I want to detect a character on a background image. The character can be tilted/moved in a number of ways, which will slightly change how we see him. Luckily, I have the sprite sheet for all of his possible positions. Is there a way to train tensorflow to detect objects/characters based on sprite sheets?
You can take different approaches:
1) I would first try out Template Matching. You slide your sprite over the image, and see if it matches the images. You do the same for the tilts, you tilt the sprite image, and slide the tilted sprite over the image. You do this for let's say every tenth of a degree, and take the best matching template.
2) If that's too computationally intensive, I would still use template matching, but only to gather data for a machine learning model. You can do template matching, then record the best match for a frame and the bounding boxes for that best match, and you can then use that to train an object detection network. There's more state-of-the-art stuff than this, but for ease-of-use I would Yolov3. it also has a tiny version, which is less accurate but way faster.
Related
This is a more generic question about training an ML-Model to detect cards.
The cards are a kid's game, 4 different colors, numbers and symbols. I don't need to detect the color, just the value (a.k.a symbol) of the cards.
I tried to take pictures with my iPhone of every card, used RectLabel to draw the rectangles around the symbols in the upper left corner (the cards have an upside down-symbol in the lower right corner, too, I didn't mark these as they'll be hidden during detection).
I cropped the images so only the card is visible, no surroundings.
Then I uploaded my images to app.roboflow.ai and let them do their magic (using Auto-Orient, Resize to 416x416, Grayscale, Auto-Adjust Contrast, Rotation, Shear, Blur and Noise).
That gave me another set of images which I used to train my model with CreateML from Apple.
However, when I use that model in my app (I'm using the Breakfast Finder Demo from Apple), the cards values aren't detected - well, sometimes it works, but only at a certain distance from the phone and the labels are either upside down or sideways.
My guess is this is because my images aren't taken the way they should be?
Any hints on how I'd have to set this whole thing up so my model gets trained well?
My bet would be on this being the problem:
I cropped the images so only the card is visible, no surroundings
You want your training images to be as similar as possible to the images your model will see in the wild. If it's trained only on images of cards with no surroundings and then you show it images of cards with things around them it won't know what to do.
This UNO scoring example is extremely similar to your problem and might provide some ideas and guidance.
In case the title confused you. I want to remove the background around the object. The boundary is rather complex, so doing it by hand is time-consuming. However, I have several images of one object on different backgrounds.
So I've put these images on different layers, so the object on each layer is in the same place. Now I would like to combine all layers in one, so the object would persist, but different layers would be removed. Is there a function/filter/script that works this way? Taking pixels from different layers and if they are different removes them or makes them (more) transparent? While pixels that don't differ are left unchanged.
I've tried "addition" and "multiply" modes for layers, but they don't work that way - they still change pixels that are "the same".
With two images:
Set the top image to Difference
Get a new layer from the result: Layer>New from visible
Color-select the black with a low threshold.
Your selection is the pixels that are black, that are those where the difference between the images was 0, that are those that are identical in both images.
With more images
A solution likely uses a "median filter". Such a filter makes pixels "vote": a pixel is the most common values among the corresponding pixels in each of the source images. This is typically applied to remove random objects (tourists) in front of a fixed subject (building): take several shots, and the filter will keep the pixels from the building, removing the tourists.
There is a median filter in the GMIC plugin/filter suite. Otherwise if you have good computer skills (some install tweaks required) there is an experimental one in Python.
However the median filter doesn't erase the background so the technique is likely more complex than the tourist removal one. Can you show a sample picture?
I'm trying to collect my own training data set for the image detection (Recognition, yet). Right now, I have 4 classes and 750 images for each. Each images are just regular images of the each classes; however, some of images are blur or contain outside objects such as, different background or other factors (but nothing distinguishable stuff). Using that training data set, image recognition is really bad.
My question is,
1. Does the training image set needs to contain the object in various background/setting/environment (I believe not...)?
2. Lets just say training worked fairly accurately and I want to know the location of the object on the image. I figure there is no way I can find the location just using the image recognition, so if I use the bounding box, how/where in the code can I see the location of the bounding box?
Thank you in advance!
It is difficult to know in advance what features your programm will learn for each class. But then again, if your unseen images will be in the same background, the background will play no role. I would suggest data augmentation in training; randomly color distortion, random flipping, random cropping.
You can't see in the code where the bounding box is. You have to label/annotate them yourself first in your collected data, using a tool as LabelMe for example. Then comes learning the object detector.
I've been searching around the web about how to do this and I know that it needs to be done with OpenCV. The problem is that all the tutorials and examples that I find are for separated shapes detection or template matching.
What I need is a way to detect the contents between 3 circles (which can be a photo or something else). From what I searched, its not to difficult to find the circles with the camera using contours but, how do I extract what is between them? The circles work like a pattern on the image to grab what is "inside the pattern".
Do I need to use the contours of each circle and measure the distance between them to grab my contents? If so, what if the image is a bit rotated/distorted on the camera?
I'm using Xamarin.iOS for this but from what I already saw, I believe I need to go native for this and any Objective C example is welcome too.
EDIT
Imagining that the image captured by the camera is this:
What I want is to match the 3 circles and get the following part of the image as result:
Since the images come from the camera, they can be rotated or scaled up/down.
The warpAffine function will let you map the desired area of the source image to a destination image, performing cropping, rotation and scaling in a single go.
Talking about rotation and scaling seem to indicate that you want to extract a rectangle of a given aspect ratio, hence perform a similarity transform. To define such a transform, three points are too much, two suffice. The construction of the affine matrix is a little tricky.
I'm searching for a methods of text recognition based on document borders.
Or the methods that can solve the problem of finding new viewpoint.
For exmp. the camera is in point (x1,y1,z1) and the result picture with perspective distortions, but we can find (x2,y2,z2) for camera to correct picture.
Thanks.
The usual approach, which assumes that the document's page is approximately flat in 3D space, is to warp the quadrangle encompassing the page into a rectangle. To do so you must estimate a homography, i.e. a (linear) projective transformation between the original image and its warped counterpart.
The estimation requires matching points (or lines) between the two images, and a common choice for documents is to map the page corners in the original images to the image corners of the warped image. This will in general produce a rectangle with an incorrect aspect ratio (i.e. the warped page will look "wider" or "taller" than the real one), but this can be easily corrected if you happen to know in advance what the real aspect ratio is (for example, because you know the type of paper used, whether letter, A4, etc.).
A simple algorithm to perform the estimation is the so-called Direct Linear Transformation.
The OpenCV library contains routines to help accomplishing all these tasks, look into it.