Capture a live video of handwriting using pen and paper and replace the hand in video with some object or cursor - video-capture

I want to process the captured video. I will try to capture the video of handwriting on paper / drawing on paper. But I do not want to show the hand or pen on the paper while live streaming via p5.js.
Can this be done using machine learning?
Any idea how to implement this?

If I understand you right you want to detect where in the image the hand is a draw an overlay on this position right?
If so You can use YOLO more information to detect where the hand is.
There are some trained networks that you can download maybe they are good enough, maybe you have to train your own just for handy.
There are also some libery for yolo and JS https://github.com/ModelDepot/tfjs-yolo-tiny

You may not need to go the full ML object segmentation route.
If the paper's position and illumination are constant (or at least knowable) you could try some simple heuristic comparing the pixels in the current frame with a short history and using the most constant pixel values. There might be some lag as new parts of your drawing 'become constant' so maybe you could try some modification to the accumulation, such as if the pixel was white and is going black.

Related

Creating an ML-Model that detects card-values

This is a more generic question about training an ML-Model to detect cards.
The cards are a kid's game, 4 different colors, numbers and symbols. I don't need to detect the color, just the value (a.k.a symbol) of the cards.
I tried to take pictures with my iPhone of every card, used RectLabel to draw the rectangles around the symbols in the upper left corner (the cards have an upside down-symbol in the lower right corner, too, I didn't mark these as they'll be hidden during detection).
I cropped the images so only the card is visible, no surroundings.
Then I uploaded my images to app.roboflow.ai and let them do their magic (using Auto-Orient, Resize to 416x416, Grayscale, Auto-Adjust Contrast, Rotation, Shear, Blur and Noise).
That gave me another set of images which I used to train my model with CreateML from Apple.
However, when I use that model in my app (I'm using the Breakfast Finder Demo from Apple), the cards values aren't detected - well, sometimes it works, but only at a certain distance from the phone and the labels are either upside down or sideways.
My guess is this is because my images aren't taken the way they should be?
Any hints on how I'd have to set this whole thing up so my model gets trained well?
My bet would be on this being the problem:
I cropped the images so only the card is visible, no surroundings
You want your training images to be as similar as possible to the images your model will see in the wild. If it's trained only on images of cards with no surroundings and then you show it images of cards with things around them it won't know what to do.
This UNO scoring example is extremely similar to your problem and might provide some ideas and guidance.

How a robust background removal is implemented?

I found that a deep-learning-based method (e.g., 1) is much more robust than a non-deep-learning-based method (e.g., 2, using OpenCV).
https://www.remove.bg
How do I remove the background from this kind of image?
In the OpenCV example, Canny is used to detect the edges. But this step can be very sensitive to the image. The contour detection may end up with wrong contours. It is also difficult to determine which contours should be kept.
How a robust deep-learning method is implemented? Is any good example code? Thanks.
For that to work you need to use Unet. You can search for that on github.
Unet transofrm is: I->I.
Space of the image will become image (of same or similar size).
You need to have say 10.000 images with bg removed. People, (long hair people), cats, cars, shoes, T-shirts, etc.
So you set different backgrounds on all these images as source and prediction should be images with removed background.
You can also do a segmentation model and when you find the foreground you can remove the bg.

How to detect an image between shapes from camera

I've been searching around the web about how to do this and I know that it needs to be done with OpenCV. The problem is that all the tutorials and examples that I find are for separated shapes detection or template matching.
What I need is a way to detect the contents between 3 circles (which can be a photo or something else). From what I searched, its not to difficult to find the circles with the camera using contours but, how do I extract what is between them? The circles work like a pattern on the image to grab what is "inside the pattern".
Do I need to use the contours of each circle and measure the distance between them to grab my contents? If so, what if the image is a bit rotated/distorted on the camera?
I'm using Xamarin.iOS for this but from what I already saw, I believe I need to go native for this and any Objective C example is welcome too.
EDIT
Imagining that the image captured by the camera is this:
What I want is to match the 3 circles and get the following part of the image as result:
Since the images come from the camera, they can be rotated or scaled up/down.
The warpAffine function will let you map the desired area of the source image to a destination image, performing cropping, rotation and scaling in a single go.
Talking about rotation and scaling seem to indicate that you want to extract a rectangle of a given aspect ratio, hence perform a similarity transform. To define such a transform, three points are too much, two suffice. The construction of the affine matrix is a little tricky.

iOS - 2d image turn into a 3d

I was checking out this cool app called Morfo. According to their product description -
Use Morfo to quickly turn a photo of your friend's face into a
talking, dancing, crazy 3D character! Once captured, you can make your
friend say anything you want in a silly voice, rock out, wear makeup,
sport a pair of huge green cat eyes, suddenly gain 300lbs, and more.
So if you take a normal 2D image of steve jobs & feed it to this app it converts it into a 3D model of that image & the user can interact with it.
My questions are as following -
How are they doing this?
How is this possible in iPad?
Isn't it computationally intensive to render and convert 2D image into 3D?
Any pointers, links to websites or libraries in objectiveC which do this is very much appreciated.
UPDATE: this demo of this product here shows how morfo, uses a template mechanism to do the conversion. i.e. after a 2D image is fed, one needs to set the boundaries of the face, where the eyes are located, size & length of lips. then it goes off to convert it into a 3D model. How is this part done? What frameworks or libraries they might be using?
This is a broad question but i can point you in the right direction of how 3D Rendering works, trust me this is a huge subject with decades of work behind it and to much to put here. Not sure how up to speed you are on 3D Rendering techniques so i will give you a basic idea of texturing and point you to a good set of tutorials.
How are they doing this?
The idea is that in 3D Rendering, 3D models can be textured with a 2d image known as a texture map. You use a 2D image and wrap it around a 3d model, be that a simple primitive like a sphere of a cube or more advanced such as the classic teapot or the model of a human head e.t.c. A texture can be taken from anywhere, I have used the camera feed in the past to texture meshes with the video from the camera stream, I have used photos from the camera which s how there doing it. So this is how the face is rendered to the 3D Model.
Is this efficient?
On iOS and most mobile devices 3D rendering uses hardware acceleration utilizing OpenGLES. In regards to your question this is really fast depending on how you implement your render code.
The way it uses the mapping (scale rotate template in the video) as mentioned by anticyclope allows you to make the texture fit a model and also place the eyes which are part of there render code.
So if you want to pick this up i recommend reading Jeff Lamarche Tutorial "from the ground up" as a primer:
http://iphonedevelopment.blogspot.com/2009/05/opengl-es-from-ground-up-table-of.html
Second to that i have read about 4 books on OpenGLES, for general design and for platforms specifics. I recommend this book:
http://www.amazon.co.uk/iPhone-Programming-Developing-Graphical-Applications/dp/0596804822/ref=sr_1_1?ie=UTF8&qid=1331114559&sr=8-1
In my opinion, there is how they doing it. Just my thoughts, haven't saw the application in real-life.
They have a 3D model of human's head. When you click on certain points on 2D image, they are adjusting corresponding points in 3D model, so it is represents a specific face's features like distance between eyes, lips width and so on. Next, texture from 2D image is applied to 3D model using that control points, so we have a textured 3D model of human's head. Given the fact, that our perception is able to reconstruct a 3D shape from 2D images (say, we looking at 2D photo and still imagining a 3D person), there's no need to reconstruct 3D shape accurately, texture will do the work.
There is an issue in the rendering of 3D images, called UV mapping, takes the 3D model and defines a set of edges, and this creates an image that is used to generate different textures to the model.
Now if you notice in Morfo, you define the edge of the head, eyes, mouth and nose. with this information the Morfo knows how to place it texture to the model that has defined.
the process of loading a texture on a model is not very complex and this can be done on any device that has support of some technology such as OpenGL
Isn't it computationally intensive to render and convert 2D image into 3D?
Apple is sinking billions of dollars into developing custom chipsets, and recent models have impressive performance, considering the battery life and low operating temperature (no fans).

Finding significant images in a set of surveillance camera images

I've had theft problems outside my house so I setup a simple webcam to capture every second with Dorgem (http://dorgem.sf.net).
Dorgem does offer a feature to use motion detection to only capture frames where something is moving on the screen. The problem is that the motion detection algorithm it uses is extremely sensitive. It goes off because of variations in color between successive shots on my cheap webcam, and it also goes off because the trees in front of the house are blowing in the wind. Additionally, the front of my house is a high traffic area so there is also a large number of legitimately captured frames.
I average capturing 2800/3600 frames every second using Dorgem's motion detection. This is too much for me to search through to find out where the interesting activity is.
I wish I could re-position the camera to a more optimal position where it would only capture the areas I'm interested in, so that motion detection would be simpler, however this is not an option for me.
I think that because my camera has a fixed position and each picture frames the same area in front of my house, then I should be able to scan the images and figure out which ones have motion in some interesting region of that image, throwing out all other frames.
For example: if there's a change in pixel 320,240 then someone has stepped in front of my house and I want to see that frame, but if there's a change in pixel 1,1 then its just the trees blowing in the wind and the frame can be discarded.
I've looked at pdiff, a tool for finding diffs in sets of pictures, but it seems to be also focused on diffing the entire picture, rather than a specific region of it:
http://pdiff.sourceforge.net/
I've also looked at phash, a tool for calculating a hash based on human perception of an image, but it seems too complex:
http://www.phash.org/
I suppose I could implement it in a shell script using imagemagick's mogrify -crop to cherry pick the regions of the image I'm interested in, then running pdiff to find the interesting ones, and using that to pick out the interesting frames.
Any thoughts? ideas? existing tools?
cropping and then using pdiff seems like the best choice to me.