How to decide the weight of Luma and Chroma distortion in the RDO process? - hevc

In the HEVC reference software HM, in the RDO process, the distortion contains the luma and chroma, and they have different weights. Through debugging the codes, I find the chroma weight is higher than 1. So does anyone know how to decide the the weights? Are there any papers talking about this problem?

I will answer my own question.
This topic is discussed in the proposal JCT-VC F386. It is about the RDO of Chroma.

Related

Is TensorFlow the way to go for this optimization problem?

I have to optimize the result of a process that depends on a large number of variables, i.e. a laser engraving system where the engraving depth depends on the laser speed, distance, power and so on.
The final objective is the minimization of the engraving time, or the maximization of the laser speed. All the other parameters can vary, but must stay within safe bounds.
I have never used any machine learning tools, but to my very limited knowledge this seems like a good use case for TensorFlow or any other machine learning library.
I would experimentally gather data points to train the algorithm, test it and then use a gradient descent optimizer to find the parameters (within bounds) that maximize the laser travel velocity.
Does this sound feasible? How would you approach such a problem? Can you link to any examples available online?
Thank you,
Riccardo
I’m not quite sure if I understood the problem correctly, would you add some example data and a desired output?
As far as I understood, It could be feasible to use TensorFlow, but I believe there are better solutions to that problem. Let me expand on this.
TensorFlow is a framework focused in the development of Deep Learning models. These usually require lots of data (the number really depends on the problem) but I don’t believe that just you manually gathering this data would be enough unless your team is quite big or already have some data gathered.
Also, as you have a minimization (or maximization) problem given variables that lay within a known range, I think this can be a case of Operations Research optimization instead of Machine Learning. Check this example of OR.

Should deep learning classification be used to classify details such as liquid level in bottle

Can deep learning classification be used to precisely label/classify both the object and one of its features. For example to identify the bottle (like Grants Whiskey) and liquid level in the bottle (in 10 percent steps - like 50% full). Is this the problem that can be best solved utilizing some of deep learning frameworks (Tensorflow etc) or some other approach is more effective?
Well, this should be well possible if the liquor is well colored. If not (e.g. gin, vodka), I'd say you have no chance with today's technologies when observing the object from a natural view angle and distance.
For colored liquor, I'd train two detectors. One for detecting the bottle, and a second one to detect the liquor given the bottle. The ratio between the two will be your percentage.
Some of the proven state-of-the-art deep learning-based object detectors (just Google them):
Multibox
YOLO
Faster RCNN
Or non-deep-learning-based:
Deformable part model
EDIT:
I was ask to elaborate more. Here is an example:
The box detector e.g. draws a box in the image at [0.1, 0.2, 0.5, 0.6] (min_height, min_width, max_height, max_width) which is the relative location of your bottle.
Now you crop the bottle from the original image and feed it to the second detector. The second detector draws e.g. [0.2, 0.3, 0.7, 0.8] in your cropped bottle image, the location indicates the fluid it has detected. Now (0.7 - 0.2) * (0.8 - 0.3) = 0.25 is the relative area of the fluid with respect to the area of the bottle, which is what OP is asking for.
EDIT 2:
I entered this reply assuming OP wants to use deep learning. I'd agree other methods should be considered if OP is still unsure with deep learning. For bottle detection, deep learning-based methods have shown to outperform traditional methods by a large margin. Bottle detection happens to be one of the classes in the PASCAL VOC challenge. See results comparison here: http://rodrigob.github.io/are_we_there_yet/build/detection_datasets_results.html#50617363616c20564f43203230313020636f6d7034
For the liquid detection however, deep learning might be slightly overkill. E.g. if you know what color you are looking for, even a simple color filter will give you "something"....
The rule of thumb for deep learning is, if it is visible in the image, hence a expert can tell you the answer solely based on the image then the chances are very high that you can learn this with deep learning, given enough annotated data.
However you are quite unlikely to have the required data needed for such a task, therefore I would ask myself the question if i can simplify the problem. For example you could take gin, vodka and so on and use SIFT to find the bottle again in a new scene. Then RANSAC for bottle detection and cut the bottle out of the image.
Then I would try gradient features to find the edge with the liquid level. Finally you can calculate the percentage with (liquid edge - bottom) / (top bottle - bottom bottle).
Identifying the bottle label should not be hard to do - it's even available "on tap" for cheap (these guys actually use it to identify wine bottle labels on their website): https://aws.amazon.com/blogs/aws/amazon-rekognition-image-detection-and-recognition-powered-by-deep-learning/
Regarding the liquid level, this might be a problem AWS wouldn't be able to solve straight away - but I'm sure it's possible to make a custom CNN for it. Alternatively, you could use good old humans on Amazon Mechanical Turk to do the work for you.
(Note: I do not work for Amazon)

determine camera rotation and translation matrix from essential matrix

I am trying to extract rotation matrix and translation matrix from essential matrix.
I took these answers as reference:
Correct way to extract Translation from Essential Matrix through SVD
Extract Translation and Rotation from Fundamental Matrix
Now I've done the above steps applying SVD to essential matrix, but here comes the problem. According to my understanding about this subject, both R and T has two answers, which leads to 4 possible solutions of [R|T]. However only one of the solutions would fit in the physical situation.
My question is how can I determine which one of the 4 solutions is the correct one?
I am just a beginner on studying camera position. So if possible, please make the answer be as clear (but simple) as possible. Any suggestion would be appreciated, thanks.
The simplest is testing a point 3D position using the possible solution, that is, a reconstructed point will be in front of both cameras in only one of the possible 4 solutions.
So assuming one camera matrix is P = [I|0], you have 4 options for the other camera, but only one of the pairs will place such point in front them.
More details in Hartley and Zisserman's multiple view geometry (page 259)
If you can use Opencv (version 3.0+), you count with a function called "recoverPose", this function will do that job for you.
Ref: OpenCV documentation, http://docs.opencv.org/trunk/modules/calib3d/doc/calib3d.html

suggestion on implementation of visual odometry using monocular camera

How to overcome scale problem in monocular visual odometery?
Can someone suggest me a good paper on the implementation?
Visual-inertial fusion is the most recent approach used for monocular odometry. You need to calculate the scale between world units and meters.
You can use an Extended Kalman Filter approach or optimization as you like. Try with this paper: http://www.ee.ucr.edu/~mourikis/papers/Li2012-ICRA.pdf

FFT Pitch Detection for iOS using Accelerate Framework?

I have been reading up on FFT and Pitch Detection for a while now, but I'm having trouble piecing it all together.
I have worked out that the Accelerate framework is probably the best way to go with this, and I have read the example code from apple to see how to use it for FFTs. What is the input data for the FFT if I wanted to be running the pitch detection in real time? Do I just pass in the audio stream from the microphone? How would I do this?
Also, after I get the FFT output, how can I get the frequency from that? I have been reading everywhere, and can't find any examples or explanations of this?
Thanks for any help.
Frequency and pitch are not the same thing - frequency is a physical quantity, pitch is a psychological percept - they are similar, but there are important differences, which may or may not matter to you, depending on the type of instrument for which you are trying to measure pitch.
You need to read up a little on the various pitch detection algorithms (and on the meaning of pitch itself), decide what algorithm you want to use and only then set about implementing it. See this Wikipedia page for a good overview of pitch and pitch detection (note that you can use FFT for the autocorrelation-based and frequency domain methods).
As for using the FFT to identify peaks in a spectrum and their associated frequencies, there are many questions and answers related to this on SO already, see for example: How do I obtain the frequencies of each value in an FFT?
I have an example implementation of an Autocorrelation function available online for ios 5.1. Look at this post for a link to the implementation AND functions on how to find the nearest note and how to create a string representing the pitch (A, A#, B, B#, etc...)
While the FFT is very useful in many applications, it might not be the most accurate if you're trying to do simple pitch detection. (It can be as accurate, but you have to deal with complex numbers to do a lot of phase calculations)