How are samples inside a PU calculated in the intra mode of HEVC? - hevc

I've read several articles about intra prediction in HEVC and I still have some questions.
For a PU of NxN pixels, we use 4xN + 1 reference samples (the row above the PU, the column at the left of the PU and the sample at the top left). Then, based on the MPM, a mode is selected to work with.
I now have a row of reference samples, a column of reference samples and a mode. Based on this, how are the samples inside the PU calculated ?
In this article http://codepaint.kaist.ac.kr/wp-content/uploads/2013/10/Intra-Coding-of-the-HEVC-Standard.pdf there are ready-to-use formulae which take coordinate and selected mode as parameters. Is it really that simple ?
Now, imagine we have a picture of a checkerboard. How intra prediction can be used ? In some cases, we might not want to use reference samples of previously decoded PU. How to deal with that ?
Thanks

I now have a row of reference samples, a column of reference samples
and a mode. Based on this, how are the samples inside the PU
calculated ?
As it is stated in this article first encoder should decide about the mode and the sizes of PUs and TUs during the RDO process. Among the list of
modes lets say mode number 25 is chosen to predict the current block. Mode number 25 is one of angular modes so we will use the mentioned formula for
angular modes and obtain the output. It worth mentioning that although formula is simple details of reference samples make it a little tricky.
Now, imagine we have a picture of a checkerboard. How intra prediction
can be used ?
First the prediction modes should be found. Lets say we decided on mode X then we should refer to the related formula to mode X and form our prediction block simililar to what discussed in previous question.
In some cases, we might not want to use reference
samples of previously decoded PU. How to deal with that ?
Intra prediction basically is formed based on these reference samples and if you are not using these pixels your not doing INTRA prediction. Maybe you should shift to INTER prediction where it uses other blocks in successive frames and MVs to predict the current block.

The question is interest for me.
I can easy to say that the mode is selected by encode.
In the HEVC encoder, it run all the mode(35, in the view of complexity,encoder uses fast algorithm to simplify the selection process, you can find some paper to read), finally encoder selects the best mode(RDO process). so,decoder can not select reference sample. decoder have to select the samples which are same with encoder.
In the SCC(screen content coding) coding which is a extension of the HEVC, using IBC(intra block copy) mode to select the reference sample in reconstructed area.

Related

Understanding Time2Vec embedding for implementing this as a keras layer

The paper time2vector link (the relevant theory is in section 4) shows an approach to include a time embedding for features to improve model performance. I would like to give this a try. I found a implementation as keras layer which I changed a little bit. Basically it creates two matrices for one feature:
(1) linear = w * x + b
(2) periodic = sin(w * x + b)
Currently I choose this feature manually. Concerning the paper there are a few things i don't understand. The first thing is the term k as the number of sinusoids. The authors use up to 64 sinusoids. What does this mean? I have just 1 sinusoid at the moment, right? Secondly I'm about to put every feature I have through the sinus transformation for me dataset that would make 6 (sinusoids) periodic features. The authors use only one linear term. How should I choose the feature for the linear term? Unfortunately the code from the paper is not available anymore. Has anyone worked with time embeddings or even with this particularly approach?
For my limited understanding, the linear transformation of time is a fixed element of the produced embedding and the parameter K allows you to select how many different learned time representations you want to use in your model. So, the resulting embedding has a size of K+1 elements.

TensorFlow Object Detection API: evaluation mAP behaves weirdly?

I am training an object detector for my own data using Tensorflow Object Detection API. I am following the (great) tutorial by Dat Tran https://towardsdatascience.com/how-to-train-your-own-object-detector-with-tensorflows-object-detector-api-bec72ecfe1d9. I am using the provided ssd_mobilenet_v1_coco-model pre-trained model checkpoint as the starting point for the training. I have only one object class.
I exported the trained model, ran it on the evaluation data and looked at the resulted bounding boxes. The trained model worked nicely; I would say that if there was 20 objects, typically there were 13 objects with spot on predicted bounding boxes ("true positives"); 7 where the objects were not detected ("false negatives"); 2 cases where problems occur were two or more objects are close to each other: the bounding boxes get drawn between the objects in some of these cases ("false positives"<-of course, calling these "false positives" etc. is inaccurate, but this is just for me to understand the concept of precision here). There are almost no other "false positives". This seems much better result than what I was hoping to get, and while this kind of visual inspection does not give the actual mAP (which is calculated based on overlap of the predicted and tagged bounding boxes?), I would roughly estimate the mAP as something like 13/(13+2) >80%.
However, when I run the evaluation (eval.py) (on two different evaluation sets), I get the following mAP graph (0.7 smoothed):
mAP during training
This would indicate a huge variation in mAP, and level of about 0.3 at the end of the training, which is way worse than what I would assume based on how well the boundary boxes are drawn when I use the exported output_inference_graph.pb on the evaluation set.
Here is the total loss graph for the training:
total loss during training
My training data consist of 200 images with about 20 labeled objects each (I labeled them using the labelImg app); the images are extracted from a video and the objects are small and kind of blurry. The original image size is 1200x900, so I reduced it to 600x450 for the training data. Evaluation data (which I used both as the evaluation data set for eval.pyand to visually check what the predictions look like) is similar, consists of 50 images with 20 object each, but is still in the original size (the training data is extracted from the first 30 min of the video and evaluation data from the last 30 min).
Question 1: Why is the mAP so low in evaluation when the model appears to work so well? Is it normal for the mAP graph fluctuate so much? I did not touch the default values for how many images the tensorboard uses to draw the graph (I read this question: Tensorflow object detection api validation data size and have some vague idea that there is some default value that can be changed?)
Question 2: Can this be related to different size of the training data and the evaluation data (1200x700 vs 600x450)? If so, should I resize the evaluation data, too? (I did not want to do this as my application uses the original image size, and I want to evaluate how well the model does on that data).
Question 3: Is it a problem to form the training and evaluation data from images where there are multiple tagged objects per image (i.e. surely the evaluation routine compares all the predicted bounding boxes in one image to all the tagged bounding boxes in one image, and not all the predicted boxes in one image to one tagged box which would preduce many "false false positives"?)
(Question 4: it seems to me the model training could have been stopped after around 10000 timesteps were the mAP kind of leveled out, is it now overtrained? it's kind of hard to tell when it fluctuates so much.)
I am a newbie with object detection so I very much appreciate any insight anyone can offer! :)
Question 1: This is the tough one... First, I think you don't understand correctly what mAP is, since your rough calculation is false. Here is, briefly, how it is computed:
For each class of object, using the overlap between the real objects and the detected ones, the detections are tagged as "True positive" or "False positive"; all the real objects with no "True positive" associated to them are labelled "False Negative".
Then, iterate through all your detections (on all images of the dataset) in decreasing order of confidence. Compute the accuracy (TP/(TP+FP)) and recall (TP/(TP+FN)), only counting the detections that you've already seen ( with confidence bigger than the current one) for TP and FP. This gives you a point (acc, recc), that you can put on a precision-recall graph.
Once you've added all possible points to your graph, you compute the area under the curve: this is the Average Precision for this category
if you have multiple categories, the mAP is the standard mean of all APs.
Applying that to your case: in the best case your true positive are the detections with the best confidence. In that case your acc/rec curve will look like a rectangle: you'd have 100% accuracy up to (13/20) recall, and then points with 13/20 recall and <100% accuracy; this gives you mAP=AP(category 1)=13/20=0.65. And this is the best case, you can expect less in practice due to false positives which higher confidence.
Other reasons why yours could be lower:
maybe among the bounding boxes that appear to be good, some are still rejected in the calculations because the overlap between the detection and the real object is not quite big enough. The criterion is that Intersection over Union (IoU) of the two bounding boxes (real one and detection) should be over 0.5. While it seems like a gentle threshold, it's not really; you should probably try and write a script to display the detected bounding boxes with a different color depending on whether they're accepted or not (if not, you'll get both a FP and a FN).
maybe you're only visualizing the first 10 images of the evaluation. If so, change that, for 2 reasons: 1. maybe you're just very lucky on these images, and they're not representative of what follows, just by luck. 2. Actually, more than luck, if these images are the first from the evaluation set, they come right after the end of the training set in your video, so they are probably quite similar to some images in the training set, so they are easier to predict, so they're not representative of your evaluation set.
Question 2: if you have not changed that part in the config file mobilenet_v1_coco-model, all your images (both for training and testing) are rescaled to 300x300 pixels at the start of the network, so your preprocessings don't matter.
Question 3: no it's not a problem at all, all these algorithms were designed to detect multiple objects in images.
Question 4: Given the fluctuations, I'd actually keep training it until you can see improvement or clear overtraining. 10k steps is actually quite small, maybe it's enough because your task is relatively easy, maybe it's not enough and you need to wait ten times that to have significant improvement...

Should my seq2seq RNN idea work?

I want to predict stock price.
Normally, people would feed the input as a sequence of stock prices.
Then they would feed the output as the same sequence but shifted to the left.
When testing, they would feed the output of the prediction into the next input timestep like this:
I have another idea, which is to fix the sequence length, for example 50 timesteps.
The input and output are exactly the same sequence.
When training, I replace last 3 elements of the input by zero to let the model know that I have no input for those timesteps.
When testing, I would feed the model a sequence of 50 elements. The last 3 are zeros. The predictions I care are the last 3 elements of the output.
Would this work or is there a flaw in this idea?
The main flaw of this idea is that it does not add anything to the model's learning, and it reduces its capacity, as you force your model to learn identity mapping for first 47 steps (50-3). Note, that providing 0 as inputs is equivalent of not providing input for an RNN, as zero input, after multiplying by a weight matrix is still zero, so the only source of information is bias and output from previous timestep - both are already there in the original formulation. Now second addon, where we have output for first 47 steps - there is nothing to be gained by learning the identity mapping, yet network will have to "pay the price" for it - it will need to use weights to encode this mapping in order not to be penalised.
So in short - yes, your idea will work, but it is nearly impossible to get better results this way as compared to the original approach (as you do not provide any new information, do not really modify learning dynamics, yet you limit capacity by requesting identity mapping to be learned per-step; especially that it is an extremely easy thing to learn, so gradient descent will discover this relation first, before even trying to "model the future").

Feature Pyramid Network with tensorflow/models/object_detection

If I want to implement k = k0 + log2(√(w*h)/224) in Feature Pyramid Networks for Object Detection, where and which file should I change?
Note, this formula is for ROI pooling. W and H are the width and height of ROI, whereas k represents the level of the feature pyramid this ROI should be used on.
*saying the FasterRCNN meta_architecture file of in object_detection might be helpful, but please inform me which method I can change.
Take a look at this document for a rough overview of the process. In a nutshell, you'll have to create a "FeatureExtractor" sub-class for you desired meta-architecture. For FasterRCNN, you can probably start with a copy of our Resnet101 Feature Extractor as a starting point.
The short answer is that the change won't be trivial as we don't currently support cropping regions from multiple layers. Here is an outline of what would need to change if you would like to pursue this anyway:
Generating a new anchor set
Currently Faster RCNN uses a “GridAnchorGenerator” as the first_stage_anchor_generator - instead you will have to use a MultipleGridAnchorGenerator (same as we use in SSD pipeline).
You will have to use a 32^2 anchor box -> for the scales field of the anchor generator, basically you will have to add a .125
You will have to modify the code to generate and crop from multiple layers: to start, look for a function in the faster_rcnn_meta_arch file called "_extract_rpn_feature_maps", which is suggestively named, but currently returns just a single tensor! You will also have to add some logic to determine which layer to crop from based on the size of the proposal (Eqn 1 from the paper)
You will have to finally create a new feature extractor following the directions that Derek linked to.

Has anyone managed to make Asynchronous advantage actor critic work with Mujoco experiments?

I'm using an open source version of a3c implementation in Tensorflow which works reasonably well for atari 2600 experiments. However, when I modify the network for Mujoco, as outlined in the paper, the network refuses to learn anything meaningful. Has anyone managed to make any open source implementations of a3c work with continuous domain problems, for example mujoco?
I have done a continuous action of Pendulum and it works well.
Firstly, you will build your neural network and output mean (mu) and standard deviation (sigma) for selecting an action.
The essential part of the continuous action is to include a normal distribution. I'm using tensorflow, so the code is looks like:
normal_dist = tf.contrib.distributions.Normal(mu, sigma)
log_prob = normal_dist.log_prob(action)
exp_v = log_prob * td_error
entropy = normal_dist.entropy() # encourage exploration
exp_v = tf.reduce_sum(0.01 * entropy + exp_v)
actor_loss = -exp_v
When you wanna sample an action, use the function tensorflow gives:
sampled_action = normal_dist.sample(1)
The full code of Pendulum can be found in my Github. https://github.com/MorvanZhou/tutorials/blob/master/Reinforcement_learning_TUT/10_A3C/A3C_continuous_action.py
I was hung up on this for a long time, hopefully this helps someone in my shoes:
Advantage Actor-critic in discrete spaces is easy: if your actor does better than you expect, increase the probability of doing that move. If it does worse, decrease it.
In continuous spaces though, how do you do this? The entire vector your policy function outputs is your move -- if you are on-policy, and you do better than expected, there's no way of saying "let's output that action even more!" because you're already outputting exactly that vector.
That's where Morvan's answer comes into play. Instead of outputting just an action, you output a mean and a std-dev for each output-feature. To choose an action, you pass your inputs in to create a mean/stddev for each output-feature, and then sample each feature from this normal distribution.
If you do well, you adjust the weights of your policy network to change the mean/stddev to encourage this action. If you do poorly, you do the opposite.