Criteria to choose a object detection Method - object-detection

I'm in the research phase of my project and I'm trying to make an object detector using CNN. I know that in general there's 2 "type" of CNN object detector, Region Proposal based (i.e R-CNN and R-FCN ) and Regression/Classification based method (i.e YOLO and SSD). The problem is I'm not so sure which method should I use. I would like to know what are the usual reasoning to choose a Method over the other. there's a few general criteria such as Speed vs Accuracy. But is there any other commonly used reasoning ?

There are two categories for detectors, one stage and two stage. Yolo, SSD, RetinaNet, CenterNet etc. fall in one stage while R-FCN, R-CNN, Faster R-CNN, etc. fall in two stage category.
Direct quote from [1] about advantage two stage detector comprated to one stage,
Compared to one-stage detectors, the two-stage ones have the following
advantages: 1) By sampling a sparse set of region proposals, two-stage
detectors filter out most of the negative proposals; while one-stage
detectors directly face all the regions on the image and have a
problem of class imbalance if no specialized design is introduced. 2)
Since two-stage detectors only process a small number of proposals,
the head of the network (for proposal classification and regression)
can be larger than one-stage detectors, so that richer features will
be extracted. 3) Two-stage detectors have high-quality features of
sampled proposals by use of the RoIAlign [10] operation that extracts
the location consistent feature of each proposal; but different region
proposals can share the same feature in one-stage detectors and the
coarse and spatially implicit representation of proposals may cause
severe feature misalignment. 4) Two-stage detectors regress the object
location twice (once on each stage) and the bounding boxes are better
refined than one-stage methods.
Quote accuracy vs efficiency,
One-stage detectors are more efficient and elegant in design, but
currently the two-stage detectors have domination in accuracy.
One stage detectors can be deployed on edge devices such as phones for fast real-time detection. This can save more energy compared to more compute intensive detectors.
In summary, go for two stage detectors if accuracy is more important, otherwise go for one stage for faster detection while maintaining good enough accuracy.
Related works section of [1] contains easy to read details as well as each referenced papers have review on two stage vs one stage.
Object detection benchmarks
https://paperswithcode.com/task/object-detection
References
[1] MimicDet, https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123590528.pdf
[2] Speed/accuracy trade-offs for modern convolutional object detectors, https://arxiv.org/pdf/1611.10012.pdf
[3] RetinaNet, https://arxiv.org/pdf/1708.02002.pdf
[4] Object detection review, https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9186021
[5] CSPNET, https://arxiv.org/pdf/1911.11929v1.pdf
[6] CenterNet, https://arxiv.org/pdf/1904.08189v3.pdf
[7] EfficientDet, https://arxiv.org/pdf/1911.09070.pdf
[8] SpineNet, https://arxiv.org/pdf/1912.05027.pdf
Related articles
https://jonathan-hui.medium.com/object-detection-speed-and-accuracy-comparison-faster-r-cnn-r-fcn-ssd-and-yolo-5425656ae359
https://www.jeremyjordan.me/object-detection-one-stage/

Related

Reproducibility, Controlling Randomness, Operator-level Randomness in TFF

I have a TFF code that takes a slightly different optimization path while training across different runs, despite having set all the operator-level seeds, numpy seeds for sampling clients in each round, etc. The FAQ section on TFF website does talk about randomness and expectation in TFF, but I found the answer slightly confusing. Is it the case that some aspects of the randomness can't be directly controlled even after setting all the operator-level seeds that one could; because one can't control the way sub-sessions are started and ended?
To be more specific, these are all the operator-level seeds that my code already sets: dataset.shuffle, create_tf_dataset_from_all_clients, keras.initializers and np.random.seed for per-round client sampling (which uses numpy). I have verified that the initial model state is the same across runs, but as soon as training starts, the model states start diverging across different runs. The divergence is gradual/slow in most cases, but not always.
The code is quite complex, so not adding it here.
There is one more source of non-determinism that would be very hard to control -- summation of float32 numbers is not commutative.
When you simulate a number of clients in a round, the TFF executor does not have a way to control the order in which the model updates are added together. As a result, there could be some differences at the bottom of the float32 range. While this may sound negligible, it can add up over a number of rounds (I have seen hundreds, but could be also less), and eventually cause different loss/accuracy/model weights trajectories, as the gradients will start to be computed at slightly different points.
BTW, this tutorial has more info on best practices in controlling randomness in TFF.

Creating a good training set for one-class detection

I am training a one-class (hands) object detector on the egohands data set. My problem is that it detects way too many things as hands. It feels like it is detecting everything that is skin-colored as a hand.
I assume the most likely explanation for this is that my training set is poor, as every single image of the set contains hands, and also almost no other skin-toned elements are on the images. I guess it is necessary to also present the network images that are not what you try to detect?
I just want to verify I am right with my assumptions, before investing lots of time into creating a better training set. Therefore I am very grateful for every hint want I am doing wrong.
Object detection preprocessing is critical step, take extra caution guards as detection networks are sensitive to geometrical transformations.
Some proven data augmentation methods include:
1.Random geometry transformation for random cropping (with constraints),
2.Random expansion,
3.Random horizontal flip
4.Random resize (with random interpolation).
5.Random color jittering for brightness, hue, saturation, and contrast

Scalable, Efficient Hierarchical Softmax in Tensorflow?

I'm interested in implementing a hierarchical softmax model that can handle large vocabularies, say on the order of 10M classes. What is the best way to do this to both be scalable to large class counts and efficient? For instance, at least one paper has shown that HS can achieve a ~25x speedup for large vocabs when using a 2-level tree where each node sqrt(N) classes. I'm interested also in a more general version for an arbitrary depth tree with an arbitrary branching factor.
There are a few options that I see here:
1) Run tf.gather for every batch, where we gather the indices and splits. This creates problems with large batch sizes and fat trees where now the coefficients are being duplicated a lot, leading to OOM errors.
2) Similar to #1, we could use tf.embedding_lookup which would keep help with OOM errors but now keeps everything on the CPU and slows things down quite a bit.
3) Use tf.map_fn with parallel_iterations=1 to process each sample separately and go back to using gather. This is much more scalable but does not really get close to the 25x speedup due to the serialization.
Is there a better way to implement HS? Are there different ways for deep and narrow vs. short and wide trees?
You mention that you want GPU-class performance:
but now keeps everything on the CPU and slows things down quite a bit
and wish to use 300-unit hidden size and 10M-word dictionaries.
This means that (assuming float32), you'll need 4 * 300 * 10M * 2 bytes = 24 GB just to store the parameters and the gradient for the output layer.
Hierarchical Softmax (HSM) doesn't reduce the memory requirements - it just speeds up the training.
Realistically, you'll need a lot more GPU memory, because you'll also need to store:
other parameters and their gradients
optimizer data, e.g. velocities in momentum training
activations and backpropagated temporary data
framework-specific overhead
Therefore, if you want to do all computation on GPUs, you'll have no choice but to distribute this layer across multiple high-memory GPUs.
However, you now have another problem:
To make this concrete, let's suppose you have a 2-level HSM with 3K classes, with 3K words per class (9M words in total). You distribute the 3K classes across 8 GPUs, so that each hosts 384 classes.
What if all target words in a batch are from the same 384 classes, i.e. they belong to the same GPU? One GPU will be doing all the work, while the other 7 wait for it.
The problem is that even if the target words in a batch belong to different GPUs, you'll still have the same performance as in the worst-case scenario, if you want to do this computation in TensorFlow (This is because TensorFlow is a "specify-and-run" framework -- the computational graph is the same for the best case and the worst case)
What is the best way to do this to both be scalable to large class counts and efficient?
The above inefficiency of model parallelism (each GPU must process the whole batch) suggests that one should try to keep everything in one place.
Let us suppose that you are either implementing everything on the host, or on 1 humongous GPU.
If you are not modeling sequences, or if you are, but there is only one output for the whole sequence, then the memory overhead from copying the parameters, to which you referred, is negligible compared to the memory requirements described above:
400 == batch size << number of classes == 3K
In this case, you could simply use gather or embedding_lookup (Although the copying is inefficient)
However, if you do model sequences of length, say, 100, with output at every time step, then the parameter copying becomes a big issue.
In this case, I think you'll need to drop down to C++ / CUDA C and implement this whole layer and its gradient as a custom op.

Is it feasibly to train an A3C algorithm in an episodic context?

The A3C Algorithm (and N-Step Q Learning) updates the globaly shared network once every N timesteps. N is usually pretty small, 5 or 20 as far as I remember.
Wouldn't it be possible to set N to infinity, meaning that the networks are only trained at the end of an episode? I do not argue that it is necessarily better - tough, for me it sounds like it could be - but at least it should not be a lot worse, right?
The lacking asynchronous training based on the asynchronous exploration of the enviroment by multiple agents in different enviroments, and therefore the stabilization of the training procedure without replay memory, might be a problem if the training is done sequentially (as in: for each worker thread, train the network on the whole observed SAR-sequence). Tough, the training could still be done asynchronously with sub-sequences, it would only make training with stateful LSTMs a little bit more complicated.
The reason why I am asking is the "Evolution Strategies as a Scalable Alternative to Reinforcement Learning" paper. To compare it to algorithms like A3C, it would make more sense - from a code engineering point of view - to train both algorithms in the same episodic way.
Definitely, just set N to be larger than the maximum episode length (or modify the source to remove the batching condition. Note that in the original A3C paper, this is done with the dynamic control environments (with continuous action spaces) with good results. It is commonly argued that being able to update mid-episode (not necessary) is a key advantage of TD methods: it uses the Markov condition of MDP.

Encoding invariance for deep neural network

I have a set of data, 2D matrix (like Grey pictures).
And use CNN for classifier.
Would like to know if there is any study/experience on the accuracy impact
if we change the encoding from traditionnal encoding.
I suppose yes, question is rather which transformation of the encoding make the accuracy invariant, which one deteriorates....
To clarify, this concerns mainly the quantization process of the raw data into input data.
EDIT:
Quantize the raw data into input data is already a pre-processing of the data, adding or removing some features (even minor). It seems not very clear the impact in term of accuracy on this quantization process on real dnn computation.
Maybe, some research available.
I'm not aware of any research specifically dealing with quantization of input data, but you may want to check out some related work on quantization of CNN parameters: http://arxiv.org/pdf/1512.06473v2.pdf. Depending on what your end goal is, the "Q-CNN" approach may be useful for you.
My own experience with using various quantizations of the input data for CNNs has been that there's a heavy dependency between the degree of quantization and the model itself. For example, I've played around with using various interpolation methods to reduce image sizes and reducing the color palette size, and in the end, I discovered that each variant required a different tuning of hyper-parameters to achieve optimal results. Generally, I found that minor quantization of data had a negligible impact, but there was a knee in the curve where throwing away additional information dramatically impacted the achievable accuracy. Unfortunately, I'm not aware of any way to determine what degree of quantization will be optimal without experimentation, and even deciding what's optimal involves a trade-off between efficiency and accuracy which doesn't necessarily have a one-size-fits-all answer.
On a theoretical note, keep in mind that CNNs need to be able to find useful, spatially-local features, so it's probably reasonable to assume that any encoding that disrupts the basic "structure" of the input would have a significantly detrimental effect on the accuracy achievable.
In usual practice -- a discrete classification task in classic implementation -- it will have no effect. However, the critical point is in the initial computations for back-propagation. The classic definition depends only on strict equality of the predicted and "base truth" classes: a simple right/wrong evaluation. Changing the class coding has no effect on whether or not a prediction is equal to the training class.
However, this function can be altered. If you change the code to have something other than a right/wrong scoring, something that depends on the encoding choice, then encoding changes can most definitely have an effect. For instance, if you're rating movies on a 1-5 scale, you likely want 1 vs 5 to contribute a higher loss than 4 vs 5.
Does this reasonably deal with your concerns?
I see now. My answer above is useful ... but not for what you're asking. I had my eye on the classification encoding; you're wondering about the input.
Please note that asking for off-site resources is a classic off-topic question category. I am unaware of any such research -- for what little that is worth.
Obviously, there should be some effect, as you're altering the input data. The effect would be dependent on the particular quantization transformation, as well as the individual application.
I do have some limited-scope observations from general big-data analytics.
In our typical environment, where the data were scattered with some inherent organization within their natural space (F dimensions, where F is the number of features), we often use two simple quantization steps: (1) Scale all feature values to a convenient integer range, such as 0-100; (2) Identify natural micro-clusters, and represent all clustered values (typically no more than 1% of the input) by the cluster's centroid.
This speeds up analytic processing somewhat. Given the fine-grained clustering, it has little effect on the classification output. In fact, it sometimes improves the accuracy minutely, as the clustering provides wider gaps among the data points.
Take with a grain of salt, as this is not the main thrust of our efforts.