Principal component analysis (PCA) assumptions - k-means

I used PCA to reduce a 180 dimensions feature space in 3 principal components.
Afterwards I used k-mean clustering to cluster the data according to the 3 principal components of PCA.
I read in wikipedia that principal components are guaranteed to be independent if the data set is jointly normally distributed. I didn't calculate the jointly distribution of all my features (180)...is that a problem?
Which are the assumptions (if any) or the best practices in using PCA for dimensionality reductions?

Related

Criteria to choose a object detection Method

I'm in the research phase of my project and I'm trying to make an object detector using CNN. I know that in general there's 2 "type" of CNN object detector, Region Proposal based (i.e R-CNN and R-FCN ) and Regression/Classification based method (i.e YOLO and SSD). The problem is I'm not so sure which method should I use. I would like to know what are the usual reasoning to choose a Method over the other. there's a few general criteria such as Speed vs Accuracy. But is there any other commonly used reasoning ?
There are two categories for detectors, one stage and two stage. Yolo, SSD, RetinaNet, CenterNet etc. fall in one stage while R-FCN, R-CNN, Faster R-CNN, etc. fall in two stage category.
Direct quote from [1] about advantage two stage detector comprated to one stage,
Compared to one-stage detectors, the two-stage ones have the following
advantages: 1) By sampling a sparse set of region proposals, two-stage
detectors filter out most of the negative proposals; while one-stage
detectors directly face all the regions on the image and have a
problem of class imbalance if no specialized design is introduced. 2)
Since two-stage detectors only process a small number of proposals,
the head of the network (for proposal classification and regression)
can be larger than one-stage detectors, so that richer features will
be extracted. 3) Two-stage detectors have high-quality features of
sampled proposals by use of the RoIAlign [10] operation that extracts
the location consistent feature of each proposal; but different region
proposals can share the same feature in one-stage detectors and the
coarse and spatially implicit representation of proposals may cause
severe feature misalignment. 4) Two-stage detectors regress the object
location twice (once on each stage) and the bounding boxes are better
refined than one-stage methods.
Quote accuracy vs efficiency,
One-stage detectors are more efficient and elegant in design, but
currently the two-stage detectors have domination in accuracy.
One stage detectors can be deployed on edge devices such as phones for fast real-time detection. This can save more energy compared to more compute intensive detectors.
In summary, go for two stage detectors if accuracy is more important, otherwise go for one stage for faster detection while maintaining good enough accuracy.
Related works section of [1] contains easy to read details as well as each referenced papers have review on two stage vs one stage.
Object detection benchmarks
https://paperswithcode.com/task/object-detection
References
[1] MimicDet, https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123590528.pdf
[2] Speed/accuracy trade-offs for modern convolutional object detectors, https://arxiv.org/pdf/1611.10012.pdf
[3] RetinaNet, https://arxiv.org/pdf/1708.02002.pdf
[4] Object detection review, https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9186021
[5] CSPNET, https://arxiv.org/pdf/1911.11929v1.pdf
[6] CenterNet, https://arxiv.org/pdf/1904.08189v3.pdf
[7] EfficientDet, https://arxiv.org/pdf/1911.09070.pdf
[8] SpineNet, https://arxiv.org/pdf/1912.05027.pdf
Related articles
https://jonathan-hui.medium.com/object-detection-speed-and-accuracy-comparison-faster-r-cnn-r-fcn-ssd-and-yolo-5425656ae359
https://www.jeremyjordan.me/object-detection-one-stage/

How to find good observations for reinforcement learning?

I am starting with my study of RL and was wondering how would one approach the observation features, which are not able to represent the state(hidden)?
Is there some systematic approach or some guidelines on how one would prefer the feature vector to look like? Discrete, dimension, Markov properties, embedding quality...?
I would like to process machine operation data streams and actually have a lot of direct measurements and many high-dim feature-vector (also stream).
Thank you very much for you input.

Is there any unsupervised clustering technique which can identify numbers clusters itself?

I checked unsupervised clsutering on gensim, fasttext, sklearn but did not find any documentation where I can cluster my text data using unsupervised learn without mentioning numbers of cluster to be identified
for example in sklearn KMneans clustering
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100)
Where I have to provide n_clusters.
In my case, I have text and it should be automatically identify numbers of clusters in it and cluster the text. Any reference article or link much appreciated.
DBSCAN is a density-based clustering method that we don't have to specify the number of clusters beforehand.
sklearn implementation : http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
Here is a good tutorial that gives an intuitive understanding on DBSCAN: http://mccormickml.com/2016/11/08/dbscan-clustering/
I extracted following from the above tutorial, which may be useful for you.
k-means requires specifying the number of clusters, ‘k’. DBSCAN does not, but does require specifying two parameters which influence the decision of whether two nearby points should be linked into the same cluster.
These two parameters are a distance threshold, ε (epsilon), and “MinPts” (minimum number of points), to be explained.
There are other methods (follow the link given in the comments) also, However, DBSCAN is a popular choice.

Encoding invariance for deep neural network

I have a set of data, 2D matrix (like Grey pictures).
And use CNN for classifier.
Would like to know if there is any study/experience on the accuracy impact
if we change the encoding from traditionnal encoding.
I suppose yes, question is rather which transformation of the encoding make the accuracy invariant, which one deteriorates....
To clarify, this concerns mainly the quantization process of the raw data into input data.
EDIT:
Quantize the raw data into input data is already a pre-processing of the data, adding or removing some features (even minor). It seems not very clear the impact in term of accuracy on this quantization process on real dnn computation.
Maybe, some research available.
I'm not aware of any research specifically dealing with quantization of input data, but you may want to check out some related work on quantization of CNN parameters: http://arxiv.org/pdf/1512.06473v2.pdf. Depending on what your end goal is, the "Q-CNN" approach may be useful for you.
My own experience with using various quantizations of the input data for CNNs has been that there's a heavy dependency between the degree of quantization and the model itself. For example, I've played around with using various interpolation methods to reduce image sizes and reducing the color palette size, and in the end, I discovered that each variant required a different tuning of hyper-parameters to achieve optimal results. Generally, I found that minor quantization of data had a negligible impact, but there was a knee in the curve where throwing away additional information dramatically impacted the achievable accuracy. Unfortunately, I'm not aware of any way to determine what degree of quantization will be optimal without experimentation, and even deciding what's optimal involves a trade-off between efficiency and accuracy which doesn't necessarily have a one-size-fits-all answer.
On a theoretical note, keep in mind that CNNs need to be able to find useful, spatially-local features, so it's probably reasonable to assume that any encoding that disrupts the basic "structure" of the input would have a significantly detrimental effect on the accuracy achievable.
In usual practice -- a discrete classification task in classic implementation -- it will have no effect. However, the critical point is in the initial computations for back-propagation. The classic definition depends only on strict equality of the predicted and "base truth" classes: a simple right/wrong evaluation. Changing the class coding has no effect on whether or not a prediction is equal to the training class.
However, this function can be altered. If you change the code to have something other than a right/wrong scoring, something that depends on the encoding choice, then encoding changes can most definitely have an effect. For instance, if you're rating movies on a 1-5 scale, you likely want 1 vs 5 to contribute a higher loss than 4 vs 5.
Does this reasonably deal with your concerns?
I see now. My answer above is useful ... but not for what you're asking. I had my eye on the classification encoding; you're wondering about the input.
Please note that asking for off-site resources is a classic off-topic question category. I am unaware of any such research -- for what little that is worth.
Obviously, there should be some effect, as you're altering the input data. The effect would be dependent on the particular quantization transformation, as well as the individual application.
I do have some limited-scope observations from general big-data analytics.
In our typical environment, where the data were scattered with some inherent organization within their natural space (F dimensions, where F is the number of features), we often use two simple quantization steps: (1) Scale all feature values to a convenient integer range, such as 0-100; (2) Identify natural micro-clusters, and represent all clustered values (typically no more than 1% of the input) by the cluster's centroid.
This speeds up analytic processing somewhat. Given the fine-grained clustering, it has little effect on the classification output. In fact, it sometimes improves the accuracy minutely, as the clustering provides wider gaps among the data points.
Take with a grain of salt, as this is not the main thrust of our efforts.

Clustering: Cluster validation

I want to use some clustering method for large social network dataset. The problem is how to evaluate the clustering method. yes, I can use some external ,internal and relative cluster validation methods. I used Normalized mutual information(NMI) as external validation method for cluster validation based on synthetic data. I produced some synthetic dataset by producing 5 clusters with equal number of nodes and some strongly connected links inside each cluster and weak links between clusters to check the clustering method, Then I analysed the spectral clustering and modularity based community detection methods on this synthetic datasets. I use the clustering with the best NMI for my real world dataset and check the error(cost function) of my algorithm and the result was good. Is my testing method for my cost function is good? or I should also validate clusters of my real word clusters again?
Thanks.
Try more than one measure.
There are a dozen cluster validation measures, and it's hard to predict which one is most appropriate for a problem. The differences between them are not really understood yet, so it's best if you consult more than one.
Also note that if you don't use a normalized measure, the baseline may be really high. So the measures are mostly useful to say "result A is more similar to result B than result C", but should not be taken as an absolute measure of quality. They are a relative measure of similarity.