Why chi-square test can be used to check a significant difference between the expected frequencies and the observed frequencies? - chi-squared

I know how chi-square distribution comes from , and also know how to apply the chi-square test.
However can't figure out why chi-square test can be used to check a significant difference between the expected frequencies and the observed frequencies.

Because the distribution of the chi-squared criterion approaches the chi-squared distribution as the size of the sample grows. The true distribution of this criterion is complex and depends on the base distribution (distribution, concordance to which we are checking). But with large samples, it approaches the chi-squared distribution for any base distribution (it is one of the central limit theorems). Therefore, we may not care about the base distribution and use the chi-squared test universally provided that the size of the sample is enough.

Related

Is there a reason why the number of channels/filters and batch sizes in many deep learning models are in powers of 2?

In many models the number of channels is kept in powers of 2. Also the batch-sizes are described in powers of 2. Is there any reason behind this design choice?
There is no significant in keeping channels and batch size as powers of 2. You can keep any number you want.
In many models the number of channels is kept in powers of 2. Also the batch-sizes are described in powers of 2. Is there any reason behind this design choice?
While both could probably be optimized for speed (cache-alignment? optimal usage of CUDA cores?) to be powers of two, I am 95% certain that 99.9% do it because others used the same numbers / it worked.
For both hyperparameters you could choose any positive integer. So what would you try? Keep in mind, each complete evaluation takes at least several hours. Hence I guess if people play with this parameter, they make something like a binary search: Starting from one number, doubling keep doubling if it improves until an upper bound is found. At some point the differences are minor and then it is irrelevant what you choose. And people will wonder less if you write that you used a batch size of 64 than if you write that you used 50. Or 42.

Smoothing aircraft GPS data with realistic turns

I have historical aircraft trajectory data with points varying from 1 second - 1 minute separation. Often these points present sharp turns. I'm looking for suggestions of best methods of resampling the data to generate smooth paths (e.g. point every n seconds) that more realistically represent the path followed. It would be useful to be able to parameterize the function with certain performance characteristics (e.g. rate of change of direction).
I'm aware of algorithms like the Kalman filter, Bezier curve fitting, splines etc. for data smoothing. But what algorithms would you suggest exploring as a starting point for generating smooth turns?
Schneider's Algorithm is an algorithm that approximately fits curves through a series of points.
The resulting curves have a drastically reduced point-count and it's error-tolerance is configurable, so you can adjust it as much as you need to.
In general:
Lower error-tolerance: More points, more accurate, less execution
Higher error-tolerance: Less points, less accurate, faster execution
Some useful links:
A live Javascript example, and it's implementation here.
Python Example
C++ implementation
If the resulting curve must pass exactly through your points, you need an interpolation algorithm instead of an approximation algorithm, but keep in mind that those do not reduce point-count.
A really good type of interpolating spline is the Centripetal Catmull-Rom Spline.

Testing the validity of a boolean pattern recognition algorithm

How do I determine a sufficient sample size to test an algorithm that can not be unit tested (~pattern recognition).
I have a relatively simple algorithm that uses vehicle position data, and bridge positional data, to determine whether a vehicle has crossed a bridge or not (true/false). The algorithm is allowed to give false positives but must never give a false negative.
I have tested the algorithm manually 400 times (200 instances where it is known the vehicle crossed, and 200 instances where it is known the vehicle did not cross). It has performed very well with no false negative results.
My concern is that I can not feasibly test the many thousand bridges for every concievable gps approach, and I must rely on a certain sample of tested bridges to be confident in my algorithm. I have read the wikipedia page on sample size and I do not see how it applies to my situation.

Clustering: Cluster validation

I want to use some clustering method for large social network dataset. The problem is how to evaluate the clustering method. yes, I can use some external ,internal and relative cluster validation methods. I used Normalized mutual information(NMI) as external validation method for cluster validation based on synthetic data. I produced some synthetic dataset by producing 5 clusters with equal number of nodes and some strongly connected links inside each cluster and weak links between clusters to check the clustering method, Then I analysed the spectral clustering and modularity based community detection methods on this synthetic datasets. I use the clustering with the best NMI for my real world dataset and check the error(cost function) of my algorithm and the result was good. Is my testing method for my cost function is good? or I should also validate clusters of my real word clusters again?
Thanks.
Try more than one measure.
There are a dozen cluster validation measures, and it's hard to predict which one is most appropriate for a problem. The differences between them are not really understood yet, so it's best if you consult more than one.
Also note that if you don't use a normalized measure, the baseline may be really high. So the measures are mostly useful to say "result A is more similar to result B than result C", but should not be taken as an absolute measure of quality. They are a relative measure of similarity.

Retrieving the most significant features gained from SIFT / SURF

I'm using SURF to extract features from images and match them to others. My Problem is that some images have in excess of 20000 features which slows down matching to a crawl.
Is there a way I can extract only the n most significant features from that set?
I tried computing MSER for the image and only use features that are within those regions. That gives me a reduction anywhere from 5% to 40% without affecting matching quality negatively, but that's unreliable and still not enough.
I could additionally size the image down, but I that seems to affect the quality of features severely in some cases.
SURF offers a few parameters (hessian threshold, octaves and layers per octave) but I couldn't find anything on how changing these would affect feature significance.
After some researching and testing I have found that the Hessian value for each feature is a rough estimate of it's strength, however using the top n features sorted by the hessian is not optimal.
I achieved better results when doing the following until number of features is below the target of n:
Size the image down, if it is overly large
Only features that lie in MSER regions are considered
For features that lie very close to each other, only the feature with the higher hessian is considered
Of the n features per image that I want to save, 75% are the features with the highest hessian values
The remaining features are taken randomly from the remainder, weighted by distribution of the hessian values computed through a histogram
Now I only need to find a suitable n, but around 1500 seems enough currently.