What are Redis timeseries module restrictions - redis

I didn't find it in any official document, what is redis timeseries module's restrictions in terms of the following:
Max number of labels which can be added?
Max size of keys?
Max number of keys or time series?
Please let me know

Max number of labels which can be added?
A. There is no hard limit but you might experience some performance degradation when querying a very large number of labels.
Max size of keys?
A. There is no limit as long as you have memory available for the module. It is recommended to downsample your data and retire raw data by using the RETENTION option. Data compression was recently added to the module which reduces memory footprint significantly.
Max number of keys or time series?
A. There is no limit beyond the usual limits on Redis itself.

Related

k-mean clustering - inertia only gets larger

I am trying to use the KMeans clustering from faiss on a human pose dataset of body joints. I have 16 body parts so a dimension of 32. The joints are scaled in a range between 0 and 1. My dataset consists of ~ 900.000 instances. As mentioned by faiss (faiss_FAQ):
As a rule of thumb there is no consistent improvement of the k-means quantizer beyond 20 iterations and 1000 * k training points
Applying this to my problem I randomly select 50000 instances for training. As I want to check for a number of clusters k between 1 and 30.
Now to my "problem":
The inertia is increasing directly as the number of cluster increases (n_cluster on the x-axis):
I tried varying the number of iterations, the number of redos, verbose and spherical, but the results stay the same or get worse. I do not think that it is a problem of my implementation; I tested it on a small example with 2D data and very clear clusters and it worked.
Is it that the data is just bad clustered or is there another problem/mistake I have missed? Maybe the scaling of the values between 0 and 1? Should I try another approach?
I found my mistake. I had to increase the parameter max_points_per_centroid. As I have so many data points it sampled a sub-batch for the fit. For a larger number of clusters this sub-batch is larger. See FAQ of faiss:
max_points_per_centroid * k: there are too many points, making k-means unnecessarily slow. Then the training set is sampled
The larger subbatch of course has a larger inertia as there are more points in total.

How can I study the properties of outliers in high-dimensional data?

I have a bundle of high-dimensional data and the instances are labeled as outliers or not. I am looking to get some insights around where these outliers reside within the data. I seek to answer questions like:
Are the outliers spread far apart from each other? Or are they clustered together?
Are the outliers lying 'in-between' clusters of good data? Or are they on the 'edge' boundaries of the data?
If outliers are clustered together, how do these cluster densities compare with clusters of good data?
'Where' are the outliers?
What kind of techniques will let me find these insights? If the data was 2 or 3-dimensional, I can easily plot the data and just look at it. But I can't do it high-dimensional data.
Analyzing the Statistical Properties of Outliers
First of all, if you can choose to focus on specific features. For
example, if you know a featues is subject to high variation, you can
draw a box plot. You can also draw a 2D graph if you want to focus on
2 features. THis shows how much the labelled outliers vary.
Next, there's a metric called a Z-score, which basically says how
many standard devations a point varies compared to the mean. The
Z-score is signed, meaning if a point is below the mean, the Z-score
will be negative. This can be used to analyze all the features of the
dataset. You can find the threshold value in your labelled dataset for which all the points above that threshold are labelled outliers
Lastly, we can find the interquartile range and similarly filter
based on it. The IQR is simply the difference between the 75
percentile and 25 percentile. You can also use this similarly to Z-score.
Using these techniques, we can analyze some of the statistical properties of the outliers.
If you also want to analyze the clusters, you can adapt the DBSCAN algorithm to your problem. This algorithm clusters data based on densities, so it will be easy to apply the techniques to outliers.

dask dataframe optimal partition size for 70GB data join operations

I have a dask dataframe of around 70GB and 3 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster.
I have to take each of the 3 columns and join them to another even larger dataframe.
The documentation recommends to have partition sizes of 100MB. However, given this size of data, joining 700 partitions seems to be a lot more work than for example joining 70 partitions a 1000MB.
Is there a reason to keep it at 700 x 100MB partitions?
If not which partition size should be used here?
Does this also depend on the number of workers I use?
1 x 50GB worker
2 x 25GB worker
3 x 17GB worker
Optimal partition size depends on many different things, including available RAM, the number of threads you're using, how large your dataset is, and in many cases the computation that you're doing.
For example, in your case if your join/merge code it could be that your data is highly repetitive, and so your 100MB partitions may quickly expand out 100x to 10GB partitions, and quickly fill up memory. Or they might not; it depends on your data. On the other hand join/merge code does produce n*log(n) tasks, so reducing the number of tasks (and so increasing partition size) can be highly advantageous.
Determining optimal partition size is challenging. Generally the best we can do is to provide insight about what is going on. That is available here:
https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions
https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs

Metric/density based clustering/grouping

I have a finite number of points (cloud), with a metric defined on them. I would like to find the maximum amount of clusters in this cloud such that:
1) the maximum distance between any two points in one cluster is smaller a given epsilon (const)
2) each cluster has exactly k (const) points in it
I looked at all kind of different clustering methods and clustering with a restriction on the inner maximum distance is not a problem (density based). The 2) constrain and the requirement to find "the maximum amount of clusters s.t." seem to be problematic though. Any suggestions for an efficient solution?
Thank you,
A~
Given your constraints, there might be no solution. And actually, that may happen quite often...
The most obvious case is when you don't have a multiple of k points.
But also if epsilon is set too low, there might be points that cannot be put into clusters anymore.
I think you need to rethink your requirements and problem, instead of looking for an algorithm to solve an unreasonably hard requirement that might not be satisfiable.
Also consider whether you really need to find the guaranteed maximum, or just a good solution.
There are some rather obvious approaches that will at least find a good approximation fast.
I have the same impression as #Anony-Mousse, actually: you have not understood your problem and requirements yet.
If you want your cluster sizes to be k, there is no question of how many clusters you will get: it's obviously n /k. So you can try to use a k-means variant that produces clusters of the same size as e.g. described in this tutorial: Tutorial on same-size k-means and set the desired number of cluster to n/k.
Note that this is not a particular sensible or good clustering algorithm. It does something to satisfy the constraints, but the clusters are not really meaningful from a cluster analysis point of view. It's constraint satisfaction, but not cluster analysis.
In order to also satisfy your epsilon constraint, you can then start off with this initial solution (which probably is what #Anony-Mousse referred to as "obvious approaches") and try to perform the same kind of optimization-by-swapping-elements in order to satisfy the epsilon condition.
You may need a number of restarts, because there may be no solution.
See also:
Group n points in k clusters of equal size
K-means algorithm variation with equal cluster size
for essentially redundant questions.

k-means empty cluster

I try to implement k-means as a homework assignment. My exercise sheet gives me following remark regarding empty centers:
During the iterations, if any of the cluster centers has no data points associated with it, replace it with a random data point.
That confuses me a bit, firstly Wikipedia or other sources I read do not mention that at all. I further read about a problem with 'choosing a good k for your data' - how is my algorithm supposed to converge if I start setting new centers for cluster that were empty.
If I ignore empty clusters I converge after 30-40 iterations. Is it wrong to ignore empty clusters?
Check out this example of how empty clusters can happen: http://www.ceng.metu.edu.tr/~tcan/ceng465_f1314/Schedule/KMeansEmpty.html
It basically means either 1) a random tremor in the force, or 2) the number of clusters k is wrong. You should iterate over a few different values for k and pick the best.
If during your iterating you should encounter an empty cluster, place a random data point into that cluster and carry on.
I hope this helped on your homework assignment last year.
Handling empty clusters is not part of the k-means algorithm but might result in better clusters quality. Talking about convergence, it is never exactly but only heuristically guaranteed and hence the criterion for convergence is extended by including a maximum number of iterations.
Regarding the strategy to tackle down this problem, I would say randomly assigning some data point to it is not very clever since we might be affecting the clusters quality since the distance to its currently assigned center is large or small. An heuristic for this case would be to choose the farthest point from the biggest cluster and move that the empty cluster, then do so until there are no empty clusters.
Statement: k-means can lead to
Consider above distribution of data points.
overlapping points mean that the distance between them is del. del tends to 0 meaning you can assume arbitary small enough value eg 0.01 for it.
dash box represents cluster assign
legend in footer represents numberline
N=6 points
k=3 clusters (coloured)
final clusters = 2
blue cluster is orphan and ends up empty.
Empty clusters can be obtained if no points are allocated to a cluster during the assignment step. If this happens, you need to choose a replacement centroid otherwise SSE would be larger than neccessary.
*Choose the point that contributes most to SSE
*Choose a point from the cluster with the highest SSE
*If there are several empty clusters, the above can be repeated several times.
***SSE = Sum of Square Error.
Check this site https://chih-ling-hsu.github.io/2017/09/01/Clustering#
You should not ignore empty clusters but replace it. k-means is an algorithm could only provides you local minimums, and the empty clusters are the local minimums that you don't want.
your program is going to converge even if you replace a point with a random one. Remember that at the beginning of the algorithm, you choose the initial K points randomly. if it can converge, how come K-1 converge points with 1 random point can't? just a couple more iterations are needed.
"Choosing good k for your data" refers to the problem of choosing the right number of clusters. Since the k-means algorithm works with a predetermined number of cluster centers, their number has to be chosen at first. Choosing the wrong number could make it hard to divide the data points into clusters or the clusters could become small and meaningless.
I can't give you an answer on whether it is a bad idea to ignore empty clusters. If you do, you might end up with a smaller number of clusters than you defined at the beginning. This will confuse people who expect k-means to work in a certain way, but it is not necessarily a bad idea.
If you re-locate any empty cluster centers, your algorithm will probably converge anyway if that happens a limited number of times. However, you if you have to relocate too often, it might happen that your algorithm doesn't terminate.
For "Choosing good k for your data", Andrew Ng gives the example of a tee shirt manufacturer looking at potential customer measurements and doing k-means to decide if you want to offer S/M/L (k=3) or 2XS/XS/S/M/L/XL/2XL (k=7). Sometimes the decision is driven by the data (k=7 gives empty clusters) and sometimes by business considerations (manufacturing costs are less with only three sizes, or marketing says customers want more choices).
Set a variable to track the farthest distanced point and its cluster based on the distance measure used.
After the allocation step for all the points, check the number of datapoints in each cluster.
If any is 0, as is the case for this question, split the biggest cluster obtained and split further into 2 sub-clusters.
Replace the selected cluster with these sub-clusters.
I hope the issue is fixed now.. Random assignment will affect the clustering structure of the already obtained clustering.