What is the meaning of bucket height in the Kademlia paper? - kademlia

It said:
We start with some definitions. For a k-bucket covering the distance range 2i,2i+1 , define the index of the bucket to be i. Define the depth, h, of a node to be 160 − i, where i is the smallest index of a non-empty bucket. Define node y’s bucket height in node x to be the index of the bucket into which x would insert y minus the index of x’s least significant empty bucket. Because node IDs are randomly chosen, it follows that highly non-uniform distributions are unlikely. Thus with overwhelming probability the height of a any given node will be within a constant of log n for a system with n nodes. Moreover, the bucket height of the closest node to an ID in the kth-closest node will likely be within a constant of log k.
I can understand the definition of bucket height, but I don't know why we need that definition, and I don't understand the last sentence of the paragraph.
Updates:
I also think that the paper has a typo: the bucket height should be the index of the bucket containing y minus the index of x’s least significant "NON-"empty bucket. Am I wrong?

but I don't know why we need that definition
The argument for O(log n) efficiency of kademlia in terms of routing table size and lookup steps is based on mapping the entire keyspace of n nodes into k-buckets where further-away buckets cover exponentially larger fractions of the keyspace. Effectively compressing the whole network into a biased list of samples.
Then arguments further down are then based on this bucket-based projection.
Moreover, the bucket height of the closest node to an ID in the kth-closest node will likely be within a constant of log k.
I think this is a convoluted way of saying that your k nearest neighbors will all end up in or near the same bucket, i.e. the deepest one (smallest index non-empty bucket).
Note that this is expressed in terms of the flat layout, in the tree layout the smallest bucket would be akin to but not necessarily identical with the the own-ID-covering bucket.

Related

What is the benefit of using same size cluster in K- Means Algorithm?

since K Means using the Centroid and distance measurement Techniques. so was wondering what is the benefit of producing same size cluster in the - K Means Algorithm.
Also also there is one point for the observation, that producing the different size of cluster in k means its a drawback.

Breadth first or Depth first

There is a theory that says six degrees of seperations is the highest
degree for people to be connected through a chain of acquaintances.
(You know the Baker - Degree of seperation 1, the Baker knows someone
you don't know - Degree of separation 2)
We have a list of People P, list A of corresponding acquaintances
among these people, and a person x
We are trying to implement an algorithm to check if person x respects
the six degrees of separations. It returns true if the distance from x
to all other people in P is at most six, false otherwise.
We are tying to accomplish O(|P| + |A|) in the worst-case.
To implement this algorithm, I thought about implementing an adjacency list over an adjacency matrix to represent the Graph G with vertices P and edges A, because an Adjacency Matrix would take O(n^2) to traverse.
Now I thought about using either BFS or DFS, but I can't seem to find a reason as to why the other is more optimal for this case.
I want to use BFS or DFS to store the distances from x in an array d, and then loop over the array d to look if any Degree is larger than 6.
DFS and BFS have the same Time Complexity, but Depth is better(faster?) in most cases at finding the first Degree larger than 6, whereas Breadth is better at excluding all Degrees > 6 simultaneously.
After DFS or BFS I would then loop over the array containing the distances from person x, and return true if there was no entry >6 and false when one is found.
With BFS, the degrees of separations would always be at the end of the Array, which would maybe lead to a higher time complexity?
With DFS, the degrees of separations would be randomly scattered in the Array, but the chance to have a degree of separation higher than 6 early in the search is higher.
I don't know if it makes any difference to the Time Complexity if using DFS or BFS here.
Time complexity of BFS and DFS is exactly the same. Both methods visit all connected vertices of the graph, so in both cases you have O(V + E), where V is the number of vertices and E is the number of edges.
That being said, sometimes one algorithm can be preferred over the other precisely because the order of vertex visitation is different. For instance, if you were to evaluate a mathematical expression, DFS would be much more convenient.
In your case, BFS could be used to optimize graph traversal, because you can simply cut-off BFS at the required degree of separation level. All the people who have the required (or bigger) degree of separation would be left unmarked as visited.
The same trick would be much more convoluted to implement with DFS, because as you've astutely noticed, DFS first gets "to the bottom" of the graph, and then it goes back recursively (or via stack) up level by level.
I believe that you can use the the Dijkstra algorithm.
Is a BFS approach that updates your path, is the path have a smaller value. Think as distance have always a cost of 1 and, if you have two friends (A and B) for a person N.
Those friends have a common friend C but, in a first time your algorithm checks a distance for friend A with cost 4 and mark as visited, they can't check the friend B that maybe have a distance of 3. The Dijkstra will help you doing checking this.
The Dijkstra solve this in O(|V|+|E|log|V)
See more at https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm

Non-empty buckets in LSH

I'm reading this survey about LSH, in particular citing the last paragraph of section 2.2.1:
To improve the recall, L hash tables are constructed, and the items
lying in the L (L ′ , L ′ < L) hash buckets h_1 (q), · · · , h_L (q)
are retrieved as near items of q for randomized R-near neighbor search
(or randomized c- approximate R-near neighbor search).
To guarantee the precision, each of the L hash codes, y_i , needs to
be a long code, which means that the total number of the buckets is
too large to index directly. Thus, only the nonempty buckets are
retained by resorting to convectional hashing of the hash codes h_l
(x).
I have 3 questions:
The bold sentence is not clear to me: what does it mean by "resorting to convenctional hashing of the hash codes h_l (x)"?
Always about the bold sentence, I'm not sure that I got the problem: I totally understand that h_l(x) can be a long code and so the number of possible buckets can be huge. For example, if h_l(x) is a binary code and length is h_l(x)'s length, then we have in total L*2^length possible buckets (since we use L hash tables)...is that correct?
Last question: once we find which bucket the query vector q belongs to, in order to find the nearest neighbor we have to use the original vector q and the original distance metric? For example, let suppose that the original vector q is in 128 dimensions q=[1,0,12,...,14.3]^T and it uses the euclidean distance in our application. Now suppose that our hashing function (supposing that L=1 for simplicity) used in LSH maps this vector to a binary space in 20 dimensions y=[0100...11]^T in order to decide which bucket assign q to. So y has the same index of the bucket B, and which already contains 100 vectors. Now, in order to find the nearest neighbor, we have to compare q with all the others 100 128-dimensions vectors using euclidean distance. Is this correct?
Approach they are using to improve recall constructs more hash tables and essentially stores multiple copies of the ID for each reference item, hence space cost is larger [4]. If there are a lot of empty buckets which increases the retrieval cost, the double-hash scheme or fast search algorithm in the Hamming space can be used to fast retrieve the hash buckets. I think in this case they are using double hash function to retrieve non-empty buckets.
No of buckets/memory cells [1][2][3] -> O(nL)
References:
[1] http://simsearch.yury.name/russir/03nncourse-hand.pdf
[2] http://joyceho.github.io/cs584_s16/slides/lsh-12.pdf
[3] https://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf
[4] http://research.microsoft.com/en-us/um/people/jingdw/Pubs%5CLTHSurvey.pdf

k-means empty cluster

I try to implement k-means as a homework assignment. My exercise sheet gives me following remark regarding empty centers:
During the iterations, if any of the cluster centers has no data points associated with it, replace it with a random data point.
That confuses me a bit, firstly Wikipedia or other sources I read do not mention that at all. I further read about a problem with 'choosing a good k for your data' - how is my algorithm supposed to converge if I start setting new centers for cluster that were empty.
If I ignore empty clusters I converge after 30-40 iterations. Is it wrong to ignore empty clusters?
Check out this example of how empty clusters can happen: http://www.ceng.metu.edu.tr/~tcan/ceng465_f1314/Schedule/KMeansEmpty.html
It basically means either 1) a random tremor in the force, or 2) the number of clusters k is wrong. You should iterate over a few different values for k and pick the best.
If during your iterating you should encounter an empty cluster, place a random data point into that cluster and carry on.
I hope this helped on your homework assignment last year.
Handling empty clusters is not part of the k-means algorithm but might result in better clusters quality. Talking about convergence, it is never exactly but only heuristically guaranteed and hence the criterion for convergence is extended by including a maximum number of iterations.
Regarding the strategy to tackle down this problem, I would say randomly assigning some data point to it is not very clever since we might be affecting the clusters quality since the distance to its currently assigned center is large or small. An heuristic for this case would be to choose the farthest point from the biggest cluster and move that the empty cluster, then do so until there are no empty clusters.
Statement: k-means can lead to
Consider above distribution of data points.
overlapping points mean that the distance between them is del. del tends to 0 meaning you can assume arbitary small enough value eg 0.01 for it.
dash box represents cluster assign
legend in footer represents numberline
N=6 points
k=3 clusters (coloured)
final clusters = 2
blue cluster is orphan and ends up empty.
Empty clusters can be obtained if no points are allocated to a cluster during the assignment step. If this happens, you need to choose a replacement centroid otherwise SSE would be larger than neccessary.
*Choose the point that contributes most to SSE
*Choose a point from the cluster with the highest SSE
*If there are several empty clusters, the above can be repeated several times.
***SSE = Sum of Square Error.
Check this site https://chih-ling-hsu.github.io/2017/09/01/Clustering#
You should not ignore empty clusters but replace it. k-means is an algorithm could only provides you local minimums, and the empty clusters are the local minimums that you don't want.
your program is going to converge even if you replace a point with a random one. Remember that at the beginning of the algorithm, you choose the initial K points randomly. if it can converge, how come K-1 converge points with 1 random point can't? just a couple more iterations are needed.
"Choosing good k for your data" refers to the problem of choosing the right number of clusters. Since the k-means algorithm works with a predetermined number of cluster centers, their number has to be chosen at first. Choosing the wrong number could make it hard to divide the data points into clusters or the clusters could become small and meaningless.
I can't give you an answer on whether it is a bad idea to ignore empty clusters. If you do, you might end up with a smaller number of clusters than you defined at the beginning. This will confuse people who expect k-means to work in a certain way, but it is not necessarily a bad idea.
If you re-locate any empty cluster centers, your algorithm will probably converge anyway if that happens a limited number of times. However, you if you have to relocate too often, it might happen that your algorithm doesn't terminate.
For "Choosing good k for your data", Andrew Ng gives the example of a tee shirt manufacturer looking at potential customer measurements and doing k-means to decide if you want to offer S/M/L (k=3) or 2XS/XS/S/M/L/XL/2XL (k=7). Sometimes the decision is driven by the data (k=7 gives empty clusters) and sometimes by business considerations (manufacturing costs are less with only three sizes, or marketing says customers want more choices).
Set a variable to track the farthest distanced point and its cluster based on the distance measure used.
After the allocation step for all the points, check the number of datapoints in each cluster.
If any is 0, as is the case for this question, split the biggest cluster obtained and split further into 2 sub-clusters.
Replace the selected cluster with these sub-clusters.
I hope the issue is fixed now.. Random assignment will affect the clustering structure of the already obtained clustering.

Search optimization problem

Suppose you have a list of 2D points with an orientation assigned to them. Let the set S be defined as:
S={ (x,y,a) | (x,y) is a 2D point, a is an orientation (an angle) }.
Given an element s of S, we will indicate with s_p the point part and with s_a the angle part. I would like to know if there exist an efficient data structure such that, given a query point q, is able to return all the elements s in S such that
(dist(q_p, s_p) < threshold_1) AND (angle_diff(q_a, s_a) < threshold_2) (1)
where dist(p1,p2), with p1,p2 2D points, is the euclidean distance, and angle_diff(a1,a2), with a1,a2 angles, is the difference between angles (taken to be the smallest one). The data structure should be efficient w.r.t. insertion/deletion of elements and the search as defined above. The number of vectors can grow up to 10.000 and more, but take this with a grain of salt.
Now suppose to change the above requirement: instead of using the condition (1), let's request all the elements of S such that, given a distance function d, we want all elements of S such that d(q,s) < threshold. If i remember well, this last setup is called range-search. I don't know if the first case can be transformed in the second.
For the distance search I believe the accepted best method is a Binary Space Partition tree. This can be stored as a series of bits. Each two bits (for a 2D tree) or three bits (for a 3D tree) subdivides the space one more level, increasing resolution.
Using a BSP, locating a set of objects to compare distances with is pretty easy. Just find the smallest set of squares or cubes which contain the edges of your distance box.
For the angle, I don't know of anything. I suppose that you could store each object in a second list or tree sorted by its angle. Then you would find every object at the proper distance using the BSP, every object at the proper angles using the angle tree, then do a set intersection.
You have effectively described a "three dimensional cyclindrical space", ie. a space that is locally three dimensional but where one dimension is topologically cyclic. In other words, it is locally flat and may be modeled as the boundary of a four-dimensional object C4 in (x, y, z, w) defined by
z^2 + w^2 = 1
where
a = arctan(w/z)
With this model, the space defined by your constraints is a 2-dimensional cylinder wrapped "lengthwise" around a cross section wedge, where the wedge wraps around the 4-d cylindrical space with an angle of 2 * threshold_2. This can be modeled using a "modified k-d tree" approach (modified 3-d tree), where the data structure is not a tree but actually a graph (it has cycles). You can still partition this space into cells with hyperplane separation, but traveling along the curve defined by (z, w) in the positive direction may encounter a point encountered in the negative direction. The tree should be modified to actually lead to these nodes from both directions, so that the edges are bidirectional (in the z-w curve direction - the others are obviously still unidirectional).
These cycles do not change the effectiveness of the data structure in locating nearby points or allowing your constraint search. In fact, for the most part, those algorithms are only slightly modified (the simplest approach being to hold a visited node data structure to prevent cycles in the search - you test the next neighbors about to be searched).
This will work especially well for your criteria, since the region you define is effectively bounded by these axis-defined hyperplane-bounded cells of a k-d tree, and so the search termination will leave a region on average populated around pi / 4 percent of the area.