Cluster 1:
Data 0 [1, 2, 3, 4, 5]
Data 1 [4, 32, 21, 3, 2]
Data 2 [2, 82, 51, 2, 1]
#end of cluster
These are some made up values (dimension = 5) representing the members of a cluster for k-means
To calculate a centroid, I understand that the avg is taken. However, I am not clear if we take the average of the sum of all these features or by column.
An example of what I mean:
Average of everything
sum = 1 + 2 + 3 + 4 + 5 + 4 + 32 + 21.... + 1 / (total length)
centroid = [sum ,sum, sum, sum, sum]
Average of features
sum1 = avg of first col = (1 + 4 + 2) / 3
sum2 = avg of 2nd col = (2 + 32 + 82) / 3
...
centroid = [sum1 , sum2, sum3, sum4, sum5]
From what I have been told the first seems like the correct way. However, the second makes more sense to me. Can anyone explain which is correct and why?
Its Average of features. The centroid will be
centroid^T = ( (1 + 4 + 2) / 3 , (2 + 32 + 82) / 3, .... , (5 + 2 + 1) / 3)
= ( 7/3, ..., 8/3)
This makes sense because you want a vector that is supposed to work as a representative for every datapoint in the cluster. Therefore, for every component of the centroid we generate the average of all the points, which will be used as the sample in R^5 space representative of the cluster.
In Cracking the Coding Interview, 6th edition, page 6, the amortized time for insertion is explained as:
As we insert elements, we double the capacity when the size of the array is a power of 2. So after X elements, we double the capacity at
array sizes 1, 2, 4, 8, 16, ... , X.
That doubling takes, respectively, 1, 2, 4, 8, 16, 32, 64, ... , X
copies. What is the sum of 1 + 2 + 4 + 8 + 16 + ... + X?
If you read this sum left to right, it starts with 1 and doubles until
it gets to X. If you read right to left, it starts with X and halves
until it gets to 1.
What then is the sum of X + X/2 + X/4 + ... + 1? This is roughly 2X.
Therefore, X insertions take O( 2X) time. The amortized time for each
insertion is O(1).
While for this code snippet(a recursive algorithm),
`
int f(int n) {
if (n <= 1) {
return 1;
}
return f(n - 1) + f(n - 1); `
The explanation is:
The tree will have depth N. Each node has two children. Therefore,
each level will have twice as many calls as the one above it.
Therefore,there will be 2^0+ 2^1 + 2^2 + 2^3 + ... + 2^N(which is
2^(N+1) - 1) nodes. . In this case, this gives us O(2^N) .
My question is:
In the first case, we have a GP 1+2+4+8...X. In the 2nd case we have the same GP 1+2+4+8..2^N. Why is the sum 2X in one case while it is 2^(N+1)-1 in another.
I think that it might be because we can't represent X as 2^N but I'm not sure.
Because in the second case N is the depth of the tree and not the total number of nodes. It would be 2^N = X, as you already stated.
The R metaop should reverse the effect of the operator it applies too. However, it does apparently a bit more than that, reversing lists if that's what it's applied to:
my #crossed = <1 2 3> Z <4 5 6>; # [(1 4) (2 5) (3 6)]
say [RZ] #crossed; # ((3 2 1) (6 5 4))
What I would like to obtain is the original lists, however, the result is reversed. Is there something I'm missing here?
R metaop does not reverse the effect of the operator. Instead it reverses the order of the operands, i.e.
$lhs <op> $rhs === $rhs R<op> $lhs
Or in your example the semantics are like this:
[RZ] [<1 4>, <2 5>, <3 6>] #is the same as [Z] [<3 6>, <2 5>, <1 4>]
Z itself does already create the original lists. No need for R Operator.
my #crossed = <1 2 3> Z <4 5 6>; # [(1 4) (2 5) (3 6)]
say [Z] #crossed; #((1 2 3) (4 5 6))
Given a huge array of integers, optimize the functions sum(i,j) and update(i,value), so that both the functions take less than O(n).
Update
Its an interview question. I have tried O(n) sum(i,j) and O(1) update(i, value). Other solution is preprocess the input array into 2-d array to give O(1) answer for sum(i,j). But that makes the update function of O(n).
For example, given an array:
A[] = {4,5,4,3,6,8,9,1,23,34,45,56,67,8,9,898,64,34,67,67,56,...}
Operations are to be defined are sum(i,j) and update(i,value).
sum(i,j) gives sum of numbers from index i to j.
update(i, value) updates the value at index i with the given value.
The very straight answer is that sum(i,j) can be calculated in O(n) time and update(i,value) in O(1) time.
The second approach is that precompute the sum(i,j) and store it in a 2-d array SUM[n][n] and when queried for, give the answer in O(1) time. But then the update function update(i,value) becomes of order O(n) as an entire row/column corresponding to index i has to be updated.
The interviewer gave me hint to do some preprocessing and use some data structure, but I couldn't think of.
What you need is a Segment Tree. A Segment tree can perform sum(i, j) and update(i, value) in O(log(n)) time.
Quote from Wikipedia:
In computer science, a segment tree is a tree data structure for storing intervals, or segments. It allows querying which of the stored segments contain a given point. It is, in principle, a static structure; that is, its structure cannot be modified once it is built.
The leaves of the tree will be the initial array elements. Their parents will be the sum of their children. For example: Suppose data[] = {2, 5, 4, 7, 8, 9, 5}, then our tree will be as follows:
This tree structure is represented using arrays. Let's call this array seg_tree. So the root of the seg_tree[] will be stored at index 1 of the array. It's two children will be stored at indexes 2 and 3. The general trend for 1-indexed representation will be:
Left child of index i is at index 2*i.
Right child of index i is at index 2*i+1.
Parent of index i is at index i/2.
and for 0-indexed representation:
Left child of index i is at index 2*i+1.
Right child of index i is at index 2*i+2.
Parent of index i is at index (i-1)/2.
Each interval [i, j] in the above picture denotes the sum of all elements in the interval data[i, j]. The root node will denote the sum of the whole data[], i.e., sum(0, 6) . Its two children will denote the sum(0, 3) and sum(4, 6) and so on.
The length of seg_tree[], MAX_LEN, will be (if n = length of data[]):
2*n-1 when n is a power of 2
2*(2^(log_2(n)+1) - 1 when n is not a power of 2.
Construction of seg_tree[] from data[]:
We will assume a 0-indexed construction in this case. Index 0 will be the root of the tree and the elements of the initial array will be stored in the leaves.
data[0...(n-1)] is the initial array and seg_tree[0...MAX_LEN] is the segment tree representation of data[]. It will be easier to understand how to construct the tree from the pseudo code:
build(node, start, end) {
// invalid interval
if start > end:
return
// leaf nodes
if start == end:
tree[node] = data[start]
return
// build left and right subtrees
build(2*node+1, start, (start + end)/2);
build(2*node+2, 1+(start+end)/2, end);
// initialize the parent with the sum of its children
tree[node] = tree[2*node+1] + tree[2*node+2]
}
Here,
[start, end] denotes the interval in data[] for which segment tree representation is to be formed. Initially, this is (0, n-1).
node represents the current index in the seg_tree[].
We start the building process by calling build(0, 0, n-1). The first argument denotes the position of the root in seg_tree[]. Second and third argument denotes the interval in data[] for which segment tree representation is to be formed. In each subsequent call node will represent the index of seg_tree[] and (start, end) will denote the interval for which seg_tree[node] will store the sum.
There are three cases:
start > end is an invalid interval and we simply return from this call.
if start == end, represents the leaf of the seg_tree[] and hence, we initialize tree[node] = data[start]
Otherwise, we are in a valid interval which is not a leaf. So we first build the left child of this node by calling build(node, start, (start + end)/2), then the right subtree by calling build(node, 1+(start+end)/2, end). Then we initialize the current index in seg_tree[] by the sum of its child nodes.
For sum(i, j):
We need to check whether the intervals at nodes overlap (partial/complete) with the given interval (i, j) or they do not overlap at all. These are the three cases:
For no overlap we can simply return 0.
For complete overlap, we will return the value stored at that node.
For partial overlap, we will visit both the children and continue this check recursively.
Suppose we need to find the value of sum(1, 5). We proceed as follows:
Let us take an empty container (Q) which will store the intervals of interest. Eventually all these ranges will be replaced by the values that they return. Initially, Q = {(0, 6)}.
We notice that (1, 5) does not completely overlap (0, 6), so we remove this range and add its children ranges.
Q = {(0, 3), (4, 6)}
Now, (1, 5) partially overlaps (0, 3). So we remove (0, 3) and insert its two children. Q = {(0, 1), (2, 3), (4, 6)}
(1, 5) partially overlaps (0, 1), so we remove this and insert its two children range. Q = {(0, 0), (1, 1), (2, 3), (4, 6)}
Now (1, 5) does not overlap (0, 0), so we replace (0, 0) with the value that it will return (which is 0 because of no overlap). Q = {(0, (1, 1), (2, 3), (4, 6)}
Next, (1, 5) completely overlaps (1, 1), so we return the value stored in the node that represents this range (i.e., 5). Q = {0, 5, (2, 3), (4, 6)}
Next, (1, 5) again completely overlaps (2, 3), so we return the value 11. Q = {0, 5, 11, (4, 6)}
Next, (1, 5) partially overlaps (4, 6) so we replace this range by its two children. Q = {0, 5, 11, (4, 5), (6, 6)}
Fast forwarding the operations, we notice that (1, 5) completely overlaps (4, 5) so we replace this by 17 and (1, 5) does not overlap (6, 6), so we replace it with 0. Finally, Q = {0, 5, 11, 17, 0}. The answer of the query is the sum of all the elements in Q, which is 33.
For update(i, value):
For update(i, value), the process is somewhat similar. First we will search for the range (i, i). All the nodes that we encounter in this path will also need to be updated. Let change = (new_value - old_value). Then while traversing the tree in the search of the range (i, i), we will add this change to all those nodes except the last node which will simply be replaced by the new value. For example, let the query be update(5, 8).
change = 8-9 = -1.
The path encountered will be (0, 6) -> (4, 6) -> (4, 5) -> (5, 5).
Final Value of (0, 6) = 40 + change = 40 - 1 = 39.
Final Value of (4, 6) = 22 + change = 22 - 1 = 21.
Final Value of (4, 5) = 17 + change = 17 - 1 = 16.
Final Value of (5, 5) = 8.
The final tree will look like this:
We can create a segment tree representation using arrays in O(n) time and both of these operations have a time complexity of O(log(n)).
In general Segment Trees can perform the following operations efficiently:
Update: It can update:
an element at a given index.
all the elements in an interval to a given value. Lazy Propagation technique is generally employed in this case to achieve efficiency.
Query: We query for some value in a given interval. The few basic types of queries are:
Minimum element in an interval
Maximum element in an interval
Sum/Product of all elements in an interval
Another data structure Interval Tree can also be used to solve this problem. Suggested Readings:
Wikipedia Page on Segment Tree
PEGWiki Page on Segment Tree
HackerEarth notes on Segment Tree
Visualization of Segment Tree Operations
I have this number x which i need to find in the (40 mod x) = 1
a possible answer for x is 3, or 39 as it goes into the number 40 and leaves a remainder of 1.
What kind of code would I need if I was to find all possible answers of x?
Mathematically, to solve (a mod x) = b, just find all of the divisors of a-b that aren't divisors of a. e.g. for (40 mod x) = 1, find the divisors of 40 - 1 (i.e. 39), which are 3, 13, and 39. The divisors of 40 are 2, 4, 5, 8, 10, 20, 40. None of the numbers in the first set are in the second, so the solutions are 3, 13, and 19.
For (40 mod x) = 5, you find the divisors of 40 - 5 (i.e. 35), which are 5, 7, and 35. 5 is on the list of divisors of 40, but the other two aren't, so the solutions are 7 and 35.
Of course, for such small numbers, it's more work to find all of the factors of a and a-b than it is to simply do all of the trial divisions of a by x, so the right way to solve your problem is to take exactly the question you asked and put it into code (forgive my VB, I haven't written any in the past 15 years or so...)
for x = 2 to 39
if (40 % x) = 1
MsgBox(x)
end if
next
Enumerable.Range(1, 40).Where(Function(x) 40 Mod x = 1)
The answer to that question is the set of unique integer factors of 39.
You can find them by looping from 1 to Math.Sqrt(39) and checking divisibility.