Choosing Centroid for K-means with multi dimensional data - data-science

Cluster 1:
Data 0 [1, 2, 3, 4, 5]
Data 1 [4, 32, 21, 3, 2]
Data 2 [2, 82, 51, 2, 1]
#end of cluster
These are some made up values (dimension = 5) representing the members of a cluster for k-means
To calculate a centroid, I understand that the avg is taken. However, I am not clear if we take the average of the sum of all these features or by column.
An example of what I mean:
Average of everything
sum = 1 + 2 + 3 + 4 + 5 + 4 + 32 + 21.... + 1 / (total length)
centroid = [sum ,sum, sum, sum, sum]
Average of features
sum1 = avg of first col = (1 + 4 + 2) / 3
sum2 = avg of 2nd col = (2 + 32 + 82) / 3
...
centroid = [sum1 , sum2, sum3, sum4, sum5]
From what I have been told the first seems like the correct way. However, the second makes more sense to me. Can anyone explain which is correct and why?

Its Average of features. The centroid will be
centroid^T = ( (1 + 4 + 2) / 3 , (2 + 32 + 82) / 3, .... , (5 + 2 + 1) / 3)
= ( 7/3, ..., 8/3)
This makes sense because you want a vector that is supposed to work as a representative for every datapoint in the cluster. Therefore, for every component of the centroid we generate the average of all the points, which will be used as the sample in R^5 space representative of the cluster.

Related

I am interpreting my code as a practice, but I don't know why the number 3 comes out from the result

I am currently trying to roll a dice randomly and set up the case as below for the random_walk.
When a number of dice is less than or equal to 2, then you may go one step lesser than the number.
When a number of dice is less than or equal to 5, then you may go one step further than the number.
Else, then you may go 6 steps + alpha(roll a dice once again and add up the value to 6)
When I run the code, the result shows like this:
[0, 3, 4, 5, 4, 5, 6, 7, 6, 5, 4, 3, ....]
but I still do not understand why 3 comes out from the result.
For instance, if I roll the dice and get the number 2, then it should be 1 for my available steps. If I roll the dice and get the number 3, then it should be 4 for my available steps.
I don't see any room for 3 here, can you tell me where it came from?
Here is my code:
random_walk = [0]
for x in range(100) :
step = random_walk[-1]
dice = np.random.randint(1,7)
if dice <= 2:
# Replace below: use max to make sure step can't go below 0
step = max(0, step - 1)
elif dice <= 5:
step = step + 1
else:
step = step + np.random.randint(1,7)
random_walk.append(step)
print(random_walk)
Your current logic is to add/minus 1 from your previous step, hence getting 3 when your previous step is 2. Seems like it should be
random_walk = [0]
for x in range(100) :
step = random_walk[-1]
dice = np.random.randint(1,7)
if dice <= 2:
# Replace below: use max to make sure step can't go below 0
step = max(0, dice - 1)
elif dice <= 5:
step = dice + 1
else:
step = 6 + np.random.randint(1,7)
random_walk.append(step)
print(random_walk)
By the way your question is not related to pandas or dataframe so you should remove these tags.

How to find a polynomial as an approximate solution to a nonlinear equation?

For my small FLOSS project, I want to approximate the Green et al. equation for maximum shear stress for point contact:
that should looks like this when plotted
the same equation in Maxima:
A: (3 / 2 / (1 + zeta^2) - 1 - nu + zeta * (1 + nu) * acot(zeta)) / 2;
Now to find the maximum 𝜏max I differentiate the above equations against 𝜁:
diff(A, zeta);
trying to solve the derivative for 𝜁:
solve(diff(A, zeta), zeta);
I ended up with a multipage equation that I can't actually use or test.
Now I was wondering if I can find the polynomial:
𝜁max = a + b * 𝜈 + c * 𝜈2 + ...
that approximately solves the
diff(A, zeta) = 0
equation for 0 < 𝜈 < 0.5 and 0 < 𝜁 < 1.
(1) Probably the first thing to try is just to solve diff(A, zeta) = 0 numerically (via find_root in this case). Here is an approximate solution for one value of nu:
(%i2) A: (3 / 2 / (1 + zeta^2) - 1 - nu + zeta * (1 + nu) * acot(zeta)) / 2;
3
(nu + 1) zeta acot(zeta) + ------------- - nu - 1
2
2 (zeta + 1)
(%o2) -------------------------------------------------
2
(%i3) dAdzeta: diff(A, zeta);
(nu + 1) zeta 3 zeta
(nu + 1) acot(zeta) - ------------- - ------------
2 2 2
zeta + 1 (zeta + 1)
(%o3) --------------------------------------------------
2
(%i4) find_root (subst ('nu = 0.25, dAdzeta), zeta, 0, 1);
(%o4) 0.4643131929806135
Here I'll plot the approximate solution for different values of nu:
(%i5) plot2d (find_root (dAdzeta, zeta, 0, 1), [nu, 0, 0.5]) $
Let's plot that together with Eq. 10 which is the approximation derived in the paper by Green:
(%i6) plot2d ([find_root (dAdzeta, zeta, 0, 1), 0.38167 + 0.33136*nu], [nu, 0, 0.5]) $
(2) I looked at some different ways to get to a symbolic solution and here is something which is maybe workable. Note that this is also an approximation since it's derived from a Taylor series. You would have to look at whether it works well enough.
Find a low-order Taylor series for acot and plug it into dAdzeta.
(%i7) acot_approx: taylor (acot(zeta), zeta, 1/2, 3);
1 1 2 1 3
4 (zeta - -) 8 (zeta - -) 16 (zeta - -)
2 2 2
(%o7)/T/ atan(2) - ------------ + ------------- + -------------- + . . .
5 25 375
(%i8) dAdzeta_approx: subst (acot(zeta) = acot_approx, dAdzeta);
(25 atan(2) - 10) nu + 25 atan(2) - 34
(%o8)/T/ --------------------------------------
50
1 1 2
(80 nu + 104) (zeta - -) (320 nu + 1184) (zeta - -)
2 2
- ------------------------ + ---------------------------
125 625
1 3
(640 nu + 11584) (zeta - -)
2
- ---------------------------- + . . .
9375
The approximate dAdzeta is a cubic polynomial in zeta, so we can solve it. The result is a big messy expression. The first two solutions are complex and the third is real, so I guess that's the one we want.
(%i9) zeta_max: solve (dAdzeta_approx = 0, zeta);
<large mess omitted here>
(%i10) grind (zeta_max[3]);
zeta = ((625*sqrt((22500*atan(2)^2+30000*atan(2)-41200)*nu^4
+(859500*atan(2)^2-1878000*atan(2)+926000)
*nu^3
+(9022725*atan(2)^2-15859620*atan(2)+7283316)
*nu^2
+(15556950*atan(2)^2-36812760*atan(2)
+19709144)
*nu+7371225*atan(2)^2-22861140*atan(2)
+17716484))
/(256*(10*nu+181)^2)
+((3*((9375*nu+9375)*atan(2)+4810*nu+6826))/(1280*nu+23168)
-((90*nu+549)*(1410*nu+4281))/((10*nu+181)*(80*nu+1448)))
/6+(90*nu+549)^3/(27*(10*nu+181)^3))
^(1/3)
-((1410*nu+4281)/(3*(80*nu+1448))
+((-1)*(90*nu+549)^2)/(9*(10*nu+181)^2))
/((625*sqrt((22500*atan(2)^2+30000*atan(2)-41200)*nu^4
+(859500*atan(2)^2-1878000*atan(2)+926000)
*nu^3
+(9022725*atan(2)^2-15859620*atan(2)+7283316)
*nu^2
+(15556950*atan(2)^2-36812760*atan(2)
+19709144)
*nu+7371225*atan(2)^2-22861140*atan(2)
+17716484))
/(256*(10*nu+181)^2)
+((3*((9375*nu+9375)*atan(2)+4810*nu+6826))
/(1280*nu+23168)
-((90*nu+549)*(1410*nu+4281))
/((10*nu+181)*(80*nu+1448)))
/6+(90*nu+549)^3/(27*(10*nu+181)^3))
^(1/3)+(90*nu+549)/(3*(10*nu+181))$
I tried some ideas to simplify the solution, but didn't find anything workable. Whether it's usable in its current form, I'll let you be the judge. Plotting the approximate solution along with the other two seems to show they're all pretty close together.
(%i18) plot2d ([find_root (dAdzeta, zeta, 0, 1),
0.38167 + 0.33136*nu,
rhs(zeta_max[3])],
[nu, 0, 0.5]) $
Here's a different approach, which is to calculate some approximate values by find_root and then assemble an approximation function which is a cubic polynomial. This makes use of a little function I wrote named polyfit. See: https://github.com/maxima-project-on-github/maxima-packages/tree/master/robert-dodier and then look in the polyfit folder.
(%i2) A: (3 / 2 / (1 + zeta^2) - 1 - nu + zeta * (1 + nu) * acot(zeta)) / 2;
3
(nu + 1) zeta acot(zeta) + ------------- - nu - 1
2
2 (zeta + 1)
(%o2) -------------------------------------------------
2
(%i3) dAdzeta: diff(A, zeta);
(nu + 1) zeta 3 zeta
(nu + 1) acot(zeta) - ------------- - ------------
2 2 2
zeta + 1 (zeta + 1)
(%o3) --------------------------------------------------
2
(%i4) nn: makelist (k/10.0, k, 0, 5);
(%o4) [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
(%i5) makelist (find_root (dAdzeta, zeta, 0, 1), nu, nn);
(%o5) [0.3819362006941755, 0.4148794361988409,
0.4478096487716516, 0.4808644852928955, 0.5141748609122403,
0.5478684611102143]
(%i7) load ("polyfit.mac");
(%o7) polyfit.mac
(%i8) foo: polyfit (nn, %o5, 3) $
(%i9) grind (foo);
[beta = matrix([0.4643142407230925],[0.05644202066198245],
[2.746081069103333e-4],[1.094924180450318e-4]),
Yhat = matrix([0.3819365703555216],[0.4148782994206623],
[0.4478104992708994],[0.4808650578507559],
[0.5141738631047557],[0.5478688029774219]),
residuals = matrix([-3.696613460890674e-7],
[1.136778178534303e-6],
[-8.504992477509354e-7],
[-5.725578604010018e-7],
[9.97807484637292e-7],
[-3.418672076538343e-7]),
mse = 5.987630959972099e-13,Xmean = 0.25,
Xsd = 0.1707825127659933,
f = lambda([X],
block([Xtilde:(X-0.25)/0.1707825127659933,X1],
X1:[1,Xtilde,Xtilde^2,Xtilde^3],
X1 . matrix([0.4643142407230925],
[0.05644202066198245],
[2.746081069103333e-4],
[1.094924180450318e-4])))]$
(%o9) done
Not sure which pieces are going to be most relevant, so I just returned several things. Items can be extracted via assoc. Here I'll extract the constructed function.
(%i10) assoc ('f, foo);
X - 0.25
(%o10) lambda([X], block([Xtilde : ------------------, X1],
0.1707825127659933
2 3
X1 : [1, Xtilde, Xtilde , Xtilde ],
[ 0.4643142407230925 ]
[ ]
[ 0.05644202066198245 ]
X1 . [ ]))
[ 2.746081069103333e-4 ]
[ ]
[ 1.094924180450318e-4 ]
(%i11) %o10(0.25);
(%o11) 0.4643142407230925
Plotting the function shows it is close to the values returned by find_root.
(%i12) plot2d ([find_root (dAdzeta, zeta, 0, 1), %o10], [nu, 0, 0.5]);

Different big O notation for same calculation(Cracking the coding interview)

In Cracking the Coding Interview, 6th edition, page 6, the amortized time for insertion is explained as:
As we insert elements, we double the capacity when the size of the array is a power of 2. So after X elements, we double the capacity at
array sizes 1, 2, 4, 8, 16, ... , X.
That doubling takes, respectively, 1, 2, 4, 8, 16, 32, 64, ... , X
copies. What is the sum of 1 + 2 + 4 + 8 + 16 + ... + X?
If you read this sum left to right, it starts with 1 and doubles until
it gets to X. If you read right to left, it starts with X and halves
until it gets to 1.
What then is the sum of X + X/2 + X/4 + ... + 1? This is roughly 2X.
Therefore, X insertions take O( 2X) time. The amortized time for each
insertion is O(1).
While for this code snippet(a recursive algorithm),
`
int f(int n) {
if (n <= 1) {
return 1;
}
return f(n - 1) + f(n - 1); `
The explanation is:
The tree will have depth N. Each node has two children. Therefore,
each level will have twice as many calls as the one above it.
Therefore,there will be 2^0+ 2^1 + 2^2 + 2^3 + ... + 2^N(which is
2^(N+1) - 1) nodes. . In this case, this gives us O(2^N) .
My question is:
In the first case, we have a GP 1+2+4+8...X. In the 2nd case we have the same GP 1+2+4+8..2^N. Why is the sum 2X in one case while it is 2^(N+1)-1 in another.
I think that it might be because we can't represent X as 2^N but I'm not sure.
Because in the second case N is the depth of the tree and not the total number of nodes. It would be 2^N = X, as you already stated.

limiting a variable to a set of values in linear programming

I am writing a linear program in LINGO to balance a drum using a minimum number of weights. My question is how do I limit a variable to a set of values? For example, if I wanted a variable called Weight to be limited to the values (0, 1, 2, 4, 5, or 10) how could I achieve this?
The usual way to achieve this is by introducing several binary (0,1) indicator variables into the formulation.
Let's say that X is the variable of interest and it can take discrete values {0,1,2,4,5, 10}
Introduce six indicator variables (Y_0, Y_1, ... Y_10)
We only want one of these Y's to take on the value of 1, everything else to be 0.
Y_0 + Y_1 + Y_2 + Y_4 + Y_5 + Y_10 = 1 (Mutual exclusivity constraint)
Now tie the indicator variables with the Original variable.
X = 0 Y_0 + 1 Y_1 + 2 Y_2 + 4 Y_4 + 5 Y_5 + 10 Y_10
(X will take on the appropriate value depending on which indicator variable is 1.)
Now use X in the rest of your formulation.

Checkerboard indexing in CUDA

So, here's the question. I want to do a computation in CUDA where I have a large 1D array (which represents a lattice), I partition it into subarrays of length #part, and I want each thread to do a couple of computations on each subarray.
More specifically, let's say that we have a number of threads, #threads, and a number of blocks, #blocks. The array is of size N = 2 * #part * #threads * #blocks. If we number the subarrays from 1 to 2*#blocks*#threads, we want to first use the #threads*#blocks threads to do computation on the subarrays with an even number and then the same number of threads to do computation on the subarrays with an odd number.
I thought that I could have a local index in each thread which would denote from where it's subarray would start.
So, I used the following index :
localIndex = #part * (2 * threadIdx.x + var) + 2 * #part * #Nthreads * blockIdx.x;
var is either 1 or 0, depending on if we want to have the thread do computation on an subarray with an even or an odd number.
I've tried to run it and it seems that something goes wrong when I use more than one blocks. Have I done something wrong with the indexing?
Thanks.
Why is it important that the threads collectively do first even, then the odd subarrays, since block and thread execution is not guaranteed to be in order there is no benefit?
Assuming you index only using x-dimension for your kernel dimension setup:
subArrayIndexEven = 2 * (blockIdx.x * blockDim.x + threadIdx.x) * part
subArrayIndexOdd = subArrayIndexEven + part
Prove:
BLOCK_SIZE = 3
NUM_OF_BLOCKS = 2
PART = 4
N = 2 * 3 * 2 * 4 = 48
T(threadIdx.x, blockIdx.x)
T(0, 1) -> even = 2 * (1 * 3 + 0) * 4 = 24, odd = 28
T(1, 1) -> even = 2 * (1 * 3 + 1) * 4 = 32, odd = 36
T(2, 1) -> even = 2 * (1 * 3 + 2) * 4 = 40, odd = 44
idx = threads_per_block*blockIdx.x + threadIdx.x;
int my_even_offset, my_odd_offset, my_even_idx, my_odd_idx;
int my_offset = floor(float(idx)/float(num_part));
my_even_offset = 2*my_offset*num_part;
my_odd_offset = (2*my_offset+1)*num_part;
my_even_idx = idx + my_even_offset;
my_odd_idx = idx + my_odd_offset;
//Do stuff with the indices.