Clustering method. Choice variables when sum of variables is 1 on each observation - variables

There are 3 variables are x, and z=1-(x+y). ( x>=0, y>=0, z>=0 )
Data are like below.
O(1)= (x1,y1,z1)
O(2)= (x2,y2,z2)
...
O(n)= (xn,yn,zn)
I thought that z is not necessary to express each observation because z is determined by x and y.
So, I did clustering this data with x and y.
And also did clustering same data with x, y and z, too.
The results are different.
Because the distances in 2D and 3D are not equal. (It changed.)
(Yes, Of course, the distances 2D and 3D are not equivalent.)
But, What is the right way do clustering in this case?
Do I have to use x and y? Or, do I have to use x, y and z? Why?
Please someone help me. Thank you in advance!
Below is R code.
############
x <- sample(c(0:100), 100, replace = T)
y <- sample(c(0:100), 100, replace = T)
z <- 200 - (x+y)
xyz <- cbind(x,y,z)
xyz <- xyz/200 # z=1-(x+y)
xy <- zyz[,-3]
require(fpc)
(xy.pamk <- pamk(xy))
plot(xy,col=xy.pamk$pamobject$clustering)
(xyz.pamk <- pamk(xyz))
require(rgl)
plot3d(xyz,col=xyz.pamk$pamobject$clustering,xlim=c(0,1), ylim=c(0,1), zlim=c(0,1))
##############

Your theory that you don't need z, because it can be computed from x and y is flawed.
If it were that way, then x,y and x,z and y,z would all give the same result.
But the algorithms don't assume x+y+z=1. They won't assume x*x+y*y+z*z=1 either or other dependencies of features.

Related

Why high order function is not popular in numpy

For example, when you do x * y + z
with high order function, it can be expressed as : on3(x,y,add,lambda (x,y,z): x * y + z) which in theory could save lots of computations.
My question is why such patterns are rare in numpy
If you say x * y + z with NumPy arrays, x * y will allocate a temporary array and then + z will allocate the final array. If you need to squeeze out every bit of performance, you can avoid the intermediate temporary array like this:
r = x * y
r += z
That will only allocate a single array, which is nearly optimal assuming you don't want to mutate the inputs. If you do want to mutate them, you could instead do this:
x *= y # or np.mul(x, y, out=x)
x += z # or np.add(x, z, out=x)
Then you allocate nothing.
The above may still not be optimal if the data do not fit in cache, because you have to traverse x twice. You can solve this problem using Numba, by writing a vectorized function:
import numba
#numba.vectorize
def fma(x, y, z):
return x * y + z
Now when you run fma(x, y, z) it will visit the first element of each array, run x * y + z on those three elements, detect the type of the result, allocate an output array of that type, and then do the calculation for the rest of the elements.
Putting it all together, doing a single pass over the inputs and allocating nothing:
fma(x, y, z, out=x) # can also use out=y or out=z

Python to fit a linear-plateau curve

I have curve that initially Y increases linearly with X, then reach a plateau at point C.
In other words, the curve can be defined as:
if X < C:
Y = k * X + b
else:
Y = k * C + b
The training data is a list of X ~ Y values. I need to determine k, b and C through a machine learning approach (or similar), since the data is noisy and refection point C changes over time. I want something more robust than get C through observing the current sample data.
How can I do it using sklearn or maybe scipy?
WLOG you can say the second equation is
Y = C
looks like you have a linear regression to fit the line and then a detection point to find the constant.
You know that in the high values of X, as in X > C you are already at the constant. So just check how far back down the values of X you get the same constant.
Then do a linear regression to find the line with value of X, X <= C
Your model is nonlinear
I think the smartest way to solve this is to do these steps:
find the maximum value of Y which is equal to k*C+b
M=max(Y)
drop this maximum value from your dataset
df1 = df[df.Y != M]
and then you have simple dataset to fit your X to Y and you can use sklearn for that

Tensorflow: ignore a specific dependency during tf.gradients()

Given variables y and z, both of which depend on a tensor x. By product rule, if I do tf.gradients(yz,x), it would give me y'(x)z(x) + z'(x)y(x). Is there a way I can specify y as a constant with respect to x such that tf.gradients(yz,x) only gives me z'(x)y(x)?
I know y_=tf.constant(sess.run(y)) will give me y as a constant, but I cannot use that solution in my code.
You can use tf.stop_gradient() to block backpropagation. To block gradients in your example:
y = function1(x)
z = function2(x)
blocked_y = tf.stop_gradient(y)
product = blocked_y * z
After you backpropagate through product, the backpropagation will continue to z and not y.

Conditional Entropy if outcome is known

I have a question about Entropy and Information Flow. Suppose that X = {-1, 1}; meaning that it can be either -1 or 1, and the following assignment for Y:
Y := X * X
My question is that the value of Y, after the assignment, will always be 1. If X = -1, then Y=1 and if X = 1, then Y= 1. Knowing this, can I still assume that the conditional entropy H(X/Y) = 0, because knowing X will always tell you the Value of Y. On the other hand, the conditional entropy H(Y/X) = 1.0 because knowing Y will not give me the value of X.
Am I thinking in the right direction? Please help
You are partially correct, though it seems like you are rather "swapped" in your notation and your definition.
H(X|Y) is entropy of X given Y rather than entropy of Y given X.
Also, you should try to look at the condition here more carefully. Since you have a very clear relationship between X and Y, that means Y = f(X). And in that situation, just as you say, the conditional entropy is always 0 (yet you are swapped in your notation). Thus it should be
H(Y|X) = 0
On the other hand, if you have Y, you completely have no clue of what is X and both -1 and 1 have equal probability. So in this case
H(X|Y) = 1

How to rename a variable which respects the name scope?

Given x, y are tensors, I know I can do
with tf.name_scope("abc"):
z = tf.add(x, y, name="z")
So that z is named "abc/z".
I am wondering if there exists a function f which assign the name directly in the following case:
with tf.name_scope("abc"):
z = x + y
f(z, name="z")
The stupid f I am using now is z = tf.add(0, z, name="z")
If you want to "rename" an op, there is no way to do that directly, because a tf.Operation (or tf.Tensor) is immutable once it has been created. The typical way to rename an op is therefore to use tf.identity(), which has almost no runtime cost:
with tf.name_scope("abc"):
z = x + y
z = tf.identity(z, name="z")
Note however that the recommended way to structure your name scope is to assign the name of the scope itself to the "output" from the scope (if there is a single output op):
with tf.name_scope("abc") as scope:
# z will get the name "abc". x and y will have names in "abc/..." if they
# are converted to tensors.
z = tf.add(x, y, name=scope)
This is how the TensorFlow libraries are structured, and it tends to give the best visualization in TensorBoard.
It seems it works also without tf.name_scope only with z = tf.identity(z, name="z_name"). If you run additionally z = tf.identity(z, name="z_name_new") then you can access the same tensor using both names: tf.get_default_graph().get_tensor_by_name("z_name:0") or tf.get_default_graph().get_tensor_by_name("z_name_new:0")