numpy - why mean and SD are unstable for the same value?

numpy - why mean and SD are unstable for the same value? - numpy

Question
Why the same value -3.29686744 results in different mean and standard deviation?
Expected
X = np.array([
[-1.11793447, -3.29686744, -3.50615096],
[-1.11793447, -3.29686744, -3.50615096],
[-1.11793447, -3.29686744, -3.50615096]
])
mean = np.mean(X, axis=0)
print(f"mean is \n{mean}\nX-mean is \n{X-mean}\n")
sd = np.std(X, axis=0)
print(f"SD is \n{sd}\n")
Result:
mean is
[-1.11793447 -3.29686744 -3.50615096]
X-mean is
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
SD is
[0. 0. 0.]
Unexpected
X = np.array([
[-1.11793447, -3.29686744, -3.50615096],
[-1.11793447, -3.29686744, -3.50615096],
[-1.11793447, -3.29686744, -3.50615096],
[-1.11793447, -3.29686744, -3.50615096],
[-1.11793447, -3.29686744, -3.50615096]
])
mean = np.mean(X, axis=0)
print(f"mean is \n{mean}\nX-mean is \n{X-mean}\n")
sd = np.std(X, axis=0)
print(f"SD is \n{sd}\n")
Result is:
mean is
[-1.11793447 -3.29686744 -3.50615096]
X-mean is
[[0.0000000e+00 4.4408921e-16 4.4408921e-16]
[0.0000000e+00 4.4408921e-16 4.4408921e-16]
[0.0000000e+00 4.4408921e-16 4.4408921e-16]
[0.0000000e+00 4.4408921e-16 4.4408921e-16]
[0.0000000e+00 4.4408921e-16 4.4408921e-16]]
SD is
[0.0000000e+00 4.4408921e-16 4.4408921e-16]

This is normal behavior when you consider that IEEE-754 double precision floats are stored as 64 bits of data. 53 bits are mantissa 10 bits are exponent, and one bit is sign. You can look up the details elsewhere.
The important part is that floats are effectively stored as integers with a scale factor. This is the binary analog of scientific notation. In fact, you can intuit exactly what is happening using more familiar decimal scientific notation.
Let's say you have three digits of decimal precision available, and you want to compute the mean of [2.31e2, 2.31e2, 2.31e2]. The sum is 6.93e2, and so the mean is unambiguously 2.31e2. But what if your array was [2.31e2, 2.31e2, 2.31e2, 2.31e2, 2.31e2]. Now the sum is 1.155e3, but with only three digits available, the best you can do is 1.15e3 or 1.16e3 depending on whether you truncate or round. Dividing by five and truncating/rounding gives you either 2.30e2 or 2.32e2. There will generally be some quantization error the moment your sum has a scale different from your original number.
Hopefully you can see that this translates directly to binary representations as well: you are seeing the differences in the last digit from the scale change during the mean operation.
Notice that 2^-53 ~= 1.11e-16. Given that the scale of all elements in X is about 1, this corresponds very well to the quantization error you are seeing.
This is very closely related to Is floating point math broken?

Related

numpy function to use for mathematical dot product to produce scalar

Question
What numpy function to use for mathematical dot product in the case below?
Backpropagation for a Linear Layer

Define sample (2,3) array:
In [299]: dldx = np.arange(6).reshape(2,3)
In [300]: w
Out[300]:
array([[0.1, 0.2, 0.3],
[0. , 0. , 0. ]])
Element wise multiplication:
In [301]: dldx*w
Out[301]:
array([[0. , 0.2, 0.6],
[0. , 0. , 0. ]])
and summing on the last axis (size 3) produces a 2 element array:
In [302]: (dldx*w).sum(axis=1)
Out[302]: array([0.8, 0. ])
Your (6) is the first term, dropping the 0. One might argue that the use of a dot/inner in (5) is a bit sloppy.
np.einsum borrows ideas from physics, where dimensions may be higher. This case can be expressed as
In [303]: np.einsum('ij,ik->i',dldx,w)
Out[303]: array([1.8, 0. ])
inner and dot do more calculations that we want. We just want the diagonal:
In [304]: np.dot(dldx,w.T)
Out[304]:
array([[0.8, 0. ],
[2.6, 0. ]])
In [305]: np.inner(dldx,w)
Out[305]:
array([[0.8, 0. ],
[2.6, 0. ]])
In matmul/# terms, the size 2 dimension is a 'batch' one, so we have to add dimensions:
In [306]: dldx[:,None,:]#w[:,:,None]
Out[306]:
array([[[0.8]],
[[0. ]]])
This is (2,1,1), so we need to squeeze out the 1s.

Using numpy einsum to perform high dimensional subtraction broadcasting

I'm having troubles in using a broadcasting subtraction. My problem is the following. I have an array x of shape [L,N], where L is an integer and N is the number of variables of my problem.
I need to compute a [L,N,N] array where at each element l,i,j it contains x[l,i]-x[l,j].
If L=1 this is equivalent to run broadcasting on subtraction: x-x.T
For example here with L=1 and N=3:
import numpy as np
x = np.array([[0,2,4]])
x-x.T
However, if one increases the dimension L things become more complicated and enter the realm of the np.einsum function.
So I tried to recreate my example, in the case L=2, where I've replicated the two rows. What I'd expect is to get a 2x3x3 array with two 3x3 matrices with equal elements.
x = np.array([[0,2,4],[0,2,4]])
n = 3
k = 2
X = np.zeros([k,n,n])
for l in range(k):
for i in range(n):
for j in range(n):
X[l,i,j] = x[l,i]-x[l,j]
print(X)
which returns
[[[ 0. -2. -4.]
[ 2. 0. -2.]
[ 4. 2. 0.]]
[[ 0. -2. -4.]
[2. 0. -2.]
[ 4. 2. 0.]]]
But how to make this with numpy einsum? I can only obtain the product:
np.einsum('ki,kj->kij',x,-x)
Are there specific examples of numpy batched subtractions or additions with increased dimension?

How to solve a facility location allocation (IP) problem in CVXPY

I am learning to solve optimization problems using CVXPY, so I started with the following simple facility location allocation problem.
The Code in CVXPY is given as:
Fi = np.array([1,1,1]) # Fixed cost of each facility
Ci = np.array([15, 10, 10]) # Capacity of each facility
Dj = np.array([5, 5, 5, 3, 3, 4]) # Demand of each facility
Cij = np.ones(m,n)
n = len(Dj)
m = len(Fi)
# Decision Variables
Xij = cvx.Bool(m,n) # (m,n) vector
Yi = cvx.Bool(m) # column vector of length (m,1)
# Objective
fixed_cost = cvx.sum_entries(Fi*Yi)
var_cost = cvx.sum_entries(Cij.T * Dj *Xij)
total_cost = fixed_cost + var_cost
objective = cvx.Minimize(total_cost)
# Maximum facility locations to be selected?
constraints.append(cvx.sum_entries(Yi)==2)
# Sum of demands allocated to a facility shall be <= facility capacity -
# Capacity Fixed Cost
constraints.append(cvx.sum_entries(Dj * Xij.T, axis=0) <= Ci*Yi)
# Every demand point shall be supplied by only one facility.
constraints.append(cvx.sum_entries(Xij, axis=1) == 1)
# Solve the problem
prob = cvx.Problem(objective, constraints)
prob.solve(solver=cvx.GLPK_MI)
# Print the values
#print("status:", prob.status)
print("optimal value", prob.value)
print("Selected Facility Locations", Yi.value)
print("Assigned Nodes", Xij.value, )
As per the last constraint, a demand location should be supplied by only one facility, however the output of Xij.value shows wrong results.
Using CVXPY version: 0.4.10
status: optimal
optimal value 91.0
Selected Facility Locations [[1.]
[0.]
[1.]]
Assigned Nodes to Facility 1) [[1. 0. 0. 0. 0. 0.]]
Assigned Nodes to Facility 2) [[1. 0. 0. 0. 0. 0.]]
Assigned Nodes to Facility 3) [[1. 0. 0. 0. 0. 0.]]
The Xij.value should be something like this:
Using CVXPY version: 0.4.10
status: optimal
optimal value 91.0
Selected Facility Locations [[1.]
[1.]
[0.]]
Assigned Nodes to Facility 1) [[1. 1. 1. 0. 0. 0.]]
Assigned Nodes to Facility 2) [[0. 0. 0. 1. 1. 1.]]
Assigned Nodes to Facility 3) [[0. 0. 0. 0. 0. 0.]]
Which means, facility 1 and 2 are selected.
The first three points are allocated to facility 1 and the next three to facility 2.

Why does matrix multiplication give different results depending on how they are grouped?

We know that A*B*C = A*(B*C), but why this matrix multiplication got different result?
import numpy as np
A = np.array([[1,2,3],[4,5,6]])
B = np.array([[1,2,3],[4,5,6],[7,8,9]])
print( A.dot( np.linalg.inv(B) ).dot(A.T) )
print( A.dot( np.linalg.inv(B).dot(A.T) ) )
The result is
[[ 0.5 2. ]
[ 1. 4. ]]
and
[[ 2. 4.]
[ 8. 16.]]

B is of insufficient rank to take an inverse. To get at least consistent results, use np.linalg.pinv for the pseudo inverse.
np.linalg.matrix_rank(B)
# we want 3
# we got 2
2
A = np.array([[1,2,3],[4,5,6]])
B = np.array([[1,2,3],[4,5,6],[7,8,9]])
print( A.dot( np.linalg.pinv(B) ).dot(A.T) )
print( A.dot( np.linalg.pinv(B).dot(A.T) ) )
[[ 1. 4.]
[ 2. 5.]]
[[ 1. 4.]
[ 2. 5.]]

Floating point arithmetical operations are not associative. Usually we don't notice this because the numerical differences between matrices A*(B*C) and (A*B)*C are tiny. But in this case, you are trying to invert a non-invertible matrix B, which Numpy actually tries to do, getting some absurd result:
[[ 3.15251974e+15 -6.30503948e+15 3.15251974e+15]
[ -6.30503948e+15 1.26100790e+16 -6.30503948e+15]
[ 3.15251974e+15 -6.30503948e+15 3.15251974e+15]]
The magnitude of these numbers is such that errors of size ~1 are to be expected at the double precision level (you get about 16 accurate digits). The multiplication by A and A.T brings the matrix entires back to something small, due to a lot of cancellation. But when very large numbers cancel each other, the relative error grows; and the result ends up being fairly meaningless.

Plotting a histogram of 2D numpyArray of (latitude, latitude), in order to determine the proper values for DBSCAN

I am trying to apply DBSCAN on a dataset of (Lan,Lat) .. The algorithm is very sensitive for the parameter; EPS & MinPts.
I would like to have a look through a Histogram over the data, to determine the proper values. Unfortunately, Matplotlib Hist() take only 1D array.
Passing a 2D matrix as argument, Hist() treats each column as a separate input.
Scatter plot and histograms:
Does anyone has a way to solve this,

If you follow the DBSCAN article, you only need the 4-nearest-neighbor distance for each object, not all pairwise distances. I.e., a 1 dimensional array.
Instead of doing a histogram, they sort the values, and try to choose a knee in this plot.
find the 4 nearest neighbor of each object
collect all 4NN distances in one array
sort this array in descending order
plot the resulting curve
look for a knee, often best at around 5%-10% of your x axis (so 95%-90% of objects are core points).
For details, see the original DBSCAN publication!

You could use numpy.histogram2d:
import numpy as np
np.random.seed(2016)
N = 100
arr = np.random.random((N, 2))
xedges = np.linspace(0, 1, 10)
yedges = np.linspace(0, 1, 10)
lat = arr[:, 0]
lng = arr[:, 1]
hist, xedges, yedges = np.histogram2d(lat, lng, (xedges, yedges))
print(hist)
yields
[[ 0. 0. 5. 0. 3. 0. 0. 0. 3.]
[ 0. 3. 0. 3. 0. 0. 4. 0. 2.]
[ 2. 2. 1. 1. 1. 1. 3. 0. 1.]
[ 2. 1. 0. 3. 1. 2. 1. 1. 3.]
[ 3. 0. 3. 2. 0. 1. 0. 2. 0.]
[ 3. 2. 3. 1. 1. 2. 1. 1. 0.]
[ 2. 3. 0. 1. 0. 1. 3. 0. 0.]
[ 1. 1. 1. 1. 2. 0. 2. 1. 1.]
[ 0. 1. 1. 0. 1. 1. 2. 0. 0.]]
Or to visualize the histogram:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.imshow(hist)
plt.show()

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

numpy - why mean and SD are unstable for the same value? - numpy

Related

numpy function to use for mathematical dot product to produce scalar

Using numpy einsum to perform high dimensional subtraction broadcasting

How to solve a facility location allocation (IP) problem in CVXPY

Why does matrix multiplication give different results depending on how they are grouped?

Plotting a histogram of 2D numpyArray of (latitude, latitude), in order to determine the proper values for DBSCAN

Categories

Resources