Why does this numpy array comparison fail? - numpy

I try to compare the results of some numpy.array calculations with expected results, and I constantly get false comparison, but the printed arrays look the same, e.g:
def test_gen_sine():
A, f, phi, fs, t = 1.0, 10.0, 1.0, 50.0, 0.1
expected = array([0.54030231, -0.63332387, -0.93171798, 0.05749049, 0.96724906])
result = gen_sine(A, f, phi, fs, t)
npt.assert_array_equal(expected, result)
prints back:
> raise AssertionError(msg)
E AssertionError:
E Arrays are not equal
E
E (mismatch 100.0%)
E x: array([ 0.540302, -0.633324, -0.931718, 0.05749 , 0.967249])
E y: array([ 0.540302, -0.633324, -0.931718, 0.05749 , 0.967249])
My gen_sine function is:
def gen_sine(A, f, phi, fs, t):
sampling_period = 1 / fs
num_samples = fs * t
samples_range = (np.arange(0, num_samples) * 2 * f * np.pi * sampling_period) + phi
return A * np.cos(samples_range)
Why is that? How should I compare the two arrays?
(I'm using numpy 1.9.3 and pytest 2.8.1)

The problem is that np.assert_array_equal returns None and does the assert statement internally. It is incorrect to preface it with a separate assert as you do:
assert np.assert_array_equal(x,y)
Instead in your test you would just do something like:
import numpy as np
from numpy.testing import assert_array_equal
def test_equal():
assert_array_equal(np.arange(0,3), np.array([0,1,2]) # No assertion raised
assert_array_equal(np.arange(0,3), np.array([2,0,1]) # Raises AssertionError
Update:
A few comments
Don't rewrite your entire original question, because then it was unclear what an answer was actually addressing.
As far as your updated question, the issue is that assert_array_equal is not appropriate for comparing floating point arrays as is explained in the documentation. Instead use assert_allclose and then set the desired relative and absolute tolerances.

Related

making numpy binary file data to two decimal [duplicate]

I have a numpy array, something like below:
data = np.array([ 1.60130719e-01, 9.93827160e-01, 3.63108206e-04])
and I want to round each element to two decimal places.
How can I do so?
Numpy provides two identical methods to do this. Either use
np.round(data, 2)
or
np.around(data, 2)
as they are equivalent.
See the documentation for more information.
Examples:
>>> import numpy as np
>>> a = np.array([0.015, 0.235, 0.112])
>>> np.round(a, 2)
array([0.02, 0.24, 0.11])
>>> np.around(a, 2)
array([0.02, 0.24, 0.11])
>>> np.round(a, 1)
array([0. , 0.2, 0.1])
If you want the output to be
array([1.6e-01, 9.9e-01, 3.6e-04])
the problem is not really a missing feature of NumPy, but rather that this sort of rounding is not a standard thing to do. You can make your own rounding function which achieves this like so:
def my_round(value, N):
exponent = np.ceil(np.log10(value))
return 10**exponent*np.round(value*10**(-exponent), N)
For a general solution handling 0 and negative values as well, you can do something like this:
def my_round(value, N):
value = np.asarray(value).copy()
zero_mask = (value == 0)
value[zero_mask] = 1.0
sign_mask = (value < 0)
value[sign_mask] *= -1
exponent = np.ceil(np.log10(value))
result = 10**exponent*np.round(value*10**(-exponent), N)
result[sign_mask] *= -1
result[zero_mask] = 0.0
return result
It is worth noting that the accepted answer will round small floats down to zero as demonstrated below:
>>> import numpy as np
>>> arr = np.asarray([2.92290007e+00, -1.57376965e-03, 4.82011728e-08, 1.92896977e-12])
>>> print(arr)
[ 2.92290007e+00 -1.57376965e-03 4.82011728e-08 1.92896977e-12]
>>> np.round(arr, 2)
array([ 2.92, -0. , 0. , 0. ])
You can use set_printoptions and a custom formatter to fix this and get a more numpy-esque printout with fewer decimal places:
>>> np.set_printoptions(formatter={'float': "{0:0.2e}".format})
>>> print(arr)
[2.92e+00 -1.57e-03 4.82e-08 1.93e-12]
This way, you get the full versatility of format and maintain the precision of numpy's datatypes.
Also note that this only affects printing, not the actual precision of the stored values used for computation.

How to fit data to model with analytical gradient in basinhopping or with another gradient descent method?

I'd like to fit experimental data to a model and extract the optimal model parameters, the parameters that result in minimal error between model function and experimental data. To get the optimal parameters, I'd like to use a gradient descent method, tensorflow, Bayesian inference or basinhopping or something that deals well with bad initial estimates and is rigid. To speed things up, I'd like to use the analytical gradient for example in basinhopping. How do I do that with the basinghopping routine from scipy. In the following example code, I have some example function and I'd like to use the analytical Jacobian instead of the numerical one, but I get an error. Do I have to sum up the Jacobian components?
Example code (my actual function is much more complex)
import random
import matplotlib.pyplot as plt
import numpy as np
# symbolic math
from sympy import lambdify, symbols, cos
from sympy.tensor.array import derive_by_array
# fitting
from scipy.optimize import basinhopping
# symbolic math with sympy ---
s_lst = x, a, b, c, d = symbols('x, a, b, c d', positive=True)
# mathematical function
y = a*x + cos(b*x)**2 * c*x**2 + d
# jacobian (derivatives after model parameters)
params = s_lst[1:]
jac_y = derive_by_array(y, params)
# translate sympy expression to python function
# function
get_y = lambdify(s_lst, y)
# jacobian (derivatives in a, b, c, d)
get_jac_y = [lambdify(s_lst, element) for element in jac_y]
#print(len(get_jac_y))
# data ---
x = np.linspace(0, 1, 500)
# measurement data
a = [random.randrange(4, 6, 1) for i in range(len(x))]
b = [random.randrange(3190, 3290, 1) for i in range(len(x))]
c = [random.randrange(90, 109, 1) for i in range(len(x))]
d = [0.1*random.randrange(0, 2, 1) for i in range(len(x))]
y_measured = get_y(x, a, b, c, d)
# exemplary model data
a, b, c, d = 5, 3200, 100, 1
y_model = get_y(x, a, b, c, d)
# plot
plt.plot(x, y_measured)
plt.plot(x, y_model)
plt.title('exemplary model and measured data')
plt.show()
# functions for fitting
def func(params, args1, args2=None):
a, b, c, d = params
y = get_y(args1, a, b, c, d)
if args2 is None:
return y
return np.sum((y - args2)**2)
# derivatives
def dfunc(params, args1, args2):
a, b, c, d = params
jac = [jac(args1, a, b, c, d) for jac in get_jac_y]
# because derviative in d is one
jac[-1] = np.ones(len(args1))
return np.asarray(jac)
# function and derivatives
def objective_func(params, args1, args2):
f = func(params, args1, args2)
df = dfunc(params, args1, args2)
return f, df
# fit with basinhopping and scipy ---
# initial model parameters
x0 = [1, 2, 33, 4]
# minimization with numerical jacobian, gives a result
minimizer_kwargs = {"args":(x, y_measured), 'method':'L-BFGS-B'}
ret = basinhopping(func, x0, minimizer_kwargs=minimizer_kwargs)
# minimization with analytical jacobian, fails,
# error: failed in converting 7th argument `g' of _lbfgsb.setulb to C/Fortran array
minimizer_kwargs = {"args":(x, y_measured), 'method':'L-BFGS-B', 'jac':True}
ret = basinhopping(objective_func, x0, minimizer_kwargs=minimizer_kwargs)
If I put in dfunc something like return [np.sum((j)) for j in jac] the program runs but fails. What would be the correct expression?

How do I correct this Value Error due to Buffer having the wrong dimensions in a quadprog SVM implementation?

I'm using the quadprog module to set up an SVM for speech recognition. I took a QP implementation from here: https://github.com/stephane-caron/qpsolvers/blob/master/qpsolvers/quadprog_.py
Here is their implementation:
def quadprog_solve_qp(P, q, G=None, h=None, A=None, b=None, initvals=None,
verbose=False):
if initvals is not None:
print("quadprog: note that warm-start values ignored by wrapper")
qp_G = P
qp_a = -q
if A is not None:
if G is None:
qp_C = -A.T
qp_b = -b
else:
qp_C = -vstack([A, G]).T
qp_b = -np.insert(h, 0, 0, axis=0)
meq = A.shape[0]
else: # no equality constraint
qp_C = -G.T if G is not None else None
qp_b = -h if h is not None else None
meq = 0
try:
return solve_qp(qp_G, qp_a, qp_C, qp_b, meq)[0]
except ValueError as e:
if "matrix G is not positive definite" in str(e):
# quadprog writes G the cost matrix that we write P in this package
raise ValueError("matrix P is not positive definite")
raise
Shapes:
P: (127, 127)
h: (254, 1)
q: (127, 1)
A: (1, 127)
G: (254, 127)
I also had that qp_b was initially assigned to an hstack of an array arr = array([0]) with h but the shape: (1,) prevented numpy from concatenating the arrays. I fixed this error by inserting a [0] instead.
When I try quadprog_solve_qp(P, q, G, h, A) I get a:
File "----------------------------.py", line 95, in quadprog_solve_qp
return solve_qp(qp_G, qp_a, qp_C, qp_b, meq)[0]
File "quadprog/quadprog.pyx", line 12, in quadprog.solve_qp
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
And I have no idea where it's coming from, nor what I can do. If anyone has any idea how the quadprog module works or simply what I might be doing wrong I would be pleased to hear.

Best way to find modes of an array along the column

Suppose that I have an array
a = np.array([[1,2.5,3,4],[1, 2.5, 3,3]])
I want to find the mode of each column without using stats.mode().
The only way I can think of is the following:
result = np.zeros(a.shape[1])
for i in range(len(result)):
curr_col = a[:,i]
result[i] = curr_col[np.argmax(np.unique(curr_col, return_counts = True))]
update:
There is some error in the above code and the correct one should be:
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
I have to use the loop because np.unique does not output compatible result for each column and there is no way to use np.bincount because the dtype is not int.
If you look at the numpy.unique documentation, this function returns the values and the associated counts (because you specified return_counts=True). A slight modification of your code is necessary to give the correct result. What you are trying todo is to find the value associated to the highest count:
import numpy as np
a = np.array([[1,5,3,4],[1,5,3,3],[1,5,3,3]])
result = np.zeros(a.shape[1])
for i in range(len(result)):
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
print(result)
Output:
% python3 script.py
[1. 5. 3. 4.]
Here is a code tha compares your solution with the scipy.stats.mode function:
import numpy as np
import scipy.stats as sps
import time
a = np.random.randint(1,100,(100,100))
t_start = time.time()
result = np.zeros(a.shape[1])
for i in range(len(result)):
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
print('Timer 1: ', (time.time()-t_start), 's')
t_start = time.time()
result_2 = sps.mode(a, axis=0).mode
print('Timer 2: ', (time.time()-t_start), 's')
print('Matrices are equal!' if np.allclose(result, result_2) else 'Matrices differ!')
Output:
% python3 script.py
Timer 1: 0.002721071243286133 s
Timer 2: 0.003339052200317383 s
Matrices are equal!
I tried several values for parameters and your code is actually faster than scipy.stats.mode function so it is probably close to optimal.

TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')

Strange error from numpy via matplotlib when trying to get a histogram of a tiny toy dataset. I'm just not sure how to interpret the error, which makes it hard to see what to do next.
Didn't find much related, though this nltk question and this gdsCAD question are superficially similar.
I intend the debugging info at bottom to be more helpful than the driver code, but if I've missed something, please ask. This is reproducible as part of an existing test suite.
if n > 1:
return diff(a[slice1]-a[slice2], n-1, axis=axis)
else:
> return a[slice1]-a[slice2]
E TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')
../py2.7.11-venv/lib/python2.7/site-packages/numpy/lib/function_base.py:1567: TypeError
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> py2.7.11-venv/lib/python2.7/site-packages/numpy/lib/function_base.py(1567)diff()
-> return a[slice1]-a[slice2]
(Pdb) bt
[...]
py2.7.11-venv/lib/python2.7/site-packages/matplotlib/axes/_axes.py(5678)hist()
-> m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
py2.7.11-venv/lib/python2.7/site-packages/numpy/lib/function_base.py(606)histogram()
-> if (np.diff(bins) < 0).any():
> py2.7.11-venv/lib/python2.7/site-packages/numpy/lib/function_base.py(1567)diff()
-> return a[slice1]-a[slice2]
(Pdb) p numpy.__version__
'1.11.0'
(Pdb) p matplotlib.__version__
'1.4.3'
(Pdb) a
a = [u'A' u'B' u'C' u'D' u'E']
n = 1
axis = -1
(Pdb) p slice1
(slice(1, None, None),)
(Pdb) p slice2
(slice(None, -1, None),)
(Pdb)
I got the same error, but in my case I am subtracting dict.key from dict.value. I have fixed this by subtracting dict.value for corresponding key from other dict.value.
cosine_sim = cosine_similarity(e_b-e_a, w-e_c)
here I got error because e_b, e_a and e_c are embedding vector for word a,b,c respectively. I didn't know that 'w' is string, when I sought out w is string then I fix this by following line:
cosine_sim = cosine_similarity(e_b-e_a, word_to_vec_map[w]-e_c)
Instead of subtracting dict.key, now I have subtracted corresponding value for key
I had a similar issue where an integer in a row of a DataFrame I was iterating over was of type numpy.int64. I got the
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')
error when trying to subtract a float from it.
The easiest fix for me was to convert the row using pd.to_numeric(row).
Why is it applying diff to an array of strings.
I get an error at the same point, though with a different message
In [23]: a=np.array([u'A' u'B' u'C' u'D' u'E'])
In [24]: np.diff(a)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-24-9d5a62fc3ff0> in <module>()
----> 1 np.diff(a)
C:\Users\paul\AppData\Local\Enthought\Canopy\User\lib\site-packages\numpy\lib\function_base.pyc in diff(a, n, axis)
1112 return diff(a[slice1]-a[slice2], n-1, axis=axis)
1113 else:
-> 1114 return a[slice1]-a[slice2]
1115
1116
TypeError: unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray'
Is this a array the bins parameter? What does the docs say bins should be?
I am fairly new to this myself, but I had a similar error and found that it is due to a type casting issue. I was trying to concatenate rather than take the difference but I think the principle is the same here. I provided a similar answer on another question so I hope that is OK.
In essence you need to use a different data type cast, in my case I needed str not float, I suspect yours is the same so my suggested solution is. I am sorry I cannot test it before suggesting but I am unclear from your example what you were doing.
return diff(str(a[slice1])-str(a[slice2]), n-1, axis=axis)
Please see my example code below for the fix to my code, the change occurs on the third to last line. The code is to produce a basic random forest model.
import scipy
import math
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing, metrics, cross_validation
Data = pd.read_csv("Free_Energy_exp.csv", sep=",")
Data = Data.fillna(Data.mean()) # replace the NA values with the mean of the descriptor
header = Data.columns.values # Ues the column headers as the descriptor labels
Data.head()
test_name = "Test.csv"
npArray = np.array(Data)
print header.shape
npheader = np.array(header[1:-1])
print("Array shape X = %d, Y = %d " % (npArray.shape))
datax, datay = npArray.shape
names = npArray[:,0]
X = npArray[:,1:-1].astype(float)
y = npArray[:,-1] .astype(float)
X = preprocessing.scale(X)
XTrain, XTest, yTrain, yTest = cross_validation.train_test_split(X,y, random_state=0)
# Predictions results initialised
RFpredictions = []
RF = RandomForestRegressor(n_estimators = 10, max_features = 5, max_depth = 5, random_state=0)
RF.fit(XTrain, yTrain) # Train the model
print("Training R2 = %5.2f" % RF.score(XTrain,yTrain))
RFpreds = RF.predict(XTest)
with open(test_name,'a') as fpred :
lenpredictions = len(RFpreds)
lentrue = yTest.shape[0]
if lenpredictions == lentrue :
fpred.write("Names/Label,, Prediction Random Forest,, True Value,\n")
for i in range(0,lenpredictions) :
fpred.write(RFpreds[i]+",,"+yTest[i]+",\n")
else :
print "ERROR - names, prediction and true value array size mismatch."
This leads to an error of;
Traceback (most recent call last):
File "min_example.py", line 40, in <module>
fpred.write(RFpreds[i]+",,"+yTest[i]+",\n")
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S32') dtype('S32') dtype('S32')
The solution is to make each variable a str() type on the third to last line then write to file. No other changes to then code have been made from the above.
import scipy
import math
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing, metrics, cross_validation
Data = pd.read_csv("Free_Energy_exp.csv", sep=",")
Data = Data.fillna(Data.mean()) # replace the NA values with the mean of the descriptor
header = Data.columns.values # Ues the column headers as the descriptor labels
Data.head()
test_name = "Test.csv"
npArray = np.array(Data)
print header.shape
npheader = np.array(header[1:-1])
print("Array shape X = %d, Y = %d " % (npArray.shape))
datax, datay = npArray.shape
names = npArray[:,0]
X = npArray[:,1:-1].astype(float)
y = npArray[:,-1] .astype(float)
X = preprocessing.scale(X)
XTrain, XTest, yTrain, yTest = cross_validation.train_test_split(X,y, random_state=0)
# Predictions results initialised
RFpredictions = []
RF = RandomForestRegressor(n_estimators = 10, max_features = 5, max_depth = 5, random_state=0)
RF.fit(XTrain, yTrain) # Train the model
print("Training R2 = %5.2f" % RF.score(XTrain,yTrain))
RFpreds = RF.predict(XTest)
with open(test_name,'a') as fpred :
lenpredictions = len(RFpreds)
lentrue = yTest.shape[0]
if lenpredictions == lentrue :
fpred.write("Names/Label,, Prediction Random Forest,, True Value,\n")
for i in range(0,lenpredictions) :
fpred.write(str(RFpreds[i])+",,"+str(yTest[i])+",\n")
else :
print "ERROR - names, prediction and true value array size mismatch."
These examples are from a larger code so I hope the examples are clear enough.
I think #James is right. I got stuck by same error while working on Polyval(). And yeah solution is to use the same type of variabes. You can use typecast to cast all variables in the same type.
BELOW IS A EXAMPLE CODE
import numpy
P = numpy.array(input().split(), float)
x = float(input())
print(numpy.polyval(P,x))
here I used float as an output type. so even the user inputs the INT value (whole number). the final answer will be typecasted to float.
I ran into the same issue, but in my case it was just a Python list instead of a Numpy array used. Using two Numpy arrays solved the issue for me.