making numpy binary file data to two decimal [duplicate] - numpy

I have a numpy array, something like below:
data = np.array([ 1.60130719e-01, 9.93827160e-01, 3.63108206e-04])
and I want to round each element to two decimal places.
How can I do so?

Numpy provides two identical methods to do this. Either use
np.round(data, 2)
or
np.around(data, 2)
as they are equivalent.
See the documentation for more information.
Examples:
>>> import numpy as np
>>> a = np.array([0.015, 0.235, 0.112])
>>> np.round(a, 2)
array([0.02, 0.24, 0.11])
>>> np.around(a, 2)
array([0.02, 0.24, 0.11])
>>> np.round(a, 1)
array([0. , 0.2, 0.1])

If you want the output to be
array([1.6e-01, 9.9e-01, 3.6e-04])
the problem is not really a missing feature of NumPy, but rather that this sort of rounding is not a standard thing to do. You can make your own rounding function which achieves this like so:
def my_round(value, N):
exponent = np.ceil(np.log10(value))
return 10**exponent*np.round(value*10**(-exponent), N)
For a general solution handling 0 and negative values as well, you can do something like this:
def my_round(value, N):
value = np.asarray(value).copy()
zero_mask = (value == 0)
value[zero_mask] = 1.0
sign_mask = (value < 0)
value[sign_mask] *= -1
exponent = np.ceil(np.log10(value))
result = 10**exponent*np.round(value*10**(-exponent), N)
result[sign_mask] *= -1
result[zero_mask] = 0.0
return result

It is worth noting that the accepted answer will round small floats down to zero as demonstrated below:
>>> import numpy as np
>>> arr = np.asarray([2.92290007e+00, -1.57376965e-03, 4.82011728e-08, 1.92896977e-12])
>>> print(arr)
[ 2.92290007e+00 -1.57376965e-03 4.82011728e-08 1.92896977e-12]
>>> np.round(arr, 2)
array([ 2.92, -0. , 0. , 0. ])
You can use set_printoptions and a custom formatter to fix this and get a more numpy-esque printout with fewer decimal places:
>>> np.set_printoptions(formatter={'float': "{0:0.2e}".format})
>>> print(arr)
[2.92e+00 -1.57e-03 4.82e-08 1.93e-12]
This way, you get the full versatility of format and maintain the precision of numpy's datatypes.
Also note that this only affects printing, not the actual precision of the stored values used for computation.

Related

Best way to find modes of an array along the column

Suppose that I have an array
a = np.array([[1,2.5,3,4],[1, 2.5, 3,3]])
I want to find the mode of each column without using stats.mode().
The only way I can think of is the following:
result = np.zeros(a.shape[1])
for i in range(len(result)):
curr_col = a[:,i]
result[i] = curr_col[np.argmax(np.unique(curr_col, return_counts = True))]
update:
There is some error in the above code and the correct one should be:
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
I have to use the loop because np.unique does not output compatible result for each column and there is no way to use np.bincount because the dtype is not int.
If you look at the numpy.unique documentation, this function returns the values and the associated counts (because you specified return_counts=True). A slight modification of your code is necessary to give the correct result. What you are trying todo is to find the value associated to the highest count:
import numpy as np
a = np.array([[1,5,3,4],[1,5,3,3],[1,5,3,3]])
result = np.zeros(a.shape[1])
for i in range(len(result)):
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
print(result)
Output:
% python3 script.py
[1. 5. 3. 4.]
Here is a code tha compares your solution with the scipy.stats.mode function:
import numpy as np
import scipy.stats as sps
import time
a = np.random.randint(1,100,(100,100))
t_start = time.time()
result = np.zeros(a.shape[1])
for i in range(len(result)):
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
print('Timer 1: ', (time.time()-t_start), 's')
t_start = time.time()
result_2 = sps.mode(a, axis=0).mode
print('Timer 2: ', (time.time()-t_start), 's')
print('Matrices are equal!' if np.allclose(result, result_2) else 'Matrices differ!')
Output:
% python3 script.py
Timer 1: 0.002721071243286133 s
Timer 2: 0.003339052200317383 s
Matrices are equal!
I tried several values for parameters and your code is actually faster than scipy.stats.mode function so it is probably close to optimal.

Numpy bug using .any()?

I'm having the following error using NumPy:
>>> distance = 0.9014179933248182
>>> min_distance = np.array([0.71341723, 0.07322284])
>>> distance < min_distance
array([False, False])
which is right, but when I try:
>>> distance < min_distance.any()
True
which is obviously wrong, since there is no number in 'min_distance' smaller than 'distance'
What is going on here? I'm using NumPy on Google Colab, on version '1.17.3'.
Whilst numpy bugs are common, this is not one. Note that min_distance.any() returns a boolean result. So in this expression:
distance < min_distance.any()
you are comparing a float with a boolean, which unfortunately works, because of a comedy of errors:
bool is a subclass of int
True is equal to 1
floats are comparable with integers.
E.g.
>>> 0.9 < True
True
>>> 1.1 < True
False
What you wanted instead:
>>> (distance < min_distance).any()
False
try (distance < min_distance).any()

apply generic function in a vectorized fashion using numpy/pandas

I am trying to vectorize my code and, thanks in large part to some users (https://stackoverflow.com/users/3293881/divakar, https://stackoverflow.com/users/625914/behzad-nouri), I was able to make huge progress. Essentially, I am trying to apply a generic function (in this case max_dd_array_ret) to each of the bins I found (see vectorize complex slicing with pandas dataframe for details on date vectorization and Start, End and Duration of Maximum Drawdown in Python for the rationale behind max_dd_array_ret). the problem is the following: I should be able to obtain the result df_2 and, to some degree, ranged_DD(asd_1.values, starts, ends+1) is what I am looking for, except for the tragic effect that it's as if the first two bins are merged and the last one is missing as it can be gauged by looking at the results.
any explanation and fix is very welcomed
import pandas as pd
import numpy as np
from time import time
from scipy.stats import binned_statistic
def max_dd_array_ret(xs):
xs = (xs+1).cumprod()
i = np.argmax(np.maximum.accumulate(xs) - xs) # end of the period
j = np.argmax(xs[:i])
max_dd = abs(xs[j]/xs[i] -1)
return max_dd if max_dd is not None else 0
def get_ranges_arr(starts,ends):
# Taken from https://stackoverflow.com/a/37626057/3293881
counts = ends - starts
counts_csum = counts.cumsum()
id_arr = np.ones(counts_csum[-1],dtype=int)
id_arr[0] = starts[0]
id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
return id_arr.cumsum()
def ranged_DD(arr,starts,ends):
# Get all indices and the IDs corresponding to same groups
idx = get_ranges_arr(starts,ends)
id_arr = np.repeat(np.arange(starts.size),ends-starts)
slice_arr = arr[idx]
return binned_statistic(id_arr, slice_arr, statistic=max_dd_array_ret)[0]
asd_1 = pd.Series(0.01 * np.random.randn(500), index=pd.date_range('2011-1-1', periods=500)).pct_change()
index_1 = pd.to_datetime(['2011-2-2', '2011-4-3', '2011-5-1','2011-7-2', '2011-8-3', '2011-9-1','2011-10-2', '2011-11-3', '2011-12-1','2012-1-2', '2012-2-3', '2012-3-1',])
index_2 = pd.to_datetime(['2011-2-15', '2011-4-16', '2011-5-17','2011-7-17', '2011-8-17', '2011-9-17','2011-10-17', '2011-11-17', '2011-12-17','2012-1-17', '2012-2-17', '2012-3-17',])
starts = asd_1.index.searchsorted(index_1)
ends = asd_1.index.searchsorted(index_2)
df_2 = pd.DataFrame([max_dd_array_ret(asd_1.loc[i:j]) for i, j in zip(index_1, index_2)], index=index_1)
print(df_2[0].values)
print(ranged_DD(asd_1.values, starts, ends+1))
results:
df_2
[ 1.75893509 6.08002911 2.60131797 1.55631781 1.8770067 2.50709085
1.43863472 1.85322338 1.84767224 1.32605754 1.48688414 5.44786663]
ranged_DD(asd_1.values, starts, ends+1)
[ 6.08002911 2.60131797 1.55631781 1.8770067 2.50709085 1.43863472
1.85322338 1.84767224 1.32605754 1.48688414]
which are identical except for the first two:
[ 1.75893509 6.08002911 vs [ 6.08002911
and the last two
1.48688414 5.44786663] vs 1.48688414]
p.s.:while looking in more detail at the docs (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html) I found that this might be the problem
"All but the last (righthand-most) bin is half-open. In other words,
if bins is [1, 2, 3, 4], then the first bin is [1, 2) (including 1,
but excluding 2) and the second [2, 3). The last bin, however, is [3,
4], which includes 4. New in version 0.11.0."
problem is I don't how to reset it.

numpy concatenate function not works when handling with large matrix

I was trying to concatenate a 3-by-n 3d coordinate matrix called VTrans with a 1-by-n all one value vector called lr to augment the coordinate matrix to the 4-by-n homogeneous matrix. n in my case is the vertex Number 141669, which is pretty big.
The code below is not working while it does work in a very small dataset.
lr = np.ones(vertexNum).reshape((1, vertexNum))
VtransAppend = np.concatenate((VTrans, lr), axis=0)
update2:
Just found the problem, my vertexNum is wrong! IT is actually 47223 instead of 141669. 141669 is its size! All solution work and I will accept the first one. Thank you all!
The error says "all the input array dimensions except for the concatenation axis must match exactly"
I further verify lr and VtransAppend has the same length by printing the size out.
print lr.size
print VTrans.size
Anyone once has the same weird problem before and know how to solve it?
Here is the update:
My VTrans matrix is attached, where vertextNum is 141669
This is the code followed by YXD's suggestion, but the issue still exits...
vertexNum = VTrans.size # Total vertex in current model
lr = np.ones(vertexNum)
VtransAppend = np.concatenate((VTrans, lr.reshape(1, -1)), axis=0)
You have to fiddle lr to have the same number of dimensions as vTrans
>>> n = 4
>>> vTrans = np.random.random_sample((3, n))
>>> lr = np.ones(n)
>>> np.concatenate((vTrans, lr.reshape(1, -1)), axis=0)
array([[ 0.65769116, 0.41008341, 0.66046706, 0.86501781],
[ 0.51584699, 0.60601466, 0.93800371, 0.25077702],
[ 0.16696658, 0.41839794, 0.0938594 , 0.48484606],
[ 1. , 1. , 1. , 1. ]])
>>>
i.e. after the reshape, the non-concatenation dimension matches vTrans
>>> lr.shape
(4,)
>>> lr.reshape(1, -1).shape
(1, 4)
>>>
Try vstack instead of concatenate:
a = np.random.random((3,5))
b = np.random.random(5)
np.vstack((a, b))
Alternatively:
np.concatenate((a, b[None,:]))
The None adds an axis to the 1D array b.

Why does this numpy array comparison fail?

I try to compare the results of some numpy.array calculations with expected results, and I constantly get false comparison, but the printed arrays look the same, e.g:
def test_gen_sine():
A, f, phi, fs, t = 1.0, 10.0, 1.0, 50.0, 0.1
expected = array([0.54030231, -0.63332387, -0.93171798, 0.05749049, 0.96724906])
result = gen_sine(A, f, phi, fs, t)
npt.assert_array_equal(expected, result)
prints back:
> raise AssertionError(msg)
E AssertionError:
E Arrays are not equal
E
E (mismatch 100.0%)
E x: array([ 0.540302, -0.633324, -0.931718, 0.05749 , 0.967249])
E y: array([ 0.540302, -0.633324, -0.931718, 0.05749 , 0.967249])
My gen_sine function is:
def gen_sine(A, f, phi, fs, t):
sampling_period = 1 / fs
num_samples = fs * t
samples_range = (np.arange(0, num_samples) * 2 * f * np.pi * sampling_period) + phi
return A * np.cos(samples_range)
Why is that? How should I compare the two arrays?
(I'm using numpy 1.9.3 and pytest 2.8.1)
The problem is that np.assert_array_equal returns None and does the assert statement internally. It is incorrect to preface it with a separate assert as you do:
assert np.assert_array_equal(x,y)
Instead in your test you would just do something like:
import numpy as np
from numpy.testing import assert_array_equal
def test_equal():
assert_array_equal(np.arange(0,3), np.array([0,1,2]) # No assertion raised
assert_array_equal(np.arange(0,3), np.array([2,0,1]) # Raises AssertionError
Update:
A few comments
Don't rewrite your entire original question, because then it was unclear what an answer was actually addressing.
As far as your updated question, the issue is that assert_array_equal is not appropriate for comparing floating point arrays as is explained in the documentation. Instead use assert_allclose and then set the desired relative and absolute tolerances.