numpy set_printoptions precision ignored in list of tuples - numpy

(Windows 7,Python 3.4.5 |Anaconda 2.2.0 (64-bit)| (default, Jul 5 2016, 14:53:07) [MSC v.1600 64 bit (AMD64)])
I was trying to neatly print some data by using np.set_printoptions(precision= and it seems to be ignored. Why?
import numpy as np
np.set_printoptions(precision=4)
a=[[1,15.02],
[2,14.38],
[3,14.60]]
b=np.array(a)
print(b)
at=b.T
l=list(zip(at[0],at[1]))
print(l)
Output:
[[ 1. 15.02]
[ 2. 14.38]
[ 3. 14.6 ]]
[(1.0, 15.02), (2.0, 14.380000000000001), (3.0, 14.6)]

A problem of float :
In [118]:ref=decimal.Decimal('14.380000000000000000000000000000000000000000000000')
In [119]: decimal.Decimal(14.38)
Out[119]: Decimal('14.3800000000000007815970093361102044582366943359375')
In [120]: decimal.Decimal(14.38)-ref
Out[120]: Decimal('7.815970093361102044582366943E-16')
In [121]: decimal.Decimal(14.38-2**-50)-ref
Out[121]: Decimal('-9.947598300641402602195739746E-16')
This show that 14.380000000000001 is the best float64 approximation of 14.38 .
to workaround this fact, you can downgrade in np.float32 :
In [140]:tuple(zip(*np.array(a).T.astype(np.float32)))
Out[140]: ((1.0, 15.02), (2.0, 14.38), (3.0, 14.6))

In [34]: a
Out[34]: [[1, 15.02], [2, 14.38], [3, 14.6]]
In [35]: b=np.array(a, dtype=float).T
In [36]: b
Out[36]:
array([[ 1. , 2. , 3. ],
[ 15.02, 14.38, 14.6 ]])
In [37]: list(zip(*b))
Out[37]: [(1.0, 15.02), (2.0, 14.380000000000001), (3.0, 14.6)]
However if I first past b through tolist:
In [38]: list(zip(*b.tolist()))
Out[38]: [(1.0, 15.02), (2.0, 14.38), (3.0, 14.6)]
In the first case the elements of the tuple still have np.float64 wrapper, while tolist extracts them all to native Python numbers:
In [39]: type(list(zip(*b))[1][1])
Out[39]: numpy.float64
In [40]: type(list(zip(*b.tolist()))[1][1])
Out[40]: float
item is another way of extracting the native number:
In [41]: list(zip(*b))[1][1]
Out[41]: 14.380000000000001
In [42]: list(zip(*b))[1][1].item()
Out[42]: 14.38
I can't say why the setprintoptions doesn't apply in the case of np.float64, but does with np.array.
As a general rule, it is better to use tolist() if you want to convert an array, and all its values, into a native Python list. Operations like list and zip aren't enough. They iterate on the first dimension of the array, but don't recursively convert the elements:
Partial conversion(s):
In [43]: list(b)
Out[43]: [array([ 1., 2., 3.]), array([ 15.02, 14.38, 14.6 ])]
In [44]: list(b[1])
Out[44]: [15.02, 14.380000000000001, 14.6]
Full conversion:
In [45]: b.tolist()
Out[45]: [[1.0, 2.0, 3.0], [15.02, 14.38, 14.6]]
Apparently the formatter for float64 shows all precision, regardless of the set_printoptions values:
In [58]: 14.380000000000001
Out[58]: 14.38
In [59]: np.array(14.380000000000001)
Out[59]: array(14.38)
In [60]: np.float64(14.380000000000001)
Out[60]: 14.380000000000001
In [61]: np.float32(14.380000000000001)
Out[61]: 14.38
An np.float64(...) object is, in many ways like a single item array, but different in subtle ways. Usually though we don't create such an object directly.

Related

extract CSV columns data to individual Numpy array

Extract data from the given SalaryGender CSV file and store the data from each column in a separate NumPy array
SalaryGender.csv sample data
Salary,Gender,Age,PhD
140,1,47,1
30,0,65,1
35.1,0,56,0
30,1,23,0
80,0,53,1
Use: DataFrame.groupby
that will create a list where each element has a numpy array of each column:
[group.values for i,group in df.groupby(level=0,axis=1)]
If you aren't looking for a list then use:
for i,group in df.groupby(level=0,axis=1):
print(group.values)
.....
Also you can use DataFrame.iteritems:
for i,col in df.iteritems():
print(col.to_numpy())
In [199]: txt = """Salary,Gender,Age,PhD
...: 140,1,47,1
...: 30,0,65,1
...: 35.1,0,56,0
...: 30,1,23,0
...: 80,0,53,1"""
We can load your sample as a structured array:
In [203]: data = np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',', encoding=None, names=True)
In [204]: data
Out[204]:
array([(140. , 1, 47, 1), ( 30. , 0, 65, 1), ( 35.1, 0, 56, 0),
( 30. , 1, 23, 0), ( 80. , 0, 53, 1)],
dtype=[('Salary', '<f8'), ('Gender', '<i8'), ('Age', '<i8'), ('PhD', '<i8')])
Each element of the array is a row of the file; field names come from the header line, field dtype is deduced from the data.
Fields can be accessed by name:
In [205]: data['Salary']
Out[205]: array([140. , 30. , 35.1, 30. , 80. ])
In [206]: data['Gender']
Out[206]: array([1, 0, 0, 1, 0])
They can be accessed that way or can be assigned to a variable
salary = data['Salary']
You can also use unpack:
In [213]: a,b,c,d = np.genfromtxt(txt.splitlines(), delimiter=',', encoding=Non
...: e, skip_header=1, unpack=True)
In [214]: a
Out[214]: array([140. , 30. , 35.1, 30. , 80. ])
In [215]: b
Out[215]: array([1., 0., 0., 1., 0.])
In [216]: c
Out[216]: array([47., 65., 56., 23., 53.])
In [217]: d
Out[217]: array([1., 1., 0., 0., 1.])
Sometimes it's simpler to load the file one (or selected) column at a time:
In [218]: b = np.genfromtxt(txt.splitlines(), delimiter=',', encoding=None, ski
...: p_header=1, usecols=[1])
In [219]: b
Out[219]: array([1., 0., 0., 1., 0.])
Please try this:
SG[SG.columns].values
where SG is your file name. The code above gives you all columns values in array in a single go.

Multidimensional numpy.outer without flatten

x is N by M matrix.
y is 1 by L vector.
I want to return "outer product" between x and y, let's call it z.
z[n,m,l] = x[n,m] * y[l]
I could probably do this using einsum.
np.einsum("ij,k->ijk", x[:, :, k], y[:, k])
or reshape afterwards.
np.outer(x[:, :, k], y).reshape((x.shape[0],x.shape[1],y.shape[0]))
But I'm thinking of doing this in np.outer only or something seems simpler, memory efficient.
Is there a way?
It's one of those numpy "can't know unless you happen to know" bits: np.outer flattens multidimensional inputs while np.multiply.outer doesn't:
m,n,l = 3,4,5
x = np.arange(m*n).reshape(m,n)
y = np.arange(l)
np.multiply.outer(x,y).shape
# (3, 4, 5)
The code for outer is:
multiply(a.ravel()[:, newaxis], b.ravel()[newaxis, :], out)
As its docs says, it flattens (i.e. ravel). If the arrays are already 1d, that expression could be written as
a[:,None] * b[None,:]
a[:,None] * b # broadcasting auto adds the None to b
We could apply broadcasting rules to your (n,m)*(1,l):
In [2]: x = np.arange(12).reshape(3,4); y = np.array([[1,2]])
In [3]: x.shape, y.shape
Out[3]: ((3, 4), (1, 2))
You want a (n,m,l), which a (n,m,1) * (1,1,l) achieves. We need to add a trailing dimension to x. The extra leading 1 on y is automatic:
In [4]: z = x[...,None]*y
In [5]: z.shape
Out[5]: (3, 4, 2)
In [6]: z
Out[6]:
array([[[ 0, 0],
[ 1, 2],
[ 2, 4],
[ 3, 6]],
[[ 4, 8],
[ 5, 10],
[ 6, 12],
[ 7, 14]],
[[ 8, 16],
[ 9, 18],
[10, 20],
[11, 22]]])
Using einsum:
In [8]: np.einsum('nm,kl->nml', x, y).shape
Out[8]: (3, 4, 2)
The fact that you approved:
In [9]: np.multiply.outer(x,y).shape
Out[9]: (3, 4, 1, 2)
suggests y isn't really (1,l) but rather (l,)`. Adjust for either is easy.
I don't think there's much difference in memory efficiency among these. In this small example In[4] is fastest, but not by much.

Numpy summation with sliding window is really slow

Code:
shape = np.array([6, 6])
grid = np.array([x.ravel() for x in np.meshgrid(*[np.arange(x) for i, x in enumerate(shape)], indexing='ij')]).T
slices = [tuple(slice(box[i], box[i] + 2) for i in range(len(box))) for box in grid]
score = np.zeros((7,7,3))
column = np.random.randn(36, 12) #just for example
column
>> array([[ 0, 1, 2, 3, ... 425, 426, 427, 428, 429, 430, 431]])
column = column.reshape((16, 3, 3, 3))
for i, window in enumerate(slices):
score[window] += column[i]
score
>> array([[[0.000e+00, 1.000e+00, 2.000e+00],
[3.000e+01, 3.200e+01, 3.400e+01],
[9.000e+01, 9.300e+01, 9.600e+01], ...
[8.280e+02, 8.300e+02, 8.320e+02],
[4.290e+02, 4.300e+02, 4.310e+02]]])
It works but last 2 lines take really much time as they will be in loop. The problem is that 'grid' variable contains an array of windows. And I don't now how to speed up the process.
Let's simplify the problem at bit - reduce the dimensions, and drop the final size 3 dimension:
In [265]: shape = np.array([4,4])
In [266]: grid = np.array([x.ravel() for x in np.meshgrid(*[np.arange(x) for i
...: , x in enumerate(shape)], indexing='ij')]).T
...: grid = [tuple(slice(box[i], box[i] + 3) for i in range(len(box))) fo
...: r box in grid]
...:
...:
In [267]: len(grid)
Out[267]: 16
In [268]: score = np.arange(36).reshape(6,6)
In [269]: X = np.array([score[x] for x in grid]).reshape(4,4,3,3)
In [270]: X
Out[270]:
array([[[[ 0, 1, 2],
[ 6, 7, 8],
[12, 13, 14]],
[[ 1, 2, 3],
[ 7, 8, 9],
[13, 14, 15]],
[[ 2, 3, 4],
[ 8, 9, 10],
[14, 15, 16]],
....
[[21, 22, 23],
[27, 28, 29],
[33, 34, 35]]]])
This is a moving window - one (3,3) array, shift over 1,..., shift down 1, etc
With as_strided is is possible to construct a view of the array, that consists of all these windows, but without actually copying values. Having worked with as_strided before I was able construct the equivalent strides as:
In [271]: score.shape
Out[271]: (6, 6)
In [272]: score.strides
Out[272]: (48, 8)
In [273]: ast = np.lib.stride_tricks.as_strided
In [274]: x=ast(score, shape=(4,4,3,3), strides=(48,8,48,8))
In [275]: np.allclose(X,x)
Out[275]: True
This could be extended to your (28,28,3) dimensions, and turned into the summation.
Generating such moving windows has been covered in previous SO questions. And it's also implemented in one of the image processing packages.
Adaptation for a 3 channel image,
In [45]: arr.shape
Out[45]: (6, 6, 3)
In [46]: arr.strides
Out[46]: (144, 24, 8)
In [47]: arr[:3,:3,0]
Out[47]:
array([[ 0., 1., 2.],
[ 6., 7., 8.],
[12., 13., 14.]])
In [48]: x = ast(arr, shape=(4,4,3,3,3), strides=(144,24,144,24,8))
In [49]: x[0,0,:,:,0]
Out[49]:
array([[ 0., 1., 2.],
[ 6., 7., 8.],
[12., 13., 14.]])
Since we are moving the window by one element at a time, the strides for x are easily derived form the source strides.
For 4x4 windows, just change the shape
x = ast(arr, shape=(3,3,4,4,3), strides=(144,24,144,24,8))
In Efficiently Using Multiple Numpy Slices for Random Image Cropping
#Divikar suggests using skimage
With the default step=1, the result is compatible:
In [55]: from skimage.util.shape import view_as_windows
In [63]: y = view_as_windows(arr,(4,4,3))
In [64]: y.shape
Out[64]: (3, 3, 1, 4, 4, 3)
In [69]: np.allclose(x,y[:,:,0])
Out[69]: True

xarray: simple weighted rolling mean example using .construct()

Xarray can do weighted rolling mean via the .construct() object as stated in answer on SO here and also in the docs.
The weighted rolling mean example in the docs doesn't quite look right as it seems to give the same answer as the ordinary rolling mean.
import xarray as xr
import numpy as np
arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5),
... dims=('x', 'y'))
arr.rolling(y=3, center=True).mean()
#<xarray.DataArray (x: 3, y: 5)>
#array([[nan, 0.5, 1. , 1.5, nan],
# [nan, 3. , 3.5, 4. , nan],
# [nan, 5.5, 6. , 6.5, nan]])
#Dimensions without coordinates: x, y
weight = xr.DataArray([0.25, 0.5, 0.25], dims=['window'])
arr.rolling(y=3, center=True).construct('window').dot(weight)
#<xarray.DataArray (x: 3, y: 5)>
#array([[nan, 0.5, 1. , 1.5, nan],
# [nan, 3. , 3.5, 4. , nan],
# [nan, 5.5, 6. , 6.5, nan]])
#Dimensions without coordinates: x, y
Here is a more simple example which I would like to get the syntax right on:
da = xr.DataArray(np.arange(1,6), dims='x')
da.rolling(x=3, center=True).mean()
#<xarray.DataArray (x: 5)>
#array([nan, 2., 3., 4., nan])
#Dimensions without coordinates: x
weight = xr.DataArray([0.5, 1, 0.5], dims=['window'])
da.rolling(x=3, center=True).construct('window').dot(weight)
#<xarray.DataArray (x: 5)>
#array([nan, 4., 6., 8., nan])
#Dimensions without coordinates: x
It returns 4, 6, 8. I thought it would do:
(1 x 0.5) + (2 x 1) + (3 x 0.5) / 3 = 4/3
(2 x 0.5) + (3 x 1) + (4 x 0.5) / 3 = 2
(3 x 0.5) + (4 x 1) + (5 x 0.5) / 3 = 8/3
1.33, 2. 2.66
In the first example, you use evenly spaced data for arr.
Therefore, the weighted mean (with [0.25, 5, 0.25]) will be the same as the simple mean.
If you consider non-linear data, the result differs
In [50]: arr = xr.DataArray((np.arange(0, 7.5, 0.5)**2).reshape(3, 5),
...: dims=('x', 'y'))
...:
In [51]: arr.rolling(y=3, center=True).mean()
Out[51]:
<xarray.DataArray (x: 3, y: 5)>
array([[ nan, 0.416667, 1.166667, 2.416667, nan],
[ nan, 9.166667, 12.416667, 16.166667, nan],
[ nan, 30.416667, 36.166667, 42.416667, nan]])
Dimensions without coordinates: x, y
In [52]: weight = xr.DataArray([0.25, 0.5, 0.25], dims=['window'])
...: arr.rolling(y=3, center=True).construct('window').dot(weight)
...:
Out[52]:
<xarray.DataArray (x: 3, y: 5)>
array([[ nan, 0.375, 1.125, 2.375, nan],
[ nan, 9.125, 12.375, 16.125, nan],
[ nan, 30.375, 36.125, 42.375, nan]])
Dimensions without coordinates: x, y
For the second example, you use [0.5, 1, 0.5] as weight, the total of which is 2.
Therefore, the first non-nan item will be
(1 x 0.5) + (2 x 1) + (3 x 0.5) = 4
If you want weighted mean, rather than the weighted sum, use [0.25, 0.5, 0.25] instead.

numpy mask un-shaped N-dimensional array

Here is a Numpy array I would like to mask (note it is not a strict 2D array):
a = array([array([0, 1, 2, 3, 4]), array([0, 1]), array([0, 1, 2, 3, 4])], dtype=object)
This seems impossible however. I would like to understand why, and possibly how to treat this kind of example, where I get a mask from a values to apply it to another array with the same shape.
Thank you very much.
This is an object dtype array, containing 3 elements (which happen to be arrays themselves):
In [94]: a = np.array([np.array([0, 1, 2, 3, 4]), np.array([0, 1]), np.array([0,
...: 1, 2, 3, 4])], dtype=object)
In [95]: a
Out[95]: array([array([0, 1, 2, 3, 4]), array([0, 1]), array([0, 1, 2, 3, 4])], dtype=object)
In [96]: a.shape
Out[96]: (3,)
In [97]: a[1]
Out[97]: array([0, 1])
What do you mean by mask?
I can apply a boolean index to it:
In [99]: a[np.array([True,False,True])]
Out[99]: array([array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4])], dtype=object)
a==np.array([0,1]) produces a warning and False; In general == (and other comparison test) does not work well with object dtype arrays.
Maybe what you need is to use Pandas DataFrames that can hold missing values. In your case you could do something like this:
>>> import pandas as pd
>>> df=pd.DataFrame([aa[0].tolist(), aa[1].tolist(), aa[2].tolist()])
>>> df.transpose()
>>> df
0 1 2 3 4
0 0 1 2.0 3.0 4.0
1 0 1 NaN NaN NaN
2 0 1 2.0 3.0 4.0
DataFrames are very powerful and they have more appropriate methods than Numpy arrays when you think of something that is more like a spreadsheet than like a matrix.