Partitioning np.array into sub-arrays with no np.nan values - numpy

Say I have a np.array, e.g. a = np.array([np.nan, 2., 3., 4., 5., np.nan, np.nan, np.nan, 8., 9., 10., np.nan, 14., np.nan, 16.]). I want to obtain all sub-arrays with no np.nan value, i.e. my desired output is:
sub_arrays_list = [array([2., 3., 4., 5.]), array([8., 9., 10.]), array([14.]), array([16.])]
I kind of managed to solve this with the following but it is quite inefficient:
sub_arrays_list = []
start, end = 0, 0
while end < len(a) - 1:
if np.isnan(a[end]).any():
end += 1
start = end
else:
while not np.isnan(a[end]).any():
if end < len(a) - 1:
end += 1
else:
sub_arrays_list.append(a[start:])
break
else:
sub_arrays_list.append(a[start:end])
start = end
Would anyone please suggest a faster and better alternative to achieve this? Many thanks!

You can use:
# identify NaN values
m = np.isnan(a)
# array([ True, False, False, False, False, True, True, True, False,
# False, False, True, False, True, False])
# compute groups
idx = np.cumsum(m)
# array([1, 1, 1, 1, 1, 2, 3, 4, 4, 4, 4, 5, 5, 6, 6])
# remove NaNs, get indices of first non-NaN per group and split
out = np.split(a[~m], np.unique(idx[~m], return_index=True)[1][1:])
output:
[array([2., 3., 4., 5.]), array([ 8., 9., 10.]), array([14.]), array([16.])]

Related

Insert or append empty rows to a numpy array

There are references to using np.append to add to an initially empty array, such as How to add a new row to an empty numpy array.
Instead, my question is how to allocate extra empty space at the end of an array so that it can later be assigned to.
An example:
# Inefficient: The data in new_rows gets copied twice.
array = np.arange(6).reshape(2, 3)
new_rows = np.square(array)
new = np.concatenate((array, new_rows), axis=0)
# Instead, we would like something like the following:
def append_new_empty_rows(array, num_rows):
new_rows = np.empty_like(array, shape=(num_rows, array.shape[1]))
return np.concatenate((array, new_rows), axis=0)
array = np.arange(6).reshape(2, 3)
new = append_new_empty_rows(array, 2)
np.square(array[:2], out=new[2:])
However, the np.concatenate() likely still copies the empty data array?
Is there something like an np.append_empty()?
Here's what you are doing:
Make an array that's big enough for both pieces. np.zeros avoids any illusions that we are saving memory or work.
In [15]: arr1 = np.zeros((4,3), int)
In [16]: arr1
Out[16]:
array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])
Now copy values from the initial (2,3) to part of arr1:
In [17]: arr1[:2] = arr
In [18]: arr1
Out[18]:
array([[0, 1, 2],
[3, 4, 5],
[0, 0, 0],
[0, 0, 0]])
and use the out to copy square values to the 2nd part
In [19]: np.square(arr[:2], out=arr1[2:])
Out[19]:
array([[ 0, 1, 4],
[ 9, 16, 25]])
In [21]: arr1
Out[21]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 0, 1, 4],
[ 9, 16, 25]])
I don't see how that saves any effort or memory compared to:
In [22]: np.concatenate((arr, np.square(arr)), axis=0)
Out[22]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 0, 1, 4],
[ 9, 16, 25]])
concatenate, under the covers must be making a result array of the right size, and copying the pieces to it. There's really no getting around that if you want an array that contains both arr and np.square(arr).
Why don't you do it as follows:
array = np.arange(6).reshape(2, 3)
n_rows = 4
new = np.vstack([array, np.zeros((n_rows, array.shape[1]) )])
The new array will be this:
array([[0., 1., 2.],
[3., 4., 5.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])
If what you want is to save some space, then you should consider using the out parameter provided by concatenate. So it would be like this:
array = np.arange(6).reshape(2, 3)
n_rows = 4
np.concatenate([array, np.zeros((n_rows, array.shape[1]))], out=array)
As you can see, the only assignment is array and there is not any copy created. It overwrites array instead...
I find that the fastest solution is to create an empty larger array and then copy the input array into its initial rows:
shape = (1000, 1000)
array = np.ones(shape)
new_shape = (2000, 1000)
def version1(): # Uses np.concatenate().
new_rows = np.square(array)
return np.concatenate((array, new_rows), axis=0)
def version2(): # Initializes new array using np.zeros().
new = np.zeros(new_shape)
new[:shape[0]] = array
np.square(array, out=new[shape[0]:])
return new
def append_new_empty_rows(array, num_rows):
new = np.empty((array.shape[0] + num_rows, array.shape[1]))
new[:array.shape[0]] = array
return new
def version3(): # Initializes new array using np.empty().
new = append_new_empty_rows(array, num_rows=array.shape[0])
np.square(array, out=new[array.shape[0]:])
return new
assert np.all(version1() == version2())
assert np.all(version1() == version3())
%timeit version1() # 4.34 ms per loop
%timeit version2() # 3.15 ms per loop
%timeit version3() # 2.24 ms per loop

extract CSV columns data to individual Numpy array

Extract data from the given SalaryGender CSV file and store the data from each column in a separate NumPy array
SalaryGender.csv sample data
Salary,Gender,Age,PhD
140,1,47,1
30,0,65,1
35.1,0,56,0
30,1,23,0
80,0,53,1
Use: DataFrame.groupby
that will create a list where each element has a numpy array of each column:
[group.values for i,group in df.groupby(level=0,axis=1)]
If you aren't looking for a list then use:
for i,group in df.groupby(level=0,axis=1):
print(group.values)
.....
Also you can use DataFrame.iteritems:
for i,col in df.iteritems():
print(col.to_numpy())
In [199]: txt = """Salary,Gender,Age,PhD
...: 140,1,47,1
...: 30,0,65,1
...: 35.1,0,56,0
...: 30,1,23,0
...: 80,0,53,1"""
We can load your sample as a structured array:
In [203]: data = np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',', encoding=None, names=True)
In [204]: data
Out[204]:
array([(140. , 1, 47, 1), ( 30. , 0, 65, 1), ( 35.1, 0, 56, 0),
( 30. , 1, 23, 0), ( 80. , 0, 53, 1)],
dtype=[('Salary', '<f8'), ('Gender', '<i8'), ('Age', '<i8'), ('PhD', '<i8')])
Each element of the array is a row of the file; field names come from the header line, field dtype is deduced from the data.
Fields can be accessed by name:
In [205]: data['Salary']
Out[205]: array([140. , 30. , 35.1, 30. , 80. ])
In [206]: data['Gender']
Out[206]: array([1, 0, 0, 1, 0])
They can be accessed that way or can be assigned to a variable
salary = data['Salary']
You can also use unpack:
In [213]: a,b,c,d = np.genfromtxt(txt.splitlines(), delimiter=',', encoding=Non
...: e, skip_header=1, unpack=True)
In [214]: a
Out[214]: array([140. , 30. , 35.1, 30. , 80. ])
In [215]: b
Out[215]: array([1., 0., 0., 1., 0.])
In [216]: c
Out[216]: array([47., 65., 56., 23., 53.])
In [217]: d
Out[217]: array([1., 1., 0., 0., 1.])
Sometimes it's simpler to load the file one (or selected) column at a time:
In [218]: b = np.genfromtxt(txt.splitlines(), delimiter=',', encoding=None, ski
...: p_header=1, usecols=[1])
In [219]: b
Out[219]: array([1., 0., 0., 1., 0.])
Please try this:
SG[SG.columns].values
where SG is your file name. The code above gives you all columns values in array in a single go.

Numpy summation with sliding window is really slow

Code:
shape = np.array([6, 6])
grid = np.array([x.ravel() for x in np.meshgrid(*[np.arange(x) for i, x in enumerate(shape)], indexing='ij')]).T
slices = [tuple(slice(box[i], box[i] + 2) for i in range(len(box))) for box in grid]
score = np.zeros((7,7,3))
column = np.random.randn(36, 12) #just for example
column
>> array([[ 0, 1, 2, 3, ... 425, 426, 427, 428, 429, 430, 431]])
column = column.reshape((16, 3, 3, 3))
for i, window in enumerate(slices):
score[window] += column[i]
score
>> array([[[0.000e+00, 1.000e+00, 2.000e+00],
[3.000e+01, 3.200e+01, 3.400e+01],
[9.000e+01, 9.300e+01, 9.600e+01], ...
[8.280e+02, 8.300e+02, 8.320e+02],
[4.290e+02, 4.300e+02, 4.310e+02]]])
It works but last 2 lines take really much time as they will be in loop. The problem is that 'grid' variable contains an array of windows. And I don't now how to speed up the process.
Let's simplify the problem at bit - reduce the dimensions, and drop the final size 3 dimension:
In [265]: shape = np.array([4,4])
In [266]: grid = np.array([x.ravel() for x in np.meshgrid(*[np.arange(x) for i
...: , x in enumerate(shape)], indexing='ij')]).T
...: grid = [tuple(slice(box[i], box[i] + 3) for i in range(len(box))) fo
...: r box in grid]
...:
...:
In [267]: len(grid)
Out[267]: 16
In [268]: score = np.arange(36).reshape(6,6)
In [269]: X = np.array([score[x] for x in grid]).reshape(4,4,3,3)
In [270]: X
Out[270]:
array([[[[ 0, 1, 2],
[ 6, 7, 8],
[12, 13, 14]],
[[ 1, 2, 3],
[ 7, 8, 9],
[13, 14, 15]],
[[ 2, 3, 4],
[ 8, 9, 10],
[14, 15, 16]],
....
[[21, 22, 23],
[27, 28, 29],
[33, 34, 35]]]])
This is a moving window - one (3,3) array, shift over 1,..., shift down 1, etc
With as_strided is is possible to construct a view of the array, that consists of all these windows, but without actually copying values. Having worked with as_strided before I was able construct the equivalent strides as:
In [271]: score.shape
Out[271]: (6, 6)
In [272]: score.strides
Out[272]: (48, 8)
In [273]: ast = np.lib.stride_tricks.as_strided
In [274]: x=ast(score, shape=(4,4,3,3), strides=(48,8,48,8))
In [275]: np.allclose(X,x)
Out[275]: True
This could be extended to your (28,28,3) dimensions, and turned into the summation.
Generating such moving windows has been covered in previous SO questions. And it's also implemented in one of the image processing packages.
Adaptation for a 3 channel image,
In [45]: arr.shape
Out[45]: (6, 6, 3)
In [46]: arr.strides
Out[46]: (144, 24, 8)
In [47]: arr[:3,:3,0]
Out[47]:
array([[ 0., 1., 2.],
[ 6., 7., 8.],
[12., 13., 14.]])
In [48]: x = ast(arr, shape=(4,4,3,3,3), strides=(144,24,144,24,8))
In [49]: x[0,0,:,:,0]
Out[49]:
array([[ 0., 1., 2.],
[ 6., 7., 8.],
[12., 13., 14.]])
Since we are moving the window by one element at a time, the strides for x are easily derived form the source strides.
For 4x4 windows, just change the shape
x = ast(arr, shape=(3,3,4,4,3), strides=(144,24,144,24,8))
In Efficiently Using Multiple Numpy Slices for Random Image Cropping
#Divikar suggests using skimage
With the default step=1, the result is compatible:
In [55]: from skimage.util.shape import view_as_windows
In [63]: y = view_as_windows(arr,(4,4,3))
In [64]: y.shape
Out[64]: (3, 3, 1, 4, 4, 3)
In [69]: np.allclose(x,y[:,:,0])
Out[69]: True

Pandas MultiIndex Vector Setting

I have a DataFrame with multiindex like this:
0 1 2
a 0 0.928295 0.828225 -0.612509
1 1.103340 -0.540640 -0.344500
2 -1.760918 -1.426488 -0.647610
3 -0.782976 0.359211 1.601602
4 0.334406 -0.508752 -0.611212
b 2 0.717163 0.902514 1.027191
3 0.296955 1.543040 -1.429113
4 -0.651468 0.665114 0.949849
c 0 0.195620 -0.240177 0.745310
1 1.244997 -0.817949 0.130422
2 0.288510 1.123550 0.211385
3 -1.060227 1.739789 2.186224
4 -0.109178 -1.645732 0.022480
d 3 0.021789 0.747183 0.614485
4 -1.074870 0.407974 -0.961013
What I want : array([1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0])
Now I want to generate a zero vector which have the sample length of this DataFrame and only have ones on the first elements of level[1] index.
For example, here the df have a shape of (15, 3). Therefore I want to get a vector with length of 15 and should have 1 at(a, 0), (b, 2), (c, 0), (d, 3) and 0 at other points.
How could I generator an vector like that? (If possible don't loop over get each sub vector and then use np.concatenate()) Thanks a lot!
IIUC
duplicated
(~df.index.get_level_values(0).duplicated()).astype(int)
Out[726]: array([1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0])
Or using groupby and head
df.loc[df.groupby(level=0).head(1).index,'New']=1
df.New.fillna(0).values
Out[721]: array([1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 0.])
Get the labels of your first multiindex, turn them into a series, then find where they are not equal to the adjacent ones
labels = pd.Series(df.index.labels[0])
v = labels.ne(labels.shift()).astype(int).values
>>> v
array([1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0])
pd.Index(df.labels[0])
Int64Index([0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3], dtype='int64')
res = pd.Index(df.labels[0]).duplicated(keep='first')
array([False, True, True, True, True, False, True, True, False,
True, True, True, True, False, True])
Mulitindex has an attribute labels to indicate postion.
Which has the same meaning of the requirement.

xarray: simple weighted rolling mean example using .construct()

Xarray can do weighted rolling mean via the .construct() object as stated in answer on SO here and also in the docs.
The weighted rolling mean example in the docs doesn't quite look right as it seems to give the same answer as the ordinary rolling mean.
import xarray as xr
import numpy as np
arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5),
... dims=('x', 'y'))
arr.rolling(y=3, center=True).mean()
#<xarray.DataArray (x: 3, y: 5)>
#array([[nan, 0.5, 1. , 1.5, nan],
# [nan, 3. , 3.5, 4. , nan],
# [nan, 5.5, 6. , 6.5, nan]])
#Dimensions without coordinates: x, y
weight = xr.DataArray([0.25, 0.5, 0.25], dims=['window'])
arr.rolling(y=3, center=True).construct('window').dot(weight)
#<xarray.DataArray (x: 3, y: 5)>
#array([[nan, 0.5, 1. , 1.5, nan],
# [nan, 3. , 3.5, 4. , nan],
# [nan, 5.5, 6. , 6.5, nan]])
#Dimensions without coordinates: x, y
Here is a more simple example which I would like to get the syntax right on:
da = xr.DataArray(np.arange(1,6), dims='x')
da.rolling(x=3, center=True).mean()
#<xarray.DataArray (x: 5)>
#array([nan, 2., 3., 4., nan])
#Dimensions without coordinates: x
weight = xr.DataArray([0.5, 1, 0.5], dims=['window'])
da.rolling(x=3, center=True).construct('window').dot(weight)
#<xarray.DataArray (x: 5)>
#array([nan, 4., 6., 8., nan])
#Dimensions without coordinates: x
It returns 4, 6, 8. I thought it would do:
(1 x 0.5) + (2 x 1) + (3 x 0.5) / 3 = 4/3
(2 x 0.5) + (3 x 1) + (4 x 0.5) / 3 = 2
(3 x 0.5) + (4 x 1) + (5 x 0.5) / 3 = 8/3
1.33, 2. 2.66
In the first example, you use evenly spaced data for arr.
Therefore, the weighted mean (with [0.25, 5, 0.25]) will be the same as the simple mean.
If you consider non-linear data, the result differs
In [50]: arr = xr.DataArray((np.arange(0, 7.5, 0.5)**2).reshape(3, 5),
...: dims=('x', 'y'))
...:
In [51]: arr.rolling(y=3, center=True).mean()
Out[51]:
<xarray.DataArray (x: 3, y: 5)>
array([[ nan, 0.416667, 1.166667, 2.416667, nan],
[ nan, 9.166667, 12.416667, 16.166667, nan],
[ nan, 30.416667, 36.166667, 42.416667, nan]])
Dimensions without coordinates: x, y
In [52]: weight = xr.DataArray([0.25, 0.5, 0.25], dims=['window'])
...: arr.rolling(y=3, center=True).construct('window').dot(weight)
...:
Out[52]:
<xarray.DataArray (x: 3, y: 5)>
array([[ nan, 0.375, 1.125, 2.375, nan],
[ nan, 9.125, 12.375, 16.125, nan],
[ nan, 30.375, 36.125, 42.375, nan]])
Dimensions without coordinates: x, y
For the second example, you use [0.5, 1, 0.5] as weight, the total of which is 2.
Therefore, the first non-nan item will be
(1 x 0.5) + (2 x 1) + (3 x 0.5) = 4
If you want weighted mean, rather than the weighted sum, use [0.25, 0.5, 0.25] instead.