Insert or append empty rows to a numpy array - numpy

There are references to using np.append to add to an initially empty array, such as How to add a new row to an empty numpy array.
Instead, my question is how to allocate extra empty space at the end of an array so that it can later be assigned to.
An example:
# Inefficient: The data in new_rows gets copied twice.
array = np.arange(6).reshape(2, 3)
new_rows = np.square(array)
new = np.concatenate((array, new_rows), axis=0)
# Instead, we would like something like the following:
def append_new_empty_rows(array, num_rows):
new_rows = np.empty_like(array, shape=(num_rows, array.shape[1]))
return np.concatenate((array, new_rows), axis=0)
array = np.arange(6).reshape(2, 3)
new = append_new_empty_rows(array, 2)
np.square(array[:2], out=new[2:])
However, the np.concatenate() likely still copies the empty data array?
Is there something like an np.append_empty()?

Here's what you are doing:
Make an array that's big enough for both pieces. np.zeros avoids any illusions that we are saving memory or work.
In [15]: arr1 = np.zeros((4,3), int)
In [16]: arr1
Out[16]:
array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])
Now copy values from the initial (2,3) to part of arr1:
In [17]: arr1[:2] = arr
In [18]: arr1
Out[18]:
array([[0, 1, 2],
[3, 4, 5],
[0, 0, 0],
[0, 0, 0]])
and use the out to copy square values to the 2nd part
In [19]: np.square(arr[:2], out=arr1[2:])
Out[19]:
array([[ 0, 1, 4],
[ 9, 16, 25]])
In [21]: arr1
Out[21]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 0, 1, 4],
[ 9, 16, 25]])
I don't see how that saves any effort or memory compared to:
In [22]: np.concatenate((arr, np.square(arr)), axis=0)
Out[22]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 0, 1, 4],
[ 9, 16, 25]])
concatenate, under the covers must be making a result array of the right size, and copying the pieces to it. There's really no getting around that if you want an array that contains both arr and np.square(arr).

Why don't you do it as follows:
array = np.arange(6).reshape(2, 3)
n_rows = 4
new = np.vstack([array, np.zeros((n_rows, array.shape[1]) )])
The new array will be this:
array([[0., 1., 2.],
[3., 4., 5.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])
If what you want is to save some space, then you should consider using the out parameter provided by concatenate. So it would be like this:
array = np.arange(6).reshape(2, 3)
n_rows = 4
np.concatenate([array, np.zeros((n_rows, array.shape[1]))], out=array)
As you can see, the only assignment is array and there is not any copy created. It overwrites array instead...

I find that the fastest solution is to create an empty larger array and then copy the input array into its initial rows:
shape = (1000, 1000)
array = np.ones(shape)
new_shape = (2000, 1000)
def version1(): # Uses np.concatenate().
new_rows = np.square(array)
return np.concatenate((array, new_rows), axis=0)
def version2(): # Initializes new array using np.zeros().
new = np.zeros(new_shape)
new[:shape[0]] = array
np.square(array, out=new[shape[0]:])
return new
def append_new_empty_rows(array, num_rows):
new = np.empty((array.shape[0] + num_rows, array.shape[1]))
new[:array.shape[0]] = array
return new
def version3(): # Initializes new array using np.empty().
new = append_new_empty_rows(array, num_rows=array.shape[0])
np.square(array, out=new[array.shape[0]:])
return new
assert np.all(version1() == version2())
assert np.all(version1() == version3())
%timeit version1() # 4.34 ms per loop
%timeit version2() # 3.15 ms per loop
%timeit version3() # 2.24 ms per loop

Related

extract CSV columns data to individual Numpy array

Extract data from the given SalaryGender CSV file and store the data from each column in a separate NumPy array
SalaryGender.csv sample data
Salary,Gender,Age,PhD
140,1,47,1
30,0,65,1
35.1,0,56,0
30,1,23,0
80,0,53,1
Use: DataFrame.groupby
that will create a list where each element has a numpy array of each column:
[group.values for i,group in df.groupby(level=0,axis=1)]
If you aren't looking for a list then use:
for i,group in df.groupby(level=0,axis=1):
print(group.values)
.....
Also you can use DataFrame.iteritems:
for i,col in df.iteritems():
print(col.to_numpy())
In [199]: txt = """Salary,Gender,Age,PhD
...: 140,1,47,1
...: 30,0,65,1
...: 35.1,0,56,0
...: 30,1,23,0
...: 80,0,53,1"""
We can load your sample as a structured array:
In [203]: data = np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',', encoding=None, names=True)
In [204]: data
Out[204]:
array([(140. , 1, 47, 1), ( 30. , 0, 65, 1), ( 35.1, 0, 56, 0),
( 30. , 1, 23, 0), ( 80. , 0, 53, 1)],
dtype=[('Salary', '<f8'), ('Gender', '<i8'), ('Age', '<i8'), ('PhD', '<i8')])
Each element of the array is a row of the file; field names come from the header line, field dtype is deduced from the data.
Fields can be accessed by name:
In [205]: data['Salary']
Out[205]: array([140. , 30. , 35.1, 30. , 80. ])
In [206]: data['Gender']
Out[206]: array([1, 0, 0, 1, 0])
They can be accessed that way or can be assigned to a variable
salary = data['Salary']
You can also use unpack:
In [213]: a,b,c,d = np.genfromtxt(txt.splitlines(), delimiter=',', encoding=Non
...: e, skip_header=1, unpack=True)
In [214]: a
Out[214]: array([140. , 30. , 35.1, 30. , 80. ])
In [215]: b
Out[215]: array([1., 0., 0., 1., 0.])
In [216]: c
Out[216]: array([47., 65., 56., 23., 53.])
In [217]: d
Out[217]: array([1., 1., 0., 0., 1.])
Sometimes it's simpler to load the file one (or selected) column at a time:
In [218]: b = np.genfromtxt(txt.splitlines(), delimiter=',', encoding=None, ski
...: p_header=1, usecols=[1])
In [219]: b
Out[219]: array([1., 0., 0., 1., 0.])
Please try this:
SG[SG.columns].values
where SG is your file name. The code above gives you all columns values in array in a single go.

Numpy summation with sliding window is really slow

Code:
shape = np.array([6, 6])
grid = np.array([x.ravel() for x in np.meshgrid(*[np.arange(x) for i, x in enumerate(shape)], indexing='ij')]).T
slices = [tuple(slice(box[i], box[i] + 2) for i in range(len(box))) for box in grid]
score = np.zeros((7,7,3))
column = np.random.randn(36, 12) #just for example
column
>> array([[ 0, 1, 2, 3, ... 425, 426, 427, 428, 429, 430, 431]])
column = column.reshape((16, 3, 3, 3))
for i, window in enumerate(slices):
score[window] += column[i]
score
>> array([[[0.000e+00, 1.000e+00, 2.000e+00],
[3.000e+01, 3.200e+01, 3.400e+01],
[9.000e+01, 9.300e+01, 9.600e+01], ...
[8.280e+02, 8.300e+02, 8.320e+02],
[4.290e+02, 4.300e+02, 4.310e+02]]])
It works but last 2 lines take really much time as they will be in loop. The problem is that 'grid' variable contains an array of windows. And I don't now how to speed up the process.
Let's simplify the problem at bit - reduce the dimensions, and drop the final size 3 dimension:
In [265]: shape = np.array([4,4])
In [266]: grid = np.array([x.ravel() for x in np.meshgrid(*[np.arange(x) for i
...: , x in enumerate(shape)], indexing='ij')]).T
...: grid = [tuple(slice(box[i], box[i] + 3) for i in range(len(box))) fo
...: r box in grid]
...:
...:
In [267]: len(grid)
Out[267]: 16
In [268]: score = np.arange(36).reshape(6,6)
In [269]: X = np.array([score[x] for x in grid]).reshape(4,4,3,3)
In [270]: X
Out[270]:
array([[[[ 0, 1, 2],
[ 6, 7, 8],
[12, 13, 14]],
[[ 1, 2, 3],
[ 7, 8, 9],
[13, 14, 15]],
[[ 2, 3, 4],
[ 8, 9, 10],
[14, 15, 16]],
....
[[21, 22, 23],
[27, 28, 29],
[33, 34, 35]]]])
This is a moving window - one (3,3) array, shift over 1,..., shift down 1, etc
With as_strided is is possible to construct a view of the array, that consists of all these windows, but without actually copying values. Having worked with as_strided before I was able construct the equivalent strides as:
In [271]: score.shape
Out[271]: (6, 6)
In [272]: score.strides
Out[272]: (48, 8)
In [273]: ast = np.lib.stride_tricks.as_strided
In [274]: x=ast(score, shape=(4,4,3,3), strides=(48,8,48,8))
In [275]: np.allclose(X,x)
Out[275]: True
This could be extended to your (28,28,3) dimensions, and turned into the summation.
Generating such moving windows has been covered in previous SO questions. And it's also implemented in one of the image processing packages.
Adaptation for a 3 channel image,
In [45]: arr.shape
Out[45]: (6, 6, 3)
In [46]: arr.strides
Out[46]: (144, 24, 8)
In [47]: arr[:3,:3,0]
Out[47]:
array([[ 0., 1., 2.],
[ 6., 7., 8.],
[12., 13., 14.]])
In [48]: x = ast(arr, shape=(4,4,3,3,3), strides=(144,24,144,24,8))
In [49]: x[0,0,:,:,0]
Out[49]:
array([[ 0., 1., 2.],
[ 6., 7., 8.],
[12., 13., 14.]])
Since we are moving the window by one element at a time, the strides for x are easily derived form the source strides.
For 4x4 windows, just change the shape
x = ast(arr, shape=(3,3,4,4,3), strides=(144,24,144,24,8))
In Efficiently Using Multiple Numpy Slices for Random Image Cropping
#Divikar suggests using skimage
With the default step=1, the result is compatible:
In [55]: from skimage.util.shape import view_as_windows
In [63]: y = view_as_windows(arr,(4,4,3))
In [64]: y.shape
Out[64]: (3, 3, 1, 4, 4, 3)
In [69]: np.allclose(x,y[:,:,0])
Out[69]: True

Fastest way to compare neighboring elements in multi-dimensional numpy array

What is the fastest way to compare neighboring elements in a 3-dimensional array?
Assume I have a numpy array of (4,4,4). I want to loop in the k-direction and compare elements in pairs. So, compare all neighboring elements and assign the lowest index if they are not equal. Essentially this:
if array([0, 0, 0)] != array[(0, 0, 1)]:
array[(0, 0, 0)] = 111
Thus, the comparisons would be:
(0, 0, 0) and (0, 0, 1)
(0, 0, 1) and (0, 0, 2)
(0, 0, 2) and (0, 0, 3)
(0, 0, 3) and (0, 0, 4)
... for all i and j ...
However, I want to do this for every i and j in the array and writing a standard Python for loop for this on huge arrays with millions of cells is incredibly slow. Is there a more 'standard' numpy way to do this without the explicit for loop?
Maybe there's some trick using the slicing step (i.e array[:,:,::2], array[:,:,1::2])?
Try np.diff.
import numpy as np
a = np.arange(9).reshape(3, 3)
A = np.array([a, a, a + 1]).T
same_with_neighbor_on_last_axis = np.diff(A, axis=-1) == 0
print A
print same_with_neighbor_on_last_axis
A is constructed to have 2 consecutive equal entries along the third axis,
>>>print A
array([[[0, 0, 1],
[3, 3, 4],
[6, 6, 7]],
[[1, 1, 2],
[4, 4, 5],
[7, 7, 8]],
[[2, 2, 3],
[5, 5, 6],
[8, 8, 9]]])
The output vector then yields
>>>print same_with_neighbor_on_last_axis
[[[ True False]
[ True False]
[ True False]]
[[ True False]
[ True False]
[ True False]]
[[ True False]
[ True False]
[ True False]]]
Using the axis keyword, you can choose whichever axis you need to do this operation on. If it is all of them, you can use a loop. np.diff does not much else than the following
np.diff(A, axis=-1) == A[..., 1:] - A[..., :-1]

Reduce a dimension of numpy array by selecting

I have a 3d array
A = np.random.random((4,4,3))
and a index matrix
B = np.int_(np.random.random((4,4))*3)
How do I get a 2D array from A based on index matrix B?
In general, how to get a N-1 dimensional array from a ND array and a N-1 dimensional index array?
Lets take an example:
>>> A = np.random.randint(0,10,(3,3,2))
>>> A
array([[[0, 1],
[8, 2],
[6, 4]],
[[1, 0],
[6, 9],
[7, 7]],
[[1, 2],
[2, 2],
[9, 7]]])
Use fancy indexing to take simple indices. Note that the all indices must be of the same shape and the shape of each index will be what is returned.
>>> ind = np.arange(2)
>>> A[ind,ind,ind]
array([0, 9]) #Index (0,0,0) and (1,1,1)
>>> ind = np.arange(2).reshape(2,1)
>>> A[ind,ind,ind]
array([[0],
[9]])
So for your example we need to supply the grid for the first two dimensions:
>>> A = np.random.random((4,4,3))
>>> B = np.int_(np.random.random((4,4))*3)
>>> A
array([[[ 0.95158697, 0.37643036, 0.29175815],
[ 0.84093397, 0.53453123, 0.64183715],
[ 0.31189496, 0.06281937, 0.10008886],
[ 0.79784114, 0.26428462, 0.87899921]],
[[ 0.04498205, 0.63823379, 0.48130828],
[ 0.93302194, 0.91964805, 0.05975115],
[ 0.55686047, 0.02692168, 0.31065731],
[ 0.92822499, 0.74771321, 0.03055592]],
[[ 0.24849139, 0.42819062, 0.14640117],
[ 0.92420031, 0.87483486, 0.51313695],
[ 0.68414428, 0.86867423, 0.96176415],
[ 0.98072548, 0.16939697, 0.19117458]],
[[ 0.71009607, 0.23057644, 0.80725518],
[ 0.01932983, 0.36680718, 0.46692839],
[ 0.51729835, 0.16073775, 0.77768313],
[ 0.8591955 , 0.81561797, 0.90633695]]])
>>> B
array([[1, 2, 0, 0],
[1, 2, 0, 1],
[2, 1, 1, 1],
[1, 2, 1, 2]])
>>> x,y = np.meshgrid(np.arange(A.shape[0]),np.arange(A.shape[1]))
>>> x
array([[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3]])
>>> y
array([[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3]])
>>> A[x,y,B]
array([[ 0.37643036, 0.48130828, 0.24849139, 0.71009607],
[ 0.53453123, 0.05975115, 0.92420031, 0.36680718],
[ 0.10008886, 0.02692168, 0.86867423, 0.16073775],
[ 0.26428462, 0.03055592, 0.16939697, 0.90633695]])
If you prefer to use mesh as suggested by Daniel, you may also use
A[tuple( np.ogrid[:A.shape[0], :A.shape[1]] + [B] )]
to work with sparse indices. In the general case you could use
A[tuple( np.ogrid[ [slice(0, end) for end in A.shape[:-1]] ] + [B] )]
Note that this may also be used when you'd like to index by B an axis different from the last one (see for example this answer about inserting an element into a list).
Otherwise you can do it using broadcasting:
A[np.arange(A.shape[0])[:, np.newaxis], np.arange(A.shape[1])[np.newaxis, :], B]
This may be generalized too but it's a bit more complicated.

referencing rows in a matrix using index from another matrix

You have an original sparse matrix X:
>>print type(X)
>>print X.todense()
<class 'scipy.sparse.csr.csr_matrix'>
[[1,4,3]
[3,4,1]
[2,1,1]
[3,6,3]]
You have a second sparse matrix Z, which is derived from some rows of X (say the values are doubled so we can see the difference between the two matrices). In pseudo-code:
>>Z = X[[0,2,3]]
>>print Z.todense()
[[1,4,3]
[2,1,1]
[3,6,3]]
>>Z = Z*2
>>print Z.todense()
[[2, 8, 6]
[4, 2, 2]
[6, 12,6]]
What's the best way of retrieving the rows in Z using the ORIGINAL indices from X. So for instance, in pseudo-code:
>>print Z[[0,3]]
[[2,8,6] #0 from Z, and what would be row **0** from X)
[6,12,6]] #2 from Z, but what would be row **3** from X)
That is, how can you retrieve rows from Z, using indices that refer to the original rows position in the original matrix X? To do this, you can't modify X in anyway (you can't add an index column to the matrix X), but there are no other limits.
If you have the original indices in an array i, and the values in i are in increasing order (as in your example), you can use numpy.searchsorted(i, [0, 3]) to find the indices in Z that correspond to indices [0, 3] in the original X. Here's a demonstration in an IPython session:
In [39]: X = csr_matrix([[1,4,3],[3,4,1],[2,1,1],[3,6,3]])
In [40]: X.todense()
Out[40]:
matrix([[1, 4, 3],
[3, 4, 1],
[2, 1, 1],
[3, 6, 3]])
In [41]: i = array([0, 2, 3])
In [42]: Z = 2 * X[i]
In [43]: Z.todense()
Out[43]:
matrix([[ 2, 8, 6],
[ 4, 2, 2],
[ 6, 12, 6]])
In [44]: Zsub = Z[searchsorted(i, [0, 3])]
In [45]: Zsub.todense()
Out[45]:
matrix([[ 2, 8, 6],
[ 6, 12, 6]])