2D Numpy Array Boolean Slicing [duplicate] - numpy

I've got a strange situation.
I have a 2D Numpy array, x:
x = np.random.random_integers(0,5,(20,8))
And I have 2 indexers--one with indices for the rows, and one with indices for the column. In order to index X, I am having to do the following:
row_indices = [4,2,18,16,7,19,4]
col_indices = [1,2]
x_rows = x[row_indices,:]
x_indexed = x_rows[:,column_indices]
Instead of just:
x_new = x[row_indices,column_indices]
(which fails with: error, cannot broadcast (20,) with (2,))
I'd like to be able to do the indexing in one line using the broadcasting, since that would keep the code clean and readable...also, I don't know all that much about python under the hood, but as I understand it, it should be faster to do it in one line (and I'll be working with pretty big arrays).
Test Case:
x = np.random.random_integers(0,5,(20,8))
row_indices = [4,2,18,16,7,19,4]
col_indices = [1,2]
x_rows = x[row_indices,:]
x_indexed = x_rows[:,col_indices]
x_doesnt_work = x[row_indices,col_indices]

Selections or assignments with np.ix_ using indexing or boolean arrays/masks
1. With indexing-arrays
A. Selection
We can use np.ix_ to get a tuple of indexing arrays that are broadcastable against each other to result in a higher-dimensional combinations of indices. So, when that tuple is used for indexing into the input array, would give us the same higher-dimensional array. Hence, to make a selection based on two 1D indexing arrays, it would be -
x_indexed = x[np.ix_(row_indices,col_indices)]
B. Assignment
We can use the same notation for assigning scalar or a broadcastable array into those indexed positions. Hence, the following works for assignments -
x[np.ix_(row_indices,col_indices)] = # scalar or broadcastable array
2. With masks
We can also use boolean arrays/masks with np.ix_, similar to how indexing arrays are used. This can be used again to select a block off the input array and also for assignments into it.
A. Selection
Thus, with row_mask and col_mask boolean arrays as the masks for row and column selections respectively, we can use the following for selections -
x[np.ix_(row_mask,col_mask)]
B. Assignment
And the following works for assignments -
x[np.ix_(row_mask,col_mask)] = # scalar or broadcastable array
Sample Runs
1. Using np.ix_ with indexing-arrays
Input array and indexing arrays -
In [221]: x
Out[221]:
array([[17, 39, 88, 14, 73, 58, 17, 78],
[88, 92, 46, 67, 44, 81, 17, 67],
[31, 70, 47, 90, 52, 15, 24, 22],
[19, 59, 98, 19, 52, 95, 88, 65],
[85, 76, 56, 72, 43, 79, 53, 37],
[74, 46, 95, 27, 81, 97, 93, 69],
[49, 46, 12, 83, 15, 63, 20, 79]])
In [222]: row_indices
Out[222]: [4, 2, 5, 4, 1]
In [223]: col_indices
Out[223]: [1, 2]
Tuple of indexing arrays with np.ix_ -
In [224]: np.ix_(row_indices,col_indices) # Broadcasting of indices
Out[224]:
(array([[4],
[2],
[5],
[4],
[1]]), array([[1, 2]]))
Make selections -
In [225]: x[np.ix_(row_indices,col_indices)]
Out[225]:
array([[76, 56],
[70, 47],
[46, 95],
[76, 56],
[92, 46]])
As suggested by OP, this is in effect same as performing old-school broadcasting with a 2D array version of row_indices that has its elements/indices sent to axis=0 and thus creating a singleton dimension at axis=1 and thus allowing broadcasting with col_indices. Thus, we would have an alternative solution like so -
In [227]: x[np.asarray(row_indices)[:,None],col_indices]
Out[227]:
array([[76, 56],
[70, 47],
[46, 95],
[76, 56],
[92, 46]])
As discussed earlier, for the assignments, we simply do so.
Row, col indexing arrays -
In [36]: row_indices = [1, 4]
In [37]: col_indices = [1, 3]
Make assignments with scalar -
In [38]: x[np.ix_(row_indices,col_indices)] = -1
In [39]: x
Out[39]:
array([[17, 39, 88, 14, 73, 58, 17, 78],
[88, -1, 46, -1, 44, 81, 17, 67],
[31, 70, 47, 90, 52, 15, 24, 22],
[19, 59, 98, 19, 52, 95, 88, 65],
[85, -1, 56, -1, 43, 79, 53, 37],
[74, 46, 95, 27, 81, 97, 93, 69],
[49, 46, 12, 83, 15, 63, 20, 79]])
Make assignments with 2D block(broadcastable array) -
In [40]: rand_arr = -np.arange(4).reshape(2,2)
In [41]: x[np.ix_(row_indices,col_indices)] = rand_arr
In [42]: x
Out[42]:
array([[17, 39, 88, 14, 73, 58, 17, 78],
[88, 0, 46, -1, 44, 81, 17, 67],
[31, 70, 47, 90, 52, 15, 24, 22],
[19, 59, 98, 19, 52, 95, 88, 65],
[85, -2, 56, -3, 43, 79, 53, 37],
[74, 46, 95, 27, 81, 97, 93, 69],
[49, 46, 12, 83, 15, 63, 20, 79]])
2. Using np.ix_ with masks
Input array -
In [19]: x
Out[19]:
array([[17, 39, 88, 14, 73, 58, 17, 78],
[88, 92, 46, 67, 44, 81, 17, 67],
[31, 70, 47, 90, 52, 15, 24, 22],
[19, 59, 98, 19, 52, 95, 88, 65],
[85, 76, 56, 72, 43, 79, 53, 37],
[74, 46, 95, 27, 81, 97, 93, 69],
[49, 46, 12, 83, 15, 63, 20, 79]])
Input row, col masks -
In [20]: row_mask = np.array([0,1,1,0,0,1,0],dtype=bool)
In [21]: col_mask = np.array([1,0,1,0,1,1,0,0],dtype=bool)
Make selections -
In [22]: x[np.ix_(row_mask,col_mask)]
Out[22]:
array([[88, 46, 44, 81],
[31, 47, 52, 15],
[74, 95, 81, 97]])
Make assignments with scalar -
In [23]: x[np.ix_(row_mask,col_mask)] = -1
In [24]: x
Out[24]:
array([[17, 39, 88, 14, 73, 58, 17, 78],
[-1, 92, -1, 67, -1, -1, 17, 67],
[-1, 70, -1, 90, -1, -1, 24, 22],
[19, 59, 98, 19, 52, 95, 88, 65],
[85, 76, 56, 72, 43, 79, 53, 37],
[-1, 46, -1, 27, -1, -1, 93, 69],
[49, 46, 12, 83, 15, 63, 20, 79]])
Make assignments with 2D block(broadcastable array) -
In [25]: rand_arr = -np.arange(12).reshape(3,4)
In [26]: x[np.ix_(row_mask,col_mask)] = rand_arr
In [27]: x
Out[27]:
array([[ 17, 39, 88, 14, 73, 58, 17, 78],
[ 0, 92, -1, 67, -2, -3, 17, 67],
[ -4, 70, -5, 90, -6, -7, 24, 22],
[ 19, 59, 98, 19, 52, 95, 88, 65],
[ 85, 76, 56, 72, 43, 79, 53, 37],
[ -8, 46, -9, 27, -10, -11, 93, 69],
[ 49, 46, 12, 83, 15, 63, 20, 79]])

What about:
x[row_indices][:,col_indices]
For example,
x = np.random.random_integers(0,5,(5,5))
## array([[4, 3, 2, 5, 0],
## [0, 3, 1, 4, 2],
## [4, 2, 0, 0, 3],
## [4, 5, 5, 5, 0],
## [1, 1, 5, 0, 2]])
row_indices = [4,2]
col_indices = [1,2]
x[row_indices][:,col_indices]
## array([[1, 5],
## [2, 0]])

import numpy as np
x = np.random.random_integers(0,5,(4,4))
x
array([[5, 3, 3, 2],
[4, 3, 0, 0],
[1, 4, 5, 3],
[0, 4, 3, 4]])
# This indexes the elements 1,1 and 2,2 and 3,3
indexes = (np.array([1,2,3]),np.array([1,2,3]))
x[indexes]
# returns array([3, 5, 4])
Notice that numpy has very different rules depending on what kind of indexes you use. So indexing several elements should be by a tuple of np.ndarray (see indexing manual).
So you need only to convert your list to np.ndarray and it should work as expected.

I think you are trying to do one of the following (equlvalent) operations:
x_does_work = x[row_indices,:][:,col_indices]
x_does_work = x[:,col_indices][row_indices,:]
This will actually create a subset of x with only the selected rows, then select the columns from that, or vice versa in the second case. The first case can be thought of as
x_does_work = (x[row_indices,:])[:,col_indices]

Your first try would work if you write it with np.newaxis
x_new = x[row_indices[:, np.newaxis],column_indices]

Related

Take multiply slices in numpy/pytorch

I have a big one dimensional array X.shape = (10000,), and a vector of indices y = [0, 7, 9995].
I would like to get a matrix with rows
[
X[0 : 100],
X[7 : 107],
concat(X[9995:], X[:95]),
]
That is, slices of length 100, starting at each index, with wrap-around.
I can do that with a python loop, but I'm wondering if there's a smarter batched way of doing it in pytorch or numpy, since my arrays can be quite large.
Quite simple, actually.
For each element E in y, create a range from E to E + 100
Concatenate all the ranges horizontally
Modulo the resulting array by the length of X
indexes = np.hstack([np.arange(v, v + 100) for v in y]) % X.shape[0]
Output:
>>> indexes
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87,
88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
99, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82,
83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93,
94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104,
105, 106, 9995, 9996, 9997, 9998, 9999, 0, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58,
59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,
70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91,
92, 93, 94])
Now just use index X with that:
X[indexes]
This is a version of user17242583's answer that doesn't use a python loop:
N, BS, S = 10000, 1000, 100
X = np.random.randn(N)
h = np.random.randint(N, size=(BS,))
indexes = (h[..., None] + np.arange(S)) % N
result = X[indexes]
In pytorch I also found another solution using unfold:
wrapped = torch.cat((X, X[:S-1]))
strides = wrapped.unfold(dimension=0, size=S, step=1)
result = strides[h]
But I haven't done experiments to see which one is more efficient yet.

reshape numpy array, do a transformation and do the inverse reshape

Here is my problem :
I’m trying to do a operation on a numpy array after reshaping it.
But after this operation, I want to reshape again my array to get my original shape with the same indexing.
So I want to find the appropriate "inverse reshape" so that inverse_reshape(reshape(a))==a
length = 10
a = np.arange(length^2).reshape((length,length))
#a.spape = (10,10)
b = (a.reshape((length//2, 2, -1, 2))
.swapaxes(1, 2)
.reshape(-1, 2, 2))
#b.shape = (25,2,2)
b = my_function(b)
#b.shape = (25,2,2) still the same shape
# b --> a ?
I know that the numpy reshape funtion doesn’t copy the array, but the swapaxes one does.
How can I get the appropriate reshaping ?
Simply reverse the order of the the a=>b conversion.
The original made:
In [53]: a.reshape((length//2, 2, -1, 2)).shape
Out[53]: (5, 2, 5, 2)
In [54]: a.reshape((length//2, 2, -1, 2)).swapaxes(1,2).shape
Out[54]: (5, 5, 2, 2)
In [55]: b.shape
Out[55]: (25, 2, 2)
So we need to get b back to the 4d shape, swap the axes back, and reshape to original a shape:
In [56]: b.reshape(5,5,2,2).swapaxes(1,2).reshape(10,10)
Out[56]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
[90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])

How to get initialisation point after sklearn.cluster.KMeans

How can I know the initialisation points that were used for the Means after performing Means from sklearn.cluster?
For each of my clusters, I need to return each feature of the initialisation points used (original input was in a Pandas datafraame)
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
np.random.seed(0)
# Use Iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# KMeans with 3 clusters
clf = KMeans(n_clusters=3)
clf.fit(X,y)
#Coordinates of cluster centers with shape [n_clusters, n_features]
clf.cluster_centers_
#Labels of each point
clf.labels_
# Nice Pythonic way to get the indices of the points for each corresponding cluster
mydict = {i: np.where(clf.labels_ == i)[0] for i in range(clf.n_clusters)}
# Transform this dictionary into list (if you need a list as result)
dictlist = []
for key, value in mydict.iteritems():
temp = [key,value]
dictlist.append(temp)
print(dictlist)
[[0, array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149])],
[1, array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])],
[2, array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])]]

How tf.train.shuffle_batch works?

Does it do one shuffling in one epoch, or else?
What is the difference of tf.train.shuffle_batch and tf.train.batch?
Could someone explain it? Thanks.
First take a look at the documentation (https://www.tensorflow.org/api_docs/python/tf/train/shuffle_batch and https://www.tensorflow.org/api_docs/python/tf/train/batch ). Internally batch is build around a FIFOQueue and shuffle_batch is build around a RandomShuffleQueue.
Consider the following toy example, it puts 1 to 100 in a constant which gets fed through tf.train.shuffle_batch and tf.train.batch and later on this prints the results.
import tensorflow as tf
import numpy as np
data = np.arange(1, 100 + 1)
data_input = tf.constant(data)
batch_shuffle = tf.train.shuffle_batch([data_input], enqueue_many=True, batch_size=10, capacity=100, min_after_dequeue=10, allow_smaller_final_batch=True)
batch_no_shuffle = tf.train.batch([data_input], enqueue_many=True, batch_size=10, capacity=100, allow_smaller_final_batch=True)
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(10):
print(i, sess.run([batch_shuffle, batch_no_shuffle]))
coord.request_stop()
coord.join(threads)
Which yields:
0 [array([23, 48, 15, 46, 78, 89, 18, 37, 88, 4]), array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])]
1 [array([80, 10, 5, 76, 50, 53, 1, 72, 67, 14]), array([11, 12, 13, 14, 15, 16, 17, 18, 19, 20])]
2 [array([11, 85, 56, 21, 86, 12, 9, 7, 24, 1]), array([21, 22, 23, 24, 25, 26, 27, 28, 29, 30])]
3 [array([ 8, 79, 90, 81, 71, 2, 20, 63, 73, 26]), array([31, 32, 33, 34, 35, 36, 37, 38, 39, 40])]
4 [array([84, 82, 33, 6, 39, 6, 25, 19, 19, 34]), array([41, 42, 43, 44, 45, 46, 47, 48, 49, 50])]
5 [array([27, 41, 21, 37, 60, 16, 12, 16, 24, 57]), array([51, 52, 53, 54, 55, 56, 57, 58, 59, 60])]
6 [array([69, 40, 52, 55, 29, 15, 45, 4, 7, 42]), array([61, 62, 63, 64, 65, 66, 67, 68, 69, 70])]
7 [array([61, 30, 53, 95, 22, 33, 10, 34, 41, 13]), array([71, 72, 73, 74, 75, 76, 77, 78, 79, 80])]
8 [array([45, 52, 57, 35, 70, 51, 8, 94, 68, 47]), array([81, 82, 83, 84, 85, 86, 87, 88, 89, 90])]
9 [array([35, 28, 83, 65, 80, 84, 71, 72, 26, 77]), array([ 91, 92, 93, 94, 95, 96, 97, 98, 99, 100])]
tf.train.shuffle_batch() shuffles every epoch.

Shuffle groups of rows of a 2D array - NumPy

Suppose I have a (50, 5) array. Is there a way for me to shuffle it on the basis of groupings of rows/sequences of datapoints, i.e. instead of shuffling every row, shuffle chunks of say, 5 rows?
Thanks
Approach #1 : Here's an approach that reshapes into a 3D array based on the group size, indexes into the indices of blocks with shuffled indices obtained from np.random.permutation and finally reshapes back to 2D -
N = 5 # Blocks of N rows
M,n = a.shape[0]//N, a.shape[1]
out = a.reshape(M,-1,n)[np.random.permutation(M)].reshape(-1,n)
Sample run -
In [141]: a
Out[141]:
array([[89, 26, 12],
[97, 60, 96],
[94, 38, 54],
[41, 63, 29],
[88, 62, 48],
[95, 66, 32],
[28, 58, 80],
[26, 35, 89],
[72, 91, 38],
[26, 70, 93]])
In [142]: N = 2 # Blocks of N rows
In [143]: M,n = a.shape[0]//N, a.shape[1]
In [144]: a.reshape(M,-1,n)[np.random.permutation(M)].reshape(-1,n)
Out[144]:
array([[94, 38, 54],
[41, 63, 29],
[28, 58, 80],
[26, 35, 89],
[89, 26, 12],
[97, 60, 96],
[72, 91, 38],
[26, 70, 93],
[88, 62, 48],
[95, 66, 32]])
Approach #2 : One can also simply use np.random.shuffle for an in-situ change -
np.random.shuffle(a.reshape(M,-1,n))
Sample run -
In [156]: a
Out[156]:
array([[15, 12, 14],
[55, 39, 35],
[73, 78, 36],
[54, 52, 32],
[83, 34, 91],
[42, 11, 98],
[27, 65, 47],
[78, 75, 82],
[33, 52, 93],
[87, 51, 80]])
In [157]: N = 2 # Blocks of N rows
In [158]: M,n = a.shape[0]//N, a.shape[1]
In [159]: np.random.shuffle(a.reshape(M,-1,n))
In [160]: a
Out[160]:
array([[15, 12, 14],
[55, 39, 35],
[27, 65, 47],
[78, 75, 82],
[73, 78, 36],
[54, 52, 32],
[33, 52, 93],
[87, 51, 80],
[83, 34, 91],
[42, 11, 98]])