Boolean indexing with 2D arrays - numpy

I have two arrays, a and b, one 2D and one 1D, containing values of two related quantities that are filled in the same order, such that a[0] is related to b[0] and so on.
I would like to access the element of b where a is equal to a given value, where the value is a 1D array itself.
For example
a=np.array([[0,0],[0,1],[1,0],[1,1]])
b=np.array([0, 7, 9, 4])
value = np.array([0,1])
In 1D cases I could use boolean indexing easily and do
b[a==value]
The result I want is 7.
But in this case, it does not work because it checks each element of b in the comparison, instead of checking subarrays...
Is there a quick way to do this?

The question doesn't seem to match the example, but this returns [7]:
b[(a == value).all(axis=-1)]

Related

Selecting two sets of columns from a dataFrame with all rows

I have a dataFrame with 28 columns (features) and 600 rows (instances). I want to select all rows, but only columns from 0-12 and 16-27. Meaning that I don't want to select columns 12-15.
I wrote the following code, but it doesn't work and throws a syntax error at : in 0:12 and 16:. Can someone help me understand why?
X = df.iloc[:,[0:12,16:]]
I know there are other ways for selecting these rows, but I am curious to learn why this one does not work, and how I should write it to work (if there is a way).
For now, I have written it is as:
X = df.iloc[:,0:12]
X = X + df.iloc[:,16:]
Which seems to return an incorrect result, because I have already treated the NaN values of df, but when I use this code, X includes lots of NaNs!
Thanks for your feedback in advance.
You can use np.r_ to concatenate the slices:
x = df.iloc[:, np.r_[0:12,16:]]
iloc has these allowed inputs (from the docs):
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
What you're passing to iloc in X = df.iloc[:,[0:12,16:]] is not a list of integers or a slice of ints, but a list of slice objects. You need to convert those slices to a list of integers, and the best way to do that is using the numpy.r_ function.
X = df.iloc[:, np.r_[0:13, 16:28]]

NumPy indexing ambiguity in 3D arrays [duplicate]

I have the following minimal example:
a = np.zeros((5,5,5))
a[1,1,:] = [1,1,1,1,1]
print(a[1,:,range(4)])
I would expect as output an array with 5 rows and 4 columns, where we have ones on the second row. Instead it is an array with 4 rows and 5 columns with ones on the second column. What is happening here, and what can I do to get the output I expected?
This is an example of mixed basic and advanced indexing, as discussed in https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
The slice dimension has been appended to the end.
With one scalar index this is a marginal case for the ambiguity described there. It's been discussed in previous SO questions and one or more bug/issues.
Numpy sub-array assignment with advanced, mixed indexing
In this case you can replace the range with a slice, and get the expected order:
In [215]: a[1,:,range(4)].shape
Out[215]: (4, 5) # slice dimension last
In [216]: a[1,:,:4].shape
Out[216]: (5, 4)
In [219]: a[1][:,[0,1,3]].shape
Out[219]: (5, 3)

How to find if any column in an array has duplicate values

Let's say I have a numpy matrix A
A = array([[ 0.5, 0.5, 3.7],
[ 3.8, 2.7, 3.7],
[ 3.3, 1.0, 0.2]])
I would like to know if there is at least two rows i and i' such that A[i, j]=A[i', j] for some column j?
In the example A, i=0 and i'=1 for j=2 and the answer is yes.
How can I do this?
I tried this:
def test(A, n):
for j in range(n):
i = 0
while i < n:
a = A[i, j]
for s in range(i+1, n):
if A[s, j] == a:
return True
i += 1
return False
Is there a faster/better way?
There are a number of ways of checking for duplicates. The idea is to use as few loops in the Python code as possible to do this. I will present a couple of ways here:
Use np.unique. You would still have to loop over the columns since it wouldn't make sense for unique to accept an axis argument because each column could have a different number of unique elements. While it still requires a loop, unique allows you to find the positions and other stats of repeated elements:
def test(A):
for i in range(A.shape[1]):
if np.unique(A[:, i]).size < A.shape[0]:
return True
return False
With this method, you basically check if the number of unique elements in a column is equal to the size of the column. If not, there are duplicates.
Use np.sort, np.diff and np.any. This is a fully vectorized solution that does not require any loops because you can specify an axis for each of these functions:
def test(A):
return np.any(diff(np.sort(A, axis=0), axis=0) == 0)
This literally reads "if any of the column-wise differences in the column-wise sorted array are zero, return True". A zero difference in the sorted array means that there are identical elements. axis=0 makes sort and diff operate on each column individually.
You never need to pass in n since the size of the matrix is encoded in the attribute shape. If you need to look at the subset of a matrix, just pass in the subset using indexing. It will not copy the data, just return a view object with the required dimensions.
A solution without numpy would look like this: First, swap columns and rows with zip()
zipped = zip(*A)
then check if any now row has any duplicates. You can check for duplicates by turning a list into a set, which discards duplicates, and check the length.
has_duplicates = any(len(set(row)) != len(row) for row in zip(*A))
Most probably way slower and also more memory intensive than the pure numpy solution, but this may help for clarity

initialize pandas SparseArray

Is it possible to initialize a pandas SparseArray by providing only the dense entries? I could not figure this out from the documentation: http://pandas.pydata.org/pandas-docs/stable/sparse.html .
For example, say I want a length 1000 SparseArray with a one at index 9 and zeros everywhere else, how would I go about creating it? This is one way:
a = [0] * 1000
a[9] = 1
sparse_a = pd.SparseArray(data=a, fill_value=0)
But, in the above, we have to create the dense array before the sparse one. Is there a way to specify only the indices and the dense entries to create the SparseArray directly?
A length 10 SparseArray with a one at index 9 and zeros everywhere else:
pd.SparseArray(1, index= range(1), kind='block',
sparse_index= BlockIndex(10, [8], [1]),
fill_value=0)
Notes:
index could be any list as long as its length is equal to all non-sparsed part of the array (the smaller part of the data), in this case, number of 1 in the sparse array
BlockIndex(10, [8], [1]) is the object pointing to the positions of the non-parsed part of the data where the first argument is the TOTAL length of the array (sparse + non-sparse), the second argument is a list of starting positions of the non-sparse data and the third argument is a list of how long each block of non-sparse lasts. Notice: that the length of the array mentioned in point 1 is the sum of all elements of the list in the third argument of this BlockIndex
So a more general example is: to make a length 20 SparseArray where the 2nd, 3rd, 6th,7th,8th elements are 1 and the rest is 0 is:
pd.SparseArray(1, index= range(5), kind='block',
sparse_index= BlockIndex(20, [1,5], [2,3]),
fill_value=0)
or
pd.SparseArray(1, index= [None, 3, 2, 7, np.inf], kind='block',
sparse_index= BlockIndex(20, [1,5], [2,3]),
fill_value=0)
Sadly, I don't know any good way to specify an array of non-sparsed data as the first argument for SparseArray-- it does not mean that it can't be done, this is only a disclaimer. I think as long as you specify index=... pandas will require a scalar for the first argument (the data).
Tested on Windows 7, pandas version 0.20.2 installed by Aconda.

Numpy index array of unknown dimensions?

I need to compare a bunch of numpy arrays with different dimensions, say:
a = np.array([1,2,3])
b = np.array([1,2,3],[4,5,6])
assert(a == b[0])
How can I do this if I do not know either the shape of a and b, besides that
len(shape(a)) == len(shape(b)) - 1
and neither do I know which dimension to skip from b. I'd like to use np.index_exp, but that does not seem to help me ...
def compare_arrays(a,b,skip_row):
u = np.index_exp[ ... ]
assert(a[:] == b[u])
Edit
Or to put it otherwise, I wan't to construct slicing if I know the shape of the array and the dimension I want to miss. How do I dynamically create the np.index_exp, if I know the number of dimensions and positions, where to put ":" and where to put "0".
I was just looking at the code for apply_along_axis and apply_over_axis, studying how they construct indexing objects.
Lets make a 4d array:
In [355]: b=np.ones((2,3,4,3),int)
Make a list of slices (using list * replicate)
In [356]: ind=[slice(None)]*b.ndim
In [357]: b[ind].shape # same as b[:,:,:,:]
Out[357]: (2, 3, 4, 3)
In [358]: ind[2]=2 # replace one slice with index
In [359]: b[ind].shape # a slice, indexing on the third dim
Out[359]: (2, 3, 3)
Or with your example
In [361]: b = np.array([1,2,3],[4,5,6]) # missing []
...
TypeError: data type not understood
In [362]: b = np.array([[1,2,3],[4,5,6]])
In [366]: ind=[slice(None)]*b.ndim
In [367]: ind[0]=0
In [368]: a==b[ind]
Out[368]: array([ True, True, True], dtype=bool)
This indexing is basically the same as np.take, but the same idea can be extended to other cases.
I don't quite follow your questions about the use of :. Note that when building an indexing list I use slice(None). The interpreter translates all indexing : into slice objects: [start:stop:step] => slice(start, stop, step).
Usually you don't need to use a[:]==b[0]; a==b[0] is sufficient. With lists alist[:] makes a copy, with arrays it does nothing (unless used on the RHS, a[:]=...).