numpy unique over multiple arrays - numpy

Numpy.unique expects a 1-D array. If the input is not a 1-D array, it flattens it by default.
Is there a way for it to accept multiple arrays? To keep it simple, let's just say a pair of arrays, and we are unique-ing the pair of elements across the 2 arrays.
For example, say I have 2 numpy array as inputs
a = [1, 2, 3, 3]
b = [10, 20, 30, 31]
I'm unique-ing against both of these arrays, so against these 4 pairs (1,10), (2,20) (3, 30), and (3,31). These 4 are all unique, so I want my result to say
[True, True, True, True]
If instead the inputs are as follows
a = [1, 2, 3, 3]
b = [10, 20, 30, 30]
Then the last 2 elements are not unique. So the output should be
[True, True, True, False]

You could use the unique_indices value returned by numpy.unique():
In [243]: def is_unique(*lsts):
...: arr = np.vstack(lsts)
...: _, ind = np.unique(arr, axis=1, return_index=True)
...: out = np.zeros(shape=arr.shape[1], dtype=bool)
...: out[ind] = True
...: return out
In [244]: a = [1, 2, 2, 3, 3]
In [245]: b = [1, 2, 2, 3, 3]
In [246]: c = [1, 2, 0, 3, 3]
In [247]: is_unique(a, b)
Out[247]: array([ True, True, False, True, False])
In [248]: is_unique(a, b, c)
Out[248]: array([ True, True, True, True, False])
You may also find this thread helpful.

Related

Can I create a view from a boolean selection of a numpy array?

If I create a numpy array, and another to serve as a selective index into it:
>>> x
array([[ 2, 3, 4],
[ 5, 6, 7],
[ 6, 7, 8],
[11, 12, 13]])
>>> nz
array([ True, True, False, True], dtype=bool)
then direct use of nz returns a view of the original array:
>>> x[nz,:]
array([[ 2, 3, 4],
[ 5, 6, 7],
[11, 12, 13]])
>>> x[nz,:] += 2
>>> x
array([[ 4, 5, 6],
[ 7, 8, 9],
[ 6, 7, 8],
[13, 14, 15]])
however, naturally, an assignment makes a copy:
>>> v = x[nz,:]
Any operation on v is on the copy, and has no effect on the original array.
Is there any way to create a named view, from x[nz,:], simply to abbreviate code, or which I can pass around, so operations on the named view will affect only the selected elements of x?
Numpy has masked_array, which might be what you are looking for:
import numpy as np
x = np.asarray([[ 2, 3, 4],[ 5, 6, 7],[ 6, 7, 8],[11, 12, 13]])
nz = np.asarray([ True, True, False, True], dtype=bool)
mx = np.ma.masked_array(x, ~nz.repeat(3)) # True means masked, so "~" is needed
mx += 2
# x changed as well because it is the base of mx
print(x)
print(x is mx.base)

Use an ufunc analogous to numpy.where

For example, if I want to add conditionally, I can use:
y = numpy.where(condition, a+b, b)
Is there a way to directly combine an ufunc and where? Something like:
y = numpy.add.where(condition, a, b)
Something along that line is add.at.
In [21]: b = np.arange(10)
In [22]: cond = b%3==0
Your where:
In [24]: np.where(cond, 10+b, b)
Out[24]: array([10, 1, 2, 13, 4, 5, 16, 7, 8, 19])
Use the other where (or np.nonzeros) to turn the boolean mask into index tuple
In [25]: cond
Out[25]: array([ True, False, False, True, False, False, True, False, False, True], dtype=bool)
In [26]: idx = np.where(cond)
In [27]: idx
Out[27]: (array([0, 3, 6, 9], dtype=int32),)
add.at does inplace, unbuffered addition:
In [28]: np.add.at(b,idx[0],10)
In [29]: b
Out[29]: array([10, 1, 2, 13, 4, 5, 16, 7, 8, 19])
add.at is intended as a way of getting around buffering problems with the more direct index +=:
In [30]: b = np.arange(10)
In [31]: b[idx[0]] += 10
In [32]: b
Out[32]: array([10, 1, 2, 13, 4, 5, 16, 7, 8, 19])
Here the action is the same (add.at is slower). But if there were duplicates in idx the results will be different.
+= also works with the boolean mask:
In [33]: b[cond] -= 10
In [34]: b
Out[34]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
There's got to be a ufunc equivalent to the += operator, but I don't use ufunc enough to know off hand.

Tensorflow indicator matrix for top n values

Does anyone know how to extract the top n largest values per row of a rank 2 tensor?
For instance, if I wanted the top 2 values of a tensor of shape [2,4] with values:
[[40, 30, 20, 10], [10, 20, 30, 40]]
The desired condition matrix would look like:
[[True, True, False, False],[False, False, True, True]]
Once I have the condition matrix, I can use tf.select to choose actual values.
Thank you for assistance!
You can do it using built-in tf.nn.top_k function:
a = tf.convert_to_tensor([[40, 30, 20, 10], [10, 20, 30, 40]])
b = tf.nn.top_k(a, 2)
print(sess.run(b))
TopKV2(values=array([[40, 30],
[40, 30]], dtype=int32), indices=array([[0, 1],
[3, 2]], dtype=int32))
print(sess.run(b).values))
array([[40, 30],
[40, 30]], dtype=int32)
To get boolean True/False values, you can first get the k-th value and then use tf.greater_equal:
kth = tf.reduce_min(b.values)
top2 = tf.greater_equal(a, kth)
print(sess.run(top2))
array([[ True, True, False, False],
[False, False, True, True]], dtype=bool)
you can also use tf.contrib.framework.argsort
a = [[40, 30, 20, 10], [10, 20, 30, 40]]
idx = tf.contrib.framework.argsort(a, direction='DESCENDING') # sorted indices
ranks = tf.contrib.framework.argsort(idx, direction='ASCENDING') # ranks
b = ranks < 2
# [[ True True False False] [False False True True]]
Moreover, you can replace 2 with a 1d tensor so that each row/column can have different n values.

Finding minimal subset of columns that make rows in a matrix unique

What is a generic, efficient algorithm to find the minimal subset of columns in a discrete-valued matrix that makes that rows unique.
For example, consider this matrix (with named columns):
a b c d
2 1 0 0
2 0 0 0
2 1 2 2
1 2 2 2
2 1 1 0
Each row in the matrix is unique. However, if we remove columns a and d we maintain that same property.
I could enumerate all possible combinations of the columns, however, that will quickly become intractable as my matrix grows. Is there a faster, optimal algorithm for doing this?
Actually, my original formulation wasn't very good. This is better as a set cover.
import pulp
# Input data
A = [
[2, 1, 0, 0],
[2, 0, 0, 0],
[2, 1, 2, 2],
[1, 2, 2, 2],
[2, 1, 1, 0]
]
# Preprocess the data a bit.
# Bikj = 1 if Aij != Akj, 0 otherwise
B = []
for i in range(len(A)):
Bi = []
for k in range(len(A)):
Bik = [int(A[i][j] != A[k][j]) for j in range(len(A[i]))]
Bi.append(Bik)
B.append(Bi)
model = pulp.LpProblem('Tim', pulp.LpMinimize)
# Variables turn on and off columns.
x = [pulp.LpVariable('x_%d' % j, cat=pulp.LpBinary) for j in range(len(A[0]))]
# The sum of elementwise absolute difference per element and row.
for i in range(len(A)):
for k in range(i + 1, len(A)):
model += sum(B[i][k][j] * x[j] for j in range(len(A[i]))) >= 1
model.setObjective(pulp.lpSum(x))
assert model.solve() == pulp.LpStatusOptimal
print([xi.value() for xi in x])
An observation: if M has unique rows without both columns i and j, then it has unique rows without column i and without column j independently (in other words, adding a column to a matrix with unique rows cannot make the rows not unique). Therefore, you should be able to find the minimum (not just minimal) solution by using a depth first search.
def has_unique_rows(M):
return len(set([tuple(i) for i in M])) == len(M)
def remove_cols(M, cols):
ret = []
for row in M:
new_row = []
for i in range(len(row)):
if i in cols:
continue
new_row.append(row[i])
ret.append(new_row)
return ret
def minimum_unique_rows(M):
if not has_unique_rows(M):
raise ValueError("M must have unique rows")
cols = list(range(len(M[0])))
def _cols_to_remove(M, removed_cols=(), max_removed_cols=()):
for i in set(cols) - set(removed_cols):
new_removed_cols = removed_cols + (i,)
new_M = remove_cols(M, new_removed_cols)
if not has_unique_rows(new_M):
continue
if len(new_removed_cols) > len(max_removed_cols):
max_removed_cols = new_removed_cols
return _cols_to_remove(M, new_removed_cols, max_removed_cols)
return max_removed_cols
removed_cols = _cols_to_remove(M)
return remove_cols(M, removed_cols), removed_cols
(note that my variable naming is terrible)
Here's it on your matrix
In [172]: rows = [
.....: [2, 1, 0, 0],
.....: [2, 0, 0, 0],
.....: [2, 1, 2, 2],
.....: [1, 2, 2, 2],
.....: [2, 1, 1, 0]
.....: ]
In [173]: minimum_unique_rows(rows)
Out[173]: ([[1, 0], [0, 0], [1, 2], [2, 2], [1, 1]], (0, 3))
I generated a random matrix (using sympy.randMatrix) which is shown below
⎡0 1 0 1 0 1 1⎤
⎢ ⎥
⎢0 1 1 2 0 0 2⎥
⎢ ⎥
⎢1 0 1 1 1 0 0⎥
⎢ ⎥
⎢1 2 2 1 1 2 2⎥
⎢ ⎥
⎢2 0 0 0 0 1 1⎥
⎢ ⎥
⎢2 0 2 2 1 1 0⎥
⎢ ⎥
⎢2 1 2 1 1 0 1⎥
⎢ ⎥
⎢2 2 1 2 1 0 1⎥
⎢ ⎥
⎣2 2 2 1 1 2 1⎦
(note that sorting the rows of M helps a lot in checking these things by hand)
In [224]: M1 = [[0, 1, 0, 1, 0, 1, 1], [0, 1, 1, 2, 0, 0, 2], [1, 0, 1, 1, 1, 0, 0], [1, 2, 2, 1, 1, 2, 2], [2, 0, 0, 0, 0, 1, 1], [2, 0, 2, 2, 1, 1, 0], [2, 1, 2, 1, 1, 0
, 1], [2, 2, 1, 2, 1, 0, 1], [2, 2, 2, 1, 1, 2, 1]]
In [225]: minimum_unique_rows(M1)
Out[225]: ([[1, 1, 1], [2, 0, 2], [1, 0, 0], [1, 2, 2], [0, 1, 1], [2, 1, 0], [1, 0, 1], [2, 0, 1], [1, 2, 1]], (0, 1, 2, 4))
Here's a brute-force check that it's the minimum answer (actually there are quite a few minimums).
In [229]: from itertools import combinations
In [230]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 6)])
[False, False, False, False, False, False, False]
In [231]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 5)])
[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]
In [232]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 4)])
[False, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, True, False, False, False, False, False, True, True]
Here is my greedy solution. (Yes, that fails your "optimal" criterion.) Randomly pick a row that can be safely thrown away and throw it away. Keep going until no more such rows. I'm sure the is_valid could be optimized.
rows = [
[2, 1, 0, 0],
[2, 0, 0, 0],
[2, 1, 2, 2],
[1, 2, 2, 2],
[2, 1, 1, 0]
]
col_names = [0, 1, 2, 3]
def is_valid(rows, col_names):
# it's valid if every row has a distinct "signature"
signatures = { tuple(row[col] for col in col_names) for row in rows }
return len(signatures) == len(rows)
import random
def minimal_distinct_columns(rows, col_names):
col_names = col_names[:]
random.shuffle(col_names)
for i, col in enumerate(col_names):
fewer_col_names = col_names[:i] + col_names[(i+1):]
if is_valid(rows, fewer_col_names):
return minimal_distinct_columns(rows, fewer_col_names)
return col_names
Since it's greedy, it doesn't get the best answer always, but it should be relatively speedy (and simple).
Although I'm sure there's better approaches, this fondly reminded me of some Genetic Algorithms stuff I did in the 90s. I wrote up a quick version using R's GA package.
library(GA)
matrix_to_minimize <- matrix(c(2,2,1,1,2,
1,0,1,2,1,
0,0,2,2,1,
0,0,2,2,0), ncol=4)
evaluate <- function(indices) {
if(all(indices == 0)) {
return(0)
}
selected_cols <- matrix_to_minimize[, as.logical(indices), drop=FALSE]
are_unique <- nrow(selected_cols) == nrow(unique(selected_cols))
if (are_unique == FALSE) {
return(0)
}
retval <- (1/sum(as.logical(indices)))
return(retval)
}
ga_results <- ga("binary", evaluate,
nBits=ncol(matrix_to_minimize),
popSize=10 * ncol(matrix_to_minimize), #why not
maxiter=1000,
run=10) #probably want to play with this
print("Best Solution: ")
print(ga_results#solution)
I don't know that it's good or optimal, but I bet it will provide a reasonably good answer in a reasonable amount of time? :)

Fastest way to compare neighboring elements in multi-dimensional numpy array

What is the fastest way to compare neighboring elements in a 3-dimensional array?
Assume I have a numpy array of (4,4,4). I want to loop in the k-direction and compare elements in pairs. So, compare all neighboring elements and assign the lowest index if they are not equal. Essentially this:
if array([0, 0, 0)] != array[(0, 0, 1)]:
array[(0, 0, 0)] = 111
Thus, the comparisons would be:
(0, 0, 0) and (0, 0, 1)
(0, 0, 1) and (0, 0, 2)
(0, 0, 2) and (0, 0, 3)
(0, 0, 3) and (0, 0, 4)
... for all i and j ...
However, I want to do this for every i and j in the array and writing a standard Python for loop for this on huge arrays with millions of cells is incredibly slow. Is there a more 'standard' numpy way to do this without the explicit for loop?
Maybe there's some trick using the slicing step (i.e array[:,:,::2], array[:,:,1::2])?
Try np.diff.
import numpy as np
a = np.arange(9).reshape(3, 3)
A = np.array([a, a, a + 1]).T
same_with_neighbor_on_last_axis = np.diff(A, axis=-1) == 0
print A
print same_with_neighbor_on_last_axis
A is constructed to have 2 consecutive equal entries along the third axis,
>>>print A
array([[[0, 0, 1],
[3, 3, 4],
[6, 6, 7]],
[[1, 1, 2],
[4, 4, 5],
[7, 7, 8]],
[[2, 2, 3],
[5, 5, 6],
[8, 8, 9]]])
The output vector then yields
>>>print same_with_neighbor_on_last_axis
[[[ True False]
[ True False]
[ True False]]
[[ True False]
[ True False]
[ True False]]
[[ True False]
[ True False]
[ True False]]]
Using the axis keyword, you can choose whichever axis you need to do this operation on. If it is all of them, you can use a loop. np.diff does not much else than the following
np.diff(A, axis=-1) == A[..., 1:] - A[..., :-1]