Finding minimal subset of columns that make rows in a matrix unique - optimization

What is a generic, efficient algorithm to find the minimal subset of columns in a discrete-valued matrix that makes that rows unique.
For example, consider this matrix (with named columns):
a b c d
2 1 0 0
2 0 0 0
2 1 2 2
1 2 2 2
2 1 1 0
Each row in the matrix is unique. However, if we remove columns a and d we maintain that same property.
I could enumerate all possible combinations of the columns, however, that will quickly become intractable as my matrix grows. Is there a faster, optimal algorithm for doing this?

Actually, my original formulation wasn't very good. This is better as a set cover.
import pulp
# Input data
A = [
[2, 1, 0, 0],
[2, 0, 0, 0],
[2, 1, 2, 2],
[1, 2, 2, 2],
[2, 1, 1, 0]
]
# Preprocess the data a bit.
# Bikj = 1 if Aij != Akj, 0 otherwise
B = []
for i in range(len(A)):
Bi = []
for k in range(len(A)):
Bik = [int(A[i][j] != A[k][j]) for j in range(len(A[i]))]
Bi.append(Bik)
B.append(Bi)
model = pulp.LpProblem('Tim', pulp.LpMinimize)
# Variables turn on and off columns.
x = [pulp.LpVariable('x_%d' % j, cat=pulp.LpBinary) for j in range(len(A[0]))]
# The sum of elementwise absolute difference per element and row.
for i in range(len(A)):
for k in range(i + 1, len(A)):
model += sum(B[i][k][j] * x[j] for j in range(len(A[i]))) >= 1
model.setObjective(pulp.lpSum(x))
assert model.solve() == pulp.LpStatusOptimal
print([xi.value() for xi in x])

An observation: if M has unique rows without both columns i and j, then it has unique rows without column i and without column j independently (in other words, adding a column to a matrix with unique rows cannot make the rows not unique). Therefore, you should be able to find the minimum (not just minimal) solution by using a depth first search.
def has_unique_rows(M):
return len(set([tuple(i) for i in M])) == len(M)
def remove_cols(M, cols):
ret = []
for row in M:
new_row = []
for i in range(len(row)):
if i in cols:
continue
new_row.append(row[i])
ret.append(new_row)
return ret
def minimum_unique_rows(M):
if not has_unique_rows(M):
raise ValueError("M must have unique rows")
cols = list(range(len(M[0])))
def _cols_to_remove(M, removed_cols=(), max_removed_cols=()):
for i in set(cols) - set(removed_cols):
new_removed_cols = removed_cols + (i,)
new_M = remove_cols(M, new_removed_cols)
if not has_unique_rows(new_M):
continue
if len(new_removed_cols) > len(max_removed_cols):
max_removed_cols = new_removed_cols
return _cols_to_remove(M, new_removed_cols, max_removed_cols)
return max_removed_cols
removed_cols = _cols_to_remove(M)
return remove_cols(M, removed_cols), removed_cols
(note that my variable naming is terrible)
Here's it on your matrix
In [172]: rows = [
.....: [2, 1, 0, 0],
.....: [2, 0, 0, 0],
.....: [2, 1, 2, 2],
.....: [1, 2, 2, 2],
.....: [2, 1, 1, 0]
.....: ]
In [173]: minimum_unique_rows(rows)
Out[173]: ([[1, 0], [0, 0], [1, 2], [2, 2], [1, 1]], (0, 3))
I generated a random matrix (using sympy.randMatrix) which is shown below
⎡0 1 0 1 0 1 1⎤
⎢ ⎥
⎢0 1 1 2 0 0 2⎥
⎢ ⎥
⎢1 0 1 1 1 0 0⎥
⎢ ⎥
⎢1 2 2 1 1 2 2⎥
⎢ ⎥
⎢2 0 0 0 0 1 1⎥
⎢ ⎥
⎢2 0 2 2 1 1 0⎥
⎢ ⎥
⎢2 1 2 1 1 0 1⎥
⎢ ⎥
⎢2 2 1 2 1 0 1⎥
⎢ ⎥
⎣2 2 2 1 1 2 1⎦
(note that sorting the rows of M helps a lot in checking these things by hand)
In [224]: M1 = [[0, 1, 0, 1, 0, 1, 1], [0, 1, 1, 2, 0, 0, 2], [1, 0, 1, 1, 1, 0, 0], [1, 2, 2, 1, 1, 2, 2], [2, 0, 0, 0, 0, 1, 1], [2, 0, 2, 2, 1, 1, 0], [2, 1, 2, 1, 1, 0
, 1], [2, 2, 1, 2, 1, 0, 1], [2, 2, 2, 1, 1, 2, 1]]
In [225]: minimum_unique_rows(M1)
Out[225]: ([[1, 1, 1], [2, 0, 2], [1, 0, 0], [1, 2, 2], [0, 1, 1], [2, 1, 0], [1, 0, 1], [2, 0, 1], [1, 2, 1]], (0, 1, 2, 4))
Here's a brute-force check that it's the minimum answer (actually there are quite a few minimums).
In [229]: from itertools import combinations
In [230]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 6)])
[False, False, False, False, False, False, False]
In [231]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 5)])
[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]
In [232]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 4)])
[False, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, True, False, False, False, False, False, True, True]

Here is my greedy solution. (Yes, that fails your "optimal" criterion.) Randomly pick a row that can be safely thrown away and throw it away. Keep going until no more such rows. I'm sure the is_valid could be optimized.
rows = [
[2, 1, 0, 0],
[2, 0, 0, 0],
[2, 1, 2, 2],
[1, 2, 2, 2],
[2, 1, 1, 0]
]
col_names = [0, 1, 2, 3]
def is_valid(rows, col_names):
# it's valid if every row has a distinct "signature"
signatures = { tuple(row[col] for col in col_names) for row in rows }
return len(signatures) == len(rows)
import random
def minimal_distinct_columns(rows, col_names):
col_names = col_names[:]
random.shuffle(col_names)
for i, col in enumerate(col_names):
fewer_col_names = col_names[:i] + col_names[(i+1):]
if is_valid(rows, fewer_col_names):
return minimal_distinct_columns(rows, fewer_col_names)
return col_names
Since it's greedy, it doesn't get the best answer always, but it should be relatively speedy (and simple).

Although I'm sure there's better approaches, this fondly reminded me of some Genetic Algorithms stuff I did in the 90s. I wrote up a quick version using R's GA package.
library(GA)
matrix_to_minimize <- matrix(c(2,2,1,1,2,
1,0,1,2,1,
0,0,2,2,1,
0,0,2,2,0), ncol=4)
evaluate <- function(indices) {
if(all(indices == 0)) {
return(0)
}
selected_cols <- matrix_to_minimize[, as.logical(indices), drop=FALSE]
are_unique <- nrow(selected_cols) == nrow(unique(selected_cols))
if (are_unique == FALSE) {
return(0)
}
retval <- (1/sum(as.logical(indices)))
return(retval)
}
ga_results <- ga("binary", evaluate,
nBits=ncol(matrix_to_minimize),
popSize=10 * ncol(matrix_to_minimize), #why not
maxiter=1000,
run=10) #probably want to play with this
print("Best Solution: ")
print(ga_results#solution)
I don't know that it's good or optimal, but I bet it will provide a reasonably good answer in a reasonable amount of time? :)

Related

The current value is the maximum value in the most recent period

l = [5,1,1,1,5,3,6], and the expected returned data is [0, 0, 0, 0, 3, 0, 6]. Compare from right to left, if it is greater than, it will count cumulatively; if it is less than or equal to, it will interrupt the accumulation and start the next counting.
How to implement (numpy,pandas)?
pandas:
def TOPRANGE(S):
rt = np.zeros(len(S))
for i in range(1,len(S)): rt[i] = np.argmin(np.flipud(S[:i]<S[i]))
return rt.astype('int')
l = [5,1,1,1,5,3,6]
s = np.array(l)
TOPRANGE(s)
output: [0, 0, 0, 0, 3, 0, 0]
expected returned data is [0, 0, 0, 0, 3, 0, 6],Don't know how to solve it????
l = [5, 1, 1, 1, 5, 3, 6]
# l = [5, 1, 3, 6]
l.reverse(); out = []
for i in range(0,len(l)):
acc = 0
for j in range(i+1, len(l)):
if l[i] > l[j]: acc += 1
else: break # interrupt count
out.append(acc)
out.reverse(); out
Gives required output:
[0, 0, 0, 0, 3, 0, 6]
If I understand well the logic, this looks like an expanding comparison:
pure python
l = [5,1,3,6]
out = [sum(l[i]>x for x in l[:i]) for i in range(len(l))]
pandas
l = [5,1,3,6]
s = pd.Series(l)
out = s.expanding().apply(lambda x: sum(x.iloc[:-1].le(x.iloc[-1]))).astype(int)
out.tolist()
numpy:
l = [5,1,3,6]
a = np.array(l)
out = np.tril(a[:,None]>a).sum(1)
out.tolist()
output: [0, 0, 1, 3]

Identify vectors being a multiple of another in rectangular matrix

Given a nxm matrix (n > m) of integers, I'd like to identify rows that are a multiple of a single other row, so not a linear combination of multiple other rows.
I could scale all rows to their length and find unique rows, but that is prone to numerical issues on floating points and would also not detect vectors being opposite (pointing in the other directon) of each other.
Any ideas?
Example
A = array([[-1, -1, 0, 0],
[-1, -1, 0, 1],
[-1, 0, -1, 0],
[-1, 0, 0, 0],
[-1, 0, 0, 1],
[-1, 0, 1, 1],
[-1, 1, -1, 0],
[-1, 1, 0, 0],
[-1, 1, 1, 0],
[ 0, -1, 0, 0],
[ 0, -1, 0, 1],
[ 0, -1, 1, 0],
[ 0, -1, 1, 1],
[ 0, 0, -1, 0],
[ 0, 0, 0, 1],
[ 0, 0, 1, 0],
[ 0, 1, -1, 0],
[ 0, 1, 0, 0],
[ 0, 1, 0, 1],
[ 0, 1, 1, 0],
[ 0, 1, 1, 1],
[ 1, -1, 0, 0],
[ 1, -1, 1, 0],
[ 1, 0, 0, 0],
[ 1, 0, 0, 1],
[ 1, 0, 1, 0],
[ 1, 0, 1, 1],
[ 1, 1, 0, 0],
[ 1, 1, 0, 1],
[ 1, 1, 1, 0]])
For example Rows 0 and -3 just point in the opposite direction (multiply one by -1 to make them equal).
You can normalize each row dividing it by its GCD:
import numpy as np
def normalize(a):
return a // np.gcd.reduce(a, axis=1, keepdims=True)
And you can define a distance that considers opposite vectors as equal:
def distance(a, b):
equal = np.all(a == b) or np.all(a == -b)
return 0 if equal else 1
Then you can use standard clustering methods:
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, fcluster
def cluster(a):
norm_a = normalize(a)
distances = pdist(norm_a, metric=distance)
return fcluster(linkage(distances), t=0.5)
For example:
>>> A = np.array([( 1, 2, 3, 4),
... ( 0, 2, 4, 8),
... (-1, -2, -3, -4),
... ( 0, 1, 2, 4),
... (-1, 2, -3, 4),
... ( 2, -4, 6, -8)])
>>> cluster(A)
array([2, 3, 2, 3, 1, 1], dtype=int32)
Interpretation: cluster 1 is formed by rows 4 and 5, cluster 2 by rows 0 and 2, and cluster 3 by rows 1 and 3.
You can take advantage of the fact that inner product of two normalized linearly dependent vectors gives 1 or -1, so the code could look like this:
>>> A_normalized = (A.T/np.linalg.norm(A, axis=-1)).T
>>> M = np.absolute(np.einsum('ix,jx->ij', A_normalized, A_normalized))
>>> i, j = np.where(np.isclose(M, 1))
>>> i, j = i[i < j], j[i < j] # Remove repetitions
>>> print(i, j)
output: [ 0 2 3 6 7 9 11 13] [27 25 23 22 21 17 16 15]

numpy unique over multiple arrays

Numpy.unique expects a 1-D array. If the input is not a 1-D array, it flattens it by default.
Is there a way for it to accept multiple arrays? To keep it simple, let's just say a pair of arrays, and we are unique-ing the pair of elements across the 2 arrays.
For example, say I have 2 numpy array as inputs
a = [1, 2, 3, 3]
b = [10, 20, 30, 31]
I'm unique-ing against both of these arrays, so against these 4 pairs (1,10), (2,20) (3, 30), and (3,31). These 4 are all unique, so I want my result to say
[True, True, True, True]
If instead the inputs are as follows
a = [1, 2, 3, 3]
b = [10, 20, 30, 30]
Then the last 2 elements are not unique. So the output should be
[True, True, True, False]
You could use the unique_indices value returned by numpy.unique():
In [243]: def is_unique(*lsts):
...: arr = np.vstack(lsts)
...: _, ind = np.unique(arr, axis=1, return_index=True)
...: out = np.zeros(shape=arr.shape[1], dtype=bool)
...: out[ind] = True
...: return out
In [244]: a = [1, 2, 2, 3, 3]
In [245]: b = [1, 2, 2, 3, 3]
In [246]: c = [1, 2, 0, 3, 3]
In [247]: is_unique(a, b)
Out[247]: array([ True, True, False, True, False])
In [248]: is_unique(a, b, c)
Out[248]: array([ True, True, True, True, False])
You may also find this thread helpful.

Numpy Vectorization: add row above to current row on ndarray

I would like to add the values in the above row to the row below using vectorization. For example, if I had the ndarray,
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3]]
Then after one iteration through this method, it would result in
[[0, 0, 0, 0],
[1, 1, 1, 1],
[3, 3, 3, 3],
[5, 5, 5, 5]]
One can simply do this with a for loop:
import numpy as np
def addAboveRow(arr):
cpy = arr.copy()
r, c = arr.shape
for i in range(1, r):
for j in range(c):
cpy[i][j] += arr[i - 1][j]
return cpy
ndarr = np.array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]).reshape(4, 4)
print(addAboveRow(ndarr))
I'm not sure how to approach this using vectorization though. I think slicers should be used? Also, I'm not really sure how to deal with the issue of the top border, because nothing should be added onto the first row. Any help would be appreciated. Thanks!
Note: I am really new to vectorization so an explanation would be great!
You can use indexing directly:
b = np.zeros_like(a)
b[0] = a[0]
b[1:] = a[1:] + a[:-1]
>>> b
array([[0, 0, 0, 0],
[1, 1, 1, 1],
[3, 3, 3, 3],
[5, 5, 5, 5]])
An alternative:
b = a.copy()
b[1:] += a[:-1]
Or:
b = a.copy()
np.add(b[1:], a[:-1], out=b[1:])
You could try the following
np.put(arr, np.arange(arr.shape[1], arr.size), arr[1:]+arr[:-1])

Generating a boolean mask indexing one array into another array

It's hard to explain what I'm trying to do with words so here's an example.
Let's say we have the following inputs:
In [76]: x
Out[76]:
0 a
1 a
2 c
3 a
4 b
In [77]: z
Out[77]: ['a', 'b', 'c', 'd', 'e']
I want to get:
In [78]: ii
Out[78]:
array([[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0]])
ii is an array of boolean masks which can be applied to z to get back the original x.
My current solution is to write a function which converts z to a list and uses the index method to get the index of the element in z and then generate a row of zeroes except for the index where there is a one. This function gets applied to each row of x to get the desired result.
A first possibility:
>>> choices = np.diag([1]*5)
>>> choices[[z.index(i) for i in x]]
As noted elsewhere, you can change the list comprehension [z.index(i) for i in x] by np.searchsorted(z, x)
>>> choices[np.searchsorted(z, x)]
Note that as suggested in a comment by #seberg, you should use np.eye(len(x)) instead of np.diag([1]*len(x)). The np.eye function directly gives you a 2D array with 1 on the diagonal and 0 elsewhere.
This is numpy method for the case of z being sorted. You did not specifiy that... If pandas needs something differently, I don't know:
# Assuming z is sorted.
indices = np.searchsorted(z, x)
Now I really don't know why you want a boolean mask, these indices can be applied to z to give back x already and are more compact.
z[indices] == x # if z included all x.
Surprised no one mentioned theouter method of numpy.equal:
In [51]: np.equal.outer(s, z)
Out[51]:
array([[ True, False, False, False, False],
[ True, False, False, False, False],
[False, False, True, False, False],
[ True, False, False, False, False],
[False, True, False, False, False]], dtype=bool)
In [52]: np.equal.outer(s, z).astype(int)
Out[52]:
array([[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0]])