I have a reflexive asymetric transitive relation represented as an nxn sparse scipy csr matrix.
Now as a result of some transformations I am left with many 'unnecessary' pairs:
set([('a','b'),('b','c'),('a','c')])
I need to remove pairs ('a','c') that can be seen as 'direct' edges when there are 'indirect' ones.
First I was thinking this is a special spanning arborescence, but actually in the following case:
set([('a','b'),('b','d'),('a','c'),('c','d')])
... no pair should be removed. The result is not necessarily a tree.
Is there a name for this kind of problem?
Is there an implementation in scipy?
If not, can you suggest an efficient algorithm in python/numpy/scipy?
EDIT: Seems like this is called a transitive reduction?
But there is no scipy.sparse.csgraph implementation?
EDIT: I guess to get an acyclic directed graph I would have to (temporarily) remove the 'reflexiveness', but this is not a problem.
So the problem is called Transitive Reduction of a directed acyclic graph.
The following code should solve the problem, although it might be far from optimal:
def transitive_reduction(edges): # edges is irreflexive and scipy sparse bool
reduction = edges.copy()
num, i = 99,2
while num > 0:
new = edges**i
num = len(new.nonzero()[0])
reduction = reduction > new
i += 1
reduction.eliminate_zeros() # might or might not be required
return reduction
Explanation: As long as paths of this length exists, we remove from reduction all direct paths for which there are indirect paths of length i.
Credits to #PaulPanzer.
Related
If i have indices of shape (D_0,...,D_k) and params of shape (D_0,...,D_k,I,F) (with 0 ≤ indices[i_0,...,i_k] < I), what is the fastest/most elegant way to get the array output of shape (D_0,...,D_k,F) with
output[i_0,...,i_k,f]=params[i_0,...,i_k,indices[i_0,...,i_k],f]
If k=0, then we can use gather. So, in the past, I had a solution based on flattening. Is there a nicer solution now that tensorflow has matured?
Most of the times, when I want this type of gathering, indices is obtained by indices = tf.argmax(params[:,...,:,:,0]). For every (i_0,...,i_k), I have I vectors of size (F,) and I want to keep only those with the maximal value for one of the features. A solution which would only work for this special case (a kind of reduce_max only using one feature to decide how to reduce) would satisfy me.
Quite simply, what I want to do is the following
A = np.ones((3,3)) #arbitrary matrix
B = np.ones((2,2)) #arbitrary matrix
A[1:,1:] = A[1:,1:] + B
except in Tensorflow (where the matrices can be arbitrarily complicated tensor expressions). Neither A nor B is a Tensorflow Variable, but just a run-of-the-mill tensor.
What I have gathered so far: tensors are immutable, so I cannot assign to a submatrix. tf.scatter_nd is the current option for sub-assignment, but does not appear to support sub-matrices, only slices.
Methods that should work, but are perhaps not ideal:
I could pad B with zeros, but I'm sure this leads to instantiation of
an unnecessarily large B - can it be made sparse, maybe?
I could use the padding idea, but write it as a low-rank decomposition, e.g. in Numpy: A+U.dot(B).U.T where U is a stacked zero and identity matrix. I'm not sure this is actually advantageous.
I could split A into submatrices, and stack them back together. Might be the most efficient, but sounds like the code would be convoluted.
Ideally, I want to do this operation N times for progressively smaller matrices, resulting in one large final result, but this is tangential.
I'll use one of the hacks for now, but I'm hoping someone can tell me what the idiomatic version is!
I'm having some problems in solving this exercise about schemata in genetic algorithms. Suppose I have the following situation, where three parents {1101, 0101, 0001} have, respectively, fitness {0.7, 4.3, 3.5} with respect to an unknown fitness function. The question is: which schemata will have the highest survival probability in the case of a maximization problem? The possible answers I had been given are: { ** 01}, {0 *** }, {***1} and {*101}.
Thank you in advance!
For the general case the schema theorem states that the schema with above average fitness, short defining length and lower order is more likely to survive.
For a schema H:
the order o(H) = number of fixed bit (e.g. o({01*0*}) = 3)
the defining length δ(H) = distance between the first and the last fixed bits (e.g. δ({*0*10}) = 3)
the probability of a gene not being changed is (1 - p) where p is the mutation probability. So the probability a schema H survives under mutation is S(H) = (1-p) ^ o(H)
...but this isn't the general case.
Every individual matches the two schemas {**01} and {***1}.
No matter what parent is selected for crossover / copy (these operations are fitness-dependent), children will match (at least before mutation) both schemas (with 100% probability).
Assuming mutation is applied gene by gene, for a schema H to survive, all fixed bits must remain unchanged. So {***1} is more likely to survive (has lower order).
This question and answer demonstrate that when feature selection is performed using one of scikit-learn's dedicated feature selection routines, then the names of the selected features can be retrieved as follows:
np.asarray(vectorizer.get_feature_names())[featureSelector.get_support()]
For example, in the above code, featureSelector might be an instance of sklearn.feature_selection.SelectKBest or sklearn.feature_selection.SelectPercentile, since these classes implement the get_support method which returns a boolean mask or integer indices of the selected features.
When one performs feature selection via linear models penalized with the L1 norm, it's unclear how to accomplish this. sklearn.svm.LinearSVC has no get_support method and the documentation doesn't make clear how to retrieve the feature indices after using its transform method to eliminate features from a collection of samples. Am I missing something here?
For sparse estimators you can generally find the support by checking where the non-zero entries are in the coefficients vector (provided the coefficients vector exists, which is the case for e.g. linear models)
support = np.flatnonzero(estimator.coef_)
For your LinearSVC with l1 penalty it would accordingly be
from sklearn.svm import LinearSVC
svc = LinearSVC(C=1., penalty='l1', dual=False)
svc.fit(X, y)
selected_feature_names = np.asarray(vectorizer.get_feature_names())[np.flatnonzero(svc.coef_)]
I've been using sklearn 15.2, and according to LinearSVC documentation , coef_ is an array, shape = [n_features] if n_classes == 2 else [n_classes, n_features].
So first, np.flatnonzero doesn't work for multi-class. You'll have index out of range error. Second, it should be np.where(svc.coef_ != 0)[1] instead of np.where(svc.coef_ != 0)[0] . 0 is index of classes, not features. I ended up with using np.asarray(vectorizer.get_feature_names())[list(set(np.where(svc.coef_ != 0)[1]))]
Suppose that there are multiple source destination pairs in an undirected graph. I want to generate disjoint paths for multiple pairs. What would be the complexity of such problem? Is there any polynomial heuristic for finding edge-disjoint paths for these pairs? (i.e. path between s1 and d1 should not have any common edges with the path between s2 and d2)
This looks like a variant of the multi-commodity flow problem: http://en.wikipedia.org/wiki/Multi-commodity_flow_problem
Treat each source/sink pair as a new commodity, and give your edges unit weights to enforce disjoint paths. Now search the literature for approximations to this class of MCFP with unit capacities.
Your problem is NP-hard, even for the case of two sources and two sinks. It becomes polynomially solvable if you stop caring which source matches with which sink.