Solving an underdetermined scipy.sparse matrix using svd - pandas

Problem
I have a set of equations with variables denoted with lowercase variables and constants with uppercase variables as such
A = a + b
B = c + d
C = a + b + c + d + e
I'm provided the information as to the structure of these equations in a pandas DataFrame with two columns: Constants and Variables
E.g.
df = pd.DataFrame([['A','a'],['A','b'],['B','c'],['B','d'],['C','a'],['C','b'],
['C','c'],['C','d'],['C','e']],columns=['Constants','Variables'])
I then convert this to a sparse CSC matrix by using NetworkX
table = nx.bipartite.biadjacency_matrix(nx.from_pandas_dataframe(df,'Constants','Variables')
,df.Constants.unique(),df.Variables.unique(),format='csc')
When converted to a dense matrix, table looks like the following
matrix([[1, 1, 0, 0, 0],[0, 0, 1, 1, 0],[1, 1, 1, 1, 1]], dtype=int64)
What I want from here is to find which variables are solvable (in this example, only e is solvable) and for each solvable variable, what constants is its value dependent on (in this case, since e = C-B-A, it is dependent on A, B, and C)
Attempts at Solution
I first tried to use rref to solve for the solvable variables. I used the symbolics library sympy and the function sympy.Matrix.rref, which gave me exactly what I wanted, since any solvable variable would have its own row with almost all zeros and 1 one, which I could check for row by row.
However, this solution was not stable. Primarily, it was exceedingly slow, and didn't make use of the fact that my datasets are likely to be very sparse. Moreover, rref doesn't do too well with floating points. So I decided to move on to another approach motivated by Removing unsolvable equations from an underdetermined system, which suggested using svd
Conveniently, there is a svd function in the scipy.sparse library, namely scipy.sparse.linalg.svds. However, given my lack of linear algebra background, I don't understand the results outputted by running this function on my table, or how to use those results to get what I want.
Further Details in the Problem
The coefficient of every variable in my problem is 1. This is how the data can be expressed in the two column pandas DataFrame shown earlier
The vast majority of variables in my actual examples will not be solvable. The goal is to find the few that are solvable
I'm more than willing to try an alternate approach if it fits the constraints of this problem.
This is my first time posting a question, so I apologize if this doesn't exactly follow guidelines. Please leave constructive criticism but be gentle!

The system you are solving has the form
[ 1 1 0 0 0 ] [a] [A]
[ 0 0 1 1 0 ] [b] = [B]
[ 1 1 1 1 1 ] [c] [C]
[d]
[e]
i.e., three equations for five variables a, b, c, d, e. As the answer linked in your question mentions, one can tackle such underdetermined system with the pseudoinverse, which Numpy directly provides in terms of the pinv function.
Since M has linearly independent rows, the psudoinverse has in this case the property that M.pinv(M) = I, where I denotes identity matrix (3x3 in this case). Thus formally, we can write the solution as:
v = pinv(M) . b
where v is the 5-component solution vector, and b denotes the right-hand side 3-component vector [A, B, C]. However, this solution is not unique, since one can add a vector from the so-called kernel or null space of the matrix M (i.e., a vector w for which M.w=0) and it will be still a solution:
M.(v + w) = M.v + M.w = b + 0 = b
Therefore, the only variables for which there is a unique solution are those for which the corresponding component of all possible vectors from the null space of M is zero. In other words, if you assemble the basis of the null space into a matrix (one basis vector per column), then the "solvable variables" will correspond to zero rows of this matrix (the corresponding component of any linear combination of the columns will be then also zero).
Let's apply this to your particular example:
import numpy as np
from numpy.linalg import pinv
M = [
[1, 1, 0, 0, 0],
[0, 0, 1, 1, 0],
[1, 1, 1, 1, 1]
]
print(pinv(M))
[[ 5.00000000e-01 -2.01966890e-16 1.54302378e-16]
[ 5.00000000e-01 1.48779676e-16 -2.10806254e-16]
[-8.76351626e-17 5.00000000e-01 8.66819360e-17]
[-2.60659800e-17 5.00000000e-01 3.43000417e-17]
[-1.00000000e+00 -1.00000000e+00 1.00000000e+00]]
From this pseudoinverse, we see that the variable e (last row) is indeed expressible as - A - B + C. However, it also "predicts" that a=A/2 and b=A/2. To eliminate these non-unique solutions (equally valid would be also a=A and b=0 for example), let's calculate the null space borrowing the function from SciPy Cookbook:
print(nullspace(M))
[[ 5.00000000e-01 -5.00000000e-01]
[-5.00000000e-01 5.00000000e-01]
[-5.00000000e-01 -5.00000000e-01]
[ 5.00000000e-01 5.00000000e-01]
[-1.77302319e-16 2.22044605e-16]]
This function returns already the basis of the null space assembled into a matrix (one vector per column) and we see that, within a reasonable precision, the only zero row is indeed only the last one corresponding to the variable e.
EDIT:
For the set of equations
A = a + b, B = b + c, C = a + c
the corresponding matrix M is
[ 1 1 0 ]
[ 0 1 1 ]
[ 1 0 1 ]
Here we see that the matrix is in fact square, and invertible (the determinant is 2). Thus the pseudoinverse coincides with "normal" inverse:
[[ 0.5 -0.5 0.5]
[ 0.5 0.5 -0.5]
[-0.5 0.5 0.5]]
which corresponds to the solution a = (A - B + C)/2, .... Since M is invertible, its kernel / null space is empty, that's why the cookbook function returns only []. To see this, let's use the definition of the kernel - it is formed by all non-zero vectors x such that M.x = 0. However, since M^{-1} exists, x is given as x = M^{-1} . 0 = 0 which is a contradiction. Formally, this means that the found solution is unique (or that all variables are "solvable").

To build on ewcz's answer, both the nullspace and pseudo-inverse can be calculated using numpy.linalg.svd. See the links below:
pseudo-inverse
nullspace

Related

How should I handle music key (scale) as a feature in the knn algorithm

I'm doing a data science project, and I was wondering how to handle a music key (scale) as a feature in the KNN algorithm.
I know KNN is based on distances, therefore giving each key a number like 1-24 doesn't make that much sense (because key number 24 is close to 1 as much as 7 close to 8).
I have thought about making a column for "Major/Minor" and another for the note itself,
but I'm still facing the same problem, I need to specify the note with a number, but because notes are cyclic I cannot number them linearly 1-12.
For the people that have no idea how music keys work my question is equivalent to handling states in KNN, you can't just number them linearly 1-50.
One way you could think about the distance between scales is to think of each scale as a 12-element binary vector where there's a 1 wherever a note is in the scale and a zero otherwise.
Then you can compute the Hamming distance between scales. The Hamming distance, for example, between a major scale and its relative minor scale should be zero because they both contain the same notes.
Here's a way you could set this up in Python
from enum import IntEnum
import numpy as np
from scipy.spatial.distance import hamming
class Note(IntEnum):
C = 0
Db = 1
D = 2
Eb = 3
E = 4
F = 5
Gb = 6
G = 7
Ab = 8
A = 9
Bb = 10
B = 11
major = np.array((1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1))
minor = np.array((1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0)) #WHWWHWW Natural Minor
# Transpose the basic scale form to a key using Numpy's `roll` function
cMaj = np.roll(major, Note.C) # Rolling by zero changes nothing
aMin = np.roll(minor, Note.A)
gMaj = np.roll(major, Note.G)
fMaj = np.roll(major, Note.F)
print('Distance from cMaj to aMin', hamming(cMaj, aMin))
print('Distance from cMaj to gMaj', hamming(cMaj, gMaj)) # One step clockwise on circle of fifths
print('Distance from cMaj to fMaj', hamming(cMaj, fMaj)) # One step counter-clockwise on circle of fifths
IIUC, you can convert your features to something like sin as follows. Hear I have 10 values 1-10 and I am transforming them to keep their circular relation.
a = np.around(np.sin([np.deg2rad(x*18) for x in np.array(list(range(11)))]), 3)
import matplotlib.pyplot as plt
plt.plot(a)
Output:
Through this feature engineering you can see that the circularity of your feature is encoded. The value of 0 is equal to 10.

How to find most similar numerical arrays to one array, using Numpy/Scipy?

Let's say I have a list of 5 words:
[this, is, a, short, list]
Furthermore, I can classify some text by counting the occurrences of the words from the list above and representing these counts as a vector:
N = [1,0,2,5,10] # 1x this, 0x is, 2x a, 5x short, 10x list found in the given text
In the same way, I classify many other texts (count the 5 words per text, and represent them as counts - each row represents a different text which we will be comparing to N):
M = [[1,0,2,0,5],
[0,0,0,0,0],
[2,0,0,0,20],
[4,0,8,20,40],
...]
Now, I want to find the top 1 (2, 3 etc) rows from M that are most similar to N. Or on simple words, the most similar texts to my initial text.
The challenge is, just checking the distances between N and each row from M is not enough, since for example row M4 [4,0,8,20,40] is very different by distance from N, but still proportional (by a factor of 4) and therefore very similar. For example, the text in row M4 can be just 4x as long as the text represented by N, so naturally all counts will be 4x as high.
What is the best approach to solve this problem (of finding the most 1,2,3 etc similar texts from M to the text in N)?
Generally speaking, the most widely standard technique of bag of words (i.e. you arrays) for similarity is to check cosine similarity measure. This maps your bag of n (here 5) words to a n-dimensional space and each array is a point (which is essentially also a point vector) in that space. The most similar vectors(/points) would be ones that have the least angle to your text N in that space (this automatically takes care of proportional ones as they would be close in angle). Therefore, here is a code for it (assuming M and N are numpy arrays of the similar shape introduced in the question):
import numpy as np
cos_sim = M[np.argmax(np.dot(N, M.T)/(np.linalg.norm(M)*np.linalg.norm(N)))]
which gives output [ 4 0 8 20 40] for your inputs.
You can normalise your row counts to remove the length effect as you discussed. Row normalisation of M can be done as M / M.sum(axis=1)[:, np.newaxis]. The residual values can then be calculated as the sum of the square difference between N and M per row. The minimum difference (ignoring NaN or inf values obtained if the row sum is 0) is then the most similar.
Here is an example:
import numpy as np
N = np.array([1,0,2,5,10])
M = np.array([[1,0,2,0,5],
[0,0,0,0,0],
[2,0,0,0,20],
[4,0,8,20,40]])
# sqrt of sum of normalised square differences
similarity = np.sqrt(np.sum((M / M.sum(axis=1)[:, np.newaxis] - N / np.sum(N))**2, axis=1))
# remove any Nan values obtained by dividing by 0 by making them larger than one element
similarity[np.isnan(similarity)] = similarity[0]+1
result = M[similarity.argmin()]
result
>>> array([ 4, 0, 8, 20, 40])
You could then use np.argsort(similarity)[:n] to get the n most similar rows.

How to add magnitude or value to a vector in Python?

I am using this function to calculate distance between 2 vectors a,b, of size 300, word2vec, I get the distance between 'hot' and 'cold' to be equal 1.
How to add this value (1) to a vector, becz i thought simply new_vec=model['hot']+1, but when I do the calc dist(new_vec,model['hot'])=17?
import numpy
def dist(a,b):
return numpy.linalg.norm(a-b)
a=model['hot']
c=a+1
dist(a,c)
17
I expected dist(a,c) will give me back 1!
You should review what the norm is. In the case of numpy, the default is to use the L-2 norm (a.k.a the Euclidean norm). When you add 1 to a vector, the call is to add 1 to all of the elements in the vector.
>> vec1 = np.random.normal(0,1,size=300)
>> print(vec1[:5])
... [ 1.18469795 0.04074346 -1.77579852 0.23806222 0.81620881]
>> vec2 = vec1 + 1
>> print(vec2[:5])
... [ 2.18469795 1.04074346 -0.77579852 1.23806222 1.81620881]
Now, your call to norm is saying sqrt( (a1-b1)**2 + (a2-b2)**2 + ... + (aN-bN)**2 ) where N is the length of the vector and a is the first vector and b is the second vector (and ai being the ith element in a). Since (a1-b1)**2 == (a2-b2)**2 == ... == (aN-bN)**2 == 1 we expect this sum to produce N which in your case is 300. So sqrt(300) = 17.3 is the expected answer.
>> print(np.linalg.norm(vec1-vec2))
... 17.320508075688775
To answer the question, "How to add a value to a vector": you have done this correctly. If you'd like to add a value to a specific element then you can do vec2[ix] += value where ix indexes the element that you wish to add. If you want to add a value uniformly across all elements in the vector that will change the norm by 1, then add np.sqrt(1/300).
Also possibly relevant is a more commonly used distance metric for word2vec vectors: the cosine distance which measures the angle between two vectors.

Solving and producing and equation with variables indexed by a given set (Haskell)

I want to solve the following problem in Haskell:
Let n be a natural number and let A = [d_1 , ..., d_r] be a set of positive numbers.
I want to find all the positive solutions of the following equation:
n = Sum d_i^2 x_i.
For example if n= 12 and the set A= [1,2,3]. I would like to solve the following equation over the natural numbers:
x+4y+9z=12.
It's enough to use the following code:
[(x,y,z) | x<-[0..12], y<-[0..12], z<-[0..12], x+4*y+9*z==12]
My problem is if n is not fixed and also the set A are not fixed. I don't know how to "produce" a certain amount of variables indexed by the set A.
Instead of a list-comprehension you can use a recursive call with do-notation for the list-monad.
It's a bit more tricky as you have to handle the edge-cases correctly and I allowed myself to optimize a bit:
solve :: Integer -> [Integer] -> [[Integer]]
solve 0 ds = [replicate (length ds) 0]
solve _ [] = []
solve n (d:ds) = do
let maxN = floor $ fromIntegral n / fromIntegral (d^2)
x <- [0..maxN]
xs <- solve (n - x * d^2) ds
return (x:xs)
it works like this:
It's keeping track of the remaining sum in the first argument
when there this sum is 0 where are obviously done and only have to return 0's (first case) - it will return a list of 0s with the same length as the ds
if the remaining sum is not 0 but there are no d's left we are in trouble as there are no solutions (second case) - note that no solutions is just the empty list
in every other case we have a non-zero n (remaining sum) and some ds left (third case):
now look for the maximum number that you can pick for x (maxN) remember that x * d^2 should be <= n so the upper limit is n / d^2 but we are only interested in integers (so it's floor)
try all from x from 0 to maxN
look for all solutions of the remaining sum when using this x with the remaining ds and pick one of those xs
combine x with xs to give a solution to the current subproblem
The list-monad's bind will handle the rest for you ;)
examples
λ> solve 12 [1,2,3]
[[0,3,0],[3,0,1],[4,2,0],[8,1,0],[12,0,0]]
λ> solve 37 [2,3,4,6]
[[3,1,1,0],[7,1,0,0]]
remark
this will fail when dealing with negative numbers - if you need those you gonna have to introduce some more cases - I'm sure you figure them out (it's really more math than Haskell at this point)
Some hints:
Ultimately you want to write a function with this signature:
solutions :: Int -> [Int] -> [ [Int] ]
Examples:
solutions 4 [1,2] == [ [4,0], [0,1] ]
-- two solutions: 4 = 4*1^2 + 0*2^2, 4 = 0*1^2 + 1*2^2
solutions 22 [2,3] == [ [1,2] ]
-- just one solution: 22 = 1*2^2 + 2*3^2
solutions 10 [2,3] == [ ]
-- no solutions
Step 2. Define solutions recursively based on the structure of the list:
solutions x [a] = ...
-- This will either be [] or a single element list
solutions x (a:as) = ...
-- Hint: you will use `solutions ... as` here

torch logical indexing of tensor

I looking for an elegant way to select a subset of a torch tensor which satisfies some constrains.
For example, say I have:
A = torch.rand(10,2)-1
and S is a 10x1 tensor,
sel = torch.ge(S,5) -- this is a ByteTensor
I would like to be able to do logical indexing, as follows:
A1 = A[sel]
But that doesn't work.
So there's the index function which accepts a LongTensor but I could not find a simple way to convert S to a LongTensor, except the following:
sel = torch.nonzero(sel)
which returns a K x 2 tensor (K being the number of values of S >= 5). So then I have to convert it to a 1 dimensional array, which finally allows me to index A:
A:index(1,torch.squeeze(sel:select(2,1)))
This is very cumbersome; in e.g. Matlab all I'd have to do is
A(S>=5,:)
Can anyone suggest a better way?
One possible alternative is:
sel = S:ge(5):expandAs(A) -- now you can use this mask with the [] operator
A1 = A[sel]:unfold(1, 2, 2) -- unfold to get back a 2D tensor
Example:
> A = torch.rand(3,2)-1
-0.0047 -0.7976
-0.2653 -0.4582
-0.9713 -0.9660
[torch.DoubleTensor of size 3x2]
> S = torch.Tensor{{6}, {1}, {5}}
6
1
5
[torch.DoubleTensor of size 3x1]
> sel = S:ge(5):expandAs(A)
1 1
0 0
1 1
[torch.ByteTensor of size 3x2]
> A[sel]
-0.0047
-0.7976
-0.9713
-0.9660
[torch.DoubleTensor of size 4]
> A[sel]:unfold(1, 2, 2)
-0.0047 -0.7976
-0.9713 -0.9660
[torch.DoubleTensor of size 2x2]
There are two simpler alternatives:
Use maskedSelect:
result=A:maskedSelect(your_byte_tensor)
Use a simple element-wise multiplication, for example
result=torch.cmul(A,S:gt(0))
The second one is very useful if you need to keep the shape of the original matrix (i.e A), for example to select neurons in a layer at backprop. However, since it puts zeros in the resulting matrix whenever the condition dictated by the ByteTensor doesn't apply, you can't use it to compute the product (or median, etc.). The first one only returns the elements that satisfy the condittion, so this is what I'd use to compute products or medians or any other thing where I don't want zeros.