To find an inverse matrix of A with LU decomposition - numpy

The task asks me to generate A matrix with 50 columns and 50 rows with a random library of seed 1007092020 in the range [0,1].
import numpy as np
np.random.seed(1007092020)
A = np.random.randint(2, size=(3,3))
Then I have to find an inverse matrix of A with LU decomposition.
No idea how to do that.

If you need matrix A to be a 50 x 50 matrix with random floating numbers, then you can make that with the following code :
import numpy as np
np.random.seed(1007092020)
A = np.random.random((50,50))
Instead, if you want integers in the range 0,1 (1 included), you can do this
A = np.random.randint(0,2,(50,50))
If you want to compute the inverse using LU decomposition, you can use SciPy. It should be noted that since you are generating random matrices, it is possible that your matrix does not have an inverse. In that case, you can not find the inverse.
Here's some code that will work in case A does have an inverse.
from scipy.linalg import lu
p,l,u = lu(A, permute_l = False)
Now that we have the lower (l) and upper (u) triangular matrices, we can find the inverse of A by the following equation : A^-1 = U^-1 L^-1
l = np.dot(p,l)
l_inv = np.linalg.inv(l)
u_inv = np.linalg.inv(u)
A_inv = np.dot(u_inv,l_inv)

Related

How to detect multivariate outliers within large dataset?

How do I detect multivariate outliers within large data with more than 50 variables. Do i need to plot all of the variables or do i have to group them based independent and dependent variables or do i need an algorithm for this?
We do have a special type of distance formula that we use to find multivariate outliers. It is called Mahalanobis Distance.
The MD is a metric that establishes the separation between a distribution D and a data point x by generalizing the z-score, the MD indicates how far x is from the D mean in terms of standard deviations.
You can use the below function to find out outliers. It returns the index of outliers.
from scipy.stats import chi2
import scipy as sp
import numpy as np
def mahalanobis_method(df):
#M-Distance
x_minus_mean = df - np.mean(df)
cov = np.cov(df.values.T) #Covariance
inv_covmat = sp.linalg.inv(cov) #Inverse covariance
left_term = np.dot(x_minus_mean, inv_covmat)
mahal = np.dot(left_term, x_minus_mean.T)
md = np.sqrt(mahal.diagonal())
#Flag as outliers
outliers = []
#Cut-off point
C = np.sqrt(chi2.ppf((1-0.001), df=df.shape[1])) #degrees of freedom = number of variables
for i, v in enumerate(md):
if v > C:
outliers.append(i)
else:
continue
return outliers, md
If you want to study more about Mahalanobis Distance and its formula you can read this blog.
So, how to understand the above formula? Let’s take the (x – m)^T . C^(-1) term. (x – m) is essentially the distance of the vector from the mean. We then divide this by the covariance matrix (or multiply by the inverse of the covariance matrix). If you think about it, this is essentially a multivariate equivalent of the regular standardization (z = (x – mu)/sigma).

numpy.corrcoeff() MemoryError

Can't understand MemoryError I get using numpy.corrcoeff() to find correlation coefficient between 2 vectors smin & smax as following:
import numpy as np
from numpy import random as rn
r=0.01
sigma=0.2
T=1
K=1
N=252
h=T/N
M = 50000
Z = rn.randn(M,N)
S=np.ones((M,N+1))
smax=np.ones((M,1))
smin=np.ones((M,1))
for i in range(0,N):
S[:,i+1]=S[:,i]*(np.exp((r-(sigma**2)/2)*h+sigma*Z[:,i]*np.sqrt(h)))
for j in range(0,M):
smax[j,:]=np.exp(-r*T)*(np.max(S[j,:])>K)*(np.max(S[j,:])-K)
smin[j,:]=np.exp(-r*T)*(np.min(S[j,:])<K)*(K-np.min(S[j,:]))
c=np.corrcoef(smax,smin)
print(c)
if there is another way to find correlation coeff.,like using pandas it's also good.
The shape of your arrays here is what is the problem. The function documentation states that x is a "1-D or 2-D array containing multiple variables and observations. Each row of x represents a variable, and each column a single observation of all those variables." and that y is an additional set of variables and observations. So this is trying to allocate an array of size (10000, 10000), which is huge.
If you just want to calculate the pearson correlation coefficient between two one dimensional vectors, you can use a much simpler formula than what is implemented here. This documentation has the formula I am referring to.
https://hydroerr.readthedocs.io/en/stable/api/HydroErr.HydroErr.pearson_r.html#HydroErr.HydroErr.pearson_r
But to be able to still use the numpy version you need to pass in the observations and predictions in the same parameter x, and x and y need to be 1D arrays.
import numpy as np
simulated_array = np.random.rand(50000)
observed_array = np.random.rand(50000)
c = np.corrcoef([simulated_array, observed_array])[1, 0]
More explanation about this here.

Efficient way to calculate the pairwise matrix product between one tensor and all the rolling of another tensor

Suppose we have two tensors:
tensor A whose shape is (d,m,n)
tensor B whose shape is (d,n,l).
If we want to get the pairwise matrix product of the right-most matrix of A and B, I think we can use np.einsum('dmn,...nl->d...ml',A,B) whose size is (d,d,m,l). However, I would like to get the pairwise product of not all the pairs.
Import a parameter k, 1<=k<=d, I want to get the following pairwise matrix product:
from
A(0,...)#B(0,...)
to
A(0,...)#B(k-1,...)
;
from
A(1,...)#B(1,...)
to
A(1,...)#B(k,...)
;
....
;
from
A(d-2,...)#B(d-2,...),
A(d-2,...)#B(d-1,...)
to
A(d-2,...)#B(k-3,...)
;
from
A(d-1,...)#B(d-1,...)
to
A(d-1,...)#B(k-2,...)
.
Note here we we use a rolling way to deal with tensor B. (like numpy.roll).
Finally, we actually get a tensor whose shape is (d,k,m,l).
What's the most efficient way to do this.
I know several ways like:
First get np.einsum('dmn,...nl->d...ml',A,B), then use a mask to extract the (d,k) pairs.
tile B first, then use einsum in some way.
But I think there exists a better way.
I doubt you can do much better than a for loop. Here is, for example, a vectorized version using einsum and stride_tricks compared to a double for loop:
Code:
from simple_benchmark import BenchmarkBuilder, MultiArgument
import numpy as np
from numpy.lib.stride_tricks import as_strided
B = BenchmarkBuilder()
#B.add_function()
def loopy(A,B,k):
d,m,n = A.shape
l = B.shape[-1]
out = np.empty((d,k,m,l),int)
for i in range(d):
for j in range(k):
out[i,j] = A[i]#B[(i+j)%d]
return out
#B.add_function()
def vectory(A,B,k):
d,m,n = A.shape
l = B.shape[-1]
BB = np.concatenate([B,B[:k-1]],0)
BB = as_strided(BB,(d,k,n,l),np.repeat(BB.strides,(2,1,1)))
return np.einsum("ikl,ijln->ijkn",A,BB)
#B.add_arguments('d x k x m x n x l')
def argument_provider():
for exp in range(10):
d,k,m,n,l = (np.r_[1.6,1.5,1.5,1.5,1.5]**exp*(4,2,2,2,2)).astype(int)
print(d,k,m,n,l)
A = np.random.randint(0,10,(d,m,n))
B = np.random.randint(0,10,(d,n,l))
yield k*d*m*n*l,MultiArgument([A,B,k])
r = B.run()
r.plot()
import pylab
pylab.savefig('diagwa.png')

Difficulty with numpy broadcasting

I have two 2d point clouds (oldPts and newPts) which I whish to combine. They are mx2 and nx2 numpyinteger arrays with m and n of order 2000. newPts contains many duplicates or near duplicates of oldPts and I need to remove these before combining.
So far I have used the histogram2d function to produce a 2d representation of oldPts (H). I then compare each newPt to an NxN area of H and if it is empty I accept the point. This last part I am currently doing with a python loop which i would like to remove. Can anybody show me how to do this with broadcasting or perhaps suggest a completely different method of going about the problem. the working code is below
npzfile = np.load(path+datasetNo+'\\temp.npz')
arrs = npzfile.files
oldPts = npzfile[arrs[0]]
newPts = npzfile[arrs[1]]
# remove all the negative values
oldPts = oldPts[oldPts.min(axis=1)>=0,:]
newPts = newPts[newPts.min(axis=1)>=0,:]
# round to integers
oldPts = np.around(oldPts).astype(int)
newPts = newPts.astype(int)
# put the oldPts into 2d array
H, xedg,yedg= np.histogram2d(oldPts[:,0],oldPts[:,1],
bins = [xMax,yMax],
range = [[0, xMax], [0, yMax]])
finalNewList = []
N = 5
for pt in newPts:
if not H[max(0,pt[0]-N):min(xMax,pt[0]+N),
max(0,pt[1]- N):min(yMax,pt[1]+N)].any():
finalNewList.append(pt)
finalNew = np.array(finalNewList)
The right way to do this is to use linear algebra to compute the distance between each pair of 2-long vectors, and then accept only the new points that are "different enough" from each old point: using scipy.spatial.distance.cdist:
import numpy as np
oldPts = np.random.randn(1000,2)
newPts = np.random.randn(2000,2)
from scipy.spatial.distance import cdist
dist = cdist(oldPts, newPts)
print(dist.shape) # (1000, 2000)
okIndex = np.max(dist, axis=0) > 5
print(np.sum(okIndex)) # prints 1503 for me
finalNew = newPts[okIndex,:]
print(finalNew.shape) # (1503, 2)
Above I use the Euclidean distance of 5 as the threshold for "too close": any point in newPts that's farther than 5 from all points in oldPts is accepted into finalPts. You will have to look at the range of values in dist to find a good threshold, but your histogram can guide you in picking the best one.
(One good way to visualize dist is to use matplotlib.pyplot.imshow(dist).)
This is a more refined version of what you were doing with the histogram. In fact, you ought to be able to get the exact same answer as the histogram by passing in metric='minkowski', p=1 keyword arguments to cdist, assuming your histogram bin widths are the same in both dimensions, and using 5 again as the threshold.
(PS. If you're interested in another useful function in scipy.spatial.distance, check out my answer that uses pdist to find unique rows/columns in an array.)

Numpy / Scipy - Sparse matrix to vector

I have sparse CSR matrices (from a product of two sparse vector) and I want to convert each matrix to a flat vector. Indeed, I want to avoid using any dense representation or iterating over indexes.
So far, the only solution that came up was to iterate over non null elements by using coo representation:
import numpy
from scipy import sparse as sp
matrices = [sp.csr_matrix([[1,2],[3,4]])]*3
vectorSize = matrices[0].shape[0]*matrices[0].shape[1]
flatMatrixData = []
flatMatrixRows = []
flatMatrixCols = []
for i in range(len(matrices)):
matrix = matrices[i].tocoo()
flatMatrixData += matrix.data.tolist()
flatMatrixRows += [i]*matrix.nnz
flatMatrixCols += [r+c*2 for r,c in zip(matrix.row, matrix.col)]
flatMatrix = sp.coo_matrix((flatMatrixData,(flatMatrixRows, flatMatrixCols)), shape=(len(matrices), vectorSize), dtype=numpy.float64).tocsr()
It is indeed unsatisfying and inelegant. Does any one know how to achieve this in an efficient way?
Your flatMatrix is (3,4); each row is [1 3 2 4]. If a submatrix is x, then the row is x.A.T.flatten().
F = sp.vstack([x.T.tolil().reshape((1,vectorSize)) for x in matrices])
F is the same (dtype is int). I had to convert each submatrix to lil since csr has not implemented reshape (in my version of sparse). I don't know if other formats work.
Ideally sparse would let you do the whole range of numpy array (or matrix) manipulations, but it isn't there yet.
Given the small dimensions in this example, I won't speculate on the speed of the alternatives.