I have 2 sets of datapoints:
A: mx10
B: nx10
Distance matrix D of data points in A and B: mxn
How could I extract k rows of A using the distance matrix D, in which their distances to data points in B are smallest? The matrix should have the size of nxk. I do not want to loop through each column and row of the matrix, so I am interested in a way to do this using matrix only.
D = np.distance_matrix(A, B)
Assuming that the full array D is already given and "distance to B" means "smallest of all the distances to all elements in B", then it should be somehting like
d = D.min(axis=1) # m-long vector of distances from points in A to B
ord = d.argsort() # an array of indices in d sorted by the corresponding values
kD = d[ord[:k],:] # take first k elements
This is not very efficient if k is much smaller than m, since it sorts all of the elements instead of just finding kth. But it should do the trick.
Related
Assume I have a data like this:
x = np.random.randn(4, 100000)
and I fit a histogram
hist = np.histogramdd(x, density=True)
What I want is to get the probability of number g, e.g. g=0.1. Assume some hypothetical function foo then.
g = 0.1
prob = foo(hist, g)
print(prob)
>> 0.2223124214
How could I do something like this, where I get probability back for a single or a vector of numbers for a fitted histogram? Especially histogram that is N-dimensional.
histogramdd takes O(r^D) memory, and unless you have a very large dataset or very small dimension you will have a poor estimate. Consider your example data, 100k points in 4-D space, the default histogram will be 10 x 10 x 10 x 10, so it will have 10k bins.
x = np.random.randn(4, 100000)
hist = np.histogramdd(x.transpose(), density=True)
np.mean(hist[0] == 0)
gives something arround 0.77 meaning that 77% of the bins in the histogram have no points.
You probably want to smooth the distribution. Unless you have a good reason to not do, I would suggest you to use Gaussian kernel-density Estimate
x = np.random.randn(4, 100000) # d x n array
f = scipy.stats.gaussian_kde(x) # d-dimensional PDF
f([1,2,3,4]) # evaluate the PDF in a given point
I have N Gaussian distributions (multivariate) with N different means (covariance is the same for all of them) in D dimensions.
I also have N evaluation points, where I want to evaluate each of these (log) PDFs.
This means I need to get a NxN matrix, call it "kernels". That is, the (i,j)-th entry is the j-th Gaussian evaluated at the i-th point. A naive approach is:
from torch.distributions.multivariate_normal import MultivariateNormal
import numpy as np
# means contains all N means as rows and is thus N x D
# same for eval_points
# cov is not a problem , just a DxD matrix that is equal for all N Gaussians
kernels = np.empty((N,N))
for i in range(N):
for j in range(N):
kernels[i][j] = MultivariateNormal(means[j], cov).log_prob(eval_points[i])
Now one for loop we can get rid of easily, since for example if we wanted all the evaluations of the first Gaussian , we simply do:
MultivariateNormal(means[0], cov).log_prob(eval_points).squeeze()
and this gives us a N x 1 list of values, that is the first Gaussian evaluated at all N points.
My problem is that , in order to get the full N x N matrix , this doesn't work:
kernels = MultivariateNormal(means, cov).log_prob(eval_points).squeeze()
It doesn't figure out that it should evaluate each mean with all evaluation points in eval_points, and it doesn't return a NxN matrix with these which would be what I want. Therefore, I am not able to get rid of the second for loop, over all N Gaussians.
You are passing wrong shaped tensors to MultivariateNormal's constructor. You should pass a collection of mean vectors of shape (N, D) and a collection of precision matrix cov of shape (N, D, D) for N D-dimensional gaussian.
You are passing mu of shape (N, D) but your precision matrix is not well-shaped. You will need to repeat the precision matrix N number of times before passing it to the MultivariateNormal constructor. Here's one way to do it.
N = 10
D = 3
# means contains all N means as rows and is thus N x D
# same for eval_points
# cov is not a problem , just a DxD matrix that is equal for all N Gaussians
mu = torch.from_numpy(np.random.randn(N, D))
cov = torch.from_numpy(make_spd_matrix(D, D))
cov_n = cov[None, ...].repeat_interleave(N, 0)
assert cov_n.shape == (N, D, D)
kernels = MultivariateNormal(mu, cov_n)
my dataframe [11 x 300], where the column header equals 'x' ([0.75,1,1.25,1.5,1.75,2,2.25,2.5,2.75,3,3.25]), and each row-value represents 'y' for. Each row can be described by an exponential function in the following format : a * x ^k + b.
The goal is to add three additional columns, describing a, k and b for that specific row. Just like: Python curve fitting on pandas dataframe then add coef to new columns
Instead of a polynomial function, my data needs be described in the following format: a * x **k + b.
As I cannot find any solution to derive the coefficients by using np.polyfit, I split my dataframe into different lists.
x = np.array([0.75,1,1.25,1.5,1.75,2,2.25,2.5,2.75,3,3.25])
y1 = np.array([288.79,238.32,199.42,181.22,165.50,154.74,152.25,152.26,144.81,144.81,144.81])
y2 = np.array([309.92,255.75,214.02,194.48,177.61,166.06,163.40,163.40,155.41,155.41,155.41])
...
y300 = np.array([352.18,290.63,243.20,221.00,201.83,188.71,185.68,185.68,176.60,176.60,176.60])
def func(x,a,k,b):
return a * (x**k) + b
popt1, pcov = curve_fit(func,x,y1, p0 = (300,-0.5,0))
...
popt300, pcov = curve_fit(func,x,y300, p0 = (300,-0.5,0))
output:
popt1
[107.73727907 -1.545475 123.48621504]
...
popt300
[131.38411712 -1.5454452 150.59522147
This works, when I split all dataframe rows into lists and define popt for every list/row.
Avoiding to split all 300 columns - I prefer to apply the same methodology as Python curve fitting on pandas dataframe then add coef to new columns
my_coep_array = pd.DataFrame(np.polyfit(x, df.values,1)).T
But how to define my np.polyfit - a * x **k + b?
I generated a new random rows matrix B (50, 40) from a matrix A (100, 40):
B = A[np.random.randint(0,100,size=50)] # it works fine.
Now, I want to take the rows from A that isn't in matrix B.
C = A not in B # pseudocode.
This should do the job:
import numpy as np
A=np.random.randint(5,size=[100,40])
l=np.random.choice(100, size=50, replace=False)
B = A[l]
C= A[np.setdiff1d(np.arange(0,100),l)]
l stores the selected rows, and for C you take the complement of l. Then C is the required matrix.
Note that I set l=np.random.choice(100, size=50, replace=False) to avoid replacement. If you use np.random.randint(0,100,size=50) you may get repeated rows as the same number is selected at random.
Inspried by this question, Check whether each row of a matrix is in another matrix [Python]. First get indices of rows exists in B, then get difference from whole A indices. select rows using difference in the end.
index = np.argwhere((B[:,None,:] == A[:,:]).all(-1))[:, 1]
C = A[np.setdiff1d(np.arange(100), index)]
The numpy_indexed package (Disclaimer: i am its author) has efficient vectorized functionality for all these kinds of operations.
import numpy_indexed as npi
C = npi.difference(A, B)
For each location in the result matrix, instead of storing the dot product of the corresponding row and column in the argument matrices, I would like like to store the element wise product, which will be a vector extending into a third dimension.
One idea would be to convert the argument matrices to vectors with vector entries, and then take their outer product, but I'm not sure how to do this either.
EDIT:
I figured it out before I saw there was a reply. Here is my solution:
def newdot(A, B):
A = A.reshape((1,) + A.shape)
B = B.reshape((1,) + B.shape)
A = A.transpose(2, 1, 0)
B = B.transpose(1, 0, 2)
return A * B
What I am doing is taking apart each row and column pair that will have their outer product taken, and forming two lists of them, which then get their contents matrix multiplied together in parallel.
It's a little convoluted (and difficult to explain) but this function should get you what you're looking for:
def f(m1, m2):
return (m2.A.T * m1.A.reshape(m1.shape[0],1,m1.shape[1]))
m3 = m1 * m2
m3_el = f(m1, m2)
m3[i,j] == sum(m3_el[i,j,:])
m3 == m3_el.sum(2)
The basic idea is to turn the matrices into arrays and do element-by-element multiplication. One of the arrays gets reshaped to have a size of one in its middle dimension, and array broadcasting rules expand this dimension out to match the height of the other array.