How can I vectorize this loop in NumPy? It uses sampling from NumPy's binomial() function to estimate the probability that out of 55 events exactly m of a particular type occur, where the probability of m occuring is 5%; ie it estimates 55Cm.(0.05)^m.(0.95)^(55-m). where 55Cm = 55!/(m!.(55-m)!)
import numpy as np
M = 7
m = np.arange(M+1)
ntrials = 1000000
p = np.empty(M+1)
for r in m:
p[r] = np.sum(np.random.binomial(55, 0.05, ntrials)==r)/ntrials
Here is the equivalent code:
p = np.zeros(M+1)
print p
I imagine you didn't intend for your output to always be all zero, but it is! So the first thing to do is add a dtype=float argument to your np.sum() call. With that out of the way, we can vectorize the whole thing like this:
samples = np.random.binomial(55, 0.05, (ntrials, M+1))
p = np.sum(samples == m, dtype=float, axis=0) / ntrials
This produces an equivalent, though not identical, result. The reason is that the random number generation is done in a different sequence, so you will get an answer which is "correct" but not identical to the old code. If you want the identical result to before, you can get that by changing the first line to this:
samples = p.random.binomial(55, 0.05, (M+1, ntrials)).T
Then you draw in the same order as before, with no real performance penalty.
Related
I'm trying to calculate the cross correlation between 2 signals without considering a lag. Essentially I want to recreate the cross correlation of 2 signals with zero lags, to see if my understanding of how cross correlation is calculated is correct.
The following is my code:
x1 = np.linspace(0,2*np.pi,1000)
y1 = np.sin(x1)
#second signal is with a phase shift of pi/4
y2 = np.sin(x1 - np.pi/4)
#do FFT on each signal
y1_fft = np.fft.fft(y1)
y2_fft = np.fft.fft(y2)
#complex conjugate of y2_fft
y2_conj = np.conjugate(y2_fft)
#take inner product of the fft and conjugate of fft
np.inner(y1_fft,y2_conj)
The result is -353199.837 - 2.59E-11i which is wrong
In comparison when I use scipy.signal.correlate the following is the result
corr = sg.correlate(y1,y2,method='fft')
lags = sg.correlation_lags(y1.shape[0],y2.shape[0])
fig,ax = plt.subplots(1,1,figsize = (10,10))
ax.plot(lags,corr)
As seen, the cross-correlation at zero lag is around 475, however my result is very different.
Where am I going wrong?
The unnormalized circular correlation is calculated as follows
# your code
import numpy as np
x1 = np.linspace(0,2*np.pi,1000)
y1 = np.sin(x1)
y2 = np.sin(x1 - np.pi/4)
y1_fft = np.fft.fft(y1)
y2_fft = np.fft.fft(y2)
y2_conj = np.conjugate(y2_fft)
# calculate the correlation (without padding)
corr = np.fft.ifft(y1_fft * y2_conj)
np.sum(y1*y2), corr[0]
by circular correlation I mean that the signal will be rolled instead of shifted, rolling or shifting are equivalent for lag[0], but for e.g. lag=3 you would get something like
np.sum(y1*np.roll(y2, 3)), corr[3]
not the same as
np.sum(y1[3:] * y2[:-3])
If you want only correlation with lag 0, to be honest I think it is better to compute by the definition directly np.inner(y1, y2).
I have a vector of numbers (here random). I'd like to calculate the consecutive relation of something (here means to clarify example) on the left and right side of each number in a vector.
Here is a procedural example. I'm interested in the vectorized form.
from numpy.random import rand
import numpy as np
numbers = rand(40)
k=np.zeros(numbers.shape)
for i in range(*numbers.shape):
k[i]=np.mean(numbers[:i])/np.mean(numbers[i:])
This example will return nan in the first iteration but it is not a problem now.
Here's a vectorized way -
n = len(numbers)
fwd = numbers.cumsum()/np.arange(1,n+1)
bwd = (numbers[::-1].cumsum()[::-1])/np.arange(n,0,-1)
k_out = np.r_[np.nan,fwd[:-1]]/bwd
Optimizing a bit further with one cumsum, it would be -
n = len(numbers)
r = np.arange(1,n+1)
c = numbers.cumsum()
fwd = c/r
b = c[-1]-c
bwd = np.r_[1,b[:-1]]/r[::-1]
k_out = np.r_[np.nan,fwd[:-1]]/bwd
I spent some time and there is a simple and universal solution: numpy.vectorize with excluded parameter, where vector designated to be split must be excluded from vectorisation. The example still uses np.mean but can be replaced with any function:
def split_mean(vect,i):
return np.mean(vect[:i])/np.mean(vect[i:])
v_split_mean = np.vectorize(split_mean)
v_split_mean.excluded.add(0)
numbers = np.random.rand(30)
indexes = np.arange(*numbers.shape)
v_split_mean(numbers,indexes)
Problem
I was working on the problem described here. I have two goals.
For any given system of linear equations, figure out which variables have unique solutions.
For those variables with unique solutions, return the minimal list of equations such that knowing those equations determines the value of that variable.
For example, in the following set of equations
X = a + b
Y = a + b + c
Z = a + b + c + d
The appropriate output should be c and d, where X and Y determine c and Y and Z determine d.
Parameters
I'm provided a two columns pandas DataFrame entitled InputDataSet where the two columns are Equation and Variable. Each row represents a variable's membership in a given equation. For example, the above set of equations would be represented as
InputDataSet = pd.DataFrame([['X','a'],['X','b'],['Y','a'],['Y','b'],['Y','c'],
['Z','a'],['Z','b'],['Z','c'],['Z','d']],columns=['Equation','Variable'])
The output will be stored in a 2 column DataFrame named OutputDataSet as well, where the first contains the variables that have unique solution, and the second is a comma delimited string of the minimal set of equations needed to solve the given variable. For example, the correct OutputDataSet would look like
OutputDataSet = pd.DataFrame([['c','X,Y'],['d','Y,Z']],columns=['Variable','EquationList'])
Current Solution
My current solution takes the InputDataSet and converts it into a NetworkX graph. After splitting the graph into connected subgraphs, it then converts the graph into a biadjacency matrix (since the graph by nature is bipartite). After this conversion, the SVD is computed, and the nullspace and pseudoinverse are calculated from the SVD (To see how they are calculated, see here and here: look at the source code for numpy.linalg.pinv and the cookbook function for nullspace. I fused the two functions since they both use SVD).
After calculating nullspace and pseudo-inverse, and rounding to a given tolerance, I find all rows in the nullspace where all of the coefficients are 0, and return those variables as those with a unique solution, and return those equations with non-zero coefficients for those variables in the pseudo-inverse.
Here is the code:
import networkx as nx
import pandas as pd
import numpy as np
import numpy.core as cr
def svd_lite(a, tol=1e-2):
wrap = getattr(a, "__array_prepare__", a.__array_wrap__)
rcond = cr.asarray(tol)
a = a.conjugate()
u, s, vt = np.linalg.svd(a)
nnz = (s >= tol).sum()
ns = vt[nnz:].conj().T
shape = a.shape
if shape[0]>shape[1]:
u = u[:,:shape[1]]
elif shape[1]>shape[0]:
vt = vt[:shape[0]]
cutoff = rcond[..., cr.newaxis] * cr.amax(s, axis=-1, keepdims=True)
large = s > cutoff
s = cr.divide(1, s, where=large, out=s)
s[~large] = 0
res = cr.matmul(cr.swapaxes(vt, -1, -2), cr.multiply(s[..., cr.newaxis],
cr.swapaxes(u, -1, -2)))
return (wrap(res),ns)
cols = InputDataSet.columns
tolexp=2
graphs = nx.connected_component_subgraphs(nx.from_pandas_dataframe(InputDataSet,cols[0],
cols[1]))
OutputDataSet = []
Eqs = InputDataSet[cols[0]].unique()
Vars = InputDataSet[cols[1]].unique()
for i in graphs:
EqList = np.array([val for val in np.array(i.nodes) if val in Eqs])
VarList = [val for val in np.array(i.nodes) if val in Vars]
pinv,nulls = svd_lite(nx.bipartite.biadjacency_matrix(i,EqList,VarList,format='csc')
.astype(float).todense(),tol=10**-tolexp)
df2 = np.where(~np.round(nulls,tolexp).any(axis=1))[0]
df3 = np.round(np.array(pinv),tolexp)
OutputDataSet.extend([[VarList[i],",".join(EqList[np.nonzero(df3[i])])] for i in df2])
OutputDataSet = pd.DataFrame(OutputDataSet)
Issues
On the data that I've tested this algorithm on, it performs pretty well with decent execution time. However, the main issue is that it suggests far too many equations as required to determine a given variable.
Often, with datasets of 10,000 equations, the algorithm will claim that 8,000 of those 10,000 are required to determine a given variable, which most definitely is not the case.
I tried raising the tolerance (what I round the coefficients in the pseudo-inverse) to .1, but even then, nearly 5000 equations had non-zero coefficients.
I had conjectured that perhaps the pseudo-inverse is collapsing upon a non-optimal set of coefficients, but the Moore-Penrose pseudoinverse is unique, so that isn't a possibility.
Am I doing something wrong here? Or is the approach I'm taking not going to give me what I desire?
Further Notes
All of the coefficients of all of the variables are 1
The results the current algorithm is producing are reliable ... When I multiply any vector of equation totals by the pseudoinverse generated by the algorithm, I get values essentially equal to those claimed to have a unique solution, which is promising.
What I want to know here is either whether I'm doing something wrong in how I'm extrapolating information from the pseudo-inverse, or whether my approach is completely wrong.
I apologize for not posting any actual results, but not only are they quite large, but they are somewhat unintuitive since they are reformatted into an XML which would probably take another question to explain anyways.
Thank you for you time!
I want to understand how convolution works.
Here is some code:
import numpy
from scipy import misc
data = misc.imread("path_to_a_512x512_grayscale_image.png")
data = data/255.0
masque = numpy.array([[-1,0,1],
[-2,0,0],
[-1,0,1]],numpy.double)
def my_convolution(image, masque):
hauteur,largeur = image.shape
resultat = numpy.empty((hauteur,largeur))
for y in range(1,hauteur-1):
for x in range(1,largeur-1):
pixel = 0.0
for ym in range(3):
for xm in range(3):
pixel += masque[ym,xm]*image[y-1+ym,x-1+ym]
resultat[y,x]=pixel/9.0
return resultat
my_result = my_convolution(data,masque)
plt.imshow(my_result, cmap='gray')
Result is not exactly the same with this basic method bellow.
My previous method gives a a picture that seems to be darker
from scipy import signal
result2 = signal.convolve2d(data, masque)
result2 = result2[1:-1,1:-1]
plt.imshow(result2, cmap='gray')
Anyone call explain me with those 2 codes does not give the same result ?
I do not want to know which method is fastest, i know the first method is very ugly, i just want to understand.
Thanks
The convolution requires going backwards over one of the convolved functions, which means subtracting the inner indices, not adding them. Also, the second index in the image access expression has mismatched terms. So,
pixel += masque[ym,xm]*image[y-1+ym,x-1+ym]
should instead be more like
pixel += masque[ym,xm]*image[y-1-ym,x-1-xm]
For confirmation look at the code that runs when you call signal.convolve2d (specifically here and here). The inner indices match their respective outer indices, and they are subtracted during a convolution, not added.
I don't understand why the ifft(fft(myFunction)) is not the same as my function. It seems to be the same shape but a factor of 2 out (ignoring the constant y-offset). All the documentation I can see says there is some normalisation that fft doesn't do, but that ifft should take care of that. Here's some example code below - you can see where I've bodged the factor of 2 to give me the right answer. Thanks for any help - its driving me nuts.
import numpy as np
import scipy.fftpack as fftp
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
def fourier_series(x, y, wn, n=None):
# get FFT
myfft = fftp.fft(y, n)
# kill higher freqs above wavenumber wn
myfft[wn:] = 0
# make new series
y2 = fftp.ifft(myfft).real
# find constant y offset
myfft[1:]=0
c = fftp.ifft(myfft)[0]
# remove c, apply factor of 2 and re apply c
y2 = (y2-c)*2 + c
plt.figure(num=None)
plt.plot(x, y, x, y2)
plt.show()
if __name__=='__main__':
x = np.array([float(i) for i in range(0,360)])
y = np.sin(2*np.pi/360*x) + np.sin(2*2*np.pi/360*x) + 5
fourier_series(x, y, 3, 360)
You're removing half the spectrum when you do myfft[wn:] = 0. The negative frequencies are those in the top half of the array and are required.
You have a second fudge to get your results which is taking the real part to find y2: y2 = fftp.ifft(myfft).real (fftp.ifft(myfft) has a non-negligible imaginary part due to the asymmetry in the spectrum).
Fix it with myfft[wn:-wn] = 0 instead of myfft[wn:] = 0, and remove the fudges. So the fixed code looks something like:
import numpy as np
import scipy.fftpack as fftp
import matplotlib.pyplot as plt
def fourier_series(x, y, wn, n=None):
# get FFT
myfft = fftp.fft(y, n)
# kill higher freqs above wavenumber wn
myfft[wn:-wn] = 0
# make new series
y2 = fftp.ifft(myfft)
plt.figure(num=None)
plt.plot(x, y, x, y2)
plt.show()
if __name__=='__main__':
x = np.array([float(i) for i in range(0,360)])
y = np.sin(2*np.pi/360*x) + np.sin(2*2*np.pi/360*x) + 5
fourier_series(x, y, 3, 360)
It's really worth paying attention to the interim arrays that you are creating when trying to do signal processing. Invariably, there are clues as to what is going wrong that should direct you to the problem. In this case, you taking the real part masked the problem and made your task more difficult.
Just to add another quick point: Sometimes taking the real part of the resultant array is exactly the correct thing to do. It's often the case that you end up with an imaginary part to the signal output which is just down to numerical errors in the input to the inverse FFT. Typically this manifests itself as very small imaginary values, so taking the real part is basically the same array.
You are killing the negative frequencies between 0 and -wn.
I think what you mean to do is to set myfft to 0 for all frequencies outside [-wn, wn].
Change the following line:
myfft[wn:] = 0
to:
myfft[wn:-wn] = 0