numpy - find all pixels near a set of pixels - numpy

I have a PIL.Image object input of mode '1' (a black & white bitmap) and I would like to determine, for every pixel in the image, whether it's within n pixels (Euclidean distance - n may be around 100 or so) of any of the white pixels.
The motivation is: input represents every pixel that is different between two other images, and I would like to create a highlight region around all those differences to show clearly where the differences occur.
So far I haven't been able to find a fast algorithm for this - the following code works, but the convolution is very slow because the kernel argument is larger than the convolution can apparently handle efficiently:
from scipy import ndimage
import numpy as np
from PIL import Image
n = 100
y, x = np.ogrid[:2*n, :2*n]
kernel = (x-n)**2 + (y-n)**2 <= n**2
img = Image.open('input.png')
result = ndimage.convolve(np.array(img), kernel) != 0
Image.fromarray(result).save('result.png')
Example input input.png:
Desired output result.png (there are also some undesired artifacts here that I assume come from over/underflow):
Even with these small images, the computation takes 30 seconds or so.
Can someone recommend a better procedure to compute this? Thanks.

ndimage.convolve use a very inefficient algorithm to perform the convolution certainly running in O(n m kn km) where (n,m) is the shape of the image and (kn, km) is the shape of the kernel. You can use an FFT to do that much more efficiently in O(n m log(n m)) time. Hopefully, scipy provide such a function. Here is an example of usage:
import scipy.signal
import numpy as np
from PIL import Image
n = 100
y, x = np.ogrid[:2*n, :2*n]
kernel = (x-n)**2 + (y-n)**2 <= n**2
img = Image.open('input.png')
result = scipy.signal.fftconvolve(img, kernel, mode='same') >= 1.0
Image.fromarray(result).save('result.png')
This is >500 times faster on my machine and this also fix the artefacts. Here is the result:

Related

NumPy matrix multiplication is 20X slower than OpenCV's cvtColor

OpenCV converts BGR images to grayscale using the linear transformation Y = 0.299R + 0.587G + 0.114B, according to their documentation.
I tried to mimic it using NumPy, by multiplying the HxWx3 BGR matrix by the 3x1 vector of coefficients [0.114, 0.587, 0.299]', a multiplication that should result in a HxWx1 grayscale image matrix.
The NumPy code is as follows:
import cv2
import numpy as np
import time
im = cv2.imread(IM_PATHS[0], cv2.IMREAD_COLOR)
# Prepare destination grayscale memory
dst = np.zeros(im.shape[:2], dtype = np.uint8)
# BGR -> Grayscale projection column vector
bgr_weight_arr = np.array((0.114,0.587,0.299), dtype = np.float32).reshape(3,1)
for im_path in IM_PATHS:
im = cv2.imread(im_path , cv2.IMREAD_COLOR)
t1 = time.time()
# NumPy multiplication comes here
dst[:,:] = (im # bgr_weight_arr).reshape(*dst.shape)
t2 = time.time()
print(f'runtime: {(t2-t1):.3f}sec')
Using 12MP images (4000x3000 pixels), the above NumPy-powered process typically takes around 90ms per image, and that is without rounding the multiplication results.
On the other hand, when I replace the matrix multiplication part by OpenCV's function: dst[:,:] = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY), the typical runtime I get is around 5ms per image. I.e., 18X faster!
Can anyone explain how is that possible? I have always been taught to believe that NumPy uses all available acceleration techniques, such as SIMD. So how can OpenCV get so dramatically faster?
Update:
Even when using quantized multiplications, NumPy's runtimes stay at the same range, around 90ms...
rgb_weight_arr_uint16 = np.round(256 * np.array((0.114,0.587,0.299))).astype('uint16').reshape(3,1)
for im_path in IM_PATHS:
im = cv2.imread(im_path , cv2.IMREAD_COLOR)
t1 = time.time()
# NumPy multiplication comes here
dst[:,:] = np.right_shift(im # bgr_weight_arr_uint16, 8).reshape(*dst.shape)
t2 = time.time()
print(f'runtime: {(t2-t1):.3f}sec')

Central Limit Theorem: Sample means do not follow a normal distribution

The Problem
Good evening.
I am learning about the Central Limit Theorem. As practice, I ran simulations in an attempt to find the mean of a fair die (I know, a toy problem).
I took 4000 samples, and in each sample I rolled a die 50 times (screenshot of the code at the bottom). For each of these 4000 samples I computed the mean. Then, I plotted these 4000 sample means in a histogram (with bin size 0.03) using matplotlib.
Here is the result:
Question
Why aren't the sample means normally distributed given that the conditions for CLT (sample size >= 30) were respected?
Specifically, why does the histogram look like two normal distributions superimposed on top of each other? More intriguingly, why does the "outer" distribution look "discrete" with empty spaces occurring at regular intervals?
It almost seems like the result is off in a systematic way.
All help is greatly appreciated. I am very lost.
Supplementary Code
The code I used to generate the 4000 sample means.
"""
Take multiple samples of dice rolls. For
each sample, compute the sample mean.
With the sample means, plot a histogram.
By the Central Limit Theorem, the sample
means should be normally distributed.
"""
sample_means = []
num_samples = 4000
for i in range(num_samples):
# Large enough for CLT to hold
num_rolls = 50
sample = []
for j in range(num_rolls):
observation = random.randint(1, 6)
sample.append(observation)
sample_mean = sum(sample) / len(sample)
sample_means.append(sample_mean)
When num_rolls equals 50, each possible mean will be a fraction with denominator 50. So, in reality, you are looking at a discrete distribution.
To create a histogram of a discrete distribution, the bin boundaries are best placed nicely in-between the values. Using a step size of 0.03, some bin boundaries will coincide with the values, putting the double of values into one bin compared to its neighbor. Moreover, due to subtle floating point rounding problems, the result can become unpredictable when values and boundaries coincide.
Here is some code to illustrate what is going on:
from matplotlib import pyplot as plt
import numpy as np
import random
sample_means = []
num_samples = 4000
for i in range(num_samples):
num_rolls = 50
sample = []
for j in range(num_rolls):
observation = random.randint(1, 6)
sample.append(observation)
sample_mean = sum(sample) / len(sample)
sample_means.append(sample_mean)
fig, axs = plt.subplots(2, 2, figsize=(14, 8))
random_y = np.random.rand(len(sample_means))
for (ax0, ax1), step in zip(axs, [0.03, 0.02]):
bins = np.arange(3.01, 4, step)
ax0.hist(sample_means, bins=bins)
ax0.set_title(f'step={step}')
ax0.vlines(bins, 0, ax0.get_ylim()[1], ls=':', color='r') # show the bin boundaries in red
ax1.scatter(sample_means, random_y, s=1) # show the sample means with a random y
ax1.vlines(bins, 0, 1, ls=':', color='r') # show the bin boundaries in red
ax1.set_xticks(np.arange(3, 4, 0.02))
ax1.set_xlim(3.0, 3.3) # zoom in to region to better see the ins
ax1.set_title('bin boundaries between values' if step == 0.02 else 'chaotic bin boundaries')
plt.show()
PS: Note that the code would run much, much faster if instead of Python lists, the code would work completely with numpy.

Fastest way to find nearest nonzero value in array from columns in pandas dataframe

I am looking for the nearest nonzero cell in a numpy 3d array based on the i,j,k coordinates stored in a pandas dataframe. My solution below works, but it is slower than I would like. I know my optimization skills are lacking, so I am hoping someone can help me find a faster option.
It takes 2 seconds to find the nearest non-zero for a 100 x 100 x 100 binary array, and I have hundreds of files, so any speed enhancements would be much appreciated!
a=np.random.randint(0,2,size=(100,100,100))
# df with i,j,k of interest
df=pd.DataFrame(np.random.randint(100,size=(100,3)).tolist(),
columns=['i','j','k'])
def find_nearest(a,df):
import numpy as np
import pandas as pd
import time
t0=time.time()
nzi = np.nonzero(a)
for i,r in df.iterrows():
dist = ((r['k'] - nzi[0])**2 + \
(r['i'] - nzi[1])**2 + \
(r['j'] - nzi[2])**2)
nidx = dist.argmin()
df.loc[i,['nk','ni','nj']]=(nzi[0][nidx],
nzi[1][nidx],
nzi[2][nidx])
print(time.time()-t0)
return(df)
The problem that you are trying to solve looks like a nearest-neighbor search.
The worst-case complexity of the current code is O(n m) with n the number of point to search and m the number of neighbour candidates. With n = 100 and m = 100**3 = 1,000,000, this means about hundreds of million iterations. To solve this efficiently, one can use a better algorithm.
The common way to solve this kind of problem consists in putting all elements in a BSP-Tree data structure (such as Quadtree or Octree. Such a data structure helps you to locate the nearest elements near a location in a O(log(m)) time. As a result, the overall complexity of this method is O(n log(m))! SciPy already implement KD-trees.
Vectorization generally also help to speed up the computation.
def find_nearest_fast(a,df):
from scipy.spatial import KDTree
import numpy as np
import pandas as pd
import time
t0=time.time()
candidates = np.array(np.nonzero(a)).transpose().copy()
tree = KDTree(candidates, leafsize=1024, compact_nodes=False)
searched = np.array([df['k'], df['i'], df['j']]).transpose()
distances, indices = tree.query(searched)
nearestPoints = candidates[indices,:]
df[['nk', 'ni', 'nj']] = nearestPoints
print(time.time()-t0)
return df
This implementation is 16 times faster on my machine. Note the results differ a bit since there are multiple nearest points for a given input point (with the same distance).

Scipy Butter bandpass is not producing the desired results

So I'm trying to bandpass filter a wav PCM 24-bit 44.1khz file. What I would like to do is bandpass each frequency from 0Hz-22Khz.
So far I have loaded the data and can display it on Matplot and it looks like the following.
But when I go to apply the bandpass filter which I got from here
http://scipy-cookbook.readthedocs.io/items/ButterworthBandpass.html
I get the following result:
So I'm trying to bandpass at 100-101Hz as a test, here is my code:
from WaveData import WaveData
import matplotlib.pyplot as plt
from scipy.signal import butter, lfilter, freqz
from scipy.io.wavfile import read
import numpy as np
from WaveData import WaveData
class Filter:
def __init__(self, wav):
self.waveData = WaveData(wav)
def butter_bandpass(self, lowcut, highcut, fs, order=5):
nyq = 0.5 * fs
low = lowcut / nyq
high = highcut / nyq
b, a = butter(order, [low, high], btype='band')
return b, a
def butter_bandpass_filter(self, data, lowcut, highcut, fs, order):
b, a = self.butter_bandpass(lowcut, highcut, fs, order=order)
y = lfilter(b, a, data)
return y
def getFilteredSignal(self, freq):
return self.butter_bandpass_filter(data=self.waveData.file['Data'], lowcut=100, highcut=101, fs=44100, order=3)
def getUnprocessedData(self):
return self.waveData.file['Data']
def plot(self, signalA, signalB=None):
plt.plot(signalA)
if signalB != None:
plt.plot(signalB)
plt.show()
if __name__ == "__main__":
# file = WaveData("kick.wav")
# fileA = read("kick0.wav")
f = Filter("kick.wav")
a, b = f. butter_bandpass(lowcut=100, highcut=101, fs=44100)
w, h = freqz(b, a, worN=22000) ##Filted signal is not working?
f.plot(h, w)
print("break")
I dont understand where I have gone wrong.
Thanks
What #WoodyDev said is true: 1 Hz out of 44.1 kHz is way way too tiny of a bandpass for any kind of filter. Just look at the filter coefficients butter returns:
In [3]: butter(5, [100/(44.1e3/2), 101/(44.1e3/2)], btype='band')
Out[3]:
(array([ 1.83424060e-21, 0.00000000e+00, -9.17120299e-21, 0.00000000e+00,
1.83424060e-20, 0.00000000e+00, -1.83424060e-20, 0.00000000e+00,
9.17120299e-21, 0.00000000e+00, -1.83424060e-21]),
array([ 1. , -9.99851389, 44.98765092, -119.95470631,
209.90388506, -251.87018009, 209.88453023, -119.93258575,
44.9752074 , -9.99482662, 0.99953904]))
Look at the b coefficients (the first array): their values at 1e-20, meaning the filter design totally failed to converge, and if you apply it to any signal, the output will be zero—which is what you found.
You didn't mention your application but if you really really want to keep the signal's frequency content between 100 and 101 Hz, you could take a zero-padded FFT of the signal, zero out the portions of the spectrum outside that band, and IFFT (look at rfft, irfft, and rfftfreq in numpy.fft module).
Here's a function that applies a brick-wall bandpass filter in the Fourier domain using FFTs:
import numpy.fft as fft
import numpy as np
def fftBandpass(x, low, high, fs=1.0):
"""
Apply a bandpass signal via FFTs.
Parameters
----------
x : array_like
Input signal vector. Assumed to be real-only.
low : float
Lower bound of the passband in Hertz. (If less than or equal
to zero, a high-pass filter is applied.)
high : float
Upper bound of the passband, Hertz.
fs : float
Sample rate in units of samples per second. If `high > fs / 2`,
the output is low-pass filtered.
Returns
-------
y : ndarray
Output signal vector with all frequencies outside the `[low, high]`
passband zeroed.
Caveat
------
Note that the energe in `y` will be lower than the energy in `x`, i.e.,
`sum(abs(y)) < sum(abs(x))`.
"""
xf = fft.rfft(x)
f = fft.rfftfreq(len(x), d=1 / fs)
xf[f < low] = 0
xf[f > high] = 0
return fft.irfft(xf, len(x))
if __name__ == '__main__':
fs = 44.1e3
N = int(fs)
x = np.random.randn(N)
t = np.arange(N) / fs
import pylab as plt
plt.figure()
plt.plot(t, x, t, 100 * fftBandpass(x, 100, 101, fs=fs))
plt.xlabel('time (seconds)')
plt.ylabel('signal')
plt.legend(['original', 'scaled bandpassed'])
plt.show()
You can put this in a file, fftBandpass.py, and just run it with python fftBandpass.py to see it create the following plot:
Note I had to scale the 1 Hz bandpassed signal by 100 because, after bandpassing that much, there's very little energy in the signal. Also note that the signal living inside this small a passband is pretty much just a sinusoid at around 100 Hz.
If you put the following in your own code: from fftBandpass import fftBandpass, you can use the fftBandpass function.
Another thing you could try is to decimate the signal 100x, so convert it to a signal that was sampled at 441 Hz. 1 Hz out of 441 Hz is still a crazy-narrow passband but you might have better luck than trying to bandpass the original signal. See scipy.signal.decimate, but don't try and call it with q=100, instead recursively decimate the signal, by 2, then 2, then 5, then 5 (for total decimation of 100x).
So there are some problems with your code which means you aren't plotting the results correctly, although I believe this isn't your main problem.
Check your code
In the example you linked, they show precisely the process for calculating, and plotting the filter at different orders:
for order in [3, 6, 9]:
b, a = butter_bandpass(lowcut, highcut, fs, order=order)
w, h = freqz(b, a, worN=2000)
plt.plot((fs * 0.5 / np.pi) * w, abs(h), label="order = %d" % order)
You are currently not scaling your frequency axis correctly, or calling the absolute to get the real informatino from h, like the correct code above.
Check your theory
However your main issue, is your such steep bandpass (i.e. only 100Hz - 101Hz). It is very rare that I have seen a filter so sharp as this is very processing intensive (will require a lot of filter coefficients), and because you are only looking at a range of 1Hz, it will completely get rid of all other frequencies.
So the graph you have shown with the gain as 0 may very well be correct. If you use their example and change the bandpass cutoff frequencies to 100Hz -> 101Hz, then the output result is an array of (almost if not completely) zeros. This is because it will only be looking at the energy of the signal in a 1Hz range which will be very very small if you think about it.
If you are doing this for analysis, the frequency spacing tends to be much larger i.e. Octave Bands (or smaller divisions of octave bands).
The Spectrogram
As I am not sure of your end purpose I cannot clarify exactly which route you should take to get there. However, using bandpass filters on every single frequency up to 20kHz seems kind of silly in this day and age.
If I remember correctly, some of the first spectrogram attempts with needles on paper used this technique with analog band pass filter banks to analyze the frequency content. So this makes me think you may be looking for something to do with a spectrogram? It lets you analyze the whole signal's frequency information vs time and still has all of the signal's amplitude information. Python already has spectrogram functionality included as part of scipy or Matplotlib.

Estimation of t-distribution by mean of samples does not work

I am trying to create a t-distribution by taking the mean of many samples from a normal distribution (and then estimating the shape with kernel density estimation).
For some reason, I am getting pretty different results when I compare what I get with a proper t-distribution. I don't understand what is going wrong, so I think I am confused about something.
Here is the code:
import numpy as np
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt
import seaborn
inner_sample_size = 10
X = np.arange(-3, 3, 0.01)
results = [
np.mean(np.random.normal(size=inner_sample_size))
for _ in range(10000)
]
estimation = gaussian_kde(results)
plt.plot(X, estimation.evaluate(X))
t_samples = np.random.standard_t(inner_sample_size, 10000)
t_estimator = gaussian_kde(t_samples)
plt.plot(X, t_estimator.evaluate(X))
plt.ylabel("Probability density")
plt.show()
And here is the plot I get:
Where the orange line is numpy's own t-distribution, and the blue line is the one estimated by sampling.
Your assumption that the mean of Standard Normals has T distribution is incorrect. In fact, the mean of Standard Normals has Normal Distribution, which explains the shape of your blue graph. To generate one random variable T from a T distribution with k degrees of freedom, you first generate k+1 independent Standard Normals Z_i, i=0,...,k. You then compute
T = Z_0 / sqrt( sum(Z_i^2, i=1 to k)/k ).
The sum of squared Standard Normals sum(Z_i^2, i=1 to k) has Chi-Squared Distribution with k degrees of freedom, so if there is a pre-canned method to generate this, you should use it, since it's likely more efficient.