numpy.fft.fft not computing dft at frequencies given by numpy.fft.fftfreq? - numpy

This is a mathematical question, but it is tied to the numpy implementation, so I decided to ask it at SO. Perhaps I'm hugely misunderstanding something, but if so I would like to be put straight.
numpy.ftt.ftt computes DFT according to equation:
numpy.ftt.fftfreq is supposed to return frequencies at which DFT was computed.
Say we have:
x = [0, 0, 1, 0, 0]
X = np.fft.fft(x)
freq = np.fft.fftfreq(5)
Then for signal x, its DFT transformation is X, and frequencies at which X is computed are given by freq. For example X[0] is DFT of x at frequency freq[0], X[1] is DFT of x at frequency freq[1], and so on.
But when I compute DFT of a simple signal by hand with the formula quoted above, my results indicate that X[1] is DFT of x at frequency 1, not at freq[1], X[2] is DFT of x at frequency 2, etc, not at freq[2], etc.
As an example:
In [32]: x
Out[32]: [0, 0, 1, 0, 0]
In [33]: X
Out[33]:
array([
1.00000000+0.j,
-0.80901699-0.58778525j,
0.30901699+0.95105652j, 0.30901699-0.95105652j,
-0.80901699+0.58778525j])
In [34]: freq
Out[34]: array([ 0. , 0.2, 0.4, -0.4, -0.2])
If I compute DFT of above signal for k = 0.2 (or freq[1]), I get
X at freq = 0.2: 0.876 - 0.482j, which isn't X[1].
If however I compute for k = 1 I get the same results as are in X[1] or -0.809 - 0.588j.
So what am I misunderstanding? If numpy.fft.fft(x)[n] is a DFT of x at frequency n, not at frequency numpy.fft.fttfreq(len(x))[n], what is the purpose of numpy.fft.fttfreq?

I think that because the values in the array returned by the numpy.fft.fttfreq are equal to the (k/n)*sampling frequency.
The frequencies of the dft result are equal to k/n divided by the time spacing, because the periodic function's period's amplitude will become the inverse of the original value after fft. You can consider the digital signal function is a periodic sampling function convoluted by the analog signal function. The convolution in time domain means multiplication in frequency domain, so that the time spacing of the input data will affect the frequency spacing of the dft result and the frequency spacing's value will become the original one divided by the time spacing. Originally, the frequency spacing of the dft result is equal to 1/n when the time spacing is equal to 1. So after the dft, the frequency spacing will become 1/n divided by the time spacing, which eqauls to 1/n multiplied by the sampling frequency.
To calculate that, the numpy.fft.fttfreq has two arguments, the length of the input and time spacing, which means the inverse of the sampling rate. The length of the input is equal to n, and the time spacing is equal to the value which the result k/n divided by (Default is 1.)
I have tried to let k = 2, and the result is equal to the X[2] in your example. In this situation, the k/n*1 is equal to the freq[2].

The DFT is a dimensionless basis transform or matrix multiplication. The output or result of a DFT has nothing to do with frequencies unless you know the sampling rate represented by the input vector (samples per second, per meter, per radian, etc.)
You can compute a Goertzel filter of the same length N with k=0.2, but that result isn't contained in an DFT or FFT result of length N. A DFT only contains complex Goertzel filter results for integer k values. And to get from k to the frequency represented by X[k], you need to know the sample rate.

Yours is not a SO question
You wrote
If I compute DFT of above signal for k = 0.2 .
and I reply "You shouldn't"... the DFT can be meaningfully computed only for integer values of k.
The relationship between an index k and a frequency is given by f_k = k Δf or, if you prefer circular frequencies, ω_k = k Δω where Δf = 1/T and Δω = 2πΔf, T being the period of the signal.
The arguments of fftfreq are a bit misleading... the required one is the number of samples n and the optional argument is the sampling interval, by default d=1.0, but at any rate T=n*d and Δf = 1/(n*d)
>>> fftfreq(5) # d=1
array([ 0. , 0.2, 0.4, -0.4, -0.2])
>>> fftfreq(5,2)
array([ 0. , 0.1, 0.2, -0.2, -0.1])
>>> fftfreq(5,10)
array([ 0. , 0.02, 0.04, -0.04, -0.02])
and the different T are 5,10,50 and the respective df are -.2,0.1,0.02 as (I) expected.
Why fftfreq doesn't simply require the signal's period? because it is mainly intended as an helper in demangling the Nyquist frequency issue.
As you know, the DFT is periodic, for a signal x of length N you have that
DFT(x,k) is equal to DFT(x,k+mN) where m is an integer.
This imply that there are only N/2 positive and N/2 negative distinct frequencies and that, when N/2<k<N, the frequency that must be associated with k in the most meaningful way is not k df but (k-N) df.
To perform this, fftfreq needs more information that the period T, hence the choice of requiring n and computing df from an assumption on sampling interval.

Related

How to perform McNemar (or Chi-square) test on multi-dimensional data?

Question:
Assume for this part that the total count for every sample is 5000 (i.e., sum of column = 5000).
Imagine there was a row (gene G) in this dataset for which the count is expected to be 1 in 10% of samples and 0 in the remaining 90% of samples. We are doing an experiment where we would like to know if the expression of gene G changes in experimental vs control conditions, and we will measure n samples (single cells) from each condition.
Plot the statistical power to detect a 10% increase in the expression of G in experimental vs control at Bonferroni-corrected p < 0.05 as a function of n, assuming that we will be performing a similar test for significance on 1000 genes total. How many samples from each condition do we need to measure to achieve a power of 95%?
First, for gene G, I created a pandas dataframe for control and experimental conditions, where the ratio of 0:1 is 10% and 20%, respectively. I extrapolated the conditions for 1000 genes and then perform cross-tabulation analysis.
import pandas as pd
from statsmodels.stats.contingency_tables import mcnemar
from statsmodels.stats.gof import chisquare_effectsize
from statsmodels.stats.power import GofChisquarePower
n = 5000 # no of records
nog = 1000 # no of genes
gene_list = ["gene_" + str(i) for i in range(0,nog)]
def generate_gene_df(gene, n):
df = pd.DataFrame.from_dict(
{"Gene" : gene,
"Cells": (f'Cell{x}' for x in range(1, n+1)),
"Control": np.random.choice([1,0], p=[0.1, 0.9], size=n),
"Experimental": np.random.choice([1,0], p=[0.1+0.1, 0.9-0.1], size=n)},
orient='columns'
)
df = df.set_index(["Gene","Cells"])
return df
# List of simulated genes
gene_df_list = [generate_gene_df(gene, n) for gene in gene_list]
df = pd.concat(gene_df_list)
df = df.reset_index()
table = pd.crosstab([df["Gene"], df["Cells"]],
[df["Control"], df["Experimental"]]).to_numpy()
Table:
array([[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 1, 0],
...,
[0, 1, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0]])
Now, I want to plot the statistical power at Bonferroni-corrected p < 0.05 as a function of n. I also want to find out how many samples from each condition do we need to measure to achieve a power of 95%.
My attempt:
McNemar's test
result = mcnemar(table, exact=True)
print('statistic=%.3f, p-value=%.3f' % (result.statistic, result.pvalue))
alpha=0.05
if result.pvalue > alpha:
print('Same proportions of errors (fail to reject H0)')
else:
print('Different proportions of errors (reject H0)')
Output:
statistic=0.000, p-value=1.000
Same proportions of errors (fail to reject H0)
I calculated the power analysis using:
nobs = 5000
alpha = 0.05
effect_size = chisquare_effectsize(0.5, 0.5*1.1, correction=None, cohen=True, axis=0)
analysis = GofChisquarePower()
power_chisquare = analysis.solve_power(effect_size=effect_size, nobs=nobs, alpha=alpha)
print('Based on Chi Square test, the minimum number of samples required to see an effect of the desired size: %.3f' % power_chisquare)
Based on Chi Square test, the minimum number of samples required to see an effect of the desired size: 0.050
Why does the power curve look atypical? Did I perform the analyses correctly? Is McNemar an appropriate statistical test method in this case?

How do I convert scale bars after a 2d FFT?

I'm currently writing something which will compute the 2d FFT of an image and pick out certain peaks in the magnitude spectrum. My images all have scales on the x and y axis in nm but I'm struggling to understand how I would convert from my length scales to frequencies.
I'm sure I'm just misunderstanding something simple here but I can't find anything on the subject that seems relevant.
Thanks in advance for any help.
Brief overview of FFT
Remember FFT is just a Fast version of the Discrete Fourier Transform algorithm. It transforms data back and forth between the signals original domain (which could be either time or space) and its corresponding frequency domain. In case of images, you are talking about spatial frequency.
Spatial frequency
Spatial frequency is similar to regular temporal frequency. But it has units of m⁻¹ rather than s⁻¹.
Remember that from Nyquist-Shannon sampling theorem the highest frequency component is equal to half of the sampling rate. In time frequency, this means that if you are sampling at 1000Hz the highest signal frequency you can sample is 500Hz. In spatial domain, this is equivalent. If you are sampling every milimetre (1000m⁻¹) your signal can only contain the highest frequency of 500m⁻¹ (wavelength of 2mm).
You can sort of picture it by imagining you sample 1, -1, 1, -1 ... at every 1mm, the wave clearly repeats every 2mm.
The smallest frequency component is going to depend on the length of the signal. Clearly if you have a 1s sample, the smallest frequency bin you can detect is 1Hz. As you have probably already noticed, same applies to spatial frequencies.
Now you could look at your sampling rate, and your signal length and work out the frequency spacing and align for the shifted FFT signal... But numpy already provides a very powerful method for generating your frequency axis numpy.fft.fftfreq. You give it the length of your signal n and sample spacing d and it will provide you with the correctly scaled and spaced frequency axis.
So in your case where you have an x x y image with pixels every nm you would generate your x and y spatial frequency axis like this:
d = 1e-9 # Sample spacing is 1nm
y, x = image.shape # Get the y and x size of your input image (assuming its just 2D)
y_freq = np.fft.fftfreq(y, d)
x_freq = np.fft.fftfreq(x, d)
Frequency order
Remember that by default, the output of a FFT is shifted so the coefficients start at DC, continue to highest frequency and then wrap to the negative frequencies
[0, 1, 2, 3, 4, -5, -4, -3, -2, -1]
fftfreq outputs the frequencie axis in the same way. In case you want to reorder the axis, us np.fft.fftshift. This will reorder the frequencies to a more logical order like this:
[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4]
This also works for 2D FFT. Basically your code would look something like this:
image_f = np.fft.fft2(image) # 2D FFT of image
image_fs = np.fft.fftshift(image_f) # shift the FFT of the image
d = 1e-9 # Sampling rate is 1/1nm
y, x = image.shape # Get the y and x size of your input image (assuming its just 2D)
# Compute the shifted Spacial Frequency axis with units m⁻¹
y_freq = np.fft.fftshift(np.fft.fftfreq(y, d))
x_freq = np.fft.fftshift(np.fft.fftfreq(x, d))
fig, ax = plt.subplots()
# Plot the absolute of the 2D FFT. Set the axis extent to the min/max values for x and y freq axis
# Note, in imshow, you actually don't really need the full freq axis, just the limits
ax.imshow(np.abs(image_fs), origin='lower', extent=[x_freq[0], x_freq[-1], y_freq[0], y_freq[-1]])

How does tf.multinomial work?

How does tf.multinomial work? Here is stated that it "Draws samples from a multinomial distribution". What does that mean?
If you perform an experiment n times that can have only two outcomes (either success or failure, head or tail, etc.), then the number of times you obtain one of the two outcomes (success) is a binomial random variable.
In other words, If you perform an experiment that can have only two outcomes (either success or failure, head or tail, etc.), then a random variable that takes value 1 in case of success and value 0 in case of failure is a Bernoulli random variable.
If you perform an experiment n times that can have K outcomes (where K can be any natural number) and you denote by X_i the number of times that you obtain the i-th outcome, then the random vector X defined as
X = [X_1, X_2, X_3, ..., X_K]
is a multinomial random vector.
In other words, if you perform an experiment that can have K outcomes and you denote by X_i a random variable that takes value 1 if you obtain the i-th outcome and 0 otherwise, then the random vector X defined as
X = [X_1, X_2, X_3, ..., X_K]
is a Multinoulli random vector. In other words, when the i-th outcome is obtained, the i-th entry of the Multinoulli random vector X takes value 1, while all other entries take value 0.
So, a multinomial distribution can be seen as a sum of mutually independent Multinoulli random variables.
And the probabilities of the K possible outcomes will be denoted by
p_1, p_2, p_3, ..., p_K
An example in Tensorflow,
In [171]: isess = tf.InteractiveSession()
In [172]: prob = [[.1, .2, .7], [.3, .3, .4]] # Shape [2, 3]
...: dist = tf.distributions.Multinomial(total_count=[4., 5], probs=prob)
...:
...: counts = [[2., 1, 1], [3, 1, 1]]
...: isess.run(dist.prob(counts)) # Shape [2]
...:
Out[172]: array([ 0.0168 , 0.06479999], dtype=float32)
Note: The Multinomial is identical to the
Binomial distribution when K = 2. For more detailed information please refer either tf.compat.v1.distributions.Multinomial or the latest docs of tensorflow_probability.distributions.Multinomial

Bernoulli random number generator

I cannot understand how Bernoulli Random Number generator used in numpy is calculated and would like some explanation on it. For example:
np.random.binomial(size=3, n=1, p= 0.5)
Results:
[1 0 0]
n = number of trails
p = probability of occurrence
size = number of experiments
With how do I determine the generated numbers/results of "0" or "1"?
=================================Update==================================
I created a Restricted Boltzmann Machine which always presents the same results despite being "random" on multiple code executions. The randomize is seeded using
np.random.seed(10)
import numpy as np
np.random.seed(10)
def sigmoid(u):
return 1/(1+np.exp(-u))
def gibbs_vhv(W, hbias, vbias, x):
f_s = sigmoid(np.dot(x, W) + hbias)
h_sample = np.random.binomial(size=f_s.shape, n=1, p=f_s)
f_u = sigmoid(np.dot(h_sample, W.transpose())+vbias)
v_sample = np.random.binomial(size=f_u.shape, n=1, p=f_u)
return [f_s, h_sample, f_u, v_sample]
def reconstruction_error(f_u, x):
cross_entropy = -np.mean(
np.sum(
x * np.log(sigmoid(f_u)) + (1 - x) * np.log(1 - sigmoid(f_u)),
axis=1))
return cross_entropy
X = np.array([[1, 0, 0, 0]])
#Weight to hidden
W = np.array([[-3.85, 10.14, 1.16],
[6.69, 2.84, -7.73],
[1.37, 10.76, -3.98],
[-6.18, -5.89, 8.29]])
hbias = np.array([1.04, -4.48, 2.50]) #<= 3 bias for 3 neuron in hidden
vbias = np.array([-6.33, -1.68, -1.25, 3.45]) #<= 4 bias for 4 neuron in input
k = 2
v_sample = X
for i in range(k):
[f_s, h_sample, f_u, v_sample] = gibbs_vhv(W, hbias, vbias, v_sample)
start = v_sample
if i < 2:
print('f_s:', f_s)
print('h_sample:', h_sample)
print('f_u:', f_u)
print('v_sample:', v_sample)
print(v_sample)
print('iter:', i, ' h:', h_sample, ' x:', v_sample, ' entropy:%.3f'%reconstruction_error(f_u, v_sample))
Results:
[[1 0 0 0]]
f_s: [[ 0.05678618 0.99652957 0.97491304]]
h_sample: [[0 1 1]]
f_u: [[ 0.99310473 0.00139984 0.99604968 0.99712837]]
v_sample: [[1 0 1 1]]
[[1 0 1 1]]
iter: 0 h: [[0 1 1]] x: [[1 0 1 1]] entropy:1.637
f_s: [[ 4.90301318e-04 9.99973278e-01 9.99654440e-01]]
h_sample: [[0 1 1]]
f_u: [[ 0.99310473 0.00139984 0.99604968 0.99712837]]
v_sample: [[1 0 1 1]]
[[1 0 1 1]]
iter: 1 h: [[0 1 1]] x: [[1 0 1 1]] entropy:1.637
I am asking on how the algorithm works to produce the numbers. – WhiteSolstice 35 mins ago
Non-technical explanation
If you pass n=1 to the Binomial distribution it is equivalent to the Bernoulli distribution. In this case the function could be thought of simulating coin flips. size=3 tells it to flip the coin three times and p=0.5 makes it a fair coin with equal probabilitiy of head (1) or tail (0).
The result of [1 0 0] means the coin came down once with head and twice with tail facing up. This is random, so running it again would result in a different sequence like [1 1 0], [0 1 0], or maybe even [1 1 1]. Although you cannot get the same number of 1s and 0s in three runs, on average you would get the same number.
Technical explanation
Numpy implements random number generation in C. The source code for the Binomial distribution can be found here. Actually two different algorithms are implemented.
If n * p <= 30 it uses inverse transform sampling.
If n * p > 30 the BTPE algorithm of (Kachitvichyanukul and Schmeiser 1988) is used. (The publication is not freely available.)
I think both methods, but certainly the inverse transform sampling, depend on a random number generator to produce uniformly distributed random numbers. Numpy internally uses a Mersenne Twister pseudo random number generator. The uniform random numbers are then transformed into the desired distribution.
A Binomially distributed random variable has two parameters n and p, and can be thought of as the distribution of the number of heads obtained when flipping a biased coin n times, where the probability of getting a head at each flip is p. (More formally it is a sum of independent Bernoulli random variables with parameter p).
For instance, if n=10 and p=0.5, one could simulate a draw from Bin(10, 0.5) by flipping a fair coin 10 times and summing the number of times that the coin lands heads.
In addition to the n and p parameters described above, np.random.binomial has an additional size parameter. If size=1, np.random.binomial computes a single draw from the Binomial distribution. If size=k for some integer k, k independent draws from the same Binomial distribution will be computed. size can also be an array of indices, in which case a whole np.array with the given size will be filled with independent draws from the Binomial distribution.
Note that the Binomial distribution is a generalisation of the Bernoulli distribution - in the case that n=1, Bin(n,p) has the same distribution as Ber(p).
For more information about the binomial distribution see: https://en.wikipedia.org/wiki/Binomial_distribution

Equivalent of R's of cor.test in Python

Is there a way I can find the r confidence interval in Python?
In R i could do something like:
cor.test(m, h)
Pearson's product-moment correlation
data: m and h
t = 0.8974, df = 4, p-value = 0.4202
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.6022868 0.9164582
sample estimates:
cor
0.4093729
In Python I can calculate r (cor) using:
r,p = scipy.stats.pearsonr(df.age, df.pets)
But that doesn't return the r confidence interval.
Here's one way to calculate confidence internal
First get the correlation value (pearson's)
In [85]: from scipy import stats
In [86]: corr = stats.pearsonr(df['col1'], df['col2'])
In [87]: corr
Out[87]: (0.551178607008175, 0.0)
Use the Fisher transformation to get z
In [88]: z = np.arctanh(corr[0])
In [89]: z
Out[89]: 0.62007264620685021
And, the sigma value i.e standard error
In [90]: sigma = (1/((len(df.index)-3)**0.5))
In [91]: sigma
Out[91]: 0.013840913308956662
Get normal 95% interval probability density function for normal continuous random variable apply two-sided conditional formula
In [92]: cint = z + np.array([-1, 1]) * sigma * stats.norm.ppf((1+0.95)/2)
Finally take hyperbolic tangent to get interval values for 95%
In [93]: np.tanh(cint)
Out[93]: array([ 0.53201034, 0.56978224])