What does the shape of a spectrogram really mean? - tensorflow

I have the following code taken from this tutorial.
def get_spectrogram(waveform):
zero_padding = tf.zeros([4900] - tf.shape(waveform), dtype=tf.float32)
waveform = tf.cast(waveform, tf.float32)
equal_length = tf.concat([waveform, zero_padding], 0)
spectrogram = tf.signal.stft(equal_length, frame_length=256, frame_step=128)
spectrogram = tf.abs(spectrogram)
return spectrogram
spectrogram = get_spectrogram(waveform)
print('Spectrogram shape:', spectrogram.shape)
And i have the following output of spectrogram shape.
Spectrogram shape: (37, 129)
What does the first and second value mean?
If I have 4900 samples and a frame_step of 128. Shouldn't the first value be 38?
4900/128 = 38.28125 -> 38 rounded
It also happens that with a Kotlin library I get a shape of (38, 127).
I need to understand, since I am implementing a model in Android with TFLite, therefore I am pre-processing the data from the mobile device.

I'm not familiar exactly with Python API, but assuming it's doing similar to WaveBeans which I'm very familiar with, it looks like what you've got is the 2-dimensional matrix.
What you're doing is a Short Fourier Transform, which is basically taking FFT over time. Whilst the FFT magnitude or phase is 2-dimensional and can be represented as a 1-dimensional vector, the SFT is 3-dimensional and have also the time axes, which is why it is 2-dimensional vector.
So it looks like the 38 side is time indexes, the 127 side is frequency index, the values are the FFT value on specific time-frequency bin, though that are complex numbers. Thinking of it as a polar coordinates, the phase is the angle, the magnitude is the length. In your code seems you're getting the magnitude by calling .abs() function, so you've already got rid of complex number representation.
Within WaveBeans there is an API to work with FFT specifically to extract out the phase and magnitude, as well as frequency values, and time values.
So to just keep the answer full I'll provide a code snippet:
// let's take simple sine as an example
val waveformAsAStream = 440.sine().trim(1000)
val fftStream = waveformAsAStream
.window(256,128)
// zero padding is already done inside, but if window.size == fft.size it doesn't really do anything
.fft(256)
// evaluate it, for example as a kotlin sequence
val stft = fftStream.asSequence(44100.0f)
.toList()
// get the specific sample for the sake of the demonstration
val fftSample = stft.drop(10).first()
// get time in nano seconds
fftSample.time()
// outputs the time of the taken sample:
// 29024943
// get frequencies values
fftSample.frequency().toList()
// outputs a list of size 128, each element is a frequency in Hz :
// [0.0, 172.265625, 344.53125, 516.796875, 689.0625, ..., 21360.9375, 21533.203125, 21705.46875, 21877.734375]
// get magnitude values
fftSample.magnitude().toList()
// outputs a list of size 128, each element is magnitude value for specific bin in dB:
// [29.629418039768613, 31.125367384785786, 38.077554502661705, 38.480916556622745, ..., -11.57802246867041]
// the index of the closest bin (index) of the frequency
fftSample.bin(440.0)
// outputs:
// 3
// get the magnitude in the FFT spectrogram of the specific frequency
fftSample.magnitude().toList()[fftSample.bin(440.0)]
// outputs:
// 38.480916556622745
Although I would recommend for better FFT output result to use window functions for example hamming is the popular one, and use less sized windows (zero padding will do the aligning trick in that case as FFT requires specific input length), i.e something like this:
waveformAsAStream
.window(101, 85)
.hamming()
.fft(256)
If you want to play around with the values you may use Kotlin Jupyter notebook with WaveBeans library, check it out on github

Related

Getting frequency and amplitude from an audio file using FFT - so close but missing some vital insights, eli5?

tl/dr: I've got two audio recordings of the same song without timestamps, and I'd like to align them. I believe FFT is the way to go, but while I've got a long way, it feels like I'm right on the edge of understanding enough to make it work, and would greatly benefit from a "you got this part wrong" advice on FFT. (My education never got into this area) So I came here seeking ELI5 help.
The journey:
Get two recordings at the same sample rate. (done!)
Transform them into a waveform. (DoubleArray) This doesn't keep any of the meta info like "samples/second" but the FFT math doesn't care until later.
Run a FFT on them using a simplified implementation for beginners
Get a Array<Frame>, each Frame contains Array<Bin>, each Bin has (amplitude, frequency) because the older implementation hid all the details (like frame width, and number of Bins, and ... stuff?) and outputs words I'm familiar with like "amplitude" and "frequency"
Try moving to a more robust FFT (Apache Commons)
Get an output of 'real' and 'imaginary' (uh oh)
Make the totally incorrect assumption that those were the same thing (amplitude and frequency). Surprise, they aren't!
Apache's FFT returns an Array<Complex> which means it... er... is just one frame's worth? And I should be chopping the song into 1 second chunks and passing each one into the FFT and call it multiple times? That seems strange, how does it get lower frequencies?
To the best of my understanding, the complex number is a way to convey the phase shift and amplitude in one neat container (and you need phase shift if you want to do the FFT in reverse). And the frequency is calculated from the index of the array.
Which works out to (pseudocode in Kotlin)
val audioFile = File("Dream_On.pcm")
val (phases, amplitudes) = AudioInputStream(
audioFile.inputStream(),
AudioFormat(
/* encoding = */ AudioFormat.Encoding.PCM_SIGNED,
/* sampleRate = */ 44100f,
/* sampleSizeInBits = */ 16,
/* channels = */ 2,
/* frameSize = */ 4,
/* frameRate = */ 44100f,
/* bigEndian = */ false
),
(audiFile.length() / /* frameSize */ 4)
).use { ais ->
val bytes = ais.readAllBytes()
val shorts = ShortArray(bytes.size / 2)
ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).asShortBuffer().get(shorts)
val allWaveform = DoubleArray(shorts.size)
for (i in shorts.indices) {
allWaveform[i] = shorts[i].toDouble()
}
val halfwayThroughSong = allWaveform.size / 2
val moreThanOneSecond = allWaveform.copyOfRange(halfwayThroughSong, halfwayThroughSong + findNextPowerOf2(44100))
val fft = FastFourierTransformer(DftNormalization.STANDARD)
val fftResult: Array<Complex> = fft.transform(moreThanOneSecond, TransformType.FORWARD)
println("fftResult size: ${fftResult.size}")
val phases = DoubleArray(fftResult.size / 2)
val amplitudes = DoubleArray(fftResult.size / 2)
val frequencies = DoubleArray(fftResult.size / 2)
fftResult.filterIndexed { index, _ -> index < fftResult.size / 2 }.forEachIndexed { idx, complex ->
phases[idx] = atan2(complex.imaginary, complex.real)
frequencies[idx] = idx * 44100.0 / fftResult.size
amplitudes[idx] = hypot(complex.real, complex.imaginary)
}
Triple(phases, frequencies, amplitudes)
}
Is my step #8 at all close to the truth? Why would the FFT result return an array as big as my input number of samples? That makes me think I've got the "window" or "frame" part wrong.
I read up on
FFT real/imaginary/abs parts interpretation
Converting Real and Imaginary FFT output to Frequency and Amplitude
Java - Finding frequency and amplitude of audio signal using FFT
An audio recording in waveform is a series of sound energy levels, basically how much sound energy there should be at any one instant. Based on the bit rate, you can think of the whole recording as a graph of energy versus time.
Sound is made of waves, which have frequencies and amplitudes. Unless your recording is of a pure sine wave, it will have many different waves of sound coming and going, which summed together create the total sound that you experience over time. At any one instant of time, you have energy from many different waves added together. Some of those waves may be at their peaks, and some at their valleys, or anywhere in between.
An FFT is a way to convert energy-vs.-time data into amplitude-vs.-frequency data. The input to an FFT is a block of waveform. You can't just give it a single energy level from a one-dimensional point in time, because then there is no way to determine all the waves that add together to make up the amplitude at that point of time. So, you give it a series of amplitudes over some finite period of time.
The FFT then does its math and returns a range of complex numbers that represent the waves of sound over that chunk of time, that when added together would create the series of energy levels over that block of time. That's why the return value is an array. It represents a bunch of frequency ranges. Together the total data of the array represents the same energy from the input array.
You can calculate from the complex numbers both phase shift and amplitude for each frequency range represented in the return array.
Ultimately, I don’t see why performing an FFT would get you any closer to syncing your recordings. Admittedly it’s not a task I’ve tried before. But I would think waveform data is already the perfect form for comparing the data and finding matching patterns. If you break your songs up into chunks to perform FFTs on, then you can try to find matching FFTs but they will only match perfectly if your chunks are divided exactly along the same division points relative to the beginning of the original recording. And even if you could guarantee that and found matching FFT’s, you will only have as much precision as the size of your chunks.
But when I think of apps like Shazam, I realize they must be doing some sort of manipulation of the audio that breaks it down into something simpler for rapid comparison. That possibly involves some FFT manipulation and filtering.
Maybe you could compare FFTs using some algorithm to just find ones that are pretty similar to narrow down to a time range and then compare wave form data in that range to find the exact point of synchronization.
I would imagine the approach that would work well would to find the offset with the maximum cross-correlation between the two recordings. This means calculate the cross-correlation between the two pieces at various offsets. You would expect the maximum cross-correlation to occur at the offset where the two piece were best aligned.

Limited range for TensorFlow Universal Sentence Encoder Lite embeddings?

Starting from the universal-sentence-encoder in TensorFlow.js, I noticed that the range of the numbers in the embeddings wasn't what I expected. I was expecting some distribution between [0-1] or [-1,1] but don't see either of these.
For the sentence "cats are great!" here's a visualization, where each dimension is projected onto a scale from [-0.5, 0.5]:
Here's the same kind of visualization for "i wonder what this sentence's embedding will be" (the pattern is similar for the first ~10 sentences I tried):
To debug, I looked at whether the same kind of thing comes up in the demo Colab notebook, and it seems like it is. Here's what I see if I see for the range of the embeddings for those two sentences:
# NEW: added this, with different messages
messages = ["cats are great!", "sometimes models are confusing"]
values, indices, dense_shape = process_to_IDs_in_sparse_format(sp, messages)
with tf.Session() as session:
session.run([tf.global_variables_initializer(), tf.tables_initializer()])
message_embeddings = session.run(
encodings,
feed_dict={input_placeholder.values: values,
input_placeholder.indices: indices,
input_placeholder.dense_shape: dense_shape})
for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
print("Message: {}".format(messages[i]))
print("Embedding size: {}".format(len(message_embedding)))
message_embedding_snippet = ", ".join(
(str(x) for x in message_embedding[:3]))
print("Embedding: [{}, ...]\n".format(message_embedding_snippet))
# NEW: added this, to show the range of the embedding output
print("Embedding range: [{}, {}]".format(min(message_embedding), max(message_embedding)))
And the output shows:
Message: cats are great!
Embedding range: [-0.05904272198677063, 0.05903803929686546]
Message: sometimes models are confusing
Embedding range: [-0.060731519013643265, 0.06075377017259598]
So this again isn't what I'm expecting - the range is more narrow than I'd expect. I thought this might be a TF convention that I missed, but couldn't see it in the TFHub page or the guide to text embeddings or in the paper so am not sure where else to look without digging into the training code.
The colab notebook example code has an example sentence that says:
Universal Sentence Encoder embeddings also support short paragraphs.
There is no hard limit on how long the paragraph is. Roughly, the
longer the more 'diluted' the embedding will be.
But the range of the embedding is roughly the same for all the other examples in the colab, even one word examples.
I'm assuming this range is not just arbitrary, and it does make sense to me that the range is centered in zero and small, but I'm trying to understand how this scale came to be.
The output of the universal sentence encoder is a vector of length 512, with an L2 norm of (approximately) 1.0. You can check this by calculating the inner product
ip = 0
for i in range(512):
ip += message_embeddings[0][i] * message_embeddings[0][i]
print(ip)
> 1.0000000807544893
The implications are that:
Most values are likely to be in a narrow range centered around zero
The largest possible single value in the vector is 1.0 - and this would only happen if all other values are exactly 0.
Similarly the smallest possible value is -1.
If we take a random vector of length 512, with values distributed uniformly, and then normalize it to unit magnitude, we expect to see values in a range similar to what you see.
rand_uniform = np.random.uniform(-1, 1, 512)
l2 = np.linalg.norm(rand_uniform)
plt.plot(rand_uniform / l2, 'b.')
axes = plt.gca()
axes.set_ylim([-0.5, 0.5])
Judging visually, the distribution of excitations does not look uniform, but rather is biased toward extremes.

Bad result plotting windowing FFT

im playing with python and scipy to understand windowing, i made a plot to see how windowing behave under FFT, but the result is not what i was specting.
the plot is:
the middle plots are pure FFT plot, here is where i get weird things.
Then i changed the trig. function to get leak, putting a 1 straight for the 300 first items of the array, the result:
the code:
sign_freq=80
sample_freq=3000
num=np.linspace(0,1,num=sample_freq)
i=0
#wave data:
sin=np.sin(2*pi*num*sign_freq)+np.sin(2*pi*num*sign_freq*2)
while i<1000:
sin[i]=1
i=i+1
#wave fft:
fft_sin=np.fft.fft(sin)
fft_freq_axis=np.fft.fftfreq(len(num),d=1/sample_freq)
#wave Linear Spectrum (Rms)
lin_spec=sqrt(2)*np.abs(np.fft.rfft(sin))/len(num)
lin_spec_freq_axis=np.fft.rfftfreq(len(num),d=1/sample_freq)
#window data:
hann=np.hanning(len(num))
#window fft:
fft_hann=np.fft.fft(hann)
#window fft Linear Spectrum:
wlin_spec=sqrt(2)*np.abs(np.fft.rfft(hann))/len(num)
#window + sin
wsin=hann*sin
#window + sin fft:
wsin_spec=sqrt(2)*np.abs(np.fft.rfft(wsin))/len(num)
wsin_spec_freq_axis=np.fft.rfftfreq(len(num),d=1/sample_freq)
fig=plt.figure()
ax1 = fig.add_subplot(431)
ax2 = fig.add_subplot(432)
ax3 = fig.add_subplot(433)
ax4 = fig.add_subplot(434)
ax5 = fig.add_subplot(435)
ax6 = fig.add_subplot(436)
ax7 = fig.add_subplot(413)
ax8 = fig.add_subplot(414)
ax1.plot(num,sin,'r')
ax2.plot(fft_freq_axis,abs(fft_sin),'r')
ax3.plot(lin_spec_freq_axis,lin_spec,'r')
ax4.plot(num,hann,'b')
ax5.plot(fft_freq_axis,fft_hann)
ax6.plot(lin_spec_freq_axis,wlin_spec)
ax7.plot(num,wsin,'c')
ax8.plot(wsin_spec_freq_axis,wsin_spec)
plt.show()
EDIT: as asked in the comments, i plotted the functions in dB scale, obtaining much clearer plots. Thanks a lot #SleuthEye !
It appears the plot which is problematic is the one generated by:
ax5.plot(fft_freq_axis,fft_hann)
resulting in the graph:
instead of the expected graph from Wikipedia.
There are a number of issues with the way the plot is constructed. The first is that this command essentially attempts to plot a complex-valued array (fft_hann). You may in fact be getting the warning ComplexWarning: Casting complex values to real discards the imaginary part as a result. To generate a graph which looks like the one from Wikipedia, you would have to take the magnitude (instead of the real part) with:
ax5.plot(fft_freq_axis,abs(fft_hann))
Then we notice that there is still a line striking through our plot. Looking at np.fft.fft's documentation:
The values in the result follow so-called “standard” order: If A = fft(a, n), then A[0] contains the zero-frequency term (the sum of the signal), which is always purely real for real inputs. Then A[1:n/2] contains the positive-frequency terms, and A[n/2+1:] contains the negative-frequency terms, in order of decreasingly negative frequency.
[...]
The routine np.fft.fftfreq(n) returns an array giving the frequencies of corresponding elements in the output.
Indeed, if we print the fft_freq_axis we can see that the result is:
[ 0. 1. 2. ..., -3. -2. -1.]
To get around this problem we simply need to swap the lower and upper parts of the arrays with np.fft.fftshift:
ax5.plot(np.fft.fftshift(fft_freq_axis),np.fft.fftshift(abs(fft_hann)))
Then you should note that the graph on Wikipedia is actually shown with amplitudes in decibels. You would then need to do the same with:
ax5.plot(np.fft.fftshift(fft_freq_axis),np.fft.fftshift(20*np.log10(abs(fft_hann))))
We should then be getting closer, but the result is not quite the same as can be seen from the following figure:
This is due to the fact that the plot on Wikipedia actually has a higher frequency resolution and captures the value of the frequency spectrum as its oscillates, whereas your plot samples the spectrum at fewer points and a lot of those points have near zero amplitudes. To resolve this problem, we need to get the frequency spectrum of the window at more frequency points.
This can be done by zero padding the input to the FFT, or more simply setting the parameter n (desired length of the output) to a value much larger than the input size:
N = 8*len(num)
fft_freq_axis=np.fft.fftfreq(N,d=1/sample_freq)
fft_hann=np.fft.fft(hann, N)
ax5.plot(np.fft.fftshift(fft_freq_axis),np.fft.fftshift(20*np.log10(abs(fft_hann))))
ax5.set_xlim([-40, 40])
ax5.set_ylim([-50, 80])

Zoom in on np.fft2 result

Is there a way to chose the x/y output axes range from np.fft2 ?
I have a piece of code computing the diffraction pattern of an aperture. The aperture is defined in a 2k x 2k pixel array. The diffraction pattern is basically the inner part of the 2D FT of the aperture. The np.fft2 gives me an output array same size of the input but with some preset range of the x/y axes. Of course I can zoom in by using the image viewer, but I have already lost detail. What is the solution?
Thanks,
Gert
import numpy as np
import matplotlib.pyplot as plt
r= 500
s= 1000
y,x = np.ogrid[-s:s+1, -s:s+1]
mask = x*x + y*y <= r*r
aperture = np.ones((2*s+1, 2*s+1))
aperture[mask] = 0
plt.imshow(aperture)
plt.show()
ffta= np.fft.fft2(aperture)
plt.imshow(np.log(np.abs(np.fft.fftshift(ffta))**2))
plt.show()
Unfortunately, much of the speed and accuracy of the FFT come from the outputs being the same size as the input.
The conventional way to increase the apparent resolution in the output Fourier domain is by zero-padding the input: np.fft.fft2(aperture, [4 * (2*s+1), 4 * (2*s+1)]) tells the FFT to pad your input to be 4 * (2*s+1) pixels tall and wide, i.e., make the input four times larger (sixteen times the number of pixels).
Begin aside I say "apparent" resolution because the actual amount of data you have hasn't increased, but the Fourier transform will appear smoother because zero-padding in the input domain causes the Fourier transform to interpolate the output. In the example above, any feature that could be seen with one pixel will be shown with four pixels. Just to make this fully concrete, this example shows that every fourth pixel of the zero-padded FFT is numerically the same as every pixel of the original unpadded FFT:
# Generate your `ffta` as above, then
N = 2 * s + 1
Up = 4
fftup = np.fft.fft2(aperture, [Up * N, Up * N])
relerr = lambda dirt, gold: np.abs((dirt - gold) / gold)
print(np.max(relerr(fftup[::Up, ::Up] , ffta))) # ~6e-12.
(That relerr is just a simple relative error, which you want to be close to machine precision, around 2e-16. The largest error between every 4th sample of the zero-padded FFT and the unpadded FFT is 6e-12 which is quite close to machine precision, meaning these two arrays are nearly numerically equivalent.) End aside
Zero-padding is the most straightforward way around your problem. But it does cost you a lot of memory. And it is frustrating because you might only care about a tiny, tiny part of the transform. There's an algorithm called the chirp z-transform (CZT, or colloquially the "zoom FFT") which can do this. If your input is N (for you 2*s+1) and you want just M samples of the FFT's output evaluated anywhere, it will compute three Fourier transforms of size N + M - 1 to obtain the desired M samples of the output. This would solve your problem too, since you can ask for M samples in the region of interest, and it wouldn't require prohibitively-much memory, though it would need at least 3x more CPU time. The downside is that a solid implementation of CZT isn't in Numpy/Scipy yet: see the scipy issue and the code it references. Matlab's CZT seems reliable, if that's an option; Octave-forge has one too and the Octave people usually try hard to match/exceed Matlab.
But if you have the memory, zero-padding the input is the way to go.

Fourier transform and filtering frequencies with negative fft values

I'm looking for the most abundant frequency in a periodic signal.
I'm trying to understand what do I get if I perform a Fourier transformation on a periodic signal and filter for frequencies which have negative fft values.
In other words, what do the axis of plots 2 and 3 (see below) express? I'm plotting frequency (cycles/second) over the fft-transformed signal - what do negative values on the y axis mean, and would it make sense that I'd be interested in only those?
import numpy as np
import scipy
# generate data
time = scipy.linspace(0,120,4000)
acc = lambda t: 10*scipy.sin(2*pi*2.0*t) + 5*scipy.sin(2*pi*8.0*t) + 2*scipy.random.random(len(t))
signal = acc(time)
# get frequencies from decomposed fft
W = np.fft.fftfreq(signal.size, d=time[1]-time[0])
f_signal = np.fft.fft(signal)
# filter signal
# I'm getting only the "negative" part!
cut_f_signal = f_signal.copy()
# filter noisy frequencies
cut_f_signal[(W < 8.0)] = 0
cut_f_signal[(W > 8.2)] = 0
# inverse fourier to get filtered frequency
cut_signal = np.fft.ifft(cut_f_signal)
# plot
plt.subplot(221)
plt.plot(time,signal)
plt.subplot(222)
plt.plot(W, f_signal)
plt.subplot(223)
plt.plot(W, cut_f_signal)
plt.subplot(224)
plt.plot(time, cut_signal)
plt.show()
The FFT of a real-valued input signal will produce a conjugate symmetric result. (That's just the way the math works best.) So, for FFT result magnitudes only of real data, the negative frequencies are just mirrored duplicates of the positive frequencies, and can thus be ignored when analyzing the result.
However if you want to do the inverse and compute the IFFT, you will need to feed the IFFT a conjugate symmetric negative half (or upper half, above Fs/2) of frequency data, or else your IFFT result will end up producing a complex result (e.g. with non-zero imaginary (sqrt(-1)) components, rarely what one want when dealing with base-band real data).
If you want to filter the FFT data and end up with real results from an IFFT, you will need to filter the positive and negative frequencies symmetrically identically to maintain the needed symmetry.
The FFT also produces a complex result, where the value and sign the components (real and imaginary) of each result bin represents the phase as well as the magnitude of the component basis vector (complex sinusoid, or real cosine plus real sine components). Any negative value just represents a phase rotation from if the same result was positive.
As #hotpaw2 already wrote in his comment above, the result of a FFT performed on a real signal in time domain generates complex values in frequency domain.
The input value f_signal of your plot command is a vector of complex values.
plt.subplot(222)
plt.plot(W, f_signal)
This results in meaningless output.
You should plot the absolute values of f_signal.
If you are interested in the phase you should plot the angle, too.
In Matlab this would look like this:
% Plot the absolute values of f_signal
plot(W, abs(f_signal));
% Plot the phase of f_signal
plot(W, (unwrap(angle(f_signal)));