Splitting audio segment by pitch - pydub

I'm trying to use Pydub to split an audio segment by pitch. My idea is to find the timestamps when the pitch changes and split according to those timestamps.
The part I am stuck on is how to find the time when the pitch changes.
Is there any way to do that?
Thanks

Related

How to detect silence and cut mp3 file without re-encoding using NAudio and .NET

I've been looking for an answer everywhere and I was only able to find some bits and pieces. What I want to do is to load multiple mp3 files (kind of temporarily merge them) and then cut them into pieces using silence detection.
My understanding is that I can use Mp3FileReader for this but the questions are:
1. How do I read say 20 seconds of audio from an mp3 file? Do I need to read 20 times reader.WaveFormat.AverageBytesPerSecond? Or maybe keep on reading frames until the sum of Mp3Frame.SampleCount / Mp3Frame.SampleRate exceeds 20 seconds?
2. How do I actually detect the silence? I would look at an appropriate number of the consecutive samples to check if they are all below some threshold. But how do I access the samples regardless of them being 8 or 16bit, mono or stereo etc.? Can I directly decode an MP3 frame?
3. After I have detected silence at say sample 10465, how do I map it back to the mp3 frame index to perform the cutting without re-encoding?
Here's the approach I'd recommend (which does involve re-encoding)
Use AudioFileReader to get your MP3 as floating point samples directly in the Read method
Find an open source noise gate algorithm, port it to C#, and use that to detect silence (i.e. when noise gate is closed, you have silence. You'll want to tweak threshold and attack/release times)
Create a derived ISampleProvider that uses the noise gate, and in its Read method, does not return samples that are in silence
Either: Pass the output into WaveFileWriter to create a WAV File and and encode the WAV file to MP3
Or: use NAudio.Lame to encode directly without a WAV step. You'll probably need to go from SampleProvider back down to 16 bit WAV provider first
BEFORE READING BELOW: Mark's answer is far easier to implement, and you'll almost certainly be happy with the results. This answer is for those who are willing to spend an inordinate amount of time on it.
So with that said, cutting an MP3 file based on silence without re-encoding or full decoding is actually possible... Basically, you can look at each frame's side info and each granule's gain & huffman data to "estimate" the silence.
Find the silence
Copy all the frames from before the silence to a new file
now it gets tricky...
Pull the audio data from the frames after the silence, keeping track of which frame header goes with what audio data.
Start writing the second new file, but as you write out the frames, update the main_data_begin field so the bit reservoir is in sync with where the audio data really is.
MP3 is a compressed audio format. You can't just cut bits out and expect the remainder to still be a valid MP3 file. In fact, since it's a DCT-based transform, the bits are in the frequency domain instead of the time domain. There simply are no bits for sample 10465. There's a frame which contains sample 10465, and there's a set of bits describing all frequencies in that frame.
Plain cutting the audio at sample 10465 and continuing with some random other sample probably causes a discontinuity, which means the number of frequencies present in the resulting frame skyrockets. So that definitely means a full recode. The better way is to smooth the transition, but that's not a trivial operation. And the result is of course slightly different than the input, so it still means a recode.
I don't understand why you'd want to read 20 seconds of audio anyway. Where's that number coming from? You usually want to read everything.
Sound is a wave; it's entirely expected that it crosses zero. So being close to zero isn't special. For a 20 Hz wave (threshold of hearing), zero crossings happen 40 times per second, but each time you'll have multiple samples near zero. So you basically need multiple samples that are all close to zero, but on both sides. 5 6 7 isn't much for 16 bits sounds, but it might very well be part of a wave that will have a maximum at 10000. You really should check for at least 0.05 seconds to catch those 20 Hz sounds.
Since you detected silence in a 50 millisecond interval, you have a "position" that's approximately several hundred samples wide. With any bit of luck, there's a frame boundary in there. Cut there. Else it's time for reencoding.

What does the value at setVolume in AVAudioPlayer do/mean?

I am working in Xcode 4.5.1 and developing for the iPhone.
I am using AVAudioPlayer to play sound. I am playing two sounds and want to make a ratio of their average volume relative to each other.
I gather the relevant information using averagePowerForChannel: in combination with NSTimer, checking the volume of both sound files at an interval. However, I have come to the discovery that, regardless of the value I input at setVolume, a specific sound file will always return the same average volume, for example -20,0. I conclude that it does not take into account any volume changes you apply using setVolume.
You can enter values 0-1 at setVolume. Is there a way to convert these values to something meaningful that I can apply to the volume information which I have gathered using averagePowerForChannel:? I am assuming that I can't simply multiply my average volume value with the setVolume value. I have looked in the class reference, but I could not find anything useful.
Please point me in the right direction. Any input is appreciated.

FFT Pitch Detection for iOS using Accelerate Framework?

I have been reading up on FFT and Pitch Detection for a while now, but I'm having trouble piecing it all together.
I have worked out that the Accelerate framework is probably the best way to go with this, and I have read the example code from apple to see how to use it for FFTs. What is the input data for the FFT if I wanted to be running the pitch detection in real time? Do I just pass in the audio stream from the microphone? How would I do this?
Also, after I get the FFT output, how can I get the frequency from that? I have been reading everywhere, and can't find any examples or explanations of this?
Thanks for any help.
Frequency and pitch are not the same thing - frequency is a physical quantity, pitch is a psychological percept - they are similar, but there are important differences, which may or may not matter to you, depending on the type of instrument for which you are trying to measure pitch.
You need to read up a little on the various pitch detection algorithms (and on the meaning of pitch itself), decide what algorithm you want to use and only then set about implementing it. See this Wikipedia page for a good overview of pitch and pitch detection (note that you can use FFT for the autocorrelation-based and frequency domain methods).
As for using the FFT to identify peaks in a spectrum and their associated frequencies, there are many questions and answers related to this on SO already, see for example: How do I obtain the frequencies of each value in an FFT?
I have an example implementation of an Autocorrelation function available online for ios 5.1. Look at this post for a link to the implementation AND functions on how to find the nearest note and how to create a string representing the pitch (A, A#, B, B#, etc...)
While the FFT is very useful in many applications, it might not be the most accurate if you're trying to do simple pitch detection. (It can be as accurate, but you have to deal with complex numbers to do a lot of phase calculations)

Algorithm for reducing GPS track data to discard redundant data?

We're building a GIS interface to display GPS track data, e.g. imagine the raw data set from a guy wandering around a neighborhood on a bike for an hour. A set of data like this with perhaps a new point recorded every 5 seconds, will be large and displaying it in a browser or a handheld device will be challenging. Also, displaying every single point is usually not necessary since a user can't visually resolve that much data anyway.
So for performance reasons we are looking for algorithms that are good at 'reducing' data like this so that the number of points being displayed is reduced significantly but in such a way that it doesn't risk data mis-interpretation. For example, if our fictional bike rider stops for a drink, we certainly don't want to draw 100 lat/lon points in a cluster around the 7-Eleven.
We are aware of clustering, which is good for when looking at a bunch of disconnected points, however what we need is something that applies to tracks as described above. Thanks.
A more scientific and perhaps more math heavy solution is to use the Ramer-Douglas-Peucker algorithm to generalize your path. I used it when I studied for my Master of Surveying so it's a proven thing. :-)
Giving your path and the minimum angle you can tolerate in your path, it simplifies the path by reducing the number of points.
Typically the best way of doing that is:
Determine the minimum number of screen pixels you want between GPS points displayed.
Determine the distance represented by each pixel in the current zoom level.
Multiply answer 1 by answer 2 to get the minimum distance between coordinates you want to display.
starting from the first coordinate in the journey path, read each next coordinate until you've reached the required minimum distance from the current point. Repeat.

Getting Pitch with VB.net

I want to get the pitch of a song at any point. I plan on storing the pitches later. How can I read say... an mp3 file or wav file to get the pitch played at a certain point?
Here is a visual example:
Say I wanted to get the pitch that is here at ^this point of the song.
Thanks if you can!
The matter is a tad more complicated than you may be anticipating.
While time-domain approaches exist (that is, approaches which work with the PCM data directly), frequency-domain pitch detection is going to be more accurate. You can read a very simplified overview here.
What you probably want is a Fourier Transform, which can be used to transform blocks of your signal from time-domain to frequency-domain (that is, a distribution of frequency content over a given span of the signal). From there, you would need to analyze the frequency spectrum within that block. The problem becomes even harder still, because there is no best way to deduce pitch from a sampled frequency spectrum in the general case. The aforementioned Wikipedia article should give you a foundation for looking into those algorithms.
Finally, it's worth noting that this is really a language-agnostic question, unless your primary interest is in reading a WAV file specifically using VB.NET.