How to detect silence and cut mp3 file without re-encoding using NAudio and .NET - naudio

I've been looking for an answer everywhere and I was only able to find some bits and pieces. What I want to do is to load multiple mp3 files (kind of temporarily merge them) and then cut them into pieces using silence detection.
My understanding is that I can use Mp3FileReader for this but the questions are:
1. How do I read say 20 seconds of audio from an mp3 file? Do I need to read 20 times reader.WaveFormat.AverageBytesPerSecond? Or maybe keep on reading frames until the sum of Mp3Frame.SampleCount / Mp3Frame.SampleRate exceeds 20 seconds?
2. How do I actually detect the silence? I would look at an appropriate number of the consecutive samples to check if they are all below some threshold. But how do I access the samples regardless of them being 8 or 16bit, mono or stereo etc.? Can I directly decode an MP3 frame?
3. After I have detected silence at say sample 10465, how do I map it back to the mp3 frame index to perform the cutting without re-encoding?

Here's the approach I'd recommend (which does involve re-encoding)
Use AudioFileReader to get your MP3 as floating point samples directly in the Read method
Find an open source noise gate algorithm, port it to C#, and use that to detect silence (i.e. when noise gate is closed, you have silence. You'll want to tweak threshold and attack/release times)
Create a derived ISampleProvider that uses the noise gate, and in its Read method, does not return samples that are in silence
Either: Pass the output into WaveFileWriter to create a WAV File and and encode the WAV file to MP3
Or: use NAudio.Lame to encode directly without a WAV step. You'll probably need to go from SampleProvider back down to 16 bit WAV provider first

BEFORE READING BELOW: Mark's answer is far easier to implement, and you'll almost certainly be happy with the results. This answer is for those who are willing to spend an inordinate amount of time on it.
So with that said, cutting an MP3 file based on silence without re-encoding or full decoding is actually possible... Basically, you can look at each frame's side info and each granule's gain & huffman data to "estimate" the silence.
Find the silence
Copy all the frames from before the silence to a new file
now it gets tricky...
Pull the audio data from the frames after the silence, keeping track of which frame header goes with what audio data.
Start writing the second new file, but as you write out the frames, update the main_data_begin field so the bit reservoir is in sync with where the audio data really is.

MP3 is a compressed audio format. You can't just cut bits out and expect the remainder to still be a valid MP3 file. In fact, since it's a DCT-based transform, the bits are in the frequency domain instead of the time domain. There simply are no bits for sample 10465. There's a frame which contains sample 10465, and there's a set of bits describing all frequencies in that frame.
Plain cutting the audio at sample 10465 and continuing with some random other sample probably causes a discontinuity, which means the number of frequencies present in the resulting frame skyrockets. So that definitely means a full recode. The better way is to smooth the transition, but that's not a trivial operation. And the result is of course slightly different than the input, so it still means a recode.
I don't understand why you'd want to read 20 seconds of audio anyway. Where's that number coming from? You usually want to read everything.
Sound is a wave; it's entirely expected that it crosses zero. So being close to zero isn't special. For a 20 Hz wave (threshold of hearing), zero crossings happen 40 times per second, but each time you'll have multiple samples near zero. So you basically need multiple samples that are all close to zero, but on both sides. 5 6 7 isn't much for 16 bits sounds, but it might very well be part of a wave that will have a maximum at 10000. You really should check for at least 0.05 seconds to catch those 20 Hz sounds.
Since you detected silence in a 50 millisecond interval, you have a "position" that's approximately several hundred samples wide. With any bit of luck, there's a frame boundary in there. Cut there. Else it's time for reencoding.

Related

Reading specific bytes of data from a large text file... quickly

For argument's sake, let's say you have a single, enormous file to hold your map save data. The game that comes to mind as a great example is Terraria. They save all MapWidth*MapHeight tile data within a single map file (Horrible idea, really) but they can render only what is visible within the camera (And some outer-lying tiles for smoothness sake) based on the camera position.
So my question is, "How can they search through all of that data in real time starting at the camera position?"
That would entail reading through potentially millions of tile data just to get to the screen coordinates. I understand you could skip bytes of data based on the x/y coordinates if the tile data was consistent (This is all I can find in my week or so of searching), but that is where my problem lies. The tile data is dynamic. If one tile is empty, the data beyond "isValid" is nonexistent. So that is less bytes to search through. If a tile has water, multiple states, a background, etc... it contains all the data and is the largest in terms of bytes. So it is not constant at all. In that case we cannot just skip X amount of bytes as it changes (Constantly as tiles are modified).
My current solutions are: Read it line by line (Ugh), use chunk files, or ensure fixed line sizes (Padding? Data wasted... Ugh).
I know chunks would be the best option, but being able to reach that deep into text files quickly would still be a nice thing to know.
If you have chunk-based data, you need a chunk-based reader, simple as that.
Additionally, if you're particularly interested only in certain parts of the data and you can process it first, is to build a second file/list that stores the offsets to the start of every object in the first file.
In that case, whenever you need to reference an object, you look up the offset first and then do a straight jump to it in your original file. It still requires you to read through the whole file at-least once.

MPEG-2 seeking, where to start?

I'd like to be able to seek to an arbitrary frame in a MPEG-2 file (from DVD, I guess it's called MPEG-2 Program Stream). So far I had been using OpenCV 2.1 for accessing those frames but that would only work on a frame after frame basis (only forward seeking). Later when I installed OpenCV 2.3.1 that possibility was lost though, i.e. limited to AVI. Anyways, I'd like to do that without OpenCV. I've managed to seek to keyframes (I suppose) or every so and so frame (e.g. every 12th frame). Now, looking at VirtualDub frame accurate seeking is possible. It says: ''parsing interleaved MPEG-2 file''. What exactly does that mean and where would I have to start to do the same? Is it even legal, I remember reading something about that somewhere, can't really remember though. I'm programming in C++ using directshow. As far as I know directshow won't do it. Then I was looking into CBaseFilter, streamtime method etc but before I dive into that complex topic I'd like to know if that's the right way to go. Looking forward to your answers, thanks!
# Geraint: code snippet of filter graph:
CoCreateInstance(CLSID_FilterGraph,NULL,CLSCTX_INPROC,IID_IGraphBuilder,(LPVOID *)&pGraphBuilder);
CoCreateInstance(CLSID_MPEG2Demultiplexer,NULL,CLSCTX_INPROC,IID_IBaseFilter,(LPVOID *)&pib);
CoCreateInstance(CLSID_CMPEG2VidDecoderDS,NULL,CLSCTX_INPROC,IID_IBaseFilter,(LPVOID *)&pib2);
pGraphBuilder->AddFilter(pib,L"Sample Splitter");
pGraphBuilder->AddFilter(pib2,L"Sample Decoder");
ZeroMemory(&am_media_type, sizeof(am_media_type));
am_media_type.majortype = MEDIATYPE_Video;
am_media_type.subtype = MEDIASUBTYPE_MPEG2_VIDEO;
am_media_type.formattype = FORMAT_MPEG2Video;
pGraphBuilder->QueryInterface(IID_IMediaControl,(LPVOID *)&pMediaControl);
pGraphBuilder->QueryInterface(IID_IMediaSeeking, (void**)(&pMediaSeeking));
pGraphBuilder->QueryInterface(__uuidof(IVideoFrameStep), (PVOID *)&fst);
pGraphBuilder->QueryInterface(IID_IMediaEvent, (void **)&imev);
pGraphBuilder->QueryInterface(IID_IBasicVideo,(LPVOID *)&ibv);
pGraphBuilder->RenderFile(FILENAME,0);
and then I use IMediaSeeking for seeking the vid. I've also tried frame stepping (hence the references above).
DirectShow is capable of delivering frame-accurate seeking. However, without an index, this is based on a time offset from file start, not a frame count.
Use IMediaSeeking to set the start time. The demux will begin delivery of compressed frames some way before that. The decoder will start decoding at the previous key frame but will discard any frames that are before your chosen start point.
G

NAudio - Create software beat machine / sampler - General Strategy

Im new to NAudio but the goal of my project is to provide the user with the ability for the user to listen to an MP3 and then select a sample or a "chunk" of that song as a sample which could be saved to disk. These samples would be able to replayed at the same time (i.e. not merged but played at the same time).
Could someone please let me know what the overall strategy required to achieve this (....not necessarily the specifics...almost like pseduo code....).
For example would the samples / chunks of a song need to be saved as a WAV file. And these samples could be played together in the WAV format, etc.
I have seen a few small examples of a few implementations of some of the ideas Ive mentioned above but dont have a good sense of the big picture just yet.
Thanks in advance,
Andrew
The chunks wouldn't need to be saved as WAV files unless you were keeping them for future use. You can store the PCM audio (Mp3FileReader automatically converts to PCM) in a byte array and use RawSourceWaveStream to play them.
As for mixing them, I'd recommend using the MixingSampleProvider. This does mean you need to convert your RawSourceWaveStream to IEEE float, but you can use Pcm16BitToSampleProvider to do this. This will give the advantage that you can adjust volumes (and do other DSP) easily on the samples you are mixing. MixingSampleProvider also auto-removes completed inputs, so you can just add new inputs whenever you want to trigger a sound.

Performing buffering/windowing with overlap add CMSampleBufferRef

I'm trying to perform some basic DSP functions on PCM audio data which I retrieve from a video file using AVAssetReader on the iPhone.
I'm reading the buffers correctly, number of samples per buffer is 8192 (is that by default? can that be changed?).
However, I need to perform windowing, fft and various other manipulations on slices which aren't 8192 samples long. In fact I want to process 512 samples at a time with 50% overlap between each slice.
I've been digging deep in Apple's Accelerate/vDSP framework and I think I can handle the processing and such, just not sure how to actually split up my signal the way I want it.
I have a strong DSP background but unfortunately my DSP programming experience pretty much ends in MATLAB.
Any help will be appreciated.
After digging deeper I found CASpectralProcessor in PublicUtility of the CoreAudio developer tools, which from ver. 4.3 onwards is no longer bundled with XCODE. To download go to
https://developer.apple.com/downloads/index.action?name=for%20Xcode%20-
CASpectralProcessor is exactly what I need, a full blown spectral analyzer that includes customizing window length, window type, hop size. Even performs IFFT with overlap/add!
Hope this helps anyone.
You can chop 1 or 2 of those large buffers into a number of buffers of some shorter desired length and feed those shorter buffers or slices to your processing routine.

Getting Pitch with VB.net

I want to get the pitch of a song at any point. I plan on storing the pitches later. How can I read say... an mp3 file or wav file to get the pitch played at a certain point?
Here is a visual example:
Say I wanted to get the pitch that is here at ^this point of the song.
Thanks if you can!
The matter is a tad more complicated than you may be anticipating.
While time-domain approaches exist (that is, approaches which work with the PCM data directly), frequency-domain pitch detection is going to be more accurate. You can read a very simplified overview here.
What you probably want is a Fourier Transform, which can be used to transform blocks of your signal from time-domain to frequency-domain (that is, a distribution of frequency content over a given span of the signal). From there, you would need to analyze the frequency spectrum within that block. The problem becomes even harder still, because there is no best way to deduce pitch from a sampled frequency spectrum in the general case. The aforementioned Wikipedia article should give you a foundation for looking into those algorithms.
Finally, it's worth noting that this is really a language-agnostic question, unless your primary interest is in reading a WAV file specifically using VB.NET.