I am using nvenc to encode a series of frames to a h.264 video on the GPU. My example is quite synthetic where I simply repeat the same frame. The encoding works fine and I was looking at figuring out what is going on:
One thing I notice is that the beginning frame is an IDR frame (NV_ENC_PIC_TYPE_IDR) and the rest of the frames are all forward predicted frames (NV_ENC_PIC_TYPE_P). I thought this is perhaps because every frame is the same. So, I generated frames where each frame is completely random. In that case, it is the same as well.
So, my question is whether there is something in the configuration which makes all the frames P-frames and there are no B or I frames. Under what conditions would these frames be generated?
Related
I am currently implementing a few image style transfer algorithms for Tensorflow, but I would like to do it in tiles, so I don't have to run the entire image through the network. Everything works fine, however each image is normalized differently, according to its own statistics, which results in tiles with slightly different characteristics.
I am certain that the only issue is instance normalization, since if I feed the true values (obtained from the entire image) to each tile calculation the result is perfect, however I still have to run the entire image through the network to calculate these values. I also tried calculating these values using a downsampled version of the image, but resolution suffers a lot.
So my question is: is it possible to estimate mean and variance values for instance normalization without feeding the entire image through the network?
You can take a random sample of the pixels of the image, and use the sample mean and sample variance to normalize the whole image. It will not be perfect, but the larger the sample, the better. A few hundred pixels will probably suffice, maybe even less, but you need to experiment.
Use tf.random_uniform() to get random X and Y coordinates, and then use tf.gather_nd() to get the pixel values at the given coordinates.
I am trying to analyze H.265 coding performance. Is there a way to export the predicted frames for H.265/HEVC encoding? Specifically, how should I obtain reconstructed frames after compensating with the motion vectors, but before applying the residual? Is there a way to do this with ffmpeg, or any other codec analysis tool?
Yes you can do it with HM decoder.
What you need to do is to find the exact line of the code in the TDecCu.cpp file, where two pointers piResi and piPred are accessed to be added and reconstruct the block. There, you may print piPred alone.
I've been looking for an answer everywhere and I was only able to find some bits and pieces. What I want to do is to load multiple mp3 files (kind of temporarily merge them) and then cut them into pieces using silence detection.
My understanding is that I can use Mp3FileReader for this but the questions are:
1. How do I read say 20 seconds of audio from an mp3 file? Do I need to read 20 times reader.WaveFormat.AverageBytesPerSecond? Or maybe keep on reading frames until the sum of Mp3Frame.SampleCount / Mp3Frame.SampleRate exceeds 20 seconds?
2. How do I actually detect the silence? I would look at an appropriate number of the consecutive samples to check if they are all below some threshold. But how do I access the samples regardless of them being 8 or 16bit, mono or stereo etc.? Can I directly decode an MP3 frame?
3. After I have detected silence at say sample 10465, how do I map it back to the mp3 frame index to perform the cutting without re-encoding?
Here's the approach I'd recommend (which does involve re-encoding)
Use AudioFileReader to get your MP3 as floating point samples directly in the Read method
Find an open source noise gate algorithm, port it to C#, and use that to detect silence (i.e. when noise gate is closed, you have silence. You'll want to tweak threshold and attack/release times)
Create a derived ISampleProvider that uses the noise gate, and in its Read method, does not return samples that are in silence
Either: Pass the output into WaveFileWriter to create a WAV File and and encode the WAV file to MP3
Or: use NAudio.Lame to encode directly without a WAV step. You'll probably need to go from SampleProvider back down to 16 bit WAV provider first
BEFORE READING BELOW: Mark's answer is far easier to implement, and you'll almost certainly be happy with the results. This answer is for those who are willing to spend an inordinate amount of time on it.
So with that said, cutting an MP3 file based on silence without re-encoding or full decoding is actually possible... Basically, you can look at each frame's side info and each granule's gain & huffman data to "estimate" the silence.
Find the silence
Copy all the frames from before the silence to a new file
now it gets tricky...
Pull the audio data from the frames after the silence, keeping track of which frame header goes with what audio data.
Start writing the second new file, but as you write out the frames, update the main_data_begin field so the bit reservoir is in sync with where the audio data really is.
MP3 is a compressed audio format. You can't just cut bits out and expect the remainder to still be a valid MP3 file. In fact, since it's a DCT-based transform, the bits are in the frequency domain instead of the time domain. There simply are no bits for sample 10465. There's a frame which contains sample 10465, and there's a set of bits describing all frequencies in that frame.
Plain cutting the audio at sample 10465 and continuing with some random other sample probably causes a discontinuity, which means the number of frequencies present in the resulting frame skyrockets. So that definitely means a full recode. The better way is to smooth the transition, but that's not a trivial operation. And the result is of course slightly different than the input, so it still means a recode.
I don't understand why you'd want to read 20 seconds of audio anyway. Where's that number coming from? You usually want to read everything.
Sound is a wave; it's entirely expected that it crosses zero. So being close to zero isn't special. For a 20 Hz wave (threshold of hearing), zero crossings happen 40 times per second, but each time you'll have multiple samples near zero. So you basically need multiple samples that are all close to zero, but on both sides. 5 6 7 isn't much for 16 bits sounds, but it might very well be part of a wave that will have a maximum at 10000. You really should check for at least 0.05 seconds to catch those 20 Hz sounds.
Since you detected silence in a 50 millisecond interval, you have a "position" that's approximately several hundred samples wide. With any bit of luck, there's a frame boundary in there. Cut there. Else it's time for reencoding.
Are there some suggestions on decoding video data from a webcam in raw format, I would like the per pixel raster data frame per frame.
Use libraw to load and decode RAW data. If the html5-video tag was intended and it means what I think, there is no way of doing this in a browser, besides compiling it to Javascript, if you feel really adventurous.
I'd like to be able to seek to an arbitrary frame in a MPEG-2 file (from DVD, I guess it's called MPEG-2 Program Stream). So far I had been using OpenCV 2.1 for accessing those frames but that would only work on a frame after frame basis (only forward seeking). Later when I installed OpenCV 2.3.1 that possibility was lost though, i.e. limited to AVI. Anyways, I'd like to do that without OpenCV. I've managed to seek to keyframes (I suppose) or every so and so frame (e.g. every 12th frame). Now, looking at VirtualDub frame accurate seeking is possible. It says: ''parsing interleaved MPEG-2 file''. What exactly does that mean and where would I have to start to do the same? Is it even legal, I remember reading something about that somewhere, can't really remember though. I'm programming in C++ using directshow. As far as I know directshow won't do it. Then I was looking into CBaseFilter, streamtime method etc but before I dive into that complex topic I'd like to know if that's the right way to go. Looking forward to your answers, thanks!
# Geraint: code snippet of filter graph:
CoCreateInstance(CLSID_FilterGraph,NULL,CLSCTX_INPROC,IID_IGraphBuilder,(LPVOID *)&pGraphBuilder);
CoCreateInstance(CLSID_MPEG2Demultiplexer,NULL,CLSCTX_INPROC,IID_IBaseFilter,(LPVOID *)&pib);
CoCreateInstance(CLSID_CMPEG2VidDecoderDS,NULL,CLSCTX_INPROC,IID_IBaseFilter,(LPVOID *)&pib2);
pGraphBuilder->AddFilter(pib,L"Sample Splitter");
pGraphBuilder->AddFilter(pib2,L"Sample Decoder");
ZeroMemory(&am_media_type, sizeof(am_media_type));
am_media_type.majortype = MEDIATYPE_Video;
am_media_type.subtype = MEDIASUBTYPE_MPEG2_VIDEO;
am_media_type.formattype = FORMAT_MPEG2Video;
pGraphBuilder->QueryInterface(IID_IMediaControl,(LPVOID *)&pMediaControl);
pGraphBuilder->QueryInterface(IID_IMediaSeeking, (void**)(&pMediaSeeking));
pGraphBuilder->QueryInterface(__uuidof(IVideoFrameStep), (PVOID *)&fst);
pGraphBuilder->QueryInterface(IID_IMediaEvent, (void **)&imev);
pGraphBuilder->QueryInterface(IID_IBasicVideo,(LPVOID *)&ibv);
pGraphBuilder->RenderFile(FILENAME,0);
and then I use IMediaSeeking for seeking the vid. I've also tried frame stepping (hence the references above).
DirectShow is capable of delivering frame-accurate seeking. However, without an index, this is based on a time offset from file start, not a frame count.
Use IMediaSeeking to set the start time. The demux will begin delivery of compressed frames some way before that. The decoder will start decoding at the previous key frame but will discard any frames that are before your chosen start point.
G