Parsing HEVC for Motion Information - hevc

I parsed the HEVC stream by simply identifying sart code (000001 or 00000001), and now I am looking for the motion information in the NAL payload. My goal is to calculate the percentage of the motion information in the stream. Any ideas?

Your best bet is to start with the HM reference software (get it here: https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/trunk/) and add some debug info as the different kinds of data is read from the bitstream. This is likely much easier than writing bitstream decoder from scratch.
Check out the debug that is built into the software already, for example RExt__DECODER_DEBUG_BIT_STATISTICS or DEBUG_CABAC_BINS. This may do what you want already, if not it will be pretty close. I think information about bit usage can be best collected in source/Lib/TLibDecoder/TDecBinCoderCABAC.cpp during decode.
If you need to speed this up, you can of course skip the actual decode steps :)

At the decoder side, You can find the motion vector information as MVD, so you should using pixel decoding process to get the motion information. it need you to understand the process of the inter prediction at HEVC.
than you!

Related

what are the inputs to a wavenet?

I am trying to implement TTS. I have just read about wavenet, but, I am confused on local conditioning. The original paper here, explains to add a time series for local conditioning, this article explains that adding mel spectrogram features for local conditioning is fine. As we know that Wavenet is a generative model and takes raw audio inputs to generate high audio output when conditioned,
my question is that the said mel spectrogram features are of that raw audio passed as in the input or of some other audio.
Secondly, for implementing a TTS the audio input will be generated by some other TTS system whose output quality will be improved by wavenet, am I correct to think this way??
Please help, it is direly needed.
Thanks
Mel features are created by actual TTS module from the text (tacotron2 for example), than you run vocoder module (Wavenet) to create speech.
It is better to try existing implementation like Nvidia/tacotron2 + nvidia/waveglow. Waveglow is better than wavenet between, much faster. Wavenet is very slow.

kinect SKD skeletonization method

I was wondering if there's a way to modify the depth map prior to sending it to the skeletonization algorithm used by the kinect, for example, if we want to run the skeletonization on the output of a segmented depth image. So far I have reviewed the methods in the sdk but I haven't been able to find a skeletonization method exposed. It's like you either turn the skeleton on or off but you have no control on its inputs.
If anyone has any idea regarding this topic I will be much obliged.
Shamita: skeletonization means tracking the joints of the user in real time. I edit because I can't comment (not enought reputation).
All the joints' give a depth coordinate and I don't think you can mess with the Kinect hardware input stream. But you can categorize the joints regarding to depth segments. For example with the live stream you categorize it with the corresponding category if it is below 10 and above five it is in category A. this can be done with the live stream itself because it is just a simple calculation.

NAudio - Create software beat machine / sampler - General Strategy

Im new to NAudio but the goal of my project is to provide the user with the ability for the user to listen to an MP3 and then select a sample or a "chunk" of that song as a sample which could be saved to disk. These samples would be able to replayed at the same time (i.e. not merged but played at the same time).
Could someone please let me know what the overall strategy required to achieve this (....not necessarily the specifics...almost like pseduo code....).
For example would the samples / chunks of a song need to be saved as a WAV file. And these samples could be played together in the WAV format, etc.
I have seen a few small examples of a few implementations of some of the ideas Ive mentioned above but dont have a good sense of the big picture just yet.
Thanks in advance,
Andrew
The chunks wouldn't need to be saved as WAV files unless you were keeping them for future use. You can store the PCM audio (Mp3FileReader automatically converts to PCM) in a byte array and use RawSourceWaveStream to play them.
As for mixing them, I'd recommend using the MixingSampleProvider. This does mean you need to convert your RawSourceWaveStream to IEEE float, but you can use Pcm16BitToSampleProvider to do this. This will give the advantage that you can adjust volumes (and do other DSP) easily on the samples you are mixing. MixingSampleProvider also auto-removes completed inputs, so you can just add new inputs whenever you want to trigger a sound.

Performing buffering/windowing with overlap add CMSampleBufferRef

I'm trying to perform some basic DSP functions on PCM audio data which I retrieve from a video file using AVAssetReader on the iPhone.
I'm reading the buffers correctly, number of samples per buffer is 8192 (is that by default? can that be changed?).
However, I need to perform windowing, fft and various other manipulations on slices which aren't 8192 samples long. In fact I want to process 512 samples at a time with 50% overlap between each slice.
I've been digging deep in Apple's Accelerate/vDSP framework and I think I can handle the processing and such, just not sure how to actually split up my signal the way I want it.
I have a strong DSP background but unfortunately my DSP programming experience pretty much ends in MATLAB.
Any help will be appreciated.
After digging deeper I found CASpectralProcessor in PublicUtility of the CoreAudio developer tools, which from ver. 4.3 onwards is no longer bundled with XCODE. To download go to
https://developer.apple.com/downloads/index.action?name=for%20Xcode%20-
CASpectralProcessor is exactly what I need, a full blown spectral analyzer that includes customizing window length, window type, hop size. Even performs IFFT with overlap/add!
Hope this helps anyone.
You can chop 1 or 2 of those large buffers into a number of buffers of some shorter desired length and feed those shorter buffers or slices to your processing routine.

Getting Pitch with VB.net

I want to get the pitch of a song at any point. I plan on storing the pitches later. How can I read say... an mp3 file or wav file to get the pitch played at a certain point?
Here is a visual example:
Say I wanted to get the pitch that is here at ^this point of the song.
Thanks if you can!
The matter is a tad more complicated than you may be anticipating.
While time-domain approaches exist (that is, approaches which work with the PCM data directly), frequency-domain pitch detection is going to be more accurate. You can read a very simplified overview here.
What you probably want is a Fourier Transform, which can be used to transform blocks of your signal from time-domain to frequency-domain (that is, a distribution of frequency content over a given span of the signal). From there, you would need to analyze the frequency spectrum within that block. The problem becomes even harder still, because there is no best way to deduce pitch from a sampled frequency spectrum in the general case. The aforementioned Wikipedia article should give you a foundation for looking into those algorithms.
Finally, it's worth noting that this is really a language-agnostic question, unless your primary interest is in reading a WAV file specifically using VB.NET.