How to extract QP per macroblock or slice in HEVC? - hevc

How can I get QP value per macroblock or slice from a encoded frame (encode by hevc hardware encoder)? I have tried some hevc bitstream parser like hevcexbrowser https://github.com/virinext/hevcesbrowser but it doesn't have access to CTU or parse slice body.

You can decode the bitstream using an open-source decoder, then modify that decoder to dump the information you need during its parsing. I'd recommend Hevc test Model (HM)

Related

One hot encoding gives different result for model and deployment

I am preparing the AI system and I use tensorflow.keras.preprocessing.text.one_hot for encoding the categorical data. I am working on text and sentence kind of data.
vocab_length = 1000
encoded_text = one_hot(text, vocab_length)
so, after the model training, I deploy the model and it will work on user input text I am using the same one_hot method but encoding algorithms generate different encoding so I am getting the wrong prediction. I also try to dump the one_hot into joblib and load it on the server still it gives the wrong result. Kindly suggest to me how can I get the same encoding into the model and server deployment.

Convert a .npy file to wav following tacotron2 training

I am training the Tacotron2 model using TensorflowTTS for a new language.
I managed to train the model (performed pre-processing, normalization, and decoded the few generated output files)
The files in the output directory are .npy files. Which makes sense as they are mel-spectograms.
I am trying to find a way to convert said files to a .wav file in order to check if my work has been fruitfull.
I used this :
melspectrogram = librosa.feature.melspectrogram(
"/content/prediction/tacotron2-0/paol_wavpaol_8-norm-feats.npy", sr=22050,
window=scipy.signal.hanning, n_fft=1024, hop_length=256)
print('melspectrogram.shape', melspectrogram.shape)
print(melspectrogram)
audio_signal = librosa.feature.inverse.mel_to_audio(
melspectrogram, sr22050, n_fft=1024, hop_length=256, window=scipy.signal.hanning)
print(audio_signal, audio_signal.shape)
sf.write('test.wav', audio_signal, sample_rate)
But it is given me this error : Audio data must be of type numpy.ndarray.
Although I am already giving it a numpy.ndarray file.
Does anyone know where the issue might be, and if anyone knows a better way to do it?
I'm not sure what your error is, but the output of a Tacotron 2 system are log Mel spectral features and you can't just apply the inverse Fourier transform to get a waveform because you are missing the phase information and because the features are not invertible. You can learn about why this is at places like Speech.Zone (https://speech.zone/courses/)
Instead of using librosa like you are doing, you need to use a vocoder like HiFiGan (https://github.com/jik876/hifi-gan) that is trained to reconstruct a waveform from log Mel spectral features. You can use a pre-trained model, and most off-the-shelf vocoders, but make sure that the sample rate, Mel range, FFT, hop size and window size are all the same between your Tacotron2 feature prediction network and whatever vocoder you choose otherwise you'll just get noise!

How to makeup FSNS dataset with my own image for attention OCR tensorflow model

I want to apply attention-ocr to detect all digits on number board of cars.
I've read your README.md of attention_ocr on github(https://github.com/tensorflow/models/tree/master/research/attention_ocr), and also the way I should do to use my own image data to train model with the StackOverFlow page.(https://stackoverflow.com/a/44461910/743658)
However, I didn't get any information of how to store annotation or label of the picture, or the format of this problem.
For object detection model, I was able to make my dataset with LabelImg and converting this into csv file, and finally make .tfrecord file.
I want to make .tfrecord file on FSNS dataset format.
Can you give me your advice to go on this training steps?
Please reread the mentioned answer it has a section explaining how to store the annotation. It is stored in the three features image/text, image/class and image/unpadded_class. The image/text field is used for visualization, some models support unpadded sequences and use image/unpadded_class, while the default version relies on the text padded with null characters to have the same length stored in the feature image/class. Here is the excerpt to store the text annotation:
char_ids_padded, char_ids_unpadded = encode_utf8_string(
text, charset, length, null_char_id)
example = tf.train.Example(features=tf.train.Features(
feature={
'image/class': _int64_feature(char_ids_padded),
'image/unpadded_class': _int64_feature(char_ids_unpadded),
'image/text': _bytes_feature(text)
...
}
))
If you have worked with tensorflow object detection, then the apporach should be much easier for you.
You can create the annotation file (in .csv format) using labelImg or any other annotation tool.
However, before converting it into tensorflow format (.tfrecord), you should keep in mind the annotation format. (FSNS format in this case)
The format is : files text xmin ymin xmax ymax
So while annotating dont bother much about the class (as you would have done in object detection !! Some random name should suffice.)
Convert it into .tfrecords.
And finally labelMap is a list of characters which you have annotated.
Hope it helps !

Google Cloud ML Engine: How to create task receiving an image float32 format

I'm new to Google Cloud ML Engine. I'm deploying my first model and I trained a model that receives images in float32 format. I'm following ML Engine tutorial but the they encode the image in base64. Is there a way to encode it using float32? Or can I create a task that is in float32?
python -c 'import base64, sys, json; img = base64.b64encode(open(sys.argv[1], "rb").read()); print json.dumps({"inputs": {"key":"0", "image_bytes": {"b64": im g}}})' flower.jpg &> request.json
There are multiple ways to encode image data, some more efficient than others. These are outlined in this answer. You are looking for the "Raw Tensor Encoded AS JSON" section, which shows how to export your model and also how to construct the JSON. Please also consider the tradeoff of the inefficiency of using floats in JSON and consider the alternative approaches.

UIImage to YUV 422 Colorspace

I have been trying to figure this out for a while to no avail, I was wondering if someone could help or point me in he right direction.
I have a need to convert an UIImage or a stored JPG to get its YUV422 data so I can then apply some image enhancements, and with the result convert it back to either a JPG or UIImage.
I'm a bit stuck at the moment, I this point I am just trying to get it to YUV422.
Any help would be greatly appreciated.
Thanks in advance.
You must first read the JPEG markers to determine the meta data. The meta data such as the size, the sample rate (usually 4:2:2 but not always ), the quantization tables, and the huffman tables.
You must then de-huffman-code the entropy encoded data segment. This will give you DC coefficient followed by any AC coefficients for the color channel for each channel in zig zag form. you must then de zigzag the entries and multiply it by the corresponding quantization table. Finally you must preform the Inverse Discrete Cosine Transformation on the decoded macroblock.
This will then give you 3 channels in YCrCb (YUV is for analog) at the sample rate the JPEG was encoded at. If you need it to be 4:2:2 you will have to resample.
Hopefully you have a library to do the actual JPEG decoding since writing one that is compliant is a non trivial task.
Here is a very basic and flawed JPEG decoder I started writing to give you more technical details. Ruby JPEG decoder It does not successfully implement the IDCT
For a correct implementation in C I suggest IJG