I have been trying to figure this out for a while to no avail, I was wondering if someone could help or point me in he right direction.
I have a need to convert an UIImage or a stored JPG to get its YUV422 data so I can then apply some image enhancements, and with the result convert it back to either a JPG or UIImage.
I'm a bit stuck at the moment, I this point I am just trying to get it to YUV422.
Any help would be greatly appreciated.
Thanks in advance.
You must first read the JPEG markers to determine the meta data. The meta data such as the size, the sample rate (usually 4:2:2 but not always ), the quantization tables, and the huffman tables.
You must then de-huffman-code the entropy encoded data segment. This will give you DC coefficient followed by any AC coefficients for the color channel for each channel in zig zag form. you must then de zigzag the entries and multiply it by the corresponding quantization table. Finally you must preform the Inverse Discrete Cosine Transformation on the decoded macroblock.
This will then give you 3 channels in YCrCb (YUV is for analog) at the sample rate the JPEG was encoded at. If you need it to be 4:2:2 you will have to resample.
Hopefully you have a library to do the actual JPEG decoding since writing one that is compliant is a non trivial task.
Here is a very basic and flawed JPEG decoder I started writing to give you more technical details. Ruby JPEG decoder It does not successfully implement the IDCT
For a correct implementation in C I suggest IJG
Related
I have a tflite model trained on BGR data. How can I make it work properly with RGB images?
UPDATE
I want to use it with the material-showcase app: https://github.com/googlesamples/mlkit/tree/master/android/material-showcase
#Farmaker #JaredJunyoungLim . Thank you very much for your answers. I've updated the question. At first I was thinking about converting the model itself, so it wouldn't require any changes in the code. For example, the converter to the OpenVINO format has an option to reverse input channels. I have also tried to set the BGR ColorSpace in the metadata, but have found out that it's most probably not possible.
I guess I'll go with your suggestion then. In the linked code, there is indeed the ByteBuffer (FrameProcessorBase.kt). I guess this is the place to change the order of the channels (after the line 70):
val frame = processingFrame ?: return
However, how can I change the order of channels, if this is just a ByteBuffer? Do I need to figure out the way data is stored in it? For example there is R,G,B,R,G,B,R,G,B,... etc. for every pixel? Or maybe there is some more elegant way to that?
I can see that the format is set to IMAGE_FORMAT_NV21, which is YCrCb
UPDATE 2
For what I've tested (
Log.d("ByteBuffer", frame.toString())
), it seems that the ByteBuffer takes 1.5 bytes per pixel:
java.nio.HeapByteBuffer[pos=0 lim=3110401 cap=3110401]
(Resolution: 1920x1080; 3110400/1920/1080=1.5)
So it uses 12 bits per pixel, which means 4 bits per channel per pixel. That's a bit strange, because I would suspect at least 8 bits per channel per pixel (0-255).
So I guess that maybe it's compressed.
I am training the Tacotron2 model using TensorflowTTS for a new language.
I managed to train the model (performed pre-processing, normalization, and decoded the few generated output files)
The files in the output directory are .npy files. Which makes sense as they are mel-spectograms.
I am trying to find a way to convert said files to a .wav file in order to check if my work has been fruitfull.
I used this :
melspectrogram = librosa.feature.melspectrogram(
"/content/prediction/tacotron2-0/paol_wavpaol_8-norm-feats.npy", sr=22050,
window=scipy.signal.hanning, n_fft=1024, hop_length=256)
print('melspectrogram.shape', melspectrogram.shape)
print(melspectrogram)
audio_signal = librosa.feature.inverse.mel_to_audio(
melspectrogram, sr22050, n_fft=1024, hop_length=256, window=scipy.signal.hanning)
print(audio_signal, audio_signal.shape)
sf.write('test.wav', audio_signal, sample_rate)
But it is given me this error : Audio data must be of type numpy.ndarray.
Although I am already giving it a numpy.ndarray file.
Does anyone know where the issue might be, and if anyone knows a better way to do it?
I'm not sure what your error is, but the output of a Tacotron 2 system are log Mel spectral features and you can't just apply the inverse Fourier transform to get a waveform because you are missing the phase information and because the features are not invertible. You can learn about why this is at places like Speech.Zone (https://speech.zone/courses/)
Instead of using librosa like you are doing, you need to use a vocoder like HiFiGan (https://github.com/jik876/hifi-gan) that is trained to reconstruct a waveform from log Mel spectral features. You can use a pre-trained model, and most off-the-shelf vocoders, but make sure that the sample rate, Mel range, FFT, hop size and window size are all the same between your Tacotron2 feature prediction network and whatever vocoder you choose otherwise you'll just get noise!
I've read in dozens of articles, scientific papers, and toy implementations that the steps in JPEG compression are roughly as follows
Take 8x8 DCT
Divide by quantization matrix
Round to integers
Run-length & Hufmann
And then the inverse is pretty much the same. What is left out in everything on the topic I've found so far is the magnitude of the data and the corresponding serialization.
It appears implicitly assumed that all the coefficients are stored as unsigned bytes. However, as I understand it, the DC coefficient is in the range 0-255, while the AC coefficients can be negative. Are the AC coefficients in the range ±255, or ±127, or something else?
What is the common way to store these coefficients in a compact way?
The first-hand source to read is of course the ITU-T T.81 standard document.
Looks like the first Google link leads to a paywall.. it's on the w3 site, though: https://www.w3.org/Graphics/JPEG/itu-t81.pdf
Take 8-bit input samples (0..255)
Subtract 128 (-128..127)
Do N*N fDCT, where N=8
Output can have log2(N)+8 bits = 11 bits (-1024..1023)
DC coefficients are stored as a difference, so they can have 12 bits.
The encoding process depends upon whether you have a sequential scan or a progressive scan. The details of the encoding process are too complicated to fit within an answer here.
I highly recommend this book:
https://www.amazon.com/Compressed-Image-File-Formats-JPEG/dp/0201604434/ref=sr_1_2?ie=UTF8&qid=1531091178&sr=8-2&keywords=JPEG&dpID=5168QFRTslL&preST=_SX258_BO1,204,203,200_QL70_&dpSrc=srch
It is the only source I know of that explains JPEG end-to-end in plain English.
Can somebody please explain why rendering with premultiplied alpha (and corrected blending function) looks differently than "normal" alpha when, mathematically speaking, those are the same?
I've looked into this post for understanding of premultiplied alpha:
http://blogs.msdn.com/b/shawnhar/archive/2009/11/06/premultiplied-alpha.aspx
The author also said that the end computation is the same:
"Look at the blend equations for conventional vs. premultiplied alpha. If you substitute this color format conversion into the premultiplied blend function, you get the conventional blend function, so either way produces the same end result. The difference is that premultiplied alpha applies the (source.rgb * source.a) computation as a preprocess rather than inside the blending hardware."
Am I missing something? Why is the result different then?
neshone
The difference is in filtering.
Imagine that you have a texture with just two pixels and you are sampling it exactly in the middle between the two pixels. Also assume linear filtering.
Schematically:
R|G|B|A + R|G|B|A = R|G|B|A
non-premultiplied:
1|0|0|1 + 0|1|0|0 = 0.5|0.5|0|0.5
premultiplied:
1|0|0|1 + 0|0|0|0 = 0.5|0|0|0.5
Notice the difference in green channel.
Filtering premultiplied alpha produces correct results.
Note that all this has nothing to do with blending.
This is a guess, because there is not enough information yet to figure it out.
It should be the same. One common way of getting a different value is to use a different Gamma correction method between the premultiply and the rendering step.
I am going to guess that one of your stages, either the blending, or the premultiplying stage is being done with a different gamma value. If you generate your premultiplied textures with a tool like DirectXTex texconv and use the default srgb option for premultiplying alpha, then your sampler needs to be an _SRGB format and your render target should be _SRGB as well. If you are treating them linearly then you may not be able to render to an _SRGB target or sample the texture with gamma correction, even if you are doing the premultiply in the same shader that samples (depending on 3D API and render target setup differences). Doing so will cause the alpha to be significantly different between the two methods in the midtones.
See: The Importance of Being Linear.
If you are generating the alpha in Photoshop then you should know a couple things. Photoshop does not save alpha in linear OR sRGB format. It saves it as a Gamma value about half way between linear and sRGB. If you premultiply in Photoshop it will compute the premultiply correctly but save the result with the wrong ramp. If you generate a normal alpha then sample it as sRGB or LINEAR in your 3d API it will be close but will not match the values Photoshop shows in either case.
For a more in depth reply the information we would need would be.
What 3d API are you using.
How are your textures generated and sampled
When and how are you premultiplying the alpha.
and preferably a code or shader example that shows the error.
I was researching why one would use Pre vs non-Pre and found this interesting info from Nvidia
https://developer.nvidia.com/content/alpha-blending-pre-or-not-pre
It seems that their specific case has more precision when using Pre, over Post-Alpha.
I also read (I believe on here but cannot find it), that doing pre-alpha (which is multiplying Alpha to each RGB value), you will save time. I still need to find out if that's true or not, but there seems to be a reason why pre-alpha is preferred.
So, I have this usual error message where the number of rows and columns of my images don't match (in cross ref).
I have generalised one of my images and then use expand to make the resolutions match again.
However, in the process I lost a few columns (which doesn't bother me), however, I don't know what to do in order to make my both images the same size again.
Can someone help me ?
Thank you very much
L.
To make similar column and rows,
Convert layer from raster to vector (rastervector conversion tool)
Convert the vector converted layer in to raster (resample the layer that has required column and row)