accelerate framework cepstrum peak find - objective-c

I'm trying to find peak values of cepstrum analysis with accelerate framework. I get peak values always at the end of or at the beginning of frames. I'm analysing it real-time getting audio from microphone. What is wrong with this my code? My code is below :
OSStatus microphoneInputCallback (void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData){
// get reference of test app we need for test app attributes
TestApp *this = (TestApp *)inRefCon;
COMPLEX_SPLIT complexArray = this->fftA;
void *dataBuffer = this->dataBuffer;
float *outputBuffer = this->outputBuffer;
FFTSetup fftSetup = this->fftSetup;
uint32_t log2n = this->fftLog2n;
uint32_t n = this->fftN; // 4096
uint32_t nOver2 = this->fftNOver2;
uint32_t stride = 1;
int bufferCapacity = this->fftBufferCapacity; // 4096
SInt16 index = this->fftIndex;
OSStatus renderErr;
// observation objects
float *observerBufferRef = this->observerBuffer;
int observationCountRef = this->observationCount;
renderErr = AudioUnitRender(rioUnit, ioActionFlags,
inTimeStamp, bus1, inNumberFrames, this->bufferList);
if (renderErr < 0) {
return renderErr;
}
// Fill the buffer with our sampled data. If we fill our buffer, run the
// fft.
int read = bufferCapacity - index;
if (read > inNumberFrames) {
memcpy((SInt16 *)dataBuffer + index, this->bufferList->mBuffers[0].mData, inNumberFrames*sizeof(SInt16));
this->fftIndex += inNumberFrames;
} else {
// If we enter this conditional, our buffer will be filled and we should PERFORM FFT.
memcpy((SInt16 *)dataBuffer + index, this->bufferList->mBuffers[0].mData, read*sizeof(SInt16));
// Reset the index.
this->fftIndex = 0;
/*************** FFT ***************/
//multiply by window
vDSP_vmul((SInt16 *)dataBuffer, 1, this->window, 1, this->outputBuffer, 1, n);
// We want to deal with only floating point values here.
vDSP_vflt16((SInt16 *) dataBuffer, stride, (float *) outputBuffer, stride, bufferCapacity );
/**
Look at the real signal as an interleaved complex vector by casting it.
Then call the transformation function vDSP_ctoz to get a split complex
vector, which for a real signal, divides into an even-odd configuration.
*/
vDSP_ctoz((COMPLEX*)outputBuffer, 2, &complexArray, 1, nOver2);
// Carry out a Forward FFT transform.
vDSP_fft_zrip(fftSetup, &complexArray, stride, log2n, FFT_FORWARD);
vDSP_ztoc(&complexArray, 1, (COMPLEX *)outputBuffer, 2, nOver2);
complexArray.imagp[0] = 0.0f;
vDSP_zvmags(&complexArray, 1, complexArray.realp, 1, nOver2);
bzero(complexArray.imagp, (nOver2) * sizeof(float));
// scale
float scale = 1.0f / (2.0f*(float)n);
vDSP_vsmul(complexArray.realp, 1, &scale, complexArray.realp, 1, nOver2);
// step 2 get log for cepstrum
float *logmag = malloc(sizeof(float)*nOver2);
for (int i=0; i < nOver2; i++)
logmag[i] = logf(sqrtf(complexArray.realp[i]));
// configure float array into acceptable input array format (interleaved)
vDSP_ctoz((COMPLEX*)logmag, 2, &complexArray, 1, nOver2);
// create cepstrum
vDSP_fft_zrip(fftSetup, &complexArray, stride, log2n-1, FFT_INVERSE);
//convert interleaved to real
float *displayData = malloc(sizeof(float)*n);
vDSP_ztoc(&complexArray, 1, (COMPLEX*)displayData, 2, nOver2);
float dominantFrequency = 0;
int currentBin = 0;
float dominantFrequencyAmp = 0;
// find peak of cepstrum
for (int i=0; i < nOver2; i++){
//get current frequency magnitude
if (displayData[i] > dominantFrequencyAmp) {
// DLog("Bufferer filled %f", displayData[i]);
dominantFrequencyAmp = displayData[i];
currentBin = i;
}
}
DLog("currentBin : %i amplitude: %f", currentBin, dominantFrequencyAmp);
}
return noErr;
}

I haven't worked with the Accelerate Framework, but your code appears to be taking the proper steps to calculate the Cepstrum.
The Cepstrum of real acoustic signals tends to have a very large DC component, a large peak at and near zero quefrency [sic]. Just ignore the near-DC portion of the Cepstrum and look for peaks above 20 Hz frequency (above quefrency of Cepstrum_Width/20Hz).
If the input signal contains a series of very closely spaced overtones, the Cepstrum will also have a large peak at the high quefrency end.
For example, the plot below shows the Cepstrum of a Dirichlet Kernel of N=128 and Width=4096, the spectrum of which is a series of very closely spaced overtones.
You may want to use a static synthetic signal to test and debug your code. A good choice for a test signal is any sinusoid with a fundamental F and several overtones at exact integer multiples of F.
Your Cepstra should look something like the following examples.
First a synthetic signal.
The plot below shows the Cepstrum of a synthetic steady-state E2 note, synthesized using a typical near-DC component, a fundamental at 82.4 Hz, and 8 harmonics at integer multiples of 82.4 Hz. The synthetic sinusoid was programmed to generate 4096 samples.
Observe the prominent non-DC peak at 12.36. The Cepstrum width is 1024 (the output of the second FFT), therefore the peak corresponds to 1024/12.36 = 82.8 Hz which is very close to 82.4 Hz the true fundamental frequency.
Now a real acoustical signal.
The plot below shows the Cepstrum of a real acoustic guitar's E2 note. The signal was not windowed prior to the first FFT. Observe the prominent non-DC peak at 542.9. The Cepstrum width is 32768 (the output of the second FFT), therefore the peak corresponds to 32768/542.9 = 60.4 Hz which is fairly far from 82.4 Hz the true fundamental frequency.
The plot below shows the Cepstrum of the same real acoustic guitar's E2 note, but this time the signal was Hann windowed prior to the first FFT. Observe the prominent non-DC peak at 268.46. The Cepstrum width is 32768 (the output of the second FFT), therefore the peak corresponds to 32768/268.46 = 122.1 Hz which is even farther from 82.4 Hz the true fundamental frequency.
The acoustic guitar's E2 note used for this analysis was sampled at 44.1 KHz with a high quality microphone under studio conditions, it contains essentially zero background noise, no other instruments or voices, and no post processing.
References:
Real audio signal data, synthetic signal generation, plots, FFT, and Cepstral analysis were done here: Musical instrument cepstrum

Related

ceres solver analytical derivative doesn't work

template<typename ConcreteOccGridMapUtil>
class getResidual : public ceres::SizedCostFunction<1,3>
{
public:
ConcreteOccGridMapUtil* occ;
DataContainer dataPoints;
getResidual(ConcreteOccGridMapUtil* occ, const DataContainer& dataPoints)
{
this->occ = occ;
this->dataPoints = dataPoints;
}
virtual ~getResidual() {}
virtual bool Evaluate(double const* const* parameters,
double* residuals,
double** jacobians) const
{
Eigen::Matrix<double, 3, 1> pose1(parameters[0][0],parameters[0][1],parameters[0][2]);
Eigen::Vector3f pose = pose1.cast<float>();
Eigen::Affine2f transform(occ->getTransformForState(pose)); // transform: rotation->translation
float sinRot = std::sin(pose[2]);
float cosRot = std::cos(pose[2]);
int size = dataPoints.getSize();
residuals[0] = 0;
jacobians[0][0]=0;
jacobians[0][1]=0;
jacobians[0][2]=0;
for (int i = 0; i < size; ++i)
{
const Eigen::Vector2f& currPoint (dataPoints.getVecEntry(i)); /// lidar point
Eigen::Vector3f transformedPointData(occ->interpMapValueWithDerivatives(transform * currPoint)); /// {M,dM/dx,dM/dy}
float funVal = 1.0f - transformedPointData[0];
// float weight=util::WeightValue(funVal);
float weight=1.0;
residuals[0] += static_cast<double>(funVal);
jacobians[0][0] += static_cast<double>(transformedPointData[1]);
jacobians[0][1] += static_cast<double>(transformedPointData[2]);
double rotDeriv = ((-sinRot * currPoint.x() - cosRot * currPoint.y()) * transformedPointData[1] + (cosRot * currPoint.x() - sinRot * currPoint.y()) * transformedPointData[2]);
jacobians[0][2] += static_cast<double>(rotDeriv);
}
return true;
}
};
my parameter to optimize is the pose = [x,y,theta]
my objective function is to minimize the occupancy value about pose and laser point. And here I add them manually together into residuals[0]
I have 3 parameters [x,y,theta] so my jacobians have 3 dimensions in jocobians[0]
But when I run the program, the report is like below:
Solver Summary (v 1.12.0-eigen-(3.2.0)-lapack-suitesparse-(4.2.1)-openmp)
Original Reduced
Parameter blocks 1 1
Parameters 3 3
Residual blocks 1 1
Residual 1 1
Minimizer TRUST_REGION
Dense linear algebra library EIGEN
Trust region strategy LEVENBERG_MARQUARDT
Given Used
Linear solver DENSE_QR DENSE_QR
Threads 1 1
Linear solver threads 1 1
Linear solver ordering AUTOMATIC 1
Cost:
Initial 8.569800e+04
Final 8.569800e+04
Change 0.000000e+00
Minimizer iterations 1
Successful steps 1
Unsuccessful steps 0
Time (in seconds):
Preprocessor 0.0001
Residual evaluation 0.0000
Jacobian evaluation 0.0050
Linear solver 0.0000
Minimizer 0.0051
Postprocessor 0.0000
Total 0.0052
Termination: CONVERGENCE (Gradient tolerance reached. Gradient max norm: 0.000000e+00 <= 1.000000e-10)
Since I have set the jacobians, how can it say that the gradient norm is so small?
Two things.
1. You cannot unconditionally set the Jacobian, you need to check if the solver is actually asking for and the pointers are non-null.
2. There is something wrong with your Jacobian eval, because as far as Ceres can tell it is seeing a zero gradient. Simple thing to check would be to dump out the Jacobian and Jacobian'residual from the CostFunction before returning.
for example are you sure size != 0?

Faster way to structure operations on offset neighborhoods in OpenCL

How can an operation on many overlapping but offset blocks of a 2D array be structured for more efficient execution in OpenCL?
For example, I have the following OpenCL kernel:
__kernel void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) -
read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j));
}
}
write_imageui(dest, pos, diff);
}
It produces correct results, but is slow... only ~25 GFLOPS on NVS4200M with 1k by 1k input. (The hardware spec is 155 GFLOPS). I'm guessing this has to do with the memory access patterns. Each work item reads one 16x16 block of data which is the same as all its neighbors in a 16x16 area, and also another offset block of data most of the time overlaps with that of its immediate neighbors. All reads are through samplers. The host program is PyOpenCL (I don't think that actually changes anything) and the work-group size is 16x16.
EDIT: New version of kernel per suggestion below, copy work area to local variables:
__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int dx = pos.x % 16;
int dy = pos.y % 16;
__local uint4 local_src[16*16];
__local uint4 local_src2[32*32];
local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16));
barrier(CLK_LOCAL_MEM_FENCE);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ];
}
}
write_imageui(dest, pos, diff);
}
Result: output is correct, running time is 56% slower. If using local_src only (not local_src2), the result is ~10% faster.
EDIT: Benchmarked on much more powerful hardware, AMD Radeon HD 7850 gets 420GFLOPS, spec is 1751GFLOPS. To be fair the spec is for multiply-add, and there is no multiply here so the expected is ~875GFLOPS, but this is still off by quite a lot compared to the theoretical performance.
EDIT: To ease running tests for anyone who would like to try this out, the host-side program in PyOpenCL below:
import pyopencl as cl
import numpy
import numpy.random
from time import time
CL_SOURCE = '''
// kernel goes here
'''
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
prg = cl.Program(ctx, CL_SOURCE).build()
h, w = 1024, 1024
src = numpy.zeros((h, w, 4), dtype=numpy.uint8)
src[:,:,:] = numpy.random.rand(h, w, 4) * 255
mf = cl.mem_flags
src_buf = cl.image_from_array(ctx, src, 4)
fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)
dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h))
# warmup
for n in range(10):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
# benchmark
t1 = time()
for n in range(100):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
t2 = time()
print "Duration (host): ", (t2-t1)/100
print "Duration (event): ", (event.profile.end-event.profile.start)*1e-9
EDIT: Thinking about the memory access patterns, the original naive version may be pretty good; when calling read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) all work-items in a work group are reading the same location (so this is just one read??), and when calling read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)) they are reading sequential locations (so the reads can be coalesced perfectly??).
This is definitely a memory access problem. Neighbouring work items' pixels can overlap by as much as 15x16, and worse yet, each work item will overlap at least 225 others.
I would use local memory and get work groups to cooperatively process many 16x16 blocks. I like to use a large, square block for each work group. Rectangular blocks are a bit more complicated, but can get better memory utilization for you.
If you read blocks of n by n pixels form your source image, the boarders will overlap by nx15 (or 15xn). You need to calculate the largest possible value for n base on your available local memory size (LDS). If you are using opencl 1.1 or greater, the LDS is at least 32kb. opencl 1.0 promises 16kb per work group.
n <= sqrt(32kb / sizeof(uint4))
n <= sqrt(32768 / 16)
n ~ 45
Using n=45 will use 32400 out of 32768 bytes of the LDS, and let you use 900 work items per group (45-15)^2 = 900. Note: Here's where a rectangular block would help out; for example 64x32 would use all of the LDS, but with group size = (64-15)*(32-15) = 833.
steps to use LDS for your kernel:
allocate a 1D or 2D local array for your cached block of the image. I use a #define constant, and it rarely has to change.
read the uint values from your image, and store locally.
adjust 'pos' for each work item to relate to the local memory
execute the same i,j loops you have, but using the local memory to read values. remember that the i and j loops stop 15 short of n.
Each step can be searched online if you are not sure how to implement it, or you can ask me if you need a hand.
Chances are good that the LDS on your device will outperform the texture read speed. This is counter-intuitive, but remember that you are reading tiny amounts of data at a time, so the gpu may not be able to cache the pixels effectively. The LDS usage will guarantee that the pixels are available, and given the number of times each pixel is read, I expect this to make a huge difference.
Please let me know what kind of results you observe.
UPDATE: Here's my attempt to better explain my solution. I used graph paper for my drawings, because I'm not all that great with image manipulation software.
Above is a sketch of how the values were read from src in your first code snippet. The big problem is that the pos0 rectangle -- 16x16 uint4 values -- is being read in its entirety for each work item in the group (256 of them). My solution involves reading a large area and sharing the data for all 256 work groups.
If you store a 31x31 region of your image in local memory, all 256 work items' data will be available.
steps:
use work group dimensions: (16,16)
read the values of src into a large local buffer ie: uint4 buff[31][31]; The buffer needs to be translated such that 'pos0' is at buff[0][0]
barrier(CLK_LOCAL_MEM_FENCE) to wait for memory copy operations
do the same i,j for loops you had originally, except you leave out the pos and pos0 values. only use i and j for the location. Accumulate 'diff' in the same way you were doing so originally.
write the solution to 'dest'
This is the same as my first response to your question, except I use n=16. This value does not utilize the local memory fully, but will probably work well for most platforms. 256 tends to be a common maximum work group size.
I hope this clears things up for you.
Some suggestions:
Compute more than 1 output pixel in each work item. It will increase data reuse.
Benchmark different work-group sizes to maximize the usage of texture cache.
Maybe there is a way to separate the kernel into two passes (horizontal and vertical).
Update: more suggestions
Instead of loading everything in local memory, try loading only the local_src values, and use read_image for the other one.
Since you do almost no computations, you should measure read speed in GB/s, and compare to the peak memory speed.

Working out Oscillator wave type code, and creating new wave types

I want to work out a bit of code that generates the oscillator wave-type in my tone generator app. The one in this example is a sine-wave, can someone tell me how the code works, as i want to in the future make custom wave-types and square, sawtooth and triangle types.
OSStatus RenderTone(
void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData)
{
// Fixed amplitude is good enough for our purposes
const double amplitude = 0.25;
// Get the tone parameters out of the view controller
ToneGeneratorViewController *viewController =
(ToneGeneratorViewController *)inRefCon;
double theta = viewController->theta;
double theta_increment = 2.0 * M_PI * viewController->frequency / viewController->sampleRate;
// This is a mono tone generator so we only need the first buffer
const int channel = 0;
Float32 *buffer = (Float32 *)ioData->mBuffers[channel].mData;
// Generate the samples
for (UInt32 frame = 0; frame < inNumberFrames; frame++)
{
buffer[frame] = sin(theta) * amplitude;
theta += theta_increment;
if (theta > 2.0 * M_PI)
{
theta -= 2.0 * M_PI;
}
}
// Store the theta back in the view controller
viewController->theta = theta;
return noErr;
}
The actual sine wave samples are being generated and are populating the buffer in the snippet below
for (UInt32 frame = 0; frame < inNumberFrames; frame++)
{
buffer[frame] = sin(theta) * amplitude;
theta += theta_increment;
if (theta > 2.0 * M_PI)
{
theta -= 2.0 * M_PI;
}
}
In the line where buffer[frame] is being assigned, you are calling sin(theta) * amplitude, and for each iteration of the for loop, you are incrementing theta by some finite step size based on your frequency and sample rate, via
double theta_increment = 2.0 * M_PI * viewController->frequency / viewController->sampleRate;
Which is essentially dividing 2.0 * PI * frequency by your sample rate.
Incrementing the theta variable while looping through the for loop is basically advancing the time step one sample at a time until your buffer is full (i.e. frame == iNumberFrames).
If you wanted to generate something other than a sine wave, you would simply replace the following line with some other function:
buffer[frame] = sin(theta) * amplitude;
I.e. let's say, for example, you wanted the first three terms in the infinite Fourier series that converges to a triangle wave; you might then have the following instead...
buffer[frame] = (8 / pow(M_PI,2)) * (sin(theta) - sin(3*theta)/9 + sin(5*theta)/25);
To produce your desired waveform, you need to replace the sin() function with a function that produces your desired wave shape.
You might be able to find this function in a table of functions with graphical examples, or you might have to create your function. The are lots of ways to create a functional approximation, including polynomial, Fourier series, table lookup with or without interpolation, recursions, and etc. But that is a big subject on its own (many textbooks, etc.)

Using CUDA to find the pixel-wise average value of a bunch of images

So I have a cube of images. 512X512X512, I want to sum up the images pixel-wise and save it to a final resulting image. So if all the pixels were value 1...the final image would all be 512. I am having trouble understanding the indexing to do this in CUDA. I figure one thread's job will be to sum up all 512 at it's pixel...so the total thread number will be 512X512. So I plan to do it with 512 blocks, with 512 threads each. From here, I am having trouble coming up with the indexing of how to sum the depth. Any help will be greatly appreciated.
One way to solve this problem is imaging the cube as a set of Z slides. The coordinates X, Y refers to the width and height of the image, and the Z coordinate to each slide in the Z dimension. Each thread will iterate in the Z coordinate to accumulate the values.
With this in mind, configure a kernel to launch a block of 16x16 threads and a grid of enough blocks to process the width and height of the image (I'm assuming a gray scale image with 1 byte per pixel):
#define THREADS 16
// kernel configuration
dim3 dimBlock = dim3 ( THREADS, THREADS, 1 );
dim3 dimGrid = dim3 ( WIDTH / THREADS, HEIGHT / THREADS );
// call the kernel
kernel<<<dimGrid, dimBlock>>>(i_data, o_Data, WIDTH, HEIGHT, DEPTH);
If you are clear how to index a 2D array, loop through the Z dimension would be also clear
__global__ void kernel(unsigned char* i_data, unsigned char* o_data, int WIDTH, int HEIGHT, int DEPTH)
{
// in your kernel map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the global index of a pixel into the image array
// this global index is to the first slide of the cube
int idx = x + y * WIDTH;
// partial results
int r = 0;
// iterate in the Z dimension
for (int z = 0; z < DEPTH; ++z)
{
// WIDTH * HEIGHT is the offset of one slide
int idx_z = z * WIDTH*HEIGHT + idx;
r += i_data[ idx_z ];
}
// o_data is a 2D array, so you can use the global index idx
o_data[ idx ] = r;
}
This is a naive implementation. In order to maximize memory throughput, the data should be properly aligned.
This can be done easily using ArrayFire GPU library ( free). In ArrayFire, you can construct 3D arrays like the following :
Two approaches:
// Method 1:
array data = rand(x,y,z);
// Just reshaping the array, this is a noop
data = newdims(data,x*y, z, 1);
// Sum of pixels
res = sum(data);
// Method 2:
// Use ArrayFire "GFOR"
array data = rand(x,y,z);res = zeros(z,1);
gfor(array i, z) {
res(ii) = sum(data(:,:,i);
}

Discrete Wavelet Transform on images and watermark embedding in LL band coefficients, data is lost when IDWT-DWT is performed again?

I'm writing an image watermarking system to hide a watermark in an image's low frequency band by transforming the image's luminance channel with a Discrete Wavelet Transform, then modifying coefficients in the LL band of the DWT output. I then do an Inverse DWT and rebuild my image.
The problem I'm having is when I modify coefficients in the DWT output, then inverse-DWT, and then DWT again, the modified coefficients are radically different.
For example, one of the output coefficients in the LL band of the 2-scale DWT was -0.10704, I modified this coefficient to be 16.89, then performed the IDWT on my data. I then took the output of the IDWT and performed a DWT on it again, and my coefficient which was modified to be 16.89 became 0.022.
I'm fairly certain that the DWT and IDWT code is correct because I've tested it against other libraries and the output from each transform matches when the filter coefficients and other parameters are the same. (Within what can be expected due to rounding error)
The main problem I have is that I perhaps don't understand the DWT all that well, I thought DWT and IDWT were supposed to be reasonably lossless (Aside from rounding error and such), yet this doesn't seem to be the case here.
I'm hoping someone more familiar with the transform can point me at a possible issue, is it possible that because the coefficients in my other subbands (LH, HL, HH) for that position are insignificant I'm losing data? If so, how can I determine which coefficients this may happen to?
My embedding function is below, coefficients are chosen in the LL band, "strong" is determined to be true if the absolute value of the LH, HH, or HL band for the selected location is larger than the mean value of the corresponding subband.
//If this evaluates to true, then the texture is considered strong.
if ((Math.Abs(LH[i][w]) >= LHmean) || (Math.Abs(HL[i][w]) >= HLmean) || (Math.Abs(HH[i][w]) >= HHmean))
static double MarkCoeff(int index, double coeff,bool strong)
{
int q1 = 16;
int q2 = 8;
int quantizestep = 0;
byte watermarkbit = binaryWM[index];
if(strong)
quantizestep = q1;
else
quantizestep = q2;
coeff /= (double)quantizestep;
double coeffdiff = 0;
if(coeff > 0.0)
coeffdiff = coeff - (int)coeff;
else
coeffdiff = coeff + (int)coeff;
if (1 == ((int)coeff % 2))
{
//odd
if (watermarkbit == 0)
{
if (Math.Abs(coeffdiff) > 0.5)
coeff += 1.0;
else
coeff -= 1.0;
}
}
else
{
//even
if (watermarkbit == 1)
{
if (Math.Abs(coeffdiff) > 0.5)
coeff += 1.0;
else
coeff -= 1.0;
}
}
coeff *= (double)quantizestep;
return coeff;
}