Doing "uint8x8x4_t - 128" then divising this by 2 - neon

I'm a bit mixed up about how to achieve a division by a scalar on Neon in a specific case.
In a c++ context, I'm achieving a contrast effect with a very rudimentary algorithm:
if (currentEffect == "contrast_with_cpp")
{
r += ((r - 128) / 2);
g += ((g - 128) / 2);
b += ((b - 128) / 2);
}
I would like to port this algorithm to neon intrinsics.
I've tried, but I'm totally newbie to this approach, and I cannot debug this code in Visual Studio. It is compiled at startup and integrated to a Windows Phone application.
if (currentEffect == "contrast_with_neon") /* Experimental, not working *
{
// To test
copy_rgb = rgb;
// Substract 128 from the copy, prevent it should be a signed variable
?
// Get half value from copy and put it in another copy
uint8x8x4_t otherCopy = interleaved;
otherCopy.val[2] = vmul_n_f32(copy_rgb.val[2], 0.5);
otherCopy.val[1] = vmul_n_f32(copy_rgb.val[1], 0.5);
otherCopy.val[0] = vmul_n_f32(copy_rgb.val[0], 0.5);
// Add it to the first copy
copy_rgb.val[2] = vadd_u8(copy_rgb.val[2], otherCopy.val[2]);
copy_rgb.val[1] = vadd_u8(copy_rgb.val[2], otherCopy.val[1]);
copy_rgb.val[0] = vadd_u8(copy_rgb.val[2], otherCopy.val[0]);
rgb = copy_rgb;
}
Is this achievable using intrinsics?
[Edit] I guess the color data structure is similar to this

Stop wasting your time with intrinsics. It's a real pain, especially with gcc.
Try this in assembly :
vmov.i16 qMdeian, #128 // put this line outside of loop
// -----------------------------------------------
vmovl.u8 qRed, dRed
vmovl.u8 qGrn, dGrn
vmovl.u8 qBlu, dBlu
vsub.s16 qRedTemp, qRed, qMedian
vsub.s16 qGrnTemp, qGrn, qMedian
vsub.s16 qBluTemp, qBlu, qMedian
vshr.s16 qRedTemp, #2
vshr.s16 qGrnTemp, #2
vshr.s16 qBluTemp, #2
vadd.s16 qRed, qRedTemp
vadd.s16 qGrn, qGrnTemp
vadd.s16 qBlu, qBluTemp
vqmovun.s16 dRed, qRed
vqmovun.s16 dGrn, qGrn
vqmovun.s16 dBlu, qBlu
If the does saturate at 255, and any negative values will become zeros which I assume is intended.
PS : What are you doing with float?

Related

HDL - PC.hdl but starting off with x2 8 bit registers

So, I basically need to create a PC.hdl, but starting off with x2 8 bit registers. Here's the starting point:
// This file is BASED ON part of www.nand2tetris.org
// and the book "The Elements of Computing Systems"
// by Nisan and Schocken, MIT Press.
// File name: project03starter/a/PC.hdl
/**
* A 16-bit counter with load and reset control bits.
* if (reset[t] == 1) out[t+1] = 0
* else if (load[t] == 1) out[t+1] = in[t]
* else if (inc[t] == 1) out[t+1] = out[t] + 1 (integer addition)
* else out[t+1] = out[t]
*/
CHIP PC {
IN in[16],load,inc,reset;
OUT out[16];
PARTS:
// Something to start you off: you need *two* 8-bit registers
Register(in=nextLow, out=out[0..7], out=currentLow, load=true);
Register(in=nextHigh, out=out[8..15], out=currentHigh, load=true);
// Handling 'inc' to increment the 16-bit value also gets tricky
// ... this might be useful
And(a=inc, b=lowIsMax, out=incAndLowIsMax);
// ...
// The rest of your code goes here...
}
I know how to do this normally with just handling the 16 bit but I'm not sure how to go about this with 8 bit registers.
Could anyone please help me with the correct solution?
Thanks.
Assuming you have an 8-bit adder, you can implement a 16-bit incrementer using 2 adders, one of which computes currentLow+1, and the other which computes currentHigh+(carry output of the low bits adder)

Raw Input mouse lastx, lasty with odd values while logged in through RDP

When I attempt to update my mouse position from the lLastX, and lLastY members of the RAWMOUSE structure while I'm logged in via RDP, I get some really odd numbers (like > 30,000 for both). I've noticed this behavior on Windows 7, 8, 8.1 and 10.
The usFlags member returns a value of MOUSE_MOVE_ABSOLUTE | MOUSE_VIRTUAL_DESKTOP. Regarding the MOUSE_MOVE_ABSOLUTE, I am handling absolute positioning as well as relative in my code. However, the virtual desktop flag has me a bit confused as I assumed that flag was for a multi-monitor setup. I've got a feeling that there's a connection to that flag and the weird numbers I'm getting. Unfortunately, I really don't know how to adjust the values without a point of reference, nor do I even know how to get a point of reference.
When I run my code locally, everything works as it should.
So does anyone have any idea why RDP + Raw Input would give me such messed up mouse lastx/lasty values? And if so, is there a way I can convert them to more sensible values?
It appears that when using WM_INPUT through remote desktop, the MOUSE_MOVE_ABSOLUTE and MOUSE_VIRTUAL_DESKTOP bits are set, and the values seems to be ranging from 0 to USHRT_MAX.
I never really found a clear documentation stating which coordinate system is used when MOUSE_VIRTUAL_DESKTOP bit is set, but this seems to have worked well thus far:
case WM_INPUT: {
UINT buffer_size = 48;
LPBYTE buffer[48];
GetRawInputData((HRAWINPUT)lparam, RID_INPUT, buffer, &buffer_size, sizeof(RAWINPUTHEADER));
RAWINPUT* raw = (RAWINPUT*)buffer;
if (raw->header.dwType != RIM_TYPEMOUSE) {
break;
}
const RAWMOUSE& mouse = raw->data.mouse;
if ((mouse.usFlags & MOUSE_MOVE_ABSOLUTE) == MOUSE_MOVE_ABSOLUTE) {
static Vector3 last_pos = vector3(FLT_MAX, FLT_MAX, FLT_MAX);
const bool virtual_desktop = (mouse.usFlags & MOUSE_VIRTUAL_DESKTOP) == MOUSE_VIRTUAL_DESKTOP;
const int width = GetSystemMetrics(virtual_desktop ? SM_CXVIRTUALSCREEN : SM_CXSCREEN);
const int height = GetSystemMetrics(virtual_desktop ? SM_CYVIRTUALSCREEN : SM_CYSCREEN);
const Vector3 absolute_pos = vector3((mouse.lLastX / float(USHRT_MAX)) * width, (mouse.lLastY / float(USHRT_MAX)) * height, 0);
if (last_pos != vector3(FLT_MAX, FLT_MAX, FLT_MAX)) {
MouseMoveEvent(absolute_pos - last_pos);
}
last_pos = absolute_pos;
}
else {
MouseMoveEvent(vector3((float)mouse.lLastX, (float)mouse.lLastY, 0));
}
}
break;

Optimization of HLSL shaders

I'm trying to optimize my terrain shader for my XNA game as it seems to consume a lot of ressources. It takes around 10 to 20 FPS on my computer, and my terrain is 512*512 vertices, so the PixelShader is called a lot of times.
I've seen that branching is using some ressources, and I have 3/4 conditions in my shaders.
What could I do to bypass them? Are triadic operators more efficient than conditions?
For instance:
float a = (b == x) ? c : d;
or
float a;
if(b == x)
a = c;
else
c = d;
I'm using also multiple times the functions lerp and clamp, should it be more efficient to use arithmetic operations instead?
Here's the less efficient part of my code:
float fog;
if(FogWaterActivated && input.WorldPosition.y-0.1 < FogWaterHeight)
{
if(!IsUnderWater)
fog = clamp(input.Depth*0.005*(FogWaterHeight - input.WorldPosition.y), 0, 1);
else
fog = clamp(input.Depth*0.02, 0, 1);
return float4(lerp(lerp( output * light, FogColorWater, fog), ShoreColor, shore), 1);
}
else
{
fog = clamp((input.Depth*0.01 - FogStart) / (FogEnd - FogStart), 0, 0.8);
return float4(lerp(lerp( output * light, FogColor, fog), ShoreColor, shore), 1);
}
Thanks!
Any time you can precalculate operations done on shader constants the better. Removing division operations by passing through the inverse into the shader is another useful tip as division is typically slower than multiplication.
In your case, precalculate (1 / (FogEnd - FogStart)), and multiply by that on your second-last line of code.

GluUnProject for iOS

To detect 3D world coordinates through the 2D screen coordinates of the iOS, is there any other possible way besides the gluUnProject port?
I've been fiddling around with this days on end now, and I can't seemingly get the hang of it.
-(void)receivePoint:(CGPoint)loke
{
GLfloat projectionF[16];
GLfloat modelViewF[16];
GLint viewportI[4];
glGetFloatv(GL_MODELVIEW_MATRIX, modelViewF);
glGetFloatv(GL_PROJECTION_MATRIX, projectionF);
glGetIntegerv(GL_VIEWPORT, viewportI);
loke.y = (float) viewportI[3] - loke.y;
float nearPlanex, nearPlaney, nearPlanez, farPlanex, farPlaney, farPlanez;
gluUnProject(loke.x, loke.y, 0, modelViewF, projectionF, viewportI, &nearPlanex, &nearPlaney, &nearPlanez);
gluUnProject(loke.x, loke.y, 1, modelViewF, projectionF, viewportI, &farPlanex, &farPlaney, &farPlanez);
float rayx = farPlanex - nearPlanex;
float rayy = farPlaney - nearPlaney;
float rayz = farPlanez - nearPlanez;
float rayLength = sqrtf((rayx*rayx)+(rayy*rayy)+(rayz*rayz));
//normalizing rayVector
rayx /= rayLength;
rayy /= rayLength;
rayz /= rayLength;
float collisionPointx, collisionPointy, collisionPointz;
for (int i = 0; i < 50; i++)
{
collisionPointx = rayx * rayLength/i*50;
collisionPointy = rayy * rayLength/i*50;
collisionPointz = rayz * rayLength/i*50;
}
}
There's a good chunk of my code. Yeah, I could have easily used a struct but I was too mentally fat to do it at the time. That's something I could go back and fix later.
Anywho, the point is that when I output to the debugger using NSLog after I use gluUnProject, the nearplane's and farplane's don't relay results even close to accurate. In fact, they both relay the exact same results, not to mention, the first click always reproduces x, y, & z being all equal to "nan."
Am I skipping over something extraordinarily important here?
There is no gluUnProject function in ES2.0, what is this port that you are using? Also there is no GL_MODELVIEW_MATRIX or GL_PROJECTION_MATRIX, which is most likely your problem.

How to program smooth movement with the accelerometer like a labyrinth game on iPhone OS?

I want to be able to make image move realistically with the accelerometer controlling it, like any labyrinth game. Below shows what I have so far but it seems very jittery and isnt realistic at all. The ball images seems to never be able to stop and does lots of jittery movements around everywhere.
- (void)accelerometer:(UIAccelerometer *)accelerometer didAccelerate:(UIAcceleration *)acceleration {
deviceTilt.x = 0.01 * deviceTilt.x + (1.0 - 0.01) * acceleration.x;
deviceTilt.y = 0.01 * deviceTilt.y + (1.0 - 0.01) * acceleration.y;
}
-(void)onTimer {
ballImage.center = CGPointMake(ballImage.center.x + (deviceTilt.x * 50), ballImage.center.y + (deviceTilt.y * 50));
if (ballImage.center.x > 279) {
ballImage.center = CGPointMake(279, ballImage.center.y);
}
if (ballImage.center.x < 42) {
ballImage.center = CGPointMake(42, ballImage.center.y);
}
if (ballImage.center.y > 419) {
ballImage.center = CGPointMake(ballImage.center.x, 419);
}
if (ballImage.center.y < 181) {
ballImage.center = CGPointMake(ballImage.center.x, 181);
}
Is there some reason why you can not use the smoothing filter provided in response to your previous question: How do you use a moving average to filter out accelerometer values in iPhone OS ?
You need to calculate the running average of the values. To do this you need to store the last n values in an array, and then push and pop values off the array when ever you read the accelerometer data. Here is some pseudocode:
const SIZE = 10;
float[] xVals = new float[SIZE];
float xAvg = 0;
function runAverage(float newX){
xAvg += newX/SIZE;
xVals.push(newX);
if(xVals.length > SIZE){
xAvg -= xVals.pop()/SIZE;
}
}
You need to do this for all three axis. Play around with the value of SIZE; the larger it is, the smoother the value, but the slower things will seem to respond. It really depends on how often you read the accelerometer value. If it is read 10 times per second, then SIZE = 10 might be too large.