Working with unsigned char. How to replace elements without using loop? - objective-c

I'm developing an application that should use very few resources and be very fast. And in my app I use an unsigned char* rawData which contains bytes got from image. So in this rawData array I have to keep some bytes and others set to zero. But I'm not permitted to use any loop(otherwise I can just run through each byte and set them zero).
So here are questions.
Q1) Is there any method in Objective C like ZeroMemory in C
Q2) Is there any other ways to set nessecary bytes to zero without using any loop.
Thanks In Advance...
P.S. Can provide some code if nessecary...

If you don't know the size of the buffer, you can't do it without a loop. Even if you don't write the loop yourself, calling something like strlen will result in a loop. I'm counting recursion as a loop here too.
How do you know which bytes to keep and which to set to zero? If these bytes are in known positions, you can use vector operations to zero out some of the bytes and not others. The following example zeros out only the even bytes over the first 64 bytes of rawData:
__m128i zeros = _mm_setzero_si128();
uint8_t mask[] = {8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0};
__m128i sse_mask = _mm_load_si128(mask);
_mm_maskmoveu_si128(zeros, sse_mask, &rawData[0]);
_mm_maskmoveu_si128(zeros, sse_mask, &rawData[16]);
_mm_maskmoveu_si128(zeros, sse_mask, &rawData[32]);
_mm_maskmoveu_si128(zeros, sse_mask, &rawData[48]);
If the high bit of each byte in mask is 1, the corresponding value in zeros will be copied to rawData. You can use a sequence of these masked copies to quickly replace some bytes and not others. The resulting machine code uses SSE operations, so this is actually quite fast. It's not required, but SSE operations will run much faster if rawData is 16-byte aligned.
Sorry if you're targeting ARM. I believe the NEON intrinsics are similar, but not identical.

Related

How to Make a Uniform Random Integer Generator from a Random Boolean Generator?

I have a hardware-based boolean generator that generates either 1 or 0 uniformly. How to use it to make a uniform 8-bit integer generator? I'm currently using the collected booleans to create the binary string for the 8-bit integer. The generated integers aren't uniformly distributed. It follows the distribution explained on this page. Integers with ̶a̶ ̶l̶o̶t̶ ̶o̶f̶ ̶a̶l̶t̶e̶r̶n̶a̶t̶I̶n̶g̶ ̶b̶I̶t̶s̶ the same number of 1's and 0's such as 85 (01010101) and -86 (10101010) have the highest chance to be generated and integers with a lot of repeating bits such as 0 (00000000) and -1 (11111111) have the lowest chance.
Here's the page that I've annotated with probabilities for each possible 4-bit integer. We can see that they're not uniform. 3, 5, 6, -7, -6, and -4 that have the same number of 1's and 0's have ⁶/₁₆ probability while 0 and -1 that all of their bits are the same only have ¹/₁₆ probability.
.
And here's my implementation on Kotlin
Based on your edit, there appears to be a misunderstanding here. By "uniform 4-bit integers", you seem to have the following in mind:
Start at 0.
Generate a random bit. If it's 1, add 1, and otherwise subtract 1.
Repeat step 2 three more times.
Output the resulting number.
Although the random bit generator may generate bits where each outcome is as likely as the other to be randomly generated, and each 4-bit chunk may be just as likely as any other to be randomly generated, the number of bits in each chunk is not uniformly distributed.
What range of integers do you want? Say you're generating 4-bit integers. Do you want a range of [-4, 4], as in the 4-bit random walk in your question, or do you want a range of [-8, 7], which is what you get when you treat a 4-bit chunk of bits as a two's complement integer?
If the former, the random walk won't generate a uniform distribution, and you will need to tackle the problem in a different way.
In this case, to generate a uniform random number in the range [-4, 4], do the following:
Take 4 bits of the random bit generator and treat them as an integer in [0, 15);
If the integer is greater than 8, go to step 1.
Subtract 4 from the integer and output it.
This algorithm uses rejection sampling, but is variable-time (thus is not appropriate whenever timing differences can be exploited in a security attack). Numbers in other ranges are similarly generated, but the details are too involved to describe in this answer. See my article on random number generation methods for details.
Based on the code you've shown me, your approach to building up bytes, ints, and longs is highly error-prone. For example, a better way to build up an 8-bit byte to achieve what you want is as follows (keeping in mind that I am not very familiar with Kotlin, so the syntax may be wrong):
val i = 0
val b = 0
for (i = 0; i < 8; i++) {
b = b << 1; // Shift old bits
if (bitStringBuilder[i] == '1') {
b = b | 1; // Set new bit
} else {
b = b | 0; // Don't set new bit
}
}
value = (b as byte) as T
Also, if MediatorLiveData is not thread safe, then neither is your approach to gathering bits using a StringBuilder (especially because StringBuilder is not thread safe).
The approach you suggest, combining eight bits of the boolean generator to make one uniform integer, will work in theory. However, in practice there are several issues:
You don't mention what kind of hardware it is. In most cases, the hardware won't be likely to generate uniformly random Boolean bits unless the hardware is a so-called true random number generator designed for this purpose. For example, the hardware might generate uniformly distributed bits but have periodic behavior.
Entropy means how hard it is to predict the values a generator produces, compared to ideal random values. For example, a 64-bit data block with 32 bits of entropy is as hard to predict as an ideal random 32-bit data block. Characterizing a hardware device's entropy (or ability to produce unpredictable values) is far from trivial. Among other things, this involves entropy tests that have to be done across the full range of operating conditions suitable for the hardware (e.g., temperature, voltage).
Most hardware cannot produce uniform random values, so usually an additional step, called randomness extraction, entropy extraction, unbiasing, whitening, or deskewing, is done to transform the values the hardware generates into uniformly distributed random numbers. However, it works best if the hardware's entropy is characterized first (see previous point).
Finally, you still have to test whether the whole process delivers numbers that are "adequately random" for your purposes. There are several statistical tests that attempt to do so, such as NIST's Statistical Test Suite or TestU01.
For more information, see "Nondeterministic Sources and Seed Generation".
After your edits to this page, it seems you're going about the problem the wrong way. To produce a uniform random number, you don't add uniformly distributed random bits (e.g., bit() + bit() + bit()), but concatenate them (e.g., (bit() << 2) | (bit() << 1) | bit()). However, again, this will work in theory, but not in practice, for the reasons I mention above.

Finding the correct alignment for host visible memory

ppData points to a pointer in which is returned a host-accessible
pointer to the beginning of the mapped range. This pointer minus
offset must be aligned to at least
VkPhysicalDeviceLimits::minMemoryMapAlignment.
I want to allocate a Vec3 float in a uniform buffer. A Vec3 float is 12bytes big.
VkMemoryRequirements { size: 16, alignment: 16, memory_type_bits: 15 }
Vulkan reports that it has to be aligned to 16 bytes, which means that the size of the allocation is now 16 instead of 12. So Vulkan already handled this for me.
minMemoryMapAlignment on my GPU is 64 bytes. What exactly does this mean for my allocation? Does this mean that I can not use the size from a VkMemoryRequirements for my allocation? And instead of allocating 16bytes here, I would have to allocate 64bytes?
Update:
For a 12 byte allocation with a 16 byte alignment and 64 bytes minMemoryMapAlignment. I would still allocate only 16 bytes and then call:
vkMapMemory(device, memory, 0, 16, 0, &mapped);
But the ptr returned from vkMapMemory is actually not 16 bytes but 64 bytes wide? And all the relevant data is in the first 12 bytes and the rest is just "padded" memory? So in practice this basically means that I don't need to use minMemoryMapAlignment at all?
There is nothing in the spec that restricts the size of the allocation like that. The paragraph you quoted means that the mapping will be aligned to minMemoryMapAlignment and you can then tell the compiler to use aligned memory accesses when accessing it. What will happen is that when the memory is mapped the later 48 bytes are wasted space in the host's memory space. That is unlikely to matter though.
This is why people keep saying to allocate larger blocks and subdivide them as needed. That way you can put 4 of those vkBuffers into a single 64 byte allocation (which you will need if you want to pipeline the rendering).
It's highly unlikely that that single vec3 is the only thing you need memory for, so take a look at your other allocations and see which ones you can combine.

SCrypt Lookup Gap Negative Effect

I'm developing a Litecoin Miner for a processor that has only 32KB of internal memory. So I was looking at SCrypt algorithms and for Litecoin it uses N = 1024, that gives me 2^10 * 1 * 128 = 128KB memory use aproximate.
So I was looking into GPU Algorithms that has the parameter Lookup Gap. For reading I'm using kepler code from CudaMiner:
https://github.com/cbuchner1/CudaMiner/blob/master/kepler_kernel.cu (Line 535)
So I understand that lookup gap is a tradeoff between CPU and Memory. So higher is it, higher is my CPU use and lower my memory. What I didnt understand is how it works exactly.
In the code I have
int pos = c_N_1/LOOKUP_GAP, loop = 1 + (c_N_1-pos*LOOKUP_GAP);
That will make it look the scratchpad every LOOKUP_GAP byte (if its 2, it will be 0,2,4,6,8,10), but where is the more CPU Use of the algorithm?
My implementation will not be highly optimized, is something like try to run.
I also saw a FPGA Implementation that uses Interpolation ( https://github.com/kramble/FPGA-Litecoin-Miner ) this is more strange to me. I dunno how they could do interpolation of the values in scratchpad.
Thanks!
The increased CPU usage comes if you do not hit a pre-calculated entry. With LOOKUP 2 you are calculating 0-1023, but only storing 0, 2, 4, etc... So if you need the data for scratch-pad entry 3 you have to calculate it on the fly using the data from 2. This is an extra calculation vs. having them all stored permanently. As the lookup gap increases the amount of on the fly calculations you will do will increase.

Strange bitwise operation with Bitmap row width, what does it mean? (And why)

What and why is the developer adding a hex value of 16, then using the bitwise operations AND followed by a NOT in this line:
size_t bytesPerRow = ((width * 4) + 0x0000000F) & ~0x0000000F;
He comments that "16 byte aligned is good", what does he mean?
- (CGContextRef)createBitmapContext {
CGRect boundingBox = CGPathGetBoundingBox(_mShape);
size_t width = CGRectGetWidth(boundingBox);
size_t height = CGRectGetHeight(boundingBox);
size_t bitsPerComponent = 8;
size_t bytesPerRow = ((width * 4) + 0x0000000F) & ~0x0000000F; // 16 byte aligned is good
ANDing with ~0x0000000F = 0xFFFFFFF0 (aka -16) rounds down to a multiple of 16, simply by resetting those bits that could make it anything other than a multiple of 16 (the 8's, 4's, 2's and 1's).
Adding 15 (0x0000000F) first makes it round up instead of down.
purpose of size_t bytesPerRow = ((width * 4) + 0x0000000F) & ~0x0000000F; is to round up this value to 16 bytes
The goal is to set bytesPerRow to be the smallest multiple of 16 that is capable of holding a row of data. This is done so that a bitmap can be allocated where every row address is 16 byte aligned, i.e. a multiple of 16. There are many possible benefits to alignment, including optimizations that take advantage of it. Some APIs may also require alignment.
The code sets the 4 least significant bits to zero. If the value is an address it will be on an even 16 byte boundary, "16 byte aligned".
this ia a one's complement so
~0x0000000F
becomes
0xFFFFFFF0
and-ing it with another value will clear 4 least significant bits.
This is the kind of thing we used to do all the time "back in the day"!
He's adding 0xf, and then masking out the lower 4 bits (& ~0xf), to make sure the value is rounded up. If he didn't add the 0xf, it would round down.

GLubyte / GLushort usage issue

I'm new to OpenGl ES. I'm trying to build a sphere not using any manuals, tutrials...
I have succeded to achieve my goal. I can draw a sphere using TRIANGLE_STRIP. And the number of meridians/horizontals I specify before drawing.
Everything works fine when I have less then 256 indexes for vertixes. I tried to use GLushort instead of GLubyte but the picture changed a lot.
GLubyte *Indices;
...
glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(GLubyte) * (meridians * (horizontals * 2 + 2)), Indices, GL_STATIC_DRAW);
...
Indices = malloc(sizeof(GLubyte) * (meridians * (horizontals * 2 + 2)));
Thats where I change byte to short.
Current project on GitHub
What should I do?
Here are the pictures where I change byte to short
Looks like you forgot to change the following line:
glDrawElements(GL_TRIANGLE_STRIP, (meridians * (horizontals * 2 + 2)), GL_UNSIGNED_BYTE, 0);
This indicates that there are a number of indices to be render, and each one is the size of an unsigned byte (most likely 8 bits, but the actual number is platform specific...very very very rarely is it not 8 bits though). However, you have filled an array of indices that are the size of unsigned shorts (probably 16 bits) so what will end up happening is that each of your numbers will be read twice. Once with the "first" 8-bits, and once with the "second" (endian will determine whether high or low order comes first). Since a lot of your indices (the majority?) are under 255, then there are going to be a lot of vertices that turn into "0" since the higher 8 bits are all 0. On top of that, you will only render half of your indices.
So, you need to indicate to OpenGL that it needs to draw these indices as unsigned shorts instead by changing the above line to this:
glDrawElements(GL_TRIANGLE_STRIP, (meridians * (horizontals * 2 + 2)), GL_UNSIGNED_SHORT, 0);