how do i know in advance that the buffer size is enough in nanopb? - size

im trying to use nanopb, according to the example:
https://github.com/nanopb/nanopb/blob/master/examples/simple/simple.c
the buffer size is initialized to 128:
uint8_t buffer[128];
my question is how do i know (in advance) this 128-length buffer is enough to transmit my message? how to decide a proper(enough but not waste too much due to over-large) size of buffer before initial (or coding) it?
looks like a noob question :) , but thx for your quick suggestion.

When possible, nanopb adds a define in the generated .pb.h file that has the maximum encoded size of a message. In the file examples/simple/simple.pb.h you'll find:
/* Maximum encoded size of messages (where known) */
#define SimpleMessage_size 11
And could specify uint8_t buffer[SimpleMessage_size];.
This define will be available only if all repeated and string fields have been specified (nanopb).max_count and (nanopb).max_size options.
For many practical purposes, you can pick a buffer size that you estimate will be large enough, and handle error conditions. It is also possible to use pb_get_encoded_size() to calculate the encoded size and dynamically allocate storage, but in general that is not a great solution in embedded applications. When total system memory size is limited, it is often better to have a constant sized buffer that you can test with, instead of having the available amount of dynamic memory vary at the runtime.

Related

What is the use of .range field when updating a buffer descriptor?

When I went to update a UNIFORM_BUFFER descriptor, I set up the buffer info:
VkDescriptorBufferInfo buffer_info;
buffer_info.buffer = /* SOME BUFFER */;
buffer_info.offset = 0;
buffer_info.range = 0; // I assume this doesn't do any thing for this use
And then vkUpdateDescriptorSets().
Get the validation layer error:
VkDescriptorBufferInfo range is not VK_WHOLE_SIZE and is zero, which
is not allowed.. The Vulkan spec states: If range is not equal to
VK_WHOLE_SIZE, range must be greater than 0
My question is, isn't the job of the buffer info to tell which buffer and what offset to read from the shaders at a particular descriptor set and binding? I didn't think the size of the buffer mattered because generally that's how these things work, you usually specify a buffer and offset and then you read outside that buffer in the shader at your peril.
Let's just I write the wrong range, what would that do? If I write a 32 and in the shader I access 64 bytes in, what happens? Is this argument for validation warnings?
Edit: I just want to clarify the range argument can't mean how much of the buffer I want to copy, what I'm writing to is essentially a pointer. The actual writing of the buffer data is done in a buffer to buffer copy transfer.
A descriptor describes a (usually memory) resource being used by a shader in some capacity. Buffers do have a size, but a shader can use a subset of a buffer's memory range. The descriptor describes which portion of the buffer is being used.
If a descriptor should use the whole size of the buffer assigned to it (starting at offset), that's what VK_WHOLE_SIZE is for.
This allows you to have multiple uniform buffers provided by the same VkBuffer resource. You can even use dynamic uniform block descriptors to change the offset/range without changing the buffer binding itself. This is faster than switching descriptor sets, thus making it easier to provide per-object data.
Let's just I write the wrong range, what would that do?
If the range is smaller than the size of the uniform block specified in the shader, then you'll get a validation failure/undefined behavior.

To deploy a Tiny ML model that I created via Colab Google

When I compile the code (on arduino) I get the following error:
8 bytes lost due to alignment. To avoid this loss, please make sure the tensor_arena is 16 bytes aligned.
constexpr int tensorArenaSize = 8 * 1024;
byte tensorArena[tensorArenaSize];
Someone can help me to fix this problem?
For reasons unbeknownst to me, the compiler wants to make sure your large byte array is 16-byte-aligned. Because of variables already declared above the two lines you included, it needs to "move forward" the Large Array by 8 bytes, to make it start at an address that is on a 16-byte boundary. To fix the error (to me this should just be a warning) either add a dummy 8-byte variable before your Large Array, or move 8-byte worth of variables from before your Large Array to after it. In the first case you just lose 8 bytes of variable space.

A general-purpose warp-level std::copy-like function - what should it account for?

A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode

how to improve OpenCL kernel reading __global char* data efficiently?

motion compensation between two images(3840*2160), block size 16
kernel divide 3840 * 135(135=2160/16), group size 64*1 or 128*1 (basically no difference)
Now my kernel do access global char data, but imagepos = src + mv.xy is not aligned, so must read char one by one. I think there is a latency here, CodeXL also show there is no limited by GPRs. So i need find a method to speed up data read. Also i want to know how to use local memory but data just need once.
Any suggestion will be appreciated.

What mechanism detects accesses of unallocated memory?

From time to time, I'll have an off-by-one error like the following:
unsigned int* x = calloc(2000, sizeof(unsigned int));
printf("%d", x[2000]);
I've gone beyond the end of the allocated region, so I get an EXC_BAD_ACCESS signal at runtime. My question is: how is this detected? It seems like this would just silently return garbage, since I'm only off by one byte and not, say, a full page. What part of the system prevents me from just returning the garbage byte at x + 2000?
The memory system has sentinel values at the beginning and end of its memory fields, beyond your allocated bytes. When you free the memory, it checks to see if those values are intact. If not, it tells you.
Perhaps you are just lucky because you are using 2000 as a size. Depending on the size of int the total size is divisible by 32 or 64, so chances are high that the end of it really terminates the "real" allocation. Try with some odd number of bytes (better use a char array for that) and see if your systems still detects it.
In any case you shouldn't rely on finding these bugs this way. Always use valgrind or similar to check your memory accesses.