Why am I getting such a large alignment memory requirement for an image?

Why am I getting such a large alignment memory requirement for an image? - vulkan

I create an image in Vulkan and I get an alignment requirement in the memory requirements of 131072. This seems like an enormous alignment and I'm not sure why anything bigger than 128 or 256 may be needed. It's so big that my memory allocation algorithm can't even handle it, and will never be able to practically handle it given that each allocation of this strict an alignment will waste too much space. What's the deal behind this? Here is how I create the image:
VkImageCreateInfo create_info{};
create_info.sType = VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO;
create_info.imageType = VK_IMAGE_TYPE_2D;
create_info.pNext = nullptr;
create_info.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
create_info.samples = VkSampleCountFlagBits::VK_SAMPLE_COUNT_1_BIT;
create_info.queueFamilyIndexCount = 0;
image_create_info.extent.width = 1716;
image_create_info.extent.height = 1731;
image_create_info.extent.depth = 1;
image_create_info.usage = VkImageUsageFlagBits::VK_IMAGE_USAGE_SAMPLED_BIT;
image_create_info.tiling = VkImageTiling::VK_IMAGE_TILING_OPTIMAL;
image_create_info.initialLayout = VkImageLayout::VK_IMAGE_LAYOUT_UNDEFINED;
image_create_info.flags = 0;
image_create_info.mipLevels = 1;
image_create_info.format = VK_FORMAT_R8G8B8A8_UINT;
image_create_info.arrayLayers = 1;
VkImage vk_image;
VkResult result = vkCreateImage((VkDevice)VK::logicalDevice, &image_create_info, nullptr, &vk_image);
VkMemoryRequirements requirements;
vkGetImageMemoryRequirements(VK::logicalDevice, vk_image, &requirements);
Another interesting thing about the requirements returned by the function is that the memory size requirement for format VK_FORMAT_R8G8B8A8_UINT is about 12 mb, which makes sense, but with a format of VK_FORMAT_R8G8B8_UINT (so without the alpha channel), it gives a size requirement of only 3 mb, about a quarter of the size. Have I run into some sort of bug?
I know the dimensions of the image I created aren't power of two, but surely this shouldn't lead to such strange behaviour, should it?

It's so big that my memory allocation algorithm can't even handle it and will never be able to practically handle it given that each allocation of this strict an alignment will waste too much space.
Then fix that.
Implementations are allowed to require all kinds of alignments, especially for optimally-tiled images. 128KiB alignment is hardly unreasonable for images. So your sub-allocator needs to be able to account for this.
As for "waste too much space," perhaps you should take another look at those numbers. The example texture must take up at least 11'881'584 bytes. 128KiB is slightly more than 1% of that storage. That's not a lot of waste.

Related

how do i know in advance that the buffer size is enough in nanopb?

im trying to use nanopb, according to the example:
https://github.com/nanopb/nanopb/blob/master/examples/simple/simple.c
the buffer size is initialized to 128:
uint8_t buffer[128];
my question is how do i know (in advance) this 128-length buffer is enough to transmit my message? how to decide a proper(enough but not waste too much due to over-large) size of buffer before initial (or coding) it?
looks like a noob question :) , but thx for your quick suggestion.

When possible, nanopb adds a define in the generated .pb.h file that has the maximum encoded size of a message. In the file examples/simple/simple.pb.h you'll find:
/* Maximum encoded size of messages (where known) */
#define SimpleMessage_size 11
And could specify uint8_t buffer[SimpleMessage_size];.
This define will be available only if all repeated and string fields have been specified (nanopb).max_count and (nanopb).max_size options.
For many practical purposes, you can pick a buffer size that you estimate will be large enough, and handle error conditions. It is also possible to use pb_get_encoded_size() to calculate the encoded size and dynamically allocate storage, but in general that is not a great solution in embedded applications. When total system memory size is limited, it is often better to have a constant sized buffer that you can test with, instead of having the available amount of dynamic memory vary at the runtime.

Does vkCmdCopyImageToBuffer work when source image uses VK_IMAGE_TILING_OPTIMAL?

I have read (after running into the limitation myself) that for copying data from the host to a VK_IMAGE_TILING_OPTIMAL VkImage, you're better off using a VkBuffer rather than a VkImage for the staging image to avoid restrictions on mipmap and layer counts. (Here and Here)
So, when it came to implementing a glReadPixels-esque piece of functionality to read the results of a render-to-texture back to the host, I thought that reading to a staging VkBuffer with vkCmdCopyImageToBuffer instead of using a staging VkImage would be a good idea.
However, I haven't been able to get it to work yet, I'm seeing most of the intended image, but with rectangular blocks of the image in incorrect locations and even some bits duplicated.
There is a good chance that I've messed up my synchronization or layout transitions somewhere and I'll continue to investigate that possibility.
However, I couldn't figure out from the spec whether using vkCmdCopyImageToBuffer with an image source using VK_IMAGE_TILING_OPTIMAL is actually supposed to 'un-tile' the image, or whether I should actually expect to receive a garbled implementation-defined image layout if I attempt such a thing.
So my question is: Does vkCmdCopyImageToBuffer with a VK_IMAGE_TILING_OPTIMAL source image fill the buffer with linearly tiled data or optimally (implementation defined) tiled data?

Section 18.4 describes the layout of the data in the source/destination buffers, relative to the image being copied from/to. This is outlined in the description of the VkBufferImageCopy struct. There is no language in this section which would permit different behavior from tiled images.
The specification even has pseudo code for how copies work (this is for non-block compressed images):
rowLength = region->bufferRowLength;
if (rowLength == 0)
rowLength = region->imageExtent.width;
imageHeight = region->bufferImageHeight;
if (imageHeight == 0)
imageHeight = region->imageExtent.height;
texelSize = <texel size taken from the src/dstImage>;
address of (x,y,z) = region->bufferOffset + (((z * imageHeight) + y) * rowLength + x) * texelSize;
where x,y,z range from (0,0,0) to region->imageExtent.width,height,depth}.
The x,y,z part is the location of the pixel in question from the image. Since this location is not dependent on the tiling of the image (as evidenced by the lack of anything stating that it would be), buffer/image copies will work equally on both kinds of tiling.
Also, do note that this specification is shared between vkCmdCopyImageToBuffer and vkCmdCopyBufferToImage. As such, if a copy works one way, it by necessity must work the other.

Storing bitmap data in an integer array for speed?

I am using Cocoa/Objective-C and I am using NSBitmapImageRep getPixel:atX:y: to test whether R is 0 or 255. That is the only piece of data I need (the bitmap is only black and white).
I am noticing that this one function is the biggest draw on CPU power in my application, accounting for something like 95% of the overhead. Would it be faster for me to preload the bitmap into a 2 dimensional integer array
NSUInteger pixels[1280][1024];
and read the values like so:
if(pixels[x][y]!=0){
//....do stuff
}
?

One thing that might be helpful could be converting the data into something more "dense". Since you're only interested in a single bit per pixel location, it doesn't make sense to store more than that. Storing more data than necessary means you get less usage out of your cache, which can really slow things down if the image is big and/or the accesses very random.
For instance, you could use the platform's largest "native" integer and pack in the pixels to use a single bit for each pixel. That will make the access a bit more involved since you need to do a single-bit testing, but it might be a win.
You would do something like this:
uint32_t image[HEIGHT * ((WIDTH + 31) / 32)];
Then initialize this array by using the slow getter method, once per pixel. Then you can read out the value of a pixel using something like image[y * ((WIDTH + 31) / 32) + (x / 32)] & (1 << (x & 31)).
I'm being vague ("might", "can" and so on) since it really depends on your access pattern, the size of the image, and other things. You should probably test it.

I'm not familiar with Objective-C or the NSBitmapImageRep object, but a reasonable guess is that the getPixel routine employs clipping to avoid reading outside of memory, which could a possible slowdown (among other things).
Have a look inside it and see what it does.
(update)
Having learnt that this is Apple code, you probably can't take a look inside it.
However, the documentation for NSBitmapImageRep_Class seems to indicate that getPixel:atX:y: performs at least some type magic. You could test if the result is clipped by accessing a pixel outside of the image boundary and observing the result.
The bitmapData seems to be something you'd be interested in: get the pointer to the data, then read the array yourself avoiding type conversion or clipping.

Read a series of images row by row or entire image for performance?

I'm writing an application that takes a series of exposures of a target and computes their average and saves the resultant image. This technique is used extensively in astrophotography to reduce noise in the final image. Basically, one computes the average at pixel and writes out the value in the output file.
The number of exposures can be quite high, from 20 to 30 (sometimes even more), and with today's large CCD sensors the resolution, too, can be quite high. So the amount of data can be very very large.
My question is, when it comes to performance, should I read the images row by row (Method #1) or should I read the entire image array of all arrays (Method #2)? Using the former method, I will have to load every corresponding row. So, if I have 10 images and I'm reading row #1 - I will have to read the first row from each image, compute their average and write out the row.
With the latter method, I read all images in their entirety, compute and write out the entire image.
In theory, the latter method ought to be much faster but much more memory intensive. In practice however, I've found that the difference in performance isn't great and this was puzzeling. At most, Method #2 was only 2 to 3 seconds faster than Method #1. However, Method #2 was using upto 1.3 GB of memory for 24 8megapixel images. Method #1, on the other hand, was at most using 70MB. On average, both methods are taking about 20 seconds to process 24 8megapixel images.
I am writing this in Objective-C with a good amount of C thrown in when calling CFITSIO.
Here's Method #1:
pixelRows = (double**)malloc(self.numberOfImages * sizeof(double*)); //alloc. pixel array.
for(i=0;i<self.numberOfImages;i++)
{
pixelRows[i] = (double*)malloc(width*sizeof(double));
}
apix = (double*)malloc(width*sizeof(double));
for(firstpix[1]=1;firstpix[1]<=size[1];firstpix[1]++)
{
[self gatherRowsFromImages:firstpix[1] withRowWidth:theWidth thePixelMap:pixelRows];
[self averageRows:pixelRows width:width theAveragedRow:apix];
fits_write_pix(outfptr, TDOUBLE, firstpix, width,apix, &status);
//NSLog(#"Row %ld written.",firstpix[1]);
}
fits_close_file(outfptr,&status);
NSLog(#"End");
if(!status)
{
NSLog(#"File written successfully.");
}
for(i=0;i<self.numberOfImages;i++)
{
free(pixelRows[i]);
}
free(pixelRows);
free(apix);
Here's Method #2:
imageArray = (double**)malloc(files.count * sizeof(double*));
for(i=0;i<files.count;i++)
{
imageArray[i] = (double*)malloc(size[0] * size[1] * sizeof(double));
fits_read_pix(fptr[i],TDOUBLE,firstpix,size[0] * size[1],NULL,imageArray[i],NULL,&status);
//NSLog(#"%d",status);
}
int fileIndex;
NSLog(#"%d",files.count);
apix = (double*)malloc(size[0] * size[1] * sizeof(double));
for(i=0;i<(size[0] * size[1]);i++)
{
apix[i] = 0.0;
for(fileIndex=0;fileIndex<files.count;fileIndex++)
{
apix[i] = apix[i] + imageArray[fileIndex][i];
}
//NSLog(#"%f",apix[i]);
apix[i] = apix[i] / files.count;
}
fits_create_file(&outfptr,[outPath UTF8String],&status);
fits_copy_header(fptr[0],outfptr,&status);
fits_write_pix(outfptr, TDOUBLE, firstpix, size[0] * size[1],apix, &status);
fits_close_file(outfptr,&status);
Any suggestions regarding this? Am I expecting too much of a gain by reading in every image in its entirety?

I would always go for the row-by-row approach, since it is scalable. It may also be faster since the memory footprint is smaller, meaning there is no need to swap out any program to disk just for your memory hungry tool.
Furthermore, to optimize the row-by-row approach, you should also consider reading in images per 8 rows (or some other number). E.g. JPEG is stored in 8x8 blocks, so reading in less than 8 rows would be pointless. Of course this depends on the image format and the library you are using.
There are also other considerations regarding the use of cache memory by the cpu. Memory locations that are used frequently don't have to travel to the "slow" memory but can stay closer to the cpu. There are several levels of cache and they vary in size per cpu type. (the biggest of which is typically 8 or 16 mb at the time of writing)
Another thing to consider is the code that does the actual averaging. Tuning this will also gain a lot, especially for the kind of operation you're doing, look at SSE and related topics. Also using integer calculations will probably beat floating point arithmetic. Using bit shifts for division might also be faster than true division, but it will only allow you to divide by 2^n.

What are you favorite low level code optimization tricks? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I know that you should only optimize things when it is deemed necessary. But, if it is deemed necessary, what are your favorite low level (as opposed to algorithmic level) optimization tricks.
For example: loop unrolling.

gcc -O2
Compilers do a lot better job of it than you can.

Picking a power of two for filters, circular buffers, etc.
So very, very convenient.
-Adam

Why, bit twiddling hacks, of course!

One of the most useful in scientific code is to replace pow(x,4) with x*x*x*x. Pow is almost always more expensive than multiplication. This is followed by
for(int i = 0; i < N; i++)
{
z += x/y;
}
to
double denom = 1/y;
for(int i = 0; i < N; i++)
{
z += x*denom;
}
But my favorite low level optimization is to figure out which calculations can be removed from a loop. Its always faster to do the calculation once rather than N times. Depending on your compiler, some of these may be automatically done for you.

Inspect the compiler's output, then try to coerce it to do something faster.

I wouldn't necessarily call it a low level optimization, but I have saved orders of magnitude more cycles through judicious application of caching than I have through all my applications of low level tricks combined. Many of these methods are applications specific.
Having an LRU cache of database queries (or any other IPC based request).
Remembering the last failed database query and returning a failure if re-requested within a certain time frame.
Remembering your location in a large data structure to ensure that if the next request is for the same node, the search is free.
Caching calculation results to prevent duplicate work. In addition to more complex scenarios, this is often found in if or for statements.
CPUs and compilers are constantly changing. Whatever low level code trick that made sense 3 CPU chips ago with a different compiler may actually be slower on the current architecture and there may be a good chance that this trick may confuse whoever is maintaining this code in the future.

++i can be faster than i++, because it avoids creating a temporary.
Whether this still holds for modern C/C++/Java/C# compilers, I don't know. It might well be different for user-defined types with overloaded operators, whereas in the case of simple integers it probably doesn't matter.
But I've come to like the syntax... it reads like "increment i" which is a sensible order.

Using template metaprogramming to calculate things at compile time instead of at run-time.

Years ago with a not-so-smart compilier, I got great mileage from function inlining, walking pointers instead of indexing arrays, and iterating down to zero instead of up to a maximum.
When in doubt, a little knowledge of assembly will let you look at what the compiler is producing and attack the inefficient parts (in your source language, using structures friendlier to your compiler.)

precalculating values.
For instance, instead of sin(a) or cos(a), if your application doesn't necessarily need angles to be very precise, maybe you represent angles in 1/256 of a circle, and create arrays of floats sine[] and cosine[] precalculating the sin and cos of those angles.
And, if you need a vector at some angle of a given length frequently, you might precalculate all those sines and cosines already multiplied by that length.
Or, to put it more generally, trade memory for speed.
Or, even more generally, "All programming is an exercise in caching" -- Terje Mathisen
Some things are less obvious. For instance traversing a two dimensional array, you might do something like
for (x=0;x<maxx;x++)
for (y=0;y<maxy;y++)
do_something(a[x,y]);
You might find the processor cache likes it better if you do:
for (y=0;y<maxy;y++)
for (x=0;x<maxx;x++)
do_something(a[x,y]);
or vice versa.

Don't do loop unrolling. Don't do Duff's device. Make your loops as small as possible, anything else inhibits x86 performance and gcc optimizer performance.
Getting rid of branches can be useful, though - so getting rid of loops completely is good, and those branchless math tricks really do work. Beyond that, try never to go out of the L2 cache - this means a lot of precalculation/caching should also be avoided if it wastes cache space.
And, especially for x86, try to keep the number of variables in use at any one time down. It's hard to tell what compilers will do with that kind of thing, but usually having less loop iteration variables/array indexes will end up with better asm output.
Of course, this is for desktop CPUs; a slow CPU with fast memory access can precalculate a lot more, but in these days that might be an embedded system with little total memory anyway…

I've found that changing from a pointer to indexed access may make a difference; the compiler has different instruction forms and register usages to choose from. Vice versa, too. This is extremely low-level and compiler dependent, though, and only good when you need that last few percent.
E.g.
for (i = 0; i < n; ++i)
*p++ = ...; // some complicated expression
vs.
for (i = 0; i < n; ++i)
p[i] = ...; // some complicated expression

Optimizing cache locality - for example when multiplying two matrices that don't fit into cache.

Allocating with new on a pre-allocated buffer using C++'s placement new.

Counting down a loop. It's cheaper to compare against 0 than N:
for (i = N; --i >= 0; ) ...
Shifting and masking by powers of two is cheaper than division and remainder, / and %
#define WORD_LOG 5
#define SIZE (1 << WORD_LOG)
#define MASK (SIZE - 1)
uint32_t bits[K]
void set_bit(unsigned i)
{
bits[i >> WORD_LOG] |= (1 << (i & MASK))
}
Edit
(i >> WORD_LOG) == (i / SIZE) and
(i & MASK) == (i % SIZE)
because SIZE is 32 or 2^5.

Jon Bentley's Writing Efficient Programs is a great source of low- and high-level techniques -- if you can find a copy.

Eliminating branches (if/elses) by using boolean math:
if(x == 0)
x = 5;
// becomes:
x += (x == 0) * 5;
// if '5' was a base 2 number, let's say 4:
x += (x == 0) << 2;
// divide by 2 if flag is set
sum >>= (blendMode == BLEND);
This REALLY speeds things out especially when those ifs are in a loop or somewhere that is being called a lot.

The one from Assembler:
xor ax, ax
instead of:
mov ax, 0
Classical optimization for program size and performance.

In SQL, if you only need to know whether any data exists or not, don't bother with COUNT(*):
SELECT 1 FROM table WHERE some_primary_key = some_value
If your WHERE clause is likely return multiple rows, add a LIMIT 1 too.
(Remember that databases can't see what your code's doing with their results, so they can't optimise these things away on their own!)

Recycling the frame-pointer all of a sudden
Pascal calling-convention
Rewrite stack-frame tail call optimizarion (although it sometimes messes with the above)
Using vfork() instead of fork() before exec()
And one I am still looking for, an excuse to use: data driven code-generation at runtime

Liberal use of __restrict to eliminate load-hit-store stalls.

Rolling up loops.
Seriously, the last time I needed to do anything like this was in a function that took 80% of the runtime, so it was worth trying to micro-optimize if I could get a noticeable performance increase.
The first thing I did was to roll up the loop. This gave me a very significant speed increase. I believe this was a matter of cache locality.
The next thing I did was add a layer of indirection, and put some more logic into the loop, which allowed me to only loop through the things I needed. This wasn't as much of a speed increase, but it was worth doing.
If you're going to micro-optimize, you need to have a reasonable idea of two things: the architecture you're actually using (which is vastly different from the systems I grew up with, at least for micro-optimization purposes), and what the compiler will do for you.
A lot of the traditional micro-optimizations trade space for time. Nowadays, using more space increases the chances of a cache miss, and there goes your performance. Moreover, a lot of them are now done by modern compilers, and typically better than you're likely to do them.
Currently, you should (a) profile to see if you need to micro-optimize, and then (b) try to trade computation for space, in the hope of keeping as much as possible in cache. Finally, run some tests, so you know if you've improved things or screwed them up. Modern compilers and chips are far too complex for you to keep a good mental model, and the only way you'll know if some optimization works or not is to test.

In addition to Joshua's comment about code generation (a big win), and other good suggestions, ...
I'm not sure if you would call it "low-level", but (and this is downvote-bait) 1) stay away from using any more levels of abstraction than absolutely necessary, and 2) stay away from event-driven notification-style programming, if possible.
If a computer executing a program is like a car running a race, a method call is like a detour. That's not necessarily bad except there's a strong temptation to nest those things, because once you're written a method call, you tend to forget what that call could cost you.
If your're relying on events and notifications, it's because you have multiple data structures that need to be kept in agreement. This is costly, and should only be done if you can't avoid it.
In my experience, the biggest performance killers are too much data structure and too much abstraction.

I was amazed at the speedup I got by replacing a for loop adding numbers together in structs:
const unsigned long SIZE = 100000000;
typedef struct {
int a;
int b;
int result;
} addition;
addition *sum;
void start() {
unsigned int byte_count = SIZE * sizeof(addition);
sum = malloc(byte_count);
unsigned int i = 0;
if (i < SIZE) {
do {
sum[i].a = i;
sum[i].b = i;
i++;
} while (i < SIZE);
}
}
void test_func() {
unsigned int i = 0;
if (i < SIZE) { // this is about 30% faster than the more obvious for loop, even with O3
do {
addition *s1 = &sum[i];
s1->result = s1->b + s1->a;
i++;
} while ( i<SIZE );
}
}
void finish() {
free(sum);
}
Why doesn't gcc optimise for loops into this? Or is there something I missed? Some cache effect?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas