Machine code alignment - optimization

I am trying to understand the principles of machine code alignment. I have an assembler implementation which can generate machine code in run-time. I use 16-bytes alignment on every branch destination, but looks like it is not the optimal choice, since I've noticed that if I remove alignment than sometimes same code works faster. I think that something to do with cache line width, so that some commands are cut by a cache line and CPU experiences stalls because of that. So if some bytes of alignment inserted at one place, it will move instructions somewhere further pass the cache border line...
I was hoping to implement an automatic alignment procedure, which can process a code as a whole and insert alignment according to the specification of the CPU (cache line width, 32/64 bits and so on)...
Can someone give some hints about this procedure? As an example the target CPU could be Intel Core i7 CPU 64-bit platform.
Thank you.

I'm not qualified to answer your question because this is such a vast and complicated topic. There are probably many more mechanisms in play here, other than cache line size.
However, I would like to point you to Agner Fog's site and the optimization manuals for compiler makers that you can find there. They contain a plethora of information on these kind of subjects - cache lines, branch prediction and data/code alignment.

Paragraph (16-byte) alignment is usually the best. However, it can force some "local" JMP instructions to no longer be local (due to code size bloat). May also result in not as much code being cached. I would only align major segments of code, I would not align every tiny subroutine/JMP section.

Not an expert, however... Branches to places that are not going to be in the instruction cache should benefit from alignment the most because you'll read whole cache-line of instructions to fill the pipeline. Given that statement, forward branches will benefit on the first run of a function. Backward branches ("for" and "while" loops for example) will probably not benefit because the branch target and following instructions have been read into cache already. Do follow the links in Martins answer.

As mentioned previously this is a very complex area. Agner Fog seems like a good place to visit. As to the complexities I ran across the article here Torbjörn Granlund on "Improved Division by Invariant Integers" and in the code he uses to illustrate his new algorithm the first instruction at - I guess - the main label is nop - no operation. According to the commentary it improves performance significantly. Go figure.

Related

When to use VK_IMAGE_LAYOUT_GENERAL

It isn't clear to me when it's a good idea to use VK_IMAGE_LAYOUT_GENERAL as opposed to transitioning to the optimal layout for whatever action I'm about to perform. Currently, my policy is to always transition to the optimal layout.
But VK_IMAGE_LAYOUT_GENERAL exists. Maybe I should be using it when I'm only going to use a given layout for a short period of time.
For example, right now, I'm writing code to generate mipmaps using vkCmdBlitImage. As I loop through the sub-resources performing the vkCmdBlitImage commands, should I transition to VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL as I scale down into a mip, then transition to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL when I'll be the source for the next mip before finally transitioning to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL when I'm all done? It seems like a lot of transitioning, and maybe generating the mips in VK_IMAGE_LAYOUT_GENERAL is better.
I appreciate the answer might be to measure, but it's hard to measure on all my target GPUs (especially because I haven't got anything running on Android yet) so if anyone has any decent rule of thumb to apply it would be much appreciated.
FWIW, I'm writing Vulkan code that will run on desktop GPUs and Android, but I'm mainly concerned about performance on the latter.
You would use it when:
You are lazy
You need to map the memory to host (unless you can use PREINITIALIZED)
When you use the image as multiple incompatible attachments and you have no choice
For Store Images
( 5. Other cases when you would switch layouts too much (and you don't even need barriers) relatively to the work done on the images. Measurement needed to confirm GENERAL is better in that case. Most likely a premature optimalization even then.
)
PS: You could transition all the mip-maps together to TRANSFER_DST by a single command beforehand and then only the one you need to SRC. With a decent HDD, it should be even best to already have them stored with mip-maps, if that's a option (and perhaps even have a better quality using some sophisticated algorithm).
PS2: Too bad, there's not a mip-map creation command. The cmdBlit most likely does it anyway under the hood for Images smaller than half resolution....
If you read from mipmap[n] image for creating the mipmap[n+1] image then you should use the transfer image flags if you want your code to run on all Vulkan implementations and get the most performance across all implementations as the flags may be used by the GPU to optimize the image for reads or writes.
So if you want to go cross-vendor only use VK_IMAGE_LAYOUT_GENERAL for setting up the descriptor that uses the final image and not image reads or writes.
If you don't want to use that many transitions you may copy from a buffer instead of an image, though you obviously wouldn't get the format conversion, scaling and filtering that vkCmdBlitImage does for you for free.
Also don't forget to check if the target format actually supports the BLIT_SRC or BLIT_DST bits. This is independent of whether you use the transfer or general layout for copies.

Synopsys: Repeated compiles produce different results. How to automate iterated compile?

I'm new to using Design Compiler. In the past, I've done mostly FPGA work. Right now, I'm using Synopsys to determine the minimum are necessary to represent some circuits (using the Nangate 45nm library). I'm not doing P&R right now; I'm just trying to determine transistor area.
My only optimization constraint is to minimize area. I've noticed that if I tell DC to compile more than one time in a row, it produces different (and usually smaller) results each time.
I've looked and looked and failed to see if this is mentioned in a manual or anywhere in any discussion. Is it meant to work this way?
This suggests that optimization is stopping earlier than it could, so it's not REALLY minimizing area. Any idea why?
Is there a way I can tell it to increase the effort and/or tell it to automatically iterate compiles so that it will converge on the smallest design?
I'm guessing that DC is expecting to meet timing constraints, but I've given it a purely combinatorial block and no timing constraint. Did they never consider the usage scenario when all you want to do is work out the minimum gate area for a combinatorial circuit?
On a pure combinatorial circuit you can use a set_max_delay constraint and DC will attempt to meet that.
For reduced area you can use -map_effort high or -map_effort ultra to get it to work harder.
DC is a funny beast, and the algorithms it uses change as processes advance and make certain activities more or less useful. A lot of pre-layout optimization is less useful since the whole situation can change once the gates are actually placed and routed.
I filed a support ticket with Synopsys. I was using a 2010 version of design compiler. Apparently, area optimization has been improved since then, and the 2014 version will minimize area in one compiler pass.

Usage of frame pointer optimizations

This is related but NOT the same as frame pointer omitting ? Any risk?
I am trying to follow this old (but still relevan article)
http://blogs.msdn.com/b/larryosterman/archive/2007/03/12/fpo.aspx
Larry (author writes)
machines got sufficiently faster since 1995 that the performance
improvements that were achieved by FPO weren't sufficient to counter
the pain in debugging and analysis that FPO caused
However in the discussion further down the page one user writes
Disabling FPO can have both serious code size and performance impact.
Tail call optimizations have to be disabled when a frame pointer is
present, leading to much greater stack usage in affected paths. Small
functions are also disproportionately affected by prolog/epilog code.
Third, although there are still six registers available with a frame
pointer on X86, only three of them are nonvolatile with respect to
nested calls: EBX, ESI, and EDI. Opening up a fourth register can drop
out a bunch of spill code.
I have a couple of question.
Spill code == Register spillage?
Is the author correct that FPO is generally considered a pain and
the gain doe not out-weigh the benefits.
Is FPO still relevant today in x64 architecture since there are a
LOT more registers o play with.
Do you use FPO? What for (if yes) and does it make a difference to
you?
Finally in this article
http://www.altdevblogaday.com/2012/05/24/x64-abi-intro-to-the-windows-x64-calling-convention/
The author says
[with repect to Windows x64 calling convention].....
All parameters have space reserved on the stack, even the ones passed in registers. In fact, there’s stack space for 4 parameters
even if your function doesn’t have any params. Those parameters are 8
bytes so that’s at least 32 bytes on the stack for every function
(every function actually has at least 48 bytes on the stack…I’ll
explain that another time). This stack area is called the home space.
There are few reasons behind this home space:
If the registers need to be used for something else, the called function can store the data in the home space without moving the stack
pointer.
It keeps the stack structure easy to determine. That’s very handy for debugging, and perhaps necessary for x64′s stack metadata (another
point I’ll come back to another time). ...... The compiler can use it
for whatever it wants, and an optimized build will likely make great
use of it.
Wouldn't an optimized build optimize the excess allocation away?
1.Spill code == Register spillage?
Almost. Stricly speaking, spill code is the code added by the compiler to implement a register spill. The spill itself is the decision to tag a live range as not able to be placed in a register.
2.Is the author correct that FPO is generally considered a pain and the gain doe not out-weigh the benefits.
The author is probably correct that in modern processor architectures, the kinds of functions where FPOs will generate a significant performance gain is a smaller set than in the past. Yet FPO's do make code smaller, reducing cache pressure. They do reduce register pressure. These can be important in some settings. They do speed up prolog and epilog code by a few instructions. The point about debuggers not working well without the FP is noteworthy. It means core dumps are less useful for post mortems on production-optimized code. You'd never want to use FPO during development except for final testing.
3.Is FPO still relevant today in x64 architecture since there are a LOT more registers o play with.
Modern processors are so various and complex that you just about never know what's "relevant" until you try it and measure.
4.Do you use FPO? What for (if yes) and does it make a difference to you?
I have written a medium-size C library (20K SLOC) where it made a small (~5%) difference in run time overall under gcc. This was a native language extension to a scripting language that had to compile under both gcc and Visual C. Using it would have split the build path. I decided 5% was not worth it for the purpose the extension served. But if it had been a dynamic fluid simulation to predict the weather, 5% could have been worth many millions of dollars. The decision would have been different.
5.Wouldn't an optimized build optimize the excess allocation away?
That's entirely up to the compiler and optimizer designer. It looks from the MS documentation here that MS has defined the ABI to require home space for all data even if it's whole lifetime is spent in a register.
1) When you need to use a register and don't have any unused ones, you need to write code to save some register value on the stack and later restore it.
2) FPO was a pain back when unwinding was primarily done by walking the stack. Nowadays standard unwind ABIs exist anyway (e.g. to enable exception handling), so the information already exists, and is organized more efficiently (away from the hot code), so there's no pain. Sure, there would be some pain if you wrote all your machine code by hand, but that's not the typical use case.
3) Typical x86_64 ABIs don't use frame pointers at all (except when absolutely necessary, like for variable-length arrays in C).
4) I'm not a compiler. My compiler doesn't generate frame pointers.
Optimize excess away) Not sure what your question is. The space consumption for the home area isn't a problem. The benefit of not having to adjust any stack pointers is a big advantage, since you need a lot less code. The same goes for the red zone just beyond the stack frame, which allows leaf code to use a lot of memory without ever needing any stack pointer gymnastics.

Is it possible to optimize a compiled binary?

This is more of a curiosity I suppose, but I was wondering whether it is possible to apply compiler optimizations post-compilation. Are most optimization techniques highly-dependent on the IR, or can assembly be translated back and forth fairly easily?
This has been done, though I don't know of many standard tools that do it.
This paper describes an optimizer for Compaq Alpha processors that works after linking has already been done and some of the challenges they faced in writing it.
If you strain the definition a bit, you can use profile-guided optimization to instrument a binary and then rewrite it based on its observable behaviors with regards to cache misses, page faults, etc.
There's also been some work in dynamic translation, in which you run an existing binary in an interpreter and use standard dynamic compilation techniques to try to speed this up. Here's one paper that details this.
Hope this helps!
There's been some recent research interest in this space. Alex Aiken's STOKE project is doing exactly this with some pretty impressive results. In one example, their optimizer found a function that is twice as fast as gcc -O3 for the Montgomery Multiplication step in OpenSSL's RSA library. It applies these optimizations to already-compiled ELF binaries.
Here is a link to the paper.
Some compiler backends have a peephole optimizer which basically does just that, before it commits to the assembly that represents the IR, it has a little opportunity to optimize.
Basically you would want to do the same thing, from the binary, machine code to machine code. Not the same tool but the same kind of process, examine some size block of code and optimize.
Now the problem you will end up with though is for example you may have had some variables that were marked volatile in C so they are being very inefficiently used in the binary, the optimizer wont know the programmers desire there and could end up optimizing that out.
You could certainly take this back to IR and forward again, nothing to stop you from that.

Fast rgb565 to YUV (or even rgb565 to Y)

I'm working on a thing where I want to have the output option to go to a video overlay. Some support rgb565, If so sweet, just copy the data across.
If not I have to copy data across with a conversion and it's a frame buffer at a time. I'm going to try a few things, but I thought this might be one of those things that optimisers would be keen on having a go at for a bit of a challenge.
There a variety of YUV formats that are commonly supported easiest would be the Plane of Y followed by either interleaved or individual planes of UV.
Using Linux / xv, but at the level I'm dealing with it's just bytes and an x86.
I'm going to focus on speed at the cost of quality, but there are potentially hundreds of different paths to try out. There's a balance in there somewhere.
I looked at mmx but I'm not sure if there is anything useful there. There's nothing that strikes me as particularly suited to the task and it's a lot of shuffling to get things into the right place in registers.
Trying a crude version with Y = Green*0.5 + R*0.25 + Blue*notmuch. The U and V are even less of a concern quality wise. You can get away with murder on those channels.
For a simple loop.
loop:
movzx eax,[esi]
add esi,2
shr eax,3
shr al,1
add ah,ah
add al,ah
mov [edi],al
add edi,1
dec count
jnz loop
of course every instruction depends on the one before and word reads aren't the best so interleaving two might gain a bit
loop:
mov eax,[esi]
add esi,4
mov ebx,eax
shr eax,3
shr ebx,19
add ah,ah
add bh,bh
add al,ah
add bl,bh
mov ah,bl
mov [edi],ax
add edi,2
dec count
jnz loop
It would be quite easy to do that with 4 at a time, maybe for a benefit.
Can anyone come up with anything faster, better?
An interesting side point to this is whether or not a decent compiler can produce similar code.
A decent compiler, given the appropriate switches to tune for the CPU variants of most interest, almost certainly knows a lot more about good x86 instruction selection and scheduling than any mere mortal!
Take a look at the Intel(R) 64 and IA-32 Architectures Optimization Reference Manual...
If you want to get into hand-optimising code, a good strategy might be to get the compiler to generate assembly source for you as a starting point, and then tweak that; profile before and after every change to ensure that you're actually making things better.
What you really want to look at, I think, is using MMX or the integer SSE instructions for this. That will let you work with a few pixels at a time. I imagine your compiler will be able to generate such code if you specify the correct switches, especially if your code is written nicely enough.
Regarding your existing codes, I wouldn't bother with interleaving instructions of different iterations to gain performance. The out-of-order engine of all x86 processors (excluding Atom) and the caches should handle that pretty well.
Edit: If you need to do horizontal adds you might want to use the PHADDD and PHADDW instructions. In fact, if you have the Intel Software Designer's Manual, you should look for the PH* instructions. They might have what you need.