Fast rgb565 to YUV (or even rgb565 to Y) - optimization

I'm working on a thing where I want to have the output option to go to a video overlay. Some support rgb565, If so sweet, just copy the data across.
If not I have to copy data across with a conversion and it's a frame buffer at a time. I'm going to try a few things, but I thought this might be one of those things that optimisers would be keen on having a go at for a bit of a challenge.
There a variety of YUV formats that are commonly supported easiest would be the Plane of Y followed by either interleaved or individual planes of UV.
Using Linux / xv, but at the level I'm dealing with it's just bytes and an x86.
I'm going to focus on speed at the cost of quality, but there are potentially hundreds of different paths to try out. There's a balance in there somewhere.
I looked at mmx but I'm not sure if there is anything useful there. There's nothing that strikes me as particularly suited to the task and it's a lot of shuffling to get things into the right place in registers.
Trying a crude version with Y = Green*0.5 + R*0.25 + Blue*notmuch. The U and V are even less of a concern quality wise. You can get away with murder on those channels.
For a simple loop.
loop:
movzx eax,[esi]
add esi,2
shr eax,3
shr al,1
add ah,ah
add al,ah
mov [edi],al
add edi,1
dec count
jnz loop
of course every instruction depends on the one before and word reads aren't the best so interleaving two might gain a bit
loop:
mov eax,[esi]
add esi,4
mov ebx,eax
shr eax,3
shr ebx,19
add ah,ah
add bh,bh
add al,ah
add bl,bh
mov ah,bl
mov [edi],ax
add edi,2
dec count
jnz loop
It would be quite easy to do that with 4 at a time, maybe for a benefit.
Can anyone come up with anything faster, better?
An interesting side point to this is whether or not a decent compiler can produce similar code.

A decent compiler, given the appropriate switches to tune for the CPU variants of most interest, almost certainly knows a lot more about good x86 instruction selection and scheduling than any mere mortal!
Take a look at the Intel(R) 64 and IA-32 Architectures Optimization Reference Manual...
If you want to get into hand-optimising code, a good strategy might be to get the compiler to generate assembly source for you as a starting point, and then tweak that; profile before and after every change to ensure that you're actually making things better.

What you really want to look at, I think, is using MMX or the integer SSE instructions for this. That will let you work with a few pixels at a time. I imagine your compiler will be able to generate such code if you specify the correct switches, especially if your code is written nicely enough.
Regarding your existing codes, I wouldn't bother with interleaving instructions of different iterations to gain performance. The out-of-order engine of all x86 processors (excluding Atom) and the caches should handle that pretty well.
Edit: If you need to do horizontal adds you might want to use the PHADDD and PHADDW instructions. In fact, if you have the Intel Software Designer's Manual, you should look for the PH* instructions. They might have what you need.

Related

What's the point of "make move" and "undo move" in chessengines?

I want to experiment with massive parallel chess computing. From what I saw and understood in wikis and source code of some engines is that in most (all?) implementations of the min-max(negamax, alpha-beta, ...)-algorithm there is one internal position that gets updated for every new branch and then gets undone after receiving the evaluation of that branch.
What are the benefits of that technic compared to just generating a new position-object and passing that to the next branch?
This is what I have done in my previous engines and I believe this method is superior for the purpose of parallelism.
The answer to your question depends heavily on a few factors, such as how you are storing the chess board, and what your make/unmake move function looks like. I'm sure there are ways of storing the board that would be better suited towards your method, and indeed historically some top tiered chess engine (in particular crafty) used to use that method, but you are correct in saying that modern engines no longer do it that way. Here is a link to a discussion about this very point:
http://www.talkchess.com/forum/viewtopic.php?t=50805
If you want to understand why this is, then you must understand how today's engines represent the board. The standard implementation revolves around bitboards, and that requires 12- 64 bit integers per position, in addition to a redundant mailbox (array in non computer chess jargon) used in conjunction. Copying all that is usually more expensive than a good makeMove/unMakeMove function.
I also want to point out that a hybrid approach is also common. Usually make and unmake is used on the board itself, where as other information like en passant squares and castle rights are copied and changed like you suggested.
Interestingly, for board representations of < ~300 bytes, it IS cheaper to copy the board on each move (on modern x86), as opposed to making and unmaking the move.
As you suggest the immutable properties of copying the board on each move are from a programming perspective, including parallelising, very attractive.
My board rep is 208 bytes. C++ compiled with g++ 7.4.0.
Empricism shows that board copy is 20% faster than move make/unmake. I presume the code is using 32/64 byte-wide AVX instructions for the copy. In theory you can do 32/64 byte copy per cycle.
Just for you: https://github.com/rolandpj1968/oom-skaak/tree/init-impl-legal-move-gen-move-do-undo-2

Is SSE redundant or discouraged?

Looking around here and the internet, I can find a lot of posts about modern compilers beating SSE in many real situations, and I have just encountered in some code I inherited that when I disable some SSE code written in 2006 for integer-based image processing and force the code down the standard C branch, it runs faster.
On modern processors with multiple cores and advanced pipelining, etc, does older SSE code underperform gcc -O2?
You have to be careful with microbenchmarks. It's really easy to measure something other than what you thought you were. Microbenchmarks also usually don't account for code size at all, in terms of pressure on the L1 I-cache / uop-cache and branch-predictor entries.
In most cases, microbenchmarks usually have all the branches predicted as well as they can be, while a routine that's called frequently but not in a tight loop might not do as well in practice.
There have been many additions to SSE over the years. A reasonable baseline for new code is SSSE3 (found in Intel Core2 and later, and AMD Bulldozer and later), as long as there is a scalar fallback. The addition of a fast byte-shuffle (pshufb) is a game-changer for some things. SSE4.1 adds quite a few nice things for integer code, too. If old code doesn't use it, compiler output, or new hand-written code, could do much better.
Currently we're up to AVX2, which handles two 128b lanes at once, in 256b registers. There are a few 256b shuffle instructions. AVX/AVX2 gives 3-operand (non-destructive dest, src1, src2) versions of all the previous SSE instructions, which helps improve code density even when the two-lane aspect of using 256b ops is a downside (or when targeting AVX1 without AVX2 for integer code).
In a year or two, the first AVX512 desktop hardware will probably be around. That adds a huge amount of powerful features (mask registers, and filling in more gaps in the highly non-orthogonal SSE / AVX instruction set), as well as just wider registers and execution units.
If the old SSE code only gave a marginal speedup over the scalar code back when it was written, or nobody ever benchmarked it, that might be the problem. Compiler advances may lead to the generated code for scalar C beating old SSE that takes a lot of shuffling. Sometimes the cost of shuffling data into vector registers eats up all the speedup of being fast once it's there.
Or depending on your compiler options, the compiler might even be auto-vectorizing. IIRC, gcc -O2 doesn't enable -ftree-vectorize, so you need -O3 for auto-vec.
Another thing that might hold back old SSE code is that it might assume unaligned loads/stores are slow, and used palignr or similar techniques to go between unaligned data in registers and aligned loads/stores. So old code might be tuned for an old microarch in a way that's actually slower on recent ones.
So even without using any instructions that weren't available previously, tuning for a different microarchitecture matters.
Compiler output is rarely optimal, esp. if you haven't told it about pointers not aliasing (restrict), or being aligned. But it often manages to run pretty fast. You can often improve it a bit (esp. for being more hyperthreading-friendly by having fewer uops/insns to do the same work), but you have to know the microarchitecture you're targeting. E.g. Intel Sandybridge and later can only micro-fuse memory operands with one-register addressing mode. Other links at the x86 wiki.
So to answer the title, no the SSE instruction set is in no way redundant or discouraged. Using it directly, with asm, is discouraged for casual use (use intrinsics instead). Using intrinsics is discouraged unless you can actually get a speedup over compiler output. If they're tied now, it will be easier for a future compiler to do even better with your scalar code than to do better with your vector intrinsics.
Just to add to Peter's already excellent answer, one fundamental point to consider is that the compiler does not know everything that the programmer knows about the problem domain, and there is in general no easy way for the programmer to express useful constraints and other relevant information that a truly smart compiler might be able to exploit in order to aid vectorization. This can give the programmer a huge advantage in many cases.
For example, for a simple case such as:
// add two arrays of floats
float a[N], b[N], c[N];
for (int i = 0; i < N; ++i)
a[i] = b[i] + c[i];
any decent compiler should be able to do a reasonably good job of vectorizing this with SSE/AVX/whatever, and there would be little point in implementing this with SIMD intrinsics. Apart from relatively minor concerns such as data alignment, or the likely range of values for N, the compiler-generated code should be close to optimal.
But if you have something less straightforward, e.g.
// map array of 4 bit values to 8 bit values using a LUT
const uint8_t LUT[16] = { 0, 1, 3, 7, 11, 15, 20, 27, ..., 255 };
uint8_t in[N]; // 4 bit input values
uint8_t out[N]; // 8 bit output values
for (int i = 0; i < N; ++i)
out[i] = LUT[in[i]];
you won't see any auto-vectoization from your compiler because (a) it doesn't know that you can use PSHUFB to implement a small LUT, and (b) even if it did, it has no way of knowing that the input data is constrained to a 4 bit range. So a programmer could write a simple SSE implementation which would most likely be an order of magnitude faster:
__m128i vLUT = _mm_loadu_si128((__m128i *)LUT);
for (int i = 0; i < N; i += 16)
{
__m128i va = _mm_loadu_si128((__m128i *)&b[i]);
__m128i vb = _mm_shuffle_epi8(va, vLUT);
_mm_storeu_si128((__m128i *)&a[i], vb);
}
Maybe in another 10 years compilers will be smart enough to do this kind of thing, and programming languages will have methods to express everything the programmer knows about the problem, the data, and other relevant constraints, at which point it will probably be time for people like me to consider a new career. But until then there will continue to be a large problem space where a human can still easily beat a compiler with manual SIMD optimisation.
These were two separate and strictly speaking unrelated questions:
1) Did SSE in general and SSE-tuned codebases in particular become obsolete / "discouraged" / retired?
Answer in brief: not yet and not really. High Level Reason: because there are still enough hardware around (even in HPC domain, where one could easily find Nehalem) which only have SSE* on board, but no AVX* available. If you look outside HPC, then consider for example Intel Atom CPU, which currently supports only up to SSE4.
2) Why gcc -O2 (i.e. auto-vectorized, running on SSE-only hardware) is faster than some old (presumably intrinsics) SSE implementation written 9 years ago.
Answer: it depends, but first of all things are very actively improving on Compilers side. AFAIK top 4 x86 compilers dev teams has made big to enormous investments into auto-vectorization or explicit-vectorization domains in the course of past 9 years. And the reason why they did so is also clear: SIMD "FLOPs" potential in x86 hardware has been increased (formally) "by 8 times" (i.e. 8x of SSE4 peak flops) in the course of past 9 years.
Let me ask one more question myself:
3) OK, SSE is not obsolete. But will it be obsolete in X years from now?
Answer: who knows, but at least in HPC, with wider AVX-2 and AVX-512 compatible hardware adoption, SSE intrinsics codebases are highly likely to retire soon enough, although it again depends on what you develop. Some low-level optimized HPC/HPC+Media libraries will likely keep highly tuned SSE code pathes for long time.
You might very well see modern compilers use SSE4. But even if they stick to the same ISA, they're often a lot better at scheduling. Keeping SSE units busy means careful management of data streaming.
Cores are irrelevant as each instruction stream (thread) runs on a single core.
Yes -- but mainly in the same sense that writing inline assembly is discouraged.
SSE instructions (and other vector instructions) have been around long enough that compilers now have a good understanding of how to use them to generate efficient code.
You won't do a better job than the compiler unless you have a good idea what you're doing. And even then it often won't be worth the effort spent trying to beat the compiler. And even then our efforts at optimizing for one specific CPU might not result in good code for other CPUs.

How can I estimate if a feature is going to take up too many resources on an FPGA?

I'm starting on my first commercial sized application, and I often find myself making a design, but stopping myself from coding and implementing it, because it seems like a huge use of resources. This is especially true when it's on a piece that is peripheral (for example an enable for the output taps of a shift register). It gets even worse when I think about how large the generic implementation can get (4k bits for the taps example). The cleanest implementation would have these, but in my head it adds a great amount of overhead.
Is there any kind of rule I can use to make a quick decision on whether a design option is worth coding and evaluation? In general I worry less about the number of flip-flops, and more when it comes to width of signals. This may just be coming from a CS background where all application boundarys should be as small as possibly feasable to prevent overhead.
Point 1. We learn by playing, so play! Try a couple of things. See what the tools do. Get a feel for the problem. You won't get past this is you don't try something. Often the problems aren't where you think they're going to be.
Point 2. You need to get some context for these decisions. How big is adding an enable to a shift register compared to the capacity of the FPGA / your design?
Point 3. There's two major types of 'resource' to consider :- Cells and Time.
Cells is relatively easy in broad terms. How many flops? How much logic in identifiable blocks (e.g. in an ALU: multipliers, adders, etc)? Often this is defined by the design you're trying to do. You can't build an ALU without registers, a multiplier, an adder, etc.
Time is more subtle, and is invariably traded off against cells. You'll be trying to hit some performance target and recognising the structures that will make that hard are where to experience from point 1 comes in.
Things to look out for include:
A single net driving a large number of things. Large fan-outs cause a heavy load on a single driver which slows it down. The tool will then have to use cells to buffer that signal. Classic time vs cells trade off.
Deep clumps of logic between register stages. Again the tool will have to spend more cells to make logic meet timing if it's close to the edge. Simple logic is fast and small. Sometimes introducing a pipeline stage can decrease the size of a design is it makes the logic either side far easier.
Don't worry so much about large buses, if each bit is low fanout and you've budgeted for the registers. Large buses are often inherent in fast designs because you need high bandwidth. It can be easier to go wide than to go to a higher clock speed. On the other hand, think about the control logic for a wide bus, because it's likely to have a large fan-out.
Different tools and target devices have different characteristics, so you have to play and learn the rules for your set-up. There's always a size vs speed (and these days 'vs power') compromise. You need to understand what moves you along that curve in each direction. That comes with experience.
Is there any kind of rule I can use to make a quick decision on whether a design option is worth coding and evaluation?
Only rule I can come up with is 'Have I got time? or not?'
If I have, I'll explore. If not I better just make something work.
Ahhh, the life of doing design to a deadline!
It's something that comes with experience. Here's some pointers:
adding numbers is fairly cheap
choosing between them (multiplexing) gets big quite quickly if you have a lot of inputs to the multiplexer (the width of each input is a secondary issue also).
Multiplications are free if you have spare multipliers in your chip, they suddenly become expensive when you run out of hard DSP blocks.
memory is also cheap, until you run out. For example, your 4Kbit shift register easily fits within a single Xilinx block RAM, which is fine if you have one to spare. If not it'll take a large number of LUTs (depending on the device - an older Spartan 3 can fit 17 bits into a LUT (including the in-CLB register), so will require ~235 LUTS). And not all LUTs can be shift registers. If you are only worried about the enable for the register, don't. Unless you are pushing the performance of the device, routing that sort of signal to a few hundred LUTs is unlikely to cause major timing issues.

Machine code alignment

I am trying to understand the principles of machine code alignment. I have an assembler implementation which can generate machine code in run-time. I use 16-bytes alignment on every branch destination, but looks like it is not the optimal choice, since I've noticed that if I remove alignment than sometimes same code works faster. I think that something to do with cache line width, so that some commands are cut by a cache line and CPU experiences stalls because of that. So if some bytes of alignment inserted at one place, it will move instructions somewhere further pass the cache border line...
I was hoping to implement an automatic alignment procedure, which can process a code as a whole and insert alignment according to the specification of the CPU (cache line width, 32/64 bits and so on)...
Can someone give some hints about this procedure? As an example the target CPU could be Intel Core i7 CPU 64-bit platform.
Thank you.
I'm not qualified to answer your question because this is such a vast and complicated topic. There are probably many more mechanisms in play here, other than cache line size.
However, I would like to point you to Agner Fog's site and the optimization manuals for compiler makers that you can find there. They contain a plethora of information on these kind of subjects - cache lines, branch prediction and data/code alignment.
Paragraph (16-byte) alignment is usually the best. However, it can force some "local" JMP instructions to no longer be local (due to code size bloat). May also result in not as much code being cached. I would only align major segments of code, I would not align every tiny subroutine/JMP section.
Not an expert, however... Branches to places that are not going to be in the instruction cache should benefit from alignment the most because you'll read whole cache-line of instructions to fill the pipeline. Given that statement, forward branches will benefit on the first run of a function. Backward branches ("for" and "while" loops for example) will probably not benefit because the branch target and following instructions have been read into cache already. Do follow the links in Martins answer.
As mentioned previously this is a very complex area. Agner Fog seems like a good place to visit. As to the complexities I ran across the article here Torbjörn Granlund on "Improved Division by Invariant Integers" and in the code he uses to illustrate his new algorithm the first instruction at - I guess - the main label is nop - no operation. According to the commentary it improves performance significantly. Go figure.

How much speed-up from converting 3D maths to SSE or other SIMD?

I am using 3D maths in my application extensively. How much speed-up can I achieve by converting my vector/matrix library to SSE, AltiVec or a similar SIMD code?
In my experience I typically see about a 3x improvement in taking an algorithm from x87 to SSE, and a better than 5x improvement in going to VMX/Altivec (because of complicated issues having to do with pipeline depth, scheduling, etc). But I usually only do this in cases where I have hundreds or thousands of numbers to operate on, not for those where I'm doing one vector at a time ad hoc.
That's not the whole story, but it's possible to get further optimizations using SIMD, have a look at Miguel's presentation about when he implemented SIMD instructions with MONO which he held at PDC 2008,
(source: tirania.org)
Picture from Miguel's blog entry.
For some very rough numbers: I've heard some people on ompf.org claim 10x speed ups for some hand-optimized ray tracing routines. I've also had some good speed ups. I estimate I got somewhere between 2x and 6x on my routines depending on the problem, and many of these had a couple of unnecessary stores and loads. If you have a huge amount of branching in your code, forget about it, but for problems that are naturally data-parallel you can do quite well.
However, I should add that your algorithms should be designed for data-parallel execution.
This means that if you have a generic math library as you've mentioned then it should take packed vectors rather than individual vectors or you'll just be wasting your time.
E.g. Something like
namespace SIMD {
class PackedVec4d
{
__m128 x;
__m128 y;
__m128 z;
__m128 w;
//...
};
}
Most problems where performance matters can be parallelized since you'll most likely be working with a large dataset. Your problem sounds like a case of premature optimization to me.
For 3D operations beware of un-initialized data in your W component. I've seen cases where SSE ops (_mm_add_ps) would take 10x normal time because of bad data in W.
The answer highly depends on what the library is doing and how it is used.
The gains can go from a few percent points, to "several times faster", the areas most susceptible of seeing gains are those where you're not dealing with isolated vectors or values, but multiple vectors or values that have to be processed in the same way.
Another area is when you're hitting cache or memory limits, which, again, requires a lot of values/vectors being processed.
The domains where gains can be the most drastic are probably those of image and signal processing, computational simulations, as well general 3D maths operation on meshes (rather than isolated vectors).
These days all the good compilers for x86 generate SSE instructions for SP and DP float math by default. It's nearly always faster to use these instructions than the native ones, even for scalar operations, so long as you schedule them correctly. This will come as a surprise to many, who in the past found SSE to be "slow", and thought compilers could not generate fast SSE scalar instructions. But now, you have to use a switch to turn off SSE generation and use x87. Note that x87 is effectively deprecated at this point and may be removed from future processors entirely. The one down point of this is we may lose the ability to do 80bit DP float in register. But the consensus seems to be if you are depending on 80bit instead of 64bit DP floats for the precision, your should look for a more precision loss-tolerant algorithm.
Everything above came as a complete surprise to me. It's very counter intuitive. But data talks.
Most likely you will see only very small speedup, if any, and the process will be more complicated than expected. For more details see The Ubiquitous SSE vector class article by Fabian Giesen.
The Ubiquitous SSE vector class: Debunking a common myth
Not that important
First and foremost, your vector class is probably not as important for the performance of your program as you think (and if it is, it's more likely because you're doing something wrong than because the computations are inefficient). Don't get me wrong, it's probably going to be one of the most frequently used classes in your whole program, at least when doing 3D graphics. But just because vector operations will be common doesn't automatically mean that they'll dominate the execution time of your program.
Not so hot
Not easy
Not now
Not ever