Related
I'm currently working on a project that must involve research of JIT techniques. I'm a complete beginner when it comes to anything related to compilers but I did some research and learned about Java's Hotspot VM. I was hoping to do an analysis on the benefits (or downsides) of using Hotspot versus traditional compilers (for example, g++).
My initial idea was to create some sort of simple program that can be run through both compilers in order to compare compilation times but this brought up a number of questions:
From my understanding, Java source code is initially turned into bytecode by the javac compiler (creating .class files) and then, in turn, this bytecode can be run through HotSpot at runtime to execute the program. Given this, would it even be relevant to compare results with a traditional compiler that converts sources directly to machine code?
Another concern I'm facing is that the programs would be in different languages (ex. C++ vs Java). Although the functionality would be identical, could this skew results when attempting to compare?
Moving on, if the above two points are not a problem, my main questions is:
How can I actually go about benchmarking the speed-up in one method versus the other?
I did some brief research about this but all I was able to find were ways to measure the efficiency of the program itself, not the compilation technique used to run it. Is what I'm trying to do possible? Are there methods to actually analyze the speed up of one compiler over another?
Any help is appreciated!
How can I actually go about benchmarking the speed-up in one method versus the other?
You first need to consider what you actually intend to measure. In other words, saying "the speed-up" is not sufficiently rigorous.
Are we talking about CPU cycles spent compiling? Or walltime from source code to running program? Or peak performance of a few critical methods in a micro benchmark? Overall steady-state program performance? Speed of program initialization? ...
In the end you're comparing two systems that made quite different trade-offs. You can find a few roughly comparable benchmarks already mentioned in the comments but in the end they mostly represent a specific type of throughput-bound tasks and not large applications. It's not like you can find an application such as firefox written both in C and Java with identical feature sets and comparable code quality. So any comparison you do will be incomplete because you'll have to use some limited proxy measurement of how comparable two code-bases are when you compare them.
In advance apologize if the question seems somewhat broad or strange, I don't mean to offend anyone, but maybe someone can actually make a recommendation. I tried looking for the similar questions, but cold not.
Which are the better resources (books, blogs etc.) that can teach about optimizing code?
There is quite a few resources on making code more human-readable (Code Complete being number one choice probably). But what about making it run faster, more memory-efficient?
Of course there are lots of books on each particular language, but I wonder if there are some that cover the problems of memory / speed of operations and are somewhat language-independent?
Here are some links that might be helpful in general on the subject of memory optimizations
What Every Programmer Should Know About Memory by Ulrich Drepper
Herb Sutter: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software
Slides: Herb Sutter: Machine Architecture (Things Your Programming Language Never Told You)
Video: Herb Sutter # NWCPP: Machine Architecture: Things Your Programming Language Never Told You
The microarchitecture of Intel, AMD and VIA CPUs
An optimization guide for assembly programmers and compiler makers, by Agner Fog
Read Structured Programming with go to Statements. While it's the source of the "premature optimisation is the source of all evil" quote that comes up the moment somebody wants to make anything faster or smaller - no matter how desperately important or late in the process they are - it's actually about the importance of making things efficient when you can.
Learn about time complexity, space complexity and the analysis of algorithms.
Come up with examples where you would want to sacrifice having worse space complexity for better time complexity, and vice versa.
Know the time and space complexities of the algorithms and data structures your languages and frameworks of choice offer, especially those you use most often.
Read the answers on this site on questions about creating a good hash code.
Study the approach HTTP took to having the advantage of caching, without the disadvantage of using stale data inappropriately. Consider how easy or difficult that is to apply to in-memory caches. Consider when you would say "screw it, I can live with being stale for the speed boost it gives me". Consider when you would say "screw it, I can live with being slow for the guarantee of freshness it gives me".
Learn how to multithread. Learn when it improves performance. Learn why it often doesn't or even makes things worse.
Look at a lot of Joe Duffy's blog where performance is a regular concern of his writing.
Learn how to process items as streams or iterations rather than building and rebuilding data-structures full of each item, each time. Learn when you're actually better off not doing that.
Know what things cost. You can't reasonably decide "I'll work so this is in the CPU cache rather than main-memory/main-memory rather than disk/disk rather than over a network" unless you've a good idea what actually causes each to be hit, and what the cost differences are. Worse, you can't dismiss something as premature optimisation if you don't know what they cost - not bothering to optimise something is often the best choice, but if you don't even consider it in passing you aren't "avoiding premature optimisation", you're muddling through and hoping it works.
Learn a bit about what optimisations are done for you by the script engine/jitter/compiler/etc you use. Learn how to work with them rather than against them. Learn not to re-do work it'll do for you anyway. In one or two cases, you may also be able to apply the same general principle to your work.
Search for cases on this site where something is dismissed as an implementation detail - yes, all of those are cases where the detail in question isn't the most important thing at the time, but all of those implementation details were chosen for a reason. Learn what they were. Learn the counter-arguments.
Edit (I'll keep adding a few more to this as I go):
Different books of course differ in the emphasis they put on efficiency concerns, but I remember Stroustrup's The C++ Programming Language as one where there were a good few times where he will explain a choice between a few different options as relating to efficiency, and also on how to not have decisions made for efficiency's sake impact on the usability of the classes "from the outside".
Which brings me to another point. Concentrate on the efficiency of the library code you reuse in different projects. You don't want to ever be thinking "maybe I should hand-roll a new one here to be more efficient", unless it's a very specialised case, you want to be confident that lots of work went into making that heavily used class efficient over a lot of case, and concentrate on identifying hot-spots.
As for specialised cases, some of the more obscure data structures are worth knowing for the cases they serve. For example, a DAWG is a very compact structure for storing strings with a lot of common prefixes and suffixes (which would be most words in most natural languages) where you just want to find those in the list that match a pattern. If you need a "payload" then a tree where each letter has a list of nodes for each subsequent letter (a generalisation of a DAWG but ending in that "payload" rather than the terminal node) has some but not all of the advantages. They also find the result in O(n) time where n is the length of the string sought.
How often will that come up? Not many. It came up for me once (a few times really, but they were variants of the same case), and as such it would not have been worth it for me to learn all there was to know about DAWGs until then. But I knew enough to know it was what I needed to research later, and it saved me gigabytes (really, from way too much for a machine with 16GB RAM to cope with, to less than 1.5GB). Going straight for a hand-rolled DAWG would totally be premature optimisation rather than putting the strings in a hashset, but flicking through the NIST datastructure site meant I could when it came up.
Consider: "Finding a string in a DAWG is O(n)" "Finding a string in a Hashset is O(1)" Both of these statements is true, but the speed of the two tends to be comparable. Why? Because the DAWG is O(n) in terms of the length of the string, and effectively O(1) in terms of the size of the DAWG. The Hashset is O(1) in terms of the size of the hashset, but working out the hash is typically O(n) in terms of the length of the string, and equality checks are also O(n) in terms of that length. Both statements were correct, but they were thinking about a different n! You always need to know what n means in any discussion of time and space complexity - most often it'll be the size of the structure, but not always.
Don't forget constant effects: O(n²) is the same as O(1) for sufficiently low values of n! Remember that the likes of O(n²) translates as n²*k + n * k₁ + k₂, with the assumption that k₁ & k₂ are low enough and k and the k of another algorithm or structure we are comparing of are close enough, that they don't really matter and it's only n² that we care about. This isn't true all the time, and we can sometimes find that k, k₁ or k₂ are high enough that we end up in trouble. It's also not true when n is going to be so small as to make the difference in the constant costs of different approaches matter. Of course normally when n is small we don't have a big efficiency concern, but what if we are doing m operations on structures averaging n in size, and m is large. If we are choosing between an O(1) and a O(n²) approach, we are choosing between an O(m) and O(n²m) approach overall. It still seems like a no-brainer in favour of the former, but with a low n it essentially becomes a choice between two different O(m) approaches, and the constant factors are much more important.
Learn about lock-free multi-threading. Or perhaps don't. Personally, I've two pieces of my own code I use professionally that use all but the simplest lock-free techniques. One is based on well-known approaches and I wouldn't bother now (it's .NET code first written for .NET2.0 and the .NET4.0 library supplies a class that does the same thing). The other I first wrote for fun, and only used after that just-for-fun period had given me something reliable (and it still gets beaten by something in the 4.0 library for a lot of cases, but not for some others that I care about). I would hate to have to write something like it with a deadline and a client in mind.
All that said, if you're coding out of interest, the challenges involved are interesting and it's an enjoyable thing to work with when you've the freedom to give up on a failed plan that you don't get when you're doing something for a paying client, and you'll certainly learn a lot about efficiency concerns generally. (Take a look at https://github.com/hackcraft/Ariadne if you want to see some of what I've done with this).
A Case Study
Actually, that contains a relatively good example of some of the above principles. Take a look at the method that's currently at line 511 at https://github.com/hackcraft/Ariadne/blob/master/Collections/ThreadSafeDictionary.cs (where I joke in the comments about it being flame-bait for people quoting Dijkstra. Let's use it as a case-study:
This method was first written to use recursion, because it's a naturally recursive problem - after doing the operation on the current table, if there's a "next" table we want to do the exact same operation on that, and so on until there's no further table.
Recursion is almost always slower than iteration, for a few different methods. Should we make all recursive calls iterative? No, it's often not worth it, and recursion is a wonderful way to write code that is clear about what it's doing. Here though I apply the principle above that since this is a library that might be called where performance is crucial, particular effort should be extended on it.
The decision to try to improve its speed being made, the next thing I did was make measurements. I don't depend on "I know that iteration is faster than recursion, so it must be faster when changed to avoid recursion". That's just not true - a poorly written iterative version may not be as good as a well-written recursive version.
The next question is, just how to re-write it. I've a tested method that I know works and I'm going to replace it with a different version. I don't want to replace it with a version that doesn't work, obviously, so how to re-write while taking the most advantage out of what's already there?
Well, I know about tail-call elimination; an optimisation normally done by compilers that changes the way the stack is managed so that recursive functions end up with properties closer to those of iterative (it's still recursive from the perspective of the source code, but it's iterative in terms of how the compiled code actually uses the stack).
This gives me two things to think about: 1. Maybe the compiler is already doing this, in which case my extra work isn't going to do anything to help. 2. If the compiler isn't already doing this, I can take the same basic approach manually.
That decision made, I replaced all of the points where the method called itself, with a change to the one parameter that would be different for that next call, and then go back to the beginning. I.e. instead of having:
CurrentMethod(param0.next, param1, param2, /*...*/);
We have:
param0 = param0.next;
goto startOfMethod;
That being done, I measure again. Running through the entire unit tests for the class is now consistently 13% faster than before. If it were closer I'd have tried more detail measurements, but a consistent 13% on runs that includes code that doesn't even call this method is something I'm pretty happy with. (It also tells me that the compiler wasn't doing the same optimisation, or I wouldn't have gained anything).
Then I clean up the method to make more changes that make sense with the new code. Most of them let me take out the goto because goto is indeed nasty (and there's other places the same optimisation was done that aren't as obvious because the goto was refactored entirely). In some, I left it in, because 13% is worth breaking the no-goto rule to my mind!
So the above gives an example of:
Deciding where to concentrate optimisation effort (based on how often it might be hit and my inability to predict all uses of the library)
Using knowledge of general costs (recursion costs more than iteration, most of the time).
Measuring rather than depending on assuming the above always applies.
Learning from what compilers do.
Understanding that because of that I may not gain anything - maybe the compiler already did it for me.
Avoiding optimisations leading to unreadable code (refactoring out most of the gotos the first pass introduced).
Some of these are matters of opinion and style (the decision to leave in some goto would not be without controversy), and it's certainly okay to disagree with my decisions, but knowledge of the points raised so far in this post would make it an informed disagreement, rather than a knee-jerk one.
In addition to the resources mentioned in other answers, Michael Abrash's Graphics Programming Black Book is a great read for learning about optimization. While the specifics are a bit dated in places, it is still a great resource for learning about how to approach optimization.
Any time you want to optimize code it is absolutely essential to measure, measure, measure. One of the best ways to learn about optimization is by doing - take some code you want to optimize, learn how to use a profiler to measure its performance and then make changes and measure the results.
I am designing TTL serial computer, and I am struggling on choosing architecture more suitable for LLVM compiler backend (I want to be able to run any C++ software there). There will be no MMU, no multiplication/division, no hardware stack, no interrupts.
I have 2 main options:
1) 8-bit memory, 8-bit ALU, 8-bit registers (~12-16). Memory address width 24 bit. So I will need to use 3 registers as IP and 3 registers for any memory location.
Needless to say that any address calculations would be pure pain to implement in compiler.
2) 24-bit memory, 24-bit ALU, 24-bit registers (~6-8). Flat memory, nice. The drawbacks is that due to serial nature of the design, each operation would take 3 time more clocks, even if we are operating on some booleans. 24-bit memory data width is expensive. And it's harder to implement in hardware in general.
The question is : Do you think implementing all c++ features on this 8-bit, stack-less based hardware is possible, or I need to have more complex hardware to have generated code of reasonable quality & speed?
I second the suggestion to use LCC. I used it in this homebrew 16-bit RISC project: http://fpgacpu.org/xsoc/cc.html .
I don't think it should make much difference whether you build the 8-bit variant and use 3 add-with-carries to increment IP, or the 24-bit variant and do the whole thing in hardware. You can hide the difference in your assembler.
If you look at my article above, or an even simpler CPU here: http://fpgacpu.org/papers/soc-gr0040-paper.pdf you will see you really don't need that many operators / instructions to cover the integer C repetoire. In fact there is an lcc utility (ops) to print the min operator set for a given machine.
For more information see my article on porting lcc to a new machine here: http://www.fpgacpu.org/usenet/lcc.html
Once I had ported lcc, I wrote an assembler, and it synthesized a larger repetoire of instructions from the basic ones. For example, my machine had load-byte-unsigned but not load-byte-signed, so I emitted this sequence:
lbs rd,imm(rs) ->
lbu rd,imm(rs)
lea r1,0x80
xor rd,r1
sub rd,r1
So I think you can get by with this min cover of operations:
registers
load register with constant
load rd = *rs
store *rs1 = rs2
+ - (w/ w/o carry) // actually can to + with - and ^
>> 1 // << 1 is just +
& ^ // (synthesize ~ from ^, | from & and ^)
jump-and-link rd,rs // rd = pc, pc = rs
skip-z/nz/n/nn rs // skip next insn on rs==0, !=0, <0, >=0
Even simpler is to have no registers (or equivalently blur registers with memory -- all registers have a memory address).
Set aside a register for SP, and write the function prolog/epilog handler in the compiler and you won't have to worry about stack instructions. There's just code to store each of the callee save registers, adjust the SP by the frame size, and so forth.
Interrupts (and return from interrupts) are straightforward. All you need to do is force a jump-and-link instruction into the instruction register. If you chose the bit pattern for that to be something like 0, and put the right addresses into the source register rs (especially if it is r0), it can be done with a flip-flop reset input or an extra force-to-0 and gate. I use a similar trick in the second paper above.
Interesting project. I see a TTL / 7400 contest is underway and I was thinking myself of how simple a machine could you get away with and would it be cheating to add a 32 KB or 128 KB async SRAM to the machine to hold the code and data.
Anyway, happy hacking!
p.s.
1) You will want to decide how large each integral type is. You can certainly make char, short, int, long, long long, etc. the same size, one 24b word, if you wish, although it won't be compliant in min representation ranges.
2) And although I focused on lcc here, you were asking about C++. I recommend persuing C first. Once you have things figured out for C, including *, /, % operators in software, etc., it should be more tractable to move to full blown C++ whether in LLVM or GCC. The difference between C and C++ is "only" the extra vtables and RTTI tables and code sequences (entirely built up out the primitive C integer operator repetoire) required to handle virtual function calls, pointer to member dereference, dynamic casts, static constructors, exception handling, etc.
IMHO, It is possible for c compiler. i am not sure for c++, though.
LLVM/CLang could be hard choice for 8bit computer,
Instead, first try lcc, then second llvm/etc, HTH.
Bill Buzbee succeed to retarget lcc compiler for his Magic-1(known as homebrewcpu).
Although the hardware design and construction of Magic-1 usually gets the most attention, the largest part of the project (by far) has been developing/porting the software. To this end, I've had to write an assembler and linker from scratch, retarget a C compiler, write and port the standard C libraries, write a simplified operating system and then port a more sophisticated one. It's been a challenge, but a fun one. I suppose I'm somewhat twisted, but I happen to enjoy debugging difficult problems. And, when the bug you're trying to track down could involve one or more of: hardware design flaw, loose or broken wire, loose or bad TTL chip, assembler bug, linker bug, compiler bug, C runtime library bug, or finally a bug in the program in question there's lot of opportunity for fun. Oh, and I also don't have the luxury of blaming the bugs on anyone else.
I'm continually amazed that the damn thing runs at all, much less runs as well as it does.
In my opinion, stackless hardware is already poorly suited for C and C++ code. If you have nested function calls, you will need to emulate a stack in software anyway, which of course is much slower.
When going the stackless route, you will probably allocate most of your variables as 'static', and have no re-entrant functions. In this case, 6502-style addressing modes can be effective. You could for example have these addressing modes:
Immediate address (24bit) as part of opcode
Immediate address (24bit) plus index register (8bit)
Indirect access: immediate 24bit address to memory, which contains the actual address
Indirect access: 24 bit address to memory, 8 bit index register added to value from memory.
The address modes outlined above would allow efficient access to arrays, structures and objects allocated at a constant address (static allocation). They would be less efficient (but still usable) for dynamically and stack-allocated objects.
You would also get some benefit from your serial design: usually the 24 bit + 8 bit addition does not take 24 cycles, but you can instead short-circuit the addition when carry is 0.
Instead of mapping the IP as registers directly, you could allow changing it only through goto/branch instructions, using the same address modes as above. Jumps into dynamically computed addresses are quite rare so it makes more sense to give the whole 24-bit address directly in the opcode.
I think that if you design the CPU carefully, you can use many C++ features quite efficiently. However, do not expect that any random C++ code would run fast on such a limited CPU.
The implementation is certainly possible, but I doubt it will be usable (at lest for C++ code). As it was already noted, first problem is lack of stack. Next, bunch of C++ relies heavily on dynamic memory allocation, also C++ "internal" structures are quite big.
So, as it seems to me, it will be better, if you:
Get rid of C++ requirement (or at least, limit yourself to some subset)
Use 24 bits, not 8 bits for everything (for registers as well)
Add hardware stack
You are not going to be able to run "any" C++ code there. For example fork(), system(), etc. Anything that clearly relies on interrupts for example. You can get a long way there, sure.
Now do you mean any programs that can/have been written in C++ or are you limiting yourself to the language only and not the libraries that are commonly associated with C/C++? The language itself is a much easier rule to live with.
I think the easier question/answer, is, why not just try? What have you tried so far? It could be argued that the x86 is an 8-bit machine, no regard for alignment and many 8 bit instructions. the msp430 was ported to llvm to show how easily and quickly it could be done, I would like to see that platform with better support (not where my strengths lie otherwise I would be doing it) a 16 bit platform. no mmu. does have a stack and interrupts sure, dont have to use them and if you remove library rules then what is left that needs an interrupt?
I would look at llvm but note that the documentation produced that shows how easy it is to port, is dated and wrong and you basically have to figure it out on your own from the compiler sources. llc has a book, known for that, not optimized. Sources dont compile well on modern computers, always having to go backwards in time to use it, any time I go near it after an evening just trying to build it as is I give up. vbcc, simple, clean, documented, not unfriendly to smaller processors. Is it C++, dont remember. Of all of them the easiest to get a compiler up and running though. Of all of them LLVM is the most attractive and most useful when all said and done. dont go near gcc or even think of it, duct tape and bailing wire inside holding it together.
Have you invented your instruction set yet? do you have a simulator and assembler yet? Look up lsasim at github to find my instruction set. You can write an llvm backend for mine as practice for yours...grin...(my vbcc backend is horrible, I need to start over)...
You have to have some idea of how the high level will be implemented but you really have to start with an instruction set and an instruction set simulator and an assembler of some sort. Then start hand converting C/C++ code into assembly for your instruction set, that should pretty quickly get you through "can I do this without a stack", etc. In this process define your calling convention, implement more C/C++ code by hand using your calling convention. THEN dig into a compiler and make a back end. I think you should consider vbcc as a stepping stone, then head for LLVM if it appears like it (the isa) will work.
I'm starting to build a real-time raytracer for iOS. I'm new to this raytracing thing, all I've done so far is write a rudimentary one in ObjC. It seems to me that a C-based raytracer is going to be faster than one written in ObjC, but the ObjC one will be far simpler, as object hierarchies come in very handy. Speed is very important, though, as I want it to be real-time, say 30 fps.
What's your opinion on whether the speed up of C be worth the extra complexity? I can forsee the C code taking much longer and causing me headaches with lots of bugs (although I'm not new to C), but going for more speed is seductive initially.
Are there any examples out there of raytracers written in C? My google search for such things is contaminated with lots of results for C++ and C#.
If you want fast ray tracing, you can pretty much forget about using either C or Objective C. You almost certainly want to use OpenCL. It's still not going to be enough to get you (even very close to) 30 fps, but it'll probably be at least twice as fast as anything running on the CPU (and 5-10 times faster wouldn't be any real surprise).
as zneak stated, c++ is the best combination for speed and polymorphism.
however, you can accomplish something close by reducing the objc calls (read: reduce the polymorphic interface to the minimum set required, then just putting the parts that need to be fast in plain c or c++).
objc message dispatch is quite fast, and you can typically remove much of the virtual/dynamic methods from your interfaces (assume every objc instance method is virtual). c code in objc methods is still c code... from there, determine where your bottlenecks are -- it doesn't hurt to profile before changing working code, either ;)
Writing a "realtime Raytracer" is without the use of Hand-Optimized Assembly (or the use of the "cheap" Intel compiler ;) , but this is not possible for this platform), impossible because you need the speed.
Further more you need a lot of Processing power but i guess even the OpenCL path is not powerful enought (this is in my opinion the case even for real Desktop machines, the reason for this is the lack of an real big cache on the Graphics Processor).
Have a look through http://ofps.oreilly.com/titles/9780596804824/ that is as close as you'll get.
It isn't ray tracing, I have written a ray tracer and it is a huge amount of work. GL uses a different technique for graphics, hence it will be unable for example to render the capacity of a diamond to capture light. that link contains sample code, you can download and run it. You will realise that even some of the moderately complex examples really chug on an actual device... we are talking < 1 fps.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
How can programming in assembly help in achieving optimization
The most likely way programming in assembly can improve your code is by improving you: teaching you more about what is happening at a low level and getting the discipline of optimization can help you make good decisions in higher-level languages.
As far as actually helping one program: as others have noted it's rarely worth it. It's just possible you can use it as a kind of advanced profile-driven optimization: try many variations until you find one that's best on your particular problem.
To start with this: write a program in C or C++ or whatever compiled language you normally use, fire up your debugger, and disassemble a small but nontrivial function, and have a think about why the compiler did what it did. Then try writing a small bit of inline assembler yourself. On modern systems assembly is mostly easily embedded within C rather than done from scratch.
Or alternatively, get a teeny machine like a PIC and make it flash a LED...
These days, you have to be very good at assembly to beat the compiler.
I can do it any day of the week, but only by viewing the compiler's output first.
And then, if it gains more than a couple of percentage points I'd be surprised.
These days, I only program in assembly when I'm doing something the compiler can't do.
In principle, you can write highly-optimized code in assembly because the compiler is limited to specific, general-purpose optimizations that should apply to many programs, while you can be creative and use your knowledge of this particular program.
To take a simple example, back when I was new to this business compilers were very limited in their ability to optimize register usage. You know that to perform any sort of arithmetic or logical operation, the CPU must generally load one of the values into a register, then perform the operation on the other, then save the result? Like to add two numbers together -- and I'll use a pseudo-assembler here because I don't know what assembly languages you know and I've forgotten most of the details myself -- you'd write something like this:
LOAD A,value1
ADD A,value2
STORE a,destination
Compilers used to generate the loads for every operation. So if your C program said:
x=x+y;
z=z+x;
The compiler would generate something like:
LOAD A,x
ADD A,y
STORE A,x
LOAD A,z
ADD A,x
STORE A,z
But a human could observe that by the time we get to the second statement, register A already contains x, and addition is commutative, so we could optimize this to:
LOAD A,x
ADD A,y
STORE A,x
ADD A,z
STORE A,z
Et cetera. One could go through all sorts of tiny micro-optimizations like this. I used to do that all the time back when I was young and the world was green.
But over the years compilers have gotten much smarter, and CPUs have gotten more powerful so the micro-optimizations don't matter as much.
Thus, I haven't written any assembly language code in, wow, probably 15 years. I used to read the assembly generated by the compiler when debugging, sometimes it would give a clue to a subtle problem, but I haven't done that in years now either.
I don't think compilers are even written in assembly any more. Instead, you write the first draft of the compiler in a high level language on some other computer, i.e. you write a cross-compiler to get yourself off the ground.
I suspect the only real use of assembly today is for extremely constrained environments, embedded systems and that sort of thing; and for programs that have to deal intimately with the hardware, like device drivers.
I'd be interested to hear if there are any assembly programmers on this forum who care to tell us why they assembly programmers.
Programming in assembly won't, in and of itself, optimize your code. The main thing about assembly is that it allows you to have very low-level access and to choose exactly what instructions the processor executes.
Since you won't have some compiler generating the assembly for you, you can perform code optimizations when you write the program yourself, if you know how.
So, you think you are smarter than gcc optimizing compiler?
If not, then fughed aboud it (learning assembly for the sake of getting better at optimization). That would be akin to learning Scheme language for the sake of getting better at recursion :)
In general, the compiler will do a fairly good job at generating optimal code. There are, however, cases where writing your own assembly can result in even more optimized (in terms of space and/or speed) code.
Typically, this happens when there is something that you know about the target system that the compiler doesn't. Compilers are designed to work on a variety of systems; if you want to take advantage of something unique to your target system, sometimes you have to go in and do it yourself. Here's an example. A few months ago, I was writing some code for a MIPS-based embedded system. There are many different types of MIPS CPUs, and some support certain opcodes that others do not. My compiler would generate MIPS code using the set of assembly operations that all MIPS architectures support. However, I knew that my chip could do more. I had a subroutine that needed to count the number of leading zeroes in a 32-bit number. The compiler synthesized this into a loop that took about 10 lines of assembly to do. I re-wrote it in one line by using the CLZ opcode that was designed to do just this. I knew that my chip supported the opcode but the compiler didn't. Admittedly, situations like this aren't very common; when they do pop up, however, it's nice to have enough of a background in assembly to take advantage of them.
Sometimes one will need to perform a task which maps particularly well onto some CPU instructions, but does not fit well into any high-level-language constructs. For example, on many processors one may easily perform extended-precision arithmetic using something like:
add r0,r4
addc r1,r5
addc r2,r6
addc r3,r7
This will regard r3:r2:r1:r0 and r7:r6:r5:r4 as numbers four words long, adding the second to the first. Four nice easy instructions, any anyone who understands assembly would know what they do. I know of no way to perform the same task in C without it not only generating bigger and slower object code, but also being an incomprehensible mess of source code.
A somewhat more extreme but specialized real-world example: Given two arrays array1[0..63] and array2[0..63], compute array1[0]*array2[0] + array1[1]*array2[1] + array1[2]*array2[2] ... + array1[63]*array2[63]. On a DSP I used, the computation could be done in machine code in about 75 machine cycles (about 67 of which are a repeating MAC instruction). There's no way C code could come anywhere close.
About the only time I can think of using Assembly language for optimizing code is when you need something very specific, like you need a GPIO on a microcontroller to toggle between high and low exactly every 9 clock cycles. that's too short a time to manage with an interrupt, and higher level language compilers don't normally offer this kind of control over the instruction stream.
Typically you wouldn't program in assembly. You would program in C, and then look at the generated assembly to see what optimzations (or not) the C compiler made automatically. Adjusting your C code (to allow for better vectorization for example) will allow the compiler to re-arrange code better, which will give you optimized assembly
More likely than being able to beat the compiler at writing assembly code. Knowing how typical tasks translate to assembly may help you write better high level language code.
Typically you do not resort to assembly for optimiziation purposes. If this is possible, usually someone already will have provided the essential code ready for you to call, for example in form of a linear algebra library.
Likewise assembly offers direct access to the processor (e.g. for atomicity, time measurement, I/O) but the important accesses will already have have been made accessible for your high level language.
Compilers do a good job of generating assembler.
However, there's a bad reason why hand-written assembler is faster. Since it's harder to write, you write less of it.
It would be nice if programmers could discipline themselves to get the same job done in minimal code, regardless of language.
When writing assembly, or even just straight raw bytes the assembler outputs, you can write programs that use computer hardware specific features or makes something otherwise very carefully specified.
There might be really high benefits if your program does the optimized part far more often than it does anything else. Always set up benchmarks before attempting optimizations.
The downcome is that your hand-written assembly works on fewer different hardware. It may even end up getting limited into the hardware model and revision!
It's rare you ever can or need to write assembly routines because commonly written software must work on almost every hardware you find and your kitten.
There's one interesting application if you know assembly. You can then write programs that produce assembly routines. Though it's mostly only fun unless you keep it really small so you can port it easily.
Read the Graphics Programming Black Book by Michael Abrash
In most modern applications, it can't to any significant degree.
Inter-Process Communication Affects Application Response Time explains why algorithms are unlikely to be bottlenecks. (But always profile - never guess.)
In general, programming in assembly will increase time-to-market, bug density, and maintenance costs. Instead, strive for simplicity and readability in your code.
As poolie mentioned, the main benefit of learning assembly today is a deeper understanding of software and hardware. From that perspective, there's quite a bit of information on Steve Gibson's site.
If you understood why there is sometimes the need to do asm, you would appreciate the strengths, costs (headaches for you).