Control a robotic arm - robot

I have a Cyber Robot CYBER 310 and a Sciento CS-113 robotic arm with no documentation. Both use a parallel port.
How could I program those?
For the Cyber one, I found this:
Nothing at all on the Sciento one.
Any pointers or examples in Python/Java/C/whatever appreciated.
[update] This page contains some information, but I'm still lost: http://www.anf.nildram.co.uk/beebcontrol/arms/cyber/software.html

I am not entirely sure I understand what the question is.
Are you unfamiliar with with programming the parallel port?
My memory on it is hazy, but iirc it's pretty simple. It's a "dumb" interface so you simply need to write to it.
If you are running under linux then there are some great resources on it:
Linux Device Drivers: Chapter 9: An Overview of the Parallel port - Talks a bit about parallel port programming and goes on to talk about writing device drivers for it. A bit overkill I think for your application, but the entire book is fascinating, and enlightening.
Linux I/O port programming - essentially you can write to /dev/port, or include asm/io.h and use inb() and outb() (I haven't done this in a while, but im sure if you run into a specific problem there will be a multitude of answers out there once you have it narrowed down to something specific)
If you are on windows or mac, then id still suggest reading the above so you know what you are trying to do, they are straightforward in my opinion, then search for the windows/mac equivalent.
Now for what I assume the crux of the question is, what do you write to the ports?
For the Cyber 310 you have the pin layouts, although there seems to be multiple different pin layouts if you browse the site you have listed, and if we follow anf.nildram.co.uk here we can find some PIC assembly that will show us how to rotate the base.
I have never touched PIC assembly before today, but with some help from the internet and the comments, I think we can translate what this is trying to do (snipped out the relevant portion, as most of it is timing and looping )
; 6: Symbol prf = PORTA.0
; The address of 'prf' is 0x5,0
; 7: Symbol strobe = PORTA.1
; The address of 'strobe' is 0x5,1
; 8: Symbol base = PORTB.0
; The address of 'base' is 0x6,0
; 9: Symbol shoulder = PORTB.1
; The address of 'shoulder' is 0x6,1
...
; 16: main:
L0001:
; 17: base = 1
BSF 0x06,0 // set bit 0 at 0x06 to 1 essentially set base bit to 1
; 18: strobe = 1
BSF 0x05,1 // set strobe bit to 1
; 19: strobe = 0
BCF 0x05,1 // set strobe bit to 0
; 20: While a <> 730 // now we loop 729 more times
So it appears, from my naive perspective, that to rotate the arm you need to set the motor bits (grabbed from your pinout) then set and clear strobe.
Let me know if I am completely off base, this is a fascinating project.

Chris is right about the parallel port being a dumb interface. The parallel port has an address that you can output an 8bit binary number to that match the Digital Output's positions.
I found this to be a really good example of programming the Parallel port using C#.
http://www.codeproject.com/Articles/4981/I-O-Ports-Uncensored-1-Controlling-LEDs-Light-Emit
To match your project to his example. C0 is strobe. Then your Digital Outputs from left to right match his D0-D6.
Seems like a really fun project. Have fun.

Related

Accessing a combination of ports by adding both their offsets to a base address. How would this work?

Context: I am following an embedded systems course https://www.edx.org/course/embedded-systems-shape-the-world-microcontroller-i
In the lecture on bit specific addressing they present the following example on a "peanut butter and jelly port".
Given you a port PB which has a base address of 0x40005000 and you wanted to access both port 4 and port 6 from PB which would be PB6 and PB4 respectively. One could add the offset of port 4(0x40) and port 6(0x100) to the base address(0x40005000) and define that as their new address 0x40005140.
Here is where I am confused. If I wanted to define the address for PB6 it would be base(0x40005000) + offset(0x100) = 0x40005100 and the address for PB4 would be base(0x40005000) + offset(0x40) = 0x40005040. So how is it that to access both of them I could use base(0x40005000) + offset(0x40) + offset(0x100) = 0x40005140? Is this is not an entirely different location in memory for them individually?
Also why is bit 0 represented as 0x004. In binary that would be 0000 0100. I suppose it would represent bit 0 if you disregard the first two binary bits but why are we disregarding them?
Lecture notes on bit specific addressing:
Your interpretation of how memory-mapped registers are addressed is quite reasonable for any normal peripheral on an ARM based microcontroller.
However, if you read the GPIODATA register definition on page 662 of the TM4C123GH6PM datasheet then you will see that this "register" behaves very differently.
They map a huge block of the address space (1024 bytes) to a single 32-bit register. This means that bits[9:2] of the the address bus are not needed, and are in fact overloaded with data. They contain the mask of the bits to be updated. This is what the "offset" calculation you have copied is trying to describe.
Personally I think this hardware interface could be a very clever way to let you set only some of the outputs within a bank using a single atomic write, but it makes this a very bad choice of device to use for teaching, because this isn't the way things normally work.

STM32CubeMX I2C code writing to reserved register bits

I'm developing an I2C driver on the STM32F74 family processors. I'm using the STM32CubeMX Low Level drivers and I can't make sense of the generated defines for I2C start and stop register values (CR2).
The code is generated in stm32f7xx_ll_i2c.h and is as follows.
/** #defgroup I2C_LL_EC_GENERATE Start And Stop Generation
* #{
*/
#define LL_I2C_GENERATE_NOSTARTSTOP 0x00000000U
/*!< Don't Generate Stop and Start condition. */
#define LL_I2C_GENERATE_STOP (uint32_t)(0x80000000U | I2C_CR2_STOP)
/*!< Generate Stop condition (Size should be set to 0). */
#define LL_I2C_GENERATE_START_READ (uint32_t)(0x80000000U | I2C_CR2_START | I2C_CR2_RD_WRN)
/*!< Generate Start for read request. */
My question is why is bit 31 included in these defines? (0x80000000U). The reference manual (RM0385) states "Bits 31:27 Reserved, must be kept at reset value.". I can't decide between modifying the generated code or keeping the 31 bit. I'll happily take recommendations simply whether its more likely that this is something needed or that I'm going to break things by writing to a reserved bit.
Thanks in advance!
I am guessing here because who knows what was on the minds of the library authors? (Not a lot if you look at the source code!). But I would guess that it is a "dirty-trick" to check that when calling LL functions you are using the specified macros.
However it is severely flawed because the "trick" is only valid for Cortex-M3/4 STM32 variants (e.g. F1xx, F2xx, F4xx) where the I2C peripheral is very different and registers such as I2C_CR2 are only 15 bits wide.
The trick is that the library functions have parameter checking asserts such as:
assert_param(IS_TRANSFER_REQUEST(Request));
Where the IS_TRANSFER_REQUEST is defined thus:
#define IS_TRANSFER_REQUEST(REQUEST) (((REQUEST) == I2C_GENERATE_STOP) || \
((REQUEST) == I2C_GENERATE_START_READ) || \
((REQUEST) == I2C_GENERATE_START_WRITE) || \
((REQUEST) == I2C_NO_STARTSTOP))
This forces you to use the LL defined macros as parameters and not some self-defined or calculated mask because they all have that "unused" check bit in them.
If that truly is the the reason, it is an ill-advised practice that did not envisage the newer I2C peripheral. You might think that the bit was stripped from the parameter before being written to the register. I have checked, it is not. And if did you would be paying for that overhead on every call, which is also undesirable.
As an error detection technique if that is what it is, it is not even applied consistently; for example all the GPIO_PIN_xx macros are 16 bits wide and since they are masks not pin numbers, using bit 31 could for example guard against passing a literal pin-number 10 where the mask 1<<10 is in fact required. Passing 10 would refer to pins 3 and 1 not 10. And to be honest that mistake is far more likely than, passing an incorrect I2C transfer request type.
In the end however "Reserved" generally means "unused but may be used in future implementations", and requiring you to use the "reset value" is a way of ensuring forward binary compatibility. If you had such a device no doubt there would be a corresponding library update to support it - but it would require re-compilation of the code. The risk is low and probably only a problem if you attempt to run this binary on a newer incompatible part that used this bits.
I agree with Clifford, the ST CubeMC / HAL / LL library code is, in places, some of the worst written code imaginable. I have a particular issue with lines such as "TIMx->CCER &= ~TIM_CCER_CC1E" where TIM_CCER_CC1e is defined as 0x0001 and the CCER register contains reserved bits that should remain at 0. There are hundreds of such examples all throughout the library code. ST remain silent to my request for advice.

Bits Are Scrambled

The Problem: I send one value into a UART and nulls emerge on the other UART.
--- Details ---
These are both PIC processors (PIC24 and PIC32)
They are both hard wired onto a printed circuit board.
They are communicating, each via one of the UART modules which reside within them.
They are (ostensibly; according to docs) both configured for 115200 bps, 8-N-1
No handshaking, no CTS enabled, no RTS enabled; I'm just putting bytes on the wire and out they go.
(These are short little 4-byte commands and responses which fits pretty neatly)
The PIC32 is going 80 MHz.
The PIC24 has F[cy] = 14745600
i.e., it is going 14.7456 MHz
The PIC24 sends four bytes (a specific command sequence)
When I set a breakpoint at the Interrupt Service Routine for the UART, The PIC32 shows nulls, then I am seeing repeated hits on the (PIC32 code) breakpoint after the first four, and I continue to see nulls (which makes sense since the PIC24 is not sending anything)
i.e., the UART appears to be repeatedly generating interrupts when there is no reason
I did not write the code on the PIC32 side, and I am learning daily how it works.
Then I let the code just run, and I inevitably wind up on a line that says
52570 1D01_335C 9D01_335C _general_execption_handler sdbbp 0x0
When I get there,
The cause register holds 0080181C
The EPC register holds 9D00F228
The SP register holds 9F8FFFA0
This happened like clockwork, so I got suspicious of the __ISR that would not stop. MpLab showed me this...
432:
433: //*********************************************************//
434: void __ISR(_UART1_VECTOR, ipl5) IntUart1Handler(void) //MCU communication port
435: {
9D00F204 415DE800 rdpgpr sp,sp
9D00F208 401A7000 mfc0 k0,EPC
9D00F20C 401B6000 mfc0 k1,Status
9D00F210 27BDFF88 addiu sp,sp,-120
9D00F214 AFBA0074 sw k0,116(sp)
9D00F218 AFBB0070 sw k1,112(sp)
9D00F21C 7C1B7844 ins k1,zero,1,15
9D00F220 377B1400 ori k1,k1,0x1400
9D00F224 409B6000 mtc0 k1,Status
9D00F228 AFBF0064 sw ra,100(sp) ;<<<-------EPC register always points here
9D00F22C AFBE0060 sw s8,96(sp)
9D00F230 AFB9005C sw t9,92(sp)
9D00F234 AFB80058 sw t8,88(sp)
9D00F238 AFAF0054 sw t7,84(sp)
9D00F23C AFAE0050 sw t6,80(sp)
9D00F240 AFAD004C sw t5,76(sp)
9D00F244 AFAC0048 sw t4,72(sp)
9D00F248 AFAB0044 sw t3,68(sp)
9D00F24C AFAA0040 sw t2,64(sp)
9D00F250 AFA9003C sw t1,60(sp)
9D00F254 AFA80038 sw t0,56(sp)
9D00F258 AFA70034 sw a3,52(sp)
9D00F25C AFA60030 sw a2,48(sp)
9D00F260 AFA5002C sw a1,44(sp)
9D00F264 AFA40028 sw a0,40(sp)
9D00F268 AFA30024 sw v1,36(sp)
9D00F26C AFA20020 sw v0,32(sp)
9D00F270 AFA1001C sw at,28(sp)
9D00F274 00001012 mflo v0
9D00F278 AFA2006C sw v0,108(sp)
9D00F27C 00001810 mfhi v1
9D00F280 AFA30068 sw v1,104(sp)
9D00F284 03A0F021 addu s8,sp,zero
I look a little more closely at the numbers, and I see that at that time, if we add 100 (0x64) to FFA0 (the bottom 16 bits of the SP) we get 0x10004, which I am guessing is 4 too much.
PIC Manual DS61143E-page 50 says that that nomenclature means, SW Store Word Mem[Rs+offset> = Rt and other experts have told me that the cause register is telling me that the EXCCODE bits are 7 which is the code for a bus exception on load or store.
Or, I'm totally guessing here (would love to get some experts' knowledge on this) something is not clearing something and I'm encountering infinite recursion on an int handler.
All of this is starting to make sense.
THE QUESTION
Could someone please suggest the most common reasons for an int like this to be repeatedly hitting me ?
Does anyone see any common relationship between the bogus nuls coming from the UART which could somehow be connected with this endlessly generated int ? Am I even on the right track ?
In your answer, please tell me how to acknowledge the Int from the UART. I know how I do that in the PIC24 (I wrote that code totally, in ASM) but I don't know how this is done in in C on the PIC32. Assembly will be fine. I'll inline it. I'm working with code I didn't write here, and I thank you for your answers
What is the most common reason that the UART (#1, in this case) would be repeatedly generating interrupts ?
The most common reason an interrupt subroutine is called over and over is that the interrupt request is never acknowledged in the routine.
Are you sure you clear the corresponding IRQ bit?
To ease UART debugging you should first connect the UART to a PC and make sure your target can communicate both ways with the PC. With two targets at the same time, you can't determine on which one the problem is apart from inspecting signals with a scope.
On many devices an interrupt must be explicitly cleared to prevent the ISR from simply re-entering when complete.
In most cases a UART will have status bits that indicate the source of the interrupt, knowing that might tell you something, but not telling us makes it difficult to help you. You can inspect the UART registers directly in the debugger, however in some devices the act of reading a bit may in fact clear a bit, and that is true in the debugger too, so be aware of that possibility (check the data sheet/user manual).
Some UARTS require their transmitter to be explicitly switched off to stop transmitting nulls, while others are triggered automatically when data is placed in the tx register and stop after the necessary number of bits are shifted out. Again check the data sheet/manual for the part. If the PIC32 code is known to be working, then since this possible error would be with the PIC24 code, it seems to fit. You can check this simply by using an oscilloscope on the Tx line from the PIC24, if it is transmitting, you will see at least start/stop bit transitions (framing). If there is nothing, then the problem is probably at the PIC32 end.
While you have the scope out, you can check that the bit timing is correct and that you are actually transmitting at 115200. It is easy to get the clocking wrong, and that should be your first check. If the baud rate is incorrect, the PIC32 will likely generate framing error interrupts, which if not handled may persist indefinitely.
Another possibility is that after transmission the PIC24 leaves the line in the "break" state, and that the PIC32 UART is generating "line-break" interrupts. That is why it is important to look at the UART status registers to determine the interrupt cause.
As you can see, there are many possibilities; I think I have covered the most likely ones, but more methodical debugging effort and information gathering on your part is required. I hope I have guided you in this too.
There were the three root causes which were in place...
The interrupt priority level was set at value 6 in the initialization code for UART1
The first line of the interrupt service routine was coded with an interrupt priority level of 5
The first three bytes of UART data were disappearing from the data stream (this is still unsolved)
Here's the not-so-obvious way they were causing the problem
First three bytes never appeared
Fourth byte did appear
Interrupt hit (as level 6) and invoked __ISR routine
__ISR was acting as ipl5 agent
First instruction executed (possibly more, I couldn't debug that accurately)
As soon as the first instruction finished, the "higher" priority 6 interrupt immediately kicked in
This resulted in the same interrupt again
The process repeated itself infinitely.
In short order, Stack Overflow resulted
The Fix
Make sure these two lines of code agree with each other...
The IPL line in the init code, wrong way then the right way
//IPC6bits.U1IP=6; //// Wrong !!! Uart 1 IPL should not be 6 !!!
IPC6bits.U1IP=5; //// Uart 1 IPL = 5 Correct way; matches __ISR
Interrupt Service Routine
void __ISR(_UART1_VECTOR, ipl5) IntUart1Handler(void) //// Operating as IPL 5
:
:
:
:
Poor design decision. If both are on same board SPI would have been more feasible and a lot faster.

Writing .hex file to Internal FlashROM of 8051 microcontroller using SPI bus

I am doing a firmware upgrade using SPI bus on EEPROM as well as Internal ROM of 8051, basically writing a .hex file on both these memory devices.I am able to see .hex file written there.I am able to see slave and master are communicating properly, but not able to write anything on my memory devices.
If you have suggestions and if you have faced similar problems, please let me know where is the actual problem.
Any inputs would be welcomed.
Regards,
Ravi
I think more information will likely be required. In any case, here a few pitfalls I could see:
Hex Files are not necessarily memory images. The 8051s I've worked with usually use Intel Hex which is an ASCII format that describes the memory. The format is well documented here.
If you're having trouble writing to the EEPROM, you may not be writing the proper instructions. Typically, SPI EEPROM will be Byte addressed, but still has paging internally. You should start your writes on a Page boundary and write the whole page, then issue another write command, etc. By convention if you overrun a page, or start in the middle of a page it will loop around. So if your page is 8 bytes long, and you start writing 0-7 starting at index 4, you'll get:
Page Start: Index 0 = 4
Index 1 = 5
Index 2 = 6
Index 3 = 7
Index 4 = 0
Index 5 = 1
Index 6 = 2
Index 7 = 3
Most EEPROMs have locking mechanisms to prevent accidental writes once they are finalized. If the lock has been set, you will need to write an unlocking method (this will be detailed in the data sheet if it has it)
To further help you, please reference part numbers and better yet Data Sheets if you can.

What are the most hardcore optimisations you've seen?

I'm not talking about algorithmic stuff (eg use quicksort instead of bubblesort), and I'm not talking about simple things like loop unrolling.
I'm talking about the hardcore stuff. Like Tiny Teensy ELF, The Story of Mel; practically everything in the demoscene, and so on.
I once wrote a brute force RC5 key search that processed two keys at a time, the first key used the integer pipeline, the second key used the SSE pipelines and the two were interleaved at the instruction level. This was then coupled with a supervisor program that ran an instance of the code on each core in the system. In total, the code ran about 25 times faster than a naive C version.
In one (here unnamed) video game engine I worked with, they had rewritten the model-export tool (the thing that turns a Maya mesh into something the game loads) so that instead of just emitting data, it would actually emit the exact stream of microinstructions that would be necessary to render that particular model. It used a genetic algorithm to find the one that would run in the minimum number of cycles. That is to say, the data format for a given model was actually a perfectly-optimized subroutine for rendering just that model. So, drawing a mesh to the screen meant loading it into memory and branching into it.
(This wasn't for a PC, but for a console that had a vector unit separate and parallel to the CPU.)
In the early days of DOS when we used floppy discs for all data transport there were viruses as well. One common way for viruses to infect different computers was to copy a virus bootloader into the bootsector of an inserted floppydisc. When the user inserted the floppydisc into another computer and rebooted without remembering to remove the floppy, the virus was run and infected the harddrive bootsector, thus permanently infecting the host PC. A particulary annoying virus I was infected by was called "Form", to battle this I wrote a custom floppy bootsector that had the following features:
Validate the bootsector of the host harddrive and make sure it was not infected.
Validate the floppy bootsector and
make sure that it was not infected.
Code to remove the virus from the
harddrive if it was infected.
Code to duplicate the antivirus
bootsector to another floppy if a
special key was pressed.
Code to boot the harddrive if all was
well, and no infections was found.
This was done in the program space of a bootsector, about 440 bytes :)
The biggest problem for my mates was the very cryptic messages displayed because I needed all the space for code. It was like "FFVD RM?", which meant "FindForm Virus Detected, Remove?"
I was quite happy with that piece of code. The optimization was program size, not speed. Two quite different optimizations in assembly.
My favorite is the floating point inverse square root via integer operations. This is a cool little hack on how floating point values are stored and can execute faster (even doing a 1/result is faster than the stock-standard square root function) or produce more accurate results than the standard methods.
In c/c++ the code is: (sourced from Wikipedia)
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1); // Now this is what you call a real magic number
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
A Very Biological Optimisation
Quick background: Triplets of DNA nucleotides (A, C, G and T) encode amino acids, which are joined into proteins, which are what make up most of most living things.
Ordinarily, each different protein requires a separate sequence of DNA triplets (its "gene") to encode its amino acids -- so e.g. 3 proteins of lengths 30, 40, and 50 would require 90 + 120 + 150 = 360 nucleotides in total. However, in viruses, space is at a premium -- so some viruses overlap the DNA sequences for different genes, using the fact that there are 6 possible "reading frames" to use for DNA-to-protein translation (namely starting from a position that is divisible by 3; from a position that divides 3 with remainder 1; or from a position that divides 3 with remainder 2; and the same again, but reading the sequence in reverse.)
For comparison: Try writing an x86 assembly language program where the 300-byte function doFoo() begins at offset 0x1000... and another 200-byte function doBar() starts at offset 0x1001! (I propose a name for this competition: Are you smarter than Hepatitis B?)
That's hardcore space optimisation!
UPDATE: Links to further info:
Reading Frames on Wikipedia suggests Hepatitis B and "Barley Yellow Dwarf" virus (a plant virus) both overlap reading frames.
Hepatitis B genome info on Wikipedia. Seems that different reading-frame subunits produce different variations of a surface protein.
Or you could google for "overlapping reading frames"
Seems this can even happen in mammals! Extensively overlapping reading frames in a second mammalian gene is a 2001 scientific paper by Marilyn Kozak that talks about a "second" gene in rat with "extensive overlapping reading frames". (This is quite surprising as mammals have a genome structure that provides ample room for separate genes for separate proteins.) Haven't read beyond the abstract myself.
I wrote a tile-based game engine for the Apple IIgs in 65816 assembly language a few years ago. This was a fairly slow machine and programming "on the metal" is a virtual requirement for coaxing out acceptable performance.
In order to quickly update the graphics screen one has to map the stack to the screen in order to use some special instructions that allow one to update 4 screen pixels in only 5 machine cycles. This is nothing particularly fantastic and is described in detail in IIgs Tech Note #70. The hard-core bit was how I had to organize the code to make it flexible enough to be a general-purpose library while still maintaining maximum speed.
I decomposed the graphics screen into scan lines and created a 246 byte code buffer to insert the specialized 65816 opcodes. The 246 bytes are needed because each scan line of the graphics screen is 80 words wide and 1 additional word is required on each end for smooth scrolling. The Push Effective Address (PEA) instruction takes up 3 bytes, so 3 * (80 + 1 + 1) = 246 bytes.
The graphics screen is rendered by jumping to an address within the 246 byte code buffer that corresponds to the right edge of the screen and patching in a BRanch Always (BRA) instruction into the code at the word immediately following the left-most word. The BRA instruction takes a signed 8-bit offset as its argument, so it just barely has the range to jump out of the code buffer.
Even this isn't too terribly difficult, but the real hard-core optimization comes in here. My graphics engine actually supported two independent background layers and animated tiles by using different 3-byte code sequences depending on the mode:
Background 1 uses a Push Effective Address (PEA) instruction
Background 2 uses a Load Indirect Indexed (LDA ($00),y) instruction followed by a push (PHA)
Animated tiles use a Load Direct Page Indexed (LDA $00,x) instruction followed by a push (PHA)
The critical restriction is that both of the 65816 registers (X and Y) are used to reference data and cannot be modified. Further the direct page register (D) is set based on the origin of the second background and cannot be changed; the data bank register is set to the data bank that holds pixel data for the second background and cannot be changed; the stack pointer (S) is mapped to graphics screen, so there is no possibility of jumping to a subroutine and returning.
Given these restrictions, I had the need to quickly handle cases where a word that is about to be pushed onto the stack is mixed, i.e. half comes from Background 1 and half from Background 2. My solution was to trade memory for speed. Because all of the normal registers were in use, I only had the Program Counter (PC) register to work with. My solution was the following:
Define a code fragment to do the blend in the same 64K program bank as the code buffer
Create a copy of this code for each of the 82 words
There is a 1-1 correspondence, so the return from the code fragment can be a hard-coded address
Done! We have a hard-coded subroutine that does not affect the CPU registers.
Here is the actual code fragments
code_buff: PEA $0000 ; rightmost word (16-bits = 4 pixels)
PEA $0000 ; background 1
PEA $0000 ; background 1
PEA $0000 ; background 1
LDA (72),y ; background 2
PHA
LDA (70),y ; background 2
PHA
JMP word_68 ; mix the data
word_68_rtn: PEA $0000 ; more background 1
...
PEA $0000
BRA *+40 ; patched exit code
...
word_68: LDA (68),y ; load data for background 2
AND #$00FF ; mask
ORA #$AB00 ; blend with data from background 1
PHA
JMP word_68_rtn ; jump back
word_66: LDA (66),y
...
The end result was a near-optimal blitter that has minimal overhead and cranks out more than 15 frames per second at 320x200 on a 2.5 MHz CPU with a 1 MB/s memory bus.
Michael Abrash's "Zen of Assembly Language" had some nifty stuff, though I admit I don't recall specifics off the top of my head.
Actually it seems like everything Abrash wrote had some nifty optimization stuff in it.
The Stalin Scheme compiler is pretty crazy in that aspect.
I once saw a switch statement with a lot of empty cases, a comment at the head of the switch said something along the lines of:
Added case statements that are never hit because the compiler only turns the switch into a jump-table if there are more than N cases
I forget what N was. This was in the source code for Windows that was leaked in 2004.
I've gone to the Intel (or AMD) architecture references to see what instructions there are. movsx - move with sign extension is awesome for moving little signed values into big spaces, for example, in one instruction.
Likewise, if you know you only use 16-bit values, but you can access all of EAX, EBX, ECX, EDX , etc- then you have 8 very fast locations for values - just rotate the registers by 16 bits to access the other values.
The EFF DES cracker, which used custom-built hardware to generate candidate keys (the hardware they made could prove a key isn't the solution, but could not prove a key was the solution) which were then tested with a more conventional code.
The FSG 2.0 packer made by a Polish team, specifically made for packing executables made with assembly. If packing assembly isn't impressive enough (what's supposed to be almost as low as possible) the loader it comes with is 158 bytes and fully functional. If you try packing any assembly made .exe with something like UPX, it will throw a NotCompressableException at you ;)