What does programming for PS3's Cell Processor entail? - ps3

How is programming for the Cell Processor on the PS3 different than programming for any other processor found on a normal desktop?
What kind of programming paradigms, techniques, and practices are used to fully utilize the Cell Processors potential?
All the articles I hear concerning PS3 development discuss, "Learning how to program on the Cell Processor." What does this really mean beyond some hand waving?

In addition to everything George mentions, the SPUs are really better thought of as streaming vector processors. They work best when you have an algorithm that works on long sequences of numerical data, which can be fed through the SPU's limited memory via DMA, rather than having the SPU load a chunk of memory, try to operate on it, find that it needs to follow a pointer to somewhere outside its memory, load that, keep going, find another one, and so on.
So, programming for them isn't a simple model of concurrency and threads; it's more like high performance numerical or scientific computation. It is also non-uniform memory access taken to an extreme.
Furthermore, every processor is in-order with deep pipelines, so the programmer has to be much more aware of data hazards and instruction bubbles and all the numerous micro-optimizations that we are told the compiler "should" take care of for us (but it really doesn't). Things like mispredicted branches, load-hit-stores, cache misses, etc. hurt a lot more than they would on an out-of-order processor that could juggle the order of operations around to hide such latencies.
For concrete examples, check out Mike Acton's CellPerformance blog. Mike is my favorite old-school assembly-happy perf curmudgeon in the business, and he's really earned his chops on this issue.

The Cell part of the PS3 consists of 6 SPU processors. They each have 256 KB of non-shared memory and are connected via a high-speed ring that allows for DMA between each other and the PowerPC host processor. They are not pipelined or cached. This makes it rather different than an multi-core x86 with shared memory, pipelining and caching. Also, the SPU processors do not use the same instruction set as the PowerPC so you've got some asymmetry there.
In short, your typical shared-memory, multithreaded program won't just drop onto the Cell without some work (with the caveat that computer science works hard at making different machines appear to be the same so some implementors try hard to automate the process).
At a high level the program will need to be broken up into tasks that fit within the Cell's hard memory limit. Those can run in parallel and each sub-task can be sequenced to an available Cell processor. At a low level, the compiler (or assembly programmer) will need to work harder to generate code that runs quickly on a processor -- no run-time trickery to make things go faster is available. The theory being that those programmer/compiler friendly features cost silicon and speed that can be better spent giving you more and faster SPUs. Of course, you're not getting any more SPU's on the PS3 but in the general case you'll get more SPUs per number of transistor available on chip.

Completely agree with George Philips and Crashworks. Only thing I'd add is that SPU programming is fundamentally about job management. To get the best out of the SPUs you need to keep them ticking over and feeding back results. There's no point in having one SPU chewing through some complex post-processing if your having to sit and wait for the results for a frame and the rest of your SPUs are sat idle. So how you distribute your jobs requires a lot of thought and this has a big impact on how you chunk up your data.

"All the articles I hear concerning PS3 development discuss, 'Learning how to program on the Cell Processor.' What does this really mean beyond some hand waving?"
Well, stuff you have to deal with on SPUs...
Atomic operations (lock-free try-discard style).
Strong distinction between memory areas. You have to know which pointer is pointing to which memory area or you'll screw everything up.
No enforced hardware distinction between data and code. This is actually a fun thing, you can setup dynamic code loading and essentially stream subroutines in and out. Self-modifying code is possible but not necessarily practical on SPU.
Lack of hardware debugging aids.
Limited memory size.
Fast memory access.
Instruction set balanced toward SIMD operations.
Floating point "gotchas".
You ideally want to keep the SPUs doing useful work all of the time, but it's really challenging. Not only are they not well suited for handling some types of problems, but often moving a system to be efficient on SPU can involve a complete redesign. Debugging problems that would be easy to catch on the PPU can sometimes take days on SPU.
I think when people use the phrase "learning how to program the cell" they are mostly hand waving. You can learn the basics in a week, the challenge comes in trying to apply that knowledge to real code... which often already exists and isn't in a form well-suited for use on SPU.

Related

is it recommended to use SPI flash to run code instead internal flash due to memory limitation of internal flash?

We used the LPC546xx family microcontroller in our project, currently, at the initial stage, we are finalizing the software and hardware requirements. The basic firmware size (which contains RTOS, 3rd party stack, library, etc...) currently is 480 KB. Now once full application developed than the size will exceed the internal flash size (512KB) and plus we needed storage which can hold firmware update image separately.
So we planned to use SPI flash (S25LP064A-JBLE, http://www.issi.com/WW/pdf/IS25LP032-064-128.pdf, serial flash memory) of 4MB\8MB to boot and run firmware.
is it recommended to run code from SPI flash? how can I map external flash memory directly to CPU memory space? Can anyone give an example that contains this memory mapping(linker script etc..) or demo application in which LPC546xx uses SPI FLASH?
Generally speaking it's not recommended, or differently put: the closer to the CPU the better. Both S25LP064A and LPC546xx however support XIP, so it is viable.
This is not a trivial issue as many aspects are affecting. I.e. issue is best avoided and should really have been ironed out in the planning stage. Embedded Systems are more about compromising than anything and making the right/better choices takes skill end experience.
Same question with replies on the NXP forum: link
512K of NVRAM is huge. There are almost certainly room for optimisations even if 3'rd party libraries are used.
On a related note this discussion concerning XIP should give valuable insight: link.
I would strongly encourage use of file-systems if not done already, for which external storage is much better suited. The further from the computational unit, the more relevant. That's not XIP and the penalty is copy-to-RAM either way you do it. I.e. performance will be slower. But in my experience, the need for speed has often-times not been thoroughly considered and at least partially greatly overestimated.
Regarding your mentioning of RTOS and FW-upgrade:
Unless it's a poor RTOS there's file-system awareness built in. Especially for FW upgrading (Note: you'll need room for 3 images, factory reset included), unless already supported by the SoC-vendor by some other means (OTA), it will make life much easier and less risky. If there's no FS-awareness, it can be added.
FW upgrade requires a lot of extra storage. More if simpler. Simpler is however also safer which especially for FW upgrades matters hugely. In the simplest case (binary flat image), you'll need at least twice the amount of memory you're already consuming.
All-in-all: I think the direction you're going is viable and depending on the actual situation perhaps your only choice.

Does laptop Embedded Controller have limited writes?

I am wondering if I should be worried about excessive writes to the embedded controller registers on my laptop. I am guessing that if they are true registers, they probably act more like RAM rather than flash memory so this isn't a problem.
However, I have a script to modify the registers in my laptop's EC to better control the fan speed curve. It has to be re-applied after each power change event such as sleep/wake as well as power cable events, so it happens fairly often. I just want to make sure I am not burning out my chips in the process.
The script I am using to write to the EC is located here:
https://github.com/RayfenWindspear/perl-acpi-fanspeed
Well, it seems you're writing to ACPI registers. Registers here do not refer to any specific hardware; it just means its a specific address that you can reach using a specific bus. It's however highly unlikely that something that you have to re-write after every power cycle is overwriting permanent storage, so for all practical aspects I'd assume that you can rely on this for as long as your laptop lives.
Hardware peripherals are almost universally implemented as SRAM cells. They will not wear out first. The fan you are controlling will have a limited number of start/stop cycles. So it is much more likely that the act of toggling these registers will wear something else out prematurely (than the SRAM type memory cell itself).
To your particular case, correctly driving a fan/motor can significantly improve it's life time. Over driving a fan/motor does not always make it go faster, but instead creates heat. The heat weakens the wiring and eventually the coils will short reducing drive and eventually wearing out. That said, the element being cooled can be damaged by excess heat so tuning things just to reduce sound may not make sense.
Background details
Generally, the element is called a Flip-Flop with various forms. SystemRDL is an example as well as SystemC and others where digital engineers will model these. In digital hardware, the flip-flops have default or reset values. This is fixed like ROM on each chip and is not normally re-programmable, uses EEPROM technologyNote1 or is often configured via input lines which the hardware designer can pull them high/low with a resistor or connect them to another elements 'GPIO'.
It is analogous to 'initdata'. Program values that aren't zero get copied from flash, disk, etc to memory at program startup. So the flip-flops normally do not hold state over a power cycle; something else does this.
The 'Flash' technology is based off of a floating gate and uses 'quantum tunnelling' to program the floating gates. This process is slightly destructive. It was invented by Fowler and Nordheim in 1967, but wide spread electronics industry did not start to produce them until the early 90s with NOR flash followed by NAND flash and many variants. But the underlying physics is the same; just the digital connections are different. So as well as this defect you are concerned about, the flash technology actually followed many hardware chips such as 68k, i386, etc. So 'flip-flops' were well established and typically the 'register' part of the logic is not that great of a typical chip and a flip-flop uses the same logic (gates) as the rest of the chip logic. Meaning that using flash would be an extra overhead with little benefit.
Some additional take-away is that the startup up and shutdown of chips is usually the most destructive time. Often poor hardware designers do not put proper voltage supervision and some lines maybe floating with the expectation that system programs will set them immediately. Reset events, ESD, over heating, etc will all be more harmful than just the act of writing a peripheral register.
Note 1: EEPROM typically has 100,000+ cycles. These features are typically only used once at manufacture time to set a chip configuration for the system. These are actually quite rare, but possible.
The MLC (multi-level) NAND flash in SSD has pathetically low cycles like 8,000 in some cases. The SLC (single level) old school flash have 10,000+ cycles, but people demand large data formats.

Optimal way to move memory in x86 and ARM?

I am interested knowing the best approach for bulk memory copies on an x86 architecture. I realize this depends on machine-specific characteristics. The main target is typical desktop machines made in the last 4-5 years.
I know that in the old days MOVSD with REPE was nominally the fastest approach because you could move 4 bytes at a time, but I have read that nowadays MOVSB is just as fast and is simpler to write, so you may as well do a byte move and just forget about the complexities of a 4-byte move.
A surrounding question is whether MOVxx instructions are worth it at all. If the CPU can run so much faster than the memory bus, then maybe it is pointless to use a CISC move and you may as well use a plain MOV. This would be most attractive because then I could use the same algorithms on other processor architectures like ARM. This brings up the analogous question of whether ARM's specialized instructions for bulk memory moves (which are totally different than Intels) are worth it or not.
Note: I have read section 3.7.6 in the Intel Optimization Reference Manual so I am familiar with the basics. I am hoping someone can relate practical experience in the area beyond what is in this manual.
Modern Intel and AMD processors have optimisations on REP MOVSB that make it copy entire cache lines at a time if it can, making it the best (may not be fastest, but pretty close) method of copying bulk data.
As for ARM, it depends on the architecture version, but in general using an unrolled loop would be the most efficient.

What Skill set should a low level programmer possess?

I am an embedded SW Engineer, with less than 3 yrs of experience. I aim to "sharpen the saw" continuously. I was wondering if there was anything specific to low level programming that C/C++ coders should be proficient with.
What comes to my mind is familiarity with the hardware's architecture and instruction set. Knowing how to fiddle with bits is also important, resource management and performance have been part of my job, is there anything else?
EDIT: I work with an in-house customized RTOS, not embedded Linux.
I see a lot of high-level operating system answers here, but you specifically said low-level.
Some scattered thoughts:
Design for test. As you work through a problem only change one thing at a time per test.
You need to understand busses and interfaces, spi, i2c, usb, ethernet, etc. Number one interface, today, yesterday, and tomorrow, the uart, serial.
The steps involved in programming a flash.
Tricks to avoid making the product easily brickable.
Bootloaders in general.
Bit-banging above said interfaces on various families of parts (different chip
vendors have different ideas about io pins, pull ups, direction
controls, etc).
Board and chip bring up, you certainly never want to
boot a many tens of thousands of lines of code program on the first
power up (think led on, led off).
How to debug a product without using too much test equipment (logical analyzers and scopes), at the same time you have to learn to use a scope for debugging, you are far
more valuable if you don't HAVE TO have a tech or engineer in the lab
with you.
How would you reprogram the unit in the field? What would
you do to minimize human error when allowing the user to field
upgrade the unit? Remember field downgrades as well.
What would you do to discourage hacking (binaries, etc).
Efficient use of the flash/rom (don't wear out one bank or section, spread the wear around, or see if the flash is doing it for you).
How and when to use a watchdog timer.
State machines, very useful with bytestreams (serial and ethernet), design packet structures that stream well and are tailored to a state machine, and that have a header and checksum or other structure that insures you do not interpret partial packets or
random data as a good packet.
Specific concepts like,
Endianness (this link is to an old but good linuxjournal article)
Effective use of multithreading architectures (the Embedded site is good in general)
Debugging embedded and multithreaded systems
Understand, Learn and Follow good programming techniques (the link is very old and the point very generic and subjective, but think about it)
Other things (this IBM page on embedded linux sums up most of the other points I want to make)
One more thing -- never underestimate testing! or, planning test cases!!
Use the reference links I give as concepts,
please followup further for deeper knowledge.
I'd study the electronics of the actual chips. Learn how they work internally (such as architecture), interface with peripherals, electrical and timing characteristics, etc.
Basically, read the data sheet start to finish a few times and dig into anything you've not seen/used before.
By the way, what chips do you work with?
Similar to what Brian said, learn how to create unit tests and automated builds.
These skills are are good for all levels of software engineers to be proficient in. They will help improve the quality of your code while also making it easier to refactor and improve the code base.
If you haven't yet I think every Software Engineer should read The Pragmatic Programmer and Code Complete. I know these are not specific to low level programming, but have a large wealth of knowledge in them that applies to all sub disciplines.
Having great familiarity with pointers, the checks these languages don't do much (like buffer overflow and stuff like that), digital electronics. Operational systems internals might also help.
Get to know how stuff is represented internally, specially ready-made data structures (supposing you won't build your own one).
Above all, practice a lot. Doing it brings much more to you than just reading about it ;)
bit operations
processor architectures (caches, etc)
wcet analysis
scheduling
Edit: What I forgot to mention is model based development.
Today, the control algorithms are often implemented as some kind of automaton from which C code is generated afterwards.
Commercial available tools are for example MATLAB/Simulink, ASCET or SCADE.
Get yourself a copy of the MISRA-C book. It was originally written by members of the automotive industry, and attempts to make software written in C more robust by applying a number (quite a large number!) of rules and guidelines.
Then, buy PC-Lint (or another static analysis tool) to check your code for MISRA and other rules.
These are particularly relevant to low-level and embedded C, as between them they deal with the causes of a lot of bugs in such software, such as issues relating to pointers, memory leaks, integer promotion (there's a whole chapter on that in the MISRA book), endianness, and undefined behaviour.
Good question. Some that haven't been mentioned...
Learn your various options for achieving low-level multitasking. From basic round-robin (non-preemptive) schedulers, with timing ticks from a hardware timer, up to a preemptive RTOS. Learn why you might need an RTOS, and why you might not. If you use an RTOS, learn that beginners with a PC background probably tend to want to create too many tasks.
Getting visibility into the internals for debugging can be a challenge. There's no screen typically, so no throwing in "printf" calls wherever you want. An emulator or JTAG interface is ideal--you can set breakpoints and step through your program (as long as halting the micro doesn't make hardware go crazy, like swinging a robot arm around at full speed!). If emulator/JTAG is not available, learn how to use a spare serial port (or maybe even bit-bash a pin to make a serial port) for a debug channel, with some simple memory peek/poke commands.

Is "the optimized delay" a myth or is it real?

From time to time you hear stories that are meant to illustrate how good someone is at something, and sometimes you hear about the guy how is so into code optimization that he optimizes his delay loop.
Since this really sounds like it's a strange thing to do as it's much better to start a "timer interrupt" instead of a optimized buzy wait,
and nobody ever tend to tells you the name of the optimizing hacker.
That has left me to wonder if it is a urban myth or is it real?
What do you say, reality or fiction?
Thanks
Johan
Update: It sounds like ShuggyCoUk was on to something,
wonder if we can find a example.
Update: Just a little clarification, this question is about the "delay" function it self and how that is implemented, not how and where you call it.
And what that purpose was, and how that system became better.
Update: It's no myth, those guys seems to exist
Thanks
ShuggyCoUk
This has more than a kernel of truth about it...
Spin wait can be much better than a signal based interrupt or a yield.
You trade some throughput for much reduced latency.
Often this is vitally important within an OS itself.
You allow yourself the freedom to do operations not possible within an interrupt handler
memory allocation for example.
You can get considerably finer grained control of the interval waited since you can essentially measure the cycle count.
However spin waits are tricky to get right.
If you can you should use use proper idle instructions which:
can power down parts of the core, improving power usage/heat dissipation and even allowing other cores to go faster.
In Hyper Thread based CPUs you allow the other logical thread to use the full CPU pipeline while you spin.
an instruction you might think was a no-op could cause the CPU to execute them out of order via the super scalar execution units. The resulting code may get unforeseen out of order artefacts which force the CPU to apply a great deal of effort in terms of stalls and memory barriers which are unwanted.
This is why you let someone else write the spin wait loop for you in most cases..
In Linux there is the cpu_relax macro
on arm this is barrier()
on x86 this is rep_nop()
In Windows there is YieldProcessor
Accessible in .Net via Thread.SpinWait
OS X eschews providing a standard implementation unless you are in the kernel
see this document and note that it encourages the use only of lck_spin_t
As to some citations of using PAUSE for spin waits:
PostGresSQL
Linux
See also the note that this is better on non P4 as well due to reducing power
The version I've always heard is of a group of hardware programmers who developed a special instruction that optimised the idle (not busy) loop of their operating system. This is mentioned in Kernighan & Pike's book The Practice Of Programming, but even there they admit it may be an Urban Myth.
I've heard stories of programmers who intentionally put in long delay loops early in projects and removed them later as "optimizations" to impress management. Never figured out if the stories were apocryphal or not.