Performance Differences between evaluation boards [closed]

Performance Differences between evaluation boards [closed] - embedded

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Our company is a proud owner of a STM32f4 evaluation board ( cortex M4F) ,
We received another evaluation board that is (ARM7TDMI board).
Before starting the migration to the ARM7 evaluation board, we want to know if the hardware is strong enough for us,so we wont waste anytime to discover it later.
Our project utilize many DSP algorithms (that takes advantage of the FPU) , heavy usage of SDIO , and around 1 megabyte of memory .
So , i was thinking to do the following tests on both evaluation boards ,and see the performance differences between them :
Math : Addition , Subtraction,Division,Multiplication , Abs and Sqrtf . It will run i a loop ( and only floating numbers will be used).
SDIO : read/write a 2 kilobyte buffer in a loop
Memory : read/write to the external and internal RAM in a loop.
In your opinion , do this results will give as any indication about the performance differences ,and what to expect from the "real" project ?
Thanks
Michael

I would advise against any new design based on ARM7 - it is a legacy ARM architecture. You should check the the vendor's part status and planned obsolescence for any part you intend to design in. No vendor is releasing new designs based on ARM7.
I would also suggest that for DSP algorithms, the DSP features of the Cortex-M4 are more important than its floating point. The ARM Cortex-M CMSIS includes a DSP library that takes advantage of this. Either way fixed-point DSP algorithms will be far more efficient than using floating point.
Cortex-M is a far more efficient design that ARM7 achieving 1.2 DMIPS per MHz compared to less than 1.0 DMIPS per MHz. That coupled with DSP instructions, floating-point, and separate buses for on-chip flags, RAM and peripherals make most code significantly faster on Cortex-M.
The Cortex-M architecture defines the SYSCLK and interrupt controller, wheras on ARM7 these are defined by the chip vendor and vary between vendors making porting of code between them more difficult.
STM32F4xx parts run at upto 180MHz; Most ARM7 parts are 60MHz or less.
Performing a comparison using floating point is almost pointless. Floating point hardware will easily outperform software floating point necessary on ARM7 by a factor of 5 to 10 at least. Unless your application can cope with that drop in performance, it is unsuited to ARM7. However, most applications do not need floating point. Integer or fixed point algorithms can run around 5 times faster than software floating point, so compete with hardware floating point. Remember also that the Cortex-M4 FPU is single precision only.
It would be more reasonable to be comparing Cortex-M3 with Cortex-M4 to test the sensitivity of your application to lack of hardware FP and DSP support.
SDIO performance will be limited by the SDIO interface and the SD card itself (which vary widely in performance even at the same "speed rating") - the load imposed on the processor itself will be very low, or it will spend most of its time waiting for data if your application busy-waits rather than doing something useful while waiting on the SD card. The use of DMA transfers can make the CPU load more-or-less negligible.
The following diagram illustrates how ARM7 is positioned compared with Cortex-M4. The latter is both higher performance and greater capability. At the same clock frequency, Cortex-M4 sites between ARM9 and ARM11 on the performance scale.
I do not think that you need to perform any benchmark tests comparing ARM7 and Cortex M4 since the broad performance figures are already available. What you could perhaps do is measure the CPU load of your existing application on its current platform. If it is low (perhaps < 20%) and it spends most of its time idle, then ARM7 might be feasible. Of course if your application is not running on an RTOS or scheduler with an idle task, then measuring true CPU load might be difficult.

I would have thought that the M4F would be a far more capable part than the venerable 7TDMI processor. I have not used an ARM7 with floating point coprocessor and would expect that as you are wanting to do floating point DSP the M4F would be more suited to your application.
Having the floating point in hardware will speed your processing and may allow power savings to be made by slowing the processor clock.
I would be reluctant to start a new design based on a version of the ARM that is at least 10 years old

Related

Do any microprocessors today use Scoreboarding or Tomasulo's algorithm? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
I've researched a bit and i found out about Intel pentium pro, AMD K7, IBM power PC but these are pretty old. I'm not able to find any info about current day processors that use these mechanisms for dynamic scheduling

Every modern OoO exec CPU uses Tomasulo's algorithm for register renaming. The basic idea of renaming onto more physical registers in a kind of SSA dependency analysis hasn't changed.
Modern Intel CPUs like Skylake have evolved some since Pentium Pro (e.g. renaming onto a physical register file instead of holding data right in the ROB), but PPro and the P6 family is a direct ancestor of the Sandybridge-family. See https://www.realworldtech.com/sandy-bridge/ for some discussion of the first member of that new family. (And if you're curious about CPU internals, a much more in-depth look at it.) See also https://agner.org/optimize/ but Agner's microarch guide focuses more on how to optimize for it, e.g. that register renaming isn't a bottleneck on modern CPUs: rename width matches issue width, and the same register can be renamed 4 times in an issue group of 4 instructions.
Advancements in managing the RAT include Nehalem introducing fast-recovery for branch misses: snapshot the RAT on branches so you can restore to there when you detect a branch miss, instead of draining earlier un-executed uops before starting recovery.
Also mov-elimination and xor-zeroing elimination: they're handled at register-rename time instead of needing a back-end uop to write the register. (For xor-zeroing, presumably there's a physical zero register and zeroing idioms point the architectural register at that physical zero. What is the best way to set a register to zero in x86 assembly: xor, mov or and? and Can x86's MOV really be "free"? Why can't I reproduce this at all?)
If you're going to do OoO exec at all, you might as well go all-in, so AFAIK nothing modern does just scoreboarding instead of register renaming. (Except for in-order cores that scoreboard loads, so cache-miss latency doesn't stall until a later instruction actually reads the load's target register.)
There are still in-order execution cores that don't do either, leaving instruction scheduling / software-pipelining up to compilers / humans. aka statically scheduled. This is not rare; widely used budget smartphone chips use cores like ARM Cortex-A53. Most programs bottleneck on memory, and you can allow some memory-level parallelism in an in-order core, especially with a store buffer.
Sometimes energy per computation is more important than performance.

Tomasulo's algorithm dates back to 1967. It's quite old and several modifications and improvements have been made to it. Also, new dynamic scheduling methods have been developed.
Check out http://adusan.blogspot.com.au/2010/11/differences-between-tomasulos-algorithm.html
Likewise, pure Scoreboarding is not used anymore, at least not in mainstream architectures, but its core concept is used as a base element for modern dynamic scheduling techniques.
It is fair to say that although they're not used as is anymore, some of their features are still maintained in modern dynamic scheduling and out-of-order execution techniques.

Why do you need a Programmable Real Time Unit (PRU) while you can have an RTOS?

The beaglebone Black processor includes two independent Programmable Real Time Units (PRUs). Hobbyists and professionals are excited about possible use of these units for real-time applications, which is understood. However, if you can have a RTOS (whether for the beaglebone or the raspberry pi), why would you need the PRUs?
EDIT-
For information, the BBB has an ARM Cortex A8 running at 1 GHz, with 1.9 DMIPS / MHz. The PRUs are simple RISCs running at 200 MHz.

Linux, even with the real-time scheduler is unsuited to many critical hard real-time tasks with response requirements at the microsecond level, on the other hand it provides or enables a great deal of functionality in terms of UI, connectivity and filesystem support. These things are either not available in an RTOS or are provided at significant cost in high end RTOS, and with much more limited hardware support.
So if you have a system that has hard-real time constraints, but needs more general purpose computing features such a networking, filesystem connection to commercial-off-the-shelf (COTS) peripherals etc., then the PRU provides a solution to that.
On the other hand I can't help but think that this is a marketing exercise on the part of TI to sell more chips. A similar solution has always been possible (and indeed common) using one or more processors to perform time critical tasks, possibly running an RTOS, while UI and connectivity are handled by a single processor with the necessary hardware and memory resources but without the real-time constraints. The PRU device does have two 32 bit cores, but XMOS xCORE devices have as many as 16 cores - with 16 communicating cores, you may not even need an RTOS.
To answer the question...
[...] if you can have a RTOS [...], why would you need the PRUs?
... directly; you probably wouldn't need them in that case, but you would loose Linux - and your application may need that. It is just one of many solutions to real-time applications using Linux. You pays your money, and takes your choice.

Most likely the processor in BeagleBone or RaspberryPI is too "heavy" for real time - after all, you could run RTOS on your PC, but it will not be very responsive entirely deterministic, even when it's faster than your typical microcontroller (I guess that these PRUs are some sort of microcontrollers with a new fancy name). In such high-level application processor as found on these boards you rarely have direct access to hardware or interrupts, which are essential for real time applications that actually do something time-critical.

Advantages of atmega32 [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What are the advantages of using ATmega32 than other microcontrollers?
Is it better than PIC, ARM, and 8051?

Advantages
Still runs on 5 V, so legacy 5 V stuff interfaces cleaner
Even though it's 5 V capable, newer parts can run to 1.8 V. This wide range is very rare.
Nice instruction set, very good instruction throughput compared to other processors (HCS08, PIC12/16/18).
High quality GCC port (no proprietary crappy compilers!)
"PA" variants have good sleep mode capabilities, in micro-amperes.
Well rounded peripheral set
QTouch capability
Disadvantages
Still 8-bit. An ARM is a 16/32-bit workhorse, and will push a good amount more data around, at much higher clock speeds, than any 8-bit.
Cost. Can be expensive compared to HCS08 or other bargain 8-bit processors.
GCC toolchain has quirks, like the split memory model and limited 16-bit pointers.
Atmel is not the best supplier on the planet (at least they're not Maxim...)
In short, they are a very clean and easy to work with a 8-bit microcontroller.
An 8051 is legacy: the tools are passable, the architecture is bizarre (idata? xdata? non-reentrant functions in most compilers by default?).
PIC before PIC24 is also bizarre (register banking) and poor clock->instruction throughput. There is no first-class open source C compiler either.
PIC32 is competing with ARM7TDMI and ARM Cortex-M3, based on an adapted MIPS core, and has a GCC port (not main-lined).
AVR32 is competing with Cortex-M3, and offers a pretty good value, especially in the low power area.
MSP430 is the king for ultra low-power devices, and has a passable GCC port (if you're not targeting 430X).
HCS08 is very inexpensive, but poor instruction throughput. Peripherals vary quite a bit.
ARM used to be a higher cost entry point, but with the introduction of the Cortex-M3 architecture, the price has been dropping compared to an 8-bit. For example, the LPC13xx series is comparable to a ATmega32 in many ways. Luminary (TI) has quite an impressive peripheral set.

It depends.. Firstly you have to know what you want from the microprocessor.
In general:
PIC:
Old architecture. This means it's either expensive or slow
Targets only low end market (< few Mhz)
There's a lot of code written for it
ARM
Scalable
Fast/cheap
Atmega is somewhere in between

I find the PIC family (before the MIPS version) to have the most painful instruction set of all, which means assembler is the language of choice if you want to conserve space, get performance, have control, etc.
The 8051 is a little less painful, more registers, but still takes a handful of instructions to do anything useful (meaning you cannot compare these to other chips from a MHz perspective). I like AVR in many ways, they embrace the homebrew and developer community, or if not directly there is a much better family of developers out there compared to the competition. I don't like the instruction set, but it is decades ahead of the PIC and 8051. I like the MSP430 instruction set quite a bit, it is one of the best instruction sets for teaching assembler, TI is not as developer friendly though, it can be a struggle. The eZ430 was on the right path but the goodfet is better as you don't have it failing to work every other kernel version.
MSP430 and ARM have the best instructions sets as far as I am concerned which leads to both good assembler and good compiler tools. You can find commercial tools for all of the above and certainly for 8051, MSP430, and ARM free tools (MSP430 and ARM can use GCC, 8051 cannot, look for SDCC). For now mspgcc4.sf.net and CodeSourcery are the place for GCC based tools for MSP430 and ARM. LLVM supports both, I was able to get LLVM 27 to beat the latest GCC in a dhrystone test, but that is one test, LLVM trails behind in performance but is improving.
As far as finding and creating free cross compilers I see LLVM already as the easiest to get and use and going forward it will hopefully only get better. Sadly the MSP430 port for LLVM was a look what I could do in an afternoon PowerPoint presentation and not a serious port.
My answer is that it depends on what you are doing, and I recommend you try all of them. These days the evaluation boards are in the sub US$50 range and some in the sub US$30 range. Even within the ARM family (ST, Atmel, Stellaris, LPC, etc.) there is a wide veriety of features and quirks that you will only find if you try them. Avoid the LPCexpresso, mbed2, and STM32 primers. Avoid LPC in general and avoid Cortex-M3 in general until you have cut your teeth on an ARM7. Look at SparkFun for Olimex and other boards. Although it is probably LPC the ARMmite PRO and Arduino Pro are good choices. The eZ430 is a good MSP430 start, and I don't remember who is making 8051 stuff, Renasys (sp?), 8051s are not all created equal, the register space varies from one to another and you have to prepare for that. I would probably look for an 8051 simulator if you want to play with the 8051.
I see AVT and definitely ARM continuing to dominate, I would like to see the MSP430 be used for things other than just super low power. With ARM, AVR, and MSP430 you can use and get used to GCC tools now and in the future, which has a lot of benefits even if GCC isn't the best compiler in the world, it is by far the best supported compiler. I would avoid proprietary compilers and tools. I would look for devices that have non-proprietary programming interfaces that are field programmable, JTAG is good, but for example the new SWD JTAG on Cortex-M3 is bad. TI MSP was hurt by this but some hacking has resolved this, at least for now. I really don't have much good to say about PIC and won't try. A big thing to look for is glue logic, does the part or family have the SPI or I2C or whatever bus you want to use, do you need an internal pull up or wired or input?
Some chips just don't have that option and you have to add external hardware. Do you need an interrupt, with conditioning? ARM tends to win on this because it is a core used by many so each ARM vendor puts its own I/O out there so you can still live in the ARM world and have many choices, AVR and MSP are going to be very limited by comparison. With ARM the tools are going to be state of the art, ARM is the most used processor right now. AVR and MSP are special project addons, less widely supported and fragile. Although ARM is low power compared to Intel on a SBC or computer platform, it is likely not as low-power as an AVR or MSP. You really need to look at your project and pick the right processor for the job, I wouldn't and don't limit myself to one family. With as cheap as the evaluation boards are, and almost all can use free tools, it is just a matter of putting in a few nights or weekends in to learn each. I suggest learning more than one AVR, and learning more than one microprocessor.

At this end of the spectrum, there are only really two factors that make much difference. First, in smaller quantities, the only thing that matters at all is which architecture suits your development needs best. If you are already familiar with PIC, there's not much point in learning avr, or visa versa. Pick an architecture that you like, then sort through the options on that architecture to see which model is up to your particular needs.
In quantity (say, 20 or more units), you might benefit by choosing just the right platform that precisely matches your devices' needs, to keep costs as low as possible.
In general, Pic and avr platforms are good for simple, single function devices, where as arm is used in cases where you need a full OS stack like QNX or Linux for things like TCP or real time with OS services.

If you want the widest choice of peripherals, performance, price-point, software and tool support, and suppliers it would be hard to beat an ARM Cortex-M3 based part.
But addressing your question directly the whole AVR range has a consistent architecture and common peripheral set from the Tiny to the Mega (not AVR32 however which is entirely different). This is the significant difference with PIC where when moving up the range( PIC10, 12, 16, 18, 24, 32), you get different peripheral designs, different instruction sets, and need to invest in different compilers and debug hardware.
The instruction set for AVR was designed for efficient C code compilation (again unlike PIC).
8051 is an architecture originally introduced by Intel decades ago, but now used as the core for 8 bit devices from a number of vendors. It has some clever tricks such as efficient multitasking context switches via its 8 duplicated register banks, and a block of bit addressable memory, but has a quirky memory architecture and limited address range (like most 8 bit devices). Great for small well targeted devices, but not truly general purpose.
ARM Cortex-M3 essentially replaces ARM7TDMI, and is a cleaner design with well thought out architecture. It requires minimal assembler start-up code and even ISRs and vector tables can be coded in C directly without any weird compiler extensions or assembler entry/exit code. Its bitbanding technique allows all memory and peripherals to be atomically bit addressable, which is useful for fast I/O and safe multithreading. Basically it is designed to allow C or C++ code at the system level without non-standard compiler extensions. It is of course a 32 bit architecture, so does not have the resource or arithmetic limitations of 8 bit devices. Prices for low-end parts compete with higher performance 8 bit devices, and blow most 16 bit devices out of the water (making 16 bit almost obsolete).
One other key thing to remember is that PIC and AVR are from single vendors, while 8051 and ARM are licensed cores. Each licensee adds their own peripheral set, so there is no commonality between vendors on peripherals, so device driver code needs porting when switching vendors, and you need to ensure the part has the peripherals you need. If you design your device layer well, this is seldom much of a problem.

Well, it isn't easy to answer. It mostly depends on what you used before. If you are already AVR user, then it's good to use. On the other hand you can find PICs with similar capabilities, so I'd say it's mostly personal preference. I think that most ARMs are more capable than atmega32 series. If you want good advice, tell us what you plan to use it for.
AVRs have flat memory model and have free development tools and cheap development hardware is available for them.
I don't know enough about 8051 to comment.
Oh and if you're thinking about original atmega32, I'd say it's a bad idea. It's going to be deprecated soon, so you may want to consider newer models from the atmega32 series.

CPU Cards for Parallel Computation?

I remember reading some time ago that there were cpu cards for systems to add additional processing power to do mass parallelization. Anyone have any experience on this and any resources to get looking into the hardware and software aspects of the project? Is this technology inferior to a traditional cluster? Is it more power conscious?

There are two cool options. one is the use of GPU's as Mitch mentions. The other is to get a PS/3, which has a multicore Cell processor.
You can also set up multiple inexpensive motherboard PCs and run Linux and Beowulf.

GPGPU is probably the most practical option for an enthusiast. However, DSPs are another option, such as those made by Texas Instruments, Freescale, Analog Devices, and NXP Semiconductors. Granted, most of those are probably targeted more towards industrial users, but you might look into the Storm-1 line of DSPs, some of which are supposed to go for as low as $60 a piece.
Another option for data parallelism are Physics Processing Units like the Nvidia (formerly Ageia) PhysX. The most obvious use of these coprocessors are for games, but they're also used for scientific modeling, cryptography, and other vector processing applications.
ClearSpeed Attached Processors are another possibility. These are basically SIMD co-processors designed for HPC applications, so they might be out of your price range, but I'm just guessing here.
All of these suggestions are based around data parallelism since I think that's the area with the most untapped potential. A lot of currently CPU-intensive applications could be performed much faster at much lower clock rates (and using less power) by simply taking advantage of vector processing and more specialized SIMD instruction sets.
Really, most computer users don't need more than an Intel Atom processor for the majority of their casual computing needs: e-mail, browsing the web, and playing music/video. And for the other 10% of computing tasks that actually do require lots of processing power, a general-purpose scalar processor typically isn't the best tool for the job anyway.
Even most people who do have serious processing needs only need it for a narrow range of applications; a physicist doesn't need a PC capable of playing the latest FPS; a sound engineer doesn't need to do scientific modeling or perform statistical analysis; and a graphic designer doesn't need to do digital signal processing. Domain-specific vector processors with highly specialized instruction sets (like modern GPUs for gaming) would be able to handle these tasks much more efficiently than a high power general-purpose CPU.
Cluster computing is no doubt very useful for a lot of high end industrial applications like nuclear research, but I think vector processing has much more practical uses for the average person.

Have you looked at the various GPU Computing options. Nvidia (and probably others) are offering personal supercomputers based around utilising the power of graphics cards.

OpenCL - is an industry wide standard for doing HPC computing across different vendors and processor types, single-core, multi-core, graphics cards, cell, etc... see http://en.wikipedia.org/wiki/OpenCL.
The idea is that using a simple code base you can use all spare processing capacity on the machine regardless of type of processor.
Apple has implemented this standard in its next version Mac OS X. There will also be offerings from nVIDIA, ATI, Intel etc.

Mercury Computing offers a Cell Accelerator Board, it's a PCIe card that has a Cell processor, and runs Yellow Dog Linux, or Mercury's flavor of YDL. Fixstars offers a more powerful Cell PCIe board called the GigaAccel. I called up Mercury, they said their board is about $5000 USD, without software. I'd guess the GigaAccel is up to twice as expensive.
I found one of the Mercury boards used, but it didn't come with a power cable, so I haven't been able to use it yet, sadly.

CUDA or FPGA for special purpose 3D graphics computations? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I am developing a product with heavy 3D graphics computations, to a large extent closest point and range searches. Some hardware optimization would be useful. While I know little about this, my boss (who has no software experience) advocates FPGA (because it can be tailored), while our junior developer advocates GPGPU with CUDA, because its cheap, hot and open. While I feel I lack judgement in this question, I believe CUDA is the way to go also because I am worried about flexibility, our product is still under strong development.
So, rephrasing the question, are there any reasons to go for FPGA at all? Or is there a third option?

I investigated the same question a while back. After chatting to people who have worked on FPGAs, this is what I get:
FPGAs are great for realtime systems, where even 1ms of delay might be too long. This does not apply in your case;
FPGAs can be very fast, espeically for well-defined digital signal processing usages (e.g. radar data) but the good ones are much more expensive and specialised than even professional GPGPUs;
FPGAs are quite cumbersome to programme. Since there is a hardware configuration component to compiling, it could take hours. It seems to be more suited to electronic engineers (who are generally the ones who work on FPGAs) than software developers.
If you can make CUDA work for you, it's probably the best option at the moment. It will certainly be more flexible than a FPGA.
Other options include Brook from ATI, but until something big happens, it is simply not as well adopted as CUDA. After that, there's still all the traditional HPC options (clusters of x86/PowerPC/Cell), but they are all quite expensive.
Hope that helps.

We did some comparison between FPGA and CUDA. One thing where CUDA shines if you can realy formulate your problem in a SIMD fashion AND can access the memory coalesced. If the memory accesses are not coalesced(1) or if you have different control flow in different threads the GPU can lose drastically its performance and the FPGA can outperform it. Another thing is when your operation is realtive small, but you have a huge amount of it. But you cant (e.g. due to synchronisation) no start it in a loop in one kernel, then your invocation times for the GPU kernel exceeds the computation time.
Also the power of the FPGA could be better (depends on your application scenarion, ie. the GPU is only cheaper (in terms of Watts/Flop) when its computing all the time).
Offcourse the FPGA has also some drawbacks: IO can be one (we had here an application were we needed 70 GB/s, no problem for GPU, but to get this amount of data into a FPGA you need for conventional design more pins than available). Another drawback is the time and money. A FPGA is much more expensive than the best GPU and the development times are very high.
(1) Simultanously accesses from different thread to memory have to be to sequential addresses. This is sometimes really hard to achieve.

I would go with CUDA.
I work in image processing and have been trying hardware add-ons for years. First we had i860, then Transputer, then DSP, then the FPGA and direct-compiliation-to-hardware.
What innevitably happened was that by the time the hardware boards were really debugged and reliable and the code had been ported to them - regular CPUs had advanced to beat them, or the hosting machine architecture changed and we couldn't use the old boards, or the makers of the board went bust.
By sticking to something like CUDA you aren't tied to one small specialist maker of FPGA boards. The performence of GPUs is improving faster then CPUs and is funded by the gamers. It's a mainstream technology and so will probably merge with multi-core CPUs in the future and so protect your investment.

FPGAs
What you need:
Learn VHDL/Verilog (and trust me you don't want to)
Buy hw for testing, licences for synthesis tools
If you already have infrastructure and you need to develop only your core
Develop design ( and it can take years )
If you don't:
DMA, hw driver, ultra expensive synthesis tools
tons of knowledge about buses, memory mapping, hw synthesis
build the hw, buy the ip cores
Develop design
Not mentioning of board developement
For example average FPGA pcie card with chip Xilinx ZynqUS+ costs more than 3000$
FPGA cloud is also costly 2$/h+
Result:
This is something which requires resources of running company at least.
GPGPU (CUDA/OpenCL)
You already have hw to test on.
Compare to FPGA stuff:
Everything is well documented .
Everything is cheap
Everything works
Everything is well integrated to programming languages
There is GPU cloud as well.
Result:
You need to just download sdk and you can start.

This is an old thread started in 2008, but it would be good to recount what happened to FPGA programming since then:
1. C to gates in FPGA is the mainstream development for many companies with HUGE time saving vs. Verilog/SystemVerilog HDL. In C to gates System level design is the hard part.
2. OpenCL on FPGA is there for 4+ years including floating point and "cloud" deployment by Microsoft (Asure) and Amazon F1 (Ryft API). With OpenCL system design is relatively easy because of very well defined memory model and API between host and compute devices.
Software folks just need to learn a bit about FPGA architecture to be able to do things that are NOT EVEN POSSIBLE with GPUs and CPUs for the reasons of both being fixed silicon and not having broadband (100Gb+) interfaces to the outside world. Scaling down chip geometry is no longer possible, nor extracting more heat from the single chip package without melting it, so this looks like the end of the road for single package chips. My thesis here is that the future belongs to parallel programming of multi-chip systems, and FPGAs have a great chance to be ahead of the game. Check out http://isfpga.org/ if you have concerns about performance, etc.

FPGA-based solution is likely to be way more expensive than CUDA.

Obviously this is a complex question. The question might also include the cell processor.
And there is probably not a single answer which is correct for other related questions.
In my experience, any implementation done in abstract fashion, i.e. compiled high level language vs. machine level implementation, will inevitably have a performance cost, esp in a complex algorithm implementation. This is true of both FPGA's and processors of any type. An FPGA designed specifically to implement a complex algorithm will perform better than an FPGA whose processing elements are generic, allowing it a degree of programmability from input control registers, data i/o etc.
Another general example where an FPGA can be much higher performance is in cascaded processes where on process outputs become the inputs to another and they cannot be done concurrently. Cascading processes in an FPGA is simple, and can dramatically lower memory I/O requirements while processor memory will be used to effectively cascade two or more processes where there are data dependencies.
The same can be said of a GPU and CPU. Algorithms implemented in C executing on a CPU developed without regard to the inherent performance characteristics of the cache memory or main memory system will not perform as well as one implemented which does. Granted, not considering these performance characteristics simplifies implementation. But at a performance cost.
Having no direct experience with a GPU, but knowing its inherent memory system performance issues, it too will be subjected to performance issues.

CUDA has a fairly substantial code base of examples and a SDK, including a BLAS back-end. Try to find some examples similar to what you are doing, perhaps also looking at the GPU Gems series of books, to gauge how well CUDA will fit your applications. I'd say from a logistic point of view, CUDA is easier to work with and much, much cheaper than any professional FPGA development toolkit.
At one point I did look into CUDA for claim reserve simulation modelling. There is quite a good series of lectures linked off the web-site for learning. On Windows, you need to make sure CUDA is running on a card with no displays as the graphics subsystem has a watchdog timer that will nuke any process running for more than 5 seconds. This does not occur on Linux.
Any mahcine with two PCI-e x16 slots should support this. I used a HP XW9300, which you can pick up off ebay quite cheaply. If you do, make sure it has two CPU's (not one dual-core CPU) as the PCI-e slots live on separate Hypertransport buses and you need two CPU's in the machine to have both buses active.

What are you deploying on? Who is your customer? Without even know the answers to these questions, I would not use an FPGA unless you are building a real-time system and have electrical/computer engineers on your team that have knowledge of hardware description languages such as VHDL and Verilog. There's a lot to it and it takes a different frame of mind than conventional programming.

I'm a CUDA developer with very littel experience with FPGA:s, however I've been trying to find comparisons between the two.
What I've concluded so far:
The GPU has by far higher ( accessible ) peak performance
It has a more favorable FLOP/watt ratio.
It is cheaper
It is developing faster (quite soon you will literally have a "real" TFLOP available).
It is easier to program ( read article on this not personal opinion)
Note that I'm saying real/accessible to distinguish from the numbers you will see in a GPGPU commercial.
BUT the gpu is not more favorable when you need to do random accesses to data. This will hopefully change with the new Nvidia Fermi architecture which has an optional l1/l2 cache.
my 2 cents

Others have given good answers, just wanted to add a different perspective. Here is my survey paper published in ACM Computing Surveys 2015 (its permalink is here), which compares GPU with FPGA and CPU on energy efficiency metric. Most papers report: FPGA is more energy efficient than GPU, which, in turn, is more energy efficient than CPU. Since power budgets are fixed (depending on cooling capability), energy efficiency of FPGA means one can do more computations within same power budget with FPGA, and thus get better performance with FPGA than with GPU. Of course, also account for FPGA limitations, as mentioned by others.

FPGA will not be favoured by those with a software bias as they need to learn an HDL or at least understand systemC.
For those with a hardware bias FPGA will be the first option considered.
In reality a firm grasp of both is required & then an objective decision can be made.
OpenCL is designed to run on both FPGA & GPU, even CUDA can be ported to FPGA.
FPGA & GPU accelerators can be used together
So it's not a case of what is better one or the other. There is also the debate about CUDA vs OpenCL
Again unless you have optimized & benchmarked both to your specific application you can not know with 100% certainty.
Many will simply go with CUDA because of its commercial nature & resources. Others will go with openCL because of its versatility.

FPGAs are more parallel than GPUs, by three orders of magnitude. While good GPU features thousands of cores, FPGA may have millions of programmable gates.
While CUDA cores must do highly similar computations to be productive, FPGA cells are truly independent from each other.
FPGA can be very fast with some groups of tasks and are often used where a millisecond is already seen as a long duration.
GPU core is way more powerful than FPGA cell, and much easier to program. It is a core, can divide and multiply no problem when FPGA cell is only capable of rather simple boolean logic.
As GPU core is a core, it is efficient to program it in C++. Even it it is also possible to program FPGA in C++, it is inefficient (just "productive"). Specialized languages like VDHL or Verilog must be used - they are difficult and challenging to master.
Most of the true and tried instincts of a software engineer are useless with FPGA. You want a for loop with these gates? Which galaxy are you from? You need to change into the mindset of electronics engineer to understand this world.

at latest GTC'13 many HPC people agreed that CUDA is here to stay. FGPA's are cumbersome, CUDA is getting quite more mature supporting Python/C/C++/ARM.. either way, that was a dated question

Programming a GPU in CUDA is definitely easier. If you don't have any experience with programming FPGAs in HDL it will almost surely be too much of a challenge for you, but you can still program them with OpenCL which is kinda similar to CUDA. However, it is harder to implement and probably a lot more expensive than programming GPUs.
Which one is Faster?
GPU runs faster, but FPGA can be more efficient.
GPU has the potential of running at a speed higher than FPGA can ever reach. But only for algorithms that are specially suited for that. If the algorithm is not optimal, the GPU will loose a lot of performance.
FPGA on the other hand runs much slower, but you can implement problem-specific hardware that will be very efficient and get stuff done in less time.
It's kinda like eating your soup with a fork very fast vs. eating it with a spoon more slowly.
Both devices base their performance on parallelization, but each in a slightly different way. If the algorithm can be granulated into a lot of pieces that execute the same operations (keyword: SIMD), the GPU will be faster. If the algorithm can be implemented as a long pipeline, the FPGA will be faster. Also, if you want to use floating point, FPGA will not be very happy with it :)
I have dedicated my whole master's thesis to this topic.
Algorithm Acceleration on FPGA with OpenCL

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas