What is the effect of IDXGIDevice::SetGPUThreadPriority? - gpu

The documentation for IDXGIDevice::SetGPUThreadPriority states
To use the SetGPUThreadPriority method, you should have a comprehensive understanding of GPU scheduling. You should profile your application to ensure that it behaves as intended. If used inappropriately, the SetGPUThreadPriority method can impede rendering speed and result in a poor user experience.
Yet, it does not elaborate on how to gain the so called "comprehensive understanding" of GPU scheduling, and how exactly does it affect each thread that communicates with the GPU.
Here are some other specific questions:
On what scenarios SetGPUThreadPriority might be beneficial?
Does it affect GPU scheduling in general for the invoking thread or is it specific to Direct3D 11 / 12?
(I mean, does it have any effect on other APIs that share the GPU such as Vulkan, OpenGL, CUDA, OpenCL?)
Which OS component / driver does it communicate with?
How it affects other processes?

Related

How the GPU process non-graphic data in parallel?

As the introduction of programmable shaders in graphic pipeline enabled GPGPU concept which makes use of GPU as a general processing engine suited for parallel data.
However, as far as I know, because GPU is still used for graphic processing a lot compared to GPGPU, it makes use of lots of fixed graphic pipeline stages that cannot be programmed.
If my understanding is correct, when one data is processed by the GPU regardless of the type of data (graphic or general), it should be processed through the fixed graphic pipeline which includes programmable stages and non-programmable fixed stages.
Does that mean non-graphical processing should go through graphical processing stages even though it doesn't make use of it? Or can it bypass those fixed stages used for graphics? If one can explain how the GPU pipeline works for GPGPU I would appreciate it.
TL;DR:
GPGPU completely bypasses the rendering pipeline, but the pipeline is still used today.
GPUs consist of two main parts (in relation to your question). The first one is the processing part, which consists of the memory, registers, warp units, dispatchers and streaming processors. The other part is a set of controllers, that are responsible for geometry processing and the graphics pipeline. Those controllers just issue commands for the Streaming Processors on how to process the data for each of the steps of the rendering pipeline, either hardwired or based on user supplied shaders. NVidia calls them "PolyMorph Engine", AMD "Geometric Processor".
Historically, some of those controllers were hardwired to do things a single way, so you could only programm the vertexshader, fragmentshader and pixelshader. The tesselation controller e.g. was hardwired on the GPU and not user programmable. As demands grew, more and more of those controllers became user-programmable and today most of them are completely programmable (Wikipedia).
In the beginning days of GPGPU, the only way to do computing was to hack the available shaders by using a texture with your input data on a full-screen face to calculate the result and then read the rendered image back (See slide 26 on this introduction).
With CUDA, NVidia allowed users not only to program the shaders/polymorph Engine, but also directly interact with the Streaming processors and execute code on those (See slide 31 & 32).
This does not mean, that the graphics pipeline became obsolete, but now there is a way to completely bypass it and directly run code on the GPU processors. Nvidia has a nice explanation on how the pipeline works today, where you can also see both the PolyMorph Engine and the Streaming Processors here.
The Graphics pipeline still helps the dev by offloading repetitive and more complicated parts of the process, like managing the memory, managing warps, passing data and all that stuff. Theoretically you could probably write your own pipeline directly on the StreamingProcessors using CUDA and then render the result, but it would be tedious. Just how writing a GPGPU-Code using Shaders would be tedious.
Although old GPUs have pipelines hardcoded in the chip, modern GPU itself is just a large ASIC that can compute vectorized data at stupid fast speed. It is human who defines what it can do. So the render pipeline is defined in the graphics library like OpenGL, not in GPU. Thus, GPU does not care what it is computing, as long as it is vectorized data, it can do all the computation needed and give you a result.

Why would I consider using an RTOS for my embedded project?

First the background, specifics of my question will follow:
At the company that I work at the platform we work on is currently the Microchip PIC32 family using the MPLAB IDE as our development environment. Previously we've also written firmware for the Microchip dsPIC and TI MSP families for this same application.
The firmware is pretty straightforward in that the code is split into three main modules: device control, data sampling, and user communication (usually a user PC). Device control is achieved via some combination of GPIO bus lines and at least one part needing SPI or I2C control. Data sampling is interrupt driven using a Timer module to maintain sample frequency and more SPI/I2C and GPIO bus lines to control the sampling hardware (ie. ADC). User communication is currently implemented via USB using the Microchip App Framework.
So now the question: given what I've described above, at what point would I consider employing an RTOS for my project? Currently I'm thinking of these possible trigger points as reasons to use an RTOS:
Code complexity? The code base architecture/organization is still small enough that I can keep all the details in my head.
Multitasking/Threading? Time-slicing the module execution via interrupts suffices for now for multitasking.
Testing? Currently we don't do much formal testing or verification past the HW smoke test (something I hope to rectify in the near future).
Communication? We currently use a custom packet format and a protocol that pretty much only does START, STOP, SEND DATA commands with data being a binary blob.
Project scope? There is a possibility in the near future that we'll be getting a project to integrate our device into a larger system with the goal of taking that system to mass production. Currently all our projects have been experimental prototypes with quick turn-around of about a month, producing one or two units at a time.
What other points do you think I should consider? In your experience what convinced (or forced) you to consider using an RTOS vs just running your code on the base runtime? Pointers to additional resources about designing/programming for an RTOS is also much appreciated.
There are many many reasons you might want to use an RTOS. They are varied & the degree to which they apply to your situation is hard to say. (Note: I tend to think this way: RTOS implies hard real time which implies preemptive kernel...)
Rate Monotonic Analysis (RMA) - if you want to use Rate Monotonic Analysis to ensure your timing deadlines will be met, you must use a pre-emptive scheduler
Meet real-time deadlines - even without using RMA, with a priority-based pre-emptive RTOS, your scheduler can help ensure deadlines are met. Paradoxically, an RTOS will typically increase interrupt latency due to critical sections in the kernel where interrupts are usually masked
Manage complexity -- definitely, an RTOS (or most OS flavors) can help with this. By allowing the project to be decomposed into independent threads or processes, and using OS services such as message queues, mutexes, semaphores, event flags, etc. to communicate & synchronize, your project (in my experience & opinion) becomes more manageable. I tend to work on larger projects, where most people understand the concept of protecting shared resources, so a lot of the rookie mistakes don't happen. But beware, once you go to a multi-threaded approach, things can become more complex until you wrap your head around the issues.
Use of 3rd-party packages - many RTOSs offer other software components, such as protocol stacks, file systems, device drivers, GUI packages, bootloaders, and other middleware that help you build an application faster by becoming almost more of an "integrator" than a DIY shop.
Testing - yes, definitely, you can think of each thread of control as a testable component with a well-defined interface, especially if a consistent approach is used (such as always blocking in a single place on a message queue). Of course, this is not a substitute for unit, integration, system, etc. testing.
Robustness / fault tolerance - an RTOS may also provide support for the processor's MMU (in your PIC case, I don't think that applies). This allows each thread (or process) to run in its own protected space; threads / processes cannot "dip into" each others' memory and stomp on it. Even device regions (MMIO) might be off limits to some (or all) threads. Strictly speaking, you don't need an RTOS to exploit a processor's MMU (or MPU), but the 2 work very well hand-in-hand.
Generally, when I can develop with an RTOS (or some type of preemptive multi-tasker), the result tends to be cleaner, more modular, more well-behaved and more maintainable. When I have the option, I use one.
Be aware that multi-threaded development has a bit of a learning curve. If you're new to RTOS/multithreaded development, you might be interested in some articles on Choosing an RTOS, The Perils of Preemption and An Introduction to Preemptive Multitasking.
Lastly, even though you didn't ask for recommendations... In addition to the many numerous commercial RTOSs, there are free offerings (FreeRTOS being one of the most popular), and the Quantum Platform is an event-driven framework based on the concept of active objects which includes a preemptive kernel. There are plenty of choices, but I've found that having the source code (even if the RTOS isn't free) is advantageous, esp. when debugging.
RTOS, first and foremost permits you to organize your parallel flows into the set of tasks with well-defined synchronization between them.
IMO, the non-RTOS design is suitable only for the single-flow architecture where all your program is one big endless loop. If you need the multi-flow - a number of tasks, running in parallel - you're better with RTOS. Without RTOS you'll be forced to implement this functionality in-house, re-inventing the wheel.
Code re-use -- if you code drivers/protocol-handlers using an RTOS API they may plug into future projects easier
Debugging -- some IDEs (such as IAR Embedded Workbench) have plugins that show nice live data about your running process such as task CPU utilization and stack utilization
Usually you want to use an RTOS if you have any real-time constraints. If you don’t have real-time constraints, a regular OS might suffice. RTOS’s/OS’s provide a run-time infrastructure like message queues and tasking. If you are just looking for code that can reduce complexity, provide low level support and help with testing, some of the following libraries might do:
The standard C/C++ libraries
Boost libraries
Libraries available through the manufacturer of the chip that can provide hardware specific support
Commercial libraries
Open source libraries
Additional to the points mentioned before, using an RTOS may also be useful if you need support for
standard storage devices (SD, Compact Flash, disk drives ...)
standard communication hardware (Ethernet, USB, Firewire, RS232, I2C, SPI, ...)
standard communication protocols (TCP-IP, ...)
Most RTOSes provide these features or are expandable to support them

Processor can support/require an RTOS?

I have few queries related with going in for an RTOS for the different processors in hand. These are generic questions. Maybe you can clarify with examples specific to any processor/rtos or even generally. How to determine if a processor can support a RTOS ? How to know if the processor requires a RTOS ?
Does a processor requires an RTOS?
No - you don't require an RTOS. You can have a sophisticated embedded application running without one. The applications that I am working on currently does not have an RTOS.
We have to think about scheduling various tasks in our application, and have to write code that schedules these tasks. We achieve most of it by simply using software timers and timeslicing different tasks as we deem appopriate. However, having an RTOS can make the process a lot easier by scheduling different parts of your code seamlessly, and you don't really have to worry about taking care of that then.
You have to consider a few things when you choose an RTOS. How much RAM does your processor have? How much FLASH do you have? You don't want to put an expensive chip on your board, and a heavy RTOS, if you don't need all the features of it.
For basic scheduling stuff, you can get relatively small RTOS's, that are not huge and that will do most things you want quite efficiently.
e.g. Free RTOS is open source and is roughly 9K's only
You can also choose to use RTOS' like VxWorks or Embedded Linux that do a whole lot more, but are either expensive or huge or both.
In the end, the RTOS you use really depends on what your application's needs are, and how much memory you have to spare for it.
This is another "how long is a piece of string" question, but I will give it +1 for being interesting.
Second point first. I don't think that a processor can require an RTOS; I would rather say that an application can.
As to whether a processor can support an RTOS, your principle questions are going to be how heavily you load it, how many events it must handle & how much processing they require, etc, and also the availability of interrupt handling mechanisms, etc.
Do you have a particular processor, ROTS, application in mind, or is this just a general question?
No processor REQUIRES an RTOS. RT is a feature of the programming, not something a processor can DEMAND.
EVERY processor that I know of supports RTOS - a hardware interrupt will interrupt at next instruction. It is basically the OS that stops that and handles things in a non-real-time fashion.
Why would a processor require and RTOS? After all an RTOS is just software running directly on the hardware, that software could equally be your application running directly on the hardware instead. That part of your question makes little sense. Now if you have a processor designed to run say Java code by executing bytecode in hardware, it would not make sense to use that processor with anything other than a JVM as the foundation for an application, but I cannot think of a processor that is so tailored to RTOS implementation that you could not use it without an RTOS.
Now with respect to whether a processor can support an RTOS the simplest way is to see if there is a commercial RTOS already implemented for it. Most processor vendors will ensure that such support is in place from one or more third-parties before a chip is generally available. Generally I would suggest that anything with an interrupt mechanism and timer hardware can support an RTOS or at least a scheduler of some sort given sufficient resources. However there are some very resource constrained microcontrollers where it would simply make no sense.

Multi core programming

I want to get into multi core programming (not language specific) and wondered what hardware could be recommended for exploring this field.
My aim is to upgrade my existing desktop.
If at all possible, I would suggest getting a dual-socket machine, preferably with quad-core chips. You can certainly get a single-socket machine, but dual-socket would let you start seeing some of the effects of NUMA memory that are going to be exacerbated as the core counts get higher and higher.
Why do you care? There are two huge problems facing multi-core developers right now:
The programming model Parallel programming is hard, and there is (currently) no getting around this. A quad-core system will let you start playing around with real concurrency and all of the popular paradigms (threads, UPC, MPI, OpenMP, etc).
Memory Whenever you start having multiple threads, there is going to be contention for resources, and the memory wall is growing larger and larger. A recent article at arstechnica outlines some (very preliminary) research at Sandia that shows just how bad this might become if current trends continue. Multicore machines are going to have to keep everything fed, and this will require that people be intimately familiar with their memory system. Dual-socket adds NUMA to the mix (at least on AMD machines), which should get you started down this difficult road.
If you're interested in more info on performance inconsistencies with multi-socket machines, you might also check out this technical report on the subject.
Also, others have suggested getting a system with a CUDA-capable GPU, which I think is also a great way to get into multithreaded programming. It's lower level than the stuff I mentioned above, but throw one of those on your machine if you can. The new Portland Group compilers have provisional support for optimizing loops with CUDA, so you could play around with your GPU even if you don't want to learn CUDA yourself.
Quad-core, because it'll permit you to do problems where the number of concurrent processes is > 2, which often non-trivializes problems.
I would also, for sheer geek squee, pick up a nice NVidia card and use the CUDA API. If you have the bucks, there's a stand-alone CUDA workstation that plugs into your main computer via a cable and an expansion slot.
It depends what you want to do.
If you want to learn the basics of multithreaded programming, then you can do that on your existing single-core PC. (If you have 2 threads, then the OS will switch between them on a single-core PC. Then when you move to a dual-core PC they should automatically run in parallel on separate cores, for a 2x speedup). This has the advantage of being free! The disadvantages are that you won't see a speedup (in fact a parallel implementation is probably slightly slower due to overheads), and that buggy code has a slightly higher chance of working.
However, although you can learn multithreaded programming on a single-core box, a dual-core (or even HyperThreading) CPU would be a great help.
If you want to really stress-test the code you're writing, then as "blue tuxedo" says, you should go for as many cores as you can easily afford, and if possible get hyperthreading too.
If you want to learn about algorithms for running on graphics cards - which is a very different area to x86 multicore - then get CUDA and buy a normal nVidia graphics card that supports it.
I'd recommend at least a quad-core processor.
You could try tinkering with CUDA. It's free, not that hard to use and will run on any recent NVIDIA card.
Alternatively, you could get a PlayStation 3 and the Linux SDK and work out how to program a Cell processor. Note that the next cheapest option for Cell BE development is an order of magnitude more expensive than a PS3.
Finally, any modern motherboard that will take a Core Quad or quad-core Opteron (get a good one from Asus or some other reputable manufacturer) will let you experiment with a multi-core PC system for a reasonable sum of money.
The difficult thing with multithreaded/core programming is that it opens a whole new can of worms. The bugs you'll be faced with are usually not the one you're used to. Race conditions can remain dormant for ages until they bite and your mainstream language compiler won't assist you in any way. You'll get random data and/or crashes that only happen once a day/week/month/year, usually under the most mysterious conditions...
One things remains true fortunately : the higher the concurrency exhibited by a computer, the more race conditions you'll unveil.
So if you're serious about multithreaded/core programming, then go for as many cpu cores as possible. Keep in mind that neither hyperthreading nor SMT allow for the level of concurrency that multiple cores provide.
I would agree that, depending on what you ultimately want to do, you can probably get by with just your current single-core system. Multi-core programming is basically multi-threaded programming, and you can certainly do that on a single-core chip.
When I was a student, one of our projects was to build a thread-safe implementation the malloc library for C. Even on a single core processor, that was more than enough to cure me of my desire to get into multi-threaded programming. I would try something small like that before you start thinking about spending lots of money.
I agree with the others where I would upgrade to a quad-core processor. I am also a BIG FAN of ASUS Motherboards (the P5Q Pro is excellent for Core2Quad and Core2Duo processors)!
The draw for multi-core programming is that you have more resources to get things done faster. If you are serious about multi-core programming, then I would absolutely get a quad-core processor. I don't believe that you should get the new i7 architecture from Intel to take advantage of multi-core processing because anything written to take advantage of the Core2Duo or Core2Quad will just run better on the newer architecture.
If you are going to dabble in multi-core programming, then I would get a good Core2Duo processor. Remember, it's not just how many cores you have, but also how FAST the cores are to process the jobs. My Core2Duo running at 4GHz routinely completes jobs faster than my Core2Quad running at 2.4GHz even with a multi-core program.
Let me know if this helps!
JFV

CUDA or FPGA for special purpose 3D graphics computations? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I am developing a product with heavy 3D graphics computations, to a large extent closest point and range searches. Some hardware optimization would be useful. While I know little about this, my boss (who has no software experience) advocates FPGA (because it can be tailored), while our junior developer advocates GPGPU with CUDA, because its cheap, hot and open. While I feel I lack judgement in this question, I believe CUDA is the way to go also because I am worried about flexibility, our product is still under strong development.
So, rephrasing the question, are there any reasons to go for FPGA at all? Or is there a third option?
I investigated the same question a while back. After chatting to people who have worked on FPGAs, this is what I get:
FPGAs are great for realtime systems, where even 1ms of delay might be too long. This does not apply in your case;
FPGAs can be very fast, espeically for well-defined digital signal processing usages (e.g. radar data) but the good ones are much more expensive and specialised than even professional GPGPUs;
FPGAs are quite cumbersome to programme. Since there is a hardware configuration component to compiling, it could take hours. It seems to be more suited to electronic engineers (who are generally the ones who work on FPGAs) than software developers.
If you can make CUDA work for you, it's probably the best option at the moment. It will certainly be more flexible than a FPGA.
Other options include Brook from ATI, but until something big happens, it is simply not as well adopted as CUDA. After that, there's still all the traditional HPC options (clusters of x86/PowerPC/Cell), but they are all quite expensive.
Hope that helps.
We did some comparison between FPGA and CUDA. One thing where CUDA shines if you can realy formulate your problem in a SIMD fashion AND can access the memory coalesced. If the memory accesses are not coalesced(1) or if you have different control flow in different threads the GPU can lose drastically its performance and the FPGA can outperform it. Another thing is when your operation is realtive small, but you have a huge amount of it. But you cant (e.g. due to synchronisation) no start it in a loop in one kernel, then your invocation times for the GPU kernel exceeds the computation time.
Also the power of the FPGA could be better (depends on your application scenarion, ie. the GPU is only cheaper (in terms of Watts/Flop) when its computing all the time).
Offcourse the FPGA has also some drawbacks: IO can be one (we had here an application were we needed 70 GB/s, no problem for GPU, but to get this amount of data into a FPGA you need for conventional design more pins than available). Another drawback is the time and money. A FPGA is much more expensive than the best GPU and the development times are very high.
(1) Simultanously accesses from different thread to memory have to be to sequential addresses. This is sometimes really hard to achieve.
I would go with CUDA.
I work in image processing and have been trying hardware add-ons for years. First we had i860, then Transputer, then DSP, then the FPGA and direct-compiliation-to-hardware.
What innevitably happened was that by the time the hardware boards were really debugged and reliable and the code had been ported to them - regular CPUs had advanced to beat them, or the hosting machine architecture changed and we couldn't use the old boards, or the makers of the board went bust.
By sticking to something like CUDA you aren't tied to one small specialist maker of FPGA boards. The performence of GPUs is improving faster then CPUs and is funded by the gamers. It's a mainstream technology and so will probably merge with multi-core CPUs in the future and so protect your investment.
FPGAs
What you need:
Learn VHDL/Verilog (and trust me you don't want to)
Buy hw for testing, licences for synthesis tools
If you already have infrastructure and you need to develop only your core
Develop design ( and it can take years )
If you don't:
DMA, hw driver, ultra expensive synthesis tools
tons of knowledge about buses, memory mapping, hw synthesis
build the hw, buy the ip cores
Develop design
Not mentioning of board developement
For example average FPGA pcie card with chip Xilinx ZynqUS+ costs more than 3000$
FPGA cloud is also costly 2$/h+
Result:
This is something which requires resources of running company at least.
GPGPU (CUDA/OpenCL)
You already have hw to test on.
Compare to FPGA stuff:
Everything is well documented .
Everything is cheap
Everything works
Everything is well integrated to programming languages
There is GPU cloud as well.
Result:
You need to just download sdk and you can start.
This is an old thread started in 2008, but it would be good to recount what happened to FPGA programming since then:
1. C to gates in FPGA is the mainstream development for many companies with HUGE time saving vs. Verilog/SystemVerilog HDL. In C to gates System level design is the hard part.
2. OpenCL on FPGA is there for 4+ years including floating point and "cloud" deployment by Microsoft (Asure) and Amazon F1 (Ryft API). With OpenCL system design is relatively easy because of very well defined memory model and API between host and compute devices.
Software folks just need to learn a bit about FPGA architecture to be able to do things that are NOT EVEN POSSIBLE with GPUs and CPUs for the reasons of both being fixed silicon and not having broadband (100Gb+) interfaces to the outside world. Scaling down chip geometry is no longer possible, nor extracting more heat from the single chip package without melting it, so this looks like the end of the road for single package chips. My thesis here is that the future belongs to parallel programming of multi-chip systems, and FPGAs have a great chance to be ahead of the game. Check out http://isfpga.org/ if you have concerns about performance, etc.
FPGA-based solution is likely to be way more expensive than CUDA.
Obviously this is a complex question. The question might also include the cell processor.
And there is probably not a single answer which is correct for other related questions.
In my experience, any implementation done in abstract fashion, i.e. compiled high level language vs. machine level implementation, will inevitably have a performance cost, esp in a complex algorithm implementation. This is true of both FPGA's and processors of any type. An FPGA designed specifically to implement a complex algorithm will perform better than an FPGA whose processing elements are generic, allowing it a degree of programmability from input control registers, data i/o etc.
Another general example where an FPGA can be much higher performance is in cascaded processes where on process outputs become the inputs to another and they cannot be done concurrently. Cascading processes in an FPGA is simple, and can dramatically lower memory I/O requirements while processor memory will be used to effectively cascade two or more processes where there are data dependencies.
The same can be said of a GPU and CPU. Algorithms implemented in C executing on a CPU developed without regard to the inherent performance characteristics of the cache memory or main memory system will not perform as well as one implemented which does. Granted, not considering these performance characteristics simplifies implementation. But at a performance cost.
Having no direct experience with a GPU, but knowing its inherent memory system performance issues, it too will be subjected to performance issues.
CUDA has a fairly substantial code base of examples and a SDK, including a BLAS back-end. Try to find some examples similar to what you are doing, perhaps also looking at the GPU Gems series of books, to gauge how well CUDA will fit your applications. I'd say from a logistic point of view, CUDA is easier to work with and much, much cheaper than any professional FPGA development toolkit.
At one point I did look into CUDA for claim reserve simulation modelling. There is quite a good series of lectures linked off the web-site for learning. On Windows, you need to make sure CUDA is running on a card with no displays as the graphics subsystem has a watchdog timer that will nuke any process running for more than 5 seconds. This does not occur on Linux.
Any mahcine with two PCI-e x16 slots should support this. I used a HP XW9300, which you can pick up off ebay quite cheaply. If you do, make sure it has two CPU's (not one dual-core CPU) as the PCI-e slots live on separate Hypertransport buses and you need two CPU's in the machine to have both buses active.
What are you deploying on? Who is your customer? Without even know the answers to these questions, I would not use an FPGA unless you are building a real-time system and have electrical/computer engineers on your team that have knowledge of hardware description languages such as VHDL and Verilog. There's a lot to it and it takes a different frame of mind than conventional programming.
I'm a CUDA developer with very littel experience with FPGA:s, however I've been trying to find comparisons between the two.
What I've concluded so far:
The GPU has by far higher ( accessible ) peak performance
It has a more favorable FLOP/watt ratio.
It is cheaper
It is developing faster (quite soon you will literally have a "real" TFLOP available).
It is easier to program ( read article on this not personal opinion)
Note that I'm saying real/accessible to distinguish from the numbers you will see in a GPGPU commercial.
BUT the gpu is not more favorable when you need to do random accesses to data. This will hopefully change with the new Nvidia Fermi architecture which has an optional l1/l2 cache.
my 2 cents
Others have given good answers, just wanted to add a different perspective. Here is my survey paper published in ACM Computing Surveys 2015 (its permalink is here), which compares GPU with FPGA and CPU on energy efficiency metric. Most papers report: FPGA is more energy efficient than GPU, which, in turn, is more energy efficient than CPU. Since power budgets are fixed (depending on cooling capability), energy efficiency of FPGA means one can do more computations within same power budget with FPGA, and thus get better performance with FPGA than with GPU. Of course, also account for FPGA limitations, as mentioned by others.
FPGA will not be favoured by those with a software bias as they need to learn an HDL or at least understand systemC.
For those with a hardware bias FPGA will be the first option considered.
In reality a firm grasp of both is required & then an objective decision can be made.
OpenCL is designed to run on both FPGA & GPU, even CUDA can be ported to FPGA.
FPGA & GPU accelerators can be used together
So it's not a case of what is better one or the other. There is also the debate about CUDA vs OpenCL
Again unless you have optimized & benchmarked both to your specific application you can not know with 100% certainty.
Many will simply go with CUDA because of its commercial nature & resources. Others will go with openCL because of its versatility.
FPGAs are more parallel than GPUs, by three orders of magnitude. While good GPU features thousands of cores, FPGA may have millions of programmable gates.
While CUDA cores must do highly similar computations to be productive, FPGA cells are truly independent from each other.
FPGA can be very fast with some groups of tasks and are often used where a millisecond is already seen as a long duration.
GPU core is way more powerful than FPGA cell, and much easier to program. It is a core, can divide and multiply no problem when FPGA cell is only capable of rather simple boolean logic.
As GPU core is a core, it is efficient to program it in C++. Even it it is also possible to program FPGA in C++, it is inefficient (just "productive"). Specialized languages like VDHL or Verilog must be used - they are difficult and challenging to master.
Most of the true and tried instincts of a software engineer are useless with FPGA. You want a for loop with these gates? Which galaxy are you from? You need to change into the mindset of electronics engineer to understand this world.
at latest GTC'13 many HPC people agreed that CUDA is here to stay. FGPA's are cumbersome, CUDA is getting quite more mature supporting Python/C/C++/ARM.. either way, that was a dated question
Programming a GPU in CUDA is definitely easier. If you don't have any experience with programming FPGAs in HDL it will almost surely be too much of a challenge for you, but you can still program them with OpenCL which is kinda similar to CUDA. However, it is harder to implement and probably a lot more expensive than programming GPUs.
Which one is Faster?
GPU runs faster, but FPGA can be more efficient.
GPU has the potential of running at a speed higher than FPGA can ever reach. But only for algorithms that are specially suited for that. If the algorithm is not optimal, the GPU will loose a lot of performance.
FPGA on the other hand runs much slower, but you can implement problem-specific hardware that will be very efficient and get stuff done in less time.
It's kinda like eating your soup with a fork very fast vs. eating it with a spoon more slowly.
Both devices base their performance on parallelization, but each in a slightly different way. If the algorithm can be granulated into a lot of pieces that execute the same operations (keyword: SIMD), the GPU will be faster. If the algorithm can be implemented as a long pipeline, the FPGA will be faster. Also, if you want to use floating point, FPGA will not be very happy with it :)
I have dedicated my whole master's thesis to this topic.
Algorithm Acceleration on FPGA with OpenCL