How can I speed up deep learning on a non-NVIDIA setup? - tensorflow

Since I only have an AMD A10-7850 APU, and do not have the funds to spend on a $800-$1200 NVIDIA graphics card, I am trying to make due with the resources I have in order to speed up deep learning via tensorflow/keras.
Initially, I used a pre-compiled version of Tensorflow. InceptionV3 would take about 1000-1200 seconds to compute 1 epoch. It has been painfully slow.
To speed up calculations, I first self-compiled Tensorflow with optimizers (using AVX, and SSE4 instructions). This lead to a roughly 40% decrease in computation times. The same computations performed above now only take about 600 seconds to compute. It's almost bearable - kind of like you can watch paint dry.
I am looking for ways to further decrease computation times. I only have an integrated AMD graphics card that is part of the APU. (How) (C/c)an I make use of this resource to speed up computation even more?
More generally, let's say there are other people with similar monetary restrictions and Intel setups. How can anyone WITHOUT discrete NVIDIA cards make use of their integrated graphics chips or otherwise non-NVIDIA setup to achieve faster than CPU-only performance? Is that possible? Why/Why not? What needs to be done to achieve this goal? Or will this be possible in the near future (2-6 months)? How?

After researching this topic for a few months, I can see 3.5 possible paths forward:
1.) Tensorflow + OpenCl as mentioned in the comments above:
There seems to be some movement going on this field. Over at Codeplay, Lukasz Iwanski just posted a comprehensive answer on how to get tensorflow to run with opencl here (I will only provide a link as stated above because the information might change there): https://www.codeplay.com/portal/03-30-17-setting-up-tensorflow-with-opencl-using-sycl
The potential to use integrated graphics is alluring. It's also worth exploring the use of this combination with APUs. But I am not sure how well this will work since OpenCl support is still early in development, and hardware support is very limited. Furthermore, OpenCl is not the same as a handcrafted library of optimized code. (UPDATE 2017-04-24: I have gotten the code to compile after running into some issues here!) Unfortunately, the hoped for speed improvements ON MY SETUP (iGPU) did not materialize.
CIFAR 10 Dataset:
Tensorflow (via pip ak unoptimized): 1700sec/epoch at 390% CPU
utilization.
Tensorflow (SSE4, AVX): 1100sec/epoch at 390% CPU
utilization.
Tensorflow (opencl + iGPU): 5800sec/epoch at 150% CPU
and 100% GPU utilization.
Your mileage may vary significantly. So I am wondering what are other people getting relatively speaking (unoptimized vs optimized vs opencl) on your setups?
What should be noted: opencl implementation means that all the heavy computation should be done on the GPU. (Updated on 2017/4/29) But in reality this is not the case yet because some functions have not been implemented yet. This leads to unnecessary copying back and forth of data between CPU and GPU ram. Again, imminent changes should improve the situation. And furthermore, for those interested in helping out and those wanting to speed things up, we can do something that will have a measurable impact on the performance of tensorflow with opencl.
But as it stands for now: 1 iGPU << 4 CPUS with SSE+AVX. Perhaps beefier GPUs with larger RAM and/or opencl 2.0 implementation could have made a larger difference.
At this point, I should add that similar efforts have been going on with at least Caffe and/or Theano + OpenCl. The limiting step in all cases appears to be the manual porting of CUDA/cuDNN functionality to the openCl paradigm.
2.) RocM + MIOpen
RocM stands for Radeon Open Compute and seems to be a hodgepodge of initiatives that is/will make deep-learning possible on non-NVIDIA (mostly Radeon devices). It includes 3 major components:
HIP: A tool that converts CUDA code to code that can be consumed by AMD GPUs.
ROCk: a 64-bit linux kernel driver for AMD CPU+GPU devices.
HCC: A C/C++ compiler that compiles code into code for a heterogeneous system architecture environment (HSA).
Apparently, RocM is designed to play to AMDs strenghts of having both CPU and GPU technology. Their approach to speeding up deep-learning make use of both components. As an APU owner, I am particularly interested in this possibility. But as a cautionary note: Kaveri APUs have limited support (only integrated graphcs is supported). Future APUs have not been released yet. And it appears, there is still a lot of work that is being done here to bring this project to a mature state. A lot of work will hopefully make this approach viable within a year given that AMD has announced their Radeon Instinct cards will be released this year (2017).
The problem here for me is that that RocM is providing tools for building deep learning libraries. They do not themselves represent deep learning libraries. As a data scientist who is not focused on tools development, I just want something that works. and am not necessarily interested in building what I want to then do the learning. There are not enough hours in the day to do both well at the company I am at.
NVIDIA has of course CUDA and cuDNN which are libaries of hand-crafted assembler code optimized for NVIDIA GPUs. All major deep learning frameworks build on top of these proprietary libraries. AMD currently does not have anything like that at all.
I am uncertain how successfully AMD will get to where NVIDIA is in this regard. But there is some light being shone on what AMDs intentions are in an article posted by Carlos Perez on 4/3/2017 here. A recent lecture at Stanford also talks in general terms about Ryzen, Vega and deep learning fit together. In essence, the article states that MIOpen will represent this hand-crafted library of optimized deep learning functions for AMD devices. This library is set to be released in H1 of 2017. I am uncertain how soon these libraries would be incorporated into the major deep learning frameworks and what the scope of functional implementation will be then at this time.
But apparently, AMD has already worked with the developers of Caffe to "hippify" the code basis. Basically, CUDA code is converted automatically to C code via HIP. The automation takes care of the vast majority of the code basis, leaving only less than 0.5% of code had to be changed and required manual attention. Compare that to the manual translation into openCl code, and one starts getting the feeling that this approach might be more sustainable. What I am not clear about is where the lower-level assembler language optimization come in.
(Update 2017-05-19) But with the imminent release of AMD Vega cards (the professional Frontier Edition card not for consumers will be first), there are hints that major deep learning frameworks will be supported through the MIOpen framework. A Forbes article released today shows the progress MiOpen has taken over just the last couple of months in terms of performance: it appears significant.
(Update 2017-08-25) MiOpen has officially been released. We are no longer talking in hypotheticals here. Now we just need to try out how well this framework works.
3.) Neon
Neon is Nervana's (now acquired by Intel) open-source deep-learning framework. The reason I mention this framework is that it seems to be fairly straightforward to use. The syntax is about as easy and intuitive as Keras. More importantly though, this framework has achieved speeds up to 2x faster than Tensorflow on some benchmarks due to some hand-crafted assembler language optimization for those computations. Potentially, cutting computation times from 500 secs/epoch down to 300 secs/epoch is nothing to sneeze at. 300 secs = 5 minutes. So one could get 15 epochs in in an hour. and about 50 epochs in about 3.5 hours! But ideally, I want to do these kinds of calculations in under an hour. To get to those levels, I need to use a GPU, and at this point, only NVIDIA offers full support in this regard: Neon also uses CUDA and cuDNN when a GPU is available (and of course, it has to be an NVIDIA GPU). If you have access other Intel hardware this is of course a valid way to pursue. Afterall, Neon was developed out of a motivation to get things to work optimally also on non-NVIDIA setups (like Nervana's custom CPUs, and now Intel FPGAs or Xeon Phis).
3.5.) Intel Movidius
Update 2017-08-25: I came across this article. Intel has released a USB3.0-stick-based "deep learning" accelerator. Apparently, it works with Cafe and allows the user perform common Cafe-based fine-tuning of networks and inference. This is important stressing: If you want to train your own network from scratch, the wording is very ambiguous here. I will therefore assume, that apart from fine-tuning a network, training itself should still be done on something with more parallel compute. The real kicker though is this: When I checked for the pricing this stick costs a mere $79. That's nothing compared to the cost of your average NVIDIA 1070-80(ti) card. If you merely want to tackle some vision problems using common network topologies already available for some related tasks, you can use this stick to fine tune it to your own use, then compile the code and put it into this stick to do inference quickly. Many use cases can be covered with this stick, and for again $79 it could be worth it. This being Intel, they are proposing to go all out on Intel. Their model is to use the cloud (i.e. Nervana Cloud) for training. Then, use this chip for prototype inference or inference where energy consumption matters. Whether this is the right approach or not is left for the reader to answer.
At this time, it looks like deep learning without NVIDIA is still difficult to realize. Some limited speed gains are difficult but potentially possible through the use of opencl. Other initiatives sound promising but it will take time to sort out the real impact that these initiatives will have.

If your platform supports opencl you can look at using it with tensorflow. There is some experimental support for it on Linux at this github repository. Some preliminary instructions are at the documentation section of of this github repository.

Related

Recommended hardware to learn tensorflow?

I'm a software developer and I want to experiment with AI, machine learning etc. I want to learn about the different algorithms and techniques that are readily available, how to use them and what algos are appropriate for different types of challenges. TensorFlow looks like good software to start experimenting with, so I will start with TF.
I'm not interested in image processing. I'm mostly interested in making sense of patterns in data and making predictions.
Will I be able to experiment with all of the common examples and try out all the algorithms and features of TF with just a modern i7 with 8 threads or will I definitely need a GPU in order to not wait hours between each experiment?
If I do need a GPU, would an entry-level CUDA 3.0+ GPU suffice (example Geforce 730M with 2GB RAM, probably the cheapest compatible GPU)
Or would I need something with more punch and RAM like the 1050Ti/1080GTX/Ti etc?
Is it practical to learn on google or AWS or am I better off buying the hardware?
My fear is that I drop a lot of cash on a fancy graphics card and then don't really get into ML programming, and then it's a waste of money.
I have no idea if I'll even find it interesting/useful. So I'm not trying to conquer the world with ML just yet.
To sum up, my short term goals:
Get some experience with ML so that I know what techniques/algos are available/good for diff types of tasks.
See if I find ML interesting
Learn what kind of hardware I need if I want to invest further.
I can afford to buy a 1080Ti to experiment with, but I don't want to waste money without knowing more. If I buy a cheaper GPU like 1050Ti can I add a 1080Ti later or is it best if all my GPU's are the same?
Simply googling will give you most of your answers. But in short, yes you can use your CPU with TensorFlow.
You will need at least CUDA 7.5 and a compute capability of at least 3.0 on your GPU (http://www.nvidia.com/object/gpu-accelerated-applications-tensorflow-installation.html)
If you have no experience with machine learning yet I would suggest playing around with Scikit learn first (http://scikit-learn.org/). It's very easy to learn and playing around with something other than neural networks will give you a better understanding in the fundamentals of machine learning. Scikit learn does not require a GPU (GPU support is actually not provided).

Using Cuda optimization approaches for OpenCL

The more I learn about OpenCL, the more it seems that the right optimization of your kernel is the key to success. Furthermore I noticed, that the kernels for both languages seem very similar.
So how sensible would it be using Cuda optimization strategies learned from books and tutorials on OpenCL kernels? ... Considering that there is so much more (good) literature for Cuda than for OpenCL.
What is your opinion on that? What is your experience?
Thanks!
If you are working with just nvidia cards, you can use the same optimization approaches in both CUDA as well as OpenCL. A few things to keep in mind though is that OpenCL might have a larger start up time (This was a while ago when I was experimenting with both of them) compared to CUDA on nvidia cards.
However if you are going to work with different architectures, you will need to figure out a way to generalize your OpenCL program to be optimal across multiple platforms, which is not possible with CUDA.
But some of the few basic optimization approaches will remain the same.
For example, on any platform the following will be true.
Reading from and writing to memory
addresses that are aligned will have
higher performance (And sometimes
necessary on platforms like the Cell
Processor).
Knowing and understanding the limited resources
of each platform. (may it be called
constant memory, shared memory,
local memory or cache).
Understanding parallel programming.
For example, figuring out the trade
off between performance gains
(launching more threads) and
overhead costs (launching,
communication and synchronization).
That last part is useful in all kinds of parallel programming (be multi core, many core or grid computing).
While I'm still new at OpenCL (and barely glanced at CUDA), optimization at the developer level can be summarized as structuring your code so that it matches the hardware's (and compiler's) preferred way of doing things.
On GPUs, this can be anything from correctly ordering your data to take advantage of cache coherency (GPUs LOVE to work with cached data, from the top all the way down to the individual cores [there are several levels of cache]) to taking advantage of built-in operations like vector and matrix manipulation. I recently had to implement FDTD in OpenCL and found that by replacing the expanded dot/cross products in the popular implementations with matrix operations (which GPUs love!), reordering loops so that the X dimension (elements of which are stored sequentially) is handled in the innermost loop instead of the outer, avoiding branching (which GPUs hate), etc, I was able to increase the speed performance by about 20%. Those optimizations should work in CUDA, OpenCL or even GPU assembly, and I would expect that to be true of all of the most effective GPU optimizations.
Of course, most of this is application-dependent, so it may fall under the TIAS (try-it-and-see) category.
Here are a few links I found that look promising:
NVIDIA - Best Practices for OpenCL Programming
AMD - Porting CUDA to OpenCL
My research (and even NVIDIA's documentation) points to a nearly 1:1 correspondence between CUDA and OpenCL, so I would be very surprised if optimizations did not translate well between them. Most of what I have read focuses on cache coherency, avoiding branching, etc.
Also, note that in the case of OpenCL, the actual compilation process is handled by the vendor (I believe it happens in the video driver), so it may be worthwhile to have a look at the driver documentation and OpenCL kits from your vendor (NVIDIA, ATI, Intel(?), etc).

Optimisation , Compilers and Its Effects

(i) If a Program is optimised for one CPU class (e.g. Multi-Core Core i7)
by compiling the Code on the same , then will its performance
be at sub-optimal level on other CPUs from older generations (e.g. Pentium 4)
... Optimizing may prove harmful for performance on other CPUs..?
(ii)For optimization, compilers may use x86 extensions (like SSE 4) which are
not available in older CPUs.... so ,Is there a fall-back to some non-extensions
based routine on older CPUs..?
(iii)Is Intel C++ Compiler is more optimizing than Visual C++ Compiler or GCC..
(iv) Will a truly Multi-Core Threaded application will perform effeciently on a
older CPUs (like Pentium III or 4)..?
Compiling on a platform does not mean optimizing for this platform. (maybe it's just bad wording in your question.)
In all compilers I've used, optimizing for platform X does not affect the instruction set, only how it is used, e.g. optimizing for i7 does not enable SSE2 instructions.
Also, optimizers in most cases avoid "pessimizing" non-optimized platforms, e.g. when optimizing for i7, typically a small improvement on i7 will not not be chosen if it means a major hit for another common platform.
It also depends in the performance differences in the instruction sets - my impression is that they've become much less in the last decade (but I haven't delved to deep lately - might be wrong for the latest generations). Also consider that optimizations make a notable difference only in few places.
To illustrate possible options for an optimizer, consider the following methods to implement a switch statement:
sequence if (x==c) goto label
range check and jump table
binary search
combination of the above
the "best" algorithm depends on the relative cost of comparisons, jumps by fixed offsets and jumps to an address read from memory. They don't differ much on modern platforms, but even small differences can create a preference for one or other implementation.
It is probably true that optimising code for execution on CPU X will make that code less optimal on CPU Y than the same code optimised for execution on CPU Y. Probably.
Probably not.
Impossible to generalise. You have to test your code and come to your own conclusions.
Probably not.
For every argument about why X should be faster than Y under some set of conditions (choice of compiler, choice of CPU, choice of optimisation flags for compilation) some clever SOer will find a counter-argument, for every example a counter-example. When the rubber meets the road the only recourse you have is to test and measure. If you want to know whether compiler X is 'better' than compiler Y first define what you mean by better, then run a lot of experiments, then analyse the results.
I) If you did not tell the compiler which CPU type to favor, the odds are that it will be slightly sub-optimal on all CPUs. On the other hand, if you let the compiler know to optimize for your specific type of CPU, then it can definitely be sub-optimal on other CPU types.
II) No (for Intel and MS at least). If you tell the compiler to compile with SSE4, it will feel safe using SSE4 anywhere in the code without testing. It becomes your responsibility to ensure that your platform is capable of executing SSE4 instructions, otherwise your program will crash. You might want to compile two libraries and load the proper one. An alternative to compiling for SSE4 (or any other instruction set) is to use intrinsics, these will check internally for the best performing set of instructions (at the cost of a slight overhead). Note that I am not talking about instruction instrinsics here (those are specific to an instruction set), but intrinsic functions.
III) That is a whole other discussion in itself. It changes with every version, and may be different for different programs. So the only solution here is to test. Just a note though; Intel compilers are known not to compile well for running on anything other than Intel (e.g.: intrinsic functions may not recognize the instruction set of a AMD or Via CPU).
IV) If we ignore the on-die efficiencies of newer CPUs and the obvious architecture differences, then yes it may perform as well on older CPU. Multi-Core processing is not dependent per se on the CPU type. But the performance is VERY dependent on the machine architecture (e.g.: memory bandwidth, NUMA, chip-to-chip bus), and differences in the Multi-Core communication (e.g.: cache coherency, bus locking mechanism, shared cache). All this makes it impossible to compare newer and older CPU efficiencies in MP, but that is not what you are asking I believe. So on the whole, a MP program made for newer CPUs, should not be using less efficiently the MP aspects of older CPUs. Or in other words, just tweaking the MP aspects of a program specifically for an older CPU will not do much. Obviously you could rewrite your algorithm to more efficiently use a specific CPU (e.g.: A shared cache may permit you to use an algorithm that exchanges more data between working threads, but this algo will die on a system with no shared cache, full bus lock and low memory latency/bandwidth), but it involves a lot more than just MP related tweaks.
(1) Not only is it possible but it has been documented on pretty much every generation of x86 processor. Go back to the 8088 and work your way forward, every generation. Clock for clock the newer processor was slower for the current mainstream applications and operating systems (including Linux). The 32 to 64 bit transition is not helping, more cores and less clock speed is making it even worse. And this is true going backward as well for the same reason.
(2) Bank on your binaries failing or crashing. Sometimes you get lucky, most of the time you dont. There are new instructions yes, and to support them would probably mean trap for an undefined instruction and have a software emulation of that instruction which would be horribly slow and the lack of demand for it means it is probably not well done or just not there. Optimization can use new instructions but more than that the bulk of the optimization that I am guessing you are talking about has to do with reordering the instructions so that the various pipelines do not stall. So you arrange them to be fast on one generation processor they will be slower on another because in the x86 family the cores change too much. AMD had a good run there for a while as they would make the same code just run faster instead of trying to invent new processors that eventually would be faster when the software caught up. No longer true both amd and intel are struggling to just keep chips running without crashing.
(3) Generally, yes. For example gcc is a horrible compiler, one size fits all fits no one well, it can never and will never be any good at optimizing. For example gcc 4.x code is slower on gcc 3.x code for the same processor (yes all of this is subjective, it all depends on the specific application being compiled). The in house compilers I have used were leaps and bounds ahead of the cheap or free ones (I am not limiting myself to x86 here). Are they worth the price though? That is the question.
In general because of the horrible new programming languages and gobs of memory, storage, layers of caching, software engineering skills are at an all time low. Which means the pool of engineers capable of making a good compiler much less a good optimizing compiler decreases with time, this has been going on for at least 10 years. So even the in house compilers are degrading with time, or they just have their employees to work on and contribute to the open source tools instead having an in house tool. Also the tools the hardware engineers use are degrading for the same reason, so we now have processors that we hope to just run without crashing and not so much try to optimize for. There are so many bugs and chip variations that most of the compiler work is avoiding the bugs. Bottom line, gcc has singlehandedly destroyed the compiler world.
(4) See (2) above. Don't bank on it. Your operating system that you want to run this on will likely not install on the older processor anyway, saving you the pain. For the same reason that the binaries optimized for your pentium III ran slower on your Pentium 4 and vice versa. Code written to work well on multi core processors will run slower on single core processors than if you had optimized the same application for a single core processor.
The root of the problem is the x86 instruction set is dreadful. So many far superior instructions sets have come along that do not require hardware tricks to make them faster every generation. But the wintel machine created two monopolies and the others couldnt penetrate the market. My friends keep reminding me that these x86 machines are microcoded such that you really dont see the instruction set inside. Which angers me even more that the horrible isa is just an interpretation layer. It is kinda like using Java. The problems you have outlined in your questions will continue so long as intel stays on top, if the replacement does not become the monopoly then we will be stuck forever in the Java model where you are one side or the other of a common denominator, either you emulate the common platform on your specific hardware, or you are writing apps and compiling to the common platform.

Multi core programming

I want to get into multi core programming (not language specific) and wondered what hardware could be recommended for exploring this field.
My aim is to upgrade my existing desktop.
If at all possible, I would suggest getting a dual-socket machine, preferably with quad-core chips. You can certainly get a single-socket machine, but dual-socket would let you start seeing some of the effects of NUMA memory that are going to be exacerbated as the core counts get higher and higher.
Why do you care? There are two huge problems facing multi-core developers right now:
The programming model Parallel programming is hard, and there is (currently) no getting around this. A quad-core system will let you start playing around with real concurrency and all of the popular paradigms (threads, UPC, MPI, OpenMP, etc).
Memory Whenever you start having multiple threads, there is going to be contention for resources, and the memory wall is growing larger and larger. A recent article at arstechnica outlines some (very preliminary) research at Sandia that shows just how bad this might become if current trends continue. Multicore machines are going to have to keep everything fed, and this will require that people be intimately familiar with their memory system. Dual-socket adds NUMA to the mix (at least on AMD machines), which should get you started down this difficult road.
If you're interested in more info on performance inconsistencies with multi-socket machines, you might also check out this technical report on the subject.
Also, others have suggested getting a system with a CUDA-capable GPU, which I think is also a great way to get into multithreaded programming. It's lower level than the stuff I mentioned above, but throw one of those on your machine if you can. The new Portland Group compilers have provisional support for optimizing loops with CUDA, so you could play around with your GPU even if you don't want to learn CUDA yourself.
Quad-core, because it'll permit you to do problems where the number of concurrent processes is > 2, which often non-trivializes problems.
I would also, for sheer geek squee, pick up a nice NVidia card and use the CUDA API. If you have the bucks, there's a stand-alone CUDA workstation that plugs into your main computer via a cable and an expansion slot.
It depends what you want to do.
If you want to learn the basics of multithreaded programming, then you can do that on your existing single-core PC. (If you have 2 threads, then the OS will switch between them on a single-core PC. Then when you move to a dual-core PC they should automatically run in parallel on separate cores, for a 2x speedup). This has the advantage of being free! The disadvantages are that you won't see a speedup (in fact a parallel implementation is probably slightly slower due to overheads), and that buggy code has a slightly higher chance of working.
However, although you can learn multithreaded programming on a single-core box, a dual-core (or even HyperThreading) CPU would be a great help.
If you want to really stress-test the code you're writing, then as "blue tuxedo" says, you should go for as many cores as you can easily afford, and if possible get hyperthreading too.
If you want to learn about algorithms for running on graphics cards - which is a very different area to x86 multicore - then get CUDA and buy a normal nVidia graphics card that supports it.
I'd recommend at least a quad-core processor.
You could try tinkering with CUDA. It's free, not that hard to use and will run on any recent NVIDIA card.
Alternatively, you could get a PlayStation 3 and the Linux SDK and work out how to program a Cell processor. Note that the next cheapest option for Cell BE development is an order of magnitude more expensive than a PS3.
Finally, any modern motherboard that will take a Core Quad or quad-core Opteron (get a good one from Asus or some other reputable manufacturer) will let you experiment with a multi-core PC system for a reasonable sum of money.
The difficult thing with multithreaded/core programming is that it opens a whole new can of worms. The bugs you'll be faced with are usually not the one you're used to. Race conditions can remain dormant for ages until they bite and your mainstream language compiler won't assist you in any way. You'll get random data and/or crashes that only happen once a day/week/month/year, usually under the most mysterious conditions...
One things remains true fortunately : the higher the concurrency exhibited by a computer, the more race conditions you'll unveil.
So if you're serious about multithreaded/core programming, then go for as many cpu cores as possible. Keep in mind that neither hyperthreading nor SMT allow for the level of concurrency that multiple cores provide.
I would agree that, depending on what you ultimately want to do, you can probably get by with just your current single-core system. Multi-core programming is basically multi-threaded programming, and you can certainly do that on a single-core chip.
When I was a student, one of our projects was to build a thread-safe implementation the malloc library for C. Even on a single core processor, that was more than enough to cure me of my desire to get into multi-threaded programming. I would try something small like that before you start thinking about spending lots of money.
I agree with the others where I would upgrade to a quad-core processor. I am also a BIG FAN of ASUS Motherboards (the P5Q Pro is excellent for Core2Quad and Core2Duo processors)!
The draw for multi-core programming is that you have more resources to get things done faster. If you are serious about multi-core programming, then I would absolutely get a quad-core processor. I don't believe that you should get the new i7 architecture from Intel to take advantage of multi-core processing because anything written to take advantage of the Core2Duo or Core2Quad will just run better on the newer architecture.
If you are going to dabble in multi-core programming, then I would get a good Core2Duo processor. Remember, it's not just how many cores you have, but also how FAST the cores are to process the jobs. My Core2Duo running at 4GHz routinely completes jobs faster than my Core2Quad running at 2.4GHz even with a multi-core program.
Let me know if this helps!
JFV

CUDA or FPGA for special purpose 3D graphics computations? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I am developing a product with heavy 3D graphics computations, to a large extent closest point and range searches. Some hardware optimization would be useful. While I know little about this, my boss (who has no software experience) advocates FPGA (because it can be tailored), while our junior developer advocates GPGPU with CUDA, because its cheap, hot and open. While I feel I lack judgement in this question, I believe CUDA is the way to go also because I am worried about flexibility, our product is still under strong development.
So, rephrasing the question, are there any reasons to go for FPGA at all? Or is there a third option?
I investigated the same question a while back. After chatting to people who have worked on FPGAs, this is what I get:
FPGAs are great for realtime systems, where even 1ms of delay might be too long. This does not apply in your case;
FPGAs can be very fast, espeically for well-defined digital signal processing usages (e.g. radar data) but the good ones are much more expensive and specialised than even professional GPGPUs;
FPGAs are quite cumbersome to programme. Since there is a hardware configuration component to compiling, it could take hours. It seems to be more suited to electronic engineers (who are generally the ones who work on FPGAs) than software developers.
If you can make CUDA work for you, it's probably the best option at the moment. It will certainly be more flexible than a FPGA.
Other options include Brook from ATI, but until something big happens, it is simply not as well adopted as CUDA. After that, there's still all the traditional HPC options (clusters of x86/PowerPC/Cell), but they are all quite expensive.
Hope that helps.
We did some comparison between FPGA and CUDA. One thing where CUDA shines if you can realy formulate your problem in a SIMD fashion AND can access the memory coalesced. If the memory accesses are not coalesced(1) or if you have different control flow in different threads the GPU can lose drastically its performance and the FPGA can outperform it. Another thing is when your operation is realtive small, but you have a huge amount of it. But you cant (e.g. due to synchronisation) no start it in a loop in one kernel, then your invocation times for the GPU kernel exceeds the computation time.
Also the power of the FPGA could be better (depends on your application scenarion, ie. the GPU is only cheaper (in terms of Watts/Flop) when its computing all the time).
Offcourse the FPGA has also some drawbacks: IO can be one (we had here an application were we needed 70 GB/s, no problem for GPU, but to get this amount of data into a FPGA you need for conventional design more pins than available). Another drawback is the time and money. A FPGA is much more expensive than the best GPU and the development times are very high.
(1) Simultanously accesses from different thread to memory have to be to sequential addresses. This is sometimes really hard to achieve.
I would go with CUDA.
I work in image processing and have been trying hardware add-ons for years. First we had i860, then Transputer, then DSP, then the FPGA and direct-compiliation-to-hardware.
What innevitably happened was that by the time the hardware boards were really debugged and reliable and the code had been ported to them - regular CPUs had advanced to beat them, or the hosting machine architecture changed and we couldn't use the old boards, or the makers of the board went bust.
By sticking to something like CUDA you aren't tied to one small specialist maker of FPGA boards. The performence of GPUs is improving faster then CPUs and is funded by the gamers. It's a mainstream technology and so will probably merge with multi-core CPUs in the future and so protect your investment.
FPGAs
What you need:
Learn VHDL/Verilog (and trust me you don't want to)
Buy hw for testing, licences for synthesis tools
If you already have infrastructure and you need to develop only your core
Develop design ( and it can take years )
If you don't:
DMA, hw driver, ultra expensive synthesis tools
tons of knowledge about buses, memory mapping, hw synthesis
build the hw, buy the ip cores
Develop design
Not mentioning of board developement
For example average FPGA pcie card with chip Xilinx ZynqUS+ costs more than 3000$
FPGA cloud is also costly 2$/h+
Result:
This is something which requires resources of running company at least.
GPGPU (CUDA/OpenCL)
You already have hw to test on.
Compare to FPGA stuff:
Everything is well documented .
Everything is cheap
Everything works
Everything is well integrated to programming languages
There is GPU cloud as well.
Result:
You need to just download sdk and you can start.
This is an old thread started in 2008, but it would be good to recount what happened to FPGA programming since then:
1. C to gates in FPGA is the mainstream development for many companies with HUGE time saving vs. Verilog/SystemVerilog HDL. In C to gates System level design is the hard part.
2. OpenCL on FPGA is there for 4+ years including floating point and "cloud" deployment by Microsoft (Asure) and Amazon F1 (Ryft API). With OpenCL system design is relatively easy because of very well defined memory model and API between host and compute devices.
Software folks just need to learn a bit about FPGA architecture to be able to do things that are NOT EVEN POSSIBLE with GPUs and CPUs for the reasons of both being fixed silicon and not having broadband (100Gb+) interfaces to the outside world. Scaling down chip geometry is no longer possible, nor extracting more heat from the single chip package without melting it, so this looks like the end of the road for single package chips. My thesis here is that the future belongs to parallel programming of multi-chip systems, and FPGAs have a great chance to be ahead of the game. Check out http://isfpga.org/ if you have concerns about performance, etc.
FPGA-based solution is likely to be way more expensive than CUDA.
Obviously this is a complex question. The question might also include the cell processor.
And there is probably not a single answer which is correct for other related questions.
In my experience, any implementation done in abstract fashion, i.e. compiled high level language vs. machine level implementation, will inevitably have a performance cost, esp in a complex algorithm implementation. This is true of both FPGA's and processors of any type. An FPGA designed specifically to implement a complex algorithm will perform better than an FPGA whose processing elements are generic, allowing it a degree of programmability from input control registers, data i/o etc.
Another general example where an FPGA can be much higher performance is in cascaded processes where on process outputs become the inputs to another and they cannot be done concurrently. Cascading processes in an FPGA is simple, and can dramatically lower memory I/O requirements while processor memory will be used to effectively cascade two or more processes where there are data dependencies.
The same can be said of a GPU and CPU. Algorithms implemented in C executing on a CPU developed without regard to the inherent performance characteristics of the cache memory or main memory system will not perform as well as one implemented which does. Granted, not considering these performance characteristics simplifies implementation. But at a performance cost.
Having no direct experience with a GPU, but knowing its inherent memory system performance issues, it too will be subjected to performance issues.
CUDA has a fairly substantial code base of examples and a SDK, including a BLAS back-end. Try to find some examples similar to what you are doing, perhaps also looking at the GPU Gems series of books, to gauge how well CUDA will fit your applications. I'd say from a logistic point of view, CUDA is easier to work with and much, much cheaper than any professional FPGA development toolkit.
At one point I did look into CUDA for claim reserve simulation modelling. There is quite a good series of lectures linked off the web-site for learning. On Windows, you need to make sure CUDA is running on a card with no displays as the graphics subsystem has a watchdog timer that will nuke any process running for more than 5 seconds. This does not occur on Linux.
Any mahcine with two PCI-e x16 slots should support this. I used a HP XW9300, which you can pick up off ebay quite cheaply. If you do, make sure it has two CPU's (not one dual-core CPU) as the PCI-e slots live on separate Hypertransport buses and you need two CPU's in the machine to have both buses active.
What are you deploying on? Who is your customer? Without even know the answers to these questions, I would not use an FPGA unless you are building a real-time system and have electrical/computer engineers on your team that have knowledge of hardware description languages such as VHDL and Verilog. There's a lot to it and it takes a different frame of mind than conventional programming.
I'm a CUDA developer with very littel experience with FPGA:s, however I've been trying to find comparisons between the two.
What I've concluded so far:
The GPU has by far higher ( accessible ) peak performance
It has a more favorable FLOP/watt ratio.
It is cheaper
It is developing faster (quite soon you will literally have a "real" TFLOP available).
It is easier to program ( read article on this not personal opinion)
Note that I'm saying real/accessible to distinguish from the numbers you will see in a GPGPU commercial.
BUT the gpu is not more favorable when you need to do random accesses to data. This will hopefully change with the new Nvidia Fermi architecture which has an optional l1/l2 cache.
my 2 cents
Others have given good answers, just wanted to add a different perspective. Here is my survey paper published in ACM Computing Surveys 2015 (its permalink is here), which compares GPU with FPGA and CPU on energy efficiency metric. Most papers report: FPGA is more energy efficient than GPU, which, in turn, is more energy efficient than CPU. Since power budgets are fixed (depending on cooling capability), energy efficiency of FPGA means one can do more computations within same power budget with FPGA, and thus get better performance with FPGA than with GPU. Of course, also account for FPGA limitations, as mentioned by others.
FPGA will not be favoured by those with a software bias as they need to learn an HDL or at least understand systemC.
For those with a hardware bias FPGA will be the first option considered.
In reality a firm grasp of both is required & then an objective decision can be made.
OpenCL is designed to run on both FPGA & GPU, even CUDA can be ported to FPGA.
FPGA & GPU accelerators can be used together
So it's not a case of what is better one or the other. There is also the debate about CUDA vs OpenCL
Again unless you have optimized & benchmarked both to your specific application you can not know with 100% certainty.
Many will simply go with CUDA because of its commercial nature & resources. Others will go with openCL because of its versatility.
FPGAs are more parallel than GPUs, by three orders of magnitude. While good GPU features thousands of cores, FPGA may have millions of programmable gates.
While CUDA cores must do highly similar computations to be productive, FPGA cells are truly independent from each other.
FPGA can be very fast with some groups of tasks and are often used where a millisecond is already seen as a long duration.
GPU core is way more powerful than FPGA cell, and much easier to program. It is a core, can divide and multiply no problem when FPGA cell is only capable of rather simple boolean logic.
As GPU core is a core, it is efficient to program it in C++. Even it it is also possible to program FPGA in C++, it is inefficient (just "productive"). Specialized languages like VDHL or Verilog must be used - they are difficult and challenging to master.
Most of the true and tried instincts of a software engineer are useless with FPGA. You want a for loop with these gates? Which galaxy are you from? You need to change into the mindset of electronics engineer to understand this world.
at latest GTC'13 many HPC people agreed that CUDA is here to stay. FGPA's are cumbersome, CUDA is getting quite more mature supporting Python/C/C++/ARM.. either way, that was a dated question
Programming a GPU in CUDA is definitely easier. If you don't have any experience with programming FPGAs in HDL it will almost surely be too much of a challenge for you, but you can still program them with OpenCL which is kinda similar to CUDA. However, it is harder to implement and probably a lot more expensive than programming GPUs.
Which one is Faster?
GPU runs faster, but FPGA can be more efficient.
GPU has the potential of running at a speed higher than FPGA can ever reach. But only for algorithms that are specially suited for that. If the algorithm is not optimal, the GPU will loose a lot of performance.
FPGA on the other hand runs much slower, but you can implement problem-specific hardware that will be very efficient and get stuff done in less time.
It's kinda like eating your soup with a fork very fast vs. eating it with a spoon more slowly.
Both devices base their performance on parallelization, but each in a slightly different way. If the algorithm can be granulated into a lot of pieces that execute the same operations (keyword: SIMD), the GPU will be faster. If the algorithm can be implemented as a long pipeline, the FPGA will be faster. Also, if you want to use floating point, FPGA will not be very happy with it :)
I have dedicated my whole master's thesis to this topic.
Algorithm Acceleration on FPGA with OpenCL