I’ve been Googling all morning trying to find a way to use VB.Net code to find the platform and family of my CPU.
For instance:
Intel Xeon (Haswell) OR Intel Xeon (Sandy Bridge)
I have a program I’ve been working on that has several different versions specifically optimized for certain CPU platforms/families/generations.
So it’s important that my software is able to find the CPU platform/family/generation (i.e. Haswell/Broadwell/Ivy Bridge/Sandy Bridge/Coffee Lake...) so that it can run the version optimized for that CPU.
Everything I found so far can give me EVERY bit of info about my CPU and/or system EXCEPT the CPU platform/family/generation.
Can any of you guys help me out with this, or provide some kind of workaround to help?
Related
Since I only have an AMD A10-7850 APU, and do not have the funds to spend on a $800-$1200 NVIDIA graphics card, I am trying to make due with the resources I have in order to speed up deep learning via tensorflow/keras.
Initially, I used a pre-compiled version of Tensorflow. InceptionV3 would take about 1000-1200 seconds to compute 1 epoch. It has been painfully slow.
To speed up calculations, I first self-compiled Tensorflow with optimizers (using AVX, and SSE4 instructions). This lead to a roughly 40% decrease in computation times. The same computations performed above now only take about 600 seconds to compute. It's almost bearable - kind of like you can watch paint dry.
I am looking for ways to further decrease computation times. I only have an integrated AMD graphics card that is part of the APU. (How) (C/c)an I make use of this resource to speed up computation even more?
More generally, let's say there are other people with similar monetary restrictions and Intel setups. How can anyone WITHOUT discrete NVIDIA cards make use of their integrated graphics chips or otherwise non-NVIDIA setup to achieve faster than CPU-only performance? Is that possible? Why/Why not? What needs to be done to achieve this goal? Or will this be possible in the near future (2-6 months)? How?
After researching this topic for a few months, I can see 3.5 possible paths forward:
1.) Tensorflow + OpenCl as mentioned in the comments above:
There seems to be some movement going on this field. Over at Codeplay, Lukasz Iwanski just posted a comprehensive answer on how to get tensorflow to run with opencl here (I will only provide a link as stated above because the information might change there): https://www.codeplay.com/portal/03-30-17-setting-up-tensorflow-with-opencl-using-sycl
The potential to use integrated graphics is alluring. It's also worth exploring the use of this combination with APUs. But I am not sure how well this will work since OpenCl support is still early in development, and hardware support is very limited. Furthermore, OpenCl is not the same as a handcrafted library of optimized code. (UPDATE 2017-04-24: I have gotten the code to compile after running into some issues here!) Unfortunately, the hoped for speed improvements ON MY SETUP (iGPU) did not materialize.
CIFAR 10 Dataset:
Tensorflow (via pip ak unoptimized): 1700sec/epoch at 390% CPU
utilization.
Tensorflow (SSE4, AVX): 1100sec/epoch at 390% CPU
utilization.
Tensorflow (opencl + iGPU): 5800sec/epoch at 150% CPU
and 100% GPU utilization.
Your mileage may vary significantly. So I am wondering what are other people getting relatively speaking (unoptimized vs optimized vs opencl) on your setups?
What should be noted: opencl implementation means that all the heavy computation should be done on the GPU. (Updated on 2017/4/29) But in reality this is not the case yet because some functions have not been implemented yet. This leads to unnecessary copying back and forth of data between CPU and GPU ram. Again, imminent changes should improve the situation. And furthermore, for those interested in helping out and those wanting to speed things up, we can do something that will have a measurable impact on the performance of tensorflow with opencl.
But as it stands for now: 1 iGPU << 4 CPUS with SSE+AVX. Perhaps beefier GPUs with larger RAM and/or opencl 2.0 implementation could have made a larger difference.
At this point, I should add that similar efforts have been going on with at least Caffe and/or Theano + OpenCl. The limiting step in all cases appears to be the manual porting of CUDA/cuDNN functionality to the openCl paradigm.
2.) RocM + MIOpen
RocM stands for Radeon Open Compute and seems to be a hodgepodge of initiatives that is/will make deep-learning possible on non-NVIDIA (mostly Radeon devices). It includes 3 major components:
HIP: A tool that converts CUDA code to code that can be consumed by AMD GPUs.
ROCk: a 64-bit linux kernel driver for AMD CPU+GPU devices.
HCC: A C/C++ compiler that compiles code into code for a heterogeneous system architecture environment (HSA).
Apparently, RocM is designed to play to AMDs strenghts of having both CPU and GPU technology. Their approach to speeding up deep-learning make use of both components. As an APU owner, I am particularly interested in this possibility. But as a cautionary note: Kaveri APUs have limited support (only integrated graphcs is supported). Future APUs have not been released yet. And it appears, there is still a lot of work that is being done here to bring this project to a mature state. A lot of work will hopefully make this approach viable within a year given that AMD has announced their Radeon Instinct cards will be released this year (2017).
The problem here for me is that that RocM is providing tools for building deep learning libraries. They do not themselves represent deep learning libraries. As a data scientist who is not focused on tools development, I just want something that works. and am not necessarily interested in building what I want to then do the learning. There are not enough hours in the day to do both well at the company I am at.
NVIDIA has of course CUDA and cuDNN which are libaries of hand-crafted assembler code optimized for NVIDIA GPUs. All major deep learning frameworks build on top of these proprietary libraries. AMD currently does not have anything like that at all.
I am uncertain how successfully AMD will get to where NVIDIA is in this regard. But there is some light being shone on what AMDs intentions are in an article posted by Carlos Perez on 4/3/2017 here. A recent lecture at Stanford also talks in general terms about Ryzen, Vega and deep learning fit together. In essence, the article states that MIOpen will represent this hand-crafted library of optimized deep learning functions for AMD devices. This library is set to be released in H1 of 2017. I am uncertain how soon these libraries would be incorporated into the major deep learning frameworks and what the scope of functional implementation will be then at this time.
But apparently, AMD has already worked with the developers of Caffe to "hippify" the code basis. Basically, CUDA code is converted automatically to C code via HIP. The automation takes care of the vast majority of the code basis, leaving only less than 0.5% of code had to be changed and required manual attention. Compare that to the manual translation into openCl code, and one starts getting the feeling that this approach might be more sustainable. What I am not clear about is where the lower-level assembler language optimization come in.
(Update 2017-05-19) But with the imminent release of AMD Vega cards (the professional Frontier Edition card not for consumers will be first), there are hints that major deep learning frameworks will be supported through the MIOpen framework. A Forbes article released today shows the progress MiOpen has taken over just the last couple of months in terms of performance: it appears significant.
(Update 2017-08-25) MiOpen has officially been released. We are no longer talking in hypotheticals here. Now we just need to try out how well this framework works.
3.) Neon
Neon is Nervana's (now acquired by Intel) open-source deep-learning framework. The reason I mention this framework is that it seems to be fairly straightforward to use. The syntax is about as easy and intuitive as Keras. More importantly though, this framework has achieved speeds up to 2x faster than Tensorflow on some benchmarks due to some hand-crafted assembler language optimization for those computations. Potentially, cutting computation times from 500 secs/epoch down to 300 secs/epoch is nothing to sneeze at. 300 secs = 5 minutes. So one could get 15 epochs in in an hour. and about 50 epochs in about 3.5 hours! But ideally, I want to do these kinds of calculations in under an hour. To get to those levels, I need to use a GPU, and at this point, only NVIDIA offers full support in this regard: Neon also uses CUDA and cuDNN when a GPU is available (and of course, it has to be an NVIDIA GPU). If you have access other Intel hardware this is of course a valid way to pursue. Afterall, Neon was developed out of a motivation to get things to work optimally also on non-NVIDIA setups (like Nervana's custom CPUs, and now Intel FPGAs or Xeon Phis).
3.5.) Intel Movidius
Update 2017-08-25: I came across this article. Intel has released a USB3.0-stick-based "deep learning" accelerator. Apparently, it works with Cafe and allows the user perform common Cafe-based fine-tuning of networks and inference. This is important stressing: If you want to train your own network from scratch, the wording is very ambiguous here. I will therefore assume, that apart from fine-tuning a network, training itself should still be done on something with more parallel compute. The real kicker though is this: When I checked for the pricing this stick costs a mere $79. That's nothing compared to the cost of your average NVIDIA 1070-80(ti) card. If you merely want to tackle some vision problems using common network topologies already available for some related tasks, you can use this stick to fine tune it to your own use, then compile the code and put it into this stick to do inference quickly. Many use cases can be covered with this stick, and for again $79 it could be worth it. This being Intel, they are proposing to go all out on Intel. Their model is to use the cloud (i.e. Nervana Cloud) for training. Then, use this chip for prototype inference or inference where energy consumption matters. Whether this is the right approach or not is left for the reader to answer.
At this time, it looks like deep learning without NVIDIA is still difficult to realize. Some limited speed gains are difficult but potentially possible through the use of opencl. Other initiatives sound promising but it will take time to sort out the real impact that these initiatives will have.
If your platform supports opencl you can look at using it with tensorflow. There is some experimental support for it on Linux at this github repository. Some preliminary instructions are at the documentation section of of this github repository.
I am interested to try out GPU programming. One thing not clear to me is, what hardware do I need? Is it right any PC with graphics card is good? I have very little knowledge of GPU programming, so the starting learning curve is best not steep. If I have to make a lot of hacks just in order to run some tutorial because my hardware is not good enough, I'd rather to buy a new hardware.
I have a retired PC (~10 year old) installed with Ubuntu Linux, I am not sure what graphics card it has, must be some old one.
I am also planning to buy a new sub-$500 desktop which to my casual research normally has AMD Radeon 7x or Nvidia GT 6x graphics card. I assume any new PC is good enough for the programming learning.
Anyway any suggestion is appreciated.
If you want to use CUDA, you'll need a GPU from NVidia, and their site explains the compute capabilities of their different products.
If you want to learn OpenCL, you can start right now with an OpenCL implementation that has a CPU back-end. The basics of writing OpenCL code targeting CPUs or GPUs is the same, and they differ mainly in performance tuning.
For GPU programming, any AMD or NVidia GPU made in the past several years will have some degree of OpenCL support, though there have been some new features introduced with newer generations that can't be easily emulated for older generations.
Intel's integrated GPUs in Ivy Bridge and later support OpenCL, but Intel only provides a GPU-capable OpenCL implementation for Windows, not Linux.
Also be aware that there is a huge difference between a mid-range and high-end GPU in terms of compute capabilities, especially where double-precision arithmetic is supported. Some low-end GPUs don't support double-precision at all, and mid-range GPUs often perform double-precision arithmetic 24 times slower than single-precision. When you want to do a lot of double-precision calculations, it's absolutely worth it to get a compute-oriented GPU (Radeon 7900 series or GeForce Titan and up).
If you want a low-end system with a non-trivial GPU power, you best bet at the moment is probably to get a system built around an AMD APU.
I am profiling a C++ application with Intel VTune Amplifier. Most of the time seems to be spent in nvoglv64.dll more precisely in DrvPresentBuffers and/or KeSynchoronizeExecution. Note that I have a NVIDA GeoForce graphic card.
I am new to the application I am profiling and looking for bottleneck and low hanging fruits of optimization. Since most of the time seems to be spent in this NVIDIA dll, I do not know how decode the profiling results.
I would like to know where are those call from my application side in order to build a knowledge of my application. Can someone give me some hint to start :
When exactly do an application call DrvPresentBuffers, what kind of call should I look at (on my application side)
Where can I get more info about how to profile, understand and optimize applications where bottlenecks are in the graphic card dll's.
DrvPresentBuffers is part of the draw code for openGL. That nvoglv64.dll is the 64bit openGL driver for your nVidia card. There is a known performance issue for 64bit Windows 7 and this function on many drivers. I couldn't find a link but you can search the nVidia forum if you are experiencing problems. If there is nothing wrong or nothing going horribly slow then I'm not sure optimization is where I would start when familiarizing myself with a new application.
AMD announced it's Fusion platform some time ago. Having read a bit about it I'm both excited and sceptic. For example it should make it possible that GPUs and CPUs share the same memory. (and the GPU and CPU are both in the same package) Now since GPUs have a much higher memory bandwidth (around 10x the bandwidth a CPU has) and that the way CPUs and GPUs use cache is fundamentally different, the question arises how the heck can they do this? I wonder if any details are known.
I have also searched for some detailed info on how this APU technically works, but haven't found anything better than AMD's whitepaper on the subject, which, in a slightly marketing-wise tone, does present a lot of good info.
By using high-bandwidth dual ported RAM. AMD is happy to explain.
Here is a good paper (from AMD) explaining the different memory access patterns in an APU
http://amddevcentral.com/afds/assets/presentations/1004_final.pdf
CPUs and GPUs have been sharing memory for years. Fusion is no different, in fact it's far from the first instance of a GPU being integrated into a general-purpose CPU core.
Like all other such solutions, it'll be fine for casual use, but it'll be far from cutting-edge 3D acceleration.
I want to get into multi core programming (not language specific) and wondered what hardware could be recommended for exploring this field.
My aim is to upgrade my existing desktop.
If at all possible, I would suggest getting a dual-socket machine, preferably with quad-core chips. You can certainly get a single-socket machine, but dual-socket would let you start seeing some of the effects of NUMA memory that are going to be exacerbated as the core counts get higher and higher.
Why do you care? There are two huge problems facing multi-core developers right now:
The programming model Parallel programming is hard, and there is (currently) no getting around this. A quad-core system will let you start playing around with real concurrency and all of the popular paradigms (threads, UPC, MPI, OpenMP, etc).
Memory Whenever you start having multiple threads, there is going to be contention for resources, and the memory wall is growing larger and larger. A recent article at arstechnica outlines some (very preliminary) research at Sandia that shows just how bad this might become if current trends continue. Multicore machines are going to have to keep everything fed, and this will require that people be intimately familiar with their memory system. Dual-socket adds NUMA to the mix (at least on AMD machines), which should get you started down this difficult road.
If you're interested in more info on performance inconsistencies with multi-socket machines, you might also check out this technical report on the subject.
Also, others have suggested getting a system with a CUDA-capable GPU, which I think is also a great way to get into multithreaded programming. It's lower level than the stuff I mentioned above, but throw one of those on your machine if you can. The new Portland Group compilers have provisional support for optimizing loops with CUDA, so you could play around with your GPU even if you don't want to learn CUDA yourself.
Quad-core, because it'll permit you to do problems where the number of concurrent processes is > 2, which often non-trivializes problems.
I would also, for sheer geek squee, pick up a nice NVidia card and use the CUDA API. If you have the bucks, there's a stand-alone CUDA workstation that plugs into your main computer via a cable and an expansion slot.
It depends what you want to do.
If you want to learn the basics of multithreaded programming, then you can do that on your existing single-core PC. (If you have 2 threads, then the OS will switch between them on a single-core PC. Then when you move to a dual-core PC they should automatically run in parallel on separate cores, for a 2x speedup). This has the advantage of being free! The disadvantages are that you won't see a speedup (in fact a parallel implementation is probably slightly slower due to overheads), and that buggy code has a slightly higher chance of working.
However, although you can learn multithreaded programming on a single-core box, a dual-core (or even HyperThreading) CPU would be a great help.
If you want to really stress-test the code you're writing, then as "blue tuxedo" says, you should go for as many cores as you can easily afford, and if possible get hyperthreading too.
If you want to learn about algorithms for running on graphics cards - which is a very different area to x86 multicore - then get CUDA and buy a normal nVidia graphics card that supports it.
I'd recommend at least a quad-core processor.
You could try tinkering with CUDA. It's free, not that hard to use and will run on any recent NVIDIA card.
Alternatively, you could get a PlayStation 3 and the Linux SDK and work out how to program a Cell processor. Note that the next cheapest option for Cell BE development is an order of magnitude more expensive than a PS3.
Finally, any modern motherboard that will take a Core Quad or quad-core Opteron (get a good one from Asus or some other reputable manufacturer) will let you experiment with a multi-core PC system for a reasonable sum of money.
The difficult thing with multithreaded/core programming is that it opens a whole new can of worms. The bugs you'll be faced with are usually not the one you're used to. Race conditions can remain dormant for ages until they bite and your mainstream language compiler won't assist you in any way. You'll get random data and/or crashes that only happen once a day/week/month/year, usually under the most mysterious conditions...
One things remains true fortunately : the higher the concurrency exhibited by a computer, the more race conditions you'll unveil.
So if you're serious about multithreaded/core programming, then go for as many cpu cores as possible. Keep in mind that neither hyperthreading nor SMT allow for the level of concurrency that multiple cores provide.
I would agree that, depending on what you ultimately want to do, you can probably get by with just your current single-core system. Multi-core programming is basically multi-threaded programming, and you can certainly do that on a single-core chip.
When I was a student, one of our projects was to build a thread-safe implementation the malloc library for C. Even on a single core processor, that was more than enough to cure me of my desire to get into multi-threaded programming. I would try something small like that before you start thinking about spending lots of money.
I agree with the others where I would upgrade to a quad-core processor. I am also a BIG FAN of ASUS Motherboards (the P5Q Pro is excellent for Core2Quad and Core2Duo processors)!
The draw for multi-core programming is that you have more resources to get things done faster. If you are serious about multi-core programming, then I would absolutely get a quad-core processor. I don't believe that you should get the new i7 architecture from Intel to take advantage of multi-core processing because anything written to take advantage of the Core2Duo or Core2Quad will just run better on the newer architecture.
If you are going to dabble in multi-core programming, then I would get a good Core2Duo processor. Remember, it's not just how many cores you have, but also how FAST the cores are to process the jobs. My Core2Duo running at 4GHz routinely completes jobs faster than my Core2Quad running at 2.4GHz even with a multi-core program.
Let me know if this helps!
JFV