I have looked at lots of HSA introductions and find that a HSA-compliant GPU should be preemptible and support context switch.
But the Wikipedia article "AMD Accelerated Processing Unit" says GPU compute context switch, GPU graphics preemption will have support in Carizzo APU (2015).
So I wonder whether Kaveri is a HSA-compliant processor?
Thanks!
Kaveri is a 1st generation HSA-compliant APU.
As a 1st generation, it is still missing some features of the HSA specification. One of those features is Mid-wave preemption, which means the ability to preempt a graphic/compute work in the middle, context-switch to a different wave (work) and then resume the original wave.
Without this feature, Kaveri needs to finish the wave and only then can it move to a different wave.
Having said that, there is already an infrastructure for running HSA applications on Kaveri in Linux (Ubuntu 13/14). See https://github.com/HSAFoundation/Linux-HSA-Drivers-And-Images-AMD for kernel bits and https://github.com/HSAFoundation/Okra-Interface-to-HSA-Device for userspace bits.
This infrastructure also supports the Aparapi and Sumatra projects on Kaveri - running Java code on the GPU.
Hope this helps.
Related
I'm just getting into the massive topic of learning UEFI driver development and from what I understand so far, hardware peripherals are controlled using specific addresses mapped to memory. Well, the memory is hardware too. Is it not controlled by drivers?
I assume the CPU and motherboard have built-in circuits that handle this, but my curiosity is whether drivers have any hardware level control to this handling. I'd just prefer to know for sure and I'm not sure what manual would explain this.
[kernel/UEFI] driver <-> memory mapped address <-> firmware [hardware:keyboard]
[kernel/UEFI] driver <-> ? <-> firmware [hardware:RAM]
guess:
spec
driver <-> CPU microcode <-> motherboard circuit <-> firmware
I just think assumptions are bad and can't find a citation confirming the probable answer. The answer is relevant to security and which supply chain / standard we're trusting. Like PCIe or NVMe are standard specs, perhaps there's a standard for RAM <-> CPU communication?
Maybe this question is a better fit for an Engineering SE site?
From software development perspective, there isn't a driver for RAM control ‐ it's exposed as an indistinguishable part of the hardware via the Instruction Set for a given Architecture (ISA). Just like CPU is hardware, but without a driver to control it. Reason for drivers is access to hardware unknown to the ISA, such as a USB device, a particular manufacturer's SSD (which may or may not be present at power up time), graphics hardware, etc. Just like CPU, RAM must be present at power up ‐ you won't get much further than an error code via lights and beeps without RAM in your system at power up time. This isn't the case for most, if not all, other hardware; such other hardware is therefore optional, and isn't part of the ISA definition; drivers (software) are needed to access such optional hardware.
RAM is managed by both hardware (CPU; see Intel manual, volume 3), especially modern hardware, which provides virtualization support for a modern OS, including paging support, and software (the OS memory allocator), for purposes of virtualizing RAM for the running processes. No drivers though ‐ just addressing via ISA instructions.
If you're looking for an answer from a hardware perspective, such as the details of the bus circuitry, which CPU pins are involved, exact protocol, etc., then this question is a better candidate for Engineering StackExchange site.
Is there a comprehensive list, where i can check the supported Vulkan extensions for all AMD GPUs? I have been looking all over the internet, but can't find any information on this.
I currently have a RX570, but I thought the Vulkan API would feature a fallback mode for cards lacking hardware acceleration.
I think i installed the amdgpu-driver correctly, but when i try to run the raytracing_simple example, it says that the RX570 is lacking the requested extensions.
AMD introduced ray tracing support with the RX 6x00 series. A fallback mode for older hardware would have to be implemented by the vendor, which is not the case on AMD. So you need a RX 6x00 GPU for doing hardware accelerated ray tracing on Linux.
You can check VK_KHR_ray_tracing_pipeline support on Linux here: https://vulkan.gpuinfo.org/listdevicescoverage.php?extension=VK_KHR_ray_tracing_pipeline&platform=linux
That's a Vulkan hardware database I'm maintaining, which also has listings for extension support on different platforms. The data provided there is from user-uploaded reports. While not an official Vulkan database, thanks to regular contributions it's mostly complete and gives a good overview on Vulkan support for different hardware.
Note: As mentioned above, reports are submitted by users, so the list may not be 100% complete.
I have an NVIDIA RTX 2070 GPU and CUDA installed, I have WebGL support, but when I run the various TFJS examples, such as the Addition RNN Example or the Visualizing Training Example, I see my CPU usage go to 100% but the GPU (as metered via nvidia-smi) never gets used.
How can I troubleshoot this? I don't see any console messages about not finding the GPU. The TFJS docs are really vague about this, only saying that it uses the GPU if WebGL is supported and otherwise falls back to CPU if it can't find the WebGL. But again, WebGL is working. So...how to help it find my GPU?
Other related SO questions seem to be about tfjs-node-gpu, e.g., getting one's own tfjs-node-gpu installation working. This is not about that.
I'm talking about running the main TFJS examples on the official TFJS pages from my browser.
Browser is the latest Chrome for Linux. Running Ubuntu 18.04.
EDIT: Since someone will ask, chrome://gpu shows that hardware acceleration is enabled. The output log is rather long, but here's the top:
Graphics Feature Status
Canvas: Hardware accelerated
Flash: Hardware accelerated
Flash Stage3D: Hardware accelerated
Flash Stage3D Baseline profile: Hardware accelerated
Compositing: Hardware accelerated
Multiple Raster Threads: Enabled
Out-of-process Rasterization: Disabled
OpenGL: Enabled
Hardware Protected Video Decode: Unavailable
Rasterization: Software only. Hardware acceleration disabled
Skia Renderer: Enabled
Video Decode: Unavailable
Vulkan: Disabled
WebGL: Hardware accelerated
WebGL2: Hardware accelerated
Got it essentially solved. I found this older post, that one needs to check whether WebGL is using the "real" GPU or just some Intel-integrated-graphics offshoot of the CPU.
To do this, go to https://alteredqualia.com/tmp/webgl-maxparams-test/ and scroll down to the very bottom and look at the Unmasked Renderer and Unmasked Vendor tag.
In my case, these were showing Intel, not my NVIDIA GPU.
My System76 laptop has the capacity to run in "Hybrid Graphics" mode in which big computations are performed on the GPU but smaller things like GUI elements run on the integrated graphics. (This saves battery life.) But while some applications are able to take advantage of the GPU when in Hybrid Graphics mode -- I just ran a great Adversarial Latent AutoEncoder demo that maxed out my GPU while in Hybrid Graphics mode -- not all are. Chrome is one example of the latter, apparently.
To get WebGL to see my NVIDIA GPU, I needed to reboot my system in "full NVIDIA Graphics" mode.
After this reboot, some of the TFJS examples will use the GPU, such as the Visualizing Training example, which now trains almost instantly instead of taking a few minutes to train. But the Addition RNN example still only uses the CPU. This may be because of a missing backend declaration that #edkeveked pointed out.
When running a simulation in gem5, I can select a CPU with fs.py --cpu-type.
This option can also show a list of all CPU types if I use an invalid CPU type such as fs.py --cpu-type.
What is the difference between those CPU types and which one should I choose for my experiment?
Question inspired by: https://www.mail-archive.com/gem5-users#gem5.org/msg16976.html
An overview of the CPU types can be found at: https://cirosantilli.com/linux-kernel-module-cheat/#gem5-cpu-types
In summary:
simplistic CPUs (derived from BaseSimpleCPU): for example AtomicSimpleCPU (the default one). They have no CPU pipeline, and therefor are completely unrealistic. However, they also run much faster. Therefore,they are mostly useful to boot Linux fast and then checkpoint and switch to a more detailed CPU.
Within the simple CPUs we can notably distinguish:
AtomicSimpleCPU: memory requests finish immediately
TimingSimpleCPU: memory requests actually take time to go through to the memory system and return. Since there is no CPU pipeline however, the simulated CPU stalls on every memory request waiting for a response.
An alternative to those is to use KVM CPUs to speed up boot if host and guest ISA are the same, although as of 2019, KVM is less stable as it is harder to implement and debug.
in-order CPUs: derived from the generic MinorCPU by parametrization, Minor stands for In Order:
for ARM: HPI is made by ARM and models a "(2017) modern in-order Armv8-A implementation". This is your best in-order ARM bet.
out-of-order CPUs, derived from the generic DerivO3CPU by parametrization, O3 stands for Out Of Order:
for ARM: there are no models specifically published by ARM as of 2019. The only specific O3 model available is ex5_big for an A15, but you would have to verify its authors claims on how well it models the real core A15 core.
If none of those are accurate enough for your purposes, you could try to create your own in-order/out-of-order models by parametrizing MinorCPU / DerivO3CPU like HPI and ex5_big do, although this could be hard to get right, as there isn't generally enough public information on non-free CPUs to do this without experiments or reverse engineering.
The other thing you will want to think about is the memory system model. There are basically two choices: classical vs Ruby, and within Ruby, several options are available, see also: https://cirosantilli.com/linux-kernel-module-cheat/#gem5-ruby-build
Since I only have an AMD A10-7850 APU, and do not have the funds to spend on a $800-$1200 NVIDIA graphics card, I am trying to make due with the resources I have in order to speed up deep learning via tensorflow/keras.
Initially, I used a pre-compiled version of Tensorflow. InceptionV3 would take about 1000-1200 seconds to compute 1 epoch. It has been painfully slow.
To speed up calculations, I first self-compiled Tensorflow with optimizers (using AVX, and SSE4 instructions). This lead to a roughly 40% decrease in computation times. The same computations performed above now only take about 600 seconds to compute. It's almost bearable - kind of like you can watch paint dry.
I am looking for ways to further decrease computation times. I only have an integrated AMD graphics card that is part of the APU. (How) (C/c)an I make use of this resource to speed up computation even more?
More generally, let's say there are other people with similar monetary restrictions and Intel setups. How can anyone WITHOUT discrete NVIDIA cards make use of their integrated graphics chips or otherwise non-NVIDIA setup to achieve faster than CPU-only performance? Is that possible? Why/Why not? What needs to be done to achieve this goal? Or will this be possible in the near future (2-6 months)? How?
After researching this topic for a few months, I can see 3.5 possible paths forward:
1.) Tensorflow + OpenCl as mentioned in the comments above:
There seems to be some movement going on this field. Over at Codeplay, Lukasz Iwanski just posted a comprehensive answer on how to get tensorflow to run with opencl here (I will only provide a link as stated above because the information might change there): https://www.codeplay.com/portal/03-30-17-setting-up-tensorflow-with-opencl-using-sycl
The potential to use integrated graphics is alluring. It's also worth exploring the use of this combination with APUs. But I am not sure how well this will work since OpenCl support is still early in development, and hardware support is very limited. Furthermore, OpenCl is not the same as a handcrafted library of optimized code. (UPDATE 2017-04-24: I have gotten the code to compile after running into some issues here!) Unfortunately, the hoped for speed improvements ON MY SETUP (iGPU) did not materialize.
CIFAR 10 Dataset:
Tensorflow (via pip ak unoptimized): 1700sec/epoch at 390% CPU
utilization.
Tensorflow (SSE4, AVX): 1100sec/epoch at 390% CPU
utilization.
Tensorflow (opencl + iGPU): 5800sec/epoch at 150% CPU
and 100% GPU utilization.
Your mileage may vary significantly. So I am wondering what are other people getting relatively speaking (unoptimized vs optimized vs opencl) on your setups?
What should be noted: opencl implementation means that all the heavy computation should be done on the GPU. (Updated on 2017/4/29) But in reality this is not the case yet because some functions have not been implemented yet. This leads to unnecessary copying back and forth of data between CPU and GPU ram. Again, imminent changes should improve the situation. And furthermore, for those interested in helping out and those wanting to speed things up, we can do something that will have a measurable impact on the performance of tensorflow with opencl.
But as it stands for now: 1 iGPU << 4 CPUS with SSE+AVX. Perhaps beefier GPUs with larger RAM and/or opencl 2.0 implementation could have made a larger difference.
At this point, I should add that similar efforts have been going on with at least Caffe and/or Theano + OpenCl. The limiting step in all cases appears to be the manual porting of CUDA/cuDNN functionality to the openCl paradigm.
2.) RocM + MIOpen
RocM stands for Radeon Open Compute and seems to be a hodgepodge of initiatives that is/will make deep-learning possible on non-NVIDIA (mostly Radeon devices). It includes 3 major components:
HIP: A tool that converts CUDA code to code that can be consumed by AMD GPUs.
ROCk: a 64-bit linux kernel driver for AMD CPU+GPU devices.
HCC: A C/C++ compiler that compiles code into code for a heterogeneous system architecture environment (HSA).
Apparently, RocM is designed to play to AMDs strenghts of having both CPU and GPU technology. Their approach to speeding up deep-learning make use of both components. As an APU owner, I am particularly interested in this possibility. But as a cautionary note: Kaveri APUs have limited support (only integrated graphcs is supported). Future APUs have not been released yet. And it appears, there is still a lot of work that is being done here to bring this project to a mature state. A lot of work will hopefully make this approach viable within a year given that AMD has announced their Radeon Instinct cards will be released this year (2017).
The problem here for me is that that RocM is providing tools for building deep learning libraries. They do not themselves represent deep learning libraries. As a data scientist who is not focused on tools development, I just want something that works. and am not necessarily interested in building what I want to then do the learning. There are not enough hours in the day to do both well at the company I am at.
NVIDIA has of course CUDA and cuDNN which are libaries of hand-crafted assembler code optimized for NVIDIA GPUs. All major deep learning frameworks build on top of these proprietary libraries. AMD currently does not have anything like that at all.
I am uncertain how successfully AMD will get to where NVIDIA is in this regard. But there is some light being shone on what AMDs intentions are in an article posted by Carlos Perez on 4/3/2017 here. A recent lecture at Stanford also talks in general terms about Ryzen, Vega and deep learning fit together. In essence, the article states that MIOpen will represent this hand-crafted library of optimized deep learning functions for AMD devices. This library is set to be released in H1 of 2017. I am uncertain how soon these libraries would be incorporated into the major deep learning frameworks and what the scope of functional implementation will be then at this time.
But apparently, AMD has already worked with the developers of Caffe to "hippify" the code basis. Basically, CUDA code is converted automatically to C code via HIP. The automation takes care of the vast majority of the code basis, leaving only less than 0.5% of code had to be changed and required manual attention. Compare that to the manual translation into openCl code, and one starts getting the feeling that this approach might be more sustainable. What I am not clear about is where the lower-level assembler language optimization come in.
(Update 2017-05-19) But with the imminent release of AMD Vega cards (the professional Frontier Edition card not for consumers will be first), there are hints that major deep learning frameworks will be supported through the MIOpen framework. A Forbes article released today shows the progress MiOpen has taken over just the last couple of months in terms of performance: it appears significant.
(Update 2017-08-25) MiOpen has officially been released. We are no longer talking in hypotheticals here. Now we just need to try out how well this framework works.
3.) Neon
Neon is Nervana's (now acquired by Intel) open-source deep-learning framework. The reason I mention this framework is that it seems to be fairly straightforward to use. The syntax is about as easy and intuitive as Keras. More importantly though, this framework has achieved speeds up to 2x faster than Tensorflow on some benchmarks due to some hand-crafted assembler language optimization for those computations. Potentially, cutting computation times from 500 secs/epoch down to 300 secs/epoch is nothing to sneeze at. 300 secs = 5 minutes. So one could get 15 epochs in in an hour. and about 50 epochs in about 3.5 hours! But ideally, I want to do these kinds of calculations in under an hour. To get to those levels, I need to use a GPU, and at this point, only NVIDIA offers full support in this regard: Neon also uses CUDA and cuDNN when a GPU is available (and of course, it has to be an NVIDIA GPU). If you have access other Intel hardware this is of course a valid way to pursue. Afterall, Neon was developed out of a motivation to get things to work optimally also on non-NVIDIA setups (like Nervana's custom CPUs, and now Intel FPGAs or Xeon Phis).
3.5.) Intel Movidius
Update 2017-08-25: I came across this article. Intel has released a USB3.0-stick-based "deep learning" accelerator. Apparently, it works with Cafe and allows the user perform common Cafe-based fine-tuning of networks and inference. This is important stressing: If you want to train your own network from scratch, the wording is very ambiguous here. I will therefore assume, that apart from fine-tuning a network, training itself should still be done on something with more parallel compute. The real kicker though is this: When I checked for the pricing this stick costs a mere $79. That's nothing compared to the cost of your average NVIDIA 1070-80(ti) card. If you merely want to tackle some vision problems using common network topologies already available for some related tasks, you can use this stick to fine tune it to your own use, then compile the code and put it into this stick to do inference quickly. Many use cases can be covered with this stick, and for again $79 it could be worth it. This being Intel, they are proposing to go all out on Intel. Their model is to use the cloud (i.e. Nervana Cloud) for training. Then, use this chip for prototype inference or inference where energy consumption matters. Whether this is the right approach or not is left for the reader to answer.
At this time, it looks like deep learning without NVIDIA is still difficult to realize. Some limited speed gains are difficult but potentially possible through the use of opencl. Other initiatives sound promising but it will take time to sort out the real impact that these initiatives will have.
If your platform supports opencl you can look at using it with tensorflow. There is some experimental support for it on Linux at this github repository. Some preliminary instructions are at the documentation section of of this github repository.