Tensorflow.js examples not using GPU - tensorflow

I have an NVIDIA RTX 2070 GPU and CUDA installed, I have WebGL support, but when I run the various TFJS examples, such as the Addition RNN Example or the Visualizing Training Example, I see my CPU usage go to 100% but the GPU (as metered via nvidia-smi) never gets used.
How can I troubleshoot this? I don't see any console messages about not finding the GPU. The TFJS docs are really vague about this, only saying that it uses the GPU if WebGL is supported and otherwise falls back to CPU if it can't find the WebGL. But again, WebGL is working. So...how to help it find my GPU?
Other related SO questions seem to be about tfjs-node-gpu, e.g., getting one's own tfjs-node-gpu installation working. This is not about that.
I'm talking about running the main TFJS examples on the official TFJS pages from my browser.
Browser is the latest Chrome for Linux. Running Ubuntu 18.04.
EDIT: Since someone will ask, chrome://gpu shows that hardware acceleration is enabled. The output log is rather long, but here's the top:
Graphics Feature Status
Canvas: Hardware accelerated
Flash: Hardware accelerated
Flash Stage3D: Hardware accelerated
Flash Stage3D Baseline profile: Hardware accelerated
Compositing: Hardware accelerated
Multiple Raster Threads: Enabled
Out-of-process Rasterization: Disabled
OpenGL: Enabled
Hardware Protected Video Decode: Unavailable
Rasterization: Software only. Hardware acceleration disabled
Skia Renderer: Enabled
Video Decode: Unavailable
Vulkan: Disabled
WebGL: Hardware accelerated
WebGL2: Hardware accelerated

Got it essentially solved. I found this older post, that one needs to check whether WebGL is using the "real" GPU or just some Intel-integrated-graphics offshoot of the CPU.
To do this, go to https://alteredqualia.com/tmp/webgl-maxparams-test/ and scroll down to the very bottom and look at the Unmasked Renderer and Unmasked Vendor tag.
In my case, these were showing Intel, not my NVIDIA GPU.
My System76 laptop has the capacity to run in "Hybrid Graphics" mode in which big computations are performed on the GPU but smaller things like GUI elements run on the integrated graphics. (This saves battery life.) But while some applications are able to take advantage of the GPU when in Hybrid Graphics mode -- I just ran a great Adversarial Latent AutoEncoder demo that maxed out my GPU while in Hybrid Graphics mode -- not all are. Chrome is one example of the latter, apparently.
To get WebGL to see my NVIDIA GPU, I needed to reboot my system in "full NVIDIA Graphics" mode.
After this reboot, some of the TFJS examples will use the GPU, such as the Visualizing Training example, which now trains almost instantly instead of taking a few minutes to train. But the Addition RNN example still only uses the CPU. This may be because of a missing backend declaration that #edkeveked pointed out.

Related

Does TensorFlow use all of the hardware on the GPU?

The NVidia GP100 has 30 TPC circuits and 240 "texture units". Do the TPCs and texture units get used by TensorFlow, or are these disposable bits of silicon for machine learning?
I am looking at GPU-Z and Windows 10's built-in GPU performance monitor on a running neural net training session and I see various hardware functions are underutilized. Tensorflow uses CUDA. CUDA has access, I presume, to all hardware components. If I know where the gap is (between Tensorflow and underlying CUDA) and whether it is material (how much silicon is wasted) I can, for example, remediate by making a clone of TensorFlow, modifying it, and then submitting a pull request.
For example, answer below discusses texture objects, accessible from CUDA. NVidia notes that these can be used to speed up latency-sensitive, short-running kernels. If I google "TextureObject tensorflow" I don't get any hits. So I can sort of assume, barring evidence to the contrary, that TensorFlow is not taking advantage of TextureObjects.
NVidia markets GPGPUs for neural net training. So far it seems they have adopted a dual-use strategy for their circuits, so they are leaving in circuits not used for machine learning. This begs the question of whether a pure TensorFlow circuit would be more efficient. Google is now promoting TPUs for this reason. The jury is out on whether TPUs are actually cheaper for TensorFlow than NVidia GPUs. NVidia is challenging Google price/performance claims.
None of those things are separate pieces of individual hardware that can be addressed separately in CUDA. Read this passage on page 10 of your document:
Each GPC inside GP100 has ten SMs. Each SM has 64 CUDA Cores and four texture units. With 60 SMs,
GP100 has a total of 3840 single precision CUDA Cores and 240 texture units. Each memory controller is
attached to 512 KB of L2 cache, and each HBM2 DRAM stack is controlled by a pair of memory
controllers. The full GPU includes a total of 4096 KB of L2 cache.
And if we read just above that:
GP100 was built to be the highest performing parallel computing processor in the world to address the
needs of the GPU accelerated computing markets serviced by our Tesla P100 accelerator platform. Like
previous Tesla-class GPUs, GP100 is composed of an array of Graphics Processing Clusters (GPCs), Texture
Processing Clusters (TPCs), Streaming Multiprocessors (SMs), and memory controllers. A full GP100
consists of six GPCs, 60 Pascal SMs, 30 TPCs (each including two SMs), and eight 512-bit memory
controllers (4096 bits total).
and take a look at the diagram we see the following:
So not only are the GPCs and SMS not seperate pieces of hardware, but even the TPCs are just another way to reorganize the hardware architecture and come up with a fancy marketing name. You can clearly see TPC doesn't add anything new in the diagram, it just looks like a container for the SMs. Its [1 GPC]:[5 TPCs]:[10 SMs]
The memory controllers are something all hardware is going to have in order to interface with RAM, it happens that more memory controllers can enable higher bandwidth, see this diagram:
where "High bandwidth memory" refers to HBM2 a type of video memory like GDDR5, in other words, video RAM. This isn't something you would directly address in software with CUDA any more than you would do so with X86 desktop machines.
So in reality, we only have SMs here, not TPCs an GPCs. So to answer your question, since Tensor flow takes advantage of cuda, presumably its going to use all the available hardware it can.
EDIT: The poster edited their question to an entirely different question, and has new misconceptions there so here is the answer to that:
Texture Processing Clusters (TPCs) and Texture units are not the same thing. TPCs appear to be merely an organization of Streaming Multiprocessors (SM) with a bit of marketing magic thrown in.
Texture units are not a concrete term, and features differ from GPU to GPU, but basically you can think of them as the combination of texture memory or ready access to texture memory, which employs spatial coherence, versus L1,L2,L3... cache which employ temporal coherence, in combination of some fixed function functionality. Fixed functionality may include interpolation access filter (often at least linear interpolation), different coordinate modes, mipmapping control and ansiotropic texture filtering. See the Cuda 9.0 Guide on this topic to get an idea of texture unit functionality and what you can control with CUDA. On the diagram we can see the texture units at the bottom.
Clearly these are completely different from the TPCs shown in the first picture I posted, which at least according to the diagram have no extra functionality associated with them and are merely a container for two SMs.
Now, despite the fact that you can address texture functionality within cuda, you often don't need to. The texture units fixed function functionality is not all that useful to Neural nets, however, the spatially coherent texture memory is often automatically used by CUDA as an optimization even if you don't explicitly try to access it. In this way, TensorFlow still would not be "wasting" silicon.

Computer restarts with large mini batches in TensorFlow

I am running TensorFlow for Windows with a Titan X GPU (12 GB memory). When I try to train a network for images of 256X256X1 with mini-batches larger than 50 images, my computer just crashes and restarts automatically. With smaller mini-batches it runs just fine.
Any clues on what might be causing this?
I've seen similar problems being discussed in some gaming forums, where the PC would just shut down when the GPU was under heavy load. The reason was usually that the GPU was drawing more power than the power supply unit could handle. Check e.g. here or here. So may be it's worth investigating whether your PSU is the culprit.
Edit: May be the program SpeedFan can help you debugging this - it is able to show both voltages and readings of temperature sensors, which would also tell you if your PC is overheating (I've never used the tool myself, and I'm not affiliated with it either, just found it online).

Is Kaveri a HSA-compliant processor?

I have looked at lots of HSA introductions and find that a HSA-compliant GPU should be preemptible and support context switch.
But the Wikipedia article "AMD Accelerated Processing Unit" says GPU compute context switch, GPU graphics preemption will have support in Carizzo APU (2015).
So I wonder whether Kaveri is a HSA-compliant processor?
Thanks!
Kaveri is a 1st generation HSA-compliant APU.
As a 1st generation, it is still missing some features of the HSA specification. One of those features is Mid-wave preemption, which means the ability to preempt a graphic/compute work in the middle, context-switch to a different wave (work) and then resume the original wave.
Without this feature, Kaveri needs to finish the wave and only then can it move to a different wave.
Having said that, there is already an infrastructure for running HSA applications on Kaveri in Linux (Ubuntu 13/14). See https://github.com/HSAFoundation/Linux-HSA-Drivers-And-Images-AMD for kernel bits and https://github.com/HSAFoundation/Okra-Interface-to-HSA-Device for userspace bits.
This infrastructure also supports the Aparapi and Sumatra projects on Kaveri - running Java code on the GPU.
Hope this helps.

ADL only works if a monitor i connected to the GPU

I have a system with a discrete GPU, AMD Radeon HD7850, for computations only. The GPU has no monitor connected to it.
I would like to read fan speed and temperature from the GPU. This can normally be done with the ADL (AMD Display Library) API.
E.g. ADL_Overdrive6_FanSpeed_Get and ADL_Overdrive6_Temperature_Get. However, all ADL API calls return error when no displays are active, i.e. no monitor is connected.
How do I read these values when the GPU has no monitor connected to it?
The AMD Catalyst Control Center has the same problem, it too can't read the values when the display is inactive.
I know the values are accessible because I can find them with the HWiNFO64.
After consulting AMD and the guys behind HWiNFO64 I have learned that the only way to get these values from a headless GPU is to read them directly from the GPU registers.
To do this you need to write your own driver, since AMD doesn't make an API available.

Is it possible to do GPU programming if I have an integrated graphics card?

I have an HP Pavilion Laptop, it's so-called graphics card is some sort of integrated NVIDIA driver running on shared memory. To give you an idea of its capabilities, if a videogame was made in the last 5 years at a cost of more than a couple million dollars, it just won't be playable on my computer.
Anyways, I was wondering if I could do GPU programming, like CUDA, on this thing. I don't expect it to be fast, I'd just like to get the experience and not make my laptop catch fire in the meanwhile.
Find out what GPU your laptop is, and compare it against this list: http://en.wikipedia.org/wiki/CUDA#Supported_GPUs. Most likely, CUDA will not be supported.
This doesn't necessarily prevent you from doing "GPU programming", however. If the GPU supports fragment and vertex shaders, you can use the fixed pipeline to send data to the card (for example, through texture data) and do your processing in a fragment shader. You will then do a read from the pixel buffer to get the data back into system memory. Though hackish, this approach was quite popular until CUDA and other frameworks like OpenCL were introduced.