How did the first GPUs get support from CPUs? - gpu

I imagine CPUs have to have features that allow it to communicate and work with the GPU, and I can imagine this exists today, but in the early days of GPUs, how did companies get support from large CPU companies to have their devices be supported, and what features did CPU companies add to enable this?

You mean special support beyond just being devices on a bus like PCI? (Or even older, ISA or VLB.)
TL:DR: All the special features CPUs have which are useful for improved bandwidth to write (and sometimes read) video memory came after 3D graphics cards were commercially successful. They weren't necessary, just a performance boost.
Once GPUs were commercially successful and popular, and a necessary part of a gaming PC, it made obvious sense for CPU vendors to add features to make things better.
The same IO busses that let you plug in a sound card or network card already have the capabilities to access device memory and MMIO, and device IO ports, which is all that's necessary for video drivers to make a graphics card do things.
Modern GPUs are often the highest-bandwidth devices in a system (especially non-servers), so they benefit from fast buses, hence AGP for a while, until PCI Express (PCIe) unified everything again.
Anyway, graphics cards could work on standard busses; it was only once 3D graphics became popular and commercially important (and fast enough for the PCI bus to be a bottleneck), that things needed to change. At that point, CPU / motherboard companies were fully aware that consumers cared about 3D games, and thus it would make sense to develop a new bus specifically for graphics cards.
(Along with a GART, graphics address/aperture remapping table, an IOMMU that made it much easier / safer for drivers to let an AGP or PCIe video card read directly from system memory. Including I think with addresses under control of user-space, without letting user-space read arbitrary system memory, thanks to it being an IOMMU that only allows a certain address range.)
Before the GART was a thing, I assume drivers for PCI GPUs needed to have the host CPU initiate DMA to the device. Or if bus-master DMA by the GPU did happen, it could read any byte of physical memory in the system if it wanted, so drivers would have to be careful not to let programs pass arbitrary pointers.
Anyway, having a GART was new with AGP, which post-dates early 3D graphics cards like 3dfx's Voodoo and ATI 3D Rage. I don't know enough details to be sure I'm accurately describing the functionality a GART enables.
So most of the support for GPUs was in terms of busses, and thus a chipset thing, not CPUs proper. (Back then, CPUs didn't have integrated memory controllers, instead just talking to the chipset northbridge over a frontside bus.)
Relevant CPU instructions included Intel's SSE and SSE2 instruction sets, which had streaming (NT = non-temporal) stores which are good for storing large amounts of data that won't be re-read by the CPU any time soon, if at all.
SSE4.1 in 2nd-gen Core2 (2008 ish) added a streaming load instruction (movntdqa) which (still) only does anything special if used on memory regions marked in the CPU's page tables or MTRR as WC (aka USWC: uncacheable, write-combining). Copying back from GPU memory to the host was the intended use-case. (Non-temporal loads and the hardware prefetcher, do they work together?)
x86 CPUs introducing the MTRR (Memory Type Range Register) is another feature that improved CPU -> GPU write bandwidth. Again, this came after 3D graphics were commercially successful for gaming.

Related

Is it possible to use system memory instead of GPU memory for processing Dask tasks

We have been running DASK clusters on Kubernetes for some time. Up to now, we have been using CPUs for processing and, of course, system memory for storing our Dataframe of around 1,5 TB (per DASK cluster, split onto 960 workers). Now we want to update our algorithm to take advantage of GPUs. But it seems like the available memory on GPUs is not going to be enough for our needs, it will be a limiting factor(with our current setup, we are using more than 1GB of memory per virtual core).
I was wondering if it is possible to use GPUs (thinking about NVDIA, AMD cards with PCIe connections and their own VRAMS, not integrated GPUs that use system memory) for processing and system memory (not GPU memory/VRAM) for storing DASK Dataframes. I mean, is it technically possible? Have you ever tried something like this? Can I schedule a kubernetes pod such that it uses GPU cores and system memory together?
Another thing is, even if it was possible to allocate the system RAM as VRAM of GPU, is there a limitation to the size of this allocatable system RAM?
Note 1. I know that using system RAM with GPU (if it was possible) will create an unnecessary traffic through PCIe bus, and will result in a degraded performance, but I would still need to test this configuration with real data.
Note 2. GPUs are fast because they have many simple cores to perform simple tasks at the same time/in parallel. If an individual GPU core is not superior to an individual CPU core then may be I am chasing the wrong dream? I am already running dask workers on kubernetes which already have access to hundreds of CPU cores. In the end, having a huge number of workers with a part of my data won't mean better performance (increased shuffling). No use infinitely increasing the number of cores.
Note 3. We are mostly manipulating python objects and doing math calculations using calls to .so libraries implemented in C++.
Edit1: DASK-CUDA library seems to support spilling from GPU memory to host memory but spilling is not what I am after.
Edit2: It is very frustrating that most of the components needed to utilize GPUs on Kubernetes are still experimental/beta.
Dask-CUDA: This library is experimental...
NVIDIA device plugin: The NVIDIA device plugin is still considered beta and...
Kubernetes: Kubernetes includes experimental support for managing AMD and NVIDIA GPUs...
I don't think this is possible directly as of today, but it's useful to mention why and reply to some of the points you've raised:
Yes, dask-cuda is what comes to mind first when I think of your use-case. The docs do say it's experimental, but from what I gather, the team has plans to continue to support and improve it. :)
Next, dask-cuda's spilling mechanism was designed that way for a reason -- while doing GPU compute, your biggest bottleneck is data-transfer (as you have also noted), so we want to keep as much data on GPU-memory as possible by design.
I'd encourage you to open a topic on Dask's Discourse forum, where we can reach out to some NVIDIA developers who can help confirm. :)
A sidenote, there are some ongoing discussion around improving how Dask manages GPU resources. That's in its early stages, but we may see cool new features in the coming months!

Most hardware is controlled at the driver level through memory mapped handles. How is RAM controlled? Is there a spec for it?

I'm just getting into the massive topic of learning UEFI driver development and from what I understand so far, hardware peripherals are controlled using specific addresses mapped to memory. Well, the memory is hardware too. Is it not controlled by drivers?
I assume the CPU and motherboard have built-in circuits that handle this, but my curiosity is whether drivers have any hardware level control to this handling. I'd just prefer to know for sure and I'm not sure what manual would explain this.
[kernel/UEFI] driver <-> memory mapped address <-> firmware [hardware:keyboard]
[kernel/UEFI] driver <-> ? <-> firmware [hardware:RAM]
guess:
spec
driver <-> CPU microcode <-> motherboard circuit <-> firmware
I just think assumptions are bad and can't find a citation confirming the probable answer. The answer is relevant to security and which supply chain / standard we're trusting. Like PCIe or NVMe are standard specs, perhaps there's a standard for RAM <-> CPU communication?
Maybe this question is a better fit for an Engineering SE site?
From software development perspective, there isn't a driver for RAM control ‐ it's exposed as an indistinguishable part of the hardware via the Instruction Set for a given Architecture (ISA). Just like CPU is hardware, but without a driver to control it. Reason for drivers is access to hardware unknown to the ISA, such as a USB device, a particular manufacturer's SSD (which may or may not be present at power up time), graphics hardware, etc. Just like CPU, RAM must be present at power up ‐ you won't get much further than an error code via lights and beeps without RAM in your system at power up time. This isn't the case for most, if not all, other hardware; such other hardware is therefore optional, and isn't part of the ISA definition; drivers (software) are needed to access such optional hardware.
RAM is managed by both hardware (CPU; see Intel manual, volume 3), especially modern hardware, which provides virtualization support for a modern OS, including paging support, and software (the OS memory allocator), for purposes of virtualizing RAM for the running processes. No drivers though ‐ just addressing via ISA instructions.
If you're looking for an answer from a hardware perspective, such as the details of the bus circuitry, which CPU pins are involved, exact protocol, etc., then this question is a better candidate for Engineering StackExchange site.

Does TensorFlow use all of the hardware on the GPU?

The NVidia GP100 has 30 TPC circuits and 240 "texture units". Do the TPCs and texture units get used by TensorFlow, or are these disposable bits of silicon for machine learning?
I am looking at GPU-Z and Windows 10's built-in GPU performance monitor on a running neural net training session and I see various hardware functions are underutilized. Tensorflow uses CUDA. CUDA has access, I presume, to all hardware components. If I know where the gap is (between Tensorflow and underlying CUDA) and whether it is material (how much silicon is wasted) I can, for example, remediate by making a clone of TensorFlow, modifying it, and then submitting a pull request.
For example, answer below discusses texture objects, accessible from CUDA. NVidia notes that these can be used to speed up latency-sensitive, short-running kernels. If I google "TextureObject tensorflow" I don't get any hits. So I can sort of assume, barring evidence to the contrary, that TensorFlow is not taking advantage of TextureObjects.
NVidia markets GPGPUs for neural net training. So far it seems they have adopted a dual-use strategy for their circuits, so they are leaving in circuits not used for machine learning. This begs the question of whether a pure TensorFlow circuit would be more efficient. Google is now promoting TPUs for this reason. The jury is out on whether TPUs are actually cheaper for TensorFlow than NVidia GPUs. NVidia is challenging Google price/performance claims.
None of those things are separate pieces of individual hardware that can be addressed separately in CUDA. Read this passage on page 10 of your document:
Each GPC inside GP100 has ten SMs. Each SM has 64 CUDA Cores and four texture units. With 60 SMs,
GP100 has a total of 3840 single precision CUDA Cores and 240 texture units. Each memory controller is
attached to 512 KB of L2 cache, and each HBM2 DRAM stack is controlled by a pair of memory
controllers. The full GPU includes a total of 4096 KB of L2 cache.
And if we read just above that:
GP100 was built to be the highest performing parallel computing processor in the world to address the
needs of the GPU accelerated computing markets serviced by our Tesla P100 accelerator platform. Like
previous Tesla-class GPUs, GP100 is composed of an array of Graphics Processing Clusters (GPCs), Texture
Processing Clusters (TPCs), Streaming Multiprocessors (SMs), and memory controllers. A full GP100
consists of six GPCs, 60 Pascal SMs, 30 TPCs (each including two SMs), and eight 512-bit memory
controllers (4096 bits total).
and take a look at the diagram we see the following:
So not only are the GPCs and SMS not seperate pieces of hardware, but even the TPCs are just another way to reorganize the hardware architecture and come up with a fancy marketing name. You can clearly see TPC doesn't add anything new in the diagram, it just looks like a container for the SMs. Its [1 GPC]:[5 TPCs]:[10 SMs]
The memory controllers are something all hardware is going to have in order to interface with RAM, it happens that more memory controllers can enable higher bandwidth, see this diagram:
where "High bandwidth memory" refers to HBM2 a type of video memory like GDDR5, in other words, video RAM. This isn't something you would directly address in software with CUDA any more than you would do so with X86 desktop machines.
So in reality, we only have SMs here, not TPCs an GPCs. So to answer your question, since Tensor flow takes advantage of cuda, presumably its going to use all the available hardware it can.
EDIT: The poster edited their question to an entirely different question, and has new misconceptions there so here is the answer to that:
Texture Processing Clusters (TPCs) and Texture units are not the same thing. TPCs appear to be merely an organization of Streaming Multiprocessors (SM) with a bit of marketing magic thrown in.
Texture units are not a concrete term, and features differ from GPU to GPU, but basically you can think of them as the combination of texture memory or ready access to texture memory, which employs spatial coherence, versus L1,L2,L3... cache which employ temporal coherence, in combination of some fixed function functionality. Fixed functionality may include interpolation access filter (often at least linear interpolation), different coordinate modes, mipmapping control and ansiotropic texture filtering. See the Cuda 9.0 Guide on this topic to get an idea of texture unit functionality and what you can control with CUDA. On the diagram we can see the texture units at the bottom.
Clearly these are completely different from the TPCs shown in the first picture I posted, which at least according to the diagram have no extra functionality associated with them and are merely a container for two SMs.
Now, despite the fact that you can address texture functionality within cuda, you often don't need to. The texture units fixed function functionality is not all that useful to Neural nets, however, the spatially coherent texture memory is often automatically used by CUDA as an optimization even if you don't explicitly try to access it. In this way, TensorFlow still would not be "wasting" silicon.

Using a GPU both as video card and GPGPU

Where I work, we do a lot of numerical computations and we are considering buying workstations with NVIDIA video cards because of CUDA (to work with TensorFlow and Theano).
My question is: should these computers come with another video card to handle the display and free the NVIDIA for the GPGPU?
I would appreciate if anyone knows of hard data on using a video card for display and GPGPU at the same time.
Having been through this, I'll add my two cents.
It is helpful to have a dedicated card for computations, but it is definitely not necessary.
I have used a development workstation with a single high-end GPU for both display and compute. I have also used workstations with multiple GPUs, as well as headless compute servers.
My experience is that doing compute on the display GPU is fine as long as demands on the display are typical for software engineering. In a Linux setup with a couple monitors, web browsers, text editors, etc., I use about 200MB for display out of the 6GB of the card -- so only about 3% overhead. You might see the display stutter a bit during a web page refresh or something like that, but the throughput demands of the display are very small.
One technical issue worth noting for completeness is that the NVIDIA driver, GPU firmware, or OS may have a timeout for kernel completion on the display GPU (run NVIDIA's 'deviceQueryDrv' to see the driver's "run time limit on kernels" setting). In my experience (on Linux), with machine learning, this has never been a problem since the timeout is several seconds and, even with custom kernels, synchronization across multiprocessors constrains how much you can stuff into a single kernel launch. I would expect the typical runs of the pre-baked ops in TensorFlow to be two or more orders of magnitude below this limit.
That said, there are some big advantages of having multiple compute-capable cards in a workstation (whether or not one is used for display). Of course there is the potential for more throughput (if your software can use it). However, the main advantage in my experience, is being able to run long experiments while concurrently developing new experiments.
It is of course feasible to start with one card and then add one later, but make sure your motherboard has lots of room and your power supply can handle the load. If you decide to have two cards, with one being a low-end card dedicated to display, I would specifically advise against having the low-end card be a CUDA-capable card lest it get selected as a default for computation.
Hope that helps.
In my experience it is awkward to share a GPU card between numerical computation tasks and driving a video monitor. For example, there is limited memory available on any GPU, which is often the limiting factor in the size of a model you can train. Unless you're doing gaming, a fairly modest GPU is probably adequate to drive the video. But for serious ML work you will probably want a high-performance card. Where I work (Google) we typically put two GPUs in desk-side machines when one is to be used for numerical computation.

Is it possible to do GPU programming if I have an integrated graphics card?

I have an HP Pavilion Laptop, it's so-called graphics card is some sort of integrated NVIDIA driver running on shared memory. To give you an idea of its capabilities, if a videogame was made in the last 5 years at a cost of more than a couple million dollars, it just won't be playable on my computer.
Anyways, I was wondering if I could do GPU programming, like CUDA, on this thing. I don't expect it to be fast, I'd just like to get the experience and not make my laptop catch fire in the meanwhile.
Find out what GPU your laptop is, and compare it against this list: http://en.wikipedia.org/wiki/CUDA#Supported_GPUs. Most likely, CUDA will not be supported.
This doesn't necessarily prevent you from doing "GPU programming", however. If the GPU supports fragment and vertex shaders, you can use the fixed pipeline to send data to the card (for example, through texture data) and do your processing in a fragment shader. You will then do a read from the pixel buffer to get the data back into system memory. Though hackish, this approach was quite popular until CUDA and other frameworks like OpenCL were introduced.