Why do desktop GPUs typically use immediate mode rendering instead of tile based deferred rendering? - gpu

In other words, what are the advantages of immediate mode rendering vs. TBDR, assuming you have ample memory, bandwidth, and power (as found on a desktop GPU)?

The main drawback of TBDRs is that they struggle with large amounts of geometry, because they sort it before rendering in order to achieve zero overdraw. This is not a huge deal on low-power GPUs because they deal with simpler scenes anyway.
Modern desktop GPUs do have early-z tests, so if you sort the geometry and draw it front-to-back you can still get most of the bandwidth minimization of a TBDR, and many non-deferred mobile GPUs still do tiling even if they don't sort the geometry.

Related

DirectX - when to use instancing and when not?

I'm making an application using directx 11. I wanted to make use of instancing in the first place, so I've organized my whole pipeline to always work with instancing for simplicity. This means that currently if I want to draw a single occurrence of a geometry in my scene it would still go through instanced rendering.
My question(s) is(are) what overhead does instancing introduce? Is this approach a bad practice in general? If so, is there a rule on how to decide when it's beneficial to use instancing and when not?
One similar question that did not help me: What overhead is associated with instanced rendering?
Over the years the "knee" in the performance graph of when hardware instancing is a win vs. just drawing multiple times has changed. In the early days of hardware instancing, it was almost never a win on those first generation cards. As GPU's have evolved and put more hardware support in to make this faster, it's improved significantly.
For DirectX 12 class hardware (minimum Direct3d Hardware Feature level 11.0), you can count on instancing being a good win for drawing the same objects thousands of times. If you are talking about tens of times, then it's going to depend on many factors you can really only account for by running performance trials.
Specifically for 'point-sprites' i.e. particles systems for special effects, there are a number of different approaches. For Xbox One class GPU hardware, just drawing a bunch for vertices is very fast and vertex hardware instancing only wins over it for huge numbers of particles (although it will use less VRAM so that may be a consideration). For "DirectX 12 Ultimate" class GPU hardware, a Mesh Shader is actually the fastest way to draw particle systems. See the PointSprites sample on GitHub.

How the GPU process non-graphic data in parallel?

As the introduction of programmable shaders in graphic pipeline enabled GPGPU concept which makes use of GPU as a general processing engine suited for parallel data.
However, as far as I know, because GPU is still used for graphic processing a lot compared to GPGPU, it makes use of lots of fixed graphic pipeline stages that cannot be programmed.
If my understanding is correct, when one data is processed by the GPU regardless of the type of data (graphic or general), it should be processed through the fixed graphic pipeline which includes programmable stages and non-programmable fixed stages.
Does that mean non-graphical processing should go through graphical processing stages even though it doesn't make use of it? Or can it bypass those fixed stages used for graphics? If one can explain how the GPU pipeline works for GPGPU I would appreciate it.
TL;DR:
GPGPU completely bypasses the rendering pipeline, but the pipeline is still used today.
GPUs consist of two main parts (in relation to your question). The first one is the processing part, which consists of the memory, registers, warp units, dispatchers and streaming processors. The other part is a set of controllers, that are responsible for geometry processing and the graphics pipeline. Those controllers just issue commands for the Streaming Processors on how to process the data for each of the steps of the rendering pipeline, either hardwired or based on user supplied shaders. NVidia calls them "PolyMorph Engine", AMD "Geometric Processor".
Historically, some of those controllers were hardwired to do things a single way, so you could only programm the vertexshader, fragmentshader and pixelshader. The tesselation controller e.g. was hardwired on the GPU and not user programmable. As demands grew, more and more of those controllers became user-programmable and today most of them are completely programmable (Wikipedia).
In the beginning days of GPGPU, the only way to do computing was to hack the available shaders by using a texture with your input data on a full-screen face to calculate the result and then read the rendered image back (See slide 26 on this introduction).
With CUDA, NVidia allowed users not only to program the shaders/polymorph Engine, but also directly interact with the Streaming processors and execute code on those (See slide 31 & 32).
This does not mean, that the graphics pipeline became obsolete, but now there is a way to completely bypass it and directly run code on the GPU processors. Nvidia has a nice explanation on how the pipeline works today, where you can also see both the PolyMorph Engine and the Streaming Processors here.
The Graphics pipeline still helps the dev by offloading repetitive and more complicated parts of the process, like managing the memory, managing warps, passing data and all that stuff. Theoretically you could probably write your own pipeline directly on the StreamingProcessors using CUDA and then render the result, but it would be tedious. Just how writing a GPGPU-Code using Shaders would be tedious.
Although old GPUs have pipelines hardcoded in the chip, modern GPU itself is just a large ASIC that can compute vectorized data at stupid fast speed. It is human who defines what it can do. So the render pipeline is defined in the graphics library like OpenGL, not in GPU. Thus, GPU does not care what it is computing, as long as it is vectorized data, it can do all the computation needed and give you a result.

Using a GPU both as video card and GPGPU

Where I work, we do a lot of numerical computations and we are considering buying workstations with NVIDIA video cards because of CUDA (to work with TensorFlow and Theano).
My question is: should these computers come with another video card to handle the display and free the NVIDIA for the GPGPU?
I would appreciate if anyone knows of hard data on using a video card for display and GPGPU at the same time.
Having been through this, I'll add my two cents.
It is helpful to have a dedicated card for computations, but it is definitely not necessary.
I have used a development workstation with a single high-end GPU for both display and compute. I have also used workstations with multiple GPUs, as well as headless compute servers.
My experience is that doing compute on the display GPU is fine as long as demands on the display are typical for software engineering. In a Linux setup with a couple monitors, web browsers, text editors, etc., I use about 200MB for display out of the 6GB of the card -- so only about 3% overhead. You might see the display stutter a bit during a web page refresh or something like that, but the throughput demands of the display are very small.
One technical issue worth noting for completeness is that the NVIDIA driver, GPU firmware, or OS may have a timeout for kernel completion on the display GPU (run NVIDIA's 'deviceQueryDrv' to see the driver's "run time limit on kernels" setting). In my experience (on Linux), with machine learning, this has never been a problem since the timeout is several seconds and, even with custom kernels, synchronization across multiprocessors constrains how much you can stuff into a single kernel launch. I would expect the typical runs of the pre-baked ops in TensorFlow to be two or more orders of magnitude below this limit.
That said, there are some big advantages of having multiple compute-capable cards in a workstation (whether or not one is used for display). Of course there is the potential for more throughput (if your software can use it). However, the main advantage in my experience, is being able to run long experiments while concurrently developing new experiments.
It is of course feasible to start with one card and then add one later, but make sure your motherboard has lots of room and your power supply can handle the load. If you decide to have two cards, with one being a low-end card dedicated to display, I would specifically advise against having the low-end card be a CUDA-capable card lest it get selected as a default for computation.
Hope that helps.
In my experience it is awkward to share a GPU card between numerical computation tasks and driving a video monitor. For example, there is limited memory available on any GPU, which is often the limiting factor in the size of a model you can train. Unless you're doing gaming, a fairly modest GPU is probably adequate to drive the video. But for serious ML work you will probably want a high-performance card. Where I work (Google) we typically put two GPUs in desk-side machines when one is to be used for numerical computation.

Challenges in using flat memory model

The flat memory model(linear memory model) provides maximum execution speed, occupies minimum CPU real estate and has direct access to memory without any segmentation / paging. It seems that flat memory model is ideal for small realtime application or single threaded realtime application.
However, is it possible to use real-time application that is multi-threaded/multi-tasking along with requirement of high resource allocation/protection in flat memory model ?
Thanks
I don't think the memory model has much to do here, except for the (RT)OS itself which you use to get multi-threading / multi-tasking done.
Paging or segmentation, if provided, is useful for the OS primarily for implementing memory protection features. It is only possible this way that the OS may protect itself and running user mode tasks against improperly written code in others which would accidentally write in memory out of their intended domain. (You can't get memory protection without some kind of paging or segmentation since you can't guard every single memory access)
In 32 bit AVR processors there is even a distinction between Memory management unit (MMU) and Memory protection unit (MPU). The first is the more complex unit supporting those kinds of paging features like modern PC processors (for example even making it possible to realize virtual memory), while the latter is a simpler subset only giving you tools for realizing memory protection (for example by the OS, to protect itself and tasks against each other), while it does not have any remapping capability (by a given address you always access the same cell of memory) like the MMU does. (Why the distinction? Because some cheaper AVR32's, where that's sufficient, only have an MPU)
So on a simple flat memory model what important thing you won't get are the protection features. If you can get by without those, it should go just fine.

How To Simulate Lower CPU Processor Machines For Browser Testing

We have some users which are using lower-CPU powered machines and they're encountering slow response times using our web application. Is there any way for me to do testing so that I can simulate lower CPU rates?
For example, I have 2.3 Ghz computing power, can I lower it to 1.6 Ghz or lower so that I may be able to test it?
BTW, our customers are using Windows. I have to simulate low computing power on Internet Explorer as browser.
Most new CPUs multiplier can easily be lowered (Intel: Speedstep, AMD: PowerNow!). This is used to save power. With RMclock you can manually adjust your multiplier and thus lower your frequency and make your pc slower. I use this tool myself so I can tell you that it works.
http://cpu.rightmark.org/products/rmclock.shtml
The virtual machine Bochs(pronounced boxes) allows you to set a instructions per second directive. It's probably the slowest emulator out there as it is though...
Create some virtual machines.
You can use VirtualPC or VirtualBox both are free.
I would recommend to start something on the background which eats up all your processor cycles.
A program which finds primenumbers or something similar.
Another slight option in addition to those above is to boot windows in a lower resource config. Go to the start menu,, select run and type MSCONFIG. You can go to the boot tab, click on advanced options and limit the memory and number of of processsors. It's not as robust as the above, but it does give you another option.
Lowering the CPU clock doesn't always give expected results.
Newer CPUs feature architecture improvements which make them more efficient on an equvialent clock basis than older chips. Incidentally, because of this virtual machines are a bad way of testing performance for "older" tech as well.
Your best bet is to simply buy a couple of older machines. Using similar RAM (types and amounts), processor, motherboard chipsets, hard drives, and video cards. All of which feed into the total performance of the machine itself.
I bring the other components up because changing just one of them can have an impact on even browser performance. A prime example is memory. If your clients are constrained to something like 512MB of RAM, the machines could be performing a lot of hard drive access for VM swaps, even for just running the browser. In this situation downgrading the clock speed on your processor while still retaining your 2GB (assuming) of RAM would still not perform anywhere near the same even if everything else was equal.
Isak Savo'sanswer works, but can be a bit finicky, as the modern tpl is going to try and limit cpu load as much as possible. When I tested it out, It was hard (though possible with some testing) to consistently get the types of cpu usages I wanted.
Then I remembered, http://www.cpukiller.com/, which does this already. Highly recommended. As an aside, I found this util from playing old 90s games on modern machines, back when frame rate was pegged to cpu clock time, making playing them on modern computers way too fast. Great utility.
Another big difference between high-performance and low-performance CPUs is the number of cores available. This can realistically differ by a factor of 4, way more than the difference in clock frequency you're likely to encounter.
You can solve this by setting the thread affinity. Even IE6 will use 13 threads just to show google.com. That means it will benefit from a multi-core CPU. But if you set the thread affinity to one core only, all 13 IE threads will have to share that one core.
I understand that this question is pretty old, but here are some receipts I personally use (not only for Web development):
BES. I'm getting some weird results while using it.
Go to Control Panel\All Control Panel Items\Power Options\Edit Plan Settings\Change Advanced Power Settings, then go to the "Processor" section and set it's maximum state to 5% (or something else). It works only if your processor supports dynamic multiplier change and ACPI driver is installed correctly.
Run Task Manager and set processor affinity to a single core (or whatever number of cores you want) for your browser's (or any other's) process. Not a best practice for browsers, because JavaScript implementations are usually single-threaded, but, as far as I see, modern browsers actually DO use multiple cores.
There are a few different methods to accomplish this.
If you're using VirtualBox, go into the Settings for the VM you want to slow the CPU speed for. Go to System > Processor, then set the Execution Cap. The percentage controls how slow it will go: lower values are slower relative to the regular speed. In practice, I've noticed the results to be choppy, although it does technically work.
It is also possible to set the CPU speed for the whole system. In the Windows 10 Settings app, go to System > Power & Sleep. Then click Additional Power Settings on the right hand side. Go to Change Plan Settings for the currently selected plan, then click Change Advanced Power Plan Settings. Scroll down to Processor Power Management and set the Maximum Processor State. Again, this is a percentage. Although this does work, I find that in practice, it doesn't have a big impact even when the percentage is set very low.
If you're dealing with a videogame that uses DirectX or OpenGL and doesn't have a framerate cap, another common method is to force Vsync on in your graphics driver settings. This will usually slow the rendering to about 60 FPS which may be enough to play at a reasonable rate. However, it will only work for applications using 3D hardware rendering specifically.
Finally: if you'd rather not use a VM, and don't want to change a system global setting, but would rather simulate an old CPU for one specific process only, then I have my own program to do that called Old CPU Simulator.
The main brain of the operation is a command line tool written in C++, but there is also a GUI wrapper written in C#. The GUI requires .NET Framework 4.0. The default settings should be fine in most cases - just select the CPU you'd like to simulate under Target Rate, then hit New and browse for the program you'd like to run.
https://github.com/tomysshadow/OldCPUSimulator (click the Releases tab on the right for binaries.)
The concept is to suspend and resume the process at a precise rate, and because it happens so quickly the process will appear to just be running slowly. For example, by suspending a process for 3 milliseconds, then resuming it for 1 millisecond, it will appear to be running at 25% speed. By controlling the ratio of time suspended vs. time resumed, it is possible to simulate different speeds. This is completely API agnostic (it doesn't hook DirectX, OpenGL, etc. it'll work with a command line program if you want.)
Old CPU Simulator does not ask for a percentage, but rather, the clock speed to simulate (which it calls the Target Rate.) It then automatically determines, based on your CPU's real clock speed, the percentage to use. Although clock speed is not the only factor that has improved computer performance over time (there are also SSDs, faster GPUs, more RAM, multithreaded performance, etc.) it's a good enough approximation to get fairly consistent results across machines given the same Target Rate. It also supports other options that may help with consistency, such as setting the process affinity to one.
It implements three different methods of suspending and resuming a process and will use the best available: NtSuspendProcess, NtQuerySystemInformation, or Toolhelp Snapshots. It also uses timeBeginPeriod and timeEndPeriod to achieve high precision timing without busy looping. Note that this is not an emulator; the binary still runs natively. If you like, you can view the source to see how it's implemented - it's not a large project. On my machine, Old CPU Simulator uses less than 1% CPU and less than 1 MB of memory, so the program itself is quite efficient (unlike running intensive programs to intentionally slow the CPU.)