Why do we use CPUs for ray tracing instead of GPUs?

Why do we use CPUs for ray tracing instead of GPUs? - gpu

After doing some research on rasterisation and ray tracing. I have discovered that there is not much information on how CPUs work for ray-tracing available on the internet. I came across and article about Pixar and how they pre-rendered Cars 2 on the CPU. This took them 11.5 hours per frame. Would a GPU not have rendered this faster with the same image quality?
http://gizmodo.com/5813587/12500-cpu-cores-were-required-to-render-cars-2
https://www.engadget.com/2014/10/18/disney-big-hero-6/
http://www.firstshowing.net/2009/michael-bay-presents-transformers-2-facts-and-figures/

I'm one of the rendering software architects at a large VFX and animated feature studio with a proprietary renderer (not Pixar, though I was once the rendering software architect there as well, long, long ago).
Almost all high-quality rendering for film (at all the big studios, with all the major renderers) is CPU only. There are a bunch of reasons why this is the case. In no particular order, some of the really compelling ones to give you the flavor of the issues:
GPUs only go fast when everything is in memory. The biggest GPU cards have, what, 12GB or so, and it has to hold everything. Well, we routinely render scenes with 30GB of geometry and that reference 1TB or more of texture. Can't load that into GPU memory, it's literally two orders of magnitude too big. So GPUs are simply unable to deal with our biggest (or even average) scenes. (With CPU renderers, we can page stuff from disk whenever we need. GPUs aren't good at that.)
Don't believe the hype, ray tracing with GPUs is not an obvious win over CPU. GPUs are great at highly coherent work (doing the same things to lots of data at once). Ray tracing is very incoherent (each ray can go a different direction, intersect different objects, shade different materials, access different textures), and so this access pattern degrades GPU performance very severely. It's only very recently that GPU ray tracing could match the best CPU-based ray tracing code, and even though it has surpassed it, it's not by much, not enough to throw out all the old code and start fresh with buggy fragile code for GPUs. And the biggest, most expensive scenes are the ones where GPUs are only marginally faster. Being lots faster on the easy scenes is not really important to us.
If you have 50 or 100 man years of production-hardened code in your CPU-based renderer, you just don't throw it out and start over in order to get a 2x speedup. Software engineering effort, stability, and so on, is more important and a bigger cost factor.
Similarly, if your studio has an investment in a data center holding 20,000 CPU cores, all in the smallest, most power and heat-efficient form factor you can, that's also a sunk cost investment you don't just throw away. Replacing them with new machines containing top of the line GPUs vastly increases the cost of your render farm, and they are bigger and produce more heat, so it literally might not fit in your building.
Amdahl's Law: The actual "rendering" per se is only one stage in generating the scenes, and GPUs don't help with it. Let's say that it takes 1 hour to fully generate and export the scene to the renderer, and 9 hours to "render", and out of that 9 hours, an hour is reading texture, volumes, and other data from disk. So out of the total 10 hours of how the user experiences rendering (push button until final image is ready), 8 hours is potentially sped up with GPUs. So, even if GPU was 10x as fast as CPU for that part, you go from 10 hours to 1+1+0.8 = nearly 3 hours. So 10x GPU speedup only translates to 3x actual gain. If GPU was 1,000,000x faster than CPU for ray tracing, you still have 1+1+tiny, which is only a 5x speedup.
But what's different about games? Why are GPUs good for games but not film?
First of all, when you make a game, remember that it's got to render in real time -- that means your most important constraint is the 60Hz (or whatever) frame rate, and you sacrifice quality or features where necessary to achieve that. In contrast, with film, the unbreakable constraint is making the director and VFX supervisor happy with the quality and look he or she wants, and how long it takes you to get that is (to a degree) secondary.
Also, with a game, you render frame after frame after frame, live in front of every user. But with film, you effectively are rendering ONCE, and what's delivered to theaters is a movie file -- so moviegoers will never know or care if it took you 10 hours per frame, but they will notice if it doesn't look good. So again, there is less of a penalty placed on those renders taking a long time, as long as they look fabulous.
With a game, you don't really know what frames you are going to render, since the player may wander all around the world, view from just about anywhere. You can't and shouldn't try to make it all perfect, you just want it to be good enough all the time. But for a film, the shots are all hand-crafted! A tremendous amount of human time goes into composing, animating, lighting, and compositing every shot, and then you only need to render it once. Think about the economics -- once 10 days of calendar (and salary) has gone into lighting and compositing the shot just right, the advantage of rendering it in an hour (or even a minute) versus overnight, is pretty small, and not worth any sacrifice of quality or achievable complexity of the image.
ADDENDUM (2022):
The world has changed a lot since I wrote this answer in 2016! Once ray tracing acceleration was added to hardware (with NVIDIA RTX cards) ray tracing on GPUs was finally, definitively faster than ray tracing the same scene on a CPU -- for scenes that are of a size that can fit on the GPUs. And GPUs have a lot more memory than they did in 2016, so that includes a much wider range of scenes. Lots of games in 2022 use a combination of rasterization and ray tracing (when available) and probably within a couple years there may be games that are ray traced only. And in the film world, we are all racing to get our renderers ray tracing on GPUs with full feature parity with the CPU ray tracers. But we're not quite there yet. We use the GPUs more and more for various interactive uses during production, but final frames are still CPU rendered for full-complexity frames. But I think we're within a year or two of some portion of final frames being rendered strictly with GPU ray tracing, and probably within 5 years of nearly all final film frames being GPU ray traced (though not anywhere near at realtime rates).

Related

DirectX - when to use instancing and when not?

I'm making an application using directx 11. I wanted to make use of instancing in the first place, so I've organized my whole pipeline to always work with instancing for simplicity. This means that currently if I want to draw a single occurrence of a geometry in my scene it would still go through instanced rendering.
My question(s) is(are) what overhead does instancing introduce? Is this approach a bad practice in general? If so, is there a rule on how to decide when it's beneficial to use instancing and when not?
One similar question that did not help me: What overhead is associated with instanced rendering?

Over the years the "knee" in the performance graph of when hardware instancing is a win vs. just drawing multiple times has changed. In the early days of hardware instancing, it was almost never a win on those first generation cards. As GPU's have evolved and put more hardware support in to make this faster, it's improved significantly.
For DirectX 12 class hardware (minimum Direct3d Hardware Feature level 11.0), you can count on instancing being a good win for drawing the same objects thousands of times. If you are talking about tens of times, then it's going to depend on many factors you can really only account for by running performance trials.
Specifically for 'point-sprites' i.e. particles systems for special effects, there are a number of different approaches. For Xbox One class GPU hardware, just drawing a bunch for vertices is very fast and vertex hardware instancing only wins over it for huge numbers of particles (although it will use less VRAM so that may be a consideration). For "DirectX 12 Ultimate" class GPU hardware, a Mesh Shader is actually the fastest way to draw particle systems. See the PointSprites sample on GitHub.

Do gpu cores switch tasks when they're done with one?

I'm experimenting with c++ AMP, one thing thats unclear from MS documentation is this:
If I dispatch a parallel_for_each with an extent of say 1000, then that would mean that it spawns 1000 threads. If the gpu is unable to take on those 1000 threads at the same time, it completes them 300 at a time or 400 or whatever number it can do. Then there was some vague stuff on warps and tiles out of which I got this impression:
Regardless of how the threads are tiled together (or not at all), the whole group must finish before taking on new tasks so if the internally assigned group has the size of 128 and 30 of them finish, the 30 cores will idle until the other 98 are done too. Is that true? Also, how do I find out what this internal groups size is?
During my experimentation, it certainly appears to have some truth to it because assigning more even amounts of work to the threads seems to speed things up, even if there is slightly more work overall.
The reason I'm trying to figure it out is because I'm deciding whether or not to engage in another lengthy experiment that would be based on threads getting uneven amounts of work (sometimes by the factor of 10x) but all the threads would be independent so data wise, the cores would be free to pick up another thread.

In practice, the underlying execution model of AMP on GPU is the same as CUDA, OpenCL, Compute Shaders, etc. The only thing that changes is the naming of each concept. So if you feel that the AMP documentation is lacking, consider reading up on CUDA or OpenCL. Those are significantly more mature APIs and the knowledge you gain from them applies as well to AMP.
If I dispatch a parallel_for_each with an extent of say 1000, then that would mean that it spawns 1000 threads. If the gpu is unable to take on those 1000 threads at the same time, it completes them 300 at a time or 400 or whatever number it can do.
Maybe. From the high-level view of parallel_for_each, you don't have to care about this. The threads may as well be executed sequentially, one at a time.
If you launch 1000 threads without specifying a tile size, the AMP runtime will choose a tile size for you, based on the underlying hardware. If you specify a tile size, then AMP will use that one.
GPUs are made of multiprocessors (in CUDA parlance, or compute units in OpenCL), each composed of a number of cores.
Tiles are assigned per multiprocessor: all threads within the same tile will be ran by the same multiprocessor, until all threads within that tile run to completion. Then, the multiprocessor will pick another available tile (if any) and run it, until all tiles are executed. Multiprocessors can execute multiple tiles simultaneously.
if the internally assigned group has the size of 128 and 30 of them finish, the 30 cores will idle until the other 98 are done too. Is that true?
Not necessarily. As mentionned earlier, a multiprocessor may have multiple active tiles. It may therefore schedule threads from other tiles to remain busy.
Important note: On GPU, threads are not executed on a granularity of 1. For example, NVIDIA hardware executes 32 threads at once.
To not make this answer needlessly lengthy, I encourage you to read up on the concept of warp.

The GPU certainly won't run 1000 threads at the same time, but it also won't complete them 300 at a time.
It uses multithreading, which means that just like in a CPU, it will share run time among the 1000 threads allowing them to complete seemingly at the same time.
Keep in mind creating a lot of threads may be not interesting for several reasons. For instance, if you must complete all 1000 tasks in step 1 before doing step 2, you might aswell distribute them on a number of threads equal to the number of cores in your GPU and no more than that.
Using more threads than the number of cores only makes sense if you want to dispatch tasks that are not being waited on, or because you felt like doing your code this way is easier. But keep in mind thread management is time-costly too and may drag down your performance.

How fast is PhysX on GPU compared to physics engines on CPU?

I have an application that is written to use the Bullet physics engine. I am running it on an Intel i7 2600K CPU with 8 cores. The application has to process millions of chunks of physics work, each of which can be done independently. It currently runs with 8 processes, each process working through its quota of the total independently. In summary, this work has a lot of easy parallelism.
Assuming that I can acquire the best NVIDIA consumer graphics card (say Titan), what is the ballpark improvement in the physics engine performance I can see by switching from Bullet on CPU to Physx on GPU? That is, approximately how much faster will this application run if rewritten for Physx?
I found a few papers that compare the result quality between Bullet and Physx, but could not find anything about the performance comparison.

Pierre Terdimann has done an extensive series of performance comparisons between Bullet 2.81 and PhysX 2.8.4, 3.2 and 3.3 here. These are comparisons between Bullet and PhysX, both running on CPU. It can be seen that the performance difference between the two is dependent on what features of the engine are being used. For a few features, the performance is about the same, while for most others there is a 3-5x speedup.
He also mentions in the addendum that not all physics features have been ported to PhysX on GPU. Cloth and particles can be accelerated on GPU, while rigid bodies is being currently ported to GPU, in a feature called GPU Rigid Bodies (GRB). If there is a feature that is GPU accelerated, then you can expect it to be faster than on CPU, but by how much is not clear.

I found this, it's not a comparison against any specific CPU physics engine but one hopes they are comparing like with like and running PhysX on the CPU.
So it's rather unspecific and from a FAQ by the makers of PhysX so take with a pinch of salt.
From here:
Running PhysX on a mid-to-high-end GeForce GPU will enable 10-20 times
more effects and visual fidelity than physics running on a high-end
CPU.

Lets say physx is doing particle interactions such as gravity of fluid movement. Then the cache control is very important since they are emberassingly parallel. You cannot directly control your CPU's cache but you can access to cache of titan which makes it maybe 100x faster than a 8-thread cpu.
If it is not so parallel and has many branching and doesnt have exhausting computations then it is around 10x-5x speedup(or whatever bandwidth ratio of graphics ram /main RAM).

How to decide system requirements for embedded systems application/software

How should I decide system requirements like:
RAM capacity
FLASH memory capacity
Processor frequency
etc
I am building an application to control NAND FLASH, LCD driver, UART control, keypad control using a 16 bit micro-controller.

This has to be estimated from previous projects with similar functionality. Or even other people's products. But it is best to develop with a larger capacity and decide on final parts when your software nears completion, because its easier to omit components than to try and find room for them later. This kind of design can be an iterative process, start with one estimate and see if a prototype works, don't commit to volumes until you are nearly at the end.
In the case of an LCD based product, you will have two major components using up flash memory, the code and the LCD data (character strings, bitmaps etc). Its certainly easier to estimate the LCD data than the code, which depends on functionality, compiler optimisations etc. If you are bringing in external libraries then at least you already have code for them.
In any case, have an upgrade plan. The worst thing is to run out of capacity at the end of the project and be struggling to optimise the last feature/debug solution in without adding another problem. Make sure you know what the next size up chips are and how you can get them to fit, sometimes a PCB can be designed to take various different chips in the same position. Or have an expandable system, where you can plug things into a memory bus.

How many units will you be making ?
If your volumes are low (<1e3), but per unit profits high and time to market matters, more hardware will get the developers done sooner.
If the volumes are huge (>1e6), profits per unit low, then you penny pinch the hardware, but time to develop will go up. If time to market matters, that's a tradeoff.
Design the board with 2x the capacity (RAM/flash), but don't load the parts, other than to check it works.
Then if you run out of room, there is somewhere to go.
Will customers expect to get firmware updates ? Or is this a drop-ship product with no support ? Supportable is harder, needs more resources.
You'll need to pad resources to have room to expand into if the product needs support for a long time.
For CPU frequency estimates, how much work is required to be done ?
Get an Eval board for a likely MCU and prove out the core function.
Let us say it's a display for a piece of exercise equipment. Can it keep up with the sensors on the device at 2-3x the designed pace ? That's reading the sensors and updating the display. If cost is required to be low, you can then underclock the eval board adn see what trades can be made.

Hardware requirements for development machines

Given that:
SSD’s are now [high end] mainstream
Two+ cores are not hard to come across
24+ Inch monitors are plentiful
Dual Video Outputs are the norm.
64-Bit OS’s complement very cheap memory
Can I ask two questions to hardware enthused developers [not the gamers!]
What high-end hardware item could you not develop without - [what is your hardware crutch]?
What should a baseline [no frills] dev machine look like and what basic specs should it have to ensure that any dev can still be productive?
Note: It might be worth mentioning what platform and dev-env your base line is for?

The most important hardware update (and most underrated) is the monitor.
If you're coding 8+ hours a day don't hesitate on costs and get a nice high end 24" at least, or even a pair of them.

Absolute must have is a good monitor which is easy on the eyes, afterall, you stare at it all day. I go with the 24" Samsung (forget model). I used to go with two monitors but prefer the one wide screen now. You need to be able to get docs and code on the same screen.
Secondly is a good chair and desk (sorry not very technical).
Followed lastly by plenty of RAM (2Gb minimum). Once you get over any thrashing due to paging you are fine. Anything with a dual core had enough processing power.

This is entirely dependent upon what you are developing for. Take your target system requirements, and double them and use that as your minimum specs for the dev machines. That may seem odd, but it is about the point I've found that I've needed at least of when developing various projects.
As others have mentioned the importance of getting good monitors, keyboard, and chairs is underrated. If you are going to spend a lot of time at this PC, those are very important.
RAM is cheap, and you'll likely never have enough. If you are running 32bit Windows, max it out at 4GB of RAM. If you are using another OS that supports more than 4GB of ram (Linux, or 64bit Windows for example), start at 8GB minimum, and if you are working on multimedia projects be ready to upgrade from there.
Best bang for the buck on CPUs seems to be Quad cores right now, so I would say that at least a quad core (2.4Ghz or so) should be the minimum. You may not see much difference going up beyond there, until you get until dual quad core, which is a large price jump.
Find a reliable hard drive or two. Reliability and speed are going to be more important than size. Personally I currently go for a pair of 640GB western digital drives in all machines I build.

24 inch or larger monitor
Baseline dev machine would be a 15 inch MacBook Pro with 4GB of RAM. (For web development)

A pair of the fastest hard drives avaílable. I never recognized how much difference separate and fast System and Data drives can make.
(And please, none of those slow SSDs that you usually get nowadays in <$2000 Laptops - if you really want to hop on the SSD train, get a proper one, otherwise you could as well use a 32 GB SDHC Card)

There's been a study on the optimum size of computer monitors by the Utah University
Wall street journal article. Not surprising is that bigger monitors will boost the speed of work. Surprising is that there seems to be an optimum size of 26". There's no explanation why though.

I am not a developer, but do sit at the computer all day.
For me the must have is a desk that is a good height or easily adjusted, I prefer dual monitors, a 26" and a second wide screen that can turn sideways to view documents full lenght without the need for a lot of scrolling, a computer with dual core(prefer 4) and at east 4gb of ram(I tend to do a lot of vm work), and as stated above, a good chair that has lumbar support and will allow me to lean back when I am reading or pondering a situation. The last one is specific for me since I have glasses and tend to hear high frequencies, I prefer to have incandescent lighting with a slightly warm spectrum. I can hear a fluorescent ballast above someone playing loud speakers. I also find I get less glare and I can focus my eyes for longer periods of time with incandescent.

Ram, lots and lots of ram. Ram compensates for many performance bottlenecks.
But do make sure you keep an eye on the memory usage of whatever you're building. When you're building a 60 MB footprint app on a system with 2 gigs of developer tools loaded at run-time, it's easy to lose that footprint in the noise, even when it doubles.
Don't bother shelling out for a high-end cpu. The cpu is the most overpowered component in modern systems. A standard cheap dual-core should be more than enough. Compiles tend to be disk-bound, not cpu bound, so that money is better invested in a faster drive.

Dell Outlet sells 30" LCD monitors for about $800.00.
That is a good place to start.
Besides that, invest time into tweaking your OS to your needs and automate as much as possible.
It's like I keep telling people, "I'll upgrade to the latest Mac when it somehow manages to help me run more Terminal windows and Text Editors." Until then, you're better off saving the money for a new machine and investing it into a decent monitor and keyboard.

It depends on the project.
For large imaging application like medical imaging applications, You may require: large monitors(we have to view the images properly and in detail), powerful graphics, lots of RAM and a good processor(imaging applications usually need lots of power).

I'm going to echo most people on the large monitors part, and you can always make good use of a pair.
Second to that is a good keyboard. What that mean varies depending on which school of keyboard design you subscribe to. I'm with the ergonomic camp.
Following that is 2Gb+ of RAM, and a recent desktop CPU (anything released in the past 2-3 years really).

As has been previously said, large monitors are essential. These days is not that expensive to have 2 hooked up to a machine. At work I'm lucky enough to have 3 hooked up to one PC and it make a huge amount of difference to how I work.
A decent keyboard and mouse are essential. For the last 10 or so years I've always taken my own mouse and keyboard to work as you typically end up with whatever comes from the PC manufacturer. I use a Microsoft ergonomic keyboard and it's very hard to find these in the workplace, or to get your employer to stump up for one, but I've never worked anywhere where the employer has an issue with taking your own in.

High-end hardware I cannot do without:
Kinesis countoured ergonomic keyboard ($300)
Fast twin SATA drives, striped for speed ($150)
Affordable luxuries I could do without:
Dell 30" widescreen monitor ($900)
Twin Velociraptor hard drives ($600)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas