DirectX - when to use instancing and when not? - rendering

I'm making an application using directx 11. I wanted to make use of instancing in the first place, so I've organized my whole pipeline to always work with instancing for simplicity. This means that currently if I want to draw a single occurrence of a geometry in my scene it would still go through instanced rendering.
My question(s) is(are) what overhead does instancing introduce? Is this approach a bad practice in general? If so, is there a rule on how to decide when it's beneficial to use instancing and when not?
One similar question that did not help me: What overhead is associated with instanced rendering?

Over the years the "knee" in the performance graph of when hardware instancing is a win vs. just drawing multiple times has changed. In the early days of hardware instancing, it was almost never a win on those first generation cards. As GPU's have evolved and put more hardware support in to make this faster, it's improved significantly.
For DirectX 12 class hardware (minimum Direct3d Hardware Feature level 11.0), you can count on instancing being a good win for drawing the same objects thousands of times. If you are talking about tens of times, then it's going to depend on many factors you can really only account for by running performance trials.
Specifically for 'point-sprites' i.e. particles systems for special effects, there are a number of different approaches. For Xbox One class GPU hardware, just drawing a bunch for vertices is very fast and vertex hardware instancing only wins over it for huge numbers of particles (although it will use less VRAM so that may be a consideration). For "DirectX 12 Ultimate" class GPU hardware, a Mesh Shader is actually the fastest way to draw particle systems. See the PointSprites sample on GitHub.

Related

Why do desktop GPUs typically use immediate mode rendering instead of tile based deferred rendering?

In other words, what are the advantages of immediate mode rendering vs. TBDR, assuming you have ample memory, bandwidth, and power (as found on a desktop GPU)?
The main drawback of TBDRs is that they struggle with large amounts of geometry, because they sort it before rendering in order to achieve zero overdraw. This is not a huge deal on low-power GPUs because they deal with simpler scenes anyway.
Modern desktop GPUs do have early-z tests, so if you sort the geometry and draw it front-to-back you can still get most of the bandwidth minimization of a TBDR, and many non-deferred mobile GPUs still do tiling even if they don't sort the geometry.

Does laptop Embedded Controller have limited writes?

I am wondering if I should be worried about excessive writes to the embedded controller registers on my laptop. I am guessing that if they are true registers, they probably act more like RAM rather than flash memory so this isn't a problem.
However, I have a script to modify the registers in my laptop's EC to better control the fan speed curve. It has to be re-applied after each power change event such as sleep/wake as well as power cable events, so it happens fairly often. I just want to make sure I am not burning out my chips in the process.
The script I am using to write to the EC is located here:
https://github.com/RayfenWindspear/perl-acpi-fanspeed
Well, it seems you're writing to ACPI registers. Registers here do not refer to any specific hardware; it just means its a specific address that you can reach using a specific bus. It's however highly unlikely that something that you have to re-write after every power cycle is overwriting permanent storage, so for all practical aspects I'd assume that you can rely on this for as long as your laptop lives.
Hardware peripherals are almost universally implemented as SRAM cells. They will not wear out first. The fan you are controlling will have a limited number of start/stop cycles. So it is much more likely that the act of toggling these registers will wear something else out prematurely (than the SRAM type memory cell itself).
To your particular case, correctly driving a fan/motor can significantly improve it's life time. Over driving a fan/motor does not always make it go faster, but instead creates heat. The heat weakens the wiring and eventually the coils will short reducing drive and eventually wearing out. That said, the element being cooled can be damaged by excess heat so tuning things just to reduce sound may not make sense.
Background details
Generally, the element is called a Flip-Flop with various forms. SystemRDL is an example as well as SystemC and others where digital engineers will model these. In digital hardware, the flip-flops have default or reset values. This is fixed like ROM on each chip and is not normally re-programmable, uses EEPROM technologyNote1 or is often configured via input lines which the hardware designer can pull them high/low with a resistor or connect them to another elements 'GPIO'.
It is analogous to 'initdata'. Program values that aren't zero get copied from flash, disk, etc to memory at program startup. So the flip-flops normally do not hold state over a power cycle; something else does this.
The 'Flash' technology is based off of a floating gate and uses 'quantum tunnelling' to program the floating gates. This process is slightly destructive. It was invented by Fowler and Nordheim in 1967, but wide spread electronics industry did not start to produce them until the early 90s with NOR flash followed by NAND flash and many variants. But the underlying physics is the same; just the digital connections are different. So as well as this defect you are concerned about, the flash technology actually followed many hardware chips such as 68k, i386, etc. So 'flip-flops' were well established and typically the 'register' part of the logic is not that great of a typical chip and a flip-flop uses the same logic (gates) as the rest of the chip logic. Meaning that using flash would be an extra overhead with little benefit.
Some additional take-away is that the startup up and shutdown of chips is usually the most destructive time. Often poor hardware designers do not put proper voltage supervision and some lines maybe floating with the expectation that system programs will set them immediately. Reset events, ESD, over heating, etc will all be more harmful than just the act of writing a peripheral register.
Note 1: EEPROM typically has 100,000+ cycles. These features are typically only used once at manufacture time to set a chip configuration for the system. These are actually quite rare, but possible.
The MLC (multi-level) NAND flash in SSD has pathetically low cycles like 8,000 in some cases. The SLC (single level) old school flash have 10,000+ cycles, but people demand large data formats.

Embedded System and Serial Flash wear issues

I am using a serial NOR flash (SPI Based) for my embedded application and also I have to implement a file system over it. That makes my NOR flash more prone to frequent erase and write cycles where having a wear level Algorithm comes into picture. I want to ask few questions regarding the same:
First, is it possible to implement a Wear Level Algorithm for Nor flash, if yes then why most of the time I find the solutions for NAND Flash and not NOR Flash?
Second, are serial SPI based low cost NAND flashes available, if yes then kindly share the part number for the same.
Third, how difficult is it to implement our own Wear Level algorithm?
Fourth, I have also read/heard that industrial grade NOR Flashes have higher erase/write cycles (in millions!!), is this understanding correct? If yes then kindly let me know the details of such SPI NOR Flash, which may also lead to avoiding implementation of wear level algorithm, if not completely then since I'm planning to implement my own wear level algorithm, it might give me a little room and ease in certain areas to implement the wear level algorithm.
The constraint to all these point is cost, I would want to have low cost solution to these issues.
Thanks in Advance
Regards
Aditya Mittal
(mittal.aditya12#gmail.com)
Implementing a wear-levelling algorithm is is not trivial, but not impossible either:
Your wear-levelling driver needs to know when disk blocks are no longer used by the filing system (this is known as TRIM support on modern SSDs). In practice, this means you need to modify your block driver API and filing systems above it, or have the wear-levelling driver aware of the filing system's free-space map. This second option is easy for FAT, but probably patented.
You need to reserve at least an erase-unit + a few allocation units to allow erase unit recycling. Reserving more blocks will increase performance
You'll want a background thread to perform asynchronous erase-unit recycling
You'll need to test, test an test again. When I last built one of these, we built a simulation of both flash and ran the real filing system on top it, and tortured the system for weeks.
There are lots and lots of patents covering aspects of wear-leveling. By the same token, there are two at least two wear-levelling layers in the Linux Kernel.
Given all of this, licensing a third-party library is probably cost-effective,
Atmel/Adesto etc. make those little serial flash chips by the billion. They also have loads of online docs. I suspect that the serial flash beetles don't implement wear-levelling because of cost - the devices they are typically used in are very cheap and tend to have a limited lifetime anyway. Bulk, 4-line NAND flash that is expected to see heavier and lengthy use, (eg. SD cards), have complex, (relatively), built-in controllers that can implement wear-levelling in a transparent manner.
I no longer use one-pin interface serial flash, partly due to the wear issue. An SD-card is cheap enough for me to use and, even if one does break, an on-site technician, (or even the customer), can easily swap it out.
Implementing a wear-levelling algo. is too expensive, both in terms of development time, (especially testing if the device has to support a file system that must not corrupt on power fail etc), and CPU/RAM for me to bother.
If your product is so cost-sensitive that you have to use serial NOR flash, I suggest that you ignore the issue.

Windows Low-Level Graphics [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm new to programming. I do know C/C++ and the basics of Win32. I am now trying to do graphics, but I want the fastest connection to the screen. I realize most are going with Opengl or DirectX. But, I don't want the overhead. I want to start from scratch and control the pixel data. I know about GDI bitmap, but I'm not sure if this is best access to the data. I know that I have to talk through windows, which is the trouble. Do Opengl and DirectX compile down to the level of GDI or is there a special way they do it, do they bypass or use similar code? Please, Don't ask why I want to do this. Maybe an explanation of how this is done might help. Like how windows combines all windows to create the final image.
The most direct access to pixel data is via shaders, which are supported by both OpenGL and Direct3D. They are cross-compiled and run directly on the video card. They do not use OpenGL, they do not have OpenGL overhead. OpenGL is just used to get them to the graphics card's own processor in the first place.
Anything you do on the CPU has to first be copied across the bus (typically PCI-express) to the video card. GDI is actually many levels removed from the graphics memory.
OpenGL, Direct3D, Direct2D, GDI, and GDI+ are all abstraction layers. The GPU vendor writes a driver that accepts these standard command functions, re-encodes the data in the card-specific format, then sends it to the card. Typically OpenGL and Direct3D are the most heavily optimized and also require the least amount of re-encoding.
How Windows combines the various on-screen windows to create the full-screen image depends heavily on what version of Windows you are talking about. DWM changed everything. Since DWM was introduced in Vista, programs render to their personal areas of GPU memory, then the window manager uses the texture lookup units of the video card to efficiently layer each of the programs' individual areas onto the screen primary buffer. When a program (usually a game) requests full-screen exclusive access, this step is skipped and the driver causes rendering commands from that application to affect the primary screen buffer directly.
Assuming that the CPU is generating the data which needs to be displayed, the fastest and most efficient approach is likely to be block-copying that data into a vertex buffer object and using OpenGL commands to rasterize it as lines or polygons or whatever (or the Direct3D equivalent). If you previously thought that GDI was the low-level interface, you've got some reading ahead of you to make this work. But it will run several orders of magnitude faster than pure GDI. So much faster, in fact, that the new architecture is that GDI (and WPF) is built on top of Direct2D and/or Direct3D.
but I want the fastest connection to the screen
I want to start from scratch and control the pixel data
You're asking for the impossible. You get best performance when you use GPU-accelerated functions. However, in this case you don't get direct access to pixel data, and trying to access it (read it back or write) will negatively impact the performance, because you'll have to transfer data from system memory to video memory. As a result anything that is being streamed from system memory to video memory should be handled with care. Plus you'll have to study API.
If you "start from scratch" and do rendering on CPU, you'll get easy access to pixel data and full control over the rendering, but performance will be inferior to GPU (CPU is less suitable for parallel processing, and system memory can be slower by order of magnitude than video memory), plus you'll spend significant amount of time reinventing the wheel.
Do Opengl and DirectX compile down to the level of GDI or is there a special way they do it, do they bypass or use similar code?
No. They communicate with graphic hardware nearly directly using drivers provided by hardware manufacturer. And those "direct hardware access" interfaces used by DirectX/OpenGL won't be available to you - they're hardware-specific and manufacturer specific, can be internal and possibly even protected by patents.
There are, of course, few legacy hardware interfaces which ARE available to you (namely VESA or VGA 13h mode), however, their direct use is normally forbidden by operating system (you can't easily access VESA on windows), so to access them you'll have to either boot MS-DOS, use custom operating system, or helper classes (such as SVGAlib on linux) which might only function under root privilegies. And of course, even if you actually use VESA/VGA to render something yourself, on any hardware (newer than RivaTNT 2 Pro) performance will be horrible compared to hardware-accelerated rendering done by OpenGL/DirectX. Have you ever seen how fast windows xp works when it doesn't have proper GPU driver (takes a second to redraw window)? that's how fast it is going to work with direct VESA/VGA access.
Please, Don't ask why I want to do this.
It makes sense to ask why you would want to do that. Your "I want direct low-level access" approach was suitable maybe 15..20 years ago or in DOS era. Right now reasonable solution would be to use existing API (that is maintained by somebody who isn't you) and search for a way to fully utilize it. Of course, if you wanted to develop drivers, that would be another story.
Do Opengl and DirectX compile down to the level of GDI or is there a special way they do it
and
I realize most are going with Opengl or DirectX. But, I don't want the overhead
So what you're saying is **you have absolutely no clue what OpenGL or DirectX actually do, and yet you've decided that they are not efficient enough for your needs.
I'm sorry, but this is nonsense. It is impossible to answer a question like that.
In the real world, you have a small supercomputer dedicated to doing graphics. And you get access to it through OpenGL and DirectX.
And the reason they are fast is that they do NOT just "start from scratch and control the pixel data".
So please, if you want serious answers, try letting those with the knowledge to answer your questions decide which question is best.
The correct answer, if you want efficient graphics, is to use DirectX or OpenGL.

What does programming for PS3's Cell Processor entail?

How is programming for the Cell Processor on the PS3 different than programming for any other processor found on a normal desktop?
What kind of programming paradigms, techniques, and practices are used to fully utilize the Cell Processors potential?
All the articles I hear concerning PS3 development discuss, "Learning how to program on the Cell Processor." What does this really mean beyond some hand waving?
In addition to everything George mentions, the SPUs are really better thought of as streaming vector processors. They work best when you have an algorithm that works on long sequences of numerical data, which can be fed through the SPU's limited memory via DMA, rather than having the SPU load a chunk of memory, try to operate on it, find that it needs to follow a pointer to somewhere outside its memory, load that, keep going, find another one, and so on.
So, programming for them isn't a simple model of concurrency and threads; it's more like high performance numerical or scientific computation. It is also non-uniform memory access taken to an extreme.
Furthermore, every processor is in-order with deep pipelines, so the programmer has to be much more aware of data hazards and instruction bubbles and all the numerous micro-optimizations that we are told the compiler "should" take care of for us (but it really doesn't). Things like mispredicted branches, load-hit-stores, cache misses, etc. hurt a lot more than they would on an out-of-order processor that could juggle the order of operations around to hide such latencies.
For concrete examples, check out Mike Acton's CellPerformance blog. Mike is my favorite old-school assembly-happy perf curmudgeon in the business, and he's really earned his chops on this issue.
The Cell part of the PS3 consists of 6 SPU processors. They each have 256 KB of non-shared memory and are connected via a high-speed ring that allows for DMA between each other and the PowerPC host processor. They are not pipelined or cached. This makes it rather different than an multi-core x86 with shared memory, pipelining and caching. Also, the SPU processors do not use the same instruction set as the PowerPC so you've got some asymmetry there.
In short, your typical shared-memory, multithreaded program won't just drop onto the Cell without some work (with the caveat that computer science works hard at making different machines appear to be the same so some implementors try hard to automate the process).
At a high level the program will need to be broken up into tasks that fit within the Cell's hard memory limit. Those can run in parallel and each sub-task can be sequenced to an available Cell processor. At a low level, the compiler (or assembly programmer) will need to work harder to generate code that runs quickly on a processor -- no run-time trickery to make things go faster is available. The theory being that those programmer/compiler friendly features cost silicon and speed that can be better spent giving you more and faster SPUs. Of course, you're not getting any more SPU's on the PS3 but in the general case you'll get more SPUs per number of transistor available on chip.
Completely agree with George Philips and Crashworks. Only thing I'd add is that SPU programming is fundamentally about job management. To get the best out of the SPUs you need to keep them ticking over and feeding back results. There's no point in having one SPU chewing through some complex post-processing if your having to sit and wait for the results for a frame and the rest of your SPUs are sat idle. So how you distribute your jobs requires a lot of thought and this has a big impact on how you chunk up your data.
"All the articles I hear concerning PS3 development discuss, 'Learning how to program on the Cell Processor.' What does this really mean beyond some hand waving?"
Well, stuff you have to deal with on SPUs...
Atomic operations (lock-free try-discard style).
Strong distinction between memory areas. You have to know which pointer is pointing to which memory area or you'll screw everything up.
No enforced hardware distinction between data and code. This is actually a fun thing, you can setup dynamic code loading and essentially stream subroutines in and out. Self-modifying code is possible but not necessarily practical on SPU.
Lack of hardware debugging aids.
Limited memory size.
Fast memory access.
Instruction set balanced toward SIMD operations.
Floating point "gotchas".
You ideally want to keep the SPUs doing useful work all of the time, but it's really challenging. Not only are they not well suited for handling some types of problems, but often moving a system to be efficient on SPU can involve a complete redesign. Debugging problems that would be easy to catch on the PPU can sometimes take days on SPU.
I think when people use the phrase "learning how to program the cell" they are mostly hand waving. You can learn the basics in a week, the challenge comes in trying to apply that knowledge to real code... which often already exists and isn't in a form well-suited for use on SPU.