When profiling, most of the time is spent in nvoglv64.dll. What should I deduce? - optimization

I am profiling a C++ application with Intel VTune Amplifier. Most of the time seems to be spent in nvoglv64.dll more precisely in DrvPresentBuffers and/or KeSynchoronizeExecution. Note that I have a NVIDA GeoForce graphic card.
I am new to the application I am profiling and looking for bottleneck and low hanging fruits of optimization. Since most of the time seems to be spent in this NVIDIA dll, I do not know how decode the profiling results.
I would like to know where are those call from my application side in order to build a knowledge of my application. Can someone give me some hint to start :
When exactly do an application call DrvPresentBuffers, what kind of call should I look at (on my application side)
Where can I get more info about how to profile, understand and optimize applications where bottlenecks are in the graphic card dll's.

DrvPresentBuffers is part of the draw code for openGL. That nvoglv64.dll is the 64bit openGL driver for your nVidia card. There is a known performance issue for 64bit Windows 7 and this function on many drivers. I couldn't find a link but you can search the nVidia forum if you are experiencing problems. If there is nothing wrong or nothing going horribly slow then I'm not sure optimization is where I would start when familiarizing myself with a new application.

Related

UE4's editor is not loading beyond 18% on an Intel Macbook Air (BigSur)

I have recently started exploring game dev and wanted to do it using the Unreal Engine(4). However, the editor doesn't seem to be loading on my computer. I have displayed the UE4 window I keep having to look at upon launching the engine.
Could anyone please help me with this problem?
Thanks.
I'm no expert in Unreal Engine, and I don't know how long you've waited previously (do tell!), but UE4 is a massive, memory intensive engine that would struggle on any laptop (especially a Macbook). I personally use it on my laptop (an HP Pavillion) so it is possible, but there have been times when I am trying to open a project and I've had to wait for around 30 minutes for the percent to change (it was compiling shaders silently). Although, I've never seen it take too terribly long on startup (not opening a project). Try waiting for 30 minutes and see what happens (if you haven't already), otherwise I'm sure you know Unity is a lighter-weight option with comparable features.

Kinect hangs up suddenly after working pretty well a few seconds. How can I fix it?

I tried using "Kinect for Windows" on my Mac. Environment set-up seems to have gone well, but something seems being wrong. When I start some samples such as
OpenNI-Bin-Dev-MacOSX-v1.5.4.0/Samples/Bin/x64-Release/Sample-NiSimpleViewer
or others, the sample application start and seems working quite well at the beginning but after a few seconds (10 to 20 seconds), the move seen in screen of the application halts and never work again. It seems that the application get to be unable to fetch data from Kinect from certain point where some seconds passed.
I don't know whether the libraries or their dependency, or Kinect's hardware itself is going wrong (as for hardware, invisibly broken or something), and I really want to know how to detect which is it.
Could anybody tell me how can I fix the issue please?
My environment is shown below:
Mac OS X v10.7.4 (MacBook Air, core i5 1.6Ghz, 4GB of memory)
Xcode 4.4.1
Kinect for Windows
OpenNI-Bin-Dev-MacOSX-v1.5.4.0
Sensor-Bin-MacOSX-v5.1.2.1
I followed instruction here about libusb: http://openkinect.org/wiki/Getting_Started#Homebrew
and when I try using libfreenect(I know it's separate from OpenNI+SensorKinect), its sample applications say "Number of devices found: 0", which makes no sense to me since I certainly connected my Kinect to MBA...)
Unless you're booting to Windows forget about Kinect for Windows.
Regarding libfreenect and OpenNI in most cases you'll use one or the other, so think of what functionalities you need.
If it's basic RGB+Depth image (and possibly motor and accelerometer ) access libfreenect is your choice.
If you need RGB+Depth image and skeleton tracking and (hand) gestures (but no motor, accelerometer access) use OpenNI. Note that if you use the unstable(dev) versions, you should use Avin's SensorKinect Driver.
Easiest thing to do a nice clean install of OpenNI.
Also, if it helps, you can a creative coding framework like Processing or OpenFrameworks.
For Processing I recommend SimpleOpenNI
For OpenFrameworks you can use ofxKinect which ties to libfreenect or ofxOpenNI. Download the OpenFrameworks packaged on the FutureTheatre Kinect Workshop wiki as it includes both addons and some really nice examples.
When you are connecting the Kinect device to the machine, have you provided external power to it? The device will appear connected to a computer by USB only power but will not be able to tranfer data as it needs the external power supply.
Also what Kinect sensor are you using? If it is a new Kinect device (designed for Windows) they may have a different device signature which may cause the OpenNI drivers to play-up. I'm not a 100% on this one, but I've only ever tried OpenNI with an XBox 360 sensor.

How Do You Profile & Optimize CUDA Kernels?

I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.
There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.
How do you identify ways to make your CUDA kernels perform faster?
If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.
Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:
Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)
This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.
If you are using Windows... Check Nexus:
http://developer.nvidia.com/object/nexus.html
The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.
Maybe you could post your kernel code here and get some feedback ?
The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.
I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.
To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:
Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.
Running it on a CUDA processor, and doing the same thing, if possible.
What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.

embedded application

In the last two months I've worked as a simple application using a computer vision library(OpenCV).
I wish to run that application directly from the webcam without the need of an OS. I'm curious to know if that my application can be burned into a chip in order to not have the OS to run it.
Ofcorse the process can be expensive, but I'm just curious. Do you have any links about that?
ps: the application is written in C.
I'd use something bigger than a PIC, for example a small 32 bit ARM processor.
Yes. It is theoretically possible to port your app to PIC chips.
But...
There are C compilers for the PIC chip, however, due to the limitations of a microcontroller, you might find that the compiler, and the microcontroller itself is far too limited for computer vision work, especially if your initial implementation of the app was done on a full-blown PC:
You'll only have integer math available to you, in most cases, if not all (can't quote me on that, but our devs at work don't have floating point math for their PIC apps and it causes many foul words to emanate from their cubes). Either that, or you'll need to hook to an external math coprocessor.
You'll have to figure out how to get the PIC chip to talk USB to the camera. I know this is possible, but it will require additional hardware, and R&D time.
If you need strict timing control,
you might even have to program the
app in assembler.
You'd have to port portions of OpenCV to the PIC chip, if it hasn't been already. My guess is not.
If your'e not already familiar with microcontroller programming, you'll need some time to get up to speed on the differences between desktop PC programming and microcontroller programming, and you'll have to gain some experience in that. This may not be an issue for you.
Basically, it would probably be best to re-write the whole program from scratch given a PIC chip constraint. Good thing is though, you've done a lot of design work already. It would mainly be hardware/porting work.
OR...
You could try using a small embedded x86 single-board PC, perhaps in the PC/104 form factor, with your OS/app on a CF card. It's a real bone fide PC, you just add your software. Good thing is, you probably wouldn't have to re-write your app, unless it had ridiculous memory footprint. Embedded PC vendors are starting to ship boards based on 1 GHz Intel Atoms, and if you needed more help you could perhaps hook a daughterboard onto the PC-104 bus. You'll work around all of the limitations listed above, as your using an equivalent platform to the PC you developed your app on. And it has USB ports! If you do a thorough cost analysis and if your'e cool with a larger form factor, you might find it to be cheaper/quicker to use a system based on a SBC than rolling a solution using PIC chips/microcontrollers.
A quick search of PC-104 on Google would reveal many vendors of SBCs.
OR...
And this would be really cheap - just get a off-the-shelf cheap Netbook, overwrite the OEM OS, and run the code on there. Hackish, but cheap, and really easy - your hardware issues would be resolved within a week.
Just some ideas.
I think you'll find this might grow into pretty large project.
It's obviously possible to implement a stand-alone hardware solution to do something like this. Off the top of my head, Rabbit's solutions might get you to the finish-line faster. But you might be able to find some home-grown Beagle Board or Gumstix projects as well.
Two Google links I wanted to emphasize:
Rabbit: "Camera Interface Application Kit"
Gumstix: "Connecting a CMOS camera to a Gumstix Connex motherboard"
I would second Nate's recommendation to take a look at Rabbit's core modules.
Also, GHIElectronics has a product called the Embedded Master that runs .Net MicroFramework and has USB host/device capabilities built-in as well as a rich library that is a subset of the .Net framework. It runs on an Arm processor and is fairly inexpensive (> $85). Though not nearly as cheap as a single PIC chip it does come with a lot of glue logic pre-built onto the module.
CMUCam
I think you should have a look at the CMUcam project, which offers affordable hardware and an image processing library which runs on their hardware.

Multi core programming

I want to get into multi core programming (not language specific) and wondered what hardware could be recommended for exploring this field.
My aim is to upgrade my existing desktop.
If at all possible, I would suggest getting a dual-socket machine, preferably with quad-core chips. You can certainly get a single-socket machine, but dual-socket would let you start seeing some of the effects of NUMA memory that are going to be exacerbated as the core counts get higher and higher.
Why do you care? There are two huge problems facing multi-core developers right now:
The programming model Parallel programming is hard, and there is (currently) no getting around this. A quad-core system will let you start playing around with real concurrency and all of the popular paradigms (threads, UPC, MPI, OpenMP, etc).
Memory Whenever you start having multiple threads, there is going to be contention for resources, and the memory wall is growing larger and larger. A recent article at arstechnica outlines some (very preliminary) research at Sandia that shows just how bad this might become if current trends continue. Multicore machines are going to have to keep everything fed, and this will require that people be intimately familiar with their memory system. Dual-socket adds NUMA to the mix (at least on AMD machines), which should get you started down this difficult road.
If you're interested in more info on performance inconsistencies with multi-socket machines, you might also check out this technical report on the subject.
Also, others have suggested getting a system with a CUDA-capable GPU, which I think is also a great way to get into multithreaded programming. It's lower level than the stuff I mentioned above, but throw one of those on your machine if you can. The new Portland Group compilers have provisional support for optimizing loops with CUDA, so you could play around with your GPU even if you don't want to learn CUDA yourself.
Quad-core, because it'll permit you to do problems where the number of concurrent processes is > 2, which often non-trivializes problems.
I would also, for sheer geek squee, pick up a nice NVidia card and use the CUDA API. If you have the bucks, there's a stand-alone CUDA workstation that plugs into your main computer via a cable and an expansion slot.
It depends what you want to do.
If you want to learn the basics of multithreaded programming, then you can do that on your existing single-core PC. (If you have 2 threads, then the OS will switch between them on a single-core PC. Then when you move to a dual-core PC they should automatically run in parallel on separate cores, for a 2x speedup). This has the advantage of being free! The disadvantages are that you won't see a speedup (in fact a parallel implementation is probably slightly slower due to overheads), and that buggy code has a slightly higher chance of working.
However, although you can learn multithreaded programming on a single-core box, a dual-core (or even HyperThreading) CPU would be a great help.
If you want to really stress-test the code you're writing, then as "blue tuxedo" says, you should go for as many cores as you can easily afford, and if possible get hyperthreading too.
If you want to learn about algorithms for running on graphics cards - which is a very different area to x86 multicore - then get CUDA and buy a normal nVidia graphics card that supports it.
I'd recommend at least a quad-core processor.
You could try tinkering with CUDA. It's free, not that hard to use and will run on any recent NVIDIA card.
Alternatively, you could get a PlayStation 3 and the Linux SDK and work out how to program a Cell processor. Note that the next cheapest option for Cell BE development is an order of magnitude more expensive than a PS3.
Finally, any modern motherboard that will take a Core Quad or quad-core Opteron (get a good one from Asus or some other reputable manufacturer) will let you experiment with a multi-core PC system for a reasonable sum of money.
The difficult thing with multithreaded/core programming is that it opens a whole new can of worms. The bugs you'll be faced with are usually not the one you're used to. Race conditions can remain dormant for ages until they bite and your mainstream language compiler won't assist you in any way. You'll get random data and/or crashes that only happen once a day/week/month/year, usually under the most mysterious conditions...
One things remains true fortunately : the higher the concurrency exhibited by a computer, the more race conditions you'll unveil.
So if you're serious about multithreaded/core programming, then go for as many cpu cores as possible. Keep in mind that neither hyperthreading nor SMT allow for the level of concurrency that multiple cores provide.
I would agree that, depending on what you ultimately want to do, you can probably get by with just your current single-core system. Multi-core programming is basically multi-threaded programming, and you can certainly do that on a single-core chip.
When I was a student, one of our projects was to build a thread-safe implementation the malloc library for C. Even on a single core processor, that was more than enough to cure me of my desire to get into multi-threaded programming. I would try something small like that before you start thinking about spending lots of money.
I agree with the others where I would upgrade to a quad-core processor. I am also a BIG FAN of ASUS Motherboards (the P5Q Pro is excellent for Core2Quad and Core2Duo processors)!
The draw for multi-core programming is that you have more resources to get things done faster. If you are serious about multi-core programming, then I would absolutely get a quad-core processor. I don't believe that you should get the new i7 architecture from Intel to take advantage of multi-core processing because anything written to take advantage of the Core2Duo or Core2Quad will just run better on the newer architecture.
If you are going to dabble in multi-core programming, then I would get a good Core2Duo processor. Remember, it's not just how many cores you have, but also how FAST the cores are to process the jobs. My Core2Duo running at 4GHz routinely completes jobs faster than my Core2Quad running at 2.4GHz even with a multi-core program.
Let me know if this helps!
JFV