Processor Profiler - air

I have an application written in Adobe Air that is consuming at least 20% of my CPU for ever, someone knows a tool or maybe a CPU profiler that shows information about this issue???

Related

High CPU with ImageResizer DiskCache plugin

We are noticing occasional periods of high CPU on a web server that happens to use ImageResizer. Here are the surprising results of a trace performed with NewRelic's thread profiler during such a spike:
It would appear that the cleanup routine associated with ImageResizer's DiskCache plugin is responsible for a significant percentage of the high CPU consumption associated with this application. We have autoClean on, but otherwise we're configured to use the defaults, which I understand are optimal for most typical situations:
<diskCache autoClean="true" />
Armed with this information, is there anything I can do to relieve the CPU spikes? I'm open to disabling autoClean and setting up a simple nightly cleanup routine, but my understanding is that this plugin is built to be smart about how it uses resources. Has anyone experienced this and had any luck simply changing the default configuration?
This is an ASP.NET MVC application running on Windows Server 2008 R2 with ImageResizer.Plugins.DiskCache 3.4.3.
Sampling, or why the profiling is unhelpful
New Relic's thread profiler uses a technique called sampling - it does not instrument the calls - and therefore cannot know if CPU usage is actually occurring.
Looking at the provided screenshot, we can see that the backtrace of the cleanup thread (there is only ever one) is frequently found at the WaitHandle.WaitAny and WaitHandle.WaitOne calls. These methods are low-level synchronization constructs that do not spin or consume CPU resources, but rather efficiently return CPU time back to other threads, and resume on a signal.
Correct profilers should be able to detect idle or waiting threads and eliminate them from their statistical analysis. Because New Relic's profiler failed to do that, there is no useful way to interpret the data it's giving you.
If you have more than 7,000 files in /imagecache, here is one way to improve performance
By default, in V3, DiskCache uses 32 subfolders with 400 items per folder (1000 hard limit). Due to imperfect hash distribution, this means that you may start seeing cleanup occur at as few as 7,000 images, and you will start thrashing the disk at ~12,000 active cache files.
This is explained in the DiskCache documentation - see subfolders section.
I would suggest setting subfolders="8192" if you have a larger volume of images. A higher subfolder count increases overhead slightly, but also increases scalability.

How would I simulate running code on different hardware while using only one machine?

As the title says I'm looking to approximate the performance of a piece of code on different hardware setups. Are there any tools out there to do this?
I'm looking to run my code and perform measurements by limiting the resources available to the process. I would like to control things such as total memory available as well as cpu usage, but it would be better if I had more granularity. Are there any tools out there that would allow me to emulate different speeds of RAM, rate limit the cpu (to say X gigaflops), slow down disk reads, etc?
I've already been looking at the setrlimit command in linux, but I don't think it will let me emulate things like latency. I considered using VMs to run the code and just tweaking the memory and cpu but I'm not sure its granular enough. I also considered things like hooking some of the syscalls and just spinning for x nanoseconds before allowing a read/write syscall, but it feels kind of clunky. The other issue is that this code primarily runs on Windows, and if possible it would be preferable to do this on Windows.
Just for some background, I'm trying to provide some reasonably accurate estimates of things like runtime and resource utilization on different hardware setups without having to actually buy, assemble, and test said hardware.
Thanks for any help you can provide.
If you wish to get very detailed control of every possible part of a machine, use a software emulated machine such as Bochs. Bochs will emulate, in software, an x86 CPU, hard drive, video card, network card, everything.
In order to do what you want to do you would need to build your own version of Bochs with changes to the emulator to control the speed of the different pieces.

Why is my ARM chip/Surface processing WCF calls a lot slower than my i7 laptop, and is there anything I can do to speed it up?

I am diagnostic testing my winRT store application, and am noticing considerable performance differences between my Surface RT device and my i7 laptop.
Now - i know there is a big difference in expected performance between an ARM CPU and an i7 - but when my average WCF web call on my i7 takes ~0.2s, and my surface device takes ~1.2s I am forced to start looking at optimization and improvements. If the performance difference between the two was only a few hundred milliseconds then I wouldn't mind so much, but the surface device does feel a little bit clunky - and the only bottleneck seems to be the services!
Does anyone have an explanation, or even some performance improvements tips? I should mention that I am running the services across basicBinding with binary serialization.
WCF is a heavyweight stack, so I wouldn't be surprised if it just performs that much slower on a that much slower CPU. Make sure there is no other CPU load at the same time, start your requests from a background thread and display a progress indicator or try switching to a lighter technology like REST/JSON.
I have switched to an OData stream to improve performance.
Asides from being fast, this allows me to select only the data I need from the service - both reducing the bandwidth consumption of my application and the speed of service calls.

When profiling, most of the time is spent in nvoglv64.dll. What should I deduce?

I am profiling a C++ application with Intel VTune Amplifier. Most of the time seems to be spent in nvoglv64.dll more precisely in DrvPresentBuffers and/or KeSynchoronizeExecution. Note that I have a NVIDA GeoForce graphic card.
I am new to the application I am profiling and looking for bottleneck and low hanging fruits of optimization. Since most of the time seems to be spent in this NVIDIA dll, I do not know how decode the profiling results.
I would like to know where are those call from my application side in order to build a knowledge of my application. Can someone give me some hint to start :
When exactly do an application call DrvPresentBuffers, what kind of call should I look at (on my application side)
Where can I get more info about how to profile, understand and optimize applications where bottlenecks are in the graphic card dll's.
DrvPresentBuffers is part of the draw code for openGL. That nvoglv64.dll is the 64bit openGL driver for your nVidia card. There is a known performance issue for 64bit Windows 7 and this function on many drivers. I couldn't find a link but you can search the nVidia forum if you are experiencing problems. If there is nothing wrong or nothing going horribly slow then I'm not sure optimization is where I would start when familiarizing myself with a new application.

Multi core programming

I want to get into multi core programming (not language specific) and wondered what hardware could be recommended for exploring this field.
My aim is to upgrade my existing desktop.
If at all possible, I would suggest getting a dual-socket machine, preferably with quad-core chips. You can certainly get a single-socket machine, but dual-socket would let you start seeing some of the effects of NUMA memory that are going to be exacerbated as the core counts get higher and higher.
Why do you care? There are two huge problems facing multi-core developers right now:
The programming model Parallel programming is hard, and there is (currently) no getting around this. A quad-core system will let you start playing around with real concurrency and all of the popular paradigms (threads, UPC, MPI, OpenMP, etc).
Memory Whenever you start having multiple threads, there is going to be contention for resources, and the memory wall is growing larger and larger. A recent article at arstechnica outlines some (very preliminary) research at Sandia that shows just how bad this might become if current trends continue. Multicore machines are going to have to keep everything fed, and this will require that people be intimately familiar with their memory system. Dual-socket adds NUMA to the mix (at least on AMD machines), which should get you started down this difficult road.
If you're interested in more info on performance inconsistencies with multi-socket machines, you might also check out this technical report on the subject.
Also, others have suggested getting a system with a CUDA-capable GPU, which I think is also a great way to get into multithreaded programming. It's lower level than the stuff I mentioned above, but throw one of those on your machine if you can. The new Portland Group compilers have provisional support for optimizing loops with CUDA, so you could play around with your GPU even if you don't want to learn CUDA yourself.
Quad-core, because it'll permit you to do problems where the number of concurrent processes is > 2, which often non-trivializes problems.
I would also, for sheer geek squee, pick up a nice NVidia card and use the CUDA API. If you have the bucks, there's a stand-alone CUDA workstation that plugs into your main computer via a cable and an expansion slot.
It depends what you want to do.
If you want to learn the basics of multithreaded programming, then you can do that on your existing single-core PC. (If you have 2 threads, then the OS will switch between them on a single-core PC. Then when you move to a dual-core PC they should automatically run in parallel on separate cores, for a 2x speedup). This has the advantage of being free! The disadvantages are that you won't see a speedup (in fact a parallel implementation is probably slightly slower due to overheads), and that buggy code has a slightly higher chance of working.
However, although you can learn multithreaded programming on a single-core box, a dual-core (or even HyperThreading) CPU would be a great help.
If you want to really stress-test the code you're writing, then as "blue tuxedo" says, you should go for as many cores as you can easily afford, and if possible get hyperthreading too.
If you want to learn about algorithms for running on graphics cards - which is a very different area to x86 multicore - then get CUDA and buy a normal nVidia graphics card that supports it.
I'd recommend at least a quad-core processor.
You could try tinkering with CUDA. It's free, not that hard to use and will run on any recent NVIDIA card.
Alternatively, you could get a PlayStation 3 and the Linux SDK and work out how to program a Cell processor. Note that the next cheapest option for Cell BE development is an order of magnitude more expensive than a PS3.
Finally, any modern motherboard that will take a Core Quad or quad-core Opteron (get a good one from Asus or some other reputable manufacturer) will let you experiment with a multi-core PC system for a reasonable sum of money.
The difficult thing with multithreaded/core programming is that it opens a whole new can of worms. The bugs you'll be faced with are usually not the one you're used to. Race conditions can remain dormant for ages until they bite and your mainstream language compiler won't assist you in any way. You'll get random data and/or crashes that only happen once a day/week/month/year, usually under the most mysterious conditions...
One things remains true fortunately : the higher the concurrency exhibited by a computer, the more race conditions you'll unveil.
So if you're serious about multithreaded/core programming, then go for as many cpu cores as possible. Keep in mind that neither hyperthreading nor SMT allow for the level of concurrency that multiple cores provide.
I would agree that, depending on what you ultimately want to do, you can probably get by with just your current single-core system. Multi-core programming is basically multi-threaded programming, and you can certainly do that on a single-core chip.
When I was a student, one of our projects was to build a thread-safe implementation the malloc library for C. Even on a single core processor, that was more than enough to cure me of my desire to get into multi-threaded programming. I would try something small like that before you start thinking about spending lots of money.
I agree with the others where I would upgrade to a quad-core processor. I am also a BIG FAN of ASUS Motherboards (the P5Q Pro is excellent for Core2Quad and Core2Duo processors)!
The draw for multi-core programming is that you have more resources to get things done faster. If you are serious about multi-core programming, then I would absolutely get a quad-core processor. I don't believe that you should get the new i7 architecture from Intel to take advantage of multi-core processing because anything written to take advantage of the Core2Duo or Core2Quad will just run better on the newer architecture.
If you are going to dabble in multi-core programming, then I would get a good Core2Duo processor. Remember, it's not just how many cores you have, but also how FAST the cores are to process the jobs. My Core2Duo running at 4GHz routinely completes jobs faster than my Core2Quad running at 2.4GHz even with a multi-core program.
Let me know if this helps!
JFV