When examining a process in Process Explorer, what does it mean when there are several page faults? The application is processing quite a bit of data and the UI is not very responsive. Are there optimizations to the code that could reduce or eliminate page faults? Would increasing the physical RAM of the system make a difference?
http://en.wikipedia.org/wiki/Page_fault
Increasing the physical RAM on your machine could result in fewer page faults, although design changes to your application will do much better than adding RAM. In general, having a smaller memory footprint, and having things that will often be accessed around the same time be on the same page will decrease the number of page faults. It can, also, be helpful to try to do everything you can with some bit of data in memory all at once so that you don't need to access it many different times, which may cause page faults (aka thrashing).
It might also be helpful to make sure that memory that is accessed after each other is near to each other (eg if you have some objects, place them in an array) if these objects have lots of data that is very infrequently used, place it in another class and make the first class have a reference to the second one. This way you will use less memory most of the time.
A design option would be to write a memory cache system, lazy creating memory (create on demand). such cache would have a collection of pre-allocated memory chunks, accessed by their size. For example, an array of N lists, each list having M buffers.each list is responsible to bring you memory in a certain size range. (for example, from each list bringing you memory in the range of 2^i (i = 0..N-1). even if you want to use less then 2^i, you just dont use the extra memory in the buffer.
this would be a tradeoff of small memory waste, vs caching and less page faults.
another option is to use nedmalloc
good luck
Lior
Related
I am writting a program in java for my application and i am concerned about speed performance . I have done some benchmarking test and it seems to me the speed is not good enough. I think it has to do with add ang get method of the arraylist since when i use jvm and press snapshot it tells me that it takes more seconds add and get method of arraylist.
I have read some years ago when i tool OCPJP test that if you want to have a lot of add and delete use LinkedList but if you want fast iteration use ArrayList. In other words use ArrayList when you will use get method and LinkedList when you will use add method and i have done that .
I am not sure anymore if this is right or not?!
I would like anybody to give me an advise if i have to stick with that or is there any other way how can i improve my performance.
I think it has to do with add ang get method of the arraylist since when i use jvm and press snapshot it tells me that it takes more seconds add and get method of arraylist
It sounds like you have used a profiler to check what the actual issues are -- that's the first place to start! Are you able to post the results of the analysis that might, perhaps, hint at the calling context? The speed of some operations differ between the two implementations as summarized in other questions. If the calls you see are really called from another method in the List implementation, you might be chasing the wrong thing (i.e. calling insert frequently near one end of an ArrayList that can cause terrible performance).
In general performance will depend on the implementation, but when running benchmarks myself with real-world conditions I have found that ArrayList-s generally fit my use case better if able to size them appropriately on creation.
LinkedList may or may not keep a pool of pre-allocated memory for new nodes, but once the pool is empty (if present at all) it will have to go allocate more -- an expensive operation relative to CPU speed! That said, it only has to allocate at least enough space for one node and then tack it onto the tail; no copies of any of the data are made.
An ArrayList exposes the part of its implementation that pre-allocates more space than actually required for the underlying array, growing it as elements are added. If you initialize an ArrayList, it defaults to an internal array size of 10 elements. The catch is that when the list outgrows that initially-allocated size, it must go allocate a contiguous block of memory large enough for the old and the new elements and then copy the elements from the old array into the new one.
In short, if you:
use ArrayList
do not specify an initial capacity that guarantees all items fit
proceed to grow the list far beyond its original capacity
you will incur a lot of overhead when copying items. If that is the problem, over the long run that cost should be amortized across the lack of future re-sizing ... unless, of course, you repeat the whole process with a new list rather than re-using the original that has now grown in size.
As for iteration, an array is composed of a contiguous chunk of memory. Since many items may be adjacent, fetches of data from main memory can end up being much faster than the nodes in a LinkedList that could be scattered all over depending on how things get laid out in memory. I'd strongly suggest trusting the numbers of the profiler using the different implementations and tracking down what might be going on.
I've read some article about using suggestion of NSCache, for many it mentioned a recommendation is that to use NSPurgeabledata in an NSCache.
However I just can't catch the point, while the NSCache already be able to evict its content when memory is tight or it reached its count/cost limit, why we still need to use NSPurgeabledata here? Isn't that just potentially slower than using the data object we already have? What kind of advantage can we take here?
The count limit and the total-cost limit are not strictly enforced. That is, when the cache goes over one of its limits, some of its objects might get evicted immediately, later, or never, all depending on the implementation details of the cache.
So the advantages of using NSPurgeabledata here is :-
By using purgeable memory, you allow the system to quickly recover memory if it needs to, thereby increasing performance. Memory that is marked as purgeable is not paged to disk when it is reclaimed by the virtual memory system because paging is a time-consuming process. Instead, the data is discarded, and if needed later, it will have to be recomputed.
It works like a locking mechanism or we can say that it works like synchronization. If data is accessing by one thread then no other thread can access the same one, until unless the first one get completed.
How would one optimize a queue for the typical:
access / store
memory usage
i'm not sure of anyway to reduce memory besides trying to run a compression algorithm on it, but that would take quite a deal of store time as a tradeoff - one would have to recompress everything I think.
As such I'm thinking the typical linked list with pointers.... a circle queue?
Any ideas?
Thanks
Edit: regardless of what is above; how does one make the fastest/least memory intensive basic queue structure essentially?
Linked lists are actually not very typical (except in functional languages or when newbies mistakenly think that a linked list is faster than a dynamic array). A dynamic circular buffer is more typical. The growing (and, optionally, shrinking) works slightly differently than in a dynamic array: if the "data holding part" crosses the end of the array, the data should be copied to the new space in such a way that it remains contiguous (simply extending the array would create a gap in the middle of the data).
As usual, it has some advantages and some drawbacks.
Drawbacks:
slightly more complicated implementation
not suitable for lock-free synchronization
Advantages:
more compact: in the worst case (when it just grew or is just about to shrink but hasn't yet) it has a space overhead of about 100%, a singly linked list almost always has an overhead of 100% or more (unless the data elements are larger than a pointer) and a doubly linked list is even worse.
cache efficient: reading happens close to previous reading, writing happens close to previous writing. So cache misses are rare, and when they do occur, they read data that is mostly relevant (or in the case of writing: they get a cache line that will probably be written to again soon). In a linked list, locality is poor and about half of every cache miss is wasted on the overhead (pointers to other nodes).
Usually these advantages outweigh the drawbacks.
I'm thinking of optimizing a program via taking a linear array and writing each element to a arbitrary location (random-like from the perspective of the CPU) in another array. I am only doing simple writes and not reading the elements back.
I understand that a scatted read for a classical CPU can be quite slow as each access will cause a cache miss and thus a processor wait. But I was thinking that a scattered write could technically be fast because the processor isn't waiting for a result, thus it may not have to wait for the transaction to complete.
I am unfortunately unfamiliar with all the details of the classical CPU memory architecture and thus there may be some complications that may cause this also to be quite slow.
Has anyone tried this?
(I should say that I am trying to invert a problem I have. I currently have an linear array from which I am read arbitrary values -- a scattered read -- and it is incredibly slow because of all the cache misses. My thoughts are that I can invert this operation into a scattered write for a significant speed benefit.)
In general you pay a high penalty for scattered writes to addresses which are not already in cache, since you have to load and store an entire cache line for each write, hence FSB and DRAM bandwidth requirements will be much higher than for sequential writes. And of course you'll incur a cache miss on every write (a couple of hundred cycles typically on modern CPUs), and there will be no help from any automatic prefetch mechanism.
I must admit, this sounds kind of hardcore. But I take the risk and answer anyway.
Is it possible to divide the input array into pages, and read/scan each page multiple times. Every pass through the page, you only process (or output) the data that belongs in a limited amount of pages. This way you only get cache-misses at the start of each input page loop.
My compact framework application creates a smooth-scrolling list by rendering all the items to a large bitmap surface, then copying that bitmap to an offset position on the screen so that only the appropriate items show. Older versions only rendered the items that should appear on screen at the time, but this approach was too slow for a smooth scrolling interface.
It occasionally generates an OutOfMemoryException when initially creating the large bitmap. If the user performs a soft-reset of the device and runs the application again, it is able to perform the creation without issue.
It doesn't look like this bitmap is being generated in program memory, since the application uses approximately the same amount of program memory as it did before the new smooth-scrolling methods.
Is there some way I can prevent this exception? Is there any way I can free up the memory I need (wherever it is) before the exception is thrown?
I'd suggest going back to the old mechanism of rendering only part of the data, as the size of the fully-rendered data is obviously an issue. To help prevent rendering problems I would probably pre-render a few rows above and below the current view so they can be "scrolled" in with limited impact.
And just as soon as I posted I thought of something you can do to fix your problem with the new version. The problem you have is one of CF trying to find one block of contiguous memory available for the huge bitmap, and this is occasionally a problem.
Instead of creating one big bitmap, you can instead create a collection of smaller bitmaps, one for each item, and render each item onto its own little bitmap. During display, you then just copy over the bitmaps you need. CF will have a much easier time creating a bunch of little bitmaps than one big one, and you shouldn't have any memory problems unless this is a truly enormous bunch of items.
I should avoid expressions like "there is no fix".
One other important point: make sure you call Dispose() on each bitmap when you're finished with it.
Your bitmap definitely is being created in program memory. How much memory the bitmap needs depends on how big it is, and whether or not this required size will generate the OutOfMemoryException depends on how much is available to the PDA (which makes this a randomly-occuring error).
Sorry, but this is generally an inadvisable control rendering technique (especially on the Compact Framework) for which there is no fix short of increasing the physical memory on the PDA, which isn't usually possible (and often won't fix the problem anyway, since a CF process is limited to 32MB no matter how much the device has available).
Your best bet is to go back to the old version and improve its rendering speed. There is also a simple technique available on CF for making a control double-buffered to eliminate flicker.
Since it appears you've run into a device limitation that is restricting the total size of Bitmap space you can create (these are apparently created in video RAM rather than general program memory), one alternative is to replace the big Bitmap object used here with a plain-old block of Windows memory, accessing it for reading and writing by PInvoking the BitBlt API function.
Initially creating the memory block is tricky, and you'd probably want to ask another SO question about that (GCHandle.Alloc can be used here to create a "pinned" object, which means .NET isn't allowed to move it around in memory, which is critical here). I know how to do it, but I'm not sure I do it correctly and I'd rather have an expert's input.
Once you've created the big block, you'd iterate through your items, render each to one small bitmap that you keep re-using (using your existing .NET code), and BitBlt it to the appropriate spot in your memory block.
After creating the entire cache, your rendering code should work just like before, with the difference that instead of copying from the big bitmap to your rendering surface, you BitBlt from your cache block. The arguments for BitBlt are essentially the same as for DrawImage (destination, source, coordinates and sizes etc.).
Since you're creating the cache out of regular memory this way instead of specialized video RAM, I don't think you'll run into the same problem. However, I would definitely get the block creation code working first and test to make sure it can create a big enough block every time.
Update: actually, the ideal approach would be to have a collection of smaller memory blocks rather than one big one (like I thought was the problem with the Bitmap approach), but you already have enough to do. I've worked with CF apps that deal with 5 and 10MB objects and it's not a huge problem anyway (although it might be a bigger problem when that chunk is pinned - I dunno). BTW, I've always been surprised by the OOMEs on BitMap creation because I knew the bitmaps were much smaller than the available memory, as did you - now I know why. Sorry I thought this was an easy solve at first.