Degraded performance when upgrading to CGAL-5.3

Degraded performance when upgrading to CGAL-5.3 - cgal

We have recently changed from CGAL-4.10, Boost1.64 to CGAL-5.3, Boost1.77, and also changing from 32 to 64 bit. We are only using the 2D geometry.
We have noticed a significant degradation in performance, possibly related to the use of the boost pool allocation.
Any helpful suggestions on where we should look to get the performance back to where it was?

Related

Does high APC fragmentation matter?

I'm seeing a high amount of fragmentation on APC (>80%) but performance actually seems pretty good. I've read another post that advises disabling object caching in wordpress / w3tc, but I wonder if the reduction in fragmentation is better than the performance boost of caching the objects in the first place.

Fragmented APC is still a few times better than without APC, so please don't deactivate APC.
Increase your memory instead. With more memory APC will fragment lot less. This will be healthier for APC itself.
APC itself has no "defragmentation" process. You could restart yout http service or call apc_clear_cache() in an php script. But beware of the performance impact for the next minutes when your cache will be rebuilt.

Fragmentation on disk based systems is important because the head physically has to move to each location to read it. The APC cache though by definition is in Random Access Memory so the penalty for having to read a different location is in the order of a couple of CPU cycles, ie negligible unless you're seriously loading the CPU. And if you're doing that then you have bigger problems.
Also don't assign too much RAM to APC. You really want 5-10% more than the maximum possible cache. Any more is a waste of precious RAM.
I think it is misleading to put fragmentation as a metric on the APC monitor page as it's just not that important and people worry unduly. Running with highly fragmented APC is orders of magnitude better than running without it at all.

Does CGBitmapContextCreate() have a size limit?

I want to make an extremely large bitmap (250,000 pixels on each side, to be eventually written out as BigTIFF). I don't see a memory size or dimensional limit anywhere in the docs, can Core Graphics handle it?

CG is not designed for that kind of workload.
(I'd be surprised if you found any general-purpose graphics framework that is, frankly. If you're pushing images that big, you're going to have to write your own code to get anything done in a reasonable amount of time.)
In my experience, images started to fail once dimensions got over 32767 or so. Not in any organized way, just crashes and hard-to-repro failures; certain parts of the API would work, others wouldn't. Things may be better in 64-bit but I wouldn't count on it.

Does VB.NET offer any performance improvement over VB6 for CPU-bound processes?

I'm working on a mathematical model written in VB6. The amount of CPU time this model is consuming is becoming a concern to some of our customers and the notion has been floated that porting it to VB.NET will improve its performance.
The model is performing a lot of single-precision arithmetic (a finite-difference scheme over a large grid) with small bursts of database access every five seconds or so (not enough to be important). Only basic arithmetic functions with occasional use of the ^ 4 operator are involved.
Does anyone think porting to VB.NET is likely to improve matters (or not)? Does anyone know any reliable articles or papers I can check over to help with this decision?

My opinion is that VB.Net won't improve performance by far. The improvement is given by your ability to make an optimized algorith.

Probably the best performance boost you can get is eliminating the DB access (even if it doesnt look important I/O usually is the bottleneck, not the language itself). If possible get the data upfront and save it at the end instead of accesing every 5 secs.
Also as other pointed out, change the algorithm if possible since porting the code to .NET probably will only get you small performance benefits.
But if you change it to .NET 4.0 maybe you can use the parallel extensions and really get a boost by using multiple cores. http://msdn.microsoft.com/en-us/library/dd460693.aspx , but that also means, changing the algorithm
Hope it helps. ;-)

I think that improvements in memory management improve performance in VB.NET

To provide you with a correct answer we should check your code...
But certainly VB.NET in theory should be more performant:
possibility to compile on 64 bit machines (without too much effort)
VB6 was interpreted, VB.NET is almost compiled
You can use threads (depends on your algorithm) and other "tricks", so you can use more CPUs to make calculations in parallel
Best thing to try: port the most CPU consuming part of your application to VB.NET and compare.

The same algorithm will perform faster in VB6, because it is compiled in native language.
If the program has extensive memory allocation, it may perform faster in a .NET when running in a 64 bit environment though.

TPC or other DB benchmarks for SSD drives

I have been interested in SSD drives for quite sometime. I do a lot of work with databases, and I've been quite interested to find benchmarks such as TPC-H performed with and without SSD drives.
On the outside it sounds like there would be one, but unfortunately I have not been able to find one. The closest I've found to an answer was the first comment in this blog post.
http://dcsblog.burtongroup.com/data_center_strategies/2008/11/intels-enterprise-ssd-performance.html
The fellow who wrote it seemed to be a pretty big naysayer when it came to SSD technology in the enterprise, due to a claim of lack of performance with mixed read/write workloads.
There have been other benchmarks such as
this
and
this
that show absolutely ridiculous numbers. While I don't doubt them, I am curious if what said commenter in the first link said was in fact true.
Anyways, if anybody can find benchmarks done with DBs on SSDs that would be excellent.

I've been testing and using them for a while and whilst I have my own opinions (which are very positive) I think that Anandtech.com's testing document is far better than anything I could have written, see what you think;
http://www.anandtech.com/show/2739
Regards,
Phil.

The issue with SSD is that they make real sense only when the schema is normalized to 3NF or 5NF, thus removing "all" redundant data. Moving a "denormalized for speed" mess to SSD will not be fruitful, the mass of redundant data will make SSD too cost prohibitive.
Doing that for some existing application means redefining the existing table (references) to views, encapsulating the normalized tables behind the curtain. There is a time penalty on the engine's cpu to synthesize rows. The more denormalized the original schema, the greater the benefit to refactor and move to SSD. Even on SSD, these denormalized schemas will run slower, likely, due to the mass of data which must be retrieved and written.
Putting logs on SSD is not indicated; this is a sequential write-mostly (write-only under normal circumstances) operation, physics of SSD (flash type; a company named Texas Memory Systems has been building RAM based sub-systems for a long time) makes this non-indicated. Conventional rust drives, duly buffered, will do fine.
Note the anandtech articles; the Intel drive was the only one which worked right. That will likely change by the end of 2009, but as of now only the Intel drives qualify for serious use.

I've been running a fairly large SQL2008 database on SSDs for 9 months now. (600GB, over 1 billion rows, 500 transactions per second). I would say that most SSD drives that I tested are too slow for this kind of use. But if you go with the upper end Intels, and carefully pick your RAID configuration, the results will be awesome. We're talking 20,000+ random read/writes per second. In my experience, you get the best results if you stick with RAID1.
I can't wait for Intel to ship the 320GB SSDs! They are expected to hit the market in September 2009...

The formal TPC benchmarks will probably take a while to appear using SSD because there are two parts to the TPC benchmark - the speed (transactions per unit time) and the cost per (transaction per unit time). With the high speed of SSD, you have to scale the size of the DB even larger, thus using more SSD, and thus costing more. So, even though you might get superb speed, the cost is still prohibitive for a fully-scaled (auditable, publishable) TPC benchmark. This will remain true for a while yet, as in a few years, while SSD is more expensive than the corresponding quantity of spinning disk.

Commenting on:
"...quite interested to find benchmarks such as TPC-H performed with and without SSD drives."
(FYI and full disclosure, I am pseudonymously "J Scouter", the "pretty big naysayer when it came to SSD technology in the enterprise" referred to and linked above.)
So....here's the first clue to emerge.
Dell and Fusion-IO have published the first EVER audited benchmark using a Flash-memory device for storage.
The benchmark is the TPC-H, which is a "decision support" benchmark. This is important because TPC-H entails an almost exclusively "read-only" workload pattern -- perfect context for SSD as it completely avoids the write performance problem.
In the scenarios painted for us by the Flash SSD hypesters, this application represents a soft-pitch, a gentle lob right over the plate and an easy "home run" for a Flash-SSD database application.
The results? The very first audited benchmark for a flash SSD based database application, and a READ ONLY one at that resulted in (drum roll here)....a fifth place finish among comparable (100GB) systems tested.
This Flash SSD system produced about 30% as many Queries-per-hour as a disk-based system result published by Sun...in 2007.
Surely though it will be in price/performance that this Flash-based system will win, right?
At $1.46 per Query-per-hour, the Dell/Fusion-IO system finishes in third place. More than twice the cost-per-query-per-hour of the best cost/performance disk-based system.
And again, remember this is for TPC-H, a virtually "read-only" application.
This is pretty much exactly in line with what the MS Cambridge Research team discovered over a year ago -- that there are no enterprise workloads where Flash makes ROI sense from economic or energy standpoints
Can't wait to see TPC-C, TPC-E, or SPC-1, but according the the research paper that was linked above, SSDs will need to become orders-of-magnitude cheaper for them to ever make sense in enterprise apps.

How much speed-up from converting 3D maths to SSE or other SIMD?

I am using 3D maths in my application extensively. How much speed-up can I achieve by converting my vector/matrix library to SSE, AltiVec or a similar SIMD code?

In my experience I typically see about a 3x improvement in taking an algorithm from x87 to SSE, and a better than 5x improvement in going to VMX/Altivec (because of complicated issues having to do with pipeline depth, scheduling, etc). But I usually only do this in cases where I have hundreds or thousands of numbers to operate on, not for those where I'm doing one vector at a time ad hoc.

That's not the whole story, but it's possible to get further optimizations using SIMD, have a look at Miguel's presentation about when he implemented SIMD instructions with MONO which he held at PDC 2008,
(source: tirania.org)
Picture from Miguel's blog entry.

For some very rough numbers: I've heard some people on ompf.org claim 10x speed ups for some hand-optimized ray tracing routines. I've also had some good speed ups. I estimate I got somewhere between 2x and 6x on my routines depending on the problem, and many of these had a couple of unnecessary stores and loads. If you have a huge amount of branching in your code, forget about it, but for problems that are naturally data-parallel you can do quite well.
However, I should add that your algorithms should be designed for data-parallel execution.
This means that if you have a generic math library as you've mentioned then it should take packed vectors rather than individual vectors or you'll just be wasting your time.
E.g. Something like
namespace SIMD {
class PackedVec4d
{
__m128 x;
__m128 y;
__m128 z;
__m128 w;
//...
};
}
Most problems where performance matters can be parallelized since you'll most likely be working with a large dataset. Your problem sounds like a case of premature optimization to me.

For 3D operations beware of un-initialized data in your W component. I've seen cases where SSE ops (_mm_add_ps) would take 10x normal time because of bad data in W.

The answer highly depends on what the library is doing and how it is used.
The gains can go from a few percent points, to "several times faster", the areas most susceptible of seeing gains are those where you're not dealing with isolated vectors or values, but multiple vectors or values that have to be processed in the same way.
Another area is when you're hitting cache or memory limits, which, again, requires a lot of values/vectors being processed.
The domains where gains can be the most drastic are probably those of image and signal processing, computational simulations, as well general 3D maths operation on meshes (rather than isolated vectors).

These days all the good compilers for x86 generate SSE instructions for SP and DP float math by default. It's nearly always faster to use these instructions than the native ones, even for scalar operations, so long as you schedule them correctly. This will come as a surprise to many, who in the past found SSE to be "slow", and thought compilers could not generate fast SSE scalar instructions. But now, you have to use a switch to turn off SSE generation and use x87. Note that x87 is effectively deprecated at this point and may be removed from future processors entirely. The one down point of this is we may lose the ability to do 80bit DP float in register. But the consensus seems to be if you are depending on 80bit instead of 64bit DP floats for the precision, your should look for a more precision loss-tolerant algorithm.
Everything above came as a complete surprise to me. It's very counter intuitive. But data talks.

Most likely you will see only very small speedup, if any, and the process will be more complicated than expected. For more details see The Ubiquitous SSE vector class article by Fabian Giesen.
The Ubiquitous SSE vector class: Debunking a common myth
Not that important
First and foremost, your vector class is probably not as important for the performance of your program as you think (and if it is, it's more likely because you're doing something wrong than because the computations are inefficient). Don't get me wrong, it's probably going to be one of the most frequently used classes in your whole program, at least when doing 3D graphics. But just because vector operations will be common doesn't automatically mean that they'll dominate the execution time of your program.
Not so hot
Not easy
Not now
Not ever

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas