What would be the optimal amount of buckets in a fixed sized Hash Table using separate chaining and initialized with a known number N of entries? - optimization

The HT does not rehash.
We use a simple division method as Hash-function.
We assume the Hash-function is efficient at equally distributing the entries.
The goal is to have O(1) insertion, deletion and find.

The optimal number of buckets is a compromise between memory consumption and hash collisions, for intended usage patterns.
For example, if something is very frequently used you might limit the size of the hash table to half the size of a CPU's cache to reduce the chance of "cache miss accessing hash table"; and this can be faster than using a larger hash table (with worse cache misses and lower chance of hash collisions). Alternatively; if it's used infrequently (and therefore you expect cache misses regardless of hash table size) then a larger size is more likely to be optimal.
Of course real systems have multiple caches (L1, L2, L3) plus virtual memory translation caches (TLBs) plus RAM limits (plus swap space limits); real software has more than just one hash table competing for resources in the memory hierarchy; and often the software developers have no idea what other processes might be running (competing for physical RAM, polluting caches, etc) or what any end user's hardware is (sizes of caches, etc). All of this makes it virtually impossible to determine "optimal" with any method (including extensive benchmarking).
The only practical option is to take an educated guess based on various assumptions (about usage, the amount of data and how good the hashing function will be in practice, the CPU, the other things that might be using CPUs and memory, ...); and make the source code configurable (e.g. #define HASH_TABLE_SIZE ..) so you can easily re-assess the guess later.

Related

Optaplanner - large datasets with millions of rows

There are a couple of threads discussing the scalability of Optaplanner, and I am wondering what's the recommended approach to deal with very large datasets when it comes to millions of rows?
As this blog discussed I am already using heuristic (Simulated Annealing + Tabu Search). The search space of cloud balancing problem is c^p, but the feasible space is unknown/NP-complete.
http://www.optaplanner.org/blog/2014/03/27/IsTheSearchSpaceOfAnOptimizationProblemReallyThatBig.html
The problem I am trying to solve is similar to cloud balancing. But the main difference is in the input data, besides a list of computers and a list of processes, there is also a big two dimensional 'score list/table' which has the scores for each possible combinations that needs to be loaded into memory.
In other words, except for the constraints between computers and processes that the planning needs to satisfy, different valid combinations yield various scores and the higher the score the better.
It's a simple problem but when it comes to hundreds of computers, 100k+ processes and the score table has a million+ combinations, it needs a lot of memory. Even though I could allocate more memory to increase the heap size, the planning could become very slow and struggling, as the steps are sorted with custom planning variable/entity comparator classes.
A straight-forward solution is to divide the dataset into smaller subsets, run each of them individually and then combine the results, so that I could have multiple machines to run at the same time and each machine runs on multi-threads. The biggest drawback of this approach is the result produced is far away from optimal.
I am wondering is there any other better solutions?
The MachineReassignment example also has a big "score combination" matrix. OptaPlanner doesn't care about that - those are just problem facts and the DRL quickly matches the combination(s) that is picked for an assignment. The Solver.solve() causes no big memory consumption or performance impact.
However, loading the problem in your code (before calling Solver.solve()) does cause a huge memory consumption: Understand that if n = 20k, then n² = 400m and an int takes of up 4 bytes, so for 20 000 elements that matrix is 1.6 GB in its most efficient uncompressed form int[][] (both in Java and C++!). So for 20k reserve 2GB RAM, for 40k reserve 8GB RAM for 80k reserve 32 GB RAM. That scales badly.
As for dealing with these big problems, I use combinations of techniques such as Nearby Selection (see my blog article on that), Partitioned Search (what you described, it will be supported out of the box in 7, but I 've implemented it for customers in a CustomPhase), Limited Selection Construction Heuristics (need to research that further, the plumbing is there, usually overkill), ... Partitioned Search does indeed exclude optimal solutions, but above 10k planning entities the trade-off quality vs time taking clearly favors Partitioned Search given a reasonable solving time (minutes/hours/days instead of millenia). The trick is to keep the size of each partition big enough, above 1k entities (hence the use NearbySelection). Score calculation speed also matters a lot, of course.

Optimal hash load factor for optimizing memory usage

I'm having a hard time thinking about persistence around hashtable memory optimizations.
From what I understand, the way hashes are stored can be fine-tuned to allow for efficient memory usage using the hash-max-zipmap-entries and hash-max-zipmap-value configs. Below the threshold, values are basically "serialized" and scanned linearly. Above the threshold, they get converted into real hash tables.
So with this in mind, what we really want is to tune those two configurations such that we can trade off a little space for a little more CPU time such that we can still maintain O(1) access times without incurring noticeably high CPU usage.
Is there a rule of thumb for the number of "buckets" to choose if growth is unbounded? Is there an optimum load factor that we should be shooting for to take advantage of the memory optimizations? If the load factor becomes too high, would it be advisable to create more buckets and manually rehash all the hashtable entries to maintain an optimal load factor?
Any insight/advice would be appreciated!

SFV/CRC32 checksum good and fast enough to check for common backup files?

I have 3 terabytes, more than 300,000 reference files of all sizes (20, 30, 40, 200 megas each) and I usually back them up regularly (not zipped). A few months ago, I lost some files probably due to data degradation (as I did "backup" of damaged files without notice).
I do not care about security, so do not need MD5, SHA, etc. I just want to be assured that the files I'm copying are good (the same bits and bytes) and verify that backups are intact after a few months before making backups again.
Therefore, my needs are basic because the files are not very important and there is no need for security (no sensitive information).
My doubt: the format/method "SFV CRC/32" is good and fast for my needs? There is something better and faster than that? I'm using the program ExactFile.
Are there any checksum faster than SFV/CRC32 but that is not flawed? I tried using the MD5 but it is slow and since I do not need data security, I preferred the SFV/CRC32. Still, it's painful, because there are more than 300,000 files and takes hours to make the checksum of all of them, even with CPU xeon 8 cores HT and fast HDD.
From the point of view of data integrity , there is some advantage in joining all the files in one .ZIP or .RAR instead of letting them " loose " in folders and files?
Some tips?
Thanks!
If you could quantify "few" and "some" in "A few months ago, I lost some files" (where "few" would be considered to be replaced with "every few" in order to get a rate), then you could calculate the probability of a false positive. However just from those words, I would say, yes, a 32-bit CRC should be fine for your application.
As for speed, if you have a recent Intel processor, you likely have a CRC-32C instruction, which can make the calculation much faster, by about a factor of 15. (See this answer for some code.) That could be made faster still by running it over multiple cores. If done right, you should be limited by the I/O, not the calculation.
There is no advantage in this case to bundling them in a zip or rar. In fact it may be worse, if a corruption of that one file causes you to lose everything.
If you aren't getting a throughput of at least 250 MB per second per core then you're probably I/O or memory-speed bound. The raw hashing speed of CRC32 and MD5 is higher than that, even on decades-old hardware, assuming a non-sucky reasonably optimised implementation.
Have a look at the Crypto++ benchmark, which includes a wealth of other hash algorithms as well.
The Castagnoli CRC32 can be faster than standard CRC32 or MD5 because newer CPUs have a special instruction for it; with that instruction and oodles of supporting code (for hashing three streams in parallel, stitching together partial results with a bit of linear algebra, etc. pp.) you can speed up the hashing to about 1 cycle/dword. AES-based hashes are also lightning fast on recent CPUs, due to the special AES instructions.
However, in the end it doesn't matter how fast the hash function waits for data to be read; especially on a multicore machine you're almost always I/O bound in applications like this, unless you're getting sabotaged by small caches and the latencies of deep memory cache hierarchies.
I'd stick with MD5 which is no slower than CRC32 and universally available, even on the oldest of machines, in pretty much every programming system/language ever invented. Don't think of it as a 'cryptographically secure hash' (which it isn't, not anymore) but as some kind of CRC128 that's just as fast as CRC32 but that requires some 2^64 hashings for a collision to become likely, instead of only a few ten thousand as in the case of CRC32.
If you want to roll some custom code then CRCs do have some merit: the CRC of a file can be computed by combining the CRCs of sub blocks with a bit of linear algebra. With general hashes like MD5 that's not possible (but you can always process multiple files in parallel instead).
There are oodles of ready-made programs for computing MD5 hashes for files and directories fast. I'd recommend the 'deep' versions of md5sum + cousins: md5deep and hashdeep which you can find on SourceForge and on GitHub.
Darth Gizka, thanks for the tips. Now I'm using md5deep 64 you indicated. It's very good. I used to use ExactFile, which stopped being updated in 2010, is still 32-bit (no 64bit version). I did a quick comparison between the two. The ExactFile was faster to create the MD5 digest. But to compare the digest, the md5deep64 was much faster.
My problem is HDD, as you said. For backup and storage, I use three Seagates with 2 TB each (7200rpm 64 mega cache). With an SSD the procedure would be much faster, but with terabytes of files is very difficult to use SSD.
A few days ago, I did the procedure in part of the archives: 1 tera (about 170,000 files). The ExactFile took about six hours to create the digest SFV / CRC32. I used one of my newer machines, equipped with an i7 4770k (with CRC32 instructions embedded, 8 cores - four real and four virtual, MB Gygabyte Z87X-UD4H, 16 RAM).
Throughout the calculations of files, the CPU cores were almost unusable (3% to 4%, maximum 20%). The HDD was 100% used, however, only a fraction of his speed power was achieved (sata 3), most of the time 70 MB / s, sometimes dropping to 30 MB / s depending on the number of files being calculated and anti virus in the background (which I disabled later, as I often do when copying large numbers of files).
Now I am testing a copy program that uses binary file comparison. Anyway, I will continue using md5 digests. Grateful for the information and any tip is welcome.

Why is Sesame limited to, lets say, 150m triples?

I wouldn't exactly say it is limited but as long as I can see the recommendations given are of the sort of "If you need to go beyond that you can change the backend store... ". Why? Why is Sesame not as efficient as lets say OWLIM or Allegrgraph when goes beyond 150-200m triples. What optimizations are implemented in order to go that big? Are the underlying data structures different?
Answered here by #Jeen Broekstra:
http://answers.semanticweb.com/questions/21881/why-is-sesame-limited-to-lets-say-150m-triples
the actual values that make up an RDF statements (that is, the subjects, predicates, and objects) are indexed in a relatively simple hash, mapping integer ids to actual data values. This index does a lot of in-memory caching to speed up lookups but as the size of the store increases, the probability (during insertion or lookup) that a value is not present in the cache and needs to be retrieved from disk increases, and in addition the on-disk lookup itself becomes more expensive as the size of the hash increases.
data retrieval in the native store has been balanced to make optimal use of the file system page size, for maximizing retrieval speed of B-tree nodes. This optimization relies on consecutive lookups reusing the same data block so that the OS-level page cache can be reused. This heuristic start failing more often as transaction sizes (and therefore B-trees) grow, however.
as B-trees grow in size, the chances of large cascading splits increase.

What happens when maxing out Postgres' work_mem?

How does the work_mem option in Postgres work? Here's the description from http://www.postgresql.org/docs/8.4/static/runtime-config-resource.html:
Specifies the amount of memory to be used by internal
sort operations and hash tables before switching to
temporary disk files. The value defaults to one megabyte
(1MB). Note that for a complex query, several sort or
hash operations might be running in parallel; each one
will be allowed to use as much memory as this value
specifies before it starts to put data into temporary
files. Also, several running sessions could be doing
such operations concurrently. So the total memory used
could be many times the value of work_mem; it is
necessary to keep this fact in mind when choosing the
value. Sort operations are used for ORDER BY, DISTINCT,
and merge joins. Hash tables are used in hash joins,
hash-based aggregation, and hash-based processing of IN
subqueries.
I'm probably totally wrong here but..isn't "switching to temporary disk files" essentially the same thing as "virtual memory" in the operating system? Wouldn't the OS just create a swap file once the RAM is gone? Wouldn't it be better to set this to something like 100TB and let the OS figure it out? Before I potentially mess up my system, I want to check if anyone actually tried this approach.
PostgreSQL will for example switch to a sorting operation more suitable for on-disk sort than in-memory sort if it knows the sort will happen on disk - which it won't know if it happens in swap.
Also, PostgreSQL can switch to a completely different plan (for example, using a different JOIN method) if it figures out the data does not fit in RAM.
Setting work_mem too high will get you a very slow database as soon as you have enough data so that everything doesn't always fit in RAM anymore.
Keep in mind that work_mem is the maximum amount of RAM that can be used for every single sort operation. For a single query, multiple sort operations might run in parallel and there might be multiple connections querying the database at once. For that reason all sort operations may use x-times the amount of work_mem in RAM (that's the reason a conservative amount is recommended).
Now back to your question, if you choose a work_mem to a such high value, sort operations might use up most of your RAM, which leads to page in and out's from swap (keep in mind, there are lots of other processes and PostgreSQL parts that need some (or even lots of) RAM. Disk-based sort operations are by factors more efficient than page swaps done by the OS. As some of the other replies pointed out, a database server which has swap out and in constantly will perform extremely slow.
Another point is, that with such a high work_mem value, a single query (purposely or by accident) might more or less make the whole database server go unresponsive.
A database server that swaps is a dead database server.
In RAM postgres uses quicksort, on disk it uses another algorithm which is much more suited to harddisks. Using quicksort on swapped-out memory will be incredibly slow.
The OS is generic in the terms it handles swap, besides, there's a finite amount of address space a process can use, which isn't that big on 32 bit systems(2Gb on a windows 32 bit platform, can be enhanced to 3Gb), but you're right, you could let the OS handle this through virtual memory.
PostgreSQL is not 'generic' it'll know much better than the OS how to structure data once disk access is involved, so letting the database switch over to explicit file handling once memory is exhausted will have benefits over letting the OS handle it.