SFV/CRC32 checksum good and fast enough to check for common backup files? - backup

I have 3 terabytes, more than 300,000 reference files of all sizes (20, 30, 40, 200 megas each) and I usually back them up regularly (not zipped). A few months ago, I lost some files probably due to data degradation (as I did "backup" of damaged files without notice).
I do not care about security, so do not need MD5, SHA, etc. I just want to be assured that the files I'm copying are good (the same bits and bytes) and verify that backups are intact after a few months before making backups again.
Therefore, my needs are basic because the files are not very important and there is no need for security (no sensitive information).
My doubt: the format/method "SFV CRC/32" is good and fast for my needs? There is something better and faster than that? I'm using the program ExactFile.
Are there any checksum faster than SFV/CRC32 but that is not flawed? I tried using the MD5 but it is slow and since I do not need data security, I preferred the SFV/CRC32. Still, it's painful, because there are more than 300,000 files and takes hours to make the checksum of all of them, even with CPU xeon 8 cores HT and fast HDD.
From the point of view of data integrity , there is some advantage in joining all the files in one .ZIP or .RAR instead of letting them " loose " in folders and files?
Some tips?
Thanks!

If you could quantify "few" and "some" in "A few months ago, I lost some files" (where "few" would be considered to be replaced with "every few" in order to get a rate), then you could calculate the probability of a false positive. However just from those words, I would say, yes, a 32-bit CRC should be fine for your application.
As for speed, if you have a recent Intel processor, you likely have a CRC-32C instruction, which can make the calculation much faster, by about a factor of 15. (See this answer for some code.) That could be made faster still by running it over multiple cores. If done right, you should be limited by the I/O, not the calculation.
There is no advantage in this case to bundling them in a zip or rar. In fact it may be worse, if a corruption of that one file causes you to lose everything.

If you aren't getting a throughput of at least 250 MB per second per core then you're probably I/O or memory-speed bound. The raw hashing speed of CRC32 and MD5 is higher than that, even on decades-old hardware, assuming a non-sucky reasonably optimised implementation.
Have a look at the Crypto++ benchmark, which includes a wealth of other hash algorithms as well.
The Castagnoli CRC32 can be faster than standard CRC32 or MD5 because newer CPUs have a special instruction for it; with that instruction and oodles of supporting code (for hashing three streams in parallel, stitching together partial results with a bit of linear algebra, etc. pp.) you can speed up the hashing to about 1 cycle/dword. AES-based hashes are also lightning fast on recent CPUs, due to the special AES instructions.
However, in the end it doesn't matter how fast the hash function waits for data to be read; especially on a multicore machine you're almost always I/O bound in applications like this, unless you're getting sabotaged by small caches and the latencies of deep memory cache hierarchies.
I'd stick with MD5 which is no slower than CRC32 and universally available, even on the oldest of machines, in pretty much every programming system/language ever invented. Don't think of it as a 'cryptographically secure hash' (which it isn't, not anymore) but as some kind of CRC128 that's just as fast as CRC32 but that requires some 2^64 hashings for a collision to become likely, instead of only a few ten thousand as in the case of CRC32.
If you want to roll some custom code then CRCs do have some merit: the CRC of a file can be computed by combining the CRCs of sub blocks with a bit of linear algebra. With general hashes like MD5 that's not possible (but you can always process multiple files in parallel instead).
There are oodles of ready-made programs for computing MD5 hashes for files and directories fast. I'd recommend the 'deep' versions of md5sum + cousins: md5deep and hashdeep which you can find on SourceForge and on GitHub.

Darth Gizka, thanks for the tips. Now I'm using md5deep 64 you indicated. It's very good. I used to use ExactFile, which stopped being updated in 2010, is still 32-bit (no 64bit version). I did a quick comparison between the two. The ExactFile was faster to create the MD5 digest. But to compare the digest, the md5deep64 was much faster.
My problem is HDD, as you said. For backup and storage, I use three Seagates with 2 TB each (7200rpm 64 mega cache). With an SSD the procedure would be much faster, but with terabytes of files is very difficult to use SSD.
A few days ago, I did the procedure in part of the archives: 1 tera (about 170,000 files). The ExactFile took about six hours to create the digest SFV / CRC32. I used one of my newer machines, equipped with an i7 4770k (with CRC32 instructions embedded, 8 cores - four real and four virtual, MB Gygabyte Z87X-UD4H, 16 RAM).
Throughout the calculations of files, the CPU cores were almost unusable (3% to 4%, maximum 20%). The HDD was 100% used, however, only a fraction of his speed power was achieved (sata 3), most of the time 70 MB / s, sometimes dropping to 30 MB / s depending on the number of files being calculated and anti virus in the background (which I disabled later, as I often do when copying large numbers of files).
Now I am testing a copy program that uses binary file comparison. Anyway, I will continue using md5 digests. Grateful for the information and any tip is welcome.

Related

What would be the optimal amount of buckets in a fixed sized Hash Table using separate chaining and initialized with a known number N of entries?

The HT does not rehash.
We use a simple division method as Hash-function.
We assume the Hash-function is efficient at equally distributing the entries.
The goal is to have O(1) insertion, deletion and find.
The optimal number of buckets is a compromise between memory consumption and hash collisions, for intended usage patterns.
For example, if something is very frequently used you might limit the size of the hash table to half the size of a CPU's cache to reduce the chance of "cache miss accessing hash table"; and this can be faster than using a larger hash table (with worse cache misses and lower chance of hash collisions). Alternatively; if it's used infrequently (and therefore you expect cache misses regardless of hash table size) then a larger size is more likely to be optimal.
Of course real systems have multiple caches (L1, L2, L3) plus virtual memory translation caches (TLBs) plus RAM limits (plus swap space limits); real software has more than just one hash table competing for resources in the memory hierarchy; and often the software developers have no idea what other processes might be running (competing for physical RAM, polluting caches, etc) or what any end user's hardware is (sizes of caches, etc). All of this makes it virtually impossible to determine "optimal" with any method (including extensive benchmarking).
The only practical option is to take an educated guess based on various assumptions (about usage, the amount of data and how good the hashing function will be in practice, the CPU, the other things that might be using CPUs and memory, ...); and make the source code configurable (e.g. #define HASH_TABLE_SIZE ..) so you can easily re-assess the guess later.

gfortran change/find out write buffer size

I have this molecular dynamics program that writes atom position and velocities to a file at every n steps of simulation. The actual writing is taking like 90% of the running time! (checked by eiminating the writes) So I desperately need to optimize that.
I see that some fortrans have an extension to change the write buffer size (called i/o block size) and the "number of blocks" at the OPEN statement, but it appears that gfortran doesn't. Also I read somewhere that gfortran uses 8192 bytes write buffer.
I even tried to do an FSTAT (right after opening, is that right?) to see what is the block size and number of blocks it is using but it returns -1 on both. (compiling for windows 64 bit)
Isn't there a way to enlarge the write buffer for a file in gfortran? Will it be diferent compiling for linux than for windows?
I'd really really rather stay in fortran but as a desperate measure isn't there a way to do so by adding some c routine?
thanks!
IanH question is key. Unformatted IO is MUCH faster than formatted. The conversion from base 2 to base 10 is very CPU intensive. If you don't need the values to be human readable, then use unformatted IO. If you want to be able to read the values in another language, then use access='stream'.
Another approach would be to add your own buffering. Replace the write statement with a call to a subroutine. Have that subroutine store values and write only when it has received M values. You'll also have to have a "flush" call to the subroutine to cause it to write the last values, if they are fewer them M.
If gcc C is faster at IO, you could mix Fortran and C with Fortran's ISO_C_Binding: https://stackoverflow.com/questions/tagged/fortran-iso-c-binding. There are examples of the use of the ISO C Binding in the gfortran manual under "Mixed Language Programming".
If you spend 90% of your runtime writing coords/vels every n timesteps, the obvious quick fix would be to instead write data every, say, n/100 timestep. But I'm sure you already thought of that yourself.
But yes, gfortran has a fixed 8k buffer, whose size cannot be changed except by modifying the libgfortran source and rebuilding it. The reason for the buffering is to amortize the syscall overhead; (simplistic) tests on Linux showed that 8k is sufficient and more than that goes far into diminishing returns territory. That being said, if you have some substantiated claims that bigger buffers are useful on some I/O patterns and/or OS, there's no reason why the buffer can't be made larger in a future release.
As for you performance issues, as already mentioned, unformatted is a lot faster than formatted I/O. Additionally, gfortran has rather high per-IO-statement overhead. You can amortize that by writing arrays (or, array sections) rather than individual elements (this matters mostly for unformatted, for formatted IO there is so much to do that this doesn't help that much).
I am thinking that if cost of IO is comparable or even larger than the effort of simulation, then it probably isn't such a good idea to store all these data to disk the first place. It is better to do whatever processing you intend to do directly during the simulation, instead of saving lots of intermediate data them later read them in again to do the processing.
Moreover, MD is an inherently highly parallelizable problem, and with IO you will severely cripple the efficiency of parallelization! I would avoid IO whenever possible.
For individual trajectories, normally you just need to store the initial condition of each trajectory, along with its key statistics, or important snapshots at a small number of time values. When you need one specific trajectory plotted you can regenerate the exact same trajectory or section of trajectory from the initial condition or the closest snapshot, and with similar cost as reading it from the disk.

How to store 15 x 100 million 32-byte records for sequential access?

Me got 15 x 100 million 32-byte records. Only sequential access and appends needed. The key is a Long. The value is a tuple - (Date, Double, Double). Is there something in this universe which can do this? I am willing to have 15 seperate databases (sql/nosql) or files for each of those 100 million records. I only have a i7 core and 8 GB RAM and 2 TB hard disk.
I have tried PostgreSQL, MySQL, Kyoto Cabinet (with fine tuning) with Protostuff encoding.
SQL DBs (with indices) take forever to do the silliest query.
Kyoto Cabinet's B-Tree can handle upto 15-18 million records beyond which appends take forever.
I am fed up so much that I am thinking of falling back on awk + CSV which I remember used to work for this type of data.
If you scenario means always going through all records in sequence then it may be an overkill to use a database. If you start to need random lookups, replacing/deleting records or checking if a new record is not a duplicate of an older one, a database engine would make more sense.
For the sequential access, a couple of text files or hand-crafted binary files will be easier to handle. You sound like a developer - I would probably go for an own binary format and access it with help of memory-mapped files to improve the sequential read/append speed. No caching, just a sliding window to read the data. I think that it would perform better and even on usual hardware than any DB would; I did such data analysis once. It would also be faster than awking CSV files; however, I am not sure how much and if it satisfied the effort to develop the binary storage, first of all.
As soon as the database becomes interesting, you can have a look at MongoDB and CouchDB. They are used for storing and serving very large amounts of data. (There is a flattering evaluation that compares one of them to traditional DBs.). Databases usually need a reasonable hardware power to perform better; maybe you could check out how those two would do with your data.
--- Ferda
Ferdinand Prantl's answer is very good. Two points:
By your requirements I recommend that you create a very tight binary format. This will be easy to do because your records are fixed size.
If you understand your data well you might be able to compress it. For example, if your key is an increasing log value you don't need to store it entirely. Instead, store the difference to the previous value (which is almost always going to be one). Then, use a standard compression algorithm/library to save on data size big time.
For sequential reads and writes, leveldb will handle your dataset pretty well.
I think that's about 48 gigs of data in one table.
When you get into large databases, you have to look at things a little differently. With an ordinary database (say, tables less than a couple million rows), you can do just about anything as a proof of concept. Even if you're stone ignorant about SQL databases, server tuning, and hardware tuning, the answer you come up with will probably be right. (Although sometimes you might be right for the wrong reason.)
That's not usually the case for large databases.
Unfortunately, you can't just throw 1.5 billion rows straight at an untuned PostgreSQL server, run a couple of queries, and say, "PostgreSQL can't handle this." Most SQL dbms have ways of dealing with lots of data, and most people don't know that much about them.
Here are some of the things that I have to think about when I have to process a lot of data over the long term. (Short-term or one-off processing, it's usually not worth caring a lot about speed. A lot of companies won't invest in more RAM or a dozen high-speed disks--or even a couple of SSDs--for even a long-term solution, let alone a one-time job.)
Server CPU.
Server RAM.
Server disks.
RAID configuration. (RAID 3 might be worth looking at for you.)
Choice of operating system. (64-bit vs 32-bit, BSD v. AT&T derivatives)
Choice of DBMS. (Oracle will usually outperform PostgreSQL, but it costs.)
DBMS tuning. (Shared buffers, sort memory, cache size, etc.)
Choice of index and clustering. (Lots of different kinds nowadays.)
Normalization. (You'd be surprised how often 5NF outperforms lower NFs. Ditto for natural keys.)
Tablespaces. (Maybe putting an index on its own SSD.)
Partitioning.
I'm sure there are others, but I haven't had coffee yet.
But the point is that you can't determine whether, say, PostgreSQL can handle a 48 gig table unless you've accounted for the effect of all those optimizations. With large databases, you come to rely on the cumulative effect of small improvements. You have to do a lot of testing before you can defensibly conclude that a given dbms can't handle a 48 gig table.
Now, whether you can implement those optimizations is a different question--most companies won't invest in a new 64-bit server running Oracle and a dozen of the newest "I'm the fastest hard disk" hard drives to solve your problem.
But someone is going to pay either for optimal hardware and software, for dba tuning expertise, or for programmer time and waiting on suboptimal hardware. I've seen problems like this take months to solve. If it's going to take months, money on hardware is probably a wise investment.

How can I accelerate the generation of the an MD5 Checksum within vb.net?

I'm working with some very large files residing on P2 (Panasonic) cards. Part of the process we employ is to first generate a checksum of the file we are going to copy, then copy the file, then run a checksum on the file to confirm that it copied OK. The problem is, is that files are large (70 GB+) and take a long time to complete. It's an issue since we will eventually be dealing with thousands of these files.
I would like to find a faster way to generate the checksum other than using the System.Security.Cryptography.MD5CryptoServiceProvider
I don't care if this means using a specialized hardware card, provided it works and is not to ungodly expensive. I would prefer to have a method of encoding that provided some feedback as to how far the process has gone along so I can display it like I do now.
The application is written in vb.net. I would prefer to be able to use it as component, library, reference within my application, but I'm willing to call an outside application if there is enough improvement in the speed of generating the checksum.
Needless to say, the checksum must be consistent and correct. :-)
Thank you in advance for your time and efforts,
Richard
I see one potential way to speed up this process: calculate the MD5 of the source file while performing the copy, not prior to it. This will reduce the number of times you'll need to read the entire file from 3 (source hash, copy, destination hash) to 2 (copy, destination hash).
The downside of this all is that you'll have to write your own copying code (as opposed to just relying on System.IO.File.Copy), and there's a non-zero chance that this will turn out to be slower in the end anyway than the 3-step process.
Other than that, I don't think there's much you can do here, as the entire process is I/O bound by design. You're spending most of your time reading/writing the file, and even at 100MB/s (a respectable I/O speed for your typical SATA drive), you'll do about 5.8GB/min at best.
With a modern processor, the overhead of calculating the MD5 (or anything else) doesn't factor into things very much, so speeding it up won't improve your overall throughput. Crypto accelerators in particular won't help you here, as unless the driver implementation is very efficient, they'll add more overhead due to context switches required to feed the data to the external card than they'll save.
What you do want to improve is the I/O speed. The .NET framework is already pretty efficient when it comes to this (using nicely-sized buffers, overlapped I/O and such), but it's possible an optimized native Windows application will perform better here. My advice: Google around for a few native MD5 calculators, and see how they compare to your current .NET implementation. If the difference in hash calculation speed is >10%, it's worth switching to using said external app.
The correct answer is to avoid using MD5. MD5 is a cryptographic hash function, designed to provide certain cryptographic features. For merely detecting accidental corruption, it is way over-engineered and slow. There are many faster checksums, the design of which can be understood by examining the literature of error detection and correction. Some common examples are the CRC checksums, of which CRC32 is very common, but you can also relatively easily compute 64 or 128 bit or even larger CRCs much much faster than an MD5 hash.

What would be a good (de)compression routine for this scenario

I need a FAST decompression routine optimized for restricted resource environment like embedded systems on binary (hex data) that has following characteristics:
Data is 8bit (byte) oriented (data bus is 8 bits wide).
Byte values do NOT range uniformly from 0 - 0xFF, but have a poisson distribution (bell curve) in each DataSet.
Dataset is fixed in advanced (to be burnt into Flash) and each set is rarely > 1 - 2MB
Compression can take as much as time required, but decompression of a byte should take 23uS in the worst case scenario with minimal memory footprint as it will be done on a restricted resource environment like an embedded system (3Mhz - 12Mhz core, 2k byte RAM).
What would be a good decompression routine?
The basic Run-length encoding seems too wasteful - I can immediately see that adding a header setion to the compressed data to put to use unused byte values to represent oft repeated patterns would give phenomenal performance!
With me who only invested a few minutes, surely there must already exist much better algorithms from people who love this stuff?
I would like to have some "ready to go" examples to try out on a PC so that I can compare the performance vis-a-vis a basic RLE.
The two solutions I use when performance is the only concern:
LZO Has a GPL License.
liblzf Has a BSD License.
miniLZO.tar.gz This is LZO, just repacked in to a 'minified' version that is better suited to embedded development.
Both are extremely fast when decompressing. I've found that LZO will create slightly smaller compressed data than liblzf in most cases. You'll need to do your own benchmarks for speeds, but I consider them to be "essentially equal". Both are light-years faster than zlib, though neither compresses as well (as you would expect).
LZO, in particular miniLZO, and liblzf are both excellent for embedded targets.
If you have a preset distribution of values that means the propability of each value is fixed over all datasets, you can create a huffman encoding with fixed codes (the code tree has not to be embedded into the data).
Depending on the data, I'd try huffman with fixed codes or lz77 (see links of Brian).
Well, the main two algorithms that come to mind are Huffman and LZ.
The first basically just creates a dictionary. If you restrict the dictionary's size sufficiently, it should be pretty fast...but don't expect very good compression.
The latter works by adding back-references to repeating portions of output file. This probably would take very little memory to run, except that you would need to either use file i/o to read the back-references or store a chunk of the recently read data in RAM.
I suspect LZ is your best option, if the repeated sections tend to be close to one another. Huffman works by having a dictionary of often repeated elements, as you mentioned.
Since this seems to be audio, I'd look at either differential PCM or ADPCM, or something similar, which will reduce it to 4 bits/sample without much loss in quality.
With the most basic differential PCM implementation, you just store a 4 bit signed difference between the current sample and an accumulator, and add that difference to the accumulator and move to the next sample. If the difference it outside of [-8,7], you have to clamp the value and it may take several samples for the accumulator to catch up. Decoding is very fast using almost no memory, just adding each value to the accumulator and outputting the accumulator as the next sample.
A small improvement over basic DPCM to help the accumulator catch up faster when the signal gets louder and higher pitch is to use a lookup table to decode the 4 bit values to a larger non-linear range, where they're still 1 apart near zero, but increase at larger increments toward the limits. And/or you could reserve one of the values to toggle a multiplier. Deciding when to use it up to the encoder. With these improvements, you can either achieve better quality or get away with 3 bits per sample instead of 4.
If your device has a non-linear μ-law or A-law ADC, you can get quality comparable to 11-12 bit with 8 bit samples. Or you can probably do it yourself in your decoder. http://en.wikipedia.org/wiki/M-law_algorithm
There might be inexpensive chips out there that already do all this for you, depending on what you're making. I haven't looked into any.
You should try different compression algorithms with either a compression software tool with command line switches or a compression library where you can try out different algorithms.
Use typical data for your application.
Then you know which algorithm is best-fitting for your needs.
I have used zlib in embedded systems for a bootloader that decompresses the application image to RAM on start-up. The licence is nicely permissive, no GPL nonsense. It does make a single malloc call, but in my case I simply replaced this with a stub that returned a pointer to a static block, and a corresponding free() stub. I did this by monitoring its memory allocation usage to get the size right. If your system can support dynamic memory allocation, then it is much simpler.
http://www.zlib.net/