Backing up my database is taking too long - backup

On a windows mobile unit, the software I'm working on relies on a sdf file as it's database.
The platform that the software is targeted towards is "less than optimal" and hard resets every once and a while. In the far distant past we lost data. Now we close the database, and copy the SDF file to the SD card. If the unit gets hard reset, we restore the app (also on the sd card) and the database.
I'm not concerned about the restore (just yet). The problem we have now is that doing a "backup" takes a crazy amount of time because the SDF is 7+ megs and writing to the SD card is slow slow slow.
My boss suggested we create hashes of "chunks" of the file and then write to the destination file only when a compare of the hashes is !=.
So here's the question.
How would you test if a file is changed if you can only have one copy of the file and thus can't compare it with it's original.
I'm just shooting for a bit of brain storming.

Just store your hashes of your chunks somewhere. You don't need the "backup" copy to compare to if you know what your hashes are. Obviously this creates a chicken and egg problem for at least one hash, but copying a single "chunk" is a much smaller problem.
Your proposed approach will still have performance problems though, as hashing a large file isn't going to be a pretty operation on a slow CPU powered by a battery.
I assume you don't have the granular control to keep track of the parts of the file you modify, and then update just those sections when you need to do backup?

Related

Is it wrong to write byte of images in the database?

When should I make this direct recording at the bank?
What are the situations?
I know I can record the path of the image in the bank.
In addition to the cost being higher as mentioned, one must take into account several factors:
Data Volume: For a low volume of data there may be no problem. On the other hand, for mass storage of data the database is practically unfeasible.
Clustering: One advantage of the database is if your system runs on multiple servers, everyone will have uniform access to the files.
Scalability: If demand for volume or availability increases, can you add more capacity to the system? It is much easier to split files between different servers than to distribute records from one table to more servers.
Flexibility: Backing up, moving files from one server to another, doing some processing on the stored files, all this is easier if the files are in a directory.
There are several strategies for scaling a system in terms of both availability and volume. Basically these strategies consist of distributing them on several different servers and redirecting the user to each of them according to some criteria. The details vary of implementation, such as: data update strategy, redundancy, distribution criteria, etc.
One of the great difficulties in managing files outside BD is that we now have two distinct data sources that need to be always in sync.
From the safety point of view, there is actually little difference. If a hacker can compromise a server, it can read both the files written to disk of your system and the files of the database system. If this question is critical, an alternative is to store the encrypted data.
I also convert my images into byte array and store them in an sql server database but in the long run, I am sure that someone will ask you and tell you that you should only save the (server) path of the image.
The biggest disadvantage of storing as binary I think is
Retrieving images from database is significantly more expensive compared to using the file system

How to deal with thousands of small audio files?

Need to implement an app that has a feature to play sounds. Each sound will be some word sound, number of expected sounds is about one thousand. So, the most simple solution would be to store those sounds as sound files, each word sound in separate sound file, and to play them on demand. Would there be any potential problems with such a large number of files?
No problem with that many files, but they will take up more space than just the total of their sizes. Each file will fill up a whole # of space blocks on the device. On average you will then waste half a block (as a rule of thumb) unless all your files are significantly smaller than one block, in which case you will always use 1.000 blocks (one pr. file) and waste 1000 * (blocksize - average file size).
Things you could do:
Concatenate the files into one big file, store the start and length of each subfile, either read the chunk into memory or copy to a temporary file.
Drop the files in a database as BLOB fields for easier retrieval. This won't save space, but may make your code simpler or more reliable.
I don't think you need to make your own caching mechanism. Most likely iOS has a system-wide cache that does a far better job. That should only be relevant if you experience performance issues and need to get much shorter load times. In that case prhaps consider using bolcks for loading and dispatching the playing, as that's an easier way to hide the load latency and avoid UI freezes.
If your audio is uncompressed, the App Store will report the compressed size. If that differs a lot from the unpacked size, some (nitpicking) customers will definitely notice ald complain, as they think the advertised size is the install size. I know from personal experience. They wil generally not take a technical answer for an answer, any may even bypass talking to you, and just downvote you based on this. I s#it you not.
You should be fine storing 1000 audio clip files within the IPA but it is important to take note about the space requirements and organisation.
Also to take into consideration is the fact that accessing the disk is slower than memory and it also takes up battery space so it my be ideal to load up the most frequently used audio clips into memory.
If you can afford it, use FMOD which I believe can extract audio from various compressed schemes. If you just want to handle all those files yourself create a .zip file and extract them on the fly using libz (iOS library libs.dylib).

Is there any performance difference between creating an NSFileHandle for a large versus a small file?

This question strikes me as almost silly, but I just want to sanity check myself. For a variety of reasons, I'm welding together a bunch of files into a single megafile before packing this as a resource in my iOS app. I'm then using NSFileHandle to open the file, seek to the right place, and read out just the bytes I want.
Is there any performance difference between doing it this way and reading loose files? Or, supposing I could choose to use just one monolithic megafile, versus, say, 10 medium-sized (but still joined) files, is there any performance difference between "opening" the large versus a smaller file?
Since I know exactly where to seek to, and I'm reading just the bytes I want, I don't see how there could be a difference. But, hey -- Stranger things have proved to be. Thanks in advance!
There could be a difference if it was an extremely large number of files. Every open file uses up resources in memory (file handles, and the like), and on some storage devices, a file will take up an entire block even if it doesn't fill it. That can lead to wasted space in extreme cases. But in practice, it probably won't be a problem. To know for sure, you can profile your code and see if it's faster one way vs. the other, and see what sort of space it takes up on a typical device.

Is there a reverse-incremental backup solution with built-in redundancy (e.g. par2)?

I'm setting a home server primarily for backup use. I have about 90GB of personal data that must be backed up in the most reliable manner, while still preserving disk space. I want to have full file history so I can go back to any file at any particular date.
Full weekly backups are not an option because of the size of the data. Instead, I'm looking along the lines of an incremental backup solution. However, I'm aware that a single corruption in a set of incremental backups makes the entire series (beyond a point) unrecoverable. Thus simple incremental backups are not an option.
I've researched a number of solutions to the problem. First, I would use reverse-incremental backups so that the latest version of the files would have the least chance of loss (older files are not as important). Second, I want to protect both the increments and backup with some sort of redundancy. Par2 parity data seems perfect for the job. In short, I'm looking for a backup solution with the following requirements:
Reverse incremental (to save on disk space and prioritize the most recent backup)
File history (kind of a broader category including reverse incremental)
Par2 parity data on increments and backup data
Preserve metadata
Efficient with bandwidth (bandwidth saving; no copying the entire directory over for each increment). Most incremental backup solutions should work this way.
This would (I believe) ensure file integrity and relatively small backup sizes. I've looked at a number of backup solutions already but they have a number of problems:
Bacula - Simple normal incremental backups
bup - incremental and implements par2 but isn't reverse incremental and doesn't preserve metadata
duplicity - incremental, compressed, and encrypted but isn't reverse incremental
dar - incremental and par2 is easy to add, but isn't reverse incremental and no file history?
rdiff-backup - almost perfect for what I need but it doesn't have par2 support
So far I think that rdiff-backup seems like the best compromise but it doesn't support par2. I think I can add par2 support to backup increments easily enough since they aren't modified each backup but what about the rest of the files? I could generate par2 files recursively for all files in the backup but this would be slow and inefficient, and I'd have to worry about corruption during a backup and old par2 files. In particular, I couldn't tell the difference between a changed file and a corrupt file, and I don't know how to check for such errors or how they would affect the backup history. Does anyone know of any better solution? Is there a better approach to the issue?
Thanks for reading through my difficulties and for any input you can give me. Any help would be greatly appreciated.
http://www.timedicer.co.uk/index
Uses rdiff-backup as the engine. I've been looking at it, but that requires me to set up a "server" using linux or a virtual machine.
Personally, I use WinRAR to make pseudo-incremental backups (it actually makes a full backup of recent files) run daily by a scheduled task. It is similarly a "push" backup.
It's not a true incremental (or reverse-incremental) but it saves different versions of files based on when it was last updated. I mean, it saves the version for today, yesterday and the previous days, even if the file is identical. You can set the archive bit to save space, but I don't bother anymore as all I backup are small spreadsheets and documents.
RAR has it's own parity or recovery record that you can set in size or percentage. I use 1% (one percent).
It can preserve metadata, I personally skip the high resolution times.
It can be efficient since it compresses the files.
Then all I have to do is send the file to my backup. I have it copied to a different drive and to another computer in the network. No need for a true server, just a share. You can't do this for too many computers though as Windows workstations have a 10 connection limit.
So for my purpose, which may fit yours, backs up my files daily for files that have been updated in the last 7 days. Then I have another scheduled backup that backups files that have been updated in the last 90 days run once a month or every 30 days.
But I use Windows, so if you're actually setting up a Linux server, you might check out the Time Dicer.
Since nobody was able to answer my question, I'll write a few possible solutions I found while researching the topic. In short, I believe the best solution is rdiff-backup to a ZFS filesystem. Here's why:
ZFS checksums all blocks stored and can easily detect errors.
If you have ZFS set to mirror your data, it can recover the errors by copying from the good copy.
This takes up less space than full backups, even though the data is copied twice.
The odds of an error in both the original and mirror is tiny.
Personally I am not using this solution as ZFS is a little tricky to get working on Linux. Btrfs looks promising but hasn't been proven stable from years of use. Instead, I'm going with a cheaper option of simply checking hard drive SMART data. Hard drives should do some error checking/correcting themselves and by monitoring this data I can see if this process is working properly. It's not as good as additional filesystem parity but better than nothing.
A few more notes that might be interesting to people looking into reliable backup development:
par2 seems to be dated and buggy software. zfec seems like a much faster modern alternative. Discussion in bup occurred a while ago: https://groups.google.com/group/bup-list/browse_thread/thread/a61748557087ca07
It's safer to calculate parity data before even writing to disk. i.e. don't write to disk, read it, and then calculate parity data. Do it from ram, and check against the original for additional reliability. This might only be possible with zfec, since par2 is too slow.

considerations for saving data to ONE file or MULTIPLE?

i am going to be saving data with DPAPI encryption. i am not sure whether i should just have one big file with all the data or should i break up the data into separate files, where every file is its own record. i suspect the entire dataset would be less than 10mb, so i am not sure whether it's worth it to break it down into about a few hundred separate files or should i just keep it one file?
will it take a long time to decrypt 10mb of data?
For 10 megabytes, I wouldn't worry about splitting it up. The cost of encrypting/decrypting
a given volume of data will be pretty much the same, whether it's one big file or a
group of small files. If you needed the ability to selectively decrypt individual records,
as opposed to all at once, splitting the file might be useful.
If you can never think of the hardware your app is going to run on, make it scaleable. It can then run from 10 parallel floppy drives if it's too slow reading from 1.
If your scope is limited to high-perfo computers, and the file size is not likely to rise within the coming next 10 years, put it in 1 file.