considerations for saving data to ONE file or MULTIPLE? - vb.net

i am going to be saving data with DPAPI encryption. i am not sure whether i should just have one big file with all the data or should i break up the data into separate files, where every file is its own record. i suspect the entire dataset would be less than 10mb, so i am not sure whether it's worth it to break it down into about a few hundred separate files or should i just keep it one file?
will it take a long time to decrypt 10mb of data?

For 10 megabytes, I wouldn't worry about splitting it up. The cost of encrypting/decrypting
a given volume of data will be pretty much the same, whether it's one big file or a
group of small files. If you needed the ability to selectively decrypt individual records,
as opposed to all at once, splitting the file might be useful.

If you can never think of the hardware your app is going to run on, make it scaleable. It can then run from 10 parallel floppy drives if it's too slow reading from 1.
If your scope is limited to high-perfo computers, and the file size is not likely to rise within the coming next 10 years, put it in 1 file.

Related

Is it wrong to write byte of images in the database?

When should I make this direct recording at the bank?
What are the situations?
I know I can record the path of the image in the bank.
In addition to the cost being higher as mentioned, one must take into account several factors:
Data Volume: For a low volume of data there may be no problem. On the other hand, for mass storage of data the database is practically unfeasible.
Clustering: One advantage of the database is if your system runs on multiple servers, everyone will have uniform access to the files.
Scalability: If demand for volume or availability increases, can you add more capacity to the system? It is much easier to split files between different servers than to distribute records from one table to more servers.
Flexibility: Backing up, moving files from one server to another, doing some processing on the stored files, all this is easier if the files are in a directory.
There are several strategies for scaling a system in terms of both availability and volume. Basically these strategies consist of distributing them on several different servers and redirecting the user to each of them according to some criteria. The details vary of implementation, such as: data update strategy, redundancy, distribution criteria, etc.
One of the great difficulties in managing files outside BD is that we now have two distinct data sources that need to be always in sync.
From the safety point of view, there is actually little difference. If a hacker can compromise a server, it can read both the files written to disk of your system and the files of the database system. If this question is critical, an alternative is to store the encrypted data.
I also convert my images into byte array and store them in an sql server database but in the long run, I am sure that someone will ask you and tell you that you should only save the (server) path of the image.
The biggest disadvantage of storing as binary I think is
Retrieving images from database is significantly more expensive compared to using the file system

Writing small data to file and reading from it or query the database

I have a situation where I would have to query the database for some data which is the same for all users and changes daily, so I figured I could make a file in which I would save this data (once per day) and then load it from that file each time a user visits my site.
Now, I know that this is a common practice (caching) when the requests from database are big but the data I'm about to write to the file is a simple 3 digit number, so my question is should this still be faster or is it just an overkill and I should stick with the database query?
Caching, when done right, is always faster.
It depends how long storing and retrieving data from the file takes and how long requests to the database takes.
If the database query to get the number takes long, then caching may be a good idea, since the data is small.
If you were to do a search (e.g. sequential) in a file with lots of cached data (which doesn't seem to be the case), it would take long.
Disk I/O could be slower than database I/O (which is unlikely, unless it's a local DB).
Bottom line - benchmark.
For your scenario, caching is probably a good idea, but if it's only a single 3-digit number for all users then I'd just try to stick it in RAM rather than in a file.

How to deal with thousands of small audio files?

Need to implement an app that has a feature to play sounds. Each sound will be some word sound, number of expected sounds is about one thousand. So, the most simple solution would be to store those sounds as sound files, each word sound in separate sound file, and to play them on demand. Would there be any potential problems with such a large number of files?
No problem with that many files, but they will take up more space than just the total of their sizes. Each file will fill up a whole # of space blocks on the device. On average you will then waste half a block (as a rule of thumb) unless all your files are significantly smaller than one block, in which case you will always use 1.000 blocks (one pr. file) and waste 1000 * (blocksize - average file size).
Things you could do:
Concatenate the files into one big file, store the start and length of each subfile, either read the chunk into memory or copy to a temporary file.
Drop the files in a database as BLOB fields for easier retrieval. This won't save space, but may make your code simpler or more reliable.
I don't think you need to make your own caching mechanism. Most likely iOS has a system-wide cache that does a far better job. That should only be relevant if you experience performance issues and need to get much shorter load times. In that case prhaps consider using bolcks for loading and dispatching the playing, as that's an easier way to hide the load latency and avoid UI freezes.
If your audio is uncompressed, the App Store will report the compressed size. If that differs a lot from the unpacked size, some (nitpicking) customers will definitely notice ald complain, as they think the advertised size is the install size. I know from personal experience. They wil generally not take a technical answer for an answer, any may even bypass talking to you, and just downvote you based on this. I s#it you not.
You should be fine storing 1000 audio clip files within the IPA but it is important to take note about the space requirements and organisation.
Also to take into consideration is the fact that accessing the disk is slower than memory and it also takes up battery space so it my be ideal to load up the most frequently used audio clips into memory.
If you can afford it, use FMOD which I believe can extract audio from various compressed schemes. If you just want to handle all those files yourself create a .zip file and extract them on the fly using libz (iOS library libs.dylib).

Does storing a lot of images in a single directory slow down image retrieval?

If i have a site where users can upload as many images as they want(think photobucket-like), what is the best way to set up file storage (also, all uploads get a unique random timestamp)?
site root
--username
----image1.jpg
----image2.jpg
----image3.jpg
--anotheruser
----image1.jpg
----image2.jpg
----image3.jpg
...
or
siteroot
--uploads
----image1.jpg
----image2.jpg
----image3.jpg
----image4.jpg
----image6.jpg
...
----image50000.jpg
I think the first method is more organized. But i think the second method is standard(keeping all uploads in the same dir), but i wonder if it would be slower when retrieving an image if there are thousands of image in the same directory
--- edit ---
Thanks for the great answers so far.
Also, i will be creating thumbnails, so i also would have to insert that directory somewhere... or, create a naming convention such as thumb_whatever.jpg.
so many different ways to do this.
Yes disk space will be a problem. but for now i am concerned with retrieval time. When i have to output an image to the browser, if that image is in a directory with 10,000 other images, i am worried on how slow that could get.
The number of files in a directory should have no effect at all on the time required to read a file's data - but it can massively affect the amount of time needed to find the file before you can start to read it.
The exact breakpoints where the major issues start up will vary from filesystem type to filesystem type, but, in general, if you're talking about a few hundred files, you don't much need to worry about it. If you're talking about a few thousand, it's worth thinking about and maybe doing a little benchmarking to see how your filesystem and hardware handle it. If you're talking about tens of thousands of files, then you really need to start breaking things up. (I once had a Linux/e2fs print server where CUPS wasn't deleting its job control files after it finished printing and it got up around 100,000 files in one directory. Just getting a directory listing took over half an hour before it even started to display any filenames.)
Separating them by user name may not be the best choice, though, since you'll likely have a lot of users uploading very few images and perhaps a couple who upload hundreds or thousands of images, potentially creating access time issues in those users' storage directories. The bigger problem in that scenario is that you'd likely end up (assuming a successful site) with thousands or tens of thousands of users and a large number of subdirectories is just as bad as a large number of files for slowing down access to your data.
Since you're going to have a timestamp on them, what I would probably do is put them into subdirectories based on the last three digits of the timestamp. That will distribute the files relatively evenly across 1000 subdirectories and should keep the number of files in each directory reasonably small. (Using the first three digits would cause one directory to be filled before moving to the next instead of distributing them evenly.) If you're still ending up with too many files in each subdirectory (which would likely mean you're dealing with several million uploaded images), you could add a second level for the previous three digits, so upload-1234567890.jpg would end up at /567/890/upload-1234567890.jpg.
The answer to that is "maybe". It's possible the file retrieval may be fine, but if you need to do any maintenance on the folder, it would be a huge headache as processes attempt ot enumerate the directory listings.
What would improve the situation would be a number of sub directories under the images folder (or two levels, depending on how many images you're looking at storing), so you have a hierarchy like this:
siteroot
-- uploads
---- a
---- b
---- c
:
---- z
...and then store files based on their first letter (so all images with names starting 'a' go into the folder 'a'). You could have this as a two or three letter suffix (aa, ab, ac, ad ..., ba, bb, bc ..., zx, zy, zz) and possibly have a hierarchy under that as well so you split files across a number of folders dependent on the first four characters of the name.
If files are then assigned a random alpha-numeric name then this would ensure files are spread evenly across all the folders (given a large enough sample size).
You might want to consider a mix of your option (1) and splitting images over a hierarchy as I've described above. That would ensure that if a single user does upload lots of files, then you're covered. Similarly, if you're looking at a lot of user directories, the same principle applies to ensure you don't have 1,000,000 user directories under a single parent.
try using mongodb... it is a keyvalue db which also allows to store binary data. It's very fast and efficient and supports sharding (placing data over multiple machines) out of the box
you really don't want to have folders and folders full of files. Managing these folders takes forever, and changing the naming/dividing scheme later is a nightmare. Furthermore, if you run out of diskspace you have a problem. Also for load balancing, having one harddisk full with files is not efficient
I often use schema like this:
uploads/(#id%1000)/img_#id.jpg
Where #id is ofc. id number (integer) of photo stored in database. That provides a simple schema based only on photo's id.
It depends on the file system. For example, FAT16 tends to be quite slow if you have more than 512 files in a directory. FAT32 and NTFS do not have the same limitations but also run much more slowly if you have an extremely large amount of files. Even if you're running one of the more robust Linux file systems, you're still going to be able to parse directories more quickly if they're smaller.
I would definitely go with #2 - splitting the images into directories by user.
I think that subdirectories under the uploads directory would be the best.
site root
--uploads
----username
------image1.jpg
------image2.jpg
------image3.jpg
----anotheruser
------image1.jpg
------image2.jpg
------image3.jpg
...
Depending on the host OS, having too many files in one directory could cause some headaches and compatibility problems. Also, depending on how you are getting the image list, it could cause performance issues.
Plus, option 2 would be a mess. :)

Backing up my database is taking too long

On a windows mobile unit, the software I'm working on relies on a sdf file as it's database.
The platform that the software is targeted towards is "less than optimal" and hard resets every once and a while. In the far distant past we lost data. Now we close the database, and copy the SDF file to the SD card. If the unit gets hard reset, we restore the app (also on the sd card) and the database.
I'm not concerned about the restore (just yet). The problem we have now is that doing a "backup" takes a crazy amount of time because the SDF is 7+ megs and writing to the SD card is slow slow slow.
My boss suggested we create hashes of "chunks" of the file and then write to the destination file only when a compare of the hashes is !=.
So here's the question.
How would you test if a file is changed if you can only have one copy of the file and thus can't compare it with it's original.
I'm just shooting for a bit of brain storming.
Just store your hashes of your chunks somewhere. You don't need the "backup" copy to compare to if you know what your hashes are. Obviously this creates a chicken and egg problem for at least one hash, but copying a single "chunk" is a much smaller problem.
Your proposed approach will still have performance problems though, as hashing a large file isn't going to be a pretty operation on a slow CPU powered by a battery.
I assume you don't have the granular control to keep track of the parts of the file you modify, and then update just those sections when you need to do backup?