Does using an archive alleviate the overhead of loading many small files?

Does using an archive alleviate the overhead of loading many small files? - optimization

I have a lot of small files to load, and I'm concerned about file IO performance. I'm debating whether or not I should be using files that aggregate the data that will be related; for example, whether I should have "item.data" and "item.png", or just "item.data" (where the latter file contains the .png image data that would have been in "item.png").
The kicker is, I plan to load these files from an archive (either .7z or .zip), and I'm not sure whether or not I'm wasting my time. I'm not really worried about the absolute data bandwidth between the disc and memory, I'm really just out to minimize seeks. If these two files are stored in the same folder within the archive, will their data be contiguous? If not, will the performance be improved for some other reason?
I'm not overly concerned about compression rates for the small files; despite having many (thousands) of these files, they'll pale in comparison to the other, larger files I'm working with. I'm really just worried about seek times.
Will storing and loading the files with an archive solve the problems I'm worried about? If it wont, what are some other approaches to help alleviate seek times and improve file IO performance?

Related

Do files on your server influence website speed

If you have a webserver for your website, does it make a difference if there are a lot of other files on the server, even if they aren't used?
Example
An average webserver has a SSD with 500 GB of space. It's hosting a single website, but has a ton of other websites which are inactive. Though that single website is only 1GB in size, the hard drive is full for 50%. Will that influence site speed?
And does SSD vs HDD make a difference in that, apart from the speed difference between the two types.
Edit: I've read somewhere that the amount of files in your server influences it's speed, and it sounds logical due to Andrei's answer, concerning the having to search through more files. I've had a discussion about it with someone however, and he firmly states that it makes no difference.

Having other/unused files always has an impact on the performance, but the question is - how big it is. Usually not much and you will not notice it at all.
But think about how files are read from disk. First, you need to locate the file record in the file allocation table (FAT). Search in the table is similar to search in a tree-like data structure, as we have to deal with folders that contain other folders etc.
The more files you have, the bigger the FAT gets. And the search becomes slower, correspondingly.
All in all, with memory caching and other tricks, this is not an issue.
You will notice the impact when you have thousands of files in one folder. That's why picture-related services that host big amount of images usually store them in a folder structure that holds only limited amount of files per folder. For example, a file named '12345678.jpg' would be stored in '/1/2/3/4/5/12345678.jpg' path as well as other files whose names are '12345000'...'12345999'. Thus only 1000 files would be saved per folder.

Thumbnail storage strategy

I am working on a portion of an app that requires a "Photos" type presentation of multiple thumbnail images. The full size images are quite large, and generating the thumbnails every time is taking too long, so I am going to cache the thumbnails.
I am having a hard time determining how best to store the thumbnails on the filesystem once I create them. I can think of a few possibilities but I don't like any of them:
Save the thumbnail in the same directory as the original file, with _Thumb added to the filename (image.png and image_Thumb.png). This makes for a messy directory and I would think performance would become a problem because of reading so many different files to load at once.
Save the thumbnails in their own sub-directory, with the same filename as the original. I think that this is slightly cleaner, but I'm still opening lots of different files.
Save all of the thumbnails to a Thumbnails file. I think that this is commonly done in Windows and OS X? I like the idea because I can open one file and read multiple thumbnails from it, but I'm not sure how to store all of them in the same file and associate them with the original files. EDIT: I thought of using NSKeyedArchiver/unArchiver but from what I can find, anytime a thumbnail is added/removed, I would have to re-create the entire archive. Perhaps there is something that I am overlooking?
EDIT Store the thumbnails in a core data/sqlite database file. I have heard over the years that it is a bad idea to store images in a database file due to slow performance and the possibility of database corruption on writes that take a (relatively) long time to complete. Does anyone have experience using either one this way?
Any suggestions on the best approach to take?

I would go for the second option. On iDevices you use flash memory. Performance penalty for accessing many files is very low comparing to HDDs. Also you can cache some in memory to prevent reading one and the same file too often. SDWebImage caching mechanism contains a great sample how to do it.
The third option - using one file for that would probably mean using database file. You could have some performance improvements there if you store uncompressed data. You'll need to do some performance tests because loading more data (uncompressed form of the thumbs), might slow it down saving CPU for more storage access.
Combined approach would be to store thumbnails as files but in uncompressed format (not .jpg, .png etc.)
A fourth option worth considering, as long as the thumbnails are reasonably small: save them in CoreData.

How to deal with thousands of small audio files?

Need to implement an app that has a feature to play sounds. Each sound will be some word sound, number of expected sounds is about one thousand. So, the most simple solution would be to store those sounds as sound files, each word sound in separate sound file, and to play them on demand. Would there be any potential problems with such a large number of files?

No problem with that many files, but they will take up more space than just the total of their sizes. Each file will fill up a whole # of space blocks on the device. On average you will then waste half a block (as a rule of thumb) unless all your files are significantly smaller than one block, in which case you will always use 1.000 blocks (one pr. file) and waste 1000 * (blocksize - average file size).
Things you could do:
Concatenate the files into one big file, store the start and length of each subfile, either read the chunk into memory or copy to a temporary file.
Drop the files in a database as BLOB fields for easier retrieval. This won't save space, but may make your code simpler or more reliable.
I don't think you need to make your own caching mechanism. Most likely iOS has a system-wide cache that does a far better job. That should only be relevant if you experience performance issues and need to get much shorter load times. In that case prhaps consider using bolcks for loading and dispatching the playing, as that's an easier way to hide the load latency and avoid UI freezes.
If your audio is uncompressed, the App Store will report the compressed size. If that differs a lot from the unpacked size, some (nitpicking) customers will definitely notice ald complain, as they think the advertised size is the install size. I know from personal experience. They wil generally not take a technical answer for an answer, any may even bypass talking to you, and just downvote you based on this. I s#it you not.

You should be fine storing 1000 audio clip files within the IPA but it is important to take note about the space requirements and organisation.
Also to take into consideration is the fact that accessing the disk is slower than memory and it also takes up battery space so it my be ideal to load up the most frequently used audio clips into memory.

If you can afford it, use FMOD which I believe can extract audio from various compressed schemes. If you just want to handle all those files yourself create a .zip file and extract them on the fly using libz (iOS library libs.dylib).

Is there any performance difference between creating an NSFileHandle for a large versus a small file?

This question strikes me as almost silly, but I just want to sanity check myself. For a variety of reasons, I'm welding together a bunch of files into a single megafile before packing this as a resource in my iOS app. I'm then using NSFileHandle to open the file, seek to the right place, and read out just the bytes I want.
Is there any performance difference between doing it this way and reading loose files? Or, supposing I could choose to use just one monolithic megafile, versus, say, 10 medium-sized (but still joined) files, is there any performance difference between "opening" the large versus a smaller file?
Since I know exactly where to seek to, and I'm reading just the bytes I want, I don't see how there could be a difference. But, hey -- Stranger things have proved to be. Thanks in advance!

There could be a difference if it was an extremely large number of files. Every open file uses up resources in memory (file handles, and the like), and on some storage devices, a file will take up an entire block even if it doesn't fill it. That can lead to wasted space in extreme cases. But in practice, it probably won't be a problem. To know for sure, you can profile your code and see if it's faster one way vs. the other, and see what sort of space it takes up on a typical device.

considerations for saving data to ONE file or MULTIPLE?

i am going to be saving data with DPAPI encryption. i am not sure whether i should just have one big file with all the data or should i break up the data into separate files, where every file is its own record. i suspect the entire dataset would be less than 10mb, so i am not sure whether it's worth it to break it down into about a few hundred separate files or should i just keep it one file?
will it take a long time to decrypt 10mb of data?

For 10 megabytes, I wouldn't worry about splitting it up. The cost of encrypting/decrypting
a given volume of data will be pretty much the same, whether it's one big file or a
group of small files. If you needed the ability to selectively decrypt individual records,
as opposed to all at once, splitting the file might be useful.

If you can never think of the hardware your app is going to run on, make it scaleable. It can then run from 10 parallel floppy drives if it's too slow reading from 1.
If your scope is limited to high-perfo computers, and the file size is not likely to rise within the coming next 10 years, put it in 1 file.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas