Thumbnail storage strategy - objective-c

I am working on a portion of an app that requires a "Photos" type presentation of multiple thumbnail images. The full size images are quite large, and generating the thumbnails every time is taking too long, so I am going to cache the thumbnails.
I am having a hard time determining how best to store the thumbnails on the filesystem once I create them. I can think of a few possibilities but I don't like any of them:
Save the thumbnail in the same directory as the original file, with _Thumb added to the filename (image.png and image_Thumb.png). This makes for a messy directory and I would think performance would become a problem because of reading so many different files to load at once.
Save the thumbnails in their own sub-directory, with the same filename as the original. I think that this is slightly cleaner, but I'm still opening lots of different files.
Save all of the thumbnails to a Thumbnails file. I think that this is commonly done in Windows and OS X? I like the idea because I can open one file and read multiple thumbnails from it, but I'm not sure how to store all of them in the same file and associate them with the original files. EDIT: I thought of using NSKeyedArchiver/unArchiver but from what I can find, anytime a thumbnail is added/removed, I would have to re-create the entire archive. Perhaps there is something that I am overlooking?
EDIT Store the thumbnails in a core data/sqlite database file. I have heard over the years that it is a bad idea to store images in a database file due to slow performance and the possibility of database corruption on writes that take a (relatively) long time to complete. Does anyone have experience using either one this way?
Any suggestions on the best approach to take?

I would go for the second option. On iDevices you use flash memory. Performance penalty for accessing many files is very low comparing to HDDs. Also you can cache some in memory to prevent reading one and the same file too often. SDWebImage caching mechanism contains a great sample how to do it.
The third option - using one file for that would probably mean using database file. You could have some performance improvements there if you store uncompressed data. You'll need to do some performance tests because loading more data (uncompressed form of the thumbs), might slow it down saving CPU for more storage access.
Combined approach would be to store thumbnails as files but in uncompressed format (not .jpg, .png etc.)
A fourth option worth considering, as long as the thumbnails are reasonably small: save them in CoreData.

Related

Does using an archive alleviate the overhead of loading many small files?

I have a lot of small files to load, and I'm concerned about file IO performance. I'm debating whether or not I should be using files that aggregate the data that will be related; for example, whether I should have "item.data" and "item.png", or just "item.data" (where the latter file contains the .png image data that would have been in "item.png").
The kicker is, I plan to load these files from an archive (either .7z or .zip), and I'm not sure whether or not I'm wasting my time. I'm not really worried about the absolute data bandwidth between the disc and memory, I'm really just out to minimize seeks. If these two files are stored in the same folder within the archive, will their data be contiguous? If not, will the performance be improved for some other reason?
I'm not overly concerned about compression rates for the small files; despite having many (thousands) of these files, they'll pale in comparison to the other, larger files I'm working with. I'm really just worried about seek times.
Will storing and loading the files with an archive solve the problems I'm worried about? If it wont, what are some other approaches to help alleviate seek times and improve file IO performance?

How to deal with thousands of small audio files?

Need to implement an app that has a feature to play sounds. Each sound will be some word sound, number of expected sounds is about one thousand. So, the most simple solution would be to store those sounds as sound files, each word sound in separate sound file, and to play them on demand. Would there be any potential problems with such a large number of files?
No problem with that many files, but they will take up more space than just the total of their sizes. Each file will fill up a whole # of space blocks on the device. On average you will then waste half a block (as a rule of thumb) unless all your files are significantly smaller than one block, in which case you will always use 1.000 blocks (one pr. file) and waste 1000 * (blocksize - average file size).
Things you could do:
Concatenate the files into one big file, store the start and length of each subfile, either read the chunk into memory or copy to a temporary file.
Drop the files in a database as BLOB fields for easier retrieval. This won't save space, but may make your code simpler or more reliable.
I don't think you need to make your own caching mechanism. Most likely iOS has a system-wide cache that does a far better job. That should only be relevant if you experience performance issues and need to get much shorter load times. In that case prhaps consider using bolcks for loading and dispatching the playing, as that's an easier way to hide the load latency and avoid UI freezes.
If your audio is uncompressed, the App Store will report the compressed size. If that differs a lot from the unpacked size, some (nitpicking) customers will definitely notice ald complain, as they think the advertised size is the install size. I know from personal experience. They wil generally not take a technical answer for an answer, any may even bypass talking to you, and just downvote you based on this. I s#it you not.
You should be fine storing 1000 audio clip files within the IPA but it is important to take note about the space requirements and organisation.
Also to take into consideration is the fact that accessing the disk is slower than memory and it also takes up battery space so it my be ideal to load up the most frequently used audio clips into memory.
If you can afford it, use FMOD which I believe can extract audio from various compressed schemes. If you just want to handle all those files yourself create a .zip file and extract them on the fly using libz (iOS library libs.dylib).

Store images in sqlite or just a reference to it?

I have made couple of apps using coredata and I was storing images in sqlite, but somewhere i found that it is bad. I've searched the net but all I've found is this suggestion:
image size < 100kb store in the same table as the relevant data
image size < 1mb store in a separate table attached via a relationship
to avoid loading unnecessarily
image size > 1mb store on disk and reference it inside of Core Data
So my question is: what are pros and cons of saving an image in sqlite db as NSData, and storing just a reference to the image while image is saved in the file system?
Apple provide some guidance on this topic in their guide on Core Data Performance. In general, although SQLite scales pretty well and can handle databases that are many gigabytes in size with ease, large binary blobs are not queryable or indexable, and inflate the size of the database with little return.
If you're targeting iOS 4 and above, you can set the "Allows External Binary Data Storage" flag on your attributes that contain such data, and Core Data will automatically store them separately on the file system (if it deems appropriate), and automatically manage the link to that data in your data store.
Benefits: Not so sure, but I can think of couple of benefits of storing just links in the database.
The native code interaction with the file system would be faster than the SQLite image fetching. (overall faster performance)
Clean and scalable database -- (with size being the concern, migration would be easier)
You may want to check the answer I get for a similiar, if not the same, topic. Because as you I only found person giving advices, but no one really was providing benchmark and real technical answer.
Provide example for why it is not advisable to store images in CoreData?
Beside that, after my app have been realized with all images in db, and shipped to app store,
I can tell you that things are easier if you use iCloud. If you use small images in UITableView with thumbnail icons, you can completely avoid asynchronous image loading.
Only one advice, provide an entity for each images size, rather storing all in a set attached to main entity.
The only downside I found, with iCloud use, is the larger transaction log generated each time I change an image. But in my case image are small, and the needs of updating the images is rare. Also, iCloud+CoreData at the moment is quite buggy so I removed it before shipping, so at the moment it is really not a problem for me.

Performance implications of storing 600,000+ images in the same folder (NTFS)

I need to store about 600,000 images on a web server that uses NTFS. Am I better off storing images in 20,000-image chunks in subfolders? (Windows Server 2008)
I'm concerned about incurring operating system overhead during image retrieval
Go for it. As long has you have an external index and have a direct file path to each file with out listing the contents of the directory then you are ok.
I have a folder with that is over 500 GB in size with over 4 million folders (which have more folders and files). I have somewhere in the order of 10 million files in total.
If I accidentally open this folder in windows explorer it gets stuck at 100% cpu usage (for one core) until I kill the process. But as long as you directly refer to the file/folder performance is great (meaning I can access any of those 10 million files with no overhead)
Depending on whether NTFS has directory indexes, it should be alright from the application level.
I mean, that opening files by name, deleting, renaming etc, programmatically should work nicely.
But the problem is always tools. Third party tools (such as MS explorer, your backup tool, etc) are likely to suck or at least be extremely unusable with large numbers of files per directory.
Anything which does a directory scan, is likely to be quite slow, but worse, some of these tools have poor algorithms which don't scale to even modest (10k+) numbers of files per directory.
NTFS folders store an index file with links to all its contents. With a large amount of images, that file is going to increase a lot and impact your performance negatively. So, yes, on that argument alone you are better off to store chunks in subfolders. Fragments inside indexes are a pain.

Does storing a lot of images in a single directory slow down image retrieval?

If i have a site where users can upload as many images as they want(think photobucket-like), what is the best way to set up file storage (also, all uploads get a unique random timestamp)?
site root
--username
----image1.jpg
----image2.jpg
----image3.jpg
--anotheruser
----image1.jpg
----image2.jpg
----image3.jpg
...
or
siteroot
--uploads
----image1.jpg
----image2.jpg
----image3.jpg
----image4.jpg
----image6.jpg
...
----image50000.jpg
I think the first method is more organized. But i think the second method is standard(keeping all uploads in the same dir), but i wonder if it would be slower when retrieving an image if there are thousands of image in the same directory
--- edit ---
Thanks for the great answers so far.
Also, i will be creating thumbnails, so i also would have to insert that directory somewhere... or, create a naming convention such as thumb_whatever.jpg.
so many different ways to do this.
Yes disk space will be a problem. but for now i am concerned with retrieval time. When i have to output an image to the browser, if that image is in a directory with 10,000 other images, i am worried on how slow that could get.
The number of files in a directory should have no effect at all on the time required to read a file's data - but it can massively affect the amount of time needed to find the file before you can start to read it.
The exact breakpoints where the major issues start up will vary from filesystem type to filesystem type, but, in general, if you're talking about a few hundred files, you don't much need to worry about it. If you're talking about a few thousand, it's worth thinking about and maybe doing a little benchmarking to see how your filesystem and hardware handle it. If you're talking about tens of thousands of files, then you really need to start breaking things up. (I once had a Linux/e2fs print server where CUPS wasn't deleting its job control files after it finished printing and it got up around 100,000 files in one directory. Just getting a directory listing took over half an hour before it even started to display any filenames.)
Separating them by user name may not be the best choice, though, since you'll likely have a lot of users uploading very few images and perhaps a couple who upload hundreds or thousands of images, potentially creating access time issues in those users' storage directories. The bigger problem in that scenario is that you'd likely end up (assuming a successful site) with thousands or tens of thousands of users and a large number of subdirectories is just as bad as a large number of files for slowing down access to your data.
Since you're going to have a timestamp on them, what I would probably do is put them into subdirectories based on the last three digits of the timestamp. That will distribute the files relatively evenly across 1000 subdirectories and should keep the number of files in each directory reasonably small. (Using the first three digits would cause one directory to be filled before moving to the next instead of distributing them evenly.) If you're still ending up with too many files in each subdirectory (which would likely mean you're dealing with several million uploaded images), you could add a second level for the previous three digits, so upload-1234567890.jpg would end up at /567/890/upload-1234567890.jpg.
The answer to that is "maybe". It's possible the file retrieval may be fine, but if you need to do any maintenance on the folder, it would be a huge headache as processes attempt ot enumerate the directory listings.
What would improve the situation would be a number of sub directories under the images folder (or two levels, depending on how many images you're looking at storing), so you have a hierarchy like this:
siteroot
-- uploads
---- a
---- b
---- c
:
---- z
...and then store files based on their first letter (so all images with names starting 'a' go into the folder 'a'). You could have this as a two or three letter suffix (aa, ab, ac, ad ..., ba, bb, bc ..., zx, zy, zz) and possibly have a hierarchy under that as well so you split files across a number of folders dependent on the first four characters of the name.
If files are then assigned a random alpha-numeric name then this would ensure files are spread evenly across all the folders (given a large enough sample size).
You might want to consider a mix of your option (1) and splitting images over a hierarchy as I've described above. That would ensure that if a single user does upload lots of files, then you're covered. Similarly, if you're looking at a lot of user directories, the same principle applies to ensure you don't have 1,000,000 user directories under a single parent.
try using mongodb... it is a keyvalue db which also allows to store binary data. It's very fast and efficient and supports sharding (placing data over multiple machines) out of the box
you really don't want to have folders and folders full of files. Managing these folders takes forever, and changing the naming/dividing scheme later is a nightmare. Furthermore, if you run out of diskspace you have a problem. Also for load balancing, having one harddisk full with files is not efficient
I often use schema like this:
uploads/(#id%1000)/img_#id.jpg
Where #id is ofc. id number (integer) of photo stored in database. That provides a simple schema based only on photo's id.
It depends on the file system. For example, FAT16 tends to be quite slow if you have more than 512 files in a directory. FAT32 and NTFS do not have the same limitations but also run much more slowly if you have an extremely large amount of files. Even if you're running one of the more robust Linux file systems, you're still going to be able to parse directories more quickly if they're smaller.
I would definitely go with #2 - splitting the images into directories by user.
I think that subdirectories under the uploads directory would be the best.
site root
--uploads
----username
------image1.jpg
------image2.jpg
------image3.jpg
----anotheruser
------image1.jpg
------image2.jpg
------image3.jpg
...
Depending on the host OS, having too many files in one directory could cause some headaches and compatibility problems. Also, depending on how you are getting the image list, it could cause performance issues.
Plus, option 2 would be a mess. :)