Saving large objects to file - objective-c

I'm working on a project in Objective-c where I need to work with large quantities of data stored in an NSDictionary (it's around max ~2 gigs in ram). After all the computations that I preform on it, it seems like it would be quicker to save/load the data when needed (versus re-parsing the original file).
So I started to look into saving large amount of data. I've tried using NSKeyedUnarchiver and [NSDictionary writeToFile:atomically:], but both failed with malloc errors (Can not allocate ____ bytes).
I've looked around SO, Apple's Dev forums and Google, but was unable to find anything. I'm wondering if it might be better to create the file bit-by-bit instead of all at once, but I can't anyway to add to an existing file. I'm not completely opposed to saving with a bunch of small files, but I would much rather use one big file.
Thanks!
Edited to include more information: I'm not sure how much overhead NSDictionary gives me, as I don't take all the information from the text files. I have a 1.5 gig file (of which I keep ~1/2), and it turns out to be around 900 megs through 1 gig in ram. There will be some more data that I need to add eventually, but it will be constructed with references to what's already loaded into memory - it shouldn't double the size, but it may come close.
The data is all serial, and could be separated in storage, but needs to all be in memory for execution. I currently have integer/string pairs, and will eventually end up with string/strings pairs (with all the values also being a key for a different set of strings, so the final storage requirements will be the same strings that I currently have, plus a bunch of references).
In the end, I will need to associate ~3 million strings with some other set of strings. However, the only important thing is the relationship between those strings - I could hash all of them, but NSNumber (as NSDictionary needs objects) might give me just as much overhead.

NSDictionary isn't going to give you the scalable storage that you're looking for, at least not for persistence. You should implement your own type of data structure/serialisation process.
Have you considered using an embedded sqllite database? Then you can process the data but perhaps only loading a fragment of the data structure at a time.

If you can, rebuilding your application in 64-bit mode will give you a much larger heap space.
If that's not an option for you, you'll need to create your own data structure and define your own load/save routines that don't allocate as much memory.

Related

Why can't I store un-serialized data structure on disk the same way I can store them in memory?

Firstly, I am assuming that data structures, like a hash-map for example, can only be stored in-memory but not on disk unless they are serialized. I want to understand why not?
What is holding us back from dumping a block of memory which stores the data structure directly into disk without any modifications?
Something like a JSON could be thought of as a "serialized" python dictionary. We can very well store JSON in files, so why not a dict?
You may say how would you represent non-string values like bool/objects on disk? I can argue "the same way you store them in memory". Am I missing something here?
naming a few problems:
Big endian vs Little endian makes reading data from disk depend on the architecture of the CPU, so if you just dumped it you won't be able to read it again on different device.
items are not contagious in memory, a list (or dictionary) for example only contains pointers to things that exist "somewhere" in memory, you can only dump contagious memory, otherwise you are only storing the locations in memory that the data was in, which won't be the same when you load the program again.
the way structures are laid in memory can change between two compiled versions of the same program, so if you just recompile your application, you may get different layouts for structures in memory so you just lost your data.
different versions of the same application may wish to update the shape of the structures to allow extra functionality, this won't be possible if the data shape on disk is the same as in memory. (which is one of the reasons why you shouldn't be using pickle for portable data storage, despite it using a memory serializer)

Is there any performance difference between creating an NSFileHandle for a large versus a small file?

This question strikes me as almost silly, but I just want to sanity check myself. For a variety of reasons, I'm welding together a bunch of files into a single megafile before packing this as a resource in my iOS app. I'm then using NSFileHandle to open the file, seek to the right place, and read out just the bytes I want.
Is there any performance difference between doing it this way and reading loose files? Or, supposing I could choose to use just one monolithic megafile, versus, say, 10 medium-sized (but still joined) files, is there any performance difference between "opening" the large versus a smaller file?
Since I know exactly where to seek to, and I'm reading just the bytes I want, I don't see how there could be a difference. But, hey -- Stranger things have proved to be. Thanks in advance!
There could be a difference if it was an extremely large number of files. Every open file uses up resources in memory (file handles, and the like), and on some storage devices, a file will take up an entire block even if it doesn't fill it. That can lead to wasted space in extreme cases. But in practice, it probably won't be a problem. To know for sure, you can profile your code and see if it's faster one way vs. the other, and see what sort of space it takes up on a typical device.

RavenDB : Storage Size Problems

I'm doing some testing with RavenDB to store data based on an iphone application. The application is going to send up a string of 5 GPS coordinates with a GUID for the key. I'm seeing in RavenDB that each document is around 664-668 bytes. That's HUGE for 10 decimals and a guid. Can someone help me understand what I'm doing wrong? I noticed the size was extraordinarily large when a million records was over a gig on disk. By my calculations it should be much smaller. Purely based on the data sizes shouldn't the document be around 100 bytes? And given that the document database has the object schema built in let's say double that to 200 bytes. Given that calculation the database should be about two hundred megs with 1 million records. But it's ten times larger. Can someone help me where I've gone wrong with the math here?
(Got a friend to check my math and I was off by a bit - numbers updated)
As a general principal, NoSQL databases aren't optimized for disk space. That's the kind of traditional requirement of an RDBMS. Often with NoSQL, you will choose to store the data in duplicate or triplicate for various reasons.
Specifically with RavenDB, each document is in JSON format, so you have some overhead there. However, it is actually persisted on disk in BSON format, saving you some bytes. This implementation detail is obscured from the client. Also, every document has two streams - the main document content, and the associated metadata. This is very powerful, but does take up additional disk space. Both the document and the metadata are kept in BSON format in the ESENT backed document store.
Then you need to consider how you will access the data. Any static indexes you create, and any dynamic indexes you ask Raven to create for you via its LINQ API will have the data copied into the index store. This is a separate store implemented with Lucene.net using their proprietary index file format. You need to take this into consideration if you are estimating disk space requirements. (BTW - you would also have this concern with indexes in an RDBMS solution)
If you are super concerned about optimizing every byte of disk space, perhaps NoSQL solutions aren't for you. Just about every product on the market has these types of overhead. But keep in mind that disk space is cheap today. Relational databases optimized for disk space because storage was very expensive when they were invented. The world has changed, and NoSQL solutions embrace that.

Options for storing large text blobs in/with an SQL database?

I have some large volumes of text (log files) which may be very large (up to gigabytes). They are associated with entities which I'm storing in a database, and I'm trying to figure out whether I should store them within the SQL database, or in external files.
It seems like in-database storage may be limited to 4GB for LONGTEXT fields in MySQL, and presumably other DBs have similar limits. Also, storing in the database presumably precludes any kind of seeking when viewing this data -- I'd have to load the full length of the data to render any part of it, right?
So it seems like I'm leaning towards storing this data out-of-DB: are my misgivings about storing large blobs in the database valid, and if I'm going to store them out of the database then are there any frameworks/libraries to help with that?
(I'm working in python but am interested in technologies in other languages too)
Your misgivings are valid.
DB's gained the ability to handle large binary and text fields some years ago, and after everybody tried we gave up.
The problem stems from the fact that your operations on large objects tend to be very different from your operations on the atomic values. So the code gets difficult and inconsistent.
So most veterans just go with storing them on the filesystem with a pointer in the db.
I know php/mysql/oracle/prob more lets you work with large database objects as if you have a file pointer, which gets around memory issues.

When should we store images in database?

I have a table of productList in which i have 4 column, now i have to store image for each row so i have two option for this..
Store image in data base.
Save images in a folder and store only path on table.
So my question is which one is better in this situation and why ?
Microsoft Research published quite an extensive paper on the subject, called To Blob Or Not To Blob.
Their synopsis is:
Application designers often face the question of whether to store large objects in a filesystem or in a database. Often this decision is made for application design simplicity. Sometimes, performance measurements are also used. This paper looks at the question of fragmentation – one of the operational issues that can affect the performance and/or manageability of the system as deployed long term. As expected from the common wisdom, objects smaller than 256K are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256K and 1M, the read:write ratio and rate of object overwrite or replacement are important factors. We used the notion of “storage age” or number of object overwrites as way of normalizing wall clock time. Storage age allows our results or similar such results to be applied across a number of read:write ratios and object replacement rates.
It depends -
You can store images in DB if you know that they wont increase in size very often. This has its advantage when you are deploying your systems or migrating to new servers. you dont have to worry about copying images seperately.
If the no. of rows increase very frequently on that system, and the images get bulkier, then its good to store on the file system and have a path stored in database for later retrieval. This also will keep you on toes when migrating your servers where you have to take care of copying the images from filepath seperately.