Why can't I store un-serialized data structure on disk the same way I can store them in memory? - serialization

Firstly, I am assuming that data structures, like a hash-map for example, can only be stored in-memory but not on disk unless they are serialized. I want to understand why not?
What is holding us back from dumping a block of memory which stores the data structure directly into disk without any modifications?
Something like a JSON could be thought of as a "serialized" python dictionary. We can very well store JSON in files, so why not a dict?
You may say how would you represent non-string values like bool/objects on disk? I can argue "the same way you store them in memory". Am I missing something here?

naming a few problems:
Big endian vs Little endian makes reading data from disk depend on the architecture of the CPU, so if you just dumped it you won't be able to read it again on different device.
items are not contagious in memory, a list (or dictionary) for example only contains pointers to things that exist "somewhere" in memory, you can only dump contagious memory, otherwise you are only storing the locations in memory that the data was in, which won't be the same when you load the program again.
the way structures are laid in memory can change between two compiled versions of the same program, so if you just recompile your application, you may get different layouts for structures in memory so you just lost your data.
different versions of the same application may wish to update the shape of the structures to allow extra functionality, this won't be possible if the data shape on disk is the same as in memory. (which is one of the reasons why you shouldn't be using pickle for portable data storage, despite it using a memory serializer)

Related

How are indices kept on disc

Ok, so the index is a binary tree (for example) that can be searched efficiently to find specific value. Binary tree is represented in memory as a structure with pointers to children and root. When I add some data to my table/file, I also add this data to the tree structure.
Ok, great, but if the table/structure is big, and exceeds memory limits, it should be kept in file. How do I keep such structure in a file? How do I modify it?
Good question. Generally databases use B-Tree structures for indexing data because those types of data structures allow you to reference larger blocks of data.
You could technically serialize any binary tree to disk and then load it into memory, or partially load it into memory as you traverse it. But if the index becomes too large to the point that it no longer fits into memory or takes up too much of the available memory it becomes inefficient to have to page it in/out of memory.

Why a 500MB Redis dump.rdb file takes about 5.0GB memory?

Actually, I have 3 Redis instances and I put them together into this 500MB+ dump.rdb. The Redis server can read this dump.rdb and it seems that everything is ok. Then I notice that redis-server cost more than 5.0GB memory. I don't know why.
Is there anything wrong with my file? My db has about 3 million keys, values for each key is a list contains about 80 integers.
I use this METHOD to put 3 instance together.
PS:Another dump.rdb with the same size and same key-value structure cost only 1GB memory.
And my data looks like keyNum->{num1, num2, num3,......}. All numbers is between 1 and 4,000,000. So should I use List to store them? For now, I use lpush(k, v). Did this way cost too much?
The ratio of memory to dump size depends on the data types Redis uses internally.
For small objects (hashes, lists and sortedsets), redis uses ziplists to encode data. For small sets made of integers, redis uses Intsets. ZipLists and IntSets are stored on disk in the same format as they are stored in memory. So, you'd expect a 1:1 ratio if your data uses these encodings.
For larger objects, the in-memory representation is completely different from the on-disk representation. The on-disk format is compressed, doesn't have pointers, doesn't have to deal with memory fragmentation. So, if your objects are large, a 10:1 memory to disk ratio is normal and expected.
If you want to know which objects eat up memory, use redis-rdb-tools to profile your data (disclaimer: I am the author of this tool). From there, follow the memory optimization notes on redis.io, as well as the memory optimization wiki entry on redis-rdb-tools.
There may be more to it, but I believe Redis compresses the dump files.

Is it possible to memory map a compressed file?

We have large files with zlib-compressed binary data that we would like to memory map.
Is it even possible to memory map such a compressed binary file and access those bytes in an effective manner?
Are we better off just decompressing the data, memory mapping it, then after we're done with our operations compress it again?
EDIT
I think I should probably mention that these files can be appended to at regular intervals.
Currently, this data on disk gets loaded via NSMutableData and decompressed. We then have some arbitrary read/write operations on this data. Finally, at some point we compress and write the data back to disk.
Memory mapping is all about the 1:1 mapping of memory to disk. That's not compatible with automatic decompression, since it breaks the 1:1 mapping.
I assume these files are read-only, since random-access writing to a compressed file is generally impractical. I would therefore assume that the files are somewhat static.
I believe this is a solvable problem, but it's not trivial, and you will need to understand the compression format. I don't know of any easily reusable software to solve it (though I'm sure many people have solved something like it in the past).
You could memory map the file and then provide a front-end adapter interface to fetch bytes at a given offset and length. You would scan the file once, decompressing as you went, and create a "table of contents" file that mapped periodic nominal offsets to real offset (this is just an optimization, you could "discover" this table of contents as you fetched data). Then the algorithm would look something like:
Given nominal offset n, look up greatest real offset m that maps to less than n.
Read m-32k into buffer (32k is the largest allowed distance in DEFLATE).
Begin DEFLATE algorithm at m. Count decompressed bytes until you get to n.
Obviously you'd want to cache your solutions. NSCache and NSPurgeableData are ideal for this. Doing this really well and maintaining good performance would be challenging, but if it's a key part of your application it could be very valuable.

Options for storing large text blobs in/with an SQL database?

I have some large volumes of text (log files) which may be very large (up to gigabytes). They are associated with entities which I'm storing in a database, and I'm trying to figure out whether I should store them within the SQL database, or in external files.
It seems like in-database storage may be limited to 4GB for LONGTEXT fields in MySQL, and presumably other DBs have similar limits. Also, storing in the database presumably precludes any kind of seeking when viewing this data -- I'd have to load the full length of the data to render any part of it, right?
So it seems like I'm leaning towards storing this data out-of-DB: are my misgivings about storing large blobs in the database valid, and if I'm going to store them out of the database then are there any frameworks/libraries to help with that?
(I'm working in python but am interested in technologies in other languages too)
Your misgivings are valid.
DB's gained the ability to handle large binary and text fields some years ago, and after everybody tried we gave up.
The problem stems from the fact that your operations on large objects tend to be very different from your operations on the atomic values. So the code gets difficult and inconsistent.
So most veterans just go with storing them on the filesystem with a pointer in the db.
I know php/mysql/oracle/prob more lets you work with large database objects as if you have a file pointer, which gets around memory issues.

Saving large objects to file

I'm working on a project in Objective-c where I need to work with large quantities of data stored in an NSDictionary (it's around max ~2 gigs in ram). After all the computations that I preform on it, it seems like it would be quicker to save/load the data when needed (versus re-parsing the original file).
So I started to look into saving large amount of data. I've tried using NSKeyedUnarchiver and [NSDictionary writeToFile:atomically:], but both failed with malloc errors (Can not allocate ____ bytes).
I've looked around SO, Apple's Dev forums and Google, but was unable to find anything. I'm wondering if it might be better to create the file bit-by-bit instead of all at once, but I can't anyway to add to an existing file. I'm not completely opposed to saving with a bunch of small files, but I would much rather use one big file.
Thanks!
Edited to include more information: I'm not sure how much overhead NSDictionary gives me, as I don't take all the information from the text files. I have a 1.5 gig file (of which I keep ~1/2), and it turns out to be around 900 megs through 1 gig in ram. There will be some more data that I need to add eventually, but it will be constructed with references to what's already loaded into memory - it shouldn't double the size, but it may come close.
The data is all serial, and could be separated in storage, but needs to all be in memory for execution. I currently have integer/string pairs, and will eventually end up with string/strings pairs (with all the values also being a key for a different set of strings, so the final storage requirements will be the same strings that I currently have, plus a bunch of references).
In the end, I will need to associate ~3 million strings with some other set of strings. However, the only important thing is the relationship between those strings - I could hash all of them, but NSNumber (as NSDictionary needs objects) might give me just as much overhead.
NSDictionary isn't going to give you the scalable storage that you're looking for, at least not for persistence. You should implement your own type of data structure/serialisation process.
Have you considered using an embedded sqllite database? Then you can process the data but perhaps only loading a fragment of the data structure at a time.
If you can, rebuilding your application in 64-bit mode will give you a much larger heap space.
If that's not an option for you, you'll need to create your own data structure and define your own load/save routines that don't allocate as much memory.