Ok, so the index is a binary tree (for example) that can be searched efficiently to find specific value. Binary tree is represented in memory as a structure with pointers to children and root. When I add some data to my table/file, I also add this data to the tree structure.
Ok, great, but if the table/structure is big, and exceeds memory limits, it should be kept in file. How do I keep such structure in a file? How do I modify it?
Good question. Generally databases use B-Tree structures for indexing data because those types of data structures allow you to reference larger blocks of data.
You could technically serialize any binary tree to disk and then load it into memory, or partially load it into memory as you traverse it. But if the index becomes too large to the point that it no longer fits into memory or takes up too much of the available memory it becomes inefficient to have to page it in/out of memory.
Related
Firstly, I am assuming that data structures, like a hash-map for example, can only be stored in-memory but not on disk unless they are serialized. I want to understand why not?
What is holding us back from dumping a block of memory which stores the data structure directly into disk without any modifications?
Something like a JSON could be thought of as a "serialized" python dictionary. We can very well store JSON in files, so why not a dict?
You may say how would you represent non-string values like bool/objects on disk? I can argue "the same way you store them in memory". Am I missing something here?
naming a few problems:
Big endian vs Little endian makes reading data from disk depend on the architecture of the CPU, so if you just dumped it you won't be able to read it again on different device.
items are not contagious in memory, a list (or dictionary) for example only contains pointers to things that exist "somewhere" in memory, you can only dump contagious memory, otherwise you are only storing the locations in memory that the data was in, which won't be the same when you load the program again.
the way structures are laid in memory can change between two compiled versions of the same program, so if you just recompile your application, you may get different layouts for structures in memory so you just lost your data.
different versions of the same application may wish to update the shape of the structures to allow extra functionality, this won't be possible if the data shape on disk is the same as in memory. (which is one of the reasons why you shouldn't be using pickle for portable data storage, despite it using a memory serializer)
As far as I know, wide-column cannot be applicable.
But is there a difference in efficiency to put big data into the node?
I'd like to put an index to distinguish the value and want to know the efficiency.
"There are few efficiency considerations when putting big data into nodes (accurately property). Search filtering may become slower due to search scope and size increases. Changes may result in overhead such as wal log.
Although it's difficult to determine how much big data you have, but I think you should save it as a file and save its description as property type. Information saved in the property can reduce access expense by the creation of a seperate property index. "
I wouldn't exactly say it is limited but as long as I can see the recommendations given are of the sort of "If you need to go beyond that you can change the backend store... ". Why? Why is Sesame not as efficient as lets say OWLIM or Allegrgraph when goes beyond 150-200m triples. What optimizations are implemented in order to go that big? Are the underlying data structures different?
Answered here by #Jeen Broekstra:
http://answers.semanticweb.com/questions/21881/why-is-sesame-limited-to-lets-say-150m-triples
the actual values that make up an RDF statements (that is, the subjects, predicates, and objects) are indexed in a relatively simple hash, mapping integer ids to actual data values. This index does a lot of in-memory caching to speed up lookups but as the size of the store increases, the probability (during insertion or lookup) that a value is not present in the cache and needs to be retrieved from disk increases, and in addition the on-disk lookup itself becomes more expensive as the size of the hash increases.
data retrieval in the native store has been balanced to make optimal use of the file system page size, for maximizing retrieval speed of B-tree nodes. This optimization relies on consecutive lookups reusing the same data block so that the OS-level page cache can be reused. This heuristic start failing more often as transaction sizes (and therefore B-trees) grow, however.
as B-trees grow in size, the chances of large cascading splits increase.
We have large files with zlib-compressed binary data that we would like to memory map.
Is it even possible to memory map such a compressed binary file and access those bytes in an effective manner?
Are we better off just decompressing the data, memory mapping it, then after we're done with our operations compress it again?
EDIT
I think I should probably mention that these files can be appended to at regular intervals.
Currently, this data on disk gets loaded via NSMutableData and decompressed. We then have some arbitrary read/write operations on this data. Finally, at some point we compress and write the data back to disk.
Memory mapping is all about the 1:1 mapping of memory to disk. That's not compatible with automatic decompression, since it breaks the 1:1 mapping.
I assume these files are read-only, since random-access writing to a compressed file is generally impractical. I would therefore assume that the files are somewhat static.
I believe this is a solvable problem, but it's not trivial, and you will need to understand the compression format. I don't know of any easily reusable software to solve it (though I'm sure many people have solved something like it in the past).
You could memory map the file and then provide a front-end adapter interface to fetch bytes at a given offset and length. You would scan the file once, decompressing as you went, and create a "table of contents" file that mapped periodic nominal offsets to real offset (this is just an optimization, you could "discover" this table of contents as you fetched data). Then the algorithm would look something like:
Given nominal offset n, look up greatest real offset m that maps to less than n.
Read m-32k into buffer (32k is the largest allowed distance in DEFLATE).
Begin DEFLATE algorithm at m. Count decompressed bytes until you get to n.
Obviously you'd want to cache your solutions. NSCache and NSPurgeableData are ideal for this. Doing this really well and maintaining good performance would be challenging, but if it's a key part of your application it could be very valuable.
I'm working on a project in Objective-c where I need to work with large quantities of data stored in an NSDictionary (it's around max ~2 gigs in ram). After all the computations that I preform on it, it seems like it would be quicker to save/load the data when needed (versus re-parsing the original file).
So I started to look into saving large amount of data. I've tried using NSKeyedUnarchiver and [NSDictionary writeToFile:atomically:], but both failed with malloc errors (Can not allocate ____ bytes).
I've looked around SO, Apple's Dev forums and Google, but was unable to find anything. I'm wondering if it might be better to create the file bit-by-bit instead of all at once, but I can't anyway to add to an existing file. I'm not completely opposed to saving with a bunch of small files, but I would much rather use one big file.
Thanks!
Edited to include more information: I'm not sure how much overhead NSDictionary gives me, as I don't take all the information from the text files. I have a 1.5 gig file (of which I keep ~1/2), and it turns out to be around 900 megs through 1 gig in ram. There will be some more data that I need to add eventually, but it will be constructed with references to what's already loaded into memory - it shouldn't double the size, but it may come close.
The data is all serial, and could be separated in storage, but needs to all be in memory for execution. I currently have integer/string pairs, and will eventually end up with string/strings pairs (with all the values also being a key for a different set of strings, so the final storage requirements will be the same strings that I currently have, plus a bunch of references).
In the end, I will need to associate ~3 million strings with some other set of strings. However, the only important thing is the relationship between those strings - I could hash all of them, but NSNumber (as NSDictionary needs objects) might give me just as much overhead.
NSDictionary isn't going to give you the scalable storage that you're looking for, at least not for persistence. You should implement your own type of data structure/serialisation process.
Have you considered using an embedded sqllite database? Then you can process the data but perhaps only loading a fragment of the data structure at a time.
If you can, rebuilding your application in 64-bit mode will give you a much larger heap space.
If that's not an option for you, you'll need to create your own data structure and define your own load/save routines that don't allocate as much memory.