I'd like to know if there is a guideline for the maximum number of attributes any one NSManagedObject subclass should have. I've read Apple's docs carefully, but find no mention of a limit where performance starts to erode.
I DO see a compiler flag option in Xcode that provides a warning when an NSManagedObject has more than 100 attributes, but I can't find any documentation on it. Does anyone here have experience with Core Data MOs that have a large number of attributes?
I'm focused on performance, not memory usage. In my app, there will only ever be about 10-20 instances of this MO that has a large number of attributes and I'm developing on OS X rather than iOS, so memory usage isn't a factor. But if there is a point where performance (especially when faulting) starts to die, I'd like to know now so I can change the structure of my data model accordingly.
Thanks!
Each attribute gets mapped to a table column, if you're using SQLite backing stores. SQLite has a hard limit on the number of columns you can use (2000 by default, though it's compile-time configurable so Apple's implementation could differ), and they recommend not using more than one hundred. That could well be why the Xcode warning sets its threshold at 100.
That same linked page on limits also notes that there are some O(N^2) algorithms where N is the number of columns, so, sounds like you should generally avoid high numbers.
For other file formats, I don't know of any limits or recommendations. But I'd expect similar things - i.e. there's probably some algorithm in there that's O(N^2) or worse, so you want to avoid becoming an uncommon edge case.
Not that I have run across even on iOS. The biggest limiting factor of performance is the cache size in the NSPersistentStoreCoordinator which on Mac OSX is pretty big.
If your attributes are strings, numbers, dates, etc. (i.e. not binary data) then you can probably have a huge number of attributes before you start to see a performance hit. If you are working with binary data then I would caution you against blowing the cache and consider storing binary data outside of SQLite. More recent versions of the OS can even do this automatically for you.
However, I would question why you would want to do this. Surely there are attributes that are going to be less vital than others and can be abstracted away into child entities on the other side of a one-to-one relationship?
Related
I was reading that book about APIs called "API design patterns" by JJ Geewax and there was a section that talks about getting the count of items and he said it's not a good idea especially in distributed storage systems.
page 102
Next, there is often the temptation to include a count of the items along with the listing. While this
might be nice for user-interface consumers to show a total number of matching results, it often
adds far more headache as time goes on and the number of items in the list grows beyond what
was originally projected. This is particularly complicated for distributed storage systems that are
not designed to provide quick access to counts matching specific queries. In short, it's generally a
bad idea to include item counts in the responses to a standard List method.
Anyone has a clue why is that or at least gives me keywords to search for.
In a typical database (e.g., a MySQL db with a few gigs of data in there), counting the number of rows is pretty easy. If that's all you'll ever deal with, then providing a count of matching results isn't a huge deal -- the concern comes up when things get bigger.
As the amount data starts growing (e.g., say... 10T?), dynamically computing an accurate count of matching rows can start to get pretty expensive (you have to scan and keep a running count of all the matching data). Even with a distributed storage system, this can be fast, but still expensive. This means your API will be spending a lot of computing resources to calculate the total number of results when it could be doing other important things. In my opinion, this is wasteful (a large expense for a "nice-to-have" on the API). If counts are critical to the API, then that changes the calculation.
Further, as changes to the data become more frequent (more creates, updates, and deletes), a count becomes less and less accurate as it might change drastically from one second to the next. In that case, not only is there more work being done to come up with a number, but that number isn't even all that accurate (and presumably, not super useful at that point).
So overall... result counts on larger data sets tend to be:
Expensive
More nice-to-have than business critical
Inaccurate
And since APIs tend to live much longer than we ever predict and can grow to a size far larger than we imagine, I discourage including result counts in API responses.
Every API is different though, so maybe it makes sense to have counts in your API, though I'd still suggest using rough estimates rather than exact counts to future-proof the API.
Some good reasons to include a count:
Your data size will stay reasonably small (i.e., able to be served by a single MySQL database).
Result counts are critical to your API (not just "nice-to-have").
Whatever numbers you come up with are accurate enough for your use cases (i.e., exact numbers for small data sets or "good estimates", not useless estimates).
In C++, map class is very comfortable. Instead of going for a separate database I want to store all the rows as objects and I want to create map object for the columns to search. I am concerned with maximum objects a process can handle. And is using map function to retrieve an object among, say, 10 million objects, if linux permits, is a good choice? I'm not worried about persisting the data.
What you are looking for is std::map::max_size, quoting from the reference:
...reflects the theoretical limit on the size of the container. At runtime, the size of the container may be limited to a value smaller than max_size() by the amount of RAM available.
No, there is no maximum number of objects per process. Objects (as in, C++ objects) are an abstraction which the OS is unaware of. The only meaningful limit in this regard is the amount of memory used.
You can completely fill your RAM using as much map as it takes, I promise.
As you can see in reference documentation, the constannt map::max_size will let you know the numbers.
This should be 2^31-1 on iX86 hardware/OS and 2^64-1 on amd64 hardware/64bit OS
Possible additionnal information here.
Object is a concept in programming language. In fact, the processes are not aware of the objects. With enough RAM space, you can alloc as many objects as possible in your program.
About your second question, my answer is that which data structure you choose in your program depends on the problem that you want to solve in your program. Map is a suitable data structure for quick accessing objects, testing existance, etc, but is not good enough to maintain the objects' order.
How would one optimize a queue for the typical:
access / store
memory usage
i'm not sure of anyway to reduce memory besides trying to run a compression algorithm on it, but that would take quite a deal of store time as a tradeoff - one would have to recompress everything I think.
As such I'm thinking the typical linked list with pointers.... a circle queue?
Any ideas?
Thanks
Edit: regardless of what is above; how does one make the fastest/least memory intensive basic queue structure essentially?
Linked lists are actually not very typical (except in functional languages or when newbies mistakenly think that a linked list is faster than a dynamic array). A dynamic circular buffer is more typical. The growing (and, optionally, shrinking) works slightly differently than in a dynamic array: if the "data holding part" crosses the end of the array, the data should be copied to the new space in such a way that it remains contiguous (simply extending the array would create a gap in the middle of the data).
As usual, it has some advantages and some drawbacks.
Drawbacks:
slightly more complicated implementation
not suitable for lock-free synchronization
Advantages:
more compact: in the worst case (when it just grew or is just about to shrink but hasn't yet) it has a space overhead of about 100%, a singly linked list almost always has an overhead of 100% or more (unless the data elements are larger than a pointer) and a doubly linked list is even worse.
cache efficient: reading happens close to previous reading, writing happens close to previous writing. So cache misses are rare, and when they do occur, they read data that is mostly relevant (or in the case of writing: they get a cache line that will probably be written to again soon). In a linked list, locality is poor and about half of every cache miss is wasted on the overhead (pointers to other nodes).
Usually these advantages outweigh the drawbacks.
Can optimizers get rid of bad uses of spatial locality? I'm maintaining some code written by somebody else, and many of their arrays are declared in haphazard orders, and iterated differently every time they are called.
Because of the complexity of the code it would be quite the block of time to try and remanage every time the arrays were cycled. I'm not skilled enough at reading assembly language to be able to tell exactly whats different with varying levels of optimization, but my question is,
Is locality important when writing programs, or does that get optimized away so I can not worry about it?
Getting locality right is important, because it can make a difference of two orders of magnitude (5-6 orders of magnitude if you have page faults) of difference in runtime.
Apart from the fact that real compilers usually don't handle this automatically (as Joel Falcou said), even a hypothetical compiler would have a very hard time doing such a thing. In many cases, it may not even be valid for the compiler to do such a thing, and it is very hard to predict when it is or when it is not.
Say, for example, you have vertex data that you calculate on the CPU, and which you upload to a graphics API such as OpenGL or DirectX. You've agreed with that API a certain vertex data layout. Now the compiler figures that it is more efficient to rearrange the layout in some way. Bang, you're dead.
How was the compiler supposed to know?
Say you have a few arrays and a few pointers, and some pointers alias others, or some point into the middle of an array for some reason, others point at the beginning. The compiler figures that it's more efficient to do certain operations in a different order, overwriting one result with another.
The data corruption issue left aside, let's say those arrays are "somewhat big", so they're most certainly going to be dynamically allocated rather than being on the stack. Which means their start addresses are "non-deterministic" or even "random" from the compiler's point of view. How is the compiler going to make decisions -- at compile time -- not knowing half of the details?
Few to none compiler handle data layout for locality. It's still an active research domain.
I'm working on a project in Objective-c where I need to work with large quantities of data stored in an NSDictionary (it's around max ~2 gigs in ram). After all the computations that I preform on it, it seems like it would be quicker to save/load the data when needed (versus re-parsing the original file).
So I started to look into saving large amount of data. I've tried using NSKeyedUnarchiver and [NSDictionary writeToFile:atomically:], but both failed with malloc errors (Can not allocate ____ bytes).
I've looked around SO, Apple's Dev forums and Google, but was unable to find anything. I'm wondering if it might be better to create the file bit-by-bit instead of all at once, but I can't anyway to add to an existing file. I'm not completely opposed to saving with a bunch of small files, but I would much rather use one big file.
Thanks!
Edited to include more information: I'm not sure how much overhead NSDictionary gives me, as I don't take all the information from the text files. I have a 1.5 gig file (of which I keep ~1/2), and it turns out to be around 900 megs through 1 gig in ram. There will be some more data that I need to add eventually, but it will be constructed with references to what's already loaded into memory - it shouldn't double the size, but it may come close.
The data is all serial, and could be separated in storage, but needs to all be in memory for execution. I currently have integer/string pairs, and will eventually end up with string/strings pairs (with all the values also being a key for a different set of strings, so the final storage requirements will be the same strings that I currently have, plus a bunch of references).
In the end, I will need to associate ~3 million strings with some other set of strings. However, the only important thing is the relationship between those strings - I could hash all of them, but NSNumber (as NSDictionary needs objects) might give me just as much overhead.
NSDictionary isn't going to give you the scalable storage that you're looking for, at least not for persistence. You should implement your own type of data structure/serialisation process.
Have you considered using an embedded sqllite database? Then you can process the data but perhaps only loading a fragment of the data structure at a time.
If you can, rebuilding your application in 64-bit mode will give you a much larger heap space.
If that's not an option for you, you'll need to create your own data structure and define your own load/save routines that don't allocate as much memory.