I'm using a NSKeyedArchiver to encode a big object graph (76295 objects.)
It takes a lot of time but even worse the NSKeyedArchiver doesn't give back all its memory.
After using the Leak check, the code doesn't appear to leak at all but for some reason the encoding doesn't give back all memory after it's done.
Calling the encode method several times does make it worse, more and more memory is being eaten away.
You got any suggestions I could like at?
P.S. A database (sqlite) or CoreData aren't alternatives because they seem to scale very poorly with an big object graph like the one mentioned above.
I would prefer a solution using NSKeyedArchiver
NSKeyedArchiver doesn't actually encode objects. It simply walks the graph and calls the encode methods of each instances in the graph. Therefore, the most likely source of a leak is inside a custom coder method that you wrote to archive one of your custom classes. You might want to do test archives of individual classes to see if one of them leaks.
The problem with using NSKeyedArchiver to archive a large complex graph is that the entire graph is piled into memory at once. One leak in one classes coder can explode if a lot of instances of that class are archived. If you have 76,000+ objects and just a couple of thousand leak a few bytes each, that will add up in a hurry.
I must add that I have never encountered or even read of a situation in which Core Data did not perform better than an archive for complex graphs regardless of size. Core Data was created specifically to handle just this sort of problem.
If you've tried Core Data and it bogged down, it is probably because you used to much entity inheritance. Since Core Data uses single table inheritance on the store side in SQL, all the descendants of an entity end up in the same table and that bogs things down. Remember, entities are separate from the classes they model so you can have class inheritance without having entity inheritance. That gives you the coding advantages of inheritance without the speed penalty of entity inheritance.
Seems the memory is given back to the system slowly at a later interval.
So no real memory leak is there.
For anyone else: try look at CoreData or sqlite3 directly.
If you use sqlite3 directly make sure you encapsulate the complete set of queries in an transaction; it will dramatically increase data throughput.
More info regarding sqlite speed optimization:
http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html
Related
I'm trying to gain insight into the least overhead solution to tracking model object changes in Cocoa.
As I see it there are 3 options:
Use Core Data – lot's of functionality exists for monitoring model object changes (Core Data NSManagedObject - tracking if attribute was changed). I don't know what the overhead of Core Data's management infrastructure is compared to other approaches but it's well established architecture for multi-threading support is a plus. For cross-platform devs there is some downside in not having a readily accessible schema but there are ways around that issue.
Write custom accessors that mark the object as dirty when updating a field with a new value. I've been using this technique with mixed success for quite some time. There are some sticky issues to deal with when sharing objects across threads. You also don't get the benefits of enhancements to automatic synthesis of attributes, etc. You do, however, have greater control of your data store than when using Core Data which can be of benefit (eg. certain operations can be done in a SQL store across many objects in a much more efficient way). Note: There could be a lot of variation here depending on how you write the accessors. For the sake of conversation let's assume setters make a check of the new value against the old one, make appropriate calls to KVO (willChange / didChange), and set a boolean flag (all within synchronization of course).
Use KVO to monitor object fields (ala keyPathsForValuesAffectingValueForKey:) and mark the object as dirty in the KVO callout. I have yet to use this method but it seems like a decent approach. The obvious downside would be the callout every time a setter is called.
I am inclined to think that option 2 has the lowest overhead (in terms of raw processing requirements) given that Core Data and KVO both have some additional overhead either in the generated accessors or in the KVO callouts. The question is, how substantial is the overhead?
And lastly, did I miss an option?
Thanks.
TLDR summary: (a) Should I include (lengthy) method code in classes which may spawn multiple objects at runtime, (b) does doing so cause memory usage bloat, (c) if so should I "outsource" the code to a class that is loaded only once and have the class methods call that, or alternatively (d) does the code get loaded only once with the object definition anyway and I'm worrying about nothing?
........
I don't know whether there's a good answer to this but if there is I haven't found it yet by searching in the usual places.
In my VB.Net (2010 if it matters) WinForms project I have about a dozen or so class objects in an object model. Some of these are pretty simple and do little more than act as data storage repositories. The ones further up the object model, however, have an increasing number of methods. There can be a significant number of higher level objects in use though the exact number will be runtime dependent so I can't be more precise than that.
As I was writing the method code for one of the top level ones I noticed that it was starting to get quite lengthy.
Memory optimisation is something of a lost art given how much memory the average PC has these days but I don't want to make my application a resource hog. So my questions for anyone who knows .Net way better than I do (of which there will be many) are:
Is the code loaded into memory with each instance of the class that's created?
Alternatively is it loaded only once with the definition of the class, and all derived objects just refer to that definition? (I'm not really sure how that could be possible given that, for example, event handlers can be assigned dynamically, but no harm asking.)
If the answer to the first one is yes, would it be more efficient to write the code in a "utility" object which is loaded only once and called from the real class' methods?
Any thoughts appreciated.
Go with whichever is going to be the easier to maintain codebase (shorter methods, etc). That is the more important cost with anything that has increasing complexity.
Memory optimization is only a problem if its a problem. 12 classes is really nothing, when you have hundreds of instances of hundreds of classes, then it may become a problem.
The short answer, it doesn't matter. Your data is stored in memory but your code is loaded only once.
EDIT: I guess I need a longer answer.
If you have 10 instances of a class, the variables that are part of that instance all take up thier own memory space. So if you have 10 properties, variables, etc, that means you have 100(ish) items in your memory. As for your code, it was loaded just once with your assembly. If you create 10 instances of your class, your code is not in memory 10 times.
This is similar to a question I asked before, but now that I've come much further along I still have a question about "proper" subclassing of NSManagedObject as I was told last night that it's a "bad idea" to put lots of non-persisted properties and ivars inside one. Currently I have TONS of code inside my NSManagedObject, and Apple's docs don't really address the "rightness" of that. FYI: the code works, but I'm asking if there are pitfalls ahead, or if there are obvious improvements to doing it another way.
My "object" is a continuously growing array of incoming data, the properties/ivars that track the progress of the analysis of that data, and the processed data (output). All of this is stored in memory because it grows huge, very quickly, and would not be possible to re-generate/re-analyze continuously. The NSManagedObject properties that are actually persisted are just the raw data (regularly saved, as Core Data doesn't support NSMutableData), a few basic properties and 2 relationships to other NSManagedObjects (1 being a user, the other being a set of snapshots of the data). Only one object is being recorded to at any one time, although dozens can be opened for viewing (which may involve further processing at any time).
It's not possible to have the object that inserts the entity (data manager that manages Core Data) have all of the processing logic/variables inside it, as each object necessitates at least a handful of arrays/properties that are used as intermediaries and tracking values for the analysis. And I, personally, think that it sounds silly to create two objects for each object that is being used (the NSManagedObject that is the store, and another object that is the processing/temp store).
Basically, all of the examples I can find using NSManagedObjects have super simple objects that are things like coordinates, address book entries, pictures: stuff that is basically static. In that case I can see having all of the logic that creates/modifies them outside the object. However, my case is not that simple and I have yet to come up with an alternative that doesn't involve duplication.
Any suggestions would be appreciated.
You might use a 'wrapper', that is to say a class with a reference to one of your managed object instances, this wrapper would contain your algorithms and your non persisted algorithms.
I used MATLAB to write a simulation engine for the simulation of product flows in a production environment. I inherited all used class from handle and used these handles (quite excessively, I guess) to link between e.g. products and work systems, orders, etc.
Now, to run multiple instances of my model, I create a simulation object that contains all other objects and their relations, run the model and free the simulation variable.
Creating and running the model takes ~50 seconds (this including the generation of all objects, their relations and of course the calculation over the course of the simulation run). Freeing the variable before the next run, currently takes ~3-4 minutes!
I tried clear, delete and plain overwriting of the old simulation object, without notifying significant differences in performance.
Is there a way to improve the performance without rewriting the code?
It is hard to say anything particular about your code without seeing it, or at least some high level design.
A short advice before optimizing the OO aspects :
Are you sure that the bottleneck is in the objects creation? Verify it with the profiler.
If the OO is indeed the bottleneck, here are some guesses:
You have used circular references. Matlab does not use garbage collector, but rather a smart reference counting mechanism, which can be quite slow in this case. Change the references between the objects to be tree-like instead.
You have created an enormous amount of objects. Matlab has a significant overhead for each object, much more than the traditional languages (c++, java). Re-design the system to have a smaller amount of objects.
Do you happen to use cell arrays to store other handle objects from within a handle object? This can cause serious slowdowns prior to Matlab R2011A. See http://www.mathworks.com/support/solutions/en/data/1-6VVMS0/index.html?product=ML
A workaround is to use a temp local variable to manipulate cell array, then assign this tmp variable back to your handle object property. I saw ~ 100X improvement in performance after doing this in one case.
Why do we need to use serialization?
If we want to send an object or piece of data through a network we can use streams of bytes. If we want to save some data to the disk, we can again use the binary mode along with the byte streams and save it.
So what's the advantage of using serialization?
Technically on the low-level, your serialized object will also end up as a stream of bytes on your cable or your filesystem...
So you can also think of it as a standardized and already available way of converting your objects to a stream of bytes. Storing/transferring object is a very common requirement, and it has less or little meaning to reinvent this wheel in every application.
As other have mentioned, you also know that this object->stream_of_bytes implementation is quite robust, tested, and generally architecture-independent.
This does not mean it is the only acceptable way to save or transfer an object: in some cases, you'll have to implement your own methods, for example to avoid saving unnecessary/private members (for example for security or performance reasons). But if you are in a simple case, you can make your life easier by using the serialization/deserialization of your framework, language or VM instead of having to implement it by yourself.
Hope this helps.
Quoting from Designing Data Intensive Applications book:
Programs usually work with data in (at least) two different
representations:
In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These data structures are optimized for
efficient access and manipulation by the CPU (typically using
pointers).
When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes
(for example, a JSON document). Since a pointer wouldn’t make sense to
any other process, this sequence-of-bytes representation looks quite
different from the data structures that are normally used in memory.
Thus, we need some kind of translation between the two
representations. The translation from the in-memory representation to
a byte sequence is called encoding (also known as serialization or
marshalling), and the reverse is called decoding (parsing,
deserialization, unmarshalling).
Among other reasons to be compatible between architecture. An integer doesn't have the same number of bytes from one architecture to another, and sometimes from one compiler to another.
Plus what you're talking about is still serialization. Binary Serialization. You're putting all the bytes of your object together in order to store them and be able to reconvert them as an object later. This is serializing.
More info on wikipedia
Serialization is the process of converting an object into a stream so that it can be saved in any physical file like (XML) or can be saved in Database. The main purpose of Serialization in C# is to persist an object and save it in any specified storage medium like stream, physical file or DataBase.
In General, serialization is a method to persist an object's state, but i suggest you to read this wiki page, it is pretty detailed and correct in my opinion:
http://en.wikipedia.org/wiki/Serialization
In serialization, the point is not turning an object into bits and bytes, objects ARE bits and bytes already. Serialization is the process of making the object's "state" persistent. Notice the word "state", which means the values of the instance variables of the entire object graph (the target object and all the objects it references either directly or indirectly) WITHOUT methods and other extra runtime stuff stuck to them (and of course plus a little more info that JVM needs for restoration of these objects, such as their class types).
So this is the main reason of its necessity: Storing the whole bytes of objects would be expensive, and for all intents and purposes, unnecessary.