Performance implications of changing NSManagedObject instances that I never intend to save - objective-c

I have a CoreData-based application that retrieves data about past events from an SQLite persistence store. Once I have the past events my application does some statistical analysis to predict future events based on the data it has about past events. Once my application has made a prediction about future events I want to run another algorithm that does some evaluation of that prediction. I'm expecting to do a lot of these evaluations, so performance optimization for each evaluation is likely to be critical.
Now, all of the classes I need to represent my future event predictions exist in my data model, and I have NSManagedObject subclasses for most of the important entities. The easiest way for me to implement my algorithms is to "fill in" the results for future events based on the prediction, and then run my evaluation using NSManagedObject instances for both the past events and the predictions for future events. However, I typically don't want to save these future event predictions in my persistent store: Once I have performed my evaluation of the prediction I want to throw away the predictions and just keep the evaluation results. I can do this pretty easily, I think, by just sending the rollback: message to my managed object context once my evaluation is complete.
That will all work fine, and from a coding perspective it seems like it will be quite easy to implement. However, I am wondering if I should expect performance concerns making such heavy use of managed objects when I have no intention of ever saving the changes I'm making. Given that performance is likely to be a factor, does using NSManagedObject instances for this make sense? Surely all the things it's doing to keep track of changes and support things like undo and complex entity relationships come with some amount of overhead. Should I be concerned about this overhead?
I could of course create non-NSManagedObject classes that implement an optimized version of my model classes for use when making predictions and evaluating them. That would involve a lot of additional work, including the work necessary to copy data back and forth between the NSManagedObject instances for past events and the optimized class instances for future events: I'd rather not create that code if it is not needed.

Surely all the things it's doing to keep track of changes and support
things like undo and complex entity relationships come with some
amount of overhead.
Core Data doesn't have the overhead that people expect owing to its optimizations. In general, using managed objects in memory is as fast or faster than any custom objects and management code you write yourself.
Should I be concerned about this overhead?
Can't really say without implementation details but most likely not. You can hand tweak Core Data for specific circumstances to get better performance.
The best approach is always to start with the most simple solution and then move to a more complex only when testing reveals that the simple solution does not perform well.
Premature optimization is the root of all evil.

Related

Is there a standard way to breakdown huge optimization models to ensure the submodules are working correctly?

Apologies as this may be a general question for optimization:
For truly large scale optimization models it feels as if the model becomes quite complex and cumbersome before it is even testable. For small scale optimization problems, up to even 10-20 constraints its feasible to code the entire program and just stare at it and debug.
However, for large scale models with potentially 10s-100 of constraint equations it feels as if there should be a way to test subsections of the optimization model before putting the entire thing together.
Imagine you are writing a optimization for a rocket that needs to land on the moon the model tells the rocket how to fly itself and land on the moon safely. There might be one piece of the model that would dictate the gravitational effects of the earth and moon and their orbits together that would influence how the rocket should position itself, another module that dictates how the thrusters should fire in order to maneuver the rocket into the correct positions, and perhaps a final module that dictates how to optimally use the various fuel sources.
Is there a good practice to ensure that one small section (e.g the gravitational module) works well independently of the whole model. Then iteratively testing the rocket thruster piece, then the optimal fuel use etc. Since, once you put all the pieces together and the model doesn't resolve (perhaps due to missing constraints or variables) it quickly becomes a nightmare to debug.
What are the best practices if any for iteratively building and testing large-scale optimization models?
I regularly work on models with millions of variables and equations. ("10s-100 of constraint equations" is considered small-scale). Luckily they all have less than say 50 blocks of similar equations (indexed equations). Obviously just eyeballing solutions is impossible. So we add a lot of checks (also on the data, which can contain errors). For debugging, it is a very good idea to have a very small data set around. Finally, it helps to have good tools, such as modeling systems with type/domain checking, automatic differentiation etc.)
Often we cannot really check equations in isolation because we are dealing with simultaneous equations. The model only makes sense when all equations are present. So "iterative building and testing" is usually not possible for me. Sometimes we keep small stylized models around for documentation and education.

Solution Cloning Performance Tips

We are currently trying to improve the performance of a planning problem we've implemented in OptaPlanner. Our model has ~45,000 chained variables and after profiling the application it seems like the main bottleneck is around the cloning. Approximately 90% of the CPU run-time is consumed by the FieldAccessingSolutionCloner method calls.
We've already tried to make our object model more lightweight by reducing the number of Maps and Sets within the PlanningEntities, changing fields to primitives where possible, but from your own OptaPlanner experience have you any advice about how speed up cloning performance?
Have you tried writing a custom cloner? See docs.
The default one needs to rely on reflection, so it's slower.
Also, the structure of your domain model influences how much you need to clone (regardless if you go custom or not):
If you delete your Solution and Planning Entities classes, do your other domain classes still compile?
If yes, then the clone is minimal. If no, it's not.

First Time Architecturing?

I was recently given the task of rebuilding an existing RIA. The new RIA that I've designed is based on Silverlight, with a WCF service to connect to MS SQL Server. This is my first time doing something like this, so I'm not sure how to design the entire thing.
Basically, the client can look through graphs of "stocks" (allowing the client to choose different time periods, settings, etc). I've written the whole application essentially, but I'm not sure how to put it together.
The graphs are supposed to be directly based on the database, and to create the datapoints on the graph, some calculations need to be done (not very expensive ones).
The problem I'm having is to decide where to put the calculations (client or serverside? Or half and half?)
What factors should I look for to help me decide where the calculations should be done? And how can I go about optimizing this (caching, etc)?
Obviously this is a very broad subject, so I'm not expecting an immediate answer, but any help/pointing in the right direction/resources would be appreciated.
A few tips for this kind of app.
Put as much logic as possible on the client.
Make the client responsible for session data, making all your server code stateless.
Try to minimize traffic to and from the server (Bigger requests are more efficient than multiple smaller ones) so consolidate requests when possible.
If this project is likely to grow beyond it's current feature set I think it's probably a good idea to perform the calculations client side. This can avoid scaling issues, because you're using all the client side CPUs ratther than you're single, precious server CPU. This does however rely on being able to transfer the required data to the client in an efficient way, otherwise you replace a processor bottleneck with a network bottleneck.
As for caching it depends on your inputs, what variables can users of the client affect? If any of the variables they can alter are discrete (ie they can be a fixed set of values) then they're candidates for caching. For example if a user can select a date range of stock variations to view then that's probably not so useful, if however they can only select a year then you could cache your data sets by year (download each data set to the client and perform your calculation). I'd not worry about caching too much unless you find it's a real performance problem, it'll only make your code more complex, so don't add it until you have proven you need it.
One other thing, if this project is unlikely to be a long term concern then implement the calculations wherever is easiest and fastest, you can revisit if the project becomes more important later on.
Be REALLY REALLY careful about implementing client-side caching. Caching is INSANELY hard to do right while maintaining performance, security and correctness. Note that your DB Server's caching mechanism is already likely to be way better than any local caching mechanism you're likely to implement in less than 2 weeks' effort!
I would urge you to do as much work on the back-end as possible and to limit your client to render the data in a manner that is appropriate for your users. While many may balk at this suggestion, it's based on a number of observations from building many such systems in the past:
If you're going to filter some of the data returned by your service, you've just wasted thousands of clock cycles shipping data that need never have left your server
If you're going to sort your data, your DB could have done the sorting for you (often using otherwise idle CPU ticks) while waiting for the data to be read from its disks.
Your server most likely has more CPU and RAM available than your clients and has a surprising amount of "free time" to use for sorting, filtering, running inline calculations, etc., while its waiting for disks to read sectors etc.
As Roman suggested: Minimize your round-trips between your client and your server as much as possible.
But perhaps most importantly:
BEFORE YOU START DESIGNING YOUR SYSTEM, state your performance goals
Design what you think will achieve those goals. Try to find bottlenecks in your design, particularly areas where you make blocking calls. Re-design those areas to use async patterns wherever you can.
Build your intended solution
Measure your actual perforamnce under actual real-world load
If you're within your expected performance goals, then you're done.
If not, work out where you're spending too long and tune the design of that portion of the system. Goto 3.
Don't try to build the perfect system in one try - chances are that you won't manage it, no matter how hard you try, for a variety of reasons including user expectations, your servers ability to process the required load, your clients' ability to handle the returned data, your network's ability to carry the traffic, etc.
They're a little old now, but I suggest you read through some of the earlier posts at http://blogs.msdn.com/richardt for more thoughts around designing and constructing Service Oriented and distributed systems.

Is ORM slow? Does it matter? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I really like ORM as compared to store procedure, but one thing that I afraid is that ORM could be slow, because of layers and layers of abstraction. Will using ORM slow down my application? Or does it matter?
Yes, it matters. It is using more CPU cycles and consequently slowing your application down. Hear me out though...
But, consider this: what is more expensive? Server hardware or another programmer? Server hardware, generally, is cheaper than hiring another team of programmers. So, while ORM may be costing you CPU cycles, you need one less programmer to manage your SQL queries, often resulting in a lower net cost.
To determine if it's worth it for you, calculate or determine how many hours you saved by using an ORM. Then, figure out how much money you spent on the server to support ORM. Multiply the hours you saved by your hourly rate and compare to the server cost.
Of course, whether an ORM actually saves you time is a whole another debate...
Is ORM slow?
Not inherently. Some heavyweight ORMs can add a general drag to things but we're not talking orders of magnitude slowdown.
What does make ORM slow is naïve usage. If you're using an ORM because it looks easy and you don't know how the underlying relational data model works, you can easily write code that seems reasonable to an OO programmer, but will murder performance.
ORM is a handy tool, but you need the lower-level understanding (that usually comes from writing SQL queries) to go with it.
Does it matter?
If you end up performing a looped query for each of thousands of entities at once, instead of a single fast join, then certainly it can.
ORM's are slower and do add overhead to applications (unless you specifically know how to get around these problems, which is not very common). The database is the most critical element and Web applications should be designed around it.
Many OOP frameworks using Active Record or ORMs, developers in general - treat the database as an unimportant afterthought and tend to look at it as something they don't really need to learn. But performance and scalability usually suffer as the db is heavily taxed!
Many large scale web apps have fallen flat, wasting millions and months to years of time because they didn't recognize the importance of the database. Hundreds of concurrent users and tables with millions of records require database tuning and optimization. But I believe the problem is noticeable with a few users and less data.
Why are developers so afraid to learn proper SQL and tuning measures when it's the key to performance?
In a Windows Mobile 5 project against using SqlCe, I went from using hand-coded objects to code generated (CodeSmith) objects using an ORM template. In the process all my data access used CSLA as a base layer.
The straight conversion improved my performance by 32% in local testing, almost all of it a result of better access methods.
After that change, we adjusted the templates (after seeing some SqlCe performance stuff at PDC by Steve Lasker) and in less then 20 minutes, our entire data layer was greatly improved, our average 'slow' calls went from 460ms to ~20ms. The cool part about the ORM stuff is that we only had to implement (and unit test) these changes once and all the data access code got changed. It was an amazing time saver, we maybe saved 40 hours or more.
The above being said, we did lose some time by taking out a bunch of 'waiting' and 'progress' dialogs that were no longer needed.
I have used a few of the ORM tools, and I can recommend two of them:
.NET Tiers
CSLA codegen templates
Both of them have performed quite nicely and any performance loss has not been noticeable.
I've always found it doesn't matter. You should use whatever will make you the most productive, responsive to changes, and whatever is easiest to debug and maintain.
Most applications never need enough load for the difference between ORM and SPs to noticeable. And there are optimizations to make ORM faster.
Finally, a well-written app will have its data access seperated from everything else so that in the future switching from ORM to whatever would be possible.
Is ORM slow?
Yes ( compared with stored procedures )
Does it matter?
No ( except your concern is speed )
I think the problem is many people think of ORM as a object "trick" to databases, to code less or simplify SQL usage, while in reality is .. well an Object - To Relational ( DB ) - Mapping.
ORM is used to persist your objects to a relational database manager system, and not ( just ) to substitute or make SQL easier ( although it make a good job at that too )
If you don't have a good object model, or you're using to make reports, or even if you're just trying to get some information, ORM is not worth it.
If in the other hand you have a complex system modeled through objects were each one have different rules and they interact dynamically and you your concern is persist that information into the database rather than substitute some existing SQL scripts then go for ORM.
Yes, ORM will slow down your application. By how much depends on how far the abstraction goes, how well your object model maps to the database, and other factors. The question should be, are you willing to spend more developer time and use straight data access or trade less dev time for slower runtime performance.
Overall, the good ORMs have little overhead and, by and large, are considered well worth the trade off.
Yes, ORMs affect performance, whether that matters ultimately depends on the specifics of your project.
Programmers often love ORM because they like the nice front-end cding environments like Visual Studio and dislike coding raw SQL with no intellisense, etc.
ORMs have other limitations besides a performance hit--they also often do not do what you need 100% of the time, add the complexity of an additional abstraction layer that must be maintained and re-established every time chhnges are made, there are also caching issues to be dealt with.
Just a thought -- if the database vendors would make the SQL programming environment as nice as Visual Studio, and provide a more natural linkage between the db code and front-end code, we wouldn't need the ORMs...I guess things may go in that direction eventually.
Obvious answer: It depends
ORM does a good job of insulating a programmer from SQL. This in effect substitutes mediocre, computer generated queries for the catastrophically bad queries a programmer might give.
Even in the best case, an ORM is going to do some extra work, loading fields it doesn't need to, explicitly checking constraints, and so forth.
When these become a bottle-neck, most ORM's let you side-step them and inject raw SQL.
If your application fits well with objects, but not quite so easily with relations, then this can still be a win. If instead your app fits nicely around a relational model, then the ORM represents a coding bottleneck on top of a possible performance bottleneck.
One thing I've found to be particularly offensive about most ORM's is their handling of primary keys. Most ORM's require pk's for everything they touch, even if there is no concievable use for them. Example: Authors should have pk's, Blog posts SHOULD have pk's, but the links (join table) between authors and posts not.
I have found that the difference between "too slow" and "not too much slower" depends on if you have your ORM's 2nd level (SessionFactory) cache enabled. With it off it handles fine under development load, but will crush your system under mild production load. After turning on the 2nd Level cache the server handled the expected load and scaled nicely.
ORM can get an order of magnitude slower, not just on the grount=s of wasting a lot of CPU cycles on it's own but also using much more memeory which then has to be GC-d.
Much worse that that however is that the is no standard for ORM (unlike SQL) and that my and large ORM-s use SQL vary inefficiently so at the end of the day you still have to dig into SQL to fix per issues and every time an ORM makes a mess and you have to debug it. Meaning that you haven't gained anything at all.
It's terribly immature technology for real production-level applications. Very problematic things are handling indexes, foreign keys, tweaking tables to fit object hierarchies and terribly long transactions, which means much more deadlocks and repeats - if an ORM knows hows to handle that at all.
It actually makes servers less scalable which multiplies costs but these costs don't get mentioned at the begining - a little inconvenient truth :-) When something uses transactions 10-100 times bigger than optimal it becomes impossible to scale SQL side at all. Talking about serious systems again not home/toy/academic stuff.
An ORM will always add some overhead because of the layers of abstraction but unless it is a poorly designed ORM that should be minimal. The time to actually query the database will be many times more than the additional overhead of the ORM infrastructure if you are doing it correctly, for example not loading the full object graph when not required. A good ORM (nHibernate) will also give you many options for the queries run against the database so you can optimise as required as well.
Using an ORM is generally slower. But the boost in productivity you get will get your application up and running much faster. And the time you save can later be spent finding the portions of your application that are causing the biggest slow down - you can then spend time optimizing the areas where you get the best return on your development effort. Just because you've decided to use an ORM doesn't mean you can't use other techniques in the sections of code that can really benefit from it.
An ORM can be slower, but this is offset by their ability to cache data, therefore however fast the alternative, you can't get much faster than reading from memory.
I never really understood why people think that this is slower or that is slower... get a real machine I say. I have had mixed results... I've seen where execution time for a stored procedure is much slower than ORM and vise versa.. But in both cases the performance was due to difference in hardware.

When do transactions become more of a burden than a benefit?

Transactional programming is, in this day and age, a staple in modern development. Concurrency and fault-tolerance are critical to an applications longevity and, rightly so, transactional logic has become easy to implement. As applications grow though, it seems that transactional code tends to become more and more burdensome on the scalability of the application, and when you bridge into distributed transactions and mirrored data sets the issues start to become very complicated. I'm curious what seems to be the point, in data size or application complexity, that transactions frequently start becoming the source of issues (causing timeouts, deadlocks, performance issues in mission critical code, etc) which are more bothersome to fix, troubleshoot or workaround than designing a data model that is more fault-tolerant in itself, or using other means to ensure data integrity. Also, what design patterns serve to minimize these impacts or make standard transactional logic obsolete or a non-issue?
--
EDIT: We've got some answers of reasonable quality so far, but I think I'll post an answer myself to bring up some of the things I've heard about to try to inspire some additional creativity; most of the responses I'm getting are pessimistic views of the problem.
Another important note is that not all dead-locks are a result of poorly coded procedures; sometimes there are mission critical operations that depend on similar resources in different orders, or complex joins in different queries that step on each other; this is an issue that can sometimes seem unavoidable, but I've been a part of reworking workflows to facilitate an execution order that is less likely to cause one.
I think no design pattern can solve this issue in itself. Good database design, good store procedure programming and especially learning how to keep your transactions short will ease most of the problems.
There is no 100% guaranteed method of not having problems though.
In basically every case I've seen in my career though, deadlocks and slowdowns were solved by fixing the stored procedures:
making sure all tables are accessed in order prevents deadlocks
fixing indexes and statistics makes everything faster (hence diminishes the chance of deadlock)
sometimes there was no real need of transactions, it just "looked" like it
sometimes transactions could be eliminated by making multiple statement stored procedures in single statement ones.
The use of shared resources is wrong in the long run. Because by reusing an existing environment you are creating more and more possibilities. Just review the busy beavers :) The way Erlang goes is the right way to produce fault-tolerant and easily verifiable systems.
But transactional memory is essential for many applications in widespread use. If you consult a bank with its millions of customers for example you can't just copy the data for the sake of efficiency.
I think monads are a cool concept to handle the difficult concept of changing state.
One approach I've heard of is to make a versioned insert only model where no updates ever occur. During selects the version is used to select only the latest rows. One downside I know of with this approach is that the database can get rather large very quickly.
I also know that some solutions, such as FogBugz, don't use enforced foreign keys, which I believe would also help mitigate some of these problems because the SQL query plan can lock linked tables during selects or updates even if no data is changing in them, and if it's a highly contended table that gets locked it can increase the chance of DeadLock or Timeout.
I don't know much about these approaches though since I've never used them, so I assume there are pros and cons to each that I'm not aware of, as well as some other techniques I've never heard about.
I've also been looking into some of the material from Carlo Pescio's recent post, which I've not had enough time to do it justice unfortunately, but the material seems very interesting.
If you are talking 'cloud computing' here, the answer would be to localize each transaction to the place where it happens in the cloud.
There is no need for the entire cloud to be consistent, as that would kill performance (as you noted). Simply, keep track of what is changed and where and handle multiple small transactions as changes propagate through the system.
The situation where user A updates record R and user B at the other end of cloud does not see it (yet) is the same as the one when user A didn't do the change yet in the current strict-transactional environment. This could lead to a discrepancy in an update-heavy system, so systems should be architectured to work with updates as less as possible - moving things to aggregation of data and pulling out the aggregates once the exact figure is critical (i.e. moving requirement for consistency from write-time to critical-read-time).
Well, just my POV. It's hard to conceive a system that is application agnostic in this case.
Try to make changes at the database level in the least number of possible instructions.
The general rule is to lock a resource the lest possible time. Using T-SQL, PLSQL, Java on Oracle or any similar way you can reduce the time that each transaction locks a shared resource. I fact transactions in the database are optimized with row-level locks, multi-version, and other kinds of intelligent techniques. If you can make the transaction at the database you save the network latency. Apart from other layers like ODBC/JDBC/OLEBD.
Sometimes the programmer tries to obtain the good things of a database ( It is transactional, parallel, distributed, ) but keep a caché of the data. Then they need to add manually some of the database features.