What is the expected performance gap switching from SQL to TSDB for handling time series? - sql

We are in the case of using a SQL database for a single node storage of roughly 1 hour of high frequency metrics (several k inserts a second). We quickly ran into I/O issues which proper buffering would not simply handle, and we are willing to put time into solving the performance issue.
I suggested to switch to a specialised database for handling time series, but my colleague stayed pretty skeptical. His argument is that the gain "out of the box" is not guaranteed as he knows SQL well and already spent time optimizing the storage, and we in comparison do not have any kind of TSDB experience to properly optimize it.
My intuition is that using a TSDB would be much more efficient even with an out of box configuration but I don't have any data to measure this, and internet benchs such as InfluxDB's are nowhere near trustable. We should run our own, except we can't affoard to loose time in a dead end or a mediocre improvement.
What would be, in my use case but very roughly, the performance gap between relational storage and TSDB, when it comes to single node throughput ?

This question may be bordering on a software recommendation. I just want to point one important thing out: You have an existing code base so switching to another data store is expensive in terms of development costs and time. If you have someone experienced with the current technology, you are probably better off with a good-faith effort to make that technology work.
Whether you switch or not depends on the actual requirements of your application. For instance, if you don't need the data immediately, perhaps writing batches to a file is the most efficient mechanism.
Your infrastructure has ample opportunity for in-place growth -- more memory, more processors, solid-state disk (for example). These might meet your performance needs with a minimal amount of effort.
If you cannot make the solution work (and 10k inserts per second should be quite feasible), then there are numerous solutions. Some NOSQL databases relax some of the strict ACID requirements of traditional RDBMSs, providing faster throughout.

Related

How to store 15 x 100 million 32-byte records for sequential access?

Me got 15 x 100 million 32-byte records. Only sequential access and appends needed. The key is a Long. The value is a tuple - (Date, Double, Double). Is there something in this universe which can do this? I am willing to have 15 seperate databases (sql/nosql) or files for each of those 100 million records. I only have a i7 core and 8 GB RAM and 2 TB hard disk.
I have tried PostgreSQL, MySQL, Kyoto Cabinet (with fine tuning) with Protostuff encoding.
SQL DBs (with indices) take forever to do the silliest query.
Kyoto Cabinet's B-Tree can handle upto 15-18 million records beyond which appends take forever.
I am fed up so much that I am thinking of falling back on awk + CSV which I remember used to work for this type of data.
If you scenario means always going through all records in sequence then it may be an overkill to use a database. If you start to need random lookups, replacing/deleting records or checking if a new record is not a duplicate of an older one, a database engine would make more sense.
For the sequential access, a couple of text files or hand-crafted binary files will be easier to handle. You sound like a developer - I would probably go for an own binary format and access it with help of memory-mapped files to improve the sequential read/append speed. No caching, just a sliding window to read the data. I think that it would perform better and even on usual hardware than any DB would; I did such data analysis once. It would also be faster than awking CSV files; however, I am not sure how much and if it satisfied the effort to develop the binary storage, first of all.
As soon as the database becomes interesting, you can have a look at MongoDB and CouchDB. They are used for storing and serving very large amounts of data. (There is a flattering evaluation that compares one of them to traditional DBs.). Databases usually need a reasonable hardware power to perform better; maybe you could check out how those two would do with your data.
--- Ferda
Ferdinand Prantl's answer is very good. Two points:
By your requirements I recommend that you create a very tight binary format. This will be easy to do because your records are fixed size.
If you understand your data well you might be able to compress it. For example, if your key is an increasing log value you don't need to store it entirely. Instead, store the difference to the previous value (which is almost always going to be one). Then, use a standard compression algorithm/library to save on data size big time.
For sequential reads and writes, leveldb will handle your dataset pretty well.
I think that's about 48 gigs of data in one table.
When you get into large databases, you have to look at things a little differently. With an ordinary database (say, tables less than a couple million rows), you can do just about anything as a proof of concept. Even if you're stone ignorant about SQL databases, server tuning, and hardware tuning, the answer you come up with will probably be right. (Although sometimes you might be right for the wrong reason.)
That's not usually the case for large databases.
Unfortunately, you can't just throw 1.5 billion rows straight at an untuned PostgreSQL server, run a couple of queries, and say, "PostgreSQL can't handle this." Most SQL dbms have ways of dealing with lots of data, and most people don't know that much about them.
Here are some of the things that I have to think about when I have to process a lot of data over the long term. (Short-term or one-off processing, it's usually not worth caring a lot about speed. A lot of companies won't invest in more RAM or a dozen high-speed disks--or even a couple of SSDs--for even a long-term solution, let alone a one-time job.)
Server CPU.
Server RAM.
Server disks.
RAID configuration. (RAID 3 might be worth looking at for you.)
Choice of operating system. (64-bit vs 32-bit, BSD v. AT&T derivatives)
Choice of DBMS. (Oracle will usually outperform PostgreSQL, but it costs.)
DBMS tuning. (Shared buffers, sort memory, cache size, etc.)
Choice of index and clustering. (Lots of different kinds nowadays.)
Normalization. (You'd be surprised how often 5NF outperforms lower NFs. Ditto for natural keys.)
Tablespaces. (Maybe putting an index on its own SSD.)
Partitioning.
I'm sure there are others, but I haven't had coffee yet.
But the point is that you can't determine whether, say, PostgreSQL can handle a 48 gig table unless you've accounted for the effect of all those optimizations. With large databases, you come to rely on the cumulative effect of small improvements. You have to do a lot of testing before you can defensibly conclude that a given dbms can't handle a 48 gig table.
Now, whether you can implement those optimizations is a different question--most companies won't invest in a new 64-bit server running Oracle and a dozen of the newest "I'm the fastest hard disk" hard drives to solve your problem.
But someone is going to pay either for optimal hardware and software, for dba tuning expertise, or for programmer time and waiting on suboptimal hardware. I've seen problems like this take months to solve. If it's going to take months, money on hardware is probably a wise investment.

Design a database with a lot of new data

Im new to database design and need some guidance.
A lot of new data is inserted to my database throughout the day. (100k rows per day)
The data is never modified or deleted once it has been inserted.
How can I optimize this database for retrieval speed?
My ideas
Create two databases (and possible on different hard drives) and merge the two at night when traffic is low
Create some special indexes...
Your recommendation is highly appreciated.
UPDATE:
My database only has a single table.
100k/day is actually fairly low. 3M/month, 40M/year. You can store 10 years archive and not reach 1B rows.
The most important thing to choose in your design will be the clustered key(s). You need to make sure that they are narrow and can serve all the queries your application will normally use. Any query that will end up in table scan will completely trash your memory by fetching in the entire table. So, no surprises there, your driving factor in your design is the actual load you'll have: exactly what queries will you be running.
A common problem (more often neglected than not) with any high insert rate is that eventually every row inserted will have to be deleted. Not acknowledging this is a pipe dream. The proper strategy depends on many factors, but probably the best bet is on a sliding window partitioning scheme. See How to Implement an Automatic Sliding Window in a Partitioned Table. This cannot be some afterthought, the choice for how to remove data will permeate every aspect of your design and you better start making a strategy now.
The best tip I can give which all big sites use to speed up there website is:
CACHE CACHE CACHE
use redis/memcached to cache your data! Because memory is (blazingly)fast and disc I/O is expensive.
Queue writes
Also for extra performance you could queue up the writes in memory for a little while before flushing them to disc -> writting them to SQL database. Off course then you have the risk off losing data if you keep it in memory and your computer crashes or has power failure or something
Context missing
Also I don't think you gave us much context!
What I think is missing is:
architecture.
What kind of server are you having VPS/shared hosting.
What kind of Operating system does it have linux/windows/macosx
computer specifics like how much memory available, cpu etc.
a find your definition of data a bit vague. Could you not attach a diagram or something which explains your domain a little bit. For example something like
this using http://yuml.me/
Your requirements are way to general. For MS SQL server 100k (more or less "normal") records per days should not be a problem, if you have decent hardware. Obviously you want to write fast to the database, but you ask for optimization for retrieval performance. That does not match very well! ;-) Tuning a database is a special skill on its own. So you will never get the general answer you would like to have.

How important is optimization?

I got into a bit of a debate yesterday with my boss about the proper role of optimization when building software. Essentially, his position was that optimization needs to be a primary concern during the entire process of development.
My opinion is that you need to make the right algorithmic decisions during development, but you should never be counting cycles during development. In fact, I feel so strongly about this I had to walk away from the conversation. I've seen too many bad programming decisions in the name of "optimization", and too much bad code defended with the excuse "this way is faster".
What does the StackOverflow.com community think?
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
- Donald Knuth
I think the premature optimization quote is used by too many to avoid thinking about the hard stuff concerning how well the application will run. I guarantee the users want you to think about how to design it so it will run as fast as possible.
This is not to say you should be timing everything, but the design phase is the easiest place to optimize and not cost lots of time later.
There are often several ways to do anything, you should pick in the design phase the one which is most likely to perform the best (if it turns out to be one of the times when it isn't the best, then optimize later). This should trump the need to have easy to read code.
If you aren't considering performance in the design phase, you aren't going to have a well designed system. That doesn't mean it should be the only concern (although in a database I'd rate it as 3rd in importance, right after data integrity and security), but trying to fix a system where poorly performing techniques were used throughout because the developers thought they were easier to understand is a nightmare. Being a user of such a system where you have to wait for minutes everytime you want to move from one screen to another is a nightmare (developers reallly should spend all day everyday for at least a week, using their systems!) for everyone who is stuck with the badly designed system. It costs less to design properly than to fix later and considering performance is critical to designing properly.
I work somewhere where the orginal developers drank the koolaid about premature optimization and did everything the way they thought was simplest (but which in almost every case was the wrong choice from a performance perspective). Now we are at 10 times the size we were three years ago and every screen on every website takes 30 seconds or so to load (or worse times out) and we are losing customers because of it. But changing it will be too hard because at the base they designed the database without considering how it would perform and redesigning a database with many many gigabytes of data into a new structure is way too time consuming and costly. If it had been designed to perform from the start it would be both easier to maintain and faster for the clients. We aren't talking about the need to performance tune the top 10 slowest queries here, we are talking about the fact that the overall structure requires a drastic change (one that would affect virtually every query against the system) to perform well.
Yes don't do micro optimization until you nmeed to, but please do the macro stuff. Consider if is this the best way before you commit to the path. Don't write cursors to hit tables with millions of records when a set-based statement will do. Don't try to have as few tables as possible becasue that seems to be a more elegant solution when the tables are storing disparate items (such as people, places, and vehicles) causing every single query to hit the same table and causing every delete to check all sorts of foreign key tables that will not ever have a record for that type of entity (it takes minutes to delete one record from the main table in our database, it's a real joy when something goes wrong in an import (bad data from a client usually) and we have to delete 200,000 let me tell you).
Optimization is almost tautologically a tradeoff: you gain runtime efficiency at the cost of other things (readability, maintainability, flexibility, compile times, etc.). As such, it's really never a good idea to do unless you know what the tradeoff is and why it's worthwhile.
Even worse, thinking about "how do I do X fast" can be very distracting. To much energy in that direction can easily lead to you misisng out on method Y which is much better (and often faster --- this is how "optimization" can make your code slower). Particularly if you do too much of this on a big project from the beginning, it represents a lot of momentum. If you can't afford to overcome that momentum, you can easily become locked into a bad design because you can't afford the time to restructure it. This way lie dragons
What your boss may be thinking of is more of an issue of writing bad code via inappropriate representations and algorithms. It's not really the same thing as optimizing, but an approach where you pay no attention whatsoever to appropriate data structures etc. can result in a codebase that is slow everywhere, and (much like the above "lock in") requires heroic effort to fix.
In general though, premature optimization really honestly is a terrible idea. Particularly when you end up with a complex, finely tuned, well documented (because that's the only way you can understand it) piece of code you end up not using anyway. And that's not even getting into the issue of subtle bugs that are often introduced when "optimizing"
[edit: pshaw, of course a Knuth quote encapsulates this well. That's what I get for typing too much]
Engineer throughout, optimize at the end.
Since with going with pithy, I'll say that optimization is as important as the impact of not doing it.
I think the "premature optimization is root of all evil" has to be understood literally - it does not say when is premature, and does not say you should optimize only at the end. Just not too early. Also, the "use the right algorithm, O(n^2) vs O(N)" is a bit dangerous if taken literally - because for many problems, the N is actually small, etc...
I think it depends a lot of the type of software you are doing: some software are such as every part is very independent, and can be optimized separately. But that's not always the case. For many (most ?) applications, speed just does not matter at all, the brute force but obviously correct way is the best one. But for projects where speed matters, it often has to be taken into account early - maybe that's another possible interpretation of Knuth's saying: many applications don't need to be optimized at all, just know which ones need and plan ahead.
Optimization is a primary concern through development when you have a good reason to expect that performance will be unfixably bad if optimization is a secondary concern.
This depends a lot what kind of code you're writing, but there are often better reasons to believe that your code will be unfixably difficult to use; or maintain; or full of bugs; or late; if all those things become secondary to tweaking performance.
Bad managers say, "all of those things are our primary concerns". Good managers work to find out which are dangers for this project.
Of course, good design does have to consider all these things, and the earlier you have a back-of-the-envelope estimate of any of them, the better. If all your manager is saying, is that if you never think about how fast your code will run then too much of your code will be dog-slow, then he's right. I just wouldn't say that makes optimization your "primary" concern.
If the USP of your software is that it's faster than your competitors', then optimization is a primary concern. With experience, you can often predict what sorts of operations will be the bottlenecks, design those with optimization in mind right from the start, and more-or-less ignore optimization elsewhere. A lot of projects won't even need this: they'll be fast enough without much effort, provided you use sensible algorithms and don't do anything stupid. "Don't do anything stupid" is always a primary concern, with no need to mention performance in particular.
I think that code needs to be, first and foremost, readable and understandable. So, optimisations that are done, should not be at the expense of readability. However, optimisation is often a trade-off.
Whether or not you should optimise your code depends on your application domain. If you are working on an embedded processor with only 8Mb of memory, then optimisation is probably something that every team member needs to keep in mind, when writing code - optimising for space vs speed.
However, pre-mature optimisation is not useful unless your system has been clearly spec'ed and understood. This is because most programmers do not make good optimisation decisions unless they can factor in the influence of the overall system, including processor architectural factors such as cache memory, hardware threads, pipelines, etc.
From 2 years of building highly optimized Java code (and that needed to be optimized that way) I would say that there is a time-spent rule that governs optimization:
optimizing on the spot: 5%-10% of your development time, because you have to do it countless times (every single time you have to amend your design)
optimizing just when you have had it working: 2% of your development time (you do it only once)
going back to it and optimizing when it's too slow: 30% of your development time, because you have to plunge yourself back into the system
SO I would come to the conclusion that there is a right time and a right way to optimize: do it entity by entity (class by class, if you have classes that have a single, well defined job to do, and can be tested and optimized), test well, make sure the logic is working, optimize just afterward, and forget about that entity's implementation details forever.
When developing, just keep it simple. IMHO, most performance problems are caused by over-engineering - making mountains out of molehills, often because of wanting "the right algorithm".
Periodically, stress test with a big data set, profiling or (my favorite technique) manual random sampling. You find a problem, you fix it. You find another, you fix it.
That way you avoid creating slugs (slowness bugs), and when they do arise, you kill them.
Added: If I can just elaborate on point 1. OO is seemingly the law of the land, and it certainly has good reasons behind it. Unfortunately it causes many young programmers to feel that programming is all about having lots of data structure, with layers upon layers of abstraction. Not that those are inherently bad, but combine that with the natural tendency to assume that the time something takes is roughly proportional to the number of characters you have to type to invoke it, and that this tendency multiplies over the layers (and besides, the machines are really fast), it's easy to create a perfect storm of cycle-waste.
Quote from a friend: "It's easier to make a working system efficient than to make an efficient system work".
I think it is important to use smart practices and patterns from the start, but get the system actually running for small test cases then do performance analysis. Frequently the areas of code that have poor performance aren't anticipated at the beginning, so get some real data and then optimize the bottlenecking 3% (or 20%, or whatever it is).
I think your boss is a lot more right than you are.
Allt too often the user experience is lost only to be "rediscovered" at the last possible moments when performance activites are prohibitively costly and inefficient. Or when it is discovered that the batch program that will process today's transactions requires forty hours to run.
Things such as database organization, when and when not to do which SELECTs are examples of design decisions that can make or break an application. Still you run the risk of a single programmer deciding to to otherwise, misinterpret or simply not understand what to do. Following-up on performance during an entire project decreases the risk that such things will happen. It also allows design decisions to be changed when there is a need for that without puting the entire poroject at risk.
"You need to make the right algorithmic decisions during development" is certainly true yet how many mainstream programmers are able to do that? Browsing the net for information does not guarantee finding a high quality solution. "Right" could be interpreted as meaning that it is ok to choose a poor algorithm because it is easy to understand and implement (= less development time, lower cost) than a more complicated one (= more development time, higher cost).
The pendulum of quantity vs quality is almost always on the quantity side because more code per hour or faster development time means money in the short term. The quality side means money in the long term.
EDIT
This article discusses performance and optimization quite thoroughly.
Performance preacher Rico Mariana sums it up in the short statement "never give up your performance accidentally."
Premature optimization is the root of all evil...There is a fine balance between, but I would say 95% of the time you need to optimize at the end; however, there are decisions you can make early on to help prevent issues. For example assume we are talking about an e-commerce web site. You have a requirement to display the catalog. Now you can grab all 100,000 items and display 50 of them, or you can grab just 50 from the database. These type of decisions should be made up front.
Cycle counting should only be done when a problem has been identified.
Your boss is partly right, optimisation does need to be considered throughout the development lifecycle but it is rarely the primary concern. Also, the term 'optimisation' is vague - it's an adjective, 'optimise for ...' which could be 'memory', 'speed', 'usability', 'maintainability' and so on.
However, the OP is right that counting cycles is pointless for many projects. For most PC applications the CPU is never the bottleneck. Also, the IA32 is not consistent - what worked well on one architecture, performs poorly on another. Cycle counting should only ever be done when it will actually make a difference - usually in CPU limited code or code with very specific timing needs.
Optimisation, of any kind, must always be driven by hard evidence. Never assume anything about the system or how the code is behaving. In an ideal world, application performance / constraints will be specified in the initial product design and tools to monitor the application's performance during development will be added early on in the development phase to guide the programmers as the product is made.

How to prove that code isn't broken, but the hardware is?

I'm sure it repeats everywhere. You can 'feel' network is slow, or machine or slow or something. But the server/chassis logs are not showing anything, so IT doesn't believe you. What do you do?
Your regressions are taking twice the time ... but that's not enough
Okay you transfer 100 GB using dd etc, but ... that's not enough.
Okay you get server placed in different chassis for 2 week, it works fine ... but .. that's not enough...
so HOW do you get IT to replace the chassis ?
More specifically:
Is there any suite which I can run on two setups ( supposed to be identical ), which can show up difference in network/cpu/disk access .. which IT will believe ?
Computers don't age and slow down the same way we do. If your server is getting slower -- actually slower, not just feels slower because every other computer you use is getting faster -- then there is a reason and it is possible that you may be able to fix it. I'd try cleaning up some disk space, de-fragmenting the disk, and checking what other processes are running (perhaps someone's added more apps to the system and you're just not getting as many cycles).
If your app uses a database, you may want to analyze your query performance and see if some indices are in order. Queries that perform well when you have little data can start taking a long time as the amount of data grows if they have to use table scans. As a former "IT" guy, I'd also be reluctant to throw hardware at a problem because someone tells me the system is slowing down. I'd want to know what has changed and see if I could get the system running the way it should be. If the app has simply out grown the hardware -- after you've made suitable optimizations -- then upgrading is a reasonable choice.
Run a standard benchmark suite. See if it pinpoints memory, cpu, bus or disk, when compared to a "working" similar computer.
See http://en.wikipedia.org/wiki/Benchmark_(computing)#Common_benchmarks for some tips.
The only way to prove something is to do a stringent audit.
Now traditionally, we should keep the system constant between two different sets while altering the variable we are interested. In this case the variable is the hardware that your code is running on. So in simple terms, you should audit the running of your software on two different sets of hardware, one being the hardware you are unhappy about. And see the difference.
Now if you are to do this properly, which I am sure you are, you will first need to come up with a null hypothesis, something like:
"The slowness of the application is
unrelated to the specific hardware we
are using"
And now you set about disproving that hypothesis in favour of an alternative hypothesis. Once you have collected enough results, you can apply statistical analyses on them, to decide whether any differences are statistically significant. There are analyses to find out how much data you need, and then compare the two sets to decide if the differences are random, or not random (which would disprove your null hypothesis). The type of tests you do will mostly depend on your data, but clever people have made checklists to help us decide.
It sounds like your main problem is being listened to by IT, but raw technical data may not be persuasive to the right people. Getting backup from the business may help you and that means talking about money.
Luckily, both platforms already contain a common piece of software - the application itself - designed to make or save money for someone. Why not measure how quickly it can do that e.g. how long does it take to process an order?
By measuring how long your application spends dealing with each sub task or data source you can get a rough idea of the underlying hardware which is under performing. Writing to a local database, or handling a data structure larger than RAM will impact the disk, making network calls will impact the network hardware, CPU bound calculations will impact there.
This data will never be as precise as a benchmark, and it may require expensive coding, but its easier to translate what it finds into money terms. Log4j's NDC and MDC features, and Springs AOP might be good enabling tools for you.
Run perfmon.msc from Start / Run in Windows 2000 through to Vista. Then just add counters for CPU, disk etc..
For SQL queries you should capture the actual queries then run them manually to see if they are slow.
For instance if using SQL Server, run the profiler from Tools, SQL Server Profiler. Then perform some operations in your program and look at the capture for any suspicous database calls. Copy and paste one of the queries into a new query window in management studio and run it.
For networking you should try artificially limiting your network speed to see how it affects your code (e.g. Traffic Shaper XP is a simple freeware limiter).

How to avoid the dangers of optimisation when designing for the unknown?

A two parter:
1) Say you're designing a new type of application and you're in the process of coming up with new algorithms to express the concepts and content -- does it make sense to attempt to actively not consider optimisation techniques at that stage, even if in the back of your mind you fear it might end up as O(N!) over millions of elements?
2) If so, say to avoid limiting cool functionality which you might be able to optimise once the proof-of-concept is running -- how do you stop yourself from this programmers habit of a lifetime? I've been trying mental exercises, paper notes, but I grew up essentially counting clock cycles in assembler and I continually find myself vetoing potential solutions for being too wasteful before fully considering the functional value.
Edit: This is about designing something which hasn't been done before (the unknown), when you're not even sure if it can be done in theory, never mind with unlimited computing power at hand. So answers along the line of "of course you have to optimise before you have a prototype because it's an established computing principle," aren't particularly useful.
I say all the following not because I think you don't already know it, but to provide moral support while you suppress your inner critic :-)
The key is to retain sanity.
If you find yourself writing a Theta(N!) algorithm which is expected to scale, then you're crazy. You'll have to throw it away, so you might as well start now finding a better algorithm that you might actually use.
If you find yourself worrying about whether a bit of Pentium code, that executes precisely once per user keypress, will take 10 cycles or 10K cycles, then you're crazy. The CPU is 95% idle. Give it ten thousand measly cycles. Raise an enhancement ticket if you must, but step slowly away from the assembler.
Once thing to decide is whether the project is "write a research prototype and then evolve it into a real product", or "write a research prototype". With obviously an expectation that if the research succeeds, there will be another related project down the line.
In the latter case (which from comments sounds like what you have), you can afford to write something that only works for N<=7 and even then causes brownouts from here to Cincinnati. That's still something you weren't sure you could do. Once you have a feel for the problem, then you'll have a better idea what the performance issues are.
What you're doing, is striking a balance between wasting time now (on considerations that your research proves irrelevant) with wasting time later (because you didn't consider something now that turns out to be important). The more risky your research is, the more you should be happy just to do something, and worry about what you've done later.
My big answer is Test Driven Development. By writing all your tests up front then you force yourself to only write enough code to implement the behavior you are looking for. If timing and clock cycles becomes a requirement then you can write tests to cover that scenario and then refactor your code to meet those requirements.
Like security and usability, performance is something that has to be considered from the beginning of the project. As such, you should definitely be designing with good performance in mind.
The old Knuth line is "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." O(N!) to O(poly(N)) is not a "small efficiency"!
The best way to handle type 1 is to start with the simplest thing that could possibly work (O(N!) cannot possibly work unless you're not scaling past a couple dozen elements!) and encapsulate it from the rest of the application so you could rewrite it to a better approach assuming that there is going to be a performance issue.
Optimization isn't exactly a danger; its good to think about speed to some extent when writing code, because it stops you from implementing slow and messy solutions when something simpler and faster would do. It also gives you a check in your mind on whether something is going to be practical or not.
The worst thing that can happen is you design a large program explicitly ignoring optimization, only to go back and find that your entire design is completely useless because it cannot be optimized without completely rewriting it. This never happens if you consider everything when writing it--and part of that "everything" is potential performance issues.
"Premature optimization is the root of all evil" is the root of all evil. I've seen projects crippled by overuse of this concept. At my company we have a software program that broadcasts transport streams from disk on the network. It was originally created for testing purposes (so we would just need a few streams at once), but it was always in the program's spec requirements that it work for larger numbers of streams so it could later be used for video on demand.
Because it was written completely ignoring speed, it was a mess; it had tons of memcpys despite the fact that they should never be necessary, its TS processing code was absurdly slow (it actually parsed every single TS packet multiple times), and so forth. It handled a mere 40 streams at a time instead of the thousands it was supposed to, and when it actually came time to use it for VOD, we had to go back and spend a huge amount of time cleaning it up and rewriting large parts of it.
"First, make it run. Then make it run fast."
or
"To finish first, first you have to finish."
Slow existing app is usually better than ultra-fast non-existing app.
First of all peopleclaim that finishign is only thing that matters (or almost).
But if you finish a product that has O(N!) complexity on its main algorithm, as a rule of thumb you did not finished it! You have an incomplete and unacceptable product for 99% of the cases.
A reasonable performance is part of a working product. A perfect performance might not be. If you finish a text editor that needs 6 GB of memory to write a short note, then you have not finished a product at all, you have only a waste of time at your hands.. You must remember always that is not only delivering code that makes a product complete, is making it achieve capability of supplying the costumer/users needs. If you fail at that it matters nothing that you have finished the code writing in the schedule.
So all optimizations that avoid a resulting useless product are due to be considered and applied as soon as they do not compromise the rest of design and implementation proccess.
"actively not consider optimisation" sounds really weird to me. Usually 80/20 rule works quite good. If you spend 80% of your time to optimize program for less than 20% of use cases, it might be better to not waste time unless those 20% of use-cases really matter.
As for perfectionism, there is nothing wrong with it unless it starts to slow you down and makes you miss time-frames. Art of computer programming is an act of balancing between beauty and functionality of your applications. To help yourself consider learning time-management. When you learn how to split and measure your work, it would be easy to decide whether to optimize it right now, or create working version.
I think it is quite reasonable to forget about O(N!) worst case for an algorithm. First you need to determine that a given process is possible at all. Keep in mind that Moore's law is still in effect, so even bad algorithms will take less time in 10 or 20 years!
First optimize for Design -- e.g. get it to work first :-) Then optimize for performance. This is the kind of tradeoff python programmers do inherently. By programming in a language that is typically slower at run-time, but is higher level (e.g. compared to C/C++) and thus faster to develop, python programmers are able to accomplish quite a bit. Then they focus on optimization.
One caveat, if the time it takes to finish is so long that you can't determine if your algorithm is right, then it is a very good time to worry about optimization earlier up stream. I've encountered this scenario only a few times -- but good to be aware of it.
Following on from onebyone's answer there's a big difference between optimising the code and optimising the algorithm.
Yes, at this stage optimising the code is going to be of questionable benefit. You don't know where the real bottlenecks are, you don't know if there is going to be a speed problem in the first place.
But being mindful of scaling issues even at this stage of the development of your algorithm/data structures etc. is not only reasonable but I suspect essential. After all there's not going to be a lot of point continuing if your back-of-the-envelope analysis says that you won't be able to run your shiny new application once to completion before the heat death of the universe happens. ;-)
I like this question, so I'm giving an answer, even though others have already answered it.
When I was in grad school, in the MIT AI Lab, we faced this situation all the time, where we were trying to write programs to gain understanding into language, vision, learning, reasoning, etc.
My impression was that those who made progress were more interested in writing programs that would do something interesting than do something fast. In fact, time spent worrying about performance was basically subtracted from time spent conceiving interesting behavior.
Now I work on more prosaic stuff, but the same principle applies. If I get something working I can always make it work faster.
I would caution however that the way software engineering is now taught strongly encourages making mountains out of molehills. Rather than just getting it done, folks are taught to create a class hierarchy, with as many layers of abstraction as they can make, with services, interface specifications, plugins, and everything under the sun. They are not taught to use these things as sparingly as possible.
The result is monstrously overcomplicated software that is much harder to optimize because it is much more complicated to change.
I think the only way to avoid this is to get a lot of experience doing performance tuning and in that way come to recognize the design approaches that lead to this overcomplication. (Such as: an over-emphasis on classes and data structure.)
Here is an example of tuning an application that has been written in the way that is generally taught.
I will give a little story about something that happened to me, but not really an answer.
I am developing a project for a client where one part of it is processing very large scans (images) on the server. When i wrote it i was looking for functionality, but i thought of several ways to optimize the code so it was faster and used less memory.
Now an issue has arisen. During Demos to potential clients for this software and beta testing, on the demo unit (self contained laptop) it fails due to too much memory being used. It also fails on the dev server with really large files.
So was it an optimization, or was it a known future bug. Do i fix it or oprtimize it now? well, that is to be determined as their are other priorities as well.
It just makes me wish I did spend the time to reoptimize the code earlier on.
Think about the operational scenarios. ( use cases)
Say that we're making a pizza-shop finder gizmo.
The user turns on the machine. It has to show him the nearest Pizza shop in meaningful time. It Turns out our users want to know fast: in under 15 seconds.
So now, any idea you have, you think: is this going to ever, realistically run in some time less than 15 seconds, less all other time spend doing important stuff..
Or you're a trading system: accurate sums. Less than a millisecond per trade if you can, please. (They'd probably accept 10ms), so , agian: you look at every idea from the relevant scenarios point of view.
Say it's a phone app: has to start in under (how many seconds)
Demonstrations to customers fomr laptops are ALWAYS a scenario. We've got to sell the product.
Maintenance, where some person upgrades the thing are ALWAYS a scenario.
So now, as an example: all the hard, AI heavy, lisp-customized approaches are not suitable.
Or for different strokes, the XML server configuration file is not user friendly enough.
See how that helps.
If I'm concerned about the codes ability to handle data growth, before I get too far along I try to set up sample data sets in large chunk increments to test it with like:
1000 records
10000 records
100000 records
1000000 records
and see where it breaks or becomes un-usable. Then you can decide based on real data if you need to optimize or re-design the core algorithms.