How far can you really go with "eventual" consistency and no transactions (aka SimpleDB)? - sql

I really want to use SimpleDB, but I worry that without real locking and transactions the entire system is fatally flawed. I understand that for high-read/low-write apps it makes sense, since eventually the system becomes consistent, but what about that time in between? Seems like the right query in an inconsistent db would perpetuate havoc throughout the entire database in a way that's very hard to track down. Hopefully I'm just being a worry wart...

This is the pretty classic battle between consistency and scalability and - to some extent - availability. Some data doesn't always need to be that consistent. For instance, look at digg.com and the number of diggs against a story. There's a good chance that value is duplicated in the "digg" record rather than forcing the DB to do a join against the "user_digg" table. Does it matter if that number isn't perfectly accurate? Probably not. Then using something like SimpleDB might be a good fit. However if you are writing a banking system, you should probably value consistency above all else. :)
Unless you know from day 1 that you have to deal with massive scale, I would stick to simple more conventional systems like RDBMS. If you are working somewhere with a reasonable business model, you will hopefully see a big spike in revenue if there's a big spike in traffic. Then you can use that money to help solving the scaling problems. Scaling is hard and scaling is hard to predict. Most of the scaling problems that hurt you will be ones that you never expect.
I would much rather get a site off the ground and spend a few weeks fixing scale issues when traffic picks up then spend so much time worrying about scale that we never make it to production because we run out of money. :)

Assuming you're talking about this SimpleDB, you're not being a worrywart; there are real reasons not to use it as a real world DBMS.
The properties that you get from transaction support in a DBMS can be abbreviated by the acronym "A.C.I.D.": Atomicity, Consistency, Isolation, and Durability. The A and D have mostly to do with system crashes, and the C and I have to do with regular operation. They're all things people totally take for granted when working with commercial databases, so if you work with a database that doesn't have one or more of them, you might be in for any number of nasty surprises.
Atomicity: Any transaction will either complete fully or not at all (i.e. it will either commit or abort cleanly). This applies to single statements (like "UPDATE table ...") as well as longer, more complicated transactions. If you don't have this, then anything that goes wrong (like, the disk getting full, the computer crashing, etc.) might leave something half-done. In other words, you can't ever rely on the DBMS to really do the things you tell it to, because any number of real-world problems can get in the way, and even a simple UPDATE statement might get partially completed.
Consistency: Any rules you've set up about the database will always be enforced. Like, if you have a rule that says A always equals B, then nothing anybody does to the database system can break that rule - it'll fail any operation that tries. This isn't quite as important if all your code is perfect ... but really, when is that ever the case? Plus, if you're missing this safety net, things get really yucky when you lose ...
Isolation: Any actions taken on the database will execute as if they happened serially (one at a time), even if in reality they're happening concurrently (interleaved with each other). If more than one user is going to hit this database at the same time, and you don't have this, then things you can't even dream up will go wrong; even atomic statements can interact with each other in unforeseen ways and screw things up.
Durability: If you lose power or the software crashes, what happens to database transactions that were in progress? If you have durability, the answer is "nothing - they're all safe". Databases do this by using something called "Undo / Redo Logging", where every little thing you do to the database is first logged (typically on a separate disk for safety) in a way such that you can reconstruct the current state after a failure. Without that, the other properties above are sort of useless, because you can never be 100% sure that things will stay consistent after a crash.
Do any of these things matter to you? The answer has everything to do with the types of transactions you're doing, and what guarantees you want in a failure situation. There may well be cases (like a read-only database) where you don't need these, but as soon as you start doing anything non-trivial, and something bad happens, you'll wish you had 'em. Maybe it's OK for you to just revert to a backup anytime something unexpected happens, but my guess is that it isn't.
Also note that dropping all of these protections doesn't make it a given that your database will perform better; in fact, it's probably the opposite. That's because real-world DBMS software also has tons of code to optimize query performance. So, if you write a query that joins 6 tables on SimpleDB, don't assume that it'll figure out the optimal way to run that query - you might end up waiting hours for it to complete, when a commercial DBMS could use an indexed hash join and get it in .5 seconds. There are a zillion little tricks that you can do to optimize query performance, and believe me, you'll really miss them when they're gone.
None of this is meant as a knock on SimpleDB; take it from the author of the software: "Although it is a great teaching tool, I can't imagine that anyone would want to use it for anything else."

Related

SQL generalization/specialization, data redundancy

I have three tables: actions, messages, likes. It defines inheritance, messages and likes are actions' childs (specialization).
Message and Like both have column userId and createdAt. Those should be of course moved to the parrent table Action and removed from Message and Likes. But there's only one case when I need to select both messages and likes from the database, in other cases I select only one of them, either messages or likes.
Is it ok to duplicate userId and createdAt in child and parrent table? It costs disk space but saves one join - I would have to join messages, likes with actions everytime I needed userId and createdAt. Whatsmore I would need to change my current code...
What would you suggest?
In my opinion this is a case of premature optimization (or premature denormalization, if you prefer). You're guessing that the join overhead will cause significant problems, so you're guessing that duplicating the userId and createdAt columns in the dependent tables will improve performance significantly.
I suggest that you should not duplicate columns until you know there's a real problem. I keep a few observations on performance optimization tacked up on the wall to remind myself of what I should do in similar cases:
It ain’t broke ‘til it’s broke.
You can’t improve what you haven’t measured.
Programs spend surprising amounts of time in the damnedest places.
Make it run. Make it run right. Make it run right fast.
optimization is literally the last thing you should be doing.
doing things wrong faster is no great benefit.
Also a few comments on denormalization:
You can’t denormalize that which is not normalized.
Most developers wouldn’t know third-normal form if it leapt out from behind their screen, screamed like a banshee, and cracked a baseball bat over their heads.
Denormalization is suggested as a panacea for database performance issues. The problem is that too often those recommending denormalization have never normalized anything.
“Denormalization for performance reasons” is an excuse for sloppy, “do what we’ve always done” thinking, especially when such denormalization is enshrined in the design.
In my experience, I am not able to identify where performance problems will occur before writing code. Problems always seem to occur in places where I would never have thought to look. Thus, I've found that my best choice is always to write the simplest, clearest code that I can and to design the database as simply as I can, following the normalization rules to the best of my ability, and then to deal with what turns up. There may still be performance issues which need attention (but, surprisingly, not really all that often), but in the end I'll end up with simple, clear, and easily understood/maintained code, running on a simple, well-designed database.
Share and enjoy.

Are database deadlocks a fact of life?

We all know about techniques to prevent db deadlocks - acquire locks in the same order, etc. But at some point, systems under pressure may simply suffer from deadlocks here and there. Should we simply accept that and always be prepared to retry when a deadlock occurs or should deadlocks be considered absolutely verboten and should we do everything in our power to prevent them?
The answer is yes.
You should do everything in your power to prevent them, but are you ever going to be satisfied that you've made them impossible?
Do everything in your power to prevent them, and be prepared to retry when they occur. :)
Keep in mind that "doing everything in your power" can mean things like queueing batch updates, making inserts into temp tables and then merging those into the main tables later and other non-trivial techniques. Be sure to check your transaction isolation level and your lock escalation policy.
This will probably be closed, but the world is trending to NoSQL solutions to this problem, breaking problems up so that guaranteed consistency isn't required from the datasource meaning that locks aren't required.
Facebook would be a good example of this, it doesn't matter when everyone sees your update, or if different users around the world see different versions of your profile. As long as the update works or eventually fails, that is good enough.

Storing multiple choice values in database

Say I offer user to check off languages she speaks and store it in a db. Important side note, I will not search db for any of those values, as I will have some separate search engine for search.
Now, the obvious way of storing these values is to create a table like
UserLanguages
(
UserID nvarchar(50),
LookupLanguageID int
)
but the site will be high load and we are trying to eliminate any overhead where possible, so in order to avoid joins with main member table when showing results on UI, I was thinking of storing languages for a user in the main table, having them comma separated, like "12,34,65"
Again, I don't search for them so I don't worry about having to do fulltext index on that column.
I don't really see any problems with this solution, but am I overlooking anything?
Thanks,
Andrey
Don't.
You don't search for them now
Data is useless to anything but this one situation
No data integrity (eg no FK)
You still have to change to "English,German" etc for display
"Give me all users who speak x" = FAIL
The list is actually a presentation issue
It's your system, though, and I look forward to answering the inevitable "help" questions later...
You might not be missing anything now, but when you're requirements change you might regret that decision. You should store it normalized like your first instinct suggested. That's the correct approach.
What you're suggesting is a classic premature optimization. You don't know yet whether that join will be a bottleneck, and so you don't know whether you're actually buying any performance improvement. Wait until you can profile the thing, and then you'll know whether that piece needs to be optimized.
If it does, I would consider a materialized view, or some other approach that pre-computes the answer using the normalized data to a cache that is not considered the book of record.
More generally, there are a lot of possible optimizations that could be done, if necessary, without compromising your design in the way you suggest.
This type of storage has almost ALWAYS come back to haunt me. For one, you are not even in first normal form. For another, some manager or the other will definitely come back and say.. "hey, now that we store this, can you write me a report on... "
I would suggest going with a normalized design. Put it in a separate table.
Problems:
You lose join capability (obviously).
You have to reparse the list on each page load / post back. Which results in more code client side.
You lose all pretenses of trying to keep database integrity. Just imagine if you decide to REMOVE a language later on... What's the sql going to be to fix all of your user profiles?
Assuming your various profile options are stored in a lookup table in the DB, you still have to run "30 queries" per profile page. If they aren't then you have to code deploy for each little change. bad, very bad.
Basing a design decision on something that "won't happen" is an absolute recipe for failure. Sure, the business people said they won't ever do that... Until they think of a reason they absolutely must do it. Today. Which will be promptly after you finish coding this.
As I stated in a comment, 30 queries for a low use page is nothing. Don't sweat it, and definitely don't optimize unless you know for darn sure it's necessary. Guess how many queries SO does for it's profile page?
I generally stay away at the solution you described, you asking for troubles when you store relational data in such fashion.
As alternative solution:
You could store as one bitmasked integer, for example:
0 - No selection
1 - English
2 - Spanish
4 - German
8 - French
16 - Russian
--and so on powers of 2
So if someone selected English and Russian the value would be 17, and you could easily query the values with Bitwise operators.
Premature optimization is the root of all evil.
EDIT: Apparently the context of my observation has been misconstrued by some - and hence the downvotes. So I will clarify.
Denormalizing your model to make things easier and/or 'more performant' - such as creating concatenated columns to represent business information (as in the OP case) - is what I refer to as a "premature optimization".
While there may be some extreme edge cases where there is no other way to get the necessary performance necessary for a particular problem domain - one should rarely assume this is the case. In general, such premature optimizations cause long-term grief because they are hard to undo - changing your data model once it is in production takes a lot more effort than when it initially deployed.
When designing a database, developers (and DBAs) should apply standard practices like normalization to ensure that their data model expresses the business information being collected and managed. I don't believe that proper use of data normalization is an "optimization" - it is a necessary practice. In my opinion, data modelers should always be on the lookout for models that could be restructured to (at least) third normal form (3NF).
If you're not querying against them, you don't lose anything by storing them in a form like your initial plan.
If you are, then storing them in the comma-delimited format will come back to haunt you, and I doubt that any speed savings would be significant, especially when you factor in the work required to translate them back.
You seem to be extremely worried about adding in a few extra lookup table joins. In my experience, the time it takes to actually transmit the HTML response and have the browser render it far exceed a few extra table joins. Especially if you are using indexes for your primary and foreign keys (as you should be). It's like you are planning a multi-day cross-country trip and you are worried about 1 extra 10 minute bathroom stop.
The lack of long-term flexibility and data integrity are not worth it for such a small optimization (which may not be necessary or even noticeable).
Nooooooooooooooooo!!!!!!!!
As stated very well in the above few posts.
If you want a contrary view to this debate, look at wordpress. Tables are chocked full of delimited data, and it's a great, simple platform.

SQL Query with Table Locking

I am having an argument with a developer colleague on the team.
Problem: SQL query X runs for 1 second on the test system, but for an unknown amount of time on live system (150 users can run queries at the same time).
The query causes locks on 8 tables, of which 7 are useless.
His idea is to put a WITH (NOLOCK) on the 7 tables so there aren't any more locks.
My argument:
My suggestion is, with a nolock there is the chance that user 1 runs the select query which needs 10 seconds to complete because the server performance is low at the moment and user 2 changes a value in one of the 7 tables, e.g. a customer.
Then the query would be completely wrong or maybe the expected dataset can't be filled and it crashes and throws an error. So use a Rowlock.
His argument:
We don't need a rowlock, the chances of getting such a scenario are too low. We don't need to be perfect. Do what is asked of you and don't think.
What can I say to win against such people who don't count on perfectionism?
I believe, based on what you have said that you are correct in your reasoning.
If there is ANY chance that something could go wrong, no matter how small a chance in the operation that causes the database to lose integrity it MUST be fixed.
Integrity is one of the basic premises of database design your co worker sounds like he is not being rigorous in his work.
If you are trying to construct a technical argument to "beat" your co worker, note that it may not give you the desired outcome you imagine.
If your co worker is not amenable to what you are saying AND if you are REALLY sure that you are correct in your reasoning, then I would inform your team leader why you think this is important and show him your solution. If he agrees with your co worker because he believes that database integrity is not important, then perhaps you should look at working somewhere else.
Don't get me wrong, I realise that in the real world software cannot be 'perfect' otherwise it would never be released. But something as fundamental as data input checking should not be skipped over, and it isn't difficult to do. It's basically the same as saying, "well let's not bother to validate user input". This is something you learn how to do this in a first year Computer Science class!
We have enough crappy software on this planet and this is the age where we are capable of AMAZING THINGS. Sloppiness in Software Engineering doesn't have a place anymore and I hope that you do not let your co worker lower your standards. Keep your standards high and you will learn more than he does and eventually do better in the long run.
Locking hints in SQL Server 2000 (SS2k) were useful because SS2k was greedy about locking on UPDATE statements and would default to TABLELOCK and narrow it as it progressed. If you knew your UPDATE statement's pattern you could use locking hints to increase performance and SS2k would escalate the lock if needed.
NOLOCK was introduced for dirty reads of locked data. If a table is frequently updated and queries that don't rely on the validity of the underlying data are being blocked, you could use NOLOCK to read the data in whatever state it was in. If you need to read records to generate a search results page you might choose to specify the NOLOCK hint to ensure your query isn't blocked by any update statements.
I believe lock escalation was reworked in SQL Server 2005 and locking hints are no longer respected.
If you are using SQL Server, which it sounds like you are then you instead of worrying about using NOLOCK for readers blocking writers (a common issue on high use SQL Server DBs doing lots of reads and writes) you should consider using SQL Server Row Versioning transaction isolation. This works with SQL Server 2005 and above.
This makes SQL Server work much more like Oracle does and eliminates the issues caused by readers blocking writers. Please read into the disadvantages too before you make the decision to use it.
ACID: - atomicity, consistency, isolation and durability. These are the basic tenets of databases that you ignore at your peril.
What your colleague is stating is that it's okay to ignore isolation, the property that you don't get to see half-done transactions. That's okay in some situations.
For example, we have a set of reports that are not used for critical business purposes but merely to give an indication as to the general health of the system. For that, 95% accuracy is good enough and we don't want the reporting to get in the way of the real work.
But, for a statement from a bank to one of it's customers, 100% is the absolute minimum accuracy. In situations where you will rely on the data, isolation must be adhered to.
You need to decide which bucket your particular system falls into. I'd be willing to bet good money that the number of situations in which you can ignore any of the ACID principles is minimal.
From my experience, Murphy's law is true: If anything can go wrong, it will.
We don't need to be perfect is not an argument. You, and your colleage, have certainly requirements to conform with.
"Do what is want from you and don't
think."
Remember that you're always the person in charge of your own code, if something goes wrong, you can't say "He told me that to do it bla bla bla" ...
Your collegue is wrong, you always have to think, they pay you for use your brain, you're not a Teacher of aerobics (only a joke, sorry for all those Teachesr of aerobics that love programming).

Do I really need to use transactions in stored procedures? [MSSQL 2005]

I'm writing a pretty straightforward e-commerce app in asp.net, do I need to use transactions in my stored procedures?
Read/Write ratio is about 9:1
Many people ask - do I need transactions? Why do I need them? When to use them?
The answer is simple: use them all the time, unless you have a very good reason not to (for instance, don't use atomic transactions for "long running activities" between businesses). The default should always be yes. You are in doubt? - use transactions.
Why are transactions beneficial? They help you deal with crashes, failures, data consistency, error handling, they help you write simpler code etc. And the list of benefits will continue to grow with time.
Here is some more info from http://blogs.msdn.com/florinlazar/
Remember in SQL Server all single statement CRUD operations are in an implicit transaction by default. You just need to turn on explict transactions (BEGIN TRAN) if you need to make multiple statements act as an atomic unit.
The answer is, it depends. You do not always need transaction safety. Sometimes it's overkill. Sometimes it's not.
I can see that, for example, when you implement a checkout process you only want to finalize it once you gathered all data, etc.. Think about a payment f'up, you can rollback - that's an example when you need a transaction. Or maybe when it's wise to use them.
Do you need a transaction when you create a new user account? Maybe, if it's across 10 tables (for whatever reason), if it's just a single table then probably not.
It also depends on what you sold your client on and who they are, and if they requested it, etc.. But if making a decision is up to you, then I'd say, choose wisely.
My bottom line is, avoid premature optimization. Build your application, keep in mind that you may want to go back and refactor/optimize later when you need it. Look at a couple opensource projects and see how they implemented different parts of their app, learn from that. You'll see that most of them don't use transactions at all, yet there are huge online stores that use them.
Of course, it depends.
It depends upon the work that the particular stored procedure performs and, perhaps, not so much the "read/write ratio" that you suggest. In general, you should consider enclosing a unit of work within a transaction if it is query that could be impacted by some other, simultaneously running query. If this sounds nondeterministic, it is. It is often difficult to predict under what circumstances a particular unit of work qualifies as a candidate for this.
A good place to start is to review the precise CRUD being performed within the unit of work, in this case within your stored procedure, and decide if it a) could be affected by some other, simultaneous operation and b) if that other work matters to the end result of this work being performed (or, even, vice versa). If the answer is "Yes" to both of these then consider wrapping the unit of work within a transaction.
What this is suggesting is that you can't always simply decide to either use or not use transactions, rather you should apply them when it makes sense. Use the properties defined by ACID (Atomicity, Consistency, Isolation, and Durability) to help decide when this might be the case.
One other thing to consider is that in some circumstances, particularly if the system must perform many operations in quick succession, e.g., a high-volume transaction processing application, you might need to weigh the relative performance cost of the transaction. Depending upon the size of the unit of work, a commit (or rollback) of a transaction can be resource expensive, perhaps negatively impacting the performance of your system unnecessarily or, at least, with limited benefit.
Unfortunately, this is not an easy question to precisely answer: "It depends."
Use them if:
There are some errors that you may want to test for and catch which won't be caught except by you going out and doing the work (looking things up, testing values, etc.), usually from within a transaction so that you can roll back the whole operation.
There are multi-step operations of any sort, which should, logically, be rolled back as a group if they fail.