SQL Server NOLOCK vs READPAST in a Queue system - sql

I am developing a brand new product at work that is supposed to go live soon. It's predominantly an ETL product that will be dealing with enormous volumes of data in a queue type operation. So records come in, we do work on them, and then they are picked up and sent back out. So everything that's being done in this system is done over and over again until all the records have completed processing.
I'm being told by my boss (who is open and reasonable, this isn't a demand) that I should add NOLOCK hints to all the queries.
I'm torn on this because I've always read that it's bad practice. I've also read about the READPAST hint and I'm thinking that might be a good alternative, but I want to get others' opinions as no one in my organization has used it before.
My understanding is that READPAST would pick up any records that aren't locked, and would just ignore locked records, which in some systems I could see being a problem. In this system where the job will run again a few minutes later and will likely pick up the previously locked record, I don't see that it would matter. It's not super time-sensitive, so if a record takes a few minutes longer because it's locked, that's acceptable.
Wondering what others thoughts on this are? I realize this is not a replacement for proper indexing and I'm working on that as well.

Related

Slow Simultaneous Writes to Same Table in PostgreSQL Database

I suspect this question may be better suited for the Database Administrators site, so LMK if it is and I'll move it. :)
I'm something of a database/Postgres beginner here so help me out. I have a system set up to process 10 things in parallel and write output of those things to the same table in the same Postgres database. The writes happen ok but they take forever. My log files show that I'll have results for 30,000 of these things, but only 7,000 of them are reflected in the database.
I suspect Postgres is queueing up the writes for some reason, and my guess is that this happens because that table has an auto-incrementing primary key. If I'm trying to write 10 records to the same table simultaneously, I would assume they'd have to be queued, because otherwise how is the primary key going to be set?
Do I have this right, or is my database horribly misconfigured? My sysadmin doesn't typically do databases, so if you have any tuning suggestions, even basic stuff, I'd be glad to hear them. :)
I suspect Postgres is queueing up the writes for some reason, and my guess is that this happens because that table has an auto-incrementing primary key. If I'm trying to write 10 records to the same table simultaneously, I would assume they'd have to be queued, because otherwise how is the primary key going to be set?
Nope, that's not it.
If you read the documentation on sequences you'll see that they're exempt from transactional visibility and rollback specifically for this reason. An ID generated with nextval is not re-used on rollback.
Do I have this right, or is my database horribly misconfigured? My sysadmin doesn't typically do databases, so if you have any tuning suggestions, even basic stuff, I'd be glad to hear them. :)
It's more likely that you're doing individual commits, one per insert, on a system with really slow fsync()s like a single magnetic hard drive. You might also have your checkpoint intervals too low (see the PostgreSQL logs where warnings about this will appear if so), might have too many indexes causing a slowdown, etc.
You should look at the PostgreSQL logs.
Also, please see the primer I wrote on the topic of improving insert performance.

Improve Log Exceptions

I am planning to use log4net in a new web project. In my experience, I see how big the log table can get, also I notice that errors or exceptions are repeated. For instance, I just query a log table that have more than 132.000 records, and I using distinct and found that only 2.500 records are unique (~2%), the others (~98%) are just duplicates. so, I came up with this idea to improve logging.
Having a couple of new columns: counter and updated_dt, that are updated every time try to insert same record.
If want to track the user that cause the exception, need to create a user_log or log_user table, to map N-N relationship.
Create this model may made the system slow and inefficient trying to compare all these long text... Here the trick, we should also has a hash column of binary of 16 or 32, that hash the message and the exception, and configure an index on it. We can use HASHBYTES to help us.
I am not an expert in DB, but I think that will made the faster way to locate a similar record. And because hashing doesn't guarantee uniqueness, will help to locale those similar record much faster and later compare by message or exception directly to make sure that are unique.
This is a theoretical/practical solution, but will it work or bring more complexity? what aspects I am leaving out or what other considerations need to have? the trigger will do the job of insert or update, but is the trigger the best way to do it?
I wouldn't be too concerned with a log table of 132,000 records to be honest, I have seen millions, if not billions of records in a log table. If you are logging out 132,000 records every few minutes then you might want to tone it down a bit.
I think the idea is interesting but here is my major concerns:
You could actually hurt the performance of your application by doing this. The Log4Net ADO.NET appender is synchronous. This means if you make your INSERT anymore complicated than it needs to be (aka looking up if the data already exists, calculating hash codes etc.) you will block the thread calling logging. That's not good! You could fix this writing to some sort of a staging table and doing it out of band with a job or something but now you've created a bunch of moving parts for something that could be much simpler.
Time could probably be better spent doing other things. Storage is cheap, developer hours aren't and logs don't need to be extremely fast to access so a denormalized model should be fine.
Thoughts?
Yes you can do that. It is a good idea and it will work. Watch out for concurrency issues when inserting from multiple threads or processes. You probably need to investigate locking in detail. You should look into locking hints (in your case UPDLOCK, HOLDLOCK, ROWLOCK) and the MERGE statement. They can be used to maintain the dimension table.
As an alternative you could log to a file and compress it. Typical compression algorithms are very good at eliminating this type of exact redundancy.

Are database deadlocks a fact of life?

We all know about techniques to prevent db deadlocks - acquire locks in the same order, etc. But at some point, systems under pressure may simply suffer from deadlocks here and there. Should we simply accept that and always be prepared to retry when a deadlock occurs or should deadlocks be considered absolutely verboten and should we do everything in our power to prevent them?
The answer is yes.
You should do everything in your power to prevent them, but are you ever going to be satisfied that you've made them impossible?
Do everything in your power to prevent them, and be prepared to retry when they occur. :)
Keep in mind that "doing everything in your power" can mean things like queueing batch updates, making inserts into temp tables and then merging those into the main tables later and other non-trivial techniques. Be sure to check your transaction isolation level and your lock escalation policy.
This will probably be closed, but the world is trending to NoSQL solutions to this problem, breaking problems up so that guaranteed consistency isn't required from the datasource meaning that locks aren't required.
Facebook would be a good example of this, it doesn't matter when everyone sees your update, or if different users around the world see different versions of your profile. As long as the update works or eventually fails, that is good enough.

SQL Query with Table Locking

I am having an argument with a developer colleague on the team.
Problem: SQL query X runs for 1 second on the test system, but for an unknown amount of time on live system (150 users can run queries at the same time).
The query causes locks on 8 tables, of which 7 are useless.
His idea is to put a WITH (NOLOCK) on the 7 tables so there aren't any more locks.
My argument:
My suggestion is, with a nolock there is the chance that user 1 runs the select query which needs 10 seconds to complete because the server performance is low at the moment and user 2 changes a value in one of the 7 tables, e.g. a customer.
Then the query would be completely wrong or maybe the expected dataset can't be filled and it crashes and throws an error. So use a Rowlock.
His argument:
We don't need a rowlock, the chances of getting such a scenario are too low. We don't need to be perfect. Do what is asked of you and don't think.
What can I say to win against such people who don't count on perfectionism?
I believe, based on what you have said that you are correct in your reasoning.
If there is ANY chance that something could go wrong, no matter how small a chance in the operation that causes the database to lose integrity it MUST be fixed.
Integrity is one of the basic premises of database design your co worker sounds like he is not being rigorous in his work.
If you are trying to construct a technical argument to "beat" your co worker, note that it may not give you the desired outcome you imagine.
If your co worker is not amenable to what you are saying AND if you are REALLY sure that you are correct in your reasoning, then I would inform your team leader why you think this is important and show him your solution. If he agrees with your co worker because he believes that database integrity is not important, then perhaps you should look at working somewhere else.
Don't get me wrong, I realise that in the real world software cannot be 'perfect' otherwise it would never be released. But something as fundamental as data input checking should not be skipped over, and it isn't difficult to do. It's basically the same as saying, "well let's not bother to validate user input". This is something you learn how to do this in a first year Computer Science class!
We have enough crappy software on this planet and this is the age where we are capable of AMAZING THINGS. Sloppiness in Software Engineering doesn't have a place anymore and I hope that you do not let your co worker lower your standards. Keep your standards high and you will learn more than he does and eventually do better in the long run.
Locking hints in SQL Server 2000 (SS2k) were useful because SS2k was greedy about locking on UPDATE statements and would default to TABLELOCK and narrow it as it progressed. If you knew your UPDATE statement's pattern you could use locking hints to increase performance and SS2k would escalate the lock if needed.
NOLOCK was introduced for dirty reads of locked data. If a table is frequently updated and queries that don't rely on the validity of the underlying data are being blocked, you could use NOLOCK to read the data in whatever state it was in. If you need to read records to generate a search results page you might choose to specify the NOLOCK hint to ensure your query isn't blocked by any update statements.
I believe lock escalation was reworked in SQL Server 2005 and locking hints are no longer respected.
If you are using SQL Server, which it sounds like you are then you instead of worrying about using NOLOCK for readers blocking writers (a common issue on high use SQL Server DBs doing lots of reads and writes) you should consider using SQL Server Row Versioning transaction isolation. This works with SQL Server 2005 and above.
This makes SQL Server work much more like Oracle does and eliminates the issues caused by readers blocking writers. Please read into the disadvantages too before you make the decision to use it.
ACID: - atomicity, consistency, isolation and durability. These are the basic tenets of databases that you ignore at your peril.
What your colleague is stating is that it's okay to ignore isolation, the property that you don't get to see half-done transactions. That's okay in some situations.
For example, we have a set of reports that are not used for critical business purposes but merely to give an indication as to the general health of the system. For that, 95% accuracy is good enough and we don't want the reporting to get in the way of the real work.
But, for a statement from a bank to one of it's customers, 100% is the absolute minimum accuracy. In situations where you will rely on the data, isolation must be adhered to.
You need to decide which bucket your particular system falls into. I'd be willing to bet good money that the number of situations in which you can ignore any of the ACID principles is minimal.
From my experience, Murphy's law is true: If anything can go wrong, it will.
We don't need to be perfect is not an argument. You, and your colleage, have certainly requirements to conform with.
"Do what is want from you and don't
think."
Remember that you're always the person in charge of your own code, if something goes wrong, you can't say "He told me that to do it bla bla bla" ...
Your collegue is wrong, you always have to think, they pay you for use your brain, you're not a Teacher of aerobics (only a joke, sorry for all those Teachesr of aerobics that love programming).

How far can you really go with "eventual" consistency and no transactions (aka SimpleDB)?

I really want to use SimpleDB, but I worry that without real locking and transactions the entire system is fatally flawed. I understand that for high-read/low-write apps it makes sense, since eventually the system becomes consistent, but what about that time in between? Seems like the right query in an inconsistent db would perpetuate havoc throughout the entire database in a way that's very hard to track down. Hopefully I'm just being a worry wart...
This is the pretty classic battle between consistency and scalability and - to some extent - availability. Some data doesn't always need to be that consistent. For instance, look at digg.com and the number of diggs against a story. There's a good chance that value is duplicated in the "digg" record rather than forcing the DB to do a join against the "user_digg" table. Does it matter if that number isn't perfectly accurate? Probably not. Then using something like SimpleDB might be a good fit. However if you are writing a banking system, you should probably value consistency above all else. :)
Unless you know from day 1 that you have to deal with massive scale, I would stick to simple more conventional systems like RDBMS. If you are working somewhere with a reasonable business model, you will hopefully see a big spike in revenue if there's a big spike in traffic. Then you can use that money to help solving the scaling problems. Scaling is hard and scaling is hard to predict. Most of the scaling problems that hurt you will be ones that you never expect.
I would much rather get a site off the ground and spend a few weeks fixing scale issues when traffic picks up then spend so much time worrying about scale that we never make it to production because we run out of money. :)
Assuming you're talking about this SimpleDB, you're not being a worrywart; there are real reasons not to use it as a real world DBMS.
The properties that you get from transaction support in a DBMS can be abbreviated by the acronym "A.C.I.D.": Atomicity, Consistency, Isolation, and Durability. The A and D have mostly to do with system crashes, and the C and I have to do with regular operation. They're all things people totally take for granted when working with commercial databases, so if you work with a database that doesn't have one or more of them, you might be in for any number of nasty surprises.
Atomicity: Any transaction will either complete fully or not at all (i.e. it will either commit or abort cleanly). This applies to single statements (like "UPDATE table ...") as well as longer, more complicated transactions. If you don't have this, then anything that goes wrong (like, the disk getting full, the computer crashing, etc.) might leave something half-done. In other words, you can't ever rely on the DBMS to really do the things you tell it to, because any number of real-world problems can get in the way, and even a simple UPDATE statement might get partially completed.
Consistency: Any rules you've set up about the database will always be enforced. Like, if you have a rule that says A always equals B, then nothing anybody does to the database system can break that rule - it'll fail any operation that tries. This isn't quite as important if all your code is perfect ... but really, when is that ever the case? Plus, if you're missing this safety net, things get really yucky when you lose ...
Isolation: Any actions taken on the database will execute as if they happened serially (one at a time), even if in reality they're happening concurrently (interleaved with each other). If more than one user is going to hit this database at the same time, and you don't have this, then things you can't even dream up will go wrong; even atomic statements can interact with each other in unforeseen ways and screw things up.
Durability: If you lose power or the software crashes, what happens to database transactions that were in progress? If you have durability, the answer is "nothing - they're all safe". Databases do this by using something called "Undo / Redo Logging", where every little thing you do to the database is first logged (typically on a separate disk for safety) in a way such that you can reconstruct the current state after a failure. Without that, the other properties above are sort of useless, because you can never be 100% sure that things will stay consistent after a crash.
Do any of these things matter to you? The answer has everything to do with the types of transactions you're doing, and what guarantees you want in a failure situation. There may well be cases (like a read-only database) where you don't need these, but as soon as you start doing anything non-trivial, and something bad happens, you'll wish you had 'em. Maybe it's OK for you to just revert to a backup anytime something unexpected happens, but my guess is that it isn't.
Also note that dropping all of these protections doesn't make it a given that your database will perform better; in fact, it's probably the opposite. That's because real-world DBMS software also has tons of code to optimize query performance. So, if you write a query that joins 6 tables on SimpleDB, don't assume that it'll figure out the optimal way to run that query - you might end up waiting hours for it to complete, when a commercial DBMS could use an indexed hash join and get it in .5 seconds. There are a zillion little tricks that you can do to optimize query performance, and believe me, you'll really miss them when they're gone.
None of this is meant as a knock on SimpleDB; take it from the author of the software: "Although it is a great teaching tool, I can't imagine that anyone would want to use it for anything else."