I suspect this question may be better suited for the Database Administrators site, so LMK if it is and I'll move it. :)
I'm something of a database/Postgres beginner here so help me out. I have a system set up to process 10 things in parallel and write output of those things to the same table in the same Postgres database. The writes happen ok but they take forever. My log files show that I'll have results for 30,000 of these things, but only 7,000 of them are reflected in the database.
I suspect Postgres is queueing up the writes for some reason, and my guess is that this happens because that table has an auto-incrementing primary key. If I'm trying to write 10 records to the same table simultaneously, I would assume they'd have to be queued, because otherwise how is the primary key going to be set?
Do I have this right, or is my database horribly misconfigured? My sysadmin doesn't typically do databases, so if you have any tuning suggestions, even basic stuff, I'd be glad to hear them. :)
I suspect Postgres is queueing up the writes for some reason, and my guess is that this happens because that table has an auto-incrementing primary key. If I'm trying to write 10 records to the same table simultaneously, I would assume they'd have to be queued, because otherwise how is the primary key going to be set?
Nope, that's not it.
If you read the documentation on sequences you'll see that they're exempt from transactional visibility and rollback specifically for this reason. An ID generated with nextval is not re-used on rollback.
Do I have this right, or is my database horribly misconfigured? My sysadmin doesn't typically do databases, so if you have any tuning suggestions, even basic stuff, I'd be glad to hear them. :)
It's more likely that you're doing individual commits, one per insert, on a system with really slow fsync()s like a single magnetic hard drive. You might also have your checkpoint intervals too low (see the PostgreSQL logs where warnings about this will appear if so), might have too many indexes causing a slowdown, etc.
You should look at the PostgreSQL logs.
Also, please see the primer I wrote on the topic of improving insert performance.
Related
I inherited a database (don't we all just love that) and a table has grown out of control.
The database is over 50 GB.
I did a defrag on it as well as re-indexing but that did not help.
The problem is that the table has 236 columns and about 23000 rows.
And no, that is not a typo!
The only solution I can see is to break up the table.
The .Net app is on our intranet and I have optimized every piece of code it contains as well as the stored procedures.
The table contains information such as TempHigh, TempLow, and TempMed. The High, Low, and Med repeat throughout the table for other factors. So each High, Low, and Med will become its own table with a foreign key pointing to the parent table.
This will create a lot of JOINs when accessing the data and updating.
This is the only way I can see that may fix the problem.
My question is, am I overlooking a better way to fix this problem?
Any and all suggestions are welcome.
Thanks!!!
EDIT
Just to clarify, I have run a defrag on the database as well as re-indexing.
I have had performance monitor open on the web and db servers as well.
Thanks for the comments. I will try to answer your questions. First, I have copied the database to my .Net development environment and it's slow, even with just me logging in. And moved the .Net app to the same server as the sql server to test connection issues. Same problems. This is a transaction system (OLTP). Some columns I have already moved to their own tables as they were repeating themselves and in their place a new column with a foreign key (without the constraint). There are no images, just data.
Here are the specs of the table:
TableName SchemaName RowCounts TotalSpaceKB UsedSpaceKB UnusedSpaceKB
MyTable dbo 22904 45192 45160 32
am I overlooking a better way to fix this problem?
Yes:
Profile your application (source code, not SQL) to ensure that the DB access is the slowest part
Run SQL traces to identify the worst-performing queries and work with a good DBA to identify possible optimizations
Make sure all statistics are up-to-date (at least daily).
Don't guess - use hard facts to identify performance bottlenecks
Make sure you have the appropriate indexes for the most common queries.
Don't assume that breaking up the table will speed things up. It can actually make things slower if you're constantly joining across multiple tables that are 1:1.
I've seen databases that are terabytes in size that still perform well (adding/uptating thousands of records daily). The key is determining the slowest operations and adding/updating indexes to optimize those.
I need to synchronize tables between 2 databases daily, the source is MSSQL 2008, the target is MSSQL 2005. If I use UPDATE, INSERT, and DELETE statements (i.e. UPDATE rows that changed, INSERT new rows, DELETE rows no longer present), will there be performance improvements if I perform the DELETE statement first? i.e. so that the UPDATE statement doesn't look at rows that don't need to be updated, because they will be deleted.
Here are some other things I need to consider. The tables have 1-3 million+ rows, and because of the amount of transactions and business requirements, the source DB needs to remain online, and the query needs to be as efficient as possible. The job will be run daily in a SQL server agent job on the target DB. On top of that, I am a DB rookie!
Thanks StackOverflow community, you are awesome!
I'd say, first you do delete, then update then insert, so you don't have to update rows which will be deleted anyway and you'll not update rows which are just inserted.
But actually, have you seen SQL Server merge syntax? It could save you a great amount of code.
update I have not checked performance of MERGE statement against INSERT/UPDATE/DELETE, here's related link given by Aaron Bertrand for more details.
Rule of Thumb: DELETE, then UPDATE, then INSERT.
Performance aside, my main concern is to avoid any potential Deadlocks when:
Updating something you will immediately Delete.
Inserting something you may immediately try to Update.
If you only modify what is necessary and use transactions correctly, then you may use any order.
P.S. Someone suggested using MERGE - I've tried it a few times and my preference is to never use it.
I think Roman's answer is what you were looking for in your current situation: DELETE, UPDATE, INSERT (or MERGE.)
Now there are other possible routes which can make things even faster, but with a rather different process:
1. Consider saving all orders in a file that you, once in a while, run against the target
Assuming both databases are exactly the same, for each SQL order that modifies the 2008 database, save that order in a .sql file which you later execute against the 2005 database. You have to consider locking the file while writing to it, and maybe have some kind of redundancy. However, this means you need no access to the 2008 database at all while doing the work on the 2005 database. In other words, no side effects to the 2008 database speed.
Pitfall: you may miss a statement and the destination will not be an exact equivalent...
2. Ongoing replication
I do not know about MSSQL enough to tell you of a good tool to do automatic replication (see here: http://technet.microsoft.com/en-us/library/ms151198.aspx), but I'd bet you can find a good tool. MySQL (http://dev.mysql.com/doc/refman/5.0/en/replication.html) and PostgreSQL (http://wiki.postgresql.org/wiki/Streaming_Replication) have such tools and those are all free.
This would be the solution I would choose. Depending on the tool you use, it can be really very well optimized meaning that the impact on the live system will be minimal and the 2005 duplicate will be up to date within seconds (depending on whether it is a long distance remote connection or not, the amount of work, the setup of each server, the Internet connections, etc.)
The pitfall is obviously that it adds an ongoing process on the database, but if you find an MSSQL tool that works like the streaming replication of PostgreSQL, it makes use of a copy of the journal which means it is dead fast (no heavy use of disk I/O.)
3. Cluster Database (like Cassandra)
This would involve a change of database which I'm totally sure you're not ready to do (especially because most of those systems do not offer SQL,) but I thought that it would be a good thing to talk about in your situation.
A system like Cassandra (http://cassandra.apache.org/) automatically replicate its data on many computers. It can actually be setup to replicate all the data 100% or X% of data per computer with redundancy in case of failure (a computer that breaks down). This alleviates the need for a specific copy on a separate computer because the performance can be increased simply by adding a few nodes to your system. (At less than $1,000 a computer, it is worth it! Frankly you could create a Peta Byte system for $50k or less and end up with something a lot faster than any SQL database...)
The main problem is that the use of those clusters is completely different than SQL. But that could be a solution for big businesses having large databases that need to be really fast and they do not want to invest in a mini-computer (think Cobol and $250k computers that manage 100 million rows in a few milli-seconds...)
With Cassandra you can run extremely heavy batch processes on back end computers that do not make a dent to the front end system!
I am planning to use log4net in a new web project. In my experience, I see how big the log table can get, also I notice that errors or exceptions are repeated. For instance, I just query a log table that have more than 132.000 records, and I using distinct and found that only 2.500 records are unique (~2%), the others (~98%) are just duplicates. so, I came up with this idea to improve logging.
Having a couple of new columns: counter and updated_dt, that are updated every time try to insert same record.
If want to track the user that cause the exception, need to create a user_log or log_user table, to map N-N relationship.
Create this model may made the system slow and inefficient trying to compare all these long text... Here the trick, we should also has a hash column of binary of 16 or 32, that hash the message and the exception, and configure an index on it. We can use HASHBYTES to help us.
I am not an expert in DB, but I think that will made the faster way to locate a similar record. And because hashing doesn't guarantee uniqueness, will help to locale those similar record much faster and later compare by message or exception directly to make sure that are unique.
This is a theoretical/practical solution, but will it work or bring more complexity? what aspects I am leaving out or what other considerations need to have? the trigger will do the job of insert or update, but is the trigger the best way to do it?
I wouldn't be too concerned with a log table of 132,000 records to be honest, I have seen millions, if not billions of records in a log table. If you are logging out 132,000 records every few minutes then you might want to tone it down a bit.
I think the idea is interesting but here is my major concerns:
You could actually hurt the performance of your application by doing this. The Log4Net ADO.NET appender is synchronous. This means if you make your INSERT anymore complicated than it needs to be (aka looking up if the data already exists, calculating hash codes etc.) you will block the thread calling logging. That's not good! You could fix this writing to some sort of a staging table and doing it out of band with a job or something but now you've created a bunch of moving parts for something that could be much simpler.
Time could probably be better spent doing other things. Storage is cheap, developer hours aren't and logs don't need to be extremely fast to access so a denormalized model should be fine.
Thoughts?
Yes you can do that. It is a good idea and it will work. Watch out for concurrency issues when inserting from multiple threads or processes. You probably need to investigate locking in detail. You should look into locking hints (in your case UPDLOCK, HOLDLOCK, ROWLOCK) and the MERGE statement. They can be used to maintain the dimension table.
As an alternative you could log to a file and compress it. Typical compression algorithms are very good at eliminating this type of exact redundancy.
I'm working on a database with the following characteristics:
Many inserts (in the range of 1k/second)
Lots of indices on the data, complex joins
NO Deletes or updates, only inserts, read and table drops
I don't care if the reads to the database reflect accurate state
Data isn't critical, I'm already running fsync=off
I already know a fair bit about postgres optimization, but I was hoping there might be some additional tricks that are more suited to my particular use case.
You can disable the WAL, perhaps by pointing it to /dev/null or RAMDISK. Note there is some speculation that you may not be able to restart the DB after even a clean stop, so I advise caution and testing.
Make sure you cluster your tables. Partitioning might help as well.
Certainly disable synchronous_commits.
http://wiki.postgresql.org/wiki/FAQ#How_do_I_tune_the_database_engine_for_better_performance.3F
You might want to use unlogged tables (available as of 9.1) for that. It's basically a table for which the WAL is disabled.
I'm trying my best to persuade my boss into letting us use foreign keys in our databases - so far without luck.
He claims it costs a significant amount of performance, and says we'll just have jobs to cleanup the invalid references now and then.
Obviously this doesn't work in practice, and the database is flooded with invalid references.
Does anyone know of a comparison, benchmark or similar which proves there's no significant performance hit to using foreign keys? (Which I hope will convince him)
There is a tiny performance hit on inserts, updates and deletes because the FK has to be checked. For an individual record this would normally be so slight as to be unnoticeable unless you start having a ridiculous number of FKs associated to the table (Clearly it takes longer to check 100 other tables than 2). This is a good thing not a bad thing as databases without integrity are untrustworthy and thus useless. You should not trade integrity for speed. That performance hit is usually offset by the better ability to optimize execution plans.
We have a medium sized database with around 9 million records and FKs everywhere they should be and rarely notice a performance hit (except on one badly designed table that has well over 100 foreign keys, it is a bit slow to delete records from this as all must be checked). Almost every dba I know of who deals with large, terabyte sized databases and a true need for high performance on large data sets insists on foreign key constraints because integrity is key to any database. If the people with terabyte-sized databases can afford the very small performance hit, then so can you.
FKs are not automatically indexed and if they are not indexed this can cause performance problems.
Honestly, I'd take a copy of your database, add properly indexed FKs and show the time difference to insert, delete, update and select from those tables in comparision with the same from your database without the FKs. Show that you won't be causing a performance hit. Then show the results of queries that show orphaned records that no longer have meaning because the PK they are related to no longer exists. It is especially effective to show this for tables which contain financial information ("We have 2700 orders that we can't associate with a customer" will make management sit up and take notice).
From Microsoft Patterns and Practices: Chapter 14 Improving SQL Server Performance:
When primary and foreign keys are
defined as constraints in the database
schema, the server can use that
information to create optimal
execution plans.
This is more of a political issue than a technical one. If your project management doesn't see any value in maintaining the integrity of your data, you need to be on a different project.
If your boss doesn't already know or care that you have thousands of invalid references, he isn't going to start caring just because you tell him about it. I sympathize with the other posters here who are trying to urge you to do the "right thing" by fighting the good fight, but I've tried it many times before and in actual practice it doesn't work. The story of David and Goliath makes good reading, but in real life it's a losing proposition.
It is OK to be concerned about performance, but making paranoid decisions is not.
You can easily write benchmark code to show results yourself, but first you'll need to find out what performance your boss is concerned about and detail exactly those metrics.
As far as the invalid references ar concerned, if you don't allow nulls on your foreign keys, you won't get invalid references. The database will esception if you try to assign an invalid foreign key that does not exist. If you need "nulls", assign a key to be "UNDEFINED" or something like that, and make that the default key.
Finally, explain database normalisation issues to your boss, because I think you will quickly find that this issue will be more of a problem than foreign key performance ever will.
Does anyone know of a comparison, benchmark or similar which proves there's no significant performance hit to using foreign keys ? (Which I hope will convince him)
I think you're going about this the wrong way. Benchmarks never convince anyone.
What you should do, is first uncover the problems that result from not using foreign key constraints. Try to quantify how much work it costs to "clean out invalid references". In addition, try and gauge how many errors result in the business process because of these errors. If you can attach a dollar amount to that - even better.
Now for a benchmark - you should try and get insight into your workload, identify which type of operations are done most often. Then set up a testing environment, and replay those operations with foreign keys in place. Then compare.
Personally I would not claim right away without knowledge of the applications that are running on the database that foreign keys don't cost performance. Especially if you have cascading deletes and/or updates in combination with composite natural primary keys, then I personally would have some fear of performance issues, especially timed-out or deadlocked transactions due to side-effects of cascading operations.
But no-one can tell you- you have to test yourself, with your data, your workload, your number of concurrent users, your hardware, your applications.
A significant factor in the cost would be the size of the index the foreign key references - if it's small and frequently used, the performance impact will be negligible, large and less frequently used indexes will have more impact, but if your foreign key is against a clustered index, it still shouldn't be a huge hit, but #Ronald Bouman is right - you need to test to be sure.
i know that this is a decade post.
But database primitives are always on demand.
I will refer to my own experience.
In one of the projects that i have worked has to deal with a telecommunication switch database. They have developed a database with no FKs, the reason was because they wanted as much faster inserts they could have. Because sy system it self it have to deal with calls, it make some sense.
Before, there was no need for any intensive queries and if you wanted any report, you could use the GUI software of the switch. After some time you could have some basic reports.
But when i was involved they wanted to develop and AI thus to be able to create smart reports and have something like an automatic troubleshooting.
It was completely a nightmare, having millions of records, you couldn't execute any long query and many times facing sql server timeout. And don't even think using Entity Framework.
It is much difference when you have to face a situation like this instead of describing.
My advice is that you have to be very specific on your design and having a very good reason why not using FKs.