I inherited a database (don't we all just love that) and a table has grown out of control.
The database is over 50 GB.
I did a defrag on it as well as re-indexing but that did not help.
The problem is that the table has 236 columns and about 23000 rows.
And no, that is not a typo!
The only solution I can see is to break up the table.
The .Net app is on our intranet and I have optimized every piece of code it contains as well as the stored procedures.
The table contains information such as TempHigh, TempLow, and TempMed. The High, Low, and Med repeat throughout the table for other factors. So each High, Low, and Med will become its own table with a foreign key pointing to the parent table.
This will create a lot of JOINs when accessing the data and updating.
This is the only way I can see that may fix the problem.
My question is, am I overlooking a better way to fix this problem?
Any and all suggestions are welcome.
Thanks!!!
EDIT
Just to clarify, I have run a defrag on the database as well as re-indexing.
I have had performance monitor open on the web and db servers as well.
Thanks for the comments. I will try to answer your questions. First, I have copied the database to my .Net development environment and it's slow, even with just me logging in. And moved the .Net app to the same server as the sql server to test connection issues. Same problems. This is a transaction system (OLTP). Some columns I have already moved to their own tables as they were repeating themselves and in their place a new column with a foreign key (without the constraint). There are no images, just data.
Here are the specs of the table:
TableName SchemaName RowCounts TotalSpaceKB UsedSpaceKB UnusedSpaceKB
MyTable dbo 22904 45192 45160 32
am I overlooking a better way to fix this problem?
Yes:
Profile your application (source code, not SQL) to ensure that the DB access is the slowest part
Run SQL traces to identify the worst-performing queries and work with a good DBA to identify possible optimizations
Make sure all statistics are up-to-date (at least daily).
Don't guess - use hard facts to identify performance bottlenecks
Make sure you have the appropriate indexes for the most common queries.
Don't assume that breaking up the table will speed things up. It can actually make things slower if you're constantly joining across multiple tables that are 1:1.
I've seen databases that are terabytes in size that still perform well (adding/uptating thousands of records daily). The key is determining the slowest operations and adding/updating indexes to optimize those.
Related
What happens if I restart a MariaDB server while it is Repairing or Optimizing a very large table (like at least 20GB)? Probably because I need to use the table for other stuff and I'm just getting plain bored.
REPAIR and OPTIMIZE are designed to be crash-safe. (Or at least to a large extent.)
OPTIMIZE, for example, copies the table over to a tmp table name. When finished, it internally does a RENAME TABLE, which is fast.
OPTIMIZE is needed in only very rare cases for MyISAM. It is even less needed for InnoDB. What is your use case? I will probably counter that it is 'futile' or 'not worth the effort'.
Repair is needed only for MyISAM. I hope you are not using that antiquated engine.
More
Consider switching to InnoDB; we can discuss this further. REPAIR is much less often needed, plus is automated.
What is the schema? What is the data flow like? I may have tips on avoiding the indexes getting out of date. (The way the .MYD file is laid out is problematic for certain data flows.)
Use ANALYZE TABLE (instead) when there is no complaint about index corruption yet queries are suddenly slow.
I suspect this question may be better suited for the Database Administrators site, so LMK if it is and I'll move it. :)
I'm something of a database/Postgres beginner here so help me out. I have a system set up to process 10 things in parallel and write output of those things to the same table in the same Postgres database. The writes happen ok but they take forever. My log files show that I'll have results for 30,000 of these things, but only 7,000 of them are reflected in the database.
I suspect Postgres is queueing up the writes for some reason, and my guess is that this happens because that table has an auto-incrementing primary key. If I'm trying to write 10 records to the same table simultaneously, I would assume they'd have to be queued, because otherwise how is the primary key going to be set?
Do I have this right, or is my database horribly misconfigured? My sysadmin doesn't typically do databases, so if you have any tuning suggestions, even basic stuff, I'd be glad to hear them. :)
I suspect Postgres is queueing up the writes for some reason, and my guess is that this happens because that table has an auto-incrementing primary key. If I'm trying to write 10 records to the same table simultaneously, I would assume they'd have to be queued, because otherwise how is the primary key going to be set?
Nope, that's not it.
If you read the documentation on sequences you'll see that they're exempt from transactional visibility and rollback specifically for this reason. An ID generated with nextval is not re-used on rollback.
Do I have this right, or is my database horribly misconfigured? My sysadmin doesn't typically do databases, so if you have any tuning suggestions, even basic stuff, I'd be glad to hear them. :)
It's more likely that you're doing individual commits, one per insert, on a system with really slow fsync()s like a single magnetic hard drive. You might also have your checkpoint intervals too low (see the PostgreSQL logs where warnings about this will appear if so), might have too many indexes causing a slowdown, etc.
You should look at the PostgreSQL logs.
Also, please see the primer I wrote on the topic of improving insert performance.
I need to synchronize tables between 2 databases daily, the source is MSSQL 2008, the target is MSSQL 2005. If I use UPDATE, INSERT, and DELETE statements (i.e. UPDATE rows that changed, INSERT new rows, DELETE rows no longer present), will there be performance improvements if I perform the DELETE statement first? i.e. so that the UPDATE statement doesn't look at rows that don't need to be updated, because they will be deleted.
Here are some other things I need to consider. The tables have 1-3 million+ rows, and because of the amount of transactions and business requirements, the source DB needs to remain online, and the query needs to be as efficient as possible. The job will be run daily in a SQL server agent job on the target DB. On top of that, I am a DB rookie!
Thanks StackOverflow community, you are awesome!
I'd say, first you do delete, then update then insert, so you don't have to update rows which will be deleted anyway and you'll not update rows which are just inserted.
But actually, have you seen SQL Server merge syntax? It could save you a great amount of code.
update I have not checked performance of MERGE statement against INSERT/UPDATE/DELETE, here's related link given by Aaron Bertrand for more details.
Rule of Thumb: DELETE, then UPDATE, then INSERT.
Performance aside, my main concern is to avoid any potential Deadlocks when:
Updating something you will immediately Delete.
Inserting something you may immediately try to Update.
If you only modify what is necessary and use transactions correctly, then you may use any order.
P.S. Someone suggested using MERGE - I've tried it a few times and my preference is to never use it.
I think Roman's answer is what you were looking for in your current situation: DELETE, UPDATE, INSERT (or MERGE.)
Now there are other possible routes which can make things even faster, but with a rather different process:
1. Consider saving all orders in a file that you, once in a while, run against the target
Assuming both databases are exactly the same, for each SQL order that modifies the 2008 database, save that order in a .sql file which you later execute against the 2005 database. You have to consider locking the file while writing to it, and maybe have some kind of redundancy. However, this means you need no access to the 2008 database at all while doing the work on the 2005 database. In other words, no side effects to the 2008 database speed.
Pitfall: you may miss a statement and the destination will not be an exact equivalent...
2. Ongoing replication
I do not know about MSSQL enough to tell you of a good tool to do automatic replication (see here: http://technet.microsoft.com/en-us/library/ms151198.aspx), but I'd bet you can find a good tool. MySQL (http://dev.mysql.com/doc/refman/5.0/en/replication.html) and PostgreSQL (http://wiki.postgresql.org/wiki/Streaming_Replication) have such tools and those are all free.
This would be the solution I would choose. Depending on the tool you use, it can be really very well optimized meaning that the impact on the live system will be minimal and the 2005 duplicate will be up to date within seconds (depending on whether it is a long distance remote connection or not, the amount of work, the setup of each server, the Internet connections, etc.)
The pitfall is obviously that it adds an ongoing process on the database, but if you find an MSSQL tool that works like the streaming replication of PostgreSQL, it makes use of a copy of the journal which means it is dead fast (no heavy use of disk I/O.)
3. Cluster Database (like Cassandra)
This would involve a change of database which I'm totally sure you're not ready to do (especially because most of those systems do not offer SQL,) but I thought that it would be a good thing to talk about in your situation.
A system like Cassandra (http://cassandra.apache.org/) automatically replicate its data on many computers. It can actually be setup to replicate all the data 100% or X% of data per computer with redundancy in case of failure (a computer that breaks down). This alleviates the need for a specific copy on a separate computer because the performance can be increased simply by adding a few nodes to your system. (At less than $1,000 a computer, it is worth it! Frankly you could create a Peta Byte system for $50k or less and end up with something a lot faster than any SQL database...)
The main problem is that the use of those clusters is completely different than SQL. But that could be a solution for big businesses having large databases that need to be really fast and they do not want to invest in a mini-computer (think Cobol and $250k computers that manage 100 million rows in a few milli-seconds...)
With Cassandra you can run extremely heavy batch processes on back end computers that do not make a dent to the front end system!
I am planning to use log4net in a new web project. In my experience, I see how big the log table can get, also I notice that errors or exceptions are repeated. For instance, I just query a log table that have more than 132.000 records, and I using distinct and found that only 2.500 records are unique (~2%), the others (~98%) are just duplicates. so, I came up with this idea to improve logging.
Having a couple of new columns: counter and updated_dt, that are updated every time try to insert same record.
If want to track the user that cause the exception, need to create a user_log or log_user table, to map N-N relationship.
Create this model may made the system slow and inefficient trying to compare all these long text... Here the trick, we should also has a hash column of binary of 16 or 32, that hash the message and the exception, and configure an index on it. We can use HASHBYTES to help us.
I am not an expert in DB, but I think that will made the faster way to locate a similar record. And because hashing doesn't guarantee uniqueness, will help to locale those similar record much faster and later compare by message or exception directly to make sure that are unique.
This is a theoretical/practical solution, but will it work or bring more complexity? what aspects I am leaving out or what other considerations need to have? the trigger will do the job of insert or update, but is the trigger the best way to do it?
I wouldn't be too concerned with a log table of 132,000 records to be honest, I have seen millions, if not billions of records in a log table. If you are logging out 132,000 records every few minutes then you might want to tone it down a bit.
I think the idea is interesting but here is my major concerns:
You could actually hurt the performance of your application by doing this. The Log4Net ADO.NET appender is synchronous. This means if you make your INSERT anymore complicated than it needs to be (aka looking up if the data already exists, calculating hash codes etc.) you will block the thread calling logging. That's not good! You could fix this writing to some sort of a staging table and doing it out of band with a job or something but now you've created a bunch of moving parts for something that could be much simpler.
Time could probably be better spent doing other things. Storage is cheap, developer hours aren't and logs don't need to be extremely fast to access so a denormalized model should be fine.
Thoughts?
Yes you can do that. It is a good idea and it will work. Watch out for concurrency issues when inserting from multiple threads or processes. You probably need to investigate locking in detail. You should look into locking hints (in your case UPDLOCK, HOLDLOCK, ROWLOCK) and the MERGE statement. They can be used to maintain the dimension table.
As an alternative you could log to a file and compress it. Typical compression algorithms are very good at eliminating this type of exact redundancy.
We have a bit of a messy database situation.
Our main back-office system is written in Visual Fox Pro with local data (yes, I know!)
In order to effectively work with the data in our websites, we have chosen to regularly export data to a SQL database. However the process that does this basically clears out the tables each time and does a re-insert.
This means we have two SQL databases - one that our FoxPro export process writes to, and another that our websites read from.
This question is concerned with the transform from one SQL database to the other (SqlFoxProData -> SqlWebData).
For a particular table (one of our main application tables), because various data transformations take places during this process, it's not a straightforward UPDATE, INSERT and DELETE statements using self-joins, but we're having to use cursors instead (I know!)
This has been working fine for many months but now we are starting to hit upon performance problems when an update is taking place (this can happen regularly during the day)
Basically when we are updating SqlWebData.ImportantTable from SqlFoxProData.ImportantTable, it's causing occasional connection timeouts/deadlocks/other problems on the live websites.
I've worked hard at optimising queries, caching etc etc, but it's come to a point where I'm looking for another strategy to update the data.
One idea that has come to mind is to have two copies of ImportantTable (A and B), some concept of which table is currently 'active', updating the non-active table, then switching the currenly actice table
i.e. websites read from ImportantTableA whilst we're updating ImportantTableB, then we switch websites to read from ImportantTableB.
Question is, is this feasible and a good idea? I have done something like it before but I'm not convinced it's necessarily good for optimisation/indexing etc.
Any suggestions welcome, I know this is a messy situation... and the long term goal would be to get our FoxPro application pointing to SQL.
(We're using SQL 2005 if it helps)
I should add that data consistency isn't particularly important in the instance, seeing as the data is always slightly out of date
There are a lot of ways to skin this cat.
I would attack the locking issues first. It is extremely rare that I would use CURSORS, and I think improving the performance and locking behavior there might resolve a lot of your issues.
I expect that I would solve it by using two separate staging tables. One for the FoxPro export in SQL and one transformed into the final format in SQL side-by-side. Then either swapping the final for production using sp_rename, or simply using 3 INSERT/UPDATE/DELETE transactions to apply all changes from the final table to production. Either way, there is going to be some locking there, but how big are we talking about?
You should be able to maintain one db for the website and just replicate to that table from the other sql db table.
This is assuming that you do not update any data from the website itself.
"For a particular table (one of our main application tables), because various data transformations take places during this process, it's not a straightforward UPDATE, INSERT and DELETE statements using self-joins, but we're having to use cursors instead (I know!)"
I cannot think of a case where I would ever need to perform an insert, update or delete using a cursor. If you can write the select for the cursor, you can convert it into an insert, update or delete. You can join to other tables in these statements and use the case stament for conditional processing. Taking the time to do this in a set -based fashion may solve your problem.
One thing you may consider if you have lots of data to move. We occassionally create a view to the data we want and then have two tables - one active and one that data will be loaded into. When the data is finsihed loading, as part of your process run a simple command to switch the table the view uses to the one you just finshed loading to. That way the users are only down for a couple of seconds at most. You won't create locking issues where they are trying to access data as you are loading.
You might also look at using SSIS to move the data.
Do you have the option of making the updates more atomic, rather than the stated 'clear out and re-insert'? I think Visual Fox Pro supports triggers, right? For your key tables, can you add a trigger to the update/insert/delete to capture the ID of records that change, then move (or delete) just those records?
Or how about writing all changes to an offline database, and letting SQL Server replication take care of the sync?
[sorry, this would have been a comment, if I had enough reputation!]
Based on your response to Ernie above, you asked how you replicate databases. Here is Microsoft's how-to about replication in SQL2005.
However, if you're asking about replication and how to do it, it indicates to me that you are a little light in experience for SQL server. That being said, it's fairly easy to muck things up and while I'm all for learning by experience, if this is mission critical data, you might be better off hiring a DBA or at the very least, testing the #$##$% out of this before you actually implement it.