RENAME faster than DROP+ADD in MySQL alter table - sql

I'm performing some MySQL table maintenance that will mean removing some redundant columns and adding some new ones.
Some of the columns to drop are of the same type as ones to add. Would the procedure be faster if I took advantage of this and reused some of the existing columns?
My rationale is that changing column names should be a simple table metadata change, whereas removing and adding columns means either finding room at the end of the file (fragmenting data) or rebuilding every row with the correct columns so that they're at the same place on the disk.
The engine in question is MyISAM and I'm not up to scratch on how exactly it'll treat this so I'd like to hear from anyone who has been in the same situation before!

Unless you have a serious issue with performance, I wouldn't take the renaming approach - because of all the dirty data you're going to leave lying around.
Also, by dropping the table, you will cause any indexes to get re-built - which is a good idea every once in a while...
Martin

I would drop the columns. You will have fragmentation either way. That should be handled in your regular maintenance plans. You could accelerate those after a large number of modification operations.

If you don't know, in Myisam table, every ALTER TABLE operation will do a copy of entire table, thus the table will be locked for the time your server needs to copy the table.

I've used that same logic, and got stung because even with changes that are supposed to not require rewriting the table (i.e. a table rename), a MySQL bug caused it to think it was a change that required rewriting the table.
If the fields you are dealing with are date, datetime or timestamp fields, you are likely to be hit by this, which means that you should just assume it has to do a full rewrite and plan that way.

Related

Speeding up deletes that have joins

i am running a stored procedure to delete data from two tables:
delete from TESTING_testresults
from TESTING_testresults
inner join TESTING_QuickLabDump
on TESTING_QuickLabDump.quicklabdumpid = TESTING_TestResults.quicklabdumpid
where TESTING_quicklabdump.[Specimen ID]=#specimen
delete from TESTING_QuickLabDump
from TESTING_Quicklabdump
where [specimen id]=#specimen
one table is 60m rows and the other is about 2m rows
the procedure takes about 3 seconds to run.
is there any way i can speed this up? perhaps using EXISTS?
meaning IF EXISTS...THEN DELETE - because the delete should not be occurring every single time
something like this
if #specimen exists in TESTING_QuickLabDump then do the procedure with the two deletes
thank you !!!
Rewriting the query probably wont help speeding this up. Use the profiler to find out which parts of the query are slow. For this, make it profiler output the execution plan. Then, try adding appropriate indexes. Perhaps one or both tables could use an index over [specimen id].
For a table with 60 mil rows I would definitely look into partitioning the data horizontally and/or vertically. If it's time-sensitive data then you ought to be able to move old data into a history table. That's usually the first and most obvious thing people do so I would imagine if that were a possibility you would have already done it.
If there are many columns then it would definitely benefit you to denormalize the data into multiple tables. If you did this, I would suggest renaming the tables and creating a view of all the partitioned tables named after the original table. Doing that should ensure existing code isn't broken.
If you 'really' want to fine tune the speed then you should look into getting a faster hard drive and learn a little about hard drives work. Whether the data is stored towards the inner or outer section of the hd will affect speed of access slightly for example. And solid state hard drives have come a long way so you might look into getting one of those.
Beside indexing "obvious" fields, also look in your database schema and check if you have any FOREIGN KEYs whose ON DELETE CASCADE or SET NULL might be triggered by your delete (unlike Oracle, MS SQL Server will tend to show these in the execution plan). Fortunately, this is usually fairly easy to fix by indexing the child endpoint of the FOREIGN KEY.
Also check if you have any expensive triggers.

Implementing soft delete with minimal impact on performance and code

There are some similar questions on the topic, but they are not really helping me.
I want to implement a soft delete feature like on StackOverflow, where items are not really deleted, but just hidden. I am using a SQL database. Here are 3 options:
Add a is_deleted boolean field.
Advantages: Simple.
Disadvantages: No date record. Forces me to add a is_deleted = 0 in every query.
Add a deleted_date date field. This is set to NULL if it's not deleted.
Advantages: Has date.
Disadvantages: Still cluttering my queries.
For both of the above
It will also impact performance because there are all these useless rows. They still have to be maintained in indexes. Also an index on the deleted column won't help when fetching non-deleted (the majority) of the rows. Full table scan is needed.
Another option is to create a separate table to hold deleted items:
Advantages: Improved performance when querying non-deleted rows. No need to add conditions to my queries on non-deleted rows. Easier on index maintenance.
Disadvantages: Complexity: Requires data migration for both deletion and undeletion. Need for new tables. Referential integrity is harder to handle.
Is there a better option?
I personally would base my answer off of how often you anticipate your users wanting to access that deleted data or "restore" that deleted data.
If it's often, then I would go with a "Date_Deleted" field and put a calculated "IsDeleted" in my poco in the code.
If it's never (or almost never) then a history table or deleted table is good for the benefits you explained.
I personally almost never use deleted tables (and opt for isDeleted or date_deleted) because of the potential risk to referencial integrity. You have A -> B and you remove the record from B database... You now have to manage referencial integrity because of your design choice.
If the key is numeric, I handle a "soft-delete" by negating the key. (Of course, won't work for identity keys). You don't need to change your code at all, and can easily restore the record by multiplying by -1.
Just another approach to give some thought to... If the key is alphanumeric, you can do something similar by prepending a unique "marker" characters. Since deleted records will all begin with this marker, then will end up off by themselves in the index.
In my opinion, the best way forward, when thinking about scaling and eventual table/database sizes is your third option - a separate table for deleted items. Such a table can eventually be moved to a different database to support scaling.
I believe you have listed the three most common options. As you have seen, each has advantages and disadvantages. Personally, I like taking the longer view on things.
I think your analysis of the options is good but you missed a few relevant points which I list below. Almost all implementations that I have seen use some sort of deleted or versioning field on the row as you suggest in your first two options.
Using one table with deleted flag:
If your indexes all contain the deleted flag field first and your query's mostly contain a where isdeleted=false type structure then it DOES solve you performance problems and the indexes very efficiently exclude the deleted rows. Similar logic could be used for the deleted date option.
Using two Tables
In general you need to make massive changes to reports because some reports may refer to deleted data (like old sales figures might refer to a deleted sales category). One can overcome this by creating a view which is a union of the two tables to read from and only write to the active records table.
Let's suppose we create a field called dead to mark deleted rows. We can create a index where field dead is false.
In this way, we only search non-deleted rows using the hint use index.

sql: DELETE + INSERT vs UPDATE + INSERT

A similar question has been asked, but since it always depends, I'm asking for my specific situation separately.
I have a web-site page that shows some data that comes from a database, and to generate the data from that database I have to do some fairly complex multiple joins queries.
The data is being updated once a day (nightly).
I would like to pre-generate the data for the said view to speed up the page access.
For that I am creating a table that contains exact data I need.
Question: for my situation, is it reasonable to do complete table wipe followed by insert? or should I do update,insert?
SQL wise seems like DELETE + INSERT will be easier (INSERT part is a single SQL expression).
EDIT: RDBMS: MS SQL Server 2008 Ent
TRUNCATE will be faster than delete, so if you need to empty a table do that instead
You didn't specify your RDBMS vendor but some of them also have MERGE/UPSERT commands This enables you do update the table if the data exists and insert if it doesn't
It partly depends on how the data is accessed. If you have a period of time with no (or very few) users accessing it, then there won't be much impact on the data disappearing (between the DELETE and the completion of the INSERT) for a short while.
Have you considered using a materialized view (MSSQL calls them indexed views) instead of doing it manually? This could also have other performance benefits as an indexed view gives the query optimizer more choices when its constructing execution plans for other queries that reference the table(s) in the view.
It depends on the size of the table and the recovery model on the database. If you are deleting many hundreds of thousands of records and reinstating them vs updating a small batch of a few hundred and inserting tens of rows, it will add an unnecessary size to your transaction logs. However you could use TRUNCATE to get around this as it won't affect the transaction log.
Do you have the option of a MERGE/UPSERT? If you're using MS-SQL you can use CROSS APPLY to do something similar if you don't.
One approach to handling this type of problem is to insert into new table, then do a table Rename. This will insure that all new data is present at the same time.
What if some data that was present yesterdays is not anymore? Delete may be safer or you could end up deleting some records anyway.
And in the end it doesnt really matter which way you go.
Unless on the case #kevinw mentioned
Although I fully agree with SQLMenace's answer I do would like to point out that MERGE does NOT remove unneeded records ! If you're sure that your new data will be a super-set of the existing data, then MERGE is great, otherwise you'll either need to make sure that you delete any superfluous records later on, or use the TRUNCATE + INSERT method ...
(Personally I'm still a fan of the latter as it usually is quite fast, just make sure to drop all indexes/unique constraints upfront and rebuild them one by one. This has the benefit of the INSERT transaction being smaller and the index-adding being done in (smaller) transactions again later on). (**)
(**: yes, this might be tricky on live system, but then again he already mentioned this was done during some kind of overnight anyway, I'm extrapolating there is no user-access at that time)

Do MySQL indexes still work after inserting new rows?

I have a MySQL table with an index covering two fields. If I insert new rows into the table, will my index still work?
Yes, the index is automatically updated; therefore indices (aka indexes) make inserts, updates and deletes slower.
I have a MySQL table with an index covering two fields. If insert new rows into the table, will my index still work?
Yes, the index is updated every time there is an insert/update/delete executed for that table.
Yep, it will still work. Indexes would be pretty useless if you couldn't use them after you had inserted new data. Be aware that the more data you add to your tables, the bigger your indexes will grow, so you want to be sure you create them properly and limit them to what you need them for.
Yes. Could you explain more why you thought it might not?
Note that if you also have individual indexes on one or both of the fields, mysql (at least some 5.0 versions I've used) won't necessarily use the combined index by default; when in doubt, use EXPLAIN to verify that your indexes are used as you expect.
As others have stated, indices are updated with each insert. It's part of the performance penalty for inserts and updates.
Actually this isn't as odd a question as some answerers allude. In the 1980s I engineered a system using an RDBMS which deferred updates "for performance" (according to the documentation). Ironically it was named RDM, originally for Realtime Data Manager (or something very close to that), but later backronymed Responsive Data Manager.
RDM seemed intended for humans speed updates, so deferring database writes until several rows were complete was probably acceptable. The system I engineered used the database backend wrapped by a database server connected to automatic data sources. AFAIK, this was the first application of RDM expecting rapid row inserts. One of the problem solutions was engineering a timestamp field so all the nodes in the system could say "update me with everything which has changed since yyyy-mm-dd hh:mm:ss". Turns out that was a worst-case index update/refresh scenario, and hell to diagnose. Once undocumented calls meaning trust me, I really want the data flushed and the indexes updated now were added, it worked pretty well.

`active' flag or not?

OK, so practically every database based application has to deal with "non-active" records. Either, soft-deletions or marking something as "to be ignored". I'm curious as to whether there are any radical alternatives thoughts on an `active' column (or a status column).
For example, if I had a list of people
CREATE TABLE people (
id INTEGER PRIMARY KEY,
name VARCHAR(100),
active BOOLEAN,
...
);
That means to get a list of active people, you need to use
SELECT * FROM people WHERE active=True;
Does anyone suggest that non active records would be moved off to a separate table and where appropiate a UNION is done to join the two?
Curiosity striking...
EDIT: I should make clear, I'm coming at this from a purist perspective. I can see how data archiving might be necessary for large amounts of data, but that is not where I'm coming from. If you do a SELECT * FROM people it would make sense to me that those entries are in a sense "active"
Thanks
You partition the table on the active flag, so that active records are in one partition, and inactive records are in the other partition. Then you create an active view for each table which automatically has the active filter on it. The database query engine automatically restricts the query to the partition that has the active records in it, which is much faster than even using an index on that flag.
Here is an example of how to create a partitioned table in Oracle. Oracle doesn't have boolean column types, so I've modified your table structure for Oracle purposes.
CREATE TABLE people
(
id NUMBER(10),
name VARCHAR2(100),
active NUMBER(1)
)
PARTITION BY LIST(active)
(
PARTITION active_records VALUES (0)
PARTITION inactive_records VALUES (1)
);
If you wanted to you could put each partition in different tablespaces. You can also partition your indexes as well.
Incidentally, this seems a repeat of this question, as a newbie I need to ask, what's the procedure on dealing with unintended duplicates?
Edit: As requested in comments, provided an example for creating a partitioned table in Oracle
Well, to ensure that you only draw active records in most situations, you could create views that only contain the active records. That way it's much easier to not leave out the active part.
We use an enum('ACTIVE','INACTIVE','DELETED') in most tables so we actually have a 3-way flag. I find it works well for us in different situations. Your mileage may vary.
Moving inactive stuff is usually a stupid idea. It's a lot of overhead with lots of potential for bugs, everything becomes more complicated, like unarchiving the stuff etc. What do you do with related data? If you move all that, too, you have to modify every single query. If you don't move it, what advantage were you hoping to get?
That leads to the next point: WHY would you move it? A properly indexed table requires one additional lookup when the size doubles. Any performance improvement is bound to be negligible. And why would you even think about it until the distant future time when you actually have performance problems?
I think looking at it strictly as a piece of data then the way that is shown in the original post is proper. The active flag piece of data is directly dependent upon the primary key and should be in the table.
That table holds data on people, irrespective of the current status of their data.
The active flag is sort of ugly, but it is simple and works well.
You could move them to another table as you suggested. I'd suggest looking at the percentage of active / inactive records. If you have over 20 or 30 % inactive records, then you might consider moving them elsewhere. Otherwise, it's not a big deal.
Yes, we would. We currently have the "active='T/F'" column in many of our tables, mainly to show the 'latest' row. When a new row is inserted, the previous T row is marked F to keep it for audit purposes.
Now, we're moving to a 2-table approach, when a new row is inserted, the previous row is moved to an history table. This give us better performance for the majority of cases - looking at the current data.
The cost is slightly more than the old method, previously you had to update and insert, now you have to insert and update (ie instead of inserting a new T row, you modify the existing row with all the new data), so the cost is just that of passing in a whole row of data instead of passing in just the changes. That's hardly going to make any effect.
The performance benefit is that your main table's index is significantly smaller, and you can optimise your tablespaces better (they won't grow quite so much!)
Binary flags like this in your schema are a BAD idea. Consider the query
SELECT count(*) FROM users WHERE active=1
Looks simple enough. But what happens when you have a large number of users, so many that adding an index to this table would be required. Again, it looks straight forward
ALTER TABLE users ADD INDEX index_users_on_active (active)
EXCEPT!! This index is useless because the cardinality on this column is exactly two! Any database query optimiser will ignore this index because of it's low cardinality and do a table scan.
Before filling up your schema with helpful flags consider how you are going to access that data.
https://stackoverflow.com/questions/108503/mysql-advisable-number-of-rows
We use active flags quite often. If your database is going to be very large, I could see the value in migrating inactive values to a separate table, though.
You would then only require a union of the tables when someone wants to see all records, active or inactive.
In most cases a binary field indicating deletion is sufficient. Often there is a clean up mechanism that will remove those deleted records after a certain amount of time, so you may wish to start the schema with a deleted timestamp.
Moving off to a separate table and bringing them back up takes time. Depending on how many records go offline and how often you need to bring them back, it might or might not be a good idea.
If the mostly dont come back once they are buried, and are only used for summaries/reports/whatever, then it will make your main table smaller, queries simpler and probably faster.
We use both methods for dealing with inactive records. The method we use is dependent upon the situation. For records that are essentially lookup values, we use the Active bit field. This allows us to deactivate entries so they wont be used, but also allows us to maintain data integrity with relations.
We use the "move to separation table" method where the data is no longer needed and the data is not part of a relation.
The situation really dictates the solution, methinks:
If the table contains users, then several "flag" fields could be used. One for Deleted, Disabled etc. Or if space is an issue, then a flag for disabled would suffice, and then actually deleting the row if they have been deleted.
It also depends on policies for storing data. If there are policies for keeping data archived, then a separate table would most likely be necessary after any great length of time.
No - this is a pretty common thing - couple of variations depending on specific requirements (but you already covered them):
1) If you expect to have a whole BUNCH of data - like multiple terabytes or more - not a bad idea to archive deleted records immediately - though you might use a combination approach of marking as deleted then copying to archive tables.
2) Of course the option to hard delete a record still exists - though us developers tend to be data pack-rats - I suggest that you should look at the business process and decide if there is now any need to even keep the data - if there is - do so... if there isn't - you should probably feel free just to throw the stuff away.....again, according to the specific business scenario.
From a 'purist perspective' the realtional model doesn't differentiate between a view and a table - both are relations. So that use of a view that uses the discriminator is perfectly meaningful and valid provided the entities are correctly named e.g. Person/ActivePerson.
Also, from a 'purist perspective' the table should be named person, not people as the name of the relation reflects a tuple, not the entire set.
Regarding indexing the boolean, why not:
ALTER TABLE users ADD INDEX index_users_on_active (id, active) ;
Would that not improve the search?
However I don't know how much of that answer depends on the platform.
This is an old question but for those search for low cardinality/selectivity indexes, I'd like to propose the following approach that avoids partitioning, secondary tables, etc.:
The trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value.
Then your query becomes:
SELECT * FROM people WHERE dateInactivated is NULL;
The index will pull in just the right set of rows that you care about.
Filtering data on a bit flag for big tables is not really good in terms of performance. In case when 'active' determinate virtual deletion you can create 'TableName_delted' table with the same structure and move deleted data there using delete trigger.
That solution will help with performance and simplifies data queries.