Do I need to issue CREATE INDEX everytime I update the database? - sql

I have a geo-map database with columns of x,y,z,zoom and type. Initially the read speed is very slow when I use the call
SELECT image WHERE x = ... AND y=... AND zoom=... AND type =...
Thanks to the kind help from stack overflow, I found indexing of (x,y,z,zoom) has helped improved the read speed impressively.
However, I have a question this CREATE INDEX command only need to be issue once when the database initialize at the first time? And even the database grow up gradually, it will still enjoy the read speed improvement brought by indexing?
Or do I need to issue CREATE INDEX command every time before I close my application(during the application, the database will grow)?

You will only need to create an index once.
The database will remember the columns with index and will keep changing the index along with your table.
If you insert an entry to the table, it will be added to the index. If you change an entry - it will be modified in the index. Finally, if you delete an entry - it will be removed from the index.
Note, the index will speed up your search operation - SELECT on the indexed columns, but will downgrade INSERT, UPDATE, DELETE.

Related

Postgres extracting data from a huge table based on non-indexed column

We have a table on production which has been there for quite some time and the volume of that table is huge(close to 3 TB), since most of the data in this table is stale and unused we are planning to get rid of historical data which does not have any references.
There is a column "active" with type boolean which we can use to get rid of this data, however this column is not indexed.
Considering the volume of the table i am not too sure whether creation of a new index is going to help, i tried to incrementally delete the inactive rows 100K at a time but still the volume is so huge that this is going to take months to clear up.
The primary key of the table is of type UUID, i thought of creating a new table and inserting only the valued with active="true" as
insert
into
mytable_active
select
*
from
mytable
where
is_active = true;
But as expected this approach also fails because of the volume and keeps running like forever.
Any suggestions approaches would be most welcome.
When you need to delete a lot of rows quickly, partitioning is great......... when the table is already partitioned.
If there is no index on the column you need, then at least one full table scan will be required, unless you can use another index like "date" or something to narrow it down.
I mean, you could create an index "WHERE active" but that would also require the full table scan you're trying to avoid, so... meh.
First, DELETE. Just don't, not even in small bits with LIMIT. Not only will it write most of the table (3TB writes) but it will also write it to the WAL (3 more TB) and it will also update the indexes, and write that to the WAL too. This will take forever, and the random IO from index updates will nuke your performance. And if it ever finishes, you will still have a 3TB file, with most of it unallocated. Plus indexes.
So, no DELETE. Uh, wait.
Scenario with DELETE:
Swap the table with a view "SELECT * FROM humongous WHERE active=true" and add triggers or rules on the view to redirect updates/inserts/delete to the underlying table. Make sure triggers set all new rows with active=true.
Re-create each index (concurrently) except the primary key, adding "WHERE active=true". This will require a full table scan for the first index, even if you create the index on "active", because CREATE INDEX WHERE doesn't seem to be able to use another index to speed up when a WHERE is specified.
Drop the old indices
Note the purpose of the view is only to ensure absolutely all queries have "active=true" in the WHERE, because otherwise, they wouldn't be able to use the conditional indices we just created, so each query would be a full table scan, and that would be undesirable.
And now, you can DELETE, bit by bit, with your delete from mytable where id in ( select id from mytable where active = false limit 100000);
It's a tradeoff, you'll have a large number of table scans to recreate indices, but you'll avoid the random IO from index update due to a huge delete, which is the real reason why you say it will take months.
Scenario with INSERT INTO new_table SELECT...
If you have inserts and updates running on this huge table, then you have a problem, because these will not be transferred to the new table during the operation. So a solution would be to:
turn off all the scripts and services that run long queries
lock everything
create new_table
rename huge_table to huge_old
create a view that is a UNION ALL of huge_table and huge_old. From the application point of view, this view replaces huge_table. It must handle priority, ie if a row is present in the new table, a row with the same id present in the old table should be ignored... so it will have to have a JOIN. This step should be tested carefully beforehand.
unlock
Then, let it run for a while, see if the view does not destroy your performance. At this point, if it breaks, you can easily go back by dropping the view and renaming the table back to its old self. I said to turn off all the scripts and services that run long queries because these might fail with the view, and you don't want to take a big lock while one long query is running, because that will halt everything until it's done.
add insert/update/delete triggers on the view to redirect the writes to new_table. Inserts go directly to the new table, updates will have to transfer the row, deletes will have to hit both tables, and UNIQUE constraints will be... interesting. This will be a bit complicated.
Now to transfer the data.
Even if it takes a while, who cares? It will finish eventually. I suppose if you have a 3TB table, you must have some decent storage, even if that's these old spinning things that we used to put data on, it shouldn't take more than a few hours if the IO is not random. So the idea is to only use linear IO.
Fingers crossed hoping the table does not have a big text column that is stored in separate TOAST table that is going to require one random access per row. Did you check?
Now, you might actually want it to run for longer so it uses less IO bandwidth, both for reads and writes, and especially WAL writes. It doesn't matter how long the query runs as long as it doesn't degrade performance for the rest of the users.
Postgres will probably go for a parallel table scan to use all the cores and all the IO in the box, so maybe disable that first.
Then I think you should try to avoid the hilarious (for onlookers) scenario where it reads from the table for half a day, not finding any rows that match, so the disks handle the reads just fine, then it finds all the rows that match at the end and proceeds to write 300GB to the WAL and the destination table, causing huge write contention, and you have to Ctrl-C it when you know, you just know it in your gut that it was THIS CLOSE to finishing.
So:
create bogus_table just like mytable but without indices;
insert into bogus_table select * from mytable;
10% of "active" rows is still 300GB so better check the server can handle writing a 300GB table without slowing down. Watch vmstat and check if iowait goes crazy, watch number of transactions per second, query latency, web server responsiveness, the usual database health stuff. If the phone rings, hit Ctrl-C and say "Fixed!"
After it's done a few checkpoints, Ctrl-C. Time to do the real thing.
Now to make this query take much longer (and therefore destroy much less IO bandwidth) you can add this to the columns in your select:
pg_sleep((random()<0.000001)::INTEGER * 0.1)
That will make it sleep for 0.1s every million rows on average. Adjust to taste while looking at vmstat.
You can also monitor query progress using hacks.
It should work fine.
Once the interesting rows have been extracted from the accursed table, you could move the old data to a data warehouse or something, or to cold storage, or have fun loading it into clickhouse if you want to run some analytics.
Maybe partitioning the new table would also be a good idea, before it grows back to 3TB. Or periodically moving old rows.
Now, I wonder how you backup this thing...
-- EDIT
OK, I have another idea, maybe simpler, but you'll need a box.
Get a second server with fast storage and setup logical replication. On this replica server, create an empty UNLOGGED replica of the huge table with only one index on the primary key. Logical replication will copy the entire table, so it will take a while. A second network card in the original server or some QoS tuning would help not blowing up the ethernet connection you actually use to serve queries.
Logical replication is row based and identifies rows by primary key, so you absolutely need to manually create that PK index on the slave.
I've tested it on my home box right now and it works very well. The initial data transfer was a bit slow, but that may be my network. Pausing then resuming replication transferred rows inserted or updated on the master during the pause. However, renaming the table seems to break it, so you won't be able to do INSERT INTO SELECT, you'll have to DELETE on the replica. With SSDs, only one PK index, the table set to UNLOGGED, it should not take forever. Maybe using btrfs would turn the random index write IO into linear IO due to its copy on write nature. Or, if the PK index fits in shared_buffers, just YOLO it and set checkpoint_timeout to "7 days" so it doesn't actually write anything. You'll probably need to do the delete in chunks so the replicated updates keep up.
When I dropped the PK index to speed up the deletion, then recreated it before re-enabling replication, it didn't catch up on the updates. So you can't drop the index.
But is there a way to only transfer the rows you want to keep instead of transferring everything and deleting, while also having the replica keep up with the master's updates?... It's possible to do it for inserts (just disable the initial data copy) but not for updates unfortunately. You'd need an integer primary key so you could generate bogus rows on the replica that would then be updated during replication... but you can't do that with your UUID PK.
Anyway. Once this is done, set the number of WAL segments to be kept on the master server to a very high value, to resume replication later without missing updates.
And now you can run your big DELETE on the replica. When it's done, vacuum, maybe CLUSTER, re-create all indexes, etc, and set the table to LOGGED.
Then you can failover to the new server. Or if you're feeling adventurous, you could replicate the replica's table back on the master, since it will have the same name it should be in another schema.
That should allow for very little downtime since all updates are replicated, the replica will always be up to date.
I would suggest:
Copy the active records to a temporary table
Drop the main table
Rename the temporary table to the main table name

Am I supposed to drop and recreate indexes on tables every time I add data to it?

I'm currently using MSSQL Server, I've created a table with indexes on 4 columns. I plan on appending 1mm rows every month end. Is it customary to drop the indexes, and recreate them every time you add data to the table?
Don't recreate the index. Instead, you can use update statistics to compute the statistics for the given index or for the whole table:
UPDATE STATISTICS mytable myindex; -- statistics for the table index
UPDATE STATISTICS mytable; -- statistics for the whole table
I don't think it is customary, but it is not uncommon. Presumably the database would not be used for other tasks during the data load, otherwise, well, you'll have other problems.
It could save time and effort if you just disabled the indexes:
ALTER INDEX IX_MyIndex ON dbo.MyTable DISABLE
More info on this non-trivial topic can be found here. Note especially that disabling the clustered index will block all access to the table (i.e. don't do that). If the data being loaded is ordered in [clustered index] order, that can help some.
A last note, do some testing. 1MM rows doesn't seem like that much; the time you save may get used up by recreating the indexes.

Two SQL non-clustered index creation

I've got a fairly large table that I'm trying to add two non-clustered indexes to at the same time and it is taking a very long time to go through.
Both are set with ONLINE=ON so that the table data could still be updated and more rows added to it.
Two quick questions:
-I'm wondering if updating data in such a large table is causing the index to never fully be created?
-Now that I've stopped the table from being updated, should both indexes now be created or could they be interfering with each other? (ie. should I cancel one of them?)
TIA
FYI, neither of them finished. I cancelled one and still the other didn't finish. ONLINE did not work at all, I had to revert to ONLINE=OFF and the indexes created fine.

Altering 2 columns in table - risky operation?

I have a table with ~100 columns, about ~30M rows, on MSSQL server 2005.
I need to alter 2 columns - change their types from VARCHAR(1024) to VARCHAR(max). These columns does not have index on them.
I'm worried that doing so will fill up the log, and cause the operation to fail. How can I estimate the needed free disk space, both of the data and the log, needed for such operation to ensure it will not fail?
You are right, increasing the column size (including to MAX) will generate a huge log for a large table, because every row will be updated (behind the scenens the old column gets dropped and a new column gets added and data is copied).
Add a new column of type VARCHAR(MAX) NULL. As a nullable column, will be added as metadata only (no data update)
Copy the data from the old column to new column. This can be done in batches to alleviate the log pressure.
Drop the old column. This will be a metadata only operation.
Use sp_rename to rename the new column to the old column name.
Later, at your convenience, rebuild the clustered index (online if needed) to get rid of the space occupied by the old column
This way you get control over the log by controlling the batches at step 2). You also minimize the disruption on permissions, constraints and relations by not copying the entire table into a new one (as SSMS so poorly does...).
You can do this sequence for both columns at once.
I would recommend that you consider, instead:
Create a new table with the new schema
Copy data from old table to new table
Drop old table
Rename new table to name of old table
This might be a far less costly operation and could possibly be done with minimal logging using INSERT/SELECT (if this were SQL Server 2008 or higher).
Why would increasing the VARCHAR limit fill up the log?
Try to do some test in smaller pieces. I mean, you could create the same structure locally with few thousand rows, and see the difference before and after. I think the change will be linear. The real question is about redo log, if it will fit into it or not, since you can do it at once. Must you do it online, or you can stop production for a while? If you can stop, maybe there is a way to stop redo log in MSSQL like in Oracle. It could make it a lot faster. If you need to do it online, you could try to make a new column, copy the value into it by a cycle for example 100000 rows at once, commit, continue. After completing maybe to drop original column and rename new one is faster than altering.

Slow bulk insert for table with many indexes

I try to insert millions of records into a table that has more than 20 indexes.
In the last run it took more than 4 hours per 100.000 rows, and the query was cancelled after 3½ days...
Do you have any suggestions about how to speed this up.
(I suspect the many indexes to be the cause. If you also think so, how can I automatically drop indexes before the operation, and then create the same indexes afterwards again?)
Extra info:
The space used by the indexes is about 4 times the space used by the data alone
The inserts are wrapped in a transaction per 100.000 rows.
Update on status:
The accepted answer helped me make it much faster.
You can disable and enable the indexes. Note that disabling them can have unwanted side-effects (such as having duplicate primary keys or unique indices etc.) which will only be found when re-enabling the indexes.
--Disable Index
ALTER INDEX [IXYourIndex] ON YourTable DISABLE
GO
--Enable Index
ALTER INDEX [IXYourIndex] ON YourTable REBUILD
GO
This sounds like a data warehouse operation.
It would be normal to drop the indexes before the insert and rebuild them afterwards.
When you rebuild the indexes, build the clustered index first, and conversely drop it last. They should all have fillfactor 100%.
Code should be something like this
if object_id('Index') is not null drop table IndexList
select name into Index from dbo.sysindexes where id = object_id('Fact')
if exists (select name from Index where name = 'id1') drop index Fact.id1
if exists (select name from Index where name = 'id2') drop index Fact.id2
if exists (select name from Index where name = 'id3') drop index Fact.id3
.
.
BIG INSERT
RECREATE THE INDEXES
As noted by another answer disabling indexes will be a very good start.
4 hours per 100.000 rows
[...]
The inserts are wrapped in a transaction per 100.000 rows.
You should look at reducing the number, the server has to maintain a huge amount of state while in a transaction (so it can be rolled back), this (along with the indexes) means adding data is very hard work.
Why not wrap each insert statement in its own transaction?
Also look at the nature of the SQL you are using, are you adding one row per statement (and network roundtrip), or adding many?
Disabling and then re-enabling indices is frequently suggested in those cases. I have my doubts about this approach though, because:
(1) The application's DB user needs schema alteration privileges, which it normally should not possess.
(2) The chosen insert approach and/or index schema might be less then optimal in the first place, otherwise rebuilding complete index trees should not be faster then some decent batch-inserting (e.g. the client issuing one insert statement at a time, causing thousands of server-roundtrips; or a poor choice on the clustered index, leading to constant index node splits).
That's why my suggestions look a little bit different:
Increase ADO.NET BatchSize
Choose the target table's clustered index wisely, so that inserts won't lead to clustered index node splits. Usually an identity column is a good choice
Let the client insert into a temporary heap table first (heap tables don't have any clustered index); then, issue one big "insert-into-select" statement to push all that staging table data into the actual target table
Apply SqlBulkCopy
Decrease transaction logging by choosing bulk-logged recovery model
You might find more detailled information in this article.