I am using Firebird 2.5.1 Embedded. I have done the usual to empty the table with nearly 200k rows:
delete from SZAFKI
Here's the output, see as it takes 16 seconds, which is, well, unacceptable.
Preparing query: delete from SZAFKI
Prepare time: 0.010s
PLAN (SZAFKI NATURAL)
Executing...
Done.
3973416 fetches, 1030917 marks, 116515 reads, 116434 writes.
0 inserts, 0 updates, 182658 deletes, 27 index, 182658 seq.
Delta memory: -19688 bytes.
SZAFKI: 182658 deletes.
182658 rows affected directly.
Total execution time: 16.729s
Script execution finished.
Firebird has no TRUNCATE keyword. As the query uses PLAN NATURAL, I tried to PLAN the query by hand, like so:
delete from szafki PLAN (SZAFKI INDEX (SZAFKI_PK))
but Firebird says "SZAFKI_PK cannot be used in the specified plan" (it is a primary key)
Question is how do i empty table efficiently? Dropping and recreating is not possible.
Answer based on my comment
A trick you could try is to use DELETE FROM SZAFKI WHERE ID > 0 (assuming the ID is 1 or higher). This will force Firebird to look up the rows using the primary key index.
My initial assumption was that this would be worse than an unindexed delete. An unindexed delete will do a sequential scan of all datapages of a table and delete rows (that is: create a new recordversion that is a deleted stub record). When you use the index it will lookup rows in index order, this will result in a random walk through the datapages (assuming a high level of fragmentation in the data due to a high number of record versions due to inserts, deletes and updates). I had expected this to be slower, but probably it will result in Firebird having to only read the relevant datapages (with record versions relevant to the transaction) instead of all datapages of a table.
Unfortunately, there is no fast way to do massive delete on entire (big) table with currently Firebird versions. You can expect even higher delays when the "deleted content" is garbage collected (run select * in the table after the delete is committed and you will see). You can try to deactivate indexes in that table before doing the delete and see if it helps.
If you are using the table as some kind of temporary storage, I suggest you to use the GTT feature.
Fastest and only way to dot get rid from all data fast in FireBird table- drop and create table again. At least for the current official version 2.5.X . There is no truncate operator in roadmap for FireBird 3.0 , beta is out, so most probably no truncate in 3.0 too.
Also, you can use the operator RECREATE - same syntax as create. If table exists, RECRATE drops it, then creates new. If table doesn't exists, then recreate just creates it.
RECREATE TABLE Table1 (
ID INTEGER,
NAME VARCHAR(20),
DATE DATE,
T TIME
);
Related
I have a table like following:
Create Table Txn_History nologging (
ID number,
Comment varchar2(300),
... (Another 20 columns),
Std_hash raw(1000)
);
This table is 8GB with 19 Million rows with a growth of around 50,000 rows daily.
I need to delete 300,000 rows and update 100,000 rows. I know that normally delete and update statement will cause Oracle database to generate redo log. The only way I know to avoid this is to create a new table with the updated result.
However, consider that the delete and update statement is only talking about 2% of the entire table, it appears not very worth to create a new table, follow by all corresponding indexes.
Do you have any new idea?
To be honest I don't think that the redo generation is a big problem here: just 300k rows to delete and 100k rows to update... For such batch operations Oracle uses fast "array update" REDO operation. Probably you need to trace your operation to find out real bottlenecks and load profile(IO/CPU, access paths, triggers, indexes, etc).
Basically it's better to use the partitioning option properly to update/delete(or truncate) by whole partitions.
There is also new alter table ... move including rows where ... feature starting from Oracle 12.2:
https://blogs.oracle.com/sql/how-to-delete-millions-of-rows-fast-with-sql
In SQL Server 2008 I have some million rows of data which needs be deleted. They are scattered across a handful of tables. Deletion takes up to 20 seconds which I think is way to slow! The data to be deleted is identified by a timestamp column. Here is what I have done so far in order to optimize:
Using isolation level read uncommitted. I don't care about transactions. If we fail the user will issue the delete operation again. And new data is ensured not to have the timestamp we are deleting.
Deleting leaf tables before parent tables.
The timestamp column is part of the PK clustered index, in fact its the first position of the PK/index.
Each table is emptied using a loop which deletes top 200000 entries in order to reduce the transaction log overhead.
Neither I/O nor CPU is maxed out on the server
What have I overlooked?
Also I am in doubt of the effect of moving the timestamp column to the first position in the PK. After doing so, must I reorganize the tables or is SQL Server smart enough to do this itself. My understanding of clustered index is that since it defines the physical layout of the rows, it is force into reorganizing the data. But we have no complaints from the customer that the changing clustered index operation took a long time to perform.
Please make sure the tables you want to delete data from has "primary key" specifically indicated.
Wrong: create table myTable (ID int)
True: create table myTable (ID int PRIMARY KEY)
In addition to that, please try to add "option (recompile)", which will help the performance:
DELETE FROM myTable
WHERE timestamp in (select timestamp from other_table)
OPTION (RECOMPILE)
I have the following UPDATE query:
UPDATE Indexer.Pages SET LastError=NULL where LastError is not null;
Right now, this query takes about 93 minutes to complete. I'd like to find ways to make this a bit faster.
The Indexer.Pages table has around 506,000 rows, and about 490,000 of them contain a value for LastError, so I doubt I can take advantage of any indexes here.
The table (when uncompressed) has about 46 gigs of data in it, however the majority of that data is in a text field called html. I believe simply loading and unloading that many pages is causing the slowdown. One idea would be to make a new table with just the Id and the html field, and keep Indexer.Pages as small as possible. However, testing this theory would be a decent amount of work since I actually don't have the hard disk space to create a copy of the table. I'd have to copy it over to another machine, drop the table, then copy the data back which would probably take all evening.
Ideas? I'm using Postgres 9.0.0.
UPDATE:
Here's the schema:
CREATE TABLE indexer.pages
(
id uuid NOT NULL,
url character varying(1024) NOT NULL,
firstcrawled timestamp with time zone NOT NULL,
lastcrawled timestamp with time zone NOT NULL,
recipeid uuid,
html text NOT NULL,
lasterror character varying(1024),
missingings smallint,
CONSTRAINT pages_pkey PRIMARY KEY (id ),
CONSTRAINT indexer_pages_uniqueurl UNIQUE (url )
);
I also have two indexes:
CREATE INDEX idx_indexer_pages_missingings
ON indexer.pages
USING btree
(missingings )
WHERE missingings > 0;
and
CREATE INDEX idx_indexer_pages_null
ON indexer.pages
USING btree
(recipeid )
WHERE NULL::boolean;
There are no triggers on this table, and there is one other table that has a FK constraint on Pages.PageId.
What #kgrittn posted as comment is the best answer so far. I am merely filling in details.
Before you do anything else, you should upgrade PostgreSQL to a current version, at least to the last security release of your major version. See guidelines on the project.
I also want to stress what Kevin mentioned about indexes involving the column LastError. Normally, HOT updates can recycle dead rows on a data page and make UPDATEs a lot faster - effectively removing (most of) the need for vacuuming. Related:
Redundant data in update statements
If your column is used in any index in any way, HOT UPDATEs are disabled, because it would break the index(es). If that is the case, you should be able to speed up the query a lot by deleting all of these indexes before you UPDATE and recreate them later.
In this context it would help to run multiple smaller UPDATEs:
If ...
... the updated column is not involved in any indexes (enabling HOT updates).
... the UPDATE is easily divided into multiple patches in multiple transactions.
... the rows in those patches are spread out over the table (physically, not logically).
... there are no other concurrent transactions keeping dead tuples from being reused.
Then you would not need to VACCUUM in between multiple patches, because HOT updates can reuse dead tuples directly - only dead tuples from previous transactions, not from the same or concurrent ones. You may want to schedule a VACUUM at the end of the operation, or just let auto-vacuuming do its job.
The same could be done with any other index that is not needed for the UPDATE - and judging from your numbers the UPDATE is not going to use an index anyway. If you update large parts of your table, building new indexes from scratch is much faster than incrementally updating indexes with every changed row.
Also, your update is not likely to break any foreign key constraints. You could try to delete & recreate those, too. This does open a time slot where referential integrity would not be enforced. If the integrity is violated during the UPDATE you get an error when trying to recreate the FK. If you do it all within one transaction, concurrent transactions never get to see the dropped FK, but you take a write lock on the table - same as with dropping / recreating indexes or triggers)
Lastly, disable & enable triggers that are not needed for the update.
Be sure to do all of this in one transaction. Maybe do it in a number of smaller patches, so it does not block concurrent operations for too long.
So:
BEGIN;
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- UPDATE ...
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
COMMIT;
You cannot run VACUUM inside a transaction block. Per documentation:
VACUUM cannot be executed inside a transaction block.
You could split your operation into a few big chunks and run in between:
VACUUM ANALYZE tbl;
If you don't have to deal with concurrent transactions you could (even more effectively):
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- Multiple UPDATEs with logical slices of the table
-- each slice in its own transaction.
-- VACUUM ANALYZE tbl; -- optionally in between, or autovacuum kicks in
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
UPDATE Indexer.Pages
SET LastError=NULL
;
The where clause is not needed since the NULL fields are already NULL, so it won't harm to set them to NULL again (I don't think this would affect performance significantly).
Given your number_of_rows = 500K and your table size=46G, I conclude that your average rowsize is 90KB. That is huge. Maybe you could move {unused, sparse} columns of your table to other tables?
Your theory is probably correct. Reading the full table (and then doing anything) is probably causing the slow-down.
Why don't you just create another table that has PageId and LastError? Initialize this with the data in the table you have now (which should take less than 93 minutes). Then, use the LastError from the new table.
At your leisure, you can remove LastError from your existing table.
By the way, I don't normally recommend keeping two copies of a column in two separate tables. In this case, though, you sound like you are stuck and need a way to proceed.
I try to insert millions of records into a table that has more than 20 indexes.
In the last run it took more than 4 hours per 100.000 rows, and the query was cancelled after 3½ days...
Do you have any suggestions about how to speed this up.
(I suspect the many indexes to be the cause. If you also think so, how can I automatically drop indexes before the operation, and then create the same indexes afterwards again?)
Extra info:
The space used by the indexes is about 4 times the space used by the data alone
The inserts are wrapped in a transaction per 100.000 rows.
Update on status:
The accepted answer helped me make it much faster.
You can disable and enable the indexes. Note that disabling them can have unwanted side-effects (such as having duplicate primary keys or unique indices etc.) which will only be found when re-enabling the indexes.
--Disable Index
ALTER INDEX [IXYourIndex] ON YourTable DISABLE
GO
--Enable Index
ALTER INDEX [IXYourIndex] ON YourTable REBUILD
GO
This sounds like a data warehouse operation.
It would be normal to drop the indexes before the insert and rebuild them afterwards.
When you rebuild the indexes, build the clustered index first, and conversely drop it last. They should all have fillfactor 100%.
Code should be something like this
if object_id('Index') is not null drop table IndexList
select name into Index from dbo.sysindexes where id = object_id('Fact')
if exists (select name from Index where name = 'id1') drop index Fact.id1
if exists (select name from Index where name = 'id2') drop index Fact.id2
if exists (select name from Index where name = 'id3') drop index Fact.id3
.
.
BIG INSERT
RECREATE THE INDEXES
As noted by another answer disabling indexes will be a very good start.
4 hours per 100.000 rows
[...]
The inserts are wrapped in a transaction per 100.000 rows.
You should look at reducing the number, the server has to maintain a huge amount of state while in a transaction (so it can be rolled back), this (along with the indexes) means adding data is very hard work.
Why not wrap each insert statement in its own transaction?
Also look at the nature of the SQL you are using, are you adding one row per statement (and network roundtrip), or adding many?
Disabling and then re-enabling indices is frequently suggested in those cases. I have my doubts about this approach though, because:
(1) The application's DB user needs schema alteration privileges, which it normally should not possess.
(2) The chosen insert approach and/or index schema might be less then optimal in the first place, otherwise rebuilding complete index trees should not be faster then some decent batch-inserting (e.g. the client issuing one insert statement at a time, causing thousands of server-roundtrips; or a poor choice on the clustered index, leading to constant index node splits).
That's why my suggestions look a little bit different:
Increase ADO.NET BatchSize
Choose the target table's clustered index wisely, so that inserts won't lead to clustered index node splits. Usually an identity column is a good choice
Let the client insert into a temporary heap table first (heap tables don't have any clustered index); then, issue one big "insert-into-select" statement to push all that staging table data into the actual target table
Apply SqlBulkCopy
Decrease transaction logging by choosing bulk-logged recovery model
You might find more detailled information in this article.
Newish to Oracle programming (from Sybase and MS SQL Server). What is the "Oracle way" to avoid filling the trans log with large updates?
In my specific case, I'm doing an update of potentially a very large number of rows. Here's my approach:
UPDATE my_table
SET a_col = null
WHERE my_table_id IN
(SELECT my_table_id FROM my_table WHERE some_col < some_val and rownum < 1000)
...where I execute this inside a loop until the updated row count is zero,
Is this the best approach?
Thanks,
The amount of updates to the redo and undo logs will not at all be reduced if you break up the UPDATE in multiple runs of, say 1000 records. On top of it, the total query time will be most likely be higher compared to running a single large SQL.
There's no real way to address the UNDO/REDO log issue in UPDATEs. With INSERTs and CREATE TABLEs you can use a DIRECT aka APPEND option, but I guess this doesn't easily work for you.
Depends on the percent of rows almost as much as the number. And it also depends on if the update makes the row longer than before. i.e. going from null to 200bytes in every row. This could have an effect on your performance - chained rows.
Either way, you might want to try this.
Build a new table with the column corrected as part of the select instead of an update. You can build that new table via CTAS (Create Table as Select) which can avoid logging.
Drop the original table.
Rename the new table.
Reindex, repoint contrainst, rebuild triggers, recompile packages, etc.
you can avoid a lot of logging this way.
Any UPDATE is going to generate redo. Realistically, a single UPDATE that updates all the rows is going to generate the smallest total amount of redo and run for the shortest period of time.
Assuming you are updating the vast majority of the rows in the table, if there are any indexes that use A_COL, you may be better off disabling those indexes before the update and then doing a rebuild of those indexes with NOLOGGING specified after the massive UPDATE statement. In addition, if there are any triggers or foreign keys that would need to be fired/ validated as a result of the update, getting rid of those temporarily might be helpful.