I have two tables with exactly same columns. First one is used for production, web application(django) is retrieving objects from it to show on webpage.
And I am using Python script for adding objects to the second one.
When script is done I need to replace all rows in table-1 with rows from table-2. Right now I am using something like this:
TRUNCATE table-1;
INSERT INTO table-1 (columns) SELECT columns FROM table-2;
TRUNCATE table-2;
VACUUM FULL;
Problem is that it taking too long time and after TRUNCATE table-1 website is just useless until INSERT is done. What would be the best way to approach this?
You can try to create a view and code django against that. When it's time to cut over between table1 and table2, just create or replace view to switch to the other table instead.
You don't need the expensive VACUUM FULL at all. Right after your big INSERT there is nothing to clean up. No dead tuples, and all indexes are in pristine condition.
It can make sense to run ANALYZE though to update completely changed table statistics right away.
For small tables DELETE can be faster than TRUNCATE, but for medium to large tables TRUNCATE is typically faster.
And do it all in a single transaction. INSERT after TRUNCATE in the same transaction does not have to write to WAL and is much faster:
BEGIN;
TRUNCATE table1;
INSERT INTO table1 TABLE table2;
TRUNCATE table2; -- do you need this?
ANALYZE table1; -- might be useful
COMMIT;
Details:
What causes large INSERT to slow down and disk usage to explode?
If you have any indexes on table1 it pays to drop them first and recreate them after the INSERT for big tables.
If you don't have depending objects, you might also just drop table1 and rename table2, which would be much faster.
How do I replace a table in postgres?
Either method requires an exclusive lock on the table. But you have been using VACUUM FULL previously which takes an exclusive lock as well:
VACUUM FULL rewrites the entire contents of the table into a new disk
file with no extra space, allowing unused space to be returned to the
operating system. This form is much slower and requires an exclusive
lock on each table while it is being processed.
So it's safe to assume an exclusive lock is ok for you.
Related
In Oracle pl/sql, I have join few tables and insert into another table, which would result in Thousands/Lakhs or it could be in millions. Can insert as
insert into tableA
select * from tableB;
Will there be any chance of failure because of number of rows ?
Or is there a better way to insert values in case of more no of records.
Thanks in Advance
Well, everything is finite inside the machine, so if that select returns too many rows, it for sure won't work (although there must be maaany rows, the number is dependent on your storage and memory size, your OS, and maybe other things).
If you think your query can surpass the limit, then do the insertion in batches, and commit after each batch. Of course you need to be aware you must do something if at 50% of the inserts you decide you need to cancel the process (as a rollback will not be useful here).
My recommended steps are different because performance typically increases when you load more data in one SQL statement using SQL or PL/SQL:
I would recommend checking the size of your rollback segment (RBS segment) and possibly bring online a larger dedicated one for such transaction.
For inserts, you can say something like 'rollback consumed' = 'amount of data inserted'. You know the typical row width from the database statistics (see user_tables after analyze table tableB compute statistics for table for all columns for all indexes).
Determine how many rows you can insert per iteration.
Insert this amount of data in big insert and commit.
Repeat.
Locking normally is not an issue with insert, since what does not yet exist can't be locked :-)
When running on partitioned tables, you might want to consider different scenarios allowing the (sub)partitions to distribute the work together. When using SQL*Loader by loading from text files, you might use different approach too, such as direct path which adds preformatted data blocks to the database without the SQL engine instead of letting the RDBMS handle the SQL.
To create limited number of rows you can use ROW_NUM which is a pseudo column .
for example to create table with 10,000 rows from another table having 50,000 rows you can use.
insert into new_table_name select * from old_table_name where row_num<10000;
I have got tables which has got more than 70 million records in it; what I just found that developers were dropping indexes before bulk insert and then creating again after the bulk insert is over. Execution time for the stored procedure is nearly 30 mins (do drop index, do bulk insert, then recreate index from scratch
Advice: Is this a good practice to drop INDEXs from table which has more than 70+ millions records and increasing by 3-4 million everyday.
Would it be help to improve performance by not dropping index before bulk insert ?
What is the best practice to be followed while doing BULK insert in BIG TABLE.
Thanks and Regards
Like everything in SQL Server, "It Depends"
There is overhead in maintaining indexes during the insert and there is overhead in rebuilding the indexes after the insert. The only way to definitively determine which method incurs less overhead is to try them both and benchmark them.
If I were a betting man I would put my wager that leaving the indexes in place would edge out the full rebuild but I don't have the full picture to make an educated guess. Again, the only way to know for sure is to try both options.
One key optimization is to make sure your bulk insert is in clustered key order.
If I'm reading your question correctly, that table is pretty much off limits (locked) for the duration of the load and this is a problem.
If your primary goal is to increase availability/decrease blocking, try taking the A/B table approach.
The A/B approach breaks down as follows:
Given a table called "MyTable" you would actually have two physical tables (MyTable_A and MyTable_B) and one view (MyTable).
If MyTable_A contains the current "active" dataset, your view (MyTable) is selecting all columns from MyTable_A. Meanwhile you can have carte blanche on MyTable_B (which contains a copy of MyTable_A's data and the new data you're writing.) Once MyTable_B is loaded, indexed and ready to go, update your "MyTable" view to point to MyTable_B and truncate MyTable_A.
This approach assumes that you're willing to increase I/O and storage costs (dramatically, in your case) to maintain availability. It also assumes that your big table is also relatively static. If you do follow this approach, I would recommend a second view, something like MyTable_old which points to the non-live table (i.e. if MyTable_A is the current presentation table and is referenced by the MyTable view, MyTable_old will reference MyTable_B) You would update the MyTable_old view at the same time you update the MyTable view.
Depending on the nature of the data you're inserting (and your SQL Server version/edition), you may also be able to take advantage of partitioning (MSDN blog on this topic.)
I have a very large table (more than 300 millions records) that will need to be cleaned up. Roughly 80% of it will need to be deleted. The database software is MS SQL 2005. There are several indexes and statistics on the table but not external relationships.
The best solution I came up with, so far, is to put the database into "simple" recovery mode, copy all the records I want to keep to a temporary table, truncate the original table, set identity insert to on and copy back the data from the temp table.
It works but it's still taking several hours to complete. Is there a faster way to do this ?
As per the comments my suggestion would be to simply dispense with the copy back step and promote the table containing records to be kept to become the new main table by renaming it.
It should be quite straightforward to script out the index/statistics creation to be applied to the new table before it gets swapped in.
The clustered index should be created before the non clustered indexes.
A couple of points I'm not sure about though.
Whether it would be quicker to insert into a heap then create the clustered index afterwards. (I guess no if the insert can be done in clustered index order)
Whether the original table should be truncated before being dropped (I guess yes)
#uriDium -- Chunking using batches of 50,000 will escalate to a table lock, unless you have disabled lock escalation via alter table (sql2k8) or other various locking tricks.
I am not sure what the structure of your data is. When does a row become eligible for deletion? If it is a purely ID based on date based thing then you can create a new table for each day, insert your new data into the new tables and when it comes to cleaning simply drop the required tables. Then for any selects construct a view over all the tables. Just an idea.
EDIT: (In response to comments)
If you are maintaining a view over all the tables then no it won't be complicated at all. The complex part is coding the dropping and recreating of the view.
I am assuming that you don't want you data to be locked down too much during deletes. Why not chunk the delete operations. Created a SP that will delete the data in chunks, 50 000 rows at a time. This should make sure that SQL Server keeps a row lock instead of a table lock. Use the
WAITFOR DELAY 'x'
In your while loop so that you can give other queries a bit of breathing room. Your problem is the old age computer science, space vs time.
We've run across a slightly odd situation. Basically there are two tables in one of our databases that are fed tons and tons of logging info we don't need or care about. Partially because of this we're running out of disk space.
I'm trying to clean out the tables, but it's taking forever (there are still 57,000,000+ records after letting this run through the weekend... and that's just the first table!)
Just using delete table is taking forever and eats up drive space (I believe because of the transaction log.) Right now I'm using a while loop to delete records X at a time, while playing around with X to determine what's actually fastest. For instance X=1000 takes 3 seconds, while X=100,000 takes 26 seconds... which doing the math is slightly faster.
But the question is whether or not there is a better way?
(Once this is done, going to run a SQL Agent job go clean the table out once a day... but need it cleared out first.)
TRUNCATE the table or disable indexes before deleting
TRUNCATE TABLE [tablename]
Truncating will remove all records from the table without logging each deletion separately.
To add to the other responses, if you want to hold onto the past day's data (or past month or year or whatever), then save that off, do the TRUNCATE TABLE, then insert it back into the original table:
SELECT
*
INTO
tmp_My_Table
FROM
My_Table
WHERE
<Some_Criteria>
TRUNCATE TABLE My_Table
INSERT INTO My_Table SELECT * FROM tmp_My_Table
The next thing to do is ask yourself why you're inserting all of this information into a log if no one cares about it. If you really don't need it at all then turn off the logging at the source.
1) Truncate table
2) script out the table, drop and recreate the table
TRUNCATE TABLE [tablename]
will delete all the records without logging.
Depending on how much you want to keep, you could just copy the records you want to a temp table, truncate the log table, and copy the temp table records back to the log table.
If you can work out the optimum x this will constantly loop around the delete at the quickest rate. Setting the rowcount limits the number of records that will get deleted in each step of the loop. If the logfile is getting too big; stick a counter in the loop and truncate every million rows or so.
set ##rowcount x
while 1=1
Begin
delete from table
If ##Rowcount = 0 break
End
Change the logging mode on the db to simple or bulk logged will reduce some of the delete overhead.
check this
article from MSDN Delete_a_Huge_Amount_of_Data_from
Information on Recovery Models
and View or Change the Recovery Model of a Database
I have a very large database table in PostgresQL and a column like "copied". Every new row starts uncopied and will later be replicated to another thing by a background programm. There is an partial index on that table "btree(ID) WHERE replicated=0". The background programm does a select for at most 2000 entries (LIMIT 2000), works on them and then commits the changes in one transaction using 2000 prepared sql-commands.
Now the problem ist that I want to give the user an option to reset this replicated-value, make it all zero again.
An update table set replicated=0;
is not possible:
It takes very much time
It duplicates the size of the tabel because of MVCC
It is done in one transaction: It either fails or goes through.
I actually don't need transaction-features for this case: If the system goes down, it shall process only parts of it.
Several other problems:
Doing an
update set replicated=0 where id >10000 and id<20000
is also bad: It does a sequential scan all over the whole table which is too slow.
If it weren't doing that, it would still be slow because it would be too many seeks.
What I really need is a way of going through all rows, changing them and not being bound to a giant transaction.
Strangely, an
UPDATE table
SET replicated=0
WHERE ID in (SELECT id from table WHERE replicated= LIMIT 10000)
is also slow, although it should be a good thing: Go through the table in DISK-order...
(Note that in that case there was also an index that covered this)
(An update LIMIT like Mysql is unavailable for PostgresQL)
BTW: The real problem is more complicated and we're talking about an embedded system here that is already deployed, so remote schema changes are difficult, but possible
It's PostgresQL 7.4 unfortunately.
The amount of rows I'm talking about is e.g. 90000000. The size of the databse can be several dozend gigabytes.
The database itself only contains 5 tables, one is a very large one.
But that is not bad design, because these embedded boxes only operate with one kind of entity, it's not an ERP-system or something like that!
Any ideas?
How about adding a new table to store this replicated value (and a primary key to link each record to the main table). Then you simply add a record for every replicated item, and delete records to remove the replicated flag. (Or maybe the other way around - a record for every non-replicated record, depending on which is the common case).
That would also simplify the case when you want to set them all back to 0, as you can just truncate the table (which zeroes the table size on disk, you don't even have to vacuum to free up the space)
If you are trying to reset the whole table, not just a few rows, it is usually faster (on extremely large datasets -- not on regular tables) to simply CREATE TABLE bar AS SELECT everything, but, copied, 0 FROM foo, and then swap the tables and drop the old one. Obviously you would need to ensure nothing gets inserted into the original table while you are doing that. You'll need to recreate that index, too.
Edit: A simple improvement in order to avoid locking the table while you copy 14 Gigabytes:
lock ;
create a new table, bar;
swap tables so that all writes go to bar;
unlock;
create table baz as select from foo;
drop foo;
create the index on baz;
lock;
insert into baz from bar;
swap tables;
unlock;
drop bar;
(let writes happen while you are doing the copy, and insert them post-factum).
While you cannot likely fix the problem of space usage (it is temporary, just until a vacuum) you can probably really speed up the process in terms of clock time. The fact that PostgreSQL uses MVCC means that you should be able to do this without any issues related to newly inserted rows. The create table as select will get around some of the performance issues, but will not allow for continued use of the table, and takes just as much space. Just ditch the index, and rebuild it, then do a vacuum.
drop index replication_flag;
update big_table set replicated=0;
create index replication_flag on big_table btree(ID) WHERE replicated=0;
vacuum full analyze big_table;
This is pseudocode. You'll need 400MB (for ints) or 800MB (for bigints) temporary file (you can compress it with zlib if it is a problem). It will need about 100 scans of a table for vacuums. But it will not bloat a table more than 1% (at most 1000000 dead rows at any time). You can also trade less scans for more table bloat.
// write all ids to temporary file in disk order
// no where clause will ensure disk order
$file = tmpfile();
for $id, $replicated in query("select id, replicated from table") {
if ( $replicated<>0 ) {
write($file,&$id,sizeof($id));
}
}
// prepare an update query
query("prepare set_replicated_0(bigint) as
update table set replicated=0 where id=?");
// reread this file, launch prepared query and every 1000000 updates commit
// and vacuum a table
rewind($file);
$counter = 0;
query("start transaction");
while read($file,&$id,sizeof($id)) {
query("execute set_replicated_0($id)");
$counter++;
if ( $counter % 1000000 == 0 ) {
query("commit");
query("vacuum table");
query("start transaction");
}
}
query("commit");
query("vacuum table");
close($file);
I guess what you need to do is
a. copy the 2000 records PK value into a temporary table with the same standard limit, etc.
b. select the same 2000 records and perform the necessary operations in the cursor as it is.
c. If successful, run a single update query against the records in the temp table. Clear the temp table and run step a again.
d. If unsuccessful, clear the temp table without running the update query.
Simple, efficient and reliable.
Regards,
KT
I think it's better to change your postgres to version 8.X. probably the cause is the low version of Postgres. Also try this query below. I hope this can help.
UPDATE table1 SET name = table2.value
FROM table2
WHERE table1.id = table2.id;