Add column to huge table in Postgresql without downtime

Add column to huge table in Postgresql without downtime - sql

I have a very small table(about 1 mil rows) and I'm going to drop constraints and add new column. The query below is hang about 5 minutes, had to rollback.
BEGIN WORK;
LOCK TABLE usertable IN SHARE MODE;
ALTER TABLE usertable ALTER COLUMN email DROP NOT NULL;
COMMIT WORK;
Another approach suggested on the similar questions in the internet -
CREATE TABLE new_tbl
(
field1 int,
field2 int,
...
);
INSERT INTO new_tbl(field1, field2, ...)
(
SELECT FROM ... -- use your new logic here to insert the updated data
)
CREATE INDEX -- add your constraints and indexes to new_tbl
DROP TABLE tbl;
ALTER TABLE tbl_new RENAME tbl;
Create new table
Insert records from old table to new one (take less then a second)
Drop old table - this query hangs for about 5 minutes ~. Had to rollback. Does not work for me.
Renamed new created table to old one
Dropping old table simply hangs. However when I try to drop new created table with 1 million rows - it works instantly. Why dropping of old table takes so much time ?
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
I can see a lot of concurrent writes/reads which are waiting for my operation. Since I took lock on the table, I don't really think that the reason why I can't drop old table.
Just run vacuum on old table it did not help.
Why I can't drop old table why it takes so much time compared to dropping recently created table ?

I don't have a lot of experience with PostgreSQL, but my guess is that it keeps a bit to signify when a NULLable column is NULL (as opposed to empty) and when that column is marked as NOT NULL it no longer needs that bit. So, when you change that attribute on a column the system needs to go through the whole table and rearrange the data, moving lots of bits around so that the rows are all correctly structured.
This is much different from a DROP TABLE, which merely needs to mark the disk space as no longer in use and perhaps update a few metadata values.
In short, they're very different actions, so of course they take different amounts of time.

I was not able to drop/rename table original table cause of FK of others tables. Once I dropoed it, approach with renaming table works great

Related

Improve insert performance when checking existing rows

I have this simple query that inserts rows from one table(sn_users_main) into another(sn_users_history).
To make sure sn_users_history only has unique rows it checks if the column query_time already exists and if it does then don't insert. query_time is kind of a session identifier that is the same for every row in sn_users_main.
This works fine but since sn_users_history is reaching 50k rows running this query takes more than 2 minutes to run which is too much. Is there anything I can do to improve performance and get the same result?
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE NOT EXISTS(SELECT snh.query_time
FROM sn_users_history snh
WHERE snh.query_time = snm.query_time) --Dont insert items into history table if they already exist

I think you are missing extra condition on user_id, when you are inserting into history table. You have to check combination of userid, querytime.
For your question, I think you are trying to reinvent the wheel. SQL Server is already having temporal tables, to suppor this historical data holding. Read about SQL Server Temporal Tables
If you want to still continue with this approach, I would suggest you to do in batches:
Create a configuration Table to hold the last processed querytime
CREATE TABLE HistoryConfig(HistoryConfigId int, HistoryTableName SYSNAME,
lastProcessedQueryTime DATETIME)
you can do incremental historical inserts
DECLARE #lastProcessedQueryTime DATETIME = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE query_time > #lastProcessedQueryTime
Now, you can update the configuration again
UPDATE HistoryConfig SET lastProcessedQueryTime = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
HistoryTableName = 'sn_users_history'
I would suggest you to create index on clustered index on UserId, Query_Time(if possible, Otherwise create non-clustered index) which will improve the performance.
Other approaches you can think of:
Create clustered index on userId, querytime in the historical table and also have userid,querytime as clustered index on the main table and perform MERGE operation.

Fastest options for merging two tables in SQL Server

Consider two very large tables, Table A with 20 million rows in, and Table B which has a large overlap with TableA with 10 million rows. Both have an identifier column and a bunch of other data. I need to move all items from Table B into Table A updating where they already exist.
Both table structures
- Identifier int
- Date DateTime,
- Identifier A
- Identifier B
- General decimal data.. (maybe 10 columns)
I can get the items in Table B that are new, and get the items in Table B that need to be updated in Table A very quickly, but I can't get an update or a delete insert to work quickly. What options are available to merge the contents of TableB into TableA (i.e. updating existing records instead of inserting) in the shortest time?
I've tried pulling out existing records in TableB and running a large update on table A to update just those rows (i.e. an update statement per row), and performance is pretty bad, even with a good index on it.
I've also tried doing a one shot delete of the different values out of TableA that exist in TableB and performance of the delete is also poor, even with the indexes dropped.
I appreciate that this may be difficult to perform quickly, but I'm looking for other options that are available to achieve this.

Since you deal with two large tables, in-place updates/inserts/merge can be time consuming operations. I would recommend to have some bulk logging technique just to load a desired content to a new table and the perform a table swap:
Example using SELECT INTO:
SELECT *
INTO NewTableA
FROM (
SELECT * FROM dbo.TableB b WHERE NOT EXISTS (SELECT * FROM dbo.TableA a WHERE a.id = b.id)
UNION ALL
SELECT * FROM dbo.TableA a
) d
exec sp_rename 'TableA', 'BackupTableA'
exec sp_rename 'NewTableA', 'TableA'
Simple or at least Bulk-Logged recovery is highly recommended for such approach. Also, I assume that it has to be done out of business time since plenty of missing objects to be recreated on a new tables: indexes, default constraints, primary key etc.

A Merge is probably your best bet, if you want to both inserts and updates.
MERGE #TableB AS Tgt
USING (SELECT * FROM #TableA) Src
ON (Tgt.Identifier = SRc.Identifier)
WHEN MATCHED THEN
UPDATE SET Date = Src.Date, ...
WHEN NOT MATCHED THEN
INSERT (Identifier, Date, ...)
VALUES (Src.Identifier, Src.Date, ...);
Note that the merge statement must be terminated with a ;

Is high read but absolutely no writing on SQLite query normal?

I've been running a query to add a column to a table by matching ID from another table. Both have about 600 million rows so its reasonable that this would take a while, but what is worrying me is that there are high read speeds (~500MB/s) on the disk but sqlite is writing 0B/s according to iotop. The filesize on my .db file has not changed in hours, but adding a column to a 600 million row table should change AT LEAST one byte, no?
Is this normal behavior from SQLite? The machine is pretty powerful, Ubuntu 16 on quad core i7 with 64GB RAM and NVMe SSD. The queries and table schema are below.
ALTER TABLE tableA ADD address TEXT;
UPDATE tableA SET address = (SELECT address FROM tableB WHERE tableA.ID = tx_out.ID);
Table schema:
CREATE TABLE tableA (
ID TEXT,
column1 INT,
column2 TEXT,
);
CREATE TABLE tx_out (
ID TEXT,
sequence INT,
address TEXT
);

Adding a column changes almost nothing on disk; a row with fewer values than the table has columns is assumed to have NULLs in the missing columns.
The UPDATE is extremely slow because the subquery has to scan through the entire tx_out table for every row in tableA.
You could speed it up greatly with an index on the tx_out.ID column.
When the database has to rewrite all rows anyway, and you have the disk space, it might be a better idea to create a new table:
INSERT INTO NewTable(ID, col1, col2, address)
SELECT ID, col1, col2, address
FROM tableA
JOIN tableB USING (ID); -- also needs an index to be fast

Too big for a comment
I've had this run for days with no change...I think it is prone to locking itself in one manner or another, I killed it after it's 3rd day of what appeared to be no changes. I've encountered very similar issues when attempting to add a new index as well, but that one successfully completed in 2 days before I hit my 3 day kill threshhold ;) Possible 3 days just wasn't enough?
My preference now is to create a second table that has the new column, load the table with the data from the old plus the new column, rename the old table to X_oldtablename, rename the new table to the table name. Run tests and drop x_oldtablename after you are confident the new table is working

updating a very large oracle table

I have a very large table people with 60M rows indexed on id, wish to populate a field newid for every record based on a look up table id_conversion (1M rows) which contains id and newid, indexed on id.
when I run
update people p set p.newid=(select l.newid from id_conversion l where l.id=p.id)
it runs for an hour or so and then I get an archive error ora 00257.
Any suggestions for either running update in sections or better sql command?

To avoid writing to Oracle's undo log if your update statement hits every single row of the table then you are likely better off running a create table as select query which will bypass all undo logs, which is likely the issue you're running into as it is logging the impact across 60 million rows. You can then drop the old table and rename the new table to that of the old table's name.
Something like:
create table new_people as
select l.newid,
p.col2,
p.col3,
p.col4,
p.col5
from people p
join id_conversion l
on p.id = l.id;
drop table people;
-- rebuild any constraints and indexes
-- from old people table to new people table
alter table new_people rename to people;
For reference, read some of the tips here: http://www.dba-oracle.com/t_efficient_update_sql_dml_tips.htm
If you are basically creating a new table and not just updating some of the rows of a table it will likely prove the faster method.

I doubt you will be able to get this to run in seconds. Your query, as written, needs to update all 60 million rows.
My first advice is to add an index on id_conversion(id, newid), to make the subquery more efficient. If that doesn't help, then doing the update in batches might be the best way to go.
I should add. Because you are updating all the rows, it might be faster to take the following approach:
Copy the data into a new table with the new values.
Truncate the original table.
Insert the new data into the old table.
Inserts are faster than updates.

In addition to the answers above, which probably will work better in this case, you should know the MERGE statement
http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_9016.htm
that is used for updating one table according to another table and is far faster then update according to a select statement

Caching joined tables in SQL Server

My website has a search procedure that runs very slowly. One thing that slows it down is the 8 table join it has to do (It also has a WHERE clause on ~6 search parameters). I've tried to make the query faster using various methods such as adding indexes, but these have not helped.
One Idea I have is to cache the result of the 8 table join. I could create a temporary table of the join, and make the search procedure query this table. I could update the table every 10 minutes or so.
Using pseudo code, I would change my procedure to look like this:
IF CachedTable is NULL or CachedTable is older than 10 minutes
DROP TABLE CachedTable
CREATE TABLE CachedTable as (select * from .....)
ENDIF
Select * from CachedTable Where Name = #SearchName
AND EmailAddress = #SearchEmailAddress
Is this a working strategy? I don't really know what syntax I would need to pull this off, or what I would need to lock to stop things from breaking if two queries happen at the same time.
Also, it might take quite a long time to make a new CachedTable each time, so I thought of trying something like double buffering in computer graphics:
IF CachedTabled is NULL
CREATE TABLE CachedTable as (select * from ...)
ELSE IF CachedTable is older than 10 minutes
-- Somehow do this asynchronously, so that the next time a search comes
-- through the new table is used?
ASYNCHRONOUS (
CREATE TABLE BufferedCachedTable as (select * from ...)
DROP TABLE CachedTable
RENAME TABLE BufferedCachedTable as CachedTable
)
Select * from CachedTable Where Name = #SearchName
AND EmailAddress = #SearchEmailAddress
Does this make any sense? If so, how would I achieve it? If not, what should I do instead? I tried using indexed views, but this resulted in weird errors, so I want something like this that I can have more control over (Also, something I can potentially spin off onto a different server in the future.)
Also, what about indexes and so on for tables created like this?
This may be obvious from the question, but I don't know that much about SQL or the options I have available.

You can use multiple schemas (you should always specify schema!) and play switch-a-roo as I demonstrated in this question. Basically you need two additional schemas (one to hold a copy of the table temporarily, and one to hold the cached copy).
CREATE SCHEMA cache AUTHORIZATION dbo;
CREATE SCHEMA hold AUTHORIZATION dbo;
Now, create a mimic of the table in the cache schema:
SELECT * INTO cache.CachedTable FROM dbo.CachedTable WHERE 1 = 0;
-- then create any indexes etc.
Now when it comes time to refresh the data:
-- step 1:
TRUNCATE TABLE cache.CachedTable;
-- (if you need to maintain FKs you may need to delete)
INSERT INTO cache.CachedTable SELECT ...
-- step 2:
-- this transaction will be almost instantaneous,
-- since it is a metadata operation only:
BEGIN TRANSACTION;
ALTER SCHEMA hold TRANSFER dbo.Cachedtable;
ALTER SCHEMA dbo TRANSFER cache.CachedTable;
ALTER SCHEMA cache TRANSFER hold.CachedTable;
COMMIT TRANSACTION;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Add column to huge table in Postgresql without downtime - sql

I was not able to drop/rename table original table cause of FK of others tables. Once I dropoed it, approach with renaming table works great

Related

Improve insert performance when checking existing rows

Fastest options for merging two tables in SQL Server

Is high read but absolutely no writing on SQLite query normal?

updating a very large oracle table

Caching joined tables in SQL Server

Categories

Resources