Add new column without table lock? - sql

In my project having 23 million records and around 6 fields has been indexed of that table.
Earlier I tested to add delta column for Thinking Sphinx search but it turns in holding the whole database lock for an hour. Afterwards when the file is added and I try to rebuild indexes this is the query that holds the database lock for around 4 hours:
"update user_messages set delta = false where delta = true"
Well for making the server up I created a new database from db dump and promote it as database so server can be turned live.
Now what I am looking is that adding delta column in my table with out table lock is it possible? And once the column delta is added then why is the above query executed when I run the index rebuild command and why does it block the server for so long?
PS.: I am on Heroku and using Postgres with ika db model.

Postgres 11 or later
Since Postgres 11, only volatile default values still require a table rewrite. The manual:
Adding a column with a volatile DEFAULT or changing the type of an existing column will require the entire table and its indexes to be rewritten.
Bold emphasis mine. false is immutable. So just add the column with DEFAULT false. Super fast, job done:
ALTER TABLE tbl ADD column delta boolean DEFAULT false;
Postgres 10 or older, or for volatile DEFAULT
Adding a new column without DEFAULT or DEFAULT NULL will not normally force a table rewrite and is very cheap. Only writing actual values to it creates new rows. But, quoting the manual:
Adding a column with a DEFAULT clause or changing the type of an
existing column will require the entire table and its indexes to be rewritten.
UPDATE in PostgreSQL writes a new version of the row. Your question does not provide all the information, but that probably means writing millions of new rows.
While doing the UPDATE in place, if a major portion of the table is affected and you are free to lock the table exclusively, remove all indexes before doing the mass UPDATE and recreate them afterwards. It's faster this way. Related advice in the manual.
If your data model and available disk space allow for it, CREATE a new table in the background and then, in one transaction: DROP the old table, and RENAME the new one. Related:
Best way to populate a new column in a large table?
While creating the new table in the background: Apply all changes to the same row at once. Repeated updates create new row versions and leave dead tuples behind.
If you cannot remove the original table because of constraints, another fast way is to build a temporary table, TRUNCATE the original one and mass INSERT the new rows - sorted, if that helps performance. All in one transaction. Something like this:
BEGIN
SET temp_buffers = 1000MB; -- or whatever you can spare temporarily
-- write-lock table here to prevent concurrent writes - if needed
LOCK TABLE tbl IN SHARE MODE;
CREATE TEMP TABLE tmp AS
SELECT *, false AS delta
FROM tbl; -- copy existing rows plus new value
-- ORDER BY ??? -- opportune moment to cluster rows
-- DROP all indexes here
TRUNCATE tbl; -- empty table - truncate is super fast
ALTER TABLE tbl ADD column delta boolean DEFAULT FALSE; -- NOT NULL?
INSERT INTO tbl
TABLE tmp; -- insert back surviving rows.
-- recreate all indexes here
COMMIT;

You could add another table with the one column, there won't be any such long locks. Of course there should be another column, a foreign key to the first column.
For the indexes, you could use "CREATE INDEX CONCURRENTLY", it doesn't use too heavy locks on this table http://www.postgresql.org/docs/9.1/static/sql-createindex.html.

Related

Most efficient way of updating ~100 million rows in Postgresql database?

I have a database with a single table. This table will need to be updated every few weeks. We need to ingest third-party data into it and it will contain 100-120 million rows. So the flow is basically:
Get the raw data from the source
Detect inserts, updates & deletes
Make updates and ingest into the database
What's the best way of detecting and performing updates?
Some options are:
Compare incoming data with current database one by one and make single updates. This seems very slow and not feasible.
Ingest incoming data into a new table, then switch out old table with the new table
Bulk updates in-place in the current table. Not sure how to do this.
What do you suggest is the best option, or if there's a different option out there?
Postgres has a helpful guide for improving performance of bulk loads. From your description, you need to perform a bulk INSERT in addition to a bulk UPDATE and DELETE. Below is a roughly step by step guide for making this efficient:
Configure Global Database Configuration Variables Before the Operation
ALTER SYSTEM SET max_wal_size = <size>;
You can additionally disable WAL entirely.
ALTER SYSTEM SET wal_level = 'minimal';
ALTER SYSTEM SET archive_mode = 'off';
ALTER SYSTEM SET max_wal_senders = 0;
Note that these changes will require a database restart to take effect.
Start a Transaction
You want all work to be done in a single transaction in case anything goes wrong. Running COPY in parallel across multiple connections does not usually increase performance as disk is usually the limiting factor.
Optimize Other Configuration Variables at the Transaction level
SET LOCAL maintenance_work_mem = <size>
...
You may need to set other configuration parameters if you are doing any additional special processing of the data inside Postgres (work_mem is usually most important there especially if using Postgis extension.) See this guide for the most important configuration variables for performance.
CREATE a TEMPORARY table with no constraints.
CREATE TEMPORARY TABLE changes(
id bigint,
data text,
) ON COMMIT DROP; --ensures this table will be dropped at end of transaction
Bulk Insert Into changes using COPY FROM
Use the COPY FROM Command to bulk insert the raw data into the temporary table.
COPY changes(id,data) FROM ..
DROP Relations That Can Slow Processing
On the target table, DROP all foreign key constraints, indexes and triggers (where possible). Don't drop your PRIMARY KEY, as you'll want that for the INSERT.
Add a Tracking Column to target Table
Add a column to target table to determine if row was present in changes table:
ALTER TABLE target ADD COLUMN seen boolean;
UPSERT from the changes table into the target table:
UPSERTs are performed by adding an ON CONFLICT clause to a standard INSERT statement. This prevents the need from performing two separate operations.
INSERT INTO target(id,data,seen)
SELECT
id,
data,
true
FROM
changes
ON CONFLICT (id) DO UPDATE SET data = EXCLUDED.data, seen = true;
DELETE Rows Not In changes Table
DELETE FROM target WHERE not seen is true;
DROP Tracking Column and Temporary changes Table
DROP TABLE changes;
ALTER TABLE target DROP COLUMN seen;
Add Back Relations You Dropped For Performance
Add back all constraints, triggers and indexes that were dropped to improve bulk upsert performance.
Commit Transaction
The bulk upsert/delete is complete and the following commands should be performed outside of a transaction.
Run VACUUM ANALYZE on the target Table.
This will allow the query planner to make appropriate inferences about the table and reclaim space taken up by dead tuples.
SET maintenance_work_mem = <size>
VACUUM ANALYZE target;
SET maintenance_work_mem = <original size>
Restore Original Values of Database Configuration Variables
ALTER SYSTEM SET max_wal_size = <size>;
...
You may need to restart your database again for these settings to take effect.

Sybase ASE: Add NOT NULL column without a DEFAULT fails. Why?

Consider the following empty (as in without rows) table:
CREATE TABLE my_table(
my_column CHAR(10) NOT NULL
);
Trying to add a NOT NULL column without a DEFAULT will fail:
ALTER TABLE my_table ADD my_new_column CHAR(10) NOT NULL;
Error:
*[Code: 4997, SQL State: S1000]
ALTER TABLE my_table failed.
Default clause is required in order to add non-NULL column 'my_new_column'.
But adding the column as NULL and then change it to be NOT NULL will work:
ALTER TABLE my_table ADD my_new_column CHAR(10) NULL;
ALTER TABLE my_table MODIFY my_new_column CHAR(10) NOT NULL;
Setting a default and then removing the default will work too:
ALTER TABLE my_table ADD my_new_column CHAR(10) DEFAULT '' NOT NULL;
ALTER TABLE my_table REPLACE my_new_column DEFAULT NULL;
What's the justification for this behavior? What is the database trying to do internally that adding the column directly fails? I have a feeling that it might have something to do with internal versioning but I can't find anything in this regard.
This is speculation. I am guessing that Sybase is being overly conservative. In general, you cannot add a new not null column with no default value to a table that has rows. This is true in all databases, because there is no way to populate the existing rows for the new column.
I am guessing that Sybase simply doesn't check if the table has rows, only if it exists. Clearly it is not doing the check for the alter.
This is only speculation, but I suspect it has to do the combination of needing to both acquire a lock on the whole table to guarantee continued compliance with the schema, and re-allocate space for the records.
Allowing a direct add of a NOT NULL column would compromise any existing records if there's no default value. Yes, we know the table is empty. And the database can (eventually) know the table is empty at execution time... but it can't really know the table is empty at execution plan compile time, because a row could be added while the execution plan is determined.
This means the database would need to generate the worst-possible execution plan, involving a lock on the entire table, for the query to run in a transactionally-safe way. Additionally, adding (or removing) a column causes extra work for the database because it needs to re-allocate any pages and rebuild indexes in order to account for the changed size of individual records.
Put the two together, and it becomes difficult to just rollback a failed query, because you may have actual pages in different states. For whatever reason, the developers chose not to allow this.
The other options allow you to simply fail the query if a bad row gets in the way and would violate the schema, because you're not re-sizing records within pages. It might even allow you to get away with some page and row locks, rather than full table locks.

DB2: How to add new column between existing columns?

I have an existing DB2 database and a table named
employee with columns
id,e_name,e_mobile_no,e_dob,e_address.
How can I add a new column e_father_name before e_mobile_no?
You should try using the ADMIN_MOVE_TABLE procedure which allows to change the table structure.
The ALTER TABLE only allows adding columns to the end of the table. The reason is that it would change the physical structure of the table, i.e., each row would need to be adapted to the new format. This would be quite expensive.
Using the mentioned procedure ADMIN_MOVE_TABLE you would copy the entire table and during that process change the table structure. It requires a significant amount of space and time.
In DB2 IBM i v7r1 you can do it, try on your DB2 version
alter table yourtable
add column e_father_name varchar(10) before e_mobile_no
I always do the following --
Take a backup/dump of table data and db2look
(If you dump to a CSV file as I do I suggest dumping in the new format so for example put null for the new column in the right place.
Drop table and indexes
Create table with the new colunn
Load data with old values
Recreate all indexes and runstats.
Once you have done it a few times it becomes old hat.

Populating a table from a view in Oracle with "locked" truncate/populate

I would like to populate a table from a (potentially large) view on a scheduled basis.
My process would be:
Disable indexes on table
Truncate table
Copy data from view to table
Enable indexes on table
In SQL Server, I can wrap the process in a transaction such that when I truncate the table a schema modification lock will be held until I commit. This effectively means that no other process can insert/update/whatever until the entire process is complete.
However I am aware that in Oracle the truncate table statement is considered DDL and will thus issue an implicit commit.
So my question is how can I mimic the behaviour of SQL Server here? I don't want any other process trying to insert/update/whatever whilst I am truncating and (re)populating the table. I would also prefer my other process to be unaware of any locks.
Thanks in advance.
Make your table a partitioned table with a single partition and local indexes only. Then whenever you need to refresh:
Copy data from view into a new temporary table
CREATE TABLE tmp AS SELECT ... FROM some_view;
Exchange the partition with the temporary table:
ALTER TABLE some_table
EXCHANGE PARTITION part WITH TABLE tmp
WITHOUT VALIDATION;
The table is only locked for the duration of the partition exchange, which, without validation and global index update, should be instant.

Does Adding a Column Lock a Table in SQL Server 2008?

I want to run the following on a table of about 12 million records.
ALTER TABLE t1
ADD c1 int NULL;
ALTER TABLE t2
ADD c2 bit NOT NULL
DEFAULT(0);
I've done it in staging and the timing seemed fine, but before I do it in production, I wanted to know how locking works on the table during new column creation (especially when a default value is specified). So, does anyone know? Does the whole table get locked, or do the rows get locked one by one during default value insertion? Or does something different altogether happen?
Prior to SQL Server 11 (Denali) the add non-null column with default will run an update behind the scenes to populate the new default values. Thus it will lock the table for the duration of the 12 million rows update. In SQL Server 11 this is no longer the case, the column is added online and no update occurs, see Online non-NULL with values column add in SQL Server 11.
Both in SQL Server 11 and prior a Sch-M lock is acquired on the table to modify the definition (add the new column metadata). This lock is incompatible with any other possible access (including dirty reads). The difference is in the duration: prior to SQL Server 11 this lock will be hold for a size-of-data operation (update of 12M rows). In SQL Server 11 the lock is only held for a short brief. In the pre-SQL Server 11 update of the rows no row lock needs to be acquired because the Sch-M lock on the table guarantees that there cannot be any conflict on any individual row.
Yes, it will lock the table.
A table, as a whole, has a single schema (set of columns, with associated types). So, at a minimum, a schema lock would be required to update the definition of the table.
Try to think about how things would work contrariwise - if each row was updated individually, how would any parallel queries work (especially if they involved the new columns)?
And default values are only useful during INSERT and DDL statements - so if you specify a new default for 10,000,000 rows, that default value has to be applied to all of those rows.
Yes, it will lock.
DDL statements issue a Schema Lock (see this link) which will prevent access to the table until the operation completes.
There's not really a way around this, and it makes sense if you think about it. SQL needs to know how many fields are in a table, and during this operation some rows will have more fields than others.
The alternative is to make a new table with the correct fields, insert into, then rename the tables to swap them out.
I have not read how the lock mechanism works when adding a column, but I am almost 100% sure row by row is impossible.
Watch when you do these types of things in SQL Server Manager with drag and drop (I know you are not doing this here, but this is a public forum), as some changes are destructive (fortunately, SQL Server 2008, at least R2, is safer here as it tells you "no can do" rather than just do it).
You can run both column additions in a single statement, however, and reduce the churn.