Add a new column to big database table

Add a new column to big database table - sql

I need to add a new column to a table in my database. The table contains around 140 million rows and I'm not sure how to proceed without locking the database.
The database is in production and that's why this has to be as smooth as it can get.
I have read a lot but never really got the answer if this is a risky operation or not.
The new column is nullable and the default can be NULL. As i understood there is a bigger issue if the new column needs a default value.
I'd really appreciate some straight forward answers on this matter. Is this doable or not?

Yes, it is eminently doable.
Adding a column where NULL is acceptable and has no default value does not require a long-running lock to add data to the table.
If you supply a default value, then SQL Server has to go and update each record in order to write that new column value into the row.
How it works in general:
+---------------------+------------------------+-----------------------+
| Column is Nullable? | Default Value Supplied | Result |
+---------------------+------------------------+-----------------------+
| Yes | No | Quick Add (caveat) |
| Yes | Yes | Long running lock |
| No | No | Error |
| No | Yes | Long running lock |
+---------------------+------------------------+-----------------------+
The caveat bit:
I can't remember off the top of my head what happens when you add a column that causes the size of the NULL bitmap to be expanded. I'd like to say that the NULL bitmap represents the nullability of all the the columns currently in the row, but I can't put my hand on my heart and say that's definitely true.
Edit -> #MartinSmith pointed out that the NULL bitmap will only expand when the row is changed, many thanks. However, as he also points out, if the size of the row expands past the 8060 byte limit in SQL Server 2012 then a long running lock may still be required. Many thanks * 2.
Second caveat:
Test it.
Third and final caveat:
No really, test it.

My example is how do I add a new column to the table by tens of millions of rows and fill it by default value without long running lock
USE [MyDB]
GO
ALTER TABLE [dbo].[Customer] ADD [CustomerTypeId] TINYINT NULL
GO
ALTER TABLE [dbo].[Customer] ADD CONSTRAINT [DF_Customer_CustomerTypeId] DEFAULT 1 FOR [CustomerTypeId]
GO
DECLARE #batchSize bigint = 5000
,#rowcount int
,#MaxID int;
SET #rowcount = 1
SET #MaxID = 0
WHILE #rowcount > 0
BEGIN
;WITH upd as (
SELECT TOP (#batchSize)
[ID]
,[CustomerTypeId]
FROM [dbo].[Customer] (NOLOCK)
WHERE [CustomerTypeId] IS NULL
AND [ID] > #MaxID
ORDER BY [ID])
UPDATE upd
SET [CustomerTypeId] = 1
,#MaxID = CASE WHEN [ID] > #MaxID THEN [ID] ELSE #MaxID END
SET #rowcount = ##ROWCOUNT
WAITFOR DELAY '00:00:01'
END;
ALTER TABLE [dbo].[Customer] ALTER COLUMN [CustomerTypeId] TINYINT NOT NULL;
GO
ALTER TABLE [dbo].[Customer] ADD [CustomerTypeId] TINYINT NULL changes only the metadata (Sch-M locks) and lock time does not depend on the number of rows in a table
After that, I fill a new column by default value in small portions (5000 rows). I wait one second after each cycle so as not to block the table too aggressively. I have a int column "ID" as the primary clustered key
Finally, when all the new column is filled I change it to NOT NULL

No one can tell how much time will the operation cost as this depend on many ither factors after all.
You should not be worried about the operations itself because the SQL Server is doing everything right:
The Database Engine uses schema modification (Sch-M) locks during a
table data definition language (DDL) operation, such as adding a
column or dropping a table. During the time that it is held, the Sch-M
lock prevents concurrent access to the table. This means the Sch-M
lock blocks all outside operations until the lock is released.
I have never done ALTER operation on such amount of data and the only advice that I can give is to do it when there are not so many connections to the database (during the night).
EDIT:
Here you can found more information about your question. Generally, Matt Whitfield is right and
The only time that adding a column to a table results in a
size-of-data operation (i.e. an operation that modifies every row in a
table) is when the new column has a non-null default.
and when
New column is nullable, with a NULL default. The table's metadata
records the fact that the new column exists but may not be in the
record. This is why the null bitmap also has a count of the number of
columns in that particular record. SQL Server can work out whether a
column is present in the record or not. So – this is NOT a
size-of-data operation – the existing table records are not updated
when the new column is added. The records will be updated only when
they are updated for some other operation.

There is one way that I usually do - Export that table and create new column at local and re-name the table name, then import table table, and just re-name the existing table and convert the first table name as it wa original.

Related

Adding a computed column that uses MAX

I need to create a sequential number column for record number proposes
I am OK with losing sequence if I delete a row from the middle of the table
For example
1
2
3
If I delete 2, I am ok with new column been 4.
I tried to alter my table to
alter table [dbo].[mytable]
add [record_seq] as (MAX(record_seq) + 1)
but I am getting An aggregate may not appear in a computed column expression or check constraint.
Which is a bit confusing? do I need to specify an initial value? is there a better way?

If you're looking to allocate a sequence number even in cases where the table doesn't get a record inserted, I would handle it in the process responsible for performing those inserts. Create another table, in this table keep track of the max identity value of that sequence. Each time you want to perform an insert, reserve the sequence number you want by updating that table first. If you rely on selecting the max existing value, you could be at risk of multiple sessions getting the same "new" sequence number before inserting. Even if the insert fails, you will have incremented that control table so nothing else uses that value that has been reserved.

Its not supported in MsSql. You can use identity column:
ALTER TABLE [dbo].[mytable]
ADD [record_seq] INT IDENTITY
Or use trigger to update your seq column after insert and/or delete

Column Copy and Update vs. Column Create and Insert

I have a table with 32 Million rows and 31 columns in PostgreSQL 9.2.10. I am altering the table by adding columns with updated values.
For example, if the initial table is:
id initial_color
-- -------------
1 blue
2 red
3 yellow
I am modifying the table so that the result is:
id initial_color modified_color
-- ------------- --------------
1 blue blue_green
2 red red_orange
3 yellow yellow_brown
I have code that will read the initial_color column and update the value.
Given that my table has 32 million rows and that I have to apply this procedure on five of the 31 columns, what is the most efficient way to do this? My present choices are:
Copy the column and update the rows in the new column
Create an empty column and insert new values
I could do either option with one column at a time or with all five at once. The columns types are either character varying or character.

The columns types are either character varying or character.
Don't use character, that's a misunderstanding. varchar is ok, but I would suggest just text for arbitrary character data.
Any downsides of using data type "text" for storing strings?
Given that my table has 32 million rows and that I have to apply this
procedure on five of the 31 columns, what is the most efficient way to do this?
If you don't have objects (views, foreign keys, functions) depending on the existing table, the most efficient way is create a new table. Something like this ( details depend on the details of your installation):
BEGIN;
LOCK TABLE tbl_org IN SHARE MODE; -- to prevent concurrent writes
CREATE TABLE tbl_new (LIKE tbl_org INCLUDING STORAGE INCLUDING COMMENTS);
ALTER tbl_new ADD COLUMN modified_color text
, ADD COLUMN modified_something text;
-- , etc
INSERT INTO tbl_new (<all columns in order here>)
SELECT <all columns in order here>
, myfunction(initial_color) AS modified_color -- etc
FROM tbl_org;
-- ORDER BY tbl_id; -- optionally order rows while being at it.
-- Add constraints and indexes like in the original table here
DROP tbl_org;
ALTER tbl_new RENAME TO tbl_org;
COMMIT;
If you have depending objects, you need to do more.
Either was, be sure to add all five at once. If you update each in a separate query you write another row version each time due to the MVCC model of Postgres.
Related cases with more details, links and explanation:
Updating database rows without locking the table in PostgreSQL 9.2
Best way to populate a new column in a large table?
Optimizing bulk update performance in PostgreSQL
While creating a new table you might also order columns in an optimized fashion:
Calculating and saving space in PostgreSQL

Maybe I'm misreading the question, but as far as I know, you have 2 possibilities for creating a table with the extra columns:
CREATE TABLE
This would create a new table and filling could be done using
CREATE TABLE .. AS SELECT.. for filling with creation or
using a separate INSERT...SELECT... later on
Both variants are not what you seem to want to do, as you stated solution without listing all the fields.
Also this would require all data (plus the new fields) to be copied.
ALTER TABLE...ADD ...
This creates the new columns. As I'm not aware of any possibility to reference existing column values, you will need an additional UPDATE ..SET... for filling in values.
So, I' not seeing any way to realize a procedure that follows your choice 1.
Nevertheless, copying the (column) data just to overwrite them in a second step would be suboptimal in any case. Altering a table adding new columns is doing minimal I/O. From this, even if there would be a possibility to execute your choice 1, following choice 2 promises better performance by factors.
Thus, do 2 statements one ALTER TABLE adding all your new columns in on go and then an UPDATE providing the new values for these columns will achieve what you want.

create new column (modified colour), it will have a value of NULL or blank on all records,
run an update statement, assuming your table name is 'Table'.
update table
set modified_color = 'blue_green'
where initial_color = 'blue'
if I am correct this can also work like this
update table set modified_color = 'blue_green' where initial_color = 'blue';
update table set modified_color = 'red_orange' where initial_color = 'red';
update table set modified_color = 'yellow_brown' where initial_color = 'yellow';
once you have done this you can do another update (assuming you have another column that I will call modified_color1)
update table set 'modified_color1'= 'modified_color'

T-SQL(MSSQL 2005)Reorder Scope_identity

Have a small table of 2 columns on MSSQL Server 2005 which contains a lot of information let's say about 1 billion records and it is constantly being written into.
Definition of the table is :
Create table Test(
id int identity(1,1) primary key ,
name varchar(30) )
Te PK is int which I choose it over uniqueidentifier for a number of reasons. The problem comes with the auto increment I want to reorganize the 'id' every time a row is deleted. The objective to this is leaving no gaps. The table is active and a lot of rows are written into it, so dropping a column is not an option also locking the table for a long time.
Quick example of what I want to accomplish:
I have this :
id | name
----+-------
1 | Roy
2 | Boss
5 | Jane
7 | Janet
I want to reorganize it so it will look like this :
id | name
----+-------
1 | Roy
2 | Boss
3 | Jane
4 | Janet
I am aware of DBCC CHECKIDENT (TableName, RESEED, position) but I am not sure it will benefit my case, because my table is big and it will take a lot of time to reposition also if I am not mistaken it will lock the table for a very long time. This table is not used by any other table. But if you like you can submit a suggestion to the same problem having in mind that the table is used by other tables.
EDIT 1 :
The objective is to prove that the rows follow each other in case a row is deleted so I can see it is deleted and reinstate it.I was thinking of adding a third column that will contain a hash value from the row above , and if the row above is deleted I would know that I have a gap and need to restore it ,in that case the order will not matter because I can compare the has codes and see if they match , so I can see which row follows which.But still I wonder is there a more clever and safer way of doing this ?Maybe involve something else rather then hash codes , some other way of proving that the rows follow each other , or that the new row contains parts of the previous row?
EDIT 2 :
I'll try to explain it one more time if I can't well then I don't want to waste anyone's time.
In the perfect case scenario there will be nothing missing from this table , but due to
server errors some data maybe deleted or some of my associates might be wasteful and delete it by fault.
I have logs and can recover that data, but I want to prove that the records are sequenced , that they follow
each other even if there is a server error and some of them are deleted but later on reinstated.
Is there a way to do this ?
Example:
well let's say that 7 is deleted and after that reinstated as 23 , how would you prove that 23 is 7, meaning that 23 came after 6 and before 8 ?

I would suggest not worrying about trying to reseed your Identity column -- let SQL Server maintain it's uniqueness for each row.
Generally this is wanted for presentation logic instead, in which case, you could use the ROW_NUMBER() analytic function:
SELECT Row_Number() Over (Order By Id) NewId,
Id, Name
FROM YourTable

I agree with others that this shouldn't typically be done, but if you absolutely want to do it you can utilize the quirky update to get it done quickly, should be something like this:
DECLARE #prev_id INT = 0
UPDATE Test
SELECT id = CASE WHEN id - #prev_id = 1 THEN id
ELSE #prev_id + 1
END
,#prev_id = id
FROM test
You should read about the limitations of quirky update, primarily the conditions that must be met to ensure consistent output. This is a good article but they annoyingly have you sign in, but you can find other resources: http://www.sqlservercentral.com/articles/T-SQL/68467/
Edit: Actually, in this case I think you could just use:
DECLARE #prev_id INT = 0
UPDATE Test
SELECT id = #prev_id + 1
,#prev_id = id
FROM Test

The way to do it is to not implement your proposed fix.
Leave the identity alone.
If identity 7 is deleted you know it is just after 6 and and just before 8.
If you need them to stay in the same order then simple.
Place unique constraint on name.
Don't delete the record.
Just add a bool column for active.

Alternatives to Identity Column for Table With Frequent Inserts & Deletes?

Lets say I have a session table like this:
[Session]
-------
Id: int
UserId: int
Imagine that is used in an extremely high traffic site and Sessions are very frequently added and deleted. If I were to make the Id column of each table an Identity column, how could I easily maintain the seeding of the Id's so that they don't hit the limits of the int data type? Is there an alternative way of ensuring unique Id's that I'm not thinking of? Thanks in advance.

instead of int make it bigint, this will go up to 9,223,372,036,854,775,807
you can of course start at -9,223,372,036,854,775,808 as well
see also What To Do When Your Identity Column Maxes Out

Make the id a guid instead of int.
You get unique session id's that are not guessable and easy to implement with Guid.NewGuid().

If you have a site maintenance period you could just reseed the identity column. Naff but simple.

How long can a given session exist? If no session will last more than X period of time, and you know that you will never have more than N sessions present in the table at any given time, and you know the maxiumum rate at which new sessions will be added, then you could implement some form of circular queue system, cycling over a maximum set of numbers.
For example, if you never have more than 1000 rows in the table at any given point in time, no more than 1000 rows will be added in any given 5 minute period, and no row will persist for more than 2 days (nightly clean-up routine?), then you would go through 1000 * 2 * 24 * 12 = 576,000 Ids every two days... where every id gets added, used, and removed from the system every two days. Build circular queue logic around a large safety factor of that number (5,000,000, maybe), and you could be covered.
The hard part, of course, is generating the Id. I've done that in the past with a one-row "NextId" table which was defined and called like so:
-- Create table
CREATE TABLE NextId
(NextId int not null)
-- Add the one row to the table
INSERT Nextid (Nextid) values (1)
Optionally, put an INSERT/DELETE trigger on here to prevent the addition or deletion of rows
This procedure would be used to get the NextId to use. The single transaction is of course atomic, so you don't have to worry about locking. I used 10 for testing purposes. You will end up with an Id value of 0 every now and then, but it's a surrogate key so the actual value used should not matter.
CREATE PROCEDURE GetNextId
#NextId int OUTPUT
AS
SET NOCOUNT on
UPDATE NextId
set
#NextId = NextId
,NextId = (NextId + 1) % 10 -- 5000000
RETURN
Here's how the procedure would be called:
DECLARE #NextId int
EXECUTE GetNextId #NextId output
PRINT #NextId
I don't know how well this would work in excessively high-volume situations, but it does work well under fair-size workloads.

SQL Server concurrency

I asked two questions at once in my last thread, and the first has been answered. I decided to mark the original thread as answered and repost the second question here. Link to original thread if anyone wants it:
Handling SQL Server concurrency issues
Suppose I have a table with a field which holds foreign keys for a second table. Initially records in the first table do not have a corresponding record in the second, so I store NULL in that field. Now at some point a user runs an operation which will generate a record in the second table and have the first table link to it. If two users simultaneously try to generate the record, a single record should be created and linked to, and the other user receives a message saying the record already exists. How do I ensure that duplicates are not created in a concurrent environment?
The steps I need to carry out are:
1) Look up x number of records in table A
2) Perform some business logic that prepares a single row which is inserted into table B
3) Update the records selected in step 1) to point to the newly created record in table B
I can use scope_identity() to retrieve the primary key of the newly created record in table B, so I don't need to worry about the new record being lost due to simultaneous transactions. However I need to eliminate the possibility of concurrently executing processes resulting in a duplicate record in table B being created.

In SQL Server 2008, this can be handled with a filtered unique index:
CREATE UNIQUE INDEX ix_MyIndexName ON MyTable (FKField) WHERE FkField IS NOT NULL
This will require all non-null values be unique, and the database will enforce it for you.

The 2005 way of simulating a unique filtered index for constraint purposes is
CREATE VIEW dbo.EnforceUnique
WITH SCHEMABINDING
AS
SELECT FkField
FROM dbo.TableB
WHERE FkField IS NOT NULL
GO
CREATE UNIQUE CLUSTERED INDEX ix ON dbo.EnforceUnique(FkField)
Connections that update the base table will need to have the correct SET options but unless you are using non default options this will be the case anyway in SQL Server 2005 (ARITH_ABORT used to be the problem one in 2000)

Using a computed column
ALTER TABLE MyTable ADD
OneNonNullOnly AS ISNULL(FkField, -PkField)
CREATE UNIQUE INDEX ix_OneNullOnly ON MyTable (OneNonNullOnly);
Assumes:
FkField is numeric
no clash of FkField and -PkField values

Decided to go with the following:
1) Begin transaction
2) UPDATE tableA SET foreignKey = -1 OUTPUT inserted.id INTO #tempTable
FROM (business logic)
WHERE foreignKey is null
3) If ##rowcount > 0 Then
3a) Create record in table 2.
3b) Capture ID of newly created record using scope_identity()
3c) UPDATE tableA set foreignKey = IdOfNewRecord FROM tableA INNER JOIN #tempTable ON tableA.id = tempTable.id
Since I write junk into the foreign key field in step 2), those rows are locked and no concurrent transactions will touch them. The first transaction is free to create the record. After the transaction is committed, the blocked transaction will execute the update query, but won't capture any of the original rows due to the WHERE clause only considering NULL foreignKey fields. If no rows are returned (##rowcount = 0), the current transaction exits without creating the record in table B, and returns some sort of error message to the client. (e.g. Error: Record already exists)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas