I have a SQL 2008 table with 773,705,261 rows in the table. I want to create an archive table to archive off the data, but I want to reduce the amount of space required for this data. Speed of accessing the archived data is not the primary concern, but is always desired.
The current table definition is something like this:
TableID (PK) BIGINT NOT NULL
DocumentID (FK) BIGINT NOT NULL
StatusID (FK) INT NOT NULL
RowCreateDate DATETIME NOT NULL
With my calculation, the current table uses 28 bytes per row in the table. The problem is that for each DocumentID it could have 6 – 10 rows in this table (the amount of rows per DocumentID could grow in the future too) depending on the amount of Statuses that the system processed.
My first thought of reducing the amount of space required to store this data is to have 1 row for each DocumentID and have an XML field containing all of the StatusIDs and Times they occurred. Something like this:
TableID (PK) BIGINT NOT NULL
DocumentID (FK) BIGINT NOT NULL
Statuses XML NOT NULL
Does anyone have any recommendations for me? Any methods I can research?
Set your archive table to use page compression.
From BOL
CREATE TABLE dbo.T1
(c1 int, c2 nvarchar(200) )
WITH (DATA_COMPRESSION = PAGE);
If you do not expect to be doing any updates or deletes from your archive table (well deletes that aren't off either end of the table) then I would also create a clustered index using a fillfactor of 100%. That way there will not be any space left in each page.
Of course I would look at both in BOL before actually applying anything.
You may be able to use INT data type for TableID and DocumentID, and SMALLINT or TINYINT for StatusID. Depending on the precision you need from the RowCreateDate column, you may be able to use SMALLDATETIME or DATE. These data types use less disk space and will save you several GB over your 775,000,000 rows.
Kenneth's suggestions of using page compression and FILLFACTOR = 100 are definitely worth considering.
Related
Consider having a table like this:
CREATE TABLE Product (
Id int PRIMARY KEY CLUSTERED,
InvoicesStr varchar(max)
)
which InvoicesStr is concatenated Ids of the invoices containing this Product.
I know it is not a proper design, but I just brought it up to demonstrate the problem I want to describe.
So the table data would be something like this:
Product
Id | InvoicesStr
----|-------------------------------------
1 | 4,5,6,7,34,6,78,967,3,534,
2 | 454,767,344,567,89676,4435,3,434,
After selling millions of products the InvoicesStr would contain a very large string.
Consider a situation in which for a row, this column contains a very big string, say a 1GB string.
I want to know about the performance for such an update query:
UPDATE Product
SET InvoiceStr = InvoiceStr + '584,'
WHERE Id = 100
Is the performance of this query dependent on the size of InvoiceStr? Or is SQL Server smart enough to just append the new string and not replace it completely?
You can use the little-known .WRITE syntax to append or modify text/data in a max column.
This does an efficient append or replace (minimally logged if possible), and is useful for modifying large values. Note that SQL Server modifies only whole 8k pages, so the minimum amount of modified data would be 8k (unless the existing data exactly filled a page).
For example
UPDATE Product
SET InvoicesStr.WRITE('100,', NULL, NULL)
WHERE Id = 2;
db<>fiddle
In reality, there is usually little reason to actually use this syntax, because you would not normally have such a denormalized design. And if you were storing something like pictures or audio, you would just replace the whole value.
With this code(*), the creation of a wide table in SQL keeps me sending this:
Msg 1702, Level 16, State 1, Line 11
CREATE TABLE failed because column '2010/12/01' in table 'PriceToBookFinalI' exceeds the maximum of 1024 columns.
USE [Style]
GO
CREATE TABLE [dbo].[PriceToBookFinalI]
(DocID int PRIMARY KEY,
[2006/12/29][Money],
[2007/01/01][Money],
...
SpecialPurposeColumns XML COLUMN_SET FOR ALL_SPARSE_COLUMNS);
GO
(2614 columns)
Looking for a good hint !
Here is the background set of data I want to import to my wide table
The solution for this is to normalize your design. Even if you could fit it into the 1024 limit, your design is not a good idea. For example, what if you wanted to know the average amount a DocID changed per each month. That would be a nightmare to write in this model.
Try this instead.
CREATE TABLE dbo.PriceToBookFinalI (
DocID INT PRIMARY KEY,
SpecialPurposeColumns XML COLUMN_SET FOR ALL_SPARSE_COLUMNS
);
CREATE TABLE dbo.PriceToBookFinalMoney (
DocID INT,
DocDate DATE,
DocAmount MONEY,
CONSTRAINT PK_PriceToBookFinalMoney
PRIMARY KEY CLUSTERED
(
DocID,
DocDate
)
);
You can easily join the table with the SpecialPurposeColumns to the table with the dates and amounts for each DocID. You can still pivot the dates if desired into the format you provided above. Having the date as a value in a column gives you much more flexibility how you use the data, better performance, and naturally handles more dates.
Normalise it, allow for the columning as part of your query:
Create table Price (DocID INT primary key,
DocRef Varchar(30), -- the values from your [DATES] column
DocDate DATE,
DocValue MONEY);
Create your table with three columns: ID, Date, Amount. Each ID will have multiple rows in the table (for each date there's an amount value for).
There is a column count limitation in SQL server:
https://msdn.microsoft.com/en-us/library/ms143432.aspx
Columns per nonwide table 1,024
Columns per wide table 30,000
You can use "Wide table", where is Sparse columns - column sets. https://msdn.microsoft.com/en-us/library/cc280521.aspx
BUT - table will have limitation - 8,060 bytes per row. So, most of your columns should have no data.
So - the problem is in your design. Looks like, months should be as rows, not columns. Or maybe better would be some other structure of table. It can not be guessed without seeing the data structure in application.
I need to add a new column to a table in my database. The table contains around 140 million rows and I'm not sure how to proceed without locking the database.
The database is in production and that's why this has to be as smooth as it can get.
I have read a lot but never really got the answer if this is a risky operation or not.
The new column is nullable and the default can be NULL. As i understood there is a bigger issue if the new column needs a default value.
I'd really appreciate some straight forward answers on this matter. Is this doable or not?
Yes, it is eminently doable.
Adding a column where NULL is acceptable and has no default value does not require a long-running lock to add data to the table.
If you supply a default value, then SQL Server has to go and update each record in order to write that new column value into the row.
How it works in general:
+---------------------+------------------------+-----------------------+
| Column is Nullable? | Default Value Supplied | Result |
+---------------------+------------------------+-----------------------+
| Yes | No | Quick Add (caveat) |
| Yes | Yes | Long running lock |
| No | No | Error |
| No | Yes | Long running lock |
+---------------------+------------------------+-----------------------+
The caveat bit:
I can't remember off the top of my head what happens when you add a column that causes the size of the NULL bitmap to be expanded. I'd like to say that the NULL bitmap represents the nullability of all the the columns currently in the row, but I can't put my hand on my heart and say that's definitely true.
Edit -> #MartinSmith pointed out that the NULL bitmap will only expand when the row is changed, many thanks. However, as he also points out, if the size of the row expands past the 8060 byte limit in SQL Server 2012 then a long running lock may still be required. Many thanks * 2.
Second caveat:
Test it.
Third and final caveat:
No really, test it.
My example is how do I add a new column to the table by tens of millions of rows and fill it by default value without long running lock
USE [MyDB]
GO
ALTER TABLE [dbo].[Customer] ADD [CustomerTypeId] TINYINT NULL
GO
ALTER TABLE [dbo].[Customer] ADD CONSTRAINT [DF_Customer_CustomerTypeId] DEFAULT 1 FOR [CustomerTypeId]
GO
DECLARE #batchSize bigint = 5000
,#rowcount int
,#MaxID int;
SET #rowcount = 1
SET #MaxID = 0
WHILE #rowcount > 0
BEGIN
;WITH upd as (
SELECT TOP (#batchSize)
[ID]
,[CustomerTypeId]
FROM [dbo].[Customer] (NOLOCK)
WHERE [CustomerTypeId] IS NULL
AND [ID] > #MaxID
ORDER BY [ID])
UPDATE upd
SET [CustomerTypeId] = 1
,#MaxID = CASE WHEN [ID] > #MaxID THEN [ID] ELSE #MaxID END
SET #rowcount = ##ROWCOUNT
WAITFOR DELAY '00:00:01'
END;
ALTER TABLE [dbo].[Customer] ALTER COLUMN [CustomerTypeId] TINYINT NOT NULL;
GO
ALTER TABLE [dbo].[Customer] ADD [CustomerTypeId] TINYINT NULL changes only the metadata (Sch-M locks) and lock time does not depend on the number of rows in a table
After that, I fill a new column by default value in small portions (5000 rows). I wait one second after each cycle so as not to block the table too aggressively. I have a int column "ID" as the primary clustered key
Finally, when all the new column is filled I change it to NOT NULL
No one can tell how much time will the operation cost as this depend on many ither factors after all.
You should not be worried about the operations itself because the SQL Server is doing everything right:
The Database Engine uses schema modification (Sch-M) locks during a
table data definition language (DDL) operation, such as adding a
column or dropping a table. During the time that it is held, the Sch-M
lock prevents concurrent access to the table. This means the Sch-M
lock blocks all outside operations until the lock is released.
I have never done ALTER operation on such amount of data and the only advice that I can give is to do it when there are not so many connections to the database (during the night).
EDIT:
Here you can found more information about your question. Generally, Matt Whitfield is right and
The only time that adding a column to a table results in a
size-of-data operation (i.e. an operation that modifies every row in a
table) is when the new column has a non-null default.
and when
New column is nullable, with a NULL default. The table's metadata
records the fact that the new column exists but may not be in the
record. This is why the null bitmap also has a count of the number of
columns in that particular record. SQL Server can work out whether a
column is present in the record or not. So – this is NOT a
size-of-data operation – the existing table records are not updated
when the new column is added. The records will be updated only when
they are updated for some other operation.
There is one way that I usually do - Export that table and create new column at local and re-name the table name, then import table table, and just re-name the existing table and convert the first table name as it wa original.
Can I select rows on row version?
I am querying a database table periodically for new rows.
I want to store the last row version and then read all rows from the previously stored row version.
I cannot add anything to the table, the PK is not generated sequentially, and there is no date field.
Is there any other way to get all the rows that are new since the last query?
I am creating a new table that contains all the primary keys of the rows that have been processed and will join on that table to get new rows, but I would like to know if there is a better way.
EDIT
This is the table structure:
Everything except product_id and stock_code are fields describing the product.
You can cast the rowversion to a bigint, then when you read the rows again you cast the column to bigint and compare against your previous stored value. The problem with this approach is the table scan each time you select based on the cast of the rowversion - This could be slow if your source table is large.
I haven't tried a persisted computed column of this, I'd be interested to know if it works well.
Sample code (Tested in SQL Server 2008R2):
DECLARE #TABLE TABLE
(
Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Data VARCHAR(10) NOT NULL,
LastChanged ROWVERSION NOT NULL
)
INSERT INTO #TABLE(Data)
VALUES('Hello'), ('World')
SELECT
Id,
Data,
LastChanged,
CAST(LastChanged AS BIGINT)
FROM
#TABLE
DECLARE #Latest BIGINT = (SELECT MAX(CAST(LastChanged AS BIGINT)) FROM #TABLE)
SELECT * FROM #TABLE WHERE CAST(LastChanged AS BIGINT) >= #Latest
EDIT: It seems I've misunderstood, and you don't actually have a ROWVERSION column, you just mentioned row version as a concept. In that case, SQL Server Change Data Capture would be the only thing left I could think of that fits the bill: http://technet.microsoft.com/en-us/library/bb500353(v=sql.105).aspx
Not sure if that fits your needs, as you'd need to be able to store the LSN of "the last time you looked" so you can query the CDC tables properly. It lends itself more to data loads than to typical queries.
Assuming you can create a temporary table, the EXCEPT command seems to be what you need:
Copy your table into a temporary table.
The next time you look, select everything from your table EXCEPT everything from the temporary table, extract the keys you need from this
Make sure your temporary table is up to date again.
Note that your temporary table only needs to contain the keys you need. If this is just one column, you can go for a NOT IN rather than EXCEPT.
What's the best way to store a linked list in a MySQL database so that inserts are simple (i.e. you don't have to re-index a bunch of stuff every time) and such that the list can easily be pulled out in order?
Using Adrian's solution, but instead of incrementing by 1, increment by 10 or even 100. Then insertions can be calculated at half of the difference of what you're inserting between without having to update everything below the insertion. Pick a number large enough to handle your average number of insertions - if its too small then you'll have to fall back to updating all rows with a higher position during an insertion.
create a table with two self referencing columns PreviousID and NextID. If the item is the first thing in the list PreviousID will be null, if it is the last, NextID will be null. The SQL will look something like this:
create table tblDummy
{
PKColumn int not null,
PreviousID int null,
DataColumn1 varchar(50) not null,
DataColumn2 varchar(50) not null,
DataColumn3 varchar(50) not null,
DataColumn4 varchar(50) not null,
DataColumn5 varchar(50) not null,
DataColumn6 varchar(50) not null,
DataColumn7 varchar(50) not null,
NextID int null
}
Store an integer column in your table called 'position'. Record a 0 for the first item in your list, a 1 for the second item, etc. Index that column in your database, and when you want to pull your values out, sort by that column.
alter table linked_list add column position integer not null default 0;
alter table linked_list add index position_index (position);
select * from linked_list order by position;
To insert a value at index 3, modify the positions of rows 3 and above, and then insert:
update linked_list set position = position + 1 where position >= 3;
insert into linked_list (my_value, position) values ("new value", 3);
A linked list can be stored using recursive pointers in the table. This is very much the same hierarchies are stored in Sql and this is using the recursive association pattern.
You can learn more about it here (Wayback Machine link).
I hope this helps.
The simplest option would be creating a table with a row per list item, a column for the item position, and columns for other data in the item. Then you can use ORDER BY on the position column to retrieve in the desired order.
create table linked_list
( list_id integer not null
, position integer not null
, data varchar(100) not null
);
alter table linked_list add primary key ( list_id, position );
To manipulate the list just update the position and then insert/delete records as needed. So to insert an item into list 1 at index 3:
begin transaction;
update linked_list set position = position + 1 where position >= 3 and list_id = 1;
insert into linked_list (list_id, position, data)
values (1, 3, "some data");
commit;
Since operations on the list can require multiple commands (eg an insert will require an INSERT and an UPDATE), ensure you always perform the commands within a transaction.
A variation of this simple option is to have position incrementing by some factor for each item, say 100, so that when you perform an INSERT you don't always need to renumber the position of the following elements. However, this requires a little more effort to work out when to increment the following elements, so you lose simplicity but gain performance if you will have many inserts.
Depending on your requirements other options might appeal, such as:
If you want to perform lots of manipulations on the list and not many retrievals you may prefer to have an ID column pointing to the next item in the list, instead of using a position column. Then you need to iterative logic in the retrieval of the list in order to get the items in order. This can be relatively easily implemented in a stored proc.
If you have many lists, a quick way to serialise and deserialise your list to text/binary, and you only ever want to store and retrieve the entire list, then store the entire list as a single value in a single column. Probably not what you're asking for here though.
This is something I've been trying to figure out for a while myself. The best way I've found so far is to create a single table for the linked list using the following format (this is pseudo code):
LinkedList(
key1,
information,
key2
)
key1 is the starting point. Key2 is a foreign key linking to itself in the next column. So your columns will link something link something like this
col1
key1 = 0,
information= 'hello'
key2 = 1
Key1 is primary key of col1. key2 is a foreign key leading to the key1 of col2
col2
key1 = 1,
information= 'wassup'
key2 = null
key2 from col2 is set to null because it doesn't point to anything
When you first enter a column in for the table, you'll need to make sure key2 is set to null or you'll get an error. After you enter the second column, you can go back and set key2 of the first column to the primary key of the second column.
This makes the best method to enter many entries at a time, then go back and set the foreign keys accordingly (or build a GUI that just does that for you)
Here's some actual code I've prepared (all actual code worked on MSSQL. You may want to do some research for the version of SQL you are using!):
createtable.sql
create table linkedlist00 (
key1 int primary key not null identity(1,1),
info varchar(10),
key2 int
)
register_foreign_key.sql
alter table dbo.linkedlist00
add foreign key (key2) references dbo.linkedlist00(key1)
*I put them into two seperate files, because it has to be done in two steps. MSSQL won't let you do it in one step, because the table doesn't exist yet for the foreign key to reference.
Linked List is especially powerful in one-to-many relationships. So if you've ever wanted to make an array of foreign keys? Well this is one way to do it! You can make a primary table that points to the first column in the linked-list table, and then instead of the "information" field, you can use a foreign key to the desired information table.
Example:
Let's say you have a Bureaucracy that keeps forms.
Let's say they have a table called file cabinet
FileCabinet(
Cabinet ID (pk)
Files ID (fk)
)
each column contains a primary key for the cabinet and a foreign key for the files. These files could be tax forms, health insurance papers, field trip permissions slips etc
Files(
Files ID (pk)
File ID (fk)
Next File ID (fk)
)
this serves as a container for the Files
File(
File ID (pk)
Information on the file
)
this is the specific file
There may be better ways to do this and there are, depending on your specific needs. The example just illustrates possible usage.
There are a few approaches I can think of right off, each with differing levels of complexity and flexibility. I'm assuming your goal is to preserve an order in retrieval, rather than requiring storage as an actual linked list.
The simplest method would be to assign an ordinal value to each record in the table (e.g. 1, 2, 3, ...). Then, when you retrieve the records, specify an order-by on the ordinal column to get them back in order.
This approach also allows you to retrieve the records without regard to membership in a list, but allows for membership in only one list, and may require an additional "list id" column to indicate to which list the record belongs.
An slightly more elaborate, but also more flexible approach would be to store information about membership in a list or lists in a separate table. The table would need 3 columns: The list id, the ordinal value, and a foreign key pointer to the data record. Under this approach, the underlying data knows nothing about its membership in lists, and can easily be included in multiple lists.
This post is old but still going to give my .02$. Updating every record in a table or record set sounds crazy to solve ordering. the amount of indexing also crazy, but it sounds like most have accepted it.
Crazy solution i came up with to reduce updates and indexing is to create two tables (and in most use cases you don's sort all records in just one table anyway). Table A to hold the records of the list being sorted and table B to group and hold a record of the order as a string. the order string represents an array that can be used to order the selected records either on the web server or browser layer of a webpage application.
Create Table A{
Id int primary key identity(1,1),
Data varchar(10) not null
B_Id int
}
Create Table B{
Id int primary key Identity(1,1),
GroupName varchat(10) not null,
Order varchar(max) null
}
The format of the order sting should be id, position and some separator to split() your string by. in the case of jQuery UI the .sortable('serialize') function outputs an order string for you that is POST friendly that includes the id and position of each record in the list.
The real magic is the way you choose to reorder the selected list using the saved ordering string. this will depend on the application you are building. here is an example again from jQuery to reorder the list of items: http://ovisdevelopment.com/oramincite/?p=155
https://dba.stackexchange.com/questions/46238/linked-list-in-sql-and-trees suggests a trick of using floating-point position column for fast inserts and ordering.
It also mentions specialized SQL Server 2014 hierarchyid feature.
I think its much simpler adding a created column of Datetime type and a position column of int, so now you can have duplicate positions, at the select statement use the order by position, created desc option and your list will be fetched in order.
Increment the SERIAL 'index' by 100, but manually add intermediate values with an 'index' equal to Prev+Next / 2. If you ever saturate the 100 rows, reorder the index back to 100s.
This should maintain sequence with primary index.
A list can be stored by having a column contain the offset (list index position) -- an insert in the middle is then incrementing all above the new parent and then doing an insert.
You could implement it like a double ended queue (deque) to support fast push/pop/delete(if oridnal is known) and retrieval you would have two data structures. One with the actual data and another with the number of elements added over the history of the key. Tradeoff: This method would be slower for any insert into the middle of the linked list O(n).
create table queue (
primary_key,
queue_key
ordinal,
data
)
You would have an index on queue_key+ordinal
You would also have another table which stores the number of rows EVER added to the queue...
create table queue_addcount (
primary_key,
add_count
)
When pushing a new item to either end of the queue (left or right) you would always increment the add_count.
If you push to the back you could set the ordinal...
ordinal = add_count + 1
If you push to the front you could set the ordinal...
ordinal = -(add_count + 1)
update
add_count = add_count + 1
This way you can delete anywhere in the queue/list and it would still return in order and you could also continue to push new items maintaining the order.
You could optionally rewrite the ordinal to avoid overflow if a lot of deletes have occurred.
You could also have an index on the ordinal to support fast ordered retrieval of the list.
If you want to support inserts into the middle you would need to find the ordinal which it needs to be insert at then insert with that ordinal. Then increment every ordinal by one following that insertion point. Also, increment the add_count as usual. If the ordinal is negative you could decrement all of the earlier ordinals to do fewer updates. This would be O(n)