Yes, fillfactor again. I spend many hours reading and I can't decide what's best for each case. I don't understand when and how fragmentation happens. I'm migrating a database from MS SQL Server to PostgreSQL 9.2.
Case 1
10-50 inserts / minute in a sequential (serial) PK, 20-50 reads / hour.
CREATE TABLE dev_transactions (
transaction_id serial NOT NULL,
transaction_type smallint NOT NULL,
moment timestamp without time zone NOT NULL,
gateway integer NOT NULL,
device integer NOT NULL,
controler smallint NOT NULL,
token integer,
et_mode character(1),
status smallint NOT NULL,
CONSTRAINT pk_dev_transactions PRIMARY KEY (transaction_id)
);
Case 2
Similar structure, index for serial PK, writes in blocks (one shot) of ~ 50.000 registers every 2 months, readings 10-50 / minute.
Does a 50% fillfactor mean that each insert generates a new page and moves 50% of existing rows to a newly generated page?
Does a 50% fillfactor mean frees space is allocated between physical rows in new data pages?
A new page is generated only if there is no free space left in existing pages?
As you can see I'm very confused; I would appreciate some help — maybe a good link to read about PostgreSQL and index fillfactor.
FILLFACTOR
With only INSERT and SELECT you should use a FILLFACTOR of 100 for tables (which is the default anyway). There is no point in leaving wiggle room per data page if you are not going to "wiggle" with UPDATEs.
The mechanism behind FILLFACTOR is simple. INSERTs only fill data pages (usually 8 kB blocks) up to the percentage declared by the FILLFACTOR setting. Also, whenever you run VACUUM FULL or CLUSTER on the table, the same wiggle room per block is re-established. Ideally, this allows UPDATE to store new row versions in the same data page, which can provide a substantial performance boost when dealing with lots of UPDATEs. Also beneficial in combination with H.O.T. updates. See:
Redundant data in update statements
Indexes need more wiggle room by design. They have to store new entries at the right position in leaf pages. Once a page is full, a relatively costly "page split" is needed. So indexes tend to bloat more than tables. The default FILLFACTOR for a (default) B-Tree index is 90 (varies per index type). And wiggle room makes sense for just INSERTs, too. The best strategy heavily depends on write patterns.
Example: If new inserts have steadily growing values (typical case for a serial or timestamp column), then there are basically no page-splits, and you might go with FILLFACTOR = 100 (or a bit lower to allows for some noise).
For a random distribution of new values, you might go below the default 90 ...
Basic source of information: the manual for CREATE TABLE and CREATE INDEX.
Other optimization
But you can do something else - since you seem to be a sucker for optimization ... :)
CREATE TABLE dev_transactions(
transaction_id serial PRIMARY KEY
, gateway integer NOT NULL
, moment timestamp NOT NULL
, device integer NOT NULL
, transaction_type smallint NOT NULL
, status smallint NOT NULL
, controller smallint NOT NULL
, token integer
, et_mode character(1)
);
This optimizes your table with regard to data alignment and avoids padding for a typical 64 bit server and saves a few bytes, probably just 8 byte on average - you typically can't squeeze out much with "column tetris":
Calculating and saving space in PostgreSQL
Keep NOT NULL columns at the start of the table for a very small performance bonus.
Your table has 9 columns. The initial ("cost-free") 1-byte NULL bitmap covers 8 columns. The 9th column triggers an additional 8 bytes for the extended NULL bitmap - if there are any NULL values in the row.
If you make et_mode and token NOT NULL, all columns are NOT NULL and there is no NULL bitmap, freeing up 8 bytes per row.
This even works per row if some columns can be NULL. If all fields of the same row have values, there is no NULL bitmap for the row. In your special case, this leads to the paradox that filling in values for et_mode and token can make your storage size smaller or at least stay the same:
Do nullable columns occupy additional space in PostgreSQL?
Basic source of information: the manual on Database Physical Storage.
Compare the size of rows (filled with values) with your original table to get definitive proof:
SELECT pg_column_size(t) FROM dev_transactions t;
(Plus maybe padding between rows, as the next row starts at a multiple of 8 bytes.)
Related
I have some business that needs to be run on a daily basis and will be affecting all the rows in the tables. Once a record is fixed and can't change again by the logic it gets moved to an on disk table. At its max there will end up being approximately 30 million rows in the table. Its very skinny, just the linkage items to a main table and a key to a flag table. The flag key is what will be updated.
My question is when I'm preparing a table of this size which size bucket count should I be looking to use on the index?
The table will start off small with likely only a few hundred thousand rows in April, but come the end of the financial year it will ave grown to the maximum mentioned as previous years have indicated and I'm not sure if this practically empty bucket at the start will have any issues or if it is ok to have the count at the 30 million mark.
thanks in advance you comments, suggestion and help.
I've provided the code below and I've tried googling what occurs if the bucket count is high but the intial number of rows is low as the table grows over time but found nothing to help me understand if there will be a performance issue because of this.
CREATE TABLE [PRD].[CTRL_IN_MEM]
(
[FILE_LOAD_ID] INT NOT NULL,
[RECORD_IDENTIFIER] BIGINT NOT NULL,
[FLAG_KEY] SMALLINT NOT NULL,
[APP_LEVEL_PART] BIT NOT NULL
--Line I'm not sure about
CONSTRAINT [pk_CTRL_IN_MEM] PRIMARY KEY NONCLUSTERED HASH ([FILE_LOAD_ID], [RECORD_IDENTIFIER]) WITH (BUCKET_COUNT = 30000000),
INDEX cci_CTRL_IN_MEM CLUSTERED COLUMNSTORE
) WITH (MEMORY_OPTIMIZED = ON, DURABILITY=SCHEMA_AND_DATA)
If TableA has 1,000,000 billion rows (although it is impossible in a true environment) and the primary key increased by identity, so the max row is 1 million billion. Now I used a Random() number to match it's where statement. Like below:
Select *
from TableA a
where a.PrimaryKey = [Random Number from 1 to 1 million Billion]
If the Select statement executes 100 times, but I found it is still very fast in SQL Server.
I mean if random number is a big number, then if Select * from Table where Pk=1000000, it must be comparing with the all previous records to see if THIS number whether match the primary key column. So the performance of SQL will be very low.
It's fast because the primary key is indexed, which means the lookup of a row based on the primary key takes O(log N) time ( https://en.wikipedia.org/wiki/Big_O_notation ) if you're using a B-tree based index, which is the default in most database systems.
It's further helped by the fact the primary-key is usually the clustered-index too, which is the index that defines the physical order of rows on disk (this is a gross over-simplification, but you get the idea).
The primary key is a key. That is the database maintains a unique index on the location of every record by the key value.
The index is usually some variation on b-tree which would probably have a depth of about 10 to 20 for such a large dataset. So accessing any record via the key would involve a maximum of 10 IOs and probably much less as large parts of the B-Tree will be cached in memory.
I need to do a fast search in a column with floating point numbers in a table of SQL server 2008 R2 on Win 7.
the table has 10 million records.
e.g.
Id value
532 937598.32421
873 501223.3452
741 9797327.231
ID is primary key, I need o do a search on "value" column for a given value such that I can find the 5 closest points to the given point in the table.
The closeness is defined as the absolute value of the difference between the given value and column value.
The smaller value, the closer.
I would like to use binary search.
I want to set an unique index on the value column.
But, I am not sure whether the table will be sorted every time when I search the given value in the column ?
Or, it only sorts the table one time because I have set the value column as unique index ?
Are there better ways to do this search ?
A sorting will have to be done whenever I do a search ? I need to do a lot of times of search in the table. I know the sorting time is O(n lg n). Using index can really have done the sort for me ? or the index is associated with a sorted tree to hold the column values ?
When an index is set up, the values have been sorted ? I do not need to sort it every time when I do a search ?
Any help would be appreciated.
thanks
Sorry for my initial response, no, I would not even create an index, it won't be able to use it because you're searching not on a given value but the difference between that given value and the value column on the table. You could create a function based index, but you would have to specify the # you're searching on, which is not constant.
Given that, I would look at getting enough RAM to swallow the whole table. Ie. if the table is 10gb, try to get 10gb ram allocated for caching. And if possible do it on a machine w/ an SSD, or get an SSD.
The sql itself is not complicated, it's really just an issue of performance.
select top 5 id, abs(99 - val) as diff
from tbl
order by 2
If you don't mind some trial and error, you could create an index on the value column, and then search as follows -
select top 5 id, abs(99 - val) as diff
from tbl
where val between 99-30 and 99+30
order by 2
The above query WOULD utilize the index on the value column, because it is searching on a range of values in the value column, not the differences between the values in that column and X (2 very different things)
However, there is no guarantee it would return 5 rows, it would only return 5 rows if there actually existed 5 rows within 30 of 99 (69 to 129). If it returned 2, 3, etc. but not 5, you would have to run the query again and expand the range, and keep doing so until you determine your top 5. However, these queries should run quite a bit faster than having no index and firing against the table blind. So you could give it a shot. The index may take a while to create though, so you might want to do that part overnight.
You mention sql server and binary search. SQL server does not work that way, but sql server (or other database) is a good solution for this problem.
Just to concrete, I will assume
create table mytable
(
id int not null
, value float not null
constraint mytable_pk primary key(id)
)
And you need an index on the value field.
Now get ten rows 5 above and 5 below the search value with these 2 selects
SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value >= #searchval
ORDER BY val asc) as bigger
-- and
SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value < #searchval
ORDER BY val desc) as smaller
To combine the 2 unions into 1 result set you need
SELECT *
FROM (SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value >= #searchval
ORDER BY val asc) as bigger
UNION ALL
FROM (SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value < #searchval
ORDER BY val desc) as smaller
But since you only want the smallest 5 differences, wrap with one more layer as
SELECT TOP 5 * FROM
(
SELECT *
FROM (SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value >= #searchval
ORDER BY val asc) as bigger
UNION ALL
FROM (SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value < #searchval
ORDER BY val desc) as smaller
)
ORDER BY DIFF ASC
I Have not tested any of this
Creating the table's clustered index upon [value] will cause [value]'s values to be stored on disk in sorted order. The table's primary key (perhaps already defined on [Id]) might already be defined as the table's clustered index. There can only be one clustered index on a table. If a primary key on [Id] is already clustered, the primary key will need to be dropped, the clustered index on [value] will need to be created, and then the primary key on [Id] can be recreated (as a nonclustered primary key). A clustered index upon [value] should improve performance of this specific statement, but you must ultimately test all variety of T-SQL that will reference this table before making the final choice about this table's most useful clustered index column(s).
Because the FLOAT data type is imprecise (subject to your system's FPU and its floating point rounding and truncation errors, while still in accordance with IEEE 754's specifications), it can be a fatal mistake to assume every [value] will be unique, event when the decimal number (being inserted into FLOAT) appears (in decimal) to be unique. Irrational numbers must always be truncated and rounded. In decimal, PI is an example of an irrational value, which can be truncated and rounded to an imprecise value of 3.142. Similarly, the decimal number 0.1 is an irrational number in binary, which means FLOAT will not store decimal 0.1 as a precise binary value.... You might want to consider whether the domain of acceptable values offered by the NUMERIC data type can accommodate [value] (thus gaining more precise answers when compared to a use of FLOAT).
While a NUMERIC data type might require more storage space than FLOAT, the performance of a given query is often controlled by the levels of the (perhaps clustered) index's B-Tree (assuming an index seek can be harnessed by the query, which for your specific need is a safe assumption). A NUMERIC data type with a precision greater than 28 will require 17 bytes of storage per value. The payload of SQL Server's 8KB page is approximately 8000 bytes. Such a NUMERIC data type will thus store approximately 470 values per page. A B-Tree will consume 2^(index_level_pages-1) * 470 rows/page to store the 10,000,000 rows. Dividing both sides by 470 rows/page: 2^(index_level_pages-1) = 10,000,000/470 pages. Simplifying: log(base2)10,000,000/470 = (index_level_pages-1). Solving: ~16 = index_level_pages (albeit this is back of napkin math, I think it close enough). Thus searching for a specific value in a 10,000,000 row table will require ~16*8KB = ~128 KB of reads. If a clustered index is created, the leaf level of a clustered index will contain the other NUMERIC values that are "close" to the one being sought. Since that leaf level page (and the 15 other index pages) are now cached in SQL Server's buffer pool and are "hot", the next search (for values that are "close" to the value being sought) is likely to be constrained by memory access speeds (as opposed to disk access speeds). This is why a clustered index can enhance performance for your desired statement.
If the [value]'s values are not unique (perhaps due to floating point truncation and rounding errors), and if [value] has been defined as the table's clustered index, SQL Server will (under the covers) add a 4-byte "uniqueifier" to each value. A uniqueifier adds overhead (per above math, it is less overhead than might be thought, when a index can be harnessed). That overhead is another (albeit less important) reason to test. If values can instead be stored as NUMERIC and if a use of NUMERIC would more precisely ensure persisted decimal values are indeed unique (just the way they look, in decimal), that 4 byte overhead can be eliminated by also declaring the clustered index as being unique (assuming value uniqueness is a business need). Using similar math, I am certain you will discover the index levels for a FLOAT data type are not all that different from NUMERIC.... An index B-Tree's exponential behavior is "the great leveler" :). Choosing FLOAT because it has smaller storage space than NUMERIC may not be as useful as can initially be thought (even when greatly more storage space for the table, as a whole, is needed).
You should also consider/test whether a Columnstore index would enhance performance and suit your business needs.
This is a common request coming from my clients.
It's better if you transform your float column into two integer columns (one for each part of the floating point number), and put the appropriate index on them for fast searching. For example: 12345.678 will become two columns 12345 and 678.
I've bumped into a lot of VARCHAR(1) fields in a database I've recently had to work with. I rolled my eyes: obviously the designer didn't have a clue. But maybe I'm the one who needs to learn something. Is there any conceivable reason to use a VARCHAR(1) data type rather than CHAR(1)? I would think that the RDMS would convert the one to the other automatically.
The database is MS SQL 2K5, but evolved from Access back in the day.
Yes there is sense to it.
Easier for it to be definable in the language. It is consistent and easier to define varchar to allow 1-8000 than to say it needs to be 2+ or 3+ to 8000.
The VARying CHARacter aspect of VARCHAR(1) is exactly that. It may not be optimal for storage but conveys a specific meaning, that the data is either 1 char (classroom code) or blank (outside activity) instead of NULL (unknown/not-yet-classified).
Storage plays very little part in this - looking at a database schema for CHAR(1), you would almost expect that it must always have a 1 char value, such as credit cards must have 16 digits. That is simply not the case with some data where it can be one or optionally none.
There are also differences to using VARCHAR(1) vs CHAR(1)+NULL combination for those who say tri-state [ 1-char | 0-char | NULL ] is completely useless. It allows for SQL statements like:
select activity + '-' + classroom
from ...
which would otherwise be more difficult if you use char(1)+NULL, which can convey the same information but has subtle differences.
AFAIK, No.
a VARCHAR(1) requires 3 bytes storage (The storage size is the actual length of data entered + 2 bytes. Ref.
a CHAR(1) requires 1 byte.
From a storage perspective: A rule of thumb is, if it's less than or equal to 5 chars, consider using a fixed length char column.
A reason to avoid varchar(1) (aside from the fact they they convey poor design reasoning, IMO) is when using Linq2SQL: LINQ to SQL and varchar(1) fields
A varchar(1) can store a zero length ("empty") string. A char(1) can't as it will get padded out to a single space. If this distinction is important to you you may favour the varchar.
Apart from that, one use case for this may be if the designer wants to allow for the possibility that a greater number of characters may be required in the future.
Altering a fixed length datatype from char(1) to char(2) means that all the table rows need to be updated and any indexes or constraints that access this column dropped first.
Making these changes to a large table in production can be an extremely time consuming operation that requires down time.
Altering a column from varchar(1) to varchar(2) is much easier as it is a metadata only change (FK constraints that reference the column will need to be dropped and recreated but no need to rebuild the indexes or update the data pages).
Moreover the 2 bytes per row saving might not always materialize anyway. If the row definition is already quite long this won't always affect the number of rows that can fit on a data page. Another case would be if using the compression feature in Enterprise Edition the way the data is stored is entirely different than that mentioned in Mitch's answer in any event. Both varchar(1) and char(1) would end up stored the same way in the short data region.
#Thomas - e.g. try this table definition.
CREATE TABLE T2
(
Code VARCHAR(1),
Foo datetime2,
Bar int,
Filler CHAR(4000),
PRIMARY KEY CLUSTERED (Code, Foo, Bar)
)
INSERT INTO T2
SELECT TOP 100000 'A',
GETDATE(),
ROW_NUMBER() OVER (ORDER BY (SELECT 0)),
NULL
FROM master..spt_values v1, master..spt_values v2
CREATE NONCLUSTERED INDEX IX_T2_Foo ON T2(Foo) INCLUDE (Filler);
CREATE NONCLUSTERED INDEX IX_T2_Bar ON T2(Bar) INCLUDE (Filler);
For a varchar it is trivial to change the column definition from varchar(1) to varchar(2). This is a metadata only change.
ALTER TABLE T2 ALTER COLUMN Code VARCHAR(2) NOT NULL
If the change is from char(1) to char(2) the following steps must happen.
Drop the PK from the table. This converts the table into a heap and means all non clustered indexes need to be updated with the RID rather than the clustered index key.
Alter the column definition. This means all rows are updated in the table so that Code now is stored as char(2).
Add back the clustered PK constraint. As well as rebuilding the CI itself this means all non clustered indexes need to be updated again with the CI key as a row pointer rather than the RID.
I have a table storing millions of rows. It looks something like this:
Table_Docs
ID, Bigint (Identity col)
OutputFileID, int
Sequence, int
…(many other fields)
We find ourselves in a situation where the developer who designed it made the OutputFileID the clustered index. It is not unique. There can be thousands of records with this ID. It has no benefit to any processes using this table, so we plan to remove it.
The question, is what to change it to… I have two candidates, the ID identity column is a natural choice. However, we have a process which does a lot of update commands on this table, and it uses the Sequence to do so. The Sequence is non-unique. Most records only contain one, but about 20% can have two or more records with the same Sequence.
The INSERT app is a VB6 piece of crud throwing thousands insert commands at the table. The Inserted values are never in any particular order. So the Sequence of one insert may be 12345, and the next could be 12245. I know that this could cause SQL to move a lot of data to keep the clustered index in order. However, the Sequence of the inserts are generally close to being in order. All inserts would take place at the end of the clustered table. Eg: I have 5 million records with Sequence spanning 1 to 5 million. The INSERT app will be inserting sequence’s at the end of that range at any given time. Reordering of the data should be minimal (tens of thousands of records at most).
Now, the UPDATE app is our .NET star. It does all UPDATES on the Sequence column. “Update Table_Docs Set Feild1=This, Field2=That…WHERE Sequence =12345” – hundreds of thousands of these a day. The UPDATES are completely and totally, random, touching all points of the table.
All other processes are simply doing SELECT’s on this (Web pages). Regular indexes cover those.
So my question is, what’s better….a unique clustered index on the ID column, benefiting the INSERT app, or a non-unique clustered index on the Sequence, benefiting the UPDATE app?
First off, I would definitely recommend to have a clustered index!
Secondly, your clustered index should be:
narrow
static (never or hardly ever change)
unique
ever-increasing
so an INT IDENTITY is a very well thought out choice.
When your clustering key is not unique, SQL Server will add a 4-byte uniqueifier to those column values - thus making your clustering key and with it all non-clustered indices on that table larger and less optimal.
So in your case, I would pick the ID - it's narrow, static, unique and ever-increasing - can't be more optimal than that! Since the Sequence is used heavily in UPDATE statements, definitely put a non-clustered index on it, too!
See Kimberly Tripp's excellent blog posts on choosing the right clustering key for great background info on the topic.
As a general rule, you want your clustered index to be unique. If it is not, SQL Server will in fact add a hidden "uniquifier" to it to force it to be unique, and this adds overhead.
So, you are probably best using the ID column as your index.
Just as a side note, using a identity column as your primary key is normally referred to as a surrogate key since it is not inherent in your data. When you have a unique natural key available that is probably a better choice. In this case it looks like you do not, so using the unique surrogate key makes sense.
The worst thing about the inserts out of order is page splits.
When SQL Server needs to insert a new record into an existing index page and finds no place there, it takes half the records from the page and moves them into a new one.
Say, you have these records filling the whole page:
1 2 3 4 5 6 7 8 9
and need to insert a 10. In this case, SQL Server will just start the new page.
However, if you have this:
1 2 3 4 5 6 7 8 11
, 10 should go before 11. In this case, SQL Server will move records from 6 to 11 into the new page:
6 7 8 9 10 11
The old page, as it can be easily seen, will remain half filled (only records from 1 to 6 will go there which are very).
This will increase the index size.
Let's create two sample tables:
CREATE TABLE perfect (id INT NOT NULL PRIMARY KEY, stuffing VARCHAR(300))
CREATE TABLE almost_perfect (id INT NOT NULL PRIMARY KEY, stuffing VARCHAR(300))
;
WITH q(num) AS
(
SELECT 1
UNION ALL
SELECT num + 1
FROM q
WHERE num < 200000
)
INSERT
INTO perfect
SELECT num, REPLICATE('*', 300)
FROM q
OPTION (MAXRECURSION 0)
;
WITH q(num) AS
(
SELECT 1
UNION ALL
SELECT num + 1
FROM q
WHERE num < 200000
)
INSERT
INTO almost_perfect
SELECT num + CASE num % 5 WHEN 0 THEN 2 WHEN 1 THEN 0 ELSE 1 END, REPLICATE('*', 300)
FROM q
OPTION (MAXRECURSION 0)
EXEC sp_spaceused N'perfect'
EXEC sp_spaceused N'almost_perfect'
perfect 200000 66960 KB 66672 KB 264 KB 24 KB
almost_perfect 200000 128528 KB 128000 KB 496 KB 32 KB
Even with only 20% probability of the records being out of order, the table becomes twice as large.
On the other hand, having a clustered key on Sequence will reduce the I/O twice (since it can be done with a single clustered index seek rather than two unclustered ones).
So I'd take a sample subset of your data, insert it into the test table with a clustered index on Sequence and measure the resulting table size.
If it less than twice the size of the same table with an index on ID, I'd go for the clustered index on Sequence (since the total resulting I/O will be less).
If you decide to create a clustered index on Sequence, make ID an unclustered PRIMARY KEY and make the clustered index UNIQUE on Sequence, ID. This will use a meaningful ID instead of opaque uniquiefier.