Does it make a difference in SQL Server whether to use a TinyInt or Bit? Both in size and query performance

Does it make a difference in SQL Server whether to use a TinyInt or Bit? Both in size and query performance - sql

I have a table that has 124,387,133 rows each row has 59 columns and of those 59, 18 of the columns are TinyInt data type and all row values are either 0 or 1. Some of the TinyInt columns are used in indexes and some are not.
My question will it make a difference on query performance and table size if I change the tinyint to a bit?

In case you don't know, a bit uses less space to store information than a TinyInt (1 bit against 8 bits). So you would save space changing to bit, and in theory the performance should be better. Generally is hard to notice such performance improvement but with the amount of data you have, it might actually make a difference, I would test it in a backup copy.

Actually,it's good to use the right data type..below are the benefits i could see when you use bit data type
1.Buffer pool savings,page is read into memory from storage and less memory can be allocated
2.Index key size will be less,so more rows can fit into one page and there by less traversing
Also you can see storage space savings as immediate benefit

In theory yes, in practise the difference is going to be subtle, the 18 bit fields get byte packed and rounded up, so it changes to 3 bytes. Depending on nullability / any nullability change the storage cost again will change. Both types are held within the fixed width portion of the row. So you will drop from 18 bytes to 3 bytes for those fields - depending on the overall size of the row versus the page size you may squeeze an extra row onto the page. (The rows/page density is where the performance gain will show up primarily, if you are to gain)
This seems like a premature micro-optimization however, if you are suffering from bad performance, investigate that and gather the evidence supporting any changes. Making type changes on existing systems should be carefully considered, if you cause the need for a code change, which prompts a full regression test etc, the cost of the change rises dramatically - for very little end result. (The production change on a large dataset will also not be rapid, so you can factor in some downtime in the cost as well to make this change)

You would be saving about 15 bytes per record, for a total of 1.8 Gbytes.
You have 41 remaining fields. If I assume that those are 4-byte integers, then your current overall size is about 22 Gbytes. The overall savings is less than 10% -- and could be much less if the other fields are larger.
This does mean that a full table scan would be about 10% faster, so that gives you a sense of the performance gain and magnitude.
I believe that bit fields require an extra operation or two to mask the bits and read -- trivial overhead that is measured in nanoseconds these days -- but something to keep in mind.
The benefits of a smaller page size are that more records fit on a single page, so the table occupies less space in memory (assuming all is read in at once) and less space on disk. Smaller data does not always mean improved query performance. Here are two caveats:
If you are reading a single record, then the entire page needs to be read into cache. It is true that you are a bit less likely to get a cache miss with a warm cache, but overall reading a single record from a cold cache would be the same.
If you are reading the entire table, SQL Server actually reads pages in blocks and implements some look-ahead (also called read-ahead or prefetching). If you are doing complex processing, you might not even notice the additional I/O time, because I/O operations can run in parallel with computing.
For other operations such as deletes and updates, locking is sometimes done at the page level. In these cases, sparser pages can be associated with better performance.

Related

SQL Query Performance with an nvarchar(500) where the MAX(LEN(column)) < 30 [duplicate]

I've read up on this on MSDN forums and here and I'm still not clear. I think this is correct: Varchar(max) will be stored as a text datatype, so that has drawbacks. So lets say your field will reliably be under 8000 characters. Like a BusinessName field in my database table. In reality, a business name will probably always be under (pulling a number outta my hat) 500 characters. It seems like plenty of varchar fields that I run across fall well under the 8k character count.
So should I make that field a varchar(500) instead of varchar(8000)? From what I understand of SQL there's no difference between those two. So, to make life easy, I'd want to define all my varchar fields as varchar(8000). Does that have any drawbacks?
Related: Size of varchar columns (I didn't feel like this one answered my question).

One example where this can make a difference is that it can prevent a performance optimization that avoids adding row versioning information to tables with after triggers.
This is covered by Paul White here
The actual size of the data stored is immaterial – it is the potential
size that matters.
Similarly if using memory optimised tables since 2016 it has been possible to use LOB columns or combinations of column widths that could potentially exceed the inrow limit but with a penalty.
(Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.
This can have a large negative effect on memory consumption and performance
Another case where over declaring column widths can make a big difference is if the table will ever be processed using SSIS. The memory allocated for variable length (non BLOB) columns is fixed for each row in an execution tree and is per the columns' declared maximum length which can lead to inefficient usage of memory buffers (example). Whilst the SSIS package developer can declare a smaller column size than the source this analysis is best done up front and enforced there.
Back in the SQL Server engine itself a similar case is that when calculating the memory grant to allocate for SORT operations SQL Server assumes that varchar(x) columns will on average consume x/2 bytes.
If most of your varchar columns are fuller than that this can lead to the sort operations spilling to tempdb.
In your case if your varchar columns are declared as 8000 bytes but actually have contents much less than that your query will be allocated memory that it doesn't require which is obviously inefficient and can lead to waits for memory grants.
This is covered in Part 2 of SQL Workshops Webcast 1 downloadable from here or see below.
use tempdb;
CREATE TABLE T(
id INT IDENTITY(1,1) PRIMARY KEY,
number int,
name8000 VARCHAR(8000),
name500 VARCHAR(500))
INSERT INTO T
(number,name8000,name500)
SELECT number, name, name /*<--Same contents in both cols*/
FROM master..spt_values
SELECT id,name500
FROM T
ORDER BY number
SELECT id,name8000
FROM T
ORDER BY number

From a processing standpoint, it will not make a difference to use varchar(8000) vs varchar(500). It's more of a "good practice" kind of thing to define a maximum length that a field should hold and make your varchar that length. It's something that can be used to assist with data validation. For instance, making a state abbreviation be 2 characters or a postal/zip code as 5 or 9 characters. This used to be a more important distinction for when your data interacted with other systems or user interfaces where field length was critical (e.g. a mainframe flat file dataset), but nowadays I think it's more habit than anything else.

There are some disadvantages to large columns that are a bit less obvious and might catch you a little later:
All columns you use in an INDEX - must not exceed 900 bytes
All the columns in an ORDER BY clause may not exceed 8060 bytes. This is a bit difficult to grasp since this only applies to some columns. See SQL 2008 R2 Row size limit exceeded for details)
If the total row size exceeds 8060 bytes, you get a "page spill" for that row. This might affect performance (A page is an allocation unit in SQLServer and is fixed at 8000 bytes+some overhead. Exceeding this will not be severe, but it's noticable and you should try to avoid it if you easily can)
Many other internal datastructures, buffers and last-not-least your own varaibles and table-variables all need to mirror these sizes. With excessive sizes, excessive memory allocation can affect performance
As a general rule, try to be conservative with the column width. If it becomes a problem, you can easily expand it to fit the needs. If you notice memory issues later, shrinking a wide column later may become impossible without losing data and you won't know where to begin.
In your example of the business names, think about where you get to display them. Is there really space for 500 characters?? If not, there is little point in storing them as such. http://en.wikipedia.org/wiki/List_of_companies_of_the_United_States lists some company names and the max is about 50 characters. So I'd use 100 for the column max. Maybe more like 80.

Apart from best practices (BBlake's answer)
You get warnings about maximum row size (8060) bytes and index width (900 bytes) with DDL
DML will die if you exceed these limits
ANSI PADDING ON is the default so you could end up storing a wholeload of whitespace

Ideally you'd want to go smaller than that, down to a reasonably sized length (500 isn't reasonably sized) and make sure the client validation catches when the data is going to be too large and send a useful error.
While the varchar isn't actually going to reserve space in the database for the unused space, I recall versions of SQL Server having a snit about database rows being wider than some number of bytes (do not recall the exact count) and actually throwing out whatever data didn't fit. A certain number of those bytes were reserved for things internal to SQL Server.

Optimal fill factor to prevent fragmentation

Currently I use a daily job to REORGANIZE 1000+ indexes with > 5% and < = 30% fragmentation and REBUILD indexes with > 30% fragmentation:
https://msdn.microsoft.com/en-us/library/ms189858.aspx
All indexes are rebuild with a fill factor of 80%, but based on my latest check, the fragmentation levels of 100+ indexes are unchanged. Most of them with a high fragmentation. I tried to play with the fill factor values in a test environment, but unfortunately can't simulate the production environment.
I'm wondering if finding the 'best' fill-factor for each individual index is a good idea?

[is] finding the 'best' fill-factor for each individual index is a good idea?
If the options are:
Keep the current global 80% FILLFACTOR
or
Find the best FILLFACTOR for each table
then absolutely YES, find the most appropriate value for each table. Of course, had there been an option for:
Put everything back to the default FILLFACTOR of 0 (same thing as 100) and apply a lower value--determined per table--to only those tables that should benefit from it
then I would have chosen #3 :-). Why? Because fragmentation and fillfactor can both be a bit complicated and tricky. And setting a globally low (80 is "low" given that the default is 100) value probably has a negative impact on a larger group of tables than the benefit you might be getting on the tables where it makes sense to have it.
Consider:
Fragmentation is one of several factors that influence performance: And this particular factor is a trade-off with the size of the table since it affects how many rows fit on a data page. The fewer rows on a data page means more pages need to be read from disk (not quick) to satisfy queries, and those pages will take up more memory (i.e. the Buffer Pool). In fact, there are a great many negative effects resulting from tables being larger than they should be, such as index maintenance / backup / restore / update stats / etc operations taking longer than they should.
Setting the fill factor too low on large tables means that the tables will be even larger. The increase in disk reads and size required in the Buffer Pool needs to be balanced with the types of operations against the table. Singleton operations aren't affected so much by fragmentation, so if that is by far the majority use case, then you can err on the side of reducing the number of data pages required by the table. If you have a lot of range operations then you might need to err on the side of less fragmentation.
Data access patterns: Is the table being mostly appended to? If INSERTs are happening at the end of the table only, then fragmentation can only really occur if updates are occurring that either increase the size of rows with variable-length datatypes, or if the row moves position due to a change in value of 1 or more key fields.
Also, deleting large amounts of rows can cause fragmentation. This happens when no rows are left on the data page. This is a situation where fragmentation not only cannot be mitigated by lowering the FILLFACTOR (even if all other conditions are favorable for lowering it), but would seem to actually be made worse by lowering it. If deletes occur frequently enough to leave empty data pages, then reducing the number of rows on those pages would increase the rate at which they become empty (i.e. between 3 data pages mostly filled with 500 rows each, and 5 data pages--with a lower FILLFACTOR--filled with only 300 rows each, deleting 700 rows will leave 1 empty data page in the first scenario but 2 empty data pages in the second scenario). And more empty data pages means more "unused" space.
Row size: A table with a row size of 100 bytes will have little "wasted" space due to trying to maintain a particular fillfactor. Meaning, if wanting to fill a page 80%, then a small row size will probably lead to actually filling the page 78% (as an example). But a row size of 3500 bytes will lead to only 1 row per page, which is really just under 50% used. And in the end, how many rows do you think need to be "reserved" for out of sequence inserts or rows that expand in size? A row size of 3500 bytes would only fit 1 more row on the page anyway so not much was really saved. A row size of 100 bytes on the other hand would reserve space for quite a few rows, and this is good, but only if it will be used.
Data distribution across the entire table: Meaning, let's say you have a table with 100 million rows. And let's also say that this table does allow for non-sequential inserts and/or updates that expand the size of the row. If the locations of the inserts or updates that could cause fragmentation are evenly distributed (or at least cover 50% of the table), then a lower FILLFACTOR could be useful. But, if the inserts and/or updates are confined to the most recent 5 million rows, then why reserve free space across the first 95 million rows when it will never be used? For example, if you have a table that is ordered on a DATETIME field, holds data for several years, and changes only occur in the most recent 2 months, then you might as well use 100%.
FILLFACTOR only applies when creating or rebuilding indexes: Newly created data pages (including those created from pages splits) will fill to 100% (or as close as it can get). Meaning, if you insert a lot of data such that several (or many) new datapages are created, and the inserts are done sequentially such that there is no fragmentation at the end of the inserts, but then somehow the rows are updated in such a way as to cause fragmentation, or maybe new inserts happen that are spread among the rows inserted a moment ago, then there is no way to prevent that fragmentation anyway (at least not without doing a REBUILD after every group of inserts, and that is just silly).
Hence, the situations that truly benefit from a lower (than the default 100%--expressed as 0) FILLFACTOR are far fewer than those that benefit from the default. So set them all back to 100 (or 0) and look for tables that fit the following profile:
Not small. This is very subjective, but I would think anything under 10,000 rows can be ignored (i.e. get the default)
Row size is under 1000 bytes (maybe even less than 1000?). If you are only reserving space for 1 or 2 rows then you are doing more harm than good.
Data access patterns that can cause fragmentation: non-sequential inserts, and updates that expand the size of the row or cause its location to move.
Be careful to consider how much of the fragmentation is being caused by deletes that leave empty data pages. This type of fragmentation is adversely affected by lowering the FILLFACTOR, so deletes should make up, at most, a small proportion of the fragmentation.
Data distribution that results in the fragmentation getting distributed somewhat evenly across the index instead of being confined to 40% or less of it
Keep in mind:
Like many (or most?) other optimizations, the effects are proportional to the scale of the system. Small systems won't see much of an effect, but the larger the tables get, the more noticeable proper vs improper settings become.
It is certainly possible that a system naturally behaves in such a way that the "optimal" FILL FACTOR for all tables somehow does end up being the same--whether 80% or some other value. I am not sure how probable it is that such a system exists, but it is certainly within the realm of possibilities.

Postgres performance improvement and checklist

I'm studing a series of issues related to performance of my application written in Java, which has about 100,000 hits per day and each visit on average from 5 to 10 readings/writings on the 2 principale database tables (divided equally) whose cardinality is for both between 1 and 3 million records (i access to DB via hibernate).
My two main tables store user information (about 60 columns of type varchar, integer and timestamptz) and another linked to the data to be displayed (with about 30 columns here mainly varchar, integer, timestamptz).
The main problem I encountered may have had a drop in performance of my site (let's talk about time loads over 5 seconds which obviously does not depend only on the database performance), is the use of FillFactor which is currently the default value of 100 (that it's used always when data not changing..).
Obviously fill factor it's same on index (there are 10 for each 2 tables of type btree)
Currently on my main tables I make
40% select operations
30% update operations
20% operations insert
10% delete operations.
My database is also made up of 40 other tables of minor importance (there is just others 3 with same cardinality of user).
My questions are:
How do you find the right value of the fill factor to be set ?
Which can be a checklist of tasks to be checked to improve the performance
of a database of this kind?
Database is on server dedicated (16GB Ram, 8 Core) and storage it's on SSD disk (data are backupped all days and moved on another storage)

You have likely hit the "knee" of your memory usage where the entire index of the heavily used tables no longer fits in shared memory, so disk I/O is slowing it down. Confirm by checking if disk I/O is higher than normal. If so, try increasing shared memory (shared_buffers), or if that's already maxed, adjust the system shared memory size or add more system memory so you can bump it higher. You'll also probably have to start adjusting temp buffers, work memory and maintenance memory, and WAL parameters like checkpoint_segments, etc.
There are some perf tuning hints on PostgreSQL.org, and Google is your friend.
Edit: (to address the first comment) The first symptom of not-enough-memory is a big drop in performance, everything else being the same. Changing the table fill factor is not going to make a difference if you hit a knee in memory usage, if anything it will make it worse w.r.t. load times (which I assume means "db reads") because row information will be expanded across more pages on disk with blank space in each page thus more disk I/O is needed for table scans. But fill factor less than 100% can help with UPDATE operations, but I've found adjusting WAL parameters can compensate most of the time when using indexes (unless you've already optimized those). Bottom line, you need to profile all the heavy queries using EXPLAIN to see what will help. But at first glance, I'm pretty certain this is a memory issue even with the database on an SSD. We're talking a lot of random reads and random writes and a lot of SSDs actually get worse than HDDs after a lot of random small writes.

Is there an advantage on setting tinyint fields when I know that the value will not exceed 255?

Should I choose the smallest datatype possible, or if I am storing the value 1 for example, it doesn't matter what is the col datatype and the value will occupy the same memory size?
The question is also, cuz I will always have to convert it and play around in the application.
UPDATE
I think that varchar(1) and varchar(50) is the same memory size if value is "a", I thought it's the same with int and tinyint, according to the answers I understand it's not, is it?

Always choose the smallest data type possible. SQL can't guess what you want the maximum value to be, but it can optimize storage and performance once you tell it the data type.
To answer your update:
varchar does take up only as much space as you use and so you're right when you say that the character "a" will take up 1 byte (in latin encoding) no matter how large a varchar field you choose. That is not the case with any other type of field in SQL.
However, you will likely be sacrificing efficiency for space if you make everything a varchar field. If everything is a fixed-size field then SQL can do a simple constant-time multiplication to find your value (like an array). If you have varchar fields in there, then the only way to find out where you data is stored it to go through all the previous fields (like a linked list).
If you're beginning SQL then I advise just to stay away from varchar fields unless you expect to have fields that sometimes have very small amounts of text and sometimes very large amounts of text (like blog posts). It takes experience to know when to use variable length fields to the best effect and even I don't know most of the time.

It's a performance consideration particular to the design of your system. In general, the more data you can fit into a page of Sql Server data, the better the performance.
One page in Sql Server is 8k. Using tiny ints instead of ints will enable you to put more data into a single page but you have to consider whether or not it's worth it. If you're going to be serving up thousands of hits a minute, then yes. If this is a hobby project or something that just a few dozen users will ever see, then it doesn't matter.

The advantage is there but might not be significant unless you have lots of rows and performs los of operation. There'll be performance improvement and smaller storage.

Traditionally every bit saved on the page size would mean a little bit of speed improvement: narrower rows means more rows per page, which means less memory consumed and fewer IO requests, resulting in better speed. However, with SQL Server 2008 Page compression things start to get fuzzy. The compression algorithm may compress 4 byte ints with values under 255 on even less than a byte.
Row compression algorithms will store a 4 byte int on a single byte for values under 127 (int is signed), 2 bytes for values under 32768 and so on and so forth.
However, given that the nice compression features are only available on Enterprise Edition servers, it makes sense to keep the habit of using the smallest possible data type.

is there an advantage to varchar(500) over varchar(8000)?

I've read up on this on MSDN forums and here and I'm still not clear. I think this is correct: Varchar(max) will be stored as a text datatype, so that has drawbacks. So lets say your field will reliably be under 8000 characters. Like a BusinessName field in my database table. In reality, a business name will probably always be under (pulling a number outta my hat) 500 characters. It seems like plenty of varchar fields that I run across fall well under the 8k character count.
So should I make that field a varchar(500) instead of varchar(8000)? From what I understand of SQL there's no difference between those two. So, to make life easy, I'd want to define all my varchar fields as varchar(8000). Does that have any drawbacks?
Related: Size of varchar columns (I didn't feel like this one answered my question).

One example where this can make a difference is that it can prevent a performance optimization that avoids adding row versioning information to tables with after triggers.
This is covered by Paul White here
The actual size of the data stored is immaterial – it is the potential
size that matters.
Similarly if using memory optimised tables since 2016 it has been possible to use LOB columns or combinations of column widths that could potentially exceed the inrow limit but with a penalty.
(Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.
This can have a large negative effect on memory consumption and performance
Another case where over declaring column widths can make a big difference is if the table will ever be processed using SSIS. The memory allocated for variable length (non BLOB) columns is fixed for each row in an execution tree and is per the columns' declared maximum length which can lead to inefficient usage of memory buffers (example). Whilst the SSIS package developer can declare a smaller column size than the source this analysis is best done up front and enforced there.
Back in the SQL Server engine itself a similar case is that when calculating the memory grant to allocate for SORT operations SQL Server assumes that varchar(x) columns will on average consume x/2 bytes.
If most of your varchar columns are fuller than that this can lead to the sort operations spilling to tempdb.
In your case if your varchar columns are declared as 8000 bytes but actually have contents much less than that your query will be allocated memory that it doesn't require which is obviously inefficient and can lead to waits for memory grants.
This is covered in Part 2 of SQL Workshops Webcast 1 downloadable from here or see below.
use tempdb;
CREATE TABLE T(
id INT IDENTITY(1,1) PRIMARY KEY,
number int,
name8000 VARCHAR(8000),
name500 VARCHAR(500))
INSERT INTO T
(number,name8000,name500)
SELECT number, name, name /*<--Same contents in both cols*/
FROM master..spt_values
SELECT id,name500
FROM T
ORDER BY number
SELECT id,name8000
FROM T
ORDER BY number

From a processing standpoint, it will not make a difference to use varchar(8000) vs varchar(500). It's more of a "good practice" kind of thing to define a maximum length that a field should hold and make your varchar that length. It's something that can be used to assist with data validation. For instance, making a state abbreviation be 2 characters or a postal/zip code as 5 or 9 characters. This used to be a more important distinction for when your data interacted with other systems or user interfaces where field length was critical (e.g. a mainframe flat file dataset), but nowadays I think it's more habit than anything else.

Apart from best practices (BBlake's answer)
You get warnings about maximum row size (8060) bytes and index width (900 bytes) with DDL
DML will die if you exceed these limits
ANSI PADDING ON is the default so you could end up storing a wholeload of whitespace

Ideally you'd want to go smaller than that, down to a reasonably sized length (500 isn't reasonably sized) and make sure the client validation catches when the data is going to be too large and send a useful error.
While the varchar isn't actually going to reserve space in the database for the unused space, I recall versions of SQL Server having a snit about database rows being wider than some number of bytes (do not recall the exact count) and actually throwing out whatever data didn't fit. A certain number of those bytes were reserved for things internal to SQL Server.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas