Oracle SQL VARCHAR2 datatype storage - sql

From Oracle docs :
If you give every column the maximum length or precision for its data
type, then your application needlessly allocates many megabytes of
RAM. For example, suppose that a query selects 10 VARCHAR2(4000)
columns and a bulk fetch operation returns 100 rows. The RAM that your
application must allocate is 10 x 4,000 x 100—almost 4 MB. In
contrast, if the column length is 80, the RAM that your application
must allocate is 10 x 80 x 100—about 78 KB. This difference is
significant for a single query, and your application will process many
queries concurrently. Therefore, your application must allocate the 4
MB or 78 KB of RAM for each connection.
As I know varchar2 is variable length datatype, so DB will only allocate space actually used by column, i.e. if column is only 10 character in Unicode it will allocate 10 bytes. But according to above statement even if column (max) is only 10 character, but length of datatype is defined as 4000, it will still occupy 4000 bytes?

The space allocated on disk will only be as long as required to store the actual data for each row.
The space allocated in memory will (in some cases) be the maximum required based on the datatype.

The documentation itself is wrong/misleading in several ways. The sentence right before the quoted paragraph says "...length and precision affect storage requirements." And yet, right after that, the dufus who wrote the documentation article goes on to refer to RAM. Storage means on disk; RAM is memory. Unless we are talking about an in-memory database (which that documentation article does not), it makes no sense to talk about RAM after saying something "affects storage requirements." The declared length does NOT affect storage, but it MAY affect memory allocation.
Specifically, it MAY affect memory allocation when an application (often written in general languages like Java, C#, etc.) need to allocate memory ahead of time, when the only info they have is what's in the data dictionary. Memory can be allocated statically (at compilation time), but that means you can't use the extra info from the actual data, that all your strings are 100 bytes at most; all that is known AT THAT STAGE is 4000 bytes max. Memory can also be allocated DYNAMICALLY, and that can use the extra info - but it is MUCH, MUCH slower!
In many "interactions" between the DB and applications written in other languages, you don't even have the option of dynamic memory allocation; in the present world, the assumption is that "time" is worth much, much more than RAM, so if you find that your code runs out of memory, buy more RAM and don't worry about dynamic memory allocation. Which means that if you declare VARCHAR2(4000), you should expect that a lot of RAM will be allocated, potentially, in a wasteful way. Just declare VARCHAR2(100) if that's all you need.

The source for that interesting question is here.
The article is very clear about VARCHAR2 storage:
Oracle Database blank-pads values stored in CHAR columns but not
values stored in VARCHAR2 columns. Therefore, VARCHAR2 columns use
space more efficiently than CHAR columns.
What they are saying about the RAM allocation is that your application would not know how much RAM to allocate if you had NOT defined a limit for your VARCHAR2 column. Also, if the limit is too high, it would allocate too much RAM to start with, so always choose the most efficient limit.
There is also a comprehensive article about the OCI usage of data types here.

Related

Unable to upload data even after partitioning in VoltDB

We are trying to upload 80 GB of data in 2 host servers each with 48 GB RAM(in total 96GB). We have partitioned table too. But even after partitioning, we are able to upload data only upto 10 GB. In VMC interface, we checked the size worksheet. The no of rows in the table is 40,00,00,000 and table maximum size is 1053,200,000k and minimum size is 98,000,000K. So, what is issue in uploading 80GB even after partitioning and what is this table size?
The size worksheet provides minimum and maximum size in memory that the number of rows would take, based on the schema of the table. If you have VARCHAR or VARBINARY columns, then the difference between min and max can be quite substantial, and your actual memory use is usually somewhere in between, but can be difficult to predict because it depends on the actual size of the strings that you load.
But I think the issue is that the minimum size is 98GB according to the worksheet, meaning if any nullable strings are null, or any not-null strings would be an empty string. Even without taking into account the heap size and any overhead, this is higher than your 96GB capacity.
What is your kfactor setting? If it is 0, there will be only one copy of each record. If it is 1, there will be two copies of each record, so you would really need 196GB minimum in that configuration.
The size per record in RAM depends on the datatypes chosen and if there are any indexes. Also, VARCHAR values longer than 15 characters or 63 bytes are stored in pooled memory which carries more overhead than fixed-width storage, although it can reduce the wasted space if the values are smaller than the maximum size.
If you want some advice on how to minimize the per-record size in memory, please share the definition of your table and any indexes, and I might be able to suggest adjustments that could reduce the size.
You can add more nodes to the cluster, or use servers with more RAM to add capacity.
Disclaimer: I work for VoltDB.

SQL Query Performance with an nvarchar(500) where the MAX(LEN(column)) < 30 [duplicate]

I've read up on this on MSDN forums and here and I'm still not clear. I think this is correct: Varchar(max) will be stored as a text datatype, so that has drawbacks. So lets say your field will reliably be under 8000 characters. Like a BusinessName field in my database table. In reality, a business name will probably always be under (pulling a number outta my hat) 500 characters. It seems like plenty of varchar fields that I run across fall well under the 8k character count.
So should I make that field a varchar(500) instead of varchar(8000)? From what I understand of SQL there's no difference between those two. So, to make life easy, I'd want to define all my varchar fields as varchar(8000). Does that have any drawbacks?
Related: Size of varchar columns (I didn't feel like this one answered my question).
One example where this can make a difference is that it can prevent a performance optimization that avoids adding row versioning information to tables with after triggers.
This is covered by Paul White here
The actual size of the data stored is immaterial – it is the potential
size that matters.
Similarly if using memory optimised tables since 2016 it has been possible to use LOB columns or combinations of column widths that could potentially exceed the inrow limit but with a penalty.
(Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.
This can have a large negative effect on memory consumption and performance
Another case where over declaring column widths can make a big difference is if the table will ever be processed using SSIS. The memory allocated for variable length (non BLOB) columns is fixed for each row in an execution tree and is per the columns' declared maximum length which can lead to inefficient usage of memory buffers (example). Whilst the SSIS package developer can declare a smaller column size than the source this analysis is best done up front and enforced there.
Back in the SQL Server engine itself a similar case is that when calculating the memory grant to allocate for SORT operations SQL Server assumes that varchar(x) columns will on average consume x/2 bytes.
If most of your varchar columns are fuller than that this can lead to the sort operations spilling to tempdb.
In your case if your varchar columns are declared as 8000 bytes but actually have contents much less than that your query will be allocated memory that it doesn't require which is obviously inefficient and can lead to waits for memory grants.
This is covered in Part 2 of SQL Workshops Webcast 1 downloadable from here or see below.
use tempdb;
CREATE TABLE T(
id INT IDENTITY(1,1) PRIMARY KEY,
number int,
name8000 VARCHAR(8000),
name500 VARCHAR(500))
INSERT INTO T
(number,name8000,name500)
SELECT number, name, name /*<--Same contents in both cols*/
FROM master..spt_values
SELECT id,name500
FROM T
ORDER BY number
SELECT id,name8000
FROM T
ORDER BY number
From a processing standpoint, it will not make a difference to use varchar(8000) vs varchar(500). It's more of a "good practice" kind of thing to define a maximum length that a field should hold and make your varchar that length. It's something that can be used to assist with data validation. For instance, making a state abbreviation be 2 characters or a postal/zip code as 5 or 9 characters. This used to be a more important distinction for when your data interacted with other systems or user interfaces where field length was critical (e.g. a mainframe flat file dataset), but nowadays I think it's more habit than anything else.
There are some disadvantages to large columns that are a bit less obvious and might catch you a little later:
All columns you use in an INDEX - must not exceed 900 bytes
All the columns in an ORDER BY clause may not exceed 8060 bytes. This is a bit difficult to grasp since this only applies to some columns. See SQL 2008 R2 Row size limit exceeded for details)
If the total row size exceeds 8060 bytes, you get a "page spill" for that row. This might affect performance (A page is an allocation unit in SQLServer and is fixed at 8000 bytes+some overhead. Exceeding this will not be severe, but it's noticable and you should try to avoid it if you easily can)
Many other internal datastructures, buffers and last-not-least your own varaibles and table-variables all need to mirror these sizes. With excessive sizes, excessive memory allocation can affect performance
As a general rule, try to be conservative with the column width. If it becomes a problem, you can easily expand it to fit the needs. If you notice memory issues later, shrinking a wide column later may become impossible without losing data and you won't know where to begin.
In your example of the business names, think about where you get to display them. Is there really space for 500 characters?? If not, there is little point in storing them as such. http://en.wikipedia.org/wiki/List_of_companies_of_the_United_States lists some company names and the max is about 50 characters. So I'd use 100 for the column max. Maybe more like 80.
Apart from best practices (BBlake's answer)
You get warnings about maximum row size (8060) bytes and index width (900 bytes) with DDL
DML will die if you exceed these limits
ANSI PADDING ON is the default so you could end up storing a wholeload of whitespace
Ideally you'd want to go smaller than that, down to a reasonably sized length (500 isn't reasonably sized) and make sure the client validation catches when the data is going to be too large and send a useful error.
While the varchar isn't actually going to reserve space in the database for the unused space, I recall versions of SQL Server having a snit about database rows being wider than some number of bytes (do not recall the exact count) and actually throwing out whatever data didn't fit. A certain number of those bytes were reserved for things internal to SQL Server.

Reasonable message length to be stored in Oracle Database

I have a complex process that interacts with multiple systems.
Each of these systems may produce error messages that I would like to store in a table of my Oracle database (note that I have statuses but the nature of the process is such that the errors may not always be predefined).
We are talking about hundred thousands of transactions each day where 1% may result in various errors.
1) Wanted to know what is a reasonable/acceptable length for the database field and how big of a message should I be storing?
2) Memory wise, does it really matter how large the field is defined in the database?
"Reasonable" and "acceptable" depends on the application. Assuming that you want to define the database column as a VARCHAR2 rather than a CLOB, and assuming that you aren't using 12.1 or later, you can declare the column to hold up to 4000 bytes. Is that enough for whatever error messages you need to support? Is there a lower limit on the length of an error message that you can establish? If you're producing error messages that are designed to be shown to a user, you're probably going to be generating shorter messages. If you're producing and storing stack traces, you may need to declare the column as a CLOB because 4000 bytes may not be sufficient.
What sort of memory are we talking about? On disk, a VARCHAR2 will only allocate the space that is actually required to store the data. When the block is read into the buffer cache, it will also only use the space required to store the data. If you start allocating local variables in PL/SQL, depending on the size of the field, Oracle may allocate more space than is required to store the particular data for that local variable in order to try to avoid the cost of growing and shrinking the allocation when you modify the string. If you return the data to a client application (including a middle tier application server), that client may allocate a buffer in memory based on the maximum size of the column rather than based on the actual size of the data.

is there an advantage to varchar(500) over varchar(8000)?

I've read up on this on MSDN forums and here and I'm still not clear. I think this is correct: Varchar(max) will be stored as a text datatype, so that has drawbacks. So lets say your field will reliably be under 8000 characters. Like a BusinessName field in my database table. In reality, a business name will probably always be under (pulling a number outta my hat) 500 characters. It seems like plenty of varchar fields that I run across fall well under the 8k character count.
So should I make that field a varchar(500) instead of varchar(8000)? From what I understand of SQL there's no difference between those two. So, to make life easy, I'd want to define all my varchar fields as varchar(8000). Does that have any drawbacks?
Related: Size of varchar columns (I didn't feel like this one answered my question).
One example where this can make a difference is that it can prevent a performance optimization that avoids adding row versioning information to tables with after triggers.
This is covered by Paul White here
The actual size of the data stored is immaterial – it is the potential
size that matters.
Similarly if using memory optimised tables since 2016 it has been possible to use LOB columns or combinations of column widths that could potentially exceed the inrow limit but with a penalty.
(Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.
This can have a large negative effect on memory consumption and performance
Another case where over declaring column widths can make a big difference is if the table will ever be processed using SSIS. The memory allocated for variable length (non BLOB) columns is fixed for each row in an execution tree and is per the columns' declared maximum length which can lead to inefficient usage of memory buffers (example). Whilst the SSIS package developer can declare a smaller column size than the source this analysis is best done up front and enforced there.
Back in the SQL Server engine itself a similar case is that when calculating the memory grant to allocate for SORT operations SQL Server assumes that varchar(x) columns will on average consume x/2 bytes.
If most of your varchar columns are fuller than that this can lead to the sort operations spilling to tempdb.
In your case if your varchar columns are declared as 8000 bytes but actually have contents much less than that your query will be allocated memory that it doesn't require which is obviously inefficient and can lead to waits for memory grants.
This is covered in Part 2 of SQL Workshops Webcast 1 downloadable from here or see below.
use tempdb;
CREATE TABLE T(
id INT IDENTITY(1,1) PRIMARY KEY,
number int,
name8000 VARCHAR(8000),
name500 VARCHAR(500))
INSERT INTO T
(number,name8000,name500)
SELECT number, name, name /*<--Same contents in both cols*/
FROM master..spt_values
SELECT id,name500
FROM T
ORDER BY number
SELECT id,name8000
FROM T
ORDER BY number
From a processing standpoint, it will not make a difference to use varchar(8000) vs varchar(500). It's more of a "good practice" kind of thing to define a maximum length that a field should hold and make your varchar that length. It's something that can be used to assist with data validation. For instance, making a state abbreviation be 2 characters or a postal/zip code as 5 or 9 characters. This used to be a more important distinction for when your data interacted with other systems or user interfaces where field length was critical (e.g. a mainframe flat file dataset), but nowadays I think it's more habit than anything else.
There are some disadvantages to large columns that are a bit less obvious and might catch you a little later:
All columns you use in an INDEX - must not exceed 900 bytes
All the columns in an ORDER BY clause may not exceed 8060 bytes. This is a bit difficult to grasp since this only applies to some columns. See SQL 2008 R2 Row size limit exceeded for details)
If the total row size exceeds 8060 bytes, you get a "page spill" for that row. This might affect performance (A page is an allocation unit in SQLServer and is fixed at 8000 bytes+some overhead. Exceeding this will not be severe, but it's noticable and you should try to avoid it if you easily can)
Many other internal datastructures, buffers and last-not-least your own varaibles and table-variables all need to mirror these sizes. With excessive sizes, excessive memory allocation can affect performance
As a general rule, try to be conservative with the column width. If it becomes a problem, you can easily expand it to fit the needs. If you notice memory issues later, shrinking a wide column later may become impossible without losing data and you won't know where to begin.
In your example of the business names, think about where you get to display them. Is there really space for 500 characters?? If not, there is little point in storing them as such. http://en.wikipedia.org/wiki/List_of_companies_of_the_United_States lists some company names and the max is about 50 characters. So I'd use 100 for the column max. Maybe more like 80.
Apart from best practices (BBlake's answer)
You get warnings about maximum row size (8060) bytes and index width (900 bytes) with DDL
DML will die if you exceed these limits
ANSI PADDING ON is the default so you could end up storing a wholeload of whitespace
Ideally you'd want to go smaller than that, down to a reasonably sized length (500 isn't reasonably sized) and make sure the client validation catches when the data is going to be too large and send a useful error.
While the varchar isn't actually going to reserve space in the database for the unused space, I recall versions of SQL Server having a snit about database rows being wider than some number of bytes (do not recall the exact count) and actually throwing out whatever data didn't fit. A certain number of those bytes were reserved for things internal to SQL Server.

Disadvantage of choosing large MAX value for varchar or varbinary

What's the disadvantage of choosing a large value for max when creating a varchar or varbinary column?
I'm using MS SQL but I assume this would be relevant to other dbs as well.
Thanks
That depends on whether it is ever reasonable to store a large amount of data in the particular column.
If you declare a column that would never properly store much data (i.e. an employee first name as a VARCHAR(1000)), you end up with a variety of problems
Many if not most client APIs (i.e. ODBC drivers, JDBC drivers, etc) allocate memory buffers on the client that are large enough to store the maximum size of a particular column. So even though the database only has to store the actual data, you may substantially increase the amount of memory the client application uses.
You lose the ability to drive data validation rules (or impart information about the data) from the table definition. If the database allows 1000 character first names, every application that interacts with the database will probably end up having its own rules for how large an employee name can be. If this is not mitigated by putting a stored procedure layer between all applications and the tables, this generally leads to various applications having various rules.
Murphy's Law states that if you allow 1000 characters, someone will eventually store 1000 characters in the column, or at least a value large enough to cause errors in one or more application (i.e. no one checked to see whether every application's employee name field could display 1000 characters).
Depends on the RDBMS. IIRC, MySql allocates a 2 byte overhead for varchars > 255 characters (to track the varchar length). MSSQL <= 2000 would allow you to allocate a row size > 8060 bytes, but would fail if you tried to INSERT or UPDATE a row that actually exceeded 8060 bytes. SQL 2005[1] allows the insert, but will allocate a new page for the overflow and leave a pointer behind. This, obviously, impacts performance.
[1] varchar(max) is somewhat of a special case, but will also allocate an overflow page if the length of the field is > 8000 or the row > 8060. This is with MSSQL defaults, and behavior can change with the large types in data row option.
You could be adding a risk of breaking your application if a large data got in somehow (like from an external interface) and your app isn't designed to handle it.
As a good design, you should always limit the size of the fields to a realistic value.