Best practice of creating table with CSV file & using Varchar(MAX) - azure-sql-database

I have a question regards to creating a table in Azure SQL Server by importing a flat file (csv).
I am having an issue with size of length with table I created (if size exceeds, I get an error), so I decided to go with all varchar(max) and allow null on all columns.
I don't think allow null is any issue, but I am curious about implication of using varchar(max) for all the time.
I also noticed that when I create table with fixed length (example, varchar(500)), the size of overall each table gets so big.
But, when I tried with varchar(max), the size of table is much smaller.
Why does using varchar(max) create smaller size of storage?
Does using (max) truncate unused space in SQL?

Does using varchar(max) truncate unused space in SQL?
The space used when using either varchar(n) or varchar(max) depends on the length of the input data. When referring to this official Microsoft documentation, it can be understood that varchar is used when the input string data differ in size (amount of space taken to store this data is given in the document).
So, both varchar(n) and varchar(max), by definition, allocate space based on how much the input data requires.
It is recommended to use varchar(n) unless the storage limit exceeds 8000 bytes.
Why does using varchar(max) create smaller size of storage?
The following is a demonstration that I have done to verify whether using varchar(500) and varchar(max) is the reason for different storage sizes of the tables.
I have also imported a flat file to create 2 different tables, country_data_500 where all columns are varchar(500) allowing nulls and country_data_max where all columns are varchar(max) allowing nulls.
Now when I look at the size of these tables, they both use same amount of space.
For country_data_500:
For country_data_max:
I have used sys.dm_db_index_physical_stats to get details about the statistics of these tables. I run the following query in My azure SQL Database:
SELECT OBJECT_NAME([object_id]) AS TableName,
alloc_unit_type_desc,
record_count,
min_record_size_in_bytes,
max_record_size_in_bytes
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'DETAILED')
WHERE OBJECT_NAME([object_id]) LIKE 'country_data%';
From all the above results, it can be concluded that the usage of varchar(n) or varchar(500) is not the reason for different table storage sizes.
Please recheck the procedure because there might be some other underlying issue for why your tables are taking up different amounts of space (more for varchar(500) and less for varchar(max))

Related

Does changing varbinary(MAX) to varbinary(300) make any difference on the physical disk space?

Firt of all excuse me for the grammar as English is not my primary key :)
I am trying to find out if changing varbinary(max) to varbinary(300) reduces physical disk space usage by the table.
We have a very limited physical disk space and were trying to optimize everywhere including columns in the database.
We have >100 columns (in different tables with millions of rows) with varbinary(max) data type used for storing encrypted values and we don't need the max length as it fits in < 300 length.
Is there any gain in disk space if we switch to varbinary(300) ?
Does varbinary(max) preallocates all its required disk space when creating the table or inserting data into that column?
Does varbinary(max) column take all its disk space even if it has data with length <300?
I haven't been able to find anything anywhere except the following line:
"The storage size is the actual length of the data entered + 2 bytes."
https://learn.microsoft.com/en-us/sql/t-sql/data-types/binary-and-varbinary-transact-sql
Aly help in answering the above 3 questions would be appreciated. Thanks.
Is there any gain in disk space if we switch to varbinary(300) ?
If all your binary values fit into 300 bytes, it will be no change at all in terms of space used. That's because these values are already stored in-row.
The format, in which SQL Server stores the data from the (MAX)
columns, such as varchar(max), nvarchar(max), and varbinary(max),
depends on the actual data size. SQL Server stores it in-row when
possible. When in-row allocation is impossible, and data size is less
than or equal to 8,000 bytes, it is stored as row-overflow data. The
data that exceeds 8,000 bytes is stored as LOB data
So when you change it to varbinary(300) it will an instantaneous operation with only metadata change.
Does varbinary(max) preallocates all its required disk space when
creating the table or inserting data into that column?
No, it doesn't.
Variable-length data types, such as varchar, varbinary, and a few
others, use as much storage space as is required to store data, plus
two extra bytes
Does varbinary(max) column take all its disk space even if it has
data with length <300?
No, as said above, it will take exactly the size need to store the actual value, so if you put a 3-bytes value into varchar(max) column it will used only 5 bytes and if ypu put a null value it will used 2 bytes only (and if the null value is in the last column it will not take space at all)
When you turn varbinary(max) to varbinary(300) as I said it will change nothing data (metadata only), but if you then update the row with a new value, the old column will be dropped and a new one will be created, so not only you'll not gain a space, you'll waste the space, because the space used for old column will not be released, it will be only marked as dropped
Literature:
Microsoft SQL Server 2012 Internals (Developer Reference) by Kalen
Delaney
Pro SQL Server Internals by Dmitri Korotkevitch

SQL Query Performance with an nvarchar(500) where the MAX(LEN(column)) < 30 [duplicate]

I've read up on this on MSDN forums and here and I'm still not clear. I think this is correct: Varchar(max) will be stored as a text datatype, so that has drawbacks. So lets say your field will reliably be under 8000 characters. Like a BusinessName field in my database table. In reality, a business name will probably always be under (pulling a number outta my hat) 500 characters. It seems like plenty of varchar fields that I run across fall well under the 8k character count.
So should I make that field a varchar(500) instead of varchar(8000)? From what I understand of SQL there's no difference between those two. So, to make life easy, I'd want to define all my varchar fields as varchar(8000). Does that have any drawbacks?
Related: Size of varchar columns (I didn't feel like this one answered my question).
One example where this can make a difference is that it can prevent a performance optimization that avoids adding row versioning information to tables with after triggers.
This is covered by Paul White here
The actual size of the data stored is immaterial – it is the potential
size that matters.
Similarly if using memory optimised tables since 2016 it has been possible to use LOB columns or combinations of column widths that could potentially exceed the inrow limit but with a penalty.
(Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.
This can have a large negative effect on memory consumption and performance
Another case where over declaring column widths can make a big difference is if the table will ever be processed using SSIS. The memory allocated for variable length (non BLOB) columns is fixed for each row in an execution tree and is per the columns' declared maximum length which can lead to inefficient usage of memory buffers (example). Whilst the SSIS package developer can declare a smaller column size than the source this analysis is best done up front and enforced there.
Back in the SQL Server engine itself a similar case is that when calculating the memory grant to allocate for SORT operations SQL Server assumes that varchar(x) columns will on average consume x/2 bytes.
If most of your varchar columns are fuller than that this can lead to the sort operations spilling to tempdb.
In your case if your varchar columns are declared as 8000 bytes but actually have contents much less than that your query will be allocated memory that it doesn't require which is obviously inefficient and can lead to waits for memory grants.
This is covered in Part 2 of SQL Workshops Webcast 1 downloadable from here or see below.
use tempdb;
CREATE TABLE T(
id INT IDENTITY(1,1) PRIMARY KEY,
number int,
name8000 VARCHAR(8000),
name500 VARCHAR(500))
INSERT INTO T
(number,name8000,name500)
SELECT number, name, name /*<--Same contents in both cols*/
FROM master..spt_values
SELECT id,name500
FROM T
ORDER BY number
SELECT id,name8000
FROM T
ORDER BY number
From a processing standpoint, it will not make a difference to use varchar(8000) vs varchar(500). It's more of a "good practice" kind of thing to define a maximum length that a field should hold and make your varchar that length. It's something that can be used to assist with data validation. For instance, making a state abbreviation be 2 characters or a postal/zip code as 5 or 9 characters. This used to be a more important distinction for when your data interacted with other systems or user interfaces where field length was critical (e.g. a mainframe flat file dataset), but nowadays I think it's more habit than anything else.
There are some disadvantages to large columns that are a bit less obvious and might catch you a little later:
All columns you use in an INDEX - must not exceed 900 bytes
All the columns in an ORDER BY clause may not exceed 8060 bytes. This is a bit difficult to grasp since this only applies to some columns. See SQL 2008 R2 Row size limit exceeded for details)
If the total row size exceeds 8060 bytes, you get a "page spill" for that row. This might affect performance (A page is an allocation unit in SQLServer and is fixed at 8000 bytes+some overhead. Exceeding this will not be severe, but it's noticable and you should try to avoid it if you easily can)
Many other internal datastructures, buffers and last-not-least your own varaibles and table-variables all need to mirror these sizes. With excessive sizes, excessive memory allocation can affect performance
As a general rule, try to be conservative with the column width. If it becomes a problem, you can easily expand it to fit the needs. If you notice memory issues later, shrinking a wide column later may become impossible without losing data and you won't know where to begin.
In your example of the business names, think about where you get to display them. Is there really space for 500 characters?? If not, there is little point in storing them as such. http://en.wikipedia.org/wiki/List_of_companies_of_the_United_States lists some company names and the max is about 50 characters. So I'd use 100 for the column max. Maybe more like 80.
Apart from best practices (BBlake's answer)
You get warnings about maximum row size (8060) bytes and index width (900 bytes) with DDL
DML will die if you exceed these limits
ANSI PADDING ON is the default so you could end up storing a wholeload of whitespace
Ideally you'd want to go smaller than that, down to a reasonably sized length (500 isn't reasonably sized) and make sure the client validation catches when the data is going to be too large and send a useful error.
While the varchar isn't actually going to reserve space in the database for the unused space, I recall versions of SQL Server having a snit about database rows being wider than some number of bytes (do not recall the exact count) and actually throwing out whatever data didn't fit. A certain number of those bytes were reserved for things internal to SQL Server.

Impact of altering table column size/length in SQL Server

Consider I've column VARCHAR(MAX). What if I change it to VARCHAR(500) will Microsoft SQL Server will decrease the size claimed by table?
If you got any link just comment it. I'll check it out.
Update:
I've tested following two case with table.
ALTER column size
Create new table and import data from old table.
Initial Table size
ALTER TABLE table_transaction ALTER COLUMN column_name VARCHAR(500)
After ALTER column, table size is incresed
Create new table with new column size and import data from old table
I've taken care of Index in new table.
Why table size is increased in case of ALTER COLUMN. Ideally table size should decrease.
After performing de-fragmentation on PK in original table few MB decreased. However its not promising like creating new table.
When you change varchar(n) column to varchar(MAX) or visa-versa, SQL Server will update every row in the table. This will temporarily increase table size until you rebuild the clustered index or execute DBCC CLEANTABLE.
For ongoing space requirements of a varchar(MAX) column, the space will be the same as varchar(n) as long as the value remains in-row. However, if the value exceeds 8000 bytes, it will be stored on separate LOB page(s) dedicated to the value. This will increase space requirements and require extra I/O when a query needs the value.
I good rule of thumb is to use MAX types only if the value may exceed 8000 bytes, and specify a proper max length for the domain of data stored for data 8000 bytes or less.
According to the documentation, there is no difference in the storage of strings:
varchar [ ( n | max ) ]
Variable-length, non-Unicode string data. n defines the string length
and can be a value from 1 through 8,000. max indicates that the
maximum storage size is 2^31-1 bytes (2 GB). The storage size is the
actual length of the data entered + 2 bytes.
As I read this, the storage size is the actual length plus two bytes regardless of whether you use n or max.
I am suspicious about this. I would expect the length of varchar(max) to occupy four bytes. And there might be additional overhead for storing off-page references (if they exist). However, the documentation is pretty clear on this point.
Whether changing the data type changes the size of the field depends on the data already stored. You can have several situations.
If all the values are NULL, then there will be no changes at all. The values are not being stored.
If all the values are less than 20 bytes, then -- according to the documentation -- there would be no change. I have a nagging suspicion that you might save 2 bytes per value, but I can't find a reference to it and don't have SQL Server on hand today to check.
If values exceed 20 bytes but remain on the page, then you will save space because the values will change.
If the values go off-page, then you will save the header information as well as truncating the data (thank you Dan for pointing this out).

Methods to Accelerate Read From a Large Table

I try to log every thing in SQL, so think to add a table named log and add every things in it, the log table is:
ID UNIQUEIDENTIFIER -- PK
LogDate DATETIME PK
IP NVARCHAR
Action NVARCHAR
Info XML
UniqueID BIGINT
I log every things like: login, check permission, see pages, access object and .. to this table
Then I figured also need Some Log-Restore Implementations, So some log records are restorable, some of them not, The Log table have about 8 millions Records, but the restorable records are about 200 thousands, So every time we need to restore, need to select on 8 millions, then I decide to add new Table and Add restorable logs
to this new table: log_restore:
ID UNIQUEIDENTIFIER
LogDate DATETIME
IP NVARCHAR
Action NVARCHAR
Info XML
UniqueID BIGINT -- PK
OK when I need To log every thing is fine.
But when I need to see logs: The procedure get all records from log table and merge(union) them with log_restore table.
So I need to accelerate this procedure with no effect on insert (means do not slower that), this is my ideas:
When Add record to log_restore Add it to log table also (So in select no need to union)
Create view with this select command
Add Simple DataType Columns instead of XML
Add Clustered PK on simple DataType Column Like BIGINT
What are your ideas? any suggestion?
In general, one should try using as little space as possible; it greatly helps reducing disk seeks when executing a query. And comparing smaller datatypes always require less time!
The following tunings can be made on the columns:
use non-nullable columns (decreases storage space, diminishes the number of tests)
store LogDate in the form of a timestamp (UNSIGNED INT, 4 bytes) instead of DATETIME (8 bytes)
IP address shouldn't be stored as a NVARCHAR; if you are storing IPv4 adresses, 4 bytes would be enough (BINARY(4)). IPv6 support requires 18 bytes (VARBINARY(16)). Meanwhile, NVARCHAR would require 30 bytes for IPv4, and 78 bytes for IPv6... (search the web for inet_ntoa, inet_aton, inet_ntop, inet_pton to learn how to switch between binary and string representation of the adresses)
instead of storing similar data in two separate tables, add a Restorable flag column of type BIT indicating whether a log entry can or cannot be restored
your idea about the Info column is right: it would be better to use a TEXT or NTEXT data type
instead of using a NVARCHAR type for Action, you could consider having an Action table containing all the possible actions (assuming they are in finite number), and referencing them with an integer foreign key (the smaller the int, the better)
Index optimization is very important too. Use an index on multiple columns if your query tests multiple column at the same time. For example, if you select all the restorable rows corresponding to a specific IP over a certain range of time, this would greatly enhance the speed of the query:
CREATE NONCLUSTERED INDEX IX_IndexName ON log (Restorable ASC, IP ASC, LogDate ASC)
If you need to retrieve all the restorable rows from an IP adress corresponding to a specific action, over a given range of time, the index should be chosen as such:
CREATE NONCLUSTERED INDEX IX_IndexName ON log (IP ASC, Action ASC, LogDate ASC)
Etc.
To be honest, I would actually need to see your full SQL query in order to do proper optimization...
Options for table enhancements:
Add a Restorable bit null column and create a filtered index on it.
'XML' data type is a LOB data type and is stored outside the row. If you are not using any of XML data type methods, then you do not need it. It does hamper your performance a lot. Add and XML_code varchar () null column and copy all data from your XML column.
Choose the length of the column so to keep maximum row size (total max size of all columns) less than 8Kb. Varchar (MAX) column may be stored in row if row fits 8kb. So if you have significant number of short XMLs, then VARCHAR (MAX) can help.
If you are not working with Unicode data, then change all NVARCHAR to VARCHAR columns
Use a UNION ALL with a Where cluse to filter duplicates instead of just UNION
UNIQUEIDENTIFIER column does not help you here. If two records cannot have same datetime (or maybe datetime2) value, then it can you unique ID on its own. Alternatively consider changintIDcolumn tointas you order byint` in sensible manner.
Re your thoughts: number (4) will not help. Create indexes on both tables, that follow Where clauses and JOIN columns.
Make several iterations: Simplify data types - Check performance. Create index(es) - Check again. And so on. There should be a balance between minimizing space used and usability. You may want to keep some data in text, rather encoding to int or binary.
Use Profiler or Tuning adviser to determine bottlenecks and improvement opportunities.
I'm going to make a wild guess that the property that identifies whether something is "restorable" or not is the Action column. If that is the case, then partition the table by that column and forget about the log_restore table.
MSDN - Partitioned Table and Index Concepts
The first thing you have to be worried is the memory of the machine, How much the server has? then you should compare with your database size, or maybe just the size of the table you are working on. If the memory is to low compared with the size of the table, then you have to add more memory to the server. Thats the first thing you have do.
A Sysadmin’s Guide to Microsoft SQL Server Memory

SQL Server varchar(50) and varchar(128) performance difference [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
is there an advantage to varchar(500) over varchar(8000)?
I am currently working on a table which has lots of columns with varchar(50). The data, we now have to insert in some columns is above 50 characters so we have to change the column size from 50 to 128 for those columns, since we have a lot of columns its a waste of time to change individual columns.
So I proposed to my team, why dont we change all the columns to varchar(128). Some of the team mates argued that this will cause a performance hit during select and join operations.
Now I am not an expert on databases but I dont think moving from varchar 50 to varchar 128 will cause any significant performance hit.
P.S - We dont have any name, surname, address kind of data in those columns.
varchar(50) and varchar(128) will behave pretty much identical from every point of view. The storage size is identical for values under 50 characters. They can be joined interchangeably (varchar(50) joined with varchar(128)) w/o type convertion issues (ie. an index on varchar(50) can seek a column varchar(128) in a join) and same applies to WHERE predicates. Prior to SQL Server 2012 ncreasing the size of a varchar column is a very fast metadata-only operation, after SQL Server 2012 this operation may be a slow size-of-data-update-each-record operation under certain conditions, similar to those descirbed in Adding a nullable column can update the entire table.
Some issues can arrise from any column length change:
application issues from handling unexpected size values. Native ones may run into buffer size issues if improperly codded (ie. larger size can cause buffer overflow). Managed apps are unlikely to have serious issues, but minor issues like values not fitting on column widths on screen or on reports may occur.
T-SQL errors from truncating values on insert or update
T-SQL silent truncation occuring and resulting in incorrect values (Eg. #variables declared as varchar(50) in stored proc)
Limits like max row size or max index size may be reached. Eg. you have today a composite index on 8 columns of type varchar(50), extending to varchar(128) will exceed the max index size of 900 and trigger warnings.
Martin's warning about memory grants incresing is a very valid concern. I would just buy more RAM if that would indeed turn out to be an issue.
Please check here
Performance Comparison-of-varchar(max) vs varchar(n)
How many are such columns? The best practice rule says you need to plan each column carefully and define size accordingly. You should identify columns suitable for varchar(128) and not just increase size of all columns blindly.
I would suggest that change only the columns which needs to be changed.if you do not need more than 50 chars in any column do not change that.
Why are you intersted in making the length same for all columns?I doubt all columns has same length requirements.