Pre-allocate varbinary(max) without actually sending null data to the SQL Server? - sql-server-2005

I'm storing data in a varbinary(max) column and, for client performance reasons, chunking writes through the ".WRITE()" function using SQL Server 2005. This works great but, due to the side effects, I want to avoid the varbinary column dynamically sizing during each append.
What I'd like to do is optimize this by pre-allocating the varbinary column to the size I want. For example if I'm going to drop 2MB into the column I would like to 'allocate' the column first, then .WRITE the real data using offset/length parameters.
Is there anything in SQL that can help me here? Obviously I don't want to send a null byte array to the SQL server, as this would partially defeat the purpose of the .WRITE optimization.

If you're using a (MAX) data type, then anything above 8K goes into row overflow storage, not the in-page storage. So you just need to put in enough data to get it up to the 8K for the row, making that take up the in-page allocation for the row, and the rest goes into row-overflow storage anyway. There's some more here.
If you want to pre-allocate everything, including the row overflow data, you can use something akin to (example does 10000 bytes):
SELECT CONVERT([varbinary](MAX), REPLICATE(CONVERT(varchar(MAX), '0'), 10000))

First of all kudos to the answer provided - this was a great help! However, there is one slight change that you may want to consider. The code above actually allocates the varbinary field with a converted zero character (hex code 0x30). This may not be what you actually want, particularly if you want to perform binary operations on the field later. What I think is more useful is to allocate the field with a NUL value (hex code 0x00) so that all the bits are turned off by default. To do this, simply make the following correction:
SELECT CONVERT([varbinary](MAX), REPLICATE(CONVERT(varchar(MAX), CHAR(0)), 10000))

Related

nvarchar(max) - how to speed up getting only meaningful string in SQL

I have a table with a column with Nvarchar(Max). The Column is 90% percent of the time having a string length between 255 and 500. Some go well over 22000 which aren't required as its XML of something that the business wont ever use for reporting purpose. Anyways to cut a long story short was the best way to trim out all the excess bulk. I have tried the usual
left(column,500)
and
substring(column,1,500)
I have set the destination column to be 500 length.
However loading the table from source to target destination takes a while just because of that column alone. I am doing the in SSIS in the Source. I also gone to the output column and ignored truncation. Is there anyway I can reduce the time take loading this column. These methods seem to take as much as loading the full length. Any suggestion will be greatly appreciated.
NVARCHAR(MAX) (even when using a function like SUBSTRING or LEFT) will cost a lot of memory and will fill up your buffers quickly. Check the DefaultBufferMaxSize and also the properties BLOBTempStoragePath and BufferTempStoragePath setting them to an optimal value might increase the performance but note that you have so configure them accordingly because it is like a double edged sword.
Also If Source and Destination are on differents servers, the network could also be an issue because all data has to to from your SQL server via the network to your SSIS server. You could try changing the Network Packet Size
More info are provided in these links
Set BLOBTempStoragePath and BufferTempStoragePath to Fast Drives
Troubleshooting Package Performance
Perfomance Issue with NVarchar(MAX) in SSIS

Is varchar(128) better than varchar(100)

Quick question. Does it matter from the point of storing data if I will use decimal field limits or hexadecimal (say 16,32,64 instead of 10,20,50)?
I ask because I wonder if this will have anything to do with clusters on HDD?
Thanks!
VARCHAR(128) is better than VARCHAR(100) if you need to store strings longer than 100 bytes.
Otherwise, there is very little to choose between them; you should choose the one that better fits the maximum length of the data you might need to store. You won't be able to measure the performance difference between them. All else apart, the DBMS probably only stores the data you send, so if your average string is, say, 16 bytes, it will only use 16 (or, more likely, 17 - allowing 1 byte for storing the length) bytes on disk. The bigger size might affect the calculation of how many rows can fit on a page - detrimentally. So choosing the smallest size that is adequate makes sense - waste not, want not.
So, in summary, there is precious little difference between the two in terms of performance or disk usage, and aligning to convenient binary boundaries doesn't really make a difference.
If it would be a C-Program I'd spend some time to think about that, too. But with a database I'd leave it to the DB engine.
DB programmers spent a lot of time in thinking about the best memory layout, so just tell the database what you need and it will store the data in a way that suits the DB engine best (usually).
If you want to align your data, you'll need exact knowledge of the internal data organization: How is the string stored? One, two or 4 bytes to store the length? Is it stored as plain byte sequence or encoded in UTF-8 UTF-16 UTF-32? Does the DB need extra bytes to identify NULL or > MAXINT values? Maybe the string is stored as a NUL-terminated byte sequence - then one byte more is needed internally.
Also with VARCHAR it is not neccessary true, that the DB will always allocate 100 (128) bytes for your string. Maybe it stores just a pointer to where space for the actual data is.
So I'd strongly suggest to use VARCHAR(100) if that is your requirement. If the DB decides to align it somehow there's room for extra internal data, too.
Other way around: Let's assume you use VARCHAR(128) and all things come together: The DB allocates 128 bytes for your data. Additionally it needs 2 bytes more to store the actual string length - makes 130 bytes - and then it could be that the DB aligns the data to the next (let's say 32 byte) boundary: The actual data needed on the disk is now 160 bytes 8-}
Yes but it's not that simple. Sometimes 128 can be better than 100 and sometimes, it's the other way around.
So what is going on? varchar only allocates space as necessary so if you store hello world in a varchar(100) it will take exactly the same amount of space as in a varchar(128).
The question is: If you fill up the rows, will you hit a "block" limit/boundary or not?
Databases store their data in blocks. These have a fixed size, for example 512 (this value can be configured for some databases). So the question is: How many blocks does the DB have to read to fetch each row? Rows that span several block will need more I/O, so this will slow you down.
But again: This doesn't depend on the theoretical maximum size of the columns but on a) how many columns you have (each column needs a little bit of space even when it's empty or null), b) how many fixed width columns you have (number/decimal, char), and finally c) how much data you have in variable columns.

[My]SQL VARCHAR Size and Null-Termination

Disclaimer: I'm very new to SQL and databases in general.
I need to create a field that will store a maximum of 32 characters of text data. Does "VARCHAR(32)" mean that I have exactly 32 characters for my data? Do I need to reserve an extra character for null-termination?
I conducted a simple test and it seems that this is a WYSIWYG buffer. However, I wanted to get a concrete answer from people who actually know what they're doing.
I have a C[++] background, so this question is raising alarm bells in my head.
Yes, you have 32 characters at your disposal. SQL does not concern itself with nul terminated strings like some programming languages do.
Your VARCHAR specification size is the max size of your data, so in this case, 32 characters. However, VARCHARS are a dynamic field, so the actual physical storage used is only the size of your data, plus one or two bytes.
If you put a 10-character string into a VARCHAR(32), the physical storage will be 11 or 12 bytes (the manual will tell you the exact formula).
However, when MySQL is dealing with result sets (ie. after a SELECT), 32 bytes will be allocated in memory for that field for every record.

Database vs. Front-End for Output Formatting

I've read that (all things equal) PHP is typically faster than MySQL at arithmetic and string manipulation operations. This being the case, where does one draw the line between what one asks the database to do versus what is done by the web server(s)? We use stored procedures exclusively as our data-access layer. My unwritten rule has always been to leave output formatting (including string manipulation and arithmetic) to the web server. So our queries return:
unformatted dates
null values
no calculated values (i.e. return values for columns "foo" and "bar" and let the web server calculate foo*bar if it needs to display value foobar)
no substring-reduced fields (except when shortened field is so significantly shorter that we want to do it at database level to reduce result set size)
two separate columns to let front-end case the output as required
What I'm interested in is feedback about whether this is generally an appropriate approach or whether others know of compelling performance/maintainability considerations that justify pushing these activities to the database.
Note: I'm intentionally tagging this question to be dbms-agnostic, as I believe this is an architectural consideration that comes into play regardless of one's specific dbms.
I would draw the line on how certain layers could rotate out in place for other implementations. It's very likely that you will never use a different RDBMS or have a mobile version of your site, but you never know.
The more orthogonal a data point is, the closer it should be to being released from the database in that form. If on every theoretical version of your site your values A and B are rendered A * B, that should be returned by your database as A * B and never calculated client side.
Let's say you have something that's format heavy like a date. Sometimes you have short dates, long dates, English dates... One pure form should be returned from the database and then that should be formatted in PHP.
So the orthogonality point works in reverse as well. The more dynamic a data point is in its representation/display, the more it should be handled client side. If a string A is always taken as a substring of the first six characters, then have that be returned from the database as pre-substring'ed. If the length of the substring depends on some factor, like six for mobile and ten for your web app, then return the larger string from the database and format it at run time using PHP.
Usually, data formatting is better done on client side, especially culture-specific formatting.
Dynamic pivoting (i. e. variable columns) is also an example of what is better done on client side
When it comes to string manipulation and dynamic arrays, PHP is far more powerful than any RDBMS I'm aware of.
However, data formatting can use additional data which is also kept in the database. Like, the coloring info for each row can be stored in additional table.
You should then correspond the color to each row on database side, but wrap it into the tags on PHP side.
The rule of thumb is: retrieve everything you need for formatting in as few database round-trips as possible, then do the formatting itself on the client side.
I believe in returning the data pretty much as-is from the database and letting it be formatted on the front-end instead. I don't stick to it religously, but in general I think it's better as it provides greater flexibility - e.g. 1 sproc can service n different requirements for data, each of which can format the data as each individually needs. Otherwise, you end up either with multiple queries returning the same data with slightly different formatting from the DB (from a SQL Server point of view, thus reducing execution plan caching benefits - therefore negative impact on performance).
Leave output formatting to the web server

How do you handle BLOB and numerical data efficiently in database communication?

SQL databases seem to be the cornerstone of most software. However, it seems optimized for textual data. In fact when doing any queries involving numerical data, integers specifically, it seems inefficient that the numbers are getting converted to text and then back to native formats both ways between the application and the database. This same inefficiency seems to apply to BLOB data as well. My understanding is that even with something like Linq to SQL, this two way conversion is occuring in the background.
Are there general ways to bypass this overhead with SQL? Are there certain database management systems that handle this more efficiently than others (ie, with non-standard extensions/API's)?
Clarification. In the following select statement, the list of numbers after IN could be more easily passed as a raw array of int, but there seems to be no way of achieving that optimization level.
SELECT foo FROM bar WHERE baz IN (23, 34, 45, 9854004, ...)
Don't suppose. Measure.
Format conversion is not likely to be a measurable cost for database work, unless you are misusing the database as an arithmetic engine.
The IO cost for LOBs, especially for CLOBS with character conversion, can become significant; the remedy here, once you know that the simplest thing that might work actually has a noticeable performance impact, is to minimize the number of times you copy the LOB data. Use whatever SQL parameter binding style allows you to transfer the data directly between its point of creation or use, and the database -- often this is binding the LOB to a stream or I/O channel.
But don't do this until you have a way to measure the impact, and have measurements showing that this is your bottleneck.
Numerical data in a database is not stored as text. I guess it depends on the database, but it certainly doesn't have to be and isn't.
BLOBs are stored exactly how you set them -- by definition, the DB has no way to interpret the information -- I guess it could compress if it found that to be useful. BLOBs are not translated into text.
Here's how Oracle stores numbers:
http://download.oracle.com/docs/cd/B28359_01/server.111/b28318/datatype.htm#i16209
Internal Numeric Format
Oracle Database stores numeric data in variable-length format. Each value is stored in scientific notation, with 1 byte used to store the exponent and up to 20 bytes to store the mantissa. The resulting value is limited to 38 digits of precision. Oracle Database does not store leading and trailing zeros. For example, the number 412 is stored in a format similar to 4.12 x 102, with 1 byte used to store the exponent(2) and 2 bytes used to store the three significant digits of the mantissa(4,1,2). Negative numbers include the sign in their length.
MySQL info here:
http://dev.mysql.com/doc/refman/5.0/en/numeric-types.html
Look at the table -- a TINYINT is represented in 1 byte (range -128 - 127), not possible if stored as text.
EDIT: With the clarification -- I would say use the API in your language that looks something like this (pseudocode)
stmt = conn.Prepare("SELECT * FROM TABLE where x in (?, ?, ?)");
stmt.SetInt(0, x);
stmt.SetInt(1, y);
stmt.SetInt(2, z);
I don't believe that the underlying protocols use text for the transport of parameters.