Quick question. Does it matter from the point of storing data if I will use decimal field limits or hexadecimal (say 16,32,64 instead of 10,20,50)?
I ask because I wonder if this will have anything to do with clusters on HDD?
Thanks!
VARCHAR(128) is better than VARCHAR(100) if you need to store strings longer than 100 bytes.
Otherwise, there is very little to choose between them; you should choose the one that better fits the maximum length of the data you might need to store. You won't be able to measure the performance difference between them. All else apart, the DBMS probably only stores the data you send, so if your average string is, say, 16 bytes, it will only use 16 (or, more likely, 17 - allowing 1 byte for storing the length) bytes on disk. The bigger size might affect the calculation of how many rows can fit on a page - detrimentally. So choosing the smallest size that is adequate makes sense - waste not, want not.
So, in summary, there is precious little difference between the two in terms of performance or disk usage, and aligning to convenient binary boundaries doesn't really make a difference.
If it would be a C-Program I'd spend some time to think about that, too. But with a database I'd leave it to the DB engine.
DB programmers spent a lot of time in thinking about the best memory layout, so just tell the database what you need and it will store the data in a way that suits the DB engine best (usually).
If you want to align your data, you'll need exact knowledge of the internal data organization: How is the string stored? One, two or 4 bytes to store the length? Is it stored as plain byte sequence or encoded in UTF-8 UTF-16 UTF-32? Does the DB need extra bytes to identify NULL or > MAXINT values? Maybe the string is stored as a NUL-terminated byte sequence - then one byte more is needed internally.
Also with VARCHAR it is not neccessary true, that the DB will always allocate 100 (128) bytes for your string. Maybe it stores just a pointer to where space for the actual data is.
So I'd strongly suggest to use VARCHAR(100) if that is your requirement. If the DB decides to align it somehow there's room for extra internal data, too.
Other way around: Let's assume you use VARCHAR(128) and all things come together: The DB allocates 128 bytes for your data. Additionally it needs 2 bytes more to store the actual string length - makes 130 bytes - and then it could be that the DB aligns the data to the next (let's say 32 byte) boundary: The actual data needed on the disk is now 160 bytes 8-}
Yes but it's not that simple. Sometimes 128 can be better than 100 and sometimes, it's the other way around.
So what is going on? varchar only allocates space as necessary so if you store hello world in a varchar(100) it will take exactly the same amount of space as in a varchar(128).
The question is: If you fill up the rows, will you hit a "block" limit/boundary or not?
Databases store their data in blocks. These have a fixed size, for example 512 (this value can be configured for some databases). So the question is: How many blocks does the DB have to read to fetch each row? Rows that span several block will need more I/O, so this will slow you down.
But again: This doesn't depend on the theoretical maximum size of the columns but on a) how many columns you have (each column needs a little bit of space even when it's empty or null), b) how many fixed width columns you have (number/decimal, char), and finally c) how much data you have in variable columns.
Related
This question already has answers here:
What are the use cases for selecting CHAR over VARCHAR in SQL?
(19 answers)
Closed 8 years ago.
Char and varchar are datatypes in SQL, as they are in many other languages(So this question could be multi-language).
From what I understand, the difference is that if I declared a Char as Char(20) it would allocate 20 (bytes/bits) [Could someone clarify this bit too? For now, I'll use bytes.]. Then if I only used 16 bytes, I would still have four allocated to that field. (A waste of 4 bytes of memory.)
However, if I declared a varchar as varchar(20) and only used 16 bytes, it would only allocate 16 bytes.
Surely this is better? Why would anyone choose char? Is it foe legacy reasons, or is there something I'm missing?
Prefer VARCHAR.
In olden days of tight storage, it mattered for space. Nowadays, disk storage is cheap, but RAM and IO are still precious. VARCHAR is IO and cache friendly; it allows you to more densely pack the db buffer cache with data rather than wasted literal "space" space, and for the same reason, space padding imposes an IO overhead.
The upside to CHAR() used to be reduced row chaining on frequently updated records. When you update a field and the value is larger than previously allocated, the record may chain. This is manageable, however; databases often support a "percent free" setting on your table storage attributes that tells the DB how much extra space to preallocate per row for growth.
VARCHAR is almost always preferable because space padding requires you to be aware of it and code differently. Different databases handle it differently. With VARCHAR you know your field holds only exactly what you store in it.
I haven't designed a schema in over a decade with CHAR.
FROM Specification
char[(n)]
Fixed-length non-Unicode character data with length of n bytes. n must
be a value from 1 through 8,000. Storage size is n bytes. The SQL-92
synonym for char is character.
So Char(20) will allocate fixed 20 Bytes space to hold the data.
Usage:
For example if you have a column named Gender and you want to assign values like only M for Male (OR) F for female and you are sure that the field/column are non-null column . In such case, it's much better to define it as CHAR(1) instead like
Gender CHAR(1) not null
Also, varchar types carries extra overhead of 2 bytes as stated in document . The storage size is the actual length of the data entered + 2 bytes.
In case of char that's not the case.
At the university I was told to use VARCHAR(20) to store a first name. The VARCHAR type takes space depending of string length, so is it necessary to specify smaller length range? I'm asking because RedBean ORM creates on default a VARCHAR(255) field for strings which length is <= 255 chars.
Is it any difference between using VARCHAR(20) and VARCHAR(255)? Except the maximum string length that can be stored :)
*I know that such questions have already been asked but all I understand from them is that using VARCHAR(255) where it isn't necessary could cause excessive memory consumption in DB applications.
What it is in real life programming? Should I use VARCHAR(255) for all short text inputs or try to limit length whenever it is possible?
Because 255 is now just an arbitrary choice for a VARCHAR length.
Explanation: Prior to MySQL 5.0.3 (give or take a few point releases - I forget) a VARCHAR column could be 255 characters in length maximum, so VARCHAR(255) was often used as a default. Now, however, you can go up to 65,535 characters on VARCHAR, so if you're still using "255" then that seems arbitrary and not well thought out (or your schema is just old).
It is better to define the limit to reduce the default database size and also helpful for validation purpose.
As far as I know with varchar(20) you are telling that the field will contain not more than 20 characters.
First of all determine an ideal range for the specified field depending on what it will hold (name or address etc). Its always efficient to use as small length as required.
The decision to use 20 characters is just as arbitrary as 255 unless you know your data.
You could use varchar(max) but that changes the way your data is stored and could impact performance; but without knowing more about your application and the size/volume of the data it's hard to give advice.
VARchar (255) is generally abad choice unless you need that much space. YOU always wmake all fields the size they need to be and no more. There are several reasons for this. ONe is the row size of the record has limits. Often you can crete a row larger than the limits but the first time you try to enter data that exceeds teh limits, it willfail. THis is a bad thing and shoudl be avopided by using smaller field sizes. Larger field sizes also encourage the entry of bad information. If users know they havea lot of room in a filed, they willenter notes inthat field instead of theh data, I have seen such gems as (and this is genuine example from a past job) "Talk to the fat girl as the blonde is useless." in fields where teh length was too long. You don't want to give room for junk to be put into a field if you can help it. Bad data in means bad data out.
Wider pages can also be slower to access, so it is the database's best interest to limit field size.
Under no circumstances should you use nvarchar(max) or varchar(max) for any string type fields unless you intend for some records to contain more than 4000 characters. These fields cannot be indexed (in SQL server, know your own datbase limits when doing design) and using them indiscriminately is very bad and will cause a slower than slow database.
Names are tricky and can be hard to determine the size so some people go big. But 25-50 is more reasonable than 255. It may vary depending on the kind of names you are storing, for instance if corporations are mixed in with people names, then the field will need to be wider. If you have a lot of foreign names to store, you need to know what is the norm for names in that country as far as length. Some countries have typically longer names than others. ANd remeber as fara as first name are considered, it will make a difference whether the person uses their middle names as well and if there is anywhere to store that. This is especillay true for people who have more than one middle name or who are using their maiden name as a middle name but still go by their other names such as Mary Elizabeth Annette Von Middlesworth Jamison - you can see how hard a name like that is to break up into first, middle and last and the majority of the name might end up in the firstname column.
I wrote a program to compute similarities among a set of 2 million documents. The program works, but I'm having trouble storing the results. I won't need to access the results often, but will occasionally need to query them and pull out subsets for analysis. The output basically looks like this:
1,2,0.35
1,3,0.42
1,4,0.99
1,5,0.04
1,6,0.45
1,7,0.38
1,8,0.22
1,9,0.76
.
.
.
Columns 1 and 2 are document ids, and column 3 is the similarity score. Since the similarity scores are symmetric I don't need to compute them all, but that still leaves me with 2000000*(2000000-1)/2 ≈ 2,000,000,000,000 lines of records.
A text file with 1 million lines of records is already 9MB. Extrapolating, that means I'd need 17 TB to store the results like this (in flat text files).
Are there more efficient ways to store these sorts of data? I could have one row for each document and get rid of the repeated document ids in the first column. But that'd only go so far. What about file formats, or special database systems? This must be a common problem in "big data"; I've seen papers/blogs reporting similar analyses, but none discuss practical dimensions like storage.
DISCLAIMER: I don't have any practical experience with this, but it's a fun exercise and after some thinking this is what I came up with:
Since you have 2.000.000 documents you're kind of stuck with an integer for the document id's; that makes 4 bytes + 4 bytes; the comparison seems to be between 0.00 and 1.00, I guess a byte would do by encoding the 0.00-1.00 as 0..100.
So your table would be : id1, id2, relationship_value
That brings it to exactly 9 bytes per record. Thus (without any overhead) ((2 * 10^6)^2)*9/2bytes are needed, that's about 17Tb.
Off course that's if you have just a basic table. Since you don't plan on querying it very often I guess performance isn't that much of an issue. So you could go 'creative' by storing the values 'horizontally'.
Simplifying things, you would store the values in a 2 million by 2 million square and each 'intersection' would be a byte representing the relationship between their coordinates. This would "only" require about 3.6Tb, but it would be a pain to maintain, and it also doesn't make use of the fact that the relations are symmetrical.
So I'd suggest to use a hybrid approach, a table with 2 columns. First column would hold the 'left' document-id (4 bytes), 2nd column would hold a string of all values of documents starting with an id above the id in the first column using a varbinary. Since a varbinary only takes the space that it needs, this helps us win back some space offered by the symmetry of the relationship.
In other words,
record 1 would have a string of (2.000.000-1) bytes as value for the 2nd column
record 2 would have a string of (2.000.000-2) bytes as value for the 2nd column
record 3 would have a string of (2.000.000-3) bytes as value for the 2nd column
etc
That way you should be able to get away with something like 2Tb (inc overhead) to store the information. Add compression to it and I'm pretty sure you can store it on a modern disk.
Off course the system is far from optimal. In fact, querying the information will require some patience as you can't approach things set-based and you'll pretty much have to scan things byte by byte. A nice 'benefit' of this approach would be that you can easily add new documents by adding a new byte to the string of EACH record + 1 extra record in the end. Operations like that will be costly though as it will result in page-splits; but at least it will be possible without having to completely rewrite the table. But it will cause quite bit of fragmentation over time and you might want to rebuild the table once in a while to make it more 'aligned' again. Ah.. technicalities.
Selecting and Updating will require some creative use of SubString() operations, but nothing too complex..
PS: Strictly speaking, for 0..100 you only need 7 bytes, so if you really want to squeeze the last bit out of it you could actually store 8 values in 7 bytes and save another ca 300Mb, but it would make things quite a bit more complex... then again, it's not like the data is going to be human-readable anyway =)
PS: this line of thinking is completely geared towards reducing the amount of space needed while remaining practical in terms of updating the data. I'm not saying it's going to be fast; in fact, if you'd go searching for all documents that have a relation-value of 0.89 or above the system will have to scan the entire table and even with modern disks that IS going to take a while.
Mind you that all of this is the result of half an hour brainstorming; I'm actually hoping that someone might chime in with a neater approach =)
I'm storing data in a varbinary(max) column and, for client performance reasons, chunking writes through the ".WRITE()" function using SQL Server 2005. This works great but, due to the side effects, I want to avoid the varbinary column dynamically sizing during each append.
What I'd like to do is optimize this by pre-allocating the varbinary column to the size I want. For example if I'm going to drop 2MB into the column I would like to 'allocate' the column first, then .WRITE the real data using offset/length parameters.
Is there anything in SQL that can help me here? Obviously I don't want to send a null byte array to the SQL server, as this would partially defeat the purpose of the .WRITE optimization.
If you're using a (MAX) data type, then anything above 8K goes into row overflow storage, not the in-page storage. So you just need to put in enough data to get it up to the 8K for the row, making that take up the in-page allocation for the row, and the rest goes into row-overflow storage anyway. There's some more here.
If you want to pre-allocate everything, including the row overflow data, you can use something akin to (example does 10000 bytes):
SELECT CONVERT([varbinary](MAX), REPLICATE(CONVERT(varchar(MAX), '0'), 10000))
First of all kudos to the answer provided - this was a great help! However, there is one slight change that you may want to consider. The code above actually allocates the varbinary field with a converted zero character (hex code 0x30). This may not be what you actually want, particularly if you want to perform binary operations on the field later. What I think is more useful is to allocate the field with a NUL value (hex code 0x00) so that all the bits are turned off by default. To do this, simply make the following correction:
SELECT CONVERT([varbinary](MAX), REPLICATE(CONVERT(varchar(MAX), CHAR(0)), 10000))
Disclaimer: I'm very new to SQL and databases in general.
I need to create a field that will store a maximum of 32 characters of text data. Does "VARCHAR(32)" mean that I have exactly 32 characters for my data? Do I need to reserve an extra character for null-termination?
I conducted a simple test and it seems that this is a WYSIWYG buffer. However, I wanted to get a concrete answer from people who actually know what they're doing.
I have a C[++] background, so this question is raising alarm bells in my head.
Yes, you have 32 characters at your disposal. SQL does not concern itself with nul terminated strings like some programming languages do.
Your VARCHAR specification size is the max size of your data, so in this case, 32 characters. However, VARCHARS are a dynamic field, so the actual physical storage used is only the size of your data, plus one or two bytes.
If you put a 10-character string into a VARCHAR(32), the physical storage will be 11 or 12 bytes (the manual will tell you the exact formula).
However, when MySQL is dealing with result sets (ie. after a SELECT), 32 bytes will be allocated in memory for that field for every record.