Reducing file size via keys/indices - vba

I'm using MS Access, so file size is a real constraint (2 gigs I think). Am I saving space in the most efficient way?
tbl1: tbl_NamesDescs
pid_NamesDescs <-autonumber
ColName <-text field, Indexed: Yes (No Duplicates)
Descs <- text field
tbl2: tbl_HistStatsSettings
pid_HistStatsSettings <-autonumber
Factor <-text field
etc... (other fields)
So using the two tables above, tbl2 has ~800k records and all of Factor's unique possibilities are listed in ColName (i.e. there is a one to many relationship relationship between ColName and Factor receptively). When I look at the tables in Datasheetview I see all of the names listed (full text) in both Factor and ColName.
Question:
Is that the best way to save space? I would think that Factor should be a list of indices (numbers, not text) corresponding to ColName.
In other words, wouldn't it be more file-space efficient to populate Factor with the pid_NamesDescs autonumers since numbers are smaller than text? If that is true, what is the best way to make this happen (either steps in MS Access or VBA is what I am after here)?
EDIT: added table names and pid names as they really exist

Yes, putting the FactorID as a number instead of text will save space. I can't really answer whether it's the "best" way, but it will definitely save space.
The easiest way to do this is to run the following query:
Update tbl2 LEFT JOIN tbl1 ON tbl2.Factor = tbl1.ColName
SET tbl2.Factor = CStr(tbl1.PID_tbl1)
WHERE Not IsNull(tbl1.ColName)
Then, in design view change the datatype of "factor" to Long. I'd also then change the name of the Field to "FactorID" and change the name of "ColName" To "Factor." I'd make some other changes to the column/table (although you may be giving fake names) names for clarity.
OR make a helper column (as a long int as you suggested in comments) and update the helper field, and then delete the original field.
Then, go into the relationships table and add a relationship between tbl1.PID_tbl1 and tbl2.FactorID
After this, Compact and Repair the database to reduce the size.
*EDIT to add portion about adding the relationship between the tables.

In addition to normalization, also check all your text fields. The default is 255 characters for short text. When you are storing less than 255 characters in a text field, make sure the field size is set to no more than what you typically store. Upon changing, perform compact and repair to reduce file size. Where possible, use the short text over long text.
Also consider a split database approach where data is on the back end and your UI and VBA in the front end.

Related

Fastest way to find string by substring in SQL?

I have huge table with 2 columns: Id and Title. Id is bigint and I'm free to choose type of Title column: varchar, char, text, whatever. Column Title contains random text strings like "abcdefg", "q", "allyourbasebelongtous" with maximum of 255 chars.
My task is to get strings by given substring. Substrings also have random length and can be start, middle or end of strings. The most obvious way to perform it:
SELECT * FROM t LIKE '%abc%'
I don't care about INSERT, I need only to do fast selects. What can I do to perform search as fast as possible?
I use MS SQL Server 2008 R2, full text search will be useless, as far as I see.
if you dont care about storage, then you can create another table with partial Title entries, beginning with each substring (up to 255 entries per normal title ).
in this way, you can index these substrings, and match only to the beginning of the string, should greatly improve performance.
If you want to use less space than Randy's answer and there is considerable repetition in your data, you can create an N-Ary tree data structure where each edge is the next character and hang each string and trailing substring in your data on it.
You number the nodes in depth first order. Then you can create a table with up to 255 rows for each of your records, with the Id of your record, and the node id in your tree that matches the string or trailing substring. Then when you do a search, you find the node id that represents the string you are searching for (and all trailing substrings) and do a range search.
Sounds like you've ruled out all good alternatives.
You already know that your query
SELECT * FROM t WHERE TITLE LIKE '%abc%'
won't use an index, it will do a full table scan every time.
If you were sure that the string was at the beginning of the field, you could do
SELECT * FROM t WHERE TITLE LIKE 'abc%'
which would use an index on Title.
Are you sure full text search wouldn't help you here?
Depending on your business requirements, I've sometimes used the following logic:
Do a "begins with" query (LIKE 'abc%') first, which will use an index.
Depending on if any rows are returned (or how many), conditionally move on to the "harder" search that will do the full scan (LIKE '%abc%')
Depends on what you need, of course, but I've used this in situations where I can show the easiest and most common results first, and only move on to the more difficult query when necessary.
You can add another calculated column on the table: titleLength as len(title) PERSISTED. This would store the length of the "title" column. Create an index on this.
Also, add another calculated column called: ReverseTitle as Reverse(title) PERSISTED.
Now when someone searches for a keyword, check if the length of keyword is same as titlelength. If so, do a "=" search. If length of keyword is less than the length of the titleLength, then do a LIKE. But first do a title LIKE 'abc%', then do a reverseTitle LIKE 'cba%'. Similar to Brad's approach - ie you do the next difficult query only if required.
Also, if the 80-20 rules applies to your keywords/ substrings (ie if most of the searches are on a minority of the keywords), then you can also consider doing some sort of caching. For eg: say you find that many users search for the keyword "abc" and this keyword search returns records with ids 20, 22, 24, 25 - you can store this in a separate table and have this indexed.
And now when someone searches for a new keyword, first look in this "cache" table to see if the search was already performed by an earlier user. If so, no need to look again in main table. Simply return results from "cache" table.
You can also combine the above with SQL Server TextSearch. (assuming you have a valid reason not to use it). But you could nevertheless use Text search first to shortlist the result set. and then run a SQL query against your table to get exact results using the Ids returned by the TExt Search as a parameter along with your keyword.
All this is obviously assuming you have to use SQL. If not, you can explore something like Apache Solr.
Create index view there is new feature in sql create index on the column that you need to search and use that view after in your search that will give your more faster result.
Use ASCII charset with clustered indexing the char column.
The charset influences the search performance because of the data
size on both ram and disk. The bottleneck is often I/O.
Your column is 255 characters long so you can use normal index on
your char field rather than full text, which is faster. Do not
select unnecessary columns in your select statement.
Lastly, add more RAM to the server and Increase cache size.
Do one thing, use primary key on specific column & index it in cluster form.
Then search using any method (wild card or = or any), it will search optimally because the table is already in clustered form, so it knows where he can find (because column is already in sorted form)

How do you know when to use varchar and when to use text in sql?

It seems like a very arbitrary decision.
Both can accomplish the same thing in most cases.
By limiting the varchar length seems to me like you're shooting yourself in the foot cause you never know how long of a field you will need.
Is there any specific guideline for choosing VARCHAR or TEXT for your string fields?
I will be using postgresql with the sqlalchemy orm framework for python.
In PostgreSQL there is no technical difference between varchar and text
You can see a varchar(nnn) as a text column with a check constraint that prohibits storing larger values.
So each time you want to have a length constraint, use varchar(nnn).
If you don't want to restrict the length of the data use text
This sentence is wrong:
By limiting the varchar length seems to me like you're shooting yourself in the foot cause you never know how long of a field you will need.
If you are saving, for example, MD5 hashes you do know how large the field is your storing and your storage becomes more efficient. Other examples are:
Usernames (64 max)
Passwords (128 max)
Zip codes
Addresses
Tags
Many more!
In brief:
Variable length fields save space, but because each field can have different length, it makes table operations slower
Fixed length fields make table operations fast, although must be large enough for the maximum expected input, so can use more space
Think of an analogy to arrays and linked lists, where arrays are fixed length fields, and linked lists are like varchars. Which is better, arrays or linked lists? Lucky we have both, because they are both useful in different situations, so too here.
In the most cases you do know what the max length of a string in a field is. In case of a first of lastname you don't need more then 255 characters for example. So by design you choose wich type to use, if you always use text you're wasting resources
Check this article on PostgresOnline, it also links to two other usefull articles.
Most problems with TEXT in PostgreSQL occur when you're using tools, applications and drivers that treat TEXT very different from VARCHAR because other databases behave very different with these two datatypes.
Database designers almost always know how many characters a column needs to hold. US delivery addresses need to hold up to 64 characters. (The US Postal Service publishes addressing guidelines that say so.) US ZIP codes are 5 characters long.
A database designer will look at representative sample data from her clients when she's specifying columns. She'll ask herself, questions like "What's the longest product name?" And when the answer is "70 characters", she won't make the column 3000 characters wide.
VARCHAR has a limit of 8k in SQL Server (I think). Most applications don't require nearly that much storage for a single column.

Full Text Searching for single characters

I have a table with a TEXT column where the contents is just strings of CSV numbers. Example ",1,76,77,115," Each string can have an arbitrary number of numbers.
I am trying to set up Full Text Indexing so that I can search this column rapidly. This works great. Instead of running queries with
where MY_COL LIKE '%,77,%' and MY_COL LIKE '%,115,%'
I can do
where CONTAINS(MY_COL,'77 and 115')
However, when I try to search for a single character it doesn't work.
where CONTAINS(MY_COL,'1')
But I know that there should be records returned! I quickly found that I need to edit the Noise file and rebuild the index. But even after doing that it still doesn't work.
Working with relational databases that way is going to hurt.
Use a proper schema. Either store the values in different rows or use an array datatype for the column.
That will make solving the problem trivial.
I fixed my own problem, although I'm not exactly sure what fixed it.
I dropped my table and populated a new one (my program does batch processing) and created a new Full Text Index. Maybe I wasn't being patient enough to allow the indexing to fully rebuild.
Agreed. How does 12,15,33 not return that record for a search for 1 with fulltext? Use an actual table schema to accomplish this.

Theory of storing a number and text in same SQL field

I have a three tables
Results:
TestID
TestCode
Value
Tests:
TestID
TestType
SysCodeID
SystemCodes
SysCodeID
ParentSysCodeID
Description
The question I have is for when the user is entering data into the results table.
The formatting code when the row gets the focus changes the value field to a dropdown combobox if the testCode is of type SystemList. The drop down has a list of all the system codes that have a parentsyscodeID of the test.SysCodeID. When the user chooses a value in the list it translates into a number which goes into the value field.
The datatype of the Results.Value field is integer. I made it an integer instead of a string because when reporting it is easier to do calculations and sorting if it is a number. There are issues if you are putting integer/decimal value into a string field. As well, when the system was being designed they only wanted numbers in there.
The users now want to put strings into the value field as well as numbers/values from a list and I'm wondering what the best way of doing that would be.
Would it be bad practice to convert the field over to a string and then store both strings and integers in the same field? There are different issues related to this one but i'm not sure if any are a really big deal.
Should I add another column into the table of string datatype and if the test is a string type then put the data the user enters into the different field.
Another option would be to create a 1-1 relationship to another table and if the user types in a string into the value field it adds it into the new table with a key of a number.
Anyone have any interesting ideas?
What about treating Results.Value as if it were a numeric ValueCode that becomes an foreign key referencing another table that contains a ValueCode and a string that matches it.
CREATE TABLE ValueCodes
(
Value INTEGER NOT NULL PRIMARY KEY,
Meaning VARCHAR(32) NOT NULL UNIQUE
);
CREATE TABLE Results
(
TestID ...,
TestCode ...,
Value INTEGER NOT NULL FOREIGN KEY REFERENCES ValueCodes
);
You continue storing integers as now, but they are references to a limited set of values in the ValueCodes table. Most of the existing values appear as an integer such as 100 with a string representing the same value "100". New codes can be added as needed.
Are you saying that they want to do free-form text entry? If that's the case, they will ruin the ability to do meaningful reporting on the field, because I can guarantee that they will not consistently enter the strings.
If they are going to be entering one of several preset strings (for example, grades of A, B, C, etc.) then make a lookup table for those strings which maps to numeric values for sorting, evaluating, averaging, etc.
If they really want to be able to start entering in free-form text and you can't dissuade them from it, add another column along the lines of other_entry. Have a predefined value that means "other" to put in your value column. That way, when you're doing reporting you can either roll up all of those random "other" values or you can simply ignore them. Make sure that you add the "other" into your SystemCodes table so that you can keep a foreign key between that and the Results table. If you don't already have one, then you should definitely consider adding one.
Good luck!
The users now want to put strings into
the value field as well as
numbers/values from a list and I'm
wondering what the best way of doing
that would be.
It sounds like the users want to add new 'testCodes'. If that is the case why not just add them to your existing testcode table and keep your existing format.
Would it be bad practice to convert
the field over to a string and then
store both strings and integers in
the same field? There are different
issues related to this one but i'm not
sure if any are a really big deal.
No it's not a big deal. Often PO numbers or Invoice numbers have numbers or a combination of letters and numbers. You are right however about the performance of the database on a number field as opposed to a string, but if you index the string field you end up with the database doing it's scans on numeric indexes anyway.
The problems you may have had with your decimals as strings probably have to do with the floating point data types in which the server essentially estimates the value of the field and only retains accuracy to a certain number of digits. This can lead to a whole host of rounding errors if you are concerned about the digits. You can avoid that issue by using currency fields or the like that have static accuracy of the decimals. lol I learned this the hard way.
Tom H. did a great job addressing everything else.
I think the easiest way to do it would be to convert Results.Value to a "string" (char, varchar, whatever). Yes, this ruins the ability to do numeric sorting (and you won't be able to do a cast or convert on the column any longer since text will be intermingled with integer values), but I think any other method would be too complex to maintain properly. (For example, in the 1-1 case you mentioned, is that integer value the actual value or a foreign key to the string table? Now we need another column to determine that.)
I would create the extra column for string values. It's not true normalization but it's the easiest to implement and to work with.
Using the same field for both numbers and strings would work to as long as you don't plan on doing anything with the numbers like summing or sorting.
The extra table approach while good from a normalization standpoint is probably overly complex.
I'd convert the value field to string and add a column indicating what the datatype should be treated as for post processing and reporting.
Sql Server at least has an IsNumeric function you can use:
ORDER BY IsNumeric(Results.Value) DESC,
CASE WHEN IsNumeric(Results.Value) = 1 THEN Len(Results.Value) ELSE 99 END,
Results.Value
One of two solutions comes to mind. It kind of depends on what you're doing with the numbers. If they just represent a choice of some kind, then pick one. If you need to do math on it (sorting, conversion, etc..) then pick another.
Change the column to be a varchar, and then either put numbers or text in it. Sorting numerically will suck, but hey, it's one column.
Have both a varchar column for the text, and an int column for the number. Use a view to hide the differences, and to control the sorting if necessary. You can coalesce the two columns together if you don't care about whether you're looking at numbers or text.

What are the use cases for selecting CHAR over VARCHAR in SQL?

I realize that CHAR is recommended if all my values are fixed-width. But, so what? Why not just pick VARCHAR for all text fields just to be safe.
The general rule is to pick CHAR if all rows will have close to the same length. Pick VARCHAR (or NVARCHAR) when the length varies significantly. CHAR may also be a bit faster because all the rows are of the same length.
It varies by DB implementation, but generally, VARCHAR (or NVARCHAR) uses one or two more bytes of storage (for length or termination) in addition to the actual data. So (assuming you are using a one-byte character set) storing the word "FooBar"
CHAR(6) = 6 bytes (no overhead)
VARCHAR(100) = 8 bytes (2 bytes of overhead)
CHAR(10) = 10 bytes (4 bytes of waste)
The bottom line is CHAR can be faster and more space-efficient for data of relatively the same length (within two characters length difference).
Note: Microsoft SQL has 2 bytes of overhead for a VARCHAR. This may vary from DB to DB, but generally, there is at least 1 byte of overhead needed to indicate length or EOL on a VARCHAR.
As was pointed out by Gaven in the comments: Things change when it comes to multi-byte characters sets, and is a is case where VARCHAR becomes a much better choice.
A note about the declared length of the VARCHAR: Because it stores the length of the actual content, then you don't waste unused length. So storing 6 characters in VARCHAR(6), VARCHAR(100), or VARCHAR(MAX) uses the same amount of storage. Read more about the differences when using VARCHAR(MAX). You declare a maximum size in VARCHAR to limit how much is stored.
In the comments AlwaysLearning pointed out that the Microsoft Transact-SQL docs seem to say the opposite. I would suggest that is an error or at least the docs are unclear.
If you're working with me and you're working with Oracle, I would probably make you use varchar in almost every circumstance. The assumption that char uses less processing power than varchar may be true...for now...but database engines get better over time and this sort of general rule has the making of a future "myth".
Another thing: I have never seen a performance problem because someone decided to go with varchar. You will make much better use of your time writing good code (fewer calls to the database) and efficient SQL (how do indexes work, how does the optimizer make decisions, why is exists faster than in usually...).
Final thought: I have seen all sorts of problems with use of CHAR, people looking for '' when they should be looking for ' ', or people looking for 'FOO' when they should be looking for 'FOO (bunch of spaces here)', or people not trimming the trailing blanks, or bugs with Powerbuilder adding up to 2000 blanks to the value it returns from an Oracle procedure.
In addition to performance benefits, CHAR can be used to indicate that all values should be the same length, e.g., a column for U.S. state abbreviations.
Char is a little bit faster, so if you have a column that you KNOW will be a certain length, use char. For example, storing (M)ale/(F)emale/(U)nknown for gender, or 2 characters for a US state.
Does NChar or Char perform better that their var alternatives?
Great question. The simple answer is yes in certain situations. Let's see if this can be explained.
Obviously we all know that if I create a table with a column of varchar(255) (let's call this column myColumn) and insert a million rows but put only a few characters into myColumn for each row, the table will be much smaller (overall number of data pages needed by the storage engine) than if I had created myColumn as char(255). Anytime I do an operation (DML) on that table and request alot of rows, it will be faster when myColumn is varchar because I don't have to move around all those "extra" spaces at the end. Move, as in when SQL Server does internal sorts such as during a distinct or union operation, or if it chooses a merge during it's query plan, etc. Move could also mean the time it takes to get the data from the server to my local pc or to another computer or wherever it is going to be consumed.
But there is some overhead in using varchar. SQL Server has to use a two byte indicator (overhead) to, on each row, to know how many bytes that particular row's myColumn has in it. It's not the extra 2 bytes that presents the problem, it's the having to "decode" the length of the data in myColumn on every row.
In my experiences it makes the most sense to use char instead of varchar on columns that will be joined to in queries. For example the primary key of a table, or some other column that will be indexed. CustomerNumber on a demographic table, or CodeID on a decode table, or perhaps OrderNumber on an order table. By using char, the query engine can more quickly perform the join because it can do straight pointer arithmetic (deterministically) rather than having to move it's pointers a variable amount of bytes as it reads the pages. I know I might have lost you on that last sentence. Joins in SQL Server are based around the idea of "predicates." A predicate is a condition. For example myColumn = 1, or OrderNumber < 500.
So if SQL Server is performing a DML statement, and the predicates, or "keys" being joined on are a fixed length (char), the query engine doesn't have to do as much work to match rows from one table to rows from another table. It won't have to find out how long the data is in the row and then walk down the string to find the end. All that takes time.
Now bear in mind this can easily be poorly implemented. I have seen char used for primary key fields in online systems. The width must be kept small i.e. char(15) or something reasonable. And it works best in online systems because you are usually only retrieving or upserting a small number of rows, so having to "rtrim" those trailing spaces you'll get in the result set is a trivial task as opposed to having to join millions of rows from one table to millions of rows on another table.
Another reason CHAR makes sense over varchar on online systems is that it reduces page splits. By using char, you are essentially "reserving" (and wasting) that space so if a user comes along later and puts more data into that column SQL has already allocated space for it and in it goes.
Another reason to use CHAR is similar to the second reason. If a programmer or user does a "batch" update to millions of rows, adding some sentence to a note field for example, you won't get a call from your DBA in the middle of the night wondering why their drives are full. In other words, it leads to more predictable growth of the size of a database.
So those are 3 ways an online (OLTP) system can benefit from char over varchar. I hardly ever use char in a warehouse/analysis/OLAP scenario because usually you have SO much data that all those char columns can add up to lots of wasted space.
Keep in mind that char can make your database much larger but most backup tools have data compression so your backups tend to be about the same size as if you had used varchar. For example LiteSpeed or RedGate SQL Backup.
Another use is in views created for exporting data to a fixed width file. Let's say I have to export some data to a flat file to be read by a mainframe. It is fixed width (not delimited). I like to store the data in my "staging" table as varchar (thus consuming less space on my database) and then use a view to CAST everything to it's char equivalent, with the length corresponding to the width of the fixed width for that column. For example:
create table tblStagingTable (
pkID BIGINT (IDENTITY,1,1),
CustomerFirstName varchar(30),
CustomerLastName varchar(30),
CustomerCityStateZip varchar(100),
CustomerCurrentBalance money )
insert into tblStagingTable
(CustomerFirstName,CustomerLastName, CustomerCityStateZip) ('Joe','Blow','123 Main St Washington, MD 12345', 123.45)
create view vwStagingTable AS
SELECT CustomerFirstName = CAST(CustomerFirstName as CHAR(30)),
CustomerLastName = CAST(CustomerLastName as CHAR(30)),
CustomerCityStateZip = CAST(CustomerCityStateZip as CHAR(100)),
CustomerCurrentBalance = CAST(CAST(CustomerCurrentBalance as NUMERIC(9,2)) AS CHAR(10))
SELECT * from vwStagingTable
This is cool because internally my data takes up less space because it's using varchar. But when I use DTS or SSIS or even just a cut and paste from SSMS to Notepad, I can use the view and get the right number of trailing spaces. In DTS we used to have a feature called, damn I forget I think it was called "suggest columns" or something. In SSIS you can't do that anymore, you have to tediously define the flat file connection manager. But since you have your view setup, SSIS can know the width of each column and it can save alot of time when building your data flow tasks.
So bottom line... use varchar. There are a very small number of reasons to use char and it's only for performance reasons. If you have a system with hundrends of millions of rows you will see a noticeable difference if the predicates are deterministic (char) but for most systems using char is simply wasting space.
Hope that helps.
Jeff
There are performance benefits, but here is one that has not been mentioned: row migration. With char, you reserve the entire space in advance.So let's says you have a char(1000), and you store 10 characters, you will use up all 1000 charaters of space. In a varchar2(1000), you will only use 10 characters. The problem comes when you modify the data. Let's say you update the column to now contain 900 characters. It is possible that the space to expand the varchar is not available in the current block. In that case, the DB engine must migrate the row to another block, and make a pointer in the original block to the new row in the new block. To read this data, the DB engine will now have to read 2 blocks.
No one can equivocally say that varchar or char are better. There is a space for time tradeoff, and consideration of whether the data will be updated, especially if there is a good chance that it will grow.
There is a difference between early performance optimization and using a best practice type of rule. If you are creating new tables where you will always have a fixed length field, it makes sense to use CHAR, you should be using it in that case. This isn't early optimization, but rather implementing a rule of thumb (or best practice).
i.e. - If you have a 2 letter state field, use CHAR(2). If you have a field with the actual state names, use VARCHAR.
I would choose varchar unless the column stores fixed value like US state code -- which is always 2 chars long and the list of valid US states code doesn't change often :).
In every other case, even like storing hashed password (which is fixed length), I would choose varchar.
Why -- char type column is always fulfilled with spaces, which makes for column my_column defined as char(5) with value 'ABC' inside comparation:
my_column = 'ABC' -- my_column stores 'ABC ' value which is different then 'ABC'
false.
This feature could lead to many irritating bugs during development and makes testing harder.
CHAR takes up less storage space than VARCHAR if all your data values in that field are the same length. Now perhaps in 2009 a 800GB database is the same for all intents and purposes as a 810GB if you converted the VARCHARs to CHARs, but for short strings (1 or 2 characters), CHAR is still a industry "best practice" I would say.
Now if you look at the wide variety of data types most databases provide even for integers alone (bit, tiny, int, bigint), there ARE reasons to choose one over the other. Simply choosing bigint every time is actually being a bit ignorant of the purposes and uses of the field. If a field simply represents a persons age in years, a bigint is overkill. Now it's not necessarily "wrong", but it's not efficient.
But its an interesting argument, and as databases improve over time, it could be argued CHAR vs VARCHAR does get less relevant.
I would NEVER use chars. I’ve had this debate with many people and they always bring up the tired cliché that char is faster. Well I say, how much faster? What are we talking about here, milliseconds, seconds and if so how many? You’re telling me because someone claims its a few milliseconds faster, we should introduce tons of hard to fix bugs into the system?
So here are some issues you will run into:
Every field will be padded, so you end up with code forever that has RTRIMS everywhere. This is also a huge disk space waste for the longer fields.
Now let’s say you have the quintessential example of a char field of just one character but the field is optional. If somebody passes an empty string to that field it becomes one space. So when another application/process queries it, they get one single space, if they don’t use rtrim. We’ve had xml documents, files and other programs, display just one space, in optional fields and break things.
So now you have to ensure that you’re passing nulls and not empty string, to the char field. But that’s NOT the correct use of null. Here is the use of null. Lets say you get a file from a vendor
Name|Gender|City
Bob||Los Angeles
If gender is not specified than you enter Bob, empty string and Los Angeles into the table. Now lets say you get the file and its format changes and gender is no longer included but was in the past.
Name|City
Bob|Seattle
Well now since gender is not included, I would use null. Varchars support this without issues.
Char on the other hand is different. You always have to send null. If you ever send empty string, you will end up with a field that has spaces in it.
I could go on and on with all the bugs I’ve had to fix from chars and in about 20 years of development.
I stand by Jim McKeeth's comment.
Also, indexing and full table scans are faster if your table has only CHAR columns. Basically the optimizer will be able to predict how big each record is if it only has CHAR columns, while it needs to check the size value of every VARCHAR column.
Besides if you update a VARCHAR column to a size larger than its previous content you may force the database to rebuild its indexes (because you forced the database to physically move the record on disk). While with CHAR columns that'll never happen.
But you probably won't care about the performance hit unless your table is huge.
Remember Djikstra's wise words. Early performance optimization is the root of all evil.
Many people have pointed out that if you know the exact length of the value using CHAR has some benefits. But while storing US states as CHAR(2) is great today, when you get the message from sales that 'We have just made our first sale to Australia', you are in a world of pain. I always send to overestimate how long I think fields will need to be rather than making an 'exact' guess to cover for future events. VARCHAR will give me more flexibility in this area.
I think in your case there is probably no reason to not pick Varchar. It gives you flexibility and as has been mentioned by a number of respondants, performance is such now that except in very specific circumstances us meer mortals (as opposed to Google DBA's) will not notice the difference.
An interesting thing worth noting when it comes to DB Types is the sqlite (a popular mini database with pretty impressive performance) puts everything into the database as a string and types on the fly.
I always use VarChar and usually make it much bigger than I might strickly need. Eg. 50 for Firstname, as you say why not just to be safe.
It's the classic space versus performance tradeoff.
In MS SQL 2005, Varchar (or NVarchar for lanuagues requiring two bytes per character ie Chinese) are variable length. If you add to the row after it has been written to the hard disk it will locate the data in a non-contigious location to the original row and lead to fragmentation of your data files. This will affect performance.
So, if space is not an issue then Char are better for performance but if you want to keep the database size down then varchars are better.
Fragmentation. Char reserves space and VarChar does not. Page split can be required to accommodate update to varchar.
There is some small processing overhead in calculating the actual needed size for a column value and allocating the space for a Varchar, so if you are definitely sure how long the value will always be, it is better to use Char and avoid the hit.
when using varchar values SQL Server needs an additional 2 bytes per row to store some info about that column whereas if you use char it doesn't need that
so unless you
Using CHAR (NCHAR) and VARCHAR (NVARCHAR) brings differences in the ways the database server stores the data. The first one introduces trailing blanks; I have encountered problem when using it with LIKE operator in SQL SERVER functions. So I have to make it safe by using VARCHAR (NVARCHAR) all the times.
For example, if we have a table TEST(ID INT, Status CHAR(1)), and you write a function to list all the records with some specific value like the following:
CREATE FUNCTION List(#Status AS CHAR(1) = '')
RETURNS TABLE
AS
RETURN
SELECT * FROM TEST
WHERE Status LIKE '%' + #Status '%'
In this function we expect that when we put the default parameter the function will return all the rows, but in fact it does not. Change the #Status data type to VARCHAR will fix the issue.
In some SQL databases, VARCHAR will be padded out to its maximum size in order to optimize the offsets, This is to speed up full table scans and indexes.
Because of this, you do not have any space savings by using a VARCHAR(200) compared to a CHAR(200)