Compare queries on converted columns - sql

Certain parts of my database are required to be extremely flexible to the point that the user might decide to manipulate number and/or data types of columns in a table. The data that is already in the table though should be preserved.
That leaves me with the only option of using nvarchar(max) as the data type for any column in any of those tables.
Be it the case that the user chooses to store integers in a certain column and then wants to get all rows with that field in a certain range. Then I should run a compare query over converted values of that column into int.
I am afraid that would a performance disaster. Assuming that I am left with no other design alternatives, what can I do to improve some performance in this scenario?

I can relate to this problem. An application, for instance, might be taking user input from an Excel spreadsheet and need to store this in a format as the user sees it. Once in the database, though, you might have other requirements on filtering and combining data.
You've solved half the problem. By storing the value in a character field, you can store what the user wants.
The second half is to store the value also as a reasonable way for the database to manipulate. I would decide on a set of base types, perhaps just float and datetime, depending on the application. Then, when a user inserts a value, you can do the conversion and set the value in a separate columns. Your table might have columns like this:
ColumnX_WhatTheUserSees nvarchar(max),
ColumnX_Type char(1) not null default 'C', -- 'C'haracter, 'F'loat, 'D'atetime
ColumnX_Float float,
ColumnX_Datetme
The insertion logic then goes something like this:
insert into t(ColumnX_WhatTheUSerSees, ColumnX_type, ColumnX_Float, ColumnX_Datetime)
select #ColX,
(case when isnumeric(#Colx) = 1 then 'F'
when isdate(#Colx) = 1 then 'D'
else 'C'
end),
(case when isnumeric(#Colx) = 1 then cast(#Colx as float) end),
(case when isdate(#Colx) = 1 then cast(#Colx as datetime) end)
The above code is meant for illustrative purposes only. You may need to handle special cases you are not interested in (perhaps you think '1e5' should be a string or you might want to handle numbers with parentheses as negative numbers).
You can handle the extra part of the update through a before insert or before update trigger, so the user would never see the extra complexity. You can provide a view so the user sees only the "WhatTheUserSess" columns.
Finally, SQL does offer the sql_variant data type. This provides an alternative route for what you want. However, it would lose the initial user formatting (which has been important when I've encountered similar problems).

Given what you said, perhaps you could add an additional int column for each column and a trigger that will populate it as an int if the user puts one in the nvarchar(max) column) then at least you would only have to convert the data once, rather than each time you query it. Otherwise , yes you are stuck with the poorly performing conversion to an integer (whcih is problematic since you have to preserve earlier information that may not be int) in order to do any kind of ordering or mathmatical calculation. Another possibility is to have a string column and an int column (and a trigger to make sure only one of the two is populated) and then a view that coalesces them for display for when you ned to show all records. A meta table to tell you which one the client is using could help you in wswrting queries. No matter what this is a mess. Have you considered that a nosql solution might be better for your requirment?? That is the use case for NoSQL, data athat is unstructured. If we knew the real use for this data, it is possible we could suggest a better design alternative.
(Turn Rant on - Personally, without knowing more, I would question the need for any application to be that flexible. Often requirements add more flexibility than users actually require or will use and developers dutifully build it. I have seen this in every single COTS program I have had to support. Users in general think they want flexibility - making it a sales point, but find it so hard to use that they will not use it in practice. Sometimes we need to do a better job of pushing back when the requirement will make the software run slowly or be virtually unusable. Turn Rant off. )

Related

When would combining columns into a single, delimited column be better in a RDB schema?

Consider for example the case where you have two peaces of data, where one value is rarely used without the other. As one example, here is a table holding user authentication data :
CREATE TABLE users
(
id INT PRIMARY KEY,
auth_name STRING,
auth_password STRING,
auth_password_salt STRING
)
I think that password is meaningless without salt, and the other way around. I also have the option on representing the data this way:
CREATE TABLE users
(
id INT PRIMARY KEY,
auth_name STRING,
auth_secret STRING,
)
And in auth_secret, store strings such as D5SDfsuuAedW:unguessable42
In general, are there any situations where combining columns into one, delimited column would be a better choice?
Even if it is never a "better choice" overall, are there any costs (performance, space, anything) to having more columns vs fewer columns (for the same data)? My motivation is better understanding and to be able to more competently argue against it when someone suggests this sort of thing.
--edited I changed the example... original example as follows:
CREATE TABLE points
(
id INT PRIMARY KEY,
x_coordinate INT,
y_coordinate INT,
z_coordinate INT
)
vs
CREATE TABLE points
(
id INT PRIMARY KEY,
position STRING
)
In position, storing strings such as 7:3:15
You do that when there is no chance of needing to join, query, report or aggregate the data.
In other words - never. It is bad database design.
First Normal form (NF1) states that attributes should be distinct - it is the basic requirement.
The only possible answer to this question is never. Never, ever, store delimited data in a column. It defeats the entire point of columns, which are there to delimit your data, and makes it inordinately difficult to do anything that a database has been designed to do. It's a violation of normalisation so huge that you'll spend hours on Stack Overflow trying to correct it in a months time.
Never do this.
However, "never say never".
In certain, extremely limited, circumstances it's okay. Never assume it's okay but it can be.
A good example is Stack Overflow's own Posts table, which stores the tags in a delimited format for quick reading. The tags a question has are read from the database far more often than they are edited. The tags are stored in a separate table, PostTags, and then denormalised to Posts when they are updated.
In short, even though you can denormalise your data in this way, don't. Try everything possible to avoid it. If you come across a situation where you've been optimizing for days and the only way to get something quicker is to denormalize, then it's okay. Just ensure that you are only ever going to read data from that column and you have a secondary process in place to ensure that it is kept up-to-date. If the update of the denormalised data fails, roll everything back to ensure that your data is consistent.
You left out a significant option: create an appropriate user-defined data type. (PostgreSQL has long had an intrinsic data type for 2-space.)
PostgreSQL
Oracle
SQL Server
DB2
These implementations differ quite a lot.
But you might not have the luxury of using one of those platforms. You might have to use MySQL, for example, which doesn't support user-defined data types.
Relational theory says that data types can be arbitrarily complex; they can have internal structure. The most common data type that has internal structure is the type "date". Relational theory specifies what the dbms is supposed to do with data types like that. The dbms must either
ignore the internal structure entirely, or
provide functions to manipulate the parts.
In the case of dates, every SQL dbms provides functions to manipulate the parts.
You can make a good argument for a single column that stores 3-space coordinates like "7:3:15" in MySQL. To keep in line with relational theory, you'd want the dbms to ignore the structure, and return only the single value "7:3:15"; manipulation of parts is left to application code.
One problem with implementing something like that in MySQL is that MySQL doesn't enforce CHECK constraints. So it's a lot harder to prevent values like "wibble:frog:foo" from finding their way into the database.

Are there any advantages to use varchar over decimal for Price and Value

I was arguing with my friend against his suggestion to store price, value and other similar informations in varchar.
My point of view are on the basis of
Calculations will become difficult as we need to cast back and forth.
Integrity of the data will be lost.
Poor performance of Indexes
Sorting and aggregate functions will also need casting
etc. etc.
But he was saying that in his previous employement everybody used to store such values in varchar, because the communication between DB and the APP will be very effective in this approach. (I still cant accept this)
Are there really some advantages in storing such values in varchar ?
Note : I'm not talking about columns like PhoneNo, IDs, ZIP Code, SSN etc. I know varchar is best suited for those. The columns are value based, and will for sure be involved in calculations some way or other.
None at all.
Try casting a values back and too and see how much data you lose.
DECLARE #foo TABLE (bar varchar(30))
INSERT #foo VALUES (11.2222222222)
INSERT #foo VALUES (22.3333333333)
INSERT #foo VALUES (33.1111111111)
SELECT CAST(CAST(bar AS float) AS varchar(30)) FROM #foo
I would also mention that his current employment does things differently... he isn't at his previous employment any more....
I think a big part of the reason to use the APPROPRIATE (in this case decimal) data type is to prevent invalid data. There's nothing to stop someone entering "The King" as a price in a varchar field.
I can see no advantages, and a whole heap of very severe disadvantages - the most pressing of which is performance (particularly when sorting).
Consider if you want to get a list of the N most expensive products, and you are storing your price as a VARCHAR. Here are some sample values (sorted in descending order)
SELECT Price FROM Table ORDER BY Price DESC
Price
-----
90
600
50
1000
Whoops! The sort order is, well, wrong! (Alphanumerical sorting, rather than value sorting).
If we want to do the sort properly then this means we either need to pad values with zeroes at the start, or convert each value to a double before we sort - but if we have to do a convert on every row this means that SQL server has no way of using statistics to predict what the results will be! This in turn means extremely poor performance, probably a table scan.
As Kragen notes, sorts will not necessarily come out in the right order.
Compares won't necessarily work either. If a field is defined as, say, decimal(8,2) and I give it the value "37.20", and later I write "select ... where price=37.2", the result will be true. But if I store a varchar 37.20 and compare it to 37.2, it will not be equal. Similarly if one or the other has leading zeros.
You could solve these problems by having the application insure that you always store the numbers with a fixed number of decimal places and padded with leading zeros. Oh, and make sure you have a consistent convention about storing minus signs. But then every place in the app that writes to this field must be sure that it follows exactly the same rules. We could do this of course, but why? The database engine will do it for us if we just declare the field numeric. Like, yes, I COULD mow my lawn with a pair of scissors, but why would I want to do this?
I don't understand what your friend is saying the advantage is supposed to be. Easier communication between app and database? How? Maybe he was using some unconventional language or database interface that couldn't read numeric values from the DB. I've never had an issue with this. Actually just saying that gets me to wondering if that isn't what happenned: That at his previous company they were using some language or tool that couldn't read decimals from the database because of an implementation problem, the only way they could get it to work was to declare all the numbers as varchar, and now he walks away thinking that's a generally good idea.
Ok . One word answer . Dont
You are right about correct data types having impact on performance (SQL Optimizer works differently for INT VS VARCHAR) , data consistency and integrity etc
if all we needed was VARCHAR I dont think we ever invented other types.
SQL is not dynamically typed. Static typing makes optimization better , index pages smaller and query operators efficient.
It is not the problem of source that consumer needs all strings as input. it is upto consumer to do type checking and consuming data. A DB should always have correct types .
(Forget about choosing between INT and VARCHAR i would say you should also think whether you should have INT or TINYINT ) these consideration makes a lot of difference
Data Types are best stored in fields that match the type between two different systems. In this case you are referring from your .Net objects to MS SQL server. You are correct with data integrity loss and with the need to cast/convert data types into useable forms. As for other types such as Phone Number, ZIP Code, SSN and so on; they too would benefit from dedicated data types. The main reason these are stored in VARCHAR/NVARCHAR is due to the number of different possibilities that are not needed in every system. But if you have a type that is commonly used and you want to constrain it you can build custom data types called User-defined types to store that data in SQL server. (Even more fun is CLR defined types see example on code project.)
The only advantage I can see with using any sort of variable-sized string-ish format would be if the field would have to accommodate an unknown amount of additional information. For example, "49.95#1/39.95#5/29.95#20/14.95#100,match=true/24.95#100" to indicate that this particular product has price points at 1, 5, 20, and 100 units, and the best 100-unit price is only available when all items are identical. Using strings to store such things is icky, but if the number of price-points is open-ended, using a variable-sized field might be better than having to create another table with one row per product/price-point combination. If you do go that route, it may be good to use XML serialization for the data, rather than an ad-hoc thing as shown above. An ad-hoc approach might allow faster parsing in some cases, but if things really are open-ended it could become a real pain to maintain.
Addendum: If you want to be able to do any type of sorting or searching based on price, you'll need to have separate columns for that. If you want to allow users to e.g. find the ten cheapest items at 100-piece mix/match quantity, and the database holds 10,000 possible items, the only way to satisfy the query with varchar-stored data would be to read all l0,000 items and evaluate what the best price would be given the restrictions. If users can only query based upon a small number of price/restriction combinations, it may be helpful to have a column for each one to allow direct queries.

What is wrong with using SELECT * FROM sometable [duplicate]

I've heard that SELECT * is generally bad practice to use when writing SQL commands because it is more efficient to SELECT columns you specifically need.
If I need to SELECT every column in a table, should I use
SELECT * FROM TABLE
or
SELECT column1, colum2, column3, etc. FROM TABLE
Does the efficiency really matter in this case? I'd think SELECT * would be more optimal internally if you really need all of the data, but I'm saying this with no real understanding of database.
I'm curious to know what the best practice is in this case.
UPDATE: I probably should specify that the only situation where I would really want to do a SELECT * is when I'm selecting data from one table where I know all columns will always need to be retrieved, even when new columns are added.
Given the responses I've seen however, this still seems like a bad idea and SELECT * should never be used for a lot more technical reasons that I ever though about.
One reason that selecting specific columns is better is that it raises the probability that SQL Server can access the data from indexes rather than querying the table data.
Here's a post I wrote about it: The real reason select queries are bad index coverage
It's also less fragile to change, since any code that consumes the data will be getting the same data structure regardless of changes you make to the table schema in the future.
Given your specification that you are selecting all columns, there is little difference at this time. Realize, however, that database schemas do change. If you use SELECT * you are going to get any new columns added to the table, even though in all likelihood, your code is not prepared to use or present that new data. This means that you are exposing your system to unexpected performance and functionality changes.
You may be willing to dismiss this as a minor cost, but realize that columns that you don't need still must be:
Read from database
Sent across the network
Marshalled into your process
(for ADO-type technologies) Saved in a data-table in-memory
Ignored and discarded / garbage-collected
Item #1 has many hidden costs including eliminating some potential covering index, causing data-page loads (and server cache thrashing), incurring row / page / table locks that might be otherwise avoided.
Balance this against the potential savings of specifying the columns versus an * and the only potential savings are:
Programmer doesn't need to revisit the SQL to add columns
The network-transport of the SQL is smaller / faster
SQL Server query parse / validation time
SQL Server query plan cache
For item 1, the reality is that you're going to add / change code to use any new column you might add anyway, so it is a wash.
For item 2, the difference is rarely enough to push you into a different packet-size or number of network packets. If you get to the point where SQL statement transmission time is the predominant issue, you probably need to reduce the rate of statements first.
For item 3, there is NO savings as the expansion of the * has to happen anyway, which means consulting the table(s) schema anyway. Realistically, listing the columns will incur the same cost because they have to be validated against the schema. In other words this is a complete wash.
For item 4, when you specify specific columns, your query plan cache could get larger but only if you are dealing with different sets of columns (which is not what you've specified). In this case, you do want different cache entries because you want different plans as needed.
So, this all comes down, because of the way you specified the question, to the issue resiliency in the face of eventual schema modifications. If you're burning this schema into ROM (it happens), then an * is perfectly acceptable.
However, my general guideline is that you should only select the columns you need, which means that sometimes it will look like you are asking for all of them, but DBAs and schema evolution mean that some new columns might appear that could greatly affect the query.
My advice is that you should ALWAYS SELECT specific columns. Remember that you get good at what you do over and over, so just get in the habit of doing it right.
If you are wondering why a schema might change without code changing, think in terms of audit logging, effective/expiration dates and other similar things that get added by DBAs for systemically for compliance issues. Another source of underhanded changes is denormalizations for performance elsewhere in the system or user-defined fields.
You should only select the columns that you need. Even if you need all columns it's still better to list column names so that the sql server does not have to query system table for columns.
Also, your application might break if someone adds columns to the table. Your program will get columns it didn't expect too and it might not know how to process them.
Apart from this if the table has a binary column then the query will be much more slower and use more network resources.
There are four big reasons that select * is a bad thing:
The most significant practical reason is that it forces the user to magically know the order in which columns will be returned. It's better to be explicit, which also protects you against the table changing, which segues nicely into...
If a column name you're using changes, it's better to catch it early (at the point of the SQL call) rather than when you're trying to use the column that no longer exists (or has had its name changed, etc.)
Listing the column names makes your code far more self-documented, and so probably more readable.
If you're transferring over a network (or even if you aren't), columns you don't need are just waste.
Specifying the column list is usually the best option because your application won't be affected if someone adds/inserts a column to the table.
Specifying column names is definitely faster - for the server. But if
performance is not a big issue (for example, this is a website content database with hundreds, maybe thousands - but not millions - of rows in each table); AND
your job is to create many small, similar applications (e.g. public-facing content-managed websites) using a common framework, rather than creating a complex one-off application; AND
flexibility is important (lots of customization of the db schema for each site);
then you're better off sticking with SELECT *. In our framework, heavy use of SELECT * allows us to introduce a new website managed content field to a table, giving it all of the benefits of the CMS (versioning, workflow/approvals, etc.), while only touching the code at a couple of points, instead of a couple dozen points.
I know the DB gurus are going to hate me for this - go ahead, vote me down - but in my world, developer time is scarce and CPU cycles are abundant, so I adjust accordingly what I conserve and what I waste.
SELECT * is a bad practice even if the query is not sent over a network.
Selecting more data than you need makes the query less efficient - the server has to read and transfer extra data, so it takes time and creates unnecessary load on the system (not only the network, as others mentioned, but also disk, CPU etc.). Additionally, the server is unable to optimize the query as well as it might (for example, use covering index for the query).
After some time your table structure might change, so SELECT * will return a different set of columns. So, your application might get a dataset of unexpected structure and break somewhere downstream. Explicitly stating the columns guarantees that you either get a dataset of known structure, or get a clear error on the database level (like 'column not found').
Of course, all this doesn't matter much for a small and simple system.
Lots of good reasons answered here so far, here's another one that hasn't been mentioned.
Explicitly naming the columns will help you with maintenance down the road. At some point you're going to be making changes or troubleshooting, and find yourself asking "where the heck is that column used".
If you've got the names listed explicitly, then finding every reference to that column -- through all your stored procedures, views, etc -- is simple. Just dump a CREATE script for your DB schema, and text search through it.
Performance wise, SELECT with specific columns can be faster (no need to read in all the data). If your query really does use ALL the columns, SELECT with explicit parameters is still preferred. Any speed difference will be basically unnoticeable and near constant-time. One day your schema will change, and this is good insurance to prevent problems due to this.
definitely defining the columns, because SQL Server will not have to do a lookup on the columns to pull them. If you define the columns, then SQL can skip that step.
It's always better to specify the columns you need, if you think about it one time, SQL doesn't have to think "wtf is *" every time you query. On top of that, someone later may add columns to the table that you actually do not need in your query and you'll be better off in that case by specifying all of your columns.
The problem with "select *" is the possibility of bringing data you don't really need. During the actual database query, the selected columns don't really add to the computation. What's really "heavy" is the data transport back to your client, and any column that you don't really need is just wasting network bandwidth and adding to the time you're waiting for you query to return.
Even if you do use all the columns brought from a "select *...", that's just for now. If in the future you change the table/view layout and add more columns, you'll start bring those in your selects even if you don't need them.
Another point in which a "select *" statement is bad is on view creation. If you create a view using "select *" and later add columns to your table, the view definition and the data returned won't match, and you'll need to recompile your views in order for them to work again.
I know that writing a "select *" is tempting, 'cause I really don't like to manually specify all the fields on my queries, but when your system start to evolve, you'll see that it's worth to spend this extra time/effort in specifying the fields rather than spending much more time and effort removing bugs on your views or optimizing your app.
While explicitly listing columns is good for performance, don't get crazy.
So if you use all the data, try SELECT * for simplicity (imagine having many columns and doing a JOIN... query may get awful). Then - measure. Compare with query with column names listed explicitly.
Don't speculate about performance, measure it!
Explicit listing helps most when you have some column containing big data (like body of a post or article), and don't need it in given query. Then by not returning it in your answer DB server can save time, bandwidth, and disk throughput. Your query result will also be smaller, which is good for any query cache.
You should really be selecting only the fields you need, and only the required number, i.e.
SELECT Field1, Field2 FROM SomeTable WHERE --(constraints)
Outside of the database, dynamic queries run the risk of injection attacks and malformed data. Typically you get round this using stored procedures or parameterised queries. Also (although not really that much of a problem) the server has to generate an execution plan each time a dynamic query is executed.
It is NOT faster to use explicit field names versus *, if and only if, you need to get the data for all fields.
Your client software shouldn't depend on the order of the fields returned, so that's a nonsense too.
And it's possible (though unlikely) that you need to get all fields using * because you don't yet know what fields exist (think very dynamic database structure).
Another disadvantage of using explicit field names is that if there are many of them and they're long then it makes reading the code and/or the query log more difficult.
So the rule should be: if you need all the fields, use *, if you need only a subset, name them explicitly.
The result is too huge. It is slow to generate and send the result from the SQL engine to the client.
The client side, being a generic programming environment, is not and should not be designed to filter and process the results (e.g. the WHERE clause, ORDER clause), as the number of rows can be huge (e.g. tens of millions of rows).
Naming each column you expect to get in your application also ensures your application won't break if someone alters the table, as long as your columns are still present (in any order).
Performance wise I have seen comments that both are equal. but usability aspect there are some +'s and -'s
When you use a (select *) in a query and if some one alter the table and add new fields which do not need for the previous query it is an unnecessary overhead. And what if the newly added field is a blob or an image field??? your query response time is going to be really slow then.
In other hand if you use a (select col1,col2,..) and if the table get altered and added new fields and if those fields are needed in the result set, you always need to edit your select query after table alteration.
But I suggest always to use select col1,col2,... in your queries and alter the query if the table get altered later...
This is an old post, but still valid. For reference, I have a very complicated query consisting of:
12 tables
6 Left joins
9 inner joins
108 total columns on all 12 tables
I only need 54 columns
A 4 column Order By clause
When I execute the query using Select *, it takes an average of 2869ms.
When I execute the query using Select , it takes an average of 1513ms.
Total rows returned is 13,949.
There is no doubt selecting column names means faster performance over Select *
Select is equally efficient (in terms of velocity) if you use * or columns.
The difference is about memory, not velocity. When you select several columns SQL Server must allocate memory space to serve you the query, including all data for all the columns that you've requested, even if you're only using one of them.
What does matter in terms of performance is the excecution plan which in turn depends heavily on your WHERE clause and the number of JOIN, OUTER JOIN, etc ...
For your question just use SELECT *. If you need all the columns there's no performance difference.
It depends on the version of your DB server, but modern versions of SQL can cache the plan either way. I'd say go with whatever is most maintainable with your data access code.
One reason it's better practice to spell out exactly which columns you want is because of possible future changes in the table structure.
If you are reading in data manually using an index based approach to populate a data structure with the results of your query, then in the future when you add/remove a column you will have headaches trying to figure out what went wrong.
As to what is faster, I'll defer to others for their expertise.
As with most problems, it depends on what you want to achieve. If you want to create a db grid that will allow all columns in any table, then "Select *" is the answer. However, if you will only need certain columns and adding or deleting columns from the query is done infrequently, then specify them individually.
It also depends on the amount of data you want to transfer from the server. If one of the columns is a defined as memo, graphic, blob, etc. and you don't need that column, you'd better not use "Select *" or you'll get a whole bunch of data you don't want and your performance could suffer.
To add on to what everyone else has said, if all of your columns that you are selecting are included in an index, your result set will be pulled from the index instead of looking up additional data from SQL.
SELECT * is necessary if one wants to obtain metadata such as the number of columns.
Gonna get slammed for this, but I do a select * because almost all my data is retrived from SQL Server Views that precombine needed values from multiple tables into a single easy to access View.
I do then want all the columns from the view which won't change when new fields are added to underlying tables. This has the added benefit of allowing me to change where data comes from. FieldA in the View may at one time be calculated and then I may change it to be static. Either way the View supplies FieldA to me.
The beauty of this is that it allows my data layer to get datasets. It then passes them to my BL which can then create objects from them. My main app only knows and interacts with the objects. I even allow my objects to self-create when passed a datarow.
Of course, I'm the only developer, so that helps too :)
What everyone above said, plus:
If you're striving for readable maintainable code, doing something like:
SELECT foo, bar FROM widgets;
is instantly readable and shows intent. If you make that call you know what you're getting back. If widgets only has foo and bar columns, then selecting * means you still have to think about what you're getting back, confirm the order is mapped correctly, etc. However, if widgets has more columns but you're only interested in foo and bar, then your code gets messy when you query for a wildcard and then only use some of what's returned.
And remember if you have an inner join by definition you do not need all the columns as the data in the join columns is repeated.
It's not like listing columns in SQl server is hard or even time-consuming. You just drag them over from the object browser (you can get all in one go by dragging from the word columns). To put a permanent performance hit on your system (becasue this can reduce the use of indexes and becasue sending unneeded data over the network is costly) and make it more likely that you will have unexpected problems as the database changes (sometimes columns get added that you do not want the user to see for instance) just to save less than a minute of development time is short-sighted and unprofessional.
Absolutely define the columns you want to SELECT every time. There is no reason not to and the performance improvement is well worth it.
They should never have given the option to "SELECT *"
If you need every column then just use SELECT * but remember that the order could potentially change so when you are consuming the results access them by name and not by index.
I would ignore comments about how * needs to go get the list - chances are parsing and validating named columns is equal to the processing time if not more. Don't prematurely optimize ;-)

Why do we care about data types?

Specifically, in relational database management systems, why do we need to know the data type of a column (more likely, the attribute of an object) at creation time?
To me, data types feel like an optimization, because one data point can be implemented in any number of ways. Wouldn't it be better to assign semantic roles and constraints to a data point and then have the engine internally examine and optimize which data type best serves the user?
I suspect this is where the heavy lifting is and why it's easier to just ask the user rather than to do the work.
What do you think? Where are we headed? Is this a realistic expectation? Or do I have a misguided assumption?
The type expresses a desired constraint on the values of the column.
The answer is storage space and fixed size rows.
Fixed-size rows are much, MUCH faster to search than variable length rows, because you can seek directly to the correct byte if you know which record number and field you want.
Edit: Having said that, if you use proper indexing in your database tables, the fixed-size rows thing isn't as important as it used to be.
SQLite does not care.
Other RDBMS's use principles that were designed in early 80's, when it was vital for performance.
Oracle, for instance, does not distinguish between a NULL and an empty string, and keeps its NUMBER's as sets of centesimal digits.
That hardly makes sense today, but these were very clever solutions when Oracle was being developed.
In one of the databases I developed, though, non-indexed values were used that were stored as VARCHAR2's, casted dynamically into appropriate datatypes depending on several conditions.
That was quite a special thing, though: it was used for bulk loading key-value pairs in one call to the database using collections.
Dynamic SQL statements were used for parsing data and putting them into appropriate tables based on key name.
All values were loaded to the temporary VARCHAR2 column as is and then converted into NUMBER's and DATETIME's to be put into their columns.
Explicit data types are huge for efficiency, and storage. If they are implicit they have to be 'figured' out and therefore incur speed costs. Indexes would be hard to implement as well.
I would suspect, although not positive, that having explicit types also on average incur less storage space. For numbers especially, there is no comparison between a binary int and a string of digit characters.
Hm... Your question is sort of confusing.
If I understand it correctly, you're asking why it is that we specify data types for table columns, and why it is that the "engine" automatically determines what is needed for the user.
Data types act as a constraint - they secure the data's integrity. An int column will never have letters in it, which is a good thing. The data type isn't automatically decided for you, you specify it when you create the database - almost always using SQL.
You're right: assigning a data type to a column is an implementation detail and has nothing to do with the set theory or calculus behind a database engine. As a theoretical model, a database ought to be "typeless" and able to store whatever we throw at it.
But we have to implement the database on a real computer with real constraints. It's not practical, from a performance standpoint, to have the computer dynamically try to figure out how to best store the data.
For example, let's say you have a table in which you store a few million integers. The computer could -- correctly -- figure out that it should store each datum as an integral value. But if you were to one day suddenly try to store a string in that table, should the database engine stop everything until it converts all the data to a more general string format?
Unfortunately, specifying a data type is a necessary evil.
If you know that some data item is supposed to be numeric integer, and you deliberately choose NOT to let the DBMS take care of enforcing this, then it becomes YOUR responsibility to ensure all sorts of things such as data integrity (ensuring that no value 'A' can be entered in the column, ensuring that no value 1.5 can be entered in the column), such as consistency of system behaviour (ensuring that the value '01' is considered equal to the value '1', which is not the behaviour you get from type String), ...
Types take care of all those sorts of things for you.
I'm not sure of the history of datatypes in databases, but to me it makes sense to know the datatype of a field.
When would you want to do a sum of some fields which are entirely varchar?
If I know that a field is an integer, it makes perfect sense to do a sum, avg, max, etc.
Not all databases work this way. SQLite was mentioned earlier, but a much older set of databases also does this, multivalued databases.
Consider UniVerse (now an IBM property). It does not do any data validation, nor does it require that you specify what type it is. Searches are still (relatively) fast, it takes up less space (due to the way it stores data dynamically).
You can describe what the data may look like using meta-data (dictionary items), but that is the limit of how you restrict the data.
See the wikipedia article on UniVerse
When you're pushing half a billion rows in 5 months after go live, every byte counts (in our system)
There is no such anti-pattern as "premature optimisation" in database design.
Disk space is cheap, of course, but you use the data in memory.
You should care about datatypes when it comes to filtering (WHERE clause) or sorting (ORDER BY). For example "200" is LOWER than "3" if those values are strings, and the opposite when they are integers.
I believe sooner or later you wil have to sort or filter your data ("200" > "3" ?) or use some aggregate functions in reports (like sum() or (avg()). Until then you are good with text datatype :)
A book I've been reading on database theory tells me that the SQL standard defines a concept of a domain. For instance, height and width could be two different domains. Although both might be stored as numeric(10,2), a height and a width column could not be compared without casting. This allows for a "type" constraint that is not related to implementation.
I like this idea in general, though, since I've never seen it implemented, I don't know what it would be like to use it. I can see that it would reduce the chance of errors in using values whose implementation happen to be the same, when their conceptual domain is quite different. It might also help keep people from comparing cm and inches, for instance.
Constraint is perhaps the most important thing mentioned here. Data types exist for ensuring the correctness of your data so you are sure you can manipulate it correctly. There are 2 ways we can store a date. In a type of date or as a string "4th of January 1893". But the string could also have been "4/1 1893", "1/4 1893" or similar. Datatypes constrain that and defines a canonical form for a date.
Furthermore, a datatype has the advantage that it can undergo checks. The string "0th of February 1975" is accepted as a string, but should not be as a date. How about "30th of February 1983"? Poor databases, like MySQL, does not make these checks by default (although you can configure MySQL to do it -- and you should!).
data types will ensure the consistency of your data. This is one of the most important concepts as keeping your data sane will spare your head from insanity.
RDBMs generally require definition of column types so it can perform lookups fast. If you want to get the 5th column of every row in a huge dataset, having the columns defined is a huge optimisation.
Instead of scanning each row for some form of delimiter to retrieve the 5th column (if column widths were not fixed width), the RDBMs can just take the item at sizeOf(column1 - 4(bytes)) + sizeOf(column5(bytes)). Imagine how much quicker this would be on a table of say 10,000,000 rows.
Alternatively, if you don't want to specify the types of each column, you have two options that I'm aware of. Specify each column as a varchar(255) and decide what you want to do with it within the calling program. Or you can use a different database system that uses key-value pairs such as Redis.
database is all about physical storage, data type define this!!!

Char(4) versus int as StatusID/StatusCode column in a table

I need a status column that will have about a dozen possible values.
Is there any reason why I should choose int (StatusID) over char(4) (StatusCode)?
Since sql server doesn't support named constants, char is far more descriptive than int when used in stored procedure and views as constants.
To clarify, I would still use a lookup table either way. Since the I will need a more descriptive text for the UI. So this decision is only to help me as the developer when I'm maintaining the stored procedures and views.
Right now I'm leaning toward char(4). Especially since designing views in SQL Server Management Studio prevents me from adding comments (I know it's possible to add it in the script editor, but realistically I will use the View Designer far more often, especially if the view is trivial). StateCODE = 'NEW' is much more readable than StateID = 1000.
I guess the question is will there be cases where char(4) is problematic, and since the database is pretty small, I'm not too concerned about slight performance hit (like using TinyInt versus int), but more afraid of code maintenance problems.
Database purists will say a key should have no meaning in the business domain, and that you should create a status table where you look up the description and other meanings of the status.
But for operators and end users, having a descriptive status code can be a blessing. And it doesn't even have to be char(4), you can make it varchar(20). This allows them to query without joins, and inspect the database in an easier way.
In the end, I think the char(20) organization will run more smoothly, and go home earlier on Friday. But the int organization has a better abstraction of the database, and they can enjoy meta programming on friday evening (or boosting on forums.)
(All of this assuming that you're writing business support software. One of the more succesful business support systems, SAP, makes successful use of meaningful keys.)
There are many pro's and con's to each method. I'm sure other arguments will come up in favour of using a char(4). My reasons for choosing an int over a char include:
I always use lookup tables. They allow for an audit trail of the value to be retained and easily examined. For example, if one of your status codes is 'MING' and a business decision is made to change it from 'MING' to 'MONG' from a certain date, my lookup table handles this.
Smaller index - if you need to index this column, it will be thinner.
Extendability - OK, I made that word up, but if you need to go from 4 chars to 5 chars for example, a lookup table would be a blessing.
Descriptions: We use a lot of TLA's here which once you know what they are is great but if I gave a business user a report that said "GDA's 2007 1001", they wouldn't necessarily twig that GDA = Good Dead on Arrival. With a lookup table, I can add this description.
Best practice: Can't find the link to hand but it might be something I read in a K.Tripp article. Aim to make your clustered primary key incrementing integers to optimise the index.
Of course if you are absolutely positive that you will never need any more than a handful of 4 characters, there is no reason not to bang it in the table.
The best thing should be a lookup table with defined values and then relate it to original table, that uses that enumeration.
Collation ambigities are one reason to say no to char 4: Does ABcD = abCD = äBCd?
If you have 12 possible values, why not tinyint/byte and a Status table?
If you have to store the status for 10 million rows the 3 bytes different and the collation/string compares add up.
The place where I've run into this use case is columns that would map onto things that I would typically use an Enum for when programming. Do you store the integer value of the Enum or the name of the Enum in the database column? Honestly, I've done it both ways. Usually, I ask myself if the database will be used outside the application I'm building. If so, I will choose the human readable format to store in the database. If not, then I'll choose the integer value as it saves a little time when reconstituting (it's just a cast instead of a parse operation) the Enum in code.
You could also use a tinyint over an int
i always choose int's simply because they are easier to map to enums in code.
If you're dealing with huge amounts of data and high throughput then a smallint or tinyint can give better performance and a smaller footprint on the hard disk. If the data in your application is often viewed directly through applications like Access or Cognos then your business people will probably appreciate the descriptive values. I know that when I'm analyzing data as part of my Database Developer role I get tired of joining a lot of lookup tables because I can't remember if 1 = Foo and 2 = Bar or 1 = Bar and 2 = Foo.
Also, although performance will be enhanced if you have to lookup rows by these codes which can have smaller indexes, it can also be hurt (in a minor way) by having to do the joins if you are often looking up rows regardless of the code but where you have to include the text value. In most applications that's not an issue though and would probably only come into play in large data warehousing/reporting environments.