SQL Select help - removing white space - sql

I'm running into a problem when trying to select records from my 2005 MS-SQL database (I'm still very new to SQL, but I learned and use the basic commands from w3schools already). In theory, all my manufacturer records should be unique. At least that is how I intended it to be when I did my first massive data dump into it. Unfortunately, that is not the case and now I need to fix it! Here is my scenario:
Table name = ItemCatalog
Relevant columns = Partnumber,Manufacturer,Category
When I did a SELECT DISTINCT Manufacturer FROM ItemCatalog this little problem is what turned up:
Cables2Go
CablesToGo
Cables To Go
CableToGo Inc
CablesToGo Inc
All 5 of those showed up as distinct, which they are. Can't fault my SELECT statement for returning it, but from my human perspective they are all the same manufacturer! One method I see working is doing an UPDATE command and fixing all the permutations that show up, but I have a LOT of manufacturers and this would be very time consuming.
Is there a way when I punch in a SELECT statement, that I can find all the likely permutations of a manufacturer name (or any field really)? I attempted the LIKE operator, so my statement would read
SELECT Manufacturer FROM ItemCatalog WHERE Manufacturer LIKE '%CablesToGo%'
but that didn't turn out as well as I had hoped. Here's the nasty bit, my other program that I'm putting together absolutely requires that I only ask for a single manufacturer name, not all 5 variations. Maybe I'm talking in circles here, but is there is a simple way in one statement for me to find a similar string?

If you are doing some data mining, you could also try the SOUNDEX and DIFFERENCE function in SQL Server.
While they are both outdated (they don't handle foreign character very well), they could yield some interesting result for you:
SELECT * FROM ItemCatalog WHERE SOUNDEX(Manufacturer) = SOUNDEX('Cables To Go');
and
SELECT * FROM ItemCatalog WHERE DIFFERENCE(Name, 'Cables To Go') >= 3;
The number 3 means likely similar (0 mean not similar and 4 is very similar)
There are a few number of better SOUNDEX function available on the internet. See Tek-Tips for an example.
Here is another example at SQL Team.

Standard SQL has a SIMILAR statement, which is a bit more powerful than LIKE.
However, you could use LIKE to good effect with:
Manufacturer LIKE 'Cable%Go%'
This would work in this specific case, finding all the variants listed. However, it would also find 'Cable TV Gorgons' and you probably don't need them included. Your version would also find 'We Hate CablesToGo With Ferocity Inc', which you probably didn't want either.
However, data cleansing is a major problem, and there are companies that make a living out of providing data cleansing. You often end up making a dictionary or thesaurus of terms (company names here) mapping all the variants encountered to the canonical form. The problem is that sometimes you find the same variant spelling is used for two separate canonical forms. For example, a pair of bright sparks might both decide to use 'C2G' as an abbreviation, but one uses it for 'Cables To Go Inc' and the other uses it for 'Computers To Gamers Inc'. You have to use some other information to determine whether a particular instance of 'C2G' means 'Cables' or 'Computers'.

'Cable%Go%' might work for that one case, but if you have other variations for other strings, you'll probably have to do a lot of manual data cleanup.

I suggest you to use object relational mapping tool to map your table into object and add filtering logic there.

One option you have is to loosen your wildcard search to something like 'Cables%Go%'. This might be good in the short term, but with this approach you run the risk of matching more manufacturers than you want (ie , Cables on the Go, etc).
You could also put together a mapping table, which would put all of the variants of Cables To Go into a single group, which your app can query and normalize for your ItemCatalog query.
Another option you have is to introduce a Manufacturers table. This your ItemCatalog table would then have a foreign key to this table and only allow manufacturers that are in the Manufacturer table. This would require some cleanup of your ItemCatalog table to get it working, assuming that you want all of the variants of Cables to Go to be the same.

I know others are suggesting query fixes - I thought I'd elaborate on my long-term fix for kicks.
You could create another table relating each of the variations to a single manufacturer entity. If I encountered this situation at work (and I have), I would be enticed to fix it.
Create a manufacturer's table with a primary key, name, etc..
Create a table with aliases - these will only be needed when you are presented with data that doesn't have the manufacturer's ID (like an import file).
Modify ItemCatalog such that it references the primary key from the manufacturer table (i.e. a ManufacturerID foreign key).
When importing data to ItemCatalog, assign the ManufacturerID foreign key based on matches to the alias table. If you have a name that matches 2+ records then you flag them for manual review or you try to match on more than manufacturer name.

Related

What is the most correct way to store a "list" in a SQL Database?

So, I've read a lot about how stashing multiple values into one column is a bad idea and violates the first rule of data normalisation (which, surprisingly, is not "Do Not Talk About Data Normalisation") so I need some help.
At the moment I'm designing an ASP .NET webpage for the place I work for. I want to display data on a web page depending on what Active Directory groups the person belongs to. The first way of doing this that comes to mind is to have a table with, essentially, a column containing the AD group and the second column containing what list of computers belong to that list.
I've learnt that this is showing great disregard for relational databases, so what is a better way to do it? I want to control this access by SQL tables, so I can add/remove from these tables and change end users access accordingly.
Thanks for the help! :)
EDIT: To describe exactly what I want to do is this:
We have a certain group of computers that need to be checked up on, however these computers are in physically difficult to reach locations. The organisation I belong to has remote control enabled for these computers, however they're not in the business of giving out the remote control password (understandable).
The added layer of complexity is that, depending on who you are, our clients should only be able to see a certain group of computers (that is, the group of computers that their area owns). So, if Group A has Thomas in it, and Group B has Jones in it, if you belong to either group then you would just see one entry. However, if you belong to both groups you should see both Thomas and Jones computers in it.
The reason why I think that storing this data in a SQL cell is the way to go is because, to store them in tables would require (in my mind) a new table for each new "group" of computers. I don't want to crank out SQL tables for every new group, I'd much rather just have an added row in a SQL table somewhere.
Does this make any sense?
You basically have three options in SQL Server:
Storing the values in a single column.
Storing the values in a junction table.
Storing the values as XML (or as some other structured data format).
(Other databases have other options, such as arrays, nested tables, and JSON.)
In almost all cases, using a junction table is the correct approach. Why? Here are some reasons:
SQL Server has (relatively) lousy string manipulation, so doing something as simple as ensuring a unique list is really, really hard.
A junction table allows you to store lots of other information (When was a machine added? What is the full description of the machine? etc. etc.).
Most queries that you want are pretty easy with a junction table (with the one exception of getting a comma-delimited list, alas -- which is just counterintuitive rather than "hard").
All the types are stored natively.
A junction table allows you to enforce constraints (both check and foreign key) on the elements of the list.
Although a delimited list is almost never the right solution, it is possible to think of cases where it might be useful:
The list doesn't change and presentation of the list is very important.
Space usage is an issue (alas, denormalization often results in fewer pages).
Queries do not really access elements of the list, just the entire thing.
XML is also a reasonable choice under some circumstances. In the most recent versions of SQL Server, this can be made pretty efficient. However, it incurs the overhead of reading and parsing XML -- and things like duplicate elimination are still not obvious.
So, you do have options. In almost all cases, the junction table is the right approach.
There is an "it depends" that you should consider. If the data is never going to be queried (or queried very rarely) storing it as XML or JSON would be perfectly acceptable. Many DBAs would freak out but it is much faster to get the blob of data that you are going to send to the client than to recompose and decompose a set of columns from a secondary table. (There is a reason document and object databases are becoming so popular.)
... though I would ask why are you replicating active directory to your database and how are you planning on keeping these in sync.
I not really a bad idea to store multiple values in one column, but will depend the search you want.
If you just only want to know the persons that is part of a group then you can store persons in one column with a group id as key. For update you just update the entire list in a group.
But if you want to search a specified person that belongs to group, then its not recommended that you store this multiple persons in one column. In this case its better to store a itermedium table that store person id, and group id.
Sounds like you want a table that maps users to group IDs and a second table that maps group IDs to which computers are in that group. I'm not sure, your language describing the problem was a bit confusing to me.
a list has some columns like: name, family name, phone number etc.
and rows like name=john familyName= lee number=12321321
name=... familyname=... number=...
an sql database works same way. every row in a sql database is a record. so you jusr add records of your list into your database using insert query.
complete explanation in here:
http://www.w3schools.com/sql/sql_insert.asp
This sounds like a typical many-to-many problem. You have many groups and many computers and they are related to eachother. In this situation, it is often recommended to use a mapping table, a.k.a. "junction table" or "cross-reference" table. This table consist solely of the two foreign keys in your other tables.
If your tables look like this:
Computer
- computerId
- otherComputerColumns
Group
- groupId
- othergroupColumns
Then your mapping table would look like this:
GroupComputer
- groupId
- computerId
And you would insert a single record for every relationship between a group and computer. This is in compliance with the rules for third normal form in regards to database normalization.
You can have a table with the group and group id, another table with the computer and computer id and a third table with the relation of group id and computer id.

Is it better to name the primary key column id or *_id?

I've been using Rails for a few years and I've grown used to the convention of naming the primary key column id. But I've run across lots of examples in SQL books that name the primary key column something like employee_id for an employees table or feed_id for a feeds table.
One advantage of the 2nd system seems to be that you can use USING() more to produce more concise SQL queries:
select feeds.title, items.title from items inner join feeds USING(feed_id);
As opposed to
select feeds.title, items.title from items inner join feeds on feeds.id = items.feed_id;
Which naming convention is better? Which is favored by experienced database administrators?
Also, is it better to pluralize the name of the table?
I always use the verbose form (i.e. 'employee_id' rather than 'id') as it is more descriptive. If you are joining more than one table and both have 'id' column you will have to alias 'id' if you need to SELECT both of the ids. Also, as you mentioned, you get the advantage of USING clause. In the grand scheme of things it isn't a huge factor one way or the other but the more verbose form gives you advantages.
Both options are valid but the purists will say use id as its name is specified by the table.
I use table_id because I find it to be more descriptive and makes debugging easier. It's more practical.
Re: Tablenames. Another hotly debated topic among database nerds but I say Singular.
Tablename_Id is my strong preference. When you do joins to Fks you know exactly what to join to what and don't make mistakes where you join to ID in table a when you meant tableb below is an example of how easy this is to do especially if you copy the on clause from somewhere else
FROM tablea a
JOIN tableb b
ON a.ID = b.tableaid
JOIN tablec c
ON a.ID = c.tablebid
In the case above, you really wanted to join to B.Id but forgot to change it from a when you copied. It will work and give you a resultset that isn't correct. If you use table_id instead, the query would fail the syntax check.
Another problem with using Id is when you are doing complex reports. Since the repport queries have to have fields with individual names, you can end up wasting time writing a bunch of aliases you wouldn't need if you had named the id with the tablename.
Now people who use ORMs don't write a lot of SQl but what they do write and what report writers write are generally complex, complicated statements. You need to design you database to make it easier to do those things than simple queries.
The use of ID as the name for the identifying field is considered a SQl antipattern. http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557/ref=sr_1_1?s=books&ie=UTF8&qid=1308929815&sr=1-1
This is user preference, but I always name the primary keys of my tables Id. I always name references of that Id in other tables as [SingularEntityName][Id] e.g.
Credentials
Id Password
Users
Id Name CredentialId
Descriptions
Id UserId
Keeps my references clean. However, just be consistant in your naming and it really shouldn't matter how you set up your schemas.
To open the can of worms again,
I'm willing to bet those who select tablename_id are older, more experienced programmers.
Those who use just id are younger.
Why ? because you learn redundancy and constancy is not always a bad thing.
the one thing I would add to the #1 answer, use the "_" helps make it easier to pick out the variable in code, in the table, etc... I do the same for foreign keys. TableName_FK Some will argue over that but it works for me and it's obvious what it is.
I have had to work on other's code many times over the years. Consistency is critical, obfuscation is worthless and meaningful variable names very helpful.
There are those who argue that verbosity makes code harder to read. I don't think that argument flies in today's world of objects.that.derive.from.some.microsoft.class.twenty.layers.deep.that.you.have.to.fully.reference.
BTW - as so many have said, it's your choice. Those folks who spend time arguing over coding syntax don't have enough work to do. Learn to be flexible and to use the standards of the workplace where you are employed. If you are lucky enough to set your own standards, then have at it. The fact your are wondering is great. But choose one and then be consistent (until you change jobs or decide you have a paradigm shift that means you want to change your style.)
You can often pick out what era someone started learning to code by their personal preferences and styles. Guys that write very tight, minmal, hard to read code, started back when memory was very limited (DOS) and probably wrote a lot of assembler, those that use Hungarian started back with the Win SDK, etc...
This discussion has been evolving for decades. The older I get, the more I document my code, the more meaningful I make my variable names, etc... because in a week I will have forgotten what I wrote and I need the road maps to make sense of it. Not so much that I'm forgetful, although that's part of the equation, but more so because I'm writing code on so many different projects.
it's entirely your choice. But personally I prefer the second one as I wouldn't need to look for table names in my code when I come across an id. I think tablename_id is better.
Another advantage to giving your primary keys names that are unique to that table is that it makes it easier to have a naming convention, when referring to those keys in different tables, that indicates the corresponding key.
For example, suppose everything in your alpha table begins alpha_, so that you have alpha_id as your primary key. In your beta table - where everything would begin beta_ - you would use beta_alpha_id to have a reference in that table to the keys in the alpha table.

How important are lookup tables?

A lot of the applications I write make use of lookup tables, since that was just the way I was taught (normalization and such). The problem is that the queries I make are often more complicated because of this. They often look like this
get all posts that are still open
"SELECT * FROM posts WHERE status_id = (SELECT id FROM statuses WHERE name = 'open')"
Often times, the lookup tables themselves are very short. For instance, there may only be 3 or so different statuses. In this case, would it be okay to search for a certain type by using a constant or so in the application? Something like
get all posts that are still open
"SELECT * FROM posts WHERE status_id = ".Status::OPEN
Or, what if instead of using a foreign id, I set it as an enum and queried off of that?
Thanks.
The answer depends a little if you are limited to freeware such as PostGreSQL (not fully SQL compliant), or if you are thinking about SQL (ie. SQL compliant) and large databases.
In SQL compliant, Open Architecture databases, where there are many apps using one database, and many users using different report tools (not just the apps) to access the data, standards, normalisation, and open architecture requirements are important.
Despite the people who attempt to change the definition of "normalisation", etc. to suit their ever-changing purpose, Normalisation (the science) has not changed.
if you have data values such as {Open; Closed; etc} repeated in data tables, that is data duplication, a simple Normalisation error: if you those values change, you may have to update millions of rows, which is very limited design.
Such values should be Normalised into a Reference or Lookup table, with a short CHAR(2) PK:
O Open
C Closed
U [NotKnown]
The data values {Open;Closed;etc} are no longer duplicated in the millions of rows. It also saves space.
the second point is ease of change, if Closed were changed to Expired, again, one row needs to be changed, and that is reflected in the entire database; whereas in the un-normalised files, millions of rows need to be changed.
Adding new data values, eg. (H,HalfOpen) is then simply a matter of inserting one row.
in Open Architecture terms, the Lookup table is an ordinary table. It exists in the [SQL compliant] catalogue; as long as the FOREIGN KEY relation has been defined, the report tool can find that as well.
ENUM is a Non-SQL, do not use it. In SQL the "enum" is a Lookup table.
The next point relates to the meaningfulness of the key.
If the Key is meaningless to the user, fine, use an {INT;BIGINT;GUID;etc} or whatever is suitable; do not number them incrementally; allow "gaps".
But if the Key is meaningful to the user, do not use a meaningless number, use a meaningful Relational Key.
Now some people will get in to tangents regarding the permanence of PKs. That is a separate point. Yes, of course, always use a stable value for a PK (not "immutable", because no such thing exists, and a system-generated key does not provide row uniqueness).
{M,F} are unlikely to change
if you have used {0,1,2,4,6}, well don't change it, why would you want to. Those values were supposed to be meaningless, remember, only a meaningful Key need to be changed.
if you do use meaningful keys, use short alphabetic codes, that developers can readily understand (and infer the long description from). You will appreciate this only when you code SELECT and realise you do not have to JOIN every Lookup table. Power users too, appreciate it.
Since PKs are stable, particularly in Lookup tables, you can safely code:
WHERE status_code = 'O' -- Open
You do not have to JOIN the Lookup table and obtain the data value Open, as a developer, you are supposed to know what the Lookup PKs mean.
Last, if the database were large, and supported BI or DSS or OLAP functions in addition to OLTP (as properly Normalised databases can), then the Lookup table is actually a Dimension or Vector, in Dimension-Fact analyses. If it was not there, then it would have to be added in, to satisfy the requirements of that software, before such analyses can be mounted.
If you do that to your database from the outset, you will not have to upgrade it (and the code) later.
Your Example
SQL is a low-level language, thus it is cumbersome, especially when it comes to JOINs. That is what we have, so we need to just accept the encumbrance and deal with it. Your example code is fine. But simpler forms can do the same thing.
A report tool would generate:
SELECT p.*,
s.name
FROM posts p,
status s
WHERE p.status_id = s.status_id
AND p.status_id = 'O'
Another Exaple
For banking systems, where we use short codes which are meaningful (since they are meaningful, we do not change them with the seasons, we just add to them), given a Lookup table such as (carefully chosen, similar to ISO Country Codes):
Eq Equity
EqCS Equity/Common Share
OTC OverTheCounter
OF OTC/Future
Code such as this is common:
WHERE InstrumentTypeCode LIKE "Eq%"
And the users of the GUI would choose the value from a drop-down that displays
{Equity/Common Share;Over The Counter},
not {Eq;OTC;OF}, not {M;F;U}.
Without a lookup table, you can't do that, either in the apps, or in the report tool.
For look-up tables I use a sensible primary key -- usually just a CHAR(1) that makes sense in the domain with an additional Title (VARCHAR) field. This can maintain relationship enforcement while "keeping the SQL simple". The key to remember here is the look-up table does not "contain data". It contains identities. Some other identities might be time-zone names or assigned IOC country codes.
For instance gender:
ID Label
M Male
F Female
N Neutral
select * from people where gender = 'M'
Alternatively, an ORM could be used and manual SQL generation might never have to be done -- in this case the standard "int" surrogate key approach is fine because something else deals with it :-)
Happy coding.
Create a function for each lookup.
There is no easy way. You want performance and query simplicity. Ensure the following is maintained. You could create a SP_TestAppEnums to compare existing lookup values against the function and look for out of sync/zero returned.
CREATE FUNCTION [Enum_Post](#postname varchar(10))
RETURNS int
AS
BEGIN
DECLARE #postId int
SET #postId =
CASE #postname
WHEN 'Open' THEN 1
WHEN 'Closed' THEN 2
END
RETURN #postId
END
GO
/* Calling the function */
SELECT dbo.Enum_Post('Open')
SELECT dbo.Enum_Post('Closed')
Question is: do you need to include the lookup tables (domain tables 'round my neck of the woods) in your queries? Presumably, these sorts of tables are usually
pretty static in nature — the domain might get extended, but it probably won't get shortened.
their primary key values are pretty unlikely to change as well (e.g., the status_id for a status of 'open' is unlikely to suddenly get changed to something other than what it was created as).
If the above assumptions are correct, there's no real need to add all those extra tables to your joins just so your where clause can use a friend name instead of an id value. Just filter on status_id directly where you need to. I'd suspect the non-key attribute in the where clause ('name' in your example above) is more likely to get changes than the key attribute ('name' in your example above): you're more protected by referencing the desire key value(s) of the domain table in your join.
Domain tables serve
to limit the domain of the variable via a foreign key relationship,
to allow the domain to be expanded by adding data to the domain table,
to populate UI controls and the like with user-friendly information,
Naturally, you'd need to suck domain tables into your queries where you you actually required the non-key attributes from the domain table (e.g., descriptive name of the value).
YMMV: a lot depends on context and the nature of the problem space.
The answer is "whatever makes sense".
lookup tables involve joins or subqueries which are not always efficient. I make use of enums a lot to do this job. its efficient and fast
Where possible (and It is not always . . .), I use this rule of thumb: If I need to hard-code a value into my application (vs. let it remain a record in the database), and also store that vlue in my database, then something is amiss with my design. It's not ALWAYS true, but basically, whatever the value in question is, it either represents a piece of DATA, or a peice of PROGRAM LOGIC. It is a rare case that it is both.
NOT that you won't find yourself discovering which one it is halfway into the project. But as the others said above, there can be trade-offs either way. Just as we don't always acheive "perfect" normalization in a database design (for reason of performance, or simply because you CAN take thngs too far in pursuit of acedemic perfection . . .), we may make some concious choices about where we locate our "look-up" values.
Personally, though, I try to stand on my rule above. It is either DATA, or PROGRAM LOGIC, and rarely both. If it ends up as (or IN) a record in the databse, I try to keep it out of the Application code (except, of course, to retrieve it from the database . . .). If it is hardcoded in my application, I try to keep it out of my database.
In cases where I can't observe this rule, I DOCUMENT THE CODE with my reasoning, so three years later, some poor soul will be able to ficure out how it broke, if that happens.
The commenters have convinced me of the error of my ways. This answer and the discussion that went along with it, however, remain here for reference.
I think a constant is appropriate here, and a database table is not. As you design your application, you expect that table of statuses to never, ever change, since your application has hard-coded into it what those statuses mean, anyway. The point of a database is that the data within it will change. There are cases where the lines are fuzzy (e.g. "this data might change every few months or so…"), but this is not one of the fuzzy cases.
Statuses are a part of your application's logic; use constants to define them within the application. It's not only more strictly organized that way, but it will also allow your database interactions to be significantly speedier.

SQL: Best practice to store various fields in one table

I want to design a table for items.
There are many types of items, all share several fields.
Each type of item has it's own fields.
I want to store the uncommon fields in a separate table.
I thought of something like :
----Items
+Item_id
+Item_Type_Id
+Item_Serial
...
----Item_types
+Item_Type_Id
+Item_Name
...
----Item_Fields
+Item_Field_Id
+Item_Type_Id
+Field_Name
...
----Field_Values
+Field_Value_Id
+Item_Field_Id
+Item_Id
+Value
...
The pro is having the ability to add fields and values without changing the tables.
The con is that i have to transpose the field names and values in order to see all info for an item.
Any better suggestions? Or perhaps a simple (not stored procedure) way to join the tables to get a flat info?
I tried to use PIVOT (I'm using SQL 2005) but with no luck.
Thanks.
I wrote a stored proc to make PIVOT more useful. Here is the source:
http://dot-dash-dot.com/files/pivot_query.sql
and some examples how to use it:
http://dot-dash-dot.com/files/pivot_query_examples.sql
For your data, the query would just be the raw data joining those tables above to produce a raw listing of:
set #myQuery = '
Select Item_Id, Item_Name, Field_Name, Value From ...
';
Then your call to pivot_query would be:
exec pivot_query #myQuery, 'Item_Id, Item_Name', 'Field_Name', 'max(Value)'
like that.
One other option is to store items in XML format in one single field. Depending on your usage scenario, it may work well. Or it may not.
I believe there has to be some grouping of values.
For example lets say your items are objects in a room. Then different types of objects have different attributes. For example books have publication date and number of pages, chairs have color pattern and height, etc.
In this example, you make an item table, a book table and a chair table.
You could make an "additional values" table that holds generic information as above, but what you really want to do is figure out the "types" of the different groups of attributes and then make every one of those types it's own table.
Is there a set of values that all items have? There has to be at least one which is a type field (this describes where the other information is stored. I expect every item will also have a name and a description. This is the information to go in the item table.
Then you make additional tables for the different types itembook, itemchair etc. There may even be some overlap. For example itembook, itemhardback, itempaperback would be 3 tables used to describe books.
I believe this is the best solution to your problem. It will still allow you to extend, but it does put a framework around your data.
Of course there are systems that do it the way you describe, but unless you are building a tool that others are going to reuse for many different projects, it makes sense to design the system for the task at hand. You end up falling into the over designing trap otherwise. (IMHO)
On the other hand, if you are going to go the totally generic direction I suggest you use one of the systems that already exist that work in this way (entity framework, app framework, etc) Use someone else's don't start from scratch.
I'm not too sure how you want to retrieve the info, but something like the below may work. (It's probably close to what Hogan mentioned.)
If you want to retrieve data for a type, you can just JOIN two tables.
If you want to retrieve data for all types (with all fields), you can LEFT JOIN all tables.
----Items
+Item_id
+Item_Type_Id
+Item_Common_Field1
+Item_Common_Field1
...
----Item_Type_A
+Item_id
+Item_Type_A_Specific_Field1
+Item_Type_A_Specific_Field2
...
----Item_Type_B
+Item_id
+Item_Type_B_Specific_Field1
...
If you add these columns to the table, you can make them sparse columns to avoid the space taken by unspecified uncommon fields.
But I would not call this a best practice. (see comments under your question)
I don't want to be accused of being the always-uses-the-latest-useless-technology guy, but depending on your use case, this might be a good case for a nosql database - Tokyo, Mongo, SimpleDB, etc. Or as Developer Art suggested, you could just serialize the different fields into a single column. It's not the worst thing in the world.

What are some of your most useful database standards?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I have some ideas, some that I have accumulated over time, but I really want to know what makes things go smoothly for you when modeling database:
Table name matches Primary Key name and description key
Schemas are by functional area
Avoid composite primary keys where possible (use unique constraints)
Camel Case table names and field names
Do not prefix tables with tbl_, or procs with SP_ (no hungarian notation)
OLTP databases should be atleast in BCNF / 4NF
Name similarly targetted stored procs with the same prefix, for instance if you've got 3 stored procedures for Person. That way everything for person is grouped in one place and you can find them easily without having to look through all your procs to find them.
PersonUpdate
PersonDelete
PersonCreate
Do similar things for tables when you have groups of tables with related data. For instance:
InvoiceHeaders
InvoiceLines
InvoiceLineDetails
If you have the option of schemas within your database, use them. It's much nicer to see:
Invoice.Header
Invoice.Line.Items
Invoice.Line.Item.Details
Person.Update
Person.Delete
Person.Create
Don't use triggers unless there's no other reasonable approach to achieve that goal.
Give field names a meaningful prefix so you can tell what table they come from without someone needing to explain. That way when you see a field name referenced, you can easily tell which table it's from.
Use consistent data types for fields containing similar data, i.e. don't store phone number as numeric in one table and varchar in another. In fact, don't store it as numeric, if I come across a negative phone number I'll be mad.
Don't use spaces or other obscure characters in table/field names. They should be entirely alphanumeric - or if I had my druthers, entirely alphabetic with the exception of the underscore. I'm currently working on an inherited system where table and field names contain spaces, question marks and exclamation marks. Makes me want to kill the designer on a daily basis!
Don't use syntax keywords as object names it'll cause headaches trying to retrieve data from them. I hate having to wrap object names as [index] that's two needless chars I didn't need to type damn you!
One thing I haven't seen mentioned yet:
Never use database keywords as object names. You do not want to have to qualify them every time you use them
If you misspell something when you create it, fix it as soon as you notice it. Don't spend years having to remember that in this table UserName is really Usernmae. It's a whole lot easier to fix when there isn't much code written against it.
Never use implied joins (the comma syntax), always specify the joins.
Putting everybody's input together into one list.
Naming Standards
Schemas are named by functional area (Products, Orders, Shipping)
No Hungarian Notation: No type names in object names (no strFirstName)
Do not use registered keywords for object names
No spaces or any special characters in object names (Alphanumber + Underscore are the only things allowed)
Name objects in a natural way (FirstName instead of NameFirst)
Table name should match Primary Key Name and Description field (SalesType – SalesTypeId, SalesTypeDescription)
Do not prefix with tbl_ or sp_
Name code by object name (CustomerSearch, CustomerGetBalance)
CamelCase database object names
Column names should be singular
Table names may be plural
Give business names to all constraints (MustEnterFirstName)
Data Types
Use same variable type across tables (Zip code – numeric in one table and varchar in another is not a good idea)
Use nNVarChar for customer information (name, address(es)) etc. you never know when you may go multinational
In code
Keywords always in UPPERCASE
Never use implied joins (Comma syntax) - always use explicit INNER JOIN / OUTER JOIN
One JOIN per line
One WHERE clause per line
No loops – replace with set based logic
Use short forms of table names for aliases rather than A, B, C
Avoid triggers unless there is no recourse
Avoid cursors like the plague (read http://www.sqlservercentral.com/articles/T-SQL/66097/)
Documentation
Create database diagrams
Create a data dictionary
Normalization and Referential Integrity
Use single column primary keys as much as possible. Use unique constraints where required.
Referential integrity will be always enforced
Avoid ON DELETE CASCADE
OLTP must be at least 4NF
Evaluate every one-to-many relationship as a potential many-to-many relationship
Non user generated Primary Keys
Build Insert based models instead of update based
PK to FK must be same name (Employee.EmployeeId is the same field as EmployeeSalary.EmployeeId)
Except when there is a double join (Person.PersonId joins to PersonRelation.PersonId_Parent and PersonRelation.PersonId_Child)
Maintenance : run periodic scripts to find
Schema without table
Orphaned records
Tables without primary keys
Tables without indexes
Non-deterministic UDF
Backup, Backup, Backup
Be good
Be Consistent
Fix errors now
Read Joe Celko's SQL Programming Style (ISBN 978-0120887972)
My standards for Oracle are:
Keywords are always in UPPERCASE;
Database object names are always in lowercase;
Underscores will replace spaces (ie there won't be any camel case conventions that are common on, say, SQL Server);
Primary keys will pretty much always be named 'id';
Referential integrity will be enforced;
Integer values (including table ids) will generally always be NUMBER(19,0). The reason for this is that this will fit in a 64-bit signed integer thus allowing the Java long type to be used instead of the more awkward BigInteger;
Despite the misnomer of appending "_number" to some column names, the type of such columns will be VARCHAR2 not a number type. Number types are reserved for primary keys and columns you do arithmetic on;
I always use a technical primary keys; and
Each table will have its own sequence for key generation. The name of that sequence will be _seq.
With SQL Server, the only modification is to use camel case for database object names (ie PartyName instead of party_name).
Queries will tend to be written multi-line with one clause or condition per line:
SELECT field1, field2, field2
FROM tablename t1
JOIN tablename2 t2 ON t1.id = t2.tablename_id
WHERE t1.field1 = 'blah'
AND t2.field2 = 'foo'
If the SELECT clause is sufficiently long I'll split it out one field per line.
Name all constraints
don't forget to back up your databases on a regular basis.
Don't use type names in the field names. The older guys will remember the old MS standard of lpszFieldName and the stupidity that ensued.
Use descriptive field names That follow normal language conventions. For example "FirstName" instead of "NameFirst"
Each word in the field name is capitalized
No underscores
Do not use normal keywords such as "Index"
Do not prefix ANYTHING with the object type. For example we do NOT use tblCustomers or spCustomersGet. These don't allow for good sorting and provide zero value.
Use schemas to define separate areas of the database. Such as sales.Customers and hr.Employees. This will get rid of most of the prefixes people use.
Loops of any kind should be viewed with suspicion. There's usually a better set based way.
Use views for complicated joins.
Avoid complicated joins when possible. It may be more astheticaly pleasing to have a CustomerPhoneNumbers table; but honestly, how many phone numbers do we really need to store? Just add the fields to the Customers table. Your DB queries will be faster and it's much easier to understand.
If one table calls a field "EmployeeId" then EVERY SINGLE TABLE that references it should use that name. It doesn't need to be called CustomerServiceRepId just because it's in an extension table.
Almost all tables have the "s" ending. For example: Customers, Orders, etc. After all the table holds many records...
Evaluate your queries, indexes and foreign key relationships with an analysis tool. Even those that may be generated for you. You might be surprised.
Linking tables which support many to many relationships have both linked tables in the name. For example, SchoolsGrades. It's very easy to tell by the table name what it does.
Be CONSISTENT. If you start down one path with your conventions, don't change horses halfway unless you are willing to refactor all of the previous work. This should put the brakes on any "wouldn't it be great if.." ideas that end up causing confusion and vast amounts of rework.
Think before you type. Do you really need that table, field, sproc, or view? Are you sure it isn't covered somewhere else? Get concensus before adding it. And if for some reason you have to take it out, talk to your team first. I've been at places where the DBA's make daily breaking changes without regard for the devs. This isn't fun.
If a database is for a particular application, have a version table so that the database releases can be checked against the code releases (amongst other reasons).
I always try not to use the type in the field name - "sFirstName", "sLastName", or "iEmployeeID". While they match at first, if something changes, they'll be out of sync, and it's a huge headache to change those names later, since you have to change the dependant objects as well.
Intellisense and the GUI tools make it trivial to find out what type a column is, so I don't feel this is necessary.
The WITH clause really helps break queries down into manageable parts.
It also really helps for efficiency on the execution plans of the queries.
Ensure that every varchar/nvarchar choice is appropriate.
Ensure that every NULLable column choice is appropriate - avoid NULLable columns where possible - allowing NULL should be the justifiable position.
Regardless of any other rules you might use from the suggestions here, I would create a stored procedure in the database that can be run on a regular basis to determine system health for any rules or standards you do have (some of this is a little SQL-Server specific):
Look for orphaned records in any cases where the DBMS system's referential integrity cannot be used for some reason (in my system I have a table of processes and a table of tests - so my system_health SP looks for processes without tests, since I only have a one-way FK relationship)
Look for empty schemas
Look for tables without primary keys
Look for tables without any indexes
Look for database objects without documentation (we use SQL Server Extended properties to put the documentation in the database - this documentation can be as granular as the column).
Look for system-specific issues - tables which need to be archived, exceptions which are not part of normal monthly or daily processing, certain common column names with or without defaults (CreateDate, say).
Look for non-deterministic UDFs
Look for TODO comments to ensure that code in the DB does not somehow have untested or pre-release code.
All this can be automated to give you an overall picture of system health.
Everyone writes SQL queries (views, stored procedures, etc) in the same basic format. It really helps development/maintenance efforts down the road.
Consistent naming standards. Having everyone on the same page, using the same format (whether it be Camel Case, specific prefixes, etc..) helps in being able to maintain a system accurately.
A few likes and dislikes.
My opinion is prefixes are horrible in every aspect. I currently work on a system where the tables are prefixed, and the columns within the tables are prefixed with 2 letter table name acronyms, I waste at least 30 mins each day working on this database because the acronym isn't logical. If you want to denote something with a prefix use a schema owner instead.
Using NVarchar from the start of a project if there is even a slight hint that down the line the text data will need to support multi lingual chars. Upgrading large databases because of lack of forward planning and thinking is a pain and wastes time.
Splitting each condition within a where clause onto a new line for readability (in and not in statements wrapped in brackets and tabbed in.) I think this is the important standard for me.
I worked at one company where a standard was that comma's must always be placed at the start of a line when performing parameter or variable declarations. This apparently made it more readable however I found it a complete nightmare.
In addition to normalization to 3NF or BCNF (more about that in this question), I have found the following to be useful:
Name tables as plural nouns
Name columns as sigular
So a "People" table has a "PersonID" column.
There is nothing wrong with composite keys, so long as the rules of 3NF or BCNF still hold. In many cases (such as the "many-to-many" case) this is entirely desirable.
Avoid repeating the table name in the column names. peoplePersonID is better written as table.column anyway, and much more readable and therefore self-documenting. People.PersonID is better, to me at least.
ON DELETE CASCADE should be used very carefully.
Remember that NULL means one of two things: Either it's unknown or it's not applicable.
Remember also that NULLs have interesting affects on joins, so practice your LEFT, RIGHT, and FULL outer joins.
Some others (albeit small) comments to throw against the wall...
SQL Server database schemas can be useful for both organizing tables and stored procedures as well as controlling security.
Every transactional table should always track both who and when created the record as well as updated the record in separate columns. I've seen implementation that simply used "update date" which can lead to auditing challenges in the future.
Use GUID's for row identifiers for all rows for projects with offline/synchronization requirements.
Good database design and normalization.
Tables are named in the singular, lowercase, no underscores, no prefix
Fields also lowercase, no underscores, no prefix
Stored procedures prefixed with "st_" (sorts nicely)
Views that are treated like tables have no prefix
Views created for special reports, etc. have a "v" prefix
Indexed views created for performance have an "ixv" prefix
All indexes have purposeful names (no auto-naming)
Strongly prefer uniqueidentifier (with sequential increment) over int IDENTITY for surrogate keys
Don't artificially limit VARCHAR/NVARCHAR fields to 100 or 255. Give them room to breath. This isn't the 1980s, fields are not stored padded to their max length.
3NF minimum standard
Prefer joining tables to column-level foreign keys: many 1:m assumptions are challenged as a system grows over time.
Always use surrogate keys, not natural keys, as the primary key. All assumptions about "natural" keys (SSNs, usernames, phone numbers, internal codes, etc.) will eventually be challenged.
Tabular formatted SQL.
select a.field1, b.field2
from any_table a
inner join blah b on b.a_id = a.a_id
inner join yet_another y on y.longer_key = b.b_id
where a.field_3 > 7
and b.long_field_name < 2;
Part of this is to use uniformly long alias names (in the example, here, a, b, and y are all length 1).
With this kind of formatting, I can more quickly answer common questions like, "what table is aliased by 'a'?" and "which fields join table T into the query?" The structure doesn't take long to apply or to update, and I find that it saves a lot of time. We spend more time reading code than writing it.
Document everything; wiki type documentation is easy to setup and the software is free.
Make sure you understand the interface first and design the database second. Most of the time its a lot better to know how the data you are going to use needs to work and then engineer the database. Most bad DB design happens as things evolve not upfront.
Then define the database standard and version you are going to work to. Define standards for the code elements (views, functions etc), database naming; naming conventions for columns, tables; type conventions for columns; coding templates.
Spend time considering how you define types having standard database types for fields or bespoke types are a good thing to sort out upfront.
As part of your documentation include a list of don'ts as well as dos for the application which include your prefered hated functionality cursors, triggers.
Review it regularly.
13- Evaluate your queries
Thats true. Sometimes you don't get what you wanted.
For me, it's always useful to name the tables and fields with their exact content and (for us) in clear spanish and using Upper Camel Case, with no whitespaces:
User Name: NombreUsuario
First Last Name: ApellidoPaterno
Second Last Name: ApellidoMaterno
etc etc
Taking "database" to mean "SQL product", my answer is, "Too many to mention. You could write a whole book on the subject." Happily, someone has.
We use Joe Celko's SQL Programming Style (ISBN 978-0120887972): "this book is a collection of heuristics and rules, tips, and tricks that will help you improve SQL programming style and proficiency, and for formatting and writing portable, readable, maintainable SQL code."
Advantages of this approach is include:
the guy knows more about this kind of thing than me (is there another book on SQL heuristics?!);
the work has already been done e.g. I can give the book to someone on the team to read and refer to;
if someone doesn't like my coding style I can blame someone else;
I recently got a load of rep on SO by recommending another Celko book :)
In practice we do deviate from the prescriptions of The Book but surprisingly rarely.
In MS-SQL, I've always had objects owned by dbo., and I prefix calls to those objects with dbo.
Too many times I've seen our devs wonder why they can't call their objects that they inadvertainly owned.
Avoid silly abbreviation conventions, such as comprehensive dictionaries of abbreviations that actively encourage monstrosities like EMP_ID_CONV_FCTR_WTF_LOL_WAK_A_WAK_HU_HU. This rule is inspired a real set of guidelines I've seen before.
MVP Aaron Bertrand's
"My stored procedure "best practices" checklist"
Table name matches Primary Key name and description key
I have just recently, after years of agreeing with this, jumped ship, and now have an "ID" column on every table.
Yes I know, when linking tables it's abiguous! But so is linking ProductID to ProductID, so uhh, why the extra typing?
This:
SELECT p.Name, o.Quantity FROM Products p, Orders o WHERE o.ProductID = p.ID
Is slightly better than this:
SELECT p.Name, o.Quantity FROM Products p, Orders o WHERE o.ProductID = p.ProductID
Note that both will require table or alias prefixes. But not only am I typing slightly less (multiply that across dozens of tables with long descriptive names and it adds up fast in a data intensive application) but it also makes it easier to know which table is the parent table in every join, which, when joining 8-10 tables in a query, can help quite a bit.
I follow a lot of the same conventions as others here, but I wanted to say a few things that haven't been said yet.
Regardless of whether you like plural names or singular names for your tables, be consistent. Choose one or the other, but don't use both.
The primary key in a table has the same name as the table, with the suffix _PK. Foreign keys have their same name as their corresponding primary key, but with a suffix of _FK. For example, the Product table's primary key is called Product_PK; in the Order table the corresponding foreign key is Product_FK. I picked this habit up from another DBA friend of mine and so far I'm liking it.
Whenever I do an INSERT INTO...SELECT, I alias all the columns in the SELECT portion to match the names of the columns from the INSERT INTO portion to make it easier to maintain and see how things match up.
The most important standard is: don't have a database by default. I find too many developers grabbing a database for projects where life would have been much easier without one (at least yet). It is just a tool in the toolbox, and not every problem is a nail.
Inappropriate use of a database leads to anemic domain models, badly testable code and unneeded performance problems.
I agree with just about everything you have put there except for #5. I often use prefixes for tables and stored procedures because the systems that we develop have lots of different functional areas, so I will tend to prefix the tables and sprocs with an identifier that will allow for them to group nicely in Management Studio based on what area they belong to.
Example: cjso_Users, cjso_Roles, and then you have routing_Users, routing_Roles. This may sound like replication of data, but in reality the two different user/roles tables are for completely separate functions of the system (cjso would be for the customer-based ecommerce app while the routing would stand for employees and distributors who use the routing system).
I like our table naming convention:
People Table
PEO_PersonID
PEO_FirstName
...
Which helps make larger querys a bit more readable. and joins make a bit more sense:
Select * -- naughty!
From People
Join Orders on PEO_PersonID = ORD_PersonID
--...
i guess rather than what the naming convention is, is the consistency of the naming.