Denormalizing for sanity or performance? - sql

I've started a new project and they have a very normalized database. everything that can be a lookup is stored as the foreign key to the lookup table. this is normalized and fine, but I end up doing 5 table joins for the simplest queries.
from va in VehicleActions
join vat in VehicleActionTypes on va.VehicleActionTypeId equals vat.VehicleActionTypeId
join ai in ActivityInvolvements on va.VehicleActionId equals ai.VehicleActionId
join a in Agencies on va.AgencyId equals a.AgencyId
join vd in VehicleDescriptions on ai.VehicleDescriptionId equals vd.VehicleDescriptionId
join s in States on vd.LicensePlateStateId equals s.StateId
where va.CreatedDate > DateTime.Now.AddHours(-DateTime.Now.Hour)
select new {va.VehicleActionId,a.AgencyCode,vat.Description,vat.Code,
vd.LicensePlateNumber,LPNState = s.Code,va.LatestDateTime,va.CreatedDate}
I'd like to recommend that we denormaize some stuff. like the state code. I don't see the state codes changing in my lifetime. similar story with the 3-letter agency code. these are handed out by the agency of agencies and will never change.
When I approached the DBA with the state code issue and the 5 table joins. i get the response that "we are normalized" and that "joins are fast".
Is there a compelling argument to denormalize? I'd do it for sanity if nothing else.
the same query in T-SQL:
SELECT VehicleAction.VehicleActionID
, Agency.AgencyCode AS ActionAgency
, VehicleActionType.Description
, VehicleDescription.LicensePlateNumber
, State.Code AS LPNState
, VehicleAction.LatestDateTime AS ActionLatestDateTime
, VehicleAction.CreatedDate
FROM VehicleAction INNER JOIN
VehicleActionType ON VehicleAction.VehicleActionTypeId = VehicleActionType.VehicleActionTypeId INNER JOIN
ActivityInvolvement ON VehicleAction.VehicleActionId = ActivityInvolvement.VehicleActionId INNER JOIN
Agency ON VehicleAction.AgencyId = Agency.AgencyId INNER JOIN
VehicleDescription ON ActivityInvolvement.VehicleDescriptionId = VehicleDescription.VehicleDescriptionId INNER JOIN
State ON VehicleDescription.LicensePlateStateId = State.StateId
Where VehicleAction.CreatedDate >= floor(cast(getdate() as float))

I don't know if I would even call what you want to do denormalization -- it looks more like you just want to replace artificial foreign keys (StateId, AgencyId) with natural foreign keys (State Abbreviation, Agency Code). Using varchar fields instead of integer fields will slow down join/query performance, but (a) if you don't even need to join the table most of the time because the natural FK is what you want anyway it's not a big deal and (b) your database would need to be pretty big/have a high load for it to be noticeable.
But djna is correct in that you need a complete understanding of current and future needs before making a change like this. Are you SURE the three letter agency codes will never change, even five years from now? Really, really sure?

Some denormalization can be needed for performance (and sanity) reasons at some times. Hard to tell wihout seeing all your tables / needs etc...
But why not just build a few convenience views (to do a few joins) and then use these to be able to write simpler queries?

Beware of wanting to shape things to your current idioms. Right now the unfamiliar code seems unweildy and obstructive to your understanding. In time it's possible that you will become acclimatised.
If current (or known future) requirements, such as performance are not being met then that's a whole different issue. But remember anything can be performance tuned, the objective is not to make things as fast as possible, but to make them fast enough.

This previous post dealt with a similar issue to the one you're having. Hopefully it will be helpful to you.
Dealing with "hypernormalized" data
My own personal take on normalization is to normalize as much as possible, but denormalize only for performance. And evn the denormalization for performance is something to avoid. I'd go the route of profiling,setting correct indexes, etc before I'd denormalize.
Sanity... That's overrated. Especially in our profession.

Well, what about the performance? If the performance is okay, just make the five table JOIN into a view and, for sanity, SELECT from the view when you need the data.
State abbreviations are one of the cases in which I think meaningful keys are okay. For very simple lookup tables with a limited number of rows and where I'm in complete control of the data (meaning it's not populated from some outside source) I'll sometimes create meaningful four or five character keys so that the key value can proxy for the fully descriptive lookup value in some queries.

Create a view (or inline table-valued function to get parameterization). In any case, I usually put all my code into SPs (some code generated) whether they use views or not and that's that, you pretty much only ever write the join once.

An argument (for this "normalization") that the three-letter codes might change isn't very compelling without a plan for what you will do if the codes do change, and how your artificial-key scenario will address this eventuality better than using the codes as keys. Unless you've implemented a fully temporal schema (which is horribly difficult to do and not suggested by your example), it's not obvious to me how your normalization benefits you at all. Now if you work with agencies from multiple sources and standards that might have colliding code names, or if "state" might eventually mean a two-letter code for state, province, department, canton, or estado, that's another matter. You then need your own keys or you need a two-column key with more information than that code.

Related

Managing / Using Constant Values in SQL Server

Fixed or Seldom-changing values in SQL - How to best code them?
Writing some SQL code yesterday, I was about to type some fixed constant values (sarcastically often called "magic numbers") in to my code. Every time I do that, I feel a pang of guilt. It's like knowingly creating technical debt, as you just know that you're feeding the future questions of "How many places does StatusX = 6 matter in our system?" / "Where might the system break or need to change if we added in another status value?".
Answering this "where-used" question is always a scavenger hunt, as sometimes fixed values are used directly (StatusX =6), or sometimes in IN clauses Status X IN (2,6,7), or sometimes in ranges StatusX < 7. You can do a pretty good job finding them, but it's time consuming. There HAS to be a better way...
Solution: Fixed Value Views
I went in search of this and found two articles in particular (see links below - one of them is here, on StackOverflow) proposing using SQL Views with fixed values for this. At first I passed the idea by. It didn't click with me right away, but after taking a look at what approaches there are on offer and their drawbacks, and then finding this concept a few more times, I re-visited it. Upon closer review, I find this approach makes a LOT of sense and has a ton of benefits, and costs next to nothing performance-wise.
The scenario where this approach applies best is where one has a fixed set of values, usually stored in a codes table. These values often have foreign key constraints on them to make sure that dependent data structures are only using valid and/or active codes values. Traditional RI has the validity of the data covered, but it doesn't make it any easier to manage or write SQL code that uses the code values sitting in these mostly static tables. That's where these fixed-value views come in!
I'm reposting this idea here, as I think it has a LOT of merit, and could or should be used by many. Glad to share and hear of any improvements or variations on it. I also added a bit (literally - see below) to the idea that improves on its code-friendliness.
Benefits
Using this method, you can use SQL server's dependency information to detect where your fixed values exposed by these fixed value views have been utilized. If you ever actually needed to add a value, it's easy and you quickly know where you should check for usage. If you wanted to gasp remove a value, you could tell what code it would break, as the view would be modified. All in all, this idea is very appealing to me. My brain isn't left feeling a bit dirty inside every time I code a fixed value using this method.
This approach does not suggest that you not have actual codes tables, or use these views instead of them. As much as possible, it is best to have defined codes tables, with referential integrity enforced to them from tables that use them. Where the constant views discussed below really shine is in SQL coding : in functions, stored procedures, etc.
Example View
CREATE VIEW dbo.cvw_OrderStatusID AS
SELECT
CAST(1 AS bit) AS [Match Exists] <-- Helper column for Left Outer Joins to check for record presence.
,CAST(1 AS INT) AS [Draft]
,CAST(2 AS INT) AS [New]
,CAST(3 AS INT) AS [Open]
,CAST(4 AS INT) AS [Delivered]
Basic Usage
Here's an example query that looks for Orders that are in a status that should still allow them to be edited:
-- Use Inner Join to select 1 to n values
SELECT
O.OrderNumber,
O.CustomerNumber
FROM
dbo.Order O
INNER JOIN dbo.cvw_OrderStatusID OS_AllowEdit ON O.OrderStatusID
IN (OS_AllowEdit.[Draft], OS_AllowEdit.[New], OS_AllowEdit.[Open])
-- Use Cross Apply to pass a fixed value to a TVF for example:
SELECT orders.* FROM dbo.cvw_OrderTypeID OrderTypes
CROSS APPLY dbo.fn_GetOrders(OrderTypes.[Express], #this ,#that ,#theother) orders
ORDER BY orders.DateSold
Small improvement on the idea - [Match Exists]
To make it more evident or easy to test when you are checking against multiple values for a match, add a field called [Match Exists] to your Constant Views of type Bit Default 1 (true).
When you use your SQL Constant Views, you can use an INNER JOIN where data must match, or a LEFT OUTER JOIN where data may or may not match X different values, structure your SQL like this:
SELECT
IIF(OS_OrderDelivered.[Match Exists] = 1, p.DeliveredDate, NULL) AS OrderDeliveryDate,
ISNULL(OS_AllowEdit.[Match Exists], 0) AS AllowEditing
FROM
dbo.Order O
LEFT OUTER JOIN dbo.cvw_OrderStatusID OS_OrderDelivered ON O.OrderStatusID = OS_OrderDelivered.[Delivered]
LEFT OUTER JOIN dbo.cvw_OrderStatusID OS_AllowEdit ON O.OrderStatusID
IN (OS_AllowEdit.[Draft], OS_AllowEdit.[New], OS_AllowEdit.[Open])
Giving credit where it is due:
For the full treatment, read these blogs, as they went deeper into performance considerations.
https://learn.microsoft.com/en-us/archive/blogs/sql_server_appendix_z/sql-server-variables-parameters-or-literals-or-constants
https://www.red-gate.com/hub/product-learning/sql-prompt/misuse-scalar-user-defined-function-constant-pe017

Best Practice For Lookup Tables In SQL

I am fairly new to SQL so apologies if this questions sounds odd.
I keep coming across the issue of poor data. As an example London could be stored as LON, London UK, London England etc. Prior to using SQL I had lots of Excel lookup tables where in the first column I would have the original, and in the 2nd helper column I would have the corrected version. As an example:
Name Name_1
London, UK London
Lon London
LON London
London London
London, England London
LND London
Is there a straightforward way of doing this in SQL? I am currently trying to create lookup tables and then use joins. This is becoming tricky as I do not always have corrections for every instance, so in most scenarios my lookup tables have fewer items than the tables I'm joining them to.
I've been teaching myself about stored procedures and I wondered if this could solve the problem. Trouble is my searching on the subject of lookup tables is coming up empty.
Any advice or pointers would be gratefully received, even if it is just to say that this cannot be done.
Thanks as always for you help and apologies for the long post.
You can join to the lookup table and preferrable use the value given there. If not found, use the original:
SELECT t1.FirstName, LookupField = ISNULL(t2.Name_1, t1.LookupField)
FROM People as t1
LEFT INNER JOIN TableLookupCities as t2 ON t1.LookupField = t2.Name
Make sure, that for each name there is at most one match in TableLookupCities, or else the join will produce multiple results. Create a unique index on TableLookupCities.Name:
CREATE UNIQUE (CLUSTERED) INDEX djgndkg ON TableLookupCities (Name) INCLUDE (Name_1)
You don't have to do any of the other things, just return the original if you don't have a translation for it.
SELECT
t1.FirstName,
t1.LookupField,
case when t2.Name_1 is null
then t1.lookupfield
else t2.name_1 end Name_1
FROM People as t1
LEFT INNER JOIN TableLookupCities as t2
ON t1.LookupField = t2.Name
As mentioned above, bad data is it's own problem. Data cleaning is an industry in itself, so you've got a huge range of options for this kind of problem, from the simple and straightforward, to the elaborate tries to fix everything bells and whistles. What's "best" depends on your situation and needs.
Certainly it's possible to continue to expand this lookup table to cater for a growing number of standard errors/variations, but if this is a constant flow of information there's a maintenance overhead. This may well be adequate for your needs, so don't shy away from it just because there's fancier alternatives.
It's quite common place to trade the reliability of manual human intervention for the scalability of automated approaches; this is much easier to maintain and grow, but (depending on the nature of your problem) may make mistakes.
Eg 1. Use a pattern based approach (Contains, LIKE, RegEx) to finds something that seems like a reasonable fit. That maybe fine in some situations, such as when Name_1 is a static, well understood list, so you can make sure the results are usually going to good enough.
+ easy to setup/understand
+ more flexible than full list
- still takes a bit of maintenance
- hopeless in complex/poorly understood situations
Eg 2. In the more general case, you can use the text search capabilities offered by the database to "score" how good a match one value is against the other, and pick the best matching option. Again this isn't fool proof or safe in all contexts, and it's a bit more work to setup, but it's a lot more robust. This is a bit more performance intensive, so the size of the datasets involved, the timescales your working to, and the infrastructure available are also considerations.
+ fairly good success rates
- slower setup
- larger performance overheads
eg 3. Another option would be something more domain specific. In this case, it's spatial data, so you could use a 3rd party geocoding service as a means of validation.
+ high success rate
+ capable of dealing with huge ranges of values,
- may incur additional costs
- hardest/slowest to setup
The bottom line... bad data is bad data and it takes a lot of work to either use bad data or clean bad data or both.
UPDATE AFTER CLARIFICATION
Build your own ETL (Extract, Transform, Load) process to handle all the variant incoming data. Your ETL process will most likely get modified with each new batch of data you receive because you will have to trap for new "Bad Data" variants.
Import the data into an ALL VARCHAR table
Run the ETL Process
Good Data goes into real data tables
Bad data goes into an exception table
Repeat
Modify ETL Process
Run ETL Process
Until No More Exceptions
-- End Update
If you use LEFT JOIN you can identify the missing values fairly easily.
SELECT
t1.FirstName,
t1.LookupField,
t2.Name_1
FROM People as t1
LEFT INNER JOIN TableLookupCities as t2
ON t1.LookupField = t2.Name
Anywhere t2.Name_1 returns a NULL you know you need to add that "LookupField" to your lookup table. Here is a good book for learning database design Database Design for Mere Mortals
-- Group By to Find Missing Unique Values
t1.LookupField,
t2.Name_1
FROM People as t1
LEFT INNER JOIN TableLookupCities as t2
ON t1.LookupField = t2.Name
GROUP BY
t1.LookupField,
t2.Name_1

Implementing a Flexible Relationship in a RDBMS -- What really are the tradeoffs?

I have a bunch of products with a bunch of different possible attributes for each product. E.g. Product A has a name, size, color, shape. Product B has a name, calories, sugar, etc. One way to solve this is like:
1) Create tables
Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)
This allows for maximum flexibility, but I have heard a lot of people recommend against this although I am not sure why. I mean, if those tables were called Teams, Players, Team_Players we would all agree that this is proper relational design.
Everyone who explains to me why this is bad does so in the context of a completely flexible relational design where you don't ever create real tables past a basic few basic initial tables (e.g. object, attribute, object_attribute)-- which I think we all can agree is bad. But this is a much more limited and contained version of that (only Products, not every object in the system), so I don't think it is fair to group these two architectures together.
What issues have you encountered (experience or theoretical) that makes this design so bad?
2) Another way to solve this is to create a Product table with a bunch of columns like Size, Color, Shape, Weight, Sugar, etc and then include some extra columns at the end to give us some flexibility. This will create generally sparse rows filled mostly with NULLs. People tend to like this approach, but my question is how many columns can you have before this approach loses its performance benefits? If you have 200 columns, I imagine this is no longer a smart move, but what about 100 columns? 50 columns? 25 columns?
3) The final approach I know about is to store all of the attributes as a blob (JSON perhaps) in a single column of the Products table. I like this approach but it doesn't feel right. Queries are hard. And if you want to be able to easily change the name of an attribute later, you either have to parse every record individually or have them keyed in your blob by some id. If you go the id path then you will need another table Attributes and things start to look like approach #1 from above except you won't be able to join the attribute_id with your blob, so I hope you didn't want to query anything by attribute name.
What I like about this approach though is you can query one product and in your code you can easily access all the properties it has -- fast. And if you delete a product, you won't have to cleanup other tables -- easy to stay consistent.
4) I have read some things about being able to index strongly typed xml formats in some RDBMSs, but I honestly don't know much about this approach.
I am stuck. I feel like approach #1 is the best bet, but everything I read says that way stinks. What is the right way to think about this problem to be able to decide what is the best method for a given situation? More ideas than what I have listed are obviously welcomed!
You can probably find a great deal about this topic by doing a Google search on "entity attribute value antipattern".
One of the issues with this approach is that you end up mixing meta-data with actual data. Your "attribute" has to now tell the database what exactly is held in the "value" column. This can make it very difficult to handle this data in front-ends, reporting software, etc.
Second, you're going to have a very hard time actually enforcing any data integrity in the database. When your product has an attribute of "weight" what's to stop someone from putting "22 inches" in the value? Or a non-numeric value completely. You might say, "Well, my application will handle that." Then you need to change your application every time that you want to add a new attribute because the application needs to know how to handle it. If you're going to go through all of that work, just add a new column.
Third, how do you enforce that a given product has all of the attributes that it needs? In a row you can make column NOT NULL and they are then required to get that row into the database. You can't enforce that in the EAV model.
Fourth, this kind of a model usually leads to a lot of confusion. People aren't sure what "attributes" are supported, or they duplicate an attribute, or they forget to handle an attribute when creating a report. As an example, if I have an attribute for "Weight(kg)" and another attribute for "Weight(lbs)" and someone asks me, "What's the heaviest product in your database?" I'd better remember that I need to check both attributes.
Fifth, this model usually also leads to laziness. Hey, there's no reason to actually do any analysis of the products that our system can handle, because whatever comes along we'll just add some attributes. In my experience, companies are much better off doing the analysis required to create a good database design rather than fall back on an antipattern like this. You'll learn things about the database, the application, and likely the business as well.
Sixth, it might take a LOT of joins to get a single row of data for a given product. You can return the attributes as separate rows, but now you have to come up with customized list boxes to list those products, etc. Similarly, writing search queries against this model can be very difficult and in both of these situations you're likely to have performance issues.
These are just a few of the problems which I've encountered over the years. I'm sure that there are others.
What the correct solution is for your system depends a lot on the specifics of your business and application. Rather than a sparse row, you might consider using subtype tables if your products fall into a few categories that share common attributes.
There are many problems with flexible data models but the first one that is likely to bite you is the fact that queries get unwieldy very quickly. For example, if you wanted to get the Size attribute for every product, the query is relatively easy.
SELECT p.name product_name,
pa.value product_size
FROM product p
left outer join product_attribute pa on (p.product_id = pa.product_id)
left outer join attribute a on (pa.attribute_id = a.attribute_id and
a.name = 'size')
If you want to get the size and some other attribute like color, things get trickier
SELECT p.name product_name,
pa_size.value product_size
pa_color.value product_color
FROM product p
left outer join product_attribute pa_size on (p.product_id = pa_size.product_id)
left outer join product_attribute pa_color on (p.product_id = pa_size.product_id)
left outer join attribute a_size on (pa_size.attribute_id = a.attribute_id and
a_size.name = 'size')
left outer join attribute a_color on (pa_color.attribute_id = a.attribute_id and
a_color.name = 'color')
Very quickly, when you start wanting to grab 10 attributes or write complex searches (show me products where the color is blue and the size is medium), the queries start to get very complicated both for developers to write and maintain and for the database optimizer to generate the query plan for. If you're joining 30 tables together, the optimizer would have to prune the tree of plans it considers very, very quickly to be able to generate a query plan in a reasonable time frame. That tends to lead the optimizer to discard promising paths too early and to generate less than optimal paths for many of your queries.
This, in turn, means that you very quickly get to a point where new development is bottlenecked because developers can't get their queries right or developers can't get their queries to return quickly enough. Whatever time you saved up front by not gathering the requirements to determine what the valid attributes are quickly gets used up with the 47th iteration of "Why can't I get the data I want out of this putrid data model?"
Beyond this cost to developers, you end up creating a lot of costs for the organization as a whole.
No query tool is going to handle this sort of data model well. So all the users that can currently fire up their favorite query tool and run some reports out of your database are now stuck waiting for developers to write their reports and do their extracts for them.
Data quality becomes very hard to enforce. It becomes very hard to check conditions that involve multiple attributes (i.e. if a product's size Medium then the weight must be between 1 and 10 pounds, if a product's height is specified then a width is required as well) so people don't make those checks. They don't write the reports to identify where these sorts of rules are violated. So the data ends up being a bit bucket of data that downstream processes decide they can't use because it isn't sufficiently complete.
You're moving too much of the initial requirements discussion off into the future when understanding the core entities will likely lead to a much better design overall. If you can't agree on a set of attributes that the first version of the product needs to support, you don't really understand what that version is supposed to do. Even if you successfully code a very generic application, that means that it is going to require a lot of time to configure once you've built it (because someone will have to figure out what attributes it supports at that point). And then you'll discover when the application is being configured that you missed a ton of requirements that only became clear when the attributes were defined-- you can't know that width is required if height is specified if you don't know whether they're going to store height or width in the first place.
In the worst case, the response to this problem during configuration is to immediately determine that you need to provide a flexible way to specify business rules and to specify workflows so that the people configuring the application can quickly code their business rules when they add new attributes and so that they can control the flow of the application by grouping attributes together or skipping certain pages (i.e. have a page where make & model are required if the product type is car, skip that page if now). But in order to do that, you're going to end up building an entire development environment. And you're going to push the job of actually coding the application to the folks that are configuring the product. Unless you happen to be really good at building development environments, and unless the people configuring the product are really developers, this doesn't end well.
I mean, if those tables were called
Teams, Players, Team_Players we would
all agree that this is proper
relational design.
No, we wouldn't. Here's why.
You started with this.
Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)
Let's drop the id numbers, so we can see what's really going on. (Longer column names for clarity.)
Products (product_name)
Attributes (attribute_name)
Product_Attributes (product_name, attribute_name, value as string)
And translating that to teams and players . . .
Teams (team_name)
Players (player_name)
Team_Players (team_name, player_name, value as string)
So for sample data we might have
Team Player Value
--
St. Louis Cardinals Boggs, Mitchell ?
St. Louis Cardinals Carpenter, Chris ?
St. Louis Cardinals Franklin, Ryan ?
St. Louis Cardinals Garcia, Jaime ?
What on earth belongs in place of the question marks? Let's say we want to record number of games played. Now the sample data looks like this.
Team Player Value
--
St. Louis Cardinals Boggs, Mitchell 23
St. Louis Cardinals Carpenter, Chris 15
St. Louis Cardinals Franklin, Ryan 19
St. Louis Cardinals Garcia, Jaime 14
Want to store batting average, too? You can't. Not only can you not store batting average along with games played, you can't tell by looking at the database whether Mitch Boggs played in 23 games, had 23 hits, scored 23 runs, had 23 "at bats", had 23 singles, or struck out 23 times.
The reason why this approach is so bad is that you don't know how may times you have to join to the table to get all the attributes. Plus joining to the same table 20 times tends to create a performance block of massive proportions. I am assuming that Products wil be at the heart of your system and thus be a critical place for performance.
Now you say that the product attributes will be drastically different. I disagree. There will be many attributes that are common to a large number of your products things like price, units, size, color, dimemnsions, weight. Those should be in the product table as common properties. These are also the ones that the user is most likely to be searching for when picking a product.
Other properties are useful as a description but not for most anything else (they won't be searched on or put into the order details). Put those in a description or notes field.
Finally you are left with the few attributes which might be different. But how different are they? Are they common to a partiuclar type of product (books have these attributes, cameras have these), then a related table for that type of product might work well.
Once you have done your job and figured all this out, then add the flexibility of an EAV table if you still need one. The steps above should cover 98+% of the real requirements.
(Also it's kind of hard to design the order details table if you don't know the attribute fields you need to record for the order - you can't rely on the products table for that)
(oh and I agree wholeheartedly with what #Tom H is saying as well.)

What is so bad about using SQL INNER JOIN

Every time a database diagram gets looked out, one area people are critical of is inner joins. They look at them hard and has questions to see if an inner join really needs to be there.
Simple Library Example:
A many-to-many relationship is normally defined in SQL with three tables: Book, Category, BookCategory.
In this situation, Category is a table that contains two columns: ID, CategoryName.
In this situation, I have gotten questions about the Category table, is it need? Can it be used as a lookup table, and in the BookCategory table store the CategoryName instead of the CategoryID to stop from having to do an additional INNER JOIN. (For this question, we are going to ignore the changing, deleting of any CategoryNames)
The question is, what is so bad about inner joins? At what point is doing them a negative thing (general guidelines like # of transactions, # of records, # of joins in a statement, etc)?
Your example is a good counterexample. How do you rename categories if they're spread throughout the various rows of the BookCategory table? Your UPDATE to do the rename would touch all the rows in the same category.
With the separate table, you only have to update one row. There is no duplicate information.
I would be more concerned about OUTER joins, and the potential to pick up info that wasn't intended.
In your example, having the Category table means that a book is limited to being filed under a preset Category (via a foriegn key relationship), if you just shoved multiple entries in to the BookCategory table then it would be harder to limit what is selected for the Category.
Doing an INNER join is not so bad, it is what databases are made for. The only time it is bad is when you are doing it on a table or column that is inadequately indexed.
I am not sure there is some thing wrong in inner join per se, it is like each IF you add to your code impacts performance (or should I say every line...), but still, you need a minimum number of those to make your system work (yes yes, I know about Turing machines).
So if you have something that is not needed, it will be frowned upon.
When you map your domain model onto the relational model you have to split the information across multiple relations in order to get a normalized model - there is no way around that. And then you have to use joins to combine the relations again and get your information back. The only bad thing about this is that joins are relative expensive.
The other option would be not to normalize your relational model. This will fill your database with much redundant data, give you many opportunities to turn your data inconsistent and make updates a nightmare.
The only reason not to normalize a relational model (I can think of at the moment) is that reading performance is extremely - and I mean extremely - critical.
By the way, why do you (they) only mention inner joins? How are left, right, and full outer joins significantly different from inner joins?
Nobody can offer much about general guidelines - they'd be specific to the server, hardware, database design, and expectations... way too many variables.
Specifically about INNER JOINs being inefficient or bad... JOINs are the center of relational DBs, and they've been around for decades. It's only wrong when you use it wrong, because obviously someone's doing it right since it's not extinct yet. Personally, I'd assume anyone throwing out blanket statements like that either don't know SQL or know just enough to get in trouble. Next time it comes up, teach them how to use the query cache.
(Not mentioning update/delete, but you didn't say inserts!: the increased maintainability through avoiding humans and their typos can easily be worth at least 10x the time a join will take.)

Strategy for avoiding a common sql development error (misleading result on join bug)

Sometimes when i'm writing moderately complex SELECT statements with a few JOINs, wrong key columns are sometimes used in the JOIN statement that still return valid-looking results.
Because the auto numbering values (especially early in development) all tend to fall in similar ranges (sub 100s or so) the SELECT sill produces some results. These results often look valid at first glance and a problem is not detected until much, much later making debugging much more difficult because familiarity with the data structures and code has staled. (Gone stale in the dev's mind.)
i just spent several hours tracking down yet another of this issue that i've run into a too many times before. i name my tables and columns carefully, write my SQL statements methodically but this is an issue i can't seem to competely avoid. It comes back and bites me for hours of productivity about twice a year on average.
My question is: Has anyone come up with a clever method for avoiding this; what i assume is probably a common SQL bug/mistake?
i have thought of trying to auto-number starting with different start values but this feels cludgy and would get ugly trying to keep such a scheme straight for data models with dozens of tables... Any better ideas?
P.S.
i am very careful and methodical in naming my tables and columns. Patient table gets PatientId column, Facility get a FacilityId etc. This issues tends to arise when there are join tables involved where the linkage takes on extra meaning such as: RelatedPatientId, ReferingPatientId, FavoriteItemId etc.
When writing long complex SELECT statements try to limit the result to one record.
For instance, assume you have this gigantic enormous awesome CMS system and you have to write internal reports because the reports that come with it are horrendous. You notice that there are about 500 tables. Your select statement joins 30 of these tables. Your result should limit your row count by using a WHERE clause.
My advice is to rather then get all this code written and generalized for all cases, break the problem up and use WHERE and limit the row count to only say a record. Check all fields, if they look ok, break it up and let your code return more rows. Only after further checking should you generalize.
It bites a lot of us who keep adding more and more joins until it seems to look ok, but only after Joe Blow the accountant runs the report does he realize that the PO for 4 million was really the telephone bill for the entire year. Somehow that join got messed up!
One option would be to use your natural keys.
More practically, Red Gate SQL Prompt picks the FK columns for me.
I also tend to build up one JOIN at a time to see how things look.
If you have a visualization or diagramming tool for your SQL statements, you can follow the joins visually, and any errors will become immediately apparent, provided you have followed a sensible naming scheme for your primary and foreign keys.
Your column names should take care of this unless you named them all "ID". Are you writing multiple select statement using the same tables? You may want to create views for the more common ones.
If you're using SQL Server, you can use GUID columns as primary keys (that's what we do). You won't have problems with collisions again.
You could use GUIDs as your primary keys, but it has its pros and cons.
This pro is actually not mentioned on that page.
I have never tried doing this myself - I use a tool on top of SQL that makes incorrect joins very unlikely, so I don't have this problem. I just thought I'd mention it as another option though!
For IDs use TableNameID, for example for table Person, use PersonID
Use db model and look at the drawing when writing queries.
This way join looks like:
... ON p.PersonID = d.PersonID
as opposed to:
... ON p.ID = d.ID
Auto-increment integer PKs are among your best friends.