How to model table hierarchy - sql

I'm trying to make an application about formula 1. I have three tables: Team, Driver, And Race Results. I'm thinking about three options (and maybe I'm missing more):
Have a derived table Driver_Team. Have a Driver_TeamId in that table. Use that Driver_TeamId in the Race Results table. This seems to solve most of the queries I think I am going to use, but feels awkward and I haven't seen it anywhere.
Have Driver.DriverId and Team.TeamId in the Race Results table. This has the problem of not being able to add extra information. I don't know yet what information, maybe the date of the start of joining a new team. Then I would need a junction table (because that information is not Race Result related).
The last one: Have a junction table Driver_Team, but have only the Driver.DriverId as Foreign Key in the Race Results table. Problem is, queries like "How much points did team x get in season y/several seasons" really really horrible.
Am I missing another solution? If yes, please tell me! :-) Otherwise, which of these solutions seems the best?
Thanks!

Your first option gets my vote. I'd also suggest adding a Race table (to hold data such as track, date, conditions, etc.), and make Race_Results the combination of Driver_Team and Race.

I suggest the following:
RaceResult - Driver - DriverTeam - Team
Where RaceResult contains race_date, DriverTeam contains ( driver_id, team_id, team_join_date and team_leave_date ). Then you would be able to get all the info you're asking about in your question, even though the queries may be complicated.

Just brainstorming, one object model may look like this. Note the conspicuous lack of an "id" field on RaceResult, as the finishing position acts perfectly as a natural key (one driver per finishing position). Of course, there may be lots of other options as well.
Team:
id
name
Driver:
id
name
team_id
Race:
id
venue
date
RaceResults:
position
driver_id
race_id

For the kind of queries you're talking about, I think DriverId and TeamId should both be in RaceResults. If you want to store additional information about an association between a driver and a team, then that should be placed in a separate table. This appears to create a little bit of redundancy, since the driver/team pair in the race table will be limited by the employment dates in the DriverTeam table, but given the complexities of contracts and schedules, I think it may end up being not especially redundant.
I like the way you are planning the DB to support your queries. I have run into way too much OOP thinking in DB design over the years!

If you only store DriverId and TeamId in the RaceResults table, then you cannot associate a driver to a team without a RaceResult.

Related

Redundant field in SQL for Performance

Let's say I have two Tables, called Person, and Couple, where each Couple record stores a pair of Person id's (also assume that each person is bound to at most another different person).
I am planning to support a lot of queries where I will ask for Person records that are not married yet. Do you guys think it's worthwhile to add a 'partnerId' field to Person? (It would be set to null if that person is not married yet)
I am hesitant to do this because the partnerId field is something that is computable - just go through the Couple table to find out. The performance cost for creating new couple will also increase because I have to do this extra book keeping.
I hope that it doesn't sound like I am asking two different questions here, but I felt that this is relevant. Is it a good/common idea to include extra fields that are redundant (computable/inferable by joining with other tables), but will make your query a lot easier to write and faster?
Thanks!
A better option is to keep the data normalized, and utilize a view (indexed, if supported by your rdbms). This gets you the convenience of dealing with all the relevant fields in one place, without denormalizing your data.
Note: Even if a database doesn't support indexed views, you'll likely still be better off with a view as the indexes on the underlying tables can be utilized.
Is there always a zero to one relationship between Person and Couples? i.e. a person can have zero or one partner? If so then your Couple table is actually redundant, and your new field is a better approach.
The only reason to split Couple off to another table is if one Person can have many partners.
When someone gets a partner you either write one record to the Couple table or update one record in the Person table. I argue that your Couple table is redundant here. You haven't indicated that there is any extra info on the Couple record besides the link, and it appears that there is only ever zero or one Couple record for every Person record.
How about one table?
-- This is psuedo-code, the syntax is not correct, but it should
-- be clear what it's doing
CREATE TABLE Person
(
PersonId int not null
primary key
,PartnerId int null
foreign key references Person (PersonId)
)
With this,
Everyone on the system has a row and a PersonId
If you have a partner, they are listed in the PartnerId column
Unnormalized data is always bad. Denormalized data, now, that can be beneficial under very specific circumstances. The best advice I ever heard on this subject it to first fully normalize your data, assess performance/goals/objectives, and then carefully denormalize only if it's demonstrably worth the extra overhead.
I agree with Nick. Also consider the need for history of the couples. You could use row versioning in the same table, but this doesn't work very well for application databases, works best in a in a DW scenario. A history table in theory would duplicate all the data in the table, not just the relationship. A secondary table would give you this flexibility to add additional information about the relationship including StartDate and EndDate.

Table structure for Scheduling App in SQL DB

I'm working on a database to hold information for an on-call schedule. Currently I have a structure that looks about like this:
Table - Person: (key)ID, LName, FName, Phone, Email
Table - PersonTeam: (from Person)ID, (from Team)ID
Table - Team: (key)ID, TeamName
Table - Calendar: (key dateTime)dt, year, month, day, etc...
Table - Schedule: (from Calendar)dt, (id of Person)OnCall_NY, (id of Person)OnCall_MA, (id of Person)OnCall_CA
My question is: With the Schedule table, should I leave it structured as is, where the dt is a unique key, or should I rearrange it so that dt is non-unique and the table looks like this:
Table - Schedule: (from Calendar)dt, (from Team)ID, (from Person)ID
and have multiple entries for each day, OR would it make sense to just use:
Table - Schedule: (from Calendar)dt, (from PersonTeam)KeyID - [make a key ID on each of the person/team pairings]
A team will always have someone on call, but a person can be on call for more than one team at a time (if they are on multiple teams).
If a completely different setup would work better let me know too!
Thanks for any help! I apologize if my question is unclear. I'm learning fast but nevertheless still fairly new to using SQL daily, so I want to make sure I'm using best practices when I learn so I don't develop bad habits.
The current version, one column per team, is probably not a good idea. Since you're representing teams as a table (and not as an enum or equivalent), it means you expect to add/remove teams over time. That would force you to add/remove columns to the table, which is always a much larger task than adding/removing a few rows.
The 2nd option is the usual solution to a problem like this. A safe choice. You can always define an additional foreign key constraint from Schedule(teamID, personID) to PersonTeam to ensure you don't mistakenly assign schedule duty to a person not belonging to the team.
The 3rd option is pretty much equivalent to the 2nd, only you're swapping a composite natural key for PersonTeam for a surrogate simple key. Since the two components of said composite key are already surrogate, there is no advantage (in terms of immutability, etc.) to adding this additional one. Plus it would turn a very simple N-M relationship (PersonTeam) which most DB managers / ORMs will handle nicely into a more complex object which will need management on its own.
By Occam's razor, I'd do away with the additional surrogate key and use your 2nd option.
In my view, the answer may depend on whether the number of teams is fixed and fairly small. Of course, whether the names of the teams are fixed or not, may also matter, but that would probably have more to do with column naming.
More specifically, my view is this:
If the business requirement is to always have a small and fixed number of people (say, three) on call, then it may well be more convenient to allocate three columns in Schedule, one for every team to hold the ID of the appointed person, i.e. like your current structure:
dt OnCall_NY OnCall_MA OnCall_CA
--- --------- --------- ---------
with dt as the primary key.
If the number of teams (in the Team table) is fixed too, you could include teams' names/designators in the column names like you are doing now, but if the number of teams is more than three and it's just the number of teams in Schedule that is limited to three, then you could just use names like OnCallID1, OnCallID2, OnCallID3.
But even if that requirement is fixed, it may only turn out fixed today, and tomorrow your boss says, "We no longer work with a fixed number of teams (on call)", or "We need to extend the number of teams supported to four, and we may need to extend it further in the future". So, a more universal approach would be the one you are considering switching to in your question, that is
dt Team Person
--- ---- ------
where the primary key would now be dt, Team.
That way you could easily extend/reduce the number of people on call on the database level without having to change anything in the schema.
UPDATE
I forgot to address your third option in my original answer (sorry). Here goes.
Your first option (the one actually implemented at the moment) seems to imply that every team can be presented by (no more than) one person only. If you assign surrogate IDs to the Person/Team pairs and use those keys in Schedule instead of separate IDs for Person and Team, you will probably be unable to enforce the mentioned "one person per team in Schedule" requirement (or, at least, that might prove somewhat troublesome) at the database level, while, using separate keys, it would be just enough to set Team to be part of a composite key (dt, Team) and you are done, no more than one team per day now.
Also, you may have difficulties letting a person change the team over time if their presence in the team was fixated in this way, i.e. with a Schedule reference to the Person/Team pair. You would probably have to change the Team reference in the PersonTeam table, which would result in misrepresentation of historical info: when looking at the people on call back on certain day, the person's Team shown would be the one they belong to now, not the one they did then.
Using separate IDs for people and teams in Schedule, on the other hand, would allow you to let people change teams freely, provided you do not make (Schedule.Team, Schedule.Person) a reference to (PersonTeam.Team, PersonTeam.Person), of course.

Database Modeling of a Softball League

I am modeling a database for use in a softball league website. I'm not that experienced in DB Modeling, and I'm having a hard time with a question about the future.
Right now I have the following tables:
players table (player_id, name, gender, email, team_id)
teams table (team_id, name, captain[player_id], logo, wins_Regular_season, losses_regular_season)
regular_season table (game_id, week, date, home[team_id], away[team_id], home_score, away_score, rain_date)
playoff table (pgame_id, date, home[team_id], away[team_id], home_score, away_score, winnerTo[pgame_id], loserTo[pgame_id])
To make the data persist from season to season, but to also have an easy way to access the data should I:
A) include a year column in the tables and then filter my queries by year?
B) create new tables every year?
C) Do something else that makes more sense but that I can't think of.
This design is not only bad about the future. It is also wrong regarding the past. You're not keeping history in a proper way.
Let's suppose a player changes team: how would that fit into this design? It should be easy to get that kind of information...
And the best way of doing that (IMHO) would be also representing the season as an entity, as a concrete table. Then you should replicate this information in each relationship. Meaning, for instance, that a player does not simply belong to a team: he belongs to a team in a specific season, and may belong to another team when the season changes.
OTOH, I don't think it's wise to keep regular_season and playoff as distinct tables: they could be easily merged into one, by adding some sort of flag in order to keep that information.
Edit This is what I'm meaning:
Notice that there is a Season table.
A Player belongs to a Team in a Season.
NO NEED TO DUPLICATE ANYTHING. A team has only ONE record in the DB; a player will be associated to only ONE record.
I did NOT design the Playoff table, because I believe it should not exist. If the OP disagrees, just add it.
That way you can keep track of all seasons, without needing to replicate the whole DB. I think this is also better than using a year column, which is not meaningful, and cannot be easily constrained (like a foreign key can).
But, please, feel free to disagree.
The standard way would be to add year columns to your tables. That way you can easily call up the past with a select query, or view. SQL Server has good support for this. I've dealt with cleanup of the other route, and it isn't pretty after a few years of data have accumulated.
I would go with option A and have a year column in the seasons table and playoff table.
You already have a date column, you can use that to find the year
SELECT * FROM regular_season WHERE YEAR(date) = 2011

Database Design: Storing the primary key as a separate field in the same table

I have a table that must reference another record, but of the same table. Here's an example:
Customer
********
ID
ManagerID (the ID of another customer)
...
I have a bad feeling about doing this. My other idea was to just have a separate table that just stored the relationship.
CustomerRelationship
***************
ID
CustomerID
ManagerID
I feel I may be over complicating such a trivial thing however, I would like to get some idea's on the best approach for this particular scenario?
Thanks.
There's nothing wrong about the first design. The second one, where you have an 'intermediate' table, is used for many-to-many relationships, which i don't think is yours.
BTW, that intermediate table wouldn't have and ID of its own.
Why do you have a "bad feeling" about this? It's perfectly acceptable for a table to reference its own primary key. Introducing a secondary table only increases the complexity of your queries and negatively impacts performance.
Can a Customer have multiple managers? If so, then you need a separate table.
Otherwise, a single table is fine.
You can use the first approach. See also Using Self-Joins
There's absolutely nothing wrong with the first approach, in fact Oracle has included the 'CONNECT BY' extension to SQL since at least version 6 which is intended to directly support this type of hierarchical structure (and possibly makes Oracle worth considering as your database if you are going to be doing a lot of this).
You'll need self-joins in databases which don't have something analogous, but that's also a perfectly fine and standard solution.
As a programmer I like the first approach. I like to have less number of tables. Here we are not even talking of normalization and why do we need more tables? That is just me.
Follow the KISS principle here: Keep it simple, (silly | stupid | stud | [whatever epithet starting with S you prefer]). Go with one table, unless you have a reason to need more.
Note that if the one-to-many/many-to-many relationship ends up being the case, you can extract the existing column into a table of its own, and fill in the new entries at that time.
The only reason I would ever recommend avoiding such self-referecing tables is that SQL Server does have a few spots where there are limitations with self-referencing tables.
For one, if you ever happen to come across the need for an indexed view, then you'd find out that if one of the tables used in a view definition is indeed self-referencing, you won't be able to create a clustered index on your view :-(
But apart from that - the design per se is sound and absolutely valid - go for it! I always like to keep things as simple as possible (but no simpler than that).
Marc

SQL Select help - removing white space

I'm running into a problem when trying to select records from my 2005 MS-SQL database (I'm still very new to SQL, but I learned and use the basic commands from w3schools already). In theory, all my manufacturer records should be unique. At least that is how I intended it to be when I did my first massive data dump into it. Unfortunately, that is not the case and now I need to fix it! Here is my scenario:
Table name = ItemCatalog
Relevant columns = Partnumber,Manufacturer,Category
When I did a SELECT DISTINCT Manufacturer FROM ItemCatalog this little problem is what turned up:
Cables2Go
CablesToGo
Cables To Go
CableToGo Inc
CablesToGo Inc
All 5 of those showed up as distinct, which they are. Can't fault my SELECT statement for returning it, but from my human perspective they are all the same manufacturer! One method I see working is doing an UPDATE command and fixing all the permutations that show up, but I have a LOT of manufacturers and this would be very time consuming.
Is there a way when I punch in a SELECT statement, that I can find all the likely permutations of a manufacturer name (or any field really)? I attempted the LIKE operator, so my statement would read
SELECT Manufacturer FROM ItemCatalog WHERE Manufacturer LIKE '%CablesToGo%'
but that didn't turn out as well as I had hoped. Here's the nasty bit, my other program that I'm putting together absolutely requires that I only ask for a single manufacturer name, not all 5 variations. Maybe I'm talking in circles here, but is there is a simple way in one statement for me to find a similar string?
If you are doing some data mining, you could also try the SOUNDEX and DIFFERENCE function in SQL Server.
While they are both outdated (they don't handle foreign character very well), they could yield some interesting result for you:
SELECT * FROM ItemCatalog WHERE SOUNDEX(Manufacturer) = SOUNDEX('Cables To Go');
and
SELECT * FROM ItemCatalog WHERE DIFFERENCE(Name, 'Cables To Go') >= 3;
The number 3 means likely similar (0 mean not similar and 4 is very similar)
There are a few number of better SOUNDEX function available on the internet. See Tek-Tips for an example.
Here is another example at SQL Team.
Standard SQL has a SIMILAR statement, which is a bit more powerful than LIKE.
However, you could use LIKE to good effect with:
Manufacturer LIKE 'Cable%Go%'
This would work in this specific case, finding all the variants listed. However, it would also find 'Cable TV Gorgons' and you probably don't need them included. Your version would also find 'We Hate CablesToGo With Ferocity Inc', which you probably didn't want either.
However, data cleansing is a major problem, and there are companies that make a living out of providing data cleansing. You often end up making a dictionary or thesaurus of terms (company names here) mapping all the variants encountered to the canonical form. The problem is that sometimes you find the same variant spelling is used for two separate canonical forms. For example, a pair of bright sparks might both decide to use 'C2G' as an abbreviation, but one uses it for 'Cables To Go Inc' and the other uses it for 'Computers To Gamers Inc'. You have to use some other information to determine whether a particular instance of 'C2G' means 'Cables' or 'Computers'.
'Cable%Go%' might work for that one case, but if you have other variations for other strings, you'll probably have to do a lot of manual data cleanup.
I suggest you to use object relational mapping tool to map your table into object and add filtering logic there.
One option you have is to loosen your wildcard search to something like 'Cables%Go%'. This might be good in the short term, but with this approach you run the risk of matching more manufacturers than you want (ie , Cables on the Go, etc).
You could also put together a mapping table, which would put all of the variants of Cables To Go into a single group, which your app can query and normalize for your ItemCatalog query.
Another option you have is to introduce a Manufacturers table. This your ItemCatalog table would then have a foreign key to this table and only allow manufacturers that are in the Manufacturer table. This would require some cleanup of your ItemCatalog table to get it working, assuming that you want all of the variants of Cables to Go to be the same.
I know others are suggesting query fixes - I thought I'd elaborate on my long-term fix for kicks.
You could create another table relating each of the variations to a single manufacturer entity. If I encountered this situation at work (and I have), I would be enticed to fix it.
Create a manufacturer's table with a primary key, name, etc..
Create a table with aliases - these will only be needed when you are presented with data that doesn't have the manufacturer's ID (like an import file).
Modify ItemCatalog such that it references the primary key from the manufacturer table (i.e. a ManufacturerID foreign key).
When importing data to ItemCatalog, assign the ManufacturerID foreign key based on matches to the alias table. If you have a name that matches 2+ records then you flag them for manual review or you try to match on more than manufacturer name.