When to split up tables in SQL?

When to split up tables in SQL? - sql

I'm relatively new to SQL and trying to teach myself and I'm having a hard time understanding when to keep a column and when to separate it into a new table.
I was watching a lecture where the instructor had a 'Customer' table and one of the columns was 'City' and a lot of the customers were from the same city so the data was redundant. He then broke off 'City' into its own table but that didn't quite make sense to me.
For example, I'm creating a College Course DB and I noticed that certain columns in the 'Course' dimension are repeating very often (like credit hours). Should I break credit hours off into its own table where that table would only have a couple of rows? What does this accomplish? I would still have to use a foreign key to reference the same value for every new data entry so would it even save on storage or would it just be an unnecessary join?
I have other columns as well like 'Days of Week', 'Location', 'Class Time' which also only have a few values that repeat often. Should those be broken off into their own separate tables or be left part of the Course table?

This is always tricky when you are learning databases. The rules of normalization can help, but they can be unclear on when to apply them.
The idea is that (some) database tables represent "entities". These are things you want to store information about. Other tables represent relationships between/among entities, but let's not worry about those for now.
For your specific questions:
"credit hour seem more like an attribute of the course entity. When would they be their own entity? Well, if they had other information specific to being credit hours. It is hard to come up with examples, but for instance: cost, range of effective dates, departments where the credits apply.
"days-of-weeks". If this is for a date, then just use a date and derive the day of the week using database functions for the date or a calendar table.
"days-of-weeks" for scheduling. This one is trickier. There are multiple ways to represent this; the best representation depends on how it is being used.
"location". This sounds like an entity. It could have a name, address, contact, directions and other information. In fact, there could be more than one entity to support this.
"class time". This is probably an attribute of the course, with a start time and end time.

Think about using a Course table and a schedule table. This way you can have one course with many different schedules having different times and days of the week. If there are different locations I would move the location into the schedule table. Format the times correctly so you can calculate duration and cast them if needed. It depends on what the data looks like.
course
PK course_ID (int)
credit_hours (int)
location (varchar)
|
(one to many relationship)
|
schedule
PK schedule_ID (int)
FK course_ID (int)
day_of_week (varchar)
start_time (varchar)
end_time (varchar)

Related

Numeric IDs vs. String IDs

I'm using a very stripped down example here so please ask if you need more context.
I'm in the process of restructuring/normalising a database where the ID fields in the majority of the tables have primary key fields which are auto-incremented numerical ID's (1,2,3 etc.) and I'm thinking I need to change the ID field from a numerical value to a string value generated from data in the row.
My reasoning for this is as follows:
I have 5 tables; Staff, Members, Volunteers, Interns and Students; all of these have numeric ID's.
I have another table called BuildingAttendance which logs when people visited the premises and for what reason which has the following relevant fields:
ID Type Premises Attended
To differentiate between staff and members. I use the type field, using MEM for member and STA for staff, etc. So as an example:
ID Type Premises Attended
1 MEM Building A 27/6/15
1 STA Building A 27/6/15
2 STU Building B 27/6/15
I'm thinking it might be a better design design to use an ID similar to the following:
ID Premises Attended
MEM1 Building A 27/6/15
STA1 Building A 27/6/15
STU2 Building B 27/6/15
What would be the best way to deal with this? I know that if my primary key is a string my query performance may take a hit, but is this easier than having 2 columns?
tl;dr - How should I deal a table that references records from other tables with the same ID system?

Auto-incremented numeric ids have several advantages over strings:
They are easier to implement. In order to generate the strings (as you want them), you would need to implement a trigger or computed column.
They occupy a fixed amount of storage (probably 4 bytes), so they are more efficient in the data record and in indexes.
They allow members to change between types, without affecting the key.
The problem that you are facing is that you have subtypes of a supertype. This information should be stored with the person, not in the attendance record (unless a person could change their type with each visit). There are several ways to approach this in SQL, none as clean as simple class inheritance in a programming language.
One technique is to put all the data in a single table called something like Persons. This would have a unique id, a type, and all the columns from your five tables. The problem is when the columns from your subtables are very different.
In that case, have a table called persons with a unique primary key and the common columns. Then have separate tables for each one and use the PersonId as the primary key for these tables.
The advantage to this approach is that you can have a foreign key reference to Persons for something like BuildingAttendance. And, you can also have foreign key references to each of the subtypes, for other tables where appropriate.

Gordon Linoff already provided an answer that points out the type/supertype issue. I refer to this a class/subclass, but that's just a difference in terminology.
There are two tags in this area that collect questions that relate to class/subclass. Here they are:
class-table-inheritance
shared-primary-key
If you will look over the info tab for each of these tags, you'll see a brief outline. Plus the answers to the questions will help you with your case.
By creating a single table called Person, with an autonumber ID, you provide a handy way of referencing a person, regardless of that person's type. By making the staff, member, volunteer, student, and intern tables use a copy of this ID as their own ID you will facilitate whatever joins you need to perform.
The decision about whether to include type in attendance depends on whether you want to retrieve the data with the person's current type, or with the type the person had at the time of the attendance.

I dont get DB normalization - isn't repeating FKs also repeating?

Well, I have just heard about that today but I do not get it. So I should not have Transaction table with Date column (because more transactions can occur at the same day) but I should have a Transaction and a Date column, where a Date would have a FK to a transaction. What is the point then, instead of a date I will repeat FK.
an example: A broker can make a transaction at any date. (transaction then needs to hold broker and date information).

Check out:
http://en.wikipedia.org/wiki/Database_normalization#Normal_forms
Transaction date does not need to be normalized.
But, imagine that Transaction is tied to customers, and customer details also have to be kept - this is a case where normalization helps to reduce data redundancy.

Assuming your date table is like the period table in our data warehouse, it is probably structured something like this:
Field date, datatype date (not datetime) primary key
other fields include fiscal year and holiday information
Then your transaction table might resemble something like this:
broker_id, foreign key to broker
date, foreign key to date
transaction time
other fields as necessary
Your question was, "what's the point?". This sort of database design allows you to easily answer questions like, "give me broker x's stats for the past 5 fiscal years, broken down by fiscal period"

Normalizing scalar values (dates, numbers, etc.) is generally overkill. Just because values repeat doesn't mean they should be normalized out. Only repeating values that aren't directly related to the row's primary key (e.g. an Address) should be candidates for normalization.
The only benefit I can see to normalizing dates if you want to add different representations of each date (e.g. Month, Quarter, etc.) without having to do the math each time. Otherwise the drwabacks outweigh the advantages in my opinion.

Moving a date attribute into another table and making it a foreign key in the Transaction table has nothing to do with "normalization".
Consider the example relation:
T{TransactionId,Date}
and dependency
{TransactionId}->{Date}
If TransactionId is a key then T already satisifies 6th Normal Form. Moving Date into another table, replacing it with another attribute and/or making it a foreign key will not make T any "more normalized" than it is already.
Whether or not attribute values "repeat" is irrelevant in normalization. What matter are the functional dependencies and join dependencies you mean to satisfy in your database schema. "Repeating data" is a phrase sometimes used informally to describe what functional dependencies are about but it is an oversimplification to say that decomposition is required simply because data repeats.

Using the date is not an ideal example. Think instead of customer records, tied to each transaction. You want to store the FK of a customer within each transaction row. You don't want to store the customer's name, address, password repeatedly though!

Database Modeling of a Softball League

I am modeling a database for use in a softball league website. I'm not that experienced in DB Modeling, and I'm having a hard time with a question about the future.
Right now I have the following tables:
players table (player_id, name, gender, email, team_id)
teams table (team_id, name, captain[player_id], logo, wins_Regular_season, losses_regular_season)
regular_season table (game_id, week, date, home[team_id], away[team_id], home_score, away_score, rain_date)
playoff table (pgame_id, date, home[team_id], away[team_id], home_score, away_score, winnerTo[pgame_id], loserTo[pgame_id])
To make the data persist from season to season, but to also have an easy way to access the data should I:
A) include a year column in the tables and then filter my queries by year?
B) create new tables every year?
C) Do something else that makes more sense but that I can't think of.

This design is not only bad about the future. It is also wrong regarding the past. You're not keeping history in a proper way.
Let's suppose a player changes team: how would that fit into this design? It should be easy to get that kind of information...
And the best way of doing that (IMHO) would be also representing the season as an entity, as a concrete table. Then you should replicate this information in each relationship. Meaning, for instance, that a player does not simply belong to a team: he belongs to a team in a specific season, and may belong to another team when the season changes.
OTOH, I don't think it's wise to keep regular_season and playoff as distinct tables: they could be easily merged into one, by adding some sort of flag in order to keep that information.
Edit This is what I'm meaning:
Notice that there is a Season table.
A Player belongs to a Team in a Season.
NO NEED TO DUPLICATE ANYTHING. A team has only ONE record in the DB; a player will be associated to only ONE record.
I did NOT design the Playoff table, because I believe it should not exist. If the OP disagrees, just add it.
That way you can keep track of all seasons, without needing to replicate the whole DB. I think this is also better than using a year column, which is not meaningful, and cannot be easily constrained (like a foreign key can).
But, please, feel free to disagree.

The standard way would be to add year columns to your tables. That way you can easily call up the past with a select query, or view. SQL Server has good support for this. I've dealt with cleanup of the other route, and it isn't pretty after a few years of data have accumulated.

I would go with option A and have a year column in the seasons table and playoff table.

You already have a date column, you can use that to find the year
SELECT * FROM regular_season WHERE YEAR(date) = 2011

database: summarizing data which expires

I'm struggling to find an efficient and flexible representation for my data. We have a many-to-many relationship between two entities which have arbitrary lifetimes. Let's call these Voter and Candidate. Each relationship has a measurement which we'd like to summarize in various ways. These are timestamped and are guaranteed to be within the lifetime of the two related entities. Let's say the measure is approval rating, or just Rating.
One unusual requirement is that if I'm summarizing a period which has no measurement, I should substitute the latest valid measurement, rather than giving NULL.
Our current solution is to compile a list of valid voters and candidates for each day, then formulate a many-to-many table which records the latest valid measure.
What would your solution be?
This allows me to do a single query to get a daily summary:
select
avg(rating), valid_date, candidate_SSN, candidate_DOB
from
daily_rating natural join rating
group by
valid_date, candidate_SSN, candidate_DOB
This might work ok, but It seems inefficient to me. We're repeating a lot of data, especially if nothing happens for a given day. It also is unclear how to do weekly/monthly summaries without compiling even more tables. Since we're dealing with millions of rows (we're not really talking about voter polls...) I'm looking for a more efficient solution.

I have used data-warehousing technique here, hence the dim and fact table names.
dimDate is so-called date dimension, one row per a date.
dimCandidate has all candidate data, new and old records. In data-warehousing terms this is called type 2 dimension. One candidate can have several rows in this table, only one of them having r_status = 'current'.
Fields
, r_valid_from date
, r_valid_to date
, r_version integer -- (1, 2, 3,..)
, r_status varchar(10) -- (expired, current)
describe a record (row) status. Each time a candidate status changes, a new row is inserted and the pervious row's r_valid_to and r_status are modified.
CandidateFullName is a business (natural) key and has to uniquely identify a candidate. No two candidates can have the same CandidateFullName. Note that the CandidateKey uniquely identifies a row in the table, while CandidateFullName uniquely identifies a candidate.
dimVoter has voter data, new and old records -- just like the dimCandidate.
dimCampaign describes campaign details, this is so-called type one dimension, does not hold historical data.
factRating has the Rating measure.
Normaly this would be enough, but there is the reqirement to interpolate the missing data for a day; for that, an aggregate table aggDailyRating is introduced. At the end of a day, a scheduled job aggregates ratings for the day. This job takes care of the data-interpolation requirement.
This way the aggregate table has one row for each date-(valid) candidate-campaign combination. Note that voter is not included in the combination, data is aggregated over all voters.
Any reporting is done on the aggregate table, for example
--
-- monthy rating for years 2009-2010
-- for candidate john_smith_256
--
select
CalendarYear
, MonthNumber
, avg(DailyRating) as AverageRating
from aggDailyRating as f
join dimDate as d on d.DateKey = f.DateKey
join dimCandidate as c on c.CandidateKey = f.CandidateKey
where CandidateFullName = 'john_smith_256'
and CalendarYear between 2009 and 2010
group by CalendarYear, MonthNumber
order by CalendarYear desc, MonthNumber desc ;

Yes, that is very inefficient and wasteful. It is merely a set of files, not reasonably comparable to a set of "tables" or a "database"; extensions and enhancements to it will compound the duplication and inefficiency. Duplication is the antithesis of a database. In database terms, there are far more efficient and easier ways to implement that.
Assumption
Your post does not provide much info, so I have had to make some assumptions, but I think you can correct my submission quite easily if any of them are incorrect. Otherwise comment, and I will correct my submission.
A Voter is a Person; a Candidate is a Voter; (Candidate = subset of Voter)
A Campaign is related to Candidate (not to a Polling Campaign).
A Poll is a survey of the Voters response to a Candidate's performance, staring on a set date, running over a few days, and completing on an set date.
There are many Measures, such as ApprovalRating, that are surveyed in each Poll.
The Measures of such surveys across all Voters are aggregated at the Poll level.
Limitation
The expiry requirement is unclear, so I am not suggesting I have implemented that. If the model does not provide that for you (if it is not immediately obvious), supply details and I will add to the model. The current model provides exclusion/inclusion capability for what I understand the expiry requirement to be.
The Poll::Measure does not have enough info to be implemented fully; I need further details. The submission is primitive and unconstrained in that area.
Likewise, any Poll::Campaign relation or constraint ("there are many Polls per Campaign, and they are always related to Campaign") has not been implemented.
The arrangement of the key in the child tables is arbitrary for now: if you identify the most common queries, it can be re-arranged, so that the most those obtain the best speed.
Submission
Campaign Poll Data Model
This is just a Relational (Normalised; zero duplication) Database, pure IDEF1X, including provision for the consideration that the child tables will be huge: migration of narrow surrogate keys into the child tables, avoiding migration of wide keys.
It provides "data warehouse" capability as is. In fact, if it does not provide any BI or DSS requirement in a single query, that is only due to lack of detail from you; please provide, and I will happily change it. (Note, your item re "single query" is actually "single file"; joins are pedestrian in a Relational database.)
Keys such as %Code are 2-, 3-, and at most 4-characters. Such keys are just as fast as Integer keys, and very helpful (makes sense) when perusing the tables (without having to join the parent).
Any and all aggregation, either to load the historic rows, or to produce aggregates for the current values, should be possible in a single Relational (set-oriented) command; you should not need to resort to serial (cursor) processing. Again, if you think you need to, please comment and I will provide the set-oriented method.
We implement Versioning in DBs quite differently to the way it is done in DWs, and without limitations. Please identify if you require versioning of (eg) Candidate, and I will provide.
Last, the Null requirement is not unusual. It is catered for here. Again, if you think it isn't ...

Adding new fields vs creating separate table

I am working on a project where there are several types of users (students and teachers). Currently to store the user's information, two tables are used. The users table stores the information that all users have in common. The teachers table stores information that only teachers have with a foreign key relating it to the users table.
users table
id
name
email
34 other fields
teachers table
id
user_id
subject
17 other fields
In the rest of the database, there are no references to teachers.id. All other tables who need to relate to a user use users.id. Since a user will only have one corresponding entry in the teachers table, should I just move the fields from the teachers table into the users table and leave them blank for users who aren't teachers?
e.g.
users
id
name
email
subject
51 other fields
Is this too many fields for one table? Will this impede performance?

I think this design is fine, assuming that most of the time you only need the user data, and that you know when you need to show the teacher-specific fields.
In addition, you get only teachers just by doing a JOIN, which might come in handy.
Tomorrow you might have another kind of user who is not a teacher, and you'll be glad of the separation.
Edited to add: yes, this is an inheritance pattern, but since he didn't say what language he was using I didn't want to muddy the waters...

In the rest of the database, there are no references to teachers.id. All other tables who need to relate to a user
use users.id.
I would expect relating to the teacher_id for classes/sections...
Since a user will only have one corresponding entry in the teachers table, should I just move the fields from the teachers table into the users table and leave them blank for users who aren't teachers?
Are you modelling a system for a high school, or post-secondary? Reason I ask is because in post-secondary, a user can be both a teacher and a student... in numerous subjects.

I would think it fine provided neither you or anyone else succumbs to the temptation to reuse 'empty' columns for other purposes.
By this I mean, there will in your new table be columns that are only populated for teachers. Someone may decide that there is another value they need to store for non-teachers, and use one of the teacher's columns to hold it, because after all it'll never be needed for this non-teacher, and that way we don't need to change the table, and pretty soon your code fills up with things testing row types to find what each column holds.
I've seen this done on several systems (for instance, when loaning a library book, if the loan is a long loan the due date holds the date the book is expected back. but if it's a short loan the due date holds the time it's expected back, and woe betide anyone who doesn't somehow know that).

It's not too many fields for one table (although without any details it does seem kind of suspicious). And worrying about performance at this stage is premature.
You're probably dealing with very few rows and a very small amount of data. You concerns should be 1) getting the job done 2) designing it correctly 3) performance, in that order.
It's really not that big of a deal (at this stage/scale).

I would not stuff all fields in one table. Student to teacher ratio is high, so for 100 teachers there may be 10000 students with NULLs in those 17 fields.
Usually, a model would look close to this:
I your case, there are no specific fields for students, so you can omit the Student table, so the model would look like this
Note that for inheritance modeling, the Teacher table has UserID, same as the User table; contrast that to your example which has an Id for the Teacher table and then a separate user_id.

it won't really hurt the performance, but the other programmers might hurt you if you won't redisign it :) (55 fielded tables ??)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas