I've been looking at Microsoft's Adventure Works 2012 database. I'd be very interested if there's any information explaining why the tables were created as they are. Some sort of schema overview I guess.
For example:
Why they chose to create a BusinessEntity table as a sort of base class for Person, Employee, etc.
Most of the data is normalized so why they chose to put the CountryRegionCode field into the StateProvince table instead of an ID to a separate table.
Anyway I'm very interested in learning more about the decisions that went into the databases design. Anyone know a resource that goes into this sort of thing?
I am not aware of any official design documentation for AdventureWorks, but I use to be a trainer and used AdventureWorks databases extensively for demos and labs, so I am pretty familiar with it.
The BusinessEntity table is a classic case of a SuperType/SubType design, which reduces data redundancy, because customers could also become vendors, employees could become customers, and every other combination. Also, it means that you are not storing details related to all entities repeatedly, in separate tables, minimising effort in the event of code changes.
The CountryRegionCode I am not positive about, but I would suspect one of three reasons:
There are not enough distinct combinations to warrant an additional table at the expense of reporting performance (This could be validated with some simple COUNT(*) GROUP BY statements)
They wanted it in the same table so that in future they had the flexibility to model a hierarchy structure with hierarchyID (This is the least likeley option)
It was a normalisation error! (My money is on this option!)
This image helped me, even though it's technically for 2008.
Here is a link to fill in with a bit on the background:
https://technet.microsoft.com/en-us/library/ms124825(v=sql.100).aspx
Related
I am trying to design a database for a garage management system. And I am struggling with over normalising, Would it be best to have User_ID as a foreign key within all tables to make querying faster or does this design make sense?
Any help is appreciated
As for ORM, your design is absolutely correct. There is no redundancy and data is easily reached by transitivity. Nevertheless, sometimes, development urges for simplification and better performance. Performance must prevail over design.
I would not recommend adding user_id to the other tables. That just seems like a maintenance nightmare -- ensuring that the user_id is consistent across the tables.
Instead, you might want to borrow a technique which is associated with dimensional modeling (which is described in Wikipedia). That is flattening dimensions.
This means that you would have one table apart from users at the most detailed level. All the information about spaces, zones, boxes, and items would be in the same table.
This works assuming that the dimension values do not really change over time -- think time dimensions or geographic dimensions.
I started a new application and now I am looking at two paths and don't know which is good way to continue.
I am building something like eCommerce site. I have a categories and subcategories.
The problem is that there are different type of products on site and each has different properties. And site must be filterable by those product properties.
This is my initial database design:
Products{ProductId, Name, ProductCategoryId}
ProductCategories{ProductCategoryId, Name, ParentId}
CategoryProperties{CategoryPropertyId, ProductCategoryId, Name}
ProductPropertyValues{ProductId, CategoryPropertyId, Value}
Now after some analysis I see that this design is actually EAV model and I read that people usually don't recommend this design.
It seems that dynamic sql queries are required for everything.
That's one way and I am looking at it right now.
Another way that I see is probably named a LOT WORK WAY but if it's better I want to go there.
To make table
Product{ProductId, CategoryId, Name, ManufacturerId}
and to make table inheritance in database wich means to make tables like
Cpus{ProductId ....}
HardDisks{ProductId ....}
MotherBoards{ProductId ....}
erc. for each product (1 to 1 relation).
I understand that this will be a very large database and very large application domain but is it better, easier and performance better than the option one with EAV design.
EAV is rarely a win. In your case I can see the appeal of EAV given that different categories will have different attributes and this will be hard to manage otherwise. However, suppose someone wants to search for "all hard drives with more than 3 platters, using a SATA interface, spinning at 10k rpm?" Your query in EAV will be painful. If you ever want to support a query like that, EAV is out.
There are other approaches however. You could consider an XML field with extended data or, if you are on PostgreSQL 9.2, a JSON field (XML is easier to search though). This would give you a significantly larger range of possible searches without the headaches of EAV. The tradeoff would be that schema enforcement would be harder.
This questions seems to discuss the issue in greater detail.
Apart from performance, extensibility and complexity discussed there, also take into account:
SQL databases such as SQL Server have full-text search features; so if you have a single field describing the product - full text search will index it and will be able to provide advanced semantic searches
take a look at no-sql systems that are all the rage right now; scalability should be quite good with them and they provide support for non-structured data such as the one you have. Hadoop and Casandra are good starting points.
You could very well work with the EAV model.
We do something similar with a Logistics application. It is built on .net though.
Apart from the tables, your application code has to handle the objects correctly.
See if you can add generic table for each object. It works for us.
My professor (who claimed to have a firm understanding about systems development for many years) and I are arguing about the design of our database.
As an example:
My professor insists this design is right:
(list of columns)
Subject_ID
Description
Units_Lec
Units_Lab
Total_Units
etc...
Notice the total units column. He said that this column must be included.
I tried to explain that it is unnecessary, because if you want it, then just make a query by simply adding the two.
I showed him an example I found in a book, but he insists that I don't have to rely on books too much in making our system.
The same thing applies to similar cases as in this one:
student_ID
prelim_grade
midterm_grade
prefinal_grade
average
He wanted me to include the average! Anywhere I go, I can find myself reading articles that convince me that this is a violation of normalization. If I needed the average, I can easily compute the three grades. He enumerated some scenarios including ('Hey! What if the query has been accidentally deleted? What will you do? That is why you need to include it in your table!')
Do I need to reconstruct my database(which consists of about more than 40 tables) to comply with what he want? Am I wrong and just have overlooked these things?
Another thing is that he wanted to include the total amount in the payments table, which I believe is unnecessary. (Just compute the unit price of the product and the quantity.) He pointed out that we need that column for computing debits and/or credits that are critical for the overall system management, that it is needed for balancing transaction. Please tell me what you think.
You are absolutely correct! One of the rules of normalization is to reduce those attributes which can be easily deduced by using other attributes' values. ie, by performing some mathematical calculation. In your case, the total units column can be obtained by simply adding.
Tell your professor that having that particular column will show clear signs of transitive dependency and according to the 3rd normalization rule, its recommended to reduce those.
You are right when you say your solution is more normalized.
However, there is a thing called denormalization (google for it) which is about deliberately violating normalization rules to increase queries performance.
For instance you want to retrieve first five subjects (whatever the thing would be) ordered by decreasing number or total units.
You solution would require a full scan on two tables (subject and unit), joining the resultsets and sorting the output.
Your professor's solution would require just taking first five records from an index on total_units.
This of course comes at the price of increased maintenance cost (both in terms of computational resources and development).
I can't tell you who is "right" here: we know nothing about the project itself, data volumes, queries to be made etc. This is a decision which needs to be made for every project (and for some projects it may be a core decision).
The thing is that the professor does have a rationale for this requirement which may or may not be just.
Why he hasn't explained everything above to you himself, is another question.
In addition to redskins80's great answer I want to point out why this is a bad idea: Every time you need to update one of the source columns you need to update the calculated column as well. This is more work that can contain bugs easily (maybe 1 year later when a different programmer is altering the system).
Maybe you can use a computed column instead? That would be a workable middle-ground.
Edit: Denormalization has its place, but it is the last measure to take. It is like chemotherapy: The doctor injects you poison only to cure an even greater threat to your health. It is the last possible step.
Think it is important to add this because when you see the question the answer is not complete in my opinion. The original question has been answered well but there is a glitch here. So I take in account only the added question quoted below:
Another thing is that he wanted to include the total amount in the
payments table, which I believe is unnecessary(Just compute the unit
price of the product and the quantity.). He pointed out that we need
that column for computing debits and/or credits that are critical for
the overall system management, that it is needed for balancing
transaction. Please tell me what you think.
This edit is interesting. Based on the facts that this is a transactional system handling about money it has to be accountable. I take some basic terms: Transaction, product, price, amount.
In that sense it is very common or even required to denormalize. Why? Because you need it to be accountable. So when the transaction is registered that's it, it may never ever be modified. If you need to correct it then you make another transaction.
Now yes you can calculate for example product price * amount * taxes etc. That makes sense in normalization sense. But then you will need a complete lockdown of all related records. So take for example the products table: If you change the price before the transaction it should be taken into account when the transaction happens. But if the price changes afterwards it does not affect the transaction.
So it is not acceptable to just join transaction.product_id=products.id since that product might change. Example:
2012-01-01 price = 10
2012-01-05 price = 20
Transaction happens here, we sell 10 items so 10 * 20 = 200
2012-01-06 price = 22
Now we lookup the transaction at 2012-01-10, so we do:
SELECT
transactions.amount * products.price AS totalAmount
FROM transactions
INNER JOIN products on products.id=transactions.product_id
That would give 10 * 22 = 220 so it is not correct.
So you have 2 options:
Do not allow updates on the products table. So you make that table versioned, so for every record you add a new INSERT instead of update. So the transaction keeps pointing at the right version of the product.
Or you just add the fields to the transactions table. So add totalAmount to the transactions table and calculate it (in a database transaction) when the transaction is inserted and save it.
Yes, it is denormalized but it has a good reason, it makes it accountable. You just know and it's verified with transactions, locks etc. that the moment that transaction happened it related to the described product with the price = 20 etc.
Next to that, and that is just a nice thing of denormalization when you have to do that anyway, it is very easy to run reports. Total transaction amount of the month, year etc. It is all very easy to calculate.
Normalization has good things, for example no double storage, single point of edit etc. But in this case you just don't want that concept since that is not allowed and not preferred for a transactions log database.
See a transaction as a registration of something happened in real world. It happened, you wrote it down. Now you cannot change history, it was written as it was. Future won't change it, it happened.
If you want to implement the good, old, classic relational model, I think what you're doing is right.
In general, it's actually a matter of philosophy. Some systems, Oracle being an example, even allow you to give up the traditional, relational model in favor of objects, which (by being complex structures kept in tables) violate the 1st NF but give you the power of object-oriented model (you can use inheritance, override methods, etc.), which is pretty damn awesome in some cases. The language used is still SQL, only extended.
I know my answer drifts away from the subject (as we take into consideration a whole new database type) but I thought it's an interesting thing to share on the occasion of a pretty general question.
Database design for actual applications is hardly the question of what tables to make. Currently, there are countless possibilities when it comes to keeping and processing your data. There are relational systems we all know and love, object databases (like db4o), object-relational databases (not to be confused with object relational mapping, what I mean is tools like Oracle 11g with its objects), xml databases (take eXist), stream databases (like Esper) and the currently thriving noSQL databases (some insist they shouldn't be called databases) like MongoDB, Cassandra, CouchDB or Oracle NoSQL
In case of some of these, normalization loses its sense. Each model serves a completely different purpose. I think the term "database" has a much wider meaning than it used to.
When it comes to relational databases, I agree with you and not the professor (although I'm not sure if it's a good idea to oppose him to strongly).
Now, to the point. I think you might win him over by showing that you are open-minded and that you understand that there are many options to take into consideration (including his views) but that the situation requires you to normalize the data.
I know my answer is quite a stream of conscience for a stackoverflow post but I hope it's not received as lunatic babbling.
Good luck in the relational tug of war
You are talking about historical and financial data here. It is common to store some computations that will never change becasue that is the cost that was charged at the time. If you do the calc from product * price and the price changed 6 months after the transaction, then you have the incorrect value. Your professor is smart, listen to him. Further, if you do a lot of reporting off the database, you don't want to often calculate values that are not allowed to be changed without another record of data entry. Why perform calculations many times over the history of the application when you only need to do it once? That is wasteful of precious server resources.
The purpose of normalization is to eliminate redundancies so as to eliminate update anomalies, predominantly in transactional systems. Relational is still the best solution by far for transaction processing, DW, master data and many BI solutions. Most NOSQLs have low-integrity requirements. So you lose my tweet - annoying but not catastrophic. But to lose my million dollar stock trade is a big problem. The choice is not NOSQL vs. relational. NOSQL does certain things very well. But Relational is not going anywhere. It is still the best choice for transactional, update oriented solutions. The requirements for normalization can be loosened when the data is read-only or read-mostly. That's why redundancy is not such a huge problem in DW; there are no updates.
I'd like to hear some opinions or discussion on a matter of database design. Me and my colleagues are developing a complex application in finance industry that is being installed in several countries.
Our contractors wanted us to keep a single application for all the countries so we naturally face the difficulties with different workflows in every one of them and try to make the application adjustable to satisfy various needs.
The issue I've encountered today was a request from the head of the IT department from the contractors side that we keep the database model in terms of tables and columns they consist of.
For examlpe, we got a table with different risks and we needed to add a flag column IsSomething (BIT NOT NULL ...). It fully qualifies to exists within the risk table according to the third normal form, no transitive dependency to the key, a non key value ...
BUT, the guy said that he wants to keep the tables as they are so we had to make a new table "riskinfo" and link the data 1:1 to the new column.
What is your opinion ?
We add columns to our tables that are referenced by a variety of apps all the time.
So long as the applications specifically reference the columns they want to use and you make sure the new fields are either nullable or have a sensible default defined so it doesn't interfere with inserts I don't see any real problem.
That said, if an app does a select * then proceeds to reference the columns by index rather than name you could produce issues in existing code. Personally I have confidence that nothing referencing our database does this because of our coding conventions (That and I suspect the code review process would lynch someone who tried it :P), but if you're not certain then there is at least some small risk to such a change.
In your actual scenario I'd go back to the contractor and give your reasons you don't think the change will cause any problems and ask the rationale behind their choice. Maybe they have some application-specific wisdom behind their suggestion, maybe just paranoia from dealing with other companies that change the database structure in ways that aren't backwards-compatible, or maybe it's just a policy at their company that got rubber-stamped long ago and nobody's challenged. Till you ask you never know.
This question is indeed subjective like what Binary Worrier commented. I do not have an answer nor any suggestion. Just sharing my 2 cents.
Do you know the rationale for those decisions? Sometimes good designs are compromised for the sake of not breaking currently working applications or simply for the fact that too much has been done based on the previous one. It could also be many other non-technical reasons.
Very often, the programming community is unreasonably concerned about the ripple effect that results from redefining tables. Usually, this is a result of failure to understand data independence, and failure to guard the data independence of their operations on the data. Occasionally, the original database designer is at fault.
Most object oriented programmers understand encapsulation better than I do. But these same experts typically don't understand squat about data independence. And anyone who has learned how to operate on an SQL database, but never learned the concept of data independence is dangerously ignorant. The superficial aspects of data independence can be learned in about five minutes. But to really learn it takes time and effort.
Other responders have mentioned queries that use "select *". A select with a wildcard is more data dependent than the same select that lists the names of all the columns in the table. This is just one example among dozens.
The thing is, both data independence and encapsulation pursue the same goal: containing the unintended consequences of a change in the model.
Here's how to keep your IT chief happy. Define a new table with a new name that contains all the columns from the old table, and also all the additional columns that are now necessary. Create a view, with the same name as the old table, that contains precisely the same columns, and in the same order, that the old table had. Typically, this view will show all the rows in the old table, and the old PK will still guarantee uniqueness.
Once in a while, this will fail to meet all of the IT chief's needs. And if the IT chief is really saying "I don't understand databases; so don't change anything" then you are up the creek until the IT chief changes or gets changed.
I have a bit of an architecture problem here. Say I have two tables, Teacher and Student, both of them on separate servers. Since this tables share a lot of data and functionality, I would like to use this inheritance scheme and create a People table; however, I would need tho keep the Teacher table and the People records relating Teacher in one server, and the Student table and the People records relating Student in another server. This was a requirement made by the lead developer, since we have too many (and I mean too many) records for Teacher and Student, and a single database containing all of the People would collapse. Moreover, the clients NEED to have them on separate servers (sigh*).
I would really like to implement the inheritance scheme, since a lot of the funcionality could be shared among the databases. Is there any possible way to do this? any other architecture that may suit this type of problem? I'm I just crazy?
--- EDIT ---
Ok, I don't really have Teachers and Students per se, I just used those names to simplify my explanation. Truth is, there are about 9 sub-tables that would inherit the super table, all of them in separate servers for separate applications, and no, I don't have this type of database, but we have pretty low end servers for the amount of transactions we have ;). You're right, my statements are a bit exagerated and I apologize for that, it was just to make you guys answer faster (sorry :P). The different servers are more of a business restriction than anything else (although the lead developer DID say that a common database to store the SuperTable would collapse under it's own weight -his words, not mine :S). Our clients don't like their information mixed with other clients information, so we must have their information on different servers -pretty stupid, but the decision-makers have spoken :(.
Under what assumption did you determine that you have too much data? I'm pretty sure you could list every teacher and student in the world, and not cause SQL Server any grief.
This seems like an arbitrary decision that is going to have significant impact on the complexity of any solution you design.
Take a look here - I'm sure you don't measure your database in anything close to the scale represented on this page, and many of these db's are running on SQL Server.
I don't know for sure if this is possible with SQL Server specifically, but it smells like something that could be solved with clustering and tablespace partitioning.
What I wonder about is whether this is really a good requirement; it introduces a lot of technical complexity based on a pretty simple assertion that there's just too much data. Have you attempted to verify this? A simple test would be to create a simple schema and populate it with dummy data for the number of rows you expect in production. It would probably be in your best interest to perform this test before you go too far down the road to implement this 'requirement'.
By the way, the type of schema you linked to is an example of the class table inheritance pattern.
It would be possible for you to implement a domain model for this project where the common attributes of Teacher and Student are described by a Person interface or base class which the common operations are written against. If you plan to use stored procedures extensively, this might not be a useful option, but it's something to consider.
I think Paul is correct - perhaps look at your hardware infrastructure rather than your DB schema.
Using clustering, proper indexing, and possibly a data archive scheme should solve any performance problems. The inheritance scheme seems to be the best data model.
It is possible to split the data over multiple servers and keep the scheme, but I think you'd definitely have more performance problems than if you looked at clustering/proper indexing. By setting up linked servers you can do cross-server queries.
e.g. Students query
SELECT *
FROM SERVER_A.People.dbo.Persons P
INNER JOIN SERVER_B.People.dbo.Students S
ON P.PersonID = S.PersonID
--EDIT-- As Paul said, you could perform your database separation in your abstraction layer.
E.g. have your Student class extend your Person class. In your Person class constructor, have it connect to Server A to populate whichever fields are available. In your student class constructor, have it connect to Server B (the Person attributes will already be populated by the Person constructor).
I'm with Aaron here (sup Aaron). Move the tables into a single database. SQL Server can easily handle billions of rows per table (I've done it on SQL 2000 6-7 years ago, so modern versions and modern hardware are no problem). As long as your tables are indexed correctly There probably haven't been enough students in all of time at every school in the world to overload SQL Server much less at a single school.
In this case your best practice would be to put the tables in the same database, on the same server and index them for better performance.
Too many records cause 'database collapse'? What kind of pot is that lead developer smoking? Potent stuff!
I would recommend you guys study partitioned tables first. Making an application distributed (which really the two server approach implies) is much much harder than you think and it does not provide scalability.
Yep, I'd have to agree with the others here, and single database, single server is just fine. It is far easier and cheaper to scale up your hardware currently to support the workload than it will be to scale out to federated servers. I only know of one place that does federated servers and their workload is phenomenal.
link the servers and create a view
SELECT
FirstName
,LastName
....
FROM server.database.owner.Teachers
UNION
FirstName
,LastName
....
FROM server.database.owner.Students
What kind of client are you using? If you're using a Java client, and are using ORM, you may want to look into Hibernate Shards.
Besides all the good answers here that the assumptions behind the question are highly questionable, if I needed to do this seriously (and if I take the assumptions as true) I would compare what Oracle had to offer, because it is in this type of scenario that it shows a benefit (I say this from experience).
But on the core question, assuming that the assumptions you outline are true, I would not try to have a combined table. If teachers and students can't be in the same database, it is unlikely that their identifying information can, and if the amount of data is overwhelming, then putting it all in one table is worse.
What I suspect is that if the underlying assumptions are true it is because there is an anticipation of a lot of contention on the tables and a lot of connections and activity on the tables, causing a lot of locks. In that case, adding a Person table will make things worse.
All that being said, if you still really wanted to do it, then you can reference one database from another in queries, via linked databases.
But if the real issues is number of connections and contention and deadlocks around the tables, such a solution would make things worse.
EDIT: In response to those who question what advantage Oracle would bring to such a situation, one would be in the federated database area, where it is much more mature. Another would be in tables where you have a high amount of contention, it makes copies of the data in certain situations, and in general its model is more sophisticated when it comes to handling contention. For example scenarios where tables are read in longer running queries, causing a lot of potential read locks. Oracle helps you keep transactional integrity without having to lock on read. In MS-SQL, you have to resort to dirty reads.
MS-SQL is a fine database, but it has its limits (raw amounts of data without any particular parameters about volume of reads and writes is not really one of them, though, which makes the question strange). And given the stiff competition, the non-Enterprise version of Oracle is really close enough in price to be worth a look. It could end up costing you a lot later.
Of course, if you already purchased an MS-SQL license, the cost factor is larger for Oracle, so the benifits have to be more obvious.