I am working for a K-12 school and I am developing a gradebook system. The existing gradebook system that we use which I also developed is based on Excel with VBA. It's always a nightmare for me to consolidate 400+ Excel workbooks every end of term. During the summer break, I'm planning to put all data in a database for easy management and stuff.
My problem is this:
For a computation-intensive application like a gradebook, is it good to store the computations in a table field or is it better to ONLY store the raw data and do the computations only on the frontend?
The way the Excel gradebook system works is like this...
Teacher records each score for each assessment of each student in the form of score / highest possible score. (e.g. Quiz 1 = 5/10, Homework 1 = 20/25, etc.)
The scores will be calculated as percentages and summarized per component. "Component" means Quiz, Homework, etc. So there will be something like "Quizzes Average = 90%, Homework Average = 80%, etc."
Different subjects will have different final grade breakdown like "Science = 50% Quizzes + 50% Homeworks, Math = 60% Quizzes + 40% Homeworks".
Then, the general average grade of each student is computed by getting the average of all subjects.
Everything above is very easy to make in a spreadsheet but I don't know how to implement it in a database.
As of the moment, I'm thinking something like having a table where all assessments are recorded like this:
tbl_scores
id
student_id
term_id
subject_id
component_id
assessment_id
raw_score
highest_possible_score #not sure about this cause this can be implied from assessment_id
Would it be useful to store the computations (percentage for each score entry, component average, subject average, general average, etc. in the database) and use stored procedures and triggers to update them when a new score comes in?
Or, is it better to just store the raw scores and calculate everything ONLY on the frontend? Will this option be faster than the first one knowing that the SELECT queries here will really be more complex due to subqueries?
By the way, I'm planning to use PostgreSQL for this.
Thanks in advance for any advice.
DashMug
400 students, that is peanuts for an RDBMS that is designed to handle millions of records. You need not worry about performance.
I have seen nothing in your description that PostgreSQL could not do in simple queries. Store only the necessary information and calculate the rest. Also, do not add a redundant column highest_possible_score to your table tbl_scores if that information is in the linked table assessment already. That would do more harm than good.
If (and only if) you should later find that a query is too slow, you can always use materialized views (write the output of a query to a table, recalculate that table after changes to the base data.)
Denormalization depends on total amounts of records. I think that for school it would not be so big to do it, better decision is to consolidate data with queries (yes it will be more complex some joins will appear, that is normal)
Related
I am trying to figure out how I can relate records form 1 table to each other
I have a table with Individual cases (e.g disciplinary, grievance etc)
However, multiples of these cases could relate to each other.
e.g a group of people get into a fight, all the people involved would get an individual case, then all the cases would need to be related/linked.
Im struggling to figure out how would be best to store this data, whether it be in the same table or a new table.
This is the backend to a Winform. So now I need to be able to link records, in the example data the user would select that caseID 1- 4 are linked.
So the question is how do i store the data that these cases are related. Because other case might be linked at a later date.
It sounds like you need some sort of "master case" or "incident".
I think I would say that an "incident" comprises one or more "cases" and keep both an incidents and cases tables. In fact cases might be a bad name. Instead, it might be more like IncidentPerson.
An alternative approach would be to have a "master case" id on each case. This could be NULL or the same case. I'm not as fond of this approach because it will likely lead to confusion down the road. One analyst will count cases per month using "cases" and another using "master cases" and you'll spend a lot of time trying to figure out why the numbers are different.
I need to render a school schedule in a very detailed way (see screenshot). Every Day, Hour of the day, Room, Group and Teacher are separate entities that are related one to another in a certain way.
My issue is I have to pull data to every single cell via separate query and that's how I am having over 700 queries to get week's schedule.
The question is: what is the best approach to store, manipulate and pull data for such demands?
I was thinking about making a separate 'static' table to store the actual values, not related ID's, but then I am loosing flexibility.
Here's my best try.
id
date
room_id
group_id
teacher_id
If you're sure that the data is very static, it might be more manageable to put teacher_9, group_9 etc. as columns in your Schedule Table. Tradeoffs are yours.
For Teacher data, you most likely want it in a different table for future changes. Imagine a Name change. Same with Group and Room.
If you're concerned about query performance, know that SQL will cache your results :).
Actually i am building a software for academic institutions, so i just wanted to know answers of a few questions:
As you know the some new data will be generated each year(for new admissions) and some will be upgraded. So i should store all the data in one single table with academic year separation(as a columns like ac_year), or should i make separate tables for each year. Also that there are different tables to store information like, classes,marks,fee,hostel,etc about the students. So each Info, like Fee would be stored in different tables like
Fee-2010
Fee-2011
Fee-2012...
Or in 1 single Fee table with year as a columns.
One more point is that soon after 1-2 years database will get heavier so backing up data for a single year would be possible in single table(Like Fee with year as a column) ?
And
Please answer keeping in mind SQL Server 2005.
Thanks
As you phrase the question, the answer is clearly to store the data in one table, with the year (or date or other information) as a column. This is simply the right thing to do. You are dealing with the same entity over time.
The one exception would be when the fields are changing significantly from one year to the next. I doubt that is the case for your table.
If your database is really getting big, then you can look into partitioning the table by time. Each partition would be one year's worth of data. This would speed up queries that only need to access one year's worth. It also helps with backing up and restoring data. This may be the solution you are ultimately looking for.
"Really getting big" means at least millions of rows in the table. Even with a couple million rows in the table, most queries will probably run fine on decent hardware with appropriate indexes.
It's not typical to store the data in multiple tables based on time constraints. I would prefer to store it all in one table. In the future, you may look to archiving old data, but it will still be significant time before performance will become an issue.
It is always better option to add new property to entity than create a new entity for every different property. This way maintenance and querying will be much more easier for you.
On the performance part of querying you don't have to worry about internal affairs of data and database. If there become a real performance issue there are many solutions like Creating Index on years as in your situation.
I have a bunch of products with a bunch of different possible attributes for each product. E.g. Product A has a name, size, color, shape. Product B has a name, calories, sugar, etc. One way to solve this is like:
1) Create tables
Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)
This allows for maximum flexibility, but I have heard a lot of people recommend against this although I am not sure why. I mean, if those tables were called Teams, Players, Team_Players we would all agree that this is proper relational design.
Everyone who explains to me why this is bad does so in the context of a completely flexible relational design where you don't ever create real tables past a basic few basic initial tables (e.g. object, attribute, object_attribute)-- which I think we all can agree is bad. But this is a much more limited and contained version of that (only Products, not every object in the system), so I don't think it is fair to group these two architectures together.
What issues have you encountered (experience or theoretical) that makes this design so bad?
2) Another way to solve this is to create a Product table with a bunch of columns like Size, Color, Shape, Weight, Sugar, etc and then include some extra columns at the end to give us some flexibility. This will create generally sparse rows filled mostly with NULLs. People tend to like this approach, but my question is how many columns can you have before this approach loses its performance benefits? If you have 200 columns, I imagine this is no longer a smart move, but what about 100 columns? 50 columns? 25 columns?
3) The final approach I know about is to store all of the attributes as a blob (JSON perhaps) in a single column of the Products table. I like this approach but it doesn't feel right. Queries are hard. And if you want to be able to easily change the name of an attribute later, you either have to parse every record individually or have them keyed in your blob by some id. If you go the id path then you will need another table Attributes and things start to look like approach #1 from above except you won't be able to join the attribute_id with your blob, so I hope you didn't want to query anything by attribute name.
What I like about this approach though is you can query one product and in your code you can easily access all the properties it has -- fast. And if you delete a product, you won't have to cleanup other tables -- easy to stay consistent.
4) I have read some things about being able to index strongly typed xml formats in some RDBMSs, but I honestly don't know much about this approach.
I am stuck. I feel like approach #1 is the best bet, but everything I read says that way stinks. What is the right way to think about this problem to be able to decide what is the best method for a given situation? More ideas than what I have listed are obviously welcomed!
You can probably find a great deal about this topic by doing a Google search on "entity attribute value antipattern".
One of the issues with this approach is that you end up mixing meta-data with actual data. Your "attribute" has to now tell the database what exactly is held in the "value" column. This can make it very difficult to handle this data in front-ends, reporting software, etc.
Second, you're going to have a very hard time actually enforcing any data integrity in the database. When your product has an attribute of "weight" what's to stop someone from putting "22 inches" in the value? Or a non-numeric value completely. You might say, "Well, my application will handle that." Then you need to change your application every time that you want to add a new attribute because the application needs to know how to handle it. If you're going to go through all of that work, just add a new column.
Third, how do you enforce that a given product has all of the attributes that it needs? In a row you can make column NOT NULL and they are then required to get that row into the database. You can't enforce that in the EAV model.
Fourth, this kind of a model usually leads to a lot of confusion. People aren't sure what "attributes" are supported, or they duplicate an attribute, or they forget to handle an attribute when creating a report. As an example, if I have an attribute for "Weight(kg)" and another attribute for "Weight(lbs)" and someone asks me, "What's the heaviest product in your database?" I'd better remember that I need to check both attributes.
Fifth, this model usually also leads to laziness. Hey, there's no reason to actually do any analysis of the products that our system can handle, because whatever comes along we'll just add some attributes. In my experience, companies are much better off doing the analysis required to create a good database design rather than fall back on an antipattern like this. You'll learn things about the database, the application, and likely the business as well.
Sixth, it might take a LOT of joins to get a single row of data for a given product. You can return the attributes as separate rows, but now you have to come up with customized list boxes to list those products, etc. Similarly, writing search queries against this model can be very difficult and in both of these situations you're likely to have performance issues.
These are just a few of the problems which I've encountered over the years. I'm sure that there are others.
What the correct solution is for your system depends a lot on the specifics of your business and application. Rather than a sparse row, you might consider using subtype tables if your products fall into a few categories that share common attributes.
There are many problems with flexible data models but the first one that is likely to bite you is the fact that queries get unwieldy very quickly. For example, if you wanted to get the Size attribute for every product, the query is relatively easy.
SELECT p.name product_name,
pa.value product_size
FROM product p
left outer join product_attribute pa on (p.product_id = pa.product_id)
left outer join attribute a on (pa.attribute_id = a.attribute_id and
a.name = 'size')
If you want to get the size and some other attribute like color, things get trickier
SELECT p.name product_name,
pa_size.value product_size
pa_color.value product_color
FROM product p
left outer join product_attribute pa_size on (p.product_id = pa_size.product_id)
left outer join product_attribute pa_color on (p.product_id = pa_size.product_id)
left outer join attribute a_size on (pa_size.attribute_id = a.attribute_id and
a_size.name = 'size')
left outer join attribute a_color on (pa_color.attribute_id = a.attribute_id and
a_color.name = 'color')
Very quickly, when you start wanting to grab 10 attributes or write complex searches (show me products where the color is blue and the size is medium), the queries start to get very complicated both for developers to write and maintain and for the database optimizer to generate the query plan for. If you're joining 30 tables together, the optimizer would have to prune the tree of plans it considers very, very quickly to be able to generate a query plan in a reasonable time frame. That tends to lead the optimizer to discard promising paths too early and to generate less than optimal paths for many of your queries.
This, in turn, means that you very quickly get to a point where new development is bottlenecked because developers can't get their queries right or developers can't get their queries to return quickly enough. Whatever time you saved up front by not gathering the requirements to determine what the valid attributes are quickly gets used up with the 47th iteration of "Why can't I get the data I want out of this putrid data model?"
Beyond this cost to developers, you end up creating a lot of costs for the organization as a whole.
No query tool is going to handle this sort of data model well. So all the users that can currently fire up their favorite query tool and run some reports out of your database are now stuck waiting for developers to write their reports and do their extracts for them.
Data quality becomes very hard to enforce. It becomes very hard to check conditions that involve multiple attributes (i.e. if a product's size Medium then the weight must be between 1 and 10 pounds, if a product's height is specified then a width is required as well) so people don't make those checks. They don't write the reports to identify where these sorts of rules are violated. So the data ends up being a bit bucket of data that downstream processes decide they can't use because it isn't sufficiently complete.
You're moving too much of the initial requirements discussion off into the future when understanding the core entities will likely lead to a much better design overall. If you can't agree on a set of attributes that the first version of the product needs to support, you don't really understand what that version is supposed to do. Even if you successfully code a very generic application, that means that it is going to require a lot of time to configure once you've built it (because someone will have to figure out what attributes it supports at that point). And then you'll discover when the application is being configured that you missed a ton of requirements that only became clear when the attributes were defined-- you can't know that width is required if height is specified if you don't know whether they're going to store height or width in the first place.
In the worst case, the response to this problem during configuration is to immediately determine that you need to provide a flexible way to specify business rules and to specify workflows so that the people configuring the application can quickly code their business rules when they add new attributes and so that they can control the flow of the application by grouping attributes together or skipping certain pages (i.e. have a page where make & model are required if the product type is car, skip that page if now). But in order to do that, you're going to end up building an entire development environment. And you're going to push the job of actually coding the application to the folks that are configuring the product. Unless you happen to be really good at building development environments, and unless the people configuring the product are really developers, this doesn't end well.
I mean, if those tables were called
Teams, Players, Team_Players we would
all agree that this is proper
relational design.
No, we wouldn't. Here's why.
You started with this.
Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)
Let's drop the id numbers, so we can see what's really going on. (Longer column names for clarity.)
Products (product_name)
Attributes (attribute_name)
Product_Attributes (product_name, attribute_name, value as string)
And translating that to teams and players . . .
Teams (team_name)
Players (player_name)
Team_Players (team_name, player_name, value as string)
So for sample data we might have
Team Player Value
--
St. Louis Cardinals Boggs, Mitchell ?
St. Louis Cardinals Carpenter, Chris ?
St. Louis Cardinals Franklin, Ryan ?
St. Louis Cardinals Garcia, Jaime ?
What on earth belongs in place of the question marks? Let's say we want to record number of games played. Now the sample data looks like this.
Team Player Value
--
St. Louis Cardinals Boggs, Mitchell 23
St. Louis Cardinals Carpenter, Chris 15
St. Louis Cardinals Franklin, Ryan 19
St. Louis Cardinals Garcia, Jaime 14
Want to store batting average, too? You can't. Not only can you not store batting average along with games played, you can't tell by looking at the database whether Mitch Boggs played in 23 games, had 23 hits, scored 23 runs, had 23 "at bats", had 23 singles, or struck out 23 times.
The reason why this approach is so bad is that you don't know how may times you have to join to the table to get all the attributes. Plus joining to the same table 20 times tends to create a performance block of massive proportions. I am assuming that Products wil be at the heart of your system and thus be a critical place for performance.
Now you say that the product attributes will be drastically different. I disagree. There will be many attributes that are common to a large number of your products things like price, units, size, color, dimemnsions, weight. Those should be in the product table as common properties. These are also the ones that the user is most likely to be searching for when picking a product.
Other properties are useful as a description but not for most anything else (they won't be searched on or put into the order details). Put those in a description or notes field.
Finally you are left with the few attributes which might be different. But how different are they? Are they common to a partiuclar type of product (books have these attributes, cameras have these), then a related table for that type of product might work well.
Once you have done your job and figured all this out, then add the flexibility of an EAV table if you still need one. The steps above should cover 98+% of the real requirements.
(Also it's kind of hard to design the order details table if you don't know the attribute fields you need to record for the order - you can't rely on the products table for that)
(oh and I agree wholeheartedly with what #Tom H is saying as well.)
I'm attempting to understand denormalization in databases, but almost all the articles google has spat out are aimed at advanced DB administrators. I fair amount of knowledge about MySQL and MSSQL, but I can't really grasp this.
The only example I can think of when speed was an issue was when doing calculations on about 2,500,000 rows in two tables at a place I used to intern at. As you can guess, calculating that much on demand took forever and froze the dev server I was on for a few minutes. So right before I left my supervisor wanted me to write a calculation table that would hold all the precalculated values, and would be updated about every hour or so (this was an internal site that wasn't used often). However I never got to finish it because I left
Would this be an example of denormalization? If so, is this a good example of it or does it go much farther? If not, then what is it in simple terms?
Say you had an Excel file with 2 worksheets you want to use to store family contact details. On the first worksheet, you have names of your contacts with their cell phone numbers. On the second worksheet, you have mailing addresses for each family with their landline phone numbers.
Now you want to print Christmas card labels to all of your family contacts listing all of the names but only one label per mailing address.
You need a way to link the two normalized sets. All the data in the 2 sets you have is normalized. It's 'atomic,' representing one 'atom,' or piece of information that can't be broken down. None of it is repeated.
In a denormalized view of the 2 sets, you'd have one list of all contacts with the mailing addresses repeated multiple times (cousin Alan lives with Uncle Bob at the same address, so it's listed on both Alan and Bob's rows.)
At this point, you want to introduce a Household ID in both sets to link them. Each mailing address has one householdID, each contact has a householdID value that can be repeated (cousin Alan and Uncle Bob, living in the same household, have the same householdID.)
Now say we're at work and we need to track zillions of contacts and households. Keeping the data normalized is great for maintenance purposes, because we only want to store contact and household details in one place. When we update an address, we're updating it for all the related contacts. Unfortunately, for performance reasons, when we ask the server to join the two related sets, it takes forever.
Therefore, some developer comes along and creates one denormalized table with all the zillions of rows, one for each contact with the household details repleated. Performance improves, and space considerations are tossed right out the window, as we now need space for 3 zillion rows instead of just 2.
Make sense?
I would call that aggregation not denormalization(if it is quantity of orders for example, SUM(Orders) per day...). This is what OLAP is used for. Denormalization would be for example instead of having a PhoneType table and the PhoneTypeID in the Contact table, you would just have the PhoneType in the Contact table thus eliminating 1 join
You could also of course use index/materialized views to have to aggregation values...but now you will slow down your update, delete and inserts
triggers are also another way to accomplish this
In an overly simplified form I would describe de-normalisation as reducing the number of tables used to represent the same data.
Customers and addresses are often kept in different tables to allow the concept of one customer having multiple addresses. (Work, Home, Current Address, Previous Address, etc)
The same could be said to apply to surnames, and other properties, but only the current surname ever be of concern. As such, one might normalise all the way to having a Customer table and a Surname table, with foreign key relationships, etc. But then denormalise this by merging the two tables together.
The benefit of "normalise until it hurts" is that it forces one to consider a pure and (hopefully) complete representation of the data and possible behaviours and relationships.
The benefit of "de-normalise until it works" is to reduce certain maintenance and/or processing overheads, but sticking to the same basic model as derived by working out a normalised model.
In the "Surname" example, by denormalising one is able to add an index to the customers based on their Surname and Date of Birth. Without de-normalising the Surname and DoB are in different tables and the composite index is not possible.
Denormalizing can be beneficial, the example you provided is an instance of this. It is not ideal to dynamically calculate these as the cost is expensive and thus you create a table and have a functional id referencing the other table along with calculation value.
The data is redundant as it can be derived from another table but due to production requirements this is a better design in the functional sense.
Curious to see what others have to say on this topic because I know my sql professor would cringe at the term denormalize but it has practicality.
Normal form would reject this table, as it is fully derivable from existing data. However, for performance reasons data of this type is commonly found. For example inventory counts are typically carried, but are derivable from the transactions that created them.
For smaller faster sets a view can be used to derive the aggregate. This provides the user the data they need (the aggregated value) rather than forcing them to aggregate it themselves. Oracle (and others?) have introduced materialized views to do what your manager was suggesting. This can be updated on various schedules.
If update volumes permit, triggers could be used to emulate a materialized view using a table. This may reduce the cost of maintaining the aggregated value. If not it would spread the overhead over a greater period of time. It does however, add the risk of creating a deadlock condition.
OLAP takes this simple case to more of an extreme interest in aggregates. Analysts are interested in aggregated values not the details. However, if the aggregated value is interesting, they may look at the details. Starting from normal form, is still a good practice.