Storing data values in multiple tables vs accessing the same value via query - sql

I have designed my database tables where multiple tables store a value, all of which could be achieved via a query to one table.
My question is would it be considered better practice to never store duplicate data and always query, or to store small values multiple times to reduce the number of queries required?
For context, I am building a Python app that quizzes Korean language questions using SQLAlchemy and SQLite.
I have User , Quiz and Question classes.
The values in question are num_correct, num_wrong with regard to quiz questions.
Basically I have a question table that stores all questions related to quiz by quiz_id. Each question has a column "correct" that stores a boolean telling whether or not that question was answered correctly.
In my "quiz" table, I have columns for num_correct / num_wrong regarding questions answered for that quiz.
In my "user" table, I also have columns for num_correct / num_wrong regarding their total answers correct and wrong for all time.
I realize that to get the values in "quiz" I could query the "questions" table and to get the values in "user" I could do that same.
In this case (and in general) which would be the preferred strategy considering best practices?
I've tried googling quite a bit, but wording the question is a bit tricky.

The issue of duplicated data is a complicated one in relational databases. If your application is doing data modifications, then duplicated data incurs synchronization issues -- the data needs to be updated in multiple places.
That is bad for a variety of reasons:
Updating a single item of information requires multiple changes.
The multiple changes can get out-of-sync, meaning that queries will not see consistent data.
Changes to the database structure (such as adding new tables) can be rather cumbersome.
Databases do support this capability, via ACID properties, transactions, and triggers. However, they add overhead. In general, such duplication is added out of necessity (i.e. performance) rather than up-front. Hence, there is a strong preference for normalized data models where information is stored only once when updates frequently occur.
On the other hand, some databases are used primarily for querying purposes. These databases are often denormalized -- and quite so. For instance, a customer table might contain summaries along many different dimensions, gathering information from dozens of underlying tables.
This not only simplifies queries but it encodes business logic. One major issue with using data is that different people have slightly different definitions of things -- is a one-year customer someone who started 365 days ago? Someone who started on the same day of the year last year? Someone who has been around for 12 months? Standardized analysis tables provide the answer.
Your case seems to fall more into the first situation. You are doing updates and thinking about storing summaries up front. I would discourage you from doing this. Just write the queries you need to summarize the data. In all likelihood, indexes and partitioning will provide all the performance you need.
If you know up front that you will have millions of users taking hundreds of quizzes with dozens of questions, then you might want to think about performance optimizations up front. But for thousands of users taking a handful of quizzes with a few dozen questions, start with a simple data model and make it more complicated after you have demonstrated that it works.

My question is would it be considered better practice to never store duplicate data and always query, or to store small values multiple times to reduce the number of queries required?
I don't see how this reduces the number of queries.
It may affect the complexity of a query, i.e. you'll need to join a few tables together instead of a simple query on one table, but these operations are very fast. I would not worry about speed.
If you duplicate your data it will eventually get out of sync, and then you're in big trouble.
In short, don't duplicate.
Also, this question doesn't really have anything to do with Python.

Related

should this be two database tables or one?

I have a database table called interviews and the interviewer and the interviewee will both have to review how the interview went. The review will have similar fields (rating on a scale) but different questions.
Option 1 is to have them both in the same table and have it be 1..N back to the interview table (storing the ID of the writer and the one being reviewed as well). and only limiting which fields can be input at the application level.
Option 2 is to have two tables (one specifically for interviewer reviews and one specifically for interviewee reviews.
What is your opinion of the best way to model this?
Although this is dangerously close to being opinion-based, I have a comment that is too long for comments.
Handling surveys is rather complicated. Surveys change over time because questions are added, removed, and modified and answers are added, removed, and modified. And yet, people often want to use survey questions and track the results over time.
So, the data model for a survey is much more complicated than "one table" or "two tables". There are tables for surveys, questions, answers, and the relationships and values can change over time.
One big table is often a poor choice. If you index properly and write fine tuned queries, they are going to perform fine. Having multiple table can help you in multiple ways like
Access particular data ,
Easy queries etc
Two review tables. Those are TWO bona fide separate entities.
Here's the deal:
Designing a single table that "works" for two different purposes can be done but it's challenging: on the database level, and on your application.
But then... a few months later new requirements come in, that makes it more challenging. You'll need to implement weird logic to keep using one table. Code becomes convoluted, and testing becomes a nightmare.
And then, more changes come in. It becomes unmanageable. At some point you'll realise they were different things from the start that will EVOLVE differently.
Bottom line, it's better to keep them separate from the start to avoid huge cost in the future. Even if they have near-identical columns in the beginning.

Retrieve and interpret large amounts of data from a SQL Server database

I'm an Electrical Test Engineer. Programming experience with C, mostly for devices with 256B of RAM or less. And have not a lot of experience with SQL databases...
We got a database with production data, serial numbers and testing results.
At the creation of the database no tools was created to retrieve the data.
If we can't retrieve the data, the database may as well not exist.
We have the data, the database exists. I want to create tools to retrieve and interpret data. And in the future do statistical analysis on the data.
The database has over 500k unique devices. With over 10 million measurements.
My question is: what's the most sensical way to retrieve and display the data?
For instance: a program what loops trough every entry and records the data will be complicated to write and will take days to complete.
The program and query's get complicated very fast.
We have Device types, Batch numbers, Serial numbers.
For every DISTINCT (DeviceType)
For Every DISTINCT (Batch number)
COUNT DISTINCT (Serial number) where...
NOT IN User <> 'development'...
AND Testing result <> 'FAIL'...
AND Date between ... and ...
Not to mention the measurement data, as each device may be tested multiple times. It seemed a trivial task, I'm now overwhelmed by the complexity.
I will create the code and query's myself. What I ask is help finding a strategy.
Ask yourself what questions you want to answer by recourse to the data. If the data is recorded in the most granular way possible, then it may be appropriate to consider common grouping or aggregation methods. These might include grouping by device, or location, or something else - each of these dimensions will have a business interpretation.
Writing down 3 top business requests should give you a starting point for constructing your extract/analysis strategy.
Next, try and draw together a data model, find out what tables exist, what references and relations they have to one another.
Together, between the questions you want to answer and the table-design, you should then be in a position to start constructing sensible general use queries.
Sometimes you may find different business questions can be answered using a common view of the data - when you're happy with an extract path, you can write this out using a common query language called SQL and - if appropriate, create a VIEW using that language. This abstracts the problem and makes it more convenient for users to actually get the answers they're looking for.
Your database will provide tools to write and run SQL statements, and you will need to refer to the documentation for your database to figure out how that happens - it's usually similar, but implementations differ across databases.

Performance gains vs Normalizing your tables? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Ok ok I know you probably all going to kill me for asking this, however I got into an friendly programmer argument with a co-worker about one of our database tables and he asked a question which I know the answer to but I couldn't explain it is the better way.
I will simplify the situation for the simplicity of the question, We have a fairly large table of people / users. Now amongst other data being stored the data in question is as follows: we have a simNumber, cellNumber and the ipAddress of that sim.
Now I am saying that we should make a table lets call it SimTable and put those 3 entries in the sim table, and then put a FK in the UsersTable linking the two. Why? Because that's what I have always been taught NORMALISE your tables!!! Ok so all is good in that regard.
But now my friend says to me yes, but now when you want to query a users phone number, SQL now has to go and:
search for the user
search for the sim fk
search for the correct sim row in the sim database
get the phone number
Now when I go and request 10000 users phone numbers, the number of operations done seriously grows in size.
Vs the other approach
search for the user
find the phone number
Now the argument is purely performance based. As much as I understand why we do normalize the data (to remove redundant data, maintainability, make changes to data in one table which propagate up etc.. ) It does appear to me that the approach with the data in one table will be faster or will at least less tasks/ operations to give me the data I want?
So what is the case in this situation? I do hope that I have not asked anything insanely silly , it is early in the morning so do forgive me if im not thinking clearly
The technology involved in MS SQL server 2012
[EDIT]
This article below also touches on some pf the concepts I have mentioned above
http://databases.about.com/od/specificproducts/a/Should-I-Normalize-My-Database.htm
The goal of normalization is not performance. The goal is to model your data correctly with minimum redundancy so you avoid data anomalies.
Say for example two users share the same phone. If you store the phones in the user table, you'd have sim number, IP address, and cell number stored one each user's row.
Then you change the IP address on one row but not the other. How can one sim number have two IP addresses? Is that even valid? Which one is correct? How would you fix such discrepancies? How would you even detect them?
There are times when denormalization is worthwhile, if you really need to optimize data access for one query that you run very frequently. But denormalization comes at a cost, so be prepared to commit yourself to a lot more manual work to take responsibility for data integrity. More code, more testing, more cleanup tasks. Do those count when considering "performance" of the project overall?
Re comments:
I agree with #JoelBrown, as soon as you implement your first case of denormalization, you compromise on data integrity.
I'll expand on what Joel mentions as "well-considered." Denormalization benefits specific queries. So you need to know which queries you have in your app, and which ones you need to optimize for. Do this conservatively, because while denormalization can help a specific query, it harms performance for all other uses of the same data. So you need to know whether you need to query the data in different ways.
Example: suppose you are designing a database for StackOverflow, and you want to support tags for questions. Each question can have a number of tags, and each tag can apply to many questions. The normalized way to design this is to create a third table, pairing questions with tags. That's the physical data model for a many-to-many relationship:
Questions ----<- QuestionsTagged ->---- Tags
But you figure you don't want to do the join to get tags for a given question, so you put tags into a comma-separated string in the questions table. This makes it quicker to query a given question and its associated tags.
But what if you also want to query for one specific tag and find its related questions? If you use the normalized design, it's simply a query against the many-to-many table, but on the tag column.
But if you denormalize by storing tags as a comma-separated list in the Questions table, you'd have to search for tags as substrings within that comma-separated list. Searching for substrings can't be indexed with a standard B-tree style index, and therefore searching for related questions becomes a costly table-scan. It's also more complex and inefficient to insert and delete a tag, or to apply constraints like uniqueness or foreign keys.
That's what I mean by denormalization making an improvement for one type of query at the expense of other uses of the data. That's why it's a good idea to start out with everything in normal form, and then refactor to denormalized designs later on a case by case basis as your bottlenecks reveal themselves.
This goes back to old wisdom:
"Premature optimization is the root of all evil" -- Donald Knuth
In other words, don't denormalize until you can demonstrate during load testing that (a) it makes a real improvement to performance that justifies the loss of data integrity, and (b) it does not degrade performance of other cases unacceptably.
It sounds like you already understand the benefits of normalisation, so I won't cover these.
There are a couple of considerations here:
1. Does a user always have one and only phone number?
If so, then it is still normalised to add these to the user table. However, if the user can have either no phone number or multiple phone numbers, then the phone details should be held in a seperate table.
Assuming you have these in seperate tables, but after conducting performance tests you found that joining on these 2 tables was having a significant effect on performance, then you may choose to deliberately denormalise the tables for performance gains.
Others have already provided some good points and you may also want to take a look at this.
I'd just like to mention one more aspect that is often overlooked: I/O tends to be the greatest component of the cost of most queries, and denormalization generally increases the storage size of data, therefore making the DBMS cache "smaller".
If your normalized database fits into cache and denormalized doesn't, you may actually observe a performance decrease for the latter.
And you won't be able to spot that in development, unless you actually have the amount of data that is similar to production. This is one of many reasons why you should never, ever denormalize without solid measurements (on representative amounts of data) to justify it.

Should I be concerned that ORMs, by default, return all columns?

In my limited experience in working with ORMs (so far LLBL Gen Pro and Entity Framework 4), I've noticed that inherently, queries return data for all columns. I know NHibernate is another popular ORM, and I'm not sure that this applies with it or not, but I would assume it does.
Of course, I know there are workarounds:
Create a SQL view and create models and mappings on the view
Use a stored procedure and create models and mappings on the result set returned
I know that adhering to certain practices can help mitigate this:
Ensuring your row counts are reasonably limited when selecting data
Ensuring your tables aren't excessively wide (large number of columns and/or large data types)
So here are my questions:
Are the above practices sufficient, or should I still consider finding ways to limit the number of columns returned?
Are there other ways to limit returned columns other than the ones I listed above?
How do you typically approach this in your projects?
Thanks in advance.
UPDATE: This sort of stems from the notion that SELECT * is thought of as a bad practice. See this discussion.
One of the reasons to use an ORM of nearly any kind is to delay a lot of those lower-level concerns and focus on the business logic. As long as you keep your joins reasonable and your table widths sane, ORMs are designed to make it easy to get data in and out, and that requires having the entire row available.
Personally, I consider issues like this premature optimization until encountering a specific case that bogs down because of table width.
First of : great question, and about time someone asked this! :-)
Yes, the fact an ORM typically returns all columns for a database table is something you need to take into consideration when designing your systems. But as you've mentioned - there are ways around this.
The main fact for me is to be aware that this is what happens - either a SELECT * FROM dbo.YourTable, or (better) a SELECT (list of all columns) FROM dbo.YourTable.
This is not a problem when you really want the whole object and all its properties, and as long as you load a few rows, that's fine, too - the convenience beats the raw performance.
You might need to think about changing your database structures a little bit - things like:
maybe put large columns like BLOBs into separate tables with a 1:1 link to your base table - that way, a select on the parent tables doesn't grab all those large blobs of data
maybe put groups of columns that are optional, that might only show up in certain situations, into separate tables and link them - again, just to keep the base tables lean'n'mean
Also: avoid trying to "arm-wrestle" your ORM into doing bulk operations - that's just not their strong point.
And: keep an eye on performance, and try to pick an ORM that allows you to change certain operations into e.g. stored procedures - Entity Framework 4 allows this. So if the deletes are killing you - maybe you just write a Delete stored proc for that table and handle that operation differently.
The question here covers your options fairly well. Basically you're limited to hand-crafting the HQL/SQL. It's something you want to do if you run into scalability problems, but if you do in my experience it can have a very large positive impact. In particular, it saves a lot of disk and network IO, so your scalability can take a big jump. Not something to do right away though: analyse then optimise.
Are there other ways to limit returned columns other than the ones I listed above?
NHibernate lets you add projections to your queries so you wouldn't need to use views or procs just to limit your columns.
For me this has only been an issue if the tables has LOTS of columns > 30 or if the column had alot of data for example a over 5000 character in a field.
The approach I have used is to just map another object to the existing table but with only the fields I need. So for a search that populates a table with 100 rows I would have a
MyObjectLite, but when I click to view the Details of that Row I would call a GetById and return a MyObject that has all the columns.
Another approach is to use custom SQL, Stroed procs but I only think you should go down this path if you REALLY need the performance gain and have users complaining. SO unless there is a performance problem do not waste your time trying to fix a problem that does not exist.
You can limit number of returned columns by using Projection and Transformers.AliasToBean and DTO here how it looks in Criteria API:
.SetProjection(Projections.ProjectionList()
.Add(Projections.Property("Id"), "Id")
.Add(Projections.Property("PackageName"), "Caption"))
.SetResultTransformer(Transformers.AliasToBean(typeof(PackageNameDTO)));
In LLBLGen Pro, you can return Typed Lists which not only allow you to define which fields are returned but also allow you to join data so you can pull a custom list of fields from multiple tables.
Overall, I agree that for most situations, this is premature optimization.
One of the big advantages of using LLBLGen and other ORMs as well (I just feel confident speaking about LLBLGen because I have used it since its inception) is that the performance of the data access has been optimized by folks who understand the issues better than your average bear.
Whenever they figure out a way to further speed up their code, you get those changes "for free" just by re-generating your data layer or by installing a new dll.
Unless you consider yourself an expert at writing data access code, ORMs probably improve most developers efficacy and accuracy.

DB Design: Does having 2 Tables (One is optimized for Read, one for Write) improve performance?

I am thinking about a DB Design Problem.
For example, I am designing this stackoverflow website where I have a list of Questions.
Each Question contains certain meta data that will probably not change.
Each Question also contains certain data that will be consistently changing (Recently Viewed Date, Total Views...etc)
Would it be better to have a Main Table for reading the constant meta data and doing a join
and also keeping the changing values in a different table?
OR
Would it be better to keep everything all in one table.
I am not sure if this is the case, but when updating, does the ROW lock?
When designing a database structure, it's best to normalize first and change for performance after you've profiled and benchmarked your queries. Normalization aims to prevent data-duplication, increase integrity and define the correct relationships between your data.
Bear in mind that performing the join comes at a cost as well, so it's hard to say if your idea would help any. Proper indexing with a normalized structure would be much more helpful.
And regarding row-level locks, that depends on the storage engine - some use row-level locking and some use table-locks.
Your initial database design should be based on conceptual and relational considerations only, completely indepedent of physical considerations. Database software is designed and intended to support good relational design. You will hardly ever need to relax those considerations to deal with performance. Don't even think about the costs of joins, locking, and activity type at first. Then further along, put off these considerations until all other avenues have been explored.
Your rdbms is your friend, not your adversary.
You should have the two table separated out as you might want to record the history of the question. The main Question table is indexed by question ID then the Status table is indexed by query ID and date/time stamp and contains a row for each time the status changes.
Don't know that the updates are really significant unless you were using pessimistic locking where the row would be locked for a period of time.
I would look at caching your results either locally with Asp.net caching or using MemCached.
This would certainly be a bad idea if you were using Oracle. In Oracle, you can quite happily read records while other sessions are modifying them due to it's multi-version concurency control. You would incur extra performance penalty for the join for no savings.
A design patter that is useful, however, is to pre-join tables, pre-calculate aggregates or pre-apply where clauses using materialized views.
As already said, better start with a clean normalized design. It's just easier to denormalize later, than to go the other way around. The experience teaches that you will never denormalize that one big table! You will just throw more columns in as needed. And you will need more and more indexes and updates will go slower and slower.
You should also take a look at the expected loads: Will be there more new answers or just more querying? What other operations will you have? When it comes to optimization, you can use the features of your dbms system: indexing, views, ...
Eran Galperin already provided most of my answer. In addition, the structure you propose really wouldn't help you in terms of locking. If their are relatively static and dynamic attributes in the same row, breaking the static and dynamic into two tables isn't of much benefit. It doesn't matter if static data is being locked, since no one is trying to change it anyway.
In fact, you may actually do worse with this design. Some database engines use page locking. If a table has fewer/smaller columns, more rows will fit on a page. The more rows there are on a page, the more likely there will be a lock contention. By having the static data mixed in with the dynamic, the rows are bigger, therefore there are fewer rows in a page, and therefore fewer waits on page locks.
If you have two independent sets of dynamic attributes, and they are normally modified by different actors, then you might get some benefit by breaking them into different tables. This is a pretty unusual case, however.
I'd also point out that breaking the table into a static and dynamic portion may not be of benefit in a relatively small environment, but in a large distributed environment it may be useful to cache and replicate the dynamic data at different rates than the static data.