I am developping an application that creates random "battles / versus" of two things that have the same type. Let's say it's about cars and their features for example:
There would be many group of features, things related to safety, to comfort, etc.
Car A would have one security feature, airbags, Car B it would be ABS and air conditioning and Car C heated seats.
Now I have to store a list of versus: airbags vs. ABS, heated seats vs. air conditioning. Note that I can't do airbags vs. heated seats.
I've come up with two ideas to make this work.
users
id | username
cars
id | name
groups
id | name
features
id | car_id | group_id | value
versus
First version:
id | user_id | group_id | car_a_id | car_b_id | winner_id
Second version:
id | user_id | feature_a_id | feature_b_id | winner_id
Now with the first version, I have to use car_a_id, car_b_id and group_id to fetch features but that ensures I am not comparing features that are not in the same group. The thing is if any feature gets deleted I'll will have an invalid versus and I won't know that until I actually fetch the features.
The second version solves that, since I can just add a ON DELETE CASCADE to my foreign keys. But now I have to make sure each feature of a row is in the same group when fetching them (I can't rely on the fact that the list of versus is actually valid).
Now I don't like either of these solutions, I feel like I'm doing something completely wrong but I can't find out anything better.
Is there a better / simpler way to do that?
I commented above, but want to post a few solutions.
If a "versus" is a direct feature to feature comparison, you need to directly reference the features in the "versus." This is shown in your "Second version."
It sounds like the main concern is ensuring that both features in a "versus" are of the same group. You can accomplish this in a few different ways.
Eliminate the option for users to compare features in different groups via the UI or other code. For example, have drop down boxes that only show the features in a single group when the user is selecting features to compare.
You could also try to use subqueries or functions in a postgresql table constraints. I've never done something like this (nor would I recommend it), but it may be suitable for your specific application requirements. http://grokbase.com/t/postgresql/pgsql-general/052h6ybahr/checking-of-constraints-via-subqueries
You could store the group_id of both features in the "versus" table. This definitely violates the rules of normalization, but if you have no control over the calling code and need to ensure the groups do not conflict, you can create a simple constraint such that "feature1_group_id" == "feature2_group_id." Not a robust method, and I wouldn't recommend it, but is another option.
In summary, I think you need to coordinate with the UI to ensure that users cannot violate group membership constraints when comparing features (solution 1).
Related
I have three entities Plan, Feature, Sensor represented by tables using the same structure as follows:
[Entity]
--------
Id (int)
Name (varchar)
Each entity links to the other with 1-many relationships:
Each Plan can have multiple Features
Each Feature can have multiple Sensors (note that different features may have common sensors, but have different requirement levels - optional/mandatory)
This data will be used inside an application where the user will select a plan and ultimately will be shown a list of required sensors.
Design Issue #1:
I firstly want to have a table that will describe relationship between Plan and Feature and my though is to have a table:
PlanFeatures
--------------
PlanId (int)
FeatureId (int)
Required (bit)
If I have a Required column then I will effectively need to have a record for each combination of PlanId and FeatureId. The alternative is to only add a record if the PlanId and FeatureId combination exists. Which is better?
Design Issue #2
Similarly to #1, I want to have a table that describes the relationship between Features and Sensors with the only difference being that a Sensor may be either not required/optional/mandatory. So the idea is to have a table as follows:
FeatureSensors
--------------
FeatureId(int)
SensorId(int)
RequirementLevel(int)
As for #1, I am questioning whether I need to have a record for each combination of FeatureId and SensorId and if it is not needed then I just use a 0 for the RequirementLevel, otherwise I only have a record where it is optional (1) or mandatory (2).
Am I going down the right path here or is there a much better way to structure this data?
First of all, this is definitely one way to do it, on principle. Connecting tables via "relational tables" (EDIT: Don't know whether there's an official name for these) is a known practice.
Even before I finished reading, I asked myself the same question you did regarding issue #1: Why would you need a record for each combination of plan and feature? Similarly, why would you need a record for each combination of feature and requirement? I'd simply store the combinations that are actually applied in practice in the database, and not store the combinations that aren't used at all.
The requirement level (optional/mandatory) can then be stored as a boolean flag; i.e. maybe rename it to "required_flag" or something like that. If the flag is set to true, the feature/sensor is required for the plan/feature, respectively. If the flag is false, the combination is optional. This way, you also have a uniform presentation of requirement levels in both tables.
Was not sure how to express this in the title. So here's the deal: I have a table storing information about currency pairs used in foreign exchange rates:
PAIR_ID | BASE_CURRENCY | TERM_CURRENCY | ATTRIBUTE1 | ATTRIBUTE2 ...
Ideally I should have another table to store the currency symbols (master data), say CURRENCY_SYMBOLS and foreign keys from BASE_CURRENCY and TERM_CURRENCY to this table. However I am confused about 2 possible approaches here.
Approach 1:
CURRENCY_PAIRS:
PAIR_ID | BASE_CURRENCY_ID | TERM_CURRENCY_ID | ATTRIBUTE1 | ATTRIBUTE2 ...
CURRENCY_SYMBOLS:
SYMBOL_ID | SYMBOL
with BASE_CURRENCY_ID & TERM_CURRENCY_ID referencing SYMBOL_ID
Or Approach 2: rather than having a symbol_id that really adds no value, just have:
CURRENCY_PAIRS:
PAIR_ID | BASE_CURRENCY | TERM_CURRENCY | ATTRIBUTE1 | ATTRIBUTE2 ...
CURRENCY_SYMBOLS:
SYMBOL
with BASE_CURRENCY & TERM_CURRENCY referencing the SYMBOL directly.
I am not sure which one is better, approach 1 seems ideal but really no advantage - in fact in all my queries an additional join will be needed to retrieve data.
Approach 2 seems more efficient but somehow not correct.
Any pointers on which one I should go with?
Approach 2 seems like a good idea at first, but there are a few problems with it. I'll list them all even though 1 and 2 don't really apply as much to you, since you're only using it with 3-digit ISO codes:
Foreign key references can take up more room. Depending on how long you need to make your VARCHARs, they can take up more room as foreign keys then, say, a byte or a short. If you have zillions of objects which refer to these foreign keys then it adds up. Some DBs are smart about this and replace the VARCHARs with hash table references in the referring tables, but some don't. No DB is smart about it 100% of the time.
You're necessarily exposing database keys (which should have no meaning, at least to end-users) as business keys. What if the bosses want to replace "USD" with "$" or "Dollars"? You would need to add a lookup table in that case, negating a primary reason to use this approach in the first place. Otherwise you'd need to change the value in the CURRENCY_SYMBOLS, which can be tricky (See #3).
It's hard to maintain. Countries occasionally change. They change currencies as they enter/leave the Euro, have coups, etc. Sometimes just the name of the currency becomes politically incorrect. With this approach you not only would have to change the entry in CURRENCY_SYMBOLS, but cascade that change to every object in the DB that refers to it. That could be incredibly slow. Also, since you have no constant keys, the keys the programmers are hard-wiring into their business logic are these same keys that have now changed. Good luck hunting through the entire code base to find them all.
I often use a "hybrid" approach; that is, I use approach 1 but with a very short VARCHAR as the ID (3 or 4 characters max). That way, each entry can have a "SYMBOL" field which is exposed to end users and can be changed as needed by simply modifying the one table entry. Also, developers have a slightly more meaningful ID than trying to remember that "14" is the Yen and "27" is the US Dollar. Since these keys are not exposed, they don't have to change so long as the developers remember that YEN was the currency before The Great Revolution. If a query is just for business logic, you may still be able to get away with not using a join. It's slower for some things but it's faster for others. YMMV.
In both cases you need a join so you are not saving a join.
Option 1 adds an ID. This ID will default to have a clustered index. Meaning the data is sorted on disk with the lowest ID first and the highest ID at the end. This is a flexible option that will allow easy future development.
Option 2 will hard code the symbols into the Currency Pairs table. This means if at a later date you want to add another column to the symbols table, eg for grouping, you will need to create the symbol_id field and update all your records in the currency pairs table. This increases maintenance costs.
I always add int ID fields for this sort of table because the overhead is low and maintenance is easier.
There are also indexing advantages to option 1
I would recommend using the symbol id, but it is close. This assumes that you really mean the currency abbreviation, rather than the symbol. I generally prefer surrogate numeric keys. If I have to use a string, then I want to avoid international characters.
One issue is dealing with currencies that may not be international standards or that may change over time. In the past 15 years, we have seen many currencies change, primarily to the euro. But you have other instances where, say, the Turkish lira was re-evaluated. So if you used your own definition, you might not distinguish between the two currencies.
Also, depending on your application, you may be calling something a "currency" when it is not an official currency. This happens when financial products are priced using some sort of basket of currencies (or other benchmark metric), but you want to treat "currency-basket bonds" the same way as other bonds in your system.
Because the issue of currencies is more complicated than I had once thought, I would lean to having a surrogate key to give the application more flexibility.
I'm playing around with a database idea at the moment. It's likely not going to be deployed in any sort of fashion and is more of a learning experience.
It's meant to simplify the collection and handling of tutor information for a bunch of classes at the university I went to. I worked part time in an office that organised tutors for a handful of classes each semester.
I've got a number of questions, but the one that's causing me a problem at the moment is how I can store the availability of each tutor. I'm considering 3 options at the moment, and I'm looking for feedback on the pros and cons of each from a technical perspective.
Background:
Tutor information is stored in a "tutor" table (tutorID references this) and the previous availability must be able to be recalled. Tutor availability is discrete (hourly), and constant throughout a semester.
Option 1:
Table: Availability
+-----------+---------+-------+-------+---+---+---+----+---+
| avID (PK) | tutorID | year | sem | M | T | W | Th | F |
| | | (int) | (int) | (all strings) |
+-----------+---------+-------+-------+---+---+---+----+---+
In this table, availaiblity is stored in a string (08,09,10,13,14 represents 8am, 9am, 10am, 1pm and 2pm).
Data could be reclaimed with
SELECT * FROM Availability WHERE tutorID=0001 AND year=2013 AND sem=1
And to see who's available
SELECT * FROM Availability WHERE AND year=2013 AND sem=1 AND M LIKE '%08%'
Option 2:
Table: Availability
+-----------+---------+-------+-------+--------------+
| avID (PK) | tutorID | year | sem | availability |
| | | (int) | (int) | (set) |
+-----------+---------+-------+-------+--------------+
In this layout, the availability column is stored as the SET datatype in mysql, with the options being every combination of Mon through Friday and every time from 8 till 4 (M08, M09... Th14, F16 etc etc). This works out to 45 acceptable values. This is the one that I'm currently leaning towards, but I don't know much about the SET datatype.
Data could be reclaimed with
SELECT * FROM Availability WHERE tutorID=0001 AND year=2013 AND sem=1
And to see who's available
SELECT * FROM Availability WHERE AND year=2013 AND sem=1
AND FIND_IN_SET('M09',availability) > 0
Option 3:
Table: Availability
+-----------+---------+-------+-------+-------+-------+
| avID (PK) | tutorID | year | sem | day | time |
| | | (int) | (int) | (int) | (int) |
+-----------+---------+-------+-------+-------+-------+
In this option, there is a single row for each tutor each year and each timeslot.
Data could be reclaimed with
SELECT * FROM Availability WHERE year=2013 AND sem=2 AND tutorID=0001
Availability with
SELECT * FROM Availability WHERE year=2013 AND sem=2 AND day=3 AND time=14
Anyway... Thanks for reading through all of that. Hopefully someone will be able to shed some light on this. I think that it basically will boil down to a best-practice type of question. Unless there's something that I've missed entirely!!
None of your listed options are normalized. Basically normalizing, and one of the main points and benefits of relational database technology, is avoiding the storage of redundant information.
Option 1
You were not clear about the requirement, but I'm assuming a tutor may be available more than one hour per day. That would make Option 1 awkward, or a poor fit because you would have to have multiple rows to cover multiple sessions in a single day. The other columns values would be duplicated across rows – that kind of repetition means a violation of normalization.
Also, choosing text as the data type for the start time is probably not optimal. If the sessions always start on the hour, then you are dealing with hour numbers. If dealing with numbers, store them as numbers (as a general rule). If the sessions may not always start on the hour, then you are dealing with time values. Same general rule, store them as a Time data type.
Choosing int as data type for year is probably not clear. Usually an academic year is something like "2013-2014".
Option 2
In Option 2, stuffing multiple points of data into a single field is definitely not normalized. While your query would work it has at least two shortcomings. One is performance; typically searching a multi-value field like that will be relatively slow. But more importantly, violating normalization almost always leads to painting yourself into a corner. What if you want to tie additional values to each of those time slots — you can't because you don't have access to each time slot when they are smashed together.
Option 3
In Option 3, you are getting closer to a normalized design. But notice how multiple fields will be repeated together (year and sem)? Again that kind of duplication is a flag for a violation of normalization.
Generalize
When designing, generally it is a good habit to broaden or generalize your thinking. For example, are sessions always forever going to start on the hour and last one hour? Not likely. So it may be smart to use a Time value rather than an hour number. Another example, "semester" – not all schools use semesters and even those that do (yours) may change. So it may be smart to generalize to "term" and not make assumptions related to semesters. On the other hand, don't over-generalize or else you can fall into a meaningless mess of a design or fall into analysis-paralysis.
Normalize
To normalize, look for the "things", the stuff that may take an action, or stuff that "owns" other stuff. We call these entities.
You've already identified the tutor as a separate entity. Good.
I see another: term (semester). That repeating of 'year' and 'sem' is the clue. Such repetition is avoided by moving those values into another table. That table is for the entity of 'term'. Another clue that separate table is correct is the idea that we may well want to tie other information to the 'term' table, such as the term's start date and length (or stop date). Such additional data certainly should not be repeated across all our 'availability' rows. Such data should be stored once in a single row in term table.
My Design
So my initial design would look like this diagram.
This relationship is Many-to-Many. Each tutor may be available in multiple terms, and each term may have multiple tutors. A many-to-many is a problem in a relational design, and is always resolved with a third "bridge" or "junction" table. Many-to-many and bridge tables are quite common in databases designed for business contexts.
Here, the bridge table between them, is availibility_. That bridge table is a child table to both, and carries each parent's primary key (a foreign key). Tip: when I place parents (blue here) higher vertically than children (orange here), and I notice the "bird body with raised wings" pattern of a parent on either side, then I recognize a many-to-many relationship exists between the parents.
By the way, there are times to violate normalization. We call that "to dernormalize". Usually the goal is related to performance. But denormalize only after you have consulted with another experienced database designer, and when you have very good reasons, clearly know the price you are paying, and thoroughly document the violation for the edification of those who may later take your place.
I'm trying to design a new database structure and was wondering of any advantages/disadvantages of using one-to-many vs many-to-many relationships between tables for the following example.
Let's say we need to store information about clients, products, and their addresses (each client and product may have numerous different addresses (such as "shipping" and "billing")).
One way of achieving this is:
This is a rather straight forward approach where each relationship is of one-to-many type, but involves creating additional tables for each address type.
On the other hand we can create something like this:
This time we simply store an additional "type" field indicating what kind of address it is (Client-Billing (0), Client-Shipping (1), Product-Contact (2)) and "source_id" which is either (Clients.ID or Products.ID depending on the "type" field's value).
This way "Addresses" table doesn't have a "direct" link to any other tables, but the structure seems to be a lot simpler.
My question is if either of those approaches have any significant advantages or is it just a matter of preference? Which one would you choose? Are there any challenges I should be aware of in the future while extending the database? Are there any significant performance differences?
Thank you.
There seems to be redundancy in both of the designs, using junction tables with an "address type" field with a unique constraint accross all three columns would minimize this.
client : id | name
client_address : client_id | address_id | address_type
address : id | line_one | line_two | line_three | line_four | line_five
product_address : product_id | address_id | address_type
product : id | name
either that or make the address type an attribute of product and client
client : id | name | billing_address | contact_address
address : id | line_one | line_two | line_three | line_four | line_five
product : id | name | billing_address | contact_address
I think that you have pretty much covered the answer yourself.
To me it really depends on what you are modelling and how you think it will change in the future.
Personally I'd not recommend over engineering solutions, so the one-to-many solution sound great.
However if you expect that your needs will change in say 1 month then select the many-to-many.
It's really what you need.
You could change the database at a later stage to have a many-to-many relationship but will have a cost & time impact.
Personally I'd keep the database structure as simple as needed, however understanding how you'd change it later.
Personally I only used MS-SQL Server, so perhaps others have a better understanding of other database technologies.
I think it'd be interesting to see what you are using to access your database, eg Sprocs, or something like NHibernate or Entity Framework.
If you believe that the changing the database structure could cause you big issues (it always has for me), experiment and find out how you'd do it.
That'll give you the experience you need to make more informed decisions.
The problem with the bottom model is that you will end up storing redundant data when the billing and shipping addresses are the same.
You could keep your associated table from the bottom model but remove the actual address fields from that and put them in their own table, that way you're not storing addresses multiple times, still normalized, but with 2 fewer tables.
We have a web application that is built on top of a SQL database. Several different types of objects can have comments added to them, and some of these objects need field-level tracking, similar to how field changes are tracked on most issue-tracking systems (such as status, assignment, priority). We'd like to show who the change is by, what the previous value was, and what the new value is.
At a pure design level, it would be most straightforward to track each change from any object in a generic table, with columns for the object type, object primary key, primary key of the user that made the change, the field name, and the old and new values. In our case, these would also optionally have a comment ID if the user entered a comment when making the changes.
However, with how quickly this data can grow, is this the best architecture? What are some methods commonly employed to add this type of functionality to an already large-scale application?
[Edit] I'm starting a bounty on this question mainly because I'd like to find out in particular what is the best architecture in terms of handling scale very well. Tom H.'s answer is informative, but the recommended solution seems to be fairly size-inefficient (a new row for every new state of an object, even if many columns did not change) and not possible given the requirement that we must be able to track changes to user-created fields as well. In particular, I'm likely to accept an answer that can explain how a common issue-tracking system (JIRA or similar) has implemented this.
There are several options available to you for this. You could have audit tables which basically mirror the base tables but also include a change date/time, change type and user. These can be updated through a trigger. This solution is typically better for behind the scenes auditing (IMO) though, rather than to solve an application-specific requirement.
The second option is as you've described. You can have a generic table that holds each individual change with a type code to show which attribute was changed. I personally don't like this solution as it prevents the use of check constraints on the columns and can also prevent foreign key constraints.
The third option (which would be my initial choice with the information given) would be to have a separate historical change table which is updated through the application and includes the PK for each table as well as the column(s) which you would be tracking. It's slightly different from the first option in that the application would be responsible for updating the table as needed. I prefer this over the first option in your case because you really have a business requirement that you're trying to solve, not a back-end technical requirement like auditing. By putting the logic in the application you have a bit more flexibility. Maybe some changes you don't want to track because they're maintenance updates, etc.
With the third option you can either have the "current" data in the base table or you can have each column that is kept historically in the historical table only. You would then need to look at the latest row to get the current state for the object. I prefer that because it avoids the problem of duplicate data in your database or having to look at multiple tables for the same data.
So, you might have:
Problem_Ticket (ticket_id, ticket_name)
Problem_Ticket_History (ticket_id, change_datetime, description, comment, username)
Alternatively, you could use:
Problem_Ticket (ticket_id, ticket_name)
Problem_Ticket_Comments (ticket_id, change_datetime, comment, username)
Problem_Ticket_Statuses (ticket_id, change_datetime, status_id, username)
I'm not sure about the "issue tracking" specific approach, but I wouldn't say there is one ultimate way to do this. There are a number of options to accomplish it, each have their benefits and negatives as illustrated here.
I personally would just create one table that has some meta data columns about the change and a column that stores xml of the serialized version of the old object or whatever you care about. That way if you wanted to show the history of the object you just get all the old versions and the re-hydrate them and your done. One table to rule them all.
One often overlooked solution would be to use Change Data Capture. This might give you more space savings/performance if you really are concerned.
Good luck.
Here is the solution I would recommend to attain your objective.
Design your auditing model as shown below.
---------------- 1 * ------------
| AuditEventType |----------| AuditEvent |
---------------- ------------
| 1 | 1
| |
----------------- -------------
| 0,1 | +
------------------- ----------------
| AuditEventComment | | AuditDataTable |
------------------- ----------------
| 1
|
|
| +
----------------- + 1 --------------
| AuditDataColumn |------------------| AuditDataRow |
----------------- --------------
.
AuditEventType
Contains list of all possible events type in system and generic description for same.
.
AuditEvent
Contains information about particular even that triggerd this action.
.
AuditEventComment
Contains optional custom user comment about the audit event. Comments can be really huge so better store them in CLOB.
.
AuditDataTable
Contains list of one or more tables that were impacted by respective AuditEvent
.
AuditDataRow
Contains list of one or more identifying rows in respective AuditDataTable that was were impacted by respective AuditEvent
.
AuditDataColumn
Contains list of zero or more columns of respective AuditDataRow whose values were changed with it's previous and current values.
.
AuditBuilder
Implement AuditBuilder (Builder pattern). Instantiate it at begining of event and make it available in request context or pass it along with other DTO's. Each time anywhere in your code you make changes to your data, invoke appropriate call on AuditBuilder to notify it about the change. At the end, invoke build() on AuditBuilder to form above structure and then persist it to database.
Make sure all your activity for the event is in a single DB transaction along with persistence of audit data.
It depends on your exact requirements, and this might not be for you, but for general auditing in the database with triggers (so front-end and even the SP interface layer don't matter), we use AutoAudit, and it works very well.
I don't understand the actual usage scenarios for the audited data, though... do you need to just keep track of the changes? Will you need to "rollback" some of the changes? How frequent (and flexible) you want the audit log report/lookup to be?
Personally I'd investigate something like that:
Create AuditTable. This has an ID, a version number, a user id and a clob field.
When Object #768795 is created, add a row in AuditTable, with values:
Id=#768795
Version:0
User: (Id of the user who created the new object)
clob: an xml representation of the whole object. (if space is a problem, and access to this table is not frequent, you could use a blob and "zip" the xml representation on the fly).
Every time you change something, create a new version, and store the whole object "serialized" as an XML
In case you need to create an audit log you have all you need, and can use simple "text compare" tools or libraries to see what changed in time (a bit like Wikipedia does).
If you want to track only a subset of fields either because the rest is immutable, non significant or you are desperate for speed/space, just serialize the subset you care about.
I know this question is very old but Another possiblity which is built into sql is
TRACK CHANGES:
you can find more information on this link:
Introduction to Change Data Capture (CDC) in SQL Server 2008
http://www.simple-talk.com/sql/learn-sql-server/introduction-to-change-data-capture-(cdc)-in-sql-server-2008/
I think Observer is an ideal pattern in this scenario.