I am trying to design my first database and I have found that I have quite a few different "flags" that I want to keep of in the database:
Active # Shows whether the item submission has been completed
Indexed # Shows whether the item has been indexed
Reminded # Shows whether the “expiring email” has been sent to the user
Error # Shows whether there is an error with the submission
Confirmation # Shows whether the confirmation email has been sent
Other than just having a Boolean field for these, is there a clever way of storing these details? I was wondering if I had these under a status group in the database with an ID for every connotation (32) and just link to that.
Unless there is some reason to do otherwise, I'd recommend simply adding those five boolean (or bit) columns to the item table.
It depends on how immutable the list of connotations is.
If there are just the five you mentioned then just add five flag columns. If the list of possible connotations could change in the future, it might be safer to have a separate table with a list of connotations that currently apply to each row in the main table, with a one-to-many relationship.
Consider:
Table: Vehicle
ID
Type
Doors
Color
Table: Type_Categories
ID
Name
Table: Types
TypeID
CategoryID
Value
DataType
This way reuse of type can occur in other places as needed. However this assumes non-boolean "Flags" if all flags are truly boolean... Id stick w/ putting them in the table. But I always hated boolean values. I preferred time stamps so I know when the flag was set not just that it was set. if the timestamp is null, then it's not been set.
In my experience the status columns frequently evolve to more than two states. So I would use a smallint for each status for convenience and simplicity.
But if your aim is to save space then you can save all the statuses in a single smallint using casts to and from bit to manipulate the statuses individually or as a whole.
create table t (status smallint);
To save 10010 then cast it to smallint:
insert into t (status) values (b'10010'::int::smallint);
List all statuses:
select status::int::bit(5) from t;
status
--------
10010
To set the 3rd status use the bitwise or:
update t set status = (status::integer::bit(5) | b'00100')::integer::smallint;
select status::int::bit(5) from t;
status
--------
10110
To unset that status use the bitwise and:
update t set status = (status::integer::bit(5) & b'11011')::integer::smallint;
select status::int::bit(5) from t;
status
--------
10010
To retrieve the lines with the 3rd status set:
select status
from t
where substring(status::integer::bit(5) from 3 for 1) = '1'
You could write functions to simplify the conversions.
If they are "just" flags, store them as boolean-type columns on the table.
I'd recommend against Clodoaldo's solution unless space is REALLY tight - see this question.
It looks like the columns you mention have "business importance" - i.e. it may not be enough to store "Indexed", but also the date on which the item was indexed. It may be necessary to limit combinations of states, or impose rules on the sequencing (you can't go to complete whilst being in error state). In that case, you may want to implement an "item_status" table to store history etc.
In this case, your schema would be something like this:
ITEM
---------
item_id
....
STATUS
---------
status_id
description
ITEM_STATUS
--------------
item_id
status_id
date
Every time an item changes status, you insert a new row into the ITEM_STATUS table; the current status is the row with the latest date for that item.
Related
I have a table where I have these fields:
id(primary key, auto increment)
car registration number
car model
garage id
and 31 fields for each day of the mont for each row.
In these fields I have char of 1 or 2 characters representing car status on that date. I need to make a query to get number of each possibility for that day, field of any day could have values: D, I, R, TA, RZ, BV and LR.
I need to count in each row, amount of each value in that row.
Like how many I , how many D and so on. And this for every row in table.
What best approach would be here? Also maybe there is better way then having field in database table for each day because it makes over 30 fields obviously.
There is a better way. You should structure the data so you have another table, with rows such as:
CarId
Date
Status
Then your query would simply be:
select status, count(*)
from CarStatuses
where date >= #month_start and date < month_end
group by status;
For your data model, this is much harder to deal with. You can do something like this:
select status, count(*)
from ((select status_01 as status
from t
) union all
(select status_02
from t
) union all
. . .
(select status_31
from t
)
) s
group by status;
You seem to have to start with most basic tutorials about relational databases and SQL design. Some classic works like "Martin Gruber - Understanding SQL" may help. Or others. ATM you miss the basics.
Few hints.
Documents that you print for user or receive from user do not represent your internal data structures. They are created/parsed for that very purpose machine-to-human interface. Inside your program should structure the data for easy of storing/processing.
You have to add a "dictionary table" for the statuses.
ID / abbreviation / human-readable description
You may have a "business rule" that from "R" status you can transition to either "D" status or to "BV" status, but not to any other. In other words you better draft the possible status transitions "directed graph". You would keep it in extra columns of that dictionary table or in one more specialized helper table. Dictionary of transitions for the dictionary of possible statuses.
Your paper blank combines in the same row both totals and per-day detailisation. That is easy for human to look upon, but for computer that in a sense violates single responsibility principle. Row should either be responsible for primary record or for derived total calculation. You better have two tables - one for primary day by day records and another for per-month total summing up.
Bonus point would be that when you would change values in the primary data table you may ask server to automatically recalculate the corresponding month totals. Read about SQL triggers.
Also your triggers may check if the new state properly transits from the previous day state, as described in the "business rules". They would also maybe have to check there is not gaps between day. If there is a record for "march 03" and there is inserted a new the record for "march 05" then a record for "march 04" should exists, or the server would prohibit adding such a row. Well, maybe not, that is dependent upon you business processes. The general idea is that server should reject storing any data that is not valid and server can know it.
you per-date and per-month tables should have proper UNIQUE CONSTRAINTs prohibiting entering duplicate rows. It also means the former should have DATE-type column and the latter should either have month and year INTEGER-type columns or have a DATE-type column with the day part in it always being "1" - you would want a CHECK CONSTRAINT for it.
If your company has some registry of cars (and probably it does, it is not looking like those car were driven in by random one-time customers driving by) you have to introduce a dictionary table of cars. Integer ID (PK), registration plate, engine factory number, vagon factory number, colour and whatever else.
The per-month totals table would not have many columns per every status. It would instead have a special row for every status! The structure would probably be like that: Month / Year / ID of car in the registry / ID of status in the dictionary / count. All columns would be integer type (some may be SmallInt or BigInt, but that is minor nuancing). All the columns together (without count column) should constitute a UNIQUE CONSTRAINT or even better a "compound" Primary Key. Adding a special dedicated PK column here in the totaling table seems redundant to me.
Consequently, your per-day and per-month tables would not have literal (textual and immediate) data for status and car id. Instead they would have integer IDs referencing proper records in the corresponding cars dictionary and status dictionary tables. That you would code as FOREIGN KEY.
Remember the rule of thumb: it is easy to add/delete a row to any table but quite hard to add/delete a column.
With design like yours, column-oriented, what would happen if next year the boss would introduce some more statuses? you would have to redesign the table, the program in many points and so on.
With the rows-oriented design you would just have to add one row in the statuses dictionary and maybe few rows to transition rules dictionary, and the rest works without any change.
That way you would not
I am new to DB Design and I've recently inherited the responsibility of adding some new attributes to an existing design.
Below is a sample of the current table in question:
Submission Table:
ID (int)
Subject (text)
Processed (bit)
SubmissionDate (datetime)
Submitted (bit)
...
The new requirements are:
A Submission can be marked as valid or invalid
A Reason must be provided when a Submission is marked as invalid. (So a submission may have an InvalidReason)
Submissions can be associated with one another such that: Multiple valid Submissions can be set as "replacements" for an invalid Submission.
So I've currently taken the easy solution and simply added new attributes directly to the Submission Table so that it looks like this:
NEW Submission Table:
ID (int)
Subject (text)
Processed (bit)
SubmissionDate (datetime)
Submitted (bit)
...
IsValid (bit)
InvalidReason (text)
ReplacedSubmissionID (int)
Everything works fine this way, but it just seems a little strange:
Having InvalidReason as a column that will be NULL for majority of submissions.
Having ReplacedSubmissionID as a column that will be NULL for majority of submissions.
If I understand normalization right, InvalidReason might be transitively dependent on the IsValid bit.
It just seems like somehow some of these attributes should be extracted to a separate table, but I don't see how to create that design with these requirements.
Is this single table design okay? Anyone have better alternative ideas?
Whether or not you should have a single table design really depends on
1) How you will be querying the data
2) How much data would end up being potentially NULL in the resulting table.
In your case its probably ok, but again it depends on #1. If you will be querying separately to get information on invalid submissions, you may want to create a separate table that references the Id of invalid submissions and the reason:
New table: InvalidSubmissionInfo
Id (int) (of invalid submissions; will have FK contraint on Submission table)
InvalidReason (string)
Additionally if you will be querying for replaced submissions separately you may want to have a table just for those:
New table: ReplacementSubmissions
Id (int) (of the replacement submissions; will have FK contraint on Submission table)
ReplacedSubmissionId (int) (of what got replaced; will have FK constraint on submission table)
To get the rest of the info you will still have to join with the Submissions table.
All this to say you do not need separate this out into multiple tables. Having a NULL value only takes up 1 bit of memory which isn't bad. And if you need to query and return an entire Submission record each time, it makes more sense to condense this info into one table.
Single table design looks good to me and it should work in your case.
If you do not like NULLS, you can give default value of an empty string and ReplacedSubmissionID to 0. Default values are always preferable in database design.
Having an empty string or default value will make your data look more cleaner.
Please remember if you add default values, you might need to change queries to get proper results.
For example:-
Getting submissions which have not been replaced>
Select * from tblSubmission where ReplacedSubmissionID = 0
Don't fear joins. Looking for ways to place everything in a single table is at best a complete waste of time, at worst results in a convoluted, unmaintainable mess.
You are correct about InvalidReason and IsValid. However, you missed SubmittedDate and Submitted.
Whenever modeling an entity that will be processed in some way and going through consecutive state changes, these states really should be placed in a separate table. Any information concerning the state change -- date, reason for the change, authorization, etc. -- will have a functional dependency on the state rather than the entity as a whole, therefore an attempt to make the state information part of the entity tuple will fail at 2nf.
The problem this causes is shown in your very question. You already incorporated Submitted and SubmittedDate into the tuple. Now you have another state you want to add. If you had normalized the submission data, you could have simply added another state and gone on.
create table StateDefs(
ID int auto_generated primary key,
Name varchar( 16 ) not null, -- 'Submitted', 'Processed', 'Rejected', etc.
... -- any other data concerning states
);
create table Submissions(
ID int auto_generated primary key,
Subject varchar( 128 ) not null,
... -- other data
);
create table SubmissionStates(
SubID int not null references Submissions( ID ),
State int not null references StateDefs( ID ),
When date not null,
Description varchar( 128 )
);
This shows that a state consists of a date and an open text field to place any other information. That may suit your needs. If different states require different data, you may have to (gasp) create other state tables. Whatever your needs require.
You could insert the first state of a submission into the table and update that record at state changes. But you lose the history of state changes and that is useful information. So each state change would call for a new record each time. Reading the history of a submission would then be easy. Reading the current state would be more difficult.
But not too difficult:
select ss.*
from SubmissionStates ss
where ss.SubID = :SubID
and ss.When =(
select Max( When )
from SubmissionStates
where SubID = ss.SubID
and When <= Today() );
This finds the current row, that is, the row with the most recent date. To find the state that was in effect on a particular date, change Today() to something like :AsOf and place the date of interest in that variable. Storing the current date in that variable returns the current state so you can use the same query to find current or past data.
So given a table structure that looks something like this:
Order_date DATE
Order_id NUMBER
State VARCHAR2(16)
...
other properties/attributes
Keep in mind that I could use a sequence of integers here and generate a PK, however that does not interest me because of how I use this table in the main application.
So the composite key is made of Order_date, Order_id and State. The problem with this combination is that it's not necessary to be unique, but it is constrained in a way.
Ex:
Order_date | Order_id | State
21-09-2014 7218821 Pending
22-09-2014 2771272 Pending
20-09-2014 3277127 Approved
13-08-2014 2218765 Done
13-08-2014 2218765 Cancelled
Constraints:
There is no way for one combination of the same order_date and
order_id and state Done to be duplicated in this
There can be any number of the same order_date and order_id with any other state than Done
You cannot add a record with state DONE or ERROR
You cannot skip from one state to another by bypassing their natural sequence (REGISTERED -> PENDING -> APPROVED -> DONE | CANCELLED | ERROR)
What whould be the best way for me to implement these constraints for a Oracle database?
The first is handled by a primary key or unique key.
The second is tricky. The second can be handled with a function-based unique key, because Oracle allows multiple values for NULL:
create unique index unq_order_date_id_done on
orders(order, order_date, order_id,
(case when state = 'DONE' then state end));
I think the third and fourth need a trigger to prevent the value from being added.
Bullet by bullet:
This is most likely true with no monitoring needed. Although you don't show it, the DATE field contains the time down to the second. In order to have a duplicate, the state for the same order will have to be changed twice within the same second.
Doubtful. Unless your processing allows for multiple state changes for the same order within a second of each other.
Your example data shows a state of DONE. How did that get there?
Your description states that after APPROVED, the only allowed states are DONE or CANCELED or ERROR. Your example data shows an order going from DONE to CANCELED. This does not seem to be allowed. Actually, your second bullet suggests that a status of ERROR is not allowed under any circumstances.
The only way you can have duplicated (order, date) values is if status changes occur very quickly -- within the same second. OR...you truncate the time values from the date fields. This doesn't seem likely as there is no reason to be discarding such valuable information as the time a status change was recorded. You get no benefit and processing becomes more difficult: lose/lose.
For database design, if the value of a column is from a constant list of strings, such as status, type. Should I create a new table and have a foreign key or just store plain strings in the same table.
For example, I have a orders table with status:
----------------------------
| id | price | status |
----------------------------
| 1 | 10.00 | pending |
| 2 | 03.00 | in_progress |
| 3 | xx.xx | done |
An alternative for above table is to have a order_status table and store status_id in orders table. I'm not sure if another table is necessary here.
If it's more than just a few different values and/or values are frequently added you should go with a normalized data model, i.e. a table.
Otherwise you also might go for a column, but you need to add a CHECK(status in ('pending','in_progress,'done')) to avoid wrong data. This way you get the same consistency without the FK.
To save space you might use abbreviations (one or a few characters, e.g. 'p', 'i', 'd') but not meaningless numbers(1,2,3). Resolving the long values can be done in a View level using CASE.
ENUMs are proprietary, so IMHO better avoid it...
It's not a good practice to create a table just for static values.
Instead, you could use the ENUM type, which has a pre set value, as the example:
CREATE TABLE orders (
id INT,
price DOUBLE,
status ENUM('pending', 'in progress', 'done')
);
There are pros and cons for each solution and you need pick the best for your own project and you may have to switch later if the initial choice is bad.
In your case, storing status directly can be good enough. But if you want to prevent invalid status stored in your database, or you have a very long status text, you may want to store them separately with a foreign key constraint.
ENUM is another solution. However, if you need a new status later, you have to change your table definition, which can be a very bad thing.
If the status has extra data associated with it, like display order or a colour, then you would need a separate table. Also, choosing pre-entered values from a table prevents semi-duplicate values (for example, one person might write "in progress" whereas another might write "in_progress" or "progressing") and aids in searching for orders with the same status.
I would go for a separate table as it allows more capabilities and lowers error.
I would use an order_status table with the literal as the primary key. Then in your orders table, cascade updates on the status column in case you modify the literals in the order_status table. This way you have data consistency and avoid join queries.
Let's say I have a User which has a status and the user's status can be 'active', 'suspended' or 'inactive'.
Now, when creating the database, I was wondering... would it be better to have a column with the string value (with an enum type, or rule applied) so it's easier to both query and know the current user status or are joins better and I should join in a UserStatuses table which contains the possible user statuses?
Assuming, of course statuses can not be created by the application user.
Edit: Some clarification
I would NOT use string joins, it would be a int join to UserStatuses PK
My primary concern is performance wise
The possible status ARE STATIC and will NEVER change
On most systems it makes little or no difference to performance. Personally I'd use a short string for clarity and join that to a table with more detail as you suggest.
create table intLookup
(
pk integer primary key,
value varchar(20) not null
)
insert into intLookup (pk, value) values
(1,'value 1'),
(2,'value 2'),
(3,'value 3'),
(4,'value 4')
create table stringLookup
(
pk varchar(4) primary key,
value varchar(20) not null
)
insert into stringLookup (pk, value) values
(1,'value 1'),
(2,'value 2'),
(3,'value 3'),
(4,'value 4')
create table masterData
(
stuff varchar(50),
fkInt integer references intLookup(pk),
fkString varchar(4)references stringLookup(pk)
)
create index i on masterData(fkInt)
create index s on masterData(fkString)
insert into masterData
(stuff, fkInt, fkString)
select COLUMN_NAME, (ORDINAL_POSITION %4)+1,(ORDINAL_POSITION %4)+1 from INFORMATION_SCHEMA.COLUMNS
go 1000
This results in 300K rows.
select
*
from masterData m inner join intLookup i on m.fkInt=i.pk
select
*
from masterData m inner join stringLookup s on m.fkString=s.pk
On my system (SQL Server)
- the query plans, I/O and CPU are identical
- execution times are identical.
- The lookup table is read and processed once (in either query)
There is NO difference using an int or a string.
I think, as a whole, everyone has hit on important components of the answer to your question. However, they all have good points which should be taken together, rather than separately.
As logixologist mentioned, a healthy amount of Normalization is generally considered to increase performance. However, in contrast to logixologist, I think your situation is the perfect time for normalization. Your problem seems to be one of normalization. In this case, using a numeric key as Santhosh suggested which then leads back to a code table containing the decodes for the statuses will result in less data being stored per record. This difference wouldn't show in a small Access database, but it would likely show in a table with millions of records, each with a status.
As David Aldridge suggested, you might find that normalizing this particular data point will result in a more controlled end-user experience. Normalizing the status field will also allow you to edit the status flag at a later date in one location and have that change perpetuated throughout the database. If your boss is like mine, then you might have to change the Status of Inactive to Closed (and then back again next week!), which would be more work if the status field was not normalized. By normalizing, it's also easier to enforce referential integrity. If a status key is not in the Status code table, then it can't be added to your main table.
If you're concerned about the performance when querying in the future, then there are some different things to consider. To pull back status, if it's normalized, you'll be adding a join to your query. That join will probably not hurt you in any sized recordset but I believe it will help in larger recordsets by limiting the amount of raw text that must be handled. If your primary concern is performance when querying the data, here's a great resource on how to optimize queries: http://www.sql-server-performance.com/2007/t-sql-where/ and I think you'll find that a lot of the rules discussed here will also apply to any inclusion criteria you enforce in the join itself.
Hope this helps!
Christopher
The whole idea behind normalization is to keep the data from repeating (well at least one of the concepts).
In this case there is only 1 status a user at one time (I assume) can have so their is no reason to put it in its own table. You would simply complicate things. The only reason you would have a seperate table is if for some reason these statuses were not static. Meaning next month you may add "Sort of Active" and "Maybe Inactive". This would mean changing code to make up for that if you didnt put them in their own table. You could create a maintenace page where users could add statuses and then that would require you to create a seperate table.
An issue to consider is whether these status values have attributes of their own.
For example, perhaps you would want to have a default sort order that is different from the alphabetical order of the status text. You might also want to treat two of the statuses in a particular way that you do not treat the other, and that could be an attribute.
If you have a need for that, or suspect a future need for that, then move the status text to a different table and use an integer key value for them.
I would suggest using Integer values like 0, 1, 2. If this is fixed. When interpreting the results in Reports we can change these status back to strings.