Is this a textbook design pattern, or did I invent something new? - sql

I'm fresh out of designing a set of tables, in which I came up with an architecture that I was very pleased with! I've never seen it anywhere else before, so I'd love to know if I've just reinvented the wheel (most probable), or if this is a genuine innovation.
Here's the problem statement: I have Employees who can each sign a different contract with the company. Each employee can perform different Activities, and each activity may have a different pay rate, sometimes a fixed amount for completing one activity, sometimes an hourly rate, and sometimes at a tiered rate. There may also be a specific customer who likes the employee particularly, so when he works with that specific customer, he gets a higher rate. And if no rate is defined, he gets the company default rate.
Don't fuss about the details: the main point is that there are a lot of pay rates that can be defined, each in a fairly complicated way. And the pay rates all have the following in common:
Service Type
Pay Scale Type (Enum: Fixed Amount/Hourly Rate/Tiered Rate)
Fixed Amount (if PayScaleType = FA)
Hourly Rate (if PayScaleType = HR) - yes, could be merged into one field, but for reasons I won't go into here, I've kept them separate
Tiers (1->n relationship, with all the tiers and the amount to pay once you have gone over the tier threshold)
These pay rates apply to:
Default company rate
Employee rate
Employee override rate (defined per customer)
If I had to follow the simple brute force approach, I would have to create a PayRate and PayRateTier clone table for each of the 3 above tables, plus their corresponding Linq classes, plus logic to calculate the rates in 3 separate places, somehow refactoring to reuse the calculation logic. Ugh. That's like using copy and paste, just on the database.
So instead, what did I do? I created a intermediary table, which I called PayRatePackage, consisting only of an ID field. I have only one PayRate table with a mandatory FK to PayRatePackage, and a PayRateTier table with a mandatory FK to PayRate. Then, DefaultCompanyPayRate has a mandatory FK to PayRatePackage, as do EmployeeRate and EmployeeOverrideRate.
So simple - and it works!
(Pardon me for not attaching diagrams; that would be a lot of effort to go to for a SO question where I've already solved the main problem. If a lot of people want to see a diagram, please say so in the comments, and I'll throw something together.)
Now, I'm pretty sure that something this simple and effective must be in a formal design pattern somewhere, and I'd love to know what it is. Or did I just invent something new? :)

I'm pretty sure this is the Strategy Pattern
"Define a family of algorithms, encapsulate each one, and make them interchangeable. Strategy lets the algorithm vary independently from clients that use it."

Sounds like relational database design to me. You broke out specific logic into specific entities, and keyed them back to the original tables... Standard normalization...

Related

How to build a proper DB schema to have "periodic snapshots" of a table for a selected day? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Problem to be solved:
Im new to DataBases and Im trying to find out the best way to store changes in a table, that is a daily snapshot of some statuses: eg. "hotel_room_rentals" table (with 20 columns - every can change).
Id like to be able to generate that table for a selected day (e.g. data inside changes on production, so I have to store it somewhere else), or do some other transformations on it (e.g. average number of days rented in a period)
My theoretical example - detailed:
Let's say that Im creating a DB for a hotel.
In the production system I have a table that shows info for all 10 000 rooms in the hotel.
This is a daily snapshot - let's assume that the table is updated once per day.
Some attributes of a room change often: e.g. is_rented; customer_number, rate_usd.
Some attributes dont change too often: e.g. disabled_room, room_color, type_of_furniture.
Room_number obviously does not change (primary key)
Now I want to find the best way to track changes in this table; the best way to create statistics on base of this table (e.g. average number of days rented in a period) and to be able to generate the table for selected date (e.g. 2013-01-01)
MY IDEA:
Since I have no clue about databases, my idea is to copy the whole table every day, with 1 more column, called "DB_dump_date" (with a date). This is a pretty straightforward approach, which will probably require a lot of space; since my 10k rooms table, will have to be copied 365 times in a year.
OTHER SOLUTIONS:
On some other website, I was recommended to create two tables:
"Reservation" table with these columns: Startdate Enddate Room Rate Occupant_name
Then to transform this table into a FactReservations table: Date Room Is_occupied Rate Occupant_name
I do not understand how does this help me... in fact I assume I would have to make 20 intermediary tables and then 20 Fact tables (since I have 20 columns in my database).
QUESTIONS:
What are the recommended ways to deal with such problems?
Is there any DB schema that is prepared to deal with it, without the user making magic ETLs? (e.g. a DB that can optimize the problem by itself)
What are the alternatives?
How would you, smart people, do this? (preferably in MS Access... or some freeware technology)
edit:
one more thing - everything can change in the table, not only room reservetions, everything; and I want to be able to track the changes
stop - slow down - and take a breath.
do not - repeat do not make copies of tables each day. this approach is way off base.
your problem is a normalization problem. as you indicate - you have other suggestions on how to normalize - this is the direction you want to go.
Your goal will be to find a structure that accommodates the SQL statements that can answer your questions (and hopefully many more that you haven't thought up yet) This will be one static model where the tables do not change or get copied, but are instead static - and the only thing that changes is the data inside the tables. (ideally - to me there will also be few to no updates, only inserts)
You will certainly need a ROOM table, and a CUSTOMER table, and then a relation between them possibly RESERVATION.
these can then fill up - and you can get all the answers to the questions you posed without any copying or materialization or anything.. just SQL.
You need to focus on the requirements and start there. So far for requirements I see are:
-Generate that table for a selected day
-average number of days rented in a period
If we consider two extremes of design, at the more complex end would be a datamart with SCD tables, tracking changes to rooms, and at the simple end would be some kind of log table, along the lines of what you have already mentioned.
Reading between the lines, I don't really see any requirement for knowing the attributes of a room on a given day, but I do see a requirement for analysis of historical transactions.
So my suggestion is have a good hard think about your requirements before you start designing the database.
There is no magic design to cover this automatically. Dimensional design is a standard way of modelling business data to allow for easy analysis, but it might be over the top for your requirement.
Welcome to the world of databases! With that in mind – take almost everything that you know about Excel and throw it out the window. Whereas it’s much more difficult in Excel to define relationships between two sheets of a workbook and report off of those two different sheets, so the majority of the time it’s easier to simply copy the same data down a single sheet, it’s trivially easy to do using Access or any other relational database.
Typically what you’d want to do is create several normalized tables and define a relationship between them. Then, when querying the view, you can easily join between the tables to get the data that you need.
So, working off of the assumption that you’re building this for simple reporting and not to create a property management system (if you are looking at that – I’d recommend that you look at some of the players in the industry, like Micros or Agilysys), based on my experience working in the industry, I’d recommend the following table layout:
Reservations – this holds the reservation information (guest name,
arrival date, departure date, check-in date, check-out date, rate if
you use a blended rate, etc.)
Rooms – this holds information on your rack (number, wing code, max
guests, # beds, smoking/non, view, type, etc.)
Room Status – Only if you need to track if a room is on
reserve/hold/OOO/OTM (Status type, date start, date end)
Room Status Types – Types of room status holds and how it affects
inventory (type, out of inventory flag)
Rates (if you don’t use a blended rate) – one entry per reservation
per night (guest, rate)
Personally, I’m a huge fan of using surrogate keys for the unique identifiers, because all too often I've been burned where something changes in the business process and a natural key that was previously unique all of a sudden can be duplicated. In that vein, each table would have a surrogate key and the joins would be as follows:
Reservations – Rooms (many to one)
Rooms – Room Status (one to many)
Room Status – Room Status Types (many to one)
Reservations – Rates (one to many)
If you define the relationships properly in Access (i.e. foreign key relationships in other DBMS), it should automatically use them to build your joins when creating your queries (called Views in just about every other DBMS) or reports.
For learning about databases I’d recommend that you review:
Wikipedia on Join types
Wikipedia on Slowly Changing Dimension (you could use some of
these techniques to record changes in room information over time)
Wikipedia on Relational Databases
Office documentation on Access
Kimball Group Design Tips (great for data warehouse/datamart
design)
if you need to use your existing table then the following is not applicable. If the data can be migrated to a new schema then this will readily address the challenge. TRE is an approach which uses the current view paradigm for development but fully supports the time dimensions of data (which are system time=when the data goes into the db and valid time=the business time which applies to the data). By working in the current view approach of TRE this sort of problem is straightforward. Take a look at:- http://youtu.be/V1EcsuJxUno

Fact Table Design Suggestions

To make things simple I have a transactional system which logs instant messages between a doctor and patient. At the end each session between the doctor and patient the doctor fills out an outcome form which is stored in a DimOutcome table which looks like this:
DimOutcome
----------
PK_OutcomeKey
OutcomeCategory1
OutcomeCategory2
OutcomeCategory3
...
I'm looking for the best way to design the fact table which will track messages. One thing that needs to be taken into consideration is that sometimes chat sessions can go unanswered (i.e. out of hours contact) and then can be followed up.
What would be the ideal way to design a FactMessage, taking into consideration that I need to track the DimOutcome with each chat session.
I'm thinking I will need to create one fact for messages and another for the overall session, would this be the only way? I would also like to track the amount of time between each message and the total session?
the fact table which will track messages
First, be aware that in a fact table there is usually data, that can be aggregated, measured facts. Dimensions are used to filter the data in the fact table. Everything else doesn't make much sense in data warehousing. Maybe a normalized database model would be better for your needs.
One thing that needs to be taken into consideration is that sometimes
chat sessions can go unanswered
That for example would be in a dimension i.e. DimSession, holding attributes of all sessions like the status, i.e. unanswered. Note that other attributes of the session, like the participants, might be in dimensions DimDoctor and DimPatient.
You also said, that you want to track the "DimOutcome". Here are two possibilities. First, you save this information in the dimension "session". So you can filter your fact table for the different outcomes.
The other possibility would be that you have columns for each outcome in your fact table. So that you have the amount of sessions per outcome. That would at least be something measurable.
What you have to consider here is the granularity of your fact table. Has it one entry per session or per day? One entry per session isn't maybe the best choice if you go with having outcome columns in your fact table, since you could also have this information by filtering per DimSession and doing a COUNT(*) on your fact table.
I'm thinking I will need to create one fact for messages and another
for the overall session, would this be the only way?
I think this whole data-warehousing thing isn't what you are looking for. A normalized data structure would be better for your needs.
If you want to know more about it, google for star schema or snowflake schema if you want to get an idea, how data-warehousing is usually realized.
A very shortened star schema...

Best way to calculate sum depending on dates with SQL

I don't know a good way to maintain sums depending on dates in a SQL database.
Take a database with two tables:
Client
clientID
name
overdueAmount
Invoice
clientID
invoiceID
amount
dueDate
paymentDate
I need to propose a list of the clients and order it by overdue amount (sum of not paid past invoices of the client). On big database it isn't possible to calculate it in real time.
The problem is the maintenance of an overdue amount field on the client. The amount of this field can change at midnight from one day to the other even if nothing changed on the invoices of the client.
This sum changes if the invoice is paid, a new invoice is created and due date is past, a due date is now past and wasn't yesterday...
The only solution I found is to recalculate every night this field on every client by summing the invoices respecting the conditions. But it's not efficient on very big databases.
I think it's a common problem and I would like to know if a best practice exists?
You should read about data warehousing. It will help you to solve this problem. It looks similar as what you just said
"The only solution I found is to recalculate every night this field
on every client by summing the invoices respecting the conditions. But
it's not efficient on very big databases."
But it has something more than that. When you read it, try to forget about normalization. Its main intention is for 'show' data, not 'manage' data. So, you would feel weird at beginning but if you understand 'why we need data warehousing', it will be very very interesting.
This is a book that can be a good start http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247 , classic one.
Firstly, I'd like to understand what you mean by "very big databases" - most RDBMS systems running on decent hardware should be able to calculate this in real time for anything less than hundreds of millions of invoices. I speak from experience here.
Secondly, "best practice" is one of those expressions that mean very little - it's often used to present someone's opinion as being more meaningful than simply an opinion.
In my opinion, by far the best option is to calculate it on the fly.
If your database is so big that you really can't do this, I'd consider a nightly batch (as you describe). Nightly batch runs are a pain - especially for systems that need to be available 24/7, but they have the benefit of keeping all the logic in a single place.
If you want to avoid nightly batches, you can use triggers to populate an "unpaid_invoices" table. When you create a new invoice record, a trigger copies that invoice to the "unpaid_invoices" table; when you update the invoice with a payment, and the payment amount equals the outstanding amount, you delete from the unpaid_invoices table. By definition, the unpaid_invoices table should be far smaller than the total number of invoices; calculating the outstanding amount for a given customer on the fly should be okay.
However, triggers are nasty, evil things, with exotic failure modes that can stump the unsuspecting developer, so only consider this if you have a ninja SQL developer on hand. Absolutely make sure you have a SQL query which checks the validity of your unpaid_invoices table, and ideally schedule it as a regular task.

Is it preferred to use end-time or duration for events in sql? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
My gut tells me that start time and end time would be better than start time and duration in general, but I'm wondering if there are some concrete advantages or disadvantages to the differing methods.
The advantage for strttime and endtime I am seeing is that if you want to call all events active during a certain time period you don't have to look outside that time period.
(this is for events that are not likely to change much after initial input and are tied to a specific time, if that makes a difference)
I do not see it as a preference or a personal choice. Computer Science is, well, a science, and we are programming machinery, not a sensitive child.
Re-inventing the Wheel
Entire books have been written on the subject of Temporal Data in Relational Databases, by giants of the industry. Codd has passed on, but his colleague and co-author C J Date, and recently H Darwen carry on the work of progressing and refining the Relational Model, in The Third Manifesto. The seminal book on the subject is Temporal Data & the Relational Model by C J Date, Hugh Darwen, and Nikos
A Lorentzos.
There are many who post opinions and personal choices re CS subjects as if they were choosing ice cream. This is due to not having had any formal training, and thus treating their CS task as if they were the only person on the planet who had come across that problem, and found a solution. Basically they re-invent the wheel from scratch, as if there were no other wheels in existence. A lot of time and effort can be saved by reading technical material (that excludes Wikipedia and MS publications).
Buy a Modern Wheel
Temporal Data has been a problem that has been worked with by thousands of data modellers following the RM and trying to implement good solutions. Some of them are good and others not. But now we have the work of giants, seriously researched, and with solutions and prescribed treatment provided. As before, these will eventually be implemented in the SQL Standard. PostgreSQL already has a couple of the required functions (the authors are part of TTM).
Therefore we can take those solutions and prescriptions, which will be (a) future-proofed and (b) reliable (unlike the thousands of not-so-good Temporal databases that currently exist), rather than relying on either personal opinion, or popular votes on some web-site. Needless to say, the code will be much easier as well.
Inspect Before Purchase
If you do some googling, beware that there are also really bad "books" available. These are published under the banner of MS and Oracle, by PhDs who spend their lives at the ice cream parlour. Because they did not read and understand the textbooks, they have a shallow understanding of the problem, and invent quite incorrect "solutions". Then they proceed to provide massive solutions, not to Temporal data, but to the massive problems inherent in their "solutions". You will be locked into problems that have been identified and sole; and into implementing triggers and all sorts of unnecessary code. Anything available free is worth exactly the price you paid for it.
Temporal Data
So I will try to simplify the Temporal problem, and paraphrase the guidance from the textbook, for the scope of your question. Simple rules, taking both Normalisation and Temporal requirements into account, as well as usage that you have not foreseen.
First and foremost, use the correct Datatype for any kind of Temporal column. That means DATETIME or SMALLDATETIME, depending on the resolution and range that you require. Where only DATE or TIME portion is required , you can use that. This allows you to perform date & time arithmetic using SQL function, directly in your WHERE clause.
Second, make sure that you use really clear names for the columns and variables.
There are three types of Temporal Data. It is all about categorising the properly, so that the treatment (planned and unplanned) is easy (which is why yours is a good question, and why I provide a full explanation). The advantage is much simpler SQL using inline Date/Time functions (you do not need the planned Temporal SQL functions). Always store:
Instant as SMALL/DATETIME, eg. UpdatedDtm
Interval as INTEGER, clearly identifying the Unit in the column name, eg. IntervalSec or NumDays
There are some technicians who argue that Interval should be stored in DATETIME, regardless of the component being used, as (eg) seconds or months since midnight 01 Jan 1900, etc. That is fine, but requires more unwieldy (not complex) code both in the initial storage and whenever it is extracted.
whatever you choose, be consistent.
Period or Duration. This is defined as the time period between two separate Instants. Storage depends on whether the Period is conjunct or disjunct.
For conjunct Periods, as in your Event requirement: use one SMALL/DATETIME for EventDateTime; the end of the Period can be derived from the beginning of the Period of the next row, and EndDateTime should not be stored.
For disjunct Periods, with gaps in-between yes, you need 2 x SMALL/DATETIMEs, eg. a RentedFrom and a RentedTo. If it is in the same row.
Period or Duration across rows merely need the ending Instant to be stored in some other row. ExerciseStart is the Event.DateTime of the X1 Event row, and ExerciseEnd is the Event.DateTime of the X9 Event row.
Therefore Period or Duration stored as an Interval is simply incorrect, not subject to opinion.
Data Duplication
Separately, in a Normalised database, ie. where EndDateTime is not stored (unless disjoint, as per above), storing a datum that can be derived will introduce an Update Anomaly where there was none.
with one EndDateTime, you have version of a the truth in one place; where as with duplicated data, you have a second version of the fact in another column:
which breaks 1NF
the two facts need to be maintained (updated) together, transactionally, and are at the risk of being out of synch
different queries could yeild different results, due to two versions of the truth
All easily avoided by maintaining the science. The return (insignificant increase in speed of single query) is not worth destroying the integrity of the data for.
Response to Comments
could you expand a little bit on the practical difference between conjunct and disjunct and the direct practical effect of these concepts on db design? (as I understand the difference, the exercise and temp-basal in my database are disjunct because they are distinct events separated by whitespace.. whereas basal itself would be conjunct because there's always a value)
Not quite. In your Db (as far as I understand it so far):
All the Events are Instants, not conjunct or disjunct Periods
The exceptions are Exercise and TempBasal, for which the ending Instant is stored, and therefore they have Periods, with whitespace between the Periods; thus they are disjunct.
I think you want to identify more Durations, such a ActiveInsulinPeriod and ActiveCarbPeriod, etc, but so far they only have an Event (Instant) that is causative.
I don't think you have any conjunct Periods (there may well be, but I am hard pressed to identify any. I retract what I said (When they were Readings, they looked conjunct, but we have progressed).
For a simple example of conjunct Periods, that we can work with re practical effect, please refer to this time-series question. The text and perhaps the code may be of value, so I have linked the Q/A, but I particularly want you the look at the Data Model. Ignore the three implementation options, they are irrelevant to this context.
Every Period in that database is Conjunct. A Product is always in some Status. The End-DateTime of any Period is the Start-DateTime of the next row for the Product.
It entirely depends on what you want to do with the data. As you say, you can filter by end time if you store that. On the other hand, if you want to find "all events lasting more than an hour" then the duration would be most useful.
Of course, you could always store both if necessary.
The important thing is: do you know how you're going to want to use the data?
EDIT: Just to add a little more meat, depending on the database you're using, you may wish to consider using a view: store only (say) the start time and duration, but have a view which exposes the start time, duration and computed end time. If you need to query against all three columns (whether together or separately) you'll want to check what support your database has for indexing a view column. This has the benefits of convenience and clarity, but without the downside of data redundancy (having to keep the "spare" column in sync with the other two). On the other hand, it's more complicated and requires more support from your database.
End - Start = Duration.
One could argue you could even use End and Duration, so there really is no difference in any of the combinations.
Except for the triviality that you need the column included to filter on it, so include
duration: if you need to filter by duration of execution time
start + end: if you need to trap for events that both start and end within a timeframe

Which table design is better for a Balance that is broken up into multiple parts?

I have a database where the Balance and Payments need to be broken down into different "money buckets" to show how they are allocated. For example, there is principal, interest, late fees, bounced check fees, Misc, etc. There are up to 10 different money buckets.
Which of these two methods is the better way of designing a database for this, and why?
Option A
PAYMENTS
AccountId
// Other payment-related columns
TotalPaid
PrincipalPaid
InterestPaid
MiscPaid
BadCheckChargesPaid
...
Option B
PAYMENTS
AccountId
// Other payment-related columns
TotalPaid
PAYMENT_DETAILS
PaymentId
PaymentTypeId
AmountPaid
In most cases only 1-3 of the different balance types are used
Option B is the better normalized, more flexible option (easy to add a new bucket later) and would get my vote.
While the normalization fairy can often tempt you in the direction of the latter (as it does me), the former is probably the more sensible. You're only talking about 10 columns (not 500), and there's no normalization rules that are really being broken. Unless there's the strong possibility that this list of payment allocation buckets will grow, I would stay away from the EAV structure just because of the headaches (and innumerable joins in some queries) that it can produce.
Option B seems better to me. A clincher would be whether your application is designed to show the details like this:
Item Amount
-------------- ---------------
Principal $10.00
Interest $1.11
If so, the normalized version is not only "righter" but actually stores the data in a format closer to what your application requires.
To me, the big question is whether you store the payment total in the payment record or derive it from the details.