applying business rules at the database level - sql

I'm working on a project in which we will need to determine certain types of statuses for a large body of people, stored in a database. The business rules for determining these statuses are fairly complex and may change.
For example,
if a person is part of group X
and (if they have attribute O) has either attribute P or attribute Q,
or (if they don't have attribute O) has attribute P but not Q,
and don't have attribute R,
and aren't part of group Y (unless they also are part of group Z),
then status A is true.
Multiply by several dozen statuses and possibly hundreds of groups and attributes. The people, groups, and attributes are all in the database.
Though this will be consumed by a Java app, we also want to be able to run reports directly against the database, so it would be best if the set of computed statuses were available at at the data level.
Our current design plan, then, is to have a table or view that consists of a set of boolean flags (hasStatusA? hasStatusB? hasStatusC?) for each person. This way, if I want to query for everyone who has status C, I don't have to know all of the rules for computing status C; I just check the flag.
(Note that, in real life, the flags will have more meaningful names: isEligibleForReview?, isPastDueForReview?, etc.).
So a) is this a reasonable approach, and b) if so, what's the best way to compute those flags?
Some options we're considering for computing flags:
Make the set of flags a view, and calculate the flag values from the underlying data in real time using SQL or PL-SQL (this is an Oracle DB). This way the values are always accurate, but performance may suffer, and the rules would have to be maintained by a developer.
Make the set of flags consist of static data, and use some type of rules engine to keep those flags up-to-date as the underlying data changes. This way the rules can be maintained more easily, but the flags could potentially be inaccurate at a given point in time. (If we go with this approach, is there a rules engine that can easily manipulate data within a database in this way?)

In a case like this I suggest applying Ward Cunningham's question- ask yourself "What's the simplest thing that could possibly work?".
In this case, the simplest thing might be to come up with a view that looks at the data as it exists and does the calculations and computations to produce all the fields you care about. Now, load up your database and try it out. Is it fast enough? If so, good - you did the simplest possible thing and it worked out fine. If it's NOT fast enough, good - the first attempt didn't work, but you've got the rules mapped out in the view code. Now you can go on to try the next iteration of "the simplest thing" - perhaps your write a background task that watches for inserts and updates and then jumps in to recompute the flags. If that works, fine and dandy. If not, go to the next iteration...and so on.
Share and enjoy.

I would advise against making the statuses as column names but rather use a status id and value. such as a customer status table with columns of ID and Value.
I would have two methods for updating statuses. One a stored procedure that either has all the logic or calls separate stored procs to figure out each status. you could make all this dynamic by having a function for each status evaluation, and the one stored proc could then call each function. The 2nd method would be to have whatever stored proc(s), that updates user info, call a stored proc to go update all the users statuses based upon the current data. These two methods would allow you to have both realtime updates for the data that changed and if you add a new status, you can call the method to update all statuses with new logic.
Hopefully you have one point of updates to the user data, such as a user update stored proc, and you can put the status update stored proc call in that procedure. This would also save having to schedule a task every n seconds to update statuses.

An option I'd consider would be for each flag to be backed by a deterministic function that returns the up-to-date value given the relevant data.
The function might not perform well enough, however, if you're calling it for many rows at a time (e.g. for reporting). So, if you're on Oracle 11g, you can solve this by adding virtual columns (search for "virtual column") to the relevant tables based on the function. The Result Cache feature should improve the performance of the function as well.

Related

SQL code to avoid the unnecessary start of a calculation if it was started my some other process?

I do have the stored procedure to calculate some facts (say usp_calculate). It fills the cache-like table. The part of the table (determined by the arguments of the procedure) table must be recalculated every 20 minutes. Basically, the usp_calculate returns early if cached data is fresh, or it spends say a minute to calculate... and returns after that.
The usp_calculate is shared by more outer procedures that needs the data. How should I prevent starting the time-consuming part of the procedure if it was already started by some other process? How can I implement a kind of signaling and waiting for the result instead of starting the calculation again?
Context: I do have an SQL stored procedure named say usp_products. It finally performs a SELECT that returns the rows for a product code, a product name, and calculated information -- special price for a customer, and for the storage location. There is a lot of combinations (customers, price lists, other conditions) that prevent to precalculate the information by a separate process. It must be calculated on-demand, for the specific combination.
The third party database that is the source of the information is not designed for detecting changes. Anyway, the time condition (not older than 20 minutes) is considered a good enough to consider the data "fresh".
The building block for this is probably going to be application locks.
You can obtain an exclusive application lock using sp_getapplock. You can then determine if the data is "fresh enough" either using freshness information inherently contained in that data or a separate table that you use for tracking this.
At this point, if necessary, you refresh the data and update the freshness information.
Finally, you release the lock using sp_releaseapplock and let all of the other callers have their chance to acquire the lock and discover that the data is fresh.

Row-granularity operations/error handling

I have read many articles and posts about how a cursor is a massive performance hindrance over the equivalent single set query.
However, with a cursor, you are able to perform the desired operation successfully on all rows that did not err, and provide an error message for each row that did.
Is there some other way I can achieve this row granularity with set operations?
No, a set-based operations works - as the name tells us - with a set. It will work or fail in total.
A CURSOR (or any other procedural approach like WHILE or an external program) can be the best choice in this case.
If performance matters I would prefer to use a tolerant staging table for the first set based import. Then do some quality/cleaning actions there to ensure the succesfull transfer and shift the cleaned data into your target tables (set based).
This depends on the data, your business rules and - of course - the amount of rows.

Ensuring business rule is not broken with multiple users

Suppose I have an order which has order lines and we have a multi-user system where different users can add order lines. Orders are persisted to the database
One business rule is that only 10 order lines can be created for an order.
What are my options to ensure that this rule is not broken? Should I check in memory and apply a lock or do I handle this in the database via procedures?
You have a bunch of options on how you can handle this, including: triggers, constraints, business rule logic, and data structure.
My preference is the following. Wrap all inserts/deletes/updates to orderlines in a stored procedure and only give the application layer access to the table through the stored procedure. This procedure can then enforce this and other business rules. It would also lock the table for updates while running, so only one user can change the table at any given instant.
A similar approach is to have an insert instead-of trigger on the table. This would cause the insert to fail when this (and other) business rules fail. The main concern with this is maintainability. One trigger on one table is fine. However, you can end up with a triggering nightmare if you start doing this for multiple tables with cascading inserts/deletes.
You can attempt to do this with a constraint, as one of the comments suggests. You would definitely want to test this for performance, because you have little control over how the constraint is implemented.
Also, you could have ten orderline columns in the order table. This would emphatically enforce this rule. But, there is a cost to having separate columns for each order line and having to deal with issues such as deletes.
Another option, not appropriate in this case, is to enforce the business rule at the application level. However, with multiple concurrent users, you should be doing this work in the database.
Assume the worst case, that a large number (aka lots more than 2) of users are all simultaneously attempting to add lines to a single order. (Even worse, assume they’re all trying to add varying numbers of lines, though never more than 10 at a time -- the GUI can control at least that much). This leads to scenarios like:
A checks is there space for her rows, gets ok
B checks is there space for his rows, gets ok
A adds her rows, life is good
B adds his rows, now there are 11
All of this took place within less than a tenth of a second
To avoid this, there needs to be some form of central “checkpoint” or regulator where all users go to determine if they can add rows. If they pass the checkpoint, they can add; if the don’t pass, they can’t. The checkpoint must be absolute: once approval is received/assigned, that decision impacts all subsequent checks (i.e. when you check and are granted the 10th line, no one else has previously been granted it, and no one else subsequently will be granted it).
There are several ways to implement this, and they all involve database transactions (ACID properties). Transactions must always be as brief as possible, to avoid blocking or deadlocking with other users. This can be tricky code, and for my money the best way to implement/control the process is via stored procedures (like #Gordon Linoff said).

How should I keep accurate records summarising multiple tables?

I have a normalized database and need to produce web based reports frequently that involve joins across multiple tables. These queries are taking too long, so I'd like to keep the results computed so that I can load pages quickly. There are frequent updates to the tables I am summarising, and I need the summary to reflect all update so far.
All tables have autoincrement primary integer keys, and I almost always add new rows and can arrange to clear the computed results in they change.
I approached a similar problem where I needed a summary of a single table by arranging to iterate over each row in the table, and keep track of the iterator state and the highest primary keen (i.e. "highwater") seen. That's fine for a single table, but for multiple tables I'd end up keeping one highwater value per table, and that feels complicated. Alternatively I could denormalise down to one table (with fairly extensive application changes), which feels a step backwards and would probably change my database size from about 5GB to about 20GB.
(I'm using sqlite3 at the moment, but MySQL is also an option).
I see two approaches:
You move the data in a separate database, denormalized, putting some precalculation, to optimize it for quick access and reporting (sounds like a small datawarehouse). This implies you have to think of some jobs (scripts, separate application, etc.) that copies and transforms the data from the source to the destination. Depending on the way you want the copying to be done (full/incremental), the frequency of copying and the complexity of data model (both source and destination), it might take a while to implement and then to optimizie the process. It has the advantage that leaves your source database untouched.
You keep the current database, but you denormalize it. As you said, this might imply changing in the logic of the application (but you might find a way to minimize the impact on the logic using the database, you know the situation better than me :) ).
Can the reports be refreshed incrementally, or is it a full recalculation to rework the report? If it has to be a full recalculation then you basically just want to cache the result set until the next refresh is required. You can create some tables to contain the report output (and metadata table to define what report output versions are available), but most of the time this is overkill and you are better off just saving the query results off to a file or other cache store.
If it is an incremental refresh then you need the PK ranges to work with anyhow, so you would want something like your high water mark data (except you may want to store min/max pairs).
You can create triggers.
As soon as one of the calculated values changes, you can do one of the following:
Update the calculated field (Preferred)
Recalculate your summary table
Store a flag that a recalculation is necessary. The next time you need the calculated values check this flag first and do the recalculation if necessary
Example:
CREATE TRIGGER update_summary_table UPDATE OF order_value ON orders
BEGIN
UPDATE summary
SET total_order_value = total_order_value
- old.order_value
+ new.order_value
// OR: Do a complete recalculation
// OR: Store a flag
END;
More Information on SQLite triggers: http://www.sqlite.org/lang_createtrigger.html
In the end I arranged for a single program instance to make all database updates, and maintain the summaries in its heap, i.e. not in the database at all. This works very nicely in this case but would be inappropriate if I had multiple programs doing database updates.
You haven't said anything about your indexing strategy. I would look at that first - making sure that your indexes are covering.
Then I think the trigger option discussed is also a very good strategy.
Another possibility is the regular population of a data warehouse with a model suitable for high performance reporting (for instance, the Kimball model).

Design Question - Put hundreds of Yes/No switches in columns, rows, or other?

We are porting an old application that used a hierarchical database to a relational web app, and are trying to figure out the best way to port configuration switches (Y/N values).
Our old system had 256 distinct switches (per client) that were each stored as a bit in one of 8 32-bit data fields. Each client would typically have ~100 switches set. To read or set a switch, we'd use bitwise arithmetic using a #define value. For example:
if (a_switchbank4 & E_SHOW_SALARY_ON_CHECKS) //If true, print salary on check
We were debating what approach to store switches in our new relational (MS-SQL) database:
Put each switch in its own field
Pros: fast and easy read/write/access - 1 row per client
Cons: seems kludgey, need to change schema every time we add a switch
Create a row per switch per client
Pros: unlimited switches, no schema changes necessary w/ new switches
Cons: slightly more arduous to pull data, lose intellisense w/o extra work
Maintain bit fields
Pros: same code can be leveraged, smaller XML data transmissions between machines
Cons: doesn't make any sense to our developers, hard to debug, too easy to use wrong 'switch bank' field for comparison
I'm leaning towards #1 ... any thoughts?
It depends on a few factors such as:
How many switches are set for each client
How many switches are actually used
How often switches are added
If I had to guess (and I would be guessing) I'd say what you really want are tags. One table has clients, with a unique ID for each, another has tags (the tag name and a unique ID) and a third has client ID / tag ID pairs, to indicate which clients have which tags.
This differs from your solution #2 in that tags are only present for the clients where that switch is true. In other words, rather than storing a client ID, a switch ID, and a boolean you store just a client ID and a switch ID, but only for the clients with that switch set.
This takes up about one third the space over solution number two, but the real advantage is over solutions one and three: indexing. If you want to find out things like which clients have switches 7, 45, and 130 set but not 86 or 14, you can do them efficiently with a single index on a tag table, but there's no practical way to do them with the other solutions.
You could think about using database views to give you the best of each solution.
For example store the data as one row per switch, but use a view that pivots the switches (rows) into columns where this is more convenient.
I would go with option #2, one row per flag.
However, I'd also consider a mix of #1 and #2. I don't know your app, but if some switches are related, you could group those into tables where you have multiple columns of switches. You could group them based on use or type. You could, and would probably still have a generic table with one switch per row, for ones that don't fit into the groups.
Remember too if you change the method, you may have a lot of application code to change that relys on the existing method of storing the data. Whether you should change the method may depend on exactly how hard it will be and how many hours it will take to change everything associated. I agree with Markus' solution, but you do need to consider exactly how hard refactoring is going to be and whether your project can afford the time. The refactoring book I've been reading would suggest that you maintain both for a set time period with triggers to keep them in synch while you then start fixing all the references. Then on a set date you drop the original (and the triggers) from the database. This allows you to usue the new method going forth, but gives the flexibility that nothing will break before you get it fixed, so you can roll out the change before all references are fixed. It requires discipline however as it is easy to not get rid of the legacy code and columns because everything is working and you are afraid not to. If you are in the midst of total redesign where everything will be tested thougroughly and you have the time built into the project, then go ahead and change everything all at once.
I'd also lean toward option 1, but would also consider an option 4 in some scenarios.
4- Store in dictionary of name value pairs. Serialize to database.
I would recommend option 2. It's relatively straightforward to turn a list of tags/rows into a hash in the code, which makes it fairly easy to check variables. Having a table with 256+ columns seems like a nightmare.
One problem with option #2 is that having a crosstab query is a pain:
Client S1 S2 S3 S4 S5 ...
A X X
B X X X
But there are usually methods for doing that in a database-specific way.