Database table design for a user referencing a particular outcome - sql

Scenario:
The user inputs name, goal, and size. Goals are "long", "average", "short". Sizes are "big", "medium", "small". There are 3 tables. User stores name, goal, and size of each user. Final stores outcome based on goal/size combination. UserFinal serves as an associate table that stores the appropriate user with corresponding outcome. A certain combination of goal and size results in a specific outcome (refer to the table below) that is assigned to that user.
GOAL SIZE OUTCOME
Long Big 12
Long Medium 14
Long Small 18
Average Big 13
Average Medium 16
Average Small 19
Short Big 15
Short Medium 17
Short Small 20
Objective: Table Final should not grow and is only intended to serve as a 'lookup' table for when a User submits their goal and size (not sure if a table that isn't expected to grow is a good design?). The outcome value will be used for further calculations for each particular user.
Question: Is the illustrated table design the correct way to reflect this scenario? If it isn't, what is the best way for accomplishing this?

It seems Goal and Size identifies Outcome, and each User sets one Outcome.
See the following data model:
Sample data:
Goal
--------
L Long
A Average
S Short
Size
--------
B Big
M Medium
S Small
Outcome
--------
L B 12
L M 14
L S 18
A B 13
A M 16
A S 19
S B 15
S M 17
S S 20

I deleted my answer and am submitting another one, because the edits changed the clarity of the question significantly.
Have a final table purely as a look-up table does not seem like good db design.
I would do a simple one to many relationship.
Every user has many outcomes like so:
Users
user_id [pk]
name
Outcomes
outcome_id [pk]
user_id [fk]
goal
size
outcome

Related

How to use normalization to set levels of confidence between a rating and the number of ratings in Python or SQL?

I have a list of about 800 sales items that have a rating (from 1 to 5), and the number of ratings. I'd like to list the items that are most probable of having a "good" rating in an unbiased way, meaning that 1 person voting 5.0 isn't nearly as good as 50 people having voted and the rating of the item being a 4.5.
Initially I thought about getting the smallest amount of votes (which will be zero 99% of the time), and the highest amount of votes for an item on the list and factor that into the ratings, giving me a confidence level of 0 to 100%, however I'm thinking that this approach would be too simplistic.
I've heard about Bayesian probability but I have no idea on how to implement it. My list of items, ratings and number of ratings is on a MySQL view, but I'm parsing the code using Python, so I can make the calculations on either side (but preferably at the SQL view).
Is there any practical way that I can normalize this voting with SQL, considering the rating and number of votes as parameters?
|----------|--------|--------------|
| itemCode | rating | numOfRatings |
|----------|--------|--------------|
| 12330 | 5.00 | 2 |
| 85763 | 4.65 | 36 |
| 85333 | 3.11 | 9 |
|----------|--------|--------------|
I've started off trying to assign percentiles to the rating and numOfRatings, this way I'd be able to do normalization (sum them with an initial 50/50 weight). Here's the code I've attempted:
SELECT p.itemCode AS itemCode, (p.rating - min(p.rating)) / (max(p.rating) - min(p.rating)) AS percentil_rating,
(p.numOfRatings - min(p.numOfRatings)) / (max(p.numOfRatings) - min(p.numOfRatings)) AS percentil_qtd_ratings
FROM products p
WHERE p.available = 1
GROUP BY p.itemCode
However that's only bringing me a result for the first itemCode on the list, not all of them.
Clearly the issue here is the low number of observations your data has. Implementing Bayesian's method is the way to go because it provides great probability distribution for applications involving ratings especially if there is limited observations, and it easily decides the future likelihood ratio based on given parameters (this article provides an excellent explanation about Bayesian probability for beginners).
I would suggest storing your data in CSV files so it becomes easier to manipulate in python. Denormalizing the data via joins is the first task to do before analyzing your ratings.
This is Bayesian's simplified formula to use in your python code:
R – Confidence level aka number of observations
v – number of votes for a single product
C – avg vote for all products
m - tuneable parameter aka cutoff number required for votes to be considered (How many votes do you want displayed)
Since this is the simplified formula, this article explains how its been derived from its original formula. This article is helpful too in explaining the parameters.
Knowing the formula pretty much gets 50% of your work done, the rest is just importing your data and working with it. I provided below examples similar to your problem in case you need full demonstration:
Github example 1
Github example 2

How to calculate score for each entry based on criterion and select the ones with biggest score?

I want to write a small script which would scrap user comments from a small forum and store some statistic data in database. Basically I am interested in how often users are using different words in their comments.
Imagine the following schema:
word
id
text
user
id
username
user_words
word_id
user_id
count (how much times user used the word in all his comments on the forum)
After this I would like to create a query which would select me 5 similar users from database. To be more specific:
Imagine that we have an average frequency usage of following words in all comments of forum:
word1: 4%
word2: 1%
word3: 10%
...
For specific user the frequency of these words is:
word1: 15%
word2: 2%
word3: 1%
...
Programmatically I will find user words with the biggest difference from average values. In my case these words are:
word1 (15% over average 4%)
word3 (1% over average 10%)
Now I want to find the users with similar usage frequencies on word1 and word3. For example:
user1:
word1: 13%
word3: 2%
user2:
word1: 5%
word3: 9%
user1 is much more similar to our original user. The similarity can be calculated as deltas sum:
user1: |15% - 13%| + |1% - 2%| = 2 + 1 = 3
user2: |15 - 9%|+ |1% - 9%| = 6 + 8 = 14
The difference with user1 is much much smaller. So user1 is much more similar to original user.
Now. Imagine I have thousands of users in database. I would like to select 5 users (limit) ordering them by score (where score is calculated in a way I showed above).
The problem is that I have no idea which mechanisms exists in SQL in order to achieve this (DBMS doesn't matter). Could you please give me a peace of advice, which mechanism I should use in order to assign some virtual "score" to each user and select ordering by it? I just need to know in which direction I should go, what should I read about.
First things first would be to be able to count occurence of string right? What I would do is JOIN your words table to comment table, then call a scalar function to count the words per comment then just sum it up group by userid and wordid
Please see sample scalar function:
CREATE FUNCTION dbo.CountNoOfString
(
#searchString nvarchar(max),
#valueString nvarchar(max)
)
RETURNS INT
AS
BEGIN
return (LEN(#searchString)-LEN(REPLACE(#searchString,#valueString ,'')))/LEN(#valueString)
END
From there it would be easy to get the metrics that you need

Access 2010 doubling the sum in query

I know this question has been asked and answered. I understand the problem and I understand the underlying cause and I understand the solution. What I DON'T understand is how to implement the solution.
I'll try to be detailed....
Background: Each material is being grouped on WellID (I work in oil and gas) and SandType which is my primary key in each table, these come from 2 lookup tables one for each. (I work in oil and gas)
I have 3 tables that store material (sand)) weights at 3 different stages in the job process. Basically the weight from the engineer's DESIGN, what was DELIVERED and what is in INVENTORY.
I know that the join is messed up and adding the total for each row in each table. Sometimes double triple etc.
I am grouping on WellID and SandID.
Now I don't want someone to do the work for me. I just don't know how or where in access to restrict it to what I want, or if modifying t he sql the proper way to write the code. Current work around is 3 separate sum queries one for each table, but that is going to get inefficient and added steps.
My whole database purpose and subsequent reports hinge off math on these 3 numbers so, my show stopper here is putting the fat lady on stage, and is about to become a deal breaker at the end of the line! 0
I need some advice, direction, criticism, wisdom, witty euphemisms or a new job!
The 3 tables look as follows
Design:
T_DESIGN
DesignID WellID Sand_ID Weight_DES Time_DES
89 201 1 100 4/21/2014 6:46:02 AM
98 201 2 100 4/21/2014 7:01:22 AM
86 201 4 100 4/21/2014 6:28:01 AM
93 228 5 100 4/21/2014 6:53:34 AM
91 228 1 100 4/21/2014 6:51:23 AM
92 228 1 100 4/21/2014 6:53:30 AM
Delivered:
T_BOL
BOLID WellID_BOL SandID_BOL Weight_BOL
279 201 1 100
280 201 1 100
281 228 2 5
282 228 1 10
283 228 9 100
Inventory:
T_BIN
StrapID WellID_BIN SandID_BIN Weight_BIN
11 201 1 100
13 228 1 10
14 228 1 0
17 228 1 103
19 201 1 50
The Query Results:
Test Query99
WellID
WellID SandID Sum Of Weight_DES Sum Of Weight_BOL Sum Of Weight_BIN
201 1 400 400 300
228 1 600 60 226
SQL:
SELECT DISTINCTROW L_WELL.WellID, L_SAND.SandID,
Sum(T_DESIGN.Weight_DES) AS [Sum Of Weight_DES],
Sum(T_BOL.Weight_BOL) AS [Sum Of Weight_BOL],
Sum(T_BIN.Weight_BIN) AS [Sum Of Weight_BIN]
FROM ((L_SAND INNER JOIN
(L_WELL INNER JOIN T_DESIGN ON L_WELL.[WellID] = T_DESIGN.[WellID_DES])
ON L_SAND.SandID = T_DESIGN.[SandID_DES])
INNER JOIN T_BIN
ON (L_WELL.WellID = T_BIN.WellID_BIN)
AND (L_SAND.SandID = T_BIN.SandID_BIN))
INNER JOIN T_BOL
ON (L_WELL.WellID = T_BOL.WellID_BOL) AND (L_SAND.SandID = T_BOL.SandID_BOL)
GROUP BY L_WELL.WellID, L_SAND.SandID;
Two LooUp tables are for Well Names and Sand Types. (Well has been abbreviate do to size)
L_Well:
WellID WellName_WELL
3 AAGVIK 1-35H
4 AARON 1-22
5 ACHILLES 5301 41-12B
6 ACKLINS 6092 12-18H
7 ADDY 5992 43-21 #1H
8 AERABELLE 5502 43-7T
9 AGNES 1-13H
10 AL 5493 44-23B
11 ALDER 6092 43-8H
12 AMELIA FEDERAL 5201 41-11B
13 AMERADA STATE 1-16X
14 ANDERSMADSON 5201 41-13H
15 ANDERSON 1-13H
16 ANDERSON 7-18H
17 ANDRE 5501 13-4H
18 ANDRE 5501 14-5 3B
19 ANDRE SHEPHERD 5501 14-7 1T
Sand Lookup:
LSand
SandID SandType_Sand
1 100 Mesh
2 20/40 EP
3 20/40 RC
4 20/40 W
5 30/50 Ceramic
6 30/50 EP
7 30/50 RC
8 40/70 EP
9 40/70 W
10 NA See Notes
Querying and Joining Aggregation Data through an MS Access Database
I noticed your concern for pointers on how to implement some of the theory behind your aggregation queries. While SQL queries are good power-tools to get to the core of a difficult analysis problem, it might also be useful to show some of the steps on how to bring things together using the built-in design tools of MS Access.
This solution was developed on MS Access 2010.
Comments on Previous Solutions
#xQbert had a solid start with the following SQL statement. The sub query approach could be visualized as individual query objects created in Access:
FROM
(SELECT WellID, Sand_ID, Sum(weight_DES) as sumWeightDES
FROM T_DESGN) A
INNER JOIN
(SELECT WellID_BOL, Sum(Weight_BOL) as SUMWEIGHTBOL
FROM T_BOL B) B
ON A.Well_ID = B.WellID_BOL
INNER JOIN
(SELECT WellID_BIN, sum(Weight_Bin) as SumWeightBin
FROM T_BIN) C
ON C.Well_ID_BIN = B.Well_ID_BOL
Depending on the actual rules of the business data, the following assumptions made in this query may not necessarily be true:
Will the tables of T_DESIGN, T_BOL and T_BIN be populated at the same time? The sample data has mixed values, i.e., there are WellID and SandID combinations which do not have values for all three of these categories.
INNER type joins assume all three tables have records for each dimension value (Well-Sand combination)
#Frazz improved on the query design by suggesting that whatever is selected as the "base" joining table (T_DESIGN in this case), this table must be populated with all the relevant dimensional values (WellID and SandID combinations).
SELECT
WellID_DES AS WellID,
SandID_DES AS SandID,
SUM(Weight_DES) AS Weight_DES,
(SELECT SUM(Weight_BOL) FROM T_BOL WHERE T_BOL.WellID_BOL=d.WellID_DES
AND T_BOL.SandID_BOL=d.SandID_DES) AS Weight_BOL,
(SELECT SUM(Weight_BIN) FROM T_BIN WHERE T_BIN.WellID_BIN=d.WellID_DES
AND T_BIN.SandID_BIN=d.SandID_DES) AS Weight_BIN
FROM T_DESIGN;
(... note: a group-by statement should be here...)
This was animprovement because now all joins originate from a single point. If a key-value does not exist in either T_BOL or T_BIN, results will still come back and the entire record of the query would not be lost.
Again, it may be possible that there are no T_DESIGN records matching to values stored in the other tables.
Building Aggregation Sub Query Objects
The presented data does not suggest that there is any direct interaction between the data in each of the three tables aside from lining up their results in the end for presentation based on a common key-value pair (WellID and SandID). Since we are using Access, there is a chance to do these calculations separately.
This query was designed using the "summarizing" feature of the Access query design tool. It's output, after pointing to the T_DESIGN table looked like this:
Making Dimension Table Through a Cartesian Product
There are mixed opinions out there about cartesian products, but they do actually have a purpose.
Most of the concern is that a runaway cartesian product query will make millions and millions of nonsensical data values. In this query, it's specifically designed to simulate a real business condition.
The Case for a Cartesian Product
Picking from the sample data provided:
Some of the Sand Types: "20/40 EP", "30/50 Ceramic", "40/70 EP", and "30/50 RC" that are moved between their respective wells, are these sand types found at these wells consistently throughout the year?
Without an anchoring dimension for the key-values, Wells would not be found anywhere in the database via querying. It's not that they do not exist... it's just that there is no recorded data (i.e., Sand Type Weights delivered) for them.
A Reference Dimension Query Product
A dimension query is simple to produce. By referencing the two sources of keys: L_WELL and L_SAND (both look up tables or dimensional tables) without identifying a join condition, all the different combinations of the two key-values (WellID and SandID) are made:
The shortcut in SQL looks like this:
SELECT L_WELL.WellID, L_SAND.SandID, L_WELL.WellName, L_SAND.SandType
FROM L_SAND, L_WELL;
The resulting data looks like this:
Instead of using any of the operational data tables: T_DESIGN, T_BOL, or T_BIN as sources of data for a static dimension such as a list of Oil Wells, or a catalog of Sand Types, that data has been predetermined and can even be transferred to a real table since it probably will not change much once it is created.
Correlating Sub Query Results from Different Sources
After repeating the process and creating the summary tables for the other two sources (T_BOL and T_BIN), You can finally arrange the results through a simple query and join process.
The actual JOIN operations are between the dimension table/query: QSUB_WELL_SAND and all three of the summary queries: QSUB_DES, QSUB_BOL, and QSUB_BIN.
I have chosen to chosen to implement LEFT OUTER joins. If you are not sure of the difference between the different "outer" joins, this is the choice I made through the Access Query Design dialogue:
QSUB_WELL_SAND is defined as our anchor dimension. It will always have more records than any of the other tables. An OUTER JOIN should be defined to KEEP all reference dimension records... and all Summary Table query results, regardless if there is a match between the two Query results.
QSUB_WEIGHTS/ The Query to Combine All Sub Query Results
This is what the design of the final output query looks like:
This is what the data output looks like when this query design is executed:
Conclusions and Clean Up: Some Closing Thoughts
With respect to the join to the dimension query, there is a lot of empty space where there are no records or data to report on. This is where a cleverly placed filter or query criteria can shrink the output to exactly what you care to look at the most. Here's how mine looked after I added additional ending query criteria:
My data was based on what was supplied by the OP, except where the ID's assigned to the Well Type attribute did not match the sample data. The values I assigned instead are posted below as well.
Access supports a different style of database operations. Step-wise queries can be developed to hold pre-processed, special sets of data that can be reintroduced to the other data tables and query results to develop complex query criteria.
All this being said, Programming in SQL can also be just as rewarding. Be sure to explore some of the differences between the results and the capabilities you can tap into by using one approach (sql coding), the other approach (access design wizards) or both of the approaches. There's definitely a lot of room to grow and discover new capabilities from just the example provided here.
Hopefully I haven't stolen all the fun from developing a solution for your situation. I read into your comment about "building more on top" as the harbinger of more fun to come, so I don't feel so bad...! Happy Developing!
Data Modifications from the Sample Set
Without understanding L_SAND and L_WELL this is the best I could come up with..
use sub selects to get the sums first so you don't compound the data issues on the joins.
Select WellID, Sand_ID, sumWeightDES, WellID_BOL, SUMWEIGHTBOL,
WellID_BIN, SumWeightBin
FROM
(SELECT WellID, Sand_ID, Sum(weight_DES) as sumWeightDES
FROM T_DESGN) A
INNER JOIN
(SELECT WellID_BOL, Sum(Weight_BOL) as SUMWEIGHTBOL
FROM T_BOL B) B
ON A.Well_ID = B.WellID_BOL
INNER JOIN
(SELECT WellID_BIN, sum(Weight_Bin) as SumWeightBin
FROM T_BIN) C
ON C.Well_ID_BIN = B.Well_ID_BOL
I would simplify it excluding L_WELL and L_SAND. If you are just interestend in IDs, then they really shouldn't be necessary joins. If all the other 3 tables have the WellID and SandID columns, then pick the one that is sure to have all combos.
Supposing it's the Design table, then:
SELECT
WellID_DES AS WellID,
SandID_DES AS SandID,
SUM(Weight_DES) AS Weight_DES,
(SELECT SUM(Weight_BOL) FROM T_BOL WHERE T_BOL.WellID_BOL=d.WellID_DES AND T_BOL.SandID_BOL=d.SandID_DES) AS Weight_BOL,
(SELECT SUM(Weight_BIN) FROM T_BIN WHERE T_BIN.WellID_BIN=d.WellID_DES AND T_BIN.SandID_BIN=d.SandID_DES) AS Weight_BIN
FROM T_DESIGN
GROUP BY WellID, SandID;
... and make sure all your tables have an index on WellID and SandID.
Just to be clear. I dont' think it's a good idea to start the join from the lookup tables, or from their cartesian product. You can always left join them to fetch descriptions and other data. But the main query should be the one with all the combos of WellID and SandID... or if not all, at least the most. Things get difficult if none of the 3 tables (DESIGN, BOL and BIN) have all combos. In that case (and I'd say only in that case) then you might as well start with the cartesian product of the two lookup tables. You could also do a UNION, but I doubt that would be more efficient.

How to tally and store votes for a web site?

I am using SQL Server 2005.
I have a site that people can vote on awesome motorcycles. Each time a user votes, there is one for the first bike and one vote against the second bike. Two votes are stored in the database. The vote table looks like this:
VoteID VoteDate BikeID Vote
1 2012-01-12 123 1
2 2012-01-12 125 0
3 2012-01-12 126 0
4 2012-01-12 129 1
I want to tally the votes for each bike quite frequently, say each hour. My idea is to store the tally as a percentage of contest won versus lost on the bike table as an attribute of the bike. So, if a bike won 10 contests and lost 20 contest, they would have a score (tally) of 33. I would tally up daily, weekly, and monthly scores.
BikeID BikeName DailyTally WeeklyTally MonthlyTally
1 Big Dog 5 10 50
2 Big Cat 3 15 40
3 Small Dog 9 8 0
4 Fish Face 19 21 0
Right now, there are about 500 votes per day being cast. We anticipate 2500 - 5000 per day in the next month or so.
What is the best way to tally the data and what is the best way to store it? Should the tallies be on their own table? Should a trigger be used to run a new tally each time a bike is voted on? Should a stored procedure be run hourly to get all tallies?
Any ideas would be very helpful!
Store your VoteDate as a datetime value instead of just date.
For your tallies, you can just make that a view and calculate it on the fly. This should be very simple to do using GROUP BY and DATEPART functions. If you need exact code for how to do this, please open a new question.
For that low volume of rows it doesn't make any sense to store aggregations in a table when you can just calculate them whenever you want to see them and get accurate and immediate results that are up-to-date.
I agree with #JNK try a view or just a normal stored proc to calculate the outputs on the fly. If you find it becomes too slow as your data grows I would investigate other routes then (like caching the data in another table etc). Probably worth keeping it simple to start with; you can always resuse the logic from the SP/VIEW later if you do want to setup a scheduled task.
Edit :
Removed the index view as per #Damien_The_Unbeliever comments its not deterministic and i'm stupid :)

How should you separate dimension tables from fact tables if you are not building a data warehouse?

I realize that referring to these as dimension and fact tables is not exactly appropriate. I am at a lost for better terminology, so please excuse this categorization that I use in the post.
I am building an application for employee record keeping.
The database will contain organizational information. The information is mostly defined in three tables: Locations, Divisions, and Departments. However, there are others with similar problems. First, I need to store the available values for these tables. This will allow for available values in the application when managing an employee and for management of these values when adding/deleting departments and such. For instance, the Locations table may look like,
LocationId | LocationName | LocationStatus
1 | New York | Active
2 | Denver | Inactive
3 | New Orleans | Active
I then need to store these values for each employee and keep their history. My first thought was to create LocationHistory, DivisionHistory, and DepartmentHistory tables. I cannot pinpoint why, but this struck me as poor design. My next inclination was to create a DimLocation/FactLocation, DimDivision/FactDivision, DimDepartment/FactDepartment set of tables. I do not believe this makes sense either. I have also considered naming them as a combination of Employee, i.e. EmployeeLocations, EmployeeDivisions, etc. Regardless of the naming convention for these tables, I imagine that data would look similar to a simplified version I have below:
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 3 | 2008-07-01 | NULL
1 | 2 | 2007-04-01 | 2008-06-30
I realize any of the imagined solutions I described above could work, but I am really looking to create a design that will be easy for others to maintain with an intuitive, familiar structure. I would like to receive this community's help, opinions, and experience with this matter. I am open to and would welcome any suggestion to consider. For instance, should I even store the available values for these three tables in the database? Should they be maintained in the application code/business logic layer? Do I just need to get over seeing the word History repeating three times?
Thanks!
Firstly, I see no issue in describing these as Dimension and Fact tables outside of a warehouse :)
In terms of conceptualising and understanding the relationships, I personally see the use of start/end dates perfectly easy for people to understand. Allowing Agent and Location fact tables, and then time dependant mapping tables such as Agent_At_Location, etc. They do, however, have issues worthy of taking note.
If EndDate is 2008-08-30, was the employee in that location UP TO 30th August, or UP TO and INCLUDING 30th August.
Dealing with overlapping date periods in queries can give messy queries, but more importantly, slow queries.
The first one seems simply a matter of convention, but it can have certain implications when dealign with other data. For example, consider that an EndDate of 2008-08-30 means that they ARE at that location UP TO and INCLUDING 30th August. Then you join on to their Daily Agent Data for that day (Such as when they Actually arrived at work, left for breaks, etc). You need to join ON AgentDailyData.EventTimeStamp < '2008-08-30' + 1 in order to include all the events that happened during that day.
This is because the data's EventTimeStamp isn't measured in days, but probably minutes or seconds.
If you consider that the EndDate of '2008-08-30' means that the Agent was at that Location UP TO but NOT INCLDUING 30th August, the join does not need the + 1. In fact you don't need to know if the date is DAY bound, or can include a time component or not. You just need TimeStamp < EndDate.
By using EXCLUSIVE End markers, all of your queries simplify and never need + 1 day, or + 1 hour to deal with edge conditions.
The second one is much harder to resolve. The simplest way of resolving an overlapping period is as follows:
SELECT
CASE WHEN TableA.InclusiveFrom > TableB.InclusiveFrom THEN TableA.InclusiveFrom ELSE TableB.InclusiveFrom END AS [NetInclusiveFrom],
CASE WHEN TableA.ExclusiveFrom < TableB.ExclusiveFrom THEN TableA.ExclusiveFrom ELSE TableB.ExclusiveFrom END AS [NetExclusiveFrom],
FROM
TableA
INNER JOIN
TableB
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
-- Where InclusiveFrom is the StartDate
-- And ExclusiveFrom is the EndDate, up to but NOT including that date
The problem with that query is one of indexing. The first condition TableA.InclusiveFrom < TableB.ExclusiveFrom could be be resolved using an index. But it could give a Massive range of dates. And then, for each of those records, the ExclusiveDates could all be just about anything, and certainly not in an order that could help quickly resolve TableA.ExclusiveFrom > TableB.InclusiveFrom
The solution I have previously used for that is to have a maximum allowed gap between InclusiveFrom and ExclusiveFrom. This allows something like...
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.InclusiveFrom >= TableB.InclusiveFrom - 30
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
The condition TableA.ExclusiveFrom > TableB.InclusiveFrom STILL can't benefit from indexes. But instead we've limitted the number of rows that can be returned by searching TableA.InclusiveFrom. It's at most only ever 30 days of data, because we know that we restricted the duration to a maximum of 30 days.
An example of this is to break up the associations by calendar month (max duration of 31 days).
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 2 | 2007-04-01 | 2008-05-01
1 | 2 | 2007-05-01 | 2008-06-01
1 | 2 | 2007-06-01 | 2008-06-25
(Representing Employee 1 being in Location 2 from 1st April to (but not including) 25th June.)
It's effectively a trade off; using Disk Space to gain performance.
I've even seen this pushed to the extreme of not actually storing date Ranges, but storing the actual mapping for each and every day. Essentially, it's like restricting the maximum duration to 1 day...
EmployeeId | LocationId | EffectiveDate
1 | 2 | 2007-06-23
1 | 2 | 2007-06-24
1 | 3 | 2007-06-25
1 | 3 | 2007-06-26
Instinctively I initially rebelled against this. But in subsequent ETL, Warehousing, Reporting, etc, I actually found it Very powerful, adaptable, and maintainable. I actually saw people making fewer coding mistakes, writing code in less time, the code ending up running faster, and being much more able to adapt to clients' changing needs.
The only two down sides were:
1. More disk space taken (But trival compared to the size of fact tables)
2. Inserts and Updates to this mapping was slower
The actual slow down for Inserts and Updates only actually matter Once, where this model was being used to represent a constantly changing process net; where the app wanted to change the mapping about 30 times a second. Even then it worked, it just chomped up more CPU time than was ideal.
If you want to be efficient and keep a history, do these things. There are multiple solutions to this problem, but this is the one that I keep going back to:
Remember that each row represents a single entity, if you make corrections that entity, that's fine, but don't re-use and ID for a new Location. Set it up so that instead of deleting a Location, you mark it as deleted with a bit and hide it from the interface, that way when it's referenced historically, it's still there.
Create a history table that includes the current value, or no records if a value isn't currently set. Have the foreign key tie back to the employee and tie to the location.
Create a column in the employee table that points to the current active location in the history. When you need to get the employees location, you join in the history table based on this ID. When you need to get all of the history for an employee you join from the history table.
This structure keeps it all normalized, and gives you an easy way to find the current value without having to do any date comparisons.
As far as using the word history, think of it in different terms: since it contains the current item as well as historical items, it's really just a junction table that keeps around the old item. As such you can name it something like EmployeeLocations.