I have a Data intensive problem which requires a lot of massaging and data manipulation and I'm putting this out there to see if anyone has an idea as to how to approach it.
In simplest form. I have a lot of tables which can be joined together to give me a price listing for dentists and how much each charges for a procedure.
so we have multiple tables that looks like this.
Dentist | Procedure1 | Procedure2 | Procedure3 | .........| Procedure?
John | 500 | 342 | 434 | .........| 843
Dave | 343 | 434 | 322 | NULLs....|
Mary | 500 | 342 | 434 | .........| 843
Linda | 500 | 342 | Null | .........| 843
Dentists can have different number of procedures and different pricing for each procedures. But there are a lot of Dentists that have the same number of procedures and the same rates that goes with it. Internally, we create a unique ID for each of these so-called fee listings.
like John would be 001, Dave would be 002, but Mary would be fee 001 and Linda would be 003
It's not so bad if I have to deal with this data once but these fee listings comes in flat files (csvs) which i basically have to DTS up to a SQL server to work with. and they come on a monthly bases. The pricing could change from month to month for each dentist which then would put them in a different unique ID internally.
Can someone shed some light on as to how to best approach this problem so that it's most efficient to process on a monthly basis without having to do tons of data manipulation?
what's the best approach to finding out the duplicates of the fee listings?
How do i keep track of updating a Dentist's fee listing incase they change their rates the next month? if Mary decides to charge a different fee for procedure2, then she would have a different unique ID internally. how do i keep track of that on a monthly bases without having to delete everything and re-insert?
There are a few million fee listings that I'm working with and some have standard rules that are based on zipcodes and some are just unique fee listings, what's the approach here?
I can write some kind of ad-hoc .net program to work with it but it's a lot of data and working straight in SQL server would be easier for me.
any help would be great, thanks guys.
You probably need to unpivot the data to normalize it - so that you end up with:
Doctor: DoctorID, DoctorDetails...
FeeSchedule: DoctorID, ScheduleID, EffectiveDate, OtherDetailAtThisLevel...
FeeScheduleDetail: ScheduleID, ProcedureCode, Fee, OtherDetailAtThisLevel...
When the data comes in for a doctor, it is pivoted, a new schedule is created and the detail rows are created from the unpivoted data.
SSIS has an unpivot component which is fine - you would load the schedule first and then the detail. If the format varies significantly, you might need a custom data source or just avoid SSIS.
This system would keep track of new schedules for doctors. If the schedule is identical for a doctor, you could simply not insert it.
If this logic is extensive, you could load the data to staging tables (SSIS or whatever) and do all this in SQL (T-SQL also has an UNPIVOT operator). That can have advantages in that the code is all in one place and can do all its operations in sets.
Regarding the zip codes, if the doctor doesn't have a fee, are these like usual and customary fee? This could simply be determined from the zip code of the doctor row. In this case you have a few options. You can overlay the doctor fee schedule over a zip code fee schedule:
ZipCodeSchedule: ZipScheduleID, ZipCode, EffectiveDate
ZipCodeScheduleDetail: ZipScheduleID, ProcedureCode, Fee
Or you could save this in the regular feeschedule (potentially with some kind of flag that it was defaulted to the UCR).
Related
Our organization is currently in the process of building a new data warehouse. We are actually able to use some techniques borrowed from the DW community such as ETL processing to conform data, de-normalized dimensions in the "kimbal" style, etc. etc. Overall, data warehousing is still fairly new to our organization, but we are learning the concepts as we go along.
The problem: We have multiple sources of data, with often conflicting sources of facts. For example, we have a Master Person Index, where we use a score-based matching algorithm during ETL to match an inbound person to an existing person, so even if the inbound record doesn't exactly match, we can score based on other things like zip code radius.
Here's the question: What is the standard way to handle multiple versions of a fact from two or more sources?
I understand one of the main ideas of the data warehouse is to keep a running history of any fact, which we are doing. That's all fine and dandy when a record is being maintained by one inbound source, we keep the history of that fact over time. The problem occurs when two different sources perhaps updating on a daily basis have two different facts, e.g. source A says the name is Mary Smith, source B says the name is Mary Jane changing this value every day! Based on the matching algorithm we're confident it's the same person, but due to our history style table, it basically keeps flopping back and forth to both names every day because it is reading the name as a "change" from each data source.
An example table:
first_name last_name source last_updated
Mary Smith A 5/2/12 1:00am
Mary Jane B 5/2/12 2:00am
Mary Smith A 5/3/12 1:00am
Mary Jane B 5/3/12 2:00am
Mary Smith A 5/4/12 1:00am
Mary Jane B 5/4/12 2:00am
...
Have one table that stores your external data:
id | first_name | last_name | source | external_unique_id | import_date
----+------------+-----------+--------+--------------------+-------------
1 | Mary | Smith | A | abcdefg123 | 5/2/12 1:00am
2 | Mary | Jane | B | 1234567abc | 5/2/12 2:00am
Then have a second table that contains your cleaned data:
id | first_name | last_name
----+------------+-----------
1 | Mary | Jane-Smith (or whatever)
Then have a mapping table between the two.
local_person_id | foreign_person_id
-----------------+-------------------
1 | 1
1 | 2
Or something broadly similar.
The objective is to load the facts from your source once, and keep them.
Then use your fuzzy logic to relate them to master records somewhere. Which you only need to do when new facts are loaded or old facts are changed.
Still, you have the choice on what last_name to use. But that can be almost arbitrary in the absence of determining data. For example : Whichever pick the last name from the fact loaded most recently.
You can still quickly and simply relate the master to the child facts, to their sources, and to their corresponding data. But you have a unified entity in your warehouse to hang these external facts on.
One thing about terminology - What you've listed are "Attributes", not "Facts". A fact is a measure that you take on a set of dimensional Attributes. (for example, an order that this "person" places, or the dollar value of this customer's recent order, etc). In this case, you have multiple sources of dimensional attributes, each one considered the "same".
#Dems method is one way (and a good one) to keep your cleaned data separate from your staging / operational data set.
Another, if you need to have access to both data sets in reporting, while still keeping a "clean" version, would be to put all the attributes on your person/customer dimension:
FIRST_NAME
LAST_NAME
SOURCE1_FIRST_NAME
SOURCE1_LAST_NAME
SOURCE2_FIRST_NAME
SOURCE2_LAST_NAME
For reports on measures where the user community is expecting to see the name from Source 2, you can use the source2 attribute. For people expecting source 1, use that. For people looking for the results of the processing which "conforms" the name, use the main attribute.
As an example, lets say my dataset holds:
EMPLOYEE_ID
EMPLOYEE_NAME
EMPLOYEE_ACCT_ID
EMPLOYEE_ACCT_TYPE
EMPLOYEE_ACCT_BALANCE
I would like to present the data in the following way:
EMPLOYEE | CHECKING | SAVINGS | INVESSTMENT | XMAS |
_______________________________________________________________________
Mary | 100.00 | 700.00 | 3,000.00 | 175.00
Jim | 850.00 | 600.00 | 1,500.00 | 0.00
TOTAL | 950.00 | 1,300.00 | 4,500.00 | 175.00
Where I'm stuck is how to break out the EMPLOYEE_ACCT_TYPE into columns with each account type values listed with it's balance. Thanks in advance.
What you are trying to do is a called a Pivot. Some systems (e.g. SQL Server) have native support for this in SQL, but only if you know the number of columns in advance (i.e. you would have to hard-code the account types into the SQL). Other systems don't support pivoting natively (e.g. MySQL) so you would need to write a stored procedure or some dynamic SQL to do it.
Since you don't mention what DBMS you are using, that's about as specific as I can get.
It sounds to me like you need to do some serious normalization, first. Break employee_account data out into its own table:
table: employee_account_data
EMPLOYEE_ACCT_ID int
EMPLOYEE_ACCT_TYPE varchar(15)
EMPLOYEE_ACCT_BALANCE decimal
You'll also need a bridge table, since many employees can have many accounts (many to many):
table: employee_account_lookup
EMPLOYEE_ID int
EMPLOYEE_ACCT_ID int
This way, you won't be repeating employee_name for each account type (as I suspect you are now). If you really wanted to normalize well, you could also create a table to hold the different Employee Account Types. That way you wouldn't have to worry about someone mispelling "Checking" or "Savings" on data entry.
I realize that referring to these as dimension and fact tables is not exactly appropriate. I am at a lost for better terminology, so please excuse this categorization that I use in the post.
I am building an application for employee record keeping.
The database will contain organizational information. The information is mostly defined in three tables: Locations, Divisions, and Departments. However, there are others with similar problems. First, I need to store the available values for these tables. This will allow for available values in the application when managing an employee and for management of these values when adding/deleting departments and such. For instance, the Locations table may look like,
LocationId | LocationName | LocationStatus
1 | New York | Active
2 | Denver | Inactive
3 | New Orleans | Active
I then need to store these values for each employee and keep their history. My first thought was to create LocationHistory, DivisionHistory, and DepartmentHistory tables. I cannot pinpoint why, but this struck me as poor design. My next inclination was to create a DimLocation/FactLocation, DimDivision/FactDivision, DimDepartment/FactDepartment set of tables. I do not believe this makes sense either. I have also considered naming them as a combination of Employee, i.e. EmployeeLocations, EmployeeDivisions, etc. Regardless of the naming convention for these tables, I imagine that data would look similar to a simplified version I have below:
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 3 | 2008-07-01 | NULL
1 | 2 | 2007-04-01 | 2008-06-30
I realize any of the imagined solutions I described above could work, but I am really looking to create a design that will be easy for others to maintain with an intuitive, familiar structure. I would like to receive this community's help, opinions, and experience with this matter. I am open to and would welcome any suggestion to consider. For instance, should I even store the available values for these three tables in the database? Should they be maintained in the application code/business logic layer? Do I just need to get over seeing the word History repeating three times?
Thanks!
Firstly, I see no issue in describing these as Dimension and Fact tables outside of a warehouse :)
In terms of conceptualising and understanding the relationships, I personally see the use of start/end dates perfectly easy for people to understand. Allowing Agent and Location fact tables, and then time dependant mapping tables such as Agent_At_Location, etc. They do, however, have issues worthy of taking note.
If EndDate is 2008-08-30, was the employee in that location UP TO 30th August, or UP TO and INCLUDING 30th August.
Dealing with overlapping date periods in queries can give messy queries, but more importantly, slow queries.
The first one seems simply a matter of convention, but it can have certain implications when dealign with other data. For example, consider that an EndDate of 2008-08-30 means that they ARE at that location UP TO and INCLUDING 30th August. Then you join on to their Daily Agent Data for that day (Such as when they Actually arrived at work, left for breaks, etc). You need to join ON AgentDailyData.EventTimeStamp < '2008-08-30' + 1 in order to include all the events that happened during that day.
This is because the data's EventTimeStamp isn't measured in days, but probably minutes or seconds.
If you consider that the EndDate of '2008-08-30' means that the Agent was at that Location UP TO but NOT INCLDUING 30th August, the join does not need the + 1. In fact you don't need to know if the date is DAY bound, or can include a time component or not. You just need TimeStamp < EndDate.
By using EXCLUSIVE End markers, all of your queries simplify and never need + 1 day, or + 1 hour to deal with edge conditions.
The second one is much harder to resolve. The simplest way of resolving an overlapping period is as follows:
SELECT
CASE WHEN TableA.InclusiveFrom > TableB.InclusiveFrom THEN TableA.InclusiveFrom ELSE TableB.InclusiveFrom END AS [NetInclusiveFrom],
CASE WHEN TableA.ExclusiveFrom < TableB.ExclusiveFrom THEN TableA.ExclusiveFrom ELSE TableB.ExclusiveFrom END AS [NetExclusiveFrom],
FROM
TableA
INNER JOIN
TableB
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
-- Where InclusiveFrom is the StartDate
-- And ExclusiveFrom is the EndDate, up to but NOT including that date
The problem with that query is one of indexing. The first condition TableA.InclusiveFrom < TableB.ExclusiveFrom could be be resolved using an index. But it could give a Massive range of dates. And then, for each of those records, the ExclusiveDates could all be just about anything, and certainly not in an order that could help quickly resolve TableA.ExclusiveFrom > TableB.InclusiveFrom
The solution I have previously used for that is to have a maximum allowed gap between InclusiveFrom and ExclusiveFrom. This allows something like...
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.InclusiveFrom >= TableB.InclusiveFrom - 30
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
The condition TableA.ExclusiveFrom > TableB.InclusiveFrom STILL can't benefit from indexes. But instead we've limitted the number of rows that can be returned by searching TableA.InclusiveFrom. It's at most only ever 30 days of data, because we know that we restricted the duration to a maximum of 30 days.
An example of this is to break up the associations by calendar month (max duration of 31 days).
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 2 | 2007-04-01 | 2008-05-01
1 | 2 | 2007-05-01 | 2008-06-01
1 | 2 | 2007-06-01 | 2008-06-25
(Representing Employee 1 being in Location 2 from 1st April to (but not including) 25th June.)
It's effectively a trade off; using Disk Space to gain performance.
I've even seen this pushed to the extreme of not actually storing date Ranges, but storing the actual mapping for each and every day. Essentially, it's like restricting the maximum duration to 1 day...
EmployeeId | LocationId | EffectiveDate
1 | 2 | 2007-06-23
1 | 2 | 2007-06-24
1 | 3 | 2007-06-25
1 | 3 | 2007-06-26
Instinctively I initially rebelled against this. But in subsequent ETL, Warehousing, Reporting, etc, I actually found it Very powerful, adaptable, and maintainable. I actually saw people making fewer coding mistakes, writing code in less time, the code ending up running faster, and being much more able to adapt to clients' changing needs.
The only two down sides were:
1. More disk space taken (But trival compared to the size of fact tables)
2. Inserts and Updates to this mapping was slower
The actual slow down for Inserts and Updates only actually matter Once, where this model was being used to represent a constantly changing process net; where the app wanted to change the mapping about 30 times a second. Even then it worked, it just chomped up more CPU time than was ideal.
If you want to be efficient and keep a history, do these things. There are multiple solutions to this problem, but this is the one that I keep going back to:
Remember that each row represents a single entity, if you make corrections that entity, that's fine, but don't re-use and ID for a new Location. Set it up so that instead of deleting a Location, you mark it as deleted with a bit and hide it from the interface, that way when it's referenced historically, it's still there.
Create a history table that includes the current value, or no records if a value isn't currently set. Have the foreign key tie back to the employee and tie to the location.
Create a column in the employee table that points to the current active location in the history. When you need to get the employees location, you join in the history table based on this ID. When you need to get all of the history for an employee you join from the history table.
This structure keeps it all normalized, and gives you an easy way to find the current value without having to do any date comparisons.
As far as using the word history, think of it in different terms: since it contains the current item as well as historical items, it's really just a junction table that keeps around the old item. As such you can name it something like EmployeeLocations.
I have 2 tables, a purchases table and a users table. Records in the purchases table looks like this:
purchase_id | product_ids | customer_id
---------------------------------------
1 | (99)(34)(2) | 3
2 | (45)(3)(74) | 75
Users table looks like this:
user_id | email | password
----------------------------------------
3 | joeShmoe#gmail.com | password
75 | nolaHue#aol.com | password
To get the purchase history of a user I use a query like this:
mysql_query(" SELECT * FROM purchases WHERE customer_id = '$users_id' ");
The problem is, what will happen when tens of thousands of records are inserted into the purchases table. I feel like this will take a performance toll.
So I was thinking about storing the purchases in an additional field directly in the user's row:
user_id | email | password | purchases
------------------------------------------------------
1 | joeShmoe#gmail.com | password | (99)(34)(2)
2 | nolaHue#aol.com | password | (45)(3)(74)
And when I query the user's table for things like username, etc. I can just as easily grab their purchase history using that one query.
Is this a good idea, will it help better performance or will the benefit be insignificant and not worth making the database look messier?
I really want to know what the pros do in these situations, for example how does amazon query it's database for user's purchase history since they have millions of customers. How come there queries don't take hours?
EDIT
Ok, so I guess keeping them separate is the way to go. Now the question is a design one:
Should I keep using the "purchases" table I illustrated earlier. In that design I am separating the product ids of each purchase using parenthesis and using this as the delimiter to tell the ids apart when extracting them via PHP.
Instead should I be storing each product id separately in the "purchases" table so it looks like this?:
purchase_id | product_ids | customer_id
---------------------------------------
1 | 99 | 3
1 | 34 | 3
1 | 2 | 3
2 | 45 | 75
2 | 3 | 75
2 | 74 | 75
Nope, this is a very, very, very bad idea.
You're breaking first normal form because you don't know how to page through a large data set.
Amazon and Yahoo! and Google bring back (potentially) millions of records - but they only display them to you in chunks of 10 or 25 or 50 at a time.
They're also smart about guessing or calculating which ones are most likely to be of interest to you - they show you those first.
Which purchases in my history am I most likely to be interested in? The most recent ones, of course.
You should consider building these into your design before you violate relational database fundamentals.
Your database already looks messy, since you are storing multiple product_ids in a single field, instead of creating an "association" table like this.
_____product_purchases____
purchase_id | product_id |
--------------------------
1 | 99 |
1 | 34 |
1 | 2 |
You can still fetch it in one query:
SELECT * FROM purchases p LEFT JOIN product_purchases pp USING (purchase_id)
WHERE purchases.customer_id = $user_id
But this also gives you more possibilities, like finding out how many product #99 were bought, getting a list of all customers that purchased product #34 etc.
And of course don't forget about indexes, that will make all of this much faster.
By doing this with your schema, you will break the entity-relationship of your database.
You might want to look into Memcached, NoSQL, and Redis.
These are all tools that will help you improve your query performances, mostly by storing data in the RAM.
For example - run the query once, store it in the Memcache, if the user refresh the page, you get the data from Memcache, not from MySQL, which avoids querying your database a second time.
Hope this helps.
First off, tens of thousands of records is nothing. Unless you're running on a teensy weensy machine with limited ram and harddrive space, a database won't even blink at 100,000 records.
As for storing purchase details in the users table... what happens if a user makes more than one purchase?
MySQL is hugely extensible, and don't let the fact that it's free convince you of otherwise. Keeping the two tables separate is probably best, not only because it keeps the db more normal, but having more indices will speed queries. A 10,000 record database is relatively small in deference to multi-hundred-million record health record databases.
As far as Amazon and Google, they hire hundreds of developers to write specialized query languages for their specific application needs... not something developers like us have the resources to fund.
I'm getting the following data from a MySQL database
+----------------+------------+---------------------+----------+
| account_number | total_paid | doc_date | doc_type |
+----------------+------------+---------------------+----------+
| 18 | 54.0700 | 2009-10-22 02:37:09 | IN |
| 425 | 49.9500 | 2009-10-22 02:31:47 | PO |
+----------------+------------+---------------------+----------+
The query is fine and I'm getting the data I need except that the doc_type isn't very human readable. To fix this, I've done the following
CREATE TEMPORARY TABLE doc_type (id char(2), string varchar(60));
INSERT INTO doc_type VALUES
('IN', 'Invoice'),
('PO', 'Online payment'),
('PF', 'Offline payment'),
('CA', 'Credit adjustment'),
('DA', 'Debit adjustment'),
('OR', 'Order');
I then add a join against this temporary table so my doc_type column is easier to read which looks like this
+----------------+------------+---------------------+----------------+
| account_number | total_paid | doc_date | document_type |
+----------------+------------+---------------------+----------------+
| 18 | 54.0700 | 2009-10-22 02:37:09 | Invoice |
| 425 | 49.9500 | 2009-10-22 02:31:47 | Online payment |
+----------------+------------+---------------------+----------------+
Is this the best way to do this? Is it possible to replace the text in one query? I started looking at if statements but it doesn't seem to be what I'm after or maybe I just read it incorrectly.
// EDIT //
Thanks everyone. I suppose I'll keep doing it this way.
Unfortunately, it's not possible to change doc_type to integer as this is an existing database for a billing application I didn't write. I'd end up breaking functionality if I made any changes other than adding a table here and there.
Also appreciate the easy to understand case statement from Rahul. May come in handy later.
Your current way is the best. Arguably, document_type can be changed to an int, to save space and whatnot, but that's irrelevant.
Doing the join will be much faster and readable than any chained ifs.
Not to mention, extensible. Should you need to add a new doc_type, it's just an insert vs. potentially several queries.
You can use the SQL CASE statement to do this in a single query.
Select account_number, total_paid, doc_date,
case doctype
when 'IN' then 'Invoice'
when 'PO' then 'Online Payment'
end
from table
It is the best way to do this :)
If doc_type could be an integer, you also can use ELT function, as in
SELECT ELT(doc_type, 'Invoice', 'Document') FROM table;
but it is still worse than simple join as you have to put this thing into every query and every application that using the database, and changing description becomes a hell.
IIRC this is the correct way to achieve what you want to do. It's a normalized design
I think you are asking about the design and not how the data has to be fetched? If it is so, then I should tell I have always used the above kind of design.
This design leads to normalized database. There won't be consistency problems if you ever needed to change the name of the field like Invoice and Online Payment
I would suggest you to change doc_type field to int as not only it saves space(as told by Tordek) but it is also faster when you execute queries.
Firstly.If you used Invoice in doct_type as string, then the problems could have been was that string search is extremely slow when compared to other datatypes.
Second, it is case sensitive (which may lead to mistakes.
Thirdly, since string takes up much space, so much more space is required for storing it in the main table.
Fourth, If you ever required to change the name Invoice to say Billing, then searching for Invoice would take time and each and every row containing this value had to be updated