Multiple conflicting facts in database / data warehouse - sql

Our organization is currently in the process of building a new data warehouse. We are actually able to use some techniques borrowed from the DW community such as ETL processing to conform data, de-normalized dimensions in the "kimbal" style, etc. etc. Overall, data warehousing is still fairly new to our organization, but we are learning the concepts as we go along.
The problem: We have multiple sources of data, with often conflicting sources of facts. For example, we have a Master Person Index, where we use a score-based matching algorithm during ETL to match an inbound person to an existing person, so even if the inbound record doesn't exactly match, we can score based on other things like zip code radius.
Here's the question: What is the standard way to handle multiple versions of a fact from two or more sources?
I understand one of the main ideas of the data warehouse is to keep a running history of any fact, which we are doing. That's all fine and dandy when a record is being maintained by one inbound source, we keep the history of that fact over time. The problem occurs when two different sources perhaps updating on a daily basis have two different facts, e.g. source A says the name is Mary Smith, source B says the name is Mary Jane changing this value every day! Based on the matching algorithm we're confident it's the same person, but due to our history style table, it basically keeps flopping back and forth to both names every day because it is reading the name as a "change" from each data source.
An example table:
first_name last_name source last_updated
Mary Smith A 5/2/12 1:00am
Mary Jane B 5/2/12 2:00am
Mary Smith A 5/3/12 1:00am
Mary Jane B 5/3/12 2:00am
Mary Smith A 5/4/12 1:00am
Mary Jane B 5/4/12 2:00am
...

Have one table that stores your external data:
id | first_name | last_name | source | external_unique_id | import_date
----+------------+-----------+--------+--------------------+-------------
1 | Mary | Smith | A | abcdefg123 | 5/2/12 1:00am
2 | Mary | Jane | B | 1234567abc | 5/2/12 2:00am
Then have a second table that contains your cleaned data:
id | first_name | last_name
----+------------+-----------
1 | Mary | Jane-Smith (or whatever)
Then have a mapping table between the two.
local_person_id | foreign_person_id
-----------------+-------------------
1 | 1
1 | 2
Or something broadly similar.
The objective is to load the facts from your source once, and keep them.
Then use your fuzzy logic to relate them to master records somewhere. Which you only need to do when new facts are loaded or old facts are changed.
Still, you have the choice on what last_name to use. But that can be almost arbitrary in the absence of determining data. For example : Whichever pick the last name from the fact loaded most recently.
You can still quickly and simply relate the master to the child facts, to their sources, and to their corresponding data. But you have a unified entity in your warehouse to hang these external facts on.

One thing about terminology - What you've listed are "Attributes", not "Facts". A fact is a measure that you take on a set of dimensional Attributes. (for example, an order that this "person" places, or the dollar value of this customer's recent order, etc). In this case, you have multiple sources of dimensional attributes, each one considered the "same".
#Dems method is one way (and a good one) to keep your cleaned data separate from your staging / operational data set.
Another, if you need to have access to both data sets in reporting, while still keeping a "clean" version, would be to put all the attributes on your person/customer dimension:
FIRST_NAME
LAST_NAME
SOURCE1_FIRST_NAME
SOURCE1_LAST_NAME
SOURCE2_FIRST_NAME
SOURCE2_LAST_NAME
For reports on measures where the user community is expecting to see the name from Source 2, you can use the source2 attribute. For people expecting source 1, use that. For people looking for the results of the processing which "conforms" the name, use the main attribute.

Related

Cannot identify how to query SQL data in unusual format

I have been practicing data manipulation in SQL Server Express 2017, and I have been provided with a data source I can't seem to make sense of. I was hoping there might be someone more familiar here that might be able to point me in the right direction. I need to work on some SQL queries on the dataset, but I haven't the faintest idea on where to start.
The data looks like this for instance:
Company Code - Field - Value (3 fields)
1001 - Vendor Name - 7 Eleven
1001 - Vendor Name - Bob Jane
1001 - Vendor Name - Krispy Kreme
1001 - Vendor Address - 102 Reservoir Street
1001 - Vendor Address - 110 Pitt Road
1001 - Vendor Address - 23 Foxy Place
Usually, I would expect to see it in a somewhat relational type of table like
Company Code Vendor Name Vendor Address
1001 7 Eleven 102 Reservoir Street.
1001 Bob Jane 110 Pitt Road.
What you have appears to be a design called EAV - entity attribute value. That term you can google. Unfortunately either you left out important information or your design is fundamentally broken. Given what you posted, there is no way to know that "Bob Jane" goes with "110 Pitt Road". Rows in a table have no reliable or specific order. You need a column in your table to define "order" if you want to associate rows based on "order".

postgis query for addresses (with osm data)

I want to make queries for addresses to postgis database with data from openstreetmap, check if such address exist in database and if so, get coordinates. Database was filled from .pbf file using osmosis. This is schema for the database http://pastebin.com/Yigjt77f. I have addresses in form of city name, street name and number of street. The most important for me is this table:
CREATE TABLE node_tags (
node_id BIGINT NOT NULL,
k text NOT NULL,
v text NOT NULL
);
k column is in form of tags, one that I'm interested are: addr:housenumber, addr:street, addr:city and v is corresponding value. First I'm searching if name of city matches one in database, than in results set I'm searching for street and than for house number. The problem is that I don't know how to make SQL query that will get this result with asking only once. I can ask first only for city name, get all node_id that match my city and save them in java program, than make queries asking for each found(matching my city) id_number (list from my java program) for the street, and so on. This way is really slow, because asking for more detailed information (city than street than number) I have to make more and more queries and what is more I have to check a lot of addresses. Once I have matching node_id I can easily find coordinates, so that's not a problem.
Example of this table:
node_id | k | v <br>
123 | addr:housenumber | 50
123 | addr:street | Kingsway
123 | addr:city | London
123 | (some other stuff) | .....
100 | addr:housenumber | 121
100 | addr:street | Edmund St
100 | addr:city | London
I hope I explained clearly what is my problem.
This is not as easy as you might think. Addresses in OSM are hierarchically, like in the real world. Not all elements in OSM have a full address attached. Some only have addr:housenumber and simply belong to the nearest street. Some have addr:housenumber and addr:street but no addr:city because they simply belong to the nearest city. Or they are enclosed by a boundary relation which specifies the corresponding city. And instead of addr:housenumber there are sometimes also just address interpolations described by the addr:interpolation key. See the addr key wiki page for more information.
The Karlsruhe Schema page in the OSM wiki explains a lot about addresses in OSM. It also mentions associatedStreet relations which are sometimes used to group house numbers and their corresponding streets.
As you can see a single query in the database probably won't suffice. If you need some inspiration you can take a look at OSM's address search engine Nominatim. But note that Nominatim uses a different data base scheme than the usual one in order to optimize address queries. You can also take a look at one of the many routing applications which all have to do address lookups.

Specify Association Depth in Nhibernate Criteria

Okay, lets say I have A Many-To-Many, Student/Class objects. With the following mapping
HasManyToMany(m => m.Classes)
.Table("StudentClasses")
.LazyLoad()
.Cascade.None();
and query:
return _session.CreateCriteria<Student>()
.Add(Restrictions.Eq("Id", studentId))
.SetCacheable(false)
.UniqueResult<Student>();
My Tables
ID | Student ID | Class StudentId | ClassId
============ =========== ===================
1 | John 1 | Algebra 1 | 1
2 | Sue 2 | Biology 1 | 2
3 | Frank 3 | Speech 2 | 2
4 | Jim 4 | Athletics 2 | 5
5 | Frankenstein 5 | History 5 | 5
What I want is John, with Algebra and Biology This comes, But Biology comes with John and Sue, she in turn gets me Bio and History, which populates sue and Frankenstein. You don't want to get Frankenstein (or anything beyond biology). Also at any point you can start cycling back around in circles.
How do I specify not to populate that deeply? SetMaxRows obviously isn't want I'm looking for, as It cuts down on the total number of rows, I only want to hydrate the first level. I'm particularly confused because I thought LazyLoading would force me to specify exactly what I did want to hydrate
Lazy loading allows you to not have to specify what you want to hydrate. The collections will be populated "lazily" at the time you access them. Your above query will only fetch a single row: John. When you access john.Classes, NHibernate will execute another query behind the scenes to fetch John's classes. Then when you access the biology.Students property, NHibernate will execute another query to populate that collection... and so on.
You definitely do want lazy-loading on this relationship (which is the default, by the way). If you turned lazy loading off, then this would do exactly what you're saying you want to avoid - NHibernate would know that it's not allowed to lazily load these collections, so it would start working to load them as soon as you loaded a student, which could potentially end up loading all of the data in these three tables, and in a very inefficient manner with lots of round-trips to the database.
If you know that all you want is "John, with Algebra and Biology", then you should add these lines to your query:
.SetFetchMode("Classes", FetchMode.Eager)
.SetResultTransformer(Transformers.DistinctRootEntity)
The SetFetchMode tells NHibernate to use left outer joins to pre-populate the student.Classes collection. The DistintRootEntity tells NHibernate to not return the duplicate students that will be generated by the left outer join.

How should you separate dimension tables from fact tables if you are not building a data warehouse?

I realize that referring to these as dimension and fact tables is not exactly appropriate. I am at a lost for better terminology, so please excuse this categorization that I use in the post.
I am building an application for employee record keeping.
The database will contain organizational information. The information is mostly defined in three tables: Locations, Divisions, and Departments. However, there are others with similar problems. First, I need to store the available values for these tables. This will allow for available values in the application when managing an employee and for management of these values when adding/deleting departments and such. For instance, the Locations table may look like,
LocationId | LocationName | LocationStatus
1 | New York | Active
2 | Denver | Inactive
3 | New Orleans | Active
I then need to store these values for each employee and keep their history. My first thought was to create LocationHistory, DivisionHistory, and DepartmentHistory tables. I cannot pinpoint why, but this struck me as poor design. My next inclination was to create a DimLocation/FactLocation, DimDivision/FactDivision, DimDepartment/FactDepartment set of tables. I do not believe this makes sense either. I have also considered naming them as a combination of Employee, i.e. EmployeeLocations, EmployeeDivisions, etc. Regardless of the naming convention for these tables, I imagine that data would look similar to a simplified version I have below:
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 3 | 2008-07-01 | NULL
1 | 2 | 2007-04-01 | 2008-06-30
I realize any of the imagined solutions I described above could work, but I am really looking to create a design that will be easy for others to maintain with an intuitive, familiar structure. I would like to receive this community's help, opinions, and experience with this matter. I am open to and would welcome any suggestion to consider. For instance, should I even store the available values for these three tables in the database? Should they be maintained in the application code/business logic layer? Do I just need to get over seeing the word History repeating three times?
Thanks!
Firstly, I see no issue in describing these as Dimension and Fact tables outside of a warehouse :)
In terms of conceptualising and understanding the relationships, I personally see the use of start/end dates perfectly easy for people to understand. Allowing Agent and Location fact tables, and then time dependant mapping tables such as Agent_At_Location, etc. They do, however, have issues worthy of taking note.
If EndDate is 2008-08-30, was the employee in that location UP TO 30th August, or UP TO and INCLUDING 30th August.
Dealing with overlapping date periods in queries can give messy queries, but more importantly, slow queries.
The first one seems simply a matter of convention, but it can have certain implications when dealign with other data. For example, consider that an EndDate of 2008-08-30 means that they ARE at that location UP TO and INCLUDING 30th August. Then you join on to their Daily Agent Data for that day (Such as when they Actually arrived at work, left for breaks, etc). You need to join ON AgentDailyData.EventTimeStamp < '2008-08-30' + 1 in order to include all the events that happened during that day.
This is because the data's EventTimeStamp isn't measured in days, but probably minutes or seconds.
If you consider that the EndDate of '2008-08-30' means that the Agent was at that Location UP TO but NOT INCLDUING 30th August, the join does not need the + 1. In fact you don't need to know if the date is DAY bound, or can include a time component or not. You just need TimeStamp < EndDate.
By using EXCLUSIVE End markers, all of your queries simplify and never need + 1 day, or + 1 hour to deal with edge conditions.
The second one is much harder to resolve. The simplest way of resolving an overlapping period is as follows:
SELECT
CASE WHEN TableA.InclusiveFrom > TableB.InclusiveFrom THEN TableA.InclusiveFrom ELSE TableB.InclusiveFrom END AS [NetInclusiveFrom],
CASE WHEN TableA.ExclusiveFrom < TableB.ExclusiveFrom THEN TableA.ExclusiveFrom ELSE TableB.ExclusiveFrom END AS [NetExclusiveFrom],
FROM
TableA
INNER JOIN
TableB
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
-- Where InclusiveFrom is the StartDate
-- And ExclusiveFrom is the EndDate, up to but NOT including that date
The problem with that query is one of indexing. The first condition TableA.InclusiveFrom < TableB.ExclusiveFrom could be be resolved using an index. But it could give a Massive range of dates. And then, for each of those records, the ExclusiveDates could all be just about anything, and certainly not in an order that could help quickly resolve TableA.ExclusiveFrom > TableB.InclusiveFrom
The solution I have previously used for that is to have a maximum allowed gap between InclusiveFrom and ExclusiveFrom. This allows something like...
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.InclusiveFrom >= TableB.InclusiveFrom - 30
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
The condition TableA.ExclusiveFrom > TableB.InclusiveFrom STILL can't benefit from indexes. But instead we've limitted the number of rows that can be returned by searching TableA.InclusiveFrom. It's at most only ever 30 days of data, because we know that we restricted the duration to a maximum of 30 days.
An example of this is to break up the associations by calendar month (max duration of 31 days).
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 2 | 2007-04-01 | 2008-05-01
1 | 2 | 2007-05-01 | 2008-06-01
1 | 2 | 2007-06-01 | 2008-06-25
(Representing Employee 1 being in Location 2 from 1st April to (but not including) 25th June.)
It's effectively a trade off; using Disk Space to gain performance.
I've even seen this pushed to the extreme of not actually storing date Ranges, but storing the actual mapping for each and every day. Essentially, it's like restricting the maximum duration to 1 day...
EmployeeId | LocationId | EffectiveDate
1 | 2 | 2007-06-23
1 | 2 | 2007-06-24
1 | 3 | 2007-06-25
1 | 3 | 2007-06-26
Instinctively I initially rebelled against this. But in subsequent ETL, Warehousing, Reporting, etc, I actually found it Very powerful, adaptable, and maintainable. I actually saw people making fewer coding mistakes, writing code in less time, the code ending up running faster, and being much more able to adapt to clients' changing needs.
The only two down sides were:
1. More disk space taken (But trival compared to the size of fact tables)
2. Inserts and Updates to this mapping was slower
The actual slow down for Inserts and Updates only actually matter Once, where this model was being used to represent a constantly changing process net; where the app wanted to change the mapping about 30 times a second. Even then it worked, it just chomped up more CPU time than was ideal.
If you want to be efficient and keep a history, do these things. There are multiple solutions to this problem, but this is the one that I keep going back to:
Remember that each row represents a single entity, if you make corrections that entity, that's fine, but don't re-use and ID for a new Location. Set it up so that instead of deleting a Location, you mark it as deleted with a bit and hide it from the interface, that way when it's referenced historically, it's still there.
Create a history table that includes the current value, or no records if a value isn't currently set. Have the foreign key tie back to the employee and tie to the location.
Create a column in the employee table that points to the current active location in the history. When you need to get the employees location, you join in the history table based on this ID. When you need to get all of the history for an employee you join from the history table.
This structure keeps it all normalized, and gives you an easy way to find the current value without having to do any date comparisons.
As far as using the word history, think of it in different terms: since it contains the current item as well as historical items, it's really just a junction table that keeps around the old item. As such you can name it something like EmployeeLocations.

Best Approach to Processing SQL Data problem

I have a Data intensive problem which requires a lot of massaging and data manipulation and I'm putting this out there to see if anyone has an idea as to how to approach it.
In simplest form. I have a lot of tables which can be joined together to give me a price listing for dentists and how much each charges for a procedure.
so we have multiple tables that looks like this.
Dentist | Procedure1 | Procedure2 | Procedure3 | .........| Procedure?
John | 500 | 342 | 434 | .........| 843
Dave | 343 | 434 | 322 | NULLs....|
Mary | 500 | 342 | 434 | .........| 843
Linda | 500 | 342 | Null | .........| 843
Dentists can have different number of procedures and different pricing for each procedures. But there are a lot of Dentists that have the same number of procedures and the same rates that goes with it. Internally, we create a unique ID for each of these so-called fee listings.
like John would be 001, Dave would be 002, but Mary would be fee 001 and Linda would be 003
It's not so bad if I have to deal with this data once but these fee listings comes in flat files (csvs) which i basically have to DTS up to a SQL server to work with. and they come on a monthly bases. The pricing could change from month to month for each dentist which then would put them in a different unique ID internally.
Can someone shed some light on as to how to best approach this problem so that it's most efficient to process on a monthly basis without having to do tons of data manipulation?
what's the best approach to finding out the duplicates of the fee listings?
How do i keep track of updating a Dentist's fee listing incase they change their rates the next month? if Mary decides to charge a different fee for procedure2, then she would have a different unique ID internally. how do i keep track of that on a monthly bases without having to delete everything and re-insert?
There are a few million fee listings that I'm working with and some have standard rules that are based on zipcodes and some are just unique fee listings, what's the approach here?
I can write some kind of ad-hoc .net program to work with it but it's a lot of data and working straight in SQL server would be easier for me.
any help would be great, thanks guys.
You probably need to unpivot the data to normalize it - so that you end up with:
Doctor: DoctorID, DoctorDetails...
FeeSchedule: DoctorID, ScheduleID, EffectiveDate, OtherDetailAtThisLevel...
FeeScheduleDetail: ScheduleID, ProcedureCode, Fee, OtherDetailAtThisLevel...
When the data comes in for a doctor, it is pivoted, a new schedule is created and the detail rows are created from the unpivoted data.
SSIS has an unpivot component which is fine - you would load the schedule first and then the detail. If the format varies significantly, you might need a custom data source or just avoid SSIS.
This system would keep track of new schedules for doctors. If the schedule is identical for a doctor, you could simply not insert it.
If this logic is extensive, you could load the data to staging tables (SSIS or whatever) and do all this in SQL (T-SQL also has an UNPIVOT operator). That can have advantages in that the code is all in one place and can do all its operations in sets.
Regarding the zip codes, if the doctor doesn't have a fee, are these like usual and customary fee? This could simply be determined from the zip code of the doctor row. In this case you have a few options. You can overlay the doctor fee schedule over a zip code fee schedule:
ZipCodeSchedule: ZipScheduleID, ZipCode, EffectiveDate
ZipCodeScheduleDetail: ZipScheduleID, ProcedureCode, Fee
Or you could save this in the regular feeschedule (potentially with some kind of flag that it was defaulted to the UCR).