Data Warehouse Modeling

Data Warehouse Modeling - sql

I have a FactCase which reflects the metrics of cases created by customers. It has a field Owner_Key which is linked to DimEmployee.
DimEmployee has all the employees in organization. However, the Owner_Key in FactCase can also be a particular "Team" (Which is Queue in Salesforce, basically like in JIRA- don't assign to a person, assign to a team). DimEmployee will not have "Team" related data. We could slam Queues into DimEmployee, but that's obviously a strange fit, it breaks the data modeling rule of mixing granularity (employee / team) and then calling it the DimEmployee doesn't make sense in a long term.
Approach 1:
Thought of DimOwner as separate dimension, that's not a possibility because "Queue" from Salesforce which will be used to populate DimOwner can be used all over the place, not just owners and also DimOwner is not a business entity that makes much sense .
Approach 2:
The other approach if I can think is: Going to create a new dimension by union on top of "DimEmployee" and "DimQueue" that can be used for this type of occurrence. The facts will retain an ‘Employee_Key’ and a ‘new_dimension_key’ to allow both types of analysis. "Queue_key" is a TBD- not sure if there’s enough useful information there. Only question is what is the best naming of this ‘new_dimension_key’ i.e. dimension from union?
Please let me know your thoughts on the above discussed approaches and also suggest me if there are other best approaches to model this one.
TIA

Related

Which database design is better?

I am about to create a database to track weight-lifting exercises.
Which approach would you prefer?
Solution A:
Two tables
Exercise (with ID, Name etc.)
Set (with ID, Set_Number, Date, FK_Exercise)
Here, one Exercise and Set have a one-to-many relationship.
Set_Number is supposed to track which set it is on a given date (1st set, 2nd set, 3rd set etc.)
Advantage: One table less to deal with.
Solution B:
Three tables:
Exercise (with ID, Name etc.)
Session (with ID, Date, FK_Exercise)
Set (with ID, Set_Number, FK_Session)
Here, a Session would be something like a connector between Exercise and Set. So basically a sequence of sets on a given day for a given exercise will be pooled in one Session instance.
In this case, Exercise and Session have a one-to-many relationship and Session and Set also have a one-to-many relationship.
Advantage: The Date property will not be redundant for any given day. And logically it makes sense to bundle sets.

A good data model falls out of a proper understanding of the domain. Your domain has three entities:
EXERCISE: particular type of weightlifting move (name and weight)
SET: number of reps of a given EXERCISE (depending on training goal - strength, muscle, endurance?)
SESSION: number of SETs undertaken on a given date
So you need at least three tables. At least, because EXERCISE has two levels of detail: one is the exercise name and the other is the exercise weight . It's quite likely you will need to store SETs of different combination of names and weights (Bicep curl / 10kg, Bicep curl / 15kg, etc) in which case you need a look-up table EXERCISE name and a fourth table SET_EXERCISE to store the weight used for a particular SET of reps.
Having gone through this exercise (o ho!) we can see that your foreign keys are wrong. A SESSION comprises a number of SETs; a SET comprises a number of EXERCISEs (SET_EXERCISEs).
Hence the logical data model should look something like:
EXERCISE (ID, Name, Weight, etc)
SET (ID, FK_Exercise, Reps, etc)
SESSION (ID, FK_Set, Date, etc)
Although this is not quite accurate: SET:SESSION is in fact a many-to- many relationship, as a SESSION will normally comprise more than one SET and a SET can be done in more than one SESSION.
When it comes to a physical data model i.e. tables you should have five tables:
EXERCISE (ID, Name, etc)
SET_EXERCISE (ID, FK_Exercise, FK_Set, Weight, etc)
SET (ID, FK_Set_Exercise, Reps, etc)
SESSION_SET (FK_Set, FK_Session, Set_Number, etc)
SESSION (ID, Date, etc)
The SESSION_SET table is necessary to resolve the many-to-many relationship between SET and SESSION .
The final model has five tables: three tables for the original entities and two intersection tables which join those entities. It so happens that all the relations between the logical entities (EXERCISE, SET, SESSION) have been implemented as intersection tables rather than foreign keys. This doesn't always happen when transforming from a Logical to a Physical data model.
This is not the only way of modelling the domain. As a design activity data modelling is about interpreting the rules to fit the data you need to record. The data is the starting point.
"it seems I didn't make myself clear regarding the Session entity...he naming is probably bad and misleading"
This is why I said the data model follows from a proper understanding of the domain. EXERCISE, SET and SESSION are domain terms. You are of course welcome to make your own definitions of things for your private projects, but in real life data models are a mechanism for communication between Development and Business: the meaning of things is crucial, and must conform to a common understanding. We cannot build a data model where SESSION means something different from what the business understands by "session".
"I also don't understand how a Set can be done in more than one Session?"
A SET is a pattern of EXERCISE for a number of reps. So #1 / benchpress / 130KG / 8 reps is a SET and #2 / benchpress / 100KG / 12 reps is a different SET. If you benchpressed 130KG eight times on Monday and Wednesday then that's the same SET in two different SESSIONs. Maybe it's a layer of detail too far; but if you're going to build a database app to track your workouts instead of using a spreadsheet like most people you might as well build the best data model you can :-)
Again, data modelling is an exercise with a large dose of opinion: if your data model is good enough for your current needs then it is good enough. The thing is, a more rigorous data model is paradoxically more flexible (because enforcing data integrity rules makes it easier to write queries and be sure that the results are correct). What might be good enough now might be a terrible brake on innovation in the future.

It is ok to have duplicated values in SQL

I'm not a DBA so I'm not familiar with the proper lingo, so maybe the title of the question could be a little misleading.
So, the thing. I have Members for a certain system, these members can be part of a demographic segment (any kind of segment: favorite color, gender, job, etc)
These are the tables
SegmentCategory
ID, Name, Description
SegmentCategory_segment
SegmentID, SegmentCategoryID
Segment
ID, Name, Description
MemberSegment
ID, MemberID, SegmentID
So the guy that designed the DB decided to go uber normalizing everything so he put the member's gender on a segment and not in the Member's table.
Is this ok? According to my logic, gender it's a property of the Member so it must be on its entity. But by doing this then there must be duplicated data (The gender on the Member and Gender as a segment) But a trigger on the Member table could just fix this (Update the segment on a gender change)
Having to crawl 4 tables just to get a property from the member seems like over engineering to me.
My question is whether I'm right or not? If so, how could I propose the change to the DBA?

There isn't a blanket rule you can apply to database decisions like this. It depends on what applications/processes it is supporting. A database for reporting is much easier to work with when it is more de-normalized (in a well thought out way) than it is a more transactional database.
You can have a customer record spread across 2 tables, for instance, if some data is accessed or updated more often than other parts. Say you only need one half of the data 90% of your queries, but don't want to drag around the the varchar(max) fields you have there for whatever reason.
Having said that, having a table with just a gender/memberid is on the far side of extreme. From my naive understanding of your situation I feel you just need a members table with views over top for your segments.
As for the DBA, ultimately I imagine it will be them who will be needing to maintain the integrity of the data, so I would just approach them and say "hey, what do you think of this?" Hopefully they'll either see the merit or be able to give you reasons to their design decisions.

Database Modelling / ER Diagram - Should Look-up tables be left alone as a separate entity?

I'm trying to create a DB that can manage/record sightings of many differing machinery types, be it Cars, Buses, Trucks, Boats, Trains, etc. which will also record all the characteristics of such a sighting [which would obviously vary greatly: colour, hull type, vehicle model, etc., etc.] and where the sighting occured.
Here's my confused ER- diagram.
Where I'm getting confused, is, how would/should I go about recording/referencing the pre-defined characteristics [found in the characteristic's table] in the Item_Observation table, as I would have to create another many-to-many table to hold such, but feel i'm not implementing it very well due to table duplication somewhat?
But then I feel - I'm not 100% sure why - storing the observed characteristic's data in the look-up table iteself, is also not a good idea?
Which begs the question, should Look-up tables be left alone as a separate entity? And probably more to the point, is it my schema that's completely flawed? If you haven't already guessed, I'm certainly no DB designer. Thanks in anticipation, cheers Dyr

You are modelling a DBMS's metadata design, but not your application.
See these two posts' questions and answers.

Sql design question - many tables or not?

15 ECTS credits worth of database design down the bin.. I really can't come up with the best design solution for my problem.
Which is this: Basically I'm making a tool that gathers a lot of information concerning the user. At the most the user would fill in 50 fields of data, ranging from simple checkboxes to text input. I'm designing the db right now (with mySql) and can't decide whether or not to use a single User table with all of those fields, or to have a table for each category of input.
One example would be "type of payment". This one has three options and if I went with the "table" way I would add a table paymentType and give it binary fields for each payment type. Then I would need and id table to identify which paymentType the user has chosen whereas if I use a single user table, the data would already be there.
The site will probably see a lot of users (tv, internet and radio marketing) so I'm concerned which alternative would be the best.
I'll be happy to provide more details if you need more to base a decision.
Thanks for reading.

Read this article "Database Normalization Basics", and come back here if you still have questions. It should help a lot.
The most fundamental idea behind these decisions, as you will see in this article, is that each table should represent one and only one "thing", and each field should relate directly and only to that thing.
In your payment types example, it probably makes sense to break it out into a separate table if you anticipate the need to store additional information about each payment type.

Create your "Type of Payment" table; there's no real question there. That's proper normalization and the power behind using relational databases. One of the many reasons to do so is the ability to update a Type of Payment record and not have to touch the related data in your users table. Your join between the two tables will allow your app to see the updated type of payment info by changing it in just the 1 place.
Regarding your other fields, they may not be as clear cut. The question to ask yourself about each field is "does this field relate only to a user or does it have meaning and possible use in its own right?". If you can never imagine a field having meaning outside of the context of a user you're safe leaving it as a field on the user table, otherwise do the primary key-foreign key relationship and put the information in its own table.

If you are building a form with variable inputs, I wouldn't recommend building it as one table. This is inflexible and dirty.
Normalization is the key, though if you end up with a key/value setup, or effectively a scalar type implementation across many tables and can't cache:
a) the form definition from table data and
b) the joined result of storage (either a caching view or otherwise)
c) or don't build in proper sharding
Then you may hit a performance boundary.
In this KVP setup, you might want to look at something like CouchDB or a less table-driven storage format.
You may also want to look at trickier setups such as serialized object storage and cache-tables if your internal data is heavily relative to other data already in the database

50 columns is a lot. Have you considered a table that stores values like a property sheet? This would only be useful if you didn't need to regularly query the values it contains.
INSERT INTO UserProperty(UserID, Name, Value)
VALUES(1, 'PaymentType', 'Visa')
INSERT INTO UserProperty(UserID, Name, Value)
VALUES(1, 'TrafficSource', 'TV')

I think I figured out a great way of solving this. Thanks to a friend of mine for suggesting this!
I have three tables, Field {IdField, FieldName, FieldType}, FieldInput {IdInput, IdField, IdUser} and User { IdUser, UserName... etc }
This way it becomes very easy to see what a user has answered, the solution is somewhat scalable and it provides a good overview. I will constrain the alternatives in another layer, farther away from the db. I believe it's a tradeoff worth doing.
Any suggestions or critics to this solution?

SQL efficiency argument, add a column or solvable by query?

I am a recent college graduate and a new hire for software development. Things have been a little slow lately so I was given a db task. My db skills are limited to pet projects with Rails and Django. So, I was a little surprised with my latest task.
I have been asked by my manager to subclass Person with a 'Parent' table and add a reference to their custodian in the Person table. This is to facilitate going from Parent to Form when the custodian, not the Parent, is the FormContact.
Here is a simplified, mock structure of a sql-db I am working with. I would have drawn the relationship tables if I had access to Visio.
We have a table 'Person' and we have a table 'Form'. There is a table, 'FormContact', that relates a Person to a Form, not all Persons are related to a Form. There is a relationship table for Person to Person relationships (Employer, Parent, etc.)
I've asked, "Why this couldn't be handled by a query?" Response, Inefficient. (Really!?!)
So, I ask, "Why not have a reference to the Form? That would be more efficient since you wouldn't be querying the FormContacts table with the reference from child/custodian." Response, this would essentially make the Parent is a FormContact. (Fair enough.)
I went ahead an wrote a query to get from non-FormContact Parent to Form, and tested on the production server. The response time was instantaneous. SOME_VALUE is the Parent's fk ID.
SELECT FormID
FROM FormContact
WHERE FormContact.ContactID
IN (SELECT SourceContactID
FROM ContactRelationship
WHERE (ContactRelationship.RelatedContactID = *SOME_VALUE*)
AND (ContactRelationship.Relationship = 'Parent'));
If I am right, "This is an unnecessary change." What should I do, defend my position or should I concede to the managers request?
If I am wrong. What is my error? Is there a better solution than the manager's?

First things first, your query could use some reworking. Rather than subselects, try using a join:
SELECT FormID
FROM FormContact fc
JOIN ContactRelationship cr on cr.SourceContactID = fc.ContactID
and cr.Relationship = 'Parent'
WHERE cr.RelatedContactID = #parent_id
Secondly, the issue you're dealing with is normalization vs. performance. From a purity perspective, yes, your solution is "more correct" (as you aren't duplicating data, which eliminates the possibility for the disparities in the duplicated data causing conflicts and aberrant behavior), but pure normalization is not always the wisest course of action.
Normalization can induce performance penalties, especially in larger data sets. These penalties have to be weighed alongside the benefits from normalization to see which side "wins".
That being said, I can't see how joining the Person table again on the ParentID column (I'm assuming that's what you'd be adding) would provide a performance boost over the join listed above, assuming that the columns in question are properly indexed.
If the query above works for you and you do rigorous performance testing to show that it's valid, take it to your manager and ask for his input. Because you're new and fresh out of college, be very willing to defer to your manager's judgment and wishes on this one. There will be much bigger battles to fight in the future.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas