NHS Data Warehouse - schema

I have got coursework, which I do not understand, I tried emailing my tutor but he did not respond and I have been waiting for about 2 months now... I am supposed to create a Star/Snowflake Schema focusing on 2 fact tables.
The project must focus on the NHS, we are free to define the scope so I decided to focus on COVID-19. I have created a star schema for 1 fact table, which is called "Deaths", my idea is the data warehouse to show which areas have the highest death rate so that the NHS knows which areas are in demand in order to manage the situation accordingly.
I was thinking, the second Fact table to be Infection/Infected, which is supposed to see which areas have the highest infection rates. I think that it would not work because the dimension for "Infected" should be different than the ones for deaths( I am not sure if they have to be the same)?
Could you share with me your thoughts and recommendation?
Here is the assignment brief and below the brief is my star schema design(Which I think is wrong).

I don't see the need of having two facts one for recovered and one for death cases.
You can have an only one FactDiagnosticAnalysis gathering :
TreatmenCenterSK
PatientSK
TreatmenSK
StaffSK
DiagnosticSK
DateSK
Result
InsertedDate : a technical column to capture when the record was insterted
The Result column will have the values : Infected,Not Infected, Recovered,Dead at a specific date since :
a patient will have many analysis until his recovery
a patient can be not infected when he arrives after doing the
analysis
a patient will be recovered after many analysis
a patient can die after many analysis
Your model can be like below :
Actually, in this case your fact is a factless fact.
A factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual facts.
You ceate the measures in your reports/dashboards as views (if you are using SQL):
Area having the highest death rate
The number medical centers reaching their maximum capacities

Related

Designing a cube

I’ve been asked create our analysis cube and have a design question.
We sell ‘widgets’ and ‘parts’ to go with those widgets. Each order has many widgets and sometimes a few parts.
What I’m stuck on is – to me, an order is a fact in a measure. But, what are the widgets? Are they a dimension and each fact in the measure will be an entry for every part and widget for the order.
So, if order 123 had widget 1 and widget 2 and part 5, then there will be 3 facts in the measure for the same order? Is that correct?
At its basic level you can consider most facts to be transactions or transaction line items. So, for example, you may have a 'sales' fact table in which each record represents one line item from that sale. Each fact record would have numeric columns representing metrics and other columns joining to dimension tables. The combination of those dimensions would describe that line item. So, in your case, you likely have something like:
1) A 'date' dimension detailing the date of the transaction
2) A 'widget' dimension detailing the widget sold on that transaction
3) A 'customer' dimension detailing the customer who bought that item (almost certainly the same customer would appear on every line item for this transaction)
4) ... determined by what information you have and what business problem you're trying to solve.
Now, the dimension tables contain further details. For example, your widget dimension table likely contains things like the name of the widget, the color, the manufacturer, etc. Every time your company sells one of these widgets, the record in the fact table links to that same dimension record for that name, color, manufacturer, etc. combination (i.e. you don't create a new dimension record every time you sell the same item - this is a one-to-many relationship - each dimension record may have many related fact records).
You other dimension tables would similarly describe their dimensions. For example, the customer dimension might give the customer's name, their address, ...
So, the short answer to your question is that widget likely is a dimension, items and widgets may (or may not) actually be the same dimension (in a school class I suspect that they are), and that you would have 3 fact records for that one transaction.
This is probably along the same lines as the prior answer but....
If you try and model "many widgets per order" you'll have issues because you end up with a many (order fact) to many (widgets) relationship. In a cube / star schema design, many to many relationship usually need to be moddeled out to be many to one in some way.
So what you do is try and identify what special thing identifies an "order" (as opposed to a bunch of widgets in an order). Usually that is simply stuff like order date, customer, order number, tax
An example way to model this is:
If you have a single order with five widgets, you model that as a fact table with five records that happens to have a repeating widget, customer, date etc. in it
Then you have to work out how you spread an order header tax amount over five records. The two obvious solutions are:
Create a widget that represents tax and add that as another record
Spread the tax over five records, either evenly or weighted by something
Modelling "parts" just takes these concepts further.
It is important to understand what the end user wants to see, why they want to see parts. What do they want to measure by parts, how do you assign higher level values (like tax) down to lower levels like parts.

How can I model an optional "return ticket" in a database?

Here is a general layout of my tables:
I have an order table that contains one or more reservations.
I have a reservation table that is connected to a train and one or more tickets.
I have a train table which holds the information for a train, and a ticket table that does the same for tickets.
What I want to add to the system is the concept of return tickets, where the user asks for a two-way ticket. Since these tickets have two seat and ticket numbers and such, I will have to create new tickets. The same holds true for a train; since the information is different for the return train, I will have to create two trains: one for the originating train, and the other for the return train.
This leads me to conclude that these are, in fact, two different reservations (they need two calls to the underlying API, as well). So in other words, when someone buys a two-way ticket, I need to create two reservations, which are connected to two different trains, which connect to the same tickets as the originating reservation.
The problem begins when I want to connect these two reservations together. I'm looking for a strategy here that would allow me to know what kind of reservation (origination versus return) this reservation is, and what the other reservation is. For example, if the current reservation has an id of 4 and is an origination reservation, I want to be able to find the reservation with the id of 5 which is the return reservation.
Here are what I thought and the problem with each one.
Method One: originating_reservation_id and return_reservation_id
This is the easiest, but I think the worst, solution. In this case I will add these two columns to the reservation table. If originating_reservation_id is filled, then this reservation is a return reservation, and otherwise this is the origination reservation. The obvious downside to this is the fact that you get a lot of null values for one-way tickets.
Method Two: a Middle Table
In this scenario, I will have a table that has a type field, a first_reservation_id and a second_reservation_id. The types could be origination which means first_reservation_id is the origination, and second_reservation_id is the return. I will have another row in this table too, which is the opposite, meaning for each two-way ticket, two rows are created in this table. Apart from the fact that I need two rows for each two-way reservation, the layout seems unintuitive.
What am I missing here?
This problem is best solved by adding another level, rather than having two reservations at the same level.
Have a journey table that stores each reservation and a leg table that stores each trip that is part of that reservation. Journey and leg have a one to many relationship. A journey that is a single would only have one leg, and a return would have two.
This would have other benefits, besides solving your problem.
The whole journey and individual legs have different information associated with them and may need to be queried differently.
Train journeys may have more than two legs. A simple return trip might involve changing trains (and therefore having more than one seat reservation in each direction). This design would handle that requirement, now, or if it is ever introduced later.

SSAS Cube Modeling

Two part architecture question:
I have employee, job title, and supervisor dimensions. I kind of wanted to keep them in one dimension and have something like site > supervisor > job title > employee. The problem is that these need to be SCD. That is, they have historical associations to relate to the facts. The fact tables have a requirement to be processed every five minutes (dashboard).
1) Should I have these in a single dimension with a surrogate key (or composite for that matter)? The keys/surrogate key would be composed of calendar_id - employee_id.
2) Have the fact tables have maintain a reference to three different dimensions instead?
The requirement to process every 5 minutes (MOLAP SSIS ETL driven processing). Makes me lean toward keeping the time/change in the facts so that I would ease having to process the dimensions along with the fact tables.
I would design it as a single dimension, with the hierarchy you mentioned: site > supervisor > job title > employee.
Let's call this dimension EmployeeAssignment, because its granularity is not Employees, but any combination of site/supervisor/job title that an employee "adopts" during his/her career. (Feel free to come up with a better name).
I don't think you need a calendar_id key in this dimension: a surrogate key based on DISTINCT SiteID,SupervisorID,JobTitleID,EmployeeID would be enough. Adding a calendar_id key would be making the dimension do too much work: over and above slicing the actual facts, this would make the dimension answer questions like
"Where was employeeID 12345 (in the site/supervisor/job title network) on 1 January 2015?" and
"How many employees did supervisorID 98765 supervise on 1st January 2015?"
These questions IMHO are best addressed with a fact, not a dimension. One cube I've worked on addresses with with an EmployeeDay measure: sliced by dimensions "EmployeeAssignment" and Time, this simply has a 1 if the employee is in that "assignment" on that day.
This EmployeeAssignment SCD is actually pretty slowly-changing, especially compared to your 5-minute fact update interval. Employees are not going to move about or get promoted every 5 minutes, so a reprocess of the dimension shouldn't be necessary more often than daily.
If I've misunderstood anything, let me know in the comments.

Is there an appropriate way to show correlation or causation from a SQL query?

I'm curious if there's any way that Microsoft Access or SQL Server can provide a function to show correlation or causation from a SQL Select statement and its results.
This is basically the use case scenario:
Let's say you have two tables, table studentCourseSurveys and table onlineCourseReviews. At the forefront these two tables are mostly unrelated but could be joined based on the course name, for example "ENG 101".
studentCourseSurveys is a table that is intended to hold data that students submit during
their in person course survey at the end of a semester. For example, in the last day of class students receive that form to fill out to rate the instructor based on things such as "exams were related to the actual lecture content", "instructor was on time and prepared", and then at the end they have their short answer opportunities to give additional comments.
onlineCourseReviews is a table that is used by an internal department that conducts content reviews on the online component of courses. For example, this department has individual instructional designers who are assigned different courses in Blackboard to review. They review content, delivery, course structure, and so on and so forth. The course is then given its comments, score, etc.
As already mentioned the tables would be most unrelated. But let's say someone wanted to show a correlation that the results of the online course reviews could somehow indicate that the quality of the overall course was better because of these online reviews, and that this was shown in the responses from the students based on their survey results.(basically, a course that got an online course review score of 10 has course surveys where a great majority of the students rated it as excellent, content was relevant, teacher was prepared, etc, to indicate that an improvement in the online course translated to improvements in the in-person class and overall class quality).
This almost seems like a unique job for a statistician but I'd like to know if it's possible to show this data based on a query in Access or SQL Server. I know that you could just easily join the two tables with a foreign key and then get the results of a survey and online course review in a single statement but that doesn't really say anything. I would think that to show ANY kind of relationship that you need to illustrate a trend over any given period of time.
Thank you.
I would suggest creating a query with three values - course name (e.g., ENG101), the student rating, and the online course rating. In SQL Server you can save the results as a .csv file. Do this, then open it in Excel and use the RSQ function , or R-Squared, to find the correlation coefficient between columns two and three. The closer R-squared is to 1, the closer the two match. 1 mean a perfect correlation. 0 means no relation at all and -1 means related but but in a polar opposite manner.

SSIS Population of Slowly Changing Dimension with outrigger

Working on a data warehouse, a suitable analogy for the problem is that we have Healthcare Practitioners. Healthcare Practitioners have a number of professional attributes and work in an open number of teams and in an open number of clinical areas.
For example, you may have a nurse who works in children's services across a number of teams as a relief/contractor/bank staff person. Or you may have a newly qualified doctor who works general medicine who is doing time in a special area pending qualifying as a consultant of that special area.
So we have an open number of areas of work and an open number of teams, we can't have team 1, team 2 etc in our dimensions. The other attributes may change over time also, like base location (where they work out of), the main team and area they work in..
So, following Kimble I've gone for outriggers:
Table DimHealthProfessionals:
Key (primary key, identity)
Name
Main Team
Main Area of Work
Base Location
Other Attribute 1
Other Attribute 2
Start Date
End Date
Table OutriggerHealthProfessionalTeam:
HPKey (foreign key to DimHealthPRofessionals.Key)
Team Name
Team Type
Other Team Attribute 1
Other Team Attribute 2
Table OutriggerHealthProfessionalAreaOfWork:
HPKey (as above)
Area of Work
Other AoW attribute 1
If any attribute of the HP changes, or the combination of teams or areas of work in which they work change, we need to create a new entry in the SCD and it's outrigger tables to encapsulate this.
And we're doing this in SSIS.
The source data is basically an HP table with the main attributes, a table of areas of work, a table of teams and a pair of mapping tables to map a current set of areas of work to an HP.
I have three data sources, one brings in the HCP information, one the areas of work of all HCPs and one the team memberships.
The problem is how to run over all three datasets to determine if an HP has changed an attribute, and if they have changed an attribute, how we update the DIM and two outriggers appropriately.
Can anyone point me at a best practice for this? OR suggest an alternative way of modelling this dimension?
Admittedly I may not understand everything here, but it seems to me that the relationship in this example should be reversed. Place TeamKey and the WorkAreaKey in the dimHealthProfessionals -- this should simplify things.
With this in place, you simply make sure to deliver outriggers before the dimHealthProfessionals.
Treat outriggers as dimensions in their own right. You may want to treat dimHealthProfessionals as a type 2 dimension, to properly capture the history.
EDIT
Considering that team to person is many-to-many, a fact is more appropriate.
A column in a dimension table is appropriate only if a person can belong to only one team at a time. Same with work areas.
The problem is how to run over all three datasets to determine if an HP has changed an attribute, and if they have changed an attribute, how we update the DIM and two outriggers appropriately.
Can anyone point me at a best practice for this? OR suggest an alternative way of modelling this dimension?
I'm not sure I understand your question fully. If you are unsure about change detection, then use Checksums in the package. Build up a temp table with the data as it is in the source, then compare each row to its counterpart (joined via the business keys) by computing the checksum for both rows and comparing those. If they differ, the data has changed.
If you are talking about cascading updates in a historized dimension hierarchy (and you can treat the outriggers like a hierarchy in this context) then the foreign key lookups will automatically lookup the newer entry in DimHealthProfessionals if you have a historization (i.e. have validFrom / validThrough timestamps in DimHealthProfessionals). Those different foreign keys result in a different checksum.