Fact table designing for SSAS - sql

I'm designing a fact table for SSAS and this is the first time I'm trying my hand at this as this is to be a prototype system just to show what could be done and to show to someone to decide if it what they are after.
I've made up some data and am now trying to create the fact table. The cube will be looking at referrals and what I'm trying to show is the information over time showing the number of referrals that opened in a month, number that closed in a month and the number that were open at any point in the month (i.e. they could have opened in previous month and closed in a future month).
How is it best to design these measure is where I'm stuck. Should it be three fact tables or can I get away with one? If I do three fact tables, I can link on the record number and the open date to get number that opened in a month, I can link on record number and closed date to create number that closed in a month, but the one I have no idea on is to describe when it was open at any point in the month. For this table would I need to create a row for every day for every referral? This seems a bit intensive and so immediately I thought it was wrong.
So the questions are twofold:
Can I do the three measures in one table and if so what is the best method for this?
What is the best method for the open at any point in the month count?
Any thoughts would be most appreciated as I truely am a beginner at this and all I have to aid me is google as I have a short deadline for this.
Dimensions I have:
Demographics: Record number; Gender; Ethnicity; Birth date;
Referral: Record number; Open date; End date;
Time: Date; Month; Quarter; Year;
The fact table I initially designed was:
Data:
Record number; Opened_in_month; Closed_in_month; Open_in_month;
Since creating the cube, I can see that the numbers do not match up to what I put in the test data and so I know that I have messed up the fact table and it's that table I need to re-create.

I have little experience with creating cubes in SSAS but i would probably create a view as something like this
ReferallFacts:
Id | IsOpen | DateOpened | OpenedBy | DateClosed | ClosedBy | OpenForMinutes...
CalendarDimension:
ShortDate | Week | Month | Quarter | Year | FinancialWeek...
EmployeeDimension:
Id | FirstName | LastName | LineManager | Department...
DepartmentDimension:
Id | Name | ParentDepartment | Manager | Location...
I don't really see a need for more than one fact table in this case as all of what you describe "by month", "by day" is handled by the calendar dimension.
Here is a really nice walkthough, and also pcteach.me has some good videos on SSAS.

Have you considered an event-based approach, an event being a referral opening or closing?
First of all, you need to determine the granularity level of your fact table. If you need to know the number of open referrals at a specific date and time in a month, then your fact table must be at the lowest granularity (individual referral records):
FactReferrals: ( DateId, TimeId, EventId, RecordNumber, ReferralEventValue )
Here, ReferralEventValue is just an integer value of 1 when a Referral opens, and -1 when a Referral closes. EventId refers to a dimension with only two members: Opened and Closed.
This approach allows you to get the number of closed or opened events over any given time period. Also, by taking the sum of ReferralEventValue from the beginning of time, and up to a certain point in time, you get the exact amount of open referrals at that specific moment. To speed up this sum in SSAS, you could design aggregations or create a separate measure that is the accumulated sum of ReferralEventValue.
Edit: Of course, if you don't need data at individual referral granularity, you could always sum up the ReferralEventValue per day or even month, before loading the fact table.

Related

Query to find average stock ... with a twist

We are trying to calculate average stock from a movements table in a single sql sentence.
As far as we are, no problem with what we thought was a standard approach, instead of adding up the daily stock and divide by the number of days, as we don’t have daily stock, we simply add (movements*remaining days) :
select sum(quantity*(END_DATE-move_date))/(END_DATE-START_DATE)
from move_table
where move_date<=END_DATE
This is a simplified example, in real life we already take care of the initial stock at the starting date. Let’s say there are no movements prior to start_date.
Quantity sign depends on move type (sale, purchase, inventory, etc).
Of course this is done grouping by product, warehouse, ... but you get the idea.
It works as expected and the calculus is fine.
But (there is always a “but”), our customer doesn’t like accounting days when there is no stock (all stock sold out). So, he doesnt like
Sum of (daily_stock) / number_of_days (which is what we calculate using a diferent math)
Instead, he would like
Sum of (daily stock) / number_of_days_in_which_stock_is_not_zero
For sure we can do this in any programming language without much effort, but I was wondering how to do it using plain sql ... and wasn’t able to come up with a solution.
Any suggestion?
Consider creating a new table called something like Stock_EndOfDay_History that has the following columns.
stock#
date
stock_count_eod
This table would get a new row for each stock item at the start of a new day for the prior day. Rows could then be purged from this table once the applicable date value went outside the date window of interest.
To get the "number_of_days_in_which_stock_is_not_zero", use this.
SELECT COUNT(*) AS 'Not_Zero_Stock_Days' FROM Stock_EndOfDay_History
WHERE stock# = <stock#_value>
AND <date_window_clause>
Other approaches might attempt to just add a new column to the existing stock table to maintain a cumulative sum of the " number_of_days_in_which_stock_is_not_zero". But inevitably, questions will be asked as to how did the non-zero stock days count get calculated? Using this new table approach will address those questions better than the new column approach.

Query Distinct on a single Column

I have a Table called SR_Audit which holds all of the updates for each ticket in our Helpdesk Ticketing system.
The table is formatted as per the below representation:
|-----------------|------------------|------------|------------|------------|
| SR_Audit_RecID | SR_Service_RecID | Audit_text | Updated_By | Last_Update|
|-----------------|------------------|------------|------------|------------|
|........PK.......|.......FK.........|
I've constructed the below query that provides me with the appropriate output that I require in the format I want it. That is to say that I'm looking to measure how many tickets each staff member completes every day for a month.
select SR_audit.updated_by, CONVERT(CHAR(10),SR_Audit.Last_Update,101) as DateOfClose, count (*) as NumberClosed
from SR_Audit
where SR_Audit.Audit_Text LIKE '%to "Completed"%' AND SR_Audit.Last_Update >= DATEADD(day, -30, GETDATE())
group by SR_audit.updated_by, CONVERT(CHAR(10),SR_Audit.Last_Update,101)
order by CONVERT(CHAR(10),SR_Audit.Last_Update,101)
However the query has one weakness which I'm looking to overcome.
A ticket can be reopened once its completed, which means that it can be completed again. This allows a staff member to artificially inflate their score by re-opening a ticket and completing it again, thus increasing their completed ticket count by one each time they do this.
The table has a field called SR_Service_RecID which is essentially the Ticket number. I want to put a condition in the query so that each ticket is only counted once regardless of how many times its completed, while still honouring the current where clause.
I've tried sub queries and a few other methods but haven't been able to get the results I'm after.
Any assistance would be appreciated.
Cheers.
Courtenay
use as
COUNT(DISTINCT(SR_Service_RecID)) as NumberClosed
Use:
COUNT(DISTINCT SR_Service_RecID) as NumberClosed

Best practice for keeping historical data in SQL (for SSAS Cube use)

I am working on an Hotel DB, and the booking table changes a lot since people book and cancel reservation all the time. Trying to find out the best way to convert the booking table to a fact table in SSAS. I want to be able to get the right statsics from it.
For example: if a client X booked a room on Sep 20th for Dec 20th and canceled the order on Oct 20th. If I run the cube on the month of September (run it in Nov) and I want to see how many rooms got booked in the month of Sep, the order X made should be counted in the sum.
However, if I run the cube for YTD calculation (run it in Nov), the order shouldn't be counted in the sum.
I was thinking about inserting the updates to the same fact table every night, and in addition to the booking number (unique key) and add revision column to the table. So going back to the example, let say client X booking number is 1234, the first time I enter it to the table will get revision 0, in Oct when I add the cancellation record, it will get revision 1 (of course with timestamp on the row).
Now, if I want to look on any piroed of time, I can take it by the timestamp and look at the MAX(revision).
Does it make sense? Any ideas?
NOTE: I gave the example of cancelling the order, but we want to track another statistics.
Another option I read about is partitioning the cubes, but do I partition the entire table. I want to be able to add changes every night. Will I need to partition the entire table every night? it's a huge table.
One way to handle this is to insert records in your fact table for bookings and cancellations. You don't need to look at the max(revision) - cubes are all about aggregation.
If your table looks like this:
booking number, date, rooms booked
You can enter data like this:
00001, 9/10, 1
00002, 9/12, 1
00001, 10/5, -1
Then your YTDs will always have information accurate as of whatever month you're looking at. Simply sum up the booked rooms.

storing data ranges - effective representation

I need to store values for every day in timeline, i.e. every user of database should has status assigned for every day, like this:
from 1.1.2000 to 28.05.2011 - status 1
from 29.05.2011 to 30.01.2012 - status 3
from 1.2.2012 to infinity - status 4
Each day should have only one status assigned, and last status is not ending (until another one is given). My question is what is effective representation in sql database? Obvious solution is to create row for each change (with the last day the status is assigned in each range), like this:
uptodate status
28.05.2011 status 1
30.01.2012 status 3
01.01.9999 status 4
this has many problems - if i would want to add another range, say from 15.02.2012, i would need to alter last row too:
uptodate status
28.05.2011 status 1
30.01.2012 status 3
14.02.2012 status 4
01.01.9999 status 8
and it requires lots of checking to make sure there is no overlapping and errors, especially if someone wants to modify ranges in the middle of the list - inserting a new status from 29.01.2012 to 10.02.2012 is hard to implement (it would require data ranges of status 3 and status 4 to shrink accordingly to make space for new status). Is there any better solution?
i thought about completly other solution, like storing each day status in separate row - so there will be row for every day in timeline. This would make it easy to update - simply enter new status for rows with date between start and end. Of course this would generate big amount of needless data, so it's bad solution, but is coherent and easy to manage. I was wondering if there is something in between, but i guess not.
more context: i want moderator to be able to assign status freely to any dates, and edit it if he would need to. But most often moderator will be adding new status data ranges at the end. I don't really need the last status. After moderator finishes editing whole month time, I need to generate raport based on status on each day in that month. But anytime moderator may want to edit data months ago (which would be reflected on updated raports), and he can put one status for i.e. one year in advance.
You seem to want to use this table for two things - recording the current status and the history of status changes. You should separate the current status out and move it up to the parent (just like the registered date)
User
===============
Registered Date
Current Status
Status History
===============
Uptodate
Status
Your table structure should include the effective and end dates of the status period. This effectively "tiles" the statuses into groups that don't overlap. The last row should have a dummy end date (as you have above) or NULL. Using a value instead of NULL is useful if you have indexes on the end date.
With this structure, to get the status on any given date, you use the query:
select *
from t
where <date> between effdate and enddate
To add a new status at the end of the period requires two changes:
Modify the row in the table with the enddate = 01/01/9999 to have an enddate of yesterday.
Insert a new row with the effdate of today and an enddate of 01/01/9999
I would wrap this in a stored procedure.
To change a status on one date in the past requires splitting one of the historical records in two. Multiple dates may require changing multiple records.
If you have a date range, you can get all tiles that overlap a given time period with the query:
select *
from t
where <periodstart> <= enddate and <periodend> >= effdate

Is there a set based solution for this problem?

We have a table set up as follows:
|ID|EmployeeID|Date |Category |Hours|
|1 |1 |1/1/2010 |Vacation Earned|2.0 |
|2 |2 |2/12/2010|Vacation Earned|3.0 |
|3 |1 |2/4/2010 |Vacation Used |1.0 |
|4 |2 |5/18/2010|Vacation Earned|2.0 |
|5 |2 |7/23/2010|Vacation Used |4.0 |
The business rules are:
Vacation balance is calculated by vacation earned minus vacation used.
Vacation used is always applied against the oldest vacation earned amount first.
We need to return the rows for Vacation Earned that have not been offset by vacation used. If vacation used has only offset part of a vacation earned record, we need to return that record showing the difference. For example, using the above table, the result set would look like:
|ID|EmployeeID|Date |Category |Hours|
|1 |1 |1/1/2010 |Vacation Earned|1.0 |
|4 |2 |5/18/2010|Vacation Earned|1.0 |
Note that record 2 was eliminated because it was completely offset by used time, but records 1 and 4 were only partially used, so they were calculated and returned as such.
The only way we have thought of to do this is to get all of the vacation earned records in a temporary table. Then, get the total vacation used and loop through the temporary table, deleting the oldest record and subtracting that value from the total vacation used until the total vacation used is zero. We could clean it up for when the remaining vacation used is only part of the oldest vacation earned record. This would leave us with just the outstanding vacation earned records.
This works, but it is very inefficient and performs poorly. Also, the performance will just degrade over time as more and more records are added.
Are there any suggestions for a better solution, preferable set based? If not, we'll just have to go with this.
EDIT: This is a vendor database. We cannot modify the table structure in any way.
The following should do it..
(but as others mention, the best solution would be to adjust remaining vacations as they are spent..)
select
id, employeeid, date, category,
case
when earned_so_far + hours - total_spent > hours then
hours
else
earned_so_far + hours - total_spent
end as hours
from
(
select
id, employeeid, date, category, hours,
(
select
isnull(sum(hours),0)
from
vacations
WHERE
category = 'Vacation Earned'
and
date < v.date
and
employeeid = v.employeeid
) as earned_so_far,
(
select
isnull(sum(hours),0)
from
vacations
where
category = 'Vacation Used'
and
employeeid = v.employeeid
) as total_spent
from
vacations V
where category = 'Vacation Earned'
) earned
where
earned_so_far + hours > total_spent
The logic is
calculate for each earned row, the hours earned so far
calculate the total hours used for this user
select the record if the total_hours_so_far + hours of this record - total_spent_hours > 0
In thinking about the problem, it occurred to me that the only reason you need to care about when vacation is earned is if it expires. And if that's the case, the simplest solution is to add 'vacation expired' records to the table, such that the amount of vacation remaining for an employee is always just the sum(vacation earned) - (sum(vacation expired) + sum(vacatation used)). You can even show the exact records you want by using the last vacation expired record as a starting point for the query.
But I'm guessing that's not an option. To address the problem as asked, keep in mind that whenever you find yourself using a temporary table try putting that data into CTE (common table expression) instead. Unfortunately I have a meeting right now and so I don't have time to write the query (maybe later, it sounds like fun), but this should get you started.
I find your whole result set confusing and inaccurate and I can see employees sayng, "no I earned 2 hours on Jan 25th not 1." It is not true that they earned 1 hour on that date that was only partially offset, and you will have no end of problems if you choose to display this way. I'd look at a different way to present the information. Typically you either present a list of all leave actions (earned, expired and used) with a total at the bottom or you present a summary of available for use and used.
In over 30 years in the workforce and having been under many differnt timekeeping systems (as well as having studied even more when I was a managment analyst), I have never seen anyone want to display timekeeping information this way. I'm thinking there is a reason. If this is a requirement, I'd suggest pushing back on it and explaining how it will be confusing to read the data this was as well as being difficult to get a well-performing solution. I would not accept this as a requirement without trying to convince the client that it is a poor idea.
As time passes and records are added, performance will get worse and worse unless you do something about it, such as:
Purge old rows once they're "cancelled out" (e.g. vacation earned has had equivalent vacation used rows added and accounted for; vacation used has been used set "expire" vacation earned as "expended")
Add a column that flags if a a row has been "cancelled out", and incorporate this column into your indexes
Tracking how the data changes in this fashion seems an argument to modify your table sturctures (have several, not just one), but that's outside the scope of your current problem.
As for the query itself, I'd build two aggregates, do some subtraction, make that a subquery, then join it on some clever use of one of the ranking functions. Smells like a correlated subquery in there somewhere, too. I may try and hash this out later (I'm short on time), but I bet someone beats me to it.
I'd suggest modifying the table to keep track of Balance in its own column. That way, you only need to grab the most recent record to know where the employee stands.
That way, you can satisfy the simple case ("How much vacation time do I have"), while still being able to do the awkward rollup you're looking for in your "Which bits of vacation time don't line up with other bits" report, which I'd hope is something you don't need very often.