Is there a set based solution for this problem? - sql

We have a table set up as follows:
|ID|EmployeeID|Date |Category |Hours|
|1 |1 |1/1/2010 |Vacation Earned|2.0 |
|2 |2 |2/12/2010|Vacation Earned|3.0 |
|3 |1 |2/4/2010 |Vacation Used |1.0 |
|4 |2 |5/18/2010|Vacation Earned|2.0 |
|5 |2 |7/23/2010|Vacation Used |4.0 |
The business rules are:
Vacation balance is calculated by vacation earned minus vacation used.
Vacation used is always applied against the oldest vacation earned amount first.
We need to return the rows for Vacation Earned that have not been offset by vacation used. If vacation used has only offset part of a vacation earned record, we need to return that record showing the difference. For example, using the above table, the result set would look like:
|ID|EmployeeID|Date |Category |Hours|
|1 |1 |1/1/2010 |Vacation Earned|1.0 |
|4 |2 |5/18/2010|Vacation Earned|1.0 |
Note that record 2 was eliminated because it was completely offset by used time, but records 1 and 4 were only partially used, so they were calculated and returned as such.
The only way we have thought of to do this is to get all of the vacation earned records in a temporary table. Then, get the total vacation used and loop through the temporary table, deleting the oldest record and subtracting that value from the total vacation used until the total vacation used is zero. We could clean it up for when the remaining vacation used is only part of the oldest vacation earned record. This would leave us with just the outstanding vacation earned records.
This works, but it is very inefficient and performs poorly. Also, the performance will just degrade over time as more and more records are added.
Are there any suggestions for a better solution, preferable set based? If not, we'll just have to go with this.
EDIT: This is a vendor database. We cannot modify the table structure in any way.

The following should do it..
(but as others mention, the best solution would be to adjust remaining vacations as they are spent..)
select
id, employeeid, date, category,
case
when earned_so_far + hours - total_spent > hours then
hours
else
earned_so_far + hours - total_spent
end as hours
from
(
select
id, employeeid, date, category, hours,
(
select
isnull(sum(hours),0)
from
vacations
WHERE
category = 'Vacation Earned'
and
date < v.date
and
employeeid = v.employeeid
) as earned_so_far,
(
select
isnull(sum(hours),0)
from
vacations
where
category = 'Vacation Used'
and
employeeid = v.employeeid
) as total_spent
from
vacations V
where category = 'Vacation Earned'
) earned
where
earned_so_far + hours > total_spent
The logic is
calculate for each earned row, the hours earned so far
calculate the total hours used for this user
select the record if the total_hours_so_far + hours of this record - total_spent_hours > 0

In thinking about the problem, it occurred to me that the only reason you need to care about when vacation is earned is if it expires. And if that's the case, the simplest solution is to add 'vacation expired' records to the table, such that the amount of vacation remaining for an employee is always just the sum(vacation earned) - (sum(vacation expired) + sum(vacatation used)). You can even show the exact records you want by using the last vacation expired record as a starting point for the query.
But I'm guessing that's not an option. To address the problem as asked, keep in mind that whenever you find yourself using a temporary table try putting that data into CTE (common table expression) instead. Unfortunately I have a meeting right now and so I don't have time to write the query (maybe later, it sounds like fun), but this should get you started.

I find your whole result set confusing and inaccurate and I can see employees sayng, "no I earned 2 hours on Jan 25th not 1." It is not true that they earned 1 hour on that date that was only partially offset, and you will have no end of problems if you choose to display this way. I'd look at a different way to present the information. Typically you either present a list of all leave actions (earned, expired and used) with a total at the bottom or you present a summary of available for use and used.
In over 30 years in the workforce and having been under many differnt timekeeping systems (as well as having studied even more when I was a managment analyst), I have never seen anyone want to display timekeeping information this way. I'm thinking there is a reason. If this is a requirement, I'd suggest pushing back on it and explaining how it will be confusing to read the data this was as well as being difficult to get a well-performing solution. I would not accept this as a requirement without trying to convince the client that it is a poor idea.

As time passes and records are added, performance will get worse and worse unless you do something about it, such as:
Purge old rows once they're "cancelled out" (e.g. vacation earned has had equivalent vacation used rows added and accounted for; vacation used has been used set "expire" vacation earned as "expended")
Add a column that flags if a a row has been "cancelled out", and incorporate this column into your indexes
Tracking how the data changes in this fashion seems an argument to modify your table sturctures (have several, not just one), but that's outside the scope of your current problem.
As for the query itself, I'd build two aggregates, do some subtraction, make that a subquery, then join it on some clever use of one of the ranking functions. Smells like a correlated subquery in there somewhere, too. I may try and hash this out later (I'm short on time), but I bet someone beats me to it.

I'd suggest modifying the table to keep track of Balance in its own column. That way, you only need to grab the most recent record to know where the employee stands.
That way, you can satisfy the simple case ("How much vacation time do I have"), while still being able to do the awkward rollup you're looking for in your "Which bits of vacation time don't line up with other bits" report, which I'd hope is something you don't need very often.

Related

How to calculate performance growth across varying timeframes?

I have a problem that I'm wondering if you could help me with, please.
I have a table of customers and their transactions which looks a bit like this:
account_number|till_transaction_date|sales
--------------|---------------------|-----
A |2021-11-01 |10
A |2021-12-05 |25
A |2022-01-10 |40
B |2021-09-05 |15
B |2021-10-15 |10
B |2021-11-20 |20
As you can see, customer A's sales are growing at a faster rate than customer B's.
I also have a table which shows the date each customer joined a certain club (all customers from the previous table are also in this one):
account_number|join_date
--------------|----------
A |2022-01-05
B |2021-12-01
What I want to do is to quantify the sales growth of each customer in the period before they joined. So in my example, we would be looking at customer A's first two transactions, but all three of customer B's.
If it helps, the end goal of all of this is to identify which customers were showing the highest growth in performance before they joined the club.
Does anyone know what would be the best way to go about this please? Any advice would be appreciated!
EDIT: Just to clarify, the best plan I could come up with is to select from the first table all transactions before each customers' join_date, and use this to get a rolling X-day sales total per customer.
SUM(SALES) OVER (PARTITION BY ACCOUNT_NUMBER ORDER BY DAYS(TILL_TRANSACTION_DATE) RANGE BETWEEN X PRECEDING AND CURRENT ROW)
So if I set it to a 35-day total the results would look like this:
account_number|till_transaction_date|sales|35_day_total
--------------|---------------------|-----|------------
A |2021-11-01 |10 |10
A |2021-12-05 |25 |35
B |2021-09-05 |15 |15
B |2021-10-15 |10 |10
B |2021-11-20 |20 |20
I could then potentially calculate for each customer the difference between the min and max of their totals.
The problem is that this isn't a reliable method of measuring how much a customer's performance is improving - e.g. if a customer makes an expensive one-off purchase at the right time it could skew their result massively.
Theoretically, I could get the difference between each X-day total and the previous X-day total, then get the mean average of the differences, and this would provide a more accurate way of determining sales growth, but I'm not sure how to do this.
I'm not sure how to improve this plan, or even if I'm barking up the wrong tree entirely and there's a better methodology I've just not thought of.

Query Distinct on a single Column

I have a Table called SR_Audit which holds all of the updates for each ticket in our Helpdesk Ticketing system.
The table is formatted as per the below representation:
|-----------------|------------------|------------|------------|------------|
| SR_Audit_RecID | SR_Service_RecID | Audit_text | Updated_By | Last_Update|
|-----------------|------------------|------------|------------|------------|
|........PK.......|.......FK.........|
I've constructed the below query that provides me with the appropriate output that I require in the format I want it. That is to say that I'm looking to measure how many tickets each staff member completes every day for a month.
select SR_audit.updated_by, CONVERT(CHAR(10),SR_Audit.Last_Update,101) as DateOfClose, count (*) as NumberClosed
from SR_Audit
where SR_Audit.Audit_Text LIKE '%to "Completed"%' AND SR_Audit.Last_Update >= DATEADD(day, -30, GETDATE())
group by SR_audit.updated_by, CONVERT(CHAR(10),SR_Audit.Last_Update,101)
order by CONVERT(CHAR(10),SR_Audit.Last_Update,101)
However the query has one weakness which I'm looking to overcome.
A ticket can be reopened once its completed, which means that it can be completed again. This allows a staff member to artificially inflate their score by re-opening a ticket and completing it again, thus increasing their completed ticket count by one each time they do this.
The table has a field called SR_Service_RecID which is essentially the Ticket number. I want to put a condition in the query so that each ticket is only counted once regardless of how many times its completed, while still honouring the current where clause.
I've tried sub queries and a few other methods but haven't been able to get the results I'm after.
Any assistance would be appreciated.
Cheers.
Courtenay
use as
COUNT(DISTINCT(SR_Service_RecID)) as NumberClosed
Use:
COUNT(DISTINCT SR_Service_RecID) as NumberClosed

Getting a snapshot of records where an "event" can mean several entries on the same date

This is really frustrating me.
So, I'm making a database recording people joining and leaving our office, as well as changing roles, in order to keep track of headcount. This is succinctly recorded in the following table:
EmployeeID | RoleID | FTE | Date
FTE is the proportion of full-time hours the role is worth (i.e. 1 is full-time, 0.5 is part-time, etc). Leaving events are recorded as changing the role to 0 (Absent) and FTE to 0. The trouble is, people can have more than one role, which means that the number of hours they actually worked is a composite of all the events for that employee that occur on the same day. So if someone goes from full time on one project to splitting their time between two projects, a ChangeRole event is logged for each.
So I want to know the total headcount on a monthly basis. Essentially the query I would want is "Select all records from this table where, for each EmployeeID, the date is the maximum date below a specified date." From there I can sum the FTE to get the headcount.
Now I can get some of those things in isolation: I can do max(date), I can do criteria:<#dd/mm/yyyy##. But for some reason I can't seem to combine it all to get what I want, and I'm at a point where I've been staring at the problem so long that it doesn't make sense to me. Can anyone help me out? Thanks!
Something like this?
SELECT Events.*
FROM Events INNER JOIN (
SELECT EmployeeID, Max(Date) AS LatestDate
FROM Events
WHERE Events.Date < [Date entered]
GROUP BY EmployeeID) AS S
ON (Events.EmployeeID = S.EmployeeID) AND (Events.Date = S.LatestDate)

Fact table designing for SSAS

I'm designing a fact table for SSAS and this is the first time I'm trying my hand at this as this is to be a prototype system just to show what could be done and to show to someone to decide if it what they are after.
I've made up some data and am now trying to create the fact table. The cube will be looking at referrals and what I'm trying to show is the information over time showing the number of referrals that opened in a month, number that closed in a month and the number that were open at any point in the month (i.e. they could have opened in previous month and closed in a future month).
How is it best to design these measure is where I'm stuck. Should it be three fact tables or can I get away with one? If I do three fact tables, I can link on the record number and the open date to get number that opened in a month, I can link on record number and closed date to create number that closed in a month, but the one I have no idea on is to describe when it was open at any point in the month. For this table would I need to create a row for every day for every referral? This seems a bit intensive and so immediately I thought it was wrong.
So the questions are twofold:
Can I do the three measures in one table and if so what is the best method for this?
What is the best method for the open at any point in the month count?
Any thoughts would be most appreciated as I truely am a beginner at this and all I have to aid me is google as I have a short deadline for this.
Dimensions I have:
Demographics: Record number; Gender; Ethnicity; Birth date;
Referral: Record number; Open date; End date;
Time: Date; Month; Quarter; Year;
The fact table I initially designed was:
Data:
Record number; Opened_in_month; Closed_in_month; Open_in_month;
Since creating the cube, I can see that the numbers do not match up to what I put in the test data and so I know that I have messed up the fact table and it's that table I need to re-create.
I have little experience with creating cubes in SSAS but i would probably create a view as something like this
ReferallFacts:
Id | IsOpen | DateOpened | OpenedBy | DateClosed | ClosedBy | OpenForMinutes...
CalendarDimension:
ShortDate | Week | Month | Quarter | Year | FinancialWeek...
EmployeeDimension:
Id | FirstName | LastName | LineManager | Department...
DepartmentDimension:
Id | Name | ParentDepartment | Manager | Location...
I don't really see a need for more than one fact table in this case as all of what you describe "by month", "by day" is handled by the calendar dimension.
Here is a really nice walkthough, and also pcteach.me has some good videos on SSAS.
Have you considered an event-based approach, an event being a referral opening or closing?
First of all, you need to determine the granularity level of your fact table. If you need to know the number of open referrals at a specific date and time in a month, then your fact table must be at the lowest granularity (individual referral records):
FactReferrals: ( DateId, TimeId, EventId, RecordNumber, ReferralEventValue )
Here, ReferralEventValue is just an integer value of 1 when a Referral opens, and -1 when a Referral closes. EventId refers to a dimension with only two members: Opened and Closed.
This approach allows you to get the number of closed or opened events over any given time period. Also, by taking the sum of ReferralEventValue from the beginning of time, and up to a certain point in time, you get the exact amount of open referrals at that specific moment. To speed up this sum in SSAS, you could design aggregations or create a separate measure that is the accumulated sum of ReferralEventValue.
Edit: Of course, if you don't need data at individual referral granularity, you could always sum up the ReferralEventValue per day or even month, before loading the fact table.

Weekly Sales Total in Fact Table using SSIS

I have 3 sales data tables that I have converted into 4 Dimension tables and 1 Fact table.
I have everything populated properly, and am wanting a "daily-sales" and "weekly-sales" in my Fact table.
My original 3 tables are as follows is as follows:
Order (order# (PK), product-id (FK), order-date, total-order, customer# (FK))
domains are numeric, numeric, Date, Currency,numeric
Product (product-id (PK), prod-name, unit-cost, manufacture-name)
domains are numeric, nvarchar, money, nvarchar.
Customer (customer# (PK), cust-name, address, telephone#)
domains are Number, nvarchar, nvarchar, nvarchar
The Star Schema of the data warehouse is linked here:
So I have only 10 records in each table(tiny!), and am just testing concept now. "daily-order" in the Fact table is easily translated from "total-order" in the Order table. My difficulty is getting the weekly totals. I used a derived column and expression "DATEPART("wk",[order-date])" to create my week column in the time dimension.
So I guess my question is, how do I get the weekly-sales from this point? My first guess is to use another derived column after the end of my Lookup sequences used to load my Fact table. My second guess is..... ask for help on stackoverflow. Any help would be appreciated, I will supply any info needed! Thanks in advance!
For the record I tried creating the derived column as I previously described, but could not figure out syntax that would be accepted....
EDITI want to list the weekly-sales column by product level, and as I used DATEPART to derive the week column I do not need last 7 days but rather just a total for each week. My sales data stretches across only 2 consecutive weeks, so my fact table should list 1 total 7 times and a second total 3 times. I can either derive this from the original Order table, or from the tables I have in my DW environment. Preferably the latter. One note is that I am mainly restricted to SSIS and its tools (There is "Execute SQL syntax" tool, so I can still use a query)
Within your data flow for Fact-sales, it would challenging to fill the weekly-order amount as you are loading daily data. You would need to use a Multicast Transformation, add an Aggregate to that to sum your value for the product, customer, order and week. That's easy. The problem is how are you going to tie that data back to the actual detail level information in the data flow. It can be done, Merge Join, but don't.
As ElectricLlama is pointing out in the comments, what you are about to do is mix your level of granularity. If the weekly amount wasn't stored on the table, then you can add all the amounts, sliced by whatever dimensions you want. Once you put that weekly level data at the same level, your users will only be able to summarize the daily columns. The weekly values would not be additive, probably.
I say probably because you could fake it by accumulating the weekly values but only storing on them on an arbitrary record: the first day of the week, third, last, whatever. That solves the additive problem but now users have to remember which day of the week is holds the summarized value for the week. Plus, you'd be generating a fake order record for that summary row unless they placed the order on the exact day of the week you picked to solve the additive value problem.
The right, unasked for, solution is to remove the weekly-order from the fact table and beef up your data dimension to allow people to determine the weekly amounts. If you aren't bringing the data out of the warehouse and into an analytics engine, you could precompute the weekly aggregates and materialize them to a reporting table/view.
Source data
Let's look at a simple scenario. A customer places 3 orders, two on the same day for same product (maybe one was expedited shipping, other standard)
SalesDate |Customer|Product|Order|Amount
2013-01-01|1 |20 |77 |30
2013-01-01|1 |30 |77 |10
2013-01-02|1 |20 |88 |1
2013-01-02|1 |20 |99 |19
Fact table
Your user is going to point their tool of choice at the table and you'll be getting a call because the numbers aren't right. Or you'll get a call later because they restocked with 150 units of product 20 when they only needed to reorder 50. I wouldn't want to field either of those calls.
SalesDate |Customer|Product|Order|daily-order|weekly-order
2013-01-01|1 |20 |77 |30 |50
2013-01-01|1 |30 |77 |10 |10
2013-01-02|1 |20 |88 |1 |50
2013-01-02|1 |20 |99 |19 |50