Best practice for saving a series of dates in SQL - sql

I'm reworking some old programs and in one of them I need so save a repeating series of Dates in the database. The User picks days ranging from 1-31 and months ranging from 1-12 in a PHP-Form. Multiple Choices are possible. At least one of each must be provided.
I'll then use a daily scheduled Task to check if the value (day and month) is given and if yes - do something.
In the old system I saved it like this:
| Days | Months |
|1,2,5,13,15 | 1,2,3,4,5,6,7,8,9,10,11,12|
Then I exploded every row in the PHP-File fired by the scheduled Task and iterated over the Array. If one of the dates is valid - do something.
What is best practice for this Use-Case? I thought about some solutions like "saving all possible Outcomes of days and months as single rows in an mapping-table" but I don't think that's an elegant solution...and it needs to be editable too after being implemented.
Any suggestions?

I think you're looking at three tables.
Table one records the groups, give it a sequential group id and whatever other properties you need to record about the group of dates as a whole (requesting user id).
Second table is just group id from table one and the chosen days in rows, so each group has multiple rows.
Third table is the same as for second but for months.
When you need the final result join the second and third tables to the first on the group id. you'll automatically get a cross join between the two giving the combinations you need.
If you're expecting a large volume of data and\or a lot of repeats of the same groups then you may want to consider the possibility of re-using the groups of days and months. It will be a similar table design but tables 2 and 3 will have their own group ids and table one will have two extra columns one for day group and one for month group.

Seems, you can use a dimension-like scheme and attach day-month pairs to different entities. Suppose, the entity is called "task".
| tasks | | days | | months |
| ------- | | -------- | | -------- |
| id_task | | id_day | | id_month |
| ... | >---M:1--- | id_month | >---M:1--- | month |
| id_day | | day |
Don't forget to add check constraints for day (1-31) and month (1-12) columns.

I think you should expand the data in the database. Clearly, you need a table groups (or something like that) with one row per group:
create table groups (
group_id int identity(1, 1) primary key,
. . . -- additional columns
);
Then, expand the dates for each group for the schedule:
create table groups_schedule (
group_schedule_id int identity(1, 1) primary key,
group_id int references groups(group_id),
month int,
day int
);
This requires multiplying out the data in the database. However, I think it is a more accurate representation. In addition, it will give you more flexibility in the future so you are not tied specifically to lists of months/days. For instance, you might have day "25" in most months, but not December.

Related

HR Cube in SSAS

I have to design a cube for students attendance, we have four status (Present, Absent, Late, in vacation). the cube has to let me know the number of students who are not present in a gap of time (day, month, year, etc...) and the percent of that comparing the total number.
I built a fact table like this:
City ID | Class ID | Student ID | Attendance Date | Attendance State | Total Students number
--------------------------------------------------------------------------------------------
1 | 1 | 1 | 2016-01-01 | ABSENT | 20
But in my SSRS project I couldn't use this to get the correct numbers. I have to filter by date, city and attendance status.
For example, I must know that in date X there is 12 not present which correspond to 11% of total number.
Any suggestion of a good structure to achieve this.
I assume this is homework.
Your fact table is wrong.
Don't store aggregated data (Total Students) in the fact as it can make calculations difficult.
Don't store text values like 'Absent' in the fact table. Attributes belong in the dimension.
Reading homework for you:
Difference between a Fact and Dimension and how they work together
What is the grain of a Fact and how does that affect aggregations and calculations.
There is a wealth of information at the Kimball Groups pages. Start with the lower # tips as they get more advanced as you move on.

Calculating the number of new ID numbers per month in powerpivot

My dataset provides a monthly snapshot of customer accounts. Below is a very simplified version:
Date_ID | Acc_ID
------- | -------
20160430| 1
20160430| 2
20160430| 3
20160531| 1
20160531| 2
20160531| 3
20160531| 4
20160531| 5
20160531| 6
20160531| 7
20160630| 4
20160630| 5
20160630| 6
20160630| 7
20160630| 8
Customers can open or close their accounts, and I want to calculate the number of 'new' customers every month. The number of 'exited' customers will also be helpful if this is possible.
So in the above example, I should get the following result:
Month | New Customers
------- | -------
20160430| 3
20160531| 4
20160630| 1
Basically I want to compare distinct account numbers in the selected and previous month, any that exist in the selected month and not previous are new members, any that were there last month and not in the selected are exited.
I've searched but I can't seem to find any similar problems, and I hardly know where to start myself - I've tried using CALCULATE and FILTER along with DATEADD to filter the data to get two months, and then count the unique values. My PowerPivot skills aren't up to scratch to solve this on my own however!
Getting the new users is relatively straightforward - I'd add a calculated column which counts rows for that user in earlier months and if they don't exist then they are a new user:
=IF(CALCULATE(COUNTROWS(data),
FILTER(data, [Acc_ID] = EARLIER([Acc_ID])
&& [Date_ID] < EARLIER([Date_ID]))) = BLANK(),
"new",
"existing")
Once this is in place you can simply write a measure for new_users:
=CALCULATE(COUNTROWS(data), data[customer_type] = "new")
Getting the cancelled users is a little harder because it means you have to be able to look backwards to the prior month - none of the time intelligence stuff in PowerPivot will work out of the box here as you don't have a true date column.
It's nearly always good practice to have a separate date table in your PowerPivot models and it is a good way to solve this problem - essentially the table should be 1 record per date with a unique key that can be used to create a relationship. Perhaps post back with a few more details.
This is an alternative method to Jacobs which also works. It avoids creating a calculated column, but I actually find the calculated column useful to use as a flag against other measures.
=CALCULATE(
DISTINCTCOUNT('Accounts'[Acc_ID]),
DATESBETWEEN(
'Dates'[Date], 0, LASTDATE('Dates'[Date])
)
) - CALCULATE(
DISTINCTCOUNT('Accounts'[Acc_ID]),
DATESBETWEEN(
'Dates'[Date], 0, FIRSTDATE('Dates'[Date]) - 1
)
)
It basically uses the dates table to make a distinct count of all Acc_ID from the beginning of time until the first day of the period of time selected, and subtracts that from the distinct count of all Acc_ID from the beginning of time until the last day of the period of time selected. This is essentially the number of new distinct Acc_ID, although you can't work out which Acc_ID's these are using this method.
I could then calculate 'exited accounts' by taking the previous months total as 'existing accounts':
=CALCULATE(
DISTINCTCOUNT('Accounts'[Acc_ID]),
DATEADD('Dates'[Date], -1, MONTH)
)
Then adding the 'new accounts', and subtracting the 'total accounts':
=DISTINCTCOUNT('Accounts'[Acc_ID])

SSAS Row Count Aggregation

Hi I have a table like this:
idCustomer | idTime | idStatus
---------------------------------
1 | 20010101 | 2
1 | 20010102 | 2
1 | 20010103 | 3
2 | 20010101 | 1
...
I have now added this table as a factless fact table in my cube with a measure which aggregates the row count for each customer, so that for each day I can see how many customers are at each status and I can drill down to see which customers they are.
This is all well and good but when I roll it up to the month or year level it start summing up the values of each day where instead I want to see the last non empty value.
I'm not sure if this is possible but I can't think of another way of getting this information without creating a fact table with the counts for each status on each day and loosing the ability to drill down.
Can anyone help??
An easy way to get what you want would be to convert your factless fact table to one having a fact: the count. Just add a named calculation to the table object in the data source view. Name the calculation like you want your measure to be named, and use 1 as the expression. Then you can define a measure based on this calculation using the aggregate function "LastNonEmpty", and use this instead of your current count measure.

Fact table designing for SSAS

I'm designing a fact table for SSAS and this is the first time I'm trying my hand at this as this is to be a prototype system just to show what could be done and to show to someone to decide if it what they are after.
I've made up some data and am now trying to create the fact table. The cube will be looking at referrals and what I'm trying to show is the information over time showing the number of referrals that opened in a month, number that closed in a month and the number that were open at any point in the month (i.e. they could have opened in previous month and closed in a future month).
How is it best to design these measure is where I'm stuck. Should it be three fact tables or can I get away with one? If I do three fact tables, I can link on the record number and the open date to get number that opened in a month, I can link on record number and closed date to create number that closed in a month, but the one I have no idea on is to describe when it was open at any point in the month. For this table would I need to create a row for every day for every referral? This seems a bit intensive and so immediately I thought it was wrong.
So the questions are twofold:
Can I do the three measures in one table and if so what is the best method for this?
What is the best method for the open at any point in the month count?
Any thoughts would be most appreciated as I truely am a beginner at this and all I have to aid me is google as I have a short deadline for this.
Dimensions I have:
Demographics: Record number; Gender; Ethnicity; Birth date;
Referral: Record number; Open date; End date;
Time: Date; Month; Quarter; Year;
The fact table I initially designed was:
Data:
Record number; Opened_in_month; Closed_in_month; Open_in_month;
Since creating the cube, I can see that the numbers do not match up to what I put in the test data and so I know that I have messed up the fact table and it's that table I need to re-create.
I have little experience with creating cubes in SSAS but i would probably create a view as something like this
ReferallFacts:
Id | IsOpen | DateOpened | OpenedBy | DateClosed | ClosedBy | OpenForMinutes...
CalendarDimension:
ShortDate | Week | Month | Quarter | Year | FinancialWeek...
EmployeeDimension:
Id | FirstName | LastName | LineManager | Department...
DepartmentDimension:
Id | Name | ParentDepartment | Manager | Location...
I don't really see a need for more than one fact table in this case as all of what you describe "by month", "by day" is handled by the calendar dimension.
Here is a really nice walkthough, and also pcteach.me has some good videos on SSAS.
Have you considered an event-based approach, an event being a referral opening or closing?
First of all, you need to determine the granularity level of your fact table. If you need to know the number of open referrals at a specific date and time in a month, then your fact table must be at the lowest granularity (individual referral records):
FactReferrals: ( DateId, TimeId, EventId, RecordNumber, ReferralEventValue )
Here, ReferralEventValue is just an integer value of 1 when a Referral opens, and -1 when a Referral closes. EventId refers to a dimension with only two members: Opened and Closed.
This approach allows you to get the number of closed or opened events over any given time period. Also, by taking the sum of ReferralEventValue from the beginning of time, and up to a certain point in time, you get the exact amount of open referrals at that specific moment. To speed up this sum in SSAS, you could design aggregations or create a separate measure that is the accumulated sum of ReferralEventValue.
Edit: Of course, if you don't need data at individual referral granularity, you could always sum up the ReferralEventValue per day or even month, before loading the fact table.

Designing a scalable points leaderboard system using SQL Server

I'm looking for suggestions for scaling a points leaderboard system. I already have a working version using a very normalized strategy. This first version was essentially a table which looked something like this.
UserPoints - PK: (UserId,Date)
+------------+--------+---------------------+
| UserId | Points | Date |
+------------+--------+---------------------+
| 1 | 10 | 2011-03-17 07:16:36 |
| 2 | 35 | 2011-03-17 08:09:26 |
| 3 | 40 | 2011-03-17 08:05:36 |
| 1 | 65 | 2011-03-17 09:01:37 |
| 2 | 16 | 2011-03-17 10:12:35 |
| 3 | 64 | 2011-03-17 12:51:33 |
| 1 | 300 | 2011-03-17 12:19:21 |
| 2 | 1200 | 2011-03-17 13:24:13 |
| 3 | 510 | 2011-03-17 17:29:32 |
+------------+--------+---------------------+
I then have a stored procedure which basically does a GroupBy UserID and Sums the Points. I can also pass #StartDate and #EndDate parameters to create a leaderboard for a specific time period. For example, time windows for Top Users for the Day / Week / Month / Lifetime.
This seemed to work well with a moderate amount of data, but things became noticeably slower as the number of points records passed a million or so. The test data I'm working with is just over a million point records created by about 500 users distributed over a timespan of 3 months.
Is there a different way to approach this? I have experimented with denormalizing the data by pre-grouping the points into hour datetime buckets to reduce the number of rows. But I'm starting to think the real problem I need to worry about is the increasing number of users that need to be accounted for in the leaderboard. The time window sizes will generally be small but more and more users will start generating points within any given window.
Unfortunately I don't have access to 'Jobs' since I'm using SQL Azure and the Agent is not available (yet). But, I am open to the idea of scaling this using a different storage system if you are convincing enough.
My past work experience tells me I should look into data warehousing since this is almost a reporting problem. But at the same time I need it to be as real-time as possible.
Update
Ultimately, I would like to support custom leaderboards that could span from Monday 8am - Friday 6pm every week. But that's down the road and why I'm trying to not get too fancy with the aggregation. I'm willing to settle with basic Day/Week/Month/Year/AllTime windows for now.
The tricky part is that I really can't store them denormalized because I need these windows to be TimeZone convertible. The system is mult-tenant and therefore all data is stored as UTC. The problem is a week starts at different hours for different customers. Aggregating the sums together will cause some points to fall into the wrong buckets.
here are a few thoughts:
Sticking with SQL Azure: you can have another table, PointsTotals. Every time you add a row to your UserPoints table, also increment the TotalPoints value for a given UserId in PointsTotals (or insert a new row if they don't have a row to increment). Now you always have totals computed for each UserId.
Going with Azure Table Storage: Create a UserPoints table, with Partition Key being userId. This keeps all of a user's points rows together, where you'd easily be able to sum them. And... you can borrow the idea from suggestion #1, creating a separate PointsTotals table, with PartitionKey being UserId and RowKey probably being the total points.
If it were my problem, I'd ignore the timestamps and store the user and points totals by day
I decided to go with the idea of storing points along with a timespan (StartDate and EndDate columns) localized to the customer's current TimeZone setting. I realized an extra benefit with this is that I can 'purge' old leaderboard round data after a few monts without affecting the lifetime total of points.