Designing a scalable points leaderboard system using SQL Server - sql

I'm looking for suggestions for scaling a points leaderboard system. I already have a working version using a very normalized strategy. This first version was essentially a table which looked something like this.
UserPoints - PK: (UserId,Date)
+------------+--------+---------------------+
| UserId | Points | Date |
+------------+--------+---------------------+
| 1 | 10 | 2011-03-17 07:16:36 |
| 2 | 35 | 2011-03-17 08:09:26 |
| 3 | 40 | 2011-03-17 08:05:36 |
| 1 | 65 | 2011-03-17 09:01:37 |
| 2 | 16 | 2011-03-17 10:12:35 |
| 3 | 64 | 2011-03-17 12:51:33 |
| 1 | 300 | 2011-03-17 12:19:21 |
| 2 | 1200 | 2011-03-17 13:24:13 |
| 3 | 510 | 2011-03-17 17:29:32 |
+------------+--------+---------------------+
I then have a stored procedure which basically does a GroupBy UserID and Sums the Points. I can also pass #StartDate and #EndDate parameters to create a leaderboard for a specific time period. For example, time windows for Top Users for the Day / Week / Month / Lifetime.
This seemed to work well with a moderate amount of data, but things became noticeably slower as the number of points records passed a million or so. The test data I'm working with is just over a million point records created by about 500 users distributed over a timespan of 3 months.
Is there a different way to approach this? I have experimented with denormalizing the data by pre-grouping the points into hour datetime buckets to reduce the number of rows. But I'm starting to think the real problem I need to worry about is the increasing number of users that need to be accounted for in the leaderboard. The time window sizes will generally be small but more and more users will start generating points within any given window.
Unfortunately I don't have access to 'Jobs' since I'm using SQL Azure and the Agent is not available (yet). But, I am open to the idea of scaling this using a different storage system if you are convincing enough.
My past work experience tells me I should look into data warehousing since this is almost a reporting problem. But at the same time I need it to be as real-time as possible.
Update
Ultimately, I would like to support custom leaderboards that could span from Monday 8am - Friday 6pm every week. But that's down the road and why I'm trying to not get too fancy with the aggregation. I'm willing to settle with basic Day/Week/Month/Year/AllTime windows for now.
The tricky part is that I really can't store them denormalized because I need these windows to be TimeZone convertible. The system is mult-tenant and therefore all data is stored as UTC. The problem is a week starts at different hours for different customers. Aggregating the sums together will cause some points to fall into the wrong buckets.

here are a few thoughts:
Sticking with SQL Azure: you can have another table, PointsTotals. Every time you add a row to your UserPoints table, also increment the TotalPoints value for a given UserId in PointsTotals (or insert a new row if they don't have a row to increment). Now you always have totals computed for each UserId.
Going with Azure Table Storage: Create a UserPoints table, with Partition Key being userId. This keeps all of a user's points rows together, where you'd easily be able to sum them. And... you can borrow the idea from suggestion #1, creating a separate PointsTotals table, with PartitionKey being UserId and RowKey probably being the total points.

If it were my problem, I'd ignore the timestamps and store the user and points totals by day

I decided to go with the idea of storing points along with a timespan (StartDate and EndDate columns) localized to the customer's current TimeZone setting. I realized an extra benefit with this is that I can 'purge' old leaderboard round data after a few monts without affecting the lifetime total of points.

Related

Issue with Aggregated Access SQL

Trying to SUM up a Group of CELLS by their Dates in Access...so that a report will give 1 of each date....and the TOTAL amount of CELLS that have that date. Followed by the next...
I'm sure it's a combination of things. or something simple but could someone please explain how I would do this?
Thanks
EVV Table
+-------------+--------+
| DateInputed | Claims |
+-------------+--------+
|02/08/2021 | 15 |
|02/08/2021 | 31 |
|03/01/2020 | 21 |
+-------------+--------+
Report Should look like
By Date Report
-------------
02/08/2021 46
03/01/2020 21
--------------
Totals 67
With Distinct obviously being used by the Date portion Query and A SUM being done per Date.... Does this make more sense
Here's what I've thought of trying
Sooo I was Massively over thinking this...and would like to credit #June7 With the win since they were able to point out Grouping and Sorting to me.....GROUPING
So Here's my Answer that worked
I Created a SQL Query
SELECT EVV.DateInputed, Sum(EVV.ClaimNumber) AS Total_Claim, Sum(EVV.[Total Failed Claims]) AS Total_Fail FROM EVV GROUP BY EVV.DateInputed HAVING (((EVV.DateInputed)>= Forms]![Search for EVV Totals]![fromDate] And (EVV.DateInputed)<=[Forms]![Search for EVV Totals]![toDate]));
Then I created a Simple Report based on that and BAMB Instant Answer. So thank you for those of everyone that was helping

Best practice for saving a series of dates in SQL

I'm reworking some old programs and in one of them I need so save a repeating series of Dates in the database. The User picks days ranging from 1-31 and months ranging from 1-12 in a PHP-Form. Multiple Choices are possible. At least one of each must be provided.
I'll then use a daily scheduled Task to check if the value (day and month) is given and if yes - do something.
In the old system I saved it like this:
| Days | Months |
|1,2,5,13,15 | 1,2,3,4,5,6,7,8,9,10,11,12|
Then I exploded every row in the PHP-File fired by the scheduled Task and iterated over the Array. If one of the dates is valid - do something.
What is best practice for this Use-Case? I thought about some solutions like "saving all possible Outcomes of days and months as single rows in an mapping-table" but I don't think that's an elegant solution...and it needs to be editable too after being implemented.
Any suggestions?
I think you're looking at three tables.
Table one records the groups, give it a sequential group id and whatever other properties you need to record about the group of dates as a whole (requesting user id).
Second table is just group id from table one and the chosen days in rows, so each group has multiple rows.
Third table is the same as for second but for months.
When you need the final result join the second and third tables to the first on the group id. you'll automatically get a cross join between the two giving the combinations you need.
If you're expecting a large volume of data and\or a lot of repeats of the same groups then you may want to consider the possibility of re-using the groups of days and months. It will be a similar table design but tables 2 and 3 will have their own group ids and table one will have two extra columns one for day group and one for month group.
Seems, you can use a dimension-like scheme and attach day-month pairs to different entities. Suppose, the entity is called "task".
| tasks | | days | | months |
| ------- | | -------- | | -------- |
| id_task | | id_day | | id_month |
| ... | >---M:1--- | id_month | >---M:1--- | month |
| id_day | | day |
Don't forget to add check constraints for day (1-31) and month (1-12) columns.
I think you should expand the data in the database. Clearly, you need a table groups (or something like that) with one row per group:
create table groups (
group_id int identity(1, 1) primary key,
. . . -- additional columns
);
Then, expand the dates for each group for the schedule:
create table groups_schedule (
group_schedule_id int identity(1, 1) primary key,
group_id int references groups(group_id),
month int,
day int
);
This requires multiplying out the data in the database. However, I think it is a more accurate representation. In addition, it will give you more flexibility in the future so you are not tied specifically to lists of months/days. For instance, you might have day "25" in most months, but not December.

Calculating interest using SQL

I am using PostgreSQL, and have a table for a billing cycle and another for payments made in a billing cycle.
I am trying to figure out how to calculate interest based on how much amount was left after each billing cycle's last payment date. Problem is that every time a repayment is made, the interest has to be calculated on the amount remaining after that.
My thoughts on building this query are like this. Build data for all dates from last pay date of the billing cycle to today. Using partitioning, get the remaining amount for the first date. For second date, use amount from previous row and add interest to it, and then calculate interest on this one.
Unfortunately I am stuck just at the thought and can't figure out how to make this into a query!
Here's some sample data to make things easier to understand.
Billing Cycles:
id | ends_at
-----+---------------------
1 | 2017-11-30
2 | 2017-11-30
Payments:
amount | billing_cycle_id | type | created_at
-----------+------------------+---------+----------------------------
6000.0000 | 1 | payment | 2017-11-15 18:40:22.151713
2000.0000 | 1 |repayment| 2017-11-19 11:45:15.6167
2000.0000 | 1 |repayment| 2017-12-02 11:46:40.757897
So if we see, user made a repayment on the 19th, so amount due for interest post ends date(30th Nov 2017), is only 4000. So, from 30th to the 2nd, interest will be calculated daily on 4000. However, from the 2nd, interest needs to be calculated on 2000 only.
Interest Calculations(Today being 2017-12-04):
date | amount | interest
------------+---------+----------
2017-12-01 | 4000 | 100 // First day of pending dues.
2017-12-02 | 2100 | 52.5 // Second day of pending dues.
2017-12-03 | 2152.5 | 53.8125 // Third day of pending dues.
2017-12-04 |2206.3125| // Fourth's day interest will be added tomorrow
Your data is too sparse. It doesn't make any sense to need to write this query, because over time the query will get significantly more complicated. What happens when interest rates change over time?
The table itself (or a secondary table, depending on how you want to structure it) could have a running balance you add every time a deposit / withdrawal is made. (I suggest this table be add-only) Otherwise you're making both the calculation and accounting far harder on yourself than it should be. Even with the way you've presented the problem here, there's not enough information to do the calculation. (interest rate is missing) When that's the case, your stored procedure is going to be too complicated. Complicated means bugs, and people get irritated about bugs when you're talking about their money.

HR Cube in SSAS

I have to design a cube for students attendance, we have four status (Present, Absent, Late, in vacation). the cube has to let me know the number of students who are not present in a gap of time (day, month, year, etc...) and the percent of that comparing the total number.
I built a fact table like this:
City ID | Class ID | Student ID | Attendance Date | Attendance State | Total Students number
--------------------------------------------------------------------------------------------
1 | 1 | 1 | 2016-01-01 | ABSENT | 20
But in my SSRS project I couldn't use this to get the correct numbers. I have to filter by date, city and attendance status.
For example, I must know that in date X there is 12 not present which correspond to 11% of total number.
Any suggestion of a good structure to achieve this.
I assume this is homework.
Your fact table is wrong.
Don't store aggregated data (Total Students) in the fact as it can make calculations difficult.
Don't store text values like 'Absent' in the fact table. Attributes belong in the dimension.
Reading homework for you:
Difference between a Fact and Dimension and how they work together
What is the grain of a Fact and how does that affect aggregations and calculations.
There is a wealth of information at the Kimball Groups pages. Start with the lower # tips as they get more advanced as you move on.

SSAS Row Count Aggregation

Hi I have a table like this:
idCustomer | idTime | idStatus
---------------------------------
1 | 20010101 | 2
1 | 20010102 | 2
1 | 20010103 | 3
2 | 20010101 | 1
...
I have now added this table as a factless fact table in my cube with a measure which aggregates the row count for each customer, so that for each day I can see how many customers are at each status and I can drill down to see which customers they are.
This is all well and good but when I roll it up to the month or year level it start summing up the values of each day where instead I want to see the last non empty value.
I'm not sure if this is possible but I can't think of another way of getting this information without creating a fact table with the counts for each status on each day and loosing the ability to drill down.
Can anyone help??
An easy way to get what you want would be to convert your factless fact table to one having a fact: the count. Just add a named calculation to the table object in the data source view. Name the calculation like you want your measure to be named, and use 1 as the expression. Then you can define a measure based on this calculation using the aggregate function "LastNonEmpty", and use this instead of your current count measure.