report scheduler system design using database as master - sql

Problem
we have ~50k scheduled financial reports that we periodically deliver to clients via email
reports have their own delivery frequency (date&time format - as configured by clients)
weekly
daily
hourly
weekdays only
etc.
Current architecture
we have a table called report_metadata that holds report information
report_id
report_name
report_type
report_details
next_run_time
last_run_time
etc...
every week, all 6 instances of our scheduler service poll the report_metadata database, extract metadata for all reports that are to be delivered in the following week, and puts them in a timed-queue in-memory.
Only in the master/leader instance (which is one of the 6 instances):
data in the timed-queue is popped at the appropriate time
processed
a few API calls are made to get a fully-complete and current/up-to-date report
and the report is emailed to clients
the other 5 instances do nothing - they simply exist for redundancy
Proposed architecture
Numbers:
db can handle up to 1000 concurrent connections - which is good enough
total existing report number (~50k) is unlikely to get much larger in the near/distant future
Solution:
instead of polling the report_metadata db every week and storing data in a timed-queue in-memory, all 6 instances will poll the report_metadata db every 60 seconds (with a 10 s offset for each instance)
on average the scheduler will attempt to pick up work every 10 seconds
data for any single report whose next_run_time is in the past is extracted, the table row is locked, and the report is processed/delivered to clients by that specific instance
after the report is successfully processed, table row is unlocked and the next_run_time, last_run_time, etc for the report is updated
In general, the database serves as the master, individual instances of the process can work independently and the database ensures they do not overlap.
It would help if you could let me know if the proposed architecture is:
a good/correct solution
which table columns can/should be indexed
any other considerations

I have worked on a differt kind of sceduler for a program that reported analyses on a specific moment of the month/week and what I did was combining the reports to so called business cycle based time moments. these moments are on the "start of a new week", "start of the month", "start/end of a D/W/M/Q/Y'. So I standardised the moments of sending the reports and added the id's to a table that would carry the details of the report. - now you add thinks to the cycle of you remove it when needed, you could do this by adding a tag like(EOD(end of day)/EOM (End of month) SOW (Start of week) ect, ect, ect,).
So you could index the moments of when the clients want to receive the reports and build on that track. Hope that this comment can help you with your challenge.

It seems good to simply query that metadata table by all 6 instances to check which is the next report to process as you are suggesting.
It seems odd though to have a staggered approach with a check once every 60 seconds offset by 10 seconds for your servers. You have 6 servers now but that may change. Also I don't understand the "locking" you are suggesting, why now simply set a flag on the row such as [State] = "processing", then the next scheduler knows to skip that row and move on to the next available one. Once a run is processed, you can simply update a [Date_last_processed] column, or maybe something like [last_cycle_complete] = 'YES'.
Alternatively you could have one server-process to go through the table, and for each available row, sends it off to one of the instances, in a round-robin fashion (or keep track of who is busy and who isn't).

Related

Reducing database load from consecutive queries

I have an application which calls the database multiple times to achieve one simple goal.
A little information about this application; In short, the application scrapes data from a webpage & stores specific information from this page into a database. The important information in this query is: Player name, Position. There can be multiple sitting at one specific position, kill points & Class
Player name has every potential to change or remain the same every day
Regarding the Position, there can be multiple sitting in one position
Kill points has the potential to increase or remain the same every day
Class, there is only 2 possibilities that a name can be, Ex: A can change to B or remain A (same in reverse), but cannot be C,D,E,F
The player name can change at any particular day, Position can also change dependent on the kill point increase from the last update which spins back around to the goal. This is to search the database day by day, from the current date to as far back as 2021-02-22 starting at the most recent entry for a player name and back track to the previous day to check if that player name is still the same or has changed.
What is being used as a main reference to the change is the kill points. As the days go on, this number will either be the exact same or increase, it can never decrease.
So now onto the implementation of this application.
The first query which runs finds the most recent entry for the player name
SELECT TOP(1) * FROM [changes] WHERE [CharacterName]=#charname AND [Territory]=#territory AND [Archived]=0 ORDER BY [Recorded] DESC
Then continue to check the previous days entries with the following query:
SELECT TOP(1) * FROM [changes] WHERE [Territory]=#territory AND [CharacterName]=#charname AND [Recorded]=#searchdate AND ([Class] LIKE '%{Class}%' OR [Class] LIKE '%{GetOpposite(Class)}%' AND [Archived]=0 )
If no results are found, will then proceed to find an alternative name with the following query:
SELECT TOP(5) * FROM [changes] WHERE [Kills] <= #kills AND [Recorded]='{Data.Recorded.AddDays(-1):yyyy-MM-dd}' AND [Territory]=#territory AND [Mode]=#mode AND ([Class] LIKE #original OR [Class] LIKE #opposite) AND [Archived]=0 ORDER BY [Kills] DESC
The aim of the query above is to get the top 5 entries that are the closest possible matches & Then cross references with the day ahead
SELECT COUNT(*) FROM [changes] WHERE [CharacterName]=#CharacterName AND [Territory]=#Territory AND [Recorded]=#SearchedDate AND [Archived]=0
So with checking the day ahead, if the character name is not found in the day ahead, then this is considered to be the old player name for this specific character, else after searching all 5 of the results and they are all found to be present in the day aheads searches, then this name is considered to be new to the table.
Now with the date this application started to run up to today's date which is over 400 individual queries on the database to achieve one goal.
It is also worth a noting that this table grows by 14,400 - 14,500 Rows each and every day.
The overall question to this specific? Is it possible to bring all these queries into less calls onto the database, reduce queries & improve performance?
What you can do to improve performance will be based on what parts of the application stack you can manipulate. Things to try:
Store Less Data - Database content retrieval speed is largely based on how well the database is ordered/normalized and just how much data needs to be searched for each query. Managing a cache of prior scraped pages and only storing data when there's been a change between the current scrape and the last one would guarantee less redundant requests to the db.
Separate specific classes of data - Separating data into dedicated tables would allow you to query a specific table for a specific character, etc... effectively removing one where clause.
Reduce time between queries - Less incoming concurrent requests means less resource contention and faster response times to prior requests.
Use another data structure - The only reason you're using top() is because you need data ordered in some specific way (most-recent, etc...). If you just used a code data structure that keeps the data ordered and still easily-query-able you could then perhaps offload some sql requests to this structure instead of the db.
The suggestions above are not exhaustive, but what you do to improve performance is largely a function of what in the application stack you have the ability to modify.

How to manually test a data retention requirement in a search functionality?

Say, data needs to be kept for 2years. Then all data that were created 2years + 1day ago should not be displayed and be deleted from the server. How do you manually test that?
I’m new to testing and I can’t think of any other ways. Also, we cannot do automation due to time constraints.
You can create the data with backdating of more than two years in the database and can test, if it is being deleted or not automatically, In other ways ,you can change the current business date from the database and can test it
For the data retention functionality a manual tester needs to remember the search data so that the tester can perform the test cases for the search retention feature.
By Taking an example of a social networking app , being a manual tester you need to remember all the users that you searched for recently.
To check the time period of retention you can take the help from the backend developer so that they can change the time period (from like one year to 10 min) for testing purpose.
Even if you delete the search history and then you start typing the already entered search result the related result should pop on the first location of the search result. Data retention policies concern what data should be stored or archived, where that should happen, and for exactly how long. Once the retention time period for a particular data set expires, it can be deleted or moved as historical data to secondary or tertiary storage, depending on the requirement
Let’s us understand with an example, that we have below data in our database table based on past search made by users. Now with the help of this table, you can perform this testing with minimum effort and optimum result. We have Current Date as - ‘2022-03-10’ and Status column states that data is available / not available in database, where Visible means available, while Expired means deleted from table.

Search Keyword
Search On Date
Search Expiry Date
Status
sport
2022-03-05
2024-03-04
Visible
cricket news
2020-03-10
2022-03-09
Expired - Deleted
holy books
2020-03-11
2022-03-10
Visible
dance
2020-03-12
2022-03-11
Visible

ABAP Program to notify Users X amount of days before user account will be disabled

I'm currently learning ABAP and trying to make an enhancement but have broken down in confusion on how to go about building on top of existing code. I have a program that runs periodically via a background job that disables user accounts X amount of days (in this case 90 days of inactive usage based on USR02~TRDAT).
I want to add an enhancement to notify the User via their email address (result of usr02~bname to match usr21~bname to pass the usr21~persnumber and usr21~addrnumber to adr6 which will point to the adr6~smtp_addr of the user, providing the usr02~bname -> adr6~smtp_addr relationship) based on their last logon date being 30, 15, 7, 5, 3, and 1 day away from the 90 day inactivity threshold with a link to the SAP system to help them reactivate the account with ease.
I'm beginning to think that an enhancement might not be a good idea but rather create a new program and schedule the background job daily. Any guidance or information would be greatly appreciated...
Extract
CLASS cl_inactive_users_reader DEFINITION.
PUBLIC SECTION.
TYPES:
BEGIN OF ts_inactive_user,
user_name TYPE syst_uname,
days_of_inactivity TYPE int1,
END OF ts_inactive_user.
TYPES tt_inactive_users TYPE STANDARD TABLE OF ts_inactive_user WITH EMPTY KEY.
CLASS-METHODS read_inactive_users
IMPORTING
min_days_of_inactivity TYPE int1
RETURNING
VALUE(result) TYPE tt_inactive_users.
ENDCLASS.
Then refactor
REPORT block_inactive_users.
DATA(inactive_users) = cl_inactive_users_readers=>read_inactive_users( 90 ).
LOOP AT inactive_users INTO DATA(inactive_user).
" block user
ENDLOOP.
And add
REPORT warn_inactive_users.
DATA(inactive_users) = cl_inactive_users_readers=>read_inactive_users( 60 ).
LOOP AT inactive_users INTO DATA(inactive_user).
CASE inactive_user-days_of_inactivity.
" choose urgency
ENDCASE.
" send e-mail
ENDLOOP.
and run both reports daily.
Don't create a big ball of mud by squeezing new features into existing code.
From SAP wiki:
The enhancement concept allows you to add your own functionality to SAP's standard business applications without having to modify the original applications. To modify the standard SAP behavior as per customer requirements, we can use enhancement framework.
As per your description, it doesn't sound like a use case for an enhancement. It isn't an intervention in an existing process. The original process and your new requirement are two different processes with some mutual logical part - selection of days of inactivity of users. The two shouldn't rely on each other.
Structurally I think it is best to have a separate program for computing which e-mails need to be sent and when, and a separate program for actually sending them.
I would copy your original program to a new one, and modify it a little bit so that instead of disabling a user, it records into some table for each user: 1) an e-mail 2) a date when to send 3) how many days left (30, 15, 7, etc) 4) status if the e-mail was sent or not. Initially you can even have multiple such jobs for each period (30, 15, 7 etc) and pass it as a parameter (which you use inside instead of 90).
This program you run daily as a job and it populates that table with e-mail "tasks" of what needs to be sent today. It just adds new lines, so lines from yesterday should stay in there.
The 2nd program should just read that table and send actual e-mails and update the statuses. You run that program daily as well.
This way you have:
overview: just check the table to see what's going on
control: if the e-mailer dies or hangs, you can restart it and it will continue where it left off; with statuses you avoid sending duplicate mails
you can make sure that you don't send outdated e-mails if in your mailer script you ignore all tasks older than say 2 days
I want to clarify your confusion about the use of enhancements:
You would want to use enhancements in terms of 'something' happens or is going to happen in the system and you would want to change this standard way.
That something, let's call it event or process could be for example an order is placed, a certain user is logging onto the system or a material has been or is going to be changed.
The change could be notifying another system of an order or checking the logged on user with additional checks for example his GUI version and warn him/her if not up-to-date.
Ask yourself, what process on the system does the execution of your program or code depend on. Does anything need to happen before the program is executed? No, only elapsing time.
Even if you had found an enhancement, you would want to use. If this process using the enhancement would not be run in 90 days, your mails would not be sent, because the enhancement would never been called.
edit: That being said, supposing you mean by enhancement 'building on your existing program' instead of 'creating a new one' would be absolutely not the right terminology for enhancement in the sap universe.
I would extend the functionality of your existing program, since you already compute how many days are left and you would have only one job to maintain.

SSRS Data-Driven Subscription [based on static Subscription table] Not Picking Up Changes Made to Subscription Table

I have a .RDL report which I designed in BIDS and have deployed to my report server. The report asks for three parameters before viewing report: Year, Month and Customer ID. The report works great and does exactly what it is supposed to.
While I used to run each report individually because there were 2-3 customers, now there are 30+ customers who receive the report, so I wanted to switch to a more automated fulfillment method to get the reports generated. After doing some research it appears that a using Report Manager to create a "Data Driven Subscription" (DDS) using the "Windows File Share" option gives me the capabilities I need.
As part of creating the DDS, I created a table called [Subscription] which is a table containing one row for each customer receiving the report and has the following columns:
Year
Month
CustomerID
FileName
FileLocation
Overwrite
Format
...so through using the DDS Wizard in Report Manager, I was able to successfully set up a Data Driven Subscription (which is linked to various columns in the [Subscription] table) which creates a new report for each customer in the [Subscription] table, saves [and overwrites, if necessary] it in a location of my choosing as a PDF (specified in [Subscription].[FileLocation], or the FileLocation column of my table for each row), and runs every minute (I plan on changing frequency to once a week, eventually).
This works flawlessly, giving me a new set of 30 reports in the directory of my choosing, with each report having a name I assigned in the FileName column of my table. Exactly what I was looking for.
HERE'S THE PROBLEM: When I update the FileLocation or FileName (or anything, really) in the [Subscription] table - it doesn't pick up the changes right away. Sometimes it doesn't even pick it up at all (for example I updated the [ReportName] column for one customer from Report_711622 to SpecialReport_711622, so that the output file for that customer should be named SpecialReport_711622 while all of the other reports should be called Report_XXXXX [no Special prefix]. But the file name of report for Customer 711622 remains the same!
It's almost like the job only see's what it needs to do once a day, and then does not go back and reference the [Subscription] table until I leave for the night, then when I come back in the morning it picks up the change.
Since I am about to scale this process out to a large customer-base using a different report, I need to be able to make edits to the [Subscription] table and have them get picked up by the Data Driven Subscription immediately (and if not immediately, at least a fixed interval of time that I can adjust, so that I can know 100% when the change will get picked up).
Does anyone know what's causing my lag? How do I change it so that updates to the Subscription table get picked up regularly? I'm also having issues with creating new DDS on other reports (following the exact process outlined above) - I've created the subscriptions, for every minute, and it says they are running and the number of outputs match the number of customers with 0 errors, but there are no files in the drive I specified (or anywhere else I've looked, for that matter).
Any help would be greatly appreciated!
I think the answer lies in the mechanism SSRS uses. There are a few places "lag" can occur.
The subscription is in fact an SQL Agent job which creates a record in the Event table. This table is a queue that SSRS checks to do scheduled tasks.
There is a small amount of time between the moment the subscription creates the Event record and the moment SQL reads it and starts creating the dataset for your DDS. The creation of the DDS dataset takes some time, too. In this time, the subscription will be in the Pending state. If you change anything in the data during this time, The subscription will still use the old data as report parameters. So obviously you will not notice your change until the next scheduled run.
Which brings me to the following: if a subscription is still being run and the next schedule kicks in (chances are, because yours runs every minute), the engine will not execute it, but wait for the next subscription schedule, and so on. So that's another possibility of lag - and cause of missing reports for a certain schedule minute. The subscription processes reports sequentially, one row from your DDS recordset at a time. Again, this takes some time. You can also see that in the subscription window when it says: # of # processed.
I suggest you look at the Event table in the database ReportServer during an execution. Also the ExecutionHistory views (there are 3) may be interesting. A scheduled run shows up as a RequestType = 1 and generates one record for each report. You can see the exact timing and parameters of each report that is run in the subscription. You may be able to extract the data you need to resolve your other issues.
EDIT: Here is a more elaborate guide to DDS data and events
http://blogs.msdn.com/b/deanka/archive/2009/01/13/diagnosing-and-troubleshooting-subscriptions.aspx
http://blogs.msdn.com/b/deanka/archive/2010/02/16/troubleshooting-subscriptions-part-ii-using-the-report-services-trace-log-file.aspx
Could this "Double-Hop" problem be the source of my issues? I'm so stuck on this one!
The Double-Hop Problem - MSDN Knowledgecast

How to handle reoccurring calendar events and tasks (SQL Server tables & C#)

I need to scheduled events, tasks, appointments, etc. in my DB. Some of them will be one time appointments, and some will be reoccurring "To-Dos" which must be checked off. After looking a google's calendar layout and others, plus doing a lot of reading here is what I have so far.
Calendar table (Could be called schedule table I guess): Basic_Event Title, start/end, reoccurs info.
Calendar occurrence table: ties to schedule table, occurrence specific text, next occurrence date / time????
Looked here at how SQL Server does its jobs: http://technet.microsoft.com/en-us/library/ms178644.aspx
but this is slightly different.
Why two tables: I need to track status of each instance of the reoccurring task. Otherwise this would be much simpler...
so... on to the questions:
1) Does this seem like the proper way to go about it? Is there a better way to handle the multiple occurrence issue?
2) How often / how should I trigger creation of the occurrences? I really don't want to create a bunch of occurrences... BUT... What if the user wants to view next year's calendar...
Makes sense to have your schedule definition for a task in one table and then a separate table to record each instance of that separately - that's the approach I've taken in the past.
And with regards to creating the occurrences, there's probably no need to create them all up front. Especially when you consider tasks that repeat indefinitely! Again, the approach I've used in the past is to only create the next occurrence. When that instance is actioned, the next instance is then calculated and created.
This leaves the issue of viewing future occurrences. For this, you can start of with the initial/next scheduled occurrence and just calculate the future occurrences on-the-fly at display time.
While this isn't an exact answer to your question I've solved this problem before in SQL Server (though database here is irrelevant) by modeling a solution based on Unix's cron.
Instead of string parsing we used integer columns in a table to store the various time units.
We had events which could be scheduled; they could either point to a one-time schedule table that represented a distinct point in time (a date/time) or to the recurring schedule table which is modelled after cron.
Additionally remember to model your solution correctly. An event has a duration but the duration is unrelated to the schedule (but an event's duration may impact the schedule by causing conflicts). Do not try to model duration as part of your schedule.
In the past when we've done this, we had 2 tables:
1) Schedules -> Includes recurrence information
2) Exceptions -> Edit/changes to specific instances
Using SQL, it's possible to get the list of "Schedules" that have at least one instance in a given date range. Then you can expand in the GUI where each instance lies.