TMSL command to process big tabular cube - ssas

Currently spending a lot of time processing the cube which increases with every new wave of data.
The cube processing job is a sequence of steps containing ANALYSISCOMMANDs:
1 refresh type: full for all Dimensional tables.
2 refresh type: clearValues for Fact table 1
3 refresh type: dataOnly for Fact table 1
4 refresh type: calculate database (without table indication)
...
22 refresh type: clearValues for Fact table N
23 refresh type: dataOnly for Fact table N
24 refresh type: calculate database (without table indication)
One of the considerations was to split processing into steps so when the processing fails with "out of memory", we could start from the previous step.
Current sequence does not seem very efficient and I am looking how to decrease processing time.
Specifically having "refresh type: calculate database (without table indication)" coming each time after Fact table block makes me wonder:
can I either indicate a relevant Fact table here or get rid of those, leaving one calculate step on database level at the very end?

Calculate database (or recalc) is only needed once and it needs to be the last operation you do. It's only needed once, because it's a database (model) level command.
More info here: http://bifuture.blogspot.com/2017/02/ssas-processing-tabular-model.html

Related

Reducing database load from consecutive queries

I have an application which calls the database multiple times to achieve one simple goal.
A little information about this application; In short, the application scrapes data from a webpage & stores specific information from this page into a database. The important information in this query is: Player name, Position. There can be multiple sitting at one specific position, kill points & Class
Player name has every potential to change or remain the same every day
Regarding the Position, there can be multiple sitting in one position
Kill points has the potential to increase or remain the same every day
Class, there is only 2 possibilities that a name can be, Ex: A can change to B or remain A (same in reverse), but cannot be C,D,E,F
The player name can change at any particular day, Position can also change dependent on the kill point increase from the last update which spins back around to the goal. This is to search the database day by day, from the current date to as far back as 2021-02-22 starting at the most recent entry for a player name and back track to the previous day to check if that player name is still the same or has changed.
What is being used as a main reference to the change is the kill points. As the days go on, this number will either be the exact same or increase, it can never decrease.
So now onto the implementation of this application.
The first query which runs finds the most recent entry for the player name
SELECT TOP(1) * FROM [changes] WHERE [CharacterName]=#charname AND [Territory]=#territory AND [Archived]=0 ORDER BY [Recorded] DESC
Then continue to check the previous days entries with the following query:
SELECT TOP(1) * FROM [changes] WHERE [Territory]=#territory AND [CharacterName]=#charname AND [Recorded]=#searchdate AND ([Class] LIKE '%{Class}%' OR [Class] LIKE '%{GetOpposite(Class)}%' AND [Archived]=0 )
If no results are found, will then proceed to find an alternative name with the following query:
SELECT TOP(5) * FROM [changes] WHERE [Kills] <= #kills AND [Recorded]='{Data.Recorded.AddDays(-1):yyyy-MM-dd}' AND [Territory]=#territory AND [Mode]=#mode AND ([Class] LIKE #original OR [Class] LIKE #opposite) AND [Archived]=0 ORDER BY [Kills] DESC
The aim of the query above is to get the top 5 entries that are the closest possible matches & Then cross references with the day ahead
SELECT COUNT(*) FROM [changes] WHERE [CharacterName]=#CharacterName AND [Territory]=#Territory AND [Recorded]=#SearchedDate AND [Archived]=0
So with checking the day ahead, if the character name is not found in the day ahead, then this is considered to be the old player name for this specific character, else after searching all 5 of the results and they are all found to be present in the day aheads searches, then this name is considered to be new to the table.
Now with the date this application started to run up to today's date which is over 400 individual queries on the database to achieve one goal.
It is also worth a noting that this table grows by 14,400 - 14,500 Rows each and every day.
The overall question to this specific? Is it possible to bring all these queries into less calls onto the database, reduce queries & improve performance?
What you can do to improve performance will be based on what parts of the application stack you can manipulate. Things to try:
Store Less Data - Database content retrieval speed is largely based on how well the database is ordered/normalized and just how much data needs to be searched for each query. Managing a cache of prior scraped pages and only storing data when there's been a change between the current scrape and the last one would guarantee less redundant requests to the db.
Separate specific classes of data - Separating data into dedicated tables would allow you to query a specific table for a specific character, etc... effectively removing one where clause.
Reduce time between queries - Less incoming concurrent requests means less resource contention and faster response times to prior requests.
Use another data structure - The only reason you're using top() is because you need data ordered in some specific way (most-recent, etc...). If you just used a code data structure that keeps the data ordered and still easily-query-able you could then perhaps offload some sql requests to this structure instead of the db.
The suggestions above are not exhaustive, but what you do to improve performance is largely a function of what in the application stack you have the ability to modify.

report scheduler system design using database as master

Problem
we have ~50k scheduled financial reports that we periodically deliver to clients via email
reports have their own delivery frequency (date&time format - as configured by clients)
weekly
daily
hourly
weekdays only
etc.
Current architecture
we have a table called report_metadata that holds report information
report_id
report_name
report_type
report_details
next_run_time
last_run_time
etc...
every week, all 6 instances of our scheduler service poll the report_metadata database, extract metadata for all reports that are to be delivered in the following week, and puts them in a timed-queue in-memory.
Only in the master/leader instance (which is one of the 6 instances):
data in the timed-queue is popped at the appropriate time
processed
a few API calls are made to get a fully-complete and current/up-to-date report
and the report is emailed to clients
the other 5 instances do nothing - they simply exist for redundancy
Proposed architecture
Numbers:
db can handle up to 1000 concurrent connections - which is good enough
total existing report number (~50k) is unlikely to get much larger in the near/distant future
Solution:
instead of polling the report_metadata db every week and storing data in a timed-queue in-memory, all 6 instances will poll the report_metadata db every 60 seconds (with a 10 s offset for each instance)
on average the scheduler will attempt to pick up work every 10 seconds
data for any single report whose next_run_time is in the past is extracted, the table row is locked, and the report is processed/delivered to clients by that specific instance
after the report is successfully processed, table row is unlocked and the next_run_time, last_run_time, etc for the report is updated
In general, the database serves as the master, individual instances of the process can work independently and the database ensures they do not overlap.
It would help if you could let me know if the proposed architecture is:
a good/correct solution
which table columns can/should be indexed
any other considerations
I have worked on a differt kind of sceduler for a program that reported analyses on a specific moment of the month/week and what I did was combining the reports to so called business cycle based time moments. these moments are on the "start of a new week", "start of the month", "start/end of a D/W/M/Q/Y'. So I standardised the moments of sending the reports and added the id's to a table that would carry the details of the report. - now you add thinks to the cycle of you remove it when needed, you could do this by adding a tag like(EOD(end of day)/EOM (End of month) SOW (Start of week) ect, ect, ect,).
So you could index the moments of when the clients want to receive the reports and build on that track. Hope that this comment can help you with your challenge.
It seems good to simply query that metadata table by all 6 instances to check which is the next report to process as you are suggesting.
It seems odd though to have a staggered approach with a check once every 60 seconds offset by 10 seconds for your servers. You have 6 servers now but that may change. Also I don't understand the "locking" you are suggesting, why now simply set a flag on the row such as [State] = "processing", then the next scheduler knows to skip that row and move on to the next available one. Once a run is processed, you can simply update a [Date_last_processed] column, or maybe something like [last_cycle_complete] = 'YES'.
Alternatively you could have one server-process to go through the table, and for each available row, sends it off to one of the instances, in a round-robin fashion (or keep track of who is busy and who isn't).

How do I create a backup for a table which will be used for a full-refresh?

I have an incremental model A where each day is calculated using the previous day's value. Running a full-refresh means that this table needs to be calculated since the beginning of time which is very inefficient and takes too long.
I have tried to create a backup table which will take a copy of the table's value each month, and have model A refer to the backup table during a full-refresh so that the values only after the backup need to be recalculated and I can arrive at today's value much quicker. However this gives me an error:
Encountered an error:
Found a cycle: model.model_A --> model.backup --> model.model_A
This is because the backup refers to the model to get the value each month, while model A also refers to the backup to build off in the case of a full-refresh.
Is there a way around this problem, avoiding rebuilding the entire model from the beginning of time every time I do a full-refresh?
Yes, you can't have 'circular loops' or cycles in your build process.
If there is an application that calculates the values for each day, you could perhaps store the new values back in the same source table(s), just adding a 'updated_at' or something similar. If I understand your use case correctly, you could then use this value whenever you need to query only the last day's information.

SSAS: How to handle date dimension when date is null

I'm trying to add a new column to my SSAS cube. The column is a date field, and links to my DimDate table (a Date dimension). This date represents the project completion date.
However.... not all of the projects have a project completion date due to old projects not ever being assigned this value. And this is expected. We don't want to put bogus dates into the field just to get SSAS to work.
When processing the cube, it crashes with:
Errors in the OLAP storage engine: The attribute key cannot be found when
processing: Table: 'dbo_FactMyTable', Column: 'MyDate_id', Value: '0'.
The attribute is 'Date Id'.
I can't disable "missing values" for the entire project because in most cases, this really is an error. How can I disable missing values for this dimension?
Or is there a better way to handle missing dates/values like this?
Small correction - based on your question, you need to change Processing error handling for special Measure Group, not Dimension. You can do it for all dimensions linked to some measure group, but not to specific dimension.
You can process individual measure group with _Table: 'dbo_FactMyTable'_ first with necessary missing value settings, and then - process rest of your cube with default settings.
Main problem here - how to process rest of the cube. You might have sophisticated system which creates processing XMLA scripts dynamically based on data update knowledge (I do it with SSIS); in this case you would not ask this question. Suppose your environment is simpler - you update cube and would like to process it as a whole completely. In such scenario I would sudgest the following workflow:
Process Default all Dimensions (will do initial processing or in structure changes)
Process Update all Dimensions
Process Cube with Unprocess - invalidating it
Process your special measure group
Process Cube with Process Default
This will first update Dimensions, then - clear processing status flag from all measure groups in the Cube. After that you process your measure group with special flags; this set processing status for this MG. And then during Process Default on Cube - only unprocessed MGs will be covered, which excludes your special MG from processing scope.
The answer is a bit complicated, but this article did a great job of explaining it, including screen shots for the SSAS-challenged like me.
http://msbusinessintelligence.blogspot.com/2015/06/handling-null-dates-in-sql-server.html?m=1

Dynamically filtering large query result for presentation in SSRS

We have a system that records data to an SQL Server DB captured from field equipment every minute. This data is used for a number of purposes, one of which is for charting in reports via SSRS.
The issue is that with such a high volume of data, when a report is run for period of for example 3 months, the volume of data returned obviously causes excessive report rendering times.
I've been thinking of finding a way of dynamically reducing the amount of data returned, based on the start and end time periods chosen. Something along the lines of a sliding scale where from the duration between the start and end period, I can apply different levels of filtering so that where larger periods are chosen, more filtering occurs while for smaller periods less or no filtering occurs.
There is still a need to be able to produce higher resolution (as in more data points returned) reports for troubleshooting purposes.
For example:
Scenario 1:
User is executing a report for a period of 3 months. Result set returned by the query is reduced for performance reasons without adversely affecting what information the user wants to see (the chart is still representative of the changes over time).
Scenario 2:
User executes the report for a period of 1 hour, in order to look for potential indicator(s) of problems with field devices while troubleshooting the system. For this short time period, no filtering is applied.
My first thought was to use a modulo operation on the primary key of the data (which is an identity field), whereby the divisor is chosen depending on the difference between the start and end dates.
For example, something like if the difference between the start and end dates for the report execution period is 5 weeks, choose a divisor of 5 and apply a mod to the PK, selecting where the result is equal to zero.
I would love to get feedback as to whether this sounds like a valid approach or whether there is a better way to do this.
Thanks.