Quicksight filters in calculated fields - calculated-field

I have recently started using quicksights sumOver and avgOver functions thinking they behaved like LOD in tableau. That is to say that they exclude a specific filter in the calculation at this visualization level of the data. Using this in Tableau I would traditionally use this to exclude particular filters to give the user a comparison on the difference between say national and state numbers (when they had filtered for a specific state).
Using the over functions in Quicksight does not produce this same result. Unless you only have one (possibly two) filters the over functions actually create sumed, averaged etc. based on partitions. This is not the same as excluding the filter and results in incorrect information for the user when multiple filters are in place. This is because the granularity of the partition becomes smaller every time you add a new dimension into the partitioning. When you then bring this through and visualise it you then are averaging the averages from the partitions and this results in a non-weighted average that is wrong.
My questions is does quicksight have any ability to exclude a particular filter using a function in a calculated field as these over functions do not truly do that?

Related

Are there downsides to nesting data in BigQuery?

We have data of different dimensions, for example:
Name by Company
Stock prices by Date, Company
Commodity prices by Date & Commodity
Production volumes by Date, Commodity & Company
We're thinking of the best way of storing these in BigQuery. One potential method is to put them all in the same table, and nest the extra dimensions.
That would mean:
Almost all the data would be nested - e.g. there would be a single 'row' for each Company, and then its prices would be nested by Date.
Data would have to share at least one dimension - I don't think there would be a way of representing Commodity prices in a table whose first column was the company's Name
Are there disadvantages? Are there performance implications? Is it sensible to nest 5000 dates + associated values within each company's row?
It's common to have nested/repeated columns in BigQuery schemas since it makes reasoning about the data easier. Firebase produces schemas with repetition at many levels, for instance. If you flatten everything, the downside is you need some kind of unique ID for each row in order to associate events with each other, and then you'll need aggregations (using the ID as a key) rather than simple filters if you want to do any kind of counting.
As for downsides of nested/repeated schemas, one is that you may find yourself performing complicated transformations of the structure with ARRAY subqueries or STRUCT operators, for instance. These are generally fast, but they do have some overhead relative to queries without any structure imposed on the result at all.
My best suggestion would be to load some data and run some experiments. Storage and querying both are relatively cheap, so you can try a few different schema shapes and see which works better for your purposes.
Updating in Bigquery is pretty new, but based on the public available info BigQuery DML it is currently limited to only 48 updates per table per day.
Quotas
DML statements are significantly more expensive to process than SELECT
statements.
Maximum UPDATE/DELETE statements per day per table: 48 Maximum
UPDATE/DELETE statements per day per project: 500 Maximum INSERT
statements per day per table: 1,000 Maximum INSERT statements per day
per project: 10,000
Processing nested data is also very expensive since all of the data from that column is loaded on every query. It is also slow if you are doing a lot of operations on nested data.

Calculating counts over several columns in DB

We have a product backed by a DB (currently Oracle, planning to support MS SQL Server as well) with several dozen tables. For simplicity let's take one table called TASK.
We have a use case when we need to present the user the number of tasks having specific criteria. For example, suppose that among many columns the TASK table has, there are 3 columns suitable for this use case:
PRIORITY- possible values LOW, MEDIUM, HIGH
OWNER - possible values are users registered in the system (can be 10s)
STATUS- possible values IDLE, IN_PROCESS, DONE
So we want to display the user exactly how many tasks are LOW, MEDIUM, HIGH, how many of them are owned by some specific user, and how many pertain to different statuses. Of course the basic implementation would be to maintain these counts up-to-date, on every modification to the TASK table. However what complicates the matter is the fact that the user can additionally filter the result by some criteria that can include (or not) part of the columns mentioned above.
For example, the use might want to see those counts only for tasks that are owned by him and have been created last month. The number of possible filter combinations is endless here, so needless to say maintaining up-to-date counts is impossible.
So the question is: how this problem can be solved without serious impact on the DB performance? Can it be solved solely over DB, or should we resort to using other data stores, like sparse data store? It feels like a problem that is present allover in many companies. For example in Amazon store, you can see the counts on categories while using arbitrary text search criteria, which means that they also calculate it on the spot instead of maintaining it up-to-date all the time.
One last thing: we can accept a certain functional limitation, saying that the count should be exact up to 100, but starting from 100 it can just say "over 100 tasks". Maybe this mitigation can allow us to emit more efficient SQL queries.
Thank you!
As I understand you would like to have info about 3 different distributions: across PRIORITY, OWNER and STATUS. I suppose the best way to solve this problem is to maintain 3 different data sources (like SQL query, aggregated info in DB or Redis, etc.).
The simplest way to calculate this data I see as build separate SQL query for each distribution. For example, for priority it would something like:
SELECT USER_ID, PRIORITY, COUNT(*)
FROM TASKS
[WHERE <additional search criterias>]
GROUP BY PRIORITY
Of course, it is not the most efficient way in terms of database performance but it allows to maintain counts up to date.
If you would like to store aggregated values which may significantly decrease database loads (it depends on rows count) so you probably need to build a cube which dimensions should be available search criteria. With this approach, you may implement limitation functionality.

Slow calculated measure with tuple

I have one fact table that has all information regarding how much a company buys and sells. In order to create som calculation regarding f.ex margin I need to use the rows for a purchase together with rows for sales to get the correct calculations.
Now, I have created a calculated measure that gives me the correct result, but the more dimensions I add to my query the slower the query runs when using this calculated measure. It seems like it is using a lot of time to return the tuples I am using to find the rows regarding purchase.
I am using tuples to "store" the purcase row, but the tuple becomes quite large because I need to include all default members of dimensions used by the sales rows in order for them to be used. Basicly my tuples looks like this just with more dimension hierarchies:
(
[Dimension 1].[Hierarchy 1].&[member]
,[Dimension 1].[Hierarchy 2].&[member]
,[Dimension 2].[Hierarchy 1].&[member]
,[Dimension 3].[Hierarchy 1].&[member]
,[Dimension 4].[Hierarchy 1].&[member]
,[Measures].[Purchase Standard Cost]
)
I then multiply this tuple with a measure from the sales rows and I get my result.
Anyone have any tips on how to improve the query performance? The calculation works and if I slice by just a couple of dimensions it works just fine and performance is not to bad, but the more I add the slower it gets and the users will hit performance issues.
Since amount of used dimensions increased, Storage Engine has to scan additional files, it could be a reason of such performance degradation.
I have several suggestions based on their effectiveness from my point of view:
Implement partitioning (if it's not implemented yet) to scan lower amount of data.
"Materialize" some tuples into physical dimension (if there are no dynamics, late-binding functions etc. in MDX):
2.1. Add corresponding keys, which represents tuples, to your source tables.
2.2. Build appropriate dimensions on these keys.
2.3. Use calculated measures with these "ex-tuples".
Example:
You have a 100M rows table with columns: SomeDate, Customer, Product, Amount and a single-partitioned measure group.
You need to create tuples like (2015-01-01, Customer A, Product Z, Amount).
Server have to scan the entire data to take exact values.
Once you add partitions by SomeDate years (+slices), server will take only 2015 partition.
2.1. Add column Tuple_ID int to the table and map it during ETL.
E.g. Tuple_ID = 1 where Customer = 'Customer A' and Product = 'Product Z'
2.2. Create a dimension on this new field (or on additional table with list of combinations to be able to modify logic easily).
2.3. Use ([Tuple ID].[Tuple ID].&[1],[Measures].[Amount]) in calculation.
Advantage of such technique is that server takes only pre-calculated values, and queries speed as a result.

SSAS : Update a dimension won't drop aggregation process index won't rebuild aggregation

I have a 'Employee' dimension which will be changed (modified) everyday, I made monthly partitions in cube and only process full the current month partition. Lately found that the past month's aggregation will not be dropped. Tired to 'ProcessUpdate' on this dimension and 'ProcessIndex' on partition but remained same. Also tried the setting 'ProcessAffectObjects'and 'ProcessIndex' again, still the same, tried both on lazyprocessing true and false with no luck.
So my question is how to drop the stale aggregation on past month and rebuild them explicit ?
It is a distinct count measurement and no aggregation designed via wizard
Tried drop the index by using 'ProcessClearIndexes' in XMAL command, it worked fine and use 'ProcessIndexes' did rebuild the indexes and aggregation, saw them from the SSMS query execution message .
So might it only be related to the distinct count , just because it is a non-aggregation measurement ?
"Non-additive measures create the following problems on a typical OLAP system:
Roll-ups are not possible. When pre-calculating results during cube processing, the system cannot deduce summaries from other summaries. All results must be calculated from the detail data. This situation places a heavy burden in processing time.
All results must be pre-calculated. With non-additive measures, there is no way to deduce the result for a higher-level summary query from one pre-calculated aggregation. Failure to pre-calculate the results in advance means that the results are not available. It is impossible to perform and maintain incremental updates to the system. A single transaction added to the cube usually invalidates huge portions of previously pre-calculated results. In order to recover from this, a complete recalculation is needed."
"Aggregations
As mentioned before, DISTINCT COUNTs are not additive (and this is the main reason why these measures are so problematic). Therefore, the aggregations, which are all derived from additive operators, are completely useless;"
someone answered my question on MSDN
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/7302227f-11b8-4adc-98ff-72b6c395775b/ssas-update-a-dimension-wont-drop-aggregation-process-index-wont-rebuild-aggregation?forum=sqlanalysisservices
If you use materialized reference dimensions ensure you do ProcessFull to reprocess the fact tables again fully. The reason is that the join to the intermediate dimension happens in the measure group partition processing query:
http://sqlblog.com/blogs/alberto_ferrari/archive/2009/02/25/ssas-reference-materialized-dimension-might-produce-incorrect-results.aspx

What is generic name of this technique CouchDB uses to index aggregated data

CouchDB employs a cool pattern that can be used in a multitude of other scenarios. I'm talking about the persisted B-tree index of map/reduce results. The idea is to precalculate the aggregated data and store it at different levels of the B-tree index. The index can then be used to efficiently query the aggregate without having to reaggregate all the data all the time. Then, if any leaf-level value changes, only the ascending path through the tree has to get recalculated.
For example, if the data is price over time, the index could store the SUM and the COUNT of items at day, month, and year levels. Then, if anybody wants to query average price year-to-date all you had to do is sum up all the SUMs and COUNTs for all the full months since year start, plus all the days available for the last month, then divide total SUM by total COUNT. If a past price has to change, the change has to propagate through the index, but only corresponding day's and month's and year's values have to be updated, and even then the values for other days and other months within the year can be reused for the calculation.
What is generic name of this approach? Is anything similar exists in any of the popular RDBMSes? Any experience with using this in practice?
Materialized view
"A materialized view is a database object that contains the results of a query. They are local copies of data located remotely, or are used to create summary tables based on aggregations of a table's data. Materialized views, which store data based on remote tables, are also known as snapshots."
This is from a wikipedia article that mainly discusses storing of results in the context of a RDBMS.
Personally I prefer the term "indexed view". I actually found that wikipedia article by searching for "indexed view" on Google.