Pentaho Workbench Return mdx Query Result in A Tabular Format - pentaho

I am using Pentaho Version 3.14.0.0-12 on Mac.
When making mdx queries, the results are returned for each axis. Here is an example:
Axis #0:
{[OrderDate.Days].[2021]}
Axis #1:
{[Measures].[Profit]}
Axis #2:
{[Product.ProductCategory].[Bikes]}
{[Product.ProductCategory].[Accessories]}
{[Product.ProductCategory].[Clothing]}
{[Product.ProductCategory].[Components]}
Row #0: £3,183,670.33
Row #1: £217,753.85
Row #2: £69,062.19
Row #3:
Is it possible to enforce a tabular format for the mdx query results?

Related

HANA Studio - Calculation view Calculated column not being aggregated correctly

I encounter a problem while I am trying to aggregate (sum) a calculated column which was created in another Aggregation node from another Calculation view.
Calculation View : TEST2
Projection 1 (Plain projection of another query)
Projection1
Aggregation 1 Sum Amount_LC by HKONT and Unique_document_identifier. In the aggregation, a calculated column Clearing_sum is created with the following formular:
Aggregation1
[Question 1] The result of this calculation in raw data preview makes sense for me but the result in Analysis tab seems incorrect. What is the cause of this different output between Analysis and Raw Data?
Result Raw Data
Result Analysis
I thought that it might be the case that, instead of summing up, the analysis uses the formular of Clearing_sum since it is in the same node.
So I tried creating a new Calculation (TEST3) with a projection on this TEST2 (all columns included) and ran to see the output. I still get the same output (correct raw data but incorrect analysis).
Test3
Result Analysis Test3
[QUESTION 2] How can I get my desired result? (e.g. the sum of Clearing_sum for the highlighted row should be 2 according to Raw data tab). I also tried enabling the Client-side aggregation in the Calculated column, but it did not help.
Without the actual models (and not just screenshots) it is hard to tell what the cause of the problem here is.
One possible cause could be that removing the HKONT changed the grouping level of the underlying view that computed SUM(Amount_LC). In turn, this affects the calculation of Clearing_sum.
A way to avoid this is to instruct HANA to not strip those unreferenced columns and to not change the grouping level. To do that, the KEEP FLAG needs to be set for the columns that should stay part of the grouping.
For a more detailed explanation of this flag, check the documentation and/or blog posts like Usage of “Keep Flag”.

How many Axis can we use in MDX practically?

I heard about there are around 128 Axis in MDX.
AXIS(0) or simply 0 – Columns
AXIS(1) or simply 1 – Rows
AXIS(2) or simply 2 – Pages
AXIS(3) or simply 3 – Sections
……….
……….
So far I have used only two of them, Column (0) & Row (1).
I am just curious about
how,
where
when or why
can I use other MDX Axis ?
As SQL SSMS only supports two Axis, If I am not wrong.
Thanks.
How :
select ... on 0, ... on 1, ... on 2 and so on .... from [cube]
Where :
Any client that will not crash with unexpected result format ;-)
When / Why :
A client could take advantage of several axis for rendering the result in 3D using 3 axis. Even if the the client does not render the result in 3D, it might be interesting to ask the server to return the result split over 3 axis for ad-hoc (or easier) processing.
I do not know of any standard client that supports this.
But a typical application that comes to mind: Some years ago (before I was working with Analysis Services), we had a client requiring one and the same report for ten countries and five markets on fifty PowerPoint slides. If we had used Analysis Services at that time, we might have written a custom client application that uses a four dimensional report and thus can get the data to be put into all fifty PowerPoint slides with a single MDX query.
You need not think of OLAP dimensions as dimensions in space. You also can think of them (as the name aliases suggest) as e. g. pages and chapters.

Optimizing R code for ETL

I have both an R script and a Pentaho (PDI) ETL transformation for loading data from a SQL database and performing a calculation. The initial data set has 1.28 million rows of 21 variables and is equivalent in both R and PDI. In fact, I originally wrote the R code and then subsequently "ported" to a transformation in PDI.
The PDI transformation runs in 30s (and includes an additional step of writing the output to a separate DB table). The R script takes between 45m and one hour total. I realize that R is a scripting language and thus interpreted, but it seems like I'm missing some optimization opportunities here.
Here's an outline of the code:
Read data from a SQL DB into a data frame using sqlQuery() from the RODBC package (~45s)
str_trim() two of the columns (~2 - 4s)
split() the data into partitions to prepare for performing a quantitative calculation (separate function) (~30m)
run the calculation function in parallel for each partition of the data using parLapply() (~15-20m)
rbind the results together into a single resulting data frame (~10 - 15m)
I've tried using ddply() instead of split(), parLapply() and rbind(), but it ran for several hours (>3) without completing. I've also modified the SQL select statement to return an artificial group ID that is the dense rank of the rows based on the unique pairs of two columns, in an effort to increase performance. But it didn't seem to have the desired effect. I've tried using isplit() and foreach() %dopar%, but this also ran for multiple hours with no end.
The PDI transformation is running Java code, which is undoubtedly faster than R in general. But it seems that the equivalent R script should take no more than 10 minutes (i.e. 20X slower than PDI/Java) rather than an hour or longer.
Any thoughts on other optimization techniques?
update: step 3 above, split(), was resolved by using indexes as suggested here Fast alternative to split in R
update 2: I tried using mclapply() instead of parLapply(), and it's roughly the same (~25m).
update 3: rbindlist() instead of rbind() runs in under 2s, which resolves step 5

Grouping of Million Data Points slow

I have a simple table containing 2 float columns representing X and Y coordinates. A non clustered index is on each of those 2 columns. In this table there are about 5 million datapoints which I want to group into custom grid using such an SQL:
SELECT COUNT(X) Count, AVG(X) CenterX, AVG(Y) CenterY
FROM DataPoints
GROUP BY FLOOR(X / 5), FLOOR(Y / 5)
On a test case I splitted a data set with 815000 points into a grid where each point gets his own grid cell. It took the SQL server 2012 26000 milliseconds to provide the results which is definitly too long. I made a C# implementation of the same grouping using LINQ on a simple point array and there it only took 3450ms! I also created a stored procedure of the SQL for some speed-up, but still it takes 26-30seconds to calcualte the grid cells.
I can't understand why it takes the SQL Server that long to calcualte those groups. I know it might take long on all 815000 poitns to calculate the grid cell index but 7 times longer than on a simple C# program can't be a realistic result.
I also tried to use spatial types to do calculate the grid but those solutions are even slower. Using a geometry column and a spatial index (GEOMETRY_AUTO_GRID) the built in sp_help_spatial_geometry_histogram needs 2:40min to calculate 4 grid cells containing the data.
Has anybody an idea how to speed up such a simple SQL? In the future this data will be sent to a map in the browser and there will be a lot of requests so <100ms would be an ultimate goal.
What does the execution plan tells you?
why is this slow?
i suggest you put a nonclustered index on x and y (not separate),
is this result better?

MDX query returns unexpected (null)

I`ve faced strange problem with MS SSAS 2008 R2 (10.50.4000.0): two MDX queries which I expect to return the same result behave differently:
This query returns correct numbers:
select
[Measures].[Fact Count] on 0
from
[Cube]
where
[Dimension].[Attribute].&[id]
While this one, which is expected to be equivalent to the first query returns (null) from time-to-time (see details below).
select
[Measures].[Fact Count] on 0
from
(
select
[Dimension].[Attribute].&[id] on 0,
from
[Cube]
)
Some details
Problem is not persistent. It appears and disappears randomly (!) on different databases from different physical servers.
We are using incremental data import and non-lazy processing. There is no strict correlation between problem appearance and data imports. But we continue investigation in this direction.
Adition of other members to axis of the subselect fixes the problem i.e. {[Dimension].[Attribute].&[id1], [Dimension].[Attribute].&[id2] on 0} works fine.
Several dimensions are affected. All of them have integer key. Prolbem appears both on visible and hidden dimension attributes.
Addition of extra dimension to the second axis of the subselect fixes the problem for some pairs of dimensions, i.e. filter [Dimension1].[Attribute].[&id] on 0 fails, but filter [Dimension1].[Attribute].[&id] on 0,[Dimension2].[Attribute].[&id] on 1 works.
We have two measure groups with several measures each. All dimensions are related to some (default) measure in first measure group but some dimensions are only related to the second measure group. Problem appers only on the dimensions of the second type.
Does anyone have idea about the reasons of such strange non-determenistic behavior of MS OLAP?
Thanks.