Group Calculated Member Aggregates by partly related Dimension - mdx

I'm using Pentaho-CE 4.8 with Saiku Plugin 2.6 which uses Mondrian 3.6.5.
In a Mondrian Schema I defined a virtual Cube with a calculated Member which consists of two Virtual Measures. These Virtual Measures come from two Cubes which have two Dimensions in common. One of the Cubes has a degenerated Dimension which is also used in the virtual Cube.
I want to group the Calculated Member by a Dimension which only one of the Virtual Measures is related to but I'm failing at this point.
Pseudo Schema:
<Time Dimension>
<Cube 1>
<Dimension Usage: "Time Dimension">
<Degenerated Dimension>
<Measure 1>
</Cube1>
<Cube 2>
<Dimension Usage: "Time Dimension">
<Measure 2>
</Cube 2>
<Virtual Cube>
<Virtual Measure "Cube 1 Measure 1">
<Virtual Measure "Cube 2 Measure 2">
<Virtual Dimension "Time Dimension">
<Virtual Dimension "Cube 1 Degenerated Dimension"
<Calculated Member: [Virtual Measure "Cube 1 Measure 1"] / [Virtual Measure "Cube 2 Measure 2"]
</Virtual Cube>
In saiku I get results for the virtual and calculated measures as long as I do not use the "Cube 1 Degenerated Dimension". If I use it on Rows/Colums or as a filter, only values for <Virtual Measure "Cube 1 Measure 1"> are shown, since this Measure comes from a cube which has a relation to that dimension.
Is there a way / workaround how I can achieve that also the CM is shown for this Dimension? Because in theory Mondrian could do the following:
Get Virtual Measure "Cube 1 Measure 1" in dependence of "Cube 1 Degenerated Dimension" and the Time Dimension and finally aggregate the values.
Get Virtual Measure "Cube 2 Measure 2" in dependence of the Time Dimension only.
Do the calculation (Devide "Cube 1 Measure 1" / "Cube 2 Measure 2")

I found a solution for the problem:
Use ValidMeasure() Function like this:
<Calculated Member: [Virtual Measure "Cube 1 Measure 1"] / ValidMeasure([Virtual Measure "Cube 2 Measure 2"])>
The Valid Measure Function will tell Mondrian, that this Measure has non-joining dimensions which can be ignored. Measure 2 will get joined against other applied dimensions and a value will be obtained, which then can be used for the calculation.

Related

Metabase evaluate AI runs

I am trying to evaluate user-corrected AI runs using Metabase. In particular, I have a classification problem with around 100 labels, and would like to create a dashboard showing how often the predicted and user-corrected labels coincide, per label and per 3 months. How do you do this in Metabase, or in SQL in general, if e.g. we have a table with id, predLabel, selectedLabel columns?

Data selecting in data science project about predicting house prices

This is my first data science project and I need to select some data. Of course, I know that I can not just select all the data available because this will result in overfitting. I am currently investigating house prices in the capital of Denmark for the past 10 years and I wanted to know which type of houses I should select in my data:
Owner-occupied flats and houses (This gives a dataset of 50000 elements)
Or just owner-occupied flats (this gives a dataset of 43000 elements)
So as you can see there are a lot more owner-occupied flats sold in the capital of Denmark. My opinion is that I should select just the owner-occupied flats because then I have the "same" kind of elements in my data and still have 43000 elements.
Also, there are a lot higher taxes involved if you own a house rather than owning an owner-occupied flat. This might affect the price of the house and skew the data a little bit.
I have seen a few projects where both owner-occupied flats and houses are selected for the data and the conclusion was overfitting, so that is what I am looking to avoid.
This is an classic example of over-fitting due to lack of data or insufficient data.
Let me example the selection process to resolve this kind of problem. I will example using the example of credit card fraud then relate that with your question or any future problem of prediction with classified data.
In ideal world credit card fraud are not that common. So, if you look at the real data you will find only 2% data which resulted in fraud. So, if you train a model with this datasets it would be biased as you don't have normal distribution of the class (i.e fraud and none fraud transaction in your case its Owner-occupied flats and houses). There are 4 a way to tackle this issue.
Let's Suppose Datasets has 90 none fraud data points and 10 fraud data points.
1. Under sampling majority class
In this we just select 10 data points from 90 and train model with 10:10 so distribution is normalised (In your case using only 7000 of 43000 flats). This is not ideal way as we would be throughout a huge amount of data.
2. Over sampling minority class by duplication
In this we duplicate the 10 data points to make it 90 data point distribution is normalised (In your case duplicating 7000 house data to make it 43000 i.e equal to that of flat). While this work there is a better way.
3. Over sampling minority class by SMOTE (recommended)
Synthetic Minority Over-sampling Technique is a technique we use k nearest neigbors algo to generate the minority class in your case the housing data. There is module named imbalanced-learn (here) which can be use to implement this.
4. Ensemble Method
In this method you divide your data into multiple datasets to make it balance for example dividing 90 into 9 sets so that each set can have 10 fraud class data and 10 none fraud class data. In your case diving 43000 in batch of 7000. After that training each one separately and using majority vote mechanism to predict.
So now I have created the following diagram. The green line shows the price per square meter of owner occupied flats and the red line shows price per square meter of houses (all prices in DKK). I was wondering if there is imbalanced classification? The maximum deviation of the prices is atmost 10% (see for example 2018). Is 10% enough to say that the data is biased and hence therefore is imbalanced classified?

SSAS Calculated Members's equivalent in Tabular model

I am looking for a way to implement the same idea as SSAS Multidimensional model calculated members in SSAS Tabular model.
In The Internet Sales sample database, let's consider Location dimension and Sales fact table. I am looking for a way to generate the following report:
State Sales
NY $100
NJ $20
CA $120
WA $80
East Caost $120
West Coast $200
I could achieve this in Multidimensional model using calculated member on location dimension (state column) but don't know how to do it in SSAS Tabular.
Can this be done in SSAS Tabular model?
Thanks

How to generate data that fits the normal distribution within each class?

Using numpy, I need to produce training and test data for a machine learning problem. The model is able to predict three different classes (X,Y,Z). The classes represent the types of patients in multiple clinical trials, and the model should be able to predict the type of patient based on data gathered about the patient (such as blood analysis and blood pressure, previous history etc.)
From a previous study we know that, in total, the classes are represented with the following distribution, in terms of a percentage of the total patient count per trial:
X - u=7.2, s=5.3
Y - u=83.7, s=15.2
Z - u=9.1, s=2.3
The u/s describe the distribution in N(u, s) for each class (so, for all trials studied, class X had mean 7.2 and variance 5.3). Unfortunately the data set for the study is not available.
How can I recreate a dataset that follows the same distribution over all classes, and within each class, subject to the constraint of X+Y+Z=100 for each record.
It is easy to generate a dataset that follows the overall distribution (the u values), but how do I get a dataset that has the same distribution per each class?
The problem you have stated is to sample from a mixture distribution. A mixture distribution is just a number of component distributions, each with a weight, such that the weights are nonnegative and sum to 1. Your mixture has 3 components. Each is a Gaussian distribution with the mean and sd you gave. It is reasonable to assume the mixing weights are the proportion of each class in the population. To sample from a mixture, first select a component using the weights as probabilities for a discrete distribution. Then sample from the component. I assume you know how to sample from a Gaussian distribution.

Within cluster sum of square of the next iteration is bigger than the previous when K-means is applied to SURF features?

I am using K-Means algorithm to classify SIFT vector features.
My K-Means skeleton is
choose first K points of the data as initial centers
do{
// assign each point to the corresponding cluster
data assignment;
// get the recalculated center of each cluster
// suppose point is multi-dimensional data, e.g.(x1, x2, x3...)
// the center is composed by average value of each dimension.
// e.g. ((x1 + y1 + ...)/n, (x2 + y2 + ...)/n ...)
new centroid;
sum total memberShipChanged;
}while(some point still changes their membership, i.e. total memberShipChanged != 0)
We all know that K-Means aims to get the minimal of within cluster sum of square, illustrated as below snapshot.
And we can use a do while iteration to reach the target. Now I prove why after every iteration the within cluster sum of square is smaller.
Proof:
For simplicity, I only consider 2 iterations.
After data assignment process, every descriptor vector has its new nearest cluster center, so the within cluster sum of square decrease after this process. After all, the within cluster of square is sum of every vector to one center, if every vector choose it own new nearest neighbor, there is no doubt that the sum decreases.
In new centroid process, I use arithmetic mean to calculate the new center vector, the local sum of one cluster must decrease.
So the within cluster sum of square decrease twice in one iteration. And after several iterations, every descriptor vector doesn't change its membership, and within cluster sum of square reaches the local minimum.
===============================================================================
Now My question comes:
my SURF data is derived from 3000 images, every descriptor vector is 128-dimension, and there are 1,296,672 vectors in total. And in my code, I print
1)vector number of each clsuter
2)total memeberShipChanged in one iteration
3)within cluster sum of square before one iteration.
Here is output:
sum of square : 8246977014860
90504 228516 429755 266828 1653711 398631 193081 240072
memberShipChanged : 3501098
sum of square : 4462579627000
244521 284626 448700 228211 1361902 303864 317464 311810
memberShipChanged : 975442
sum of square : 4561378972772
323746 457785 388988 228431 993328 304606 473668 330546
memberShipChanged : 828709
sum of square : 4678353976030
359537 480818 346767 222646 789858 332876 612672 355924
memberShipChanged : 563256
......
I only list 4 iteration output of it. From the output, we can see that after first iteration. within cluster sum of square really decrease from 8246977014860 to 4462579627000. But other iterations are nearly of on use in minimize it, but we can still observe the memberShipChanged is converging. I don't know why this happen. I think the first k-means iteration is overwhelming important.
Besides, what should I set the new center coordinate of a empty cluster when the memberShipChanged still doesn't converge to 0 yet? Now I use (0, 0, 0, 0, 0 ...). But is this accurate, perhaps the within cluster of sum increases due to it.