SSAS Calculated Members's equivalent in Tabular model - ssas

I am looking for a way to implement the same idea as SSAS Multidimensional model calculated members in SSAS Tabular model.
In The Internet Sales sample database, let's consider Location dimension and Sales fact table. I am looking for a way to generate the following report:
State Sales
NY $100
NJ $20
CA $120
WA $80
East Caost $120
West Coast $200
I could achieve this in Multidimensional model using calculated member on location dimension (state column) but don't know how to do it in SSAS Tabular.
Can this be done in SSAS Tabular model?
Thanks

Related

Include belonging into model

If you had data like (prices and market-cap are not real)
Date Stock Close Market-cap GDP
15.4.2010 Apple 7.74 1.03 ...
15.4.2010 VW 50.03 0.8 ...
15.5.2010 Apple 7.80 1.04 ...
15.5.2010 VW 52.04 0.82 ...
where Close is the y you want to predict and Market-cap and GDP are your x-variables, would you also include Stock in your model as another independent variable as it could for example be that price building for Apple works than differently than for VW.
If yes, how would you do it? My idea is to assign 0 to Apple and 1 to VW in the column Stock.
You first need to identify what exactly are you trying to predict. As it stands, you have longitudinal data such that you have multiple measurements from the same company over a period of time.
Are you trying to predict the close price based on market cap + GDP?
Or are you trying to predict the future close price based on previous close price measurements?
You could stratify based on company name, but it really depends on what you are trying to achieve. What is the question you are trying to answer ?
You may also want to take the following considerations into account:
close prices measured at different times on the same company are correlated with each other.
correlations between two measurements soon after each other will be better than correlations between two measurements far apart in time.
There are four assumptions associated with a linear regression model:
Linearity: The relationship between X and the mean of Y is linear.
Homoscedasticity: The variance of residual is the same for any value of X.
Independence: Observations are independent of each other.
Normality: For any fixed value of X, Y is normally distributed.

Data selecting in data science project about predicting house prices

This is my first data science project and I need to select some data. Of course, I know that I can not just select all the data available because this will result in overfitting. I am currently investigating house prices in the capital of Denmark for the past 10 years and I wanted to know which type of houses I should select in my data:
Owner-occupied flats and houses (This gives a dataset of 50000 elements)
Or just owner-occupied flats (this gives a dataset of 43000 elements)
So as you can see there are a lot more owner-occupied flats sold in the capital of Denmark. My opinion is that I should select just the owner-occupied flats because then I have the "same" kind of elements in my data and still have 43000 elements.
Also, there are a lot higher taxes involved if you own a house rather than owning an owner-occupied flat. This might affect the price of the house and skew the data a little bit.
I have seen a few projects where both owner-occupied flats and houses are selected for the data and the conclusion was overfitting, so that is what I am looking to avoid.
This is an classic example of over-fitting due to lack of data or insufficient data.
Let me example the selection process to resolve this kind of problem. I will example using the example of credit card fraud then relate that with your question or any future problem of prediction with classified data.
In ideal world credit card fraud are not that common. So, if you look at the real data you will find only 2% data which resulted in fraud. So, if you train a model with this datasets it would be biased as you don't have normal distribution of the class (i.e fraud and none fraud transaction in your case its Owner-occupied flats and houses). There are 4 a way to tackle this issue.
Let's Suppose Datasets has 90 none fraud data points and 10 fraud data points.
1. Under sampling majority class
In this we just select 10 data points from 90 and train model with 10:10 so distribution is normalised (In your case using only 7000 of 43000 flats). This is not ideal way as we would be throughout a huge amount of data.
2. Over sampling minority class by duplication
In this we duplicate the 10 data points to make it 90 data point distribution is normalised (In your case duplicating 7000 house data to make it 43000 i.e equal to that of flat). While this work there is a better way.
3. Over sampling minority class by SMOTE (recommended)
Synthetic Minority Over-sampling Technique is a technique we use k nearest neigbors algo to generate the minority class in your case the housing data. There is module named imbalanced-learn (here) which can be use to implement this.
4. Ensemble Method
In this method you divide your data into multiple datasets to make it balance for example dividing 90 into 9 sets so that each set can have 10 fraud class data and 10 none fraud class data. In your case diving 43000 in batch of 7000. After that training each one separately and using majority vote mechanism to predict.
So now I have created the following diagram. The green line shows the price per square meter of owner occupied flats and the red line shows price per square meter of houses (all prices in DKK). I was wondering if there is imbalanced classification? The maximum deviation of the prices is atmost 10% (see for example 2018). Is 10% enough to say that the data is biased and hence therefore is imbalanced classified?

Is it possible to use only positive covid record for prediction

I am trying to perform covid-19 data for creating prediction model.
I have very good data for positive patient. Which has these parameters.
Fever Tiredness Cough Difficulty-in-Breathing Sore-Throat None_Sympton Pains Nasal-Congestion Runny-Nose Diarrhea None_Experiencing Age_0-9 Age_10-19 Age_20-24 Age_25-59 Age_60+ Gender_Female Gender_Male Gender_Transgender Severity_Mild Severity_Moderate Severity_None Severity_Severe Contact_Dont-Know Contact_No Contact_Yes Country
Problem with me is, I don't find negative patient data with similar parameters.
I have one negative patient data which has only 4500 records and that also missing age, sex, location information
What I want to know is : is it possible that some how we use only positive patient data and try tp predict covid probability?
As per my ML understand, we need balanced data for both class. But I am curious to know if there is any technique to deal with this situation.
Considering that your data has Features strongly corresponding to positive covid patients ,you can do an classification Approach with a classifer with Decision tree where you can Input only Features of the positive Covid patients Features to check the probability of COVID.
You can also use the Naive bayes classification model using the conditional probability with the same Approach.
To check the accuracy of probability you can use classifier metrics like LOG-LOSS to compare it with the true value and predicted proability.
Hope this is helpful

Combine multiple source sets to make a decision

I'm working on a project in which I am using ocr-engine and tensorflow to identify the vehicle number plate and vehicle model respectively. I also have a database which contains Vehicle Information (for eg, owner, number plate, vehicle brand, color, etc).
Simple flow:
Image input
Number plate recognition using OCR
Vehicle model (eg Hyundai,Toyota, Honda, etc) using Tensorflow
Query (2. and 3.) in database to find the owner
Now, the fact is ocr-engine is not 100% accurate, let's consider INDXXXX0007 as the best result of the engine.
When I query this result in database I get
Set 1,
Owner1 - INDXXXX0004 (95% match)
Owner2 - INDXXXX0009 (95% match)
In such cases, I use tensorflow data to make a decision
Set 2, where vehicle model shows:
Hyundai (95.00%)
Honda (90.00%)
Here comes my main problem, tensorflow sometimes gives me false-positive values. For eg, the actual vehicle is Honda but the model shows more confidence for Hyundai (ref, Set2).
What should be a possible way to avoid such problems or How can I combine both sets to make a decision?

SSAS - Classification - How To Split Data into: Training Set - Validation Set - Test Set

I have a set of 300,000 records of historic customer purchases data. I have started SSAS data mining project to identify best customers.
The split of data:
-90% non-buyers
-10% buyers
I have used various various algorithms of SSAS (decision trees and neural networks showed best lift) to explore my data.
The goal of the project is to identify/score customers according who is most likely to buy a product.
Currently, I have used all my records for this purpose. It feels that something is missing in the project. I am reading two books now about data mining. Both of them talk about splitting data mining into different sets; however, none of them explain HOW to actually split them.
I believe I need to split may records into 3 sets and re-run the ssas algorithms.
Main questions:
How do I split data into training, validation and test sets
1.1 What ratio of buyers and non-buyers should be in a training set?
How do I score my customers according to most likely to buy a product and least likely to buy a product.
The division of your set could be done randomly as your data set is big and the number of buyers is not too low (10%). However, if you want to be sure that your sets are representative you could take 80% of your buyers samples and 80% of non buyers samples and mix them to build a training set that contains 80% of your total data set and it has the same ratio of buyers-non buyers as the original data set which makes the subsets representative. You may want to divide your dataset not in two subsets but in three: training, crossvalidation and test. If you use a neural networkas you said you should use the crossvalidation subset to tune your model (weight decay, learning rate, momentum...).
Regarding your your second question you could use a neural network as you said and take the output, that will be in the range [0, 1] if you use a sigmoid as the activation function in the output layer, as the probability. I would also recommend you to take a look to collaborative filtering for this task because it would help you to know which products may be a customer interested in using your knowledge of other buyers with similar preferences.