Is there any data Visualization technique which will recommend us to do feature scaling? - data-science

In the process of applying an algorithm in data science , we need to do feature scaling on input data set. I would like to know whether is it mandatory step or is there any technique which will decide to perform feature scaling
1) Data Visualization
2) Statistical values

Feature scaling is needed if your inputs have a wide range of variation, if they are already normalized then you don't need it.
There is not a precise rule to follow. As a basic rule consider that normalized inputs work better then non normalized ones.

If you create a model with two numerical features and suppose one is having high values, like salary, (e.g. 2345, 1756, 34521 etc) and one is having low values, like age, (e.g. 33, 17, 29 etc). Obviously the numerical feature with higher values will have clear impact on model.
So to avoid this, we should scale both features to same level and do the modeling.
And it depends on the algorithm you are using to build the model. Only few models need feature scaling, not all.

Related

A huge number of discrete features

I'm developing a regression model. But I ran into a problem when preparing the data. 17 out of 20 signs are categorical, and there are a lot of categories in each of them. Using one-hot-encoding, my data table is transformed into a 10000x6000 table. How should I prepare this type of data?
I used PCA, trying to reduce the dimension, but even 70% of the variance is in 2500 features. That's why I joined.
Unfortunately, I can't attach the dataset, as it is confidential
How do I prepare the data to achieve the best results in the learning process?
Can the data be mapped more accurately in a non-linear manner? If so, you might want to try using an autoencoder for dimensionality reduction.
One thing to note about PCA is that it computes an orthogonal projection of the data into linear space. This means that it only gives a linear mapping of the data. Autoencoders, on the other hand, can give you a non-linear mapping, and so is able to represent a greater amount of variance in the data in fewer dimensions. Just be sure to use non-linear activation functions in your autoencoder architecture.
It really depends on exactly what you are trying to do. Getting a covariance matrix (and also PCA decomp.) will give you great insight about which classes tend to come together (and this requires one-hot encoded categories), but training a model off of that might be problematic.
In general, it really depends on the model you want to use.
One option would be a random forest. They can definitely be used for regression, though they need to be trained specifically for that. SKLearn has a class just for this:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
The benifits of random forest is that it is great for tabular data (as is the case here), and can easily be trained using numerical values for class features, meaning your data vector can only be of dimension 20!
Decision tree models (such as random forest) are being shown to out-preform deep-learning in many cases, and this may be one of them.
TLDR; If you use a random forest, it can take learn even with numerical values for categories, and you can avoid creating incredibly large vectors for data.

Using Variance Threshold with normalized variance

We know that zero-variance or low-variance features should be dropped to help with model complexity. However, I have come to learn that comparing variances of features can be difficult. For example:
The above features all have different medians, different variances, and ranges. Also, higher values in a distribution tend to have bigger variances. So, to make a fair comparison, can we normalize all features by dividing them by their mean, like so:
normalized_df = df / df.mean()
I have seen this technique in a DataCamp course and it is suggested in the course that after doing a normalization like above, we can choose a lower variance threshold, like 0.005 to make a fair comparison in feature selection. I was wondering if it was correct.
If it is, what kind of threshold should be chosen for normalized features?

Multiple trained models vs Multple features and one model

I'm trying to build a regression based M/L model using tensorflow.
I am trying to estimate an object's ETA based on the following:
distance from target
distance from target (X component)
distance from target (Y component)
speed
The object travels on specific journeys. This could be represented as from A->B or from A->C or from D->F (POINT 1 -> POINT 2). There are 500 specific journeys (between a set of points).
These journeys aren't completely straight lines, and every journey is different (ie. the shape of the route taken).
I have two ways of getting around this problem:
I can have 500 different models with 4 features and one label(the training ETA data).
I can have 1 model with 5 features and one label.
My dilemma is that if I use option 1, that's added complexity, but will be more accurate as every model will be specific to each journey.
If I use option 2, the model will be pretty simple, but I don't know if it would work properly. The new feature that I would add are originCode+ destinationCode. Unfortunately these are not quantifiable in order to make any numerical sense or pattern - they're just text that define the journey (journey A->B, and the feature would be 'AB').
Is there some way that I can use one model, and categorize the features so that one feature is just a 'grouping' feature (in order separate the training data with respect to the journey.
In ML, I believe that option 2 is generally the better option. We prefer general models rather than tailoring many models to specific tasks, as that gets dangerously close to hardcoding, which is what we're trying to get away from by using ML!
I think that, depending on the training data you have available, and the model size, a one-hot vector could be used to describe the starting/end points for the model. Eg, say we have 5 points (ABCDE), and we are going from position B to position C, this could be represented by the vector:
0100000100
as in, the first five values correspond to the origin spot whereas the second five are the destination. It is also possible to combine these if you want to reduce your input feature space to:
01100
There are other things to consider, as Scott has said in the comments:
How much data do you have? Maybe the feature space will be too big this way, I can't be sure. If you have enough data, then the model will intuitively learn the general distances (not actually, but intrinsically in the data) between datapoints.
If you have enough data, you might even be able to accurately predict between two points you don't have data for!
If it does come down to not having enough data, then finding representative features of the journey will come into use, ie. length of journey, shape of the journey, elevation travelled etc. Also a metric for distance travelled from the origin could be useful.
Best of luck!
I would be inclined to lean toward individual models. This is because, for a given position along a given route and a constant speed, the ETA is a deterministic function of time. If one moves monotonically closer to the target along the route, it is also a deterministic function of distance to target. Thus, there is no information to transfer from one route to the next, i.e. "lumping" their parameters offers no a priori benefit. This is assuming, of course, that you have several "trips" worth of data along each route (i.e. (distance, speed) collected once per minute, or some such). If you have only, say, one datum per route then lumping the parameters is a must. However, in such a low-data scenario, I believe that including a dummy variable for "which route" would ultimately be fruitless, since that would introduce a number of parameters that rivals the size of your dataset.
As a side note, NEITHER of the models you describe could handle new routes. I would be inclined to build an individual model per route, data quantity permitting, and a single model neglecting the route identity entirely just for handling new routes, until sufficient data is available to build a model for that route.

Select important features then impute or first impute then select important features?

I have a dataset with lots of features (mostly categorical features(Yes/No)) and lots of missing values.
One of the techniques for dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features. That is basically we can generate a large set of very shallow trees, with each tree being trained on a small fraction of the total number of attributes. If an attribute is often selected as best split, it is most likely an informative feature to retain.
I am also using an imputer to fill the missing values.
My doubt is what should be the order to the above two. Which of the above two (dimensionality reduction and imputation) to do first and why?
From mathematical perspective you should always avoid data imputation (in the sense - use it only if you have to). In other words - if you have a method which can work with missing values - use it (if you do not - you are left with data imputation).
Data imputation is nearly always heavily biased, it has been shown so many times, I believe that I even read paper about it which is ~20 years old. In general - in order to do a statistically sound data imputation you need to fit a very good generative model. Just imputing "most common", mean value etc. makes assumptions about the data of similar strength to the Naive Bayes.

How to identify relevant features in WEKA?

I would like to perform feature analysis in WEKA. I have a data set of 8 features and 65 instances.
I would like to perform feature selection and optimization functionalities that are available for machine learning methods like SVM.
For example in Weka I would like to know how I can display which of the features contribute best to the classification result.
I think that WEKA provides a nice graphical user interface and allows a very detailed analysis of the influence of single features. But I dont know how to use it. Any help?
You have two options:
You can perform attribute selection using filters. For instance you can use the AttributeSelection tab (or filter) with the search method Ranker and the attribute evaluation metric InfoGainAttributeEval. This way you get a ranked list of the most predictive features according to its Information Gain score. I have done this many times with good results. Sometimes it helps even to increase the accuracy of SVMs, which are known not to need (too much) of feature selection. You can try with other search methods in order to find subgroups of coupled predictors, and with other metrics.
You can just look at the coefficients in the SVM output. For instance, in linear SVMs, the classifier is a polynomial like a1.f1 + a2.f2 + ... + an.fn + fn+1 > 0, being ai the attribute values for an instance, and fi the "weights" obtained in the SVM training algorithm. In consequence, those weights with values close to 0 represent attributes that do not count too much, thus being bad predictors; extreme weights (either positive or negative) represent good predictors.
Additionally, you can check the visualization options available for a particular classifier (e.g. J48 is a decision tree, the attribute used in the root test is for the best predictor). You can check the AttributeSelection tab visualization options as well.