How to Preprocess a Dataset with Categorical and Numerical Data - tensorflow

Let's say we have a dataset with categorical and numerical data. I want to know if:
(1) it is okay to scale the whole dataset after encoding the categorical data (using label encoding, say), or
(2) it is okay to scale only the columns with numerical data
Note:
If (1), the columns with categorical data will become scaled too
If (2), there will be a bias towards the categorical data (the values of categorical data will be 0, 1, 2 etc if label encoder is used, say. The values of numerical data will be between 0 and 1 if MinMaxScaler is used, say)
I have tried both options. However, I have my reservations for them.
Thanks.

Related

Processing column with letters before feeding into a NN

I wanted to implement a classification algorithm using NN but some columns have complex alphanumeric strings, so I just chose only the simpler columns to check. Here is an example with few elements of the columns I chose...
Few Elements of the COL
As you can see these columns have A,G,C or T..etc. Some had combinations of the 4 letters but I removed it for now. My plan was to map each of these letters to values like 1,2,3 and 4 and then feed them to the NN.
Is this mapping acceptable for feeding into a dense NN?? Or is there any better method for doing this
I would not map it to integers like 1, 2, 3 etc because you are mistakenly giving them a certain order or rank which the NN may capture as important, although this ranking does not truly exist.
If you do not have high cardinality (many unique values) then you can apply One-Hot Encoding. If the cardinality is high, then you should use other encoding techniques, otherwise one-hot encoder will introduce a lot of dimensionality to your data and sparsity, which are not welcomed. You can find here some other interesting methods to encode categorical variables.

Scalling Feature implemented in DataFrame modelling

I have dataset with 15 columns with below scenario
9 -columns are categorical use so I have convert the data one hot encoder
6 columns are numeric, out of 6 - 3 columns is having outlier since column values are different range, so I have chosen RobustScaler() as scaling features and other I chosen standard Scalar.
after that I have combined all the data frame and apply the Logistic Regression algorithm my model produced very low score even I got the good score with out scaling.
will any one can able to help on this ?
please apply column standardization to data frame and see the output..I guess since logistic regression is sensitive to outliers,you are facing problem
impute outliers properly and then apply column standardization

Using embedded columns

I'm trying to understand the TensorFlow tutorial on wide and deep learning. The demo application creates indicator columns for categorical features with few categories (gender, education), and it creates embedded columns for categorical features with many categories (native_country, occupation).
I don't understand embedded columns. Is there a rule that clarifies when to use embedded columns instead of indicator columns? According to the documentation, the dimension parameter sets the dimension of the embedding. What does that mean?
From the feature columns tutorial:
Now, suppose instead of having just three possible classes, we have a million. Or maybe a billion. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using indicator columns.
We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, ordinary vector in which each cell can contain any number, not just 0 or 1. By permitting a richer palette of numbers for every cell, an embedding column contains far fewer cells than an indicator column.
The dimension parameter is the length of the vector you're reducing the categories to.

scikit-learn PCA with unknown feature values

I want to use sklearn for pca analysis (then regression and kmeans clustering). I have a dataset with 20k features, 2000k rows. However for each row in the dataset only a subset (typically any 5 or so of the 20k) of features have been measured.
How should I pad my pandas dataframe / setup sklearn so that sklearn not use features for the instances where the value has not been measured? (eg if I set null feature values to 0.0 would this distort the outcome?).
eg:
X = array[:,0:n]
Y = array[:,n]
pca = PCA()
fit = pca.fit(X)
If the dataset is padded with zeros for most feature values - then will pca be valid?
I see 3 options, however none is a solution for your problem:
1) You replace the null values by 0, but that will definetly worsen your results;
2) You replace the unknown values with the mean or median of each feature, this migth be better, however it will still give you a distorted PCA;
3) Last option don't use PCA and search for dimensionality reduction techique for sparse data.

Random projection in Python Pandas using a dataframe containing NaN values

I have a dataframe data containing real values and some NaN values. I'm trying to perform locality sensitive hashing using random projections to reduce the dimension to 25 components, specifically with thesklearn.random_projection.GaussianRandomProjection class. However, when I run:
tx = random_projection.GaussianRandomProjection(n_components = 25)
data25 = tx.fit_transform(data)
I get Input contains NaN, infinity or a value too large for dtype('float64'). Is there a work-around to this? I tried changing all the NaN values to a value that is never present in my dataset, such as -1. How valid would my output be in this case? I'm not an expert behind the theory of locality sensitive hashing/random projections so any insight would be helpful as well. Thanks.
NA / NaN values (not-available / not-a-number) are, I have found, just plain troublesome.
You don't want to just substitute a random value like -1. If you are inclined to do that, use one of the Imputer classes. Otherwise, you are likely to very substantially change the distances between points. You likely want to preserve distances as much as possible if you are using random projection:
The dimensions and distribution of random projections matrices are controlled so as to preserve the pairwise distances between any two samples of the dataset.
However, this may or may not result in reasonable values for learning. As far as I know, imputation is an open field of study, which (for instance) this gentlemen has specialized in studying.
If you have enough examples, consider dropping rows or columns that contain NaN values. Another possibility is training a generative model like a Restricted Boltzman Machine and use that to fill in missing values:
rbm = sklearn.neural_network.BernoulliRBM().fit( data_with_no_nans )
mean_imputed_data = sklearn.preprocessing.Imputer().fit_transform( all_data )
rbm_imputation = rbm.gibbs( mean_imputed_data )
nan_mask = np.isnan( all_data )
all_data[ nan_mask ] = rbm_imputation[ nan_mask ]
Finally, you might consider imputing using nearest neighbors. For a given column, train a nearest neighbors model on all the variables except that column using all complete rows. Then, for a row missing that column, find the k nearest neighbors and use the average value among them. (This gets very costly, especially if you have rows with more than one missing value, as you will have to train a model for every combination of missing columns).