Imputing values for a categorical feature - dataframe

I have a categorical feature that takes calues ‘yes’ or ‘no’ and it has 50% missing values how can i impute missing values for this feature is there ant specefic way of imputing values for features with such high number of nan’s ?

Related

How to Preprocess a Dataset with Categorical and Numerical Data

Let's say we have a dataset with categorical and numerical data. I want to know if:
(1) it is okay to scale the whole dataset after encoding the categorical data (using label encoding, say), or
(2) it is okay to scale only the columns with numerical data
Note:
If (1), the columns with categorical data will become scaled too
If (2), there will be a bias towards the categorical data (the values of categorical data will be 0, 1, 2 etc if label encoder is used, say. The values of numerical data will be between 0 and 1 if MinMaxScaler is used, say)
I have tried both options. However, I have my reservations for them.
Thanks.

Is it possible to mask individual features in tensorflow?

I have a large quantity of missing values that appear at random in my data. Unfortunately, I cannot simply drop observations with missing data as I am grouping observations by a feature and cannot drop NaNs without affecting the entire group.
I was hoping to simply mask features that were missing. So a single group might have 8 items in it, and each item may have 0 to N features, depending on how many got masked due to being missing.
I have been experimenting a lot with RaggedTensors, but have encountered a lot of issues ranging from not being able to flatten the RaggedTensor, not being able to concatenate it with regular tensors of uniform shape, and Dense layers requiring the last dimension of their input to be known, aka the number of features.
Does anybody know if there is a way to do this?

Scalling Feature implemented in DataFrame modelling

I have dataset with 15 columns with below scenario
9 -columns are categorical use so I have convert the data one hot encoder
6 columns are numeric, out of 6 - 3 columns is having outlier since column values are different range, so I have chosen RobustScaler() as scaling features and other I chosen standard Scalar.
after that I have combined all the data frame and apply the Logistic Regression algorithm my model produced very low score even I got the good score with out scaling.
will any one can able to help on this ?
please apply column standardization to data frame and see the output..I guess since logistic regression is sensitive to outliers,you are facing problem
impute outliers properly and then apply column standardization

scikit-learn PCA with unknown feature values

I want to use sklearn for pca analysis (then regression and kmeans clustering). I have a dataset with 20k features, 2000k rows. However for each row in the dataset only a subset (typically any 5 or so of the 20k) of features have been measured.
How should I pad my pandas dataframe / setup sklearn so that sklearn not use features for the instances where the value has not been measured? (eg if I set null feature values to 0.0 would this distort the outcome?).
eg:
X = array[:,0:n]
Y = array[:,n]
pca = PCA()
fit = pca.fit(X)
If the dataset is padded with zeros for most feature values - then will pca be valid?
I see 3 options, however none is a solution for your problem:
1) You replace the null values by 0, but that will definetly worsen your results;
2) You replace the unknown values with the mean or median of each feature, this migth be better, however it will still give you a distorted PCA;
3) Last option don't use PCA and search for dimensionality reduction techique for sparse data.

Random projection in Python Pandas using a dataframe containing NaN values

I have a dataframe data containing real values and some NaN values. I'm trying to perform locality sensitive hashing using random projections to reduce the dimension to 25 components, specifically with thesklearn.random_projection.GaussianRandomProjection class. However, when I run:
tx = random_projection.GaussianRandomProjection(n_components = 25)
data25 = tx.fit_transform(data)
I get Input contains NaN, infinity or a value too large for dtype('float64'). Is there a work-around to this? I tried changing all the NaN values to a value that is never present in my dataset, such as -1. How valid would my output be in this case? I'm not an expert behind the theory of locality sensitive hashing/random projections so any insight would be helpful as well. Thanks.
NA / NaN values (not-available / not-a-number) are, I have found, just plain troublesome.
You don't want to just substitute a random value like -1. If you are inclined to do that, use one of the Imputer classes. Otherwise, you are likely to very substantially change the distances between points. You likely want to preserve distances as much as possible if you are using random projection:
The dimensions and distribution of random projections matrices are controlled so as to preserve the pairwise distances between any two samples of the dataset.
However, this may or may not result in reasonable values for learning. As far as I know, imputation is an open field of study, which (for instance) this gentlemen has specialized in studying.
If you have enough examples, consider dropping rows or columns that contain NaN values. Another possibility is training a generative model like a Restricted Boltzman Machine and use that to fill in missing values:
rbm = sklearn.neural_network.BernoulliRBM().fit( data_with_no_nans )
mean_imputed_data = sklearn.preprocessing.Imputer().fit_transform( all_data )
rbm_imputation = rbm.gibbs( mean_imputed_data )
nan_mask = np.isnan( all_data )
all_data[ nan_mask ] = rbm_imputation[ nan_mask ]
Finally, you might consider imputing using nearest neighbors. For a given column, train a nearest neighbors model on all the variables except that column using all complete rows. Then, for a row missing that column, find the k nearest neighbors and use the average value among them. (This gets very costly, especially if you have rows with more than one missing value, as you will have to train a model for every combination of missing columns).