Unable to Perform NULL value analysis in pandas dataframe - pandas

I want to perform NULL values analysis here.
Here i have mentioned first 2 rows of the dataset
Shop_name Bikes_avaiable Shop_location Average_price_of_bikes Rating_of_shop
NYC Velo Ninja,hbx Salida 5685$ 4.2
Bike Gallery dtr,mtg,Harley Davidson Portland 6022$ 4.8
Except shop_name, every columns has some NULL values.
Earlier i had used mean based imputation and frequency based imputation to replace NULL values.
But, i have been told to follow the model based imputation technique to replace all
the NULL value.
Can anyone suggest me how to do that.

I guess model based imputation technique refers to using statistical models to predict the missing values. For example, you can use K-Nearest Neighbours model to predict missing values. Train the KNN model using data points on the rows where no missing value exists and predict the missing values based on the rows where there are missing values. You would have to apply one-hot encoding for categorical values.

Related

label encoder unable to convert a range of categorical columns into numerical columns

I have a 50 columns, categorical dataset. Among them only 5 columns are numerical. I would like to apply label encoder to make the categorical columns to numerical columns. Categorical columns are basically nominal columns for my dataset. I need to convert columns 0 to 4 to numerical and column 9 to 50 to numerical values.
I used the command
le = LabelEncoder()
df.iloc[:,0:4]=le.fit_transform(df.iloc[:,0:4])
df is the name of the dataframe.
error : ValueError: y should be a 1d array
How could I fix this problem? Thank you.
use .apply() method of DataFrame to apply some rule to columns/rows.
In your particular case it will be smth like this: df.apply(le.fit_transform) (notice that you need to add .iloc here)

Is there a way to combine two columns in a dataset, keeping the larger float64 using Pandas?

Ill try to keep it simple, but these are very large datasets I am working with.
Theoretically I am trying to combine Columns A and B of my data frame.
But, if A has a value in a row then B doesn't, and vice versa. That hole is filled with 'NaN'
A {1,2,NaN,4,5}
B {NaN,NaN,3,NaN,NaN}
I need A to equal {1,2,3,4,5}
EDIT:
Using
df.rename(columns{"a":"b"})
before you concatenate your data allows them to be combined easily is the only layering values layer over NaN.
df['A'] = df['A'].fillna(df['B'])
What this code does is fill all missing values of column A with the values found in column B.
For more options see: https://datascience.stackexchange.com/questions/17769/how-to-fill-missing-value-based-on-other-columns-in-pandas-dataframe

Null value of mean in dataframe columns after imputation

I have a dataframe where I imputed nulls with the mean of group ID. The null count after applying imputation for all such columns is 0. When I do a describe, the mean of few columns which were imputed shows as NaN. The null count for such columns is 0 and the data type is 'float16' (changed it from 'float64' to save on in-memory computations). How could this possibly happen?
Thanks in advance!

Merge vectors to a vector of vectors spark data frame

I have a pyspark data frame with 4 columns. The first three ara vectors and the last one a string. For each row the length of each vector is different. I want to merge the first three into one column named features that will be a Vector of length three and each element will be the vector i described. I tried to use vector assembler, but the result was only one vector. That had as result logistic regression show an error because as i said vectors of each row have a different size. Thanks in advance. Example:
Col1 = [1.5,2.3,4.8]
Col2 = [1.2,3.6.1.9,10.5,3.2]
Col3 = [1.4,5.6]
Then feature for first row will be
Feature = [[1.5,2.3,4.8],[1.2,3.6.1.9,10.5,3.2], [1.4,5.6]]

Selection column in a dataframe in pandas apply min function

I have n-dataframe in a list
df=[df_1, df_2, df_3, ...., df_n]
Where df_n is a dataframe in pandas (python). df_n is a variable of my keras-model.
X_train=[df_1_1,df_2_1,...,df_n_1]
Where:
df_1_1 is the first dataframe of the list (the first variable) and the first columns of this dataframe, his dataframe has m columns.
Each column of this dataframe if this variable applies a different type of smoothing or filter.
I have 100 column in each dataframe and I want to select the combination of columns (of different dataframes), the X_train than have min value in the score of my model.
score = model.evaluate(X_test,Y_test)
X_test and Y_test are the last n occurrences of the selected columns.
There some library for selected this columns (neuronal networks, GA, colony ant, ...)?
How can I implement it?
What is your prediction task? Do you need a neural network or not? You are essentially looking at a feature selection problem here. You could use simpler models such as the lasso which will select columns using L1-regularization. Or you could use an ensembling technique such as random forests and consider the relative feature importances to select your columns. Perhaps have a look at scikit-learn.