Compare one-hot-encoded column header and predicted labels - pandas

I have 3 one-hot-encoded columns where the header names are labels, and one prediction column preds where the labels are predicted (see image). I want to calculate the performance of my predictions by comparing the label in preds and the 1-encoded column header.
In this example I only have 20% predicted correctly.
Is there a quick way of doing this in pandas?

IIUC, DataFrame.lookup and np.mean
df[['Type_1','Type_2','Type_3']].lookup(df.index, df['preds']).mean() * 100

Related

How to adjust Python linear regression y axis

I have been having Problems with price column every time I try to plot graphs on it and all my graphs have this problem and I want to change it to its actual values instead of decimals
Example of of linear graph
This is the dataframe containing the information of the dataset
Train is the name of dataframe.
Column contains the selected
columns = ['Id', 'year', 'distance_travelled(kms)', 'brand_rank', 'car_age']
for i in columns:
plt.scatter(train[i], y, label='Actual')
plt.xlabel(i)
plt.ylabel('price')
plt.show()

label encoder unable to convert a range of categorical columns into numerical columns

I have a 50 columns, categorical dataset. Among them only 5 columns are numerical. I would like to apply label encoder to make the categorical columns to numerical columns. Categorical columns are basically nominal columns for my dataset. I need to convert columns 0 to 4 to numerical and column 9 to 50 to numerical values.
I used the command
le = LabelEncoder()
df.iloc[:,0:4]=le.fit_transform(df.iloc[:,0:4])
df is the name of the dataframe.
error : ValueError: y should be a 1d array
How could I fix this problem? Thank you.
use .apply() method of DataFrame to apply some rule to columns/rows.
In your particular case it will be smth like this: df.apply(le.fit_transform) (notice that you need to add .iloc here)

train-test split of scikit learn resulting in features having only one unique value in train data

I am trying to train a multivariate linear regression model. I have a data set named 'main'. There are few categorical variable in this dataset. I dummified the categorical variable. Let's say the columns obtained after dummification are A, B, C, D and so on. Now when I am trying to run train-test split on this main dataset, the train dataset thus obtained has only values 0 in one of these columns. How can I overcome this problem.
The code which I am using is :
for train-test split:
from sklearn.model_selection import train_test_split
np.random.seed(0)
df_train, df_test = train_test_split(main, train_size = 0.7, test_size = 0.3, random_state = 100)
On running the below code :
main.columns[main.nunique() == 1]
The result is : Index([], dtype='object')
And when running the below code for train data :
df_train.columns[df_train.nunique() == 1]
The result is : Index(['A', 'D', 'S'], dtype='object')
I want the resulting train set to contain features with all combination of values in it. However, this split is giving me only one value in some features
Edit : I checked the unique values in these columns and these columns are highly unbalanced with only one value present for the positive case. I tries stratify and it needs at lease two rows of positive class. And this the case for many columns. So I cannot separately include this columnns in the train dataset as it would require writing code for all the columns. I want this to be done automatically.
Have you tried changing random_state value ?

Selection column in a dataframe in pandas apply min function

I have n-dataframe in a list
df=[df_1, df_2, df_3, ...., df_n]
Where df_n is a dataframe in pandas (python). df_n is a variable of my keras-model.
X_train=[df_1_1,df_2_1,...,df_n_1]
Where:
df_1_1 is the first dataframe of the list (the first variable) and the first columns of this dataframe, his dataframe has m columns.
Each column of this dataframe if this variable applies a different type of smoothing or filter.
I have 100 column in each dataframe and I want to select the combination of columns (of different dataframes), the X_train than have min value in the score of my model.
score = model.evaluate(X_test,Y_test)
X_test and Y_test are the last n occurrences of the selected columns.
There some library for selected this columns (neuronal networks, GA, colony ant, ...)?
How can I implement it?
What is your prediction task? Do you need a neural network or not? You are essentially looking at a feature selection problem here. You could use simpler models such as the lasso which will select columns using L1-regularization. Or you could use an ensembling technique such as random forests and consider the relative feature importances to select your columns. Perhaps have a look at scikit-learn.

tensorflow wide model: how to use one-hot feature?

I have read about the model in https://www.tensorflow.org/versions/r0.9/tutorials/wide_and_deep/index.html
the feature in article has two type: Categorical and Continuous
In my case, I have a column which describe the userid ,range from 0 to 10000000
I treat this column as Categorical and use hash-bucket , but only get a pool auc value about 0.50010
1)is it need to use one-hot to process this id column?
2)if it's needed, how to achieve this? I find a "tf.contrib.layers.one_hot_encoding" ,but it's not support column names so cannot be used in wide-n-deep demo.
No, you don't need to encode the UserID column. Each value is unique and is not a Categorical value. It makes sense to one-hot-encode when there are less than 1000 categories.
To answer your question on how to use the one_hot_encoding, assuming you have a list of labels (note that they must be integers):
import tensorflow as tf
with tf.Session() as sess:
labels = [0, 1, 2, 3]
labels_t = tf.constant(labels)
num_classes = len(labels)
one_hot = tf.contrib.layers.one_hot_encoding(labels_t, num_classes=num_classes)
print(one_hot.eval())