not able to convert string to float in python and how to train the model with this dataset - pandas

I have a dataset with columns: age (float type), gender (str type), regions (str type) and charges(float type).
I want to predict charges using age gender and region as features, how can I do that in scikit learn?
I have tried something but it shows "ValueError: could not convert string to float: 'northwest' "
import pandas as pd
import numpy as np
df = pd.read_csv('Desktop/insurance.csv')
X = df.loc[:,['age','sex','region']].values
y = df.loc[:,['charges']].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
from sklearn import svm
clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)

The column region contains strings, which can't be used as such in the SVM classifier as it is not a vector.
Threfore you have to turn this column into something that is usable by the SVM. Here is an example by changing region into a categorical series:
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
df = pd.DataFrame({'age':[20,30,40,50],
'sex':['male','female','female','male'],
'region':['northwest','southwest','northeast','southeast'],
'charges':[1000,1000,2000,2000]})
df.sex = (df.sex == 'female')
df.region = pd.Categorical(df.region)
df.region = df.region.cat.codes
X = df.loc[:,['age','sex','region']]
y = df.loc[:,['charges']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)
Another way to approach this problem is to use one-hot vector encoding:
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
df = pd.DataFrame({'age':[20,30,40,50],
'sex':['male','female','female','male'],
'region':['northwest','southwest','northeast','southeast'],
'charges':[1000,1000,2000,2000]})
df.sex = (df.sex == 'female')
df = pd.concat([df,pd.get_dummies(df.region)],axis = 1).drop('region',1)
X = df.drop('charges',1)
y = df.charges
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)
Yet another approach is to perform label encoding:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df.region = le.fit_transform(df.region)
This list of methods is of course non-exhaustive, and they perform differently according to your problem.
The use of non-numeric data is a non-trivial one, and requires a bit of knowledge on the existing techniques (I encourage you to go and search in kaggle's forums where you can find valuable informations).

Related

I'm having problems with one-hot encoding

I am using logistic regression for a football dataset, but it seems when i try to one-hot encode the home team names and away team names it gives the model a 100% accuracy, even when doing a train_test_split i still get 100. What am i doing wrong?
from sklearn.linear_model
import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
df = pd.read_csv("FIN.csv")
df['Date'] = pd.to_datetime(df["Date"])
df = df[(df["Date"] > '2020/04/01')]
df['BTTS'] = np.where((df.HG > 0) & (df.AG > 0), 1, 0)
#print(df.to_string())
df.dropna(inplace=True)
x = df[['Home', 'Away', 'Res', 'HG', 'AG', 'PH', 'PD', 'PA', 'MaxH', 'MaxD', 'MaxA', 'AvgH', 'AvgD', 'AvgA']].values
y = df['BTTS'].values
np.set_printoptions(threshold=np.inf)
model = LogisticRegression()
ohe = OneHotEncoder(categories=[df.Home, df.Away, df.Res], sparse=False)
x = ohe.fit_transform(x)
print(x)
model.fit(x, y)
print(model.score(x, y))
x_train, x_test, y_train, y_test = train_test_split(x, y, shuffle=False)
model.fit(x_train, y_train)
print(model.score(x_test, y_test))
y_pred = model.predict(x_test)
print("accuracy:",
accuracy_score(y_test, y_pred))
print("precision:", precision_score(y_test, y_pred))
print("recall:", recall_score(y_test, y_pred))
print("f1 score:", f1_score(y_test, y_pred))
Overfitting would be a situation where your training accuracy is very high, and your test accuracy is very low. That means it's "over fitting" because it essentially just learns what the outcome will be on the training, but doesn't fit well on new, unseen data.
The reason you are getting 100% accuracy is precisely as I stated in the comments, there's a (for lack of a better term) data leakage. You are essentially allowing your model to "cheat". Your target variable y (which is 'BTTS') is feature engineered by the data. It is derived from 'HG' and 'AG', and thus are highly (100%) correlated/associated to your target. You define 'BTTS' as 1 when both 'HG' and 'AG' are greater than 1. And then you have those 2 columns included in your training data. So the model simply picked up that obvious association (Ie, when the home goals is 1 or more, and the away goals are 1 or more -> Both teams scored).
Once the model sees those 2 values greater than 0, it predicts 1, if one of those values is 0, it predicts 0.
Drop 'HG' and 'AG' from the x (features).
Once we remove those 2 columns, you'll see a more realistic performance (albeit poor - slightly better than a flip of the coin) here:
1.0
0.5625
accuracy: 0.5625
precision: 0.6666666666666666
recall: 0.4444444444444444
f1 score: 0.5333333333333333
With the Confusion Matrix:
from sklearn.metrics import confusion_matrix
labels = labels = np.unique(y).tolist()
cf_matrixGNB = confusion_matrix(y_test, y_pred, labels=labels)
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.heatmap(cf_matrixGNB, annot=True,
cmap='Blues')
ax.set_title('Confusion Matrix\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()
Another option would be to do a calculated field of 'Total_Goals', then see if it can predict on that. Obviously again, it has a little help in the obvious (if 'Total_Goals' is 0 or 1, then 'BTTS' will be 0.). But then if 'Total_Goals' is 2 or more, it'll have to rely on the other features to try to work out if one of the teams got shut out.
Here's that example:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
df = pd.read_csv("FIN.csv")
df['Date'] = pd.to_datetime(df["Date"])
df = df[(df["Date"] > '2020/04/01')]
df['BTTS'] = np.where((df.HG > 0) & (df.AG > 0), 1, 0)
#print(df.to_string())
df.dropna(inplace=True)
df['Total_Goals'] = df['HG'] + df['AG']
x = df[['Home', 'Away', 'Res', 'Total_Goals', 'PH', 'PD', 'PA', 'MaxH', 'MaxD', 'MaxA', 'AvgH', 'AvgD', 'AvgA']].values
y = df['BTTS'].values
np.set_printoptions(threshold=np.inf)
model = LogisticRegression()
ohe = OneHotEncoder(sparse=False)
x = ohe.fit_transform(x)
#print(x)
model.fit(x, y)
print(model.score(x, y))
x_train, x_test, y_train, y_test = train_test_split(x, y, shuffle=False)
model.fit(x_train, y_train)
print(model.score(x_test, y_test))
y_pred = model.predict(x_test)
print("accuracy:",
accuracy_score(y_test, y_pred))
print("precision:", precision_score(y_test, y_pred))
print("recall:", recall_score(y_test, y_pred))
print("f1 score:", f1_score(y_test, y_pred))
from sklearn.metrics import confusion_matrix
labels = np.unique(y).tolist()
cf_matrixGNB = confusion_matrix(y_test, y_pred, labels=labels)
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.heatmap(cf_matrixGNB, annot=True,
cmap='Blues')
ax.set_title('Confusion Matrix\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()
Output:
1.0
0.8
accuracy: 0.8
precision: 0.8536585365853658
recall: 0.7777777777777778
f1 score: 0.8139534883720929
To Predict on new data, you need the new data in the form of the training data. You then also need to apply any transformations you fit on the trianing data, to transform on the new data:
new_data = pd.DataFrame(
data = [['Haka', 'Mariehamn', 3.05, 3.66, 2.35, 3.05, 3.66, 2.52, 2.88, 3.48, 2.32]],
columns = ['Home', 'Away', 'PH', 'PD', 'PA', 'MaxH', 'MaxD', 'MaxA', 'AvgH', 'AvgD', 'AvgA']
)
to_predcit = new_data[['Home', 'Away', 'PH', 'PD', 'PA', 'MaxH', 'MaxD', 'MaxA', 'AvgH', 'AvgD', 'AvgA']]
to_predict_encoded = ohe.transform(to_predcit)
prediction = model.predict(to_predict_encoded)
prediction_prob = model.predict_proba(to_predict_encoded)
print(f'Predict: {prediction[0]} with {prediction_prob[0][0]} probability.')
Output:
Predict: 0 with 0.8204957018099501 probability.

How to match dimensions in CNN

I'm trying to build a CNN, where the goal is from 3 features to predict the label, but is giving an error of dimension.
Could someone help me?
updated after comments from #M.Innat
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Dense, Conv2D, Dropout, Flatten, MaxPooling2D
from tensorflow.keras.models import Sequential, load_model
from sklearn.metrics import accuracy_score, f1_score, mean_absolute_error
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
from sklearn import metrics
import tensorflow as tf
import random
# Create data
n = 8500
l = [2, 3, 4, 5,6]
k = int(np.ceil(n/len(l)))
labels = [item for item in l for i in range(k)]
random.shuffle(labels,random.random)
labels =np.array(labels)
label_unique = np.unique(labels)
x = np.linspace(613000, 615000, num=n) + np.random.uniform(-5, 5, size=n)
y = np.linspace(7763800, 7765800, num=n) + np.random.uniform(-5, 5, size=n)
z = np.linspace(1230, 1260, num=n) + np.random.uniform(-5, 5, size=n)
X = np.column_stack((x,y,z))
Y = labels
# Split the dataset into training and testing.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234)
seq_len=len(X_train)
n_features=len(X_train[0])
droprate=0.1
exit_un=len(label_unique)
seq_len=len(X_train)
n_features=len(X_train[0])
droprate=0.1
exit_un=len(label_unique)
print('n_features: {} \n seq_len: {} \n exit_un: {}'.format(n_features,seq_len,exit_un))
X_train = X_train[..., None][None, ...] # add channel axis+batch aix
Y_train = pd.get_dummies(Y_train) # transform to one-hot encoded
drop_prob = 0.5
my_model = Sequential()
my_model.add(Conv2D(input_shape=(seq_len,n_features,1),filters=32,kernel_size=(3,3),padding='same',activation="relu")) # 1 channel of grayscale.
my_model.add(MaxPooling2D(pool_size=(2,1)))
my_model.add(Conv2D(filters=64,kernel_size=(5,5), padding='same',activation="relu"))
my_model.add(MaxPooling2D(pool_size=(2,1)))
my_model.add(Flatten())
my_model.add(Dense(units = 1024, activation="relu"))
my_model.add(Dropout(rate=drop_prob))
my_model.add(Dense(units = exit_un, activation="softmax"))
n_epochs = 100
batch_size = 10
learn_rate = 0.005
# Define the optimizer and then compile.
my_optimizer=Adam(lr=learn_rate)
my_model.compile(loss = "categorical_crossentropy", optimizer = my_optimizer, metrics=['categorical_crossentropy','accuracy'])
my_summary = my_model.fit(X_train, Y_train, epochs=n_epochs, batch_size = batch_size, verbose = 1)
The error I have is:
ValueError: Data cardinality is ambiguous:
x sizes: 1
y sizes: 5950
Make sure all arrays contain the same number of samples.
You're passing the input sample without the channel axis and also the batch axis. Also, according to your loss function, you should transform your integer label to one-hot encoded.
exit_un=len(label_unique)
drop_prob = 0.5
X_train = X_train[..., None][None, ...] # add channel axis+batch aix
X_train = np.repeat(X_train, repeats=100, axis=0) # batch-ing
Y_train = np.repeat(Y_train, repeats=100, axis=0) # batch-ing
Y_train = pd.get_dummies(Y_train) # transform to one-hot encoded
print(X_train.shape, Y_train.shape)
my_model = Sequential()
...
update
Based on the discussion, it seems like you need the conv1d operation in the modeling time and need to reshape your sample as mentioned in the comment. Here is the colab, it should work now.

Getting Errors with StandardScaler Python

I am trying to scale my training and test data for a Logistic Regression but an error popped-up.
I implemented the answer in this stack: How to standard scale a 3D matrix?
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv('GonzagaTakers.csv')
x_df=df.loc[:, df.columns !='Remarks_P']
features = x_df.keys()
target = 'Remarks_P'
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
scalers ={}
for i in range(X_train.shape[1]):
scalers[i] = StandardScaler()
X_train[:, :, i]=scalers[i].fit_transform(X_train[:, :, i])
for i in range(y_test.shape[1]):
y_test[:,:,i]=scalers[i].fit_transform(y_test[:,:,i])
_model = LogisticRegression(class_weight='balanced')
_model.fit(X_train, y_train)
accuracy = _model.score(X_test, y_test) * 100
Error occurs in this line
X_train[:, :, i]=scalers[i].fit_transform(X_train[:, :, i])
TypeError: '(slice(None, None, None), slice(None, None, None), 0)' is
an invalid key

how to create tf.feature_columns with data have no header(csv file)?

I am dealing with multi-class_classification_of_handwritten_digits in the following link google colab
Then I tried to put the code in my way to re write, feed and train the DNN.
Due to the csv file has no header I am not able to create my feature columns, so I cannot train my model.
Can you please help me to figure out how it has been done in the link or how it need to be for my code? Thanks in advance.
import pandas as pd
import seaborn as sns
import tensorflow as tf
mnist_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/mnist_train_small.csv",header=None)
mnist_df.columns
hand_df = mnist_df[0]
hand_df.head()
matrix_df = mnist_df.drop([0],axis=1)
matrix_df.head()
mnist_df = mnist_df.head(10000)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(matrix_df, hand_df, test_size=0.3, random_state=101)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
matrix_df = pd.DataFrame(data=scaler.fit_transform(matrix_df),
columns=matrix_df.columns,
index=matrix_df.index)
input_func = tf.estimator.inputs.pandas_input_fn(x=X_train,y=y_train,
batch_size=10,
num_epochs=1000,
shuffle=True)
my_optimizer = tf.train.AdagradOptimizer(learning_rate=0.03)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
model = tf.estimator.LinearClassifier(feature_columns=feat_cols,
n_classes=10,
optimizer=my_optimizer,
config=tf.estimator.RunConfig(keep_checkpoint_max=1))
model.train(input_fn=input_func,steps=1000)
The example code is already splitting the dataset into training and validation sets.
And I don't think this has anything to do with the header in the CSV.
training_targets, training_examples = parse_labels_and_features(mnist_dataframe[:7500])
validation_targets, validation_examples = parse_labels_and_features(mnist_dataframe[7500:10000])
So the training code is here separately.
import pandas as pd
import tensorflow as tf
from tensorflow.python.data import Dataset
import numpy as np
mnist_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/mnist_train_small.csv",sep=",",header=None)
mnist_df = mnist_df.head(10000)
dataset = mnist_df[:7500]
labels = dataset[0]
print ( labels.shape )
# DataFrame.loc index ranges are inclusive at both ends.
features = dataset.loc[:, 1:784]
print ( features.shape )
# Scale the data to [0, 1] by dividing out the max value, 255.
features = features / 255
def create_training_input_fn(feature, label, batch_size, num_epochs=None, shuffle=True):
"""A custom input_fn for sending MNIST data to the estimator for training.
Args:
features: The training features.
labels: The training labels.
batch_size: Batch size to use during training.
Returns:
A function that returns batches of training features and labels during
training.
"""
def _input_fn(num_epochs=None, shuffle=True):
# Input pipelines are reset with each call to .train(). To ensure model
# gets a good sampling of data, even when number of steps is small, we
# shuffle all the data before creating the Dataset object
idx = np.random.permutation(feature.index)
raw_features = {"pixels": feature.reindex(idx)}
raw_targets = np.array(label[idx])
ds = Dataset.from_tensor_slices((raw_features, raw_targets)) # warning: 2GB limit
ds = ds.batch(batch_size).repeat(num_epochs)
if shuffle:
ds = ds.shuffle(10000)
# Return the next batch of data.
feature_batch, label_batch = ds.make_one_shot_iterator().get_next()
return feature_batch, label_batch
return _input_fn
my_optimizer = tf.train.AdagradOptimizer(learning_rate=0.03)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
model = tf.estimator.LinearClassifier(feature_columns=set([tf.feature_column.numeric_column('pixels', shape=784)]),
n_classes=10,
optimizer=my_optimizer,
config=tf.estimator.RunConfig(keep_checkpoint_max=1))
model.train(input_fn=create_training_input_fn(features, labels, batch_size=10),steps=1000)
Similarly you have a function for preparing the validation set for prediction. You could use this pattern as it is.
But if you are splitting the dataframe using train_test_split you can try this.
X_train, X_test = train_test_split(mnist_df, test_size=0.2)
You have to repeat the following procedure for X_test as well to get the validation features and labels.
X_train_labels = X_train[0]
print ( X_train_labels.shape )
# DataFrame.loc index ranges are inclusive at both ends.
X_train_features = X_train.loc[:, 1:784]
print ( X_train_features.shape )
# Scale the data to [0, 1] by dividing out the max value, 255.
X_train_features = X_train_features / 255
Rather than trying to find a way to use data without any column names, I have had an idea that :) I have named all my columns and append them into cols=[] then it was easy to assign and use by feature_columns = cols.
Here is my full working code for my own question.
Thanks.
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
from sklearn import metrics
from tensorflow.python.data import Dataset
mnist_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/mnist_train_small.csv",header=None)
mnist_df.describe()
mnist_df.columns
hand_df = mnist_df[0]
matrix_df = mnist_df.drop([0],axis=1)
matrix_df.head()
hand_df.head()
#creating cols array and append a1 to a784 in order to name columns
cols=[]
for i in range(785):
if i!=0:
a = '{}{}'.format('a',i)
cols.append(a)
matrix_df.columns = cols
mnist_df = mnist_df.head(10000)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(matrix_df, hand_df, test_size=0.3, random_state=101)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
matrix_df = pd.DataFrame(data=scaler.fit_transform(matrix_df),
columns=matrix_df.columns,
index=matrix_df.index)
#naming columns so I will not get error while assigning feature_columns
for i in range(len(cols)):
a=i+1
b='{}{}'.format('a',a)
cols[i] = tf.feature_column.numeric_column(str(b))
matrix_df.head()
input_func = tf.estimator.inputs.pandas_input_fn(x=X_train,y=y_train,
batch_size=10,num_epochs=1000,
shuffle=True)
my_optimizer = tf.train.AdagradOptimizer(learning_rate=0.03)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
model = tf.estimator.DNNClassifier(feature_columns=cols,
hidden_units=[32,64],
n_classes=10,
optimizer=my_optimizer,
config=tf.estimator.RunConfig(keep_checkpoint_max=1))
model.train(input_fn=input_func,steps=1000)
predict_input_func = tf.estimator.inputs.pandas_input_fn(x=X_test,
batch_size=50,
num_epochs=1,
shuffle=False)
pred_gen = model.predict(predict_input_func)
predictions = list(pred_gen)
predictions[0]

Unable to use FeatureUnion to combine processed numeric and categorical features in Python

I am trying to use Age and Gender to predict Med, but I am new to Pipeline and FeatureUnion of Scikit-learn, and encountered some issue. I read through some tutorial and answer, and that's how I wrote the codes below, but I don't have a good grasp on how to feed the split data into the pipeline functions.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.externals import joblib
from sklearn.metrics import confusion_matrix
# Import data into Pandas data frame
data_directory = 'C:/Users/Asus/'
file_name = 'Example.csv'
df = pd.read_csv(data_directory + file_name)
df_len = len(df)
# Get a lit of all variables
print (list(df))
# Class that identifies Column type
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit (self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
numeric = [] # list of numeric column names
categorical = [] # list of categorical column names
# Creating random subsample for fast model building
def sample_n(df, n, replace=False, weight=None, seed=None):
"""Sample n rows from a DataFrame at random"""
rs = np.random.RandomState(seed)
locs = rs.choice(df.shape[0], size=n, replace=replace, p=weight)
return df.take(locs, axis=0)
df = sample_n(df, n=300, seed=1123)
# Merge FG-LAI, SG-LAI and Both-LAI together into one group (MED=3)
df.ix[(df['MED']==4)|(df['MED']==5), 'MED']=3
# Remove No-Med (MED=1) and Both-LAI (MED=5) cases
df = df.drop(df[(df['MED']==1)|(df['MED']==5)].index)
# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)
# Retain only the needed predictors
X = X.filter(['age', 'gender'])
# Find the numerical columns, exclude categorical columns
X_num_cols = X.columns[X.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.5,
random_state=567,
stratify=y)
# Pipeline
pipe = Pipeline([
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=numeric),StandardScaler())),
('categorical', make_pipeline(Columns(names=categorical),OneHotEncoder(sparse=False)))
])),
('model', LogisticRegression())
])
# Declare hyperparameters
hyperparameters = {'logisticregression__c' : [0.01, 0.1, 1.0, 10.0],
'logisticregression__penalty' : ['l1', 'l2'],
'logisticregression__multi_class': ['ovr'],
'logisticregression__class_weight': ['balanced', None],
}
# SKlearn cross-validation with pipeline
clf = GridSearchCV(pipe, hyperparameters, cv=10)
# Fit and tune model
clf.fit(X_train, y_train)
Errors:
ValueError: Invalid parameter logisticregression for estimator Pipeline(memory=None,
steps=[('features', FeatureUnion(n_jobs=1,
transformer_list=[('numeric', Pipeline(memory=None,
steps=[('columns', Columns(names=[])), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))])), ('categorical', Pipeline(memory=None,
steps=[('columns', Columns(nam...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False))]). Check the list of available parameters with `estimator.get_params().keys()`.
Edits:
print (pipe.get_params().keys())
gives
dict_keys(['memory', 'steps', 'features', 'LR_model', 'features__n_jobs', 'features__transformer_list', 'features__transformer_weights', 'features__numeric', 'features__categorical', 'features__numeric__memory', 'features__numeric__steps', 'features__numeric__columns', 'features__numeric__standardscaler', 'features__numeric__columns__names', 'features__numeric__standardscaler__copy', 'features__numeric__standardscaler__with_mean', 'features__numeric__standardscaler__with_std', 'features__categorical__memory', 'features__categorical__steps', 'features__categorical__columns', 'features__categorical__onehotencoder', 'features__categorical__columns__names', 'features__categorical__onehotencoder__categorical_features', 'features__categorical__onehotencoder__dtype', 'features__categorical__onehotencoder__handle_unknown', 'features__categorical__onehotencoder__n_values', 'features__categorical__onehotencoder__sparse', 'LR_model__C', 'LR_model__class_weight', 'LR_model__dual', 'LR_model__fit_intercept', 'LR_model__intercept_scaling', 'LR_model__max_iter', 'LR_model__multi_class', 'LR_model__n_jobs', 'LR_model__penalty', 'LR_model__random_state', 'LR_model__solver', 'LR_model__tol', 'LR_model__verbose', 'LR_model__warm_start'])
After changing into 'model__', I am getting the new error:
ValueError: Found array with 0 feature(s) (shape=(109, 0)) while a minimum of 1 is required by StandardScaler.
Edits 2:
# Retain only the needed predictors
#X = X.filter(['age', 'ccis', 'num_claims', 'Prior_DIH', 'prior_ED_num'])
X_selected = X.filter(['age', 'Geo', 'ccis', 'num_claims', 'Prior_DIH', 'prior_ED_num',
'DAD_readmit', 'Num_DAD_readmit', 'ED_readmit', 'NUmber_ED_readmit'
'Fail_renew', 'FR_num'])
# from the selected X, further choose categorical only
X_selected_cat = X_selected.filter(['Geo', 'ccis']) # hand selected since some cat var has value 0, 1
# Find the numerical columns, exclude categorical columns
X_num_cols = X_selected.columns[X_selected.dtypes.apply(lambda c: np.issubdtype(c, np.number))] # list of numeric column names, automated here
X_cat_cols = X_selected_cat.columns # list of categorical column names, previously hand-slected
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y,
test_size=0.5,
random_state=567,
stratify=y)
# Pipeline
pipe = Pipeline([
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
('categorical', make_pipeline(Columns(names=X_cat_cols),OneHotEncoder(sparse=False)))
])),
('LR_model', LogisticRegression())
])
Errors:
ValueError: could not convert string to float: 'Urban'
The input array of OneHotEncoder is int but you provided string to it. You could use LabelEncoder or LabelBinarizer to convert string to int. Then, you will be allowed to use OneHotEncoder.
pipe = Pipeline([
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
('categorical', make_pipeline(Columns(names=X_cat_cols),LabelEncoder(), OneHotEncoder(sparse=False)))
])),
('LR_model', LogisticRegression())
])