JupyterLab with Naive Bayes - pandas

I am trying to detect quillbot paraphrasing by using naive bayes in jupyter notebook. My dataset has 2 columns, first column is filled with a sample text from various sources around 250 words. the second column is type, and that is set to either 1 if it is parahrased or 0 if it is the original text. Ive got this code so far but I am getting some errors. here is a link to my dataset: https://pastebin.com/ts8SLGHq and here is my code:
from pydataset import data
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
# read the CSV file into a DataFrame
df = pd.read_csv('Desktop/Dataset.csv', encoding='utf-8')
# specify the column names
df.columns = ['text', 'type']
# split the dataset into train and test sets
train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)
# convert text data into numerical features
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_df['text'])
X_test = vectorizer.transform(test_df['text'])
y_train = train_df['type'].values
y_test = test_df['type'].values
# initialize a Gaussian Naive Bayes model
model = GaussianNB()
# train the model on the train set
model.fit(X_train, y_train)
# evaluate the model on the test set
accuracy = model.score(X_test, y_test)
print("Model accuracy on test set: {:.2f}%".format(accuracy * 100))
and here is the error I am getting:
```
TypeError Traceback (most recent call last)
Cell In[184], line 5
2 model = GaussianNB()
4 # train the model on the train set
----> 5 model.fit(X_train, y_train)
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\naive_bayes.py:267, in GaussianNB.fit(self, X, y, sample_weight)
265 self._validate_params()
266 y = self._validate_data(y=y)
--> 267 return self._partial_fit(
268 X, y, np.unique(y), _refit=True, sample_weight=sample_weight
269 )
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\naive_bayes.py:428, in GaussianNB._partial_fit(self, X, y, classes, _refit, sample_weight)
425 self.classes_ = None
427 first_call = _check_partial_fit_first_call(self, classes)
--> 428 X, y = self._validate_data(X, y, reset=first_call)
429 if sample_weight is not None:
430 sample_weight = _check_sample_weight(sample_weight, X)
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py:565, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
563 y = check_array(y, input_name="y", **check_y_params)
564 else:
--> 565 X, y = check_X_y(X, y, **check_params)
566 out = X, y
568 if not no_val_X and check_params.get("ensure_2d", True):
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\validation.py:1106, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
1101 estimator_name = _check_estimator_name(estimator)
1102 raise ValueError(
1103 f"{estimator_name} requires y to be passed, but the target y is None"
1104 )
-> 1106 X = check_array(
1107 X,
1108 accept_sparse=accept_sparse,
1109 accept_large_sparse=accept_large_sparse,
1110 dtype=dtype,
1111 order=order,
1112 copy=copy,
1113 force_all_finite=force_all_finite,
1114 ensure_2d=ensure_2d,
1115 allow_nd=allow_nd,
1116 ensure_min_samples=ensure_min_samples,
1117 ensure_min_features=ensure_min_features,
1118 estimator=estimator,
1119 input_name="X",
1120 )
1122 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
1124 check_consistent_length(X, y)
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\validation.py:845, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
843 if sp.issparse(array):
844 _ensure_no_complex_data(array)
--> 845 array = _ensure_sparse_format(
846 array,
847 accept_sparse=accept_sparse,
848 dtype=dtype,
849 copy=copy,
850 force_all_finite=force_all_finite,
851 accept_large_sparse=accept_large_sparse,
852 estimator_name=estimator_name,
853 input_name=input_name,
854 )
855 else:
856 # If np.array(..) gives ComplexWarning, then we convert the warning
857 # to an error. This is needed because specifying a non complex
858 # dtype to the function converts complex to real dtype,
859 # thereby passing the test made in the lines following the scope
860 # of warnings context manager.
861 with warnings.catch_warnings():
File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\validation.py:522, in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse, estimator_name, input_name)
519 _check_large_sparse(spmatrix, accept_large_sparse)
521 if accept_sparse is False:
--> 522 raise TypeError(
523 "A sparse matrix was passed, but dense "
524 "data is required. Use X.toarray() to "
525 "convert to a dense numpy array."
526 )
527 elif isinstance(accept_sparse, (list, tuple)):
528 if len(accept_sparse) == 0:
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.`
Any help would be greatly appriciated.
I first tried to convert the data into numerical features but either I did it wrong or there is another issue

Convert sparse matrix to NumPy array
X_train.todense()

Related

GridSearchCV with KerasClassifier causes Could not pickle the task to send it to the workers with an adapted Keras layer

Question
GridSearchCV via KerasClassifier causes the error when Keras Normalization has been adapted to data. Without the adapted Normalization, it works. The reason why using Normalization is because it gave better result than simply divide by 255.0.
PicklingError: Could not pickle the task to send it to the workers.
Workaround
By setting n_jobs=1 not to multi-thread, it works but perhaps not much use to run single thread.
Environment
Python 3.9.13
TensorFlow version: 2.10.0
Eager execution is: True
Keras version: 2.10.0
sklearn version: 1.1.3
Code
import numpy as np
import tensorflow as tf
from keras.layers import (
Dense,
Flatten,
Normalization,
Conv2D,
MaxPooling2D,
)
from keras.models import (
Sequential
)
from scikeras.wrappers import (
KerasClassifier,
)
from sklearn.model_selection import (
GridSearchCV
)
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
# max_value = float(np.max(x_train))
# x_train, x_test = x_train/max_value, x_test/max_value
input_shape = x_train[0].shape
number_of_classes = 10
# Data Normalization
normalization = Normalization(
name="norm",
input_shape=input_shape, # (32, 32, 3)
axis=-1 # Regard each pixel as a feature
)
normalization.adapt(x_train)
def create_model():
model = Sequential([
# Without the adapted Normalization layer, it works.
normalization,
Conv2D(
name="conv",
filters=32,
kernel_size=(3, 3),
strides=(1, 1),
padding="same",
activation='relu',
input_shape=input_shape
),
MaxPooling2D(
name="maxpool",
pool_size=(2, 2)
),
Flatten(),
Dense(
name="full",
units=100,
activation="relu"
),
Dense(
name="label",
units=number_of_classes,
activation="softmax"
)
])
model.compile(
loss=tf.keras.losses.sparse_categorical_crossentropy,
optimizer='adam',
metrics=['accuracy']
)
return model
model = KerasClassifier(model=create_model, verbose=2)
batch_size = [32]
epochs = [2, 3]
param_grid = dict(batch_size=batch_size, epochs=epochs)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(x_train, y_train)
Log
The above exception was the direct cause of the following exception:
PicklingError Traceback (most recent call last)
Cell In [28], line 7
4 param_grid = dict(batch_size=batch_size, epochs=epochs)
6 grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
----> 7 grid_result = grid.fit(x_train, y_train)
File ~/venv/ml/lib/python3.9/site-packages/sklearn/model_selection/_search.py:875, in BaseSearchCV.fit(self, X, y, groups, **fit_params)
869 results = self._format_results(
870 all_candidate_params, n_splits, all_out, all_more_results
871 )
873 return results
--> 875 self._run_search(evaluate_candidates)
877 # multimetric is determined here because in the case of a callable
878 # self.scoring the return type is only known after calling
879 first_test_score = all_out[0]["test_scores"]
File ~/venv/ml/lib/python3.9/site-packages/sklearn/model_selection/_search.py:1379, in GridSearchCV._run_search(self, evaluate_candidates)
1377 def _run_search(self, evaluate_candidates):
1378 """Search all candidates in param_grid"""
-> 1379 evaluate_candidates(ParameterGrid(self.param_grid))
File ~/venv/ml/lib/python3.9/site-packages/sklearn/model_selection/_search.py:822, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
814 if self.verbose > 0:
815 print(
816 "Fitting {0} folds for each of {1} candidates,"
817 " totalling {2} fits".format(
818 n_splits, n_candidates, n_candidates * n_splits
819 )
820 )
--> 822 out = parallel(
823 delayed(_fit_and_score)(
824 clone(base_estimator),
825 X,
826 y,
827 train=train,
828 test=test,
829 parameters=parameters,
830 split_progress=(split_idx, n_splits),
831 candidate_progress=(cand_idx, n_candidates),
832 **fit_and_score_kwargs,
833 )
834 for (cand_idx, parameters), (split_idx, (train, test)) in product(
835 enumerate(candidate_params), enumerate(cv.split(X, y, groups))
836 )
837 )
839 if len(out) < 1:
840 raise ValueError(
841 "No fits were performed. "
842 "Was the CV iterator empty? "
843 "Were there no candidates?"
844 )
File ~/venv/ml/lib/python3.9/site-packages/joblib/parallel.py:1098, in Parallel.__call__(self, iterable)
1095 self._iterating = False
1097 with self._backend.retrieval_context():
-> 1098 self.retrieve()
1099 # Make sure that we get a last message telling us we are done
1100 elapsed_time = time.time() - self._start_time
File ~/venv/ml/lib/python3.9/site-packages/joblib/parallel.py:975, in Parallel.retrieve(self)
973 try:
974 if getattr(self._backend, 'supports_timeout', False):
--> 975 self._output.extend(job.get(timeout=self.timeout))
976 else:
977 self._output.extend(job.get())
File ~/venv/ml/lib/python3.9/site-packages/joblib/_parallel_backends.py:567, in LokyBackend.wrap_future_result(future, timeout)
564 """Wrapper for Future.result to implement the same behaviour as
565 AsyncResults.get from multiprocessing."""
566 try:
--> 567 return future.result(timeout=timeout)
568 except CfTimeoutError as e:
569 raise TimeoutError from e
File /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py:446, in Future.result(self, timeout)
444 raise CancelledError()
445 elif self._state == FINISHED:
--> 446 return self.__get_result()
447 else:
448 raise TimeoutError()
File /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py:391, in Future.__get_result(self)
389 if self._exception:
390 try:
--> 391 raise self._exception
392 finally:
393 # Break a reference cycle with the exception in self._exception
394 self = None
PicklingError: Could not pickle the task to send it to the workers.
Research
Keras KerasClassifier gridsearch TypeError: can't pickle _thread.lock objects told Keras did not support pickle was the cause. However, as the code works if the adapted Normalization is not used, not relevant.
GPU can cause the issue but there is no GPU in my environment.
References
SciKeras Basic usage¶
How to Grid Search Hyperparameters for Deep Learning Models in Python with Keras

'numpy.ndarray' object has no attribute 'lower'

I am fairly new to ML, I am trying to fit some data on my NB-classifier.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
('clf', MultinomialNB()),
])
# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),
('clf', LinearSVC()),
])
Code for fitting the data:
text_clf_nb.fit(X_train, y_train)
The shape of my training & test data is
X_train.shape, X_test.shape, y_train.shape, y_test.shape : ((169, 1), (84,), (169, 1), (84,))
But keep getting :'numpy.ndarray' object has no attribute 'lower'
Here is full trace to the error:
AttributeError Traceback (most recent call last)
<ipython-input-57-139757126594> in <module>
----> 1 text_clf_nb.fit(X_train, y_train)
~\miniconda3\envs\nlp_course\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
263 This estimator
264 """
--> 265 Xt, fit_params = self._fit(X, y, **fit_params)
266 if self._final_estimator is not None:
267 self._final_estimator.fit(Xt, y, **fit_params)
~\miniconda3\envs\nlp_course\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params)
228 Xt, fitted_transformer = fit_transform_one_cached(
229 cloned_transformer, Xt, y, None,
--> 230 **fit_params_steps[name])
231 # Replace the transformer of the step with the fitted
232 # transformer. This is necessary when loading the transformer
~\miniconda3\envs\nlp_course\lib\site-packages\sklearn\externals\joblib\memory.py in __call__(self, *args, **kwargs)
340
341 def __call__(self, *args, **kwargs):
--> 342 return self.func(*args, **kwargs)
343
344 def call_and_shelve(self, *args, **kwargs):
You have checked the shape of arrays, but have you tried something like:
data = vectorizer.fit_transform(array.ravel())
This should do the trick for you

Using tf.data.Dataset with tf Hub Modules

How do I feed a tf.keras model, that includes a 1D input TF Hub module, with a tf.data.Dataset?
(Ultimately, the aim is to use a single tf.data.Dataset with a multi-input, multi-output keras funtional api model.)
Tried this:
import tensorflow as tf
import tensorflow_hub as hub
embed = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embed, output_shape=[20], input_shape=[],
dtype=tf.string, trainable=True, name='hub_layer')
# From tf hub webpage: "The module takes a batch of sentences in a 1-D tensor of strings as input."
input_tensor = tf.keras.Input(shape=(), dtype=tf.string)
hub_tensor = hub_layer(input_tensor)
x = tf.keras.layers.Dense(16, activation='relu')(hub_tensor)#(x)
main_output = tf.keras.layers.Dense(units=4, activation='softmax', name='main_output')(x)
model = tf.keras.models.Model(inputs=[input_tensor], outputs=[main_output])
# This works as expected.
X_tensor = tf.constant(['Hello World', 'The Quick Brown Fox'])
model(X_tensor)
# This fails
X_ds = tf.data.Dataset.from_tensors(X_tensor)
X_ds.element_spec
model(X_ds)
Expectation was that the 1D tensor in the dataset would be automatically extracted and consumed by the model.
Error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in
21 X_ds = tf.data.Dataset.from_tensors(X_tensor)
22 X_ds.element_spec
---> 23 model(X_ds)
24
25
~/projects/email_analysis/email_venv/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py in __call__(self, *args, **kwargs)
966 with base_layer_utils.autocast_context_manager(
967 self._compute_dtype):
--> 968 outputs = self.call(cast_inputs, *args, **kwargs)
969 self._handle_activity_regularization(inputs, outputs)
970 self._set_mask_metadata(inputs, outputs, input_masks)
~/projects/email_analysis/email_venv/lib/python3.6/site-packages/tensorflow/python/keras/engine/network.py in call(self, inputs, training, mask)
717 return self._run_internal_graph(
718 inputs, training=training, mask=mask,
--> 719 convert_kwargs_to_constants=base_layer_utils.call_context().saving)
720
721 def compute_output_shape(self, input_shape):
~/projects/email_analysis/email_venv/lib/python3.6/site-packages/tensorflow/python/keras/engine/network.py in _run_internal_graph(self, inputs, training, mask, convert_kwargs_to_constants)
835 tensor_dict = {}
836 for x, y in zip(self.inputs, inputs):
--> 837 y = self._conform_to_reference_input(y, ref_input=x)
838 x_id = str(id(x))
839 tensor_dict[x_id] = [y] * self._tensor_usage_count[x_id]
~/projects/email_analysis/email_venv/lib/python3.6/site-packages/tensorflow/python/keras/engine/network.py in _conform_to_reference_input(self, tensor, ref_input)
959 # Dtype handling.
960 if isinstance(ref_input, (ops.Tensor, composite_tensor.CompositeTensor)):
--> 961 tensor = math_ops.cast(tensor, dtype=ref_input.dtype)
962
963 return tensor
~/projects/email_analysis/email_venv/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py in wrapper(*args, **kwargs)
178 """Call target, and fall back on dispatchers if there is a TypeError."""
179 try:
--> 180 return target(*args, **kwargs)
181 except (TypeError, ValueError):
182 # Note: convert_to_eager_tensor currently raises a ValueError, not a
~/projects/email_analysis/email_venv/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py in cast(x, dtype, name)
785 # allows some conversions that cast() can't do, e.g. casting numbers to
786 # strings.
--> 787 x = ops.convert_to_tensor(x, name="x")
788 if x.dtype.base_dtype != base_type:
789 x = gen_math_ops.cast(x, base_type, name=name)
~/projects/email_analysis/email_venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
1339
1340 if ret is None:
-> 1341 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
1342
1343 if ret is NotImplemented:
~/projects/email_analysis/email_venv/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
319 as_ref=False):
320 _ = as_ref
--> 321 return constant(v, dtype=dtype, name=name)
322
323
~/projects/email_analysis/email_venv/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name)
260 """
261 return _constant_impl(value, dtype, shape, name, verify_shape=False,
--> 262 allow_broadcast=True)
263
264
~/projects/email_analysis/email_venv/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
268 ctx = context.context()
269 if ctx.executing_eagerly():
--> 270 t = convert_to_eager_tensor(value, ctx, dtype)
271 if shape is None:
272 return t
~/projects/email_analysis/email_venv/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype)
94 dtype = dtypes.as_dtype(dtype).as_datatype_enum
95 ctx.ensure_initialized()
---> 96 return ops.EagerTensor(value, ctx.device_name, dtype)
97
98
ValueError: Attempt to convert a value () with an unsupported type () to a Tensor.
The point of a dataset is to provide a sequence of tensors, like here:
all_data = tf.constant([['Hello', 'World'], ['Brown Fox', 'lazy dog']])
ds = tf.data.Dataset.from_tensor_slices(all_data)
for tensor in ds:
print(tensor)
which outputs
tf.Tensor([b'Hello' b'World'], shape=(2,), dtype=string)
tf.Tensor([b'Brown Fox' b'lazy dog'], shape=(2,), dtype=string)
Instead of just printing tensor, you can compute with it:
for tensor in ds:
print(hub_layer(tensor))
which outputs 2 tensors of shape (2,20) each.
For more, see https://www.tensorflow.org/guide/data.

ML: Multinomial NB error: ValueError: setting an array element with a sequence

I'm pretty new to Machine Learning, and I'm trying something experimental on a public wine dataset.
I'm ending up with an error and I can't find a solution.
Here is what I'm trying do with my model:
X = data_all[['country', 'description', 'price', 'province', 'variety']]
y = data_all['points']
# Vectorizing Description column (text analysis)
vectorizerDesc = CountVectorizer()
descriptions = X['description']
vectorizerDesc.fit(descriptions)
vectorizedDesc = vectorizer.transform(X['description'])
X['description'] = vectorizedDesc
# Categorizing other string columns
X = pd.get_dummies(X, columns=['country', 'province', 'variety'])
# Generating train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
# Multinomial Naive Bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)
Here's what X looks like just before calling train_test_split:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 83945 entries, 25 to 150929
Columns: 837 entries, description to variety_Zweigelt
dtypes: float64(1), object(1), uint8(835)
The last line (nb.fit) gives me an error:
ValueError Traceback (most recent call last)
<ipython-input-197-9d40e4624ff6> in <module>()
3 # Multinomial Naive Bayes is a specialised version of Naive Bayes designed more for text documents
4 nb = MultinomialNB()
----> 5 nb.fit(X_train, y_train)
/opt/conda/lib/python3.6/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
577 Returns self.
578 """
--> 579 X, y = check_X_y(X, y, 'csr')
580 _, n_features = X.shape
581
/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
571 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
572 ensure_2d, allow_nd, ensure_min_samples,
--> 573 ensure_min_features, warn_on_dtype, estimator)
574 if multi_output:
575 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
446 # make sure we actually converted to numeric:
447 if dtype_numeric and array.dtype.kind == "O":
--> 448 array = array.astype(np.float64)
449 if not allow_nd and array.ndim >= 3:
450 raise ValueError("Found array with dim %d. %s expected <= 2."
ValueError: setting an array element with a sequence.
Would you know how I could combine my Vectorized text analysis and other datasets (like countries etc...) in a Multinomial NB algorithm?
Thank you in advance :)

GridSearchCV: "TypeError: 'StratifiedKFold' object is not iterable"

I want to perform GridSearchCV in a RandomForestClassifier, but data is not balanced, so I use StratifiedKFold:
from sklearn.model_selection import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {'n_estimators':[10, 30, 100, 300], "max_depth": [3, None],
"max_features": [1, 5, 10], "min_samples_leaf": [1, 10, 25, 50], "criterion": ["gini", "entropy"]}
rfc = RandomForestClassifier()
clf = GridSearchCV(rfc, param_grid=param_grid, cv=StratifiedKFold()).fit(X_train, y_train)
But I get an error:
TypeError Traceback (most recent call last)
<ipython-input-597-b08e92c33165> in <module>()
9 rfc = RandomForestClassifier()
10
---> 11 clf = GridSearchCV(rfc, param_grid=param_grid, cv=StratifiedKFold()).fit(X_train, y_train)
c:\python34\lib\site-packages\sklearn\grid_search.py in fit(self, X, y)
811
812 """
--> 813 return self._fit(X, y, ParameterGrid(self.param_grid))
c:\python34\lib\site-packages\sklearn\grid_search.py in _fit(self, X, y, parameter_iterable)
559 self.fit_params, return_parameters=True,
560 error_score=self.error_score)
--> 561 for parameters in parameter_iterable
562 for train, test in cv)
c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)
756 # was dispatched. In particular this covers the edge
757 # case of Parallel used with an exhausted iterator.
--> 758 while self.dispatch_one_batch(iterator):
759 self._iterating = True
760 else:
c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
601
602 with self._lock:
--> 603 tasks = BatchedCalls(itertools.islice(iterator, batch_size))
604 if len(tasks) == 0:
605 # No more tasks available in the iterator: tell caller to stop.
c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in __init__(self, iterator_slice)
125
126 def __init__(self, iterator_slice):
--> 127 self.items = list(iterator_slice)
128 self._size = len(self.items)
c:\python34\lib\site-packages\sklearn\grid_search.py in <genexpr>(.0)
560 error_score=self.error_score)
561 for parameters in parameter_iterable
--> 562 for train, test in cv)
563
564 # Out is a list of triplet: score, estimator, n_test_samples
TypeError: 'StratifiedKFold' object is not iterable
When I write cv=StratifiedKFold(y_train) I have ValueError: The number of folds must be of Integral type. But when I write `cv=5, it works.
I don't understand what is wrong with StratifiedKFold
I had exactly the same problem. The solution that worked for me is to replace:
from sklearn.grid_search import GridSearchCV
with
from sklearn.model_selection import GridSearchCV
Then it should work fine.
The problem here is an API change as mentioned in other answers, however the answers could be more explicit.
The cv parameter documentation states:
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs
for cv are:
None, to use the default 3-fold cross-validation, integer,
to specify the number of folds.
An object to be used as a
cross-validation generator.
An iterable yielding train/test splits.
For integer/None inputs, if y is binary or multiclass, StratifiedKFold
used. If the estimator is a classifier or if y is neither binary nor
multiclass, KFold is used.
So, whatever the cross validation strategy used, all that is needed is to provide the generator using the function split, as suggested:
kfolds = StratifiedKFold(5)
clf = GridSearchCV(estimator, parameters, scoring=qwk, cv=kfolds.split(xtrain,ytrain))
clf.fit(xtrain, ytrain)
It seems that cv=StratifiedKFold()).fit(X_train, y_train) should be changed to cv=StratifiedKFold()).split(X_train, y_train).
The api changed in the latest version. You used to pass y and now you pass just the number when you create the stratifiedKFold object. You pass the y later.