How can I make a DataFrame from a ColumnTransformer composed by LOO Encoder, OHE and Ordinal Encoder? - pandas

Since the feature get_feature_names() was deprecated from "native" categorical encoders in Sklearn (actually it was replaced by get_feature_names_out()), how could I make a DataFrame where the transformed variables have their proper names since inside the ColumnTransformer has encoders whose respond for get_feature_names_out() and others for get_feature_names()? Here is the situation:
features_pipe = make_column_transformer(
(OneHotEncoder(handle_unknown = 'ignore', sparse=False), ['Gender', 'Race']),
(OrdinalEncoder(), ['Age', 'Overall Work Exp.', 'Fieldwork Exp.', 'Level of Education']),
(ce.LeaveOneOutEncoder(), ['State (US)'])
).fit(X_train, y_train)
X_train_encoded = features_pipe.transform(X_train)
X_test_encoded = features_pipe.transform(X_test)
X_train_encoded_df = pd.DataFrame(X_train_encoded, columns= features_pipe.get_features_names_out())
X_train_encoded_df.head()
I got this error: AttributeError: 'ColumnTransformer' object has no attribute 'get_features_names_out'
That's because LeaveOneOutEncoder does not support get_feature_names_out(). It supports get_feature_names().
How could I overcome this issue and print my DataFrame correctly?

I have this same issue once in the past.
If you don't mind using a subclass of ColumnTransformer you can create it and just modify to call get_feature_names() when get_feature_names_out() was not available.
In this case, you should declare the class
from sklearn.compose import ColumnTransformer
from sklearn.compose._column_transformer import _is_empty_column_selection
class MyColumnTransformer(ColumnTransformer):
def __init__(self, transformers, **kwargs):
super().__init__(transformers=transformers, **kwargs)
def _get_feature_name_out_for_transformer(
self, name, trans, column, feature_names_in
):
column_indices = self._transformer_to_input_indices[name]
names = feature_names_in[column_indices]
if trans == "drop" or _is_empty_column_selection(column):
return
elif trans == "passthrough":
return names
if not hasattr(trans, "get_feature_names_out"):
return trans.get_feature_names()
return trans.get_feature_names_out(names)
Although the use of ColumnTransformer is not as simple as using make_column_transformer, it's much more customizable.
So, in this case, you also have to pass a name to each transformer using the following schema:
(name, transformer, columns)
features_pipe = MyColumnTransformer(transformers=
[
('OHE', OneHotEncoder(handle_unknown = 'ignore', sparse=False), ['Gender', 'Race']),
('OE', OrdinalEncoder(), ['Age', 'Overall Work Exp.', 'Fieldwork Exp.', 'Level of Education']),
('LOOE', ce.LeaveOneOutEncoder(), ['State (US)'])
])
features_pipe.fit(X_train, y_train)
and finally continue the code the way you suggested.
If you don't want to append the transformer names to the features names, just include verbose_feature_names_out=False when initializing MyColumnTransformer.

Related

Why spacy morphologizer doesn't work when we use a custom tokenizer?

I don't understand why when i'm doing this
import spacy
from copy import deepcopy
nlp = spacy.load("fr_core_news_lg")
class MyTokenizer:
def __init__(self, tokenizer):
self.tokenizer = deepcopy(tokenizer)
def __call__(self, text):
return self.tokenizer(text)
nlp.tokenizer = MyTokenizer(nlp.tokenizer)
doc = nlp("Un texte en français.")
Tokens don't have any morph assigned
print([tok.morph for tok in doc])
> ['','','','','']
Is this behavior expected? If yes, why ? (spacy v3.0.7)
The pipeline expects nlp.vocab and nlp.tokenizer.vocab to refer to the exact same Vocab object, which isn't the case after running deepcopy.
I admit that I'm not entirely sure off the top of my head why you end up with empty analyses instead of more specific errors, but I think the MorphAnalysis objects, which are stored centrally in the vocab in vocab.morphology, end up out-of-sync between the two vocabs.

What exactly to test (unittest) in a larger function containing several dataframe manipulations

Perhaps this is a constraint of my understanding of unittests, but I get quite confused as to what should be tested, patched, etc in a method that has several pandas dataframe manipulations. Many of the unittest examples out there focus on classes and methods that are typically small. For larger methods, I get a bit lost on the typical unittest paradigm. For example:
myscript.py
class Pivot:
def prepare_dfs(self):
df = pd.read_csv(self.file, sep=self.delimiter)
g = df.groupby("Other_Location")
df1 = g.apply(lambda x: x[x["PRN"] == "Free"].count())
locations = ["O12-03-01", "O12-03-02"]
cp = df1["PRN"]
cp = cp[locations].tolist()
data = [locations, cp]
new_df = pd.DataFrame({"Other_Location": data[0], "Free": data[1]})
return new_df, df
test_myscript.py
class TestPivot(unittest.TestCase):
def setUp(self):
args = parse_args(["-f", "test1", "-d", ","])
self.pivot = Pivot(args)
self.pivot.path = "Pivot/path"
#mock.patch("myscript.cp[locations].tolist()", return_value=None)
#mock.patch("myscript.pd.read_csv", return_value=df)
def test_prepare_dfs_1(self, mock_read_csv, mock_cp):
new_df, df = self.pivot.prepare_dfs()
# Here I get a bit lost
For example here I try to circumvent the following error message:
ModuleNotFoundError: No module named 'myscript.cp[locations]'; 'myscript' is not a package
I managed to mock correctly the pd.read_csv in my method, however further down in the code there are groupy, apply, tolist etc. The error message is thrown at the following line:
cp = cp[locations].tolist()
What is the best way to approach unittesting when your method involves several manipulations on a dataframe? Is refactoring the code always advised (into smaller chunks)? In this case, how can I mock correctly the tolist ?

Custom scikit-learn pickling doesn't work inside a grid search

I have written a scikit-learn estimator. It has a parameter and a model_ attribute that is set by fit.
class MyEstimator(BaseEstimator, TransformerMixin):
def __init__(self, param="default"):
self.param = param
self.model_ = None
def fit(self, x, y):
# Sets the value of self.model_
I want to be able to pickle MyEstimator, but the model_ object I create cannot be serialized with pickle because it is a keras model. Following the example of the blog post "Pickling Keras Models" I added the following pickling handler methods to my class.
class MyEstimator(BaseEstimator, TransformerMixin):
def __getstate__(self):
state = super().__getstate__().copy()
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
keras.models.save_model(self.model_, fd.name, overwrite=True)
state["model_"] = fd.read()
return state
def __setstate__(self, state):
super().__setstate__(state)
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
fd.write(state["model_"])
fd.flush()
self.__dict__["model_"] = keras.models.load_model(fd.name)
This replaces the unpickleable model_ member with a representation generated by keras' serializer that can be pickled. Using this customization I can call fit, serialize and deserialize, and get back my original model. Everything works.
e = MyEstimator()
e.fit(x, y)
with open("myfile.pk", mode="wb") as f:
pickle.dump(e, f)
with open("myfile.pk", mode="rb") as f:
pickle.load(f) # Returns a copy of e
However, serialization does not work when I try to put MyEstimator in a pipeline and pickle the result of a GridSearchCV.
s = GridSearchCV(Pipeline([
# ...
("estimator", MyEstimator())
# ...
]))
s.fit(x, y)
with open("myfile.pk", mode="wb") as f:
pickle.dump(s, f)
During the pickle.dump call I expect to see MyEstimator.__getstate__ get called with a fitted self.model_ object. (This is what happens when I serialize the model by itself, outside the grid search.) Instead self.model_ is None, so I am unable to serialize the best_estimator_ generated by my grid search.
It looks like grid search serialization is instantiating a new MyEstimator object instead of using the one that was in the pipeline. This seems wrong to me. I've looked through the scikit-learn code, but can't see where this is happening.
Is this a bug in scikit-learn, or am I doing something wrong?
(Note: keras does have a wrapper layer that can convert some keras models into scikit-learn estimators, but I can't use that here for other reasons and I'm not sure it wouldn't just have the same problem.)
The search object contains a mixed of MyEstimator objects, some of which have not had fit called on them. The fix is to check if model_ is None before trying to serialize it with the keras tools.
class MyEstimator(BaseEstimator, TransformerMixin):
def __getstate__(self):
state = super().__getstate__().copy()
if self.model_ is not None:
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
keras.models.save_model(self.model_, fd.name, overwrite=True)
state["model_"] = fd.read()
return state
def __setstate__(self, state):
super().__setstate__(state)
if self.model_ is not None:
with tempfile.NamedTemporaryFile(suffix=".hdf5", delete=True) as fd:
fd.write(state["model_"])
fd.flush()
self.__dict__["model_"] = keras.models.load_model(fd.name)
I don't know why there would be any unfitted models in the search object after the grid search had completed, but there are.

tensorflow record with float numpy array

I want to create tensorflow records to feed my model;
so far I use the following code to store uint8 numpy array to TFRecord format;
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _floats_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def convert_to_record(name, image, label, map):
filename = os.path.join(params.TRAINING_RECORDS_DATA_DIR, name + '.' + params.DATA_EXT)
writer = tf.python_io.TFRecordWriter(filename)
image_raw = image.tostring()
map_raw = map.tostring()
label_raw = label.tostring()
example = tf.train.Example(features=tf.train.Features(feature={
'image_raw': _bytes_feature(image_raw),
'map_raw': _bytes_feature(map_raw),
'label_raw': _bytes_feature(label_raw)
}))
writer.write(example.SerializeToString())
writer.close()
which I read with this example code
features = tf.parse_single_example(example, features={
'image_raw': tf.FixedLenFeature([], tf.string),
'map_raw': tf.FixedLenFeature([], tf.string),
'label_raw': tf.FixedLenFeature([], tf.string),
})
image = tf.decode_raw(features['image_raw'], tf.uint8)
image.set_shape(params.IMAGE_HEIGHT*params.IMAGE_WIDTH*3)
image = tf.reshape(image_, (params.IMAGE_HEIGHT,params.IMAGE_WIDTH,3))
map = tf.decode_raw(features['map_raw'], tf.uint8)
map.set_shape(params.MAP_HEIGHT*params.MAP_WIDTH*params.MAP_DEPTH)
map = tf.reshape(map, (params.MAP_HEIGHT,params.MAP_WIDTH,params.MAP_DEPTH))
label = tf.decode_raw(features['label_raw'], tf.uint8)
label.set_shape(params.NUM_CLASSES)
and that's working fine. Now I want to do the same with my array "map" being a float numpy array, instead of uint8, and I could not find examples on how to do it;
I tried the function _floats_feature, which works if I pass a scalar to it, but not with arrays;
with uint8 the serialization can be done by the method tostring();
How can I serialize a float numpy array and how can I read that back?
FloatList and BytesList expect an iterable. So you need to pass it a list of floats. Remove the extra brackets in your _float_feature, ie
def _floats_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
numpy_arr = np.ones((3,)).astype(np.float)
example = tf.train.Example(features=tf.train.Features(feature={"bytes": _floats_feature(numpy_arr)}))
print(example)
features {
feature {
key: "bytes"
value {
float_list {
value: 1.0
value: 1.0
value: 1.0
}
}
}
}
I will expand on the Yaroslav's answer.
Int64List, BytesList and FloatList expect an iterator of the underlying elements (repeated field). In your case you can use a list as an iterator.
You mentioned: it works if I pass a scalar to it, but not with arrays. And this is expected, because when you pass a scalar, your _floats_feature creates an array of one float element in it (exactly as expected). But when you pass an array you create a list of arrays and pass it to a function which expects a list of floats.
So just remove construction of the array from your function: float_list=tf.train.FloatList(value=value)
I've stumbled across this while working on a similar problem. Since part of the original question was how to read back the float32 feature from tfrecords, I'll leave this here in case it helps anyone:
If map.ravel() was used to input map of dimensions [x, y, z] into _floats_feature:
features = {
...
'map': tf.FixedLenFeature([x, y, z], dtype=tf.float32)
...
}
parsed_example = tf.parse_single_example(serialized=serialized, features=features)
map = parsed_example['map']
Yaroslav's example failed when a nd array was the input:
numpy_arr = np.ones((3,3)).astype(np.float)
I found that it worked when I used numpy_arr.ravel() as the input. But is there a better way to do it?
First of all, many thanks to Yaroslav and Salvador for their enlightening answers.
According to my experience, their methods only works when the input is a 1D NumPy array as the size of (n, ). When the input is a Numpy array with the dimension of more than 2, the following error info appears:
def _float_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
numpy_arr = np.arange(12).reshape(2, 2, 3).astype(np.float)
example = tf.train.Example(features=tf.train.Features(feature={"bytes":
_float_feature(numpy_arr)}))
print(example)
TypeError: array([[0., 1., 2.],
[3., 4., 5.]]) has type numpy.ndarray, but expected one of: int, long, float
So, I'd like to expand on Tsuan's answer, that is, flattening the input before it was fed into the TF example. The modified code is as follows:
def _floats_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
numpy_arr = np.arange(12).reshape(2, 2, 3).astype(np.float).flatten()
example = tf.train.Example(features=tf.train.Features(feature={"bytes":
_float_feature(numpy_arr)}))
print(example)
In addition, np.flatten() is more applicable than np.ravel().
Use tfrmaker, a TFRecord utility package. You can install the package with pip:
pip install tfrmaker
Then you could create tfrecords like this:
from tfrmaker import images
# mapping label names with integer encoding.
LABELS = {"bishop": 0, "knight": 1, "pawn": 2, "queen": 3, "rook": 4}
# specifiying data and output directories.
DATA_DIR = "datasets/chess/"
OUTPUT_DIR = "tfrecords/chess/"
# create tfrecords from the images present in the given data directory.
info = images.create(DATA_DIR, LABELS, OUTPUT_DIR)
# info contains a list of information (path: releative path, size: no of images in the tfrecord) about created tfrecords
print(info)
The package also has some cool features like:
dynamic resizing
splitting tfrecords into optimal shards
spliting training, validation, testing of tfrecords
count no of images in tfrecords
asynchronous tfrecord creation
NOTE: This package currently supports image datasets that are organised as directories with class names as sub directory names.

use Featureunion in scikit-learn to combine two pandas columns for tfidf

While using this as a model for spam classification, I'd like to add an additional feature of the Subject plus the body.
I have all of my features in a pandas dataframe. For example, the subject is df['Subject'], the body is df['body_text'] and the spam/ham label is df['ham/spam']
I receive the following error:
TypeError: 'FeatureUnion' object is not iterable
How can I use both df['Subject'] and df['body_text'] as features all while running them through the pipeline function?
from sklearn.pipeline import FeatureUnion
features = df[['Subject', 'body_text']].values
combined_2 = FeatureUnion(list(features))
pipeline = Pipeline([
('count_vectorizer', CountVectorizer(ngram_range=(1, 2))),
('tfidf_transformer', TfidfTransformer()),
('classifier', MultinomialNB())])
pipeline.fit(combined_2, df['ham/spam'])
k_fold = KFold(n=len(df), n_folds=6)
scores = []
confusion = numpy.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
train_text = combined_2.iloc[train_indices]
train_y = df.iloc[test_indices]['ham/spam'].values
test_text = combined_2.iloc[test_indices]
test_y = df.iloc[test_indices]['ham/spam'].values
pipeline.fit(train_text, train_y)
predictions = pipeline.predict(test_text)
prediction_prob = pipeline.predict_proba(test_text)
confusion += confusion_matrix(test_y, predictions)
score = f1_score(test_y, predictions, pos_label='spam')
scores.append(score)
FeatureUnion was not meant to be used that way. It instead takes two feature extractors / vectorizers and applies them to the input. It does not take data in the constructor the way it is shown.
CountVectorizer is expecting a sequence of strings. The easiest way to provide it with that is to concatenate the strings together. That would pass both the text in both columns to the same CountVectorizer.
combined_2 = df['Subject'] + ' ' + df['body_text']
An alternative method would be to run CountVectorizer and optionally TfidfTransformer individually on each column, and then stack the results.
import scipy.sparse as sp
subject_vectorizer = CountVectorizer(...)
subject_vectors = subject_vectorizer.fit_transform(df['Subject'])
body_vectorizer = CountVectorizer(...)
body_vectors = body_vectorizer.fit_transform(df['body_text'])
combined_2 = sp.hstack([subject_vectors, body_vectors], format='csr')
A third option is to implement your own transformer that would extract a dataframe column.
class DataFrameColumnExtracter(TransformerMixin):
def __init__(self, column):
self.column = column
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X[self.column]
In that case you could use FeatureUnion on two pipelines, each containing your custom transformer, then CountVectorizer.
subj_pipe = make_pipeline(
DataFrameColumnExtracter('Subject'),
CountVectorizer()
)
body_pipe = make_pipeline(
DataFrameColumnExtracter('body_text'),
CountVectorizer()
)
feature_union = make_union(subj_pipe, body_pipe)
This feature union of pipelines will take the dataframe and each pipeline will process its column. It will produce the concatenation of term count matrices from the two columns given.
sparse_matrix_of_counts = feature_union.fit_transform(df)
This feature union can also be added as the first step in a larger pipeline.