What is the logic of the extra columns in Tensorflow categorical encoding? - tensorflow

I am following the official Tensorflow tutorial for preprocessing layers, and I am not sure I get why I end up getting these extra columns after the categorical encoding.
Here is a stripped-down minimal reproducible example (including the data):
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import pathlib
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'
tf.keras.utils.get_file('petfinder_mini.zip', dataset_url, extract=True, cache_dir='.')
df = pd.read_csv(csv_file)
# In the original dataset "4" indicates the pet was not adopted.
df['target'] = np.where(df['AdoptionSpeed']==4, 0, 1)
# Drop un-used columns.
df = df.drop(columns=['AdoptionSpeed', 'Description'])
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('target')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
ds = ds.prefetch(batch_size)
return ds
batch_size = 5
ds = df_to_dataset(df, batch_size=batch_size)
[(train_features, label_batch)] = ds.take(1)
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
# Create a StringLookup layer which will turn strings into integer indices
if dtype == 'string':
index = preprocessing.StringLookup(max_tokens=max_tokens)
else:
index = preprocessing.IntegerLookup(max_values=max_tokens)
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
# Learn the set of possible values and assign them a fixed integer index.
index.adapt(feature_ds)
# Create a Discretization for our integer indices.
encoder = preprocessing.CategoryEncoding(max_tokens=index.vocab_size())
#encoder = preprocessing.CategoryEncoding(max_tokens=2)
# Prepare a Dataset that only yields our feature.
feature_ds = feature_ds.map(index)
# Learn the space of possible indices.
encoder.adapt(feature_ds)
# Apply one-hot encoding to our indices. The lambda function captures the
# layer so we can use them, or include them in the functional model later.
return lambda feature: encoder(index(feature))
So, after running
type_col = train_features['Type']
layer = get_category_encoding_layer('Type', ds, 'string')
layer(type_col)
I get a result as:
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
similar to what is shown in the tutorial indeed.
Notice that this is a binary classification problem (Cat/Dog):
np.unique(type_col)
# array([b'Cat', b'Dog'], dtype=object)
So, what is the logic of the 2 extra columns after the categorical encoding shown in the result above? What do they represent, and why they are 2 (and not, say, 1, or 3, or more)?
(I am perfectly aware that, should I wish for a simple one-hot encoding, I could simply use to_categorical(), but this is not the question here)

As already implied in the question, categorical encoding is somewhat richer that simple one-hot encoding. To see what these two columns represent it suffices to add a diagnostic print somewhere inside the get_category_encoding_layer() function:
print(index.get_vocabulary())
Then the result of the last commands will be:
['', '[UNK]', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
The hint should hopefully be clear: the extra two columns here represent the empty value '' and unknown ones '[UNK]', respectively, which could be present in future (unseen) data.
This is actually determined from the default arguments, not of CategoryEncoding, but of the preceding StringLookup; from the docs:
mask_token=''
oov_token='[UNK]'
You can end up with a somewhat more tight encoding (only 1 extra column instead of 2) by asking for oov_token='' instead of oov_token='[UNK]'; replace the call to StringLookup in the get_category_encoding_layer() function with
index = preprocessing.StringLookup(oov_token='',mask_token=None, max_tokens=max_tokens)
after which, the result will be:
['', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 3), dtype=float32, numpy=
array([[0., 1., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.]], dtype=float32)>
i.e. with only 3 columns (without a dedicated one for '[UNK]'). AFAIK, this is the lowest you can go - attempting to set both mask_token and oov_token to None will result to an error.

Related

Why my one hot encoded labels and my label don't match each other?

I was working on a multi class classification model, 15 classes, using to_categorical to one hot encode my label. When I test to see if my one hot encoded labels and my label in int match, they don't match.
Here is the code:
X = df.copy()
Y = X.pop('label')
xTrain, xTest, yTrain, yTest = train_test_split(X,Y, test_size=0.2)
xTrain = xTrain.to_numpy()
columnsToScale = ['PriceReg', 'ItemCount', 'demand1', 'demand2', 'demand3']
scaler = MinMaxScaler(feature_range=(0,1))
xTrain = scaler.fit_transform(xTrain)
xTest = scaler.transform(xTest)
from keras.utils.np_utils import to_categorical
one_hot_yTrain = to_categorical(yTrain, num_classes=15)
one_hot_yTest = to_categorical(yTest,num_classes=15)
When I check for example:
one_hot_yTrain[11]
I got:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
dtype=float32)
meanwhile for:
yTrain[11]
I got:
3
I expected that yTrain[11] to be 11
Did I misunderstand something? Appreciate any explanations

Tensorflow / Keras categorical encoding [duplicate]

I am following the official Tensorflow tutorial for preprocessing layers, and I am not sure I get why I end up getting these extra columns after the categorical encoding.
Here is a stripped-down minimal reproducible example (including the data):
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import pathlib
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'
tf.keras.utils.get_file('petfinder_mini.zip', dataset_url, extract=True, cache_dir='.')
df = pd.read_csv(csv_file)
# In the original dataset "4" indicates the pet was not adopted.
df['target'] = np.where(df['AdoptionSpeed']==4, 0, 1)
# Drop un-used columns.
df = df.drop(columns=['AdoptionSpeed', 'Description'])
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('target')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
ds = ds.prefetch(batch_size)
return ds
batch_size = 5
ds = df_to_dataset(df, batch_size=batch_size)
[(train_features, label_batch)] = ds.take(1)
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
# Create a StringLookup layer which will turn strings into integer indices
if dtype == 'string':
index = preprocessing.StringLookup(max_tokens=max_tokens)
else:
index = preprocessing.IntegerLookup(max_values=max_tokens)
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
# Learn the set of possible values and assign them a fixed integer index.
index.adapt(feature_ds)
# Create a Discretization for our integer indices.
encoder = preprocessing.CategoryEncoding(max_tokens=index.vocab_size())
#encoder = preprocessing.CategoryEncoding(max_tokens=2)
# Prepare a Dataset that only yields our feature.
feature_ds = feature_ds.map(index)
# Learn the space of possible indices.
encoder.adapt(feature_ds)
# Apply one-hot encoding to our indices. The lambda function captures the
# layer so we can use them, or include them in the functional model later.
return lambda feature: encoder(index(feature))
So, after running
type_col = train_features['Type']
layer = get_category_encoding_layer('Type', ds, 'string')
layer(type_col)
I get a result as:
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
similar to what is shown in the tutorial indeed.
Notice that this is a binary classification problem (Cat/Dog):
np.unique(type_col)
# array([b'Cat', b'Dog'], dtype=object)
So, what is the logic of the 2 extra columns after the categorical encoding shown in the result above? What do they represent, and why they are 2 (and not, say, 1, or 3, or more)?
(I am perfectly aware that, should I wish for a simple one-hot encoding, I could simply use to_categorical(), but this is not the question here)
As already implied in the question, categorical encoding is somewhat richer that simple one-hot encoding. To see what these two columns represent it suffices to add a diagnostic print somewhere inside the get_category_encoding_layer() function:
print(index.get_vocabulary())
Then the result of the last commands will be:
['', '[UNK]', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
The hint should hopefully be clear: the extra two columns here represent the empty value '' and unknown ones '[UNK]', respectively, which could be present in future (unseen) data.
This is actually determined from the default arguments, not of CategoryEncoding, but of the preceding StringLookup; from the docs:
mask_token=''
oov_token='[UNK]'
You can end up with a somewhat more tight encoding (only 1 extra column instead of 2) by asking for oov_token='' instead of oov_token='[UNK]'; replace the call to StringLookup in the get_category_encoding_layer() function with
index = preprocessing.StringLookup(oov_token='',mask_token=None, max_tokens=max_tokens)
after which, the result will be:
['', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 3), dtype=float32, numpy=
array([[0., 1., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.]], dtype=float32)>
i.e. with only 3 columns (without a dedicated one for '[UNK]'). AFAIK, this is the lowest you can go - attempting to set both mask_token and oov_token to None will result to an error.

scipy sparse A[:,0] = ndarray ValueError

Setting the first row of a scipy sparse array A[0,:] = np.ones() works fine,
but trying to set the first column with A[:,0] = np.ones() raises a ValueError.
Is this a bug in scipy 1.5.2, or have I not found doc which describes this ?
Answer 13 sep: this is a known bug area, see issues/10695
and the newest scipy/sparse/_index.py.
However I have not tested A[:,0] with this.
""" scipy sparse A[:,0] = ndarray ValueError """
# sparse A[0,:] = ndarray works, sparse A[:,0] = ndarray raises ValueError
# https://stackoverflow.com/search?q=[scipy] [sparse-matrix] ValueError > 100
import numpy as np
from scipy import sparse
# import warnings
# warnings.simplefilter( "ignore", sparse.SparseEfficiencyWarning )
def versionstr():
import numpy, scipy, sys
return "versions: numpy %s scipy %s python %s " % (
numpy.__version__, scipy.__version__ , sys.version.split()[0] )
print( versionstr() ) # 11 Sep 2020: numpy 1.19.2 scipy 1.5.2 python 3.7.6
#...........................................................................
n = 3
ones = np.ones( n )
for A in [
np.eye(n),
sparse.eye( n ).tolil(),
sparse.eye( n ).tocsr(),
sparse.eye( n ).tocsr(),
]:
print( "\n-- A:", type(A).__name__, A.shape )
print( "A[0,:] = ones" )
A[0,:] = ones
print( "A: \n", getattr( A, "A", A )) # dense
# first column = ones --
if sparse.issparse( A ):
A[:,0] = ones.reshape( n, 1 ) # ok
A[:,0] = np.matrix( ones ).T # ok
A[range(n),0] = ones # ok
try:
print( "A[:,0] = ones" )
A[:,0] = ones # A dense ok, A sparse ValueError
except ValueError as msg:
print( "ValueError:", msg )
# ValueError: cannot reshape array of size 9 into shape (3,1)
I'd probably call this a bug, yes - that's not the behavior I would have expected. Under the hood, it looks like this is driven by np.broadcast_arrays(), which is called when the fill array is dense. This function treats 1d arrays as if they're 2d (1, N) arrays. I would have expected based on the behavior of numpy slicing that a 1d array would be used without broadcasting if the size is correct.
Column slices:
>>> np.broadcast_arrays(np.ones((3,1)), A[:,0].A)
[array([[1.],
[1.],
[1.]]), array([[1.],
[0.],
[0.]])]
>>> np.broadcast_arrays(np.ones((3,)), A[:,0].A)
[array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]]), array([[1., 1., 1.],
[0., 0., 0.],
[0., 0., 0.]])]
>>> np.broadcast_arrays(np.ones((1, 3)), A[:,0].A)
[array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]]), array([[1., 1., 1.],
[0., 0., 0.],
[0., 0., 0.]])]
Row slices:
>>> np.broadcast_arrays(np.ones((3, )), A[0, :].A)
[array([[1., 1., 1.]]), array([[1., 0., 0.]])]
>>> np.broadcast_arrays(np.ones((3, 1)), A[0, :].A)
[array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]]), array([[1., 0., 0.],
[1., 0., 0.],
[1., 0., 0.]])]
>>> np.broadcast_arrays(np.ones((1, 3)), A[0, :].A)
[array([[1., 1., 1.]]), array([[1., 0., 0.]])]

How to transform categorical column using one hot encoder in sklearn

I have a dataframe with a categorical column and am trying to one hot encode it using sklearn using the below snippit
oneEncoder= OneHotEncoder()
features['COL2'] = features['COL2'].apply(lambda col : oneEncoder.fit_transform(col))
but it keep throwing
ValueError: Expected 2D array, got scalar array instead:
array=1771. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I also tried
oneEncoder= OneHotEncoder() #initializing an object of class LabelEncoder
oneEncoder.fit_transform( features['COL2'])
but it throws
ValueError: Expected 2D array, got 1D array instead:
Try:
oneEncoder.fit_transform(df[["Col2"]]).todense()
Suppose we have:
features = pd.DataFrame({"Col2":["a","b","c"]})
Then:
oneEncoder= OneHotEncoder()
oneEncoder.fit_transform(features[["Col2"]]).todense()
matrix([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
If you're dealing with a Series object, you may wish to reshape it:
oneEncoder.fit_transform(features.Col2.values.reshape(-1,1)).todense()
matrix([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
Dropping todense() method will leave your transformation in a sparse matrix.
And finally, you may alway decode what the columns of your matrix mean by:
oneEncoder.categories_
[array(['a', 'b', 'c'], dtype=object)]
Not surpisingly, they are your unique inputs ordered alphabetically.
you could do directly
categories = pd.get_dummies(features['COL2'])
otherwise you could pass a 2d array instead
oneEncoder.fit_transform( features[['COL2']].values)

How to use CNTK classification_error()?

I am trying to understand the correct usage of cntk.metrics.classification_error() and use it to verify a batch of predictions against their ground truths.
The below toy example (based on the Python API docs):
import numpy as np
from cntk.metrics import classification_error
predictions = np.asarray([[1., 2., 3., 4.],[1., 2., 3., 4.],[1., 2., 3., 4.]], dtype=np.float32)
labels = np.asarray([[0., 0., 0., 1.],[0., 0., 0., 1.],[0., 0., 1., 0.]], dtype=np.float32)
classification_error(predictions, labels).eval()
yields the following result:
array([[ 0., 0., 1.],
[ 0., 0., 1.],
[ 0., 0., 1.]], dtype=float32)
Is there a way I can obtain a vector rather than a square matrix which appears inefficient given I would like to process a large batch?
I've tried using the axis keyword when calling classification_error(), but whether I set axis=0 or axis=1 I get an empty result.
This happens because CNTK is trying to be user-friendly and ends up being confused about the types :-) You can tell because the classification error is not even correct.
If you add a little bit of typing information it gets the semantics right.
p = C.input(4)
y = C.input(4)
classification_error(p, y).eval({p:predictions, y:labels})
array([[ 0.],
[ 0.],
[ 1.]], dtype=float32)
We will work on a fix that will prevent the confusion.