Why my one hot encoded labels and my label don't match each other? - numpy

I was working on a multi class classification model, 15 classes, using to_categorical to one hot encode my label. When I test to see if my one hot encoded labels and my label in int match, they don't match.
Here is the code:
X = df.copy()
Y = X.pop('label')
xTrain, xTest, yTrain, yTest = train_test_split(X,Y, test_size=0.2)
xTrain = xTrain.to_numpy()
columnsToScale = ['PriceReg', 'ItemCount', 'demand1', 'demand2', 'demand3']
scaler = MinMaxScaler(feature_range=(0,1))
xTrain = scaler.fit_transform(xTrain)
xTest = scaler.transform(xTest)
from keras.utils.np_utils import to_categorical
one_hot_yTrain = to_categorical(yTrain, num_classes=15)
one_hot_yTest = to_categorical(yTest,num_classes=15)
When I check for example:
one_hot_yTrain[11]
I got:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
dtype=float32)
meanwhile for:
yTrain[11]
I got:
3
I expected that yTrain[11] to be 11
Did I misunderstand something? Appreciate any explanations

Related

What is the logic of the extra columns in Tensorflow categorical encoding?

I am following the official Tensorflow tutorial for preprocessing layers, and I am not sure I get why I end up getting these extra columns after the categorical encoding.
Here is a stripped-down minimal reproducible example (including the data):
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import pathlib
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'
tf.keras.utils.get_file('petfinder_mini.zip', dataset_url, extract=True, cache_dir='.')
df = pd.read_csv(csv_file)
# In the original dataset "4" indicates the pet was not adopted.
df['target'] = np.where(df['AdoptionSpeed']==4, 0, 1)
# Drop un-used columns.
df = df.drop(columns=['AdoptionSpeed', 'Description'])
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('target')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
ds = ds.prefetch(batch_size)
return ds
batch_size = 5
ds = df_to_dataset(df, batch_size=batch_size)
[(train_features, label_batch)] = ds.take(1)
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
# Create a StringLookup layer which will turn strings into integer indices
if dtype == 'string':
index = preprocessing.StringLookup(max_tokens=max_tokens)
else:
index = preprocessing.IntegerLookup(max_values=max_tokens)
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
# Learn the set of possible values and assign them a fixed integer index.
index.adapt(feature_ds)
# Create a Discretization for our integer indices.
encoder = preprocessing.CategoryEncoding(max_tokens=index.vocab_size())
#encoder = preprocessing.CategoryEncoding(max_tokens=2)
# Prepare a Dataset that only yields our feature.
feature_ds = feature_ds.map(index)
# Learn the space of possible indices.
encoder.adapt(feature_ds)
# Apply one-hot encoding to our indices. The lambda function captures the
# layer so we can use them, or include them in the functional model later.
return lambda feature: encoder(index(feature))
So, after running
type_col = train_features['Type']
layer = get_category_encoding_layer('Type', ds, 'string')
layer(type_col)
I get a result as:
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
similar to what is shown in the tutorial indeed.
Notice that this is a binary classification problem (Cat/Dog):
np.unique(type_col)
# array([b'Cat', b'Dog'], dtype=object)
So, what is the logic of the 2 extra columns after the categorical encoding shown in the result above? What do they represent, and why they are 2 (and not, say, 1, or 3, or more)?
(I am perfectly aware that, should I wish for a simple one-hot encoding, I could simply use to_categorical(), but this is not the question here)
As already implied in the question, categorical encoding is somewhat richer that simple one-hot encoding. To see what these two columns represent it suffices to add a diagnostic print somewhere inside the get_category_encoding_layer() function:
print(index.get_vocabulary())
Then the result of the last commands will be:
['', '[UNK]', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
The hint should hopefully be clear: the extra two columns here represent the empty value '' and unknown ones '[UNK]', respectively, which could be present in future (unseen) data.
This is actually determined from the default arguments, not of CategoryEncoding, but of the preceding StringLookup; from the docs:
mask_token=''
oov_token='[UNK]'
You can end up with a somewhat more tight encoding (only 1 extra column instead of 2) by asking for oov_token='' instead of oov_token='[UNK]'; replace the call to StringLookup in the get_category_encoding_layer() function with
index = preprocessing.StringLookup(oov_token='',mask_token=None, max_tokens=max_tokens)
after which, the result will be:
['', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 3), dtype=float32, numpy=
array([[0., 1., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.]], dtype=float32)>
i.e. with only 3 columns (without a dedicated one for '[UNK]'). AFAIK, this is the lowest you can go - attempting to set both mask_token and oov_token to None will result to an error.

Tensorflow / Keras categorical encoding [duplicate]

I am following the official Tensorflow tutorial for preprocessing layers, and I am not sure I get why I end up getting these extra columns after the categorical encoding.
Here is a stripped-down minimal reproducible example (including the data):
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import pathlib
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'
tf.keras.utils.get_file('petfinder_mini.zip', dataset_url, extract=True, cache_dir='.')
df = pd.read_csv(csv_file)
# In the original dataset "4" indicates the pet was not adopted.
df['target'] = np.where(df['AdoptionSpeed']==4, 0, 1)
# Drop un-used columns.
df = df.drop(columns=['AdoptionSpeed', 'Description'])
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('target')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
ds = ds.prefetch(batch_size)
return ds
batch_size = 5
ds = df_to_dataset(df, batch_size=batch_size)
[(train_features, label_batch)] = ds.take(1)
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
# Create a StringLookup layer which will turn strings into integer indices
if dtype == 'string':
index = preprocessing.StringLookup(max_tokens=max_tokens)
else:
index = preprocessing.IntegerLookup(max_values=max_tokens)
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
# Learn the set of possible values and assign them a fixed integer index.
index.adapt(feature_ds)
# Create a Discretization for our integer indices.
encoder = preprocessing.CategoryEncoding(max_tokens=index.vocab_size())
#encoder = preprocessing.CategoryEncoding(max_tokens=2)
# Prepare a Dataset that only yields our feature.
feature_ds = feature_ds.map(index)
# Learn the space of possible indices.
encoder.adapt(feature_ds)
# Apply one-hot encoding to our indices. The lambda function captures the
# layer so we can use them, or include them in the functional model later.
return lambda feature: encoder(index(feature))
So, after running
type_col = train_features['Type']
layer = get_category_encoding_layer('Type', ds, 'string')
layer(type_col)
I get a result as:
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
similar to what is shown in the tutorial indeed.
Notice that this is a binary classification problem (Cat/Dog):
np.unique(type_col)
# array([b'Cat', b'Dog'], dtype=object)
So, what is the logic of the 2 extra columns after the categorical encoding shown in the result above? What do they represent, and why they are 2 (and not, say, 1, or 3, or more)?
(I am perfectly aware that, should I wish for a simple one-hot encoding, I could simply use to_categorical(), but this is not the question here)
As already implied in the question, categorical encoding is somewhat richer that simple one-hot encoding. To see what these two columns represent it suffices to add a diagnostic print somewhere inside the get_category_encoding_layer() function:
print(index.get_vocabulary())
Then the result of the last commands will be:
['', '[UNK]', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)>
The hint should hopefully be clear: the extra two columns here represent the empty value '' and unknown ones '[UNK]', respectively, which could be present in future (unseen) data.
This is actually determined from the default arguments, not of CategoryEncoding, but of the preceding StringLookup; from the docs:
mask_token=''
oov_token='[UNK]'
You can end up with a somewhat more tight encoding (only 1 extra column instead of 2) by asking for oov_token='' instead of oov_token='[UNK]'; replace the call to StringLookup in the get_category_encoding_layer() function with
index = preprocessing.StringLookup(oov_token='',mask_token=None, max_tokens=max_tokens)
after which, the result will be:
['', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 3), dtype=float32, numpy=
array([[0., 1., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.]], dtype=float32)>
i.e. with only 3 columns (without a dedicated one for '[UNK]'). AFAIK, this is the lowest you can go - attempting to set both mask_token and oov_token to None will result to an error.

How can I transfer an sparse representaion of .txt to a dense matrix in scipy?

I have a .txt file from epinion data set which is a sparse representation (ie.
23 387 5 represents the fact "user 23 has rated item 387 as 5") . from this sparse format I want to transfer it to its dense Representation scipy so I can do matrix factorization on it.
I have loaded the file with loadtxt() from numpy and it is a [664824, 3] array. Using scipy.sparse.csr_matrix I transfer it to numpy array and using todense() from scipy I was hoping to achieve the dense format but I always get the same matrix: [664824, 3]. How can I turn it into the original [40163,139738] dense representation?
import numpy as np
from io import StringIO
d = np.loadtxt("MFCode/Epinions_dataset.txt")
S = csr_matrix(d)
D = R.todense()
I expected a dense matrix with the shape of [40163,139738]
A small sample csv like text:
In [218]: np.lib.format.open_memmap?
In [219]: txt = """0 1 3
...: 1 0 4
...: 2 2 5
...: 0 3 6""".splitlines()
In [220]: data = np.loadtxt(txt)
In [221]: data
Out[221]:
array([[0., 1., 3.],
[1., 0., 4.],
[2., 2., 5.],
[0., 3., 6.]])
Using sparse, using the (data, (row, col)) style of input:
In [222]: from scipy import sparse
In [223]: M = sparse.coo_matrix((data[:,2], (data[:,0], data[:,1])), shape=(5,4))
In [224]: M
Out[224]:
<5x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in COOrdinate format>
In [225]: M.A
Out[225]:
array([[0., 3., 0., 6.],
[4., 0., 0., 0.],
[0., 0., 5., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
Alternatively fill in a zeros array directly:
In [226]: arr = np.zeros((5,4))
In [227]: arr[data[:,0].astype(int), data[:,1].astype(int)]=data[:,2]
In [228]: arr
Out[228]:
array([[0., 3., 0., 6.],
[4., 0., 0., 0.],
[0., 0., 5., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
But be ware that np.zeros([40163,139738]) could raise a memory error. M.A (M.toarray())` could also do that.

Expected to see 2 array(s), but instead got the following list of 1 arrays

I want to create a model that receive one image and compute the image by two Softmax(two output). The code is:
base_model = InceptionV3(include_top=False)
x = base_model.output
x = GlobalAveragePooling2D()(x)
# first softmax
x_1 = Dense(1024, activation='relu')(x)
predictions_1 = Dense(4, activation='softmax')(x_1)
# second Softmax
x_2 = Dense(1024, activation='relu')(x)
predictions_2 = Dense(4, activation='softmax')(x_2)
my_model = Model(inputs=base_model.input, outputs=[predictions_1,predictions_2])
# train
my_model.compile(...)
my_model.fit_generator(...)
When training, I got error:
ValueError: Expected to see 2 array(s), but instead got the following
list of 1 arrays: [array([[0., 0., 0., 1.], [0., 0., 0., 1.], [0., 0.,
0., 1.], [0., 1., 0., 0.], [1., 0., 0., 0.], [0., 1., 0., 0.], [0., 1., 0., 0.], [0., 1., 0., 0.],...

Tensorflow skflow, Data seems compatible, Valuerror, Shape error

Trying to run a very simple nn classifier with skflow.
classifier = skflow.TensorFlowDNNClassifier(
hidden_units=[10, 10, 10],
n_classes=10,
batch_size=100,
learning_rate=0.05)
print (data.train.images).shape
print (data.train.labels).shape
classifier.fit(data.train.images,data.train.labels)
output is :
(73257, 3072)
(73257, 10)
and the error is :
in assert_same_rank
"Shapes %s and %s must have the same rank" % (self, other))
ValueError: Shapes (?, 10) and (?, 10, 10) must have the same rank
I do not really understand what the problem is here :(
Perhaps the labels of your dataset are one-hot vectors. (In this case, I use mnist dataset. See also https://www.tensorflow.org/versions/r0.7/tutorials/mnist/beginners/index.html)
In [1]: from tensorflow.examples.tutorials.mnist import input_data
In [2]: mnist = input_data.read_data_sets("MNIST_DATA/", one_hot=True)
Extracting MNIST_DATA/train-images-idx3-ubyte.gz
Extracting MNIST_DATA/train-labels-idx1-ubyte.gz
Extracting MNIST_DATA/t10k-images-idx3-ubyte.gz
Extracting MNIST_DATA/t10k-labels-idx1-ubyte.gz
In [3]: mnist.train.labels
Out[3]:
array([[ 0., 0., 0., ..., 1., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 1., 0.]])
In [4]: mnist.train.labels.shape
Out[4]: (55000, 10)
In [5]: import skflow
In [6]: classifier = skflow.TensorFlowDNNClassifier(hidden_units=[10, 10, 10], n_classes=10, batch_size=100, learning_rate=0.05)
In [7]: classifier.fit(mnist.train.images, mnist.train.labels)
Then I got same error.
ValueError: Shapes (?, 10) and (?, 10, 10) must have the same rank
But skflow assumes that lables are numbers between 0 and 9. (one_hot=False)
In [5]: mnist = input_data.read_data_sets("MNIST_DATA/", one_hot=False)
Extracting MNIST_DATA/train-images-idx3-ubyte.gz
Extracting MNIST_DATA/train-labels-idx1-ubyte.gz
Extracting MNIST_DATA/t10k-images-idx3-ubyte.gz
Extracting MNIST_DATA/t10k-labels-idx1-ubyte.gz
In [6]: mnist.train.labels
Out[6]: array([7, 3, 4, ..., 5, 6, 8], dtype=uint8)
In [7]: classifier.fit(mnist.train.images, mnist.train.labels)
Step #99, avg. train loss: 2.31658
Step #199, avg. train loss: 1.63361
Out[7]:
TensorFlowDNNClassifier(batch_size=100, class_weight=None,
config_addon=<skflow.addons.config_addon.ConfigAddon object at 0x11cf7eb90>,
continue_training=False, hidden_units=[10, 10, 10],
keep_checkpoint_every_n_hours=10000, learning_rate=0.05,
max_to_keep=5, n_classes=10, optimizer='SGD', steps=200,
tf_master='', tf_random_seed=42, verbose=1)
Please give it a try.