tensorflow error while training model - Labels dtype should be integer Instead got <dtype: 'string'> - pandas

I am currently learning tensorflow and am new to the concept. I am trying a multi-class classification using LinearClassifier
I have a dataset where I have reduced the number of input variables using PCA to 30. I have named the PCA columns as PCA_Col_0 -- PCA_Col_29. The PCA was done using scikit learn
I then created tensorflow feature column for each of the 30 variables using the following code:
feat_cols = [PCA_Col_0, .... PCA_Col_29]
d = {}
for item in feat_cols:
d[item] = tf.feature_column.numeric_column(item)
feat_cols2 = list(d.values())
I then initialized the model
import tensorflow as tf
n_classes = 3914
model = tf.estimator.LinearClassifier(feature_columns = feat_cols2, n_classes = n_classes)
input_fn = tf.estimator.inputs.pandas_input_fn(x = DF_Final_V1[feat_cols], y = DF_Final_V1['nUnique_ID'], shuffle = False)
model.train(input_fn)
I get the error Labels dtype should be integer Instead got on tensorflow
I have verified the following:
The Input dataset has only float64 entries
There are no null or nan values in the input dataset
the tf.feature_column shows dtype as float32
Why isn't my model training and why am I getting this error?

Credit to #Bruce Swain in the comments.
The code worked after I modified the output value from 0 to n_classes -1

Related

Discrepancy between results reported by TensorFlow model.evaluate and model.predict

I've been back and forth with this for ages, but without being able to find a solution so far anywhere. So, I have a HuggingFace model ('bert-base-cased') that I'm using with TensorFlow and a custom dataset. I've: (1) tokenized my data (2) split the data; (3) converted the data to TF dataset format; (4) instantiated, compiled and fit the model.
During training, it behaves as you'd expect: training and validation accuracy go up. But when I evaluate the model on the test dataset using TF's model.evaluate and model.predict, the results are very different. The accuracy as reported by model.evaluate is higher (and more or less in line with the validation accuracy); the accuracy as reported by model.predict is about 10% lower. (Maybe it's just a coincidence, but it's similar to the reported training accuracy after the single epoch of fine-tuning.)
Can anyone figure out what's causing this? I include snippets of my code below.
# tokenize the dataset
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="bert-base-cased",use_fast=False)
def tokenize_function(examples):
return tokenizer(examples['text'], padding = "max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# splitting dataset
trainSize = 0.7
valTestSize = 1 - trainSize
train_testvalid = tokenized_datasets.train_test_split(test_size=valTestSize,stratify_by_column='class')
valid_test = train_testvalid['test'].train_test_split(test_size=0.5,stratify_by_column='class')
# renaming each of the datasets for convenience
train_set = train_testvalid['train']
val_set = valid_test['train']
test_set = valid_test['test']
# converting the tokenized datasets to TensorFlow datasets
data_collator = DefaultDataCollator(return_tensors="tf")
tf_train_dataset = train_set.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=['class'],
shuffle=True,
collate_fn=data_collator,
batch_size=8)
tf_validation_dataset = val_set.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=['class'],
shuffle=False,
collate_fn=data_collator,
batch_size=8)
tf_test_dataset = test_set.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=['class'],
shuffle=False,
collate_fn=data_collator,
batch_size=8)
# loading tensorflow model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=1)
# compiling the model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=5e-6),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=tf.metrics.BinaryAccuracy())
# fitting model
history = model.fit(tf_train_dataset,
validation_data=tf_validation_dataset,
epochs=1)
# Evaluating the model on the test data using `evaluate`
results = model.evaluate(x=tf_test_dataset,verbose=2) # reports binary_accuracy: 0.9152
# first attempt at using model.predict method
hits = 0
misses = 0
for x, y in tf_test_dataset:
logits = tf.keras.backend.get_value(model(x, training=False).logits)
labels = tf.keras.backend.get_value(y)
for i in range(len(logits)):
if logits[i][0] < 0:
z = 0
else:
z = 1
if z == labels[i]:
hits += 1
else:
misses += 1
print(hits/(hits+misses)) # reports binary_accuracy: 0.8187
# second attempt at using model.predict method
modelPredictions = model.predict(tf_test_dataset).logits
testDataLabels = np.concatenate([y for x, y in tf_test_dataset], axis=0)
hits = 0
misses = 0
for i in range(len(modelPredictions)):
if modelPredictions[i][0] >= 0:
z = 1
else:
z = 0
if z == testDataLabels[i]:
hits += 1
else:
misses += 1
print(hits/(hits+misses)) # reports binary_accuracy: 0.8187
Things I've tried include:
different loss functions (it's a binary classification problem with the label column of the dataset filled with either a zero or a one for each row);
different ways of unpacking the test dataset and feeding it to model.predict;
altering the 'num_labels' parameter between 1 and 2.
I fixed the problem by changing the num_labels parameter to two and the loss function to sparse categorical cross entropy. (I then had to change my model.predict loop by taking the argmax of the two logits produced by the model.)

ValueError: Dimensions must be equal in Tensorflow/Keras

My codes are as follow:
v = tf.Variable(initial_value=v, trainable=True)
v.shape is (1, 768)
In the model:
inputs_sents = keras.Input(shape=(50,3))
inputs_events = keras.Input(shape=(50,768))
x_1 = tf.matmul(v,tf.transpose(inputs_events))
x_2 = tf.matmul(x_1,inputs_sents)
But I got an error,
ValueError: Dimensions must be equal, but are 768 and 50 for
'{{node BatchMatMulV2_3}} =
BatchMatMulV2[T=DT_FLOAT,
adj_x=false,
adj_y=false](BatchMatMulV2_3/ReadVariableOp,
Transpose_3)' with input shapes: [1,768], [768,50,?]
I think it takes consideration of the batch? But how shall I deal with this?
v is a trainable vector (or 2d array with first dimension being 1), I want it to be trained in the training process.
PS: This is the result I got using the codes provided by the first answer, I think it is incorrect cause keras already takes consideration of the first batch dimension.
Plus, from the keras documentation,
shape: A shape tuple (integers), not including the batch size. For instance, shape=(32,) indicates that the expected input will be batches of 32-dimensional vectors. Elements of this tuple can be None; 'None' elements represent dimensions where the shape is not known.
https://keras.io/api/layers/core_layers/input/
Should I rewrite my codes without keras?
The shape of a batch is denoted by None:
import numpy as np
inputs_sents = keras.Input(shape=(None,1,3))
inputs_events = keras.Input(shape=(None,1,768))
v = np.ones(shape=(1,768), dtype=np.float32)
v = tf.Variable(initial_value=v, trainable=True)
x_1 = tf.matmul(v,tf.transpose(inputs_events))
x_2 = tf.matmul(x_1,inputs_sents)

tensorflow model trained on keras.preprocessing.timeseries_dataset_from_array yields unexpected output shape of (sequence_length, 1)

I'm trying to train a tensorflow model where my inputs are a lagged timeseries of multiple features and I want to predict a single value.
Somehow the output shape ends up as an array of (lag/sequence_length, 1) when my lagged dataset has more than one feature, but I haven't been able to figure out why exactly that is. Here is a minimal example of what I'm trying to do
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import pandas as pd
# generate some dummy data
x0 = np.array(range(300))
x1 = np.array(range(300)) * 2
df = pd.DataFrame({"x0": x0, "x1": x1})
y = np.array(range(100))
# also tried reshaping my y, but no help
# y = np.array(range(100)).reshape(100,1)
# make a dataset with lagged values
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=df,
targets=y,
sequence_length=3,
sequence_stride=1,
sampling_rate=1,
batch_size=5
)
# show an example of what we are working with
list(ds.take(1))
# define simple model and train it
model = tf.keras.Sequential(
[
layers.Dense(32),
layers.Dense(1),
]
)
model.compile(loss="mse", optimizer=tf.optimizers.Adam())
model.fit(ds, epochs=4)
# make predictions on dataset
predictions = model.predict(ds)
# show predictions
predictions
print(predictions.shape)
"""
(100, 3, 1)
"""
If I create the dataset with only a single feature as:
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=x1,
targets=y,
sequence_length=3,
sequence_stride=1,
sampling_rate=1,
batch_size=5
)
My outputs are of expected shape.
Would appreciate any pointers. I'm guessing something is probably getting broadcast which then results in the output I'm seeing but I haven't been able to figure out what exactly is going on.

word2vec implementation in tensorflow 2.0

I want to implement word2vec using tensorflow 2.0
I have prepared dataset according to the skip-gramm model and I have got approx. 18 million observations(target and context words).
I have used the followng dataset for my goal:
https://www.kaggle.com/c/quora-question-pairs/notebooks
I have created a new dataset for n-gramm model. I have used windows_size 2 and number of skips equal to 2 as well in order to create for each target word(as our input) create context word(that is what I have to predict). It looks like this:
target context
1 3
1 1
2 1
2 1222
Here is my code:
dataset_train = tf.data.Dataset.from_tensor_slices((target, context))
dataset_train = dataset_train.shuffle(buffer_size=1024).batch(64)
#Parameters:
num_words = len(word_index)#approximately 100000
embed_size = 300
num_sampled = 64
initializer_softmax = tf.keras.initializers.GlorotUniform()
#Variables:
embeddings_weight = tf.Variable(tf.random.uniform([num_words,embed_size],-1.0,1.0))
softmax_weight = tf.Variable(initializer_softmax([num_words,embed_size]))
softmax_bias = tf.Variable(initializer_softmax([num_words]))
optimizer = tf.keras.optimizers.Adam()
#As before, we are supplying a list of integers (that correspond to our validation vocabulary words) to the embedding_lookup() function, which looks up these rows in the normalized_embeddings tensor, and returns the subset of validation normalized embeddings.
#Now that we have the normalized validation tensor, valid_embeddings, we can multiply this by the full normalized vocabulary (normalized_embedding) to finalize our similarity calculation:
#tf.function
def training(X,y):
with tf.GradientTape() as tape:
embed = tf.nn.embedding_lookup(embeddings_weight,X)
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(weights = softmax_weight, biases = softmax_bias, inputs = embed,
labels = y, num_sampled = num_sampled, num_classes = num_words))
variables = [embeddings_weight,softmax_weight,softmax_bias]
gradients = tape.gradient(loss,variables)
optimizer.apply_gradients(zip(gradients,variables))
EPOCHS = 30
for epoch in range(EPOCHS):
print('Epoch:',epoch)
for X,y in dataset_train:
training(X,y)
#compute similarity of words:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings_weight), 1, keepdims=True))
norm_embed = embeddings_weight/ norm
temp_emb = tf.nn.embedding_lookup(norm_embed,X)
similarity = tf.matmul(temp_emb,tf.transpose(norm_embed))
But the computation of even 1 epoch lasts too long. Is it possible somehow to improve the perfomance of my code?(I am using google colab for the code execution)
EDIT: this is a shape of my train dataset
dataset_train
<BatchDataset shapes: ((None,), (None, 1)), types: (tf.int64, tf.int64)>
I was following the instructions from this guide: https://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/
This is because softmax function is computationally quite expensive while dealing with possibilities of millions of points in Word2Vec algorithm as explained here. A faster training would be possible with negative sampling.

record_defaults of tf.decode_csv in tensorflow

I used tf.decode_csv in tensorflow as decoder to parse training examples in a tab-delimited file into cnn models. For every training example, the features are 2 dimensions (100 columns, 2000 rows). After reading the document in tensorflow official site, I still have two questions.
how to create record_defaults? The following is my code to do that, but I
am not sure if it is right.
code
filename_queue = tf.train.string_input_producer([file], num_epochs)
key, value = tf.TextLineReader().read(filename_queue)
record_defaults = [[1.0 for col in range(0, 100)] for row in range(0, 2000)]
content = tf.decode_csv(value, record_defaults = record_defaults, field_delim = '\t')
features = tf.pack(content[0:1999])
I am doing binary (0, 1) classification. Where do I put the labels for training examples? in the 2001th row? (For every training example, the first 2000 rows for features, and the 2001th row for label)
Thanks for your time!