NumPy array value error from training in Auto-Keras with StratifiedKFold - pandas

Background
My sentiment analysis research comes across a variety of datasets. Recently I've encountered one dataset that somehow I just cannot train successfully. I mostly work with open data in .CSV file format, hence Pandas and NumPy are heavily used.
During my research, one of the approaches is trying to integrate automated machine learning (AutoML), and the library I chose to use was Auto-Keras, mainly using its TextClassifier() wrapper function to achieve AutoML.
Main Problem
I've verified with official documentation, that the TextClassifier() takes data in the format of the NumPy array. However, when I load the data into Pandas DataFrame and used .to_numpy() on the columns that I need to train, the following error kept showing:
ValueError Traceback (most recent call last)
<ipython-input-13-1444bf2a605c> in <module>()
16 clf = ak.TextClassifier(overwrite=True, max_trials=2)
17
---> 18 clf.fit(x_train, y_train, epochs=3, callbacks=cbs)
19
20
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).
Error-related code sectors
The sector where I drop the unneeded Pandas DataFrame columns using .drop(), and convert the needed columns to NumPy Array using the to_numpy() function that Pandas has provided.
df_src = pd.read_csv(get_data)
df_src = df_src.drop(columns=["Name", "Cast", "Plot", "Direction",
"Soundtrack", "Acting", "Cinematography"])
df_src = df_src.reset_index(drop=True)
X = df_src["Review"].to_numpy()
Y = df_src["Overall Sentiment"].to_numpy()
print(X, "\n")
print("\n", Y)
The main error code part, where I perform StratifedKFold() and at the same time, use TextClassifier() to train and test the model.
fold = 0
for train, test in skf.split(X, Y):
fold += 1
print(f"Fold #{fold}\n")
x_train = X[train]
y_train = Y[train]
x_test = X[test]
y_test = Y[test]
cbs = [tf.keras.callbacks.EarlyStopping(patience=3)]
clf = ak.TextClassifier(overwrite=True, max_trials=2)
# The line where it indicated the error.
clf.fit(x_train, y_train, epochs=3, callbacks=cbs)
pred = clf.predict(x_test) # result data type is in lists of `string`
ceval = clf.evaluate(x_test, y_test)
metrics_test = metrics.classification_report(y_test, np.array(list(pred), dtype=int))
print(metrics_test, "\n")
print(f"Fold #{fold} finished\n")
Supplementary
I am sharing the full code related to the error through Google Colab, which you can help me diagnose here.
Edit notes
I have tried the potential solution, such as:
x_train = np.asarray(x_train).astype(np.float32)
y_train = np.asarray(y_train).astype(np.float32)
or
x_train = tf.data.Dataset.from_tensor_slices((x_train,))
y_train = tf.data.Dataset.from_tensor_slices((y_train,))
However, the problem remains.

One of the strings is equal to nan. Just remove this entry and the corresponding label.

Related

TypeError: tuple indices must be integers or slices, not str, facing this error in keras model

I am running a keras model, LINK IS HERE. I have just changed the dataset for this model and when I run my model it throwing this error TypeError: tuple indices must be integers or slices, not str. As it's a image captioning model and the dataset is difficult for me to understand.
See the blow code and read also the location of the error.
`reduce_lr = keras.callbacks.ReduceLROnPlateau(
monitor="val_loss", factor=0.2, patience=3
)
# Create an early stopping callback.
early_stopping = tf.keras.callbacks.EarlyStopping(
monitor="val_loss", patience=5, restore_best_weights=True
)
history = dual_encoder.fit(
train_dataloader,
epochs=num_epochs,
#validation_data=val_dataloader,
#callbacks=[reduce_lr, early_stopping],
)
print("Training completed. Saving vision and text encoders...")
vision_encoder.save("vision_encoder")
text_encoder.save("text_encoder")
print("Models are saved.")
TypeError Traceback (most recent call last)
<ipython-input-31-745dd79762e6> in <module>()
15 history = dual_encoder.fit(
16 train_dataloader,
---> 17 epochs=num_epochs,
18 #validation_data=val_dataloader,
19 #callbacks=[reduce_lr, early_stopping],
11 frames
<ipython-input-26-0696c83bf387> in call(self, features, training)
16 with tf.device("/gpu:0"):
17 # Get the embeddings for the captions.
---> 18 caption_embeddings = text_encoder(features["caption"], training=training)
19 #caption_embeddings = text_encoder(train_inputs, training=training)
20 with tf.device("/gpu:1"):
TypeError: tuple indices must be integers or slices, not str'
The error is pointing to this location caption_embeddings = text_encoder(features["caption"], training=training)
Now I am confused, I don't know whether this error is due to the data which I am passing to my model like this history = dual_encoder.fit(train_dataloader) OR this error is related to caption_embeddings = text_encoder(features["caption"], training=training) and image_embeddings = vision_encoder(features["image"], training=training) which is defined in class DualEncoder.
Because I don't know what are these features["caption"] and features["image"] which is defined in Class DualEncoder as I have not changed these two with my new dataset if You check my CODE HERE IN THIS COLAB FILE.
The dataset (train_dataloader) seems to return a tuple of items: link. In particular, model input is a tuple (images, x_batch_input).
However, your code (in DualEncoder) seems to assume that it's a dict (with keys like "caption", "image", etc). I think that's the source of the mismatch.

Passing a dict of tensors to a Keras model

I am trying to preprocess the infamous Titanic data (from Kaggle) by following this tutorial.
Everything was okay until I get to run the titanic_processing Model on the data (titanic_features) and I get this error:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).
In the tutorial it is mentioned that one should transform the data into a dict of tensors, but:
I don't see how the code (see HERE1 tag in my code below) makes a dict of tensors (there is no tf.convert_to_tensor for example)
I don't understand why one should retransform all the data as the previous code was suppose to do just that (when one create preprocessed_inputs etc.)
Here is my code, but you can also execute it on Google Colab here.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
url = "https://raw.githubusercontent.com/aymeric75/IA/master/train.csv"
titanic = pd.read_csv(url)
titanic_features = titanic.copy()
titanic_labels = titanic_features.pop('Survived')
inputs = {}
for name, column in titanic_features.items():
dtype = column.dtype
if dtype == object:
dtype = tf.string
else:
dtype = tf.float32
inputs[name] = tf.keras.Input(shape=(1,), name=name, dtype=dtype)
numeric_inputs = {name:input for name,input in inputs.items()
if input.dtype==tf.float32}
x = layers.Concatenate()(list(numeric_inputs.values()))
norm = preprocessing.Normalization()
norm.adapt(np.array(titanic[numeric_inputs.keys()]))
all_numeric_inputs = norm(x)
preprocessed_inputs = [all_numeric_inputs]
for name, input in inputs.items():
if input.dtype == tf.float32:
continue
lookup = preprocessing.StringLookup(vocabulary=np.unique(titanic_features[name].dropna()))
one_hot = preprocessing.CategoryEncoding(max_tokens=lookup.vocab_size())
x = lookup(input)
x = one_hot(x)
preprocessed_inputs.append(x)
preprocessed_inputs_cat = layers.Concatenate()(preprocessed_inputs)
titanic_preprocessing = tf.keras.Model(inputs, preprocessed_inputs_cat)
titanic_features_dict = {}
# This model just contains the input preprocessing. You can run it to see what it does to your data.
# Keras models don't automatically convert Pandas DataFrames because
# it's not clear if it should be converted to one tensor or to a dictionary of tensors. So convert it to a dictionary of tensors:
# HERE1
titanic_features_dict = {name: np.array(value)
for name, value in titanic_features.items()}
features_dict = {name:values[:1] for name, values in titanic_features_dict.items()}
titanic_preprocessing(features_dict)
Thanks a lot for you support!
Aymeric
[UPDATE] if you can answer question 2 ("I don't understand why one should retransform all the data as the previous code was suppose to do just that (when one create preprocessed_inputs etc.") then I will validate your answer, because I think I need to reformat the input indeed (but I don't see what it the point of doing all the code before...)
In your case, the problem is caused by the fact that your feature "Cabin" contains some nan (Not a Number) values. Tensorflow is fine with nan in floating point and integer data types, but not for strings.
You can replace all those nan values with an empty strings in your pandas dataframe :
titanic_features["Cabin"] = titanic_features["Cabin"].fillna("")
The previous code simply declares a preprocessing function as a keras model. You don't actually preprocess any data until your call to the titanic_preprocessing model.

Strange dimension behaviour: needs both dimension 2 and 3 unsure why

I am trying to convert a simple model to TFLite and run into the following issue with dimensions.
I've already tried using perm=[1,0] and perm=[0,2,1] the first one will generate an error requiring 3 dimensions and the 2nd one will generate an error requiring 2 dimensions.
import tensorflow as tf
captions = tf.keras.layers.Input(shape=[5,1024], name='captions')
cap_i = tf.keras.layers.Lambda(lambda q: q[0][:5,:])([captions])
cap_iT = tf.keras.layers.Lambda(lambda query:tf.transpose(query,
perm=[0,2,1]))(cap_i)
model = tf.keras.models.Model(inputs=[captions], outputs=[cap_iT])
model.save('my_model.hd5')
converter =
tf.lite.TFLiteConverter.from_keras_model_file('my_model.hd5')
tflite_model = converter.convert()
open("converted_modelfile.tflite", "wb").write(tflite_model)
ValueError: Dimension must be 2 but is 3 for 'lambda_1/transpose' (op: 'Transpose') with input shapes: [5,1024], [3].
You are probably getting the error in two different places.
You are throwing away the batch size dimension in the first Lambda with q[0]. You should not do this, you will need the batch dimension at the end of the Keras model (probably the location of the other error). Although you are passing [captions] inside a list, it is probably automatically getting the element inside the list because it's a single tensor.
The message in your question is certainly in the second Lambda, where you have a tensor with two dimensions [5,1024] (because you threw away the batch size in the first Lambda) and you are trying to permute 3 dimensions with [0,2,1].
Found a nice way to fix the inputs using a compatible operation in TFLite.
import tensorflow.compat.v1 as tf
import numpy as np
tf.disable_v2_behavior()
initial_input = tf.placeholder(dtype=tf.float32, shape=(None,5,1024))
cap_i = tf.strided_slice(initial_input, [0,0,0], [0,5,1024], [1,1,1], shrink_axis_mask=1)
cap_i_reshaped =tf.reshape(cap_i,[1,5,1024])
cap_iT = tf.transpose(cap_i_reshaped, perm=[0,2,1])
sess = tf.Session()
sess.run(tf.global_variables_initializer())
tf.io.write_graph(sess.graph_def, '', 'train.pbtxt')
converter = tf.lite.TFLiteConverter.from_session(sess, [initial_input], [cap_iT])
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,
tf.lite.OpsSet.SELECT_TF_OPS]
tflite_model = converter.convert()
open('converted_model.tflite', "wb").write(tflite_model)
sess.close()

How to Combine a TF_IDF Vectorizer with a Custom Feature

I am trying to construct a model with a combination of numerical features from a dataframe and text features from a dataframe. However, I am having a lot of trouble successfully combining the features, training using the features, then testing the the features.
Right now I am trying to use a DataFrameMapper like so:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([
('body', TfidfVectorizer()),
('numeric_feature', None),
])
for train_index, test_index in kFold.split(DF['body']):
# Split the dataset by Kfold
X_train = even_rand[['body','numeric_feature']].iloc[train_index]
y_train = even_rand['sub_class'].iloc[train_index]
X_test = even_rand[['body','numeric_feature']].iloc[test_index]
y_test = even_rand['sub_class'].iloc[test_index]
# Vectorize/transform docs
X_train = mapper.fit_transform(X_train)
X_test = mapper.fit_transform(X_test)
# Get SVM
svm = SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, n_iter=5, random_state=10)
svm.fit(X_train, y_train)
svm_score = svm.score(X_test, y_test)
This successfully combines the data and trains the data, however when I try to test the data, the features don't seem to match up correctly, and I get the error
ValueError: X has 49974 features per sample; expecting 87786
Would anyone know how to solve this issue or know of a better way to combine/train/test the numerical and text features together? I would also like to keep the features as sparse matrixes if possible.
Instead of:
X_train = mapper.fit_transform(X_train)
X_test = mapper.fit_transform(X_test)
try:
X_train = mapper.fit_transform(X_train)
X_test = mapper.transform(X_test) # change fit_transform to transform

One_Hot Encode and Tensorflow (Explain behind the scenes )

I am new to deep learning world and tensorflow. Tensorflow is so complicated for me right now.
I was following a tutorial on TF Layers API and I got this issue with one hot encode. Here is my code
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
wine_data = load_wine()
feat_data = wine_data['data']
labels = wine_data['target']
X_train, X_test, y_train, y_test = train_test_split(feat_data,
labels,
test_size=0.3,
random_state=101)
scaler = MinMaxScaler()
scaled_x_train = scaler.fit_transform(X_train)
scaled_x_test = scaler.transform(X_test)
# ONE HOT ENCODED
onehot_y_train = pd.get_dummies(y_train).as_matrix()
one_hot_y_test = pd.get_dummies(y_test).as_matrix()
num_feat = 13
num_hidden1 = 13
num_hidden2 = 13
num_outputs = 3
learning_rate = 0.01
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected
X = tf.placeholder(tf.float32,shape=[None,num_feat])
y_true = tf.placeholder(tf.float32,shape=[None,3])
actf = tf.nn.relu
hidden1 = fully_connected(X,num_hidden1,activation_fn=actf)
hidden2 = fully_connected(hidden1,num_hidden2,activation_fn=actf)
output = fully_connected(hidden2,num_outputs)
loss = tf.losses.softmax_cross_entropy(onehot_labels=y_true, logits=output)
optimizer = tf.train.AdamOptimizer(learning_rate)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
training_steps = 1000
with tf.Session() as sess:
sess.run(init)
for i in range(training_steps):
sess.run(train,feed_dict={X:scaled_x_train,y_true:y_train})
# Get Predictions
logits = output.eval(feed_dict={X:scaled_x_test})
preds = tf.argmax(logits,axis=1)
results = preds.eval()
When I run this code I got this error
ValueError: Cannot feed value of shape (124,) for Tensor
'Placeholder_1:0',
which has shape '(?, 3)'
After a little digging I found that modifying sess.run to
sess.run(train,feed_dict{X:scaled_x_train,y_true:onehot_y_train})
and changing y_train to onehot_y_train made the code run
I just want to know what is happening behind the scenes and why is the one_hot encoding that necessary in this code?
Your network is making a class prediction on 3 classes, class A, B, and C.
In defining a neural network to transform your 13 inputs to a representation that you can use to distinguish between these 3 classes you have a few choices.
You could output 1 number. Let's define a single-value output <1 represents class A, an output between [0,1] is class B, and an output >1 is class C.
You could define this, use a loss function like square error, and the network would learn to work under these assumptions and probably do half way decently at it.
However, that was a rather arbitrary choice of values to define 3 classes, as I'm sure you can see. And it's certainly sub-optimal. Learning this representation is harder than it needs to be. Can we do better?
Let's pick a more reasonable approach. Instead of 1 output we have 3 outputs. We define each output to represent how strongly we believe in a particular class. In order to conform to the cross entropy loss you use we'll further constrain those values to be in the range [0,1] by applying a sigmoid to them. So great, we now have 3 values in range [0,1] that each represent the belief that the input should fall into each of our 3 classes.
You have labels for each of your inputs, you know for sure that these inputs are class A, B, or C. So for a given input that is say class C, your label would naturally be [0, 0, 1] (e.g. you know it's not A or B, so 0 in both of those cases, and 1 for C which you know the class to be). Voila, you have the one-hot encoding!
As you might imagine this is a much easier problem to solve than the first one I presented. Hence we choose to represent our problem this way because we end up with networks that perform better when we do. It's not that you couldn't represent it another way, you just want the best results possible and one-hot encoding typically performs above other representations you might dream up.