How to Combine a TF_IDF Vectorizer with a Custom Feature - pandas

I am trying to construct a model with a combination of numerical features from a dataframe and text features from a dataframe. However, I am having a lot of trouble successfully combining the features, training using the features, then testing the the features.
Right now I am trying to use a DataFrameMapper like so:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([
('body', TfidfVectorizer()),
('numeric_feature', None),
])
for train_index, test_index in kFold.split(DF['body']):
# Split the dataset by Kfold
X_train = even_rand[['body','numeric_feature']].iloc[train_index]
y_train = even_rand['sub_class'].iloc[train_index]
X_test = even_rand[['body','numeric_feature']].iloc[test_index]
y_test = even_rand['sub_class'].iloc[test_index]
# Vectorize/transform docs
X_train = mapper.fit_transform(X_train)
X_test = mapper.fit_transform(X_test)
# Get SVM
svm = SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, n_iter=5, random_state=10)
svm.fit(X_train, y_train)
svm_score = svm.score(X_test, y_test)
This successfully combines the data and trains the data, however when I try to test the data, the features don't seem to match up correctly, and I get the error
ValueError: X has 49974 features per sample; expecting 87786
Would anyone know how to solve this issue or know of a better way to combine/train/test the numerical and text features together? I would also like to keep the features as sparse matrixes if possible.

Instead of:
X_train = mapper.fit_transform(X_train)
X_test = mapper.fit_transform(X_test)
try:
X_train = mapper.fit_transform(X_train)
X_test = mapper.transform(X_test) # change fit_transform to transform

Related

NumPy array value error from training in Auto-Keras with StratifiedKFold

Background
My sentiment analysis research comes across a variety of datasets. Recently I've encountered one dataset that somehow I just cannot train successfully. I mostly work with open data in .CSV file format, hence Pandas and NumPy are heavily used.
During my research, one of the approaches is trying to integrate automated machine learning (AutoML), and the library I chose to use was Auto-Keras, mainly using its TextClassifier() wrapper function to achieve AutoML.
Main Problem
I've verified with official documentation, that the TextClassifier() takes data in the format of the NumPy array. However, when I load the data into Pandas DataFrame and used .to_numpy() on the columns that I need to train, the following error kept showing:
ValueError Traceback (most recent call last)
<ipython-input-13-1444bf2a605c> in <module>()
16 clf = ak.TextClassifier(overwrite=True, max_trials=2)
17
---> 18 clf.fit(x_train, y_train, epochs=3, callbacks=cbs)
19
20
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).
Error-related code sectors
The sector where I drop the unneeded Pandas DataFrame columns using .drop(), and convert the needed columns to NumPy Array using the to_numpy() function that Pandas has provided.
df_src = pd.read_csv(get_data)
df_src = df_src.drop(columns=["Name", "Cast", "Plot", "Direction",
"Soundtrack", "Acting", "Cinematography"])
df_src = df_src.reset_index(drop=True)
X = df_src["Review"].to_numpy()
Y = df_src["Overall Sentiment"].to_numpy()
print(X, "\n")
print("\n", Y)
The main error code part, where I perform StratifedKFold() and at the same time, use TextClassifier() to train and test the model.
fold = 0
for train, test in skf.split(X, Y):
fold += 1
print(f"Fold #{fold}\n")
x_train = X[train]
y_train = Y[train]
x_test = X[test]
y_test = Y[test]
cbs = [tf.keras.callbacks.EarlyStopping(patience=3)]
clf = ak.TextClassifier(overwrite=True, max_trials=2)
# The line where it indicated the error.
clf.fit(x_train, y_train, epochs=3, callbacks=cbs)
pred = clf.predict(x_test) # result data type is in lists of `string`
ceval = clf.evaluate(x_test, y_test)
metrics_test = metrics.classification_report(y_test, np.array(list(pred), dtype=int))
print(metrics_test, "\n")
print(f"Fold #{fold} finished\n")
Supplementary
I am sharing the full code related to the error through Google Colab, which you can help me diagnose here.
Edit notes
I have tried the potential solution, such as:
x_train = np.asarray(x_train).astype(np.float32)
y_train = np.asarray(y_train).astype(np.float32)
or
x_train = tf.data.Dataset.from_tensor_slices((x_train,))
y_train = tf.data.Dataset.from_tensor_slices((y_train,))
However, the problem remains.
One of the strings is equal to nan. Just remove this entry and the corresponding label.

Problem with shapes of experimental Tensorflow dataset

I am trying to store numpy arrays in a Tensorflow dataset. The model fits correctly when using the numpy arrays as train and test data but not when I store the numpy arrays in a single Tensorflow dataset. The problem is with the dimensions of the dataset. Something is wrong even though shapes seem OK at first sight.
After trying multiple things to reshape my Tensorflow dataset, I am still unable to get it working. My code is the following:
train_x.shape
Out[54]: (7200, 40)
train_y.shape
Out[55]: (7200,)
dataset = tf.data.Dataset.from_tensor_slices((x,y))
print(dataset)
Out[56]: <TensorSliceDataset shapes: ((40,), ()), types: (tf.int32, tf.int32)>
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
history = model.fit(dataset, epochs=EPOCHS, batch_size=256)
sparse_softmax_cross_entropy_with_logits
logits.get_shape()))
ValueError: Shape mismatch: The shape of labels (received (1,)) should equal the shape of logits except for the last dimension (received (40, 1351)).
I have seen this answer but I am sure it doesn't apply here. I must use sparse_categorical_crossentropy. I am inspiring myself from this example where I want to store the train and test data in a Tensorflow dataset. I also want to store the arrays in a dataset as I will have to use it later.
You can't use batch_size with model.fit() when using a tf.data.Dataset. Instead use tf.data.Dataset.batch(). You'll have to change your code as follows for it to work.
import numpy as np
import tensorflow as tf
# Some toy data
train_x = np.random.normal(size=(7200, 40))
train_y = np.random.choice([0,1,2], size=(7200))
dataset = tf.data.Dataset.from_tensor_slices((train_x,train_y))
dataset = dataset.batch(256)
#### - Define your model here - ####
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
history = model.fit(dataset, epochs=EPOCHS)

How to perform sklearn style train-test split on feature and label tensors using built in tensorflow methods?

Reposting my original question since even after significant improvements to clarity, it was not revived by the community.
I am looking for a way to split feature and corresponding label data into train and test using TensorFlow inbuilt methods. My data is already in two tensors (i.e. tf.Tensor objects), named features and labels.
I know how to do this easily for numpy arrays using sklearn.model_selection as shown in this post. Additionally, I was pointed to this method which requires the data to be in a single tensor. Also, I need the train and test sets to be disjoint, unlike in this method (meaning they can't have common data points after the split).
I am looking for a way to do the same using built-in methods in Tensorflow.
There may be too many conditions in my requirement, but basically what is needed is an equivalent method to sklearn.model_selection.train_test_split() in Tensorflow such as the below:
import tensorflow as tf
X_train, X_test, y_train, y_test = tf.train_test_split(features,
labels,
test_size=0.1,
random_state=123)
You can achieve this by using TF in the following way
from typing import Tuple
import tensorflow as tf
def split_train_test(features: tf.Tensor,
labels: tf.Tensor,
test_size: float,
random_state: int = 1729) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor, tf.Tensor]:
# Generate random masks
random = tf.random.uniform(shape=(tf.shape(features)[0],), seed=random_state)
train_mask = random >= test_size
test_mask = random < test_size
# Gather values
train_features, train_labels = tf.boolean_mask(features, mask=train_mask), tf.boolean_mask(labels, mask=train_mask)
test_features, test_labels = tf.boolean_mask(features, mask=test_mask), tf.boolean_mask(labels, mask=test_mask)
return train_features, test_features, train_labels, test_labels
What we are doing here is first creating a random uniform tensor with the size of the length of the data.
Then we follow by creating boolean masks according to the ratio given by test_size and finally we extract the relevant part for train/test using tf.boolean_mask

tf.estimator.inputs.pandas_input_fn label tensor

Was trying out Tensorflow's built in pandas_input_fn() with a pandas dataframe that I named training_examples
It's a very simple dataframe, describing one set of features and labels; this is then passed as argument x in the pandas_input_fn() function as shown below, which, if I understand the docs correctly, should return an input function with the data already parsed into features and labels?
input_function = tf.estimator.inputs.pandas_input_fn(
x= training_examples,
y= None,
batch_size=128,
num_epochs=1,
shuffle=True,
queue_capacity=1000,
num_threads=1,
target_column='y'
)
However, when I then try and pass this function to the .train() method, I get an error as shown below:
ValueError: You must provide a labels Tensor. Given: None. Suggested
troubleshooting steps: Check that your data contain your label feature. Check
that your input_fn properly parses and returns labels.
Not sure what I'm doing wrong?
train_input_function zips up it's own tuple of features and labels. You're in the right track in your comments.
x = training_examples[[feature_column_list]]
y = training_examples[label_column_name]
Working with the full dataset (before splitting into train and test) I find it works effectively to produce train and test input functions like so. This makes use of sklearn's train_test_split function with 'stratify' to make sure the right ratio of cases have each category in the label.
sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, stratify=y)
At this point you can specify your input functions.
train_input_fn = tf.estimator.inputs.pandas_input_fn(x=train_x, y=train_y, shuffle=True, num_epochs=whatever, batch_size=whatever)
test_input_fn = tf.estimator.inputs.pandas_input_fn(x=test_x, y=test_y, shuffle=False, batch_size=1)
try target_column=None and use the actual Y column in Y= training_examples['label/target']

One_Hot Encode and Tensorflow (Explain behind the scenes )

I am new to deep learning world and tensorflow. Tensorflow is so complicated for me right now.
I was following a tutorial on TF Layers API and I got this issue with one hot encode. Here is my code
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
wine_data = load_wine()
feat_data = wine_data['data']
labels = wine_data['target']
X_train, X_test, y_train, y_test = train_test_split(feat_data,
labels,
test_size=0.3,
random_state=101)
scaler = MinMaxScaler()
scaled_x_train = scaler.fit_transform(X_train)
scaled_x_test = scaler.transform(X_test)
# ONE HOT ENCODED
onehot_y_train = pd.get_dummies(y_train).as_matrix()
one_hot_y_test = pd.get_dummies(y_test).as_matrix()
num_feat = 13
num_hidden1 = 13
num_hidden2 = 13
num_outputs = 3
learning_rate = 0.01
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected
X = tf.placeholder(tf.float32,shape=[None,num_feat])
y_true = tf.placeholder(tf.float32,shape=[None,3])
actf = tf.nn.relu
hidden1 = fully_connected(X,num_hidden1,activation_fn=actf)
hidden2 = fully_connected(hidden1,num_hidden2,activation_fn=actf)
output = fully_connected(hidden2,num_outputs)
loss = tf.losses.softmax_cross_entropy(onehot_labels=y_true, logits=output)
optimizer = tf.train.AdamOptimizer(learning_rate)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
training_steps = 1000
with tf.Session() as sess:
sess.run(init)
for i in range(training_steps):
sess.run(train,feed_dict={X:scaled_x_train,y_true:y_train})
# Get Predictions
logits = output.eval(feed_dict={X:scaled_x_test})
preds = tf.argmax(logits,axis=1)
results = preds.eval()
When I run this code I got this error
ValueError: Cannot feed value of shape (124,) for Tensor
'Placeholder_1:0',
which has shape '(?, 3)'
After a little digging I found that modifying sess.run to
sess.run(train,feed_dict{X:scaled_x_train,y_true:onehot_y_train})
and changing y_train to onehot_y_train made the code run
I just want to know what is happening behind the scenes and why is the one_hot encoding that necessary in this code?
Your network is making a class prediction on 3 classes, class A, B, and C.
In defining a neural network to transform your 13 inputs to a representation that you can use to distinguish between these 3 classes you have a few choices.
You could output 1 number. Let's define a single-value output <1 represents class A, an output between [0,1] is class B, and an output >1 is class C.
You could define this, use a loss function like square error, and the network would learn to work under these assumptions and probably do half way decently at it.
However, that was a rather arbitrary choice of values to define 3 classes, as I'm sure you can see. And it's certainly sub-optimal. Learning this representation is harder than it needs to be. Can we do better?
Let's pick a more reasonable approach. Instead of 1 output we have 3 outputs. We define each output to represent how strongly we believe in a particular class. In order to conform to the cross entropy loss you use we'll further constrain those values to be in the range [0,1] by applying a sigmoid to them. So great, we now have 3 values in range [0,1] that each represent the belief that the input should fall into each of our 3 classes.
You have labels for each of your inputs, you know for sure that these inputs are class A, B, or C. So for a given input that is say class C, your label would naturally be [0, 0, 1] (e.g. you know it's not A or B, so 0 in both of those cases, and 1 for C which you know the class to be). Voila, you have the one-hot encoding!
As you might imagine this is a much easier problem to solve than the first one I presented. Hence we choose to represent our problem this way because we end up with networks that perform better when we do. It's not that you couldn't represent it another way, you just want the best results possible and one-hot encoding typically performs above other representations you might dream up.