Tensorflow: How to feed in data in vocabulary feature column? - pandas

I'm currently working on a classification problem on text input basis and my main question is the following:
Am I correct in assuming that I can parse my complete sentence as one string to the vocabulary column or do I need to split the sentence in its words - like a list of strings?
My data looks something like this:
A B text
1 .. .. My first example text
2 .. .. My second example text
(Beside my text input feature there are also some other categorical information - but they are not relevant in this context)
And my code looks basically like this:
// data import and data preparation
categorical_voc = tf.feature_column.categorical_column_with_vocabulary_list(key="text", vocabulary_list=vocabulary_list)
embedding_initializer = tf.random_uniform_initializer(-1.0, 1.0)
embed_column_dim = math.ceil(len(vocabulary_list) ** 0.25)
embed_column = tf.feature_column.embedding_column(
categorical_column=categorical_voc,
dimension=embed_column_dim,
initializer=embedding_initializer,
trainable=True)
estimator = tf.estimator.DNNClassifier(
optimizer=optimizer,
feature_columns=feature_columns,
hidden_units=hidden_units,
activation_fn=activation_fn,
dropout=dropout,
n_classes=target_size,
label_vocabulary=target_list,
config=config)
train_input_fn = tf.estimator.inputs.pandas_input_fn(
x=train_data,
y=train_target,
batch_size=batch_size,
num_epochs=1,
shuffle=True)
estimator.train(input_fn=train_input_fn)
Thanks for your help :)
Edit 1:
For the ones who need the custom input function.
def input_fn(features, labels, batch_size):
if labels is None:
dataset = tf.data.Dataset.from_tensor_slices(features)
else:
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
# Shuffle, repeat, and batch the examples.
dataset = dataset.shuffle(100).repeat().batch(batch_size)
return dataset
def train_input_fn():
return input_fn(features=_train_data,
labels=_train_target,
batch_size=train_batch_size)
estimator.train(input_fn=lambda: train_input_fn(), steps=total_training_steps, hooks=train_hooks)

For those who had the same problem figuring out how to handle a sentence within a vocabulary column ..
My conclusion so far is that I have to feed the vocabulary column with an array of strings. The only issue here is that the pandas_input_fn() does not support a series of lists. Thats why I went back to my custom input function!

Related

How to build a DataGenerator/ Sequence for Multi Loss and Multi Output Model in Keras (Tensorflow2)?

I'm working on a model where I have two losses and 2 different outputs. One output takes y as an Image just like Autoencoder / U-Net architecture. The other output is simple Binary classification which takes as 0/1.
So what I'm trying to pull off is Siamese Based Unet. Basically Reconstruct the image based on mae loss and create a branch from Bottleneck layer so that it can predict whether 2 images are similar or not based on the Eunclidean distance.
Keras has a ImageDataGenerator where you can use class_mode='input' to generate a corresponding image as y label and also class_mode=binary to generate a 0/1 value given in the column. But how can I generate both things in the same generator. Problem is that the Siamese Branch will accept 2 inputs at the same time.
Have you checked the Sequence object
. It allows one to create custom data generators. The idea is to subclass Sequence and then override methods len, getitem. len should return number of batches in the sequence. The logic that returns the source and target pair is written inside getitem. It should return a batch of data. Incase of multi input models, you can write getitem such that its output data include a dictionary mapping to the input layer of your model(key=layername). Similarly for an output tensor. More information can be found in the link to the official doc I have added above. Best
Edit
This is the gist as per my understanding of your problem.:
class Dataset(Sequence):
def __init__(self, filenames, batchsize, shape):
self.filenames = filenames # List of filenames
self.batchsize = batchsize
self.shape = shape # Shape to which image should be
# resized
def __len__(self):
return len(self.filenames) // batchsize
def __getitem__(self, idx):
i = idx * self.batchsize
X_1 = np.zeros((self.batchsize, self.shape[0], self.shape[1], 3)
y = np.zeros((self.batchsize, --, --, ..., --)) # Depends on
# your target choice
filenames = self.filenames[i:i+self.batchsize]
for index, filename in enumerate(filenames):
image = cv2.imread(filename)
# Preprocess
image = your_preprocess(image)
X[index] = image
# You can include your pipeline for other
# Input also.
# Similarly obtain target values and load to y.
return {"layername": X_1, "layername": X_2}, {"layername": y}

keras generator with image and scalar

I am trying to train some layers of a network whose inputs are an image and a scalar. Please see the figure below for a better understanding. .
As you can see only the dark yellow layers will be trained. So I need to freeze the rest, that is for later.
Purpose of this architecture is to map images (chest x-rays) to 14 kinds of diseases.
The images are stored in the following directory: /home/akde/chexnet/CheXNet-Keras/data/images
Names of the images are the image IDs.
A dataframe maps images (Images are named as the Image ID) to classes (diseases)
As you can see an image can be mapped to more than one class (disease).
Another dataframe maps the images (Image IDs) to the patient age. You can see it below.
Image is the first input and patient age is the second.
So in short, for each image id, I have an image and age value which are in 2 separate dataframes.
I can already test (gives absurd results since the network is not trained, but still proves that the network accepts the input and gives some result) it using the following code.
res3 = model3.predict( [test_image, a] )
where a is the scalar input while the test_image is the image input.
My training data is stored in multiple dataframes, having read that post, I deduce that flow_from_dataframe should be used.
The first thing I have done was to see this post which explains how to use mixed inputs. That gave me some background but since it does not use fit_generator (instead uses fit) it did not solve my problem.
Then I have read this post which does not use multiple inputs. Again no clue.
Afterwards, I have seen this post, which takes 2 images as input ( not one image one scalar). So again no help.
Even though I haven't found a solution to my problem I have written the following piece of code which will be the skeleton the solution.
datagen=ImageDataGenerator(rescale=1./255., validation_split=0.25)
train_generator = datagen.flow_from_dataframe(traindf,
directory="/home/akde/chexnet/CheXNet-Keras/data/images",
class_mode="other",
x_col="Image Index",
y_col=["Atelectasis", "Cardiomegaly", "Effusion", "Infiltration", "Mass",
"Nodule", "Pneumonia", "Pneumothorax", "Consolidation", "Edema",
"Emphysema", "Fibrosis", "Pleural_Thickening", "Hernia"],
color_mode="rgb",
batch_size=32,
target_size=(224, 224)
)
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
model3.compile(optimizers.rmsprop(lr=0.0001, decay=1e-6),loss="categorical_crossentropy",metrics=["accuracy"])
model3.fit_generator(generator=train_generator,
steps_per_epoch=STEP_SIZE_TRAIN,
epochs=10
)
I know this piece of code is far from the solution.
So how can I create a generator that uses 2 dataframes which are explained earlier (the one that maps images to the diseases and the other one which maps image IDs to age).
In other words, what is the way of writing a generator that takes an image and a scalar value as an input, considering the fact that both are represented in dataframes. How can I write the generator that written in bold below.
model3.fit_generator(**generator=train_generator**,
steps_per_epoch=STEP_SIZE_TRAIN,
epochs=10
)
For your purpose you need to create a custom generator.
I will recommand you to take a deep look at this link :
https://blog.ml6.eu/training-and-serving-ml-models-with-tf-keras-3d29b41e066c
And especially this code :
import ast
import numpy as np
import math
import os
import random
from tensorflow.keras.preprocessing.image import img_to_array as img_to_array
from tensorflow.keras.preprocessing.image import load_img as load_img
def load_image(image_path, size):
# data augmentation logic such as random rotations can be added here
return img_to_array(load_img(image_path, target_size=(size, size))) / 255.
class KagglePlanetSequence(tf.keras.utils.Sequence):
"""
Custom Sequence object to train a model on out-of-memory datasets.
"""
def __init__(self, df_path, data_path, im_size, batch_size, mode='train'):
"""
df_path: path to a .csv file that contains columns with image names and labels
data_path: path that contains the training images
im_size: image size
mode: when in training mode, data will be shuffled between epochs
"""
self.df = pd.read_csv(df_path)
self.im_size = im_size
self.batch_size = batch_size
self.mode = mode
# Take labels and a list of image locations in memory
self.wlabels = self.df['weather_labels'].apply(lambda x: ast.literal_eval(x)).tolist()
self.glabels = self.df['ground_labels'].apply(lambda x: ast.literal_eval(x)).tolist()
self.image_list = self.df['image_name'].apply(lambda x: os.path.join(data_path, x + '.jpg')).tolist()
def __len__(self):
return int(math.ceil(len(self.df) / float(self.batch_size)))
def on_epoch_end(self):
# Shuffles indexes after each epoch
self.indexes = range(len(self.image_list))
if self.mode == 'train':
self.indexes = random.sample(self.indexes, k=len(self.indexes))
def get_batch_labels(self, idx):
# Fetch a batch of labels
return [self.wlabels[idx * self.batch_size: (idx + 1) * self.batch_size],
self.glabels[idx * self.batch_size: (idx + 1) * self.batch_size]]
def get_batch_features(self, idx):
# Fetch a batch of images
batch_images = self.image_list[idx * self.batch_size: (1 + idx) * self.batch_size]
return np.array([load_image(im, self.im_size) for im in batch_images])
def __getitem__(self, idx):
batch_x = self.get_batch_features(idx)
batch_y = self.get_batch_labels(idx)
return batch_x, batch_y
Hope this will help to find your solution !

How can I read endlessly from a Tensorflow tf.data.Dataset?

I'm switching my old datalayer (using Queues) to the "new" and recommended Dataset API. I'm using it for the first time, so I'm providing code examples in case I got something fundamentally wrong.
I create my Dataset from a generator (that will read a file, and provide n samples). It's a small dataset and n_iterations >> n_samples, so I simply want to read this dataset over and over again, ideally shuffled.
sample_set = tf.data.Dataset.from_generator( data_generator(filename),
(tf.uint8, tf.uint8), (tf.TensorShape([256,256,4]), tf.TensorShape([256,256,1]))
)
with datagenerator:
class data_generator:
def __init__(self, filename):
self.filename= filename
def __call__(self):
with filename.open() as f:
for idx in f: yield img[idx], label[idx]
To actually use the data, I got that I need to define an Iterator
sample = sample_set.make_one_shot_iterator().get_next()
and then we are set to read data
while True:
try: my_sample = sess.run(sample)
except tf.errors.OutOfRangeError: break # this happens after dset is read once
But all available Iterators seem to be "finite", in the way that they read a dataset only once.
Is there a simple way to make reading from the Dataset endless?
Datasets have repeat and shuffle methods.
BUF_SIZE = 100 # choose it depending on your data
sample_set = tf.data.Dataset.from_generator( data_generator(filename),
(tf.uint8, tf.uint8), (tf.TensorShape([256,256,4]),
tf.TensorShape([256,256,1]))
).repeat().shuffle(BUF_SIZE)
The Dataset.repeat() transformation will repeat a dataset endlessly if you don't pass an explicit count to it:
sample_set = tf.data.Dataset.from_generator(
data_generator(filename), (tf.uint8, tf.uint8),
(tf.TensorShape([256,256,4]), tf.TensorShape([256,256,1])))
# Repeats `sample_set` endlessly.
sample_set = sample_set.repeat()
sample = sample_set.make_one_shot_iterator().get_next()
The reinitializable Iterator will work with reinitializing on the same dataset, so this code will read the same dataset over and over again:
sample = tf.data.Iterator.from_structure(sample_set.output_types,
sample_set.output_shapes).get_next()
sample_it.make_initializer(sample_set) # create initialize op
with tf.Session(config=config) as sess:
sess.run(sample_set_init_op) # initialize in the beginning
while True:
try:
my_sample = sess.run(sample)
except tf.errors.OutOfRangeError:
sess.run(sample_set_init_op) # re-initialize on same dataset

Iterator usage in TensorFlow example code

I am learning TensorFlow (TF), and its been just one day, so I apologize in advance if my doubt is too basic to ask.
I was studying the linear classification example on the official TF website.
The authors defined a function called input_fun to read the data. The function is as follows:
def input_fn(data_file, num_epochs, shuffle, batch_size):
"""Generate an input function for the Estimator."""
assert tf.gfile.Exists(data_file), (
'%s not found. Please make sure you have either run data_download.py or '
'set both arguments --train_data and --test_data.' % data_file)
def parse_csv(value):
print('Parsing', data_file)
columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
features = dict(zip(_CSV_COLUMNS, columns))
labels = features.pop('income_bracket')
return features, tf.equal(labels, '>50K')
# Extract lines from input files using the Dataset API.
dataset = tf.data.TextLineDataset(data_file)
if shuffle:
dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])
dataset = dataset.map(parse_csv, num_parallel_calls=5)
# We call repeat after shuffling, rather than before, to prevent separate
# epochs from blending together.
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels
I am not able to understand the second last line. The one-shot-iterator calls get_next() only once but shouldn't it iterate on the data multiple times (i.e. number of rows times) to extract the rows, like this example here?
So here, get_next() is basically a dequeue op. The data is in a queue, when you consume (use/call) the element called by get_next(), it is removed from the queue, and the next image/labels is moved in its place, which is dequeued next time you call it.
So currently, this function only returns the tensorflow op for dequeing elements, you can consume it in your training loop.

How to output tensor flow prediction results in .csv?

I am training a model using CNN.
Here is my prediction part in the model.
predictions = {
"classes": tf.argmax(input=logit2, axis=1),
"probabilities": tf.nn.softmax(logit2, name="softmax_tensor")
}
Here is the code in main that do the evaluation.
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": images_test},
y=test_labels,
num_epochs=1,
shuffle=False)
eval_results = model.evaluate(input_fn=eval_input_fn)
I have trained my models, now I have a list of test image names (in the first column of a csv file), and I want to make the predictions and output the corresponding results to the second column (with probability between 0 and 1), how to achieve this, and where to add the code?
Thanks in advance.
The Estimator class has a predict function that returns the predictions as an iterable object (for an example, scroll to the very bottom of this page).
so you could do:
predictions = model.predict(input_fn=predict_input_fn)
for p in predictions:
# write p['classes'] to the csv
As for writing to the second column of the csv, take a look at the csv python module.