Converting tensorflow dataset to pandas dataframe - pandas

I am very new to the deep learning and computer vision. I want to do some face recognition project. For that I downloaded some images from Internet and converted to Tensorflow dataset by the help of this article from tensorflow documentation. Now I want to convert that dataset to pandas dataframe in order to convert that to csv files. I tried a lot but am unable to do it.
Can someone help me with it.
Here is the code for making datasets and and then some of the wrong code which I tried for this.
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
filenames = tf.constant(['al.jpg', 'al2.jpg', 'al3.jpg', 'al4.jpeg','al5.jpeg', 'al6.jpeg','al7.jpg','al8.jpeg', '5.jpg', 'hrit8.jpeg', 'Hrithik-Roshan.jpg', 'Hrithik.jpg', 'hriti1.jpeg', 'hriti2.jpg', 'hriti3.jpeg', 'hritik4.jpeg', 'hritik5.jpg', 'hritk9.jpeg', 'index.jpeg', 'sah.jpeg', 'sah1.jpeg', 'sah3.jpeg', 'sah4.jpg', 'sah5.jpg','sah6.jpg','sah7.jpg'])
labels = tf.constant([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 2, 2, 2])
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
def _parse_function(filename, label):
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_jpeg(image_string,channels=3)
image_resized = tf.image.resize_images(image_decoded, [28, 28])
return image_resized, label
dataset = dataset.map(_parse_function)
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(26)
iterator = dataset.make_one_shot_iterator()
image,labels = iterator.get_next()
sess = tf.Session()
print(sess.run([image, labels]))
Initially I just tried to use df = pd.DataFrame(dataset)
Then i got following error:
enter code here
ValueError Traceback (most recent call last)
<ipython-input-15-d5503ae4603d> in <module>()
----> 1 df = pd.DataFrame((dataset))
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
402 dtype=values.dtype, copy=False)
403 else:
--> 404 raise ValueError('DataFrame constructor not properly called!')
405
406 NDFrame.__init__(self, mgr, fastpath=True)
ValueError: DataFrame constructor not properly called!
Thereafter I came across this article I got my mistake that in tensorflow anything exist only within a session. So I tried following code:
with tf.Session() as sess:
df = pd.DataFrame(sess.run(dataset))
Please pardon me if i did stupidest mistake because i wrote above code from this analogy print(sess.run(dataset)) and got a much bigger error:
TypeError: Fetch argument <BatchDataset shapes: ((?, 28, 28, 3), (?,)), types: (tf.float32, tf.int32)> has invalid type <class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>, must be a string or Tensor. (Can not convert a BatchDataset into a Tensor or Operation.)

I think you could use map like this. I assumed that you want to add a numpy array to a data frame as described here. But you have to append one by one and also figure out how this whole array fits in one column of the data frame.
import tensorflow as tf
import pandas as pd
filenames = tf.constant(['C:/Machine Learning/sunflower/50987813_7484bfbcdf.jpg'])
labels = tf.constant([1])
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
sess = tf.Session()
def convert_to_dataframe(filename, label):
print ( pd.DataFrame.from_records(filename))
return filename, label
def _parse_function(filename, label):
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_jpeg(image_string,channels=3)
image_resized = tf.image.resize_images(image_decoded, [28, 28])
return image_resized, label
dataset = dataset.map(_parse_function)
dataset = dataset.map( lambda filename, label: tf.py_func(convert_to_dataframe,
[filename, label],
[tf.float32,tf.int32]))
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(26)
iterator = dataset.make_one_shot_iterator()
image,labels = iterator.get_next()
sess.run([image, labels])

One easy way to do it is to save the dataset into normal csv file, and then directly read the csv file into pandas dataframe.
import tensorflow_datasets as tfds
# Construct a tf.data.Dataset
ds = tfds.load('civil_comments/CivilCommentsCovert', split='train')
#read the dataset into a tensorflow styled_dataframe
df = tfds.as_dataframe(ds)
#save the dataframe into csv file
df.to_csv("/.../.../Desktop/covert_toxicity.csv")
#read the csv file as normal, then you have the df you need
import pandas as pd
file_path = "/.../.../Desktop/covert_toxicity.csv"
df = pd.read_csv(file_path, header = 0, sep=",")
df

A more simpler way to convert a TensorFlow object to a dataframe would be to convert the TensorFlow object to a numpy array and pass the pandas DataFrame class.
import pandas as pd
dataset = pd.DataFrame(labels.numpy(), columns=filenames)

Related

Using tf extract_image_patches for input to a CNN?

I want to extract patches from my original images to use them as input for a CNN.
After a little research I found a way to extract patches with
tensorflow.compat.v1.extract_image_patches.
Since these need to be reshaped to "image format" I implemented a method reshape_image_patches to reshape them and store the reshaped patches in an array.
image_patches2 = []
def reshape_image_patches(image_patches, sess, ksize_rows, ksize_cols):
a = sess.run(tf.shape(image_patches))
nr, nc = a[1], a[2]
for i in range(nr):
for j in range(nc):
patch = tf.reshape(image_patches[0,i,j,], [ksize_rows, ksize_cols, 3])
image_patches2.append(patch)
return image_patches2
How can I use this in combination with Keras generators to make these patches the input of my CNN?
Edit 1:
I have tried the approach in Load tensorflow images and create patches
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
dataset = tf.keras.preprocessing.image_dataset_from_directory(
<directory>,
label_mode=None,
seed=1,
subset='training',
validation_split=0.1,
image_size=(900, 900))
get_patches = lambda x: (tf.reshape(
tf.image.extract_patches(
x,
sizes=[1, 16, 16, 1],
strides=[1, 8, 8, 1],
rates=[1, 1, 1, 1],
padding='VALID'), (111*111, 16, 16, 3)))
dataset = dataset.map(get_patches)
fig = plt.figure()
plt.subplots_adjust(wspace=.1, hspace=.2)
images = next(iter(dataset))
for index, image in enumerate(images):
ax = plt.subplot(2, 2, index + 1)
ax.set_xticks([])
ax.set_yticks([])
ax.imshow(image)
plt.show()
In line: images = next(iter(dataset)) I get the error: InvalidArgumentError: Input to reshape is a tensor with 302800896 values, but the requested shape has 9462528
[[{{node Reshape}}]]
Does somebody know how to fix this?
The tf.reshape does not change the order of or the total number of elements in the tensor. The error as states, you are trying to reduce total number of elements from 302800896 to 9462528 . You are using tf.reshape in lambda function.
In below example, I have recreated your scenario where I have the given the shape argument as 2 for tf.reshape which doesn't accommodate all the elements of original tensor, thus throws the error -
Code -
%tensorflow_version 2.x
import tensorflow as tf
t1 = tf.Variable([1,2,2,4,5,6])
t2 = tf.reshape(t1, 2)
Output -
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-3-0ff1d701ff22> in <module>()
3 t1 = tf.Variable([1,2,2,4,5,6])
4
----> 5 t2 = tf.reshape(t1, 2)
3 frames
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)
InvalidArgumentError: Input to reshape is a tensor with 6 values, but the requested shape has 2 [Op:Reshape]
tf.reshape should be in such a way that the arrangement of elements can change but total number of elements must remain the same. So the fix would be to change the shape to [2,3] -
Code -
%tensorflow_version 2.x
import tensorflow as tf
t1 = tf.Variable([1,2,2,4,5,6])
t2 = tf.reshape(t1, [2,3])
print(t2)
Output -
tf.Tensor(
[[1 2 2]
[4 5 6]], shape=(2, 3), dtype=int32)
To solve your problem, either extract patches(tf.image.extract_patches) of size that you are trying to tf.reshape OR change the tf.reshape to size of extract patches.
Will also suggest you to look into other tf.image functionality like tf.image.central_crop and tf.image.crop_and_resize.

Passing x_train as a list of numpy arrays to tf.data.Dataset is not working

My problem is that x_train in tf.data.Dataset.from_tensor_slices(x_train, y_train) needs to be a list. When I use the following lines to pass [x1_train,x2_train] to tensorflow.data.Dataset.from_tensor_slices, then I get error (x1_train, x2_train and y_train are numpy arrays):
Train=tensorflow.data.Dataset.from_tensor_slices(([x1_train,x2_train], y_train)).batch(batch_size)
Error:
Train=tensorflow.data.Dataset.from_tensor_slices(([x1_train,x2_train], y_train)).batch(batch_size)
return ops.EagerTensor(value, ctx.device_name, dtype)
ValueError: Can't convert non-rectangular Python sequence to Tensor.
What should I do?
If the main goal is to feed data to a model having multiple input layers then the following might be helpful:
import tensorflow as tf
from tensorflow import keras
import numpy as np
def _input_fn(n):
x1_train = np.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=np.int64)
x2_train = np.array([15, 25, 35, 45, 55, 65, 75, 85], dtype=np.int64)
labels = np.array([40, 30, 20, 10, 80, 70, 50, 60], dtype=np.int64)
dataset = tf.data.Dataset.from_tensor_slices(({"input_1": x1_train, "input_2": x2_train}, labels))
dataset = dataset.batch(2, drop_remainder=True)
dataset = dataset.repeat(n)
return dataset
input1 = keras.layers.Input(shape=(1,), name='input_1')
input2 = keras.layers.Input(shape=(1,), name='input_2')
model = keras.models.Model(inputs=[input_1, input_2], outputs=output)
basically instead of passing a python list, pass a dictionary where the key indicates the layer's name to which the array will be fed to.
like in the above array x1_train will be fed to tensor input1 whose name is input_1. Refered from here
If you have a dataframe with different types (float32, int and str) you have to create it manually.
Following the Pratik's syntax:
tf.data.Dataset.from_tensor_slices(({"input_1": np.asarray(var_float).astype(np.float32), "imput_2": np.asarray(var_int).astype(np.int), ...}, labels))

Value Error when trying to visualize an image

I´m trying to visualize some images belonging to different classes. The classes are class0,class1,class2 and they mean X-ray pictures with healthy, covid and pneumonia lungs respectively. As an example, see picture below of a covid lung:
I´ve created three datasets containing the training, test and validation data. Please, see below the code:
import pandas as pd
from keras_preprocessing.image import ImageDataGenerator
from matplotlib import pyplot as plt
import numpy as np
#Creating three dataframes reading .txt files
trainingfile = pd.read_table('data/training.txt', delim_whitespace=True, names=('class', 'image'))
testingfile = pd.read_table('data/testing.txt', delim_whitespace=True, names=('class', 'image'))
validationfile = pd.read_table('data/validation.txt', delim_whitespace=True, names=('class', 'image'))
#Change 0,1,2 to categorical class class0,class1,class2
trainingfile = trainingfile.replace([0, 1, 2], ['class0', 'class1', 'class2'])
testingfile = testingfile.replace([0, 1, 2], ['class0', 'class1', 'class2'])
validationfile = validationfile.replace([0, 1, 2], ['class0', 'class1', 'class2'])
#Final training, test and validation data
datagen=ImageDataGenerator(rescale=None)
train_generator=datagen.flow_from_dataframe(dataframe=trainingfile, directory="data/", x_col="image", y_col="class", class_mode="categorical", target_size=(256,256), batch_size=32)
test_generator=datagen.flow_from_dataframe(dataframe=testingfile, directory="data/", x_col="image", y_col="class", class_mode="categorical", target_size=(256,256), batch_size=15)
validation_generator=datagen.flow_from_dataframe(dataframe=validationfile, directory="data/", x_col="image", y_col="class", class_mode="categorical", target_size=(256,256), batch_size=21)
Now, the code to visualize one picture:
first_image = train_generator[0]
first_image = np.array(first_image, dtype='float')
pixels = first_image.reshape((28, 28))
plt.imshow(pixels, cmap='gray')
plt.show()
I get the following error:
ValueError Traceback (most recent call last)
<ipython-input-3-b237e88f96dd> in <module>
1 first_image = train_generator[0]
----> 2 first_image = np.array(first_image, dtype='float')
3 pixels = first_image.reshape((28, 28))
4 plt.imshow(pixels, cmap='gray')
5 plt.show()
ValueError: could not broadcast input array from shape (32,256,256,3) into shape (32)
Furthermore, is there any way to visualize an image corresponding to a specific class?
If instead of first_image= first_image[0], I do first_image= first_image[0][0]. Then the error that pops up is:
ValueError Traceback (most recent call last)
<ipython-input-4-0664c7dc8c6b> in <module>
1 first_image = train_generator[0][0]
2 first_image = np.array(first_image, dtype='float')
----> 3 pixels = first_image.reshape((28, 28))
4 plt.imshow(pixels, cmap='gray')
5 plt.show()
ValueError: cannot reshape array of size 6291456 into shape (28,28)

How to get number of rows, columns /dimensions of tensorflow.data.Dataset?

Like pandas_df.shape is there any way for tensorflow.data.Dataset?
Thanks.
I'm not familiar with something built-in, but the shapes could be retrieved from Dataset._tensors attribute. Example:
import tensorflow as tf
def dataset_shapes(dataset):
try:
return [x.get_shape().as_list() for x in dataset._tensors]
except TypeError:
return dataset._tensors.get_shape().as_list()
And usage:
from sklearn.datasets import make_blobs
x_train, y_train = make_blobs(n_samples=10,
n_features=2,
centers=[[1, 1], [-1, -1]],
cluster_std=0.5)
dataset = tf.data.Dataset.from_tensor_slices(x_train)
print(dataset_shapes(dataset)) # [10, 2]
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
print(dataset_shapes(dataset)) # [[10, 2], [10]]
To add to Vlad's answer, just in case someone is trying this out for datasets downloaded via tfds, a possible way is to use the dataset information:
info.features['image'].shape # shape of 1 feature in dataset
info.features['label'].num_classes # number of classes
info.splits['train'].num_examples # number of training examples
Eg. tf_flowers :
import tensorflow as tf
import tensorflow_datasets as tfds
dataset, info = tfds.load("tf_flowers", with_info=True) # download data with info
image_size = info.features['image'].shape # (None, None, 3)
num_classes = info.features['label'].num_classes # 5
data_size = info.splits['train'].num_examples # 3670
Eg. fashion_mnist :
import tensorflow as tf
import tensorflow_datasets as tfds
dataset, info = tfds.load("fashion_mnist", with_info=True) # download data with info
image_size = info.features['image'].shape # (28, 28, 1)
num_classes = info.features['label'].num_classes # 10
data_splits = {k:v.num_examples for k,v in info.splits.items()} # {'test': 10000, 'train': 60000}
Hope this helps.

Converting tokens to word vectors effectively with TensorFlow Transform

I would like to use TensorFlow Transform to convert tokens to word vectors during my training, validation and inference phase.
I followed this StackOverflow post and implemented the initial conversion from tokens to vectors. The conversion works as expected and I obtain vectors of EMB_DIM for each token.
import numpy as np
import tensorflow as tf
tf.reset_default_graph()
EMB_DIM = 10
def load_pretrained_glove():
tokens = ["a", "cat", "plays", "piano"]
return tokens, np.random.rand(len(tokens), EMB_DIM)
# sample string
string_tensor = tf.constant(["plays", "piano", "unknown_token", "another_unknown_token"])
pretrained_vocab, pretrained_embs = load_pretrained_glove()
vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
mapping = tf.constant(pretrained_vocab),
default_value = len(pretrained_vocab))
string_tensor = vocab_lookup.lookup(string_tensor)
# define the word embedding
pretrained_embs = tf.get_variable(
name="embs_pretrained",
initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
shape=pretrained_embs.shape,
trainable=False)
unk_embedding = tf.get_variable(
name="unk_embedding",
shape=[1, EMB_DIM],
initializer=tf.random_uniform_initializer(-0.04, 0.04),
trainable=False)
embeddings = tf.cast(tf.concat([pretrained_embs, unk_embedding], axis=0), tf.float32)
word_vectors = tf.nn.embedding_lookup(embeddings, string_tensor)
with tf.Session() as sess:
tf.tables_initializer().run()
tf.global_variables_initializer().run()
print(sess.run(word_vectors))
When I refactor the code to run as a TFX Transform Graph, I am getting the error the ConversionError below.
import pprint
import tempfile
import numpy as np
import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam.impl as beam_impl
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema
tf.reset_default_graph()
EMB_DIM = 10
def load_pretrained_glove():
tokens = ["a", "cat", "plays", "piano"]
return tokens, np.random.rand(len(tokens), EMB_DIM)
def embed_tensor(string_tensor, trainable=False):
"""
Convert List of strings into list of indices then into EMB_DIM vectors
"""
pretrained_vocab, pretrained_embs = load_pretrained_glove()
vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(pretrained_vocab),
default_value=len(pretrained_vocab))
string_tensor = vocab_lookup.lookup(string_tensor)
pretrained_embs = tf.get_variable(
name="embs_pretrained",
initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
shape=pretrained_embs.shape,
trainable=trainable)
unk_embedding = tf.get_variable(
name="unk_embedding",
shape=[1, EMB_DIM],
initializer=tf.random_uniform_initializer(-0.04, 0.04),
trainable=False)
embeddings = tf.cast(tf.concat([pretrained_embs, unk_embedding], axis=0), tf.float32)
return tf.nn.embedding_lookup(embeddings, string_tensor)
def preprocessing_fn(inputs):
input_string = tf.string_split(inputs['sentence'], delimiter=" ")
return {'word_vectors': tft.apply_function(embed_tensor, input_string)}
raw_data = [{'sentence': 'This is a sample sentence'},]
raw_data_metadata = dataset_metadata.DatasetMetadata(dataset_schema.Schema({
'sentence': dataset_schema.ColumnSchema(
tf.string, [], dataset_schema.FixedColumnRepresentation())
}))
with beam_impl.Context(temp_dir=tempfile.mkdtemp()):
transformed_dataset, transform_fn = ( # pylint: disable=unused-variable
(raw_data, raw_data_metadata) | beam_impl.AnalyzeAndTransformDataset(
preprocessing_fn))
transformed_data, transformed_metadata = transformed_dataset # pylint: disable=unused-variable
pprint.pprint(transformed_data)
Error Message
TypeError: Failed to convert object of type <class
'tensorflow.python.framework.sparse_tensor.SparseTensor'> to Tensor.
Contents: SparseTensor(indices=Tensor("StringSplit:0", shape=(?, 2),
dtype=int64), values=Tensor("hash_table_Lookup:0", shape=(?,),
dtype=int64), dense_shape=Tensor("StringSplit:2", shape=(2,),
dtype=int64)). Consider casting elements to a supported type.
Questions
Why would the TF Transform step require an additional conversion/casting?
Is this approach of converting tokens to word vectors feasible? The word vectors might be multiple gigabytes in memory. How is Apache Beam handling the vectors? If Beam in a distributed setup, would it require N x vector memory with N the number of workers?
The SparseTensor related error is because you are calling string_split which returns a SparseTensor. Your test code does not call string_split so that's why it only happens with your Transform code.
Regarding memory, you are correct, the embedding matrix must be loaded into each worker.
One cannot put a SparseTensor into the dictionary, returned by the TFX Transform, in your case by the function "preprocessing_fn". The reason is that SparseTensor is not a Tensor, it is actually a small subgraph.
To fix your code, you can convert your SparseTensor into a Tensor. There is a number of ways to do so, I would recommend to use tf.serialize_sparse for regular SparseTensor and tf.serialize_many_sparse for batched one.
To consume such serialized Tensor in Trainer, you could call the function tf. deserialize_many_sparse.