I created my tf.data.Dataset from the image files in the directory:
train_ds = tf.keras.utils.image_dataset_from_directory(
"home/the path/to the directory/",
validation_split=0.2,
subset="training",
seed=13,
image_size=image_size,
batch_size=batch_size,
)
val_ds = tf.keras.utils.image_dataset_from_directory(
"home/the path/to the directory/",
validation_split=0.2,
subset="validation",
seed=13,
image_size=image_size,
batch_size=batch_size,
)
I save the dataset using the
tf.data.experimental.save(train_ds, path)
tf.data.experimental.save(val_ds, path)
The original directory contained jpg-images totaling about 500 MB, but the saved binary files after method tf.data.experimental.save() are 15 GB each!
What did I do wrong?
Related
I am fairly new to tensor flow and I am trying to train a BERT model for a binary classification task.
I have a data set in a single CSV file that looks like this:
Description
Target
This text passed
1
This text failed
0
I loaded the data set as a pandas data frame.
The guide I am using is the official tensorflow guide I found here.
The guide uses the IMDb dataset that is structured in separate folders.
This is the code block that created the TensorFlow dataset:
AUTOTUNE = tf.data.AUTOTUNE
batch_size = 32
seed = 42
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='training',
seed=seed)
class_names = raw_train_ds.class_names
train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='validation',
seed=seed)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/test',
batch_size=batch_size)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)
My question: is there a way to convert my Pandas dataframe into the same format?
I.e How do I generate train_ds, test_ds, and val_ds from a pandas data frame?
I have a structure of directories like this:
-root_dir----------------------------
--train
---dog (contains 750 images of dogs)
---cat (contains 750 images of cats)
---mouse (contains 750 images of mice)
--test
---dog (contains 250 images of dogs)
---cat (contains 250 images of cats)
---mouse (contains 250 images of mice)
That is how I load the data:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
train_data_gen = ImageDataGenerator(rescale=1./255)
train_data = train_data_gen.flow_from_directory(directory='/root_dir/train/',
target_size=(224, 224),
class_mode='categorical',
batch_size=32,
seed=42)
test_data_gen = ImageDataGenerator(rescale=1./255)
test_data = test_data_gen.flow_from_directory(directory='/root_dir/test/',
target_size=(224, 224),
class_mode='categorical',
batch_size=32,
seed=42)
It works fine.
train_data contains 750 images of each class.
However, I need to run fast experiments only on 10 percent of the data.
I need train_data_10_percent_subset that contains 75 randomly chosen images of each class.
Is there a simple way with ImageDataGenerator to randomly choose 10 percent of the images in the train directory in each sub-folder?
I need a generator that contains 75 images of each class from train subfolders
you can do this
train_data_gen = ImageDataGenerator(rescale=1./255, validation_split=.1)
train_data = train_data_gen.flow_from_directory(directory='/root_dir/train/',
target_size=(224, 224),
class_mode='categorical',
batch_size=32,
seed=42, subset='validation')
setting validation_split to .1 reserves 10 % of the data for validation. Setting subset='validation' will make the train_data have 10% of the training data
I am having trouble finding a way to create a dataset in tensorflow from images. My dataset has the structure below:
fruit-data
|
|-train
| |
| |- Freshapple -> .png images of fresh apples
| |- Freshorange -> .png images of fresh oranges
| |- Freshbanana -> .png images of fresh bananas
|
|-test
| |
| |- Rottenapple -> .png images of rotten apples
| |- Rottenorange -> png images of rotten oranges
| |- Rottenbanana -> .png images of rotten bananas
|
I have my paths set as so and the classes set:
train_path = ".../Desktop/Data/fruit-dataset/train"
test_path = ".../Desktop/Data/fruit-dataset/train"
categories = ["freshapple", "freshorange", "freshbanana",
"rottenapple", "rottenorange", "rottenbanana"]
From other resources I've seen, because my dataset contains over 13k images, I would need to use flow_from_directory(), as loading into memory would cause a crash at runtime.
I'm confused on what the next steps are to get this dataset loaded in.
For other information, I will be using a tuned MobilenetV2 model. (experimenting with freezing layers)
there are a number of ways to load the data. I prefer to use pandas dataframes because it is easy to partition the data in various ways. The code below should be what you need
sdir = r'.../Desktop/Data/fruit-dataset'
categories=['train', 'test']
for category in categories:
catpath=os.path.join(sdir, category)
classlist=os.listdir(catpath)
filepaths=[]
labels=[]
for klass in classlist:
classpath=os.path.join(catpath,klass)
flist=os.listdir(classpath)
for f in flist:
fpath=os.path.join(classpath,f)
filepaths.append(fpath)
labels.append(klass)
Fseries=pd.Series(filepaths, name='filepaths')
Lseries=pd.Series(labels, name='labels')
if category=='train':
df=pd.concat([Fseries, Lseries], axis=1)
else:
test_df=pd.concat([Fseries, Lseries], axis=1)
# create a validation data set
train_df, valid_df=train_test_split(df, train_size=.8, shuffle=True, random_state=123)
print('train_df length: ', len(train_df), ' test_df length: ',len(test_df), ' valid_df length: ', len(valid_df))
balance=list(train_df['labels'].value_counts())
# check the balance of the training set
for b in balance:
print (b)
height=224
width=224
channels=3
batch_size=40
img_shape=(height, width, channels)
img_size=(height, width)
length=len(test_df)
test_batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=80],reverse=True)[0]
test_steps=int(length/test_batch_size)
print ( 'test batch size: ' ,test_batch_size, ' test steps: ', test_steps)
def scalar(img):
img=img/255
return img
trgen=ImageDataGenerator(preprocessing_function=scalar, horizontal_flip=True)
tvgen=ImageDataGenerator(preprocessing_function=scalar)
train_gen=trgen.flow_from_dataframe( train_df, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
color_mode='rgb', shuffle=True, batch_size=batch_size)
test_gen=tvgen.flow_from_dataframe( test_df, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
color_mode='rgb', shuffle=False, batch_size=test_batch_size)
valid_gen=tvgen.flow_from_dataframe( valid_df, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
color_mode='rgb', shuffle=True, batch_size=batch_size)
classes=list(train_gen.class_indices.keys())
class_count=len(classes)
history=model.fit(x=train_gen, epochs=20, verbose=2, validation_data=valid_gen,
validation_steps=None, shuffle=False, initial_epoch=0)
Or a simplier way but less versitile is with flow_from_directory
gen=tf.keras.preprocessing.image.ImageDataGenerator( rescale=1/255,
validation_split=0.1)
tgen=tf.keras.preprocessing.image.ImageDataGenerator( rescale=1/255)
train_dir=r'.../Desktop/Data/fruit-dataset/train'
train_gen=gen.flow_from_directoy(train_dir, target_size=(256, 256),
class_mode="categorical", batch_size=32, shuffle=True,
seed=123, subset='training)
valid_gen=gen.flow_from_directory(train_dir, target_size=(256, 256),
class_mode="categorical", batch_size=32, shuffle=True,
seed=123, subset='validation')
test_dir=r'.../Desktop/Data/fruit-dataset/test' # you had this wrong in your code
test_gen=tgen.flow_from_directory(test_dir, target_size=(256, 256),
class_mode="categorical", batch_size=32, shuffle=False)
history=model.fit(x=train_gen, epochs=20, verbose=2, validation_data=valid_gen,
validation_steps=None, shuffle=False, initial_epoch=0)
I try to learn machine learning from the TensorFlow official tutorial.
But most tutorials do the download in command prompt.
I can't find any tutorial about loading my own image dataset from my own disk.
Would be great if you can give me a direct answer.
I put the image data set on my window 10 desktop:
C:\Users\User\Desktop\DataSet\coins\data
\test (label 1-211)
\train (label 1-211)
\validation (label 1-211)
You can use image_dataset_from_directory for this where you just have to pass in the path to the files in the argument directory.
from tensorflow.keras.preprocessing import image_dataset_from_directory
train_dataset = image_dataset_from_directory(
directory=TRAIN_DIR,
labels="inferred",
label_mode="categorical",
image_size=SIZE,
seed=SEED,
subset=None,
interpolation="bilinear",
follow_links=False,
)
validation_dataset = image_dataset_from_directory(
directory=VALIDATION_DIR,
labels="inferred",
label_mode="categorical",
image_size=SIZE,
seed=SEED,
subset=None,
interpolation="bilinear",
follow_links=False,
)
test_dataset = image_dataset_from_directory(
directory=TEST_DIR,
labels="inferred",
label_mode="categorical",
image_size=SIZE,
seed=SEED,
subset=None,
interpolation="bilinear",
follow_links=False,
)
you can use flow_from_disk in keras.
here is a pretty good tutorial
flow from disk in keras
I'm trying to implement an Autoencoder in Tensorflow 2.3. I am taking my own Image dataset stored on disk as input.can someone explain to me how this can be done in a correct way?
I tried loading the data in directory using tf.keras.preprocessing.image_dataset_from_directory() but when I use start training with the data taken from above method I am getting following error.
"ValueError: y argument is not supported when using dataset as input."
PFB the code that I am running
'''
import tensorflow as tf
from convautoencoder import ConvAutoencoder
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
import numpy as np
EPOCHS = 25
batch_size = 1
img_height = 180
img_width = 180
data_dir = "/media/aniruddha/FE47-91B8/Laptop_Backup/Auto-Encoders/Basic/data"
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
(encoder, decoder, autoencoder) = ConvAutoencoder.build(224, 224, 3)
opt = Adam(lr=1e-3)
autoencoder.compile(loss="mse", optimizer=opt)
H = autoencoder.fit( train_ds, train_ds, validation_data=(val_ds, val_ds), epochs=EPOCHS, batch_size=batch_size)
'''
I resolved this. I was not feeding the input dataset as a tuple to the model for training. Once I corrected that the training started.
I used generators to feed the input data as tuple to the autoencoder.
Please find my code below.
# initialize the training training data augmentation object
trainAug = ImageDataGenerator(rescale=1. / 255)
valAug = ImageDataGenerator(rescale=1. / 255)
# initialize the training generator
trainGen = trainAug.flow_from_directory(
config.TRAIN_PATH,
class_mode="input",
classes=None,
target_size=(64, 64),
color_mode="grayscale",
shuffle=True,
batch_size=BS)
# initialize the validation generator
valGen = valAug.flow_from_directory(
config.TRAIN_PATH,
class_mode="input",
classes=None,
target_size=(64, 64),
color_mode="grayscale",
shuffle=False,
batch_size=BS)
# initialize the testing generator
testGen = valAug.flow_from_directory(
config.TRAIN_PATH,
class_mode="input",
classes=None,
target_size=(64, 64),
color_mode="grayscale",
shuffle=False,
batch_size=BS)
early_stop = EarlyStopping(monitor='val_loss', patience=20)
mc = ModelCheckpoint('best_model_1.h5', monitor='val_loss', mode='min', save_best_only=True)
# construct our convolutional autoencoder
print("[INFO] building autoencoder...")
(encoder, decoder, autoencoder) = ConvAutoencoder.build(64, 64, 1)
opt = Adam(learning_rate= 0.0001, beta_1=0.9, beta_2=0.999, epsilon=1e-04, amsgrad=False)
autoencoder.compile(loss="mse", optimizer=opt)
# train the convolutional autoencoder
H = autoencoder.fit( trainGen, validation_data=valGen, epochs=EPOCHS, batch_size=BS ,callbacks=[ mc , early_stop])
fit is expecting data and labels, but it only accepts a single tf.data.Dataset. To use data as labels for the autoencoder you should provide it twice to the dataset constructor, e.g. :
dataset = tf.data.Dataset.from_tensor_slices((images, images))