Tensorflow 2.3 pipeline load all the data to the RAM - tensorflow

I created pipeline using tf.data API, for reading data set of images. I have a big dataset with high resolution. However, each time trying to reading all the dataset, the computer crash because the code using all the RAM. I tested the code with about 1280 images, it works without any error. But when I used all the datasets the model craches.
So, I am wondering if there is a way to make tf.data read a one or two batch in front not more than that.
This the code I am using to create the pipeline:
def decode_img(self, img):
img = tf.image.convert_image_dtype(img, tf.float32, saturate=False)
img = tf.image.resize(img, size=self.input_dim, antialias=False, name=None)
return img
def get_label(self, label):
y = np.zeros(self.n_class, dtype=np.float32)
y[label] = 1
return y
def process_path(self, file_path, label):
label = self.get_label(label)
img = Image.open(file_path)
width, height = img.size
# Setting the points for cropped image
new_hight = height // 2
new_width = width // 2
newsize = (new_width, new_hight)
img = img.resize(newsize)
if self.aug_img:
img = self.policy(img)
img = self.decode_img(np.array(img, dtype=np.float32))
return img, label
def create_pip_line(self):
def _fixup_shape(images, labels):
images.set_shape([None, None, 3])
labels.set_shape([7]) # I have 19 classes
return images, labels
tf_ds = tf.data.Dataset.from_tensor_slices((self.df["file_path"].values, self.df["class_num"].values))
tf_ds = tf_ds.map(lambda img, label: tf.numpy_function(self.process_path,
[img, label],
(tf.float32, tf.float32)),
num_parallel_calls=tf.data.experimental.AUTOTUNE)
tf_ds = tf_ds.map(_fixup_shape)
if not self.is_val:
tf_ds = tf_ds.shuffle(len(self.df), reshuffle_each_iteration=True)
tf_ds = tf_ds.batch(self.batch_size).repeat(self.epoch_num)
self.tf_ds = tf_ds.prefetch(tf.data.experimental.AUTOTUNE)

The main issue in my code was the Shuffle function. This function takes two parameters, the first one number of data to shuffle, the second one the repeat for each epoch.
However, I found the number of data that will be loaded to the memory depends on this function. Therefore, I reduced the number from all data to 100 and this makes the pipeline load 100 images and shuffles them then load another 100, and so on.
if not self.is_val:
tf_ds = tf_ds.shuffle(100, reshuffle_each_iteration=True)

Related

How to get top k predictions for a new Image

I am using this function to predict the output of never seen images
def predictor(img, model):
image = cv2.imread(img)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = cv2.resize(image, (224, 224))
image = np.array(image, dtype = 'float32')/255.0
plt.imshow(image)
image = image.reshape(1, 224,224,3)
clas = model.predict(image).argmax()
name = dict_class[clas]
print('The given image is of \nClass: {0} \nSpecies: {1}'.format(clas, name))
how to change it, if I want the top 2(or k) accuracy
i.e
70% chance its dog
15% its a bear
If you are using TensorFlow + Keras and probably doing multi-class classification, then the output of model.predict() is a tensor representing either the logits or already the probabilities (softmax on top of logits).
I am taking this example from here and slightly modifying it : https://www.tensorflow.org/api_docs/python/tf/math/top_k.
#See the softmax, probabilities add up to 1
network_predictions = [0.7,0.2,0.05,0.05]
prediction_probabilities = tf.math.top_k(network_predictions, k=2)
top_2_scores = prediction_probabilities.values.numpy()
dict_class_entries = prediction_probabilities.indices.numpy()
And here in dict_class_entries you have then the indices (sorted ascendingly) in accordance with the probabilities. (i.e. dict_class_entries[0] = 0 (corresponds to 0.7) and top_2_scores[0] = 0.7 etc.).
You just need to replace network_probabilities with model.predict(image).
Notice I removed the argmax() in order to send an array of probabilities instead of the index of the max score/probability position (that is, argmax()).

TF Dataset with CSV file where image is stored in another directory

I'm trying to create an input dataset into my TF model using a CSV dataset that I have. The dataset has the following scheme:
image_name, label
XXXXXXX.png, some_integer_value
XXXXXXX.png, some_integer_value
I did a bit of research and found that the tf.data.Dataset API seems to be optimized for this task. I am trying to use tf.data.experimental.make_csv_dataset in order to do this task. My issue that I'm facing is that I'm not sure how to load in the images into my dataset. I currently have the following setup:
csv_dataset = tf.data.experimental.make_csv_dataset(
PATH_TO_DATA_CSV,
batch_size = 5,
select_columns = ['image_name', 'label'],
label_name = 'label',
num_epochs = 1,
ignore_errors = True
)
My original idea was to use a map on the dataset in order to read the file, doing something like
def process_data(image_name, label):
image_name = image_name.numpy().decode('utf-8')
img = tf.io.read_file(DATA_PATH + '/' + image_name)
img = decode_img(img)
return img, label
csv_dataset = csv_dataset.map(process_data)
But this seems to be throwing the error
`File "", line 4, in process_data *
image_name = image_name.numpy().decode('utf-8')
AttributeError: 'collections.OrderedDict' object has no attribute 'numpy'`
Should I be approaching the problem this way (and if so, how can I fix my error)? If not, what is the most optimal way to approach this.
Can use
tf.data.Dataset.from_tensor_slices in conjunction with Pandas (for all_image_paths and all_image_labels) for something like
def load_and_preprocess_image(path):
image_string = tf.compat.as_str_any(path)
image_string = tf.io.read_file(path)
img = tf.io.decode_png(image_string, channels=3)
return tf.image.resize(img, [1000, 1000])
def load_and_preprocess_from_path_labels(path, label):
return load_and_preprocess_image(path), label
ds = tf.data.Dataset.from_tensor_slices((all_image_paths, all_image_labels))
csv_dataset = ds.map(load_and_preprocess_from_path_labels, num_parallel_calls=tf.data.AUTOTUNE)

Sampling for large class and augmentation for small classes in each batch

Let's say we have 2 classes one is small and the second is large.
I would like to use for data augmentation similar to ImageDataGenerator
for the small class, and sampling from each batch, in such a way, that, that each batch would be balanced. (Fro minor class- augmentation for major class- sampling).
Also, I would like to continue using image_dataset_from_directory (since the dataset doesn't fit into RAM).
What about
sample_from_datasets
function?
import tensorflow as tf
from tensorflow.python.data.experimental import sample_from_datasets
def augment(val):
# Example of augmentation function
return val - tf.random.uniform(shape=tf.shape(val), maxval=0.1)
big_dataset_size = 1000
small_dataset_size = 10
# Init some datasets
dataset_class_large_positive = tf.data.Dataset.from_tensor_slices(tf.range(100, 100 + big_dataset_size, dtype=tf.float32))
dataset_class_small_negative = tf.data.Dataset.from_tensor_slices(-tf.range(1, 1 + small_dataset_size, dtype=tf.float32))
# Upsample and augment small dataset
dataset_class_small_negative = dataset_class_small_negative \
.repeat(big_dataset_size // small_dataset_size) \
.map(augment)
dataset = sample_from_datasets(
datasets=[dataset_class_large_positive, dataset_class_small_negative],
weights=[0.5, 0.5]
)
dataset = dataset.shuffle(100)
dataset = dataset.batch(6)
iterator = dataset.as_numpy_iterator()
for i in range(5):
print(next(iterator))
# [109. -10.044552 136. 140. -1.0505208 -5.0829906]
# [122. 108. 141. -4.0211563 126. 116. ]
# [ -4.085523 111. -7.0003924 -7.027302 -8.0362625 -4.0226436]
# [ -9.039093 118. -1.0695585 110. 128. -5.0553837]
# [100. -2.004463 -9.032592 -8.041705 127. 149. ]
Set up the desired balance between the classes in the weights parameter of sample_from_datasets.
As it was noticed by
Yaoshiang,
the last batches are imbalanced and the datasets length are different. This can be avoided by
# Repeat infinitely both datasets and augment the small one
dataset_class_large_positive = dataset_class_large_positive.repeat()
dataset_class_small_negative = dataset_class_small_negative.repeat().map(augment)
instead of
# Upsample and augment small dataset
dataset_class_small_negative = dataset_class_small_negative \
.repeat(big_dataset_size // small_dataset_size) \
.map(augment)
This case, however, the dataset is infinite and the number of batches in epoch has to be further controlled.
You can use tf.data.Dataset.from_generator that allows more control on your data generation without loading all your data into RAM.
def generator():
i=0
while True :
if i%2 == 0:
elem = large_class_sample()
else :
elem =small_class_augmented()
yield elem
i=i+1
ds= tf.data.Dataset.from_generator(
generator,
output_signature=(
tf.TensorSpec(shape=yourElem_shape , dtype=yourElem_ype))
This generator will alterate samples between the two classes,and you can add more dataset operations(batch , shuffle..)
I didn't totally follow the problem. Would psuedo-code this work? Perhaps there are some operators on tf.data.Dataset that are sufficient to solve your problem.
ds = image_dataset_from_directory(...)
ds1=ds.filter(lambda image, label: label == MAJORITY)
ds2=ds.filter(lambda image, label: label != MAJORITY)
ds2 = ds2.map(lambda image, label: data_augment(image), label)
ds1.batch(int(10. / MAJORITY_RATIO))
ds2.batch(int(10. / MINORITY_RATIO))
ds3 = ds1.zip(ds2)
ds3 = ds3.map(lambda left, right: tf.concat(left, right, axis=0)
You can use the tf.data.Dataset.from_tensor_slices to load the images of two categories seperately and do data augmentation for the minority class. Now that you have two datasets combine them with tf.data.Dataset.sample_from_datasets.
# assume class1 is the minority class
files_class1 = glob('class1\\*.jpg')
files_class2 = glob('class2\\*.jpg')
def augment(filepath):
class_name = tf.strings.split(filepath, os.sep)[0]
image = tf.io.read_file(filepath)
image = tf.expand_dims(image, 0)
if tf.equal(class_name, 'class1'):
# do all the data augmentation
image_flip = tf.image.flip_left_right(image)
return [[image, class_name],[image_flip, class_name]]
# apply data augmentation for class1
train_class1 = tf.data.Dataset.from_tensor_slices(files_class1).\
map(augment,num_parallel_calls=tf.data.AUTOTUNE)
train_class2 = tf.data.Dataset.from_tensor_slices(files_class2)
dataset = tf.python.data.experimental.sample_from_datasets(
datasets=[train_class1,train_class2],
weights=[0.5, 0.5])
dataset = dataset.batch(BATCH_SIZE)

manipulate a matrix tensor flow, problem with tf.data.dataset

I want to concatenate three images with size [1024,1024,3] to make a batch with size [3,1024,1024,3]. I wrote this code with TensorFlow but it doesn't work. It returns the error "InaccessibleTensorError: The tensor 'Tensor("truediv:0", shape=(1024, 1024, 3), dtype=float32)' cannot be accessed here: it is defined in another function or code block. Use return values, explicit Python locals or TensorFlow collections to access it.".
def decode_img(filename):
image = tf.ones((3,1024,1024,3),dtype=tf.dtypes.float32)
cnt=0
slices = []
for fi in filename:
bits = tf.io.read_file(fi)
img = tf.image.decode_jpeg(bits, channels=3)
img = tf.image.resize(img, (1024,1024))
slices.append(tf.cast(img, tf.float32) / 255.0)
cnt +=1
image = tf.stack(slices)
return image
#-----------------------
filenames = ['img1.png', 'img2.png', 'img3.png']
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.map(decode_img, num_parallel_calls=AUTO)
In general, tensorflow does not support item assignment. Rather, generate all the img layers you want and then use tf.stack() or tf.concatenate.
filename = [img1.png, img2.png, img3.png]
cnt=0
slices = []
for fi in filename:
bits = tf.io.read_file(fi)
img = tf.image.decode_jpeg(bits, channels=3)
img = tf.image.resize(img, (1024,1024))
slices.append(tf.cast(img, tf.float32) / 255.0)
cnt +=1
image = tf.stack(slices)

What is the Pytorch sub for this tensor flow code?

In converting this line of code to Pytorch from Tensor Flow, I am having trouble
datagen = ImageDataGenerator(
shear_range=0.2,
zoom_range=0.2,
)
def read_img(filename, size, path):
img = image.load_img(os.path.join(path, filename), target_size=size)
#convert image to array
img = img_to_array(img) / 255
return img
and then
corona_df = final_train_data[final_train_data['Label_2_Virus_category'] == 'COVID-19']
with_corona_augmented = []
#create a function for augmentation
def augment(name):
img = read_img(name, (255,255), train_img_dir)
i = 0
for batch in tqdm(datagen.flow(tf.expand_dims(img, 0), batch_size=32)):
with_corona_augmented.append(tf.squeeze(batch).numpy())
if i == 20:
break
i =i+1
#apply the function
corona_df['X_ray_image_name'].apply(augment)
I tried doing
transform = transforms.Compose([transforms.Resize(255*255)
])
train_loader = torch.utils.data.DataLoader(os.path.join(train_dir,corona_df),transform = transform,batch_size =32)
def read_img(path):
img = train_loader()
img = np.asarray(img,dtype='int32')
img = img/255
return img
I tried continuing but got soo confused by the errors.
I welcome any feedback. Tell me If i miss something
Even a small advice would work, thanks !
You can create a custom dataset to read the images. If you have a directory full of images you can use ImageFolder default dataset. Otherwise if you have different folder placement you can write your own custom dataset class. You can look to this link for custom datasets. What dataloader does is, it automatically gets the data from your dataset and read the images according to your dataset __getitem__ function and apply transformation. So you don't need anything fancy to apply augmentation.
transform = transforms.Compose([ transforms.RandomAffine(20,shear=20,scale=(-0.2,0.2)),
transforms.Resize(255*255)
])
dataset = torchvision.datasets.ImageFolder(train_img_dir, transform=transform)
loader = torch.utils.data.DataLoader(dataset,batch_size =32,shuffle=True)
for batch in loader:
output = model(batch)