How to save tensorflow specific variable to local disk as a ndarray?(not layer parameter,just a variable, or tensor) - tensorflow

How to save tensorflow specific variable to local disk as a ndarray?(not layer parameter,just a variable, or tensor)
Like :
ux=tf.Variable([10,1600,1,2])
tf.save('ux.npy',ux)
Is there anything like the code above? And after that I can load the ux.npy like:
ux = numpy.load('ux.npy')

Numpy has save and load function itself so for saving the numpy array you can use its function. The syntax is:
numpy.savetxt(Out_file_name, numpy_to_save)

Related

Usage of spark.catalog.refreshTable(tablename) in S3

I want to write a CSV file after transforming my Spark data with a function. The obtained Spark dataframe after the transformation seems good, but when I want to write it into a CSV file, I have an error:
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
But I really don't understand how to use the spark.catalog.refreshTable(tablename) function. I try to use it between the transformation and the file writing, but it said
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
So I don't know how to deal with it...
#Create the function to resize the images and extract the features with mobilenetV2 model
def red_dim(width, height, nChannels, data):
#Transform image data to tensorflow compatoble format
images = []
for i in range(height.shape[0]):
x = np.ndarray(
shape=(height[i], width[i], nChannels[i]),
dtype=np.uint8,
buffer=data[i],
strides=(width[i] * nChannels[i], nChannels[i], 1))
images.append(preprocess_input(x))
#Resize images with the chosen size of the model
images = np.array(tf.image.resize(images, [IMAGE_SIZE, IMAGE_SIZE]))
#Load the model
model = load_model('models')
#Predict features for images
preds = model.predict(images).reshape(len(width), 3 * 3 * 1280)
#Return a pandas series with list of features for all images
return pd.Series(list(preds))
#Transform the function to a pandas udf function
#This allow to split the function in multiple chunks
red_dim_udf = pandas_udf(red_dim, returnType=ArrayType(DoubleType()))
#4 actions :
# apply the udf function defined just before
# cast the array of features to a string so it can be written in a csv
# select only the data that will be witten in the csv
# write the data -> where the error occurs
results=df.withColumn("dim_red", red_dim_udf(col("image.width"), col("image.height"), \
col("image.nChannels"), \
col("image.data"))) \
.withColumn("dim_red_string", lit(col("dim_red").cast("string")))
.select("image.origin", 'dim_red_string')
.repartition(5).write.csv(S3dir + '/results' + today)
Its a well-known issue where the underlying source data is getting updated while spark is processing on it.
I would suggest you to checkpoint i.e. move/copy the data to another directory before applying your transformations.
I think I can close my question, as I found the answer
If you have this type of error, it can also be because you have space in your S3 folders used to make your Dataframe, and Spark doesn't recognize the space character in the folder, so think the folder doesn't exist anymore...
But thanks #Constantine for your help !

How to convert a HyperSpectral image or an image with many bands in TFRecord format?

I've been trying to use a hyperspectral image dataset that was in .mat files. I found that using the scipy library with its loadmat function I can load the hyperspectral images and selecting some bands to see them as an RGB.
def RGBread(image):
images = loadmat(image).get('new_image')
return abs(images[:,:,(12,6,4)])
def SIread(image):
images = loadmat(image).get('new_image')
return abs(images[:,:,:])
After trying to implement the pix2pix architecture I found an unexpected error. When passing the list of the names of the dataset files by a function that is responsible for load the data(which are still .mat files), Tensor Flow does not have a direct method for this reading or coding, so I get these data with my RGBread and SIread method and then I turned them into tensors.
def load_image(filename, augment=True):
inimg = tf.cast( tf.convert_to_tensor(RGBread(ImagePATH+'/'+filename)
,dtype=tf.float32),tf.float32)[...,:3]
tgimg = tf.cast( tf.convert_to_tensor(SIread(ImagePATH+'/'+filename)
,dtype=tf.float32),tf.float32)[...,:12]
inimg, tgimg = resize(inimg, tgimg,IMG_HEIGH,IMG_WIDTH)
if augment:
inimg, tgimg = random_jitter(inimg, tgimg)
return inimg, tgimg
When loading an image with the load_image method, using the name and path of a single .mat file (a hyperspectral image) of my dataset as argument of my function the method worked perfectly.
plt.imshow(load_train_image(tr_urls[1])[0])
The problem started when I created my dataSet tensor, because my RGBread function does not receive a tensor as a parameter since loadmat('.mat') expects a string. Having the following error.
train_dataset = tf.data.Dataset.from_tensor_slices(tr_urls)
train_dataset = train_dataset.map(load_train_image,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
TypeError: expected str, bytes or os.PathLike object, not Tensor
After reading a lot about reading .mat files I found a user who recommended passing the data to TFrecord format. I've been trying to do it but I couldn't. Someone could help me?
Rasterio may be useful here.
https://rasterio.readthedocs.io/en/latest/
It can read hyperspectral .tif which can be passed to tf.data using a tf.keras data-generator. It may be a bit slow and perhaps should be done before training rather than at runtime.
An alternative is to ask whether you need the geotiff metadata. If not, you can preprocess and save as numpy arrays for tfrecords.

Keras model.get_config() returns list instead of dictionary

I am using tensorflow-gpu==1.10.0 and keras from tensorflow as tf.keras.
I am trying to use source code written by someone else to implement it on my network.
I saved my network using save_model and load it using load_model. when I use model.get_config(), I expect a dictionary, but i"m getting a list. Keras source documentation also says that get_config returns a dictionary (https://keras.io/models/about-keras-models/).
I tried to check if it has to do with saving type : save_model or model.save that makes the difference in how it is saved, but both give me this error:
TypeError: list indices must be integers or slices, not str
my code block :
model_config = self.keras_model.get_config()
for layer in model_config['layers']:
name = layer['name']
if name in update_layers:
layer['config']['filters'] = update_layers[name]['filters']
my pip freeze :
absl-py==0.6.1
astor==0.7.1
bitstring==3.1.5
coverage==4.5.1
cycler==0.10.0
decorator==4.3.0
Django==2.1.3
easydict==1.7
enum34==1.1.6
futures==3.1.1
gast==0.2.0
geopy==1.11.0
grpcio==1.16.1
h5py==2.7.1
image==1.5.15
ImageHash==3.7
imageio==2.5.0
imgaug==0.2.5
Keras==2.1.3
kiwisolver==1.1.0
lxml==4.1.1
Markdown==3.0.1
matplotlib==2.1.0
networkx==2.2
nose==1.3.7
numpy==1.14.1
olefile==0.46
opencv-python==3.3.0.10
pandas==0.20.3
Pillow==4.2.1
prometheus-client==0.4.2
protobuf==3.6.1
pyparsing==2.3.0
pyquaternion==0.9.2
python-dateutil==2.7.5
pytz==2018.7
PyWavelets==1.0.1
PyYAML==3.12
Rtree==0.8.3
scikit-image==0.13.1
scikit-learn==0.19.1
scipy==0.19.1
Shapely==1.6.4.post1
six==1.11.0
sk-video==1.1.8
sklearn-porter==0.6.2
tensorboard==1.10.0
tensorflow-gpu==1.10.0
termcolor==1.1.0
tqdm==4.19.4
utm==0.4.2
vtk==8.1.0
Werkzeug==0.14.1
xlrd==1.1.0
xmltodict==0.11.0

Feeding .npy (numpy files) into tensorflow data pipeline

Tensorflow seems to lack a reader for ".npy" files.
How can I read my data files into the new tensorflow.data.Dataset pipline?
My data doesn't fit in memory.
Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as features and a scalar as their label.
It is actually possible to read directly NPY files with TensorFlow instead of TFRecords. The key pieces are tf.data.FixedLengthRecordDataset and tf.io.decode_raw, along with a look at the documentation of the NPY format. For simplicity, let's suppose that a float32 NPY file containing an array with shape (N, K) is given, and you know the number of features K beforehand, as well as the fact that it is a float32 array. An NPY file is just a binary file with a small header and followed by the raw array data (object arrays are different, but we're considering numbers now). In short, you can find the size of this header with a function like this:
def npy_header_offset(npy_path):
with open(str(npy_path), 'rb') as f:
if f.read(6) != b'\x93NUMPY':
raise ValueError('Invalid NPY file.')
version_major, version_minor = f.read(2)
if version_major == 1:
header_len_size = 2
elif version_major == 2:
header_len_size = 4
else:
raise ValueError('Unknown NPY file version {}.{}.'.format(version_major, version_minor))
header_len = sum(b << (8 * i) for i, b in enumerate(f.read(header_len_size)))
header = f.read(header_len)
if not header.endswith(b'\n'):
raise ValueError('Invalid NPY file.')
return f.tell()
With this you can create a dataset like this:
import tensorflow as tf
npy_file = 'my_file.npy'
num_features = ...
dtype = tf.float32
header_offset = npy_header_offset(npy_file)
dataset = tf.data.FixedLengthRecordDataset([npy_file], num_features * dtype.size, header_bytes=header_offset)
Each element of this dataset contains a long string of bytes representing a single example. You can now decode it to obtain an actual array:
dataset = dataset.map(lambda s: tf.io.decode_raw(s, dtype))
The elements will have indeterminate shape, though, because TensorFlow does not keep track of the length of the strings. You can just enforce the shape since you know the number of features:
dataset = dataset.map(lambda s: tf.reshape(tf.io.decode_raw(s, dtype), (num_features,)))
Similarly, you can choose to perform this step after batching, or combine it in whatever way you feel like.
The limitation is that you had to know the number of features in advance. It is possible to extract it from the NumPy header, though, just a bit of a pain, and in any case very hardly from within TensorFlow, so the file names would need to be known in advance. Another limitation is that, as it is, the solution requires you to either use only one file per dataset or files that have the same header size, although if you know that all the arrays have the same size that should actually be the case.
Admittedly, if one considers this kind of approach it may just be better to have a pure binary file without headers, and either hard code the number of features or read them from a different source...
You can do it with tf.py_func, see the example here.
The parse function would simply decode the filename from bytes to string and call np.load.
Update: something like this:
def read_npy_file(item):
data = np.load(item.decode())
return data.astype(np.float32)
file_list = ['/foo/bar.npy', '/foo/baz.npy']
dataset = tf.data.Dataset.from_tensor_slices(file_list)
dataset = dataset.map(
lambda item: tuple(tf.py_func(read_npy_file, [item], [tf.float32,])))
Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs:
Consuming NumPy arrays
If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().
# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
features = data["features"]
labels = data["labels"]
# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
In the case that the file doesn't fit into memory, it seems like the only recommended approach is to first convert the npy data into a TFRecord format, and then use the TFRecord data set format, which can be streamed without fully loading into memory.
Here is a post with some instructions.
FWIW, it seems crazy to me that TFRecord cannot be instantiated with a directory name or file name(s) of npy files directly, but it appears to be a limitation of plain Tensorflow.
If you can split the single large npy file into smaller files that each roughly represent one batch for training, then you could write a custom data generator in Keras that would yield only the data needed for the current batch.
In general, if your dataset cannot fit in memory, storing it as one single large npy file makes it very hard to work with, and preferably you should reformat the data first, either as TFRecord or as multiple npy files, and then use other methods.
Problem setup
I had a folder with images that were being fed into an InceptionV3 model for extraction of features. This seemed to be a huge bottleneck for the entire process. As a workaround, I extracted features from each image and then stored them on disk in a .npy format.
Now I had two folders, one for the images and one for the corresponding .npy files. There was an evident problem with the loading of .npy files in the tf.data.Dataset pipeline.
Workaround
I came across TensorFlow's official tutorial on show attend and tell which had a great workaround for the problem this thread (and I) were having.
Load numpy files
First off we need to create a mapping function that accepts the .npy file name and returns the numpy array.
# Load the numpy files
def map_func(feature_path):
feature = np.load(feature_path)
return feature
Use the tf.numpy_function
With the tf.numpy_function we can wrap any python function and use it as a TensorFlow op. The function must accept numpy object (which is exactly what we want).
We create a tf.data.Dataset with the list of all the .npy filenames.
dataset = tf.data.Dataset.from_tensor_slices(feature_paths)
We then use the map function of the tf.data.Dataset API to do the rest of our task.
# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item: tf.numpy_function(
map_func, [item], tf.float16),
num_parallel_calls=tf.data.AUTOTUNE)

Use tf.TextLineReader to read to a np.array in TensorFlow

I need to read a file in my train module into a np.array (i want to use the array as label_keys in a DNNClassifier).
I tried tf.read_file and tf.TextLineReader() but i can´t get them to just output the rows to a np.array.
Is it possible?
(why not just read a file with open? I´m training in GCS and want to get the file from storage :)
To access a file from GCS using TensorFlow, you can use the Python tf.gfile.GFile API, which acts like a regular Python file object, but allows you to use TensorFlow's filesystem connectors:
with tf.gfile.GFile("gs://...") as f:
file_contents = f.read()