How to recover pickled Keras histories? - tensorflow

Does anyone know how I can recover a list of Keras history objects that I saved to drive by pickling with the following code:
import pickle
with open("H:/hists", "wb") as fp: #Pickling
pickle.dump(hists, fp)
Currently I'm trying :
with open("H:/hists", "rb") as fp: # Unpickling
hists = pickle.load(fp)
but getting this error:
FileNotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ram://b3edea45-d0d4-442d-ab4f-b0c43a87d19e/variables/variables
You may be trying to load on a different device from the computational device. Consider setting the `experimental_io_device` option in `tf.saved_model.LoadOptions` to the io_device such as '/job:localhost'.
Which I believe is because the python kernel in which I saved the history object has been terminated and a new one started.
I now think that the best way to save histories is to convert them to DataFrames or numpy arrays and save those, but that's not possible now since the histories are no longer in memory. It took about 5 hours to produce the histories, so I'm hoping it's possible to recover them.

Related

reading hdf5 file from s3 to sagemaker, is the whole file transferred?

I'm reading a file from my S3 bucket in a notebook in sagemaker studio (same account) using the following code:
dataset_path_in_h5="/Mode1/SingleFault/SimulationCompleted/IDV2/Mode1_IDVInfo_2_100/Run1/processdata"
s3 = s3fs.S3FileSystem()
h5_file = h5py.File(s3.open(s3url,'rb'), 'r')
data = h5_file.get(dataset_path_in_h5)
But I don't know what actually append behind the scene, does the whole h5 file is being transferred ? that's seems unlikely as the code is executed quite fast while the whole file is 20GB. Or is just the dataset in dataset_path_in_h5 is transferred ?
I suppose that if the whole file is transferred at each call it could cost me a lot.
When you open the file, a file object is created. It has a tiny memory footprint. The dataset values aren't read into memory until you access them.
You are returning data as a NumPy array. That loads the entire dataset into memory. (NOTE: the .get() method you are using is deprecated. Current syntax is provided in the example.)
As an alternative to returning an array, you can create a dataset object (which also has a small memory foorprint). When you do, the data is read into memory as you need it. Dataset objects behave like NumPy arrays. (Use of a dataset object vs NumPy array depends on downstream usage. Frequently you don't need an array, but sometimes they are required.) Also, if chunked I/O was enabled when the dataset was created, datasets are read in chunks.
Differences shown below. Note, I used Python's file context manager to open the file. It avoids problems if the file isn't closed properly (you forget or the program exits prematurely).
dataset_path_in_h5="/Mode1/SingleFault/SimulationCompleted/IDV2/Mode1_IDVInfo_2_100/Run1/processdata"
s3 = s3fs.S3FileSystem()
with h5py.File(s3.open(s3url,'rb'), 'r') as h5_file:
# your way to get a numpy array -- .get() is depreciated:
data = h5_file.get(dataset_path_in_h5)
# this is the preferred syntax to return an array:
data_arr = h5_file[dataset_path_in_h5][()]
# this returns a h5py dataset object:
data_ds = h5_file[dataset_path_in_h5] # deleted [()]

How to load a large h5 file in memory?

I have a large h5 file with 5-dimensional numpy array in HDFS. File size is ~130Gb. I am facing memory issues while loading the file with process gets killed with OOM Error even though machine has 256Gb RAM. How can I write the file in chunks and load back in chunks? I looked around and found that h5py provides method to chunk the dataset like so but how do I load back the data in chunks? Also will it work if the file resides in HDFS?
dset = f.create_dataset("Images2", (100,480,640), 'f', chunks=True)
Idea is to load the file in batches for less I/O time as well as memory issues. Any help would be much appreciated.
Two similar (but different) h5py I/O concepts are mentioned in the answer and comments above:
HDF5 Chunking is used to enable chunked I/O for improved performance. Chunking may not help if you get an OOM error when you try to read a large dataset with insufficient memory.
NumPy style Slicing is used to read a slice of the data from the drive to memory (or write a slice of data to the drive). Slicing is the key to avoid OOM errors when reading very large files.
Also, when creating very large datasets, you generally need to make
it resizeable. You can allocate an initial size, then use the ".resize()" method to increase the size on disk.
I wrote a simple example that shows how to use both slicing and chunking. It loads 100 images at a time into a resizeable dataset. It then closes the file and reopens (read-only) to read 100 images at a time into a NumPy array.
Effective chunking requires appropriate size/shape and is based on your array shape and I/O needs. I set the chunk size/shape in my example to match the size of 100 image array I was writing/reading.
This example should get you started. You will need to modify to use a 5-d array/dataset.
import numpy as np
import h5py
with h5py.File('SO_64645940.h5','w') as h5w:
img_ds = h5w.create_dataset('Images', shape=(100,480,640), dtype='f', maxshape=(None,480,640),chunks=(10,480,640))
next_img_row = 0
arr = np.random.random(100*480*640).reshape(100,480,640)
for cnt in range(1,10):
# print(cnt,img_ds.len(),next_img_row)
if img_ds.len() == next_img_row :
img_ds.resize(100*cnt,axis=0)
print('new ds size=',img_ds.len())
h5w['Images'][next_img_row:next_img_row+100] = arr
next_img_row += 100
with h5py.File('SO_64645940.h5','r') as h5r:
for cnt in range(10):
print('get slice#',str(cnt))
img_arr = h5r['Images'][cnt*100:(cnt+1)*100]
Chunking in HDF5 means that the data is not stored contigous, but in chunks.
See information here: https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage
--> So this doesn't help you with your problem.
The solution might be that you build a function yourself to load the data chunkwise.
I made it for example this way for getting the data chunked:
def get_chunked(data, chunk_size=100):
for i in give_chunk(len(data), chunk_size):
chunked_array = data[i]
yield chunked_array
def give_chunk(length, chunk_size):
it = iter(range(length))
while True:
chunk = list(itertools.islice(it, chunk_size))
if not chunk:
break
yield chunk
For writing the data to HDF5 you can create the dataset first and then write the data chunk wise with slicing, see h5py documentation: https://docs.h5py.org/en/stable/high/dataset.html#reading-writing-data
I really can recommend this book for basic knowledge about HDF5: https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/

Use tf.TextLineReader to read to a np.array in TensorFlow

I need to read a file in my train module into a np.array (i want to use the array as label_keys in a DNNClassifier).
I tried tf.read_file and tf.TextLineReader() but i can´t get them to just output the rows to a np.array.
Is it possible?
(why not just read a file with open? I´m training in GCS and want to get the file from storage :)
To access a file from GCS using TensorFlow, you can use the Python tf.gfile.GFile API, which acts like a regular Python file object, but allows you to use TensorFlow's filesystem connectors:
with tf.gfile.GFile("gs://...") as f:
file_contents = f.read()

Google Storage (gs) wrapper file input/out for Cloud ML?

Google recently announced the Clould ML, https://cloud.google.com/ml/ and it's very useful. However, one limitation is that the input/out of a Tensorflow program should support gs://.
If we use all tensorflow APIS to read/write files, it should OK, since these APIs support gs://.
However, if we use native file IO APIs such as open, it does not work, because they don't understand gs://
For example:
with open(vocab_file, 'wb') as f:
cPickle.dump(self.words, f)
This code won't work in Google Cloud ML.
However, modifying all native file IO APIs to tensorflow APIs or Google Storage Python APIs is really tedious. Is there any simple way to do this? Any wrappers to support google storage systems, gs:// on top of the native file IO?
As suggested here Pickled scipy sparse matrix as input data?, perhaps we can use file_io.read_file_to_string('gs://...'), but still this requrements significant code modifcation.
Do it like this:
from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://.....', mode='w+') as f:
cPickle.dump(self.words, f)
Or you can read pickle file in like this:
file_stream = file_io.FileIO(train_file, mode='r')
x_train, y_train, x_test, y_test = pickle.load(file_stream)
One solution is to copy all of the data to local disk when the program starts up. You can do that using gsutil inside the Python script that gets run, something like:
vocab_file = 'vocab.pickled'
subprocess.check_call(['gsutil', '-m' , 'cp', '-r',
os.path.join('gs://path/to/', vocab_file), '/tmp'])
with open(os.path.join('/tmp', vocab_file), 'wb') as f:
cPickle.dump(self.words, f)
And if you have any outputs, you can write them to local disk and gsutil rsync them. (But, be careful to handle restarts correctly, because you may be put on a different machine).
The other solution is to monkey patch open (Note: untested):
import __builtin__
# NB: not all modes are compatible; should handle more carefully.
# Probably should be reported on
# https://github.com/tensorflow/tensorflow/issues/4357
def new_open(name, mode='r', buffering=-1):
return file_io.FileIO(name, mode)
__builtin__.open = new_open
Just be sure to do that before any module actually tries to read from GCS.
apache_beam has the gcsio module which can be used to return a standard Python file object to read/write GCS objects. You can use this object with any method that works with Python file objects. For example
def open_local_or_gcs(path, mode):
"""Opens the given path."""
if path.startswith('gs://'):
try:
return gcsio.GcsIO().open(path, mode)
except Exception as e: # pylint: disable=broad-except
# Currently we retry exactly once, to work around flaky gcs calls.
logging.error('Retrying after exception reading gcs file: %s', e)
time.sleep(10)
return gcsio.GcsIO().open(path, mode)
else:
return open(path, mode)
with open_local_or_gcs(vocab_file, 'wb') as f:
cPickle.dump(self.words, f)

TensorFlow: Opening log data written by SummaryWriter

After following this tutorial on summaries and TensorBoard, I've been able to successfully save and look at data with TensorBoard. Is it possible to open this data with something other than TensorBoard?
By the way, my application is to do off-policy learning. I'm currently saving each state-action-reward tuple using SummaryWriter. I know I could manually store/train on this data, but I thought it'd be nice to use TensorFlow's built in logging features to store/load this data.
As of March 2017, the EventAccumulator tool has been moved from Tensorflow core to the Tensorboard Backend. You can still use it to extract data from Tensorboard log files as follows:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
event_acc = EventAccumulator('/path/to/summary/folder')
event_acc.Reload()
# Show all tags in the log file
print(event_acc.Tags())
# E. g. get wall clock, number of steps and value for a scalar 'Accuracy'
w_times, step_nums, vals = zip(*event_acc.Scalars('Accuracy'))
Easy, the data can actually be exported to a .csv file within TensorBoard under the Events tab, which can e.g. be loaded in a Pandas dataframe in Python. Make sure you check the Data download links box.
For a more automated approach, check out the TensorBoard readme:
If you'd like to export data to visualize elsewhere (e.g. iPython
Notebook), that's possible too. You can directly depend on the
underlying classes that TensorBoard uses for loading data:
python/summary/event_accumulator.py (for loading data from a single
run) or python/summary/event_multiplexer.py (for loading data from
multiple runs, and keeping it organized). These classes load groups of
event files, discard data that was "orphaned" by TensorFlow crashes,
and organize the data by tag.
As another option, there is a script
(tensorboard/scripts/serialize_tensorboard.py) which will load a
logdir just like TensorBoard does, but write all of the data out to
disk as json instead of starting a server. This script is setup to
make "fake TensorBoard backends" for testing, so it is a bit rough
around the edges.
I think the data are encoded protobufs RecordReader format. To get serialized strings out of files you can use py_record_reader or build a graph with TFRecordReader op, and to deserialize those strings to protobuf use Event schema. If you get a working example, please update this q, since we seem to be missing documentation on this.
I did something along these lines for a previous project. As mentioned by others, the main ingredient is tensorflows event accumulator
from tensorflow.python.summary import event_accumulator as ea
acc = ea.EventAccumulator("folder/containing/summaries/")
acc.Reload()
# Print tags of contained entities, use these names to retrieve entities as below
print(acc.Tags())
# E. g. get all values and steps of a scalar called 'l2_loss'
xy_l2_loss = [(s.step, s.value) for s in acc.Scalars('l2_loss')]
# Retrieve images, e. g. first labeled as 'generator'
img = acc.Images('generator/image/0')
with open('img_{}.png'.format(img.step), 'wb') as f:
f.write(img.encoded_image_string)
You can also use the tf.train.summaryiterator: To extract events in a ./logs-Folder where only classic scalars lr, acc, loss, val_acc and val_loss are present you can use this GIST: tensorboard_to_csv.py
Chris Cundy's answer works well when you have less than 10000 data points in your tfevent file. However, when you have a large file with over 10000 data points, Tensorboard will automatically sampling them and only gives you at most 10000 points. It is a quite annoying underlying behavior as it is not well-documented. See https://github.com/tensorflow/tensorboard/blob/master/tensorboard/backend/event_processing/event_accumulator.py#L186.
To get around it and get all data points, a bit hacky way is to:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
class FalseDict(object):
def __getitem__(self,key):
return 0
def __contains__(self, key):
return True
event_acc = EventAccumulator('path/to/your/tfevents',size_guidance=FalseDict())
It looks like for tb version >=2.3 you can streamline the process of converting your tb events to a pandas dataframe using tensorboard.data.experimental.ExperimentFromDev().
It requires you to upload your logs to TensorBoard.dev, though, which is public. There are plans to expand the capability to locally stored logs in the future.
https://www.tensorflow.org/tensorboard/dataframe_api
You can also use the EventFileLoader to iterate through a tensorboard file
from tensorboard.backend.event_processing.event_file_loader import EventFileLoader
for event in EventFileLoader('path/to/events.out.tfevents.xxx').Load():
print(event)
Surprisingly, the python package tb_parse has not been mentioned yet.
From documentation:
Installation:
pip install tensorflow # or tensorflow-cpu pip install -U tbparse # requires Python >= 3.7
Note: If you don't want to install TensorFlow, see Installing without TensorFlow.
We suggest using an additional virtual environment for parsing and plotting the tensorboard events. So no worries if your training code uses Python 3.6 or older versions.
Reading one or more event files with tbparse only requires 5 lines of code:
from tbparse import SummaryReader
log_dir = "<PATH_TO_EVENT_FILE_OR_DIRECTORY>"
reader = SummaryReader(log_dir)
df = reader.scalars
print(df)