Google Storage (gs) wrapper file input/out for Cloud ML? - tensorflow

Google recently announced the Clould ML, https://cloud.google.com/ml/ and it's very useful. However, one limitation is that the input/out of a Tensorflow program should support gs://.
If we use all tensorflow APIS to read/write files, it should OK, since these APIs support gs://.
However, if we use native file IO APIs such as open, it does not work, because they don't understand gs://
For example:
with open(vocab_file, 'wb') as f:
cPickle.dump(self.words, f)
This code won't work in Google Cloud ML.
However, modifying all native file IO APIs to tensorflow APIs or Google Storage Python APIs is really tedious. Is there any simple way to do this? Any wrappers to support google storage systems, gs:// on top of the native file IO?
As suggested here Pickled scipy sparse matrix as input data?, perhaps we can use file_io.read_file_to_string('gs://...'), but still this requrements significant code modifcation.

Do it like this:
from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://.....', mode='w+') as f:
cPickle.dump(self.words, f)
Or you can read pickle file in like this:
file_stream = file_io.FileIO(train_file, mode='r')
x_train, y_train, x_test, y_test = pickle.load(file_stream)

One solution is to copy all of the data to local disk when the program starts up. You can do that using gsutil inside the Python script that gets run, something like:
vocab_file = 'vocab.pickled'
subprocess.check_call(['gsutil', '-m' , 'cp', '-r',
os.path.join('gs://path/to/', vocab_file), '/tmp'])
with open(os.path.join('/tmp', vocab_file), 'wb') as f:
cPickle.dump(self.words, f)
And if you have any outputs, you can write them to local disk and gsutil rsync them. (But, be careful to handle restarts correctly, because you may be put on a different machine).
The other solution is to monkey patch open (Note: untested):
import __builtin__
# NB: not all modes are compatible; should handle more carefully.
# Probably should be reported on
# https://github.com/tensorflow/tensorflow/issues/4357
def new_open(name, mode='r', buffering=-1):
return file_io.FileIO(name, mode)
__builtin__.open = new_open
Just be sure to do that before any module actually tries to read from GCS.

apache_beam has the gcsio module which can be used to return a standard Python file object to read/write GCS objects. You can use this object with any method that works with Python file objects. For example
def open_local_or_gcs(path, mode):
"""Opens the given path."""
if path.startswith('gs://'):
try:
return gcsio.GcsIO().open(path, mode)
except Exception as e: # pylint: disable=broad-except
# Currently we retry exactly once, to work around flaky gcs calls.
logging.error('Retrying after exception reading gcs file: %s', e)
time.sleep(10)
return gcsio.GcsIO().open(path, mode)
else:
return open(path, mode)
with open_local_or_gcs(vocab_file, 'wb') as f:
cPickle.dump(self.words, f)

Related

How to recover pickled Keras histories?

Does anyone know how I can recover a list of Keras history objects that I saved to drive by pickling with the following code:
import pickle
with open("H:/hists", "wb") as fp: #Pickling
pickle.dump(hists, fp)
Currently I'm trying :
with open("H:/hists", "rb") as fp: # Unpickling
hists = pickle.load(fp)
but getting this error:
FileNotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ram://b3edea45-d0d4-442d-ab4f-b0c43a87d19e/variables/variables
You may be trying to load on a different device from the computational device. Consider setting the `experimental_io_device` option in `tf.saved_model.LoadOptions` to the io_device such as '/job:localhost'.
Which I believe is because the python kernel in which I saved the history object has been terminated and a new one started.
I now think that the best way to save histories is to convert them to DataFrames or numpy arrays and save those, but that's not possible now since the histories are no longer in memory. It took about 5 hours to produce the histories, so I'm hoping it's possible to recover them.

Stream private data to google collab TPUs from GCS

So I'm trying to make a photo classifier with 150 classes. I'm trying to run it on google colab TPUs, I understood I need a tfds with try_gcs = True for it & for that I need to put a dataset on google colab cloud. So I converted a generator to a tfds, stored it locally using
my_tf_ds = tf.data.Dataset.from_generator(datafeeder.allGenerator,
output_signature=(
tf.TensorSpec(shape=(64,64,3), dtype=tf.float32),
tf.TensorSpec(shape=(150), dtype=tf.float32)))
tf.data.experimental.save(my_tf_ds,filename)
Then I sent it to my bucket on GCS.
But when I try to load it from my bucket with
import tensorflow_datasets as tfds
dsFromGcs = tfds.load("pokemons",data_dir = "gs://dataset-7000")
It doesn't work and gives available datasets like :
- abstract_reasoning
- accentdb
- aeslc
- aflw2k3d
- ag_news_subset
- ai2_arc
- ai2_arc_with_ir
- amazon_us_reviews
- anli
- arc
that are not on my GCS bucket.
When loading it myself from local:
tfds_from_file = tf.data.experimental.load(filename, element_spec= (
tf.TensorSpec(shape=(64,64,3), dtype=tf.float32),
tf.TensorSpec(shape=(150), dtype=tf.float32)))
it works, the dataset is fine.
So I don't understand why I can't read it on gcs, can we read private ds on GCS? Or only already defined datasets. I also gave the role Storage Legacy Bucket Reader on my Bucket to the public.
I think the data_dir argument to tfds.load is where the module will store things locally on your device and the try_gcs is whether to stream the data or not. So the data_dir cannot be used to point the module to your GCS bucket.
Here are some ideas you could try:
You could try these steps to add your dataset to TFDS and then you should be able to load it using tfds.load
You could get a dataset in the right format using tf.data.experimental.save (which I think you've already done) and save it to GCS and then load it using tf.data.experimental.load, which you said is working for you locally. You could follow these steps to install gcsfuse and use that to download your dataset to Colab from GCS.
You could try TFRecord to load your dataset. Here is a codelab with explanation and then here is a Colab example that's linked in the codelab

Writing preprocessed output CSV to S3 from Scikit Learn image on Sagemaker

My problem: writing out a CSV file to S3 from inside a Sagemaker SKLearn image. I know how to write CSVs to S3 from a notebook - that is working fine. It's within the docker image that I'm unable to get it to work.
This is a preprocessing.py script called as an entry_point parameter to the SKLearn estimator. The purpose is to pre-process the data prior to running an inference. It's the first step in my inference pipeline.
Everything is working as expected in my preprocessing script, except outputting the file at the end.
Attempt #1 - this produces a CSV file that has strange binary-looking data at the beginning and end of the file (before the first cell and after the last cell of the CSV). It's almost a valid CSV but not quite. See the image at the end.
def write_joblib(file, path):
s3_bucket, s3_key = path.split('/')[2], path.split('/')[3:]
s3_key = '/'.join(s3_key)
with BytesIO() as f:
joblib.dump(file, f)
f.seek(0)
boto3.client("s3").upload_fileobj(Bucket=s3_bucket, Key=s3_key, Fileobj=f)
predictors_csv = predictors_df.to_csv(index = False)
write_joblib(predictors_csv, predictors_s3_uri)
Attempt #2 - I used StringIO rather than BytesIO. However, this produced a zero-byte file on S3.
Attempt #3 - I tried boto3.client('s3').put_object(...) but got ClientError: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
I believe I am almost there with Attempt #1 above. I assume it's an encoding issue. If I can fix the structure of the CSV file to remove the non-text characters at the start it will be working. A screenshot of the CSV in a Notepad++ is below.
Notice the non-character text at the start of the CSV file below
I solved this myself. This works within an SKLearn estimator container. I assume it will work inside any inbuilt fit/tranform container for writing a CSV to S3 directly.
The use case is writing out the results of pre-processing for model featurization, vectorization, dimensionality reduction etc. This would occur prior to model inference as the first step in an Inference Pipeline.
def write_text_file_to_s3(file_string, path):
s3_bucket, s3_key = path.split('/')[2], path.split('/')[3:]
s3_key = '/'.join(s3_key)
s3 = boto3.resource('s3')
s3.Object(s3_bucket, s3_key).put(Body=file_string)
predictors_csv = predictors_df.to_csv(index = False, encoding='utf-8-sig')
write_text_file_to_s3(predictors_csv, predictors_s3_uri)

Use tf.TextLineReader to read to a np.array in TensorFlow

I need to read a file in my train module into a np.array (i want to use the array as label_keys in a DNNClassifier).
I tried tf.read_file and tf.TextLineReader() but i can´t get them to just output the rows to a np.array.
Is it possible?
(why not just read a file with open? I´m training in GCS and want to get the file from storage :)
To access a file from GCS using TensorFlow, you can use the Python tf.gfile.GFile API, which acts like a regular Python file object, but allows you to use TensorFlow's filesystem connectors:
with tf.gfile.GFile("gs://...") as f:
file_contents = f.read()

TensorFlow: Opening log data written by SummaryWriter

After following this tutorial on summaries and TensorBoard, I've been able to successfully save and look at data with TensorBoard. Is it possible to open this data with something other than TensorBoard?
By the way, my application is to do off-policy learning. I'm currently saving each state-action-reward tuple using SummaryWriter. I know I could manually store/train on this data, but I thought it'd be nice to use TensorFlow's built in logging features to store/load this data.
As of March 2017, the EventAccumulator tool has been moved from Tensorflow core to the Tensorboard Backend. You can still use it to extract data from Tensorboard log files as follows:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
event_acc = EventAccumulator('/path/to/summary/folder')
event_acc.Reload()
# Show all tags in the log file
print(event_acc.Tags())
# E. g. get wall clock, number of steps and value for a scalar 'Accuracy'
w_times, step_nums, vals = zip(*event_acc.Scalars('Accuracy'))
Easy, the data can actually be exported to a .csv file within TensorBoard under the Events tab, which can e.g. be loaded in a Pandas dataframe in Python. Make sure you check the Data download links box.
For a more automated approach, check out the TensorBoard readme:
If you'd like to export data to visualize elsewhere (e.g. iPython
Notebook), that's possible too. You can directly depend on the
underlying classes that TensorBoard uses for loading data:
python/summary/event_accumulator.py (for loading data from a single
run) or python/summary/event_multiplexer.py (for loading data from
multiple runs, and keeping it organized). These classes load groups of
event files, discard data that was "orphaned" by TensorFlow crashes,
and organize the data by tag.
As another option, there is a script
(tensorboard/scripts/serialize_tensorboard.py) which will load a
logdir just like TensorBoard does, but write all of the data out to
disk as json instead of starting a server. This script is setup to
make "fake TensorBoard backends" for testing, so it is a bit rough
around the edges.
I think the data are encoded protobufs RecordReader format. To get serialized strings out of files you can use py_record_reader or build a graph with TFRecordReader op, and to deserialize those strings to protobuf use Event schema. If you get a working example, please update this q, since we seem to be missing documentation on this.
I did something along these lines for a previous project. As mentioned by others, the main ingredient is tensorflows event accumulator
from tensorflow.python.summary import event_accumulator as ea
acc = ea.EventAccumulator("folder/containing/summaries/")
acc.Reload()
# Print tags of contained entities, use these names to retrieve entities as below
print(acc.Tags())
# E. g. get all values and steps of a scalar called 'l2_loss'
xy_l2_loss = [(s.step, s.value) for s in acc.Scalars('l2_loss')]
# Retrieve images, e. g. first labeled as 'generator'
img = acc.Images('generator/image/0')
with open('img_{}.png'.format(img.step), 'wb') as f:
f.write(img.encoded_image_string)
You can also use the tf.train.summaryiterator: To extract events in a ./logs-Folder where only classic scalars lr, acc, loss, val_acc and val_loss are present you can use this GIST: tensorboard_to_csv.py
Chris Cundy's answer works well when you have less than 10000 data points in your tfevent file. However, when you have a large file with over 10000 data points, Tensorboard will automatically sampling them and only gives you at most 10000 points. It is a quite annoying underlying behavior as it is not well-documented. See https://github.com/tensorflow/tensorboard/blob/master/tensorboard/backend/event_processing/event_accumulator.py#L186.
To get around it and get all data points, a bit hacky way is to:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
class FalseDict(object):
def __getitem__(self,key):
return 0
def __contains__(self, key):
return True
event_acc = EventAccumulator('path/to/your/tfevents',size_guidance=FalseDict())
It looks like for tb version >=2.3 you can streamline the process of converting your tb events to a pandas dataframe using tensorboard.data.experimental.ExperimentFromDev().
It requires you to upload your logs to TensorBoard.dev, though, which is public. There are plans to expand the capability to locally stored logs in the future.
https://www.tensorflow.org/tensorboard/dataframe_api
You can also use the EventFileLoader to iterate through a tensorboard file
from tensorboard.backend.event_processing.event_file_loader import EventFileLoader
for event in EventFileLoader('path/to/events.out.tfevents.xxx').Load():
print(event)
Surprisingly, the python package tb_parse has not been mentioned yet.
From documentation:
Installation:
pip install tensorflow # or tensorflow-cpu pip install -U tbparse # requires Python >= 3.7
Note: If you don't want to install TensorFlow, see Installing without TensorFlow.
We suggest using an additional virtual environment for parsing and plotting the tensorboard events. So no worries if your training code uses Python 3.6 or older versions.
Reading one or more event files with tbparse only requires 5 lines of code:
from tbparse import SummaryReader
log_dir = "<PATH_TO_EVENT_FILE_OR_DIRECTORY>"
reader = SummaryReader(log_dir)
df = reader.scalars
print(df)