set seed for entire model training/testing - tensorflow

I have a code using tensorflow v1 and I'd like to migrate it toward native tensorflow 2.
The code defines random objects (using numpy.randomor random, a neural network (keras weight initialization etc) and other tensorflow's random functions. At the end, it makes predictions on a random test set and outputs loss/accuracy of the model.
For this task, I'm having the original code and a copy of it and I'm changing the code of the copy part by part. I want to make sure that the behaviour is the same so I want to set the randomness so that I can monitor if the loss/accuracy change
However, even after setting the seeds of the various random modules in my original file, launching it multiple times still give different loss/accuracy
here are my libraries :
import time
import random
import my_file as mf // file in directory scope
import numpy as np
import copy
import os
from matplotlib import pyplot as plt
import tensorflow.compat.v1 as tf
and I'm setting the seeds at the beginning like that :
tf.set_random_seed(42)
random.seed(42)
np.random.seed(42)
My module my_file uses the random library and I'm also setting the seed there
I do understand from the docs that tf.set_random_seed only sets the global seed and that each random operation in tensorflow is also using its own seed, resulting in different behaviors for consecutive calls. For example if I call the training/testing cell 3 times I get the consecutive value of losses L1 -> L2 -> L3
However, this should still result in the same behavior if I restart the environment so why isn't it the case ? If I restart the kernel and execute 3 times I will get L1' =/= L1 -> L2' =/= L2 -> L3' =/= L3
What else should I verify to make sure the behaviour is the same everytime I restart the notebook kernel ?

Related

Why does keras (SGD) optimizer.minimize() not reach global minimum in this example?

I'm in the process of completing a TensorFlow tutorial via DataCamp and am transcribing/replicating the code examples I am working through in my own Jupyter notebook.
Here are the original instructions from the coding problem :
I'm running the following snippet of code and am not able to arrive at the same result that I am generating within the tutorial, which I have confirmed are the correct values via a connected scatterplot of x vs. loss_function(x) as seen a bit further below.
# imports
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import Variable, keras
def loss_function(x):
import math
return 4.0*math.cos(x-1)+np.divide(math.cos(2.0*math.pi*x),x)
# Initialize x_1 and x_2
x_1 = Variable(6.0, np.float32)
x_2 = Variable(0.3, np.float32)
# Define the optimization operation
opt = keras.optimizers.SGD(learning_rate=0.01)
for j in range(100):
# Perform minimization using the loss function and x_1
opt.minimize(lambda: loss_function(x_1), var_list=[x_1])
# Perform minimization using the loss function and x_2
opt.minimize(lambda: loss_function(x_2), var_list=[x_2])
# Print x_1 and x_2 as numpy arrays
print(x_1.numpy(), x_2.numpy())
I draw a quick connected scatterplot to confirm (successfully) that the loss function that I using gets me back to the same graph provided by the example (seen in screenshot above)
# Generate loss_function(x) values for given range of x-values
losses = []
for p in np.linspace(0.1, 6.0, 60):
losses.append(loss_function(p))
# Define x,y coordinates
x_coordinates = list(np.linspace(0.1, 6.0, 60))
y_coordinates = losses
# Plot
plt.scatter(x_coordinates, y_coordinates)
plt.plot(x_coordinates, y_coordinates)
plt.title('Plot of Input values (x) vs. Losses')
plt.xlabel('x')
plt.ylabel('loss_function(x)')
plt.show()
Here are the resulting global and local minima, respectively, as per the DataCamp environment :
4.38 is the correct global minimum, and 0.42 indeed corresponds to the first local minima on the graphs RHS (when starting from x_2 = 0.3)
And here are the results from my environment, both of which move opposite the direction that they should be moving towards when seeking to minimize the loss value:
I've spent the better part of the last 90 minutes trying to sort out why my results disagree with those of the DataCamp console / why the optimizer fails to minimize this loss for this simple toy example...?
I appreciate any suggestions that you might have after you've run the provided code in your own environments, many thanks in advance!!!
As it turned out, the difference in outputs arose from the default precision of tf.division() (vs np.division()) and tf.cos() (vs math.cos()) -- operations which were specified in (my transcribed, "custom") definition of the loss_function().
The loss_function() had been predefined in the body of the tutorial and when I "inspected" it using the inspect package ( using inspect.getsourcelines(loss_function) ) in order to redefine it in my own environment, the output of said inspection didn't clearly indicate that tf.division & tf.cos had been used instead of their NumPy counterparts (which my version of the code had used).
The actual difference is quite small, but is apparently sufficient to push the optimizer in the opposite direction (away from the two respective minima).
After swapping in tf.division() and tf.cos (as seen below) I was able to arrive at the same results as seen in the DC console.
Here is the code for the loss_function that will back in to the same results as seen in the console (screenshot) :
def loss_function(x):
import math
return 4.0*tf.cos(x-1)+tf.divide(tf.cos(2.0*math.pi*x),x)

Is it possible to use pyspark to speed up regression analysis on each column of a very large size of an array?

I have an array of very large size. I want to do linear regression on each column of the array. To speed up the calculation, I created a list with each column of the array as its element. I then employed pyspark to create a RDD and further applied a defined function on it. I had memory problems in creating that RDD (i.e. parallelization).
I have tried to improve the spark.driver.memory to 50g by setting the spark-defaults.conf but the program still seems dead.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from pyspark import SparkContext
sc = SparkContext("local", "get Linear Coefficients")
def getLinearCoefficients(column):
y=column[~np.isnan(column)] # Extract column non-nan values
x=np.where(~np.isnan(column))[0]+1 # Extract corresponding indexs plus 1
# We only do linear regression interpolation when there are no less than 3 data pairs exist.
if y.shape[0]>=3:
model=LinearRegression(fit_intercept=True) # Intilialize linear regression model
model.fit(x[:,np.newaxis],y) # Fit the model using data
n=y.shape[0]
slope=model.coef_[0]
intercept=model.intercept_
r2=r2_score(y,model.predict(x[:,np.newaxis]))
rmse=np.sqrt(mean_squared_error(y,model.predict(x[:,np.newaxis])))
else:
n,slope,intercept,r2,rmse=np.nan,np.nan,np.nan,np.nan,np.nan
return n,slope,intercept,r2,rmse
random_array=np.random.rand(300,2000*2000) # Here we use a random array without missing data for testing purpose.
columns=[col for col in random_array.T]
columnsRDD=sc.parallelize(columns)
columnsLinearRDD=columnsRDD.map(getLinearCoefficients)
n=np.array([e[0] for e in columnsLinearRDD.collect()])
slope=np.array([e[1] for e in columnsLinearRDD.collect()])
intercept=np.array([e[2] for e in columnsLinearRDD.collect()])
r2=np.array([e[3] for e in columnsLinearRDD.collect()])
rmse=np.array([e[4] for e in columnsLinearRDD.collect()])
The program output was stagnant like the following.
Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.scheduler.TaskSetManager$$anonfun$resourceOffer$1.apply(TaskSetManager.scala:486)
at org.apache.spark.scheduler.TaskSetManager$$anonfun$resourceOffer$1.apply(TaskSetManager.scala:467)
at scala.Option.map(Option.scala:146)
at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:467)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:315)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:310)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$11.apply(TaskSchedulerImpl.scala:412)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$11.apply(TaskSchedulerImpl.scala:409)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:409)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:396)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:396)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:86)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:64)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I guess it is possible to use pyspark to speed up the calculation but how could I make it? Modifying other parameters in spark-defaults.conf? Or vectorize each column of the array (I do know range() function in Python3 do that way and it is really faster.)?
That is not going to work that way. You are basically doing three things:
you are using a RDD for parallelization,
you are calling your getLinearCoefficients() function and finally
you call collect() on it to use your existing code.
There is nothing wrong with the frist point, but there is a huge mistake in the second and third step. Your getLinearCoefficients() function does not benefit from pyspark, as you use numpy and sklearn (Have a look at this post for a better explanation). For most of the functions you are using, there is a pyspark equivalent.
The problem with the third step is the collect() function. When you call collect(), pyspark is bringing all the rows of the RDD to the driver and executes the sklearn functions there. Therefore you get only the parallelization which is allowed by sklearn. Using pyspark is completely pointless in the way you are doing it currently and maybe even a drawback. Pyspark is not a framework which allows you to run your python code in parallel. When you want to execute your code in parallel with pyspark, you have to use the pyspark functions.
So what can you?
First of all you could use the n_jobs parameter of the LinearRegession class to use more than one core for your calculation. This allows you at least to use all cores of one machine.
Another thing you could do, is stepping away from sklearn and use the linearRegression of pyspark (have a look at the guide and the api). With this you can use a whole cluster for your linear regression.
For large datasets with more than 100k samples, using LinearRegression is discouraged. General advice is to use the SGDRegressor and set the parameters correctly, so that OLS loss is being used:
from sklearn.linear_model import SGDRegressor
And replace your LinearRegression with:
model = SGDRegressor(loss=’squared_loss’, penalty=’none’, fit_intercept=True)
Setting loss=’squared_loss’ and penalty=’none’ sets the SGDRegressor to use OLS and no regularization, thus it should produce results similar to LinearRegression.
Try out some options like learning_rate and eta0/power_t to find an optimum in the performance.
Furthermore I recommend using train_test_split to split the data set and use the test set for scoring. A good test size to begin with is test_size=.3.

Generating a *simple* TensorFlow graph illustration

I'm working on my first deep learning model using TensorFlow in a Jupyter notebook, and I would like to generate simplified graphs which illustrate the various layers of the network. Specifically, graphs such as those pictured in this answer:
This is very simple and clean and I can understand what's going on. This is more important than capturing 100% of the details. Contrast with the graph generated by TensorBoard which is a complete fustercluck:
How can I take a tf.Graph object and automatically generate a graph similar to the one above? Bonus points if it can be displayed in the Jupyter Notebook, too.
In short - you cannot. TF is a low-level library, which has no concept of "high level operations", it has ops, and this is the only thing it can visualise in a way you are thinking about. In particular, from math perspective there are no "neurons" in your graph, there are just tensors being multiplied by each other, this additional "semantics" is there only to make it easier for humans to talk about this, but it is not really encoded in your graph.
What you can do is to group nodes by yourself by specifing variable_scope for sections of your graph, then, after displaying in TB they will be displayed as a single node. It will not give you this "per-neuron-like" flavour of visualisation but at least it will hide many details. Creating a nice, visually appealing visualisations of neural nets is an "art" on its own rights, and a hard task to do in general.
Here's a snippet of code that we use in our PipelineAI notebooks to display our TensorFlow graphs inline within our Jupyter notebooks:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import re
from google.protobuf import text_format
from tensorflow.core.framework import graph_pb2
def convert_graph_to_dot(input_graph, output_dot, is_input_graph_binary):
graph = graph_pb2.GraphDef()
with open(input_graph, "rb") as fh:
if is_input_graph_binary:
graph.ParseFromString(fh.read())
else:
text_format.Merge(fh.read(), graph)
with open(output_dot, "wt") as fh:
print("digraph graphname {", file=fh)
for node in graph.node:
output_name = node.name
print(" \"" + output_name + "\" [label=\"" + node.op + "\"];", file=fh)
for input_full_name in node.input:
parts = input_full_name.split(":")
input_name = re.sub(r"^\^", "", parts[0])
print(" \"" + input_name + "\" -> \"" + output_name + "\";", file=fh)
print("}", file=fh)
print("Created dot file '%s' for graph '%s'." % (output_dot, input_graph))
input_graph='/root/models/optimize_me/linear/cpu/unoptimized_cpu.pb'
output_dot='/root/notebooks/unoptimized_cpu.dot'
convert_graph_to_dot(input_graph=input_graph, output_dot=output_dot, is_input_graph_binary=True)
Using graphviz, you can convert the .dot to .png using a %%bash magic within your notebook cell:
%%bash
dot -T png /root/notebooks/unoptimized_cpu.dot \
-o /root/notebooks/unoptimized_cpu.png > /tmp/a.out
and finally, display the graph in your notebook:
from IPython.display import Image
Image('/root/notebooks/unoptimized_cpu.png', width=1024, height=768)
here's an example of a simple Linear Regression model implemented in TensorFlow:
Here's the optimized version used to deploy and serve the TensorFlow Model in production (also rendered using the above code snippets):
More examples and details of these types of optimizations at http://pipeline.ai

sklearn's `RandomizedSearchCV` not working with `np.random.RandomState`

I am trying to optimize a pipeline and wanted to try giving RandomizedSearchCV a np.random.RandomState object. I can't it to work but I can give it other distributions.
Is there a special syntax I can use to give RandomSearchCV a np.random.RandomState(0).uniform(0.1,1.0)?
from scipy import stats
import numpy as np
from sklearn.neighbors import KernelDensity
from sklearn.grid_search import RandomizedSearchCV
# Generate data
x = np.random.normal(5,1,size=int(1e3))
# Make model
model = KernelDensity()
# Gridsearch for best params
# This one works
search_params = RandomizedSearchCV(model, param_distributions={"bandwidth":stats.uniform(0.1, 1)}, n_iter=30, n_jobs=2)
search_params.fit(x[:, None])
# RandomizedSearchCV(cv=None, error_score='raise',
# estimator=KernelDensity(algorithm='auto', atol=0, bandwidth=1.0, breadth_first=True,
# kernel='gaussian', leaf_size=40, metric='euclidean',
# metric_params=None, rtol=0),
# fit_params={}, iid=True, n_iter=30, n_jobs=2,
# param_distributions={'bandwidth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x106ab7da0>},
# pre_dispatch='2*n_jobs', random_state=None, refit=True,
# scoring=None, verbose=0)
# This one doesn't work :(
search_params = RandomizedSearchCV(model, param_distributions={"bandwidth":np.random.RandomState(0).uniform(0.1, 1)}, n_iter=30, n_jobs=2)
# TypeError: object of type 'float' has no len()
What you observe is expected, as the class-method uniform of an object of type np.random.RandomState() immediately draws a sample at the time of the call.
Compared to that, your usage of scipy's stats.uniform() creates a distribution yet to sample from. (Although i'm not sure if it's working as you expect in your case; be careful with the parameters).
If you want to incorporate something based on np.random.RandomState() you have to build your own class like mentioned in the docs:
This example uses the scipy.stats module, which contains many useful distributions for sampling parameters, such as expon, gamma, uniform or randint. In principle, any function can be passed that provides a rvs (random variate sample) method to sample a value. A call to the rvs function should provide independent random samples from possible parameter values on consecutive calls.

TensorFlow: Opening log data written by SummaryWriter

After following this tutorial on summaries and TensorBoard, I've been able to successfully save and look at data with TensorBoard. Is it possible to open this data with something other than TensorBoard?
By the way, my application is to do off-policy learning. I'm currently saving each state-action-reward tuple using SummaryWriter. I know I could manually store/train on this data, but I thought it'd be nice to use TensorFlow's built in logging features to store/load this data.
As of March 2017, the EventAccumulator tool has been moved from Tensorflow core to the Tensorboard Backend. You can still use it to extract data from Tensorboard log files as follows:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
event_acc = EventAccumulator('/path/to/summary/folder')
event_acc.Reload()
# Show all tags in the log file
print(event_acc.Tags())
# E. g. get wall clock, number of steps and value for a scalar 'Accuracy'
w_times, step_nums, vals = zip(*event_acc.Scalars('Accuracy'))
Easy, the data can actually be exported to a .csv file within TensorBoard under the Events tab, which can e.g. be loaded in a Pandas dataframe in Python. Make sure you check the Data download links box.
For a more automated approach, check out the TensorBoard readme:
If you'd like to export data to visualize elsewhere (e.g. iPython
Notebook), that's possible too. You can directly depend on the
underlying classes that TensorBoard uses for loading data:
python/summary/event_accumulator.py (for loading data from a single
run) or python/summary/event_multiplexer.py (for loading data from
multiple runs, and keeping it organized). These classes load groups of
event files, discard data that was "orphaned" by TensorFlow crashes,
and organize the data by tag.
As another option, there is a script
(tensorboard/scripts/serialize_tensorboard.py) which will load a
logdir just like TensorBoard does, but write all of the data out to
disk as json instead of starting a server. This script is setup to
make "fake TensorBoard backends" for testing, so it is a bit rough
around the edges.
I think the data are encoded protobufs RecordReader format. To get serialized strings out of files you can use py_record_reader or build a graph with TFRecordReader op, and to deserialize those strings to protobuf use Event schema. If you get a working example, please update this q, since we seem to be missing documentation on this.
I did something along these lines for a previous project. As mentioned by others, the main ingredient is tensorflows event accumulator
from tensorflow.python.summary import event_accumulator as ea
acc = ea.EventAccumulator("folder/containing/summaries/")
acc.Reload()
# Print tags of contained entities, use these names to retrieve entities as below
print(acc.Tags())
# E. g. get all values and steps of a scalar called 'l2_loss'
xy_l2_loss = [(s.step, s.value) for s in acc.Scalars('l2_loss')]
# Retrieve images, e. g. first labeled as 'generator'
img = acc.Images('generator/image/0')
with open('img_{}.png'.format(img.step), 'wb') as f:
f.write(img.encoded_image_string)
You can also use the tf.train.summaryiterator: To extract events in a ./logs-Folder where only classic scalars lr, acc, loss, val_acc and val_loss are present you can use this GIST: tensorboard_to_csv.py
Chris Cundy's answer works well when you have less than 10000 data points in your tfevent file. However, when you have a large file with over 10000 data points, Tensorboard will automatically sampling them and only gives you at most 10000 points. It is a quite annoying underlying behavior as it is not well-documented. See https://github.com/tensorflow/tensorboard/blob/master/tensorboard/backend/event_processing/event_accumulator.py#L186.
To get around it and get all data points, a bit hacky way is to:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
class FalseDict(object):
def __getitem__(self,key):
return 0
def __contains__(self, key):
return True
event_acc = EventAccumulator('path/to/your/tfevents',size_guidance=FalseDict())
It looks like for tb version >=2.3 you can streamline the process of converting your tb events to a pandas dataframe using tensorboard.data.experimental.ExperimentFromDev().
It requires you to upload your logs to TensorBoard.dev, though, which is public. There are plans to expand the capability to locally stored logs in the future.
https://www.tensorflow.org/tensorboard/dataframe_api
You can also use the EventFileLoader to iterate through a tensorboard file
from tensorboard.backend.event_processing.event_file_loader import EventFileLoader
for event in EventFileLoader('path/to/events.out.tfevents.xxx').Load():
print(event)
Surprisingly, the python package tb_parse has not been mentioned yet.
From documentation:
Installation:
pip install tensorflow # or tensorflow-cpu pip install -U tbparse # requires Python >= 3.7
Note: If you don't want to install TensorFlow, see Installing without TensorFlow.
We suggest using an additional virtual environment for parsing and plotting the tensorboard events. So no worries if your training code uses Python 3.6 or older versions.
Reading one or more event files with tbparse only requires 5 lines of code:
from tbparse import SummaryReader
log_dir = "<PATH_TO_EVENT_FILE_OR_DIRECTORY>"
reader = SummaryReader(log_dir)
df = reader.scalars
print(df)