Testing a Jupyter Notebook - testing

I am trying to come up with a method to test a number of Jupyter notebooks. A test should run when a new notebook is implemented in a Github branch and submitted for a pull request. The tests are not that complicated, they are mostly just testing if the notebook runs end-to-end and without any errors, and maybe a few asserts. However:
There are certain calls in some cells that need to be mocked, e.g. a call to download the data from a database.
There may be some magic cells in the notebooks which run a pip command or something else.
I am open to use any testing library, such as 'pytest' or unittest, although pytest is preferred.
I looked at a few libraries for testing notebooks such as nbmake, treon, and testbook, but I was unable to make them work. I also tried to convert the notebook to a python file, but the magic cells were converted to a get_ipython().run_cell_magic(...) call which became an issue, since pytest uses python and not ipython, and get_ipython() is only available in ipython.
So, I am wondering what is a good way to test jupyter notebooks with all of that in mind. Any help is appreciated.

One straightforward approach I've already used is to execute the entire notebook with nbconvert.
A notebook failed.ipynb raising an exception will result in a failed run thanks to the --execute option that tells nbconvert to execute the notebook prior to its conversion.
jupyter nbconvert --to notebook --execute failed.ipynb
# ...
# Exception: FAILED
echo $?
# 1
Another correct notebook passed.ipynb will result in a successful export.
jupyter nbconvert --to notebook --execute passed.ipynb
# [NbConvertApp] Converting notebook passed.ipynb to notebook
# [NbConvertApp] Writing 1172 bytes to passed.nbconvert.ipynb
echo $?
# 0
Cherry on the cake, you can do the same through the API and so wrap it in Pytest!
import nbformat
import pytest
from nbconvert.preprocessors import ExecutePreprocessor
#pytest.mark.parametrize("notebook", ["passed.ipynb", "failed.ipynb"])
def test_notebook_exec(notebook):
with open(notebook) as f:
nb = nbformat.read(f, as_version=4)
ep = ExecutePreprocessor(timeout=600, kernel_name='python3')
try:
assert ep.preprocess(nb) is not None, f"Got empty notebook for {notebook}"
except Exception:
assert False, f"Failed executing {notebook}"
Running the test gives.
pytest test_nbconv.py
# FAILED test_nbconv.py::test_notebook_exec[failed.ipynb] - AssertionError: Failed executing failed.ipynb
# PASSED test_nbconv.py::test_notebook_exec[passed.ipynb]
Notes
There is several output formats, I've used here notebook.
This doesn’t convert a notebook to a different format per se, instead it allows the running of nbconvert preprocessors on a notebook, and/or conversion to other notebook formats.
The python code example is just a quick draft it can be largely improved.

Here is my own solution using testbook. Let's say I have a notebook called my_notebook.ipynb with the following content:
The trick is to inject a cell before my call to bigquery.Client and mock it:
from testbook import testbook
#testbook('./my_notebook.ipynb')
def test_get_details(tb):
tb.inject(
"""
import mock
mock_client = mock.MagicMock()
mock_df = pd.DataFrame()
mock_df['week'] = range(10)
mock_df['count'] = 5
p1 = mock.patch.object(bigquery, 'Client', return_value=mock_client)
mock_client.query().result().to_dataframe.return_value = mock_df
p1.start()
""",
before=2,
run=False
)
tb.execute()
dataframe = tb.get('dataframe')
assert dataframe.shape == (10, 2)
x = tb.get('x')
assert x == 7

Related

Passing commandline argument in google colab

How to pass commandline argument when running a python code in google colab?
I have written a code which takes a file as input via sys.argv[]. How do I do this?
As far as I know, there is no special way to pass command line arguments to python code. This is a working code sample I use to when creating tfrecords.
!python generate_tfrecord.py --csv_input=data/test_labels.csv --output_path=data/test.record --image_dir=images/
I don't see any difference between the regular command line python argument passing and the colab. Please add more code to your question to get better help.
I tried this in a google colab notebook
import sys
sys.argv[0] = "first_arg" # this is to assign the first command line argument
sys.argv[1] = "second_arg" # This line to assign the second arg for example
And it worked for me.
So if you want to run a python code that works like this:
!python test.py --image_folder '/content/image' --workers 2 --Prediction CTC --rgb True
You have to open test.py or your file with editor then you will find line inside the file similer like this:
parser = argparse.ArgumentParser()
parser.add_argument('--image_folder', required=True, help='path to image_folder')
parser.add_argument('--workers', type=int, default=1, help='number of workers')
parser.add_argument('--Prediction', type=str, default='CTC', help='Prediction stage.')
parser.add_argument('--rgb', action='store_true', help='use rgb input')
args = parser.parse_args()
But this will give you " Error SystemExit: 2 "
Then you have to change like this:
parser = argparse.ArgumentParser()
parser.add_argument('--image_folder', required=False, default='/content/image', help='path to image_folder')
parser.add_argument('--workers', type=int, default=2, help='number of workers')
parser.add_argument('--Prediction', type=str, default='CTC', help='Prediction stage.')
parser.add_argument('--rgb', action='store_false', help='use rgb input')
parser.add_argument("-f", "--file", required=False)
args = parser.parse_args()
You must add in the end of " parser.add_argument " line:
parser.add_argument("-f", "--file", required=False)
Then you can call commandline argument like this:
image = args.image_path
Or
img = Image.open(args.image_path)
workers = args.workers
But if your last line like this:
args = vars(ap.parse_args())
Then you have to call it like this:
image = args["image_path"]
Or
img = Image.open(args["image_path"])
workers = args["workers"]
#Note ( action='store_false' ) will default to ( False )
Likewise, ( action='store_false' ) will default to ( True )
Tested with Google colab
I made a bioinformatic tool locally in my machine to parse Uniprot big data files of proteins.
The tool I made needs the passing of different parameters using command line arguments. After the tool was working locally, I upload data files and python source files to my google drive.
I did not make any changes to my files. I just run directly the following command in google colab:
!python3 drive/MyDrive/uniprot/uniprot_select.py FIELDS "ID,OS,SQ" FROM drive/MyDrive/data/uniprot.dat WHERE "SQ#EYDRRR" FASTA
It works perfectly!
No need of special parsing, no need to additional imports. All the work you normally do locally in your machine, can be executed without changes.

Having problems declaring SUMO_HOME

I'm trying to run a test python code to use the traci library and it is returning "please declare environment SUMO_HOME".
I'm on Ubuntu 18.4.2 and Sumo 0.32.0.I solved this problem before by running
export SUMO_HOME=/home/gustavo/Downloads/sumo-0.32.0/tools/
,but this time it couldn't solve the problem. So I tried implementing a line inside the python file using the os library giving the same command but from the code itself:
os.system("export SUMO_HOME=/home/gustavo/Downloads/sumo-0.32.0/tool/")
And it also didn't work, so came here to ask for help. May any of you help me, please?
import os
import sys
import optparse
os.system("export SUMO_HOME=/home/gustavo/Downloads/sumo-0.32.0/tool/")
# we need to import some python modules from the $SUMO_HOME/tools directory
if 'SUMO_HOME' in os.environ:
tools = os.path.join(os.environ['SUMO_HOME=/home/gustavo/Downloads/sumo-0.32.0/tools/'], 'tools')
sys.path.append(tools)
else:
sys.exit("please declare environment variable 'SUMO_HOME'")
from sumolib import checkBinary # Checks for the binary in environ vars
import traci
def get_options():
opt_parser = optparse.OptionParser()
opt_parser.add_option("--nogui", action="store_true",
default=False, help="run the commandline version of sumo")
options, args = opt_parser.parse_args()
return options
# contains TraCI control loop
def run():
step = 0
while traci.simulation.getMinExpectedNumber() > 0:
traci.simulationStep()
print(step)
step += 1
traci.close()
sys.stdout.flush()
# main entry point
if __name__ == "__main__":
options = get_options()
# check binary
if options.nogui:
sumoBinary = checkBinary('sumo')
else:
sumoBinary = checkBinary('sumo-gui')
# traci starts sumo as a subprocess and then this script connects and runs
traci.start([sumoBinary, "-c", "demo.sumocfg",
"--tripinfo-output", "tripinfo.xml"])
run()
I expected for the steps to appear on the terminal.
The correct location is probably
export SUMO_HOME=/home/gustavo/Downloads/sumo-0.32.0
without the tools or tool suffix. It will not work from inside the python script with os.system but you could modify os.environ directly.
Furthermore you mixed up the call to os.environ in the script. It should read:
tools = os.path.join(os.environ['SUMO_HOME'], 'tools')
I swapped the if else part for another code :
try:
sys.path.append("/home/gustavo/Downloads/sumo-0.32.0/tools")
from sumolib import checkBinary
except ImportError:
sys.exit("please declare environment variable 'SUMO_HOME' as the root directory of your sumo installation (it should contain folders 'bin', 'tools' and 'docs')")
It solved the problem

Test if notebook is running on Google Colab

How can I test if my notebook is running on Google Colab?
I need this test as obtaining / unzipping my training data is different if running on my laptop or on Colab.
Try importing google.colab
try:
import google.colab
IN_COLAB = True
except:
IN_COLAB = False
Or just check if it's in sys.modules
import sys
IN_COLAB = 'google.colab' in sys.modules
For environments using ipython
If you are sure that the script will be run using ipython which is the most typical usage, there is also the possibility to check the ipython interpreter used. I think it is a little bit more clear and you don't have to import any module.
if 'google.colab' in str(get_ipython()):
print('Running on CoLab')
else:
print('Not running on CoLab')
If you need to do it multiple times you might want to assign a variable so you don't have to repeat the str(get_ipython()).
RunningInCOLAB = 'google.colab' in str(get_ipython())
RunningInCOLAB is True if run in a Google Colab notebook.
For environments not using ipython
In this case you have to check ipython is used first, assuming that COLab will always use ipython.
RunningInCOLAB = 'google.colab' in str(get_ipython()) if hasattr(__builtins__,'__IPYTHON__') else False
you can check environment variable like this:
import os
if 'COLAB_GPU' in os.environ:
print("I'm running on Colab")
actually you can print out os.environ to check what's associated with colab and then check the key
Improved Solution for all Python environments
As none of the other answers given here worked for me, and I was not using iPython. I checked the environment variables they use in Colab and thus, the following is best for checking the environment:
import os
if os.getenv("COLAB_RELEASE_TAG"):
print("Running in Colab")
else:
print("NOT in Colab")
In a %%bash cell, use:
%%bash
[[ ! -e /colabtools ]] && exit # Continue only if running on Google Colab
# Do Colab-only stuff here
Or in Python equivalence
import os
if os.path.exists('/colabtools'):
# do stuff

Running Tensorflow on JupyterNotebook instead of on Terminal commands

I wish to run some Tensorflow code on JupyterNotebook.
If run it on terminal, then the link above gives instructions like this:
python src/validate_on_lfw.py ~/datasets/lfw/lfw_mtcnnpy_160 ~/models/facenet/20170512-110547
Question: how do I run it on Jupyter notebook ? Thanks
e.g.,
# Load the model
facenet.load_model(args.model)
Simply replace args.model with ~/models/facenet/20170512-110547
# Load the model
facenet.load_model('~/models/facenet/20170512-110547')
will give error
usage: ipykernel_launcher.py [-h] [--lfw_batch_size LFW_BATCH_SIZE]
[--image_size IMAGE_SIZE] [--lfw_pairs LFW_PAIRS]
[--lfw_file_ext {jpg,png}]
[--lfw_nrof_folds LFW_NROF_FOLDS]
lfw_dir model
ipykernel_launcher.py: error: too few arguments
sys.argv
Out[5]:
['/anaconda/envs/tensorflow/lib/python2.7/site-packages/ipykernel_launcher.py',
'-f',
'/Users/my_name/Library/Jupyter/runtime/kernel-770c12c9-8fbe-44f7-91dd-4b0a5c5d7537.json']
Ok, simple solution...
Simply run it on Terminal as the given GitHub suggested and in the mean time print out the sys.argv on terminal like this
sys.argv = ['src/validate_on_lfw.py', '/Users/../datasets/lfw/lfw_mtcnnpy_160', '/Users/../models/facenet/20170512-110547']
Then use these values of sys.argv in JupyterNotebook in def parse_arguments(argv) as default values, and it worked

TensorFlow: Opening log data written by SummaryWriter

After following this tutorial on summaries and TensorBoard, I've been able to successfully save and look at data with TensorBoard. Is it possible to open this data with something other than TensorBoard?
By the way, my application is to do off-policy learning. I'm currently saving each state-action-reward tuple using SummaryWriter. I know I could manually store/train on this data, but I thought it'd be nice to use TensorFlow's built in logging features to store/load this data.
As of March 2017, the EventAccumulator tool has been moved from Tensorflow core to the Tensorboard Backend. You can still use it to extract data from Tensorboard log files as follows:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
event_acc = EventAccumulator('/path/to/summary/folder')
event_acc.Reload()
# Show all tags in the log file
print(event_acc.Tags())
# E. g. get wall clock, number of steps and value for a scalar 'Accuracy'
w_times, step_nums, vals = zip(*event_acc.Scalars('Accuracy'))
Easy, the data can actually be exported to a .csv file within TensorBoard under the Events tab, which can e.g. be loaded in a Pandas dataframe in Python. Make sure you check the Data download links box.
For a more automated approach, check out the TensorBoard readme:
If you'd like to export data to visualize elsewhere (e.g. iPython
Notebook), that's possible too. You can directly depend on the
underlying classes that TensorBoard uses for loading data:
python/summary/event_accumulator.py (for loading data from a single
run) or python/summary/event_multiplexer.py (for loading data from
multiple runs, and keeping it organized). These classes load groups of
event files, discard data that was "orphaned" by TensorFlow crashes,
and organize the data by tag.
As another option, there is a script
(tensorboard/scripts/serialize_tensorboard.py) which will load a
logdir just like TensorBoard does, but write all of the data out to
disk as json instead of starting a server. This script is setup to
make "fake TensorBoard backends" for testing, so it is a bit rough
around the edges.
I think the data are encoded protobufs RecordReader format. To get serialized strings out of files you can use py_record_reader or build a graph with TFRecordReader op, and to deserialize those strings to protobuf use Event schema. If you get a working example, please update this q, since we seem to be missing documentation on this.
I did something along these lines for a previous project. As mentioned by others, the main ingredient is tensorflows event accumulator
from tensorflow.python.summary import event_accumulator as ea
acc = ea.EventAccumulator("folder/containing/summaries/")
acc.Reload()
# Print tags of contained entities, use these names to retrieve entities as below
print(acc.Tags())
# E. g. get all values and steps of a scalar called 'l2_loss'
xy_l2_loss = [(s.step, s.value) for s in acc.Scalars('l2_loss')]
# Retrieve images, e. g. first labeled as 'generator'
img = acc.Images('generator/image/0')
with open('img_{}.png'.format(img.step), 'wb') as f:
f.write(img.encoded_image_string)
You can also use the tf.train.summaryiterator: To extract events in a ./logs-Folder where only classic scalars lr, acc, loss, val_acc and val_loss are present you can use this GIST: tensorboard_to_csv.py
Chris Cundy's answer works well when you have less than 10000 data points in your tfevent file. However, when you have a large file with over 10000 data points, Tensorboard will automatically sampling them and only gives you at most 10000 points. It is a quite annoying underlying behavior as it is not well-documented. See https://github.com/tensorflow/tensorboard/blob/master/tensorboard/backend/event_processing/event_accumulator.py#L186.
To get around it and get all data points, a bit hacky way is to:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
class FalseDict(object):
def __getitem__(self,key):
return 0
def __contains__(self, key):
return True
event_acc = EventAccumulator('path/to/your/tfevents',size_guidance=FalseDict())
It looks like for tb version >=2.3 you can streamline the process of converting your tb events to a pandas dataframe using tensorboard.data.experimental.ExperimentFromDev().
It requires you to upload your logs to TensorBoard.dev, though, which is public. There are plans to expand the capability to locally stored logs in the future.
https://www.tensorflow.org/tensorboard/dataframe_api
You can also use the EventFileLoader to iterate through a tensorboard file
from tensorboard.backend.event_processing.event_file_loader import EventFileLoader
for event in EventFileLoader('path/to/events.out.tfevents.xxx').Load():
print(event)
Surprisingly, the python package tb_parse has not been mentioned yet.
From documentation:
Installation:
pip install tensorflow # or tensorflow-cpu pip install -U tbparse # requires Python >= 3.7
Note: If you don't want to install TensorFlow, see Installing without TensorFlow.
We suggest using an additional virtual environment for parsing and plotting the tensorboard events. So no worries if your training code uses Python 3.6 or older versions.
Reading one or more event files with tbparse only requires 5 lines of code:
from tbparse import SummaryReader
log_dir = "<PATH_TO_EVENT_FILE_OR_DIRECTORY>"
reader = SummaryReader(log_dir)
df = reader.scalars
print(df)