I want to profile my model. This is a tutorial on how to do it: https://towardsdatascience.com/howto-profile-tensorflow-1a49fb18073d. But I would like to use the TensorFlow profiler, as shown in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/profiler/README.md#quick-start. According to this post, the following code should start the profiler:
# When using high-level API, session is usually hidden.
#
# Under the default ProfileContext, run a few hundred steps.
# The ProfileContext will sample some steps and dump the profiles
# to files. Users can then use command line tool or Web UI for
# interactive profiling.
with tf.contrib.tfprof.ProfileContext('/tmp/train_dir') as pctx:
# High level API, such as slim, Estimator, etc.
train_loop()
bazel-bin/tensorflow/core/profiler/profiler \
--profile_path=/tmp/train_dir/profile_xx
tfprof> op -select micros,bytes,occurrence -order_by micros
# To be open sourced...
bazel-bin/tensorflow/python/profiler/profiler_ui \
--profile_path=/tmp/profiles/profile_1
I generated the file profile_100 and located the directory profiler. So this is what I typed in my terminal:
bazel-/Users/mencia/anaconda/envs/tensorflow_py36/lib/python3.6/site-packages/tensorflow/profiler \
--profile_path=~/tmp/train_dir/profile_100
This raised the following error:
-bash:
bazel-/Users/mencia/anaconda/envs/tensorflow_py36/lib/python3.6/site- packages/tensorflow/profiler: No such file or directory
My directory profiler contains:
__init__.py
__pycache__
But according to the above code, there should be
profiler/profiler
Which I don't have.
What do I do to start the Profiler?
You have to build the profiler first. Clone the TensorFlow repo (git clone https://github.com/tensorflow/tensorflow.git) and run bazel build --config opt tensorflow/core/profiler:profiler in the root directory.
Related
I have been using Github actions to build and publish a docker image to the Github Container Registry according to the Documentation. I am getting an inconsistency behavior when I pull the new image and test it locally.
I have a CMake project in C++ that runs a simple hello world with an INTERFACE and SHARED library.
When I build a docker image locally and test it, this is the output (which is working fine):
*************************************
*** DBSCAN Cluster Segmentation ***
*************************************
--cloudfile: required.
Usage: program [options]
Optional arguments:
-h --help shows help message and exits [default: false]
-v --version prints version information and exits [default: false]
--cloudfile input cloud file [required]
--octree-res octree resolution [default: 120]
--eps epsilon value [default: 40]
--minPtsAux minimum auxiliar points [default: 5]
--minPts minimum points [default: 5]
-o --output-dir output dir to save clusters [default: "-"]
--ext cluster output extension [pcd, ply, txt, xyz] [default: "pcd"]
-d --display display clusters in the pcl visualizer [default: false]
--cal-eps calculate the value of epsilon with the distance to the nearest n points [default: false]
In Github Actions I am using this workflow:
name: Demo Push
on:
push:
# Publish `master` as Docker `latest` image.
branches: ["test-github-packages"]
# Publish `v1.2.3` tags as releases.
tags:
- v*
# Run tests for any PRs.
pull_request:
env:
IMAGE_NAME: dbscan-octrees
jobs:
# Push image to GitHub Packages.
# See also https://docs.docker.com/docker-hub/builds/
push:
runs-on: ubuntu-latest
permissions:
packages: write
contents: read
steps:
- uses: actions/checkout#v3
with:
submodules: recursive
- name: Build image
run: docker build --file Dockerfile --tag $IMAGE_NAME --label "runnumber=${GITHUB_RUN_ID}" .
- name: Test image
run: |
docker run --rm \
--env="DISPLAY" \
--env="QT_X11_NO_MITSHM=1" \
--volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" \
dbscan-octrees:latest
- name: Log in to registry
# This is where you will update the PAT to GITHUB_TOKEN
run: echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u $ --password-stdin
- name: Push image
run: |
IMAGE_ID=ghcr.io/${{ github.repository_owner }}/$IMAGE_NAME
# Change all uppercase to lowercase
IMAGE_ID=$(echo $IMAGE_ID | tr '[A-Z]' '[a-z]')
# Strip git ref prefix from version
VERSION=$(echo "${{ github.ref }}" | sed -e 's,.*/\(.*\),\1,')
# Strip "v" prefix from tag name
[[ "${{ github.ref }}" == "refs/tags/"* ]] && VERSION=$(echo $VERSION | sed -e 's/^v//')
# Use Docker `latest` tag convention
[ "$VERSION" == "master" ] && VERSION=latest
echo IMAGE_ID=$IMAGE_ID
echo VERSION=$VERSION
docker tag $IMAGE_NAME $IMAGE_ID:latest
docker push $IMAGE_ID:latest
The compilation and test steps are working fine with no errors (check this run). The problem is with the newly generated image after the push to the Github Container registry since when I pulled it locally to test it, the program is crashing with an "Illegal Instruction (core dumped)" error. I have tried to debug to find the problem and there is not a compilation error, link error, or something like that. I found out that this might be related to the linking part of the SHARED library, but it is strange because if the image is working when is built in the Github Action runner, I don't understand why fails the pushed image.
I found this post where the error might be something related to Github that changes the container during the installation.
Hope someone can help me with this.
This is the output in the Test image step on the workflow:
workflow
This is the error after pulling the newly generated image and testing it locally: error
I have even compared the bad binary file (Github version in the docker image) with the good version (Compiled version locally) using ghex, and the binary file generated by GitHub after pushing a new image is a little bigger than the good one.
binary comparision
binary sizes
Issue
CPU AVX instruction set not supported by local PC
Solution
Enable compilation flags in CMake to disable AVX support
Description
After digging using analysis tools for binaries files, debugging, etc. I discovered that the problem was related to the AVX CPU support in the GitHub action runner. My Computer does not support AVX optimized instructions, so I have to enable a compilation flag for my shared libraries in order to disable AVX support. This compilation flag will tell the Github Action runner to compile the project with no AVX CPU support or CPU optimizations which is the standard environment in GitHub Actions.
Analysis tools:
ldd binary
strace binary <-- this one allows me to identify the SIGEV_SIGNAL error code
container-diff
log error
Using the strace tool I got the next error:
--- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x55dcd7324bc0} ---
+++ killed by SIGILL (core dumped) +++
Illegal instruction (core dumped)
This error allowed me to find the error code and after searching on the internet I found a solution to my specific problem since my project was using Point cloud Library (PCL), I compiled my project with -mno-avx, according to this post.
Solution
In the CMakeList.txt file for each SHARED library define the next compilation flag:
target_compile_options(${PROJECT_NAME} PUBLIC -mno-avx)
New issue
I have resolved the major issue, but now one of my shared libraries has the same error. I will try to fix it with one of these (I think) flags.
After making a lot of tests and using CPU-X software and detecting the proper architecture-specific options in my PC with the following command via GCC:
gcc -march=native -E -v - </dev/null 2>&1 | grep cc1
output:
/usr/lib/gcc/x86_64-linux-gnu/9/cc1 -E -quiet -v -imultiarch
x86_64-linux-gnu - -march=haswell -mmmx -mno-3dnow -msse -msse2
-msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -mno-aes -mno-sha
-mpclmul -mpopcnt -mabm -mno-lwp -mno-fma -mno-fma4 -mno-xop
-mno-bmi -mno-sgx -mno-bmi2 -mno-pconfig -mno-wbnoinvd -mno-tbm
-mno-avx -mno-avx2 -msse4.2 -msse4.1 -mlzcnt -mno-rtm -mno-hle
-mrdrnd -mno-f16c -mfsgsbase -mno-rdseed -mno-prfchw -mno-adx
-mfxsr -mno-xsave -mno-xsaveopt -mno-avx512f -mno-avx512er
-mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt
-mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl
-mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps
-mno-avx5124vnniw -mno-clwb -mno-mwaitx -mno-clzero -mno-pku
-mno-rdpid -mno-gfni -mno-shstk -mno-avx512vbmi2 -mno-avx512vnni
-mno-vaes -mno-vpclmulqdq -mno-avx512bitalg -mno-avx512vpopcntdq
-mno-movdiri -mno-movdir64b -mno-waitpkg -mno-cldemote
-mno-ptwrite --param l1-cache-size=32
--param l1-cache-line-size=64 --param l2-cache-size=3072
-mtune=haswell -fasynchronous-unwind-tables
-fstack-protector-strong -Wformat -Wformat-security
-fstack-clash-protection -fcf-protection
Final solution
I have fixed the execution error with the following flags in my SHARED library:
# MMX, SSE(1, 2, 3, 3S, 4.1, 4.2), CLMUL, RdRand, VT-x, x86-64
target_compile_options(${PROJECT_NAME} PRIVATE -Wno-cpp
-mmmx
-msse
-msse2
-msse3
-mssse3
-msse4.2
-msse4.1
-mno-sse4a
-mno-avx
-mno-avx2
-mno-fma
-mno-fma4
-mno-f16c
-mno-xop
-mno-bmi
-mno-bmi2
-mrdrnd
-mno-3dnow
-mlzcnt
-mfsgsbase
-mpclmul
)
Now, the docker image stored in the GitHub Container Registry is working as expected on my local PC.
Related posts
What is the proper architecture-specific options (-m) for Sandy Bridge based Pentium?
using cmake to make a library without sse support (windows version)
https://github.com/PointCloudLibrary/pcl/issues/5248
Compile errors with Assembler messages
https://github.com/PointCloudLibrary/pcl/issues/1837
I have a singularity container that has been made for me (to run tensorflow on comet GPU nodes) but I need to modify the keras install for my purposes.
I understand that .simg files are not editable (and that the writable .img format is deprecated), so the process of converting to an .img file, editing, and then converting back to .simg is discouraged:
sudo singularity build --writable development.img production.simg
## make changes
sudo singularity build production2.img development.simg
It seems to me the best way might be to extract the contents (say into a sandbox), edit them, and then rebuild the sandbox into an .simg image.
I know how to do the second conversion (singularity build new-sif sandbox), but how can I do the first?
I have tried the following, but the command never finishes:
sudo singularity build tf_gpu tensorflow-gpu.simg
WARNING: Authentication token file not found : Only pulls of public images will succeed
Build target already exists. Do you want to overwrite? [N/y] y
2018/10/12 08:39:54 bufio.Scanner: token too long
INFO: Starting build...
You can easily convert between a sandbox and a production build using the following:
sudo singularity build lolcow.sif docker://godlovedc/lolcow # pulls and builds an example container
sudo singularity build --sandbox lolcow_sandbox/ lolcow.sif # converts from container to a writable sandbox
sudo singularity build lolcow2 lolcow_sandbox/ # converts from sandbox to container
So, you can edit the sandbox and then rebuild accordingly.
When I run the following in terminal:
$MODEL_DIR=output
gcloud ml-engine local train --module-name trainer.task --package-path trainer/ --job-dir $MODEL_DIR
It runs successfully but I don't get anything in the output folder. Although according to this I should see some files and checkpoints: https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction
In the code I've got this line to save my model:
save_path = saver.save(sess, "./my_mnist_model.ckpt")
That generates following files in the active directory: my_mnist_model.ckpt.index, my_mnist_model.ckpt.meta, my_mnist_model.ckpt.data-00000-of-00001
However they are not in output folder. And when I run it on the Cloud Machine Learning Engine I don't get anything in the specified output folder in my bucket either.
So the model is successfully trained but not saved anywhere.
What am I missing in my code / gcloud command?
Just figured out myself that i need to handle --job-dir myself in the script. From the getting started manual i thought it is handled by gcloud command that runs training.
PROBLEM: I am attempting to run a spark-submit script from my local machine to a cluster of machines. The work done by the cluster uses numpy. I currently get the following error:
ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control). Otherwise reinstall numpy.
Original error was: cannot import name multiarray
DETAIL:
In my local environment I have setup a virtualenv that includes numpy as well as a private repo I use in my project and other various libraries. I created a zip file (lib/libs.zip) from the site-packages directory at venv/lib/site-packages where 'venv' is my virtual environment. I ship this zip to the remote nodes. My shell script for performing the spark-submit looks like this:
$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.pyspark.virtualenv.enabled=true \
--conf spark.pyspark.virtualenv.type=native \
--conf spark.pyspark.virtualenv.requirements=${parent}/requirements.txt \
--conf spark.pyspark.virtualenv.bin.path=${parent}/venv \
--py-files "${parent}/lib/libs.zip" \
--num-executors 1 \
--executor-cores 2 \
--executor-memory 2G \
--driver-memory 2G \
$parent/src/features/pi.py
I also know that on the remote nodes there is a /usr/local/bin/python2.7 folder that includes a python 2.7 install.
so in my conf/spark-env.sh I have set the following:
export PYSPARK_PYTHON=/usr/local/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python2.7
When I run the script I get the error above. If I screen print the installed_distributions I get a zero length list []. Also my private library imports correctly (which says to me it is actually accessing my libs.zip site-packages.). My pi.py file looks something like this:
from myprivatelibrary.bigData.spark import spark_context
spark = spark_context()
import numpy as np
spark.parallelize(range(1, 10)).map(lambda x: np.__version__).collect()
EXPECTATION/MY THOUGHTS:
I expect this to import numpy correctly especially since I know numpy works correctly in my local virtualenv. I suspect this is because I'm not actually using the version of python that is installed in my virtualenv on the remote node. My question is first, how do I fix this and second how do I use my virtualenv installed python on the remote nodes instead of the python that is just manually installed and currently sitting on those machines? I've seen some write-ups on this but frankly they are not well written.
With --conf spark.pyspark.{} and export PYSPARK_PYTHON=/usr/local/bin/python2.7 you set options for your local environment / your driver. To set options for the cluster (executors) use the following syntax:
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON
Furthermore, I guess you should make your virtualenv relocatable (this is experimental, however). <edit 20170908> This means that the virtualenv uses relative instead of absolute links. </edit>
What we did in such cases: we shipped an entire anaconda distribution over hdfs.
<edit 20170908>
If we are talking about different environments (MacOs vs. Linux, as mentioned in the comment below), you cannot just submit a virtualenv, at least not if your virtualenv contains packages with binaries (as is the case with numpy). In that case I suggest you create yourself a 'portable' anaconda, i.e. install Anaconda in a Linux VM and zip it.
Regarding --archives vs. --py-files:
--py-files adds python files/packages to the python path. From the spark-submit documentation:
For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.
--archives means these are extracted into the working directory of each executor (only yarn clusters).
However, a crystal-clear distinction is lacking, in my opinion - see for example this SO post.
In the given case, add the anaconda.zip via --archives, and your 'other python files' via --py-files.
</edit>
See also: Running Pyspark with Virtualenv, a blog post by Henning Kropp.
I was following this tutorial to use tensorflow serving using my object detection model. I am using tensorflow object detection for generating the model. I have created a frozen model using this exporter (the generated frozen model works using python script).
The frozen graph directory has following contents ( nothing on variables directory)
variables/
saved_model.pb
Now when I try to serve the model using the following command,
tensorflow_model_server --port=9000 --model_name=ssd --model_base_path=/serving/ssd_frozen/
It always shows me
...
tensorflow_serving/model_servers/server_core.cc:421] (Re-)adding
model: ssd 2017-08-07 10:22:43.892834: W
tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:262]
No versions of servable ssd found under base path /serving/ssd_frozen/
2017-08-07 10:22:44.892901: W
tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:262]
No versions of servable ssd found under base path /serving/ssd_frozen/
...
I had same problem, the reason is because object detection api does not assign version of your model when exporting your detection model. However, tensorflow serving requires you to assign a version number of your detection model, so that you could choose different versions of your models to serve. In your case, you should put your detection model(.pb file and variables folder) under folder:
/serving/ssd_frozen/1/. In this way, you will assign your model to version 1, and tensorflow serving will automatically load this version since you only have one version. By default tensorflow serving will automatically serve the latest version(ie, the largest number of versions).
Note, after you created 1/ folder, the model_base_path is still need to be set to --model_base_path=/serving/ssd_frozen/.
For new version of tf serving, as you know, it no longer supports the model format used to be exported by SessionBundle but now SavedModelBuilder.
I suppose it's better to restore a session from your older model format and then export it by SavedModelBuilder. You can indicate the version of your model with it.
def export_saved_model(version, path, sess=None):
tf.app.flags.DEFINE_integer('version', version, 'version number of the model.')
tf.app.flags.DEFINE_string('work_dir', path, 'your older model directory.')
tf.app.flags.DEFINE_string('model_dir', '/tmp/model_name', 'saved model directory')
FLAGS = tf.app.flags.FLAGS
# you can give the session and export your model immediately after training
if not sess:
saver = tf.train.import_meta_graph(os.path.join(path, 'xxx.ckpt.meta'))
saver.restore(sess, tf.train.latest_checkpoint(path))
export_path = os.path.join(
tf.compat.as_bytes(FLAGS.model_dir),
tf.compat.as_bytes(str(FLAGS.version)))
builder = tf.saved_model.builder.SavedModelBuilder(export_path)
# define the signature def map here
# ...
legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
builder.add_meta_graph_and_variables(
sess, [tf.saved_model.tag_constants.SERVING],
signature_def_map={
'predict_xxx':
prediction_signature
},
legacy_init_op=legacy_init_op
)
builder.save()
print('Export SavedModel!')
you could find main part of the code above in tf serving example.
Finally it will generate the SavedModel in a format that can be served.
Create a version folder under like - serving/model_name/0000123/saved_model.pb
Answer's above already explained why it is important to keep a version number inside the model folder. Follow below link , here they have different sets of built models , you can take it as a reference.
https://github.com/tensorflow/serving/tree/master/tensorflow_serving/servables/tensorflow/testdata
I was doing this on my personal computer running Ubuntu, not Docker. Note I am in a directory called "serving". This is where I saved my folder "mobile_weight". I had to create a new folder, "0000123" inside "mobile_weight". My path looks like serving->mobile_weight->0000123->(variables folder and saved_model.pb)
The command from the tensorflow serving tutorial should look like (Change model_name and your directory):
nohup tensorflow_model_server \
--rest_api_port=8501 \
--model_name=model_weight \
--model_base_path=/home/murage/Desktop/serving/mobile_weight >server.log 2>&1
So my entire terminal screen looks like:
murage#murage-HP-Spectre-x360-Convertible:~/Desktop/serving$ nohup tensorflow_model_server --rest_api_port=8501 --model_name=model_weight --model_base_path=/home/murage/Desktop/serving/mobile_weight >server.log 2>&1
That error message can also result due to issues with the --volume argument.
Ensure your --volume mount is actually correct and points to the model's dir, as this is a general 'model not found' error, but it just seems more complex.
If on windows just use cmd, otherwise its easy to accidentally use linux file path and linux separators in cygwin or gitbash. Even with the correct file structure you can get OP's error if you don't use the windows absolute path.
#using cygwin
$ echo $TESTDATA
/home/username/directory/serving/tensorflow_serving/servables/tensorflow/testdata
$ docker run -t --rm -p 8501:8501 -v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two" -e MODEL_NAME=half_plus_two tensorflow/serving
2021-01-22 20:12:28.995834: W tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:267] No versions of servable half_plus_two found under base path /models/half_plus_two. Did you forget to name your leaf directory as a number (eg. '/1/')?
Then calling the same command with the same unchanged file structure but with the full windows path using windows file separators, and it works:
#using cygwin
$ export TESTDATA="$(cygpath -w "/home/username/directory/serving/tensorflow_serving/servables/tensorflow/testdata")"
$ echo $TESTDATA
C:\Users\username\directory\serving\tensorflow_serving\servables\tensorflow\testdata
$ docker run -t --rm -p 8501:8501 -v "$TESTDATA\\saved_model_half_plus_two_cpu:/models/half_plus_two" -e MODEL_NAME=half_plus_two tensorflow/serving
2021-01-22 21:10:49.527049: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: half_plus_two version: 1}