Slow NER training with GPU - SpaCy v3.2 - gpu

I am using GPU to train a NER model from Scratch in SpaCy v3.2 (with the --gpu-id option) and SLURM job scheduler:
sbatch -p gpu --gres = gpu: v100: 1 my_script.sh
Here is the "my_script.sh" submission script:
#! / bin / bash
python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy --gpu-id 0
When I use nvidia-smi, I can clearly see GPU usage at 7% with memory usage at 0% (slow). That's why, I think that adjustments on my side are to be made at the level of the SpaCy to optimize its use.
Could you please tell me where this slow training with GPU comes from?
Thanks in advance,
FA

Related

Stuck at training model with CPU

As the example points out:
docker run -it -p 8500:8500 --gpus all tensorflow/serving:latest-devel
should train the mnist mode, however I want to use intel cpu for training, not gpu. But no luck, it stucked at Training model...
Here is the command I used:
docker run -it -p 8500:8500 tensorflow/serving:latest-devel
I found out that it will download resources at first, which a proxy is needed sometimes.

YOLO darknet freezes when I start training my model

I am training a model with yolo darknet in google colab but when I start the training the page freezes and a pop-up window appears that the web page does not respond
I don't know if it is because the model has many classes to train and the page collapses
here is my code:
!apt-get update
!unzip "/content/drive/My Drive/custom_dib_model/darknet.zip"
!sudo apt install dos2unix
!find . -type f -print0 | xargs -0 dos2unix
!chmod +x /content/darknet
!make
!./darknet detector test cfg/coco.data cfg/yolov4.cfg yolov4.weights data/person.jpg
!rm /content/darknet/backup -r
!ln -s /content/drive/'My Drive'/dib_weights/backup /content/darknet
!./darknet detector train dibujos_dataset/dib.data dib_yolov4.cfg yolov4.conv.137 -map -dont_show
The last line is the one that begins the training of my model and about 5 min pass when the page freezes, it should be noted that no error appears
I found a similar question but there is no concrete answer
possible answer by the user
This is all the information that I can give you and I hope it is enough
I solved it by changing the YOLO configuration dib_yolo.cfg file modifying the subdivisions from 64 to 16

freeze model for inference with output_node_name for ssd mobilenet v1 coco

I want to compile the TensorFlow Graph to Movidius Graph. I have used Model Zoo's ssd_mobilenet_v1_coco model to train it on my own dataset.
Then I ran
python object_detection/export_inference_graph.py \
--input_type=image_tensor \
--pipeline_config_path=/home/redtwo/nsir/ssd_mobilenet_v1_coco.config \
--trained_checkpoint_prefix=/home/redtwo/nsir/train/model.ckpt-3362 \
--output_directory=/home/redtwo/nsir/output
which generates me frozen_interference_graph.pb & saved_model/saved_model.pb
Now to convert this saved model into Movidius graph. There are commands given
Export GraphDef file
python3 ../tensorflow/tensorflow/python/tools/freeze_graph.py \
--input_graph=inception_v3.pb \
--input_binary=true \
--input_checkpoint=inception_v3.ckpt \
--output_graph=inception_v3_frozen.pb \
--output_node_name=InceptionV3/Predictions/Reshape_1
Freeze model for inference
python3 ../tensorflow/tensorflow/python/tools/freeze_graph.py \
--input_graph=inception_v3.pb \
--input_binary=true \
--input_checkpoint=inception_v3.ckpt \
--output_graph=inception_v3_frozen.pb \
--output_node_name=InceptionV3/Predictions/Reshape_1
which can finally be feed to NCS Intel Movidius SDK
mvNCCompile -s 12 inception_v3_frozen.pb -in=input -on=InceptionV3/Predictions/Reshape_1
All of this is given at Intel Movidius Website here: https://movidius.github.io/ncsdk/tf_modelzoo.html
My model was already trained i.e. output/frozen_inference_graph. Why do I again freeze it using /slim/export_inference_graph.py or it's the output/saved_model/saved_model.py that will go as input to slim/export_inference_graph.py??
All I want is output_node_name=Inceptionv3/Predictions/Reshape_1. How to get this output_name_name directory structure & anything inside it? I don't know what all it contains
what output node should I use for model zoo's ssd_mobilenet_v1_coco model(trained on my own custom dataset)
python freeze_graph.py \
--input_graph=/path/to/graph.pbtxt \
--input_checkpoint=/path/to/model.ckpt-22480 \
--input_binary=false \
--output_graph=/path/to/frozen_graph.pb \
--output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
Things I understand & don't understand:
input_checkpoint: ✓ [check points that were created during training]
output_graph: ✓ [path to output frozen graph]
out_node_names: X
I don't understand out_node_names parameter & what should inside this considering its ssd_mobilnet not inception_v3
System information
What is the top-level directory of the model you are using:
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): TensorFlow installed with pip
TensorFlow version (use command below): 1.13.1
Bazel version (if compiling from source):
CUDA/cuDNN version: V10.1.168/7.*
GPU model and memory: 2080Ti 11Gb
Exact command to reproduce:
The graph in saved_model/saved_model.pb is the graph definition(graph architecture) of the pretrained inception_v3 model without the weights loaded to the graph. The frozen_interference_graph.pb is the graph frozen with the checkpoints you have provided and taking the default output nodes of the inception_v3 model.
To get output node names summarise_graph tool can be used
You can use the below commands to use summarise_graph tool if bazel is installed
bazel build tensorflow/tools/graph_transforms:summarize_graph
bazel-bin/tensorflow/tools/graph_transforms/summarize_graph \
--in_graph=/tmp/inception_v3_inf_graph.pb
In case if bazel is not installed Output nodes can be obtained using the tensorboard or any other graph visualising tools like Netron.
The additional freeze_graph.py can be used to freeze the graph specifying the output nodes(ie in a case where additional output nodes are added to the inceptionV3). The frozen_interference_graph.pb is also an equaly good fit for infrencing.

how to install tensorflow on google cloud platform

when I use the command pip install tensorflow the download is only 99% complete and terminated at that point. How can I install tensorflow using google cloud shell.
Instead of installing it by yourself you can use the machine learning api and use TensorFlow for training or inference. Just follow this guidelines: https://cloud.google.com/ml/docs/quickstarts/training
You can submit a TensorFlow job like this:
gcloud beta ml jobs submit training ${JOB_NAME} \
--package-path=trainer \
--module-name=trainer.task \
--staging-bucket="${TRAIN_BUCKET}" \
--region=us-central1 \
-- \
--train_dir="${TRAIN_PATH}/train"

Keras on Google Cloud ML does not seem to use GPU? Is it possible to make it work?

I tried running Keras with tensorflow backend on cloud ml (google cloud platform). I find that keras does not seem to use the GPU. The performance for running one epoch on my CPU is 190 seconds and is equal to what I see in the logs dumped. Is there a way to identify whether a code is running in GPU or CPU in keras? Has anybody tried Keras on Cloud ML with Tensor flow backend running??
Update: As of March of 2017, GPUs are publicly available. See Fuyang Liu's answer
GPUs are not currently available on CloudML. However, they will be in the upcoming months.
yes it is supported now.
Basically you need to add a file such as cloudml-gpu.yaml in your module with the following content:
trainingInput:
scaleTier: CUSTOM
# standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4
GPUs
masterType: standard_gpu
runtimeVersion: "1.0"
Then add a option called --config=trainer/cloudml-gpu.yaml (suppose your training module is in a folder called trainer). For example:
export BUCKET_NAME=tf-learn-simple-sentiment
export JOB_NAME="example_5_train_$(date +%Y%m%d_%H%M%S)"
export JOB_DIR=gs://$BUCKET_NAME/$JOB_NAME
export REGION=europe-west1
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir gs://$BUCKET_NAME/$JOB_NAME \
--runtime-version 1.0 \
--module-name trainer.example5-keras \
--package-path ./trainer \
--region $REGION \
--config=trainer/cloudml-gpu.yaml \
-- \
--train-file gs://tf-learn-simple-sentiment/sentiment_set.pickle
You may also want to checkout this url for the GPU available region and other info on it.
import keras.backend.tensorflow_backend as K
K._set_session(K.tf.Session(config=K.tf.ConfigProto(log_device_placement=True)))
should make keras print the device placement of each tensor to stdout or stderr.