Can't download c4 dataset with Dataflow in colab - google-colaboratory

I want to download the c4 dataset. As per the instructions page: https://www.tensorflow.org/datasets/catalog/c4, it's recommended to use dataflow. I followed the steps described here: https://www.tensorflow.org/datasets/beam_datasets in google colab.
Packages:
!pip install -q tensorflow-datasets
!pip install -q apache-beam[gcp]
This is the cell I'm trying to run in colab
%env DATASET_NAME=c4/en
%env GCP_PROJECT=......
%env GCS_BUCKET=gs://c4-dump
%env DATAFLOW_JOB_NAME=c4-en-gen
!echo "tensorflow_datasets[$DATASET_NAME]" > /tmp/beam_requirements.txt
!python -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=$DATASET_NAME
--data_dir=$GCS_BUCKET \
--beam_pipeline_options="runner=DataflowRunner,project=$GCP_PROJECT,job_name=$DATAFLOW_JOB_NAME,staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,requirements_file=/tmp/beam_requirements.txt"
It's pretty much the same code as in the tutorial. But there is no dataflow job created in the Dataflow tab and it looks like it's downloading locally. See output logs:
env: DATASET_NAME=c4/en
env: GCP_PROJECT=ai-vs-covid19
env: GCS_BUCKET=gs://c4-dump
env: DATAFLOW_JOB_NAME=c4-en-gen
2020-03-31 02:18:46.297213: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
I0331 02:18:49.098738 139869050173312 download_and_prepare.py:180] Running download_and_prepare for datasets:
c4/en
I0331 02:18:49.099436 139869050173312 download_and_prepare.py:181] Version: "None"
I0331 02:18:50.353859 139869050173312 dataset_builder.py:202] Load pre-computed datasetinfo (eg: splits) from bucket.
I0331 02:18:50.468347 139869050173312 dataset_info.py:431] Loading info from GCS for c4/en/2.2.1
I0331 02:18:50.522799 139869050173312 download_and_prepare.py:130] download_and_prepare for dataset c4/en/2.2.1...
I0331 02:18:50.560583 139869050173312 driver.py:124] Generating grammar tables from /usr/lib/python3.6/lib2to3/Grammar.txt
I0331 02:18:50.683776 139869050173312 driver.py:124] Generating grammar tables from /usr/lib/python3.6/lib2to3/PatternGrammar.txt
I0331 02:18:51.189772 139869050173312 dataset_builder.py:310] Generating dataset c4 (gs://c4-dump/c4/en/2.2.1)
Downloading and preparing dataset c4/en/2.2.1 (download: 6.96 TiB, generated: 816.78 GiB, total: 7.76 TiB) to gs://c4-dump/c4/en/2.2.1...
And then a bunch of
Dl Completed...: 0% 0/18 [00:38<?, ? url/s]
Dl Completed...: 0% 0/18 [00:38<?, ? url/s]
Dl Completed...: 0% 0/18 [00:39<?, ? url/s]I0331 02:19:33.506697 139869050173312 download_manager.py:256] Downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-18/segments/1555578517558.8/wet/CC-MAIN-20190418101243-20190418123243-00326.warc.wet.gz into gs://c4-dump/downloads/comm.s3_craw-data_CC-MAIN-2019-18_segm_1555iQS7Yn3hZ3JmwClTiCNY5qtVgGfQQAObrCqx7cMloOg.gz.tmp.1bbeb83abada465287dcecabb0e4f4b0...
Am I missing something or is it just a preparation stage? My main concern is that I can't see the dataflow job running.
Thanks!
UPD: tried the same approach with the compute instance - same result.

I just updated the tfds-nightly package so the raw files will be downloaded on the DataFlow workers instead of the manager. Please give version 2.1.0.dev202003312203 a try and let me know if you have any issues.

As of today, you don't have to do the processing yourself. We uploaded the dataset to a bucket in the Google Cloud, and also created a JSON version. More details at https://github.com/allenai/allennlp/discussions/5056.

Related

Training using object detection api is not running on GPUs in AI Platform

I am trying to run the training of some models in tensorflow 2 object detection api.
I am using this command:
gcloud ai-platform jobs submit training segmentation_maskrcnn_`date +%m_%d_%Y_%H_%M_%S` \
--runtime-version 2.1 \
--python-version 3.7 \
--job-dir=gs://${MODEL_DIR} \
--package-path ./object_detection \
--module-name object_detection.model_main_tf2 \
--region us-central1 \
--scale-tier CUSTOM \
--master-machine-type n1-highcpu-32 \
--master-accelerator count=4,type=nvidia-tesla-p100 \
-- \
--model_dir=gs://${MODEL_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
The training job is submitted successfully but when I look at my submitted job on AI platform I notice that it's not using the GPUs!
Also, when looking at the logs for my training job, I noticed that in some cases it couldn't open cuda. It would say something like this:
Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib64
I was using AI platform for training a few months back and it was successful. I don't know what has changed now!
In fact, for my own setup, nothing has changed.
For the record, I am training Mask RCNN now. A few months back I trained Faster RCNN and SSD models.
Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib64
I'm not sure as I couldn't test anyhow. With a quick google search, It appeared that people have encountered this issue for many reasons and the solution is some kind of depends. In SO, there is the same query was asked, and you probably missed it somehow, check it first, here.
Also, check this related issue posted below
TensorFlow Issue #26182
TensorFlow Issue #45930
TensorFlow Issue #38578
After checking with every possible solution, and still remains the issue, then update your query with it.
I think there some mismatches in your Cuda version (CUDA, cuDNN) and tf version, you should check them first in your working environment. Also, ensure you update the Cuda path properly. According to the given error message, you need to make it ensure that the following set properly.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.0/lib64/

Submit a Keras training job to Google cloud

I am trying to follow this tutorial:
https://medium.com/#natu.neeraj/training-a-keras-model-on-google-cloud-ml-cb831341c196
to upload and train a Keras model on Google Cloud Platform, but I can't get it to work.
Right now I have downloaded the package from GitHub, and I have created a cloud environment with AI-Platform and a bucket for storage.
I am uploading the files (with the suggested folder structure) to my Cloud Storage bucket (basically to the root of my storage), and then trying the following command in the cloud terminal:
gcloud ai-platform jobs submit training JOB1
--module-name=trainer.cnn_with_keras
--package-path=./trainer
--job-dir=gs://mykerasstorage
--region=europe-north1
--config=gs://mykerasstorage/trainer/cloudml-gpu.yaml
But I get errors, first the cloudml-gpu.yaml file can't be found, it says "no such folder or file", and trying to just remove it, I get errors because it says the --init--.py file is missing, but it isn't, even if it is empty (which it was when I downloaded from the tutorial GitHub). I am Guessing I haven't uploaded it the right way.
Any suggestions of how I should do this? There is really no info on this in the tutorial itself.
I have read in another guide that it is possible to let gcloud package and upload the job directly, but I am not sure how to do this or where to write the commands, in my terminal with gcloud command? Or in the Cloud Shell in the browser? And how do I define the path where my python files are located?
Should mention that I am working with Mac, and pretty new to using Keras and Python.
I was able to follow the tutorial you mentioned successfully, with some modifications along the way.
I will mention all the steps although you made it halfway as you mentioned.
First of all create a Cloud Storage Bucket for the job:
gsutil mb -l europe-north1 gs://keras-cloud-tutorial
To answer your question on where you should write these commands, depends on where you want to store the files that you will download from GitHub. In the tutorial you posted, the writer is using his own computer to run the commands and that's why he initializes the gcloud command with gcloud init. However, you can submit the job from the Cloud Shell too, if you download the needed files there.
The only files we need from the repository are the trainer folder and the setup.py file. So, if we put them in a folder named keras-cloud-tutorial we will have this file structure:
keras-cloud-tutorial/
├── setup.py
└── trainer
├── __init__.py
├── cloudml-gpu.yaml
└── cnn_with_keras.py
Now, a possible reason for the ImportError: No module named eager error is that you might have changed the runtimeVersion inside the cloudml-gpu.yaml file. As we can read here, eager was introduced in Tensorflow 1.5. If you have specified an earlier version, it is expected to experience this error. So the structure of cloudml-gpu.yaml should be like this:
trainingInput:
scaleTier: CUSTOM
# standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4 GPUs
masterType: standard_gpu
runtimeVersion: "1.5"
Note: "standard_gpu" is a legacy machine type.
Also, the setup.py file should look like this:
from setuptools import setup, find_packages
setup(name='trainer',
version='0.1',
packages=find_packages(),
description='Example on how to run keras on gcloud ml-engine',
author='Username',
author_email='user#gmail.com',
install_requires=[
'keras==2.1.5',
'h5py'
],
zip_safe=False)
Attention: As you can see, I have specified that I want version 2.1.5 of keras. This is because if I don't do that, the latest version is used which has compatibility issues with versions of Tensorflow earlier than 2.0.
If everything is set, you can submit the job by running the following command inside the folder keras-cloud-tutorial:
gcloud ai-platform jobs submit training test_job --module-name=trainer.cnn_with_keras --package-path=./trainer --job-dir=gs://keras-cloud-tutorial --region=europe-west1 --config=trainer/cloudml-gpu.yaml
Note: I used gcloud ai-platform instead of gcloud ml-engine command although both will work. At some point in the future though, gcloud ml-engine will be deprecated.
Attention: Be careful when choosing the region in which the job will be submitted. Some regions do not support GPUs and will throw an error if chosen. For example, if in my command I set the region parameter to europe-north1 instead of europe-west1, I will receive the following error:
ERROR: (gcloud.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED:
Quota failure for project . The request for 1 K80
accelerators exceeds the allowed maximum of 0 K80, 0 P100, 0 P4, 0 T4,
0 TPU_V2, 0 TPU_V3, 0 V100. To read more about Cloud ML Engine quota,
see https://cloud.google.com/ml-engine/quotas.
- '#type': type.googleapis.com/google.rpc.QuotaFailure violations:
- description: The request for 1 K80 accelerators exceeds the allowed maximum of
0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V3, 0 V100.
subject:
You can read more about the features of each region here and here.
EDIT:
After the completion of the training job, there should be 3 folders in the bucket that you specified: logs/, model/ and packages/. The model is saved on the model/ folder a an .h5 file. Have in mind that if you set a specific folder for the destination you should include the '/' at the end. For example, you should set gs://my-bucket/output/ instead of gs://mybucket/output. If you do the latter you will end up with folders output, outputlogs and outputmodel. Inside output there should be packages. The job page link should direct to output folder so make sure to check the rest of the bucket too!
In addition, in the AI-Platform job page you should be able to see information regarding CPU, GPU and Network utilization:
Also, I would like to clarify something as I saw that you posted some related questions as an answer:
Your local environment, either it is your personal Mac or the Cloud Shell has nothing to do with the actual training job. You don't need to install any specific package or framework locally. You just need to have the Google Cloud SDK installed (in Cloud Shell is of course already installed) to run the appropriate gcloud and gsutil commands. You can read more on how exactly training jobs on the AI-Platform work here.
I hope that you will find my answer helpful.
I got it to work halfway now by not uploading the files but just running the upload commands from cloud at my local terminal... however there was an error during it running ending in "job failed"
Seems it was trying to import something from the TensorFlow backend called "from tensorflow.python.eager import context" but there was an ImportError: No module named eager
I have tried "pip install tf-nightly" which was suggested at another place, but it says I don't have permission or I am loosing the connection to cloud shell(exactly when I try to run the command).
I have also tried making a virtual environment locally to match that on gcloud (with Conda), and have made an environment with Conda with Python=3.5, Tensorflow=1.14.0 and Keras=2.2.5, which should be supported for gcloud.
The python program works fine in this environment locally, but I still get the (ImportError: No module named eager) when trying to run the job on gcloud.
I am putting the flag --python-version 3.5 when submitting the job, but when I write the command "Python -V" in the google cloud shell, it says Python=2.7. Could this be the issue? I have not fins a way to update the python version with the cloud shell prompt, but it says google cloud should support python 3.5. If this is anyway the issue, any suggestions on how to upgrade python version on google cloud?
It is also possible to manually there a new job in the google cloud web interface, doing this, I get a different error message: ERROR: Could not find a version that satisfies the requirement cnn_with_keras.py (from versions: none) and No matching distribution found for cnn_with_keras.py. Where cnn_with_keras.py is my python code from the tutorial, which runs fine locally.
Really don't know what to do next. Any suggestions or tips would be very helpful!
The issue with the GPU is solved now, it was something so simple as, my google cloud account had GPU settings disabled and needed to be upgraded.

tensorboard logdir with s3 path

I see tensorflow support AWS s3 file system (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/s3) but I am unable to use the S3 path with tensorboard.
I tried latest nightly 0.4.0rc3 but no luck. I built locally also and made sure Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: set to YES but still i don't see tensorboard --logdir=s3://bucket/path working at all.
Am I missing something here?
If you start a tensorboard by using AWS s3 file, you should do as follows:
(1) add ENV VARIBLES
export AWS_ACCESS_KEY_ID=******
export AWS_SECRET_ACCESS_KEY=*******
export S3_ENDPOINT=******
export S3_VERIFY_SSL=0
export S3_USE_HTTPS=0
(2) upgrade tensorflow to newest version by using pip:
pip install tensorflow==1.4.1
(3) you don't need to upgrade tensorboard, because it is revolved in the previous step
Then you can start you tensorboard by using you code
tensorboard --logdir=s3://bucket/path

Tensorflow Canned Estimator problems running with multiple workers on Google cloud ml engine

I am trying to train a model using the canned DNNClassifier estimator on the google cloud ml-engine.
I am able to successfully train the model locally in single and distributed mode. Further I am able to train the model on the cloud with the provided BASIC and BASIC_GPU scale-tier.
I am now trying to pass my own custom config file. When I only specify "masterType: standard" in the config file without mentioning workers, parameter servers, the job runs successfully.
However, whenever I try adding workers, the job fails:
trainingInput:
scaleTier: CUSTOM
masterType: standard
workerType: standard
workerCount: 4
Here is how I run the job (I get the same error without mentioning the staging bucket):
SCALE_TIER=CUSTOM
JOB_NAME=chasingdatajob_10252017_13
OUTPUT_PATH=gs://chasingdata/$JOB_NAME
STAGING_BUCKET=gs://chasingdata
gcloud ml-engine jobs submit training $JOB_NAME --staging-bucket "$STAGING_BUCKET" --scale-tier $SCALE_TIER --config $SIMPLE_CONFIG --job-dir $OUTPUT_PATH --module-name trainer.task --package-path trainer/ --region $REGION -- ...
My job log shows that the job exited with a non-zero status of 1. I see the following error for worker-replica-3:
Command '['gsutil', '-q', 'cp', u'gs://chasingdata/chasingdatajob_10252017_13/e476e75c04e89e4a0f2f5f040853ec21974ae0af2289a2563293d29179a81199/trainer-0.1.tar.gz', u'trainer-0.1.tar.gz']' returned non-zero exit status 1
Ive checked my bucket (gs://chasingdata). I see chasingdatajob_10252017_13 directory created by the engine but there is no trainer-0.1.tar.gz file. Another thing to mention - I am passing "tensorflow==1.4.0rc0" as a PyPi package to the cloud in my setup.py file. I dont think this is the cause of the problem but thought Id mention it anyway.
Is there any reason for this error? Can someone please help me out?
Perhaps I am doing something stupid. I have tried to find an answer (unsuccesfully) for this.
Thanks a lot!!
The user code has the logic to delete existing job-dir, which deleted the staged user code package in GCS as well, so that the workers started late were not able to download the package.
We recommend each job has a separate job-dir to avoid similar issue.

Running TensorBoard on summaries on S3 server

I want to use TensorBoard to visualize results stored on an S3 server, without downloading them to my machine. Ideally, this would work:
$ tensorboard --logdir s3://mybucket/summary
Assuming the tfevents files are stored under summary. However this does not work and returns UnimplementedError: File system scheme s3 not implemented.
Is there some workaround to enable TensorBoard to access the data on the server?
The S3 File system plugin for tensorflow was released in Version 1.4 in early October. You'll need to make sure your tensorflow-tensorboard version is at least pip install tensorflow-tensorboard==0.4.0-rc1
Then you can start the server:
tensorboard --logdir=s3://root-bucket/jobs/4/train