I am trying to follow this tutorial:
https://medium.com/#natu.neeraj/training-a-keras-model-on-google-cloud-ml-cb831341c196
to upload and train a Keras model on Google Cloud Platform, but I can't get it to work.
Right now I have downloaded the package from GitHub, and I have created a cloud environment with AI-Platform and a bucket for storage.
I am uploading the files (with the suggested folder structure) to my Cloud Storage bucket (basically to the root of my storage), and then trying the following command in the cloud terminal:
gcloud ai-platform jobs submit training JOB1
--module-name=trainer.cnn_with_keras
--package-path=./trainer
--job-dir=gs://mykerasstorage
--region=europe-north1
--config=gs://mykerasstorage/trainer/cloudml-gpu.yaml
But I get errors, first the cloudml-gpu.yaml file can't be found, it says "no such folder or file", and trying to just remove it, I get errors because it says the --init--.py file is missing, but it isn't, even if it is empty (which it was when I downloaded from the tutorial GitHub). I am Guessing I haven't uploaded it the right way.
Any suggestions of how I should do this? There is really no info on this in the tutorial itself.
I have read in another guide that it is possible to let gcloud package and upload the job directly, but I am not sure how to do this or where to write the commands, in my terminal with gcloud command? Or in the Cloud Shell in the browser? And how do I define the path where my python files are located?
Should mention that I am working with Mac, and pretty new to using Keras and Python.
I was able to follow the tutorial you mentioned successfully, with some modifications along the way.
I will mention all the steps although you made it halfway as you mentioned.
First of all create a Cloud Storage Bucket for the job:
gsutil mb -l europe-north1 gs://keras-cloud-tutorial
To answer your question on where you should write these commands, depends on where you want to store the files that you will download from GitHub. In the tutorial you posted, the writer is using his own computer to run the commands and that's why he initializes the gcloud command with gcloud init. However, you can submit the job from the Cloud Shell too, if you download the needed files there.
The only files we need from the repository are the trainer folder and the setup.py file. So, if we put them in a folder named keras-cloud-tutorial we will have this file structure:
keras-cloud-tutorial/
├── setup.py
└── trainer
├── __init__.py
├── cloudml-gpu.yaml
└── cnn_with_keras.py
Now, a possible reason for the ImportError: No module named eager error is that you might have changed the runtimeVersion inside the cloudml-gpu.yaml file. As we can read here, eager was introduced in Tensorflow 1.5. If you have specified an earlier version, it is expected to experience this error. So the structure of cloudml-gpu.yaml should be like this:
trainingInput:
scaleTier: CUSTOM
# standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4 GPUs
masterType: standard_gpu
runtimeVersion: "1.5"
Note: "standard_gpu" is a legacy machine type.
Also, the setup.py file should look like this:
from setuptools import setup, find_packages
setup(name='trainer',
version='0.1',
packages=find_packages(),
description='Example on how to run keras on gcloud ml-engine',
author='Username',
author_email='user#gmail.com',
install_requires=[
'keras==2.1.5',
'h5py'
],
zip_safe=False)
Attention: As you can see, I have specified that I want version 2.1.5 of keras. This is because if I don't do that, the latest version is used which has compatibility issues with versions of Tensorflow earlier than 2.0.
If everything is set, you can submit the job by running the following command inside the folder keras-cloud-tutorial:
gcloud ai-platform jobs submit training test_job --module-name=trainer.cnn_with_keras --package-path=./trainer --job-dir=gs://keras-cloud-tutorial --region=europe-west1 --config=trainer/cloudml-gpu.yaml
Note: I used gcloud ai-platform instead of gcloud ml-engine command although both will work. At some point in the future though, gcloud ml-engine will be deprecated.
Attention: Be careful when choosing the region in which the job will be submitted. Some regions do not support GPUs and will throw an error if chosen. For example, if in my command I set the region parameter to europe-north1 instead of europe-west1, I will receive the following error:
ERROR: (gcloud.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED:
Quota failure for project . The request for 1 K80
accelerators exceeds the allowed maximum of 0 K80, 0 P100, 0 P4, 0 T4,
0 TPU_V2, 0 TPU_V3, 0 V100. To read more about Cloud ML Engine quota,
see https://cloud.google.com/ml-engine/quotas.
- '#type': type.googleapis.com/google.rpc.QuotaFailure violations:
- description: The request for 1 K80 accelerators exceeds the allowed maximum of
0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V3, 0 V100.
subject:
You can read more about the features of each region here and here.
EDIT:
After the completion of the training job, there should be 3 folders in the bucket that you specified: logs/, model/ and packages/. The model is saved on the model/ folder a an .h5 file. Have in mind that if you set a specific folder for the destination you should include the '/' at the end. For example, you should set gs://my-bucket/output/ instead of gs://mybucket/output. If you do the latter you will end up with folders output, outputlogs and outputmodel. Inside output there should be packages. The job page link should direct to output folder so make sure to check the rest of the bucket too!
In addition, in the AI-Platform job page you should be able to see information regarding CPU, GPU and Network utilization:
Also, I would like to clarify something as I saw that you posted some related questions as an answer:
Your local environment, either it is your personal Mac or the Cloud Shell has nothing to do with the actual training job. You don't need to install any specific package or framework locally. You just need to have the Google Cloud SDK installed (in Cloud Shell is of course already installed) to run the appropriate gcloud and gsutil commands. You can read more on how exactly training jobs on the AI-Platform work here.
I hope that you will find my answer helpful.
I got it to work halfway now by not uploading the files but just running the upload commands from cloud at my local terminal... however there was an error during it running ending in "job failed"
Seems it was trying to import something from the TensorFlow backend called "from tensorflow.python.eager import context" but there was an ImportError: No module named eager
I have tried "pip install tf-nightly" which was suggested at another place, but it says I don't have permission or I am loosing the connection to cloud shell(exactly when I try to run the command).
I have also tried making a virtual environment locally to match that on gcloud (with Conda), and have made an environment with Conda with Python=3.5, Tensorflow=1.14.0 and Keras=2.2.5, which should be supported for gcloud.
The python program works fine in this environment locally, but I still get the (ImportError: No module named eager) when trying to run the job on gcloud.
I am putting the flag --python-version 3.5 when submitting the job, but when I write the command "Python -V" in the google cloud shell, it says Python=2.7. Could this be the issue? I have not fins a way to update the python version with the cloud shell prompt, but it says google cloud should support python 3.5. If this is anyway the issue, any suggestions on how to upgrade python version on google cloud?
It is also possible to manually there a new job in the google cloud web interface, doing this, I get a different error message: ERROR: Could not find a version that satisfies the requirement cnn_with_keras.py (from versions: none) and No matching distribution found for cnn_with_keras.py. Where cnn_with_keras.py is my python code from the tutorial, which runs fine locally.
Really don't know what to do next. Any suggestions or tips would be very helpful!
The issue with the GPU is solved now, it was something so simple as, my google cloud account had GPU settings disabled and needed to be upgraded.
Getting the following error when using drone cli to add/activate repo
No help topic for 'add'
I can confirm I am successfully login and I am an admin.
{"id":1,"login":"XXXXX","email":"","machine":false,"admin":true,"active":true,"avatar":"https://bitbucket.org/account/XXXX/avatar/32/","syncing":false,"synced":1578888217,"created":1578431775,"updated":1578891320,"last_login":1578891344}
I can also list my repo using 'drone repo ls'
My guess, if you are using the add option is that you are still interacting with drone 0.8 or below, in this case the docs have been archived to an alternate location in favor of the latest version (v1.x). The old docs are still available under the following URL and help for the add option is present there:
https://0-8-0.docs.drone.io/cli-repository-add/
If you are not using 0.8 and are indeed trying to use 1.x, perhaps you are referencing improper documentation, as this cli option shifted in v1 to enable
$ drone repo enable <repo/name>
Regardless of the versions however, you will want to ensure you both have admin access to the repository (so that drone is able to add the appropriate webhooks) and also refresh or sync your repo listing in if it is something brand new:
$ drone repo sync
username/hello-world
organization/minio
...
NOTE: This might take a bit depending on how many repos you have access to
I'm trying to get Relay working with my React Native app and talking to my GraphQL server. I think I'm missing some pieces of understanding.
I'm following the instructions at https://facebook.github.io/relay/docs/relay-modern.html
It details the yarn commands to set up relay and the babel plugin. I added the "relay" script to my package.json like this:
"relay": "relay-compiler --src ./App --schema ./App/Data/schema.graphql"
But, when I run yarn run relay I get
Error: --schema path does not exist: /Users/user/dev/react-native-app/App/Data/schema.graphql.
Yeah. It doesn't exist. Isn't that what this command is supposed to generate? That documentation page doesn't explain what this command outputs, nor what it needs as input. How can I get this command working correctly? Do I really have to hand-write a schema when it already exists on the server?
EDIT, for php
Given that you are generating your schema on a PHP server, you can generate the .graphql file by creating a Node.js script to:
Send an introspection query to your /graphql endpoint
Pass the result to buildClientSchema
Call printSchema with it and write it on the disk
General usage with graphql-js
As far as I know, you effectively need your schema printed in GraphQL language. You can have a look at printSchema for this, to provide it to the relay-compiler.
printSchema will do the JS Object -> Schema Language conversion. If you already have your schema in Schema Language, this is what you need to provide to the relay-compiler.
It may be possible to use directly the JS Object schema, but I don't know how.
For a detailed explanation of the complete setup, you can look at my other answer here.
I have a Jenkins user that I want to give rights to run the remote CLI towards the Jenkins instance. The first command is to fetch the config.xml:
java -jar jenkins-cli.jar -s http://jenkins:8080/hudson get-job thejob
However when he invokes the command, it fails with:
Caught: java.lang.RuntimeException: \
hudson.security.AccessDeniedException2: \
USER is missing the Job/ExtendedRead permission \
at hudson.security.ACL.checkPermission(ACL.java:54)
I have given the rights to execute scripts, read/create/configure jobs and more in our matrix-based security grid. There is another user who has EXACTLY the same permissions in the grid, but for this other user, everything works fine.
I don't have any of the plugins 'Extended Read permission' or 'Read-only configurations' installed.
I cannot see why it fails for this new user. Suggestions anyone?
Differences in the 2 users config.xml file:
<com.cloudbees.plugins.credentials.UserCredentialsProvider_-UserCredentialsProperty plugin="credentials#1.4">
<credentials/>
vs:
<com.cloudbees.plugins.credentials.UserCredentialsProvider_-UserCredentialsProperty plugin="credentials#1.8.3">
<domainCredentialsMap class="hudson.util.CopyOnWriteMap$Hash"/>
And a final one:
<hudson.security.HudsonPrivateSecurityRealm_-Details>
<passwordHash>some values...</passwordHash>
</hudson.security.HudsonPrivateSecurityRealm_-Details>
I don't know if you are facing the same problem I had, but take a look here:
Jenkins CLI: using Anonymous permissions instead of the user defined ones
It looks like you have upgraded the credentials plugin but somehow the first user didn't get its record updated.
If you can I would suggest trying to update to the latest (1.9.1 for me). You could also edit the user record manually and force the real plugin version number in there (then restart Jenkins) and see if it processes this user more accurately.
I am getting a project not found error when trying to run queries with the bq command line tool or the BigQuery browser window.
I've registered the BigQuery API with the project. I've also setup billing.
For bq, I've setup the .bigqueryrc with the numeric project id.
When I try to query the system response is using the friendly project id so it seems that BigQuery is aware enough to do this mapping of numeric to friendly ids.
I've used the bq shell to verify that prompt reflects the right project id.
I can run 'bq ls publicdata:samples' just fine so I'm assuming the authorization really kicks in to query the data.
What's missing or wrong here?
It looks like there is an issue recognizing projects created through AppEngine. This is a bug and we're actively working on a fix.
As a workaround, you can use a project created through https://code.google.com/apis/console instead.
In my project I didn't have App Engine enabled. For me it was solved by authenticating again though gcloud:
$ gcloud auth login