Cloud9 deploy hitting size limit for numpy, pandas - pandas

I'm building in Cloud9 to deploy to Lambda. My function works fine in Cloud9 but when I go to deploy I get the error
Unzipped size must be smaller than 262144000 bytes
Running du -h | sort -h shows that my biggest offenders are:
/debug at 291M
/numpy at 79M
/pandas at 47M
/botocore at 41M
My function is extremely simple, it calls a service, uses panda to format the response, and sends it on.
What is in debug and how do I slim it down/eliminate it from the deploy package?
How do others use libraries at all if they eat up most of the memory limit?

A brief background to understand the problem root-cause
The problem is not with your function but with the size of the zipped packages. As per AWS documentation, the overall size of zipped package must not exceed greater than 3MB. With that said, if the package size is greater than 3MB which happens inevitably, as a library can have many dependencies, then consider uploading the zipped package to a AWS S3 bucket. Note: even s3 bucket has a size limit of 262MB. Ensure that your package does not exceed this limit. The error message that you have posted, Unzipped size must be smaller than 262144000 bytes is referring to the size of the deployment package aka the libraries.
Now, Understand some facts when working with AWS,
AWS Containers are empty.
AWS containers have a linux kernel
AWS Cloud9 is only an IDE like RStudio or Pycharm. And it uses S3 bucket for saving the installed packages.
This means you'll need to know the following:
the package and its related dependencies
extract the linux-compiled packages from cloud9 and save to a folder-structure like, python/lib/python3.6/site-packages/
Possible/Workable solution to overcome this problem
Overcome this problem by reducing the package size. See below.
Reducing the deployment package size
Manual method: delete files and folders within each library folder that are named *.info and *._pycache. You'll need to manually look into each folder for the above file extensions to delete them.
Automatic method: I've to figure out the command. work in progress
Use Layers
In AWS go to Lambda and create a layer
Attach the S3 bucket link containing the python package folder. Ensure the lambda function IAM role has permission to access S3 bucket.
Make sure the un-zipped folder size is less than 262MB. Because if its >260 MB then it cannot be attached to AWS Layer. You'll get an error, Failed to create layer version: Unzipped size must be smaller than 262144000 bytes

Related

Trying to download zip file from drive to instance in colab hits quota limit

I am trying to download the zip file dataset from drive using
from google.colab import drive
drive.mount('/content/drive')
!cp "/content/drive/Shared Drives/infinity-drive/datasets/coco.zip" ./data
I get the error that one of the quota limits is exceeded for 3 days in consecutively and file copying ends in I/O error.
I went over all the solutions. File size is 18GB. Although I could not understand the directive
Use drive.google.com to download the file.. What does it have to do with triggering limis in colab? For people trying to download file to colab instance and then to their local machine? Any way/
The file is in archive format.
The folder it is in has no any other files in it.
The file is private, although I am not the manager of the shared drive.
I am baffled at exactly which quota I am hitting. It does not say to me but I am 99% sure I should not be hitting any considering I can download the entire thing to my local machine. I can't keep track of exactly what is happening because even within the same day, it is able to copy about 3-5 GB of the file(even the file size is changing).
This is a bug in colab/Drive integration, and is tracked in #1607. See this comment for workarounds.

Submit a Keras training job to Google cloud

I am trying to follow this tutorial:
https://medium.com/#natu.neeraj/training-a-keras-model-on-google-cloud-ml-cb831341c196
to upload and train a Keras model on Google Cloud Platform, but I can't get it to work.
Right now I have downloaded the package from GitHub, and I have created a cloud environment with AI-Platform and a bucket for storage.
I am uploading the files (with the suggested folder structure) to my Cloud Storage bucket (basically to the root of my storage), and then trying the following command in the cloud terminal:
gcloud ai-platform jobs submit training JOB1
--module-name=trainer.cnn_with_keras
--package-path=./trainer
--job-dir=gs://mykerasstorage
--region=europe-north1
--config=gs://mykerasstorage/trainer/cloudml-gpu.yaml
But I get errors, first the cloudml-gpu.yaml file can't be found, it says "no such folder or file", and trying to just remove it, I get errors because it says the --init--.py file is missing, but it isn't, even if it is empty (which it was when I downloaded from the tutorial GitHub). I am Guessing I haven't uploaded it the right way.
Any suggestions of how I should do this? There is really no info on this in the tutorial itself.
I have read in another guide that it is possible to let gcloud package and upload the job directly, but I am not sure how to do this or where to write the commands, in my terminal with gcloud command? Or in the Cloud Shell in the browser? And how do I define the path where my python files are located?
Should mention that I am working with Mac, and pretty new to using Keras and Python.
I was able to follow the tutorial you mentioned successfully, with some modifications along the way.
I will mention all the steps although you made it halfway as you mentioned.
First of all create a Cloud Storage Bucket for the job:
gsutil mb -l europe-north1 gs://keras-cloud-tutorial
To answer your question on where you should write these commands, depends on where you want to store the files that you will download from GitHub. In the tutorial you posted, the writer is using his own computer to run the commands and that's why he initializes the gcloud command with gcloud init. However, you can submit the job from the Cloud Shell too, if you download the needed files there.
The only files we need from the repository are the trainer folder and the setup.py file. So, if we put them in a folder named keras-cloud-tutorial we will have this file structure:
keras-cloud-tutorial/
├── setup.py
└── trainer
├── __init__.py
├── cloudml-gpu.yaml
└── cnn_with_keras.py
Now, a possible reason for the ImportError: No module named eager error is that you might have changed the runtimeVersion inside the cloudml-gpu.yaml file. As we can read here, eager was introduced in Tensorflow 1.5. If you have specified an earlier version, it is expected to experience this error. So the structure of cloudml-gpu.yaml should be like this:
trainingInput:
scaleTier: CUSTOM
# standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4 GPUs
masterType: standard_gpu
runtimeVersion: "1.5"
Note: "standard_gpu" is a legacy machine type.
Also, the setup.py file should look like this:
from setuptools import setup, find_packages
setup(name='trainer',
version='0.1',
packages=find_packages(),
description='Example on how to run keras on gcloud ml-engine',
author='Username',
author_email='user#gmail.com',
install_requires=[
'keras==2.1.5',
'h5py'
],
zip_safe=False)
Attention: As you can see, I have specified that I want version 2.1.5 of keras. This is because if I don't do that, the latest version is used which has compatibility issues with versions of Tensorflow earlier than 2.0.
If everything is set, you can submit the job by running the following command inside the folder keras-cloud-tutorial:
gcloud ai-platform jobs submit training test_job --module-name=trainer.cnn_with_keras --package-path=./trainer --job-dir=gs://keras-cloud-tutorial --region=europe-west1 --config=trainer/cloudml-gpu.yaml
Note: I used gcloud ai-platform instead of gcloud ml-engine command although both will work. At some point in the future though, gcloud ml-engine will be deprecated.
Attention: Be careful when choosing the region in which the job will be submitted. Some regions do not support GPUs and will throw an error if chosen. For example, if in my command I set the region parameter to europe-north1 instead of europe-west1, I will receive the following error:
ERROR: (gcloud.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED:
Quota failure for project . The request for 1 K80
accelerators exceeds the allowed maximum of 0 K80, 0 P100, 0 P4, 0 T4,
0 TPU_V2, 0 TPU_V3, 0 V100. To read more about Cloud ML Engine quota,
see https://cloud.google.com/ml-engine/quotas.
- '#type': type.googleapis.com/google.rpc.QuotaFailure violations:
- description: The request for 1 K80 accelerators exceeds the allowed maximum of
0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V3, 0 V100.
subject:
You can read more about the features of each region here and here.
EDIT:
After the completion of the training job, there should be 3 folders in the bucket that you specified: logs/, model/ and packages/. The model is saved on the model/ folder a an .h5 file. Have in mind that if you set a specific folder for the destination you should include the '/' at the end. For example, you should set gs://my-bucket/output/ instead of gs://mybucket/output. If you do the latter you will end up with folders output, outputlogs and outputmodel. Inside output there should be packages. The job page link should direct to output folder so make sure to check the rest of the bucket too!
In addition, in the AI-Platform job page you should be able to see information regarding CPU, GPU and Network utilization:
Also, I would like to clarify something as I saw that you posted some related questions as an answer:
Your local environment, either it is your personal Mac or the Cloud Shell has nothing to do with the actual training job. You don't need to install any specific package or framework locally. You just need to have the Google Cloud SDK installed (in Cloud Shell is of course already installed) to run the appropriate gcloud and gsutil commands. You can read more on how exactly training jobs on the AI-Platform work here.
I hope that you will find my answer helpful.
I got it to work halfway now by not uploading the files but just running the upload commands from cloud at my local terminal... however there was an error during it running ending in "job failed"
Seems it was trying to import something from the TensorFlow backend called "from tensorflow.python.eager import context" but there was an ImportError: No module named eager
I have tried "pip install tf-nightly" which was suggested at another place, but it says I don't have permission or I am loosing the connection to cloud shell(exactly when I try to run the command).
I have also tried making a virtual environment locally to match that on gcloud (with Conda), and have made an environment with Conda with Python=3.5, Tensorflow=1.14.0 and Keras=2.2.5, which should be supported for gcloud.
The python program works fine in this environment locally, but I still get the (ImportError: No module named eager) when trying to run the job on gcloud.
I am putting the flag --python-version 3.5 when submitting the job, but when I write the command "Python -V" in the google cloud shell, it says Python=2.7. Could this be the issue? I have not fins a way to update the python version with the cloud shell prompt, but it says google cloud should support python 3.5. If this is anyway the issue, any suggestions on how to upgrade python version on google cloud?
It is also possible to manually there a new job in the google cloud web interface, doing this, I get a different error message: ERROR: Could not find a version that satisfies the requirement cnn_with_keras.py (from versions: none) and No matching distribution found for cnn_with_keras.py. Where cnn_with_keras.py is my python code from the tutorial, which runs fine locally.
Really don't know what to do next. Any suggestions or tips would be very helpful!
The issue with the GPU is solved now, it was something so simple as, my google cloud account had GPU settings disabled and needed to be upgraded.

how do I split a large csv.gz file in Google Cloud Storage?

I get this error when trying to load a table in Google BQ:
Input CSV files are not splittable and at least one of the files is
larger than the maximum allowed size. Size is: 56659381010. Max
allowed size is: 4294967296.
Is there a way to split the file using gsutil or something like that without having to upload everything again?
The largest compressed CSV file you can load into BigQuery is 4 gigabytes. GCS unfortunately does not provide a way to decompress a compressed file, nor does it provide a way to split a compressed file. GZip'd files can't be arbitrarily split up and reassembled in the way you could a tar file.
I imagine your best bet would likely be to spin up a GCE instance in the same region as your GCS bucket, download your object to that instance (which should be pretty fast, given that it's only a few dozen gigabytes), decompress the object (which will be slower), break that CSV file into a bunch of smaller ones (the linux split command is useful for this), and then upload the objects back up to GCS.
I ran into the same issue and this is how I dealt with it:
First, spin up a Google Compute Engine VM instance.
https://console.cloud.google.com/compute/instances
Then install the gsutil commands and then go through the authentication process.
https://cloud.google.com/storage/docs/gsutil_install
Once you have verified that the gcloud, gsutil, and bq commands are working then save a snapshot of the disk as snapshot-1 and then delete this VM.
On your local machine, run this command to create a new disk. This disk is used for the VM so that you have enough space to download and unzip the large file.
gcloud compute disks create disk-2017-11-30 --source-snapshot snapshot-1 --size=100GB
Again on your local machine, run this command to create a new VM instance that uses this disk. I use the --preemptible flag to save some cost.
gcloud compute instances create loader-2017-11-30 --disk name=disk-2017-11-30,boot=yes --preemptible
Now you can SSH into your instance and then run these commands on the remote machine.
First, copy the file from cloud storage to the VM
gsutil cp gs://my-bucket/2017/11/20171130.gz .
Then unzip the file. In my case, for ~4GB file, it took about 17 minutes to complete this step:
gunzip 20171130.gz
Once unzipped, you can either run the bq load command to load it into BigQuery but I found that for my file size (~70 GB unzipped), that operation would take about 4 hours. Instead, I uploaded the unzipped file back to Cloud Storage
gsutil cp 20171130 gs://am-alphahat-regional/unzipped/20171130.csv
Now that the file is back on cloud storage, you can run this command to delete the VM.
gcloud compute instances delete loader-2017-11-30
Theoretically, the associated disk should also have been deleted, but I found that the disk was still there and I needed to delete it with an additional command
gcloud compute disks delete disk-2017-11-30
Now finally, you should be able to run the bq load command or you can load the data from the console.

When using LZO on Hadoop output on AWS EMR, does it index the files (stored on S3) for future automatic splitting?

I want to use LZO compression on my Elastic Map Reduce job's output that is being stored on S3, but it is not clear if the files are automatically indexed so that future jobs run on this data will split the files into multiple tasks.
For example, if my output is a bunch of lines of TSV data, in a 1GB LZO file, will a future map job only create 1 task, or something like (1GB/blockSize) tasks (i.e. the behavior of when files were not compressed, or if there was a LZO index file in the directory)?
Edit: If this is not done automatically, what is recommended for getting my output to be LZO-indexed? Do the indexing before uploading the file to S3?
Short answer to my first question: AWS does not do automatic indexing. I've confirmed this with my own job, and also read the same from Andrew#AWS on their forum.
Here's how you can do the indexing:
To index some LZO files, you'll need to use my own Jar built from the Twitter hadoop-lzo project. You'll need to build the Jar somewhere, then upload to Amazon S3, if you want to Index directly with EMR.
On side note, Cloudera has good instructions on all the steps for setting this up on your own cluster. I did this on my local cluster, which allowed me to build the Jar and upload to S3. You can probably find a pre-built Jar on the net if you don't want to build it yourself.
When outputting your data from your Hadoop job, make sure you use the LzopCodec and not the LzoCodec, otherwise the files are not indexable (at least based on my experience). Example Java code (same idea carries over to Streaming API):
import com.hadoop.compression.lzo.LzopCodec;
TextOutputFormat.setCompressOutput(job, true);
TextOutputFormat.setOutputCompressorClass(job, LzopCodec.class)
Once your hadoop-lzo Jar is on S3, and your Hadoop job has outputted .lzo files, run your indexer on the output directory (instructions below you got a EMR job/cluster running):
elastic-mapreduce -j <existingJobId> \
--jar s3n://<yourBucketName>/hadoop-lzo-0.4.17-SNAPSHOT.jar \
--args com.hadoop.compression.lzo.DistributedLzoIndexer \
--args s3://<yourBucketName>/output/myLzoJobResults \
--step-name "Lzo file indexer Jar"
Then when you're using the data in a future job, be sure to specify that the input is in LZO format, otherwise the splitting won't occur. Example Java code:
import com.hadoop.mapreduce.LzoTextInputFormat;
job.setInputFormatClass(LzoTextInputFormat.class);

large file from ec2 to s3

I have a 27GB file that I am trying to move from an AWS Linux EC2 to S3. I've tried both the 'S3put' command and the 'S3cmd put' command. Both work with a test file. Neither work with the large file. No errors are given, the command returns immediately but nothing happens.
s3cmd put bigfile.tsv s3://bucket/bigfile.tsv
Though you can upload objects to S3 with sizes up to 5TB, S3 has a size limit of 5GB for an individual PUT operation.
In order to load files larger than 5GB (or even files larger than 100MB) you are going to want to use the multipart upload feature of S3.
http://docs.amazonwebservices.com/AmazonS3/latest/dev/UploadingObjects.html
http://aws.typepad.com/aws/2010/11/amazon-s3-multipart-upload.html
(Ignore the outdated description of a 5GB object limit in the above blog post. The current limit is 5TB.)
The boto library for Python supports multipart upload, and the latest boto software includes an "s3multiput" command line tool that takes care of the complexities for you and even parallelizes part uploads.
https://github.com/boto/boto
The file did not exist, doh. I realised this after running the s3 commands in verbose mode by adding the -v tag:
s3cmd put -v bigfile.tsv s3://bucket/bigfile.tsv
s3cmd version 1.1.0 supports the multi-part upload as part of the "put" command, but its still in beta (currently.)