aws sagemaker giving The model data archive is too large. Please reduce the size of the model data archive - training-data

I am using aws sagemaker to deploy a model whose generated artifacts are huge. The compressed size is about 80GB. Deploying on sage maker on a ml.m5.12xlarge instance is throwing this error while deploying to the endpoint
The model data archive is too large. Please reduce the size of the model data archive or move to an instance type with more memory.
I found that aws attaches EBS volume based on instance size(https://docs.aws.amazon.com/sagemaker/latest/dg/host-instance-storage.html) and i couldnot find anything more that 30Gb here. Should i go with a multi model endpoint here?

"d" instances have bigger (NVMe) volumes; you can try deploying to ml.m5d.* for example. But keep in mind that your download and instantiation time may exceed the service limits (between 15-20min in my experience) so if you can't have an endpoint up in that timeframe you may still encounter an error.

Related

Difference between Sagemaker notebook instance and Training job submission

I am getting an error with a Sagemaker training job with the error message "OverflowError: signed integer is greater than maximum". This is an image identification problem with code written in keras and tensorflow. The input for this is a large npy file stored in an s3 bucket.
The code works fine when run the Sagemaker notebook cells but errors out when submitted as a Training job using boto3 request.
I am using the same role in both places. what could be the cause for this error? I am using ml.g4dn.16xlarge instance in both cases
Couple of things I would check is
Framework Versions used in your notebook instance vs the training job
Instance Storage volume for the training job, since you are using G4dn it comes with attached SSD which ideally should be good enough.
This seems like a bug, Requests and urllib3 should only ask for a maximum number of bytes and it is capable of handling at once.

Difference Between Tile and Data usage & Feature Service

With Developer Account we get upto 5GB free for Tile And Data usage and uptoo 100 MB free for Feature Service. We are not sure what's the difference between two?
If I upload 100MB+ Geojson file will it be considered under 100MB or 5GB?
Thank you,
Raj
When you upload the data to ArcGIS, it will be published as a layer in a Feature Service. This will then count towards the 100MB Feature limit. However, feature service storage is typically (always?) more efficient than GeoJSON storage. For example, in a quick test, a 521KB GeoJSON file downloaded from here turned into 328KB Feature Service. Geometries in feature services are stored as binary fields, and various other efficiencies of the backing hosted feature service (such as efficiently storing attribute data) will also help. There are of course many factors that influence this, but I expect you would always see an improvement over the raw GeoJSON size.
Note that the GeoJSON file you upload will also be stored as the source for the published feature service as part of your 5GB limit (this is so you can upload updated GeoJSON and republish your feature service at the same URL). You can delete this if you won't ever need to update the feature service this way. For reference, here's the GeoJSON file I uploaded (it seems that was also compressed slightly for storage to 509KB).

How to resolve this error in Google Data Fusion: "Stage x contains a task of very large size (2803 KB). The maximum recommended task size is 100 KB."

I need to move data from an parameterized S3 Bucket into Google Cloud Storage. Basic Data dump. I don't own the S3 bucket. It has the following syntax,
s3://data-partner-bucket/mykey/folder/date=2020-10-01/hour=0
I was able to transfer data at the hourly granularity using the Amazon S3 Client provided by Data Fusion. I wanted to bring over a days worth of data so I reset the path in the client to:
s3://data-partner-bucket/mykey/folder/date=2020-10-01
It seemed like it was working until it stopped. The status is "Stopped." When I review the logs just before it stopped I see a warning, "Stage 0 contains a task of very large size (2803 KB). The maximum recommended task size is 100 KB."
I examined the data in the S3 bucket. Each folder contains a series of log files. None of them are "big". The largest folder contains a total of 3MB of data.
I saw a similar question for this error, but the answer involved Spark coding that I don't have access to in Data Fusion.
Screenshot of Advanced Settings in Amazon S3 Client
These are the settings I see in the client. Maybe there is another setting somewhere I need to set? What do I need to do so that Data Fusion can import these files from S3 to GCS?
When you deploy the pipeline you are redirected to a new page with a Ribbon at the top. one of the tools in the Ribbon is Configure.
In the resources section of the Configure Modal you can specify the memory resources. Fiddled around with the numbers. 1000MB worked. 6MB was not enough. (For me.)
I processed 756K records in about 46 min.

Lambda triggers high s3 costs

I created a new Lambda based on a 2MB zip file (it has a heavy dependency). After that, my S3 costs really increased (from $12.27 to $31).
Question 1: As this is uploaded from a CI/CD pipeline, could it be that it's storing every version and then increasing costs?
Question 2: Is this storage alternative more expensive than choosing directly an owned s3 bucket instead of the private one owned by Amazon where this zip goes? Looking at the S3 prices list, only 2MB can't result in 19 Dollars.
Thanks!
few things you can do to mitigate cost:
Use Lambda Layers for dependencies
Use S3 Infrequent access for your lambda archive
Being that I don't have your full configuration of S3, its hard to tell what can be causing cost...things like S3 versioning would do it.
The reason was that Object versioning was enabled, and after some stress tests those versions were accumulated and stored. Costs went back to $12 after they were removed.
It's key to keep the "Show" enabled (see image) to keep track of those files.

Google Big Query - Loading File From GCS Failed with "Not Found", but the file exists

We have a strange issue that happen quite often.
We have a process which getting files from sources and loading it into the GCS. Than, and only if the file uploaded successfully, we try to load it into the BigQuery table and get the error of
"Not found: Uris List of uris (possibly truncated): json: file_name: ...".
After a deep investigation, it all supposed to be fine, and we don't know what had changed. In the time frames, the file in the job exists in the cloud storage, and uploaded into the GCS 2 minutes before BigQuery tried to get it.
There is need to say that we load every file as the whole batch dictionary in the Cloud Storage, like gs://<bucket>/path_to_dir/*. Is that still supported?
Also, the file sizes are kind of small - from few bytes to KB. Is that matter?
job ids for checking:
load_file_8e4e16f737084ba59ce0ba89075241b7 load_file_6c13c25e1fc54a088af40199eb86200d
Known issue with Cloud Storage consistency
As noted by Felipe, this was indeed related to a known issue with Cloud Storage. Google Cloud Storage Incident #16036 is shown to have been resolved since December 20, 2016. This was also being tracked in Issue 738. Though Cloud Storage list operations are eventually consistent, this incident displayed excessive delays in operations returning consistent results.
Handling Cloud Storage inconsistency
Though this was an isolated incident, it is nevertheless a good practice to have some means of handling such inconsistencies. Two such suggestions can be found in comment #10 of the related public issue.
Retry the load job if it failed.
Verify if Cloud Storage results are consistent with expectations
Verify the expected number of files (and total size) was processed by BigQuery. You can get this information out of the Job metadata.
Still getting unexpected results
Should you encounter such an issue again and have the appropriate error handling measures in place, I recommend first consulting the Google Cloud Status Dashboard and BigQuery public issue tracker for existing reports showing similar symptoms. If none exist, file a new issue on the issue tracker.
The solution was to move from Multi Region Bucket(that was set before Region type was enable) to Region.
Since we moved, we never faced this issue.