I've created an S3 Batch Operation using an S3 inventory JSON file that's pointing to a few billion objects in my S3 bucket.
The operation has been stuck on "Preparing" status for 24 hours now.
What are the preparation times to expect in these kinds of volumes?
Would preparation time shorten if instead of providing it with the JSON manifest I'll join all the inventory CSVs into one uber-CSV?
I've used awscli to create the request like so:
aws s3control create-job \
--region ... \
--account-id ... \
--operation '{"S3PutObjectCopy":{"TargetResource":"arn:aws:s3:::some-bucket","MetadataDirective":"COPY"}}' \
--manifest '{"Spec":{"Format":"S3InventoryReport_CSV_20161130"},"Location":{"ObjectArn":"arn:aws:s3:::path_to_manifest/manifest.json","ETag":"..."}}' \
--report '{"Bucket":"arn:aws:s3:::some-bucket","Prefix":"reports", "Format":"Report_CSV_20180820", "Enabled":true, "ReportScope":"AllTasks"}' \
--priority 42 \
--role-arn ... \
--client-request-token $(uuidgen) \
--description "Batch request"
After ~4 days the tasks completed the preparation phase and were ready to be ran
Related
It seems that I can load data into BigQuery from S3 in the following sample:
This time, I would like to load the compressed files in S3. Not a CSV file.
If so, how can I load the data into BigQuery from S3?
sample
bq mk \
--transfer_config \
--data_source=amazon_s3 \
--display_name=load_from_s3 \
--target_dataset=test_dataset_s3 \
--params='{
"data_path":"s3://xxx-test01",
"destination_table_name_template":"test_table",
"access_key_id":"xxxxxxxxxxxxx",
"secret_access_key":"xxxxxxxxxxxxxx",
"file_format":"CSV",
"max_bad_records":"0",
"ignore_unknown_values":"true",
"field_delimiter":",",
"skip_leading_rows":"0",
"allow_quoted_newlines":"true"
}'
The same bq CLI command can be used without much changes. Assuming you have compressed CSV files, I tested the transfer with a sample compressed file of my own and the transfer was successful. Below is the bq command I tested.
bq mk \
--transfer_config \
--data_source=amazon_s3 \
--display_name=load_from_s3 \
--target_dataset=test_dataset \
--params='{
"data_path":"s3://awsbucket-name/sample.csv.gz",
"destination_table_name_template":"table-name",
"access_key_id":"xxxxxxxxxxxxxxx",
"secret_access_key":"xxxxxxxxxxxxxxxx",
"file_format":"CSV",
"max_bad_records":"0",
"ignore_unknown_values":"true",
"field_delimiter":",",
"skip_leading_rows":"1",
"allow_quoted_newlines":"true"
}'
Logs of the file transfer:
However, it is also to be noted that for formats such as CSV and JSON, BigQuery can load uncompressed files significantly faster than compressed files because uncompressed files can be read in parallel. For more information refer to this documentation.
I have the following code, which is used to run a SQL query on a keyfile, located in a S3 bucket. This runs perfectly. My question is, I do not wish to have the output written over to an output file. Could I see the output on the screen (my preference #1)? If not, what about an ability to append to the output file, rather than over-write it (my preference #2). I am using the AWS-CLI binaries to run this query. If there is another way, I am happy to try (as long as it is within bash)
aws s3api select-object-content \
--bucket "project2" \
--key keyfile1 \
--expression "SELECT * FROM s3object s where Lower(s._1) = 'email#search.com'" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {"FieldDelimiter": ":"}, "CompressionType": "GZIP"}' \
--output-serialization '{"CSV": {"FieldDelimiter": ":"}}' "OutputFile"
Of course, you can use AWS CLI to do this since stdout is just a special file in linux.
aws s3api select-object-content \
--bucket "project2" \
--key keyfile1 \
--expression "SELECT * FROM s3object s where Lower(s._1) = 'email#search.com'" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {"FieldDelimiter": ":"}, "CompressionType": "GZIP"}' \
--output-serialization '{"CSV": {"FieldDelimiter": ":"}}' /dev/stdout
Note the /dev/stdout in the end.
The AWS CLI does not offer such options.
However, you are welcome to instead call it via an AWS SDK of your choice.
For example, in the boto3 Python SDK, there is a select_object_content() function that returns the data as a stream. You can then read, manipulate, print or save it however you wish.
I think it opens /dev/stdout twice causing kaos.
In the official GCP documentation for the built-in image object detection classifier, Step 2 under "Submit a training job" says:
Submit the job:
cloud ai-platform jobs submit training $JOB_ID \
--region=$REGION \
--config=config.yaml \
...
This is the first reference to "config.yaml" on this page.
Has anyone been able to implement this example?
Here is the code from the above documentation page in full, including a correction on line 2 (the original had a JOB_DIR starting with gs://gs://, which threw an error):
PROJECT_ID="myapp"
# Original:
#BUCKET_NAME="gs://mybucket/"
# Correction:
BUCKET_NAME="mybucket"
REGION="us-central1"
gcloud config set project $PROJECT_ID
gcloud config set compute/region $REGION
# Set paths to the training and validation data.
TRAINING_DATA_PATH="gs://cloud-samples-data/ai-platform/built-in/image/coco/train*"
VALIDATION_DATA_PATH="gs://cloud-samples-data/ai-platform/built-in/image/coco/val*"
# Specify the Docker container for your built-in algorithm selection.
IMAGE_URI="gcr.io/cloud-ml-algos/image_object_detection:latest"
DATASET_NAME="coco"
ALGORITHM="object_detection"
MODEL_NAME="${DATASET_NAME}_${ALGORITHM}_model"
# Give a unique name to your training job.
DATE="$(date '+%Y%m%d_%H%M%S')"
JOB_ID="${MODEL_NAME}_${DATE}"
# Make sure you have access to this Cloud Storage bucket.
JOB_DIR="gs://${BUCKET_NAME}/algorithms_training/${MODEL_NAME}/${DATE}"
gcloud ai-platform jobs submit training $JOB_ID \
--region=$REGION \
--config=config.yaml \
--job-dir=$JOB_DIR \
-- \
--training_data_path=$TRAINING_DATA_PATH \
--validation_data_path=$VALIDATION_DATA_PATH \
--train_batch_size=64 \
--num_eval_images=500 \
--train_steps_per_eval=2000 \
--max_steps=15000 \
--num_classes=90 \
--warmup_steps=500 \
--initial_learning_rate=0.08 \
--fpn_type="nasfpn" \
--aug_scale_min=0.8 \
--aug_scale_max=1.2
gcloud ai-platform jobs describe $JOB_ID
gcloud ai-platform jobs stream-logs $JOB_ID
Running the above results in the following error:
ERROR: (gcloud.ai-platform.jobs.submit.training) Failed to load YAML from [config.yaml]: Unable to read file [config.yaml]: [Errno 2] No such file or directory: u'config.yaml'
Creating an empty config.yaml produces this error:
ERROR: gcloud crashed (AttributeError): 'NoneType' object has no attribute 'get'
From the gcloud documentation:
Path to the job configuration file. This file should be a YAML
document (JSON also accepted) containing a Job resource as defined in
the API (all fields are optional):
https://cloud.google.com/ml/reference/rest/v1/projects.jobs
I submitted feedback on this page a couple of weeks ago, but haven't heard back and it is still broken.
What content is required in config.yaml to make this work?
Any and all ideas/suggestions are welcome!
I managed to get it working by replacing this command line argument:
--config=config.yaml
With this one:
--master-image-uri $IMAGE_URI
I am new to training on Google Cloud.
When I am running the training job, I get the following error:
(gcloud.ml-engine.jobs.submit.training) Could not copy [research/dist/object_detection-0.1.tar.gz] to [training/packages/c5292b23e57f357dc2d63baab473c04337dbadd2deeb10965e743cd8422b964f/object_detection-0.1.tar.gz]. Please retry: HTTPError 404: Not Found
I am using this to run the training job
gcloud ml-engine jobs submit training job1 \
--job-dir=gs://${ml-project-neu}/training \
--packages research/dist/object_detection-0.1.tar.gz,research/slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--config cloud.yml \
--runtime-version=1.4
-- \
--train_dir=gs://${ml-project-neu}/training \
--pipeline_config_path=gs://${ml-project-neu}/data/faster_rcnn_inception_v2_pets.config
Make sure ${ml-poject-neu} is valid (it may be the empty string in your case); Make sure gs://${ml-project-neu} exists. And make sure the credentials you are using with gcloud have access to your GCS bucket (consider running gcloud auth login).
I am trying to submit a job to gcloud ml-engine. For reference the job is using this sample provided by Google
It went through the first time, but with errors unrelated to this question, and now I am trying reissue the command after having corrected my errors:
gcloud ml-engine jobs submit training $JOB_NAME \
--stream-logs \
--runtime-version 1.0 \
--job-dir $GCS_JOB_DIR \
--module-name trainer.task \
--package-path trainer/ \
--region us-east1 \
-- \
--train-files $TRAIN_GCS_FILE \
--eval-files $EVAL_GCS_FILE \
--train-steps $TRAIN_STEPS
, where $JOB_NAME = census. Unfortunately, it seems that I cannot proceed with resubmitting the job unless I change $JOB_NAME to be something like census2, then census3, etc. for every new job.
The following is the error I receive:
ERROR: (gcloud.ml-engine.jobs.submit.training) Project [my-project-name]
is the subject of a conflict: Field: job.job_id Error: A job with this
id already exists.
Is this part of the design to not be able to resubmit using the same job name or I am missing something?
Like Chunck just said, simply try setting JOB_NAME as:
JOB_NAME="census_$(date +%Y%m%d_%H%M%S)"
Not sure if this will help but in Google's sample code for flowers, the error is avoided by appending the date and time to the job id as shown on line 22, e.g.,
declare -r JOB_ID="flowers_${USER}_$(date +%Y%m%d_%H%M%S)"