It has been suggested on Amazon docs http://aws.amazon.com/dynamodb/ among other places, that you can backup your dynamodb tables using Elastic Map Reduce,
I have a general understanding of how this could work but I couldn't find any guides or tutorials on this,
So my question is how can I automate dynamodb backups (using EMR)?
So far, I think I need to create a "streaming" job with a map function that reads the data from dynamodb and a reduce that writes it to S3 and I believe these could be written in Python (or java or a few other languages).
Any comments, clarifications, code samples, corrections are appreciated.
With introduction of AWS Data Pipeline, with a ready made template for dynamodb to S3 backup, the easiest way is to schedule a back up in the Data Pipeline [link],
In case you have special needs (data transformation, very fine grain control ...) consider the answer by #greg
There are some good guides for working with MapReduce and DynamoDB. I followed this one the other day and got data exporting to S3 going reasonably painlessly. I think your best bet would be to create a hive script that performs the backup task, save it in an S3 bucket, then use the AWS API for your language to pragmatically spin up a new EMR job flow, complete the backup. You could set this as a cron job.
Example of a hive script exporting data from Dynamo to S3:
CREATE EXTERNAL TABLE my_table_dynamodb (
company_id string
,id string
,name string
,city string
,state string
,postal_code string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name"="my_table","dynamodb.column.mapping" = "id:id,name:name,city:city,state:state,postal_code:postal_code");
CREATE EXTERNAL TABLE my_table_s3 (
,id string
,name string
,city string
,state string
,postal_code string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://yourBucket/backup_path/dynamo/my_table';
INSERT OVERWRITE TABLE my_table_s3
SELECT * from my_table_dynamodb;
Here is an example of a PHP script that will spin up a new EMR job flow:
$emr = new AmazonEMR();
$response = $emr->run_job_flow(
'My Test Job',
array(
"TerminationProtected" => "false",
"HadoopVersion" => "0.20.205",
"Ec2KeyName" => "my-key",
"KeepJobFlowAliveWhenNoSteps" => "false",
"InstanceGroups" => array(
array(
"Name" => "Master Instance Group",
"Market" => "ON_DEMAND",
"InstanceType" => "m1.small",
"InstanceCount" => 1,
"InstanceRole" => "MASTER",
),
array(
"Name" => "Core Instance Group",
"Market" => "ON_DEMAND",
"InstanceType" => "m1.small",
"InstanceCount" => 1,
"InstanceRole" => "CORE",
),
),
),
array(
"Name" => "My Test Job",
"AmiVersion" => "latest",
"Steps" => array(
array(
"HadoopJarStep" => array(
"Args" => array(
"s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
"--base-path",
"s3://us-east-1.elasticmapreduce/libs/hive/",
"--install-hive",
"--hive-versions",
"0.7.1.3",
),
"Jar" => "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
),
"Name" => "Setup Hive",
"ActionOnFailure" => "TERMINATE_JOB_FLOW",
),
array(
"HadoopJarStep" => array(
"Args" => array(
"s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
"--base-path",
"s3://us-east-1.elasticmapreduce/libs/hive/",
"--hive-versions",
"0.7.1.3",
"--run-hive-script",
"--args",
"-f",
"s3n://myBucket/hive_scripts/hive_script.hql",
"-d",
"INPUT=Var_Value1",
"-d",
"LIB=Var_Value2",
"-d",
"OUTPUT=Var_Value3",
),
"Jar" => "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
),
"Name" => "Run Hive Script",
"ActionOnFailure" => "CANCEL_AND_WAIT",
),
),
"LogUri" => "s3n://myBucket/logs",
)
);
}
AWS Data Pipeline is costly and the complexity of managing a templated process cannot compare to a simplicity of a CLI command you can make changes to and run on a schedule (using cron, Teamcity or your CI tool of choice)
Amazon promotes Data Pipeline as they make a profit on it. I'd say that it only really makes sense if you have a very large database (>3GB), as the performance improvement will justify it.
For small and medium databases (1GB or less) I'd recommend you use one of the many tools available, all three below can handle backup and restore processes from the command line:
dynamo-backup-to-s3 ==>
Streaming restore to S3, using NodeJS/npm
SEEK-Jobs dynamotools ==> Streaming restore to S3, using Golang
dynamodump ==> Local backup/restore using python, upload/download S3 using aws s3 cp
Bear in mind that due to bandwidth/latency issues these will always perform better from an EC2 instance than your local network.
With the introduction of DynamoDB Streams and Lambda - you should be able to take backups and incremental backups of your DynamoDB data.
You can associate your DynamoDB Stream with a Lambda Function to automatically trigger code for every data update (Ie: data to another store like S3)
A lambda function you can use to tie up with DynamoDb for incremental backups:
https://github.com/PageUpPeopleOrg/dynamodb-replicator
I've provided a detailed explanation how you can use DynamoDB Streams, Lambda and S3 versioned buckets to create incremental backups for your data in DynamoDb on my blog:
https://www.abhayachauhan.com/category/aws/dynamodb/dynamodb-backups
Edit:
As of Dec 2017, DynamoDB has released On Demand Backups/Restores. This allows you to take backups and store them natively in DynamoDB. They can be restored to a new table.
A detailed walk through is provided here, including code to schedule them:
https://www.abhayachauhan.com/2017/12/dynamodb-scheduling-on-demand-backups
HTH
You can use my simple node.js script dynamo-archive.js, which scans an entire Dynamo table and saves output to a JSON file. Then, you upload it to S3 using s3cmd.
You can use this handy dynamodump tool which is python based (uses boto) to dump the tables into JSON files. And then upload to S3 with s3cmd
I found the dynamodb-backup lambda function to be really helpful. Took me 5 minutes to setup and can easily be configured to use a Cloudwatch Schedule event (don't forget to run npm install in the beginning though).
It's also a lot cheaper for me coming from Data Pipeline (~$40 per month), I estimate the costs to be around 1.5 cents per month (both without S3 storage). Note that it backs up all DynamoDB tables at once by default, which can easily be adjusted within the code.
The only missing part is to be notified if the function fails, which the Data Pipeline was able to do.
aws data pipeline has limit regions.
It took me 2 hours to debug the template.
https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region
You can now backup your DynamoDB data straight to S3 natively, without using Data Pipeline or writing custom scripts. This is probably the easiest way to achieve what you wanted because it does not require you to write any code and run any task/script because it's fully managed.
Since 2020 you can export a DynamoDB table to S3 directly in the AWS UI:
https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/
You need to activate PITR (Point in Time Recovery) first. You can choose between JSON and Amazon ION format.
In the Java SDK (Version 2), you can do something like this:
// first activate PITR
PointInTimeRecoverySpecification pointInTimeRecoverySpecification = PointInTimeRecoverySpecification
.builder()
.pointInTimeRecoveryEnabled(true)
.build();
UpdateContinuousBackupsRequest updateContinuousBackupsRequest = UpdateContinuousBackupsRequest
.builder()
.tableName(myTable.getName())
.pointInTimeRecoverySpecification(pointInTimeRecoverySpecification)
.build();
UpdateContinuousBackupsResponse updateContinuousBackupsResponse;
try{
updateContinuousBackupsResponse = dynamoDbClient.updateContinuousBackups(updateContinuousBackupsRequest);
}catch(Exception e){
log.error("Point in Time Recovery Activation failed: {}",e.getMessage());
}
String updatedPointInTimeRecoveryStatus=updateContinuousBackupsResponse
.continuousBackupsDescription()
.pointInTimeRecoveryDescription()
.pointInTimeRecoveryStatus()
.toString();
log.info("Point in Time Recovery for Table {} activated: {}",myTable.getName(),
updatedPointInTimeRecoveryStatus);
// ... now get the table ARN
DescribeTableRequest describeTableRequest=DescribeTableRequest
.builder()
.tableName(myTable.getName())
.build();
DescribeTableResponse describeTableResponse = dynamoDbClient.describeTable(describeTableRequest);
String tableArn = describeTableResponse.table().tableArn();
String s3Bucket = "myBucketName";
// choose the format (JSON or ION)
ExportFormat exportFormat=ExportFormat.ION;
ExportTableToPointInTimeRequest exportTableToPointInTimeRequest=ExportTableToPointInTimeRequest
.builder()
.tableArn(tableArn)
.s3Bucket(s3Bucket)
.s3Prefix(myTable.getS3Prefix())
.exportFormat(exportFormat)
.build();
dynamoDbClient.exportTableToPointInTime(exportTableToPointInTimeRequest);
Your dynamoDbClient needs to be an instance of software.amazon.awssdk.services.dynamodb.DynamoDbClient, the DynamoDbEnhancedClient or DynamoDbEnhancedAsyncClient will not work.
Related
I am doing it manually as export , uploading it to s3 bucket and deleting the old dumps.
Someone help to automate it
1)script to export the schema ICO_AV_PRD_OWR
DECLARE
hdnl NUMBER;
BEGIN
hdnl := DBMS_DATAPUMP.OPEN( operation => 'EXPORT', job_mode => 'SCHEMA', job_name=>null,
version=>12);
DBMS_DATAPUMP.ADD_FILE( handle => hdnl, filename => 'dump.dmp', directory =>
'DATA_PUMP_DIR', filetype => dbms_datapump.ku$_file_type_dump_file);
DBMS_DATAPUMP.ADD_FILE( handle => hdnl, filename => 'dump.log', directory =>
'DATA_PUMP_DIR', filetype => dbms_datapump.ku$_file_type_log_file);
DBMS_DATAPUMP.METADATA_FILTER(hdnl,'SCHEMA_EXPR','IN (''schemaname'')');
DBMS_DATAPUMP.START_JOB(hdnl);
END;
/
2)copy the dump to S3 bucket
set lines 399 pages 999
col filename for a45
select * from table(RDSADMIN.RDS_FILE_UTIL.LISTDIR('DATA_PUMP_DIR')) order by mtime; ---
listing the files
SELECT rdsadmin.rdsadmin_s3_tasks.upload_to_s3(
p_bucket_name => 'bucketname',
p_directory_name => 'DATA_PUMP_DIR')
AS TASK_ID FROM DUAL;
3)Remove the dumps from RDS
exec utl_file.fremove('DATA_PUMP_DIR','dump.dmp');
exec utl_file.fremove('DATA_PUMP_DIR','dump.log');
You can use a lambda function scheduled to run depending on an event generated by CloudWatch. The Lambda function is initiated at the scheduled interval by using Amazon CloudWatch Events, that will run the query that you have mentioned in your question.
To write the lambda function you can use language of your choices say Python.
you can use date-time in p_s3_prefix to maintain the files in S3. That will help you to delete older files, say dump files that were uploaded 3 days back.
SELECT rdsadmin.rdsadmin_s3_tasks.upload_to_s3(
p_bucket_name => 'mys3bucket',
p_prefix => '',
p_s3_prefix => 'date-time/',
p_directory_name => 'DATA_PUMP_DIR')
AS TASK_ID FROM DUAL;
Set your logic to replace the 'date-time' to actual date-time
You can see step by step guide here as suggested by AWS.
My S3 bucket is organised with this hierarchy, storing parquet file: <folder-name>/year=<yyyy>/month=<mm>/day=<dd>/<filename>.parquet
Manual Fixation
For a particular date (i.e. a single parquet file), I do some manual fixation
Downloaded the parquet file and read it as pandas DataFrame
Updated some values, while the column remains unchanged
Saved the pandas DataFrame back to parquet file with the same filename
Uploaded it back to same S3 bucket sub-folder
PS: I seem to have deleted the parquet file on S3 once, leading to empty sub-folder.
Then, I re-run the Glue crawler, pointing <folder-name>/. Unfortunately, data of this particular date is missing in the Athena Table.
After the crawler is finished running, the notification is as follow
Crawler <my-table-name> completed and made the following changes: 0 tables created, 0 tables updated. See the tables created in database <my-databse-name>.
Is there anything I have mis-configured in my Glue crawler ? Thanks
Glue Crawler Config
Schema updates in the data store: Update the table definition in the data catalog.
Inherit schema from table: Update all new and existing partitions with metadata from the table.
Object deletion in the data store: Delete tables and partitions from the data catalog.
Crawler Log in CloudWatch
BENCHMARK : Running Start Crawl for Crawler <my-table-name>
BENCHMARK : Classification complete, writing results to database <my-database-name>
INFO : Crawler configured with Configuration
{
"Version": 1,
"CrawlerOutput": {
"Partitions": {
"AddOrUpdateBehavior": "InheritFromTable"
}
},
"Grouping": {
"TableGroupingPolicy": "CombineCompatibleSchemas"
}
}
and SchemaChangePolicy
{
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "DELETE_FROM_DATABASE"
}
. Note that values in the Configuration override values in the SchemaChangePolicy for S3 Targets.
BENCHMARK : Finished writing to Catalog
BENCHMARK : Crawler has finished running and is in state READY
If you are reading from or writing to S3 buckets, the bucket name should have aws-glue* prefix for Glue to access the buckets. Assuming you are using the preconfigured “AWSGlueServiceRole” IAM role. You can try by adding prefix aws-glue to the name of the folders
I had the same problem. Check the inline policy of your IAM role. You should have something like that when you specify the bucket:
"Resource": [
"arn:aws:s3:::bucket/object*"
]
When the crawler didn't work, I instead had the following:
"Resource": [
"arn:aws:s3:::bucket/object"
]
I really liked BigQuery's Data Transfer Service. I have flat files in the exact schema sitting to be loaded into BQ. It would have been awesome to just setup DTS schedule that picked up GCS files that match a pattern and load the into BQ. I like the built in option to delete source files after copy and email in case of trouble. But the biggest bummer is that the minimum interval is 60 minutes. That is crazy. I could have lived with a 10 min delay perhaps.
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
StartManualTransferRuns is part of the RPC library but does not have a REST API equivalent as of now. How to use that will depend on your environment. For instance, you can use the Python Client Library (docs).
As an example, I used the following code (you'll need to run pip install google-cloud-bigquery-datatransfer for the depencencies):
import time
from google.cloud import bigquery_datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = bigquery_datatransfer_v1.DataTransferServiceClient()
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = '5e6...7bc' # alphanumeric ID you'll find in the UI
parent = client.project_transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = bigquery_datatransfer_v1.types.Timestamp(seconds=int(time.time() + 10))
response = client.start_manual_transfer_runs(parent, requested_run_time=start_time)
print(response)
Note that you'll need to use the right Transfer Config ID and the requested_run_time has to be of type bigquery_datatransfer_v1.types.Timestamp (for which there was no example in the docs). I set a start time 10 seconds ahead of the current execution time.
You should get a response such as:
runs {
name: "projects/PROJECT_NUMBER/locations/us/transferConfigs/5e6...7bc/runs/5e5...c04"
destination_dataset_id: "DATASET_NAME"
schedule_time {
seconds: 1579358571
nanos: 922599371
}
...
data_source_id: "google_cloud_storage"
state: PENDING
params {
...
}
run_time {
seconds: 1579358581
}
user_id: 28...65
}
and the transfer is triggered as expected (nevermind the error):
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
With this you can set a cron job to execute your function every ten minutes. As discussed in the comments, the minimum interval is 60 minutes so it won't pick up files less than one hour old (docs).
Apart from that, this is not a very robust solution and here come into play your follow-up questions. I think these might be too broad to address in a single StackOverflow question but I would say that, for on-demand refresh, Cloud Scheduler + Cloud Functions/Cloud Run can work very well.
Dataflow would be best if you needed ETL but it has a GCS connector that can watch a file pattern (example). With this you would skip the transfer, set the watch interval and the load job triggering frequency to write the files into BigQuery. VM(s) would be running constantly in a streaming pipeline as opposed to the previous approach but a 10-minute watch period is possible.
If you have complex workflows/dependencies, Airflow has recently introduced operators to start manual runs.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
You can use wildcards to match a file pattern when you create the transfer:
Also, this can be done on a file-by-file basis using Pub/Sub notifications for Cloud Storage to trigger a Cloud Function.
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
There is already a Feature Request here. Feel free to star it to show your interest and receive updates
Now your can easy manual run transfer Bigquery data use RESTApi:
HTTP request
POST https://bigquerydatatransfer.googleapis.com/v1/{parent=projects/*/locations/*/transferConfigs/*}:startManualRuns
About this part > {parent=projects//locations//transferConfigs/*}, check on CONFIGURATION of your Transfer then notice part like image bellow.
Here
More here:
https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs/startManualRuns
following the Guillem's answer and the API updates, this is my new code:
import time
from google.cloud.bigquery import datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = datatransfer_v1.DataTransferServiceClient()
config = '34y....654'
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = config
parent = client.transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = Timestamp(seconds=int(time.time()))
request = datatransfer_v1.types.StartManualTransferRunsRequest(
{ "parent": parent, "requested_run_time": start_time }
)
response = client.start_manual_transfer_runs(request, timeout=360)
print(response)
For this to work, you need to know the correct TRANSFER_CONFIG_ID.
In my case, I wanted to list all the BigQuery Scheduled queries, to get a specific ID. You can do it like that :
# Put your projetID here
PROJECT_ID = 'PROJECT_ID'
from google.cloud import bigquery_datatransfer_v1
bq_transfer_client = bigquery_datatransfer_v1.DataTransferServiceClient()
parent = bq_transfer_client.project_path(PROJECT_ID)
# Iterate over all results
for element in bq_transfer_client.list_transfer_configs(parent):
# Print Display Name for each Scheduled Query
print(f'[Schedule Query Name]:\t{element.display_name}')
# Print name of all elements (it contains the ID)
print(f'[Name]:\t\t{element.name}')
# Extract the IDs:
TRANSFER_CONFIG_ID= element.name.split('/')[-1]
print(f'[TRANSFER_CONFIG_ID]:\t\t{TRANSFER_CONFIG_ID}')
# You can print the entire element for debug purposes
print(element)
We had a third party create a python based image thumbnail script that we set up to trigger on an S3 ObjectCreated event. We then imported a collection of close to 5,000 images after testing the script, but the sheer volume of the image files ended up filling up the lambda test space during the import and only about 12% of the images ended up having thumbnails created for them.
We need to manually create thumbnails for the other 88%. While I have a php based script I can run from EC2, it's somewhat slow. It occurs to me that I could create them 'on demand' and could avoid having to create thumbnails for all of the files that didn't auto-create already during the import.
Some of the files may never be accessed again by a customer - the existing lambda thumbnailer already has a slight delay that I account for in the javascript setTimeout retry loop, but before invoking this loop, I could conceivably check if it's a recent upload -- e.g. within the last 10 seconds -- whenever a thumbnail is not found then trigger the lambda manually before starting the retry loop.
But to do this, I need to have the ability to trigger the Lambda script with the parameters similar to the event trigger. It appears as though their script is only accessing the bucket name and key from the event values:
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
Being unfamiliar with lambda and still somewhat new to the sdk, I am not sure how I do a lambda trigger that would include those values for the python script.
I can use either the php sdk or the javascript sdk. (or even the cli)
Any help is appreciated.
I think I figured it out, copying the data structure in the python references to create a bare-bones payload and triggering it as event:
$lambda = $awsSvc->getAwsSdkCached()->createLambda();
// bucket = event['Records'][0]['s3']['bucket']['name']
// key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
$bucket = "mybucket";
$key = "somefolder/someimage.jpg";
$payload_json = sprintf('{"Records":[{"s3":{"bucket":{"name":"%s"},"object":{"key":"%s"}}}]}', $bucket, $key);
$params = array(
'FunctionName' => 'ThumbnailGenerator',
'InvocationType' => 'Event',
'LogType' => 'Tail',
'Payload' => $payload_json
);
$result = $lambda->invoke($params);
I used CreateImageRequest to take a snapshot of a running EC2 machine. When I log into the EC2 console I see the following:
AMI - An image that I can launch
Volume - I believe that this is the disk image?
Snapshot - Another entry related to the snapshot?
Can anyone explain the difference in usage of each of these? For example, is there any way to create a 'snapshot' without also having an associated 'AMI', and in that case how do I launch an EBS-backed copy of this snapshot?
Finally, is there a simple API to delete an AMI and all associated data (snapshot, volume and AMI). It turns out that our scripts only store the AMI identifier, and not the rest of the data, and so it seems that that's only enough information to just Deregister an image.
The AMI represents the launchable machine configuration - it does NOT actually contain any of the machine's data, just references to it. An AMI can get its disk image either from S3 or (in your case) an EBS snapshot.
The EBS Volume is associated with a running instance. It's basically a read-write disk image. When you terminate the instance, the volume will automatically be destroyed (this may take a few minutes, note).
The snapshot is a frozen image of the EBS volume at the point in time when you created the AMI. Snapshots can be associated with AMIs, but not all snapshots are part of an AMI - you can create them manually too.
More information on EBS-backed AMIs can be found in the user's guide. It is important to have a good grasp on these concepts, so I would recommend giving the entire users guide a good read-over before going any further.
If you want to delete all data associated with an AMI, you will have to use the DescribeImageAttribute API call on the AMI's blockDeviceMapping attribute to find the snapshot ID; then delete the AMI and snapshot, in that order.
This small PS script takes the AMI parameter (stored in a variable), grab the snapshots of the given AMI ID by storing them into an array, and finally perform the required clean up (unregister & remove the snapshots).
# Unregister and clean AMI snapshots
$amiName = 'ami-XXXX' # replace this with the AMI ID you need to clean-up
$myImage = Get-EC2Image $amiName
$count = $myImage[0].BlockDeviceMapping.Count
# Loop and store snapshotID(s) to an array
$mySnaps = #()
for ($i=0; $i -lt $count; $i++)
{
$snapId = $myImage[0].BlockDeviceMapping[$i].Ebs | foreach {$_.SnapshotId}
$mySnaps += $snapId
}
# Perform the clean up
Write-Host "Unregistering" $amiName
Unregister-EC2Image $amiName
foreach ($item in $mySnaps)
{
Write-Host 'Removing' $item
Remove-EC2Snapshot $item
}
Clear-Variable mySnaps