I would like to put MLbackup directory to cloud. Is there any limitations to pushing Full and incremental backups with journal archiving enabled, to S3 compatible Object storage? Is journal archiving supported in S3 compatible cloud storage? What would happen if I put backup with journal archiving enabled to S3 storage? Will it eventually work or I will get errors?
Also, provide documents link to configure ML to point to cloud storage.
You can backup to S3, but if you want to have Journaling enabled, you will need to have them written to a different location. Journal archiving is not supported on S3.
The default location for Journals are in the backup, but when creating programmatically you can specify a different $journal-archive-path.
Backing Up a Database
The directory you specified can be an operating system mounted directory path, it can be an HDFS path, or it can be an S3 path. For details on using HDFS and S3 storage in MarkLogic, see Disk Storage Considerations in the Query Performance and Tuning Guide.
S3 and MarkLogic
Storage on S3 has an 'eventual consistency' property, meaning that write operations might not be available immediately for reading, but they will be available at some point. Because of this, S3 data directories in MarkLogic have a restriction that MarkLogic does not create Journals on S3. Therefore, MarkLogic recommends that you use S3 only for backups and for read-only forests, otherwise you risk the possibility of data loss. If your forests are read-only, then there is no need to have journals.
Related
I am responsible for some RDS instances (MariaDB), all running in the same region (with no multi-az setup) and configured to use the automated backup provided by AWS. According to AWS documentation, the automated backups are stored in S3 but there is no detail on the storage class in the documentation.
Business wants assurances that the backups are stored in multiple AZs (not all storage classes offer this!). I've tried to find out more about RDS backup storage through AWS support, but they've been rather unhelpful, claiming these are "internal details" and they can't tell me anything more than I read in the documentation.
So: is it known whether RDS automated backups are stored in multiple AZs, or should I just use AWS Backup with Cross-Region Backups?
I need to configure Marklogic Full/Incremental backup in the S3 bucket is it possible? Can anyone share the documents/steps to configure?
Thanks!
Yes, you can backup to S3.
You will need to configure the S3 credentials, so that MarkLogic is able to use S3 and read/write objects to your S3 bucket.
MarkLogic can't use S3 for journal archive paths, because S3 does not support file append operations. So if you want to enable journal archives, you will need to specify a custom path for that when creating your backups.
Backing Up a Database
The directory you specified can be an operating system mounted directory path, it can be an HDFS path, or it can be an S3 path. For details on using HDFS and S3 storage in MarkLogic, see Disk Storage Considerations in the Query Performance and Tuning Guide.
S3 Storage
S3 requires authentication with the following S3 credentials:
AWS Access Key
AWS Secret Key
The S3 credentials for a MarkLogic cluster are stored in the security database for the cluster. You can only have one set of S3 credentials per cluster. You can set up security access in S3, you can access any paths that are allowed access by those credentials. Because of the flexibility of how you can set up access in S3, you can set up any S3 account to allow access to any other account, so if you want to allow the credentials you have set up in MarkLogic to access S3 paths owned by other S3 users, those users need to grant access to those paths to the AWS Access Key set up in your MarkLogic Cluster.
To set up the AW credentials for a cluster, enter the keys in the Admin Interface under Security > Credentials. You can also set up the keys programmatically using the following Security API functions:
sec:credentials-get-aws
sec:credentials-set-aws
The credentials are stored in the Security database. Therefore, you cannot use S3 as the forest storage for a security database.
if you want to have Journaling enabled, you will need to have them written to a different location. Journal archiving is not supported on S3.
The default location for Journals are in the backup, but when creating programmatically you can specify a different $journal-archive-path .
S3 and MarkLogic
Storage on S3 has an 'eventual consistency' property, meaning that write operations might not be available immediately for reading, but they will be available at some point. Because of this, S3 data directories in MarkLogic have a restriction that MarkLogic does not create Journals on S3. Therefore, MarkLogic recommends that you use S3 only for backups and for read-only forests, otherwise you risk the possibility of data loss. If your forests are read-only, then there is no need to have journals.
I have a lot of files located in Amazon Glacier (the pre-S3 version, and have been uploading files using FastGlacier for years.) I would now like to move these files to the S3 Glacier Deep Archive storage class to take advantage of better pricing, and am trying to figure out the best way to do so. (It is my understanding that the pre-S3 Glacier does not offer the deep archive storage class, but I am happy to be wrong.)
Is there any way I can restore my files from Glacier directly to an S3 account/bucket/whatever, so I can avoid the bandwidth usage associated with downloading everything to my house, only to re-upload it to the cloud? Or is that my only option?
No.
The original Amazon Glacier service was great for its low cost, but was difficult to use. Most requests required you to come back minutes (or hours!) later to retrieve the result. It was almost impossible to use without a tool, such as the one you mentioned.
Then the Amazon S3 team introduced the ability to specify Glacier as a storage class in S3, and they would take care of the difficult bits. This gave the low cost a better interface.
More recently, the Glacier Deep Archive storage class brought even lower costs than Glacier itself, so there's hardly any reason to use Glacier directly any more. Unless, of course, your files are still in there, as is your case.
Unfortunately, there is no mechanism to move from "old Glaicer" to "S3 Glacier". You would need to extract the archives and then upload them to S3 (either specifying Glacier Deep Archive as the storage class or use S3 Lifecycle rules to change storage class). You would need to do this yourself, preferably from Amazon EC2 to make things faster and avoid Data Transfer charges. Perhaps you put your FastGlacier tool on a Windows EC2 instance and do it from there?
You can fire up an EC2 or Lightsail instance, with a public IP address, in the same AWS region as your Glacier vault and new bucket and do all the download/upload from there, avoiding the bandwidth costs since the charges are for the traffic to leave the AWS region -- and with a compute instance in the same region, that wouldn't apply.
There is no mechanism for directly transferring content from one service to the other, and you are correct... the legacy Glacier service does not appear to support deep archive.
You can now use Amazon S3 Glacier Re:Freezer. https://aws.amazon.com/about-aws/whats-new/2021/04/new-aws-solutions-implementation-amazon-s3-glacier-re-freezer/
It is a serverless solution that automatically copies entire Amazon S3 Glacier vault archives to a defined destination Amazon Simple Storage Service (Amazon S3) bucket and S3 storage class. The solution automates the optimized restore, copy, and transfer process and provides a prebuilt Amazon CloudWatch dashboard to visualize the copy operation progress. Deploying this solution allows you to seamlessly copy your S3 Glacier vault archives to more cost effective storage locations such as the Amazon S3 Glacier Deep Archive storage class.
We have a site where users upload files, some of them quite large. We've got multiple EC2 instances and would like to load balance them. Currently, we store the files on an EBS volume for fast access. What's the best way to replicate the files so they can be available on more than one instance?
My thought is that some automatic replication process that uploads the files to S3, and then automatically downloads them to other EC2 instances would be ideal.
EBS snapshots won't work because they replicate the entire volume, and we need to be able to replicate the directories of individual customers on demand.
You could write a shell script that would spawn s3cmd to sync your local filesystem with a S3 bucket whenever a new file is uploaded (or deleted). It would look something like:
s3cmd sync ./ s3://your-bucket/
Depends on what OS you are running on your EC2 instances:
There isn't really any need to add S3 to the mix unless you want to store them there for some other reason (like backup).
If you are running *nix the classic choice might be to run rsync and just sync between instances.
On Windows you could still use rsync or else SyncToy from Microsoft is a simple free option. Otherwise there are probably hundreds of commercial applications in this space...
If you do want to sync to S3 then I would suggest one of the S3 client apps like CloudBerry or JungleDisk, which both have sync functionality...
If you are running Windows it's also worth considering DFS (Distributed File System) which provides replication and is part of Windows Server...
The best way is to use the Amazon Cloud Front service. All of the replication is managed as part of the AWS. Content is served from several different availability zones, but does not require you to have EBS volumes in those zones.
Amazon CloudFront delivers your static and streaming content using a global network of edge locations. Requests for your objects are automatically routed to the nearest edge location, so content is delivered with the best possible performance.
http://aws.amazon.com/cloudfront/
Two ways:
Forget EBS, transfer the files to S3 and use S3 as your file-manager than EBS, add cloudfront and use the common-link everywhere.
Mount S3 bucket on any machines.
1. Amazon CloudFront is a web service for content delivery. It delivers your static and streaming content using a global network of edge locations.
http://aws.amazon.com/cloudfront/
2. You can mount S3 bucket on your linux machine. See below:
s3fs -
http://code.google.com/p/s3fs/wiki/InstallationNotes
- this did work for me. It uses FUSE file-system + rsync to sync the files
in S3. It kepes a copy of all
filenames in the local system & make
it look like a FILE/FOLDER.
That way you can share the S3 bucket on different machines.
I am working on a Java MapReduce app that has to be able to provide an upload service for some pictures from the local machine of the user to an S3 bucket.
The thing is the app must run on an EC2 cluster, so I am not sure how I can refer to the local machine when copying the files. The method copyFromLocalFile(..) needs a path from the local machine which will be the EC2 cluster...
I'm not sure if I stated the problem correctly, can anyone understand what I mean?
Thanks
You might also investigate s3distcp: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks across several servers. S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services, particularly Amazon Simple Storage Service (Amazon S3). Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.
You will need to get the files from the userMachine to at least 1 node before you will be able to use them through a MapReduce.
The FileSystem and FileUtil functions refer to paths either on the HDFS or the local disk of one of the nodes in the cluster.
It cannot reference the user's local system. (Maybe if you did some ssh setup... maybe?)