Apache Atlas and AWS S3 - amazon-s3

i am working on a project that has a requirement to store scientific data on AWS S3 as raw data for the beginning of a data lake. we are planning JSON for application data and using S3 metadata to persist application metadata (JSON schema) and process metadata. at the moment, on site S3 is the only service that we have available to us from the AWS cloud.
the client would like a publish environment where they can get the raw data back as files. we would like to avoid building a custom catalog and security infrastructure.
i don't see anything about Apache Atlas that will connect directly to AWS S3. but we can put Apache Hive on top of AWS S3 and then put Apache Atlas and Ranger on top of that. but not sure if this is how we can publish the raw data from S3 or if that even works as Hive is more of a processing environment.
is it possible to use Apache Atlas and Ranger on top of AWS S3 directly?

Related

AWS S3 : Copy large files from AWS S3 Storage to a SFTP location hosted on a different Server

I have large amount of historical daily data (over 1 TB across many files) that i need to transfer from AWS S3 storage location to a internally hosted server. I have SFTP access to the internal server. could you please advice what is the efficient way to copy the data.
I am using mac and have used cyberduck to download files from S3 and upload files to the internally hosted server. I have also tried using filezilla to move directly from one server to another without a intermediate copy. But would like to see if there a way to move using CLI without much manual intervention. I am new to AWS CLI, but would appreciate community's input. Thank you!

Can we configure Marklogic database backup on S3 bucket

I need to configure Marklogic Full/Incremental backup in the S3 bucket is it possible? Can anyone share the documents/steps to configure?
Thanks!
Yes, you can backup to S3.
You will need to configure the S3 credentials, so that MarkLogic is able to use S3 and read/write objects to your S3 bucket.
MarkLogic can't use S3 for journal archive paths, because S3 does not support file append operations. So if you want to enable journal archives, you will need to specify a custom path for that when creating your backups.
Backing Up a Database
The directory you specified can be an operating system mounted directory path, it can be an HDFS path, or it can be an S3 path. For details on using HDFS and S3 storage in MarkLogic, see Disk Storage Considerations in the Query Performance and Tuning Guide.
S3 Storage
S3 requires authentication with the following S3 credentials:
AWS Access Key
AWS Secret Key
The S3 credentials for a MarkLogic cluster are stored in the security database for the cluster. You can only have one set of S3 credentials per cluster. You can set up security access in S3, you can access any paths that are allowed access by those credentials. Because of the flexibility of how you can set up access in S3, you can set up any S3 account to allow access to any other account, so if you want to allow the credentials you have set up in MarkLogic to access S3 paths owned by other S3 users, those users need to grant access to those paths to the AWS Access Key set up in your MarkLogic Cluster.
To set up the AW credentials for a cluster, enter the keys in the Admin Interface under Security > Credentials. You can also set up the keys programmatically using the following Security API functions:
sec:credentials-get-aws
sec:credentials-set-aws
The credentials are stored in the Security database. Therefore, you cannot use S3 as the forest storage for a security database.
if you want to have Journaling enabled, you will need to have them written to a different location. Journal archiving is not supported on S3.
The default location for Journals are in the backup, but when creating programmatically you can specify a different $journal-archive-path .
S3 and MarkLogic
Storage on S3 has an 'eventual consistency' property, meaning that write operations might not be available immediately for reading, but they will be available at some point. Because of this, S3 data directories in MarkLogic have a restriction that MarkLogic does not create Journals on S3. Therefore, MarkLogic recommends that you use S3 only for backups and for read-only forests, otherwise you risk the possibility of data loss. If your forests are read-only, then there is no need to have journals.

Importing data into Neo4j from S3 bucket using authentication

I'm new to Neo4j and testing it on EC2 server, to see if we could use it for storing our ~1.5 nodes and their connection (currently using a Redshift).
I want to load all the data from Redshift to the Neo4j DB. I also work a lot with EMRs and usually storing most of my data on S3.
Is there any way to include AWS authentication information when I try importing data to Neo4j from S3 if the S3 location isn't public? Is there any other way to do so?
Thanks

Copy objects from S3 to google cloud storage using aws-cli

Is this possible to access Google Cloud Storage using aws CLI?
Google Cloud Platform have support to copy files from S3 to Google Cloud Storage using gsutil with the following CLI.
gsutil -m cp -R s3://bucketname gs://bucketname
But I need to do this with aws CLI instead of gsutil.
I am not aware of any solution from the AWS side, but unless you have a special reason not to use gsutil or other Google solution, you may consider using Google Cloud Storage Transfer Service instead. This service is recommended when transferring data from Amazon S3 buckets.
Compared with simply using gsutil, or other CLI tools out there, Google Cloud Storage Transfer has several nice features like the possibility to schedule one-time or recurring transfers, where you can use advanced filters. Also, you can indicate if you want the source objects to be deleted after transferring them, and even synchronize the destination bucket with the source one, deleting existing objects if they don't have a corresponding object in the source.
You can schedule transfers from the GCP Console or using the XML and JSON API.

Hadoop upload files from local machine to amazon s3

I am working on a Java MapReduce app that has to be able to provide an upload service for some pictures from the local machine of the user to an S3 bucket.
The thing is the app must run on an EC2 cluster, so I am not sure how I can refer to the local machine when copying the files. The method copyFromLocalFile(..) needs a path from the local machine which will be the EC2 cluster...
I'm not sure if I stated the problem correctly, can anyone understand what I mean?
Thanks
You might also investigate s3distcp: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks across several servers. S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services, particularly Amazon Simple Storage Service (Amazon S3). Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.
You will need to get the files from the userMachine to at least 1 node before you will be able to use them through a MapReduce.
The FileSystem and FileUtil functions refer to paths either on the HDFS or the local disk of one of the nodes in the cluster.
It cannot reference the user's local system. (Maybe if you did some ssh setup... maybe?)