How to specify different AWS credentials for EMR and S3 when using MRJob

How to specify different AWS credentials for EMR and S3 when using MRJob - mrjob

I can specify what AWS credentials to use to create an EMR cluster via environment variables. However, I would like to run a mapreduce job on another AWS user's S3 bucket for which they gave me a different set of AWS credentials.
Does MRJob provide a way to do this, or would I have to copy the bucket using my account first so that the bucket and EMR keys are the same?

Related

EMR serverless cannot connect to s3 in another region

I have an EMR serverless app that cannot connect to S3 bucket in another region. Is there a workaround for that? Maybe a parameter to set in Job parameters or Spark parameters when submitting a new job.
The error is this:
ExitCode: 1. Last few exceptions: Caused by: java.net.SocketTimeoutException: connect timed out Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectTimeoutException

In order to connect to an S3 bucket in another region or access external services, the EMR Serverless application needs to be created with a VPC.
This is mentioned on the considerations page:
Without VPC connectivity, a job can access some AWS service endpoints in the same AWS Region. These services include Amazon S3, AWS Glue, Amazon DynamoDB, Amazon CloudWatch, AWS KMS, and AWS Secrets Manager.
Here's an example AWS CLI command to create an application in a VPC - you need to provide a list of Subnet IDs and Security Group IDs. More details can be found in configuring VPC access.
aws emr-serverless create-application \
--type SPARK \
--name etl-jobs \
--release-label "emr-6.6.0" \
--network-configuration '{
"subnetIds": ["subnet-01234567890abcdef","subnet-01234567890abcded"],
"securityGroupIds": ["sg-01234566889aabbcc"]
}'

Can we configure Marklogic database backup on S3 bucket

I need to configure Marklogic Full/Incremental backup in the S3 bucket is it possible? Can anyone share the documents/steps to configure?
Thanks!

Yes, you can backup to S3.
You will need to configure the S3 credentials, so that MarkLogic is able to use S3 and read/write objects to your S3 bucket.
MarkLogic can't use S3 for journal archive paths, because S3 does not support file append operations. So if you want to enable journal archives, you will need to specify a custom path for that when creating your backups.
Backing Up a Database
The directory you specified can be an operating system mounted directory path, it can be an HDFS path, or it can be an S3 path. For details on using HDFS and S3 storage in MarkLogic, see Disk Storage Considerations in the Query Performance and Tuning Guide.
S3 Storage
S3 requires authentication with the following S3 credentials:
AWS Access Key
AWS Secret Key
The S3 credentials for a MarkLogic cluster are stored in the security database for the cluster. You can only have one set of S3 credentials per cluster. You can set up security access in S3, you can access any paths that are allowed access by those credentials. Because of the flexibility of how you can set up access in S3, you can set up any S3 account to allow access to any other account, so if you want to allow the credentials you have set up in MarkLogic to access S3 paths owned by other S3 users, those users need to grant access to those paths to the AWS Access Key set up in your MarkLogic Cluster.
To set up the AW credentials for a cluster, enter the keys in the Admin Interface under Security > Credentials. You can also set up the keys programmatically using the following Security API functions:
sec:credentials-get-aws
sec:credentials-set-aws
The credentials are stored in the Security database. Therefore, you cannot use S3 as the forest storage for a security database.
if you want to have Journaling enabled, you will need to have them written to a different location. Journal archiving is not supported on S3.
The default location for Journals are in the backup, but when creating programmatically you can specify a different $journal-archive-path .
S3 and MarkLogic
Storage on S3 has an 'eventual consistency' property, meaning that write operations might not be available immediately for reading, but they will be available at some point. Because of this, S3 data directories in MarkLogic have a restriction that MarkLogic does not create Journals on S3. Therefore, MarkLogic recommends that you use S3 only for backups and for read-only forests, otherwise you risk the possibility of data loss. If your forests are read-only, then there is no need to have journals.

Processing AWS ELB access logs (from S3 bucket to InfluxDB)

We would like to process AWS ELB access logs and write them into InfluxDB
to be used for application metrics and monitoring (ex. Grafana).
We configured ELB to store access logs into S3 bucket.
What would be the best way to process those logs and write them to InfluxDB?
What we tried so far was to mount S3 bucket to filesystem using s3fs and then use Telegraf agent for processing. But this approach has some issues: s3fs mounting looks like a hack, and all the files in the bucket are compressed and need to be unzipped before telegraf can process them which makes this task overcomplicated.
Is there any better way?
Thanks,
Oleksandr

Can you just install the telegraf agent on the AWS instance that is generating the logs, and have them sent directly to InfluxDB in real-time?

Amazon EC2 creates automatically if I use S3?

Amazon EC2 creates automatically if I use S3?
I use only S3.

No, if you use S3 it won't automatically create an Amazon EC2 Instance if that is what you are referring to. Can you clarify your question.

An AWS EC2 instance/server is different from S3.
If you use AWS S3 to upload/download store files no EC2 servers will be launched.
You can access these files through AWS console or through AWS Cli on your local machine.

How to set up a volume linked to S3 in Docker Cloud with AWS?

I'm running my Play! webapp with Docker Cloud (could also use Rancher) and AWS and I'd like to store all the logs in S3 (via volume). Any ideas on how I could achieve that with minimal effort?

Use docker volumes to store the logs in the host system.
Try S3 aws-cli to sync your local directory with S3 Bucket
aws s3 sync /var/logs/container-logs s3://bucket/
create a cron to run it on every minute or so.
Reference: s3 aws-cli

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas