Download files from FTP to amazon EMR - amazon-emr

I need to download files from FTP server to amazon EMR, I have a shell script to download files but it's working in linux machines, not in amazon EMR namenode. I am not getting any error, the terminal not displaying anything after ran shell script.
Note:I have enable ports on Master security groups. I know the other approach to download FTP to s3 and then amazon EMR, but I need to download files directly to Amazon EMR.

I assume you have tried to download files from FTP server to amazon EMR using bootstrap scripts.
To debug whats going wrong. Can you connect to master node / slaves nodes when they are up and see you script runs well ? this can help if script is running for not.
Other way to debug is , once node is launched try to run script manually on the EMR nodes and see if they throw some error.
Hope the will help to debug why scripts are not running.

Related

QNAP Backup to s3

I am using Qnap NAS currently. I want to sync the local file to AWS S3.
I have created a job and scheduled the backup operation, but the synchronization works only for small files but it cannot work for large files such as 3/4 GB files.
How can I solve this issue?
Lowering the multipart_chunksize to a smaller value will reduce the chance of S3 sync not working for big files.
Try by executing AWS CLI commands as
aws configure set default.s3.max_concurrent_requests 20
and aws configure set default.s3.multipart_chunksize 2MB
Not sure but this might resolve your issue
To get in a detailed description on how this Configuration Values works please follow the below link
https://docs.aws.amazon.com/cli/latest/topic/s3-config.html

How to download file from S3 into EC2 instance using packers to build custom AMI

I am trying to create a custom AMI using packers.
I want to install some specific software on the custom AMI and my setups are present in S3 bucket. But it seems there is no direct way to download S3 file in packers just like cfn-init.
So is there any way to download file on EC2 instance using packers.
Install the awscli in the instance and use iam_instance_profile to give the instance permissions to get the files from S3.
I can envisage an instance where this is ineffective.
When building the image upon aws you use your local creds. Whilst the image is building this building packer image has a packer user and is not you and so not your creds and can't access the S3 (if private)
One option https://github.com/enmand/packer-provisioner-s3
Two option, use local-shell provisioner you pull down the S3 files to your machine using aws S3 cp, then file provisioner to upload to the correct folder in the builder image, you can then use remote-shell to do any other work on the files. I chose this as, although it's more code, it is more universal when I share my build, other have no need to install other stuff
Three option wait and wait. There is an enhancement spoke of in 2019 packer GitHub to offer an S3 passthrough using local cars but isn't on the official roadmap.
Assuming awscli is already installed on Ec2, use below sample commmand in a shell provisioner.
sudo aws s3 cp s3://bucket-name/path_to_folder/file_name /home/ec2-user/temp

GoReplay - Upload to S3 does not work

I am trying to capture all incoming traffic on a specific port using GoReplay and to upload it directly to S3 servers.
I am running a simple file server on port 8000 and a gor instance using the (simple) command
gor --input-raw :8000 --output-file s3://<MyBucket>/%Y_%m_%d_%H_%M_%S.log
I does create a temporal file at /tmp/ but other than that, id does not upload any thing to S3.
Additional information :
The OS is Ubuntu 14.04
AWS cli is installed.
The AWS credentials are deffined within the environent
It seems the information you are providing or scenario you explained is not complete however to upload a file from your EC2 machine to S3 is simple as written command below.
aws s3 cp yourSourceFile s3://your bucket
To see your file you can see your file by using below command
aws s3 ls s3://your bucket
However, s3 is object storage and you can't use it to upload those files which are continually editing or adding or updating.

how to move files from remote server to s3 at the command line

I have a lot of big files on a remote server and I want to move them into S3. I want to do it at the command line or with a bash script (e.g., I do NOT want to use a gui app like cyberduck) so that I can automate/replicate efforts.
I have tried to mount my remote server onto my local machine using Osxfuse and sshfs and then push it to s3 using s3cmd. This does work but I keep running into errors (connection being lost for no apparent reason; mount errors, etc.).
Is this the best way to do it? Does anyone know a better way to do it?
Thanks,
A
You can use minio client aka mc to do the same.
$ mc cp --recursive localDir/ s3/remoteBucket
In case of network disconnection, mc will provide you an option to resume the upload.
Is your remote server in ec2? Your current setup requires two copies (first to pull data to your local machine via sshfs, then to push to s3 via s3cmd), if you run s3cmd on your remote server directly you can reduce that to one.
If you want to mount s3 as a filesystem, you can also use tools like goofys or s3fs. Again you should do that on your remote server to avoid extra copies.

Amazon EMR Spark Cluster: output/result not visible

I am running a Spark cluster on Amazon EMR. I am running the PageRank example programs on the cluster.
While running the programs on my local machine, I am able to see the output properly. But the same doesn't work on EMR. The S3 folder only shows empty files.
The commands I am using:
For starting the cluster:
aws emr create-cluster --name SparkCluster --ami-version 3.2 --instance-type m3.xlarge --instance-count 2 \
--ec2-attributes KeyName=sparkproj --applications Name=Hive \
--bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark \
--log-uri s3://sampleapp-amahajan/output/ \
--steps Name=SparkHistoryServer,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=s3://support.elasticmapreduce/spark/start-history-server
For adding the job:
aws emr add-steps --cluster-id j-9AWEFYP835GI --steps \
Name=PageRank,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--class,SparkPageRank,s3://sampleapp-amahajan/pagerank_2.10-1.0.jar,s3://sampleapp-amahajan/web-Google.txt,2],ActionOnFailure=CONTINUE
After a few unsuccessful attempts... I made a text file for the output of the job and it is successfully created on my local machine. But I am unable to view the same when I SSH into the cluster. I tried FoxyProxy to view the logs for the instances and neither does anything show up there.
Could you please let me know where I am going wrong?
Thanks!
How are you writing the text file locally? Generally, EMR jobs save their output to S3, so you could use something like outputRDD.saveToTextFile("s3n://<MY_BUCKET>"). You could also save the output to HDFS, but storing the results to S3 is useful for "ephemeral" clusters-- where you provision an EMR cluster, submit a job, and terminate upon completion.
"While running the programs on my local machine, I am able to see the
output properly. But the same doesn't work on EMR. The S3 folder only
shows empty files"
For the benefit of newbies:
If you are printing output to the console, it will be displayed in local mode but when you execute on EMR cluster, the reduce operation will be performed on worker nodes and they cant right to the console of the Master/Driver node!
With proper path you should be able to write results to s3.