Access files in s3n://elasticmapreduce/samples/wordcount/input - amazon-s3

How I can I access the file sitting in the following folder of S3 which is own by someone else
s3n://elasticmapreduce/samples/wordcount/input

The files in s3n://elasticmapreduce/samples/wordcount/input are public, and made available as input by Amazon to the sample word count Hadoop program. The best way to fetch them is to
Start a new Amazon Elastic MapReduce Job Flow (it doesn't matter which one) from the Amazon Web Services console, and make sure that you keep the the job alive with the Keep Alive option
Once the EC2 machines have started, find the instances on EC2 from the Amazon Web Services console
ssh into one of the running EC2 instances, using the hadoop user, for example
ssh -i keypair.pem hadoop#ec2-IPADDRESS.compute-1.amazonaws.com
Obtain the files you need, using hadoop dfs -copyToLocal s3://elasticmapreduce/samples/wordcount/input/0002 .
sftp the files to your local system

You can access wordSplitter.py here:
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/wordSplitter.py
You can access the input files here:
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0012
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0011
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0010
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0009
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0008
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0007
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0006
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0005
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0004
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0003
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0002
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0001

The owner of the folder (most likely a file in the folder) must have made it accessible to anonymous reader.
If that is the case, s3n://x/y... is translated to
http://s3.amazonaws.com/x/y...
or
http://x.s3.amazonaws.com/y...
x is the name of the bucket.
y... is the path wihtin the bucket.
If you want to make sure the file exists, e.g. if you suspect the name was misspelled, you can in your browser to open
http://s3.amazonaws.com/x
and you'll see XML describing "files" that is S3 objects, available.
Try this:
http://s3.amazonaws.com/elasticmapreduce
I tried this, and seems that the path you want is not public.
AWS EBS documentation quotes s3://elasticmapreduce/samples/wordcount/input in one of the "getting started" examples. But s3 is different from s3n, so input might be available to EMR, but not to HTTP access.

In Amazon S3, there is no concept of folders, a bucket it just a flat collection of objects. But you can list all the files you are interested in a browser with the following URL:
s3.amazonaws.com/elasticmapreduce?prefix=samples/wordcount/input/
Then you can download them by specifying the whole name, e.g.
s3.amazonaws.com/elasticmapreduce/samples/wordcount/input/0001

Related

Apache Camel eats S3 "folders" created programatically, but not ones created in AWS S3 Console

We have an Apache Camel app that is supposed to read files in a certain directory structure in S3, process the files (generating some metadata based on the folder the file is in), submit the data in the file (and metadata) to another system and finally put the consumed files into a different bucket, deleting the original from the incoming bucket.
The behaviour I'm seeing is that when I programatically create the directory structure in S3, those "folders" are being consumed, so the dir structure disappears.
I know S3 technically does not have folders, just empty files ending in /.
The twist here is that any "folder" created in the S3 Console, are NOT consumed, they stay there as we want them to. Any folders created via AWS CLI, or boto3 are immediately consumed.
The problem is that we do need the folders to be created with automation, there are too many to do by hand.
I've reached out to AWS Support, and they just tell me that there are no differences between how the Console creates folders, and how the CLI does it. Support confirmed that the command I used in CLI is correct.
I think my issue is similar to Apache Camel deleting AWS S3 bucket's folder , but that has no answer...
How can I get Camel to not "eat" any folders?

AWS FTP behavior

I'm having some issue on my AWS S3 bucket and vsftpd.
I've created a vsftpd instance and mount AWS S3 bucket. My issue is that everytime I upload a file and the connection was disrupted, it appends the existing file on the S3 bucket instead of override it when the FTP client retry. What should I set on the S3 bucket policy to have such behavior to override instead of append?
There are no Amazon S3 configuration settings that would impact this behaviour -- it is totally the result of the software you are using.
It's also worth mentioning that FTP is a rather old protocol and these days there are much better alternatives, such as uploads via the browser or Dropbox-like shared folders.
One of the easiest options is to have your users upload directly to Amazon S3 -- that way, you don't need to run any servers. This could be done by uploading via a browser, or by providing users with some software, such as Cloudberry Explorer or the AWS Command-Line Interface (CLI).
I highly encourage you to stop using FTP these days.

Mounted S3 on EC2 (Directories are not accessible from AWS UI)

Some quick questions:
Does S3 support soft link?
On mounted S3 on EC2, I can't access the created directory in Linux EC2 instance from AWS UI, however created files are accessible.
Thanks
Amazon S3 is an object store, not a filesystem. It has a specific set of APIs for uploading, listing, downloading, etc but it does not behave like a normal filesystem.
There are some utilities that can mount S3 as a filesystem (eg Expandrive, Cloudberry Drive, s3fs), but in the background these utilities are actually translating requests into API calls. This can cause some issues -- for example, you can modify a 100MB file on a local disk by just writing one by to disk. If you wish to modify one byte on S3, you must upload the whole object again. This can cause synchronization problems between your computer and S3, so such methods are not recommended for production situations. (However, they're a great way of uploading/downloading initial data.)
A good in-between option is to use the AWS Command-Line Interface (CLI), which has commands such as aws s3 cp and aws s3 sync, which are reliable ways to upload/download/sync files with Amazon S3.
To answer your questions...
Amazon S3 does not support a "soft link" (symbolic link). Amazon S3 is an object store, not a file system, so it only contains objects. Objects can also have meta-data that is often for cache control, redirection, classification, etc.
Amazon S3 does not support directories (sort of). Amazon S3 objects are kept within buckets, and the buckets are 'flat' -- they do not contains directories/sub-folders. However, it does maintain the illusion of directories. For example, if file bar.jpg is stored in the foo directory, then the Key (filename) of the object is foo/bar.jpg. This makes the object 'appear' to be in the foo directory, but that's not how it is stored. The AWS Management Console maintains this illusion by allowing users to create and open Folders, but the actual data is stored 'flat'.
This leads to some interesting behaviours:
You do not need to create a directory to store an object in the directory.
Directories don't exist. Just store a file called images/cat.jpg and the images directory magically appears (even though it doesn't exist).
You cannot rename objects. The Key (filename) is a unique identifier for the object. To 'rename' an object, you must copy it to a new Key and delete the original.
You cannot rename a directory. They don't exist. Instead, rename all the objects within the directory (which really means you have to copy the objects, then delete their old versions).
You might create a directory but not see it. Amazon S3 keeps track of CommonPrefixes to assist in listing objects by path, but it doesn't create traditional directories. So, don't get worried if you create a (pretend) directory and then don't see it. Just store your object with a full-path name and the directory will 'appear'.
The above-mentioned utilities take all this into account when allowing an Amazon S3 bucket to be mounted. They translate 'normal' filesystem commands into Amazon S3 API calls, but they can't do everything (eg they might emulate renaming a file but they typically won't let you rename a directory).

Using data present in S3 inside EMR mappers

I need to access some data during the map stage. It is a static file, from which I need to read some data.
I have uploaded the data file to S3.
How can I access that data while running my job in EMR?
If I just specify the file path as:
s3n://<bucket-name>/path
in the code, will that work ?
Thanks
S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.
A quick search for an example program brought up this link.
https://gist.github.com/lucastex/917988
You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.
Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
What I ended up doing:
1) Wrote a small script that copies my file from s3 to the cluster
hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt $DESTINATION_DIR_ON_HOST
2) Created bootstrap step for my EMR Job, that runs the script in 1).
This approach doesn't require to make the S3 data public.

How to upload files directly to Amazon S3 from a remote server?

Is it possible to upload a file to S3 from a remote server?
The remote server is basically a URL based file server. Example, using http://example.com/1.jpg, it serves the image. It doesn't do anything else and can't run code on this server.
It is possible to have another server telling S3 to upload a file from http://example.com/1.jpg
upload from http://example.com/1.jpg
server -------------------------------------------> S3 <-----> example.com
If you can't run code on the server or execute requests then, no, you can't do this. You will have to download the file to a server or computer that you own and upload from there.
You can see the operations you can perform on amazon S3 at http://docs.amazonwebservices.com/AmazonS3/latest/API/APIRest.html
Checking the operations for both the REST and SOAP APIs you'll see there's no way to give Amazon S3 a remote URL and have it grab the object for you. All of the PUT requests require the object's data to be provided as part of the request. Meaning the server or computer that is initiating the web request needs to have the data.
I have had a similar problem in the past where I wanted to download my users' Facebook Thumbnails and upload them to S3 for use on my site. The way I did it was to download the image from Facebook into Memory on my server, then upload to Amazon S3 - the full thing took under 2 seconds. After the upload to S3 was complete, write the bucket/key to a database.
Unfortunately there's no other way to do it.
I think the suggestion provided is quite good, you can SCP the file to S3 Bucket. Giving the pem file will be a password less authentication, via PHP file you can validate the extensions. PHP file can pass the file, as argument to SCP command.
The only problem with this solution is, you must have your instance in AWS. You can't use this solution if your website is hosted in other Hosting Providers and you are trying to upload files straight to S3 Bucket.
Technically it's possible, using AWS Signature Version 4, Assuming your remote server is the customer in the image below, you could prepare a form in the main server, and send the form fields to the remote server, for it to curl it. Detailed example here.
you can use scp command from Terminal.
1)using terminal, go to the place where there is that file you want to transfer to the server
2) type this:
scp -i yourAmazonKeypairPath.pem fileNameThatYouWantToTransfer.php ec2-user#ec2-00-000-000-15.us-west-2.compute.amazonaws.com:
N.B. Add "ec2-user#" before your ec2blablbla stuffs that you got from the Ec2 website!! This is such a picky error!
3) your file will be uploaded and the progress will be shown. When it is 100%, you are done!