Problems reading files from an S3 bucket mounted to Ubuntu using s3fs - amazon-s3

I am trying to use sf3f (https://github.com/s3fs-fuse/s3fs-fuse to handle files being uploaded by a different process to an s3 bucket. I can see files listed when I ls in hat mounted directory and the files are showing correctly timestamped and with the correct size. But trying to open the files either in code or using vi or nano just shows garbage such as
^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
If I curl the s3 link then I do see content so it is available and there do not seem to be permission issues.
I am able to create a new file in this directory and the content is saving fine - can view it in the S3 explorer.
Any thoughts? Is this an s3fs-fuse problem and I may fair better using an alternate such as https://github.com/kahing/goofys?
I have signed up for the AWS EFS preview but no idea what the wait time is for that.

Related

Apache Camel eats S3 "folders" created programatically, but not ones created in AWS S3 Console

We have an Apache Camel app that is supposed to read files in a certain directory structure in S3, process the files (generating some metadata based on the folder the file is in), submit the data in the file (and metadata) to another system and finally put the consumed files into a different bucket, deleting the original from the incoming bucket.
The behaviour I'm seeing is that when I programatically create the directory structure in S3, those "folders" are being consumed, so the dir structure disappears.
I know S3 technically does not have folders, just empty files ending in /.
The twist here is that any "folder" created in the S3 Console, are NOT consumed, they stay there as we want them to. Any folders created via AWS CLI, or boto3 are immediately consumed.
The problem is that we do need the folders to be created with automation, there are too many to do by hand.
I've reached out to AWS Support, and they just tell me that there are no differences between how the Console creates folders, and how the CLI does it. Support confirmed that the command I used in CLI is correct.
I think my issue is similar to Apache Camel deleting AWS S3 bucket's folder , but that has no answer...
How can I get Camel to not "eat" any folders?

How to decompress split zip files on AWS S3?

I've got a file (4GB) which is too big to upload on AWS S3 with unstable internet connection, so I split the file into several parts using WinZip.
So, file.csv became a series of files:
- file.z01
- file.z02
- ...
- file.z12
After uploading it on AWS S3 I need to unzip it. How do I do it?
You wont be able to do it without the help of an EC2 instance.
If you have already uploaded these small zip files, launch a new EC2 instance, download these files from S3 using curl or wget, combine them together and upload to s3 again.
Since you are using Winzip, consider launching a Windows based instance, as it will be tough for you find a linux based equivalent for winzip.

AWS S3 download files with exec permission

I've been struggling with this one for quite a while. Thought it would work out-of-box based on AWS documentation of supporting the acl header.
I'm using the AWS S3 CLI in order to download files from my S3 bucket. Some of the files will need to have 'exec' permissions (running on Linux).
I can chmod the files but I would like to control that during the upload rather than during the download.
So, the question is whether I can use the AWS CLI so that it will automatically grant execution (or other) permissions based on something that I can set during the upload or afterwards on the uploaded file.
Thanks,

s3fs disable cache

I have problem with viewing video from my bucket on S3.
I'm using EC2 instance. Bucket mounted as folder via s3fs. When i try to load a big file i have a pause before starting download. In this pause, i see that file download (cache) to EC2. When it was cached, file start to download in browser.
I try to configure s3fs and disable cache, but option -o use_cache="" doesn't work. I try to use s3fslite, but it is also cache files before sending it to user.
How to disable caching? Maybe there is some faster solution, that can help me to use s3 bucket like folder on EC2?
You don't need to download the files, either serve them directly from s3, or use cloudfront.
If you are trying to control access to the file. Use signed URLs which will give them user a certain amount of time to access the file before the link expires.

Access files in s3n://elasticmapreduce/samples/wordcount/input

How I can I access the file sitting in the following folder of S3 which is own by someone else
s3n://elasticmapreduce/samples/wordcount/input
The files in s3n://elasticmapreduce/samples/wordcount/input are public, and made available as input by Amazon to the sample word count Hadoop program. The best way to fetch them is to
Start a new Amazon Elastic MapReduce Job Flow (it doesn't matter which one) from the Amazon Web Services console, and make sure that you keep the the job alive with the Keep Alive option
Once the EC2 machines have started, find the instances on EC2 from the Amazon Web Services console
ssh into one of the running EC2 instances, using the hadoop user, for example
ssh -i keypair.pem hadoop#ec2-IPADDRESS.compute-1.amazonaws.com
Obtain the files you need, using hadoop dfs -copyToLocal s3://elasticmapreduce/samples/wordcount/input/0002 .
sftp the files to your local system
You can access wordSplitter.py here:
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/wordSplitter.py
You can access the input files here:
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0012
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0011
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0010
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0009
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0008
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0007
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0006
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0005
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0004
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0003
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0002
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0001
The owner of the folder (most likely a file in the folder) must have made it accessible to anonymous reader.
If that is the case, s3n://x/y... is translated to
http://s3.amazonaws.com/x/y...
or
http://x.s3.amazonaws.com/y...
x is the name of the bucket.
y... is the path wihtin the bucket.
If you want to make sure the file exists, e.g. if you suspect the name was misspelled, you can in your browser to open
http://s3.amazonaws.com/x
and you'll see XML describing "files" that is S3 objects, available.
Try this:
http://s3.amazonaws.com/elasticmapreduce
I tried this, and seems that the path you want is not public.
AWS EBS documentation quotes s3://elasticmapreduce/samples/wordcount/input in one of the "getting started" examples. But s3 is different from s3n, so input might be available to EMR, but not to HTTP access.
In Amazon S3, there is no concept of folders, a bucket it just a flat collection of objects. But you can list all the files you are interested in a browser with the following URL:
s3.amazonaws.com/elasticmapreduce?prefix=samples/wordcount/input/
Then you can download them by specifying the whole name, e.g.
s3.amazonaws.com/elasticmapreduce/samples/wordcount/input/0001