Limiting 'ls' command output in s3fs - amazon-s3

My Amazon S3 bucket has millions of files and I am mounting it using s3fs. Anytime a ls command is issued (not intentionally) the terminal hangs.
Is there a way to limit the number of results returned to 100 when a ls command is issued in a s3fs mounted path?

Try goofys (https://github.com/kahing/goofys). It doesn't limit the number of item returned for ls, but ls is about 40x faster than s3fs when there are lots of files.

It is not recommended to use s3fs in production situations. Amazon S3 is not a filesystem, so attempting to mount it can lead to some synchronization issues (and other issues like you have experienced).
It would be better to use the AWS Command-Line Interface (CLI), which has commands to list, copy and sync files to/from Amazon S3. It can also do partial listing of S3 buckets by path.

Related

How to list several specific files from amazon S3 quickly

I want to check if some files are really in my s3 bucket. Using aws cli I can do it for one files with ls like
aws s3 ls s3://mybucket/file/path/file001.jpg
but I need to be able to do it for several files
aws s3 ls s3://mybucket/file/path/file001.jpg ls s3://mybucket/file/path/file002.jpg
Won't work
nor
aws s3 ls s3://mybucket/file/path/file001.jpg s3://mybucket/file/path/file005.jpg
Of course
aws s3 ls s3://mybucket/file/path/file001.jpg;
aws s3 ls s3://mybucket/file/path/file005.jpg
Works perfectly, but slowly. It takes about 1 sec to get one file, because it connects an close the connection each time.
I've hundreds of files to check on a regular basis, so I need a fast way to do it. Thanks
I'm not insisting on using ls, or passing a path, a "find" of the filenames would also do (but aws cli seems to lack find). Another tool (as long as it can be invoked with the command line), will be ok
I don’t want to get a list of all files or have a script looking at all files and then post process. I need a way to ask s3 give me fila a,r,z in one go.
I think s3api listobjects call should be the one but I fail at its syntax to ask several file names at once.
You can easily do that using python boto3 sdk for AWS
import boto3
s3 = boto3.resource('s3')
bucket=s3.Bucket('mausamrest');
for obj in bucket.objects.all():
print(obj.key)
where mausamrest is the bucket

GoReplay - Upload to S3 does not work

I am trying to capture all incoming traffic on a specific port using GoReplay and to upload it directly to S3 servers.
I am running a simple file server on port 8000 and a gor instance using the (simple) command
gor --input-raw :8000 --output-file s3://<MyBucket>/%Y_%m_%d_%H_%M_%S.log
I does create a temporal file at /tmp/ but other than that, id does not upload any thing to S3.
Additional information :
The OS is Ubuntu 14.04
AWS cli is installed.
The AWS credentials are deffined within the environent
It seems the information you are providing or scenario you explained is not complete however to upload a file from your EC2 machine to S3 is simple as written command below.
aws s3 cp yourSourceFile s3://your bucket
To see your file you can see your file by using below command
aws s3 ls s3://your bucket
However, s3 is object storage and you can't use it to upload those files which are continually editing or adding or updating.

How to use S3 adapter cli for snowball

I'm using s3 adapter to copy files from a snowball device to local machine.
Everything appears to be in order as I was able to run this command and see the bucket name:
aws s3 ls --endpoint http://snowballip:8080
But besides this, aws doesn't offer any examples for calling cp command. How do I provide the bucket name and the key with this --endpoint flag.
Further, when I ran this:
aws s3 ls --endpoint http://snowballip:8080/bucketname
It returned 'Bucket'... Not sure what that means because I expect to see the files.
I can confirm the following is correct for snowball and snowball edge, as #sqlbot says in the comment
aws s3 ls --endpoint http://snowballip:8080 s3://bucketname/[optionalprefix]
References:
http://docs.aws.amazon.com/cli/latest/reference/
http://docs.aws.amazon.com/snowball/latest/ug/using-adapter-cli.html
Just got one in the post

Exporting data from Google Cloud Storage to Amazon S3

I would like to transfer data from a table in BigQuery, into another one in Redshift.
My planned data flow is as follows:
BigQuery -> Google Cloud Storage -> Amazon S3 -> Redshift
I know about Google Cloud Storage Transfer Service, but I'm not sure it can help me. From Google Cloud documentation:
Cloud Storage Transfer Service
This page describes Cloud Storage Transfer Service, which you can use
to quickly import online data into Google Cloud Storage.
I understand that this service can be used to import data into Google Cloud Storage and not to export from it.
Is there a way I can export data from Google Cloud Storage to Amazon S3?
You can use gsutil to copy data from a Google Cloud Storage bucket to an Amazon bucket, using a command such as:
gsutil -m rsync -rd gs://your-gcs-bucket s3://your-s3-bucket
Note that the -d option above will cause gsutil rsync to delete objects from your S3 bucket that aren't present in your GCS bucket (in addition to adding new objects). You can leave off that option if you just want to add new objects from your GCS to your S3 bucket.
Go to any instance or cloud shell in GCP
First of all configure your AWS credentials in your GCP
aws configure
if this is not recognising the install AWS CLI follow this guide https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html
follow this URL for AWS configure
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
Attaching my screenshot
Then using gsutil
gsutil -m rsync -rd gs://storagename s3://bucketname
16GB data transferred in some minutes
Using Rclone (https://rclone.org/).
Rclone is a command line program to sync files and directories to and from
Google Drive
Amazon S3
Openstack Swift / Rackspace cloud files / Memset Memstore
Dropbox
Google Cloud Storage
Amazon Drive
Microsoft OneDrive
Hubic
Backblaze B2
Yandex Disk
SFTP
The local filesystem
Using the gsutil tool we can do a wide range of bucket and object management tasks, including:
Creating and deleting buckets.
Uploading, downloading, and deleting objects.
Listing buckets and objects. Moving, copying, and renaming objects.
we can copy data from a Google Cloud Storage bucket to an amazon s3 bucket using gsutil rsync and gsutil cp operations. whereas
gsutil rsync collects all metadata from the bucket and syncs the data to s3
gsutil -m rsync -r gs://your-gcs-bucket s3://your-s3-bucket
gsutil cp copies the files one by one and as the transfer rate is good it copies 1 GB in 1 minute approximately.
gsutil cp gs://<gcs-bucket> s3://<s3-bucket-name>
if you have a large number of files with high data volume then use this bash script and run it in the background with multiple threads using the screen command in amazon or GCP instance with AWS credentials configured and GCP auth verified.
Before running the script list all the files and redirect to a file and read the file as input in the script to copy the file
gsutil ls gs://<gcs-bucket> > file_list_part.out
Bash script:
#!/bin/bash
echo "start processing"
input="file_list_part.out"
while IFS= read -r line
do
command="gsutil cp ${line} s3://<bucket-name>"
echo "command :: $command :: $now"
eval $command
retVal=$?
if [ $retVal -ne 0 ]; then
echo "Error copying file"
exit 1
fi
echo "Copy completed successfully"
done < "$input"
echo "completed processing"
execute the Bash script and write the output to a log file to check the progress of completed and failed files.
bash file_copy.sh > /root/logs/file_copy.log 2>&1
I needed to transfer 2TB of data from Google Cloud Storage bucket to Amazon S3 bucket.
For the task, I created the Google Compute Engine of V8CPU (30 GB).
Allow Login using SSH on the Compute Engine.
Once logedin create and empty .boto configuration file to add AWS credential information. Added AWS credentials by taking the reference from the mentioned link.
Then run the command:
gsutil -m rsync -rd gs://your-gcs-bucket s3://your-s3-bucket
The data transfer rate is ~1GB/s.
Hope this help.
(Do not forget to terminate the compute instance once the job is done)
For large amounts of large files (100MB+) you might get issues with broken pipes and other annoyances, probably due to multipart upload requirement (as Pathead mentioned).
For that case you're left with simple downloading all files to your machine and uploading them back. Depending on your connection and data amount, it might be more effective to create VM instance to utilize high-speed connection and ability to run it in the background on different machine than yours.
Create VM machine (make sure the service account has access to your buckets), connect via SSH and install AWS CLI (apt install awscli) and configure the access to S3 (aws configure).
Run these two lines, or make it a bash script, if you have many buckets to copy.
gsutil -m cp -r "gs://$1" ./
aws s3 cp --recursive "./$1" "s3://$1"
(It's better to use rsync in general, but cp was faster for me)
Tools like gsutil and aws s3 cp won't use multipart uploads/downloads, so will have poor performance for large files.
Skyplane is a much faster alternative for transferring data between clouds (up to 110x for large files). You can transfer data with the command:
skyplane cp -r s3://aws-bucket-name/ gcs://google-bucket-name/
(disclaimer: I am a contributor)

getting large datasets onto amazon elastic map reduce

There are some large datasets (25gb+, downloadable on the Internet) that I want to play around with using Amazon EMR. Instead of downloading the datasets onto my own computer, and then re-uploading them onto Amazon, what's the best way to get the datasets onto Amazon?
Do I fire up an EC2 instance, download the datasets (using wget) into S3 from within the instance, and then access S3 when I run my EMR jobs? (I haven't used Amazon's cloud infrastructure before, so not sure if what I just said makes any sense.)
I recommend the following...
fire up your EMR cluster
elastic-mapreduce --create --alive --other-options-here
log on to the master node and download the data from there
wget http://blah/data
copy into HDFS
hadoop fs -copyFromLocal data /data
There's no real reason to put the original dataset through S3. If you want to keep the results you can move them into S3 before shutting down your cluster.
If the dataset is represented by multiple files you can use the cluster to download it in parallel across the machines. Let me know if this is the case and I'll walk you through it.
Mat
If you're just getting started and experimenting with EMR, I'm guessing you want these on s3 so you don't have to start an interactive Hadoop session (and instead use the EMR wizards via the AWS console).
The best way would be to start a micro instance in the same region as your S3 bucket, download to that machine using wget and then use something like s3cmd (which you'll probably need to install on the instance). On Ubuntu:
wget http://example.com/mydataset dataset
sudo apt-get install s3cmd
s3cmd --configure
s3cmd put dataset s3://mybucket/
The reason you'll want your instance and s3 bucket in the same region is to avoid extra data transfer charges. Although you'll be charged in bound bandwidth to the instance for the wget, the xfer to S3 will be free.
I'm not sure about it, but to me it seems like hadoop should be able to download files directly from your sources.
just enter http://blah/data as your input, and hadoop should do the rest. It certainly works with s3, why should it not work with http?