How to delete _$folder$ files from S3 with cmd or s3ncmd? - file-io

I was using S3Fox and I end up with creating lots of _$folder$ files in multiple S3 directories. I want to clean them all. But the files are neither visible through command line tool nor through S3Fox. They are only visible through AWS S3 console.
I am looking for solution something like
hadoop fs -rmr s3://s3_bucket/dir1/dir2/dir3///*_\$folder\$

you can use s3cmd (http://s3tools.org/s3cmd) and the power of shell.
s3cmd del $(s3cmd ls s3://your-bucket | grep _$folder$ | awk '{ print $1}')

Related

How to download S3-Bucket, compress on the fly and reupload to another s3 bucket without downloading locally?

I want to download the contents of a s3 bucket (hosted on wasabi, claims to be fully s3 compatible) to my VPS, tar and gzip and gpg it and reupload this archive to another s3 bucket on wasabi!
My vps machine only has 30GB of storage, the whole buckets is about 1000GB in size so I need to download, archive, encrypt and reupload all of it on the fly without storing the data locally.
The secret seems to be in using the | pipe command. But I am stuck even in the beginning of download a bucket into an archive locally (I want to go step by step):
s3cmd sync s3://mybucket | tar cvz archive.tar.gz -
In my mind at the end I expect some code like this:
s3cmd sync s3://mybucket | tar cvz | gpg --passphrase secretpassword | s3cmd put s3://theotherbucket/archive.tar.gz.gpg
but its not working so far!
What am I missing?
The aws s3 sync command copies multiple files to the destination. It does not copy to stdout.
You could use aws s3 cp s3://mybucket - (including the dash at the end) to copy the contents of the file to stdout.
From cp — AWS CLI Command Reference:
The following cp command downloads an S3 object locally as a stream to standard output. Downloading as a stream is not currently compatible with the --recursive parameter:
aws s3 cp s3://mybucket/stream.txt -
This will only work for a single file.
You may try https://github.com/kahing/goofys. I guess, in your case it could be the following algo:
$ goofys source-s3-bucket-name /mnt/src
$ goofys destination-s3-bucket-name /mnt/dst
$ tar -cvzf /mnt/src | gpg -e -o /mnt/dst/archive.tgz.gpg

Rename files in Amazon S3

I would like to rename all files in my Amazon S3 bucket with extension.PDF to .pdf (lowercase).
Did someone already have to do this? There are a lot of files (around 1500). Is S3cmd the best way to do this? How would you do?
s3cmd --recursive ls s3://bucketname |
awk '{ print $4 }' | grep "*.pdf" | while read -r line ; do
s3cmd --recursive mv s3://<s3_bucketname>/$line/ s3://<s3_bucketname>/${line%.*}".PDF"
done
A local linux/unix example for renaming all files with .pdf extension to .PDF extension.
mkdir pdf-test
cd pdf-test
touch a{1..10}.pdf
Before
ls
a1.pdf a2.pdf a4.pdf a6.pdf a8.pdf grep.sh
a10.pdf a3.pdf a5.pdf a7.pdf a9.pdf
The script file grep.sh
#/bin/bash
ls |grep .pdf | while read -r line ; do # here use ls from s3
echo "Processing $line"
# your s3 code goes here
mv $line ${line%.*}".PDF"
done
Add permissions and try
chmod u+x grep.sh
./grep.sh
After
ls
a1.PDF a2.PDF a4.PDF a6.PDF a8.PDF grep.sh
a10.PDF a3.PDF a5.PDF a7.PDF a9.PDF
You can apply the same logic. instead of mv use s3 mv.

AWS S3 - Example of searching files in S3 using regex

How to list the files in S3 using regex (in linux cli mode)? I have the files in s3 bucket like sales1.txt, sales2.txt etc. When I ran the below command nothing is displaying. Is there a command to list the all the files in S3 bucket with regex?
Command:
aws s3 ls s3://test/sales*txt
Expected output:
sales1.txt
sales2.txt
sales3.txt
Use the following command
aws s3 ls s3://test/ | grep '[sales].txt'
The accepted solution is too broad and matches too much. Try this:
aws s3 ls s3://test/ | grep sales.*\.txt
I have been trying to sort this, {aws s3 ls } command is not supporting any regex or pattern matching option. Wr have to use bash commands grep or awk.
aws s3 ls s3://bucket/path/ | grep sales|grep txt
aws s3 ls s3://bucket/path/ | grep sales..txt

s3cmd copy files preserving path

Is there a way to use copy files to an S3 bucket by preserving the file path?
This is the example:
1. I produce a list of files that are different in bucket1 then in bucket2 using s3cmd sync --dry-run
The list looks like this:
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/definition/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/nodetypes.cnd
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/properties.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/app-store/.content.xml
I need to process this list to upload to a new location in the bucket (e.g. s3://bucket/diff/) only the files in the list BUT with the full path as shown in the list.
A simple loop like this:
diff_file_list=$(s3cmd -c s3cfg sync --dry-run s3://BUCKET/20150831/PROD s3://BUCKET/20150831/DEV | awk '{print $2}')
for f in $diff_file_list; do
s3cmd -c s3cfg cp $f s3://BUCKET/20150831/DIFF/
done
does not work; it produces this:
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/definition/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/nodetypes.cnd copied to s3://BUCKET/20150831/DIFF/nodetypes.cnd
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/properties.xml copied to s3://BUCKET/20150831/DIFF/properties.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/origin-store/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
Thanks,
Short answer: not it is not! That is because the paths in S3 buckets are not actually directories/folders and the S3 bucket have no such concepts of structure even if various tools are presenting it this way (including s3cmd which is really confusing...).
So, the "path" is actually a prefix (although the sc3cmd sync to local knows how to translate this prefix in a directory structure on your filesystem).
For a bash script the solution is:
1. create a file listing all the paths from a s3cmd sync --dry-run command (basically a list of diffs) => file1
copy that file and use sed to modify the paths as needed:
sed 's/(^s3.*)PROD/\1DIFF/') => file2
Merge the files so that line1 in file1 is continued by line1 in file2 and so on:
paste file1 file2 > final.txt
Read final.txt, line by line, in a loop and use each line as a set of 2 parameters to a copy or syun command:
while IFS='' read -r line || [[ -n "$line" ]]; do
s3cmd -c s3cfg sync $line
done < "final.txt"
Notes:
1. $line in the s3cmd must not be in quotes; if it is the sync command will complain that it received one parameter only... of course!
2. the [[ -n "$line" ]] is used here so that read will not fail of the last line has not new line character
Boto could not help more unfortunately so if you need something similar in python you would do it pretty much the same....

s3cmd: searching for files based on extension and delete from bucket

I have a S3 bucket with thousands of folders and many txt files inside those folders.
I would like to list all txt files inside the bucket so I can check if they're removable. Then remove them if they are.
Any idea how to do this with s3cmd?
This is fairly simple, but depends on how sophisticated you want the check to be. Suppose you wanted to remove every text file whose filename includes 'foo':
s3cmd --recursive ls s3://mybucket |
awk '{ print $4 }' | grep "*.txt" | grep "foo" | xargs s3cmd del
If you want a more sophisticated check than grep can handle, just redirect the first three commands to a file, then either manually edit the file or use or awk or perl or whatever your favorite tool is, then cat the output into s3cmd (depending on the check, you could do it all with piping, too):
s3cmd --recursive ls s3://mybucket | awk '{ print $4 }' | grep "*.txt" > /tmp/textfiles
magic-command-to-check-filenames /tmp/textfiles
cat /tmp/textfiles | xargs s3cmd del