I have a S3 bucket with thousands of folders and many txt files inside those folders.
I would like to list all txt files inside the bucket so I can check if they're removable. Then remove them if they are.
Any idea how to do this with s3cmd?
This is fairly simple, but depends on how sophisticated you want the check to be. Suppose you wanted to remove every text file whose filename includes 'foo':
s3cmd --recursive ls s3://mybucket |
awk '{ print $4 }' | grep "*.txt" | grep "foo" | xargs s3cmd del
If you want a more sophisticated check than grep can handle, just redirect the first three commands to a file, then either manually edit the file or use or awk or perl or whatever your favorite tool is, then cat the output into s3cmd (depending on the check, you could do it all with piping, too):
s3cmd --recursive ls s3://mybucket | awk '{ print $4 }' | grep "*.txt" > /tmp/textfiles
magic-command-to-check-filenames /tmp/textfiles
cat /tmp/textfiles | xargs s3cmd del
Related
I would like to list some of the oldest files from my GCS storage based timestamp in file name.
File name is something like this
abcdefghijklmnop_qrstu_vwxyz_table_v2.20190101000000.csv.gz
I was able to list the latest file based on this command
gsutil ls -l gs://bucket_name/folder/* | awk -F\. 'm<$4{m=$4;f=$0} END{print f}
but unable to find the right command to list the oldest file based on filename. sort -kn |head -n1 did not work
gsutil ls -l gs://bucket_name/folder/* | awk -F\. 'm<$4{m=$4;f=$0} END{print f}
gsutil ls gs://***/ | awk -F\. 'BEGIN{t=2**64}{if(t>$2){t=$2;m=$0;}}END{print m}'
I would like to rename all files in my Amazon S3 bucket with extension.PDF to .pdf (lowercase).
Did someone already have to do this? There are a lot of files (around 1500). Is S3cmd the best way to do this? How would you do?
s3cmd --recursive ls s3://bucketname |
awk '{ print $4 }' | grep "*.pdf" | while read -r line ; do
s3cmd --recursive mv s3://<s3_bucketname>/$line/ s3://<s3_bucketname>/${line%.*}".PDF"
done
A local linux/unix example for renaming all files with .pdf extension to .PDF extension.
mkdir pdf-test
cd pdf-test
touch a{1..10}.pdf
Before
ls
a1.pdf a2.pdf a4.pdf a6.pdf a8.pdf grep.sh
a10.pdf a3.pdf a5.pdf a7.pdf a9.pdf
The script file grep.sh
#/bin/bash
ls |grep .pdf | while read -r line ; do # here use ls from s3
echo "Processing $line"
# your s3 code goes here
mv $line ${line%.*}".PDF"
done
Add permissions and try
chmod u+x grep.sh
./grep.sh
After
ls
a1.PDF a2.PDF a4.PDF a6.PDF a8.PDF grep.sh
a10.PDF a3.PDF a5.PDF a7.PDF a9.PDF
You can apply the same logic. instead of mv use s3 mv.
Is there a way to use copy files to an S3 bucket by preserving the file path?
This is the example:
1. I produce a list of files that are different in bucket1 then in bucket2 using s3cmd sync --dry-run
The list looks like this:
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/definition/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/nodetypes.cnd
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/properties.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/app-store/.content.xml
I need to process this list to upload to a new location in the bucket (e.g. s3://bucket/diff/) only the files in the list BUT with the full path as shown in the list.
A simple loop like this:
diff_file_list=$(s3cmd -c s3cfg sync --dry-run s3://BUCKET/20150831/PROD s3://BUCKET/20150831/DEV | awk '{print $2}')
for f in $diff_file_list; do
s3cmd -c s3cfg cp $f s3://BUCKET/20150831/DIFF/
done
does not work; it produces this:
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/definition/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/nodetypes.cnd copied to s3://BUCKET/20150831/DIFF/nodetypes.cnd
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/properties.xml copied to s3://BUCKET/20150831/DIFF/properties.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/origin-store/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
Thanks,
Short answer: not it is not! That is because the paths in S3 buckets are not actually directories/folders and the S3 bucket have no such concepts of structure even if various tools are presenting it this way (including s3cmd which is really confusing...).
So, the "path" is actually a prefix (although the sc3cmd sync to local knows how to translate this prefix in a directory structure on your filesystem).
For a bash script the solution is:
1. create a file listing all the paths from a s3cmd sync --dry-run command (basically a list of diffs) => file1
copy that file and use sed to modify the paths as needed:
sed 's/(^s3.*)PROD/\1DIFF/') => file2
Merge the files so that line1 in file1 is continued by line1 in file2 and so on:
paste file1 file2 > final.txt
Read final.txt, line by line, in a loop and use each line as a set of 2 parameters to a copy or syun command:
while IFS='' read -r line || [[ -n "$line" ]]; do
s3cmd -c s3cfg sync $line
done < "final.txt"
Notes:
1. $line in the s3cmd must not be in quotes; if it is the sync command will complain that it received one parameter only... of course!
2. the [[ -n "$line" ]] is used here so that read will not fail of the last line has not new line character
Boto could not help more unfortunately so if you need something similar in python you would do it pretty much the same....
I would like to process multiple .gz files with gawk.
I was thinking of decompressing and passing it to gawk on the fly
but I have an additional requirement to also store/print the original file name in the output.
The thing is there's 100s of .gz files with rather large size to process.
Looking for anomalies (~0.001% rows) and want to print out the list of found inconsistencies ALONG with the file name and row number that contained it.
If I could have all the files decompressed I would simply use FILENAME variable to get this.
Because of large quantity and size of those files I can't decompress them upfront.
Any ideas how to pass filename (in addition to the gzip stdout) to gawk to produce required output?
Assuming you are looping over all the files and piping their decompression directly into awk something like the following will work.
for file in *.gz; do
gunzip -c "$file" | awk -v origname="$file" '.... {print origname " whatever"}'
done
Edit: To use a list of filenames from some source other than a direct glob something like the following can be used.
$ ls *.awk
a.awk e.awk
$ while IFS= read -d '' filename; do
echo "$filename";
done < <(find . -name \*.awk -printf '%P\0')
e.awk
a.awk
To use xargs instead of the above loop will require the body of the command to be in a pre-written script file I believe which can be called with xargs and the filename.
this is using combination of xargs and sh (to be able to use pipe on two commands: gzip and awk):
find *.gz -print0 | xargs -0 -I fname sh -c 'gzip -dc fname | gawk -v origfile="fname" -f printbadrowsonly.awk >> baddata.txt'
I'm wondering if there's any bad practice with the above approach…
I was using S3Fox and I end up with creating lots of _$folder$ files in multiple S3 directories. I want to clean them all. But the files are neither visible through command line tool nor through S3Fox. They are only visible through AWS S3 console.
I am looking for solution something like
hadoop fs -rmr s3://s3_bucket/dir1/dir2/dir3///*_\$folder\$
you can use s3cmd (http://s3tools.org/s3cmd) and the power of shell.
s3cmd del $(s3cmd ls s3://your-bucket | grep _$folder$ | awk '{ print $1}')