How to ls the first file (path included) of every subfolder recursively? - awk

Suppose these are the files:
folder1/11.txt
folder1/12.txt
folder1/levela/11a1.txt
folder1/levela/11a2.txt
folder1/levela/levelb/11b1.txt
folder1/levela/levelb/11b2.txt
folder2/21.txt
folder2/22.txt
folder2/levela/21a1.txt
folder2/levela/21a2.txt
folder2/levela/levelb/21b1.txt
folder2/levela/levelb/21b2.txt
folder3/a/b/c/d/e/deepfile1.txt
folder3/a/b/c/d/e/deepfile2.txt
Is there a way (for example using ls, find or grep or any gnuwin32 commands) to show the 1st file from every subfolder please?
Desired output:
folder1/11.txt
folder1/levela/11a1.txt
folder1/levela/levelb/11b1.txt
folder2/21.txt
folder2/levela/21a1.txt
folder2/levela/levelb/21b1.txt
folder3/a/b/c/d/e/deepfile1.txt
Thank you.

Suggesting this solution:
find -type f -printf "%p %h\n"|sort --key 2.1,1.1|uniq --skip-fields=1|awk '{print $1}'
Explanation:
find -type -printf "%p %n\n"
This find command search for all regular files under current directory.
And print for each file. Files' relative path, (space), and files' relative folder.
Suggesting to run this command on your directory.
sort --key 2.1,1.1
Sort the files list lexicography, from 2nd field than 1st field
Result in all files are sorted per their specific directory
Suggesting to try this:
find -type f -printf "%p %h\n"|sort --key 2.1,1.1
uniq --skip-fields=1
From the sorted files list.
Remove those lines having duplicate directory (field #2)
awk '{print $1}'
Print only first field, the relative files path.

A bash script:
script.sh
#!/bin/bash
declare -A filesArr # declare assiciate array for files in directories
for currFile in $(find "$1" -type f); do # main loop scan all files undre $1
currDir=$(dirname "$currFile") # get the curret file's directory
if [[ -z ${filesArr["$currDir"]} ]]; then # if current directory is not stored in filesArr
filesArr[$currDir]="$currFile" # store the directory with curren file
fi
if [[ ${filesArr["$currDir"]} > "$currFile" ]]; then # if current file < stored file in array
filesArr[$currDir]="$currFile" # set the stored file to be current file
fi
done
for currFile in ${filesArr[#]}; do # loop over array to output each directory
echo "$currFile"
done
Running script.sh on /tmp folder
chmod a+x script.sh
./script.sh /tmp
BTW: answer below with sort and uniq is much faster.

Related

s3cmd copy files preserving path

Is there a way to use copy files to an S3 bucket by preserving the file path?
This is the example:
1. I produce a list of files that are different in bucket1 then in bucket2 using s3cmd sync --dry-run
The list looks like this:
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/definition/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/nodetypes.cnd
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/properties.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/app-store/.content.xml
I need to process this list to upload to a new location in the bucket (e.g. s3://bucket/diff/) only the files in the list BUT with the full path as shown in the list.
A simple loop like this:
diff_file_list=$(s3cmd -c s3cfg sync --dry-run s3://BUCKET/20150831/PROD s3://BUCKET/20150831/DEV | awk '{print $2}')
for f in $diff_file_list; do
s3cmd -c s3cfg cp $f s3://BUCKET/20150831/DIFF/
done
does not work; it produces this:
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/definition/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/nodetypes.cnd copied to s3://BUCKET/20150831/DIFF/nodetypes.cnd
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/properties.xml copied to s3://BUCKET/20150831/DIFF/properties.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/origin-store/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
Thanks,
Short answer: not it is not! That is because the paths in S3 buckets are not actually directories/folders and the S3 bucket have no such concepts of structure even if various tools are presenting it this way (including s3cmd which is really confusing...).
So, the "path" is actually a prefix (although the sc3cmd sync to local knows how to translate this prefix in a directory structure on your filesystem).
For a bash script the solution is:
1. create a file listing all the paths from a s3cmd sync --dry-run command (basically a list of diffs) => file1
copy that file and use sed to modify the paths as needed:
sed 's/(^s3.*)PROD/\1DIFF/') => file2
Merge the files so that line1 in file1 is continued by line1 in file2 and so on:
paste file1 file2 > final.txt
Read final.txt, line by line, in a loop and use each line as a set of 2 parameters to a copy or syun command:
while IFS='' read -r line || [[ -n "$line" ]]; do
s3cmd -c s3cfg sync $line
done < "final.txt"
Notes:
1. $line in the s3cmd must not be in quotes; if it is the sync command will complain that it received one parameter only... of course!
2. the [[ -n "$line" ]] is used here so that read will not fail of the last line has not new line character
Boto could not help more unfortunately so if you need something similar in python you would do it pretty much the same....

AWK to process compressed files and printing original (compressed) file names

I would like to process multiple .gz files with gawk.
I was thinking of decompressing and passing it to gawk on the fly
but I have an additional requirement to also store/print the original file name in the output.
The thing is there's 100s of .gz files with rather large size to process.
Looking for anomalies (~0.001% rows) and want to print out the list of found inconsistencies ALONG with the file name and row number that contained it.
If I could have all the files decompressed I would simply use FILENAME variable to get this.
Because of large quantity and size of those files I can't decompress them upfront.
Any ideas how to pass filename (in addition to the gzip stdout) to gawk to produce required output?
Assuming you are looping over all the files and piping their decompression directly into awk something like the following will work.
for file in *.gz; do
gunzip -c "$file" | awk -v origname="$file" '.... {print origname " whatever"}'
done
Edit: To use a list of filenames from some source other than a direct glob something like the following can be used.
$ ls *.awk
a.awk e.awk
$ while IFS= read -d '' filename; do
echo "$filename";
done < <(find . -name \*.awk -printf '%P\0')
e.awk
a.awk
To use xargs instead of the above loop will require the body of the command to be in a pre-written script file I believe which can be called with xargs and the filename.
this is using combination of xargs and sh (to be able to use pipe on two commands: gzip and awk):
find *.gz -print0 | xargs -0 -I fname sh -c 'gzip -dc fname | gawk -v origfile="fname" -f printbadrowsonly.awk >> baddata.txt'
I'm wondering if there's any bad practice with the above approach…

SSH - Loop through lines from txt file and delete files

I have a .txt file and on each line is a different file location e.g.
file1.zip
file2.zip
file3.zip
How can I open that file, loop through each line and rm -f filename on each one?
Also, will deleting it throw an error if the file doesn't exist (has already been deleted) and if so how can I avoid this?
EDIT: The file names may have spaces in them, so this needs to be catered for as well.
You can use a for loop with cat to iterate through the lines:
IFS=$'\n'; \
for file in `cat list.txt`; do \
if [ -f $file ]; then \
rm -f "$file"; \
fi; \
done
The if [ -f $file ] will check if the file exists and is a regular file (not a directory). If the check fails, it will skip it.
The IFS=$'\n' at the top will set the delimiter to be newlines-only; This will allow you to process files with whitespace.
xargs -n1 echo < test.txt
Replace 'echo' with rm -f or any other command. You can also use cat test.txt |
'man xargs' for more info.

script to run a certain program with input from a given directory

So I need to run a bunch of (maven) tests with testfiles being supplied as an argument to a maven task.
Something like this:
mvn clean test -Dtest=<filename>
And the test files are usually organized into different directories. So I'm trying to write a script which would execute the above 'command' and automatically feed the name of all files in a given dir to the -Dtest.
So I started out with a shellscript called 'run_test':
#!/bin/sh
if test $# -lt 2; then
echo "$0: insufficient arguments on the command line." >&1
echo "usage: $0 run_test dirctory" >&1
exit 1
fi
for file in allFiles <<<<<<< what should I put here? Can I somehow iterate thru the list of all files' name in the given directory put the file name here?
do mvn clean test -Dtest= $file
exit $?
The part where I got stuck is how to get a list of filenames.
Thanks,
Assuming $1 contains the directory name (validation of the user input is a separate issue), then
for file in $1/*
do
[[ -f $file ]] && mvn clean test -Dtest=$file
done
will run the comand on all files. If you want to recurse into subdirectories then you need to use the find command
for file in $(find $1 -type f)
do
etc...
done
#! /bin/sh
# Set IFS to newline to minimise problems with whitespace in file/directory
# names. If we also need to deal with newlines, we will need to use
# find -print0 | xargs -0 instead of a for loop.
IFS="
"
if ! [[ -d "${1}" ]]; then
echo "Please supply a directory name" > &2
exit 1
else
# We use find rather than glob expansion in case there are nested directories.
# We sort the filenames so that we execute the tests in a predictable order.
for pathname in $(find "${1}" -type f | LC_ALL=C sort) do
mvn clean test -Dtest="${pathname}" || break
done
fi
# exit $? would be superfluous (it is the default)

Can someone help explain this code? It is a shell script for creating a checksum list

#!/bin/bash
# create a list of checksums
cat /dev/null > MD5SUM
for i in */*/*.sql ; do test -e $i && md5sum $i >>MD5SUM ; done
Then this command is used to check to see if anything has changed:
md5sum -c MD5SUM
It works fine and everything. I just don't really understand how. Say if I wanted to make a checksum list of all the files in my home directory $HOME how can I do that? What does the */*/*.sql part of the for loop mean? I'm assuming that is to display SQL files only but how can I modify that? Say I wanted all files in the directory? Why is it not just *.sql ? What does the rest of the for loop do in this case?
Lets go by parts:
cat /dev/null > MD5SUM
this will only "erase" the previous MD5SUM file/list that was created before.
for i in */*/*.sql;
this will iterate over files that are 2 directories deep from your current folder. If you have folders
~/a/b
~/c/d
~/e/f
and you run your script in your home folder (~) all "*.sql" inside directories b,d,f will have the checksum calculated and piped to a file MD5SUM in the current direcotry:
do test -e $i && md5sum $i >>MD5SUM ; done
Now Answering your questions:
Say if I wanted to make a checksum list of all the files in my home directory $HOME how can I do that?
I would use the find command with the exec option
find $HOME -maxdepth 1 -name \*.sql -exec md5sum {} \;
What does the //*.sql part of the for loop mean?
I answered it above, anyway only goes 2 directories deep before getting to the files.
I'm assuming that is to display SQL files only but how can I modify that? Say I wanted all files in the directory?
Change
for i in */*/*.sql;
to
for i in */*/*;
or for current directory
find $HOME -maxdepth 1 -name \* -exec md5sum {} \;
Why is it not just *.sql ? What does the rest of the for loop do in this case?
Explained before.
Hope it helps =)