Print File Paths and Filenames within a Directory into a CSV - awk

I have a directory that contains several files. For instance:
File1.bam
File2.bam
File3.bam
I want to create a .csv file that contains 2 columns and includes a header:
Name,Path
File1, /Path/To/File1.bam
File2, /Path/To/File2.bam
File3, /Path/To/File3.bam
I've managed to piece together a way of doing this in separate steps, but it involves creating a csv with the path, appending with the filename, and then appending again with the header. I would like to add both the filename and path in 1 step, so that there is no possibility of linking an incorrect filename and path.
In case it matters, I'm trying to do this within a script that is running in a batch job (SLURM), and the output CSV will be used in subsequent workflow steps.
find ~/Desktop/test -iname '*.csv' -type f >bamlist1.csv
awk '{print FILENAME (NF?",":"") $0}' *.csv > test.csv
{ echo 'Name, Path'; cat bamlist.csv; } > bamdata.csv

Untested but should be close:
find ~/Desktop/test -name '*.bam' -type f |
awk '
BEGIN { OFS=","; print "Name", "Path" }
{ fname=$0; sub(".*/","",fname); print fname, $0 }
' > bamdata.csv
Like your original script the above assumes that none of your file/directory names contain newlines or commas.
If you have GNU find you can just do:
{ echo "Name,Path"; find ~/Desktop/test -name '*.bam' -type f -printf '%f,%p\n' } > bamdata.csv

Related

Using awk to find and replace strings in every file in directory

I have a directory full of output files, with files with names:
file1.out,file2.out,..., fileN.out.
There is a certain key string in each of these files, lets call it keystring. I want to replace every instance of keystring with newstring in all files.
If there was only one file, I know I can do:
awk '{gsub("keystring","newstring",$0); print $0}' file1.out > file1.out
Is there a way to loop through all N files in awk?
You could use find command for the same. Please make sure you run this on a test file first once it works fine then only run it in your actual path(on all your actual files) for safer side. This also needs gawk newer version which has inplace option in it to save output into files itself.
find your_path -type f -name "*.out" -exec awk -i inplace -f myawkProgram.awk {} +
Where your awk program is as follows: as per your shown samples(cat myawkProgram.awk is only to show contents of your awk program here).
cat myawkProgram.awk
{
gsub("keystring","newstring",$0); print $0
}
2nd option would be pass all .out format files into your gawk program itself with -inplace by doing something like(but again make sure you run this on a single test file first and then run actual command for safer side once you are convinced by command):
awk -i inplace '{gsub("keystring","newstring",$0); print $0}' *.out
sed is the most ideal solution for this and so integrating it with find:
find /directory/path -type f -name "*.out" -exec sed -i 's/keystring/newstring/g' {} +
Find files with the extension .out and then execute the sed command on as many groups of the found files as possible (using + with -exec)

An `awk` method to print after the last `/` on every line of a text file

I'm in need of an awk method to print after the last / on every line of a text file. I've got a txt file which includes file directories.
~/2500andMore.txt file
/Volumes/MCU/_ 0 _ Mt Cabins Ut/Raw/Log Cabin on The Stream/DJI_0003.JPG
/Volumes/MCU/_ 0 _ Mt Cabins Ut/Raw/DJI_0022.jpg
/Volumes/MCU/_ 0 _ Mt Cabins Ut/PTMAD/Insta/RAW/IMG_1049.jpg
The idea is to copy files that are all on one directory back to the example given. I'm planning to do something like this to copy them back:
awk '{print <after the last '/' of the line>}' ~/2500andMore.txt > ~/filename.txt
cd file-directory
while IFS= read -r file && IFS= read -r all <&3; do cp $file $all; done <~/filename.txt 3<~/2500andMore.txt
Judging by your example, you don't need to put the filenames into a separate file, just change your while loop:
while IFS= read -r all; do
file=${all##*/} # Or file=$(basename "$all")
cp "$file" "$all"
done <~/2500andMore.txt
But if you really want the filenames in filename.txt, then use:
awk -F / '{ print $NF }' ~/2500andMore.txt > ~/filename.txt
Explanation: "/" is used as a separator and $NF represents the last field

cat lines from X to Y of multiple files into one file

I have many huge files in say 3 different folders from which i would like to copy say lines from X to Y of files of the same name and append them into a new file of the same name.
I tried doing
ls seed1/* | while read FILE; do
head -n $Y | tail -n $X seed1/$FILE seed2/$FILE seed3/$FILE > combined/$FILE
done
This does the job for the first value of $FILE, but this does not return the prompt, and hence I am unable to execute this loop.
For example i have the following files in three different folders, seed1, seed2 and seed3:
seed1/foo.dat
seed1/bar.dat
seed1/qax.dat
seed2/foo.dat
seed2/bar.dat
seed2/qax.dat
seed3/foo.dat
seed3/bar.dat
seed3/qax.dat
I would like to combine lines 10 to 20 of all files in to a combined folder:
combined/foo.dat
combined/bar.dat
combined/qax.dat
Each of the files in combined have 30 lines, with 10 each from seed1,seed2 and seed3.
No loop required:
awk -v x=10 -v y=20 '
FNR==1 { out = gensub(/.*\//,"combined/",1,FILENAME) }
FNR>=x { print > out }
FNR==y { nextfile }
' seed*/*.dat
The above assumes the "combined" directory already exists (empty or not) before awk is called and uses GNU awk for gensub() and nextfile and internal file management. Solutions with other awks are less efficient, require a bit more coding, and require you to manage closing files when too many are going to be open.

Trying to modify awk code

awk 'BEGIN{OFS=","} FNR == 1
{if (NR > 1) {print fn,fnr,nl}
fn=FILENAME; fnr = 1; nl = 0}
{fnr = FNR}
/ERROR/ && FILENAME ~ /\.gz$/ {nl++}
{
cmd="gunzip -cd " FILENAME
cmd; close(cmd)
}
END {print fn,fnr,nl}
' /tmp/appscraps/* > /tmp/test.txt
the above scans all files in a given directory. prints the file name, number of lines in each file and number of lines found containing 'ERROR'.
im now trying to make it so that the script executes a command if any of the file it reads in isn't a regular file. i.e., if the file is a gzip file, then run a particular command.
above is my attempt to include the gunzip command in there and to do it on my own. unfortunately, it isn't working. also, i cannot "gunzip" all the files in the directory beforehand. this is because not all files in the directory will be "gzip" type. some will be regular files.
so i need the script to treat any .gz file it finds a different way so it can read it, count and print the number of lines that's in it, and the number of lines it found matching the pattern supplied (just as it would if the file had been a regular file).
any help?
This part of your script makes no sense:
{if (NR > 1) {print fn,fnr,nl}
fn=FILENAME; fnr = 1; nl = 0}
{fnr = FNR}
/ERROR/ && FILENAME ~ /\.gz$/ {nl++}
Let me restructure it a bit and comment it so it's clearer what it does:
{ # for every line of every input file, do the following:
# If this is the 2nd or subsequent line, print the values of these variables:
if (NR > 1) {
print fn,fnr,nl
}
fn = FILENAME # set fn to FILENAME. Since this will occur for the first line of
# every file, this is that value fn will have when printed above,
# so why not just get rid of fn and print FILENAME?
fnr = 1 # set fnr to 1. This is immediately over-written below by
# setting it to FNR so this is pointless.
nl = 0
}
{ # for every line of every input file, also do the following
# (note the unnecessary "}" then "{" above):
fnr = FNR # set fnr to FNR. Since this will occur for the first line of
# every file, this is that value fnr will have when printed above,
# so why not just get rid of fnr and print FNR-1?
}
/ERROR/ && FILENAME ~ /\.gz$/ {
nl++ # increment the value of nl. Since nl is always set to zero above,
# this will only ever set it to 1, so why not just set it to 1?
# I suspect the real intent is to NOT set it to zero above.
}
You also have the code above testing for a file name that ends in ".gz" but then you're running gunzip on every file in the very next block.
Beyond that, just call gunzip from shell as everyone else also suggested. awk is a tool for parsing text, it's not an environment from which to call other tools - that's what a shell is for.
For example, assuming your comment (prints the file name, number of lines in each file and number of lines found containing 'ERROR) accurately describes what you want your awk script to do and assuming it makes sense to test for the word "ERROR" directly in a ".gz" file using awk:
for file in /tmp/appscraps/*.gz
do
awk -v OFS=',' '/ERROR/{nl++} END{print FILENAME, NR+0, nl+0}' "$file"
gunzip -cd "$file"
done > /tmp/test.txt
Much clearer and simpler, isn't it?
If it doesn't make sense to test for the word ERROR directly in a ".gz" file, then you can do this instead:
for file in /tmp/appscraps/*.gz
do
zcat "$file" | awk -v file="$file" -v OFS=',' '/ERROR/{nl++} END{print file, NR+0, nl+0}'
gunzip -cd "$file"
done > /tmp/test.txt
To handle gz and non-gz files as you've now described in your comment below:
for file in /tmp/appscraps/*
do
case $file in
*.gz ) cmd="zcat" ;;
* ) cmd="cat" ;;
esac
"$cmd" "$file" |
awk -v file="$file" -v OFS=',' '/ERROR/{nl++} END{print file, NR+0, nl+0}'
done > /tmp/test.txt
I left out the gunzip since you don't need it as far as I can tell from your stated requirements. If I'm wrong, explain what you need it for.
I think it could be simpler than that.
With shell expansion you already have the file name (hence you can print it).
So you can do a loop over all the files, and for each do the following:
print the file name
zgrep -c ERROR $file (this outputs the number of lines containing 'ERROR')
zcat $file|wc -l (this will output the line numbers)
zgrep and zcat work on both plain text files and gzipped ones.
Assuming you don't have any spaces in the paths/filenames:
for f in /tmp/appscraps/*
do
n_lines=$(zcat "$f"|wc -l)
n_errors=$(zgrep -c ERROR "$f")
echo "$f $n_lines $n_errors"
done
This is untested but it should work.
You can use execute the following command for each file :
gunzip -t FILENAME; echo $?
It will pass print the exit code 0(for gzip files) or 1(corrupt/other file). Now you can compare the output using IF to execute the required processing.

Using grep and awk to search and print the output to new file

I have 100 files and want to search a specific word in the first column of each file and print the content of all columns from this word to a new file
I tried this code but doesn't work well it prints only the content of one file not all:
ls -t *.txt > Filelist.tmp
cat Filelist.tmp | while read line do; grep "searchword" | awk '{print $0}' > outputfile.txt; done
This is what you want:
$ awk '$1~/searchword/' *.txt >> output
This compares the first field against searchword and appends the line to output if it matches. The default field separator with awk is whitespace.
The main problem with your attempt is you are overwriting > the file evertime, you want to be using append >>.