I need to sum all the values in a column across multiple files - awk

I have a directory with multiple csv text files, each with a single line in the format:
field1,field2,field3,560
I need to output the sum of the fourth field across all files in a directory (can be hundreds or thousands of files). So for an example of:
file1.txt
field1,field2,field3,560
file2.txt
field1,field2,field3,415
file3.txt
field1,field2,field3,672
The output would simply be:
1647
I've been trying a few different things, with the most promising being an awk command that I found here in response to another user's question. It doesn't quite do what I need it to do, and I am an awk newb so I'm unsure how to modify it to work for my purpose:
awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt
This correctly outputs 975.
However if I try pass it a 3rd file, rather than add field 4 from all 3 files, it adds file1 to file2, then file1 to file3:
awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt file3.txt
975
1232
Can anyone show me how I can modify this awk statement to accept more than two files or, ideally because there are thousands of files to sum up, an * to output the sum of the fourth field of all files in the directory?
Thank you for your time and assistance.

A couple issues with the current code:
NR==FNR is used to indicate special processing for the 1st file; in this case there is no processing that is 'special' for just the 1st file (ie, all files are to be processed the same)
an array (eg, a[NR]) is used to maintain a set of values; in this case you only have one global value to maintain so there is no need for an array
Since you're only looking for one global sum, a bit more simpler code should suffice:
$ awk -F',' '{sum+=$4} END {print sum+0}' file{1..3}.txt
1647
NOTES:
in the (unlikely?) case all files are empty, sum will be undefined so print sum will display a blank link; sum+0 insures we print 0 if sum remains undefined (ie, all files are empty)
for a variable number of files file{1..3}.txt can be replaced with whatever pattern will match on the desired set of files, eg, file*.txt, *.txt, etc

Here we go (no need to test NR==FNR in a concatenation):
$ cat file{1,2,3}.txt | awk -F, '{count+=$4}END{print count}'
1647
Or same-same 🇹🇭 (without wasting some pipe(s)):
$ awk -F, '{count+=$4}END{print count}' file{1,2,3}.txt
1647

$ perl -MList::Util=sum0 -F, -lane'push #a,$F[3];END{print sum0 #a}' file{1..3}.txt
1647
$ perl -F, -lane'push #a,$F[3];END{foreach(#a){ $sum +=$_ };print "$sum"}' file{1..3}.txt
1647

$ cut -d, -f4 file{1..3}.txt | paste -sd+ - | bc
1647

Related

What does this Awk expression mean

I am working with bash script that has this command in it.
awk -F ‘‘ ‘/abc/{print $3}’|xargs
What is the meaning of this command?? Assume input is provided to awk.
The quick answer is it'll do different things depending on the version of awk you're running and how many fields of output the awk script produces.
I assume you meant to write:
awk -F '' '/abc/{print $3}'|xargs
not the syntactically invalid (due to "smart quotes"):
awk -F ‘’’/abc/{print $3}’|xargs
-F '' is undefined behavior per POSIX so what it will do depends on the version of awk you're running. In some awks it'll split the current line into 1 character per field. in others it'll be ignored and the line will be split into fields at every sequence of white space. In other awks still it could do anything else.
/abc/ looks for a string matching the regexp abc on the current line and if found invokes the subsequent action, in this case {print $3}.
However it's split into fields, print $3 will print the 3rd such field.
xargs as used will just print chunks of the multi-line input it's getting all on 1 line so you could get 1 line of all-fields output if you don't have many fields being output or several lines of multi-field output if you do.
I suspect the intent of that code was to do what this code actually will do in any awk alone:
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
e.g.:
$ printf 'foo\nxabc\nyzabc\nbar\n' |
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
b a

Giving an argument list as file names extracted from another tab separated file

I have a tab delimited file as below. The first column represents the list of file names without .txt extension which I want to pass as an argument list to another awk command.
File1 abcd xyz 234 pqr
File2 abcd xyz 234 pqr
File3 abcd xyz 234 pqr
File4 abcd xyz 234 pqr
e.g. Assume this is my awk command, I want to pass arguments as
awk -F"\t" '---Commamd-----' File1.txt File2.txt File3.txt File4.txt >> Final.txt
So that it takes each row from 1st column with ".txt" extention as input and create Final.txt output file. It should be noted that number of columns may vary each time.
I thought of creating it in bash script, but I am not able to provide correct arguments and append next row from 1st column as next argument.
Going by my understanding of your requirements, you want to use the tab-separated file to get the file names on column 1 and you want to add .txt extension to them and pass it to another file. Firstly use mapfile to get the names from the tab-separated file
mapfile -t fileNames < <(awk -v FS="\t" '{print $1}' tabfile)
Now to pass this as an argument list to another function, all you need to do is use this quoted array by suffixing the .txt extension to it
awk ... "${fileNames[#]/%/.txt}"
Not completely sure here as it is not clear. Based on your statement that you want to get file names from 1 awk and pass it to another awk following could be tried.
awk '{print $0}' <(awk 'NF{print $1".txt"}' Input_file)
So in spite of print $0 you could do your operations here, I just printed it to see if file names are coming proper or not. Also add -F="\t" in 2nd awk in case your Input_file is TAB delimited and could change $1 to any other field in case file names are not on first column.
You can try this awk
awk '{file=$1".txt";while (getline<file == 1)print $2}' infile
append .txt on all $1 of infile to get the filename like File2.txt
print $2 of this file if it exist.

awk to store and reset variable from file

Trying to use awk to lookup the string in file1 (which is always just 1 field) in the same line of file2. That is if row 1 is being used in file1 then only row 1 is used in file2. Since it is possible for the value to be missing this is a check done to ensure it is there. This is just an idea so there probably is a better way, but I just wanted to see. Thank you :).
file1
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
file2
The oldest folder is R_2017_01_13_12_11_56_user_S5-00580-24-Medexome, created on 2017-01-17+11:31:02.5035483130 and analysis done using v1.4 by cmccabe at 01/17/17 12:41:03 PM
desired output for $filename
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
After a bunch of processes are run using $filename, I need to reset that variable with a new one.
file1 (after rerunning some process)
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
file2 (after rerunning some process)
The oldest folder is R_2017_01_13_12_11_56_user_S5-00580-24-Medexome, created on 2017-01-17+11:31:02.5035483130 and analysis done using v1.4 by cmccabe at 01/17/17 12:41:03 PM
The oldest folder is R_2017_01_13_14_46_04_user_S5-00580-25-Medexome, created on 2017-01-17+06:53:07.3194950000 and analysis done using v1.4 by cmccabe at 01/18/17 06:59:08 AM
desired output for $filename now is (since this is value is new)
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
awk
filename=$(awk 'NR==1{print $1}' file1 file2)
You want to check if the last line of file2 contains a string given in file1.
For this, you just have to read that last line and then see if it matches with any of the words in file1.
$ awk 'ENDFILE {line=$0} FNR<NR && line ~ $1' file2 file1
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
This uses:
ENDFILE {line=$0}
after reading a file, $0 contains the last line that was read (well, not always, but I assume you have a modern version of awk, since ENDFILE is a GNU awk extension). With this, we store this last line into line, so that we can use it when reading the next file.
FNR<NR && line ~ $1
while reading the file1, check if the given word is present in the stored line. If so, print is automatically triggered.
This uses the FNR<NR trick, where FNR holds the number of the line in the current file, while NR the number of line in general. This way, FNR==NR is only true while reading the first file and FNR<NR from the second on.
If you only need to check the last line of file2 continuously, you could:
$ awk 'NR==FNR{a[$1];next}{for(i in a)if(i ~ $0) print i}' file1 <(tail -f file2)
Explained:
NR==FNR{a[$1];next} read into a array the search terms from file1
file2 is tail -f'd into awk using process substitution, ie. it reads a record from the end of file2, goes thru all search words in a and looks them from the record, printing search word if there is a match

awk to output matches to separate files

I am trying to combine the text in $2 that is the same and output them to separate files with the match being the name of the new file. Since the actual files are quite large I open each file, then close to save on speed and memory, my attempt is below. Thank you :).
awk '{printf "%s\n", $2==$2".txt"; close($2".txt")}' input.txt **'{ print $2 > "$2.txt" }'**
input.txt
chr19:41848059-41848167 TGFB1:exon.2;TGFB1:exon.3;TGFB1:exon.4 284.611 108 bases
chr15:89850833-89850913 FANCI:exon.20;FANCI:exon.27;FANCI:exon.32;FANCI:exon.33;FANCI:exon.34 402.012 80 bases
chr15:31210356-31210508 FANC1:exon.6;FANC1:exon.7 340.914 152 bases
chr19:41850636-41850784 TGFB1:exon.1;TGFB1:exon.2;TGFB1:exon.3 621.527 148 bases
Desired output for TGFB1.txt
chr19:41848059-41848167 TGFB1:exon.2;TGFB1:exon.3;TGFB1:exon.4 284.611 108 bases
chr19:41850636-41850784 TGFB1:exon.1;TGFB1:exon.2;TGFB1:exon.3 621.527 148 bases
Desired output for FANC1.txt
chr15:89850833-89850913 FANCI:exon.20;FANCI:exon.27;FANCI:exon.32;FANCI:exon.33;FANCI:exon.34 402.012 80 bases
chr15:31210356-31210508 FANC1:exon.6;FANC1:exon.7 340.914 152 bases
EDIT:
awk -F '[ :]' '{f = $3 ".txt"; close($3 ".txt")} print > f}' BMF_unix_loop_genes_IonXpress_008_150902_loop_genes_average_IonXpress_008_150902.bed > /home/cmccabe/Desktop/panels/BMF **/"$f".txt;**
bash: /home/cmccabe/Desktop/panels/BMF: Is a directory
You can just redefine your field separator to include a colon and then the file name would be in $3
awk -F '[ :]' '{f = $3 ".txt"; print > f}' input.txt
I've encountered problems with some awks where constructing the filename to the right of the redirection is problematic, which is why I'm using a variable. However the Friday afternoon beer cart has been around and I can't recall specific details :/
I wouldn't bother closing the files unless you're expecting hundreds or thousands of new files to be produced.
You need to split the second field to the desired field name. This should do
$ awk 'BEGIN{close(p)} {split($2,f,":"); p=f[1]".txt"; print $0 > p }' file
Note that it won't produce your output exactly since you have a typo in one of the fields
$ ls *.txt
FANC1.txt FANCI.txt TGFB1.txt

awk subsetting on ID's in $1 and close( )

I am trying to sort with awk a large csv file by id's in the first column on OSX.
I started with:
awk -F, 'NR>1 {print > ($1 ".sync")}' file.csv
However, the process stopped at ID s_17 with the error:
awk: s_18.sync makes too many open files input record number 37674601,
file file.csv source line number 1
I tried modifying with this close() statement but it then only writes the first file
awk -F, 'NR>1 {print > ($1 ".sync");close($1 ".sync"}' file.csv
Can anyone provide insight on how to close the files after each one, properly, so that the number of open files stays manageable but they all get written?
Because you close the file you need to use the append >> operator so you don't clobber the output files:
$ awk -F, 'NR>1{f=$1".sync";print >> f;close(f)}' file.csv
Check out the manual for the official word on redirection with awk.
Don't sort with awk. AWK is great to format data before sorting. Pipe the output into sort(1) and let it sort the data. That's what sort does, and it does a great job.
Also - which type of sort do you need? Dictionary? Numeric? Do you need to ignore spaces?
example:
sort -t, -n <file