sum same column across multiple files using awk ? - awk

I want to add the 3rd column of 5 files such that the new file will have the same 2nd col and the sum of the 3rd col of the 5 files.
I tried something like this:
$ cat freqdat044.dat | awk '{n=$3; getline <"freqdat046.dat";print $2" " n+$3}' > freqtrial1.dat
freqdat048.dat`enter code here`$ cat freqdat044.dat | awk '{n=$3; getline <"freqdat046.dat";print $2" " n+$3}' > freqtrial1.dat
The files names:
freqdat044.dat
freqdat045.dat
freqdat046.dat
freqdat047.dat
freqdat049.dat
freqdat050.dat
And saved in output file the contain only $2 and the new col form the summation of the 3rd

awk '{x[$2] += $3} END {for(y in x) print y,x[y]}' freqdat044.dat freqdat045.dat freqdat046.dat freqdat047.dat freqdat049.dat freqdat050.dat
This does not necessarily print lines as they appear in the first file. If you want to preserve that sorting, then you have to save that ordering somewhere:
awk 'FNR==NR {keys[FNR]=$2; cnt=FNR} {x[$2] += $3} END {for(i=1; i<=cnt; ++i) print keys[i],x[keys[i]]}' freqdat044.dat freqdat045.dat freqdat046.dat freqdat047.dat freqdat049.dat freqdat050.dat

Related

selecting columns in awk discarding corresponding header

How to properly select columns in awk after some processing. My file here:
cat foo
A;B;C
9;6;7
8;5;4
1;2;3
I want to add a first column with line numbers and then extract some columns of the result. For the example let's get the new first (line numbers) and third columns. This way:
awk -F';' 'FNR==1{print "linenumber;"$0;next} {print FNR-1,$1,$3}' foo
gives me this unexpected output:
linenumber;A;B;C
1 9 7
2 8 4
3 1 3
but expected is (note B is now the third column as we added linenumber as first):
linenumber;B
1;6
2;5
3;2
[fixed and revised]
To get your expected output, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$(FNR==1?3:1)
}' file
Output:
linenumber;C
1;9
2;8
3;1
To add a column with line number and extract first and last columns, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$1,$NF
}' file
Output this time:
linenumber;A;C
1;9;7
2;8;4
3;1;3
Why do you print $0 (the complete record) in your header? And, if you want only two columns in your output, why to you print 3 (FNR-1, $1 and $3)? Finally, the reason why your output field separators are spaces instead of the expected ; is simply that... you did not specify the output field separator (OFS). You can do this with a command line variable assignment (OFS=\;), as shown in the second and third versions below, but also using the -v option (-v OFS=\;) or in a BEGIN block (BEGIN {OFS=";"}) as you wish (there are differences between these 3 methods but they don't matter here).
[EDIT]: see a generic solution at the end.
If the field you want to keep is the second of the input file (the B column), try:
$ awk -F\; 'FNR==1 {print "linenumber;" $2; next} {print FNR-1 ";" $2}' foo
linenumber;B
1;6
2;5
3;2
or
$ awk -F\; 'FNR==1 {print "linenumber",$2; next} {print FNR-1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Note that, as long as you don't want to keep the first field of the input file ($1), you could as well overwrite it with the line number:
$ awk -F\; '{$1=FNR==1?"linenumber":FNR-1; print $1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Finally, here is a more generic solution to which you can pass the list of indexes of the columns of the input file you want to print (1 and 3 in this example):
$ awk -F\; -v cols='1;3' '
BEGIN { OFS = ";"; n = split(cols, c); }
{ printf("%s", FNR == 1 ? "linenumber" : FNR - 1);
for(i = 1; i <= n; i++) printf("%s", OFS $(c[i]));
printf("\n");
}' foo
linenumber;A;C
1;9;7
2;8;4
3;1;3

How to print two lines of several files to a new file with speicific order?

I have a task to do with awk. I am doing sequence analysis for some genes.
I have several files with sequences in order. I would like to extract first sequence of each file into new file and like till the last sequence. I know only how to do with first or any specific line with awk.
awk 'FNR == 2 {print; nextfile}' *.txt > newfile
Here I have input like this
File 1
Saureus081.1
ATCGGCCCTTAA
Saureus081.2
ATGCCTTAAGCTATA
Saureus081.3
ATCCTAAAGGTAAGG
File 2
SaureusRF1.1
ATCGGCCCTTAC
SauruesRF1.2
ATGCCTTAAGCTAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
File 3
SaureusN305.1
ATCGGCCCTTACT
SauruesN305.2
ATGCCTTAAGCTAGA
SaureusN305.3
ATCCTAAAGGTAATG
And similar files 12 are there
File 4
.
.
.
.File 12
Required Output
Newfile
Saureus081.1
ATCGGCCCTTAA
SaureusRF1.1
ATCGGCCCTTAC
SaureusN305.1
ATCGGCCCTTACT
Saureus081.2
ATGCCTTAAGCTATA
SaureusRF1.2
ATGCCTTAAGCTAGG
SauruesN305.2
ATGCCTTAAGCTAGA
Saureus081.3
ATCCTAAAGGTAAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
SaureusN305.3
ATCCTAAAGGTAATG
I guess this task can be done easily with awk but not getting any idea how to do for multiple lines
Based on the modified question, the answer shall be done with some changes.
$ awk -F'.' 'NR%2{k=$2;v=$0;getline;a[k]=a[k]?a[k] RS v RS $0:v RS $0} END{for(i in a)print a[i]}' file1 file2 file3
Saureus081.1
ATCGGCCCTTAA
SaureusRF1.1
ATCGGCCCTTAC
SaureusN305.1
ATCGGCCCTTACT
Saureus081.2
ATGCCTTAAGCTATA
SauruesRF1.2
ATGCCTTAAGCTAGG
SauruesN305.2
ATGCCTTAAGCTAGA
Saureus081.3
ATCCTAAAGGTAAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
SaureusN305.3
ATCCTAAAGGTAATG
Brief explanation,
Set '.' as the delimeter
For every odd record, distinguish k=$2 as the key of array a
Invoke getline to set $0 of next record as the value corresponds to the key k
Print the whole array for the last step
If your data is very large, I would suggest creating temporary files:
awk 'FNR%2==1 { filename = $1 }
{ print $0 >> filename }' file1 ... filen
Afterwards, you can cat them together:
cat Seq1 ... Seqn > result
This has the additional advantage that it will work if not all sequences are present in all files.
paste + awk solution:
paste File1 File2 | awk '{ p=$2;$2="" }NR%2{ k=p; print }!(NR%2){ v=p; print $1 RS k RS v }'
paste File1 File2 - merge corresponding lines of files
p=$2;$2="" - capture the value of the 2nd field which is the respective key/value from File2
The output:
Seq1
ATCGGCCCTTAA
Seq1
ATCGGCCCTTAC
Seq2
ATGCCTTAAGCTATA
Seq2
ATGCCTTAAGCTAGG
Seq3
ATCCTAAAGGTAAGG
Seq3
ATCCTAAAGGTAAGC
Additional approach for multiple files:
paste Files[0-9]* | awk 'NR%2{ k=$1; n=NF; print k }
!(NR%2){ print $1; for(i=2;i<=n;i++) print k RS $i }'

awk to compare two file by identifier & output in a specific format

I have 2 large files i need to compare all pipe delimited
file 1
a||d||f||a
1||2||3||4
file 2
a||d||f||a
1||1||3||4
1||2||r||f
Now I want to compare the files & print accordingly such as if any update found in file 2 will be printed as updated_value#oldvalue & any new line added to file 2 will also be updated accordingly.
So the desired output is: (only the updated & new data)
1||1#2||3||4
1||2||r||f
what I have tried so far is to get the separated changed values:
awk -F '[||]+' 'NR==FNR{for(i=1;i<=NF;i++)a[NR,i]=$i;next}{for(i=1;i<=NF;i++)if(a[FNR,i]!=$i)print $i"#"a[FNR,i]}' file1 file2 >output
But I want to print the whole line. How can I achieve that??
I would say:
awk 'BEGIN{FS=OFS="|"}
FNR==NR {for (i=1;i<=NF;i+=2) a[FNR,i]=$i; next}
{for (i=1; i<=NF; i+=2)
if (a[FNR,i] && a[FNR,i]!=$i)
$i=$i"#"a[FNR,i]
}1' f1 f2
This stores the file1 in a matrix a[line number, column]. Then, it compares its values with its correspondence in file2.
Note I am using the field separator | instead of || and looping in steps of two to use the proper data. This is because I for example did gawk -F'||' '{print NF}' f1 and got just 1, meaning that FS wasn't well understood. Will be grateful if someone points the error here!
Test
$ awk 'BEGIN{FS=OFS="|"} FNR==NR {for (i=1;i<=NF;i+=2) a[FNR,i]=$i; next} {for (i=1; i<=NF; i+=2) if (a[FNR,i] && a[FNR,i]!=$i) $i=$i"#"a[FNR,i]}1' f1 f2
a||d||f||b#a
1||1#2||3||4
1||2||r||f

Printing the compared result under last column header

I'm doing a comparision of 2 files file1,file2 using first column in file1 to first column in file2 and retriving corresponding value from 7 th column .
awk -F, 'FNR==NR{a[$1]=$7;next} {print (($1 in a) ? $0","a[$1] : $0",NA");}' file2.txt file1.txt > tmp && mv tmp file1.txt
also on next day it will compare and append the result .
cat file1.txt
N1,N2,N3,N4,N5,N6,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10
XX,ZZ,XC,EE,RR,BB,OK,OK,OK,OK,OK,OK,OK,OK
XC,CF,FG,RG,GH,GH,NA,NA,NA,NA,NA,NA,NA,NA,NA
DM,DF,GR,TH,EW,BB
cat file2.txt
cat file2.txt
DF,GH,MH,FR,FG,GH,NA
XX,ZZ,XC,EE,RR,BB,OK
awk -F, 'FNR==NR{a[$1]=$7;next} {print (($1 in a) ? $0","a[$1] : $0",NA");}' file2.txt file1.txt > tmp && mv tmp file1.txt
mv: overwrite `file1.txt'? y
cat file1.txt
N1,N2,N3,N4,N5,N6,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,NA ---> Header
XX,ZZ,XC,EE,RR,BB,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK
XC,CF,FG,RG,GH,GH,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
DM,DF,GR,TH,EW,BB,NA
after adding new row
DM,DF,GR,TH,EW
problem is it is comparing and printing result for header also and result is printed
under header D1 instead of D10 for newly inserted row in file1
How can we print like this, compare should exclude header and result under last column header
N1,N2,N3,N4,N5,N6,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10
XX,ZZ,XC,EE,RR,BB,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK
XC,CF,FG,RG,GH,GH,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
DM,DF,GR,TH,EW,BB ,NA
To avoid having header updated, change awk's expression to the following:
'FNR==NR{a[$1]=$7;next} FNR==1{print $0; next} {print (($1 in a) ? $0","a[$1] : $0",NA");}'
In this case 1st line of the file1.txt will be printed as is, without any changes.
But don't you also need to have new day (like "D10" in the example) be added to the header on each run? Or you do it elsewhere?
As to the 2nd question (printing new value at the same position in the string for the shorter line as for the longer line), you should further modify awk:
'FNR==NR{a[$1]=$7;next} FNR==1{print $0; len=length($0); next} {printf $0; cont=(($1 in a) ? ","a[$1] : ",NA"); for (i=length($0)+1;i<=len-length(cont);i++) printf " " ; print cont;}'

Combining columns within a single file using awk

I am trying to reformat a large file. The first 6 columns of each line are OK but the rest of the columns in the line need to be combined in increments of 2 with a "/" character in between.
Example file (showing only a few columns but have many more in actual file):
1 1 0 0 1 2 A T A C
Into:
1 1 0 0 1 2 A/T A/C
So far I have been trying awk and this is where I am at...
awk '{print $1,$2,$3,$4,$5; for(i=7; i < NF; i=i+2) print $i+"/"+$i+1}' myfile.txt > mynewfile.txt
awk '{for(i=j=7; i < NF; i+=2) {$j = $i"/"$(i+1); j++} NF=j-1}1' input
Please try this:
awk '{print $1" "$2" "$3" "$4" "$5" "$6" "$7"/"$8" "$9"/"$10}' myfile.txt > mynewfile.txt
"+" is the arithmetic "and" operator, string concatenation is done by simply listing the strings adjacent to each other, i.e. to get the string "foobar" you'd write:
"foo" "bar"
not:
"foo" + "bar"
Anyway, try this:
awk -v ORS= '{print $1,$2,$3,$4,$5,$6; for(i=7;i<=NF;i++) print (i%2?OFS:"/") $i; print "\n"}' myfile.txt > mynewfile.txt