mathematical operations in a text file usinf awk - awk

I have a text file which looks like this small example:
in this file the first line of each group is ID and belong each ID, there are some lines in which the 1st column is 3-letters character and 2nd one is a number. 2 columns are tab separated.
ID1
AAA 17
TTA 3
ATA 6
ATC 12
AAG 9
ACA 13
ATG 21
ACC 13
ACG 5
AAT 12
AGA 11
ATT 22
AGC 11
TAA 3
ACT 8
TAC 12
ID2
AAA 10
AAC 7
AAG 4
ACA 3
ACC 1
ATG 6
ACG 1
below also I have a list of 3-letter characters. I want to get a ratio of TTA and other 3-letters characters belong each ID which is also present in the below list.
ATT
ATC
ATA
CTT
AAC
CTA
CTG
TTA
TTG
GTT
GTC
GTA
GTG
the output for this example would look like this:
ID1 0.065
ID2 0
for the ID2, the ration is 0 because there is no TTA and for the ID1 is 0.065 because 3 divided by 46 is equal to 0.065. for each ID, I only took the 3-letters characters which are common between above list and the rows below each ID. and also 2 columns are tab separated.
I am quite new in awk programming language. I wrote the following piece of code, but it does not return what I want. would you please help me to fix it?
3_letter_list= [ATT, ATC, ATA, CTT, AAC, CTA, CTG, TTA, TTG, GTT, GTC, GTA, GTG]
awk -F "\t" '{if($1==3_letter_list), (if $1=="TTA" & ratio=$2/$1)}' filename.txt > out.txt
ID3
AAA 2
AAC 8
ATA 1
ATC 20
AAG 26
ACA 6
ATG 11
ACC 16
ACG 7
AAT 2
ATT 4
AGC 18
TAA 1
TAC 8
ACT 3
AGG 1
TTC 20
TCA 1
TCC 8
TTG 6
TCG 4
AGT 5
TAT 3
GAC 18
GTC 12
TTT 6
TGC 7
GAG 31
TCT 1
GCC 19
GTG 21
TGG 6
GCG 8
CAC 12
GAT 6
CTC 12
GGA 2
CAG 22
GGC 25
CTG 52
CCC 15
GCT 3
GGG 6
CCG 4
CAT 4
CTT 2
CGC 18
GGT 4
CCT 3
CGG 13

Awk solution:
Assuming that list of 3-letter character groups is saved into groups_list.txt.
awk 'NR==FNR{ a[$1]; next }
/^ID[0-9]/{
if (id) { printf "%s %.4f\n", id, tta/sum; id=tta=sum="" }
id = $1; next
}
$1 == "TTA"{ tta = $2 }
$1 in a{ sum += $2 }
END{ printf "%s %.4f\n", id, tta/sum }' groups_list.txt file.txt
The output:
ID1 0.0698
ID2 0.0000

Related

Print column header next to column entry if condition match

I am having a large file in the following format where 1st column is the id and then all are samples. I am trying to extract only ids for which only one sample is having value larger than 5 and remaining all have values less than 5. Additionally I also want to print the sample id along-with the sample value for which the value is greater than 5. Whats the best way to proceed it? I can identify all ids which full-fill the condition by following which I came across in the forum; but cannot get what I am expecting.
awk '{for(i=1;i<=NF;i++) {if($i+0>5) c++; printf "%-5s%s", $i, (i==NF? OFS c ORS: OFS)}c=0}' input.txt | awk 'NR==1{print $0}; NR>1{if ($NF==1) print $0}
Input File
id s1 s2 s3 s4 s5
T1 203 3 0 1 80
T2 70 2 0 0 1
T3 50 66 90 321 15
T4 1 4 2 1 10
T5 0 0 0 0 2
T6 2 1 2 11 2
T7 200 3 1 0 0
T8 15 11 9 8 1
T9 1 23 1 2 1
T10 34 1 1 2 1
Expected output
T2 s1 70
T4 s5 10
T6 s4 11
T7 s1 200
T9 s2 23
T10 s1 34
Could you please try following, written on mobile couldn't test it should work but.
awk '
FNR==1{
for(k=2;k<=NF;k++){
header[k]=$k
}
next
}
{
for(k=2;k<=NF;k++){
if($k>5){
count++
val=$k
second=header[k]
}
}
if(count==1){
print $1,second,val
}
count=val=second=""
}
' Input_file

Select current and previous line if certain value is found

To figure out my problem, I subtract column 3 and create a new column 5 with new values, then I print the previous and current line if the value found is equal to 25 in column 5.
Input file
1 1 35 1
2 5 50 1
2 6 75 1
4 7 85 1
5 8 100 1
6 9 125 1
4 1 200 1
I tried
awk '{$5 = $3 - prev3; prev3 = $3; print $0}' file
output
1 1 35 1 35
2 5 50 1 15
2 6 75 1 25
4 7 85 1 10
5 8 100 1 15
6 9 125 1 25
4 1 200 1 75
Desired Output
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
Thanks in advance
you're almost there, in addition to previous $3, keep the previous $0 and only print when condition is satisfied.
$ awk '{$5=$3-p3} $5==25{print p0; print} {p0=$0;p3=$3}' file
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
this can be further golfed to
$ awk '25==($5=$3-p3){print p0; print} {p0=$0;p3=$3}' file
check the newly computed field $5 whether equal to 25. If so print the previous line and current line. Save the previous line and previous $3 for the computations in the next line.
You are close to the answer, just pipe it another awk and print it
awk '{$5 = $3 - prev3; prev3 = $3; print $0}' oxxo.txt | awk ' { curr=$0; if($5==25) { print prev;print curr } prev=curr } '
with Inputs:
$ cat oxxo.txt
1 1 35 1
2 5 50 1
2 6 75 1
4 7 85 1
5 8 100 1
6 9 125 1
4 1 200 1
$ awk '{$5 = $3 - prev3; prev3 = $3; print $0}' oxxo.txt | awk ' { curr=$0; if($5==25) { print prev;print curr } prev=curr } '
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
$
Could you please try following.
awk '$3-prev==25{print line ORS $0,$3} {$(NF+1)=$3-prev;prev=$3;line=$0}' Input_file | column -t
Here's one:
$ awk '{$5=$3-q;t=p;p=$0;q=$3;$0=t ORS $0}$10==25' file
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
Explained:
$ awk '{
$5=$3-q # subtract
t=p # previous to temp
p=$0 # store previous for next round
q=$3 # store subtract value for next round
$0=t ORS $0 # prepare record for output
}
$10==25 # output if equals
' file
No checking for duplicates so you might get same record printed twice. Easiest way to fix is to pipe the output to uniq.

awk sum collect in groups

question is for awk script (ref previous question some weeks ago) but bit more complicated .
input file looks :
Group1
id val1 val2
---------------------------
idone 2 10
idone 3 12
idone 6 9
idtwo 8 3
idtwo 14 1
Subtotal 33 35
Group2
id val1 val2
------------------------
idone 2 3
idone 1 4
idtwo 3 6
idtwo 4 7
Subtotal 10 20
Total 43 55
There might be more groups and in each group more entries .
I limited my example to 2 detail names idone, idtwo and 2 groups.
Now the purpose is to have them summarized.
with result as :
val1 val2
idone 14 38
idtwo 29 17
total 43 55
The output layout is free to choose :
if you prefer it may look like this as well :
total_idone_val1=14
total_idone_val2=38
total_idtwo_val1=29
total_idtwo_val2=17
overall_total_val1=43
overall_total_val2=55
give this awk cmd a try:
awk 'NF==3&&FNR>3&&!/[Tt]otal/{v1[$1]+=$2;v2[$1]+=$3}END{print "id","v1","v2";
for(x in v1){
print x,v1[x],v2[x]
s1+=v1[x]
s2+=v2[x]
}
print "total",s1,s2}' f1 f2
it gives:
id v1 v2
idtwo 29 17
idone 14 38
total 43 55

sum rows based on unique columns awk

I'm looking for a more elegant way to do this (for more than >100 columns):
awk '{a[$1]+=$4}{b[$1]+=$5}{c[$1]+=$6}{d[$1]+=$7}{e[$1]+=$8}{f[$1]+=$9}{g[$1]+=$10}END{for(i in a) print i,a[i],b[i],c[i],d[i],e[i],f[i],g[i]}'
Here is the input:
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
and output:
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
Thanks :)
I break the one-liner down into lines, to make it easier to read.
awk '{n[$1];for(i=2;i<=NF;i++)a[$1,i]+=$i}
END{for(x in n){
printf "%s ", x
for(y=2;y<=NF;y++)printf "%s%s", a[x,y],(y==NF?ORS:FS)
}
}' file
this awk command should work with your 100 columns file.
test with your file:
kent$ cat f
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
kent$ awk '{n[$1];for(i=2;i<=NF;i++)a[$1,i]+=$i}END{for(x in n){printf "%s ", x;for(y=2;y<=NF;y++)printf "%s%s", a[x,y],(y==NF?ORS:OFS)}}' f
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
Using arrays of arrays in gnu awk version 4
awk '{for (i=2;i<=NF;i++) a[$1][i]+=$i}
END{for (i in a)
{ printf i FS;
for (j in a[i]) printf a[i][j] FS
printf RS}
}' file
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
If you care about order of output try this
$ cat file
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
Awk Code :
$ cat tester
awk 'FNR==NR{
U[$1] # Array U with index being field1
for(i=2;i<=NF;i++) # loop through columns thats is column2 to NF
A[$1,i]+=$i # Array A holds sum of columns
next # stop processing the current record and go on to the next record
}
($1 in U){ # Here we read same file once again,if field1 is found in array U, then following statements
for(i=1;i<=NF;i++)
s = s ? s OFS A[$1,i] : A[$1,i] # I am writing sum to variable s since I want to use only one print statement, here you can use printf also
print $1,s # print column1 and variable s
delete U[$1] # We have done, so delete array element
s = "" # reset variable s
}' OFS='\t' file{,} # output field separator is tab you can set comma also
Resulting
$ bash tester
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk
--edit--
As requested in comment here is one liner, in above post for better reading purpose I had commented and it became several lines.
$ awk 'FNR==NR{U[$1];for(i=2;i<=NF;i++)A[$1,i]+=$i;next}($1 in U){for(i=1;i<=NF;i++)s = s ? s OFS A[$1,i] : A[$1,i];print $1,s;delete U[$1];s = ""}' OFS='\t' file{,}
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7

Using an array in AWK when working with two files

I have two files I merged them based key using below code
file1
-------------------------------
1 a t p bbb
2 b c f aaa
3 d y u bbb
2 b c f aaa
2 u g t ccc
2 b j h ccc
file2
--------------------------------
1 11 bbb
2 22 ccc
3 33 aaa
4 44 aaa
I merged these two file based key using below code
awk 'NR==FNR{a[$3]=$0;next;}{for(x in a){if(x==$5) print $1,$2,$3,$4,a[x]};
My question is how I can save $2 of file2 in variable or array and print after a[x] again.
My desired result is :
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
As you see the first 7 columns is the result of my merge code. I need add the last column (field 2 of a[x]) to my result.
Important:
My next question is if I have .awk file, how I can use some bash script code like (| column -t) or send result to file (awk... > result.txt)? I always use these codes in command prompt. Can I use them inside my code in .awk file?
Simply add all of file2 to an array, and use split to hold the bits you want:
awk 'FNR==NR { two[$0]++; next } { for (i in two) { split(i, one); if (one[3] == $NF) print $1,$2,$3,$4, i, one[2] } }' file2 file1
Results:
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Regarding your last question; you can also add 'pipes' and 'writes' inside of your awk. Here's an example of a pipe to column -t:
Contents of script.awk:
FNR==NR {
two[$0]++
next
}
{
for (i in two) {
split(i, one)
if (one[3] == $NF) {
print $1,$2,$3,$4, i, one[2] | "column -t"
}
}
}
Run like: awk -f script.awk file2 file1
EDIT:
Add the following to your shell script:
results=$(awk '
FNR==NR {
two[$0]++
next
}
{
for (i in two) {
split(i, one)
if (one[3] == $NF) {
print $1,$2,$3,$4, i, one[2] | "column -t"
}
}
}
' $1 $2)
echo "$results"
Run like:
./script.sh file2.txt file1.txt
Results:
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Your current script is:
awk 'NR==FNR { a[$3]=$0; next }
{ for (x in a) { if (x==$5) print $1,$2,$3,$4,a[x] } }'
(Actually, the original is missing the second close brace for the second pattern/action pair.)
It seems that you process file2 before you process file1.
You shouldn't need the loop in the second code. And you can make life easier for yourself by using the splitting in the first phase to keep the values you need:
awk 'NR==FNR { c1[$3] = $1; c2[$3] = $2; next }
{ print $1, $2, $3, $4, c1[$5], c2[$5], $5, c2[$5] }'
You can upgrade that to check whether c1[$5] and c2[$5] are defined, presumably skipping the row if they are not.
Given your input files, the output is:
1 a t p 1 11 bbb 11
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Give or take column spacing, that's what was requested. Column spacing can be fixed by using printf instead of print, or setting OFS to tab, or ...
The c1 and c2 notations for column 1 and 2 is OK for two columns. If you need more, then you should probably use the 2D array notation:
awk 'NR==FNR { for (i = 1; i <= NF; i++) col[i,$3] = $i; next }
{ print $1, $2, $3, $4, col[1,$5], col[2,$5], $5, col[2,$5] }'
This produces the same output as before.
To achieve what you ask, save the second field after the whole line in the processing of your first file, with a[$3]=$0 OFS $2. For your second question, awk has a variable to separate fields in output, it's OFS, assign a tabulator to it and play with it. Your script would be like:
awk '
BEGIN { OFS = "\t"; }
NR==FNR{
a[$3]=$0 OFS $2;
next;
}
{
for(x in a){
if(x==$5) print $1,$2,$3,$4,a[x]
}
}
' file2 file1
That yields:
1 a t p 1 11 bbb 11
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22