Use awk to sum or average for each unique ID - awk

Can anyone tell me how to use awk in order to calculate the sum of two individuals columns or the average of one column for each unique ID.
Input
chr1 3661532 3661533 0.0 5 0 chr1 3661529 3662079 NM_01011874
chr1 3661534 3661535 0.2 5 1 chr1 3661529 3662079 NM_01011874
chr1 3661537 3661538 0.0 5 0 chr1 3661529 3662079 NM_01011874
chr1 3661559 3661560 0.0 6 0 chr1 3661529 3662079 NM_01011874
chr2 4661532 4661533 0.0 8 0 chr1 4661532 4661533 NM_00175642
chr2 6661534 6661535 0.2 5 2 chr1 6661534 6661535 NM_00175642
chr2 2661537 2661538 0.0 5 0 chr1 2661537 2661538 NM_00175642
chr2 9661559 9661560 0.0 7 0 chr1 9661559 9661560 NM_00175642
Output (sum $5 $6) for each unique ID
NM_01011874 21 1
NM_00175642 25 2
or average of $4 for each unique ID
NM_01011874 0.0476
NM_00175642 0.08
Also, if you could breakdown the components of the solution I would be grateful. I'm a molecular biologist with minimal bioinformatics training.

sum of columns 5 and 6 per id:
awk '{sum5[$10] += $5; sum6[$10] += $6}; END{ for (id in sum5) { print id, sum5[id], sum6[id] } }' < /tmp/input
NM_00175642 25 2
NM_01011874 21 1
Explained: $10 is the id field, $5 and $6 are columns 5 and 6. We build 2 arrays for summing columns 5 and 6 (which are indexed by strings, so we can use the id field). Once we've processed all the lines/records, we iterate through the array keys (id strings), and print the value at that array index.
average of column 4 per id:
awk '{sum4[$10] += $4; count4[$10]++}; END{ for (id in sum4) { print id, sum4[id]/count4[id] } }' < /tmp/input
NM_00175642 0.05
NM_01011874 0.05
Explained: Very similar to the summing example. We keep a sum of column 4 per id, and a count of records seen for each id. At the end, we iterate through the ids and print the sum/count.
I don't do much with awk, I find Perl much better for small scripts. But this looks like a good starting point. There are links to more pages with example scripts.

Related

how to format a large txt file to bed

I am trying to format CpG methylation calls from R package "methyKit" to simple bed format. Since it is a large file, i can not do it in Excel. I also tried Seqmonk, but it does not allow me to export the data in the format I want. Linux Awk/sed might be a good option, but I am new to them as well. Basically, I need to trim "chr" column, add "stop" column, convert "F" to "+" /"R" to "-", and freqC with 2 decimal places. Can you please help?
From:
chrBase chr base strand coverage freqC freqT
chr1.339 chr1 339 F 7 0.00 100.00
chr1.183 chr1 183 R 4 0.00 100.00
chr1.192 chr1 192 R 6 0.00 100.00
chr1.340 chr1 340 R 5 40.00 60.00
chr1.10007 chr1 10007 F 13 53.85 46.15
chr1.10317 chr1 10317 F 8 0.00 100.00
chr1.10346 chr1 10346 F 9 88.89 11.11
chr1.10349 chr1 10349 F 9 88.89 11.11
To:
chr start stop freqc Coverage strand
1 67678 67679 0 3 -
1 67701 67702 0 3 -
1 67724 67725 0 3 -
1 67746 67747 0 3 -
1 67768 67769 0.333333 3 -
1 159446 159447 0 3 +
1 162652 162653 0 3 +
1 167767 167768 0.666667 3 +
1 167789 167790 0.666667 3 +
1 167797 167798 0 3 +
This should do what you actually want, producing a BED6 file with the methylation percentage in the score column:
$ cat foo.awk
BEGIN{OFS="\t"}
{if(NR>1) {
if($4=="F") {
strand="+"
} else {
strand="-"
}
chromUse=gsub("chr", "", $2);
print chromUse,$3-1,$3,$1,$6,strand,$5
}}
That would then be run with:
awk -f foo.awk input.txt > output.bed
The additional column 7 is the coverage, since genome viewers will only display a single score column:
1 338 339 chr1.339 0.00 + 7
1 182 183 chr1.183 0.00 - 4
1 191 192 chr1.192 0.00 - 6
1 339 340 chr1.340 40.00 - 5
1 10006 10007 chr1.10007 53.85 + 13
1 10316 10317 chr1.10317 0.00 + 8
1 10345 10346 chr1.10346 88.89 + 9
1 10348 10349 chr1.10349 88.89 + 9
You can tweak that further as needed.
It is not entirely clear the exact sequence you want since your "From" data does not correspond to what you show as your "To" results, but if what you are showing is the general format change and in the "From" data, you want to:
discard field 1,
retrieve the "chr" value from the end of field 2,
if the 4th field is "F" make it "+" else if it is "R" make it "-" otherwise leave it unchanged,
use the 3rd field as "start" and 3rd + 1 as "stop" (adjust whether to add or subtract 1 as needed to get the desired "start" and "stop" values),
print 6th field as "freqc",
output 5th field as "Coverage", and finally
output modified 4th field as "strand"
If that is your goal, then with your from data in the file named from, you can do something like the following:
awk '
BEGIN { OFS="\t"; print "chr","start","stop","freqc","Coverage","strand" }
FNR > 1 {
match($2, /[[:digit:]]+$/, arr)
if ($4 == "F")
$4 = "+"
else if ($4 == "R")
$4 = "-"
print arr[0], $3, $3 + 1, $6, $5, $4
}
' from
Explanation, the BEGIN rule is run before awk starts processing records (lines) from the file. Above it simply sets the Output Field Separator to tab and prints the heading.
The condition (pattern) of FNR > 1 on the second rule processes the from file from the 2nd record (line) on (skipping the heading row). FNR is awk's way of saying File Record Number (even though it looks like the N and R are backwards).
match($2, /[[:digit:]]+$/, arr) splits the trailing digits from the second field into the first element of arr (e.g. arr[0]) and not relevant here sets the RSTART and RLENGTH internal variables telling you which character the first digit starts on and how many digits there are.
The if and else if statement does the "F" to "+" and "R" to "-" change. And, finally, the print statement just prints the modified values and unchanged fields in the order specified above.
Example Output
Running the above on your original "From" data will produce:
chr start stop freqc Coverage strand
1 339 340 0.00 7 +
1 183 184 0.00 4 -
1 192 193 0.00 6 -
1 340 341 40.00 5 -
1 10007 10008 53.85 13 +
1 10317 10318 0.00 8 +
1 10346 10347 88.89 9 +
1 10349 10350 88.89 9 +
Let me know if this is close to what you explained in your question, and if not, drop a comment below.
The GNU Awk User's Guide is a great gawk/awk reference.

manipulating columns in a text file in awk

I have a tab separated text file and want to do some math operation on one column and make a new tab separated text file.
this is an example of my file:
chr1 144520803 144520804 12 chr1 144520813 58
chr1 144520840 144520841 12 chr1 144520845 36
chr1 144520840 144520841 12 chr1 144520845 36
chr1 144520848 144520849 14 chr1 144520851 32
chr1 144520848 144520849 14 chr1 144520851 32
i want to change the 4th column. in fact I want to divide every single element in the 4th column by sum of all elements in the 4th column and then multiply by 1000000 . like the expected output.
expected output:
chr1 144520803 144520804 187500 chr1 144520813 58
chr1 144520840 144520841 187500 chr1 144520845 36
chr1 144520840 144520841 187500 chr1 144520845 36
chr1 144520848 144520849 218750 chr1 144520851 32
chr1 144520848 144520849 218750 chr1 144520851 32
I am trying to do that in awk using the following command but it does not return what I want. do you know how to fix it:
awk '{print $1 "\t" $2 "\t" $3 "\t" $4/{sum+=$4}*1000000 "\t" $5 "\t" $6 "\t" $7}' myfile.txt > new_file.txt
you need two passes, one to compute the sum and then to scale the field
something like this
$ awk -v OFS='\t' 'NR==FNR {sum+=$4; next}
{$4*=(1000000/sum)}1' file{,} > newfile

Awk OR conditional not working

Input: A tab-separated input file with 15 columns where column 15 is an integer.
Output: The number of lines that satisfy the conditional.
My code:
$ closest-features --closest --no-overlaps --delim '\t' --dist --ec megatrans_enhancers.sorted.bed ../../data/alu_repeats.sorted.bed | awk -v OFS='\t' '{if ($15 <= 1000 || $15 >= -1000) print $0}' | wc -l
1188
The || conditional in this case is failing to work (the total number of lines in the file are 1188 and I know for certain at least some lines do not satisfy the condition), because if I remove the OR conditional then suddenly it works:
$ closest-features --closest --no-overlaps --delim '\t' --dist --ec megatrans_enhancers.sorted.bed ../../data/alu_repeats.sorted.bed | awk -v OFS='\t' '{if ($15 <= 1000) print $0}' | wc -l
926
Not sure what i'm doing wrong. Any advice?
Example Input to Awk command:
chr1 378268 378486 chr1-798_Enhancer 17.2 + chr1 375923 376219 AluY|SINE|Alu-HOMER529 0 + E:375923 0.044 -2050
chr1 1079471 1079689 chr1-929_Enhancer 14.6 - chr1 1071271 1071563 AluSx1|SINE|Alu-HOMER1669 0 - E:1071271 0.13 -7909
chr1 1080259 1080477 chr1-830_Enhancer 16.7 - chr1 1071271 1071563 AluSx1|SINE|Alu-HOMER1669 0 - E:1071271 0.13 -8697
chr1 6611744 6611962 chr1-241_Enhancer 46.6 + chr1 6611431 6611723 AluSc|SINE|Alu-HOMER10257 0 + E:6611431 0.089 -22
chr1 6959639 6959857 chr1-58_Enhancer 100.1 - chr1 6966612 6966911 AluSx|SINE|Alu-HOMER11041 0 - E:6966612 0.137 6756
chr1 6960593 6960811 chr1-202_Enhancer 51.6 - chr1 6966612 6966911 AluSx|SINE|Alu-HOMER11041 0 - E:6966612 0.137 5802
chr1 7447888 7448106 chr1-2_Enhancer 181.9 - chr1 7449489 7449799 AluSz|SINE|Alu-HOMER11879 0 + E:7449489 0.119 1384
chr1 10752461 10752679 chr1-131_Enhancer 65.4 - chr1 10752754 10753065 AluSq2|SINE|Alu-HOMER19455 0 + E:10752754 0.106 76
chr1 12485694 12485912 chr1-353_Enhancer 36.7 + chr1 12487328 12487634 AluSx3|SINE|Alu-HOMER23581 0 + E:12487328 0.085 1417
chr1 12486469 12486687 chr1-141_Enhancer 63.6 + chr1 12487328 12487634 AluSx3|SINE|Alu-HOMER23581 0 + E:12487328 0.085 642
Try to put && condition because a digit should be greater than -1000 and lesser than 1000.
Your_command | awk '$15<=1000 && $15>=-1000{count++} END{print count}'
Add -F"\t" in above awk in case your Input to it is coming TAB delimited too. Also there is no need to use wc -l after awk. I have written logic for that so give the count of lines which are satisfying the condition by creating a variable named count and printing it at very last of Input_file.
Also for your provided samples output is coming as 3 which I believe is correct one.

LINUX AWK command for a big file

I have encountered a problem that exceeds my basic unix knowledge and would really appreciate some help. I have a large file in the following format:
chr1 10495 10499 211
chr1 10496 10500 1
chr1 10587 10591 93
chr1 10588 10592 1
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10936 6
chr3 10933 10937 2
chr5 11056 11060 4
chr6 11155 11159 9
If the values in column one match and one number difference in column two, I want to sum the values in column 4 of both lines and replace the value of column 3 in line 1 with the value of column 3 in line 2 , else just the the values in the unique line without modifying any column.
So the output I am hoping for would look like this:
chr1 10495 10500 212
chr1 10587 10592 94
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10937 8
chr5 11056 11060 4
chr6 11155 11159 9
$ cat tst.awk
BEGIN { OFS="\t" }
NR>1 {
if ( ($1==p[1]) && ($2==(p[2]+1)) ) {
print p[1], p[2], $3, p[4]+$4
delete p[0]
next
}
else if (0 in p) {
print p[0]
}
}
{ split($0,p); p[0]=$0 }
END { if (0 in p) print p[0] }
$
$ awk -f tst.awk file
chr1 10495 10500 212
chr1 10587 10592 94
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10937 8
chr5 11056 11060 4
chr6 11155 11159 9
Haven't checked closely, but I think you want:
awk '{split(p,a)}
$1==a[1] && a[2]==$2-1{print a[1], a[2], $3, $4 + a[4]; p=""; next}
p {print p} {p=$0}
END {print}' OFS=\\t input
At any given step (except the first), p holds the value from the previous line. The 2nd line of the script checks if the first field in the current line matches the first field of the last line and that the 2nd field is one greater than the 2nd field of the last line. In that condition, it prints the first two fields from the previous line, the third from the current line, and the sum of the 4th fields and moves on to the next line. If they don't match, it prints the previous line. At the end, it just prints the line.
This script, I'm using to merge intervals in transcriptome data
awk '
NR==1{
n= split($0, first);
c=1;
for(i=1; i<=n; i++) d[c, i] = first[i];
}
NR>1{
n= split($0, actual);
#if(actual[1] != d[c, 1] || actual[2]>d[c, 3]){ #for interval fusion
if(actual[1] != d[c, 1] || actual[2]>d[c,2]+1){ #OP requirement
c++;
for(i=1; i<=n; i++) d[c, i] = actual[i];
}else{
if(actual[3] > d[c,3]) d[c,3] = actual[3];
d[c,4] = d[c,4] + actual[4];
}
}
END{
for(i=1; i<=c; i++){
print d[i, 1], d[i, 2], d[i, 3], d[i, 4]
}
}' file
you get:
chr1 10495 10500 212
chr1 10587 10592 94
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10937 8
chr5 11056 11060 4
chr6 11155 11159 9

awk: divide odd columns by following even column

I want to divide all the odd columns in a file by the next even column, e.g. column1/column2, column3/column4,......, columnN/columnN+1
test1.txt
1 4 1 2 1 3
1 2 4 2 3 9
desired output
0.25 0.5 0.333
0.5 2 0.333
I tried this:
awk 'BEGIN{OFS="\t"} { for (i=2; i<NF+2; i+=2) printf $(i-1)/i OFS; printf "\n"}'
but it doesn't work.
I would like to add that my actual files have a very large and variable (but always even) number of columns and I would like something that would work on all of them.
awk '{for(i=1;i<NF;i+=2)printf "%f%s",$i/$(i+1),OFS;print "";}' input.txt
Output:
0.250000 0.500000 0.333333
0.500000 2.000000 0.333333
You can adjust printing format to your needs see here for more info.