awk to match file 1 with file 2 and average field 7 - awk

I am trying to match all the file 1 names in file 2 and average them if there is a match. The field where the match will be is $5 before the | symbol and the average is the sum of $7 that matches $4. Thank you :).
file 1
AGRN
CYP2J2
file 2
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 1 2
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 2 2
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 3 2
chr1 957571 957852 chr1:957571 AGRN-7|gc=61.2 1 148
chr1 957571 957852 chr1:957571 AGRN-7|gc=61.2 2 149
chr1 957571 957852 chr1:957571 AGRN-7|gc=61.2 3 151
chr1 60381600 60381782 chr1:60381600 CYP2J2-1596|gc=40.7 153 274
chr1 60381600 60381782 chr1:60381600 CYP2J2-1596|gc=40.7 154 273
Desired output (tab-delimited)
chr1:955543 AGRN-6 2
chr1:957571 AGRN 149.3
chr1:60381600 CYP2J2-1596 153.5
I have tried so far:
awk '
FNR==NR{d[$0]; next;}
{
for(k in d){
pat="(^|;)"k":";
if($5 ~ pat){
print;
break;
}
}
}' file 1 file2 > output.bed
The awk does run but the output file, as of now, is 0 bytes. Thank you :).

The script should look like this:
test.awk
BEGIN {
FS="[ \t|]*"
}
# Read search terms from file1 into 's'
FNR==NR {
s[$0]
next
}
{
# Check if $5 matches one of the search terms
for(i in s) {
if($5 ~ i) {
# Store first two fields for later usage
a[$5]=$1
b[$5]=$2
# Add $9 to total of $9 per $5
t[$5]+=$8
# Increment count of occurences of $5
c[$5]++
next
}
}
}
END {
# Calculate average and print output for all search terms
# that has been found
for( i in t ) {
avg = t[i] / c[i]
printf("%s:%s\t%s\t%s\n", a[i], b[i], i, avg)
}
}
Call it like:
awk -f test.awk file1 file2
Btw, the third avg in your expected output is wrong. The output should look like this:
chr1:955543 AGRN-6 2
chr1:957571 AGRN-7 149.333
chr1:60381600 CYP2J2-1596 273.5

Related

Counting the number of unique values based on more than two columns in bash

I need to modify the below code to work on more than one column.
Counting the number of unique values based on two columns in bash
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Table to count how many of each double (all DIS)
patient sex DISa DISb DISc DISd DISe DISf DISg DISh DISi
patient1 male 550.1 550.5 594.1 594.3 594.8 591 1019 960.1 550.1
patient2 female 041 208 250.2 276.14 426.32 550.1 550.5 558 041
patient3 female NA NA NA NA NA NA NA 041 NA
The output I need is:
550.1 3
550.5 2
594.1 1
594.3 1
594.8 1
591 1
1019 1
960.1 1
550.1 1
041 3
208 1
250.2 1
276.14 1
426.32 1
558 1
Consider this awk:
awk -v OFS='\t' 'NR > 1 {for (i=3; i<=NF; ++i) if ($i+0 == $i) ++fq[$i]} END {for (i in fq) print i, fq[i]}' file
276.14 1
960.1 1
594.3 1
426.32 1
208 1
041 3
594.8 1
550.1 3
591 1
1019 1
558 1
550.5 2
250.2 1
594.1 1
A more readable form:
awk -v OFS='\t' '
NR > 1 {
for (i=3; i<=NF; ++i)
if ($i+0 == $i)
++fq[$i]
}
END {
for (i in fq)
print i, fq[i]
}' file
$i+0 == $i is a check for making sure column value is numeric.
If the ordering must be preserved, then you need an additional array b[] to keep the order each number is encountered, e.g.
awk '
BEGIN { OFS = "\t" }
FNR > 1 {
for (i=3;i<=NF;i++)
if ($i~/^[0-9]/) {
if (!($i in a))
b[++n] = $i;
a[$i]++
}
}
END {
for (i=1;i<=n;i++)
print b[i], a[b[i]]
}' file
Example Use/Output
$ awk '
> BEGIN { OFS = "\t" }
> FNR > 1 {
> for (i=3;i<=NF;i++)
> if ($i~/^[0-9]/) {
> if (!($i in a))
> b[++n] = $i;
> a[$i]++
> }
> }
> END {
> for (i=1;i<=n;i++)
> print b[i], a[b[i]]
> }' patients
550.1 3
550.5 2
594.1 1
594.3 1
594.8 1
591 1
1019 1
960.1 1
041 3
208 1
250.2 1
276.14 1
426.32 1
558 1
Let me know if you have further questions.
Taking complete solution from above 2 answers(#anubhava and #David) with all respect, just adding a little tweak(of applying check for integer value here as per shown samples of OP) to their solutions and adding 2 solutions here. Written and tested with shown samples only.
1st solution: If order doesn't matter in output try:
awk -v OFS='\t' '
NR > 1 {
for (i=3; i<=NF; ++i)
if (int($i))
++fq[$i]
}
END {
for (i in fq)
print i, fq[i]
}' Input_file
2nd solution: If order matters based on David's answer try.
awk '
BEGIN { OFS = "\t" }
FNR > 1 {
for (i=3;i<=NF;i++)
if (int($i)) {
if (!($i in a))
b[++n] = $i;
a[$i]++
}
}
END {
for (i=1;i<=n;i++)
print b[i], a[b[i]]
}' Input_file
Using GNU awk for multi-char RS:
$ awk -v RS='[[:space:]]+' '$0+0 == $0' file | sort | uniq -c
3 041
1 1019
1 208
1 250.2
1 276.14
1 426.32
3 550.1
2 550.5
1 558
1 591
1 594.1
1 594.3
1 594.8
1 960.1
If the order of fields really matters just pipe the above to awk '{print $2, $1}'.

How to print something multiple times in awk

I have a file sample.txt that looks like this:
Sequence: chr18_gl000207_random
Repeat 1
Indices: 2822--2996 Score: 135
Period size: 36 Copynumber: 4.8 Consensus size: 36
Consensus pattern (36 bp):
TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
Repeat 2
Indices: 2736--3623 Score: 932
Period size: 111 Copynumber: 8.1 Consensus size: 111
Consensus pattern (111 bp):
TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
GTGGTGGCTGTTGTGGTTGTAGCCTCAGTGGAAGTGCCTGCAGTTG
Repeat 3
Indices: 3421--3496 Score: 89
Period size: 39 Copynumber: 1.9 Consensus size: 39
Consensus pattern (39 bp):
AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
I have used awk to extract values for parameters that are relevant for me like this:
paste <(awk '/Indices/ {print $2}' sample.txt) <(awk '/Period size/ {print $3}' sample.txt) <(awk '/Copynumber/ {print $5}' sample.txt) <(awk '/Consensus pattern/ {getline; print $0}' sample.txt)
Output:
2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
Now I want to add the parameter Sequence to every row.
Desired output:
chr18_gl000207_random:2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
chr18_gl000207_random:2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
chr18_gl000207_random:3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
I want to do this for many files in a loop so I need a solution that would work with a different number of Repeats as well.
$ cat tst.awk
BEGIN { OFS="\t" }
$1 == "Sequence:" { seq = $2; next }
$1 == "Indices:" { ind = $2; next }
$1 == "Period" { per = $3; cpy = $5; next }
$1 == "Consensus" { isCon=1; next }
isCon { print seq":"ind, per, cpy, $1; isCon=0 }
$ awk -f tst.awk file
chr18_gl000207_random:2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
chr18_gl000207_random:2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
chr18_gl000207_random:3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT

Process multiple files with awk

I would like to count the number of points in each interval. I have the positions of the points in the first file and the intervals in the second. First I store the point attributes in two arrays(pos and name) and then i want to loop over them in order to determine wheter it belongs to the given interval ($1 is the name and $2 is the start and $3 is the end of the interval). I have the following code:
awk 'NR==FNR{name[NR]=$1;pos[NR]=$2;next}; {for (i in name) if (name[i] == $1 && pos[i] >= $2 && pos[i] <= $3) {sum[NR] += 1;}} END {for (i = 1; i <=length(sum); i++) {print sum[i]}} file1 file2 > out'
I have a syntax error: "syntax error near unexpected token `i"
I am beginner in awk. Any help is highly appriciated. Thanks
awk '
NR==FNR{
name[NR]=$1
pos[NR]=$2
next
}
{
for(i in name){
if(name[i] == $1 && pos[i] >= $2 && pos[i] <= $3){ sum[FNR] += 1; }
}
}
END {
for(i = 1; i <=FNR; i++){
print sum[i];
}
}
' points windows > output
points:
chr1 52
chr1 65
chr2 120
chr2 101
chr2 160
chr3 20
chr4 50
windows:
chr1 0 100
chr1 100 200
chr2 0 100
chr2 100 200
chr3 0 100
chr3 100 200
chr4 0 100
chr5 0 100
chr6 0 100
chr6 100 200
chr7 0 100
chr8 0 100
gave me the desired output:
2
3
1
1
Thank You
Your ' is in wrong place and awk command is not ending properly, could you please try following. Couldn't test it since no samples are given.
awk 'NR==FNR{name[NR]=$1;pos[NR]=$2;next}; {for (i in name) if (name[i] == $1 && pos[i] >= $2 && pos[i] <= $3) {sum[NR] += 1;}} END {for (i = 1; i <=length(sum); i++) {print sum[i]}}' file1 file2
Non-one liner form of above solution.
awk '
NR==FNR{
name[NR]=$1
pos[NR]=$2
next
}
{
for(i in name){
if(name[i] == $1 && pos[i] >= $2 && pos[i] <= $3){ sum[NR] += 1 }
}
}
END{
for(i = 1; i <=length(sum); i++){
print sum[i]
}
}
' file1 file2 > out
As per sir #Ed Morton 's comment following could be few recommendations: Again these are not tested since no samples were given but you could try to apply them.
sum[NR] should be as sum[FNR] if in case you want to put index as per line number of Input_file2, why because difference between NR and FNR is that NR's value will be keep keep increasing till all Input_file(s) are read but FNR value will be RESET to 1 whenever there is new Input_file is being read.
Then, length(sum) could be value of FNR because basically you may be looking for total number of times loop has to run which you could get by FNR value.

Why doesn't this awk code do what I want it to do?

I have this file:
$ head -n 4 badRegionFromHWE.merged
seqnames start end width strand
chr1 144118070 145868461 1750392 *
chr7 100049516 101110026 1060511 *
chr7 141508887 142999071 1490185 *
$
I want to not print out the header line and print column 1,2,3 separated by tabs. So I wrote this:
awk 'OFS="\t";NR>1{print$1,$2,$3}' badRegionFromHWE.merged | head
seqnames start end width strand
chr1 144118070 145868461 1750392 *
chr1 144118070 145868461
chr7 100049516 101110026 1060511 *
chr7 100049516 101110026
chr7 141508887 142999071 1490185 *
chr7 141508887 142999071
It doesn't do what I wanted it to do!
The assignment OFS="\t" evaluates to true (non-zero, non-empty) on every line, so it prints every line. You should enclose the expression in a BEGIN block:
awk 'BEGIN { OFS="\t" } NR > 1 { print$1, $2, $3 }' badRegionFromHWE.merged

AWK prints empty line of NA's at end of file

I have an older script that has been bugging me for a while now, which has a small bug in it that I haven't really gotten around to fixing, but I think it's about time. The script basically appends the columns of different files based on the ID of the rows. For example...
test1.txt:
a 3
b 2
test2.txt:
a 5
b 9
... should yield a result of:
a 3 5
b 2 9
The script itself looks like this:
#!/bin/bash
gawk 'BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$2; keys[$1] }
END {
for (key in keys) {
printf "%s%s", key, OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
}
} printf "\n"
}' $1 $2
... called as $ script.sh test1.txt test2.txt. The problem is that the result I get is not exactly what I should get:
a 3 5
b 2 9
NA NA NA
... i.e. I get a row with NA's at the very end of the file. So far I've just been deleting this row manually, but it'd be nice to not have to do that. I don't really see where this weird functionality is coming from, though... Anybody got any ideas? I'm using GAWK on OSX, if that matters.
Here's some actual input (that's what I get for trying to make the question simple and to the point! =P)
target_id length eff_length est_counts tpm
ENST00000574176 596 282 6 0.825408
ENST00000575242 103 718 105 5.19804
ENST00000573052 291 291 21 2.61356
ENST00000312051 223 192 2559 46.8843
I'm interested in the target_id and tpm columns, the others are unimportant. My full script:
FILES=$(find . -name 'data.txt' | xargs)
# Get replicate names for column header
printf "%s" 'ENSTID'
for file in $FILES; do
file2="${file/.results\/data.txt/}"
file3="${file2/.\/*\//}"
printf "\t%s" $file3
done
printf "\n"
gawk 'BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$5; keys[$1] }
END {
for (key in keys) {
printf "%s%s", key, OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
}
} printf "\n"
}' $FILES
(i.e. all the files are named data.txt, but are located in differently named subfolders.)
A simpler idiomatic way to do it would be
$ cat test1.txt
a 3
b 2
$ cat test2.txt
a 5
b 9
$ awk -v OFS="\t" 'NR==FNR{rec[$1]=$0;next}$1 in rec{print rec[$1],$2}' test1.txt test2.txt
a 3 5
b 2 9
For the actual input
$ cat test1.txt
target_id length eff_length est_counts tpm
ENST00000574176 596 282 6 0.825408
ENST00000575242 103 718 105 5.19804
ENST00000573052 291 291 21 2.61356
ENST00000312051 223 192 2559 46.8843
$ cat test2.txt
target_id length eff_length est_counts tpm
ENST00000574176 996 122 6 0.3634
ENST00000575242 213 618 105 7.277
ENST00000573052 329 291 89 2.0356
ENST00000312051 21 00 45 0.123
$ awk 'NR==FNR{rec1[$1]=$1;rec2[$1]=$5;next}$1 in rec1{printf "%-20s %-15s %-15s\n", rec1[$1],rec2[$1],$5}' test1.txt test2.txt
target_id tpm tpm
ENST00000574176 0.825408 0.3634
ENST00000575242 5.19804 7.277
ENST00000573052 2.61356 2.0356
ENST00000312051 46.8843 0.123
Notes :
-v OFS="\t" is for tab separated fields in output, order of passed files is important (Matters to first solution).
Hard-coding newlines as in
printf "%-20s %-15s %-15s\n", rec1[$1],rec2[$1],$5
is not a good idea as it renders the script less portable.You may well replace it with
printf "%-20s %-15s %-15s", rec1[$1],rec2[$1],$5;print # same effect
Edit : for more than two files
$ shopt -s globstar
$ awk 'NR==FNR{rec1[$1]=$1" "$5;next}{if($1 in rec1){rec1[$1]=rec1[$1]" "$5}else{rec1[$1]=$1" "$5}}END{for(i in rec1){if(i != "target_id"){print rec1[i];}}}' **/test*.txt
ENST00000312051 46.8843 46.8843 0.123 46.8843 0.123
ENST00000573052 2.61356 2.61356 2.0356 2.61356 2.0356
ENST00000575242 5.19804 5.19804 7.277 5.19804 7.277
ENST00000574176 0.825408 0.825408 0.3634 0.825408 0.3634
ENST77777777777 01245
ENST66666666666 7.277 7.277
$ shopt -u globstar
As far as I can see, the only reason you would get an empty line at the end of the output (which is what I get with gawk on OS X) is that you have a printf "\n" at the end of the script, which will add a newline even though you've just printed ORS.
Since your bash script is essentially an awk script, I would make a proper awk script out of it. That would additionally save you the problem of having incorrect quoting of $1 and $2 in the shell script (would break on exotic filenames). This also gives you proper syntax highlighting in your favourite text editor, if it understands Awk:
#!/usr/bin/gawk -f
BEGIN { OFS = "\t" }
{
vals[$1,ARGIND] = $2;
keys[$1] = 1;
}
END {
for (key in keys) {
printf("%s%s", key, OFS);
for (colNr = 1; colNr <= ARGIND; colNr++) {
printf("%s%s", vals[key,colNr], (colNr < ARGIND ? OFS : ORS));
}
}
}
The same can be done with more complex sed editing scripts.