Compare two numerical ranges in two distincts files with awk and print ALL lines from file1 and the matching ones from file2 - awk

This new question is a followup from a recent question : Compare two numerical ranges in two distincts files with awk. The proposed solution that perfectly worked was not practical for downstream analysis (misconception of my question, not on the solution that worked).
I have a file1 with 3 columns. Columns 2 and 3 define a numerical range. Data are sorted from the smaller to the bigger value in column 2. Numerical ranges never overlap.
file1
S 24 96
S 126 352
S 385 465
S 548 600
S 621 707
S 724 736
I have a second file2 (test) structured similarly.
file2
S 27 93
S 123 348
S 542 584
S 726 740
S 1014 2540
S 12652 12987
Desired output: Print ALL lines from file1 and next to them, lines of file2 for which numerical ranges overlap (including partially) the ones of file1. If no ranges from file2 overlap to a range of file1, print zero next to the range of file 1.
S 24 96 S 27 93 * 27-93 overlaps with 24-96
S 126 352 S 123 355 * 123-355 overlaps with 126-352
S 385 465 0 * nothing in file2 overlaps with this range
S 548 600 S 542 584 * 542-584 overlaps with 548-600
S 621 707 0 * nothing in file2 overlaps with this range
S 724 736 S 726 740 * 726-740 overlaps with 724-736
Based on the answer of the previous question from #EdMorton I modified the print command of the tst.awk script to add these new features. In addition I also changed the order file1/file2 to file2/file1 to have all the lines from file1 printed (whether or not there is a match in the second file)
'NR == FNR {
begs2ends[$2] = $3
next
}
{
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
print $0,"\t",$1,"\t",beg,"\t",end
else
print $0,"\t","0"
next
}
}
}
Note: $1 is identical in file1 and file2. This is why I used print ... $1 to make it appear. No idea how to print it from file2 and not file1 (if I understand correctly this $1 refers to file1.
And I launch the analysis with awk -f tst.awk file2 file1
The script is not accepting the else argument and I dont understand why? I assuming that it is linked to the looping but I tried several changes without any success.
Thanks if you can help me with this.

Assumptions:
a range from file1 can only overlap with one range from file2
The current code is almost correct, just need some work with the placement of the braces (using some consistent indentation helps):
awk '
BEGIN { OFS="\t" } # output field delimiter is "\t"
NR == FNR { begs2ends[$2] = $3; next }
{
# $1=$1 # uncomment to have current line ($0) reformatted with "\t" delimiters during print
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
print $0,$1,beg,end # spacing within $0 unchanged, 3 new fields prefaced with "\t"
next
}
}
# if we get this far it is because we have exhausted the "for" loop
# (ie, found no overlaps) so print current line + "0"
print $0,"0" # spacing within $0 unchanged, 1 new field prefaced with "\t"
}
' file2 file1
This generates:
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
With the $1=$1 line uncommented the output becomes:
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
S 900 1000 S 901 905

A slight variation on #markp-fuso's answer
Works with GNU awk: saved as overlaps.awk
BEGIN { PROCINFO["sorted_in"] = "#ind_num_asc" }
function in_range(val, min, max) { return min <= val && val <= max }
NR == FNR {
line[FNR] = $0
lo[FNR] = $2
hi[FNR] = $3
next
}
{
overlap = "0"
for (i in line) {
if (in_range(lo[i], $2, $3) || in_range(hi[i], $2, $3)) {
overlap = line[i]
delete line[i]
break
}
}
print $0, overlap
}
Then
gawk -f overlaps.awk file2 file1 | column -t
outputs
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740

$ cat tst.awk
BEGIN { OFS="\t" }
NR == FNR {
ranges[++numRanges] = $0
next
}
{
overlapped = 0
for ( i=1; i<=numRanges; i++ ) {
range = ranges[i]
split(range,vals)
beg = vals[2]+0
end = vals[3]+0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
overlapped = 1
break
}
}
if ( overlapped ) {
print $0, range, sprintf("* %d-%d overlaps with %d-%d", beg, end, $2, $3)
}
else {
print $0, 0, sprintf("* nothing in %s overlaps with this range", ARGV[1])
}
}
$ awk -f tst.awk file2 file1 | column -s$'\t' -t
S 24 96 S 27 93 * 27-93 overlaps with 24-96
S 126 352 S 123 348 * 123-348 overlaps with 126-352
S 385 465 0 * nothing in file2 overlaps with this range
S 548 600 S 542 584 * 542-584 overlaps with 548-600
S 621 707 0 * nothing in file2 overlaps with this range
S 724 736 S 726 740 * 726-740 overlaps with 724-736

Related

Making AWK code more efficient when evaluating sets of records

I have a file with 5 fields of content. I am evaluating 4 lines at a time in the file. So, records 1-4 are evaluated as a set. Records 5-8 are another set. Within each set, I want to extract the time from field 5 when field 4 has the max value. If there are duplicate values in field 4, then evaluate the maximum value in field 2 and use the time in field 5 associated with the max value in field 2.
For example, in the first 4 records, there is a duplicate max value in field 4 (value of 53). If that is true, I need to look at field 2 and find the maximum value. Then print the time associated with the max value in field 2 with the time in field 5.
The Data Set is:
00 31444 8.7 24 00:04:32
00 44574 12.4 25 00:01:41
00 74984 20.8 53 00:02:22
00 84465 23.5 53 00:12:33
01 34748 9.7 38 01:59:28
01 44471 12.4 37 01:55:29
01 74280 20.6 58 01:10:24
01 80673 22.4 53 01:55:49
The desired Output for records 1 through 4 is 00:12:33
The desired output for records 5 through 8 is 01:10:24
Here is my answer:
Evaluate Records 1 through 4
awk 'NR==1,NR==4 {if(max <= $4) {max = $4; time = $5} else if(max == $4) {max = $2; time = $5};next}END {print time}' test.txt test.txt
Output is: 00:12:33
Evaluate Records 5 through 8
awk 'NR==5,NR==8 {if(max <= $4) {max = $4; time = $5} else if(max == $4) {max = $2; time = $5};next}END {print time}' test.txt test.txt
Output is 01:10:24
Any suggestions on how to evaluate the record ranges more efficiently without having to write an awk statement for each set of records?
Thanks
Based on your sample input, the fact there's 4 lines for each key (first field) seems to be irrelevant and what you really want is to just produce output for each key so consider sorting the input by your desired comparison fields (field 4 then field 2) then printing the first desired output (field 5) value seen for each block per key (field 1):
$ sort -n -k1,1 -k4,4r -k2,2r file | awk '!seen[$1]++{print $5}'
00:12:33
01:10:24
This awk code
NR % 4 == 1 {max4 = $4; max2 = $2}
$4 > max4 || $4 == max4 && $2 >= max2 {max4 = $4; max2 = $2; val5 = $5}
NR % 4 == 0 {printf "lines %d-%d: %s\n", (NR - 3), NR, val5}
outputs
lines 1-4: 00:12:33
lines 5-8: 01:10:24
Looking at the data, you might want to group sets by $1 instead of hardcoding 4 lines:
awk '
function emit(nr) {printf "lines %d-%d: %s\n", nr - 3, nr, val5}
$1 != setId {
if (NR > 1) emit(NR - 1)
setId = $1
max4 = $4
max2 = $2
}
$4 > max4 || $4 == max4 && $2 >= max2 {max4 = $4; max2 = $2; val5 = $5}
END {emit(NR)}
' data
an awk-based solution that utilizes a synthetic ascii-string-comparison key combining $4 and $5, while avoiding any %-modulo operations :
mawk '
BEGIN { CONVFMT = "%020.f" (__=___=____=_____="")
_+=_+=++_ } { ____= __!=(__=__==$((_____=(+$_ "")"(" $NF)^!_) \
? __ : $!!_) || ____<_____ ? _____ : ____
} _==++___ {
printf(" group %-*s [%*.f, %-*.f] :: %s\n", --_*--_, "\"" (__) "\"", _+_,
NR-++_, ++_, NR, substr(____, index(____, "(")+_^(_____=____=___=""))) }'
group "00" [ 1, 4 ] :: 00:12:33
group "01" [ 5, 8 ] :: 01:10:24

Counting the number of unique values based on more than two columns in bash

I need to modify the below code to work on more than one column.
Counting the number of unique values based on two columns in bash
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Table to count how many of each double (all DIS)
patient sex DISa DISb DISc DISd DISe DISf DISg DISh DISi
patient1 male 550.1 550.5 594.1 594.3 594.8 591 1019 960.1 550.1
patient2 female 041 208 250.2 276.14 426.32 550.1 550.5 558 041
patient3 female NA NA NA NA NA NA NA 041 NA
The output I need is:
550.1 3
550.5 2
594.1 1
594.3 1
594.8 1
591 1
1019 1
960.1 1
550.1 1
041 3
208 1
250.2 1
276.14 1
426.32 1
558 1
Consider this awk:
awk -v OFS='\t' 'NR > 1 {for (i=3; i<=NF; ++i) if ($i+0 == $i) ++fq[$i]} END {for (i in fq) print i, fq[i]}' file
276.14 1
960.1 1
594.3 1
426.32 1
208 1
041 3
594.8 1
550.1 3
591 1
1019 1
558 1
550.5 2
250.2 1
594.1 1
A more readable form:
awk -v OFS='\t' '
NR > 1 {
for (i=3; i<=NF; ++i)
if ($i+0 == $i)
++fq[$i]
}
END {
for (i in fq)
print i, fq[i]
}' file
$i+0 == $i is a check for making sure column value is numeric.
If the ordering must be preserved, then you need an additional array b[] to keep the order each number is encountered, e.g.
awk '
BEGIN { OFS = "\t" }
FNR > 1 {
for (i=3;i<=NF;i++)
if ($i~/^[0-9]/) {
if (!($i in a))
b[++n] = $i;
a[$i]++
}
}
END {
for (i=1;i<=n;i++)
print b[i], a[b[i]]
}' file
Example Use/Output
$ awk '
> BEGIN { OFS = "\t" }
> FNR > 1 {
> for (i=3;i<=NF;i++)
> if ($i~/^[0-9]/) {
> if (!($i in a))
> b[++n] = $i;
> a[$i]++
> }
> }
> END {
> for (i=1;i<=n;i++)
> print b[i], a[b[i]]
> }' patients
550.1 3
550.5 2
594.1 1
594.3 1
594.8 1
591 1
1019 1
960.1 1
041 3
208 1
250.2 1
276.14 1
426.32 1
558 1
Let me know if you have further questions.
Taking complete solution from above 2 answers(#anubhava and #David) with all respect, just adding a little tweak(of applying check for integer value here as per shown samples of OP) to their solutions and adding 2 solutions here. Written and tested with shown samples only.
1st solution: If order doesn't matter in output try:
awk -v OFS='\t' '
NR > 1 {
for (i=3; i<=NF; ++i)
if (int($i))
++fq[$i]
}
END {
for (i in fq)
print i, fq[i]
}' Input_file
2nd solution: If order matters based on David's answer try.
awk '
BEGIN { OFS = "\t" }
FNR > 1 {
for (i=3;i<=NF;i++)
if (int($i)) {
if (!($i in a))
b[++n] = $i;
a[$i]++
}
}
END {
for (i=1;i<=n;i++)
print b[i], a[b[i]]
}' Input_file
Using GNU awk for multi-char RS:
$ awk -v RS='[[:space:]]+' '$0+0 == $0' file | sort | uniq -c
3 041
1 1019
1 208
1 250.2
1 276.14
1 426.32
3 550.1
2 550.5
1 558
1 591
1 594.1
1 594.3
1 594.8
1 960.1
If the order of fields really matters just pipe the above to awk '{print $2, $1}'.

filtering file according to the highest number in a column of each line

I have the following file:
chr11_pilon3.g3568.t1 transcript:OIT01734 transcript:OIT01734 1.1e-107 389.8 1000 218 992 1 216 130 345 MDALTRHIQGDVPWCMLFADDIILIDETRAGVSERLEIWRQTLESKGFKISRSKTEYLECKFGDEPSGVGREVMLGSQAIAKRDSVRYLGSVIQGDGEIDGDVTHRIGAGWSKWRLASGVLCDKKIPHKLKGKFFRAMVRPAMFYEAECWPVKNSHIQRMKVAEMRMLRWMCGHTRLDKIKNEVIRQKVGVAPVDKKMGEARLRWFGHVRRRGPDA MDALTRHIQGDVPWCMLFADDIVLIDETRVGVNERLEVWRQTLESKGFKLSRSKTEYLECKFSAESSEVGRDVKLGSQVIAKRDSFRYLGSVIQGEGEIDGDVTHRIGAGWSKWRLASGVLCDKKVPQKLKGKFYRAVVRPAMLYGAECWPVKNSHVQRMKVAEMRMLRWMRGLTRLDRIRNEVIREKVGVALVDEKMREARLRWYGHVRRRRPDA MDALTRHIQGDVPWCMLFADDIILIDETRAGVSERLEIWRQTLESKGFKISRSKTEYLECKFGDEPSGVGREVMLGSQAIAKRDSVRYLGSVIQGDGEIDGDVTHRIGAGWSKWRLASGVLCDKKIPHKLKGKFFRAMVRPAMFYEAECWPVKNSHIQRMKVAEMRMLRWMCGHTRLDKIKNEVIRQKVGVAPVDKKMGEARLRWFGHVRRRGPDAR* MKVWERVVEARVREMTSISVNQFGFMPGRSTTEAIHLVRRLVEHFRDKKKDLHMVFIDLENAYDKVPREVLWRCLEAKSVPEAYIRVIKDMYDGAKTRVRTVGGDSDHFPVVMGLHQGSALSPLLFALVMDALTRHIQGDVPWCMLFADDIVLIDETRVGVNERLEVWRQTLESKGFKLSRSKTEYLECKFSAESSEVGRDVKLGSQVIAKRDSFRYLGSVIQGEGEIDGDVTHRIGAGWSKWRLASGVLCDKKVPQKLKGKFYRAVVRPAMLYGAECWPVKNSHVQRMKVAEMRMLRWMRGLTRLDRIRNEVIREKVGVALVDEKMREARLRWYGHVRRRRPDAPVRIYKSAILGHLNSHGSQNALAGPVEAEENRQKTKKEVMEEIIQKSKFFKAQKAKDREENDELTEQLDKDFTSLVESKALLSLTQPDKINALKALVNKNISVGNVKKDEVADVPRKASIGKEKPDTYEMLVSEMALDMRARPSDRTKTPEEIAQEEKERLELLEQEXXXXXXXXXXXXXXDGNASDDNSKLVKDPRTVSGDDLGDDLEEVPRTKLGWIGEILRRKENELESEDAASSGDSDDGEDEGXXXXXXXXXXXXXXXXXXXXDEEQGKTQTIKDWEQSDDDIIDTELEDDDEGFGDDAKKVVKIKDHKEENLSITVAAENKKKMQVFYGVLLQYFAVLANKKPLNSKLLNLLVKPLMEMSAVSPYFAAICARQRLQRTRAQFCEDLKNTGKSSWPSLKTIFLLRLWSMIFPCSDFRHCVMTPAILLMCEYLMRCTIISGRDIAIASFLCSLLLSVIKQSQKFCPEAIVFIQTLLMAALDRKQRSNSQLDNLMEIKELGPLLCIRSSKVEMDSLDFLTLMDLPEDSQYFHSDNYRTSMLVTVLETLQGFVNVYKELISFPEIFMLISKLLCKMAGENHIPDALREKIKDVSQLIDTKAQEHHMLRQPLKMRKKKPVPIRMLNPKFEENFVKGRDYDPDRERA 389.8 1000 216 85.6 185 31 200 0 0 92.6 0 22IV6AV2SN4IV11IL12GSDA1PS1GE3ED1MK4AV6VF9DE29IV1HQ6FY2MV5FL1EG10IV14CR1HL4KR1KR5QE5PL2KE2GR6FY6GR3 85.6 1.1e-107 99.1
gene.10002.1.1.p1 NisylKD957037g0001.1 NisylKD957037g0001.1 0.0e+00 1218.8 3152 668 780 5 667 122 780 KVIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN KVIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN MFGFKVIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN* MGAKRTRSNSESDDGYKLSVPPGFESLMSFTLKKVKNSEEACNSVALGSGFAQGPSLVAATSTIISTGKLKSSVRHRPWILDDHVDHIEDDSEFEDDKSLSSSAFLPKGVIRGCSSCHNCQKVIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN 1218.8 3152 665 91.0 605 52 621 3 8 93.4 0 11HR12SNE-E-E-F-E-D-5GA24CR3EP14ED26RG5LH85GS4RGGD2ISHR2-P24HR70FL2MI7IV20IL8VA25DE5RG17RG4AP7KN10CY13FVAS6KT1ML16AT4SP13TK3QH12SP3RS36FL4FVSF6EG12VI6-EAV13LV3TS8LS2QR2PS3VI2TKVI2IL15IT19TS9 91.0 0.0e+00 99.3
gene.10002.1.4.p1 NisylKD957037g0001.1 NisylKD957037g0001.1 0.0e+00 1216.8 3147 671 780 9 670 123 780 VIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN VIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN MFGFKARIVIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN* MGAKRTRSNSESDDGYKLSVPPGFESLMSFTLKKVKNSEEACNSVALGSGFAQGPSLVAATSTIISTGKLKSSVRHRPWILDDHVDHIEDDSEFEDDKSLSSSAFLPKGVIRGCSSCHNCQKVIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN 1216.8 3147 664 91.0 604 52 620 3 8 93.4 0 10HR12SNE-E-E-F-E-D-5GA24CR3EP14ED26RG5LH85GS4RGGD2ISHR2-P24HR70FL2MI7IV20IL8VA25DE5RG17RG4AP7KN10CY13FVAS6KT1ML16AT4SP13TK3QH12SP3RS36FL4FVSF6EG12VI6-EAV13LV3TS8LS2QR2PS3VI2TKVI2IL15IT19TS9 91.0 0.0e+00 98.7
gene.10002.1.5.p1 NisylKD957037g0001.1 NisylKD957037g0001.1 0.0e+00 1218.8 3152 668 780 5 667 122 780 KVIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN KVIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN MFGFKVIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN* MGAKRTRSNSESDDGYKLSVPPGFESLMSFTLKKVKNSEEACNSVALGSGFAQGPSLVAATSTIISTGKLKSSVRHRPWILDDHVDHIEDDSEFEDDKSLSSSAFLPKGVIRGCSSCHNCQKVIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN 1218.8 3152 665 91.0 605 52 621 3 8 93.4 0 11HR12SNE-E-E-F-E-D-5GA24CR3EP14ED26RG5LH85GS4RGGD2ISHR2-P24HR70FL2MI7IV20IL8VA25DE5RG17RG4AP7KN10CY13FVAS6KT1ML16AT4SP13TK3QH12SP3RS36FL4FVSF6EG12VI6-EAV13LV3TS8LS2QR2PS3VI2TKVI2IL15IT19TS9 91.0 0.0e+00 99.3
gene.10002.1.6.p1 NisylKD957037g0001.1 NisylKD957037g0001.1 0.0e+00 1440.2 3727 799 780 15 798 1 780 MGAKRTRSNGESDDGYKLSVPPGFESLMSFTLKKVKNSEEACNSVALESEFAQSPSQVAATSTIISIGKLKSSVRHRPWILDDHVDHIEDDSEFEDDKSLSSIAFLPKGVIRGCSSCHNCQKVIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN MGAKRTRSNSESDDGYKLSVPPGFESLMSFTLKKVKNSEEACNSVALGSGFAQGPSLVAATSTIISTGKLKSSVRHRPWILDDHVDHIEDDSEFEDDKSLSSSAFLPKGVIRGCSSCHNCQKVIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN MSDCTWQRYKGEVLMGAKRTRSNGESDDGYKLSVPPGFESLMSFTLKKVKNSEEACNSVALESEFAQSPSQVAATSTIISIGKLKSSVRHRPWILDDHVDHIEDDSEFEDDKSLSSIAFLPKGVIRGCSSCHNCQKVIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN* MGAKRTRSNSESDDGYKLSVPPGFESLMSFTLKKVKNSEEACNSVALGSGFAQGPSLVAATSTIISTGKLKSSVRHRPWILDDHVDHIEDDSEFEDDKSLSSSAFLPKGVIRGCSSCHNCQKVIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN 1440.2 3727 786 91.5 719 59 735 3 8 93.5 0 9GS37EG1EG3SG2QL9IT35IS29HR12SNE-E-E-F-E-D-5GA24CR3EP14ED26RG5LH85GS4RGGD2ISHR2-P24HR70FL2MI7IV20IL8VA25DE5RG17RG4AP7KN10CY13FVAS6KT1ML16AT4SP13TK3QH12SP3RS36FL4FVSF6EG12VI6-EAV13LV3TS8LS2QR2PS3VI2TKVI2IL15IT19TS9 91.5 0.0e+00 98.1
The above file has some IDs which are similar
gene.10002.1.1.p1
gene.10002.1.4.p1
gene.10002.1.5.p1
gene.10002.1.6.p1
By remaining only gene.10002 the IDs become identically. I used this awk script (thank you to #anubhava ) to keep only lines of the same ID with smallest value (column 30)
awk '{
if (/^gene\./) {
split($1, a, /\./)
k = a[1] "." a[2]
}
else
k = $1
}
!(k in min) || $30 <= min[k] {
if(!(k in min))
ord[++n] = k
else if (min[k] == $30) {
print
next
}
min[k] = $30
rec[k] = $0
}
END {
for (i=1; i<=n; ++i)
print rec[ord[i]]
}' file
I failed to modified the above awk script to consider the maximum value in column 31 and to keep multiple copies if the column 31 value is the same?
awk '{
if (/^gene\./) {
split($1, a, /\./)
k = a[1] "." a[2]
}
else
k = $1
}
!(k in max) || $31 <= max[k] {
if(!(k in max))
ord[++n] = k
else if (max[k] == $31) {
print
next
}
cov[k] = $31
rec[k] = $0
}
END {
for (i=1; i<=n; ++i)
print rec[ord[i]]
}'
Fixing OP's attempt here, could you please try following. You should change your condition to do comparison for >= condition in $31 >= max[k], since we are looking for maximum value now, added detailed explanation later section of this post too.
awk '{
if (/^gene\./) {
split($1, a, /\./)
k = a[1] "." a[2]
}
else
k = $1
}
!(k in max) || $31 >= max[k] {
if(!(k in max))
ord[++n] = k
else if (max[k] == $31) {
print
next
}
max[k] = $31
rec[k] = $0
}
END {
for (i=1; i<=n; ++i)
print rec[ord[i]]
}' Input_file
Explanation: Adding detailed explanation for above.
awk '{ ##Starting awk program from here.
if (/^gene\./) { ##Checking condition if line is NOT starting from gene. then do following.
split($1, a, /\./) ##Splitting first field into array a with delimiter dot here.
k = a[1] "." a[2] ##Creating variable k with value of a[1] DOT a[2] here.
}
else ##In case line NOT starting from gene. then do following.
k = $1 ##Setting 1st field value to k here.
}
!(k in max) || $31 >= max[k] { ##Checking condition if k is NOT in max array and 31st field is >= max[k]
if(!(k in max)) ##If above any of the condition is true then check if k is NOT present in max
ord[++n] = k ##Creating ord with index of increasing value of n and its value is k
else if (max[k] == $31) { ##else printing maximum duplicate line, no need to keep appending it in array.
print ##Printing it here.
next ##next will skip all further statements from here.
}
max[k] = $31 ##Creating max with index of k and value of 31st field.
rec[k] = $0 ##Creating rec with index of k and value of current line.
}
END { ##Starting END block of this program from here.
for (i=1; i<=n; ++i) ##Starting a for loop from i=1 to till value of n here.
print rec[ord[i]] ##Printing array rec with index of; value of ord array which has i index.
}' Input_file ##Mentioning Input_file name here.

Process multiple files with awk

I would like to count the number of points in each interval. I have the positions of the points in the first file and the intervals in the second. First I store the point attributes in two arrays(pos and name) and then i want to loop over them in order to determine wheter it belongs to the given interval ($1 is the name and $2 is the start and $3 is the end of the interval). I have the following code:
awk 'NR==FNR{name[NR]=$1;pos[NR]=$2;next}; {for (i in name) if (name[i] == $1 && pos[i] >= $2 && pos[i] <= $3) {sum[NR] += 1;}} END {for (i = 1; i <=length(sum); i++) {print sum[i]}} file1 file2 > out'
I have a syntax error: "syntax error near unexpected token `i"
I am beginner in awk. Any help is highly appriciated. Thanks
awk '
NR==FNR{
name[NR]=$1
pos[NR]=$2
next
}
{
for(i in name){
if(name[i] == $1 && pos[i] >= $2 && pos[i] <= $3){ sum[FNR] += 1; }
}
}
END {
for(i = 1; i <=FNR; i++){
print sum[i];
}
}
' points windows > output
points:
chr1 52
chr1 65
chr2 120
chr2 101
chr2 160
chr3 20
chr4 50
windows:
chr1 0 100
chr1 100 200
chr2 0 100
chr2 100 200
chr3 0 100
chr3 100 200
chr4 0 100
chr5 0 100
chr6 0 100
chr6 100 200
chr7 0 100
chr8 0 100
gave me the desired output:
2
3
1
1
Thank You
Your ' is in wrong place and awk command is not ending properly, could you please try following. Couldn't test it since no samples are given.
awk 'NR==FNR{name[NR]=$1;pos[NR]=$2;next}; {for (i in name) if (name[i] == $1 && pos[i] >= $2 && pos[i] <= $3) {sum[NR] += 1;}} END {for (i = 1; i <=length(sum); i++) {print sum[i]}}' file1 file2
Non-one liner form of above solution.
awk '
NR==FNR{
name[NR]=$1
pos[NR]=$2
next
}
{
for(i in name){
if(name[i] == $1 && pos[i] >= $2 && pos[i] <= $3){ sum[NR] += 1 }
}
}
END{
for(i = 1; i <=length(sum); i++){
print sum[i]
}
}
' file1 file2 > out
As per sir #Ed Morton 's comment following could be few recommendations: Again these are not tested since no samples were given but you could try to apply them.
sum[NR] should be as sum[FNR] if in case you want to put index as per line number of Input_file2, why because difference between NR and FNR is that NR's value will be keep keep increasing till all Input_file(s) are read but FNR value will be RESET to 1 whenever there is new Input_file is being read.
Then, length(sum) could be value of FNR because basically you may be looking for total number of times loop has to run which you could get by FNR value.

awk setting variables to make a range

I have the following two files:
File 1:
1 290 rs1345
2 450 rs5313
1 1120 rs4523
2 790 rs4325
File 2:
1 201 LDLR
2 714 APOA5
1 818 NOTCH5
1 514 TTN
I wish to isolate only the rows in file 2 in which the second field is within 100 units of the second field in file 1 (if field 1 matches):
Desired output: (note the third field is from the matching line in file1).
1 201 LDLR rs1345
2 714 APOA5 rs4325
I tried using the following code:
for i in {1..4} #there are 4 lines in file2
do
chr=$(awk 'NR=="'${i}'" { print $1 }' file2)
pos=$(awk 'NR=="'${i}'" { print $2 }' file2)
gene=$(awk 'NR=="'${i}'" { print $3 }' file2)
start=$(echo $pos | awk '{print $1-100}') #start and end variables for 100 unit range
end=$(echo $pos | awk '{print $1+100}')
awk '{if ($1=="'$chr'" && $2 > "'$start'" && $2 < "'$end'") print "'$chr'","'$pos'","'$gene'"$3}' file1
done
The code is not working, I believe something is wrong with my start and end variables, because when I echo $start, I get 414, which doesn't make sense to me and I get 614 when i echo $end.
I understand this question might be difficult to understand so please ask me if any clarification is necessary.
Thank you.
The difficulty is that $1 is not a unique key, so some care needs to be taken with the data structure to store the data in file 1.
With GNU awk, you can use arrays of arrays:
gawk '
NR==FNR {f1[$1][$2] = $3; next}
$1 in f1 {
for (val in f1[$1])
if (val-100 <= $2 && $2 <= val+100)
print $0, f1[$1][val]
}
' file1 file2
Otherwise, you have to use a one-dimensional array and stuff 2 pieces of information into the key:
awk '
NR==FNR {f1[$1,$2] = $3; next}
{
for (key in f1) {
split(key, a, SUBSEP)
if (a[1] == $1 && a[2]-100 <= $2 && $2 <= a[2]+100)
print $0, f1[key]
}
}
' file1 file2
That works with mawk and nawk (and gawk)
#!/usr/bin/python
import pandas as pd
from StringIO import StringIO
file1 = """
1 290 rs1345
2 450 rs5313
1 1120 rs4523
2 790 rs4325
"""
file2 = """
1 201 LDLR
2 714 APOA5
1 818 NOTCH5
1 514 TTN
"""
sio = StringIO(file1)
df1 = pd.read_table(sio, sep=" ", header=None)
df1.columns = ["a", "b", "c"]
sio = StringIO(file2)
df2 = pd.read_table(sio, sep=" ", header=None)
df2.columns = ["a", "b", "c"]
df = pd.merge(df2, df1, left_on="a", right_on="a", how="outer")
#query is intuitive
r = df.query("b_y-100 < b_x <b_y + 100")
print r[["a", "b_x", "c_x", "c_y"]]
output:
a b_x c_x c_y
0 1 201 LDLR rs1345
7 2 714 APOA5 rs4325
pandas is the right tool to do such tabular data manipulation.