awk equivalents for tidyverse concepts (melt and spread) - awk

I have some text logs that I need to parse and format into CSV.
I have a working R script but it is slow once file sizes increase and this problem seems like a good candidate for a speed up using awk (or other commandline tools?) as I understand.
I have not done much with awk, and the issue I am having is translating how I think about processing in R to how awk scripting is done.
Example truncated input data (Scrap.log):
; these are comment lines
; *******************************************************************************
; \\C:\Users\Computer\Folder\Folder\Scrap.log
!!G 99999 % % % % % % % % CURRENT XYZ ABC STATE1 STATE2
_START Header1 Header2 Header3 Header4 Header5 Header6 Header7
10 12.23 1.91 6.63 1.68 50.03 0.50 13.97
11 11.32 1.94 6.64 1.94 50.12 0.58 15.10
12 12.96 2.15 6.57 2.12 55.60 0.62 16.24
13 11.43 2.18 6.60 2.36 50.89 0.68 17.39
14 14.91 2.32 6.64 2.59 56.09 0.73 18.41
15 13.16 2.38 6.53 2.85 51.62 0.81 19.30
16 15.02 2.50 6.67 3.05 56.22 0.85 20.12
!!G 99999 % % % % % % % % CURRENT XYZ ABC STATE1 STATE2
_START Header8 Header9 Header10 Header11 Header12 Header13 Header14
10 22.03 24.41 15.01 51.44 44.28 16.57 11.52
11 21.05 24.62 15.62 51.23 45.42 16.47 11.98
12 20.11 24.64 16.38 52.16 46.59 16.54 12.42
13 24.13 24.93 17.23 52.34 47.72 16.51 12.88
14 27.17 24.95 18.06 52.79 48.72 16.45 13.30
15 22.87 25.04 19.27 53.01 49.50 16.47 13.63
16 23.08 25.22 20.12 53.75 50.64 16.55 14.03
Expected output (truncated):
HH1,HH2,HH3,HH4,HH5,HH6,HH7,HH8,HH9,HH10,HH11,HH12,HH13,HH14,START,HeaderName,Value
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header1,12.23
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header2,1.91
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header3,6.63
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header4,1.68
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header5,50.03
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header6,0.5
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header7,13.97
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header1,11.32
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header2,1.94
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header3,6.64
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header4,1.94
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header5,50.12
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header6,0.58
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header7,15.1
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header1,12.96
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header2,2.15
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header3,6.57
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header4,2.12
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header5,55.6
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header6,0.62
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header7,16.24
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,13,Header1,11.43
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,13,Header2,2.18
...
My general steps in the R script:
add a single header row with new names at the top of file
spread the top row (starting with !!G) to each row
melt the header column (_START) from wide to long format
Pieces I have working in awk so far include:
how to grab and print the header lines
awk '/_START/ {header = $0; print header}' Scrap.log
How to write a single row with the new header values
awk ' BEGIN{ ORS=" "; for (counter = 1; counter <= 14; counter++) print "HH",counter;}'
I know each block is separated by a newline and starts with a !!G, so can write a match on that. Unsure if a split-apply-combine type of thinking works well in awk?
awk '/!!G/,/\n/ {print}' Scrap.log
alternatively, I tried setting RS/FS parameters like:
awk ' BEGIN{RS="\n";FS=" ";}/^!!G/{header=$0;print header}/[0-9]/{print $2}END{}' Scrap.log
I then get stuck on iterating over the rows and fields to do the melt step as well as combining the capture groups correctly.
How do I combine all these pieces to get to the CSV format?

I think the following:
awk '
BEGIN{
# output the header line
print "HH1,HH2,HH3,HH4,HH5,HH6,HH7,HH8,HH9,HH10,HH11,HH12,HH13,HH14,START,HeaderName,Value"
}
# ignore comment lines
/;/{next}
/!!G/{
valcnt = 1
# save and shuffle the values
val[valcnt++] = $2
val[valcnt++] = $11
val[valcnt++] = $12
val[valcnt++] = $13
val[valcnt++] = $14
val[valcnt++] = $15
val[valcnt++] = $3
val[valcnt++] = $4
val[valcnt++] = $5
val[valcnt++] = $6
val[valcnt++] = $7
val[valcnt++] = $8
val[valcnt++] = $9
val[valcnt++] = $10
next
}
/_START /{
# these are headers - save them to head, to be reused later
for (i = 2; i <= NF; ++i) {
# fun fact: its indexed on NF
head[i] = $i
}
next
}
# this function is redundant, but its just easier for me to think about the code
function output(firstval, header, value, \
cur, i) {
cur = valcnt
val[cur++] = firstval
val[cur++] = header
val[cur++] = value
# output val as csv
for (i = 1; i < cur; ++i) {
printf "%s%s", val[i], i != cur - 1 ? "," : "\n"
}
}
/[0-9]+/{
for (i = 2; i <= NF; ++i) {
# add these 3 to all the other values and output them
# ie. add first column, the header from header and the value
output($1, head[i], $i)
}
}
'
Should output what you want. Tested on repl.

Related

Making AWK code more efficient when evaluating sets of records

I have a file with 5 fields of content. I am evaluating 4 lines at a time in the file. So, records 1-4 are evaluated as a set. Records 5-8 are another set. Within each set, I want to extract the time from field 5 when field 4 has the max value. If there are duplicate values in field 4, then evaluate the maximum value in field 2 and use the time in field 5 associated with the max value in field 2.
For example, in the first 4 records, there is a duplicate max value in field 4 (value of 53). If that is true, I need to look at field 2 and find the maximum value. Then print the time associated with the max value in field 2 with the time in field 5.
The Data Set is:
00 31444 8.7 24 00:04:32
00 44574 12.4 25 00:01:41
00 74984 20.8 53 00:02:22
00 84465 23.5 53 00:12:33
01 34748 9.7 38 01:59:28
01 44471 12.4 37 01:55:29
01 74280 20.6 58 01:10:24
01 80673 22.4 53 01:55:49
The desired Output for records 1 through 4 is 00:12:33
The desired output for records 5 through 8 is 01:10:24
Here is my answer:
Evaluate Records 1 through 4
awk 'NR==1,NR==4 {if(max <= $4) {max = $4; time = $5} else if(max == $4) {max = $2; time = $5};next}END {print time}' test.txt test.txt
Output is: 00:12:33
Evaluate Records 5 through 8
awk 'NR==5,NR==8 {if(max <= $4) {max = $4; time = $5} else if(max == $4) {max = $2; time = $5};next}END {print time}' test.txt test.txt
Output is 01:10:24
Any suggestions on how to evaluate the record ranges more efficiently without having to write an awk statement for each set of records?
Thanks
Based on your sample input, the fact there's 4 lines for each key (first field) seems to be irrelevant and what you really want is to just produce output for each key so consider sorting the input by your desired comparison fields (field 4 then field 2) then printing the first desired output (field 5) value seen for each block per key (field 1):
$ sort -n -k1,1 -k4,4r -k2,2r file | awk '!seen[$1]++{print $5}'
00:12:33
01:10:24
This awk code
NR % 4 == 1 {max4 = $4; max2 = $2}
$4 > max4 || $4 == max4 && $2 >= max2 {max4 = $4; max2 = $2; val5 = $5}
NR % 4 == 0 {printf "lines %d-%d: %s\n", (NR - 3), NR, val5}
outputs
lines 1-4: 00:12:33
lines 5-8: 01:10:24
Looking at the data, you might want to group sets by $1 instead of hardcoding 4 lines:
awk '
function emit(nr) {printf "lines %d-%d: %s\n", nr - 3, nr, val5}
$1 != setId {
if (NR > 1) emit(NR - 1)
setId = $1
max4 = $4
max2 = $2
}
$4 > max4 || $4 == max4 && $2 >= max2 {max4 = $4; max2 = $2; val5 = $5}
END {emit(NR)}
' data
an awk-based solution that utilizes a synthetic ascii-string-comparison key combining $4 and $5, while avoiding any %-modulo operations :
mawk '
BEGIN { CONVFMT = "%020.f" (__=___=____=_____="")
_+=_+=++_ } { ____= __!=(__=__==$((_____=(+$_ "")"(" $NF)^!_) \
? __ : $!!_) || ____<_____ ? _____ : ____
} _==++___ {
printf(" group %-*s [%*.f, %-*.f] :: %s\n", --_*--_, "\"" (__) "\"", _+_,
NR-++_, ++_, NR, substr(____, index(____, "(")+_^(_____=____=___=""))) }'
group "00" [ 1, 4 ] :: 00:12:33
group "01" [ 5, 8 ] :: 01:10:24

AWK: Help on transforming data table

I have the following file called in.txt:
2020-01-01 fruit banana 3.4
2020-03-02 alcohol smirnov 26.99
2020-03-10 fruit orange 4.20
2020-04-03 fruit orange 4.20
2021-09-01 alcohol beer 6.00
2021-08-03 fruit mango 6.99
2022-01-01 fruit orange 4.30
2022-03-04 alcohol beer 6.00
2022-03-03 alcohol beer 6.00
2022-04-01 fruit mango 7.20
I want to transform the file so it reads something like this:
2020-01-01 2021-01-01 2022-01-01
-2020-12-31 -2021-12-31 -2022-12-31
fruit banana 3.40 0.00 0.00
orange 8.40 0.00 4.30
mango 0.00 6.99 7.20
Subt 11.80 6.99 11.50
alcohol beer 0.00 6.00 12.00
smirnov 26.99 0.00 0.00
Subt 26.99 6.00 12.00
Total 38.59 12.99 23.50
I have started writing the following script but am stuck on how to approach this. How can I display totals columns side by side. The other problem is that this is just dummy data. I have many different categories other than fruit and alcohol and it seems wrong to write if statements and for-loops for each one. Also how can I print fruit and alcohol out just once rather than for every iteration of column 3 and bring the date range to the top. Help is much appreciated.
#!/usr/bin/env bash
awk '
BEGIN{
FS=OFS="\t";
}
{
if ($2 ~ fruit && $1 >= "2020-01-01" && $1 <= "2020-12-31") {
a[$3]+=$4;
sa+=$4;
}
}
END {
PROCINFO["sorted_in"]="#ind_str_asc";
for (i in a) {
print "fruit", i, a[i]
}
}
' "${#:--}"
Would you please try the following:
#!/bin/bash
awk '
{
year = substr($1, 1, 4) # extract year
if (from == "" || from > year) from = year # first (smallest) year
if (to == "" || to < year) to = year # last (largest) year
if ($3 in category == 0) {
category[$3] = $2 # map item to category
list[$2] = list[$2] fs[$2] $3 # csv of items
fs[$2] = "," # delimiter for csv
}
sum[$3,year] += $4 # sum of the item in the year
subt[$2,year] += $4 # sum of the category in the year
ttl[year] += $4 # sum in the year
}
END {
format1 = "%-10s%-10s" # format for the left cells
format2 = "%-16s" # format for the header
format3 = "%-16.2f" # format for the amounts
# print upper header
printf(format1, "", "")
for (y = from; y <= to; y++) {
printf(format2, y "-01-01")
}
print ""
# print second header
printf(format1, "", "")
for (y = from; y <= to; y++) {
printf(format2, "-" y "-12-31")
}
print ""
for (cat in list) { # loop over the categories ("fruit" and "alcohol")
n = split(list[cat], item, ",") # split into items
for (i = 1; i <= n; i++) { # loop over the items
printf(format1, i == 1 ? cat : "", item[i])
for (y = from; y <= to; y++) { # loop over years
printf(format3, sum[item[i],y]) # append the sum of the year
}
print "" # finally break the line
}
print "" # insert blank line
printf(format1, "Subt", "")
for (y = from; y <= to; y++) {
printf(format3, subt[cat,y]) # append the subtotal
}
print "\n"
}
printf(format1, "Total", "")
for (y = from; y <= to; y++) {
printf(format3, ttl[y]) # append the total amount
}
print ""
}
' in.txt
Output with the provided input:
2020-01-01 2021-01-01 2022-01-01
-2020-12-31 -2021-12-31 -2022-12-31
alcohol smirnov 26.99 0.00 0.00
beer 0.00 6.00 12.00
Subt 26.99 6.00 12.00
fruit banana 3.40 0.00 0.00
orange 8.40 0.00 4.30
mango 0.00 6.99 7.20
Subt 11.80 6.99 11.50
Total 38.79 12.99 23.50
Please forgive me the order of items are not same as the OP's.
Using GNU awk for arrays of arrays:
$ cat tst.awk
BEGIN { OFS="\t" }
{
sub(/-.*/,"",$1)
minYear = ( NR==1 || $1 < minYear ? $1 : minYear )
maxYear = ( NR==1 || $1 > maxYear ? $1 : maxYear )
items[$2][$3]
vals[$1][$2][$3] += $4
typeTots[$1][$2] += $4
yearTots[$1] += $4
}
END {
printf "%s", OFS
for ( year=minYear; year<=maxYear; year++ ) {
printf "%s%s", OFS, year
}
print ""
for ( type in items ) {
itemCnt = 0
for ( item in items[type] ) {
printf "%s%s%s", (itemCnt++ ? "" : type), OFS, item
for ( year=minYear; year<=maxYear; year++ ) {
printf "%s%0.2f", OFS, vals[year][type][item]
}
print ""
}
printf "Subt%s", OFS
for ( year=minYear; year<=maxYear; year++ ) {
printf "%s%0.2f", OFS, typeTots[year][type]
}
print ORS
}
printf "Total%s", OFS
for ( year=minYear; year<=maxYear; year++ ) {
printf "%s%0.2f", OFS, yearTots[year]
}
print ""
}
$ awk -f tst.awk in.txt
2020 2021 2022
alcohol beer 0.00 6.00 12.00
smirnov 26.99 0.00 0.00
Subt 26.99 6.00 12.00
fruit orange 8.40 0.00 4.30
mango 0.00 6.99 7.20
banana 3.40 0.00 0.00
Subt 11.80 6.99 11.50
Total 38.79 12.99 23.50
or if you really want specific date ranges instead of just the year in the header:
$ cat tst.awk
BEGIN { OFS="\t" }
{
sub(/-.*/,"",$1)
minYear = ( NR==1 || $1 < minYear ? $1 : minYear )
maxYear = ( NR==1 || $1 > maxYear ? $1 : maxYear )
items[$2][$3]
vals[$1][$2][$3] += $4
typeTots[$1][$2] += $4
yearTots[$1] += $4
}
END {
printf "%s", OFS
for ( year=minYear; year<=maxYear; year++ ) {
printf "%s%s-01-01", OFS, year
}
print ""
printf "%s", OFS
for ( year=minYear; year<=maxYear; year++ ) {
printf "%s-%s-12-31", OFS, year
}
print ""
for ( type in items ) {
itemCnt = 0
for ( item in items[type] ) {
printf "%s%s%s", (itemCnt++ ? "" : type), OFS, item
for ( year=minYear; year<=maxYear; year++ ) {
printf "%s%0.2f", OFS, vals[year][type][item]
}
print ""
}
printf "Subt%s", OFS
for ( year=minYear; year<=maxYear; year++ ) {
printf "%s%0.2f", OFS, typeTots[year][type]
}
print ORS
}
printf "Total%s", OFS
for ( year=minYear; year<=maxYear; year++ ) {
printf "%s%0.2f", OFS, yearTots[year]
}
print ""
}
$ awk -f tst.awk in.txt | column -s$'\t' -t
2020-01-01 2021-01-01 2022-01-01
-2020-12-31 -2021-12-31 -2022-12-31
alcohol beer 0.00 6.00 12.00
smirnov 26.99 0.00 0.00
Subt 26.99 6.00 12.00
fruit orange 8.40 0.00 4.30
mango 0.00 6.99 7.20
banana 3.40 0.00 0.00
Subt 11.80 6.99 11.50
Total 38.79 12.99 23.50
I believe the following piece of awk code is a good start. The remaining part to do is just some cleanup and some extra code for the sums.
BEGIN{
# how many divisions per year
n=1
# initialisation of some variables
tmax=0;tmin=999999; ymax=qmax=0;ymin=9999;qmin=99
}
# convert date to quarter,trim,half
{ y=$1+0; q=(substr($1,6,7)+0)%n}
# compute min max time
(y*100+q < tmin) { ymin=y;qmin=q;tmin=y*100+q }
(y*100+q > tmax) { ymax=y;qmax=q;tmax=y*100+q }
# Create arrays that keep track of everything
# a : prices by year,q,category and element
# b : just a list of categories, eg fruit
# c : just a list of elements and the category it belongs to.
{ a[y,q,$2,$3]=$4; b[$2]; c[$3]=$2 }
END{
# loop over categories (eg fruit)
for(i in b) {
# loop over elemnts
for(j in c) {
# exclude elements that do not belong to category
if (i!=c[j]) continue
s=i OFS j;
# loop over the time
for (y=ymin;y<=ymax;y++) {
for (q=0;q<n;++q) {
if (y*100+q < tmin) continue
if (y*100+q > tmax) continue
s=s OFS a[y,q,i,j]+0
}
}
print s
}
}
}
This currently outputs:
alcohol beer 0 6 6
alcohol smirnov 26.99 0 0
fruit orange 4.2 0 4.3
fruit mango 0 6.99 7.2
fruit banana 3.4 0 0

filtering file according to the highest number in a column of each line

I have the following file:
chr11_pilon3.g3568.t1 transcript:OIT01734 transcript:OIT01734 1.1e-107 389.8 1000 218 992 1 216 130 345 MDALTRHIQGDVPWCMLFADDIILIDETRAGVSERLEIWRQTLESKGFKISRSKTEYLECKFGDEPSGVGREVMLGSQAIAKRDSVRYLGSVIQGDGEIDGDVTHRIGAGWSKWRLASGVLCDKKIPHKLKGKFFRAMVRPAMFYEAECWPVKNSHIQRMKVAEMRMLRWMCGHTRLDKIKNEVIRQKVGVAPVDKKMGEARLRWFGHVRRRGPDA MDALTRHIQGDVPWCMLFADDIVLIDETRVGVNERLEVWRQTLESKGFKLSRSKTEYLECKFSAESSEVGRDVKLGSQVIAKRDSFRYLGSVIQGEGEIDGDVTHRIGAGWSKWRLASGVLCDKKVPQKLKGKFYRAVVRPAMLYGAECWPVKNSHVQRMKVAEMRMLRWMRGLTRLDRIRNEVIREKVGVALVDEKMREARLRWYGHVRRRRPDA MDALTRHIQGDVPWCMLFADDIILIDETRAGVSERLEIWRQTLESKGFKISRSKTEYLECKFGDEPSGVGREVMLGSQAIAKRDSVRYLGSVIQGDGEIDGDVTHRIGAGWSKWRLASGVLCDKKIPHKLKGKFFRAMVRPAMFYEAECWPVKNSHIQRMKVAEMRMLRWMCGHTRLDKIKNEVIRQKVGVAPVDKKMGEARLRWFGHVRRRGPDAR* MKVWERVVEARVREMTSISVNQFGFMPGRSTTEAIHLVRRLVEHFRDKKKDLHMVFIDLENAYDKVPREVLWRCLEAKSVPEAYIRVIKDMYDGAKTRVRTVGGDSDHFPVVMGLHQGSALSPLLFALVMDALTRHIQGDVPWCMLFADDIVLIDETRVGVNERLEVWRQTLESKGFKLSRSKTEYLECKFSAESSEVGRDVKLGSQVIAKRDSFRYLGSVIQGEGEIDGDVTHRIGAGWSKWRLASGVLCDKKVPQKLKGKFYRAVVRPAMLYGAECWPVKNSHVQRMKVAEMRMLRWMRGLTRLDRIRNEVIREKVGVALVDEKMREARLRWYGHVRRRRPDAPVRIYKSAILGHLNSHGSQNALAGPVEAEENRQKTKKEVMEEIIQKSKFFKAQKAKDREENDELTEQLDKDFTSLVESKALLSLTQPDKINALKALVNKNISVGNVKKDEVADVPRKASIGKEKPDTYEMLVSEMALDMRARPSDRTKTPEEIAQEEKERLELLEQEXXXXXXXXXXXXXXDGNASDDNSKLVKDPRTVSGDDLGDDLEEVPRTKLGWIGEILRRKENELESEDAASSGDSDDGEDEGXXXXXXXXXXXXXXXXXXXXDEEQGKTQTIKDWEQSDDDIIDTELEDDDEGFGDDAKKVVKIKDHKEENLSITVAAENKKKMQVFYGVLLQYFAVLANKKPLNSKLLNLLVKPLMEMSAVSPYFAAICARQRLQRTRAQFCEDLKNTGKSSWPSLKTIFLLRLWSMIFPCSDFRHCVMTPAILLMCEYLMRCTIISGRDIAIASFLCSLLLSVIKQSQKFCPEAIVFIQTLLMAALDRKQRSNSQLDNLMEIKELGPLLCIRSSKVEMDSLDFLTLMDLPEDSQYFHSDNYRTSMLVTVLETLQGFVNVYKELISFPEIFMLISKLLCKMAGENHIPDALREKIKDVSQLIDTKAQEHHMLRQPLKMRKKKPVPIRMLNPKFEENFVKGRDYDPDRERA 389.8 1000 216 85.6 185 31 200 0 0 92.6 0 22IV6AV2SN4IV11IL12GSDA1PS1GE3ED1MK4AV6VF9DE29IV1HQ6FY2MV5FL1EG10IV14CR1HL4KR1KR5QE5PL2KE2GR6FY6GR3 85.6 1.1e-107 99.1
gene.10002.1.1.p1 NisylKD957037g0001.1 NisylKD957037g0001.1 0.0e+00 1218.8 3152 668 780 5 667 122 780 KVIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN KVIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN MFGFKVIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN* MGAKRTRSNSESDDGYKLSVPPGFESLMSFTLKKVKNSEEACNSVALGSGFAQGPSLVAATSTIISTGKLKSSVRHRPWILDDHVDHIEDDSEFEDDKSLSSSAFLPKGVIRGCSSCHNCQKVIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN 1218.8 3152 665 91.0 605 52 621 3 8 93.4 0 11HR12SNE-E-E-F-E-D-5GA24CR3EP14ED26RG5LH85GS4RGGD2ISHR2-P24HR70FL2MI7IV20IL8VA25DE5RG17RG4AP7KN10CY13FVAS6KT1ML16AT4SP13TK3QH12SP3RS36FL4FVSF6EG12VI6-EAV13LV3TS8LS2QR2PS3VI2TKVI2IL15IT19TS9 91.0 0.0e+00 99.3
gene.10002.1.4.p1 NisylKD957037g0001.1 NisylKD957037g0001.1 0.0e+00 1216.8 3147 671 780 9 670 123 780 VIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN VIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN MFGFKARIVIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN* MGAKRTRSNSESDDGYKLSVPPGFESLMSFTLKKVKNSEEACNSVALGSGFAQGPSLVAATSTIISTGKLKSSVRHRPWILDDHVDHIEDDSEFEDDKSLSSSAFLPKGVIRGCSSCHNCQKVIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN 1216.8 3147 664 91.0 604 52 620 3 8 93.4 0 10HR12SNE-E-E-F-E-D-5GA24CR3EP14ED26RG5LH85GS4RGGD2ISHR2-P24HR70FL2MI7IV20IL8VA25DE5RG17RG4AP7KN10CY13FVAS6KT1ML16AT4SP13TK3QH12SP3RS36FL4FVSF6EG12VI6-EAV13LV3TS8LS2QR2PS3VI2TKVI2IL15IT19TS9 91.0 0.0e+00 98.7
gene.10002.1.5.p1 NisylKD957037g0001.1 NisylKD957037g0001.1 0.0e+00 1218.8 3152 668 780 5 667 122 780 KVIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN KVIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN MFGFKVIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN* MGAKRTRSNSESDDGYKLSVPPGFESLMSFTLKKVKNSEEACNSVALGSGFAQGPSLVAATSTIISTGKLKSSVRHRPWILDDHVDHIEDDSEFEDDKSLSSSAFLPKGVIRGCSSCHNCQKVIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN 1218.8 3152 665 91.0 605 52 621 3 8 93.4 0 11HR12SNE-E-E-F-E-D-5GA24CR3EP14ED26RG5LH85GS4RGGD2ISHR2-P24HR70FL2MI7IV20IL8VA25DE5RG17RG4AP7KN10CY13FVAS6KT1ML16AT4SP13TK3QH12SP3RS36FL4FVSF6EG12VI6-EAV13LV3TS8LS2QR2PS3VI2TKVI2IL15IT19TS9 91.0 0.0e+00 99.3
gene.10002.1.6.p1 NisylKD957037g0001.1 NisylKD957037g0001.1 0.0e+00 1440.2 3727 799 780 15 798 1 780 MGAKRTRSNGESDDGYKLSVPPGFESLMSFTLKKVKNSEEACNSVALESEFAQSPSQVAATSTIISIGKLKSSVRHRPWILDDHVDHIEDDSEFEDDKSLSSIAFLPKGVIRGCSSCHNCQKVIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN MGAKRTRSNSESDDGYKLSVPPGFESLMSFTLKKVKNSEEACNSVALGSGFAQGPSLVAATSTIISTGKLKSSVRHRPWILDDHVDHIEDDSEFEDDKSLSSSAFLPKGVIRGCSSCHNCQKVIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN MSDCTWQRYKGEVLMGAKRTRSNGESDDGYKLSVPPGFESLMSFTLKKVKNSEEACNSVALESEFAQSPSQVAATSTIISIGKLKSSVRHRPWILDDHVDHIEDDSEFEDDKSLSSIAFLPKGVIRGCSSCHNCQKVIARCRPELAHIPSLEEAPVFHPSEEEFEDTLKYVGSILPHVKHYGICRIVPPSSWKPPSCIEEESTVYGVNTHIQRTSELQNLFFKKRLEGACTRTNNKQQKTLSRKSDFGLDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESGFPHERGVTIHRPQYVESGWNLNNTPKLQDSLLRFGSHESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLFQNMAFQFSPSILTSEGIPVYRCVQNPKEFVLILPGAYHAHVDSGFNCSEAVNFAPFDWLPHGQNAVDLYSEQRRKTSISYDKLLFEAATERIRALAELPLLHKKFFDNLKWRAVCRSNEILTKALKSRFATEVRRRKYMCASLESRKMEDDFCATAKRECSICYYDLYLSAIGCTCSPQKYTCLLHAKQLCSCAWREKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGFPVSDFSKDASKDEMKVKSESGQSLDVEQDRKEASIPSVGPSARTNNLNRVTGSWVEADGLSHQPQPKGIVNDTVEVIFPKISQHATVGKNIMISSNTVLKKHLARESSSTKRTVIILSDDEN* MGAKRTRSNSESDDGYKLSVPPGFESLMSFTLKKVKNSEEACNSVALGSGFAQGPSLVAATSTIISTGKLKSSVRHRPWILDDHVDHIEDDSEFEDDKSLSSSAFLPKGVIRGCSSCHNCQKVIARCRPELARIPSLEEAPVFHPNTLKYVASILPHVKHYGICRIVPPSSWKPPSRIEEPSTVYGVNTHIQRTSDLQNLFFKKRLEGACTRTNNKQQKTLSGKSDFGHDIERKEFGCCNEHFEFENGPKLMLKYFKHYADHFKKQYFVKEDQITASEPSIQDIEGEYWRIIENPTEEIEVLQGTSAEIKATESSFPHEGDVTSRRPPQYVESGWNLNNTPKLQDSLLRFGSRESSSILLPRLSIGMCFSSNLWRIEEHHLYLLSYIHFGAPKIFYGVPGSYRCKFEEAVKKHLPQLSAHPCLLQNIAFQFSPSVLTSEGIPVYRCVQNPKEFVLLLPGAYHAHADSGFNCSEAVNFAPFDWLPHGQNAVELYSEQGRKTSISYDKLLFEAATEGIRALPELPLLHKNFFDNLKWRAVYRSNEILTKALKSRVSTEVRRRTYLCASLESRKMEDDFCATTKRECPICYYDLYLSAIGCKCSPHKYTCLLHAKQLCPCAWSEKYLLIRYEIDELNIMVEALDGKVSAVHKWAKEKLGLPVSDVFKDASKDGMKVKSESGQSLDIEQDRKEEVSIPSVGPSARTNNVNRVSGSWVEADGSSHRPQSKGIINDKIEVLFPKISQHATVGKNIMTSSNTVLKKHLARESSSTKRSVIILSDDEN 1440.2 3727 786 91.5 719 59 735 3 8 93.5 0 9GS37EG1EG3SG2QL9IT35IS29HR12SNE-E-E-F-E-D-5GA24CR3EP14ED26RG5LH85GS4RGGD2ISHR2-P24HR70FL2MI7IV20IL8VA25DE5RG17RG4AP7KN10CY13FVAS6KT1ML16AT4SP13TK3QH12SP3RS36FL4FVSF6EG12VI6-EAV13LV3TS8LS2QR2PS3VI2TKVI2IL15IT19TS9 91.5 0.0e+00 98.1
The above file has some IDs which are similar
gene.10002.1.1.p1
gene.10002.1.4.p1
gene.10002.1.5.p1
gene.10002.1.6.p1
By remaining only gene.10002 the IDs become identically. I used this awk script (thank you to #anubhava ) to keep only lines of the same ID with smallest value (column 30)
awk '{
if (/^gene\./) {
split($1, a, /\./)
k = a[1] "." a[2]
}
else
k = $1
}
!(k in min) || $30 <= min[k] {
if(!(k in min))
ord[++n] = k
else if (min[k] == $30) {
print
next
}
min[k] = $30
rec[k] = $0
}
END {
for (i=1; i<=n; ++i)
print rec[ord[i]]
}' file
I failed to modified the above awk script to consider the maximum value in column 31 and to keep multiple copies if the column 31 value is the same?
awk '{
if (/^gene\./) {
split($1, a, /\./)
k = a[1] "." a[2]
}
else
k = $1
}
!(k in max) || $31 <= max[k] {
if(!(k in max))
ord[++n] = k
else if (max[k] == $31) {
print
next
}
cov[k] = $31
rec[k] = $0
}
END {
for (i=1; i<=n; ++i)
print rec[ord[i]]
}'
Fixing OP's attempt here, could you please try following. You should change your condition to do comparison for >= condition in $31 >= max[k], since we are looking for maximum value now, added detailed explanation later section of this post too.
awk '{
if (/^gene\./) {
split($1, a, /\./)
k = a[1] "." a[2]
}
else
k = $1
}
!(k in max) || $31 >= max[k] {
if(!(k in max))
ord[++n] = k
else if (max[k] == $31) {
print
next
}
max[k] = $31
rec[k] = $0
}
END {
for (i=1; i<=n; ++i)
print rec[ord[i]]
}' Input_file
Explanation: Adding detailed explanation for above.
awk '{ ##Starting awk program from here.
if (/^gene\./) { ##Checking condition if line is NOT starting from gene. then do following.
split($1, a, /\./) ##Splitting first field into array a with delimiter dot here.
k = a[1] "." a[2] ##Creating variable k with value of a[1] DOT a[2] here.
}
else ##In case line NOT starting from gene. then do following.
k = $1 ##Setting 1st field value to k here.
}
!(k in max) || $31 >= max[k] { ##Checking condition if k is NOT in max array and 31st field is >= max[k]
if(!(k in max)) ##If above any of the condition is true then check if k is NOT present in max
ord[++n] = k ##Creating ord with index of increasing value of n and its value is k
else if (max[k] == $31) { ##else printing maximum duplicate line, no need to keep appending it in array.
print ##Printing it here.
next ##next will skip all further statements from here.
}
max[k] = $31 ##Creating max with index of k and value of 31st field.
rec[k] = $0 ##Creating rec with index of k and value of current line.
}
END { ##Starting END block of this program from here.
for (i=1; i<=n; ++i) ##Starting a for loop from i=1 to till value of n here.
print rec[ord[i]] ##Printing array rec with index of; value of ord array which has i index.
}' Input_file ##Mentioning Input_file name here.

awk setting variables to make a range

I have the following two files:
File 1:
1 290 rs1345
2 450 rs5313
1 1120 rs4523
2 790 rs4325
File 2:
1 201 LDLR
2 714 APOA5
1 818 NOTCH5
1 514 TTN
I wish to isolate only the rows in file 2 in which the second field is within 100 units of the second field in file 1 (if field 1 matches):
Desired output: (note the third field is from the matching line in file1).
1 201 LDLR rs1345
2 714 APOA5 rs4325
I tried using the following code:
for i in {1..4} #there are 4 lines in file2
do
chr=$(awk 'NR=="'${i}'" { print $1 }' file2)
pos=$(awk 'NR=="'${i}'" { print $2 }' file2)
gene=$(awk 'NR=="'${i}'" { print $3 }' file2)
start=$(echo $pos | awk '{print $1-100}') #start and end variables for 100 unit range
end=$(echo $pos | awk '{print $1+100}')
awk '{if ($1=="'$chr'" && $2 > "'$start'" && $2 < "'$end'") print "'$chr'","'$pos'","'$gene'"$3}' file1
done
The code is not working, I believe something is wrong with my start and end variables, because when I echo $start, I get 414, which doesn't make sense to me and I get 614 when i echo $end.
I understand this question might be difficult to understand so please ask me if any clarification is necessary.
Thank you.
The difficulty is that $1 is not a unique key, so some care needs to be taken with the data structure to store the data in file 1.
With GNU awk, you can use arrays of arrays:
gawk '
NR==FNR {f1[$1][$2] = $3; next}
$1 in f1 {
for (val in f1[$1])
if (val-100 <= $2 && $2 <= val+100)
print $0, f1[$1][val]
}
' file1 file2
Otherwise, you have to use a one-dimensional array and stuff 2 pieces of information into the key:
awk '
NR==FNR {f1[$1,$2] = $3; next}
{
for (key in f1) {
split(key, a, SUBSEP)
if (a[1] == $1 && a[2]-100 <= $2 && $2 <= a[2]+100)
print $0, f1[key]
}
}
' file1 file2
That works with mawk and nawk (and gawk)
#!/usr/bin/python
import pandas as pd
from StringIO import StringIO
file1 = """
1 290 rs1345
2 450 rs5313
1 1120 rs4523
2 790 rs4325
"""
file2 = """
1 201 LDLR
2 714 APOA5
1 818 NOTCH5
1 514 TTN
"""
sio = StringIO(file1)
df1 = pd.read_table(sio, sep=" ", header=None)
df1.columns = ["a", "b", "c"]
sio = StringIO(file2)
df2 = pd.read_table(sio, sep=" ", header=None)
df2.columns = ["a", "b", "c"]
df = pd.merge(df2, df1, left_on="a", right_on="a", how="outer")
#query is intuitive
r = df.query("b_y-100 < b_x <b_y + 100")
print r[["a", "b_x", "c_x", "c_y"]]
output:
a b_x c_x c_y
0 1 201 LDLR rs1345
7 2 714 APOA5 rs4325
pandas is the right tool to do such tabular data manipulation.

Using awk to find min and max?

I have a large file that contains many types of lines. One of the types is lines of this form:
The lines all start with ATOM, and the 7th-9th fields are the x, y, and z values of the specified atom. How can I use awk to find all ATOM lines and then calculate the min and max of the x, y, and z values?
This is my file: http://pastebin.com/EqA2SUMy
One of the types is lines of this form:
ATOM 1 N ASP A 435 7.397 28.376 121.784 1.00 34.35 N
ATOM 2 CA ASP A 435 8.023 27.301 122.545 1.00 30.66 C
ATOM 3 C ASP A 435 8.170 27.721 124.009 1.00 31.39 C
ATOM 4 O ASP A 435 9.078 28.509 124.284 1.00 38.78 O
Can anyone show me how to do this please?
#!awk -f
BEGIN {
min7 = min8 = min9 = 1000
}
$1 == "ATOM" {
if ($7 < min7)
min7 = $7
if ($8 < min8)
min8 = $8
if ($9 < min9)
min9 = $9
if ($7 > max7)
max7 = $7
if ($8 > max8)
max8 = $8
if ($9 > max9)
max9 = $9
}
END {
print min7, min8, min9
print max7, max8, max9
}