How can I Re-Format the fields of a fasta file header and fold the sequence? - awk

One of the most used files in Bioinformatics is the fasta file
format.
Fasta files are simple: They contain a "Header" record that starts
with a ">", followed by the "Sequence" record, which is everything after the
header but before the next record separator (i.e., ">").
>ENSP00000488314.1 pep chromosome:GRCh38:X:143884071:143885255:1 gene:ENSG00000276380.2 transcript:ENST00000618570.1 gene_biotype:polymorphic_pseudogene transcript_biotype:polymorphic_pseudogene gene_symbol:UBE2NL description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene) [Source:HGNC Symbol;Acc:HGNC:31710]
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLA
EEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
> next record...
> another one...
The header can be very simple (e.g., ">ENSP00000488314.1") or complex.
Complex headers important but variable information.
In the case of the example sequences above (coming from ENSEMBL), the header record is compossed of:
Field 01: ENSP00000488314.1 <=Protein ID
Field 02: pep <=Peptide record
Field 03: chromosome:GRCh38:X:143884071:143885255:1 <=Chromosome and chromosomal coordinates
Field 04: gene:ENSG00000276380.2 <=Gene ID
Field 05: transcript:ENST00000618570.1 <=Transcript ID
Field 06: gene_biotype:polymorphic_pseudogene <=Gene Biotype
Field 07: transcript_biotype:polymorphic_pseudogene <=Transcript Biotype
Field 08: gene_symbol:UBE2NL <=Gene Symbol
Up to here the fields are all nicely separated by spaces, and then...Field 09 (Variable)
Field 09: description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene)
Field 10: [Source:HGNC Symbol;Acc:HGNC:31710] <=Predictable
Many times long headers are not well received by other Bioinformatic applications, and so it is required to shorten headers.
It would be nice to do that in a smart way. Therefore, using AWK, and using the example sequences below, I would like to:
First: Control the printing of the header records as follows:
Always retain the first field:
>ENSP00000488314.1
But then be able to ommit and/or include fields. Examples:
>ENSP00000488314.1 gene:ENSG00000276380.2 transcript:ENST00000618570.1
Field: 01 04 05
>ENSP00000488314.1 pep chromosome:GRCh38:X:143884071:143885255:1 [Source:HGNC Symbol;Acc:HGNC:31710]
Field: 01 02 03 10
For simplicity, totally ignoring Field 09 would be totally acceptable, but being able to use Field 10 would be nice
Then be able to "Fold" the sequence to a user specified number. For Example the records having sequence folded every 60 characters:
>ENSP00000441696.1 pep chromosome:GRCh38:14:21868839:21869365:1 gene:ENSG00000211788.2 transcript:ENST00000390436.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:TRAV13-1 description:T cell receptor alpha variable 13-1 [Source:HGNC Symbol;Acc:HGNC:12108]
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELG
KGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>ENSP00000488314.1 pep chromosome:GRCh38:X:143884071:143885255:1 gene:ENSG00000276380.2 transcript:ENST00000618570.1 gene_biotype:polymorphic_pseudogene transcript_biotype:polymorphic_pseudogene gene_symbol:UBE2NL description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene) [Source:HGNC Symbol;Acc:HGNC:31710]
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLA
EEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>ENSP00000437680.2 pep chromosome:GRCh38:22:42140203:42141924:-1 gene:ENSG00000205702.11 transcript:ENST00000435101.1 gene_biotype:polymorphic_pseudogene transcript_biotype:nonsense_mediated_decay gene_symbol:CYP2D7 description:cytochrome P450 family 2 subfamily D member 7 (gene/pseudogene) [Source:HGNC Symbol;Acc:HGNC:2624]
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
Could become (sequence folded every 120 characters):
>ENSP00000441696.1 gene:ENSG00000211788.2 transcript:ENST00000390436.2
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>ENSP00000488314.1 gene:ENSG00000276380.2 transcript:ENST00000618570.1
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>ENSP00000437680.2 gene:ENSG00000205702.11 transcript:ENST00000435101.1
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
So far, the best I was able to do is to call a script containing the following code:
awk -v w=60 -f script.awk fasta_file.fa
#!/usr/bin/env gawk
## Script.awk
/^>/ {
if (seq != "") print seq; print $1,$4,$5; seq = ""; next
}
{
seq = seq $1
while (length(seq) > w) {
print substr(seq, 1,w)
seq = substr(seq, 1+w)
}
}
END { if (seq != "") print seq }
The problem with the code above is that the fields $1, $4, and $5 are hard coded.
An elegant solution to a similar problem was proposed by
Ed Morton, but, it requires me to understand the \s/\S gawk extensions and AWK arrays, which is something I am struggling to do.
Any ideas on how to do improve the code above using AWK (not Perl/Python) will be greatly appreciated

This shows not only how to do what you want with awk but how to structure a shell script properly to call awk after parsing arguments (which you can't do if you invoke awk with a shebang so - don't do that). It uses GNU awk for gensub() and the 3rd arg to match():
$ cat tst.sh
#!/usr/bin/env bash
while getopts ":w:f:" opt; do
case "$opt" in
w) wid=${OPTARG}
;;
f) flds=${OPTARG}
;;
*) printf 'bad argument "%s"\n' "$opt" >&2
exit 1
;;
esac
done
shift "$((OPTIND-1))"
awk -v wid="$wid" -v flds="$flds" '
BEGIN {
wid=(wid ? wid : 120)
flds=(flds ? flds : "protein gene transcript")
numTags = split(flds,tags)
}
sub(/^>/,"") {
if (NR > 1) {
prt()
}
match($0,/(description:.*\S)\s+\[([^]]+)/,a)
$0 = substr($0,1,RSTART-1)
f["description"] = a[1]
f["predictable"] = a[2]
f["protein"] = $1
f["peptide"] = $2
for (i=3; i<=NF; i++) {
tag = gensub(/:.*/,"",1,$i)
f[tag] = $i
}
next
}
{ f["sequence"] = f["sequence"] $0 }
END { prt() }
function prt( tagNr, tag) {
printf ">"
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
printf "%s%s", f[tag], (tagNr<numTags ? OFS : ORS)
}
print gensub(".{"wid"}","&"RS,"g",f["sequence"])
delete f
}
' "${#:--}"
.
$ ./tst.sh file
>ENSP00000441696.1 gene:ENSG00000211788.2 transcript:ENST00000390436.2
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>ENSP00000488314.1 gene:ENSG00000276380.2 transcript:ENST00000618570.1
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>ENSP00000437680.2 gene:ENSG00000205702.11 transcript:ENST00000435101.1
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
.
$ ./tst.sh -w 60 -f 'gene_symbol chromosome' file
>gene_symbol:TRAV13-1 chromosome:GRCh38:14:21868839:21869365:1
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELG
KGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>gene_symbol:UBE2NL chromosome:GRCh38:X:143884071:143885255:1
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLA
EEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>gene_symbol:CYP2D7 chromosome:GRCh38:22:42140203:42141924:-1
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
.
$ ./tst.sh -w 10000 -f 'description' file
>description:T cell receptor alpha variable 13-1
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene)
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDDPLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>description:cytochrome P450 family 2 subfamily D member 7 (gene/pseudogene)
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
.
$ ./tst.sh -w 10000 -f 'predictable' file
>Source:HGNC Symbol;Acc:HGNC:12108
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>Source:HGNC Symbol;Acc:HGNC:31710
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDDPLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>Source:HGNC Symbol;Acc:HGNC:2624
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT

Related

Separate lines with keys and store in different files

How to separate (get) the entire line related to hexadecimal number keys and the entire line for DEBUG in a text file, then store in different file, where the key is in this format: "[ uid key]"?
i.e. ignore any lines that is not DEBUG.
in.txt:
[ uid 28fd4583833] DEBUG web.Action
[ uid 39fd5697944] DEBUG test.Action
[ uid 56866969445] DEBUG test2.Action
[ uid 76696944556] INFO test4.Action
[ uid 39fd5697944] DEBUG test7.Action
[ uid 85483e10256] DEBUG testing.Action
The output files are named as "out" + i + ".txt", where i = 1, 2, 3, 4.
i.e.
out1.txt:
[ uid 28fd4583833] DEBUG web.Action
out2.txt:
[ uid 39fd5697944] DEBUG test.Action
[ uid 39fd5697944] DEBUG test7.Action
out3.txt:
[ uid 56866969445] DEBUG test2.Action
out4.txt:
[ uid 85483e10256] DEBUG testing.Action
I tried:
awk 'match($0, /uid ([^]]+)/, a) && /DEBUG/ {print > (a[1] ".txt")}' in.txt
If you are willing to change the output file names to include the keys (frankly, this seems more useful that a one-up counter in the names), you can do:
awk '/DEBUG/{print > ("out-" $3 ".txt")}' FS='[][ ]*' in.txt
This will put all lines that match the string DEBUG with key 85483e10256 into the file out-85483e10256.txt, etc.
If you do want to keep the one-up counter, you could do:
awk '/DEBUG/{if( ! a[$3] ) a[$3] = ++counter;
print > ("out" a[$3] ".txt")}' FS='[][ ]*' in.txt
Basically, the idea is to use the regex [][ ]* as the field separator, which matches a string of square brackets or spaces. This way, $1 is the text preceding the initial [, $2 is the string uid, and $3 is the key. This will (should!) correctly get the key for lines that might have slightly different white space. We use an associative array to keep track of which keys have already been seen to keep track of the counter. But it really is cleaner just to use the key in the output file name.
Using GNU sort for -s (to guarantee retaining input line order for every key value) and any awk:
$ sort -sk3,3 in.txt |
awk '$4!="DEBUG"{next} $3!=prev{close(out); out="out"(++i)".txt"; prev=$3} {print > out}'
$ head out*.txt
==> out1.txt <==
[ uid 28fd4583833] DEBUG web.Action
==> out2.txt <==
[ uid 39fd5697944] DEBUG test.Action
[ uid 39fd5697944] DEBUG test7.Action
==> out3.txt <==
[ uid 56866969445] DEBUG test2.Action
==> out5.txt <==
[ uid 85483e10256] DEBUG testing.Action
If you don't have GNU sort then you can apply the DSU (Decorate/Sort/Undecorate) idiom using any sort:
$ awk -v OFS='\t' '{print NR, $0}' in.txt | sort -k4,4 -k1,1n | cut -f2- |
awk '$4!="DEBUG"{next} $3!=prev{close(out); out="out"(++i)".txt"; prev=$3} {print > out}'
Note that with the above only sort has to handle all of the input in memory and it's designed to use demand paging, etc. to handle extremely large amounts of input, while awk only processes 1 line at a time and keeps almost nothing in memory and only has 1 output file open at a time and so the above is far more likely to succeed for large files than an approach that stores a lot in memory in awk, or has many output files open concurrently.
If your file format is consistent as you show, you can just do:
awk '
$4!="DEBUG" { next }
!f[$3] { f[$3]=++i }
{ print > ("out" f[$3] ".txt") }
' in.txt
1st solution: Using GNU awk try following single awk code. Where I am using PROCINFO["sorted_in"] method of GNU awk.
awk '
BEGIN{
PROCINFO["sorted_in"] = "#ind_num_asc"
}
!/DEBUG/{ next }
match($0,/uid [a-zA-Z0-9]+/){
ind=substr($0,RSTART,RLENGTH)
arr[ind]=(arr[ind]?arr[ind] ORS:"") $0
}
END{
for(i in arr){
outputFile=("out"++count".txt")
print arr[i] > (outputFile)
close(outputFile)
}
}
' Input_file
2nd solution: with any awk, with your shown samples please try following solution. Change Input_file name with your actual file's name here. Using GNU sort here with option -s to maintain the order while sorting values.
awk '
!/DEBUG/{ next }
match($0,/uid [0-9a-zA-Z]+/){
print substr($0,RSTART,RLENGTH)";"$0
}' Input_file |
sort -sk2n |
cut -d';' -f2- |
awk '
match($0,/uid [0-9a-zA-Z]+/){
if(prev!=substr($0,RSTART,RLENGTH)){
count++
close(outputFile)
}
outputFile="out"count".txt"
print > (outputFile)
prev=substr($0,RSTART,RLENGTH)
}
'
1st solution's Explanation: Adding detailed explanation for 1st solution:
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
PROCINFO["sorted_in"] = "#ind_num_asc" ##Setting PROCINFO["sorted_in"] to #ind_num_asc to sort any array with index.
}
!/DEBUG/{ next } ##If a line does not contain DEBUG then jump to next line.
match($0,/uid [a-zA-Z0-9]+/){ ##using match function to match uid space and alphanumeric values here.
ind=substr($0,RSTART,RLENGTH) ##Creating ind which contains sub string of matched sub string in match function.
arr[ind]=(arr[ind]?arr[ind] ORS:"") $0 ##Creating array arr with index of ind and keep adding current line value to same index.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Traversing through array arr here.
outputFile=("out"++count".txt") ##Creating output file name here as per OP requirement.
print arr[i] > (outputFile) ##printing current array element into outputFile variable.
close(outputFile) ##Closing output file in backend to avoid too many files opened error.
}
}
' Input_file ##Mentioning Input_file name here.
A relatively portable awk-based solution with these highlights ::
output rows do not truncate leading edge double space
output filenames adhere to stabilized input row order without the need to pre-sort rows, post-sort rows, or utilize gnu gawk-specific features
tested and confirmed working on
gawk 5.1.1, including -ce flag,
mawk 1.3.4,
mawk 1.9.9.6, and
macOS nawk 20200816
————————————————————————————————
# gawk profile, created Thu May 19 12:10:56 2022
BEGIN {
____ = "test_72297811_" # opt. filename prefix
OFS = FS = "^ [[] uid "
_+=_ = gsub("\\^|[[][]]", _, OFS)
_*= _--
} NF *= / DEBUG / {
print >> (__[___ = substr($NF,_~_,_)] ?__[___]:\
__[___]= ____ "out" length(__) ".txt" )
} END {
for (_ in __) { close(__[_]) } }'
————————————————————————————————
==> test_72297811_out1.txt <==
[ uid 28fd4583833] DEBUG web.Action
==> test_72297811_out2.txt <==
[ uid 39fd5697944] DEBUG test.Action
[ uid 39fd5697944] DEBUG test7.Action
==> test_72297811_out3.txt <==
[ uid 56866969445] DEBUG test2.Action
==> test_72297811_out4.txt <==
[ uid 85483e10256] DEBUG testing.Action

How to replace all escape sequences with non-escaped equivalent with unix utilities (sed/tr/awk)

I'm processing a Wireshark config file (dfilter_buttons) for display filters and would like to print out the filter of a given name. The content of file is like:
Sample input
"TRUE","test","sip contains \x22Hello, world\x5cx22\x22",""
And the resulting output should have the escape sequences replaced, so I can use them later in my script:
Desired output
sip contains "Hello, world\x22"
My first pass is like this:
Current parser
filter_name=test
awk -v filter_name="$filter_name" 'BEGIN {FS="\",\""} ($2 == filter_name) {print $3}' "$config_file"
And my output is this:
Current output
sip contains \x22Hello, world\x5cx22\x22
I know I can handle these exact two escape sequences by piping to sed and matching those exact two sequences, but is there a generic way to substitutes all escape sequences? Future filters I build may utilize more escape sequences than just " and , and I would like to handle future scenarios.
Using gnu-awk you can do this using split, gensub and strtonum functions:
awk -F '","' -v filt='test' '$2 == filt {n = split($3, subj, /\\x[0-9a-fA-F]{2}/, seps); for (i=1; i<n; ++i) printf "%s%c", subj[i], strtonum("0" substr(seps[i], 2)); print subj[i]}' file
sip contains "Hello, world\x22"
A more readable form:
awk -F '","' -v filt='test' '
$2 == filt {
n = split($3, subj, /\\x[0-9a-fA-F]{2}/, seps)
for (i=1; i<n; ++i)
printf "%s%c", subj[i], strtonum("0" substr(seps[i], 2))
print subj[i]
}' file
Explanation:
Using -F '","' we split input using delimiter ","
$2 == filt we filter input for $2 == "test" condition
Using /\\x[0-9a-fA-F]{2}/ as regex (that matches 2 digit hex strings) we split $3 and save split tokens into array subj and matched separators into array seps
Using substr we remove first char i.e \\ and prepend 0
Using strtonum we convert hex string to equivalent ascii number
Using %c in printf we print corresponding ascii character
Last for loop joins $3 back using subj and seps array elements
Using GNU awk for FPAT, gensub(), strtonum(), and the 3rd arg to match():
$ cat tst.awk
BEGIN { FPAT="([^,]*)|(\"[^\"]*\")"; OFS="," }
$2 == ("\"" filter_name "\"") {
gsub(/^"|"$/,"",$3)
while ( match($3,/(\\x[0-9a-fA-F]{2})(.*)/,a) ) {
printf "%s%c", substr($3,1,RSTART-1), strtonum(gensub(/./,0,1,a[1]))
$3 = a[2]
}
print $3
}
$ awk -v filter_name='test' -f tst.awk file
sip contains "Hello, world\x22"
The above assumes your escape sequences are always \x followed by exactly 2 hex digits. It isolates every \xHH string in the input, replaces \ with 0 in that string so that strtonum() can then convert the string to a number, then uses %c in the printf formatting string to convert that number to a character.
Note that GNU awk has a debugger (see https://www.gnu.org/software/gawk/manual/gawk.html#Debugger) so if you're ever not sure what any part of a program does you can just run it in the debugger (-D) and trace it, e.g. in the following I plant a breakpoint to tell awk to stop at line 1 of the script (b 1), then start running (r) and the step (s) through the script printing the value of $3 (p $3) at each line so I can see how it changes after the gsub():
$ awk -D -v filter_name='test' -f tst.awk file
gawk> b 1
Breakpoint 1 set at file `tst.awk', line 1
gawk> r
Starting program:
Stopping in BEGIN ...
Breakpoint 1, main() at `tst.awk':1
1 BEGIN { FPAT="([^,]*)|(\"[^\"]*\")"; OFS="," }
gawk> p $3
$3 = uninitialized field
gawk> s
Stopping in Rule ...
2 $2 == "\"" filter_name "\"" {
gawk> p $3
$3 = "\"sip contains \\x22Hello, world\\x5cx22\\x22\""
gawk> s
3 gsub(/^"|"$/,"",$3)
gawk> p $3
$3 = "\"sip contains \\x22Hello, world\\x5cx22\\x22\""
gawk> s
4 while ( match($3,/(\\x[0-9a-fA-F]{2})(.*)/,a) ) {
gawk> p $3
$3 = "sip contains \\x22Hello, world\\x5cx22\\x22"

extract info from a tag using awk

I have multi columns file and i want to extract some info in column 71.
I want to extract using tags which the value can be anything, for example i want to just extract AC=* ; AF=* , where the value can be anything .
I found similar question and gave it a try but it didn't work
Extract columns with values matching a specific pattern
Column 71 looks like this:
AC=14511;AC_AFR=382;AC_AMR=1177;AC_Adj=14343;AC_EAS=5;AC_FIN=427;AC_Het=11813;AC_Hom=1265;AC_NFE=11027;AC_OTH=97;AC_SAS=1228;AF=0.137;AN=106198;AN_AFR=8190;AN_AMR=10424;AN_Adj=99264;AN_EAS=7068;AN_FIN=6414;AN_NFE=51090;AN_OTH=658;AN_SAS=15420;BaseQRankSum=1.73;ClippingRankSum=-1.460e-01;DB;DP=1268322;FS=0.000;GQ_MEAN=190.24;GQ_STDDEV=319.67;Het_AFR=358;Het_AMR=1049;Het_EAS=5;Het_FIN=399;Het_NFE=8799;Het_OTH=83;Het_SAS=1120;Hom_AFR=12;Hom_AMR=64;Hom_EAS=0;Hom_FIN=14;Hom_NFE=1114;Hom_OTH=7;Hom_SAS=54;InbreedingCoeff=0.0478;MQ=60.00;MQ0=0;MQRankSum=0.037;NCC=270;POSITIVE_TRAIN_SITE;QD=21.41;ReadPosRankSum=0.212;VQSLOD=4.79;culprit=MQ;DP_HIST=30|3209|1539|1494|30007|7938|4130|2038|1310|612|334|185|97|60|31|25|9|11|7|33,0|66|339|1048|2096|2665|2626|1832|1210|584|323|179|89|54|31|22|7|9|4|15;GQ_HIST=84|66|56|82|3299|568|617|403|250|319|436|310|28566|2937|827|834|451|186|217|12591,15|15|13|16|25|11|22|28|18|38|52|31|65|76|39|83|93|65|97|12397;CSQ=T|ENSG00000186868|ENST00000334239|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11502.1|ENSP00000334886|TAU_HUMAN|B4DSE3_HUMAN|UPI0000000C16||||2/8||ENST00000334239.8:c.134-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000570299|Transcript|intron_variant&non_coding_transcript_variant||||||rs754512|1||1|MAPT|HGNC|6893|processed_transcript||||||||||2/6||ENST00000570299.1:n.262-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000340799|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45716.1|ENSP00000340438|TAU_HUMAN||UPI000004EEE6||||3/10||ENST00000340799.5:c.221-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000262410|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11501.1|ENSP00000262410|TAU_HUMAN||UPI0000EE80B7||||4/13||ENST00000262410.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000446361|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11500.1|ENSP00000408975|TAU_HUMAN||UPI000004EEE5||||2/9||ENST00000446361.3:c.134-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000574436|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11499.1|ENSP00000460965|TAU_HUMAN||UPI000002D754||||3/10||ENST00000574436.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000571987|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11501.1|ENSP00000458742|TAU_HUMAN||UPI0000EE80B7||||3/12||ENST00000571987.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000415613|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45715.1|ENSP00000410838|TAU_HUMAN||UPI0001AE66E9||||3/13||ENST00000415613.2:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000571311|Transcript|intron_variant&NMD_transcript_variant||||||rs754512|1||1|MAPT|HGNC|6893|nonsense_mediated_decay|||ENSP00000460048||I3L2Z2_HUMAN|UPI00025A2E6E||||4/4||ENST00000571311.1:c.*176-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000535772|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS56033.1|ENSP00000443028|TAU_HUMAN|B4DSE3_HUMAN|UPI000004EEE4||||4/10||ENST00000535772.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000576518|Transcript|stop_gained|5499|7|3|K/*|Aag/Tag|rs754512|1||1|MAPT|HGNC|6893|protein_coding|||ENSP00000458621||I3L170_HUMAN&B4DSE3_HUMAN|UPI0001639A7C|||1/7|||ENST00000576518.1:c.7A>T|ENSP00000458621.1:p.Lys3Ter|T:0.1171|||||||||15792962|||||POSITION:0.00682261208576998&ANN_ORF:-255.6993&MAX_ORF:-255.6993|PHYLOCSF_WEAK|ANC_ALLELE|LC,T|ENSG00000186868|ENST00000420682|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45716.1|ENSP00000413056|TAU_HUMAN||UPI000004EEE6||||2/9||ENST00000420682.2:c.221-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000572440|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2790|||||rs754512|1||1|MAPT|HGNC|6893|retained_intron|||||||||1/1|||ENST00000572440.1:n.2790A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000351559|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11499.1|ENSP00000303214|TAU_HUMAN||UPI000002D754||||4/11||ENST00000351559.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000344290|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding|YES|CCDS45715.1|ENSP00000340820|TAU_HUMAN||UPI0001AE66E9||||4/14||ENST00000344290.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000347967|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding|||ENSP00000302706|TAU_HUMAN|B4DSE3_HUMAN|UPI0000173D91||||4/10||ENST00000347967.5:c.32-100A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000431008|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS56033.1|ENSP00000389250|TAU_HUMAN|B4DSE3_HUMAN|UPI000004EEE4||||3/9||ENST00000431008.3:c.308-94A>T||T:0.1171|||||||||15792962||||||||
The code that i tried:
awk '{
for (i = 1; i <= NF; i++) {
if ($i ~ /AC|AF/) {
printf "%s %s ", $i, $(i + 1)
}
}
print ""
}'
I keep getting syntax error.
output wanted :
AC=14511;AF=0.137
Whenever you have name=value pairs, it's usually simplest to first create an array that maps names to values (n2v[] below) and then you can just access the values by their names.
$ cat file
AC=1;AC_AFR=2;AF=3 AC=4;AC_AFR=5;AF=6
$ cat tst.awk
{
delete n2v
split($2,tmp,/[;=]/)
for (i=1; i in tmp; i+=2) {
n2v[tmp[i]] = tmp[i+1]
}
prt("AC")
prt("AF")
}
function prt(name) { print name, "=", n2v[name] }
$ awk -f tst.awk file
AC = 4
AF = 6
Just change $2 to $71 for your real input.
Something like this should do it (in Gnu awk due to switch):
$ awk '{split($71,a,";");for(i in a )if(a[i]~/^AF/) print a[i]}' foo
AF=0.137
You split the field $71 by ;s, loop thru the array you split to looking for desired match. For multiple matches use switch:
$ awk '{
split($0,a,";");
for(i in a )
switch(a[i]) {
case /^AF=/:
b=b a[i] OFS;
break;
case /^AC=/:
b=b a[i] OFS;
break
}
sub(/.$/,"\n",b);
printf b
}' foo
AC=14511 AF=0.137
EDIT: Now it buffers output to a variable and prints it in the end. You can control the separator with OFS.

array over non-existing indices in awk

Sorry for the verbose question, it boils down to a very simple problem.
Assume there are n text files each containing one column of strings (denominating groups) and one of integers (denominating the values of instances within these groups):
# filename xxyz.log
a 5
a 6
b 10
b 15
c 101
c 100
#filename xyzz.log
a 3
a 5
c 116
c 128
Note that while the length of both columns within any given file is always identical it differs between files. Furthermore, not all files contain the same range of groups (the first one contains groups a, b, c, while the second one only contains groups a and c). In awk one could calculate the average of column 2 for each string in column 1 within each file separately and output the results with the following code:
NAMES=$(ls|grep .log|awk -F'.' '{print $1}');
for q in $NAMES;
do
gawk -F' ' -v y=$q 'BEGIN {print "param", y}
{sum1[$1] += $2; N[$1]++}
END {for (key in sum1) {
avg1 = sum1[key] / N[key];
printf "%s %f\n", key, avg1;
} }' $q.log | sort > $q.mean;
done;
Howerver, for the abovementioned reasons, the length of the resulting .mean files differs between files. For each .log file I'd like to output a .mean file listing the entire range of groups (a-d) in the first column and the corresponding mean value or empty spaces in the second column depending on whether this category is present in the .log file. I've tried the following code (given without $NAMES for brevity):
awk 'BEGIN{arr[a]="a"; arr[b]="b"; arr[c]="c"; arr[d]="d"}
{sum[$1] += $2; N[$1]++}
END {for (i in arr) {
if (i in sum) {
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;}
else {
printf "%s %s\n" i, "";}
}}' xxyz.log > xxyz.mean;
but it returns the following error:
awk: (FILENAME=myfile FNR=7) fatal: not enough arguments to satisfy format string
`%s %s
'
^ ran out for this one
Any suggestions would be highly appreciated.
Will you ever have explicit zeroes or negative numbers in the log files? I'm going to assume not.
The first line of your second script doesn't do what you wanted:
awk 'BEGIN{arr[a]="a"; arr[b]="b"; arr[c]="c"; arr[d]="d"}
This assigns "a" to arr[0] (because a is a variable not previously used), then "b" to the same element (because b is a variable not previously used), then "c", then "d". Clearly, not what you had in mind. This (untested) code should do the job you need as long as you know that there are just the four groups. If you don't know the groups a priori, you need a more complex program (it can be done, but it is harder).
awk 'BEGIN { sum["a"] = 0; sum["b"] = 0; sum["c"] = 0; sum["d"] = 0 }
{ sum[$1] += $2; N[$1]++ }
END { for (i in sum) {
if (N[i] == 0) N[i] = 1 # Divide by zero protection
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;
}
}' xxyz.log > xxyz.mean;
This will print a zero average for the missing groups. If you prefer, you can do:
awk 'BEGIN { sum["a"] = 0; sum["b"] = 0; sum["c"] = 0; sum["d"] = 0 }
{ sum[$1] += $2; N[$1]++ }
END { for (i in sum) {
if (N[i] == 0)
printf("%s\n", i;
else {
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;
}
}
}' xxyz.log > xxyz.mean;
For each .log file I'd like to output a .mean file listing the entire
range of groups (a-d) in the first column and the corresponding mean
value or empty spaces in the second column depending on whether this
category is present in the .log file.
Not purely an awk solution, but you can get all the groups with this.
awk '{print $1}' *.log | sort -u > groups
After you calculate the means, you can then join the groups file. Let's say the means for your second input file look like this temporary, intermediate file. (I called it xyzz.tmp.)
a 4
c 122
Join the groups, preserving all the values from the groups file.
$ join -a1 groups xyzz.tmp > xyzz.mean
$ cat xyzz.mean
a 4
b
c 122
Here's my take on the problem. Run like:
./script.sh
Contents of script.sh:
array=($(awk '!a[$1]++ { print $1 }' *.log))
readarray -t sorted < <(for i in "${array[#]}"; do echo "$i"; done | sort)
for i in *.log; do
for j in "${sorted[#]}"; do
awk -v var=$j '
{
sum[$1]+=$2
cnt[$1]++
}
END {
print var, (var in cnt ? sum[var]/cnt[var] : "")
}
' "$i" >> "${i/.log/.main}"
done
done
Results of grep . *.main:
xxyz.main:a 5.5
xxyz.main:b 12.5
xxyz.main:c 100.5
xyzz.main:a 4
xyzz.main:b
xyzz.main:c 122
Here is a pure awk answer:
find . -maxdepth 1 -name "*.log" -print0 |
xargs -0 awk '{SUBSEP=" ";sum[FILENAME,$1]+=$2;cnt[FILENAME,$1]+=1;next}
END{for(i in sum)print i, sum[i], cnt[i], sum[i]/cnt[i]}'
Easy enough to push this into a file --

How to print specific duplicate line based on fields number

I need to print out only one of various consecutive lines with same first field, and the one must be the one with "more fields in its last field". That means that last field is a set of words, and I need to print the line with more elements in its last field. In case of same number of max elements in last field, any of the max is ok.
Example input:
("aborrecimento",[Noun],[Masc],[Reg:Sing],[Bulk])
("aborrecimento",[Noun],[Masc],[Reg:Sing],[Device,Concrete,Count])
("aborrecimento",[Noun],[Masc],[Reg:Sing],[])
("adiamento",[Noun],[Masc],[Reg:Sing],[])
("adiamento",[Noun],[Masc],[Reg:Sing],[Count])
("adiamento",[Noun],[Masc],[Reg:Sing],[VerbNom])
Example output:
("aborrecimento",[Noun],[Masc],[Reg:Sing],[Device,Concrete,Count])
("adiamento",[Noun],[Masc],[Reg:Sing],[VerbNom])
solution with awk would be nice, but no need of one liner.
generate index file
$ cat input.txt |
sed 's/,\[/|[/g' |
awk -F'|' '
{if(!gensub(/[[\])]/, "", "g", $NF))n=0;else n=split($NF, a, /,/); print NR,$1,n}
' |
sort -k2,2 -k3,3nr |
awk '$2!=x{x=$2;print $1}' >idx.txt
content of index file
$ cat idx.txt
2
5
select lines
$ awk 'NR==FNR{idx[$0]; next}; (FNR in idx)' idx.txt input.txt
("aborrecimento",[Noun],[Masc],[Reg:Sing],[Device,Concrete,Count])
("adiamento",[Noun],[Masc],[Reg:Sing],[Count])
Note: no space in input.txt
Use [ as the field delimiter, then split the last field on ,:
awk -F '[[]' '
{split($NF, f, /,/)}
length(f) > max[$1] {line[$1] = $0; max[$1] = length(f)}
END {for (l in line) print line[l]}
' filename
Since order is important, an update:
awk -F '[[]' '
{split($NF, f, /,/)}
length(f) > max[$1] {line[$1] = $0; max[$1] = length(f); nr[$1] = NR}
END {for (l in line) printf("%d\t%s\n", nr[$1], line[l])}
' filename |
sort -n |
cut -f 2-
Something like this might work:
awk 'BEGIN {FS="["}
Ff != gensub("^([^,]+).*","\\1","g",$0) { Ff = gensub("^([^,]+).*","\\1","g",$0) ; Lf = $NF ; if (length(Ml) > 0) { print Ml } }
Ff == gensub("^([^,]+).*","\\1","g",$0) { if (length($NF) > length(Lf)) { Lf=$NF ; Ml=$0 } }
END {if (length(Ml) > 0) { print Ml } }' INPUTFILE
See here in action. BUT it's not the solution you want to use, as this is rather a hack. And it fails you if you meant that your last field is longer if it contains more , separated elements than the length of your last element. (E.g. the above script happily reports [KABLAMMMMMMMMMMM!] as longer than [A,B,C].)
This might work for you:
sort -r file | sort -t, -k1,1 -u