myfile looks like
Split level: train 67.0%
importance 0.17
Score metric (accuracy_score): 0.986
Score metric (precision_score): 0.903
I want to extract the accuracy score (here it is 0.986) by awk:
$ awk '/Score metric \(accuracy_score\):/ { match(/[0-9]+\.[0-9]+ *$/); substr($0, RSTART, RLENGTH) }' myfile
awk: fatal: 1 is invalid as number of arguments for match
What does the error mean here? I don't have 1 in my awk program.
How can I correct my program to make it work?
What is your better solution?
Others have stated the issue with the number of match parameters... this can be found in the awk manual. The following answer is quick and easy -- avoiding the match() and substr() functions. It outputs the last field when your pattern is found. LC_ALL=C is used because your match criterion and number are all representable in ASCII -- the script will run faster in this mode.
LC_ALL=C awk '/Score metric \(accuracy_score\):/ { print $NF }'
If you stick to awk, I would do:
awk -F':\\s*' '/Score metric \(accuracy_score\)/{print $2}' file
In your codes, you used match() function, in man page:
match(s, r [, a]) s is the string, r is the regex, and optional a is array.
You gave only a /..../ to match() this would be interpreted by awk as a boolean, it does $0~/.../, so the result it true. and boolean.true in awk has int value 1.
Here you found the 1(one).
*awk what I meant above is gnu awk.
Related
Works here: 'awk.js.org/`
but not in openwrt's awk, which returns the error message:
awk: bad regex '^(server=|address=)[': Missing ']'
Hello everyone!
I'm trying to use an awk command I wrote which is:
'!/^(server=|address=)[/][[:alnum:]][[:alnum:]-.]+([/]|[/]#)$|^#|^\s*$/ {count++}; END {print count+0}'
Which counts invalid lines in a dns blocklist (oisd in this case):
Input would be eg:
server=/0--foodwarez.da.ru/anyaddress.1.1.1
serverspellerror=/0-000.store/
server=/0-24bpautomentes.hu/
server=/0-29.com/
server=/0-day.us/
server=/0.0.0remote.cryptopool.eu/
server=/0.0mail6.xmrminingpro.com/
server=/0.0xun.cryptopool.space/
Output for this should be "2" since there are two lines that don't match the criteria (correctly formed address, comments, or blank lines).
I've tried formatting the command every which way with [], but can't find anything that works. Does anyone have an idea what format/syntax/option needs adjusting?
Thanks!
To portably include - in a bracket expression it has to be the first or last character, otherwise it means a range, and \s is shorthand for [[:space:]] in only some awks. This will work in any POSIX awk:
$ awk '!/^(server=|address=)[/][[:alnum:]][[:alnum:].-]+([/]|[/]#)$|^#|^[[:space:]]*$/ {count++}; END {print count+0}' file
2
Per #tripleee's comment below if your awk is broken such that a / inside a bracket expression isn't treated as literal then you may need this instead:
$ awk '!/^(server=|address=)\/[[:alnum:]][[:alnum:].-]+(\/|\/#)$|^#|^[[:space:]]*$/ {count++}; END {print count+0}' file
2
but get a new awk, e.g. GNU awk, as who knows what other surprises the one you're using may have in store for you!
'!/^(server=|address=)[/][[:alnum:]][[:alnum:]-.]+([/]|[/]#)$|^#|^\s*$/ {count++}; END {print count+0}'
- has special meaning inside [ and ], it is used to denote range e.g. [A-Z] means uppercase ASCII letter, use \ escape sequence to make it literal dash, let file.txt content be
server=/0--foodwarez.da.ru/anyaddress.1.1.1
serverspellerror=/0-000.store/
server=/0-24bpautomentes.hu/
server=/0-29.com/
server=/0-day.us/
server=/0.0.0remote.cryptopool.eu/
server=/0.0mail6.xmrminingpro.com/
server=/0.0xun.cryptopool.space/
then
awk '!/^(server=|address=)[/][[:alnum:]][[:alnum:]\-.]+([/]|[/]#)$|^#|^\s*$/ {count++}; END {print count+0}' file.txt
gives output
2
You might also consider replacing \s using [[:space:]] in order to main consistency.
(tested in GNU Awk 5.0.1)
I am trying to match two different Regexp to long strings with awk, removing the part of the string that matches in a 35 characters window.
The problem is that the same bunch of code works when I am looking for the first (which matches at the beginnng) whereas fails to match with the second one (end of string).
Input:
Regexp1(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)Regexp2
Desired output
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
So far I used this code that extracts correctly Regexp1, but, unfortunately, is not able to extract also Regexp2 since indexed of RSTART and RLENGTH for Regexp2 are incorrect.
Code for extracting Regexp1 (correct output):
awk -v F="Regexp1" '{if (match(substr($1,1,35),F)) print substr($1,RSTART,RLENGTH)}' file
Code for extracting Regexp2 (wrong output)
awk -v F="Regexp2" '{if (match(substr($1,length($1)-35,35),F)) print substr($1,RSTART,RLENGTH)}' file
Despite the indexes for Regexp1 are correct, for Regexp2 indexes are wrond (RSTART=13). I cannot figure out how to extract the second Regexp.
Considering that your actual Input_file is same as shown samples, if this is the case could you please try following then(good to have new version of awk since old versions may not support number of times logic for regex).
awk '
match($0,/\([0-9]+\){5}.*\([0-9]\){4}/){
print substr($0,RSTART,RLENGTH)
}' Input_file
In case your number of parenthesis values are not fixed then you could do like as follows:
awk '
match($0,/\([0-9]+\){1,}.*\([0-9]\){1,}/){
print substr($0,RSTART,RLENGTH)
}' Input_file
If this isn't all you need:
$ sed 's/Regexp1\(.*\)Regexp2/\1/' file
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
or using GNU awk for gensub():
$ awk '{print gensub(/Regexp1(.*)Regexp2/,"\\1",1)}' file
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
then edit your question to be far clearer with your requirements and example.
I am trying to filter all the singleton from a fasta file.
Here is my input file:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU2;size=1;
ATCCGGGACTGATC
>OTU3;size=5;
GAACTATCGGGTAA
>OTU4;size=1;
AATTGGCCATCT
The expected output is:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
I've tried
awk -F'>' '{if($1>=2) {print $0}' input.fasta > ouput.fasta
but this will remove all the header for each OTU.
Anyone could help me out?
Could you please try following.
awk -F'[=;]' '/^>/{flag=""} $3>=3{flag=1} flag' Input_file
$ awk '/>/{f=/=1;/}!f' file
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
awk -v FS='[;=]' 'prev_sz>=2 && !/size/{print prev RS $0} /size/{prev=$0;prev_sz=$(NF-1)}'
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
Store the size from each line in prev_sz variable and whole line in prev variables. Now check if its >= 2, then print the previous line and the current line. RS is used to print new line.
While all the above methods work, they are limited to the fact that input always has to look the same. I.e the sequence-name in your fasta-file needs to have the form:
>NAME;size=value;
A few solutions can handle a bit more extended sequence-names, but none handle the case where things go a bit more generic, i.e.
>NAME;label1=value1;label2=value2;STRING;label3=value3;
Print sequence where label xxx matches value vvv:
awk '/>{f = /;xxx=vvv;/}f' file.fasta
Print sequence where label xxx has a numeric value p bigger than q:
awk -v label="xxx" -v limit=q \
'BEGIN{ere=";" label "="}
/>/{ f=0; match($0,ere);value=0+substr($0,RSTART+length(ere)); f=(value>limit)}
f' <file>
In the above ere is a regular expression we try to match. We use it to find the location of the value attached to label xxx. This substring will have none-numeric characters after its value, but by adding 0 to it, it is converted to a number, losing all non-numeric values (i.e. 3;label4=value4; is converted to 3). We check if the value is bigger than our limit, and print the sequence based on that result.
emphasized textI have some text like
CreateMainPageLink("410",$objUserData,$mnt[139]);
from which i want to extract the number 139 after the occurrence of mnt with gawk. I tried the following expression (within a pipe expression to be used on a result of a grep)
gawk '{FS="[\[\]]";print NF}'
to print the number of fields. If my field separators were [ and ] I expect to see the number 3 printed out (three fields; one before the opening rectangular bracket, one after, and the actual number I want to extract). What I get instead is one field, corresponding to the full line, and two warnings:
gawk: warning: escape sequence `\[' treated as plain `['
gawk: warning: escape sequence `\]' treated as plain `]'
I was following the example given here, but obviously there is some problem/error with my expression.
Using the following two expressions also do not work:
gawk '{FS="[]"}{print NF;}'
gawk: (FILENAME=- FNR=1) fatal: Unmatched [ or [^: /[]/
and
gawk '{FS="\[\]"}{print NF;}'
gawk: warning: escape sequence `\[' treated as plain `['
gawk: warning: escape sequence `\]' treated as plain `]'
gawk: (FILENAME=- FNR=1) fatal: Unmatched [ or [^: /[]/
gawk -F[][] '{ print $0" -> "$1"\t"$2; }'
$ gawk -F[][] '{ print $0" -> "$1"\t"$2; }'
titi[toto]tutu
titi[toto]tutu -> titi toto
1) You must set the FS before entering the main parsing loop. You could do:
awk 'BEGIN { FS="[\\[\\]]"; } { print $0" -> "$1"\t"$2; }'
Which executes the BEGIN clause before parsing the file.
I have to escape the [character twice: one because it is inside a quoted string. And another once because gawk mandate it inside a bracket expression.
I personnaly prefer to use the -F flag which is less verbose.
2) FS="[\[\]]" is wrong, because you are inside a quoted string, this escape the character inside the string. Awk will see: [[]] which is an invalid bracket expression.
3) FS="[]" is wrong because it is an empty bracket expression trying to match nothing
4) FS="\[\]" is wrong again because it is error 2) and 3) together :)
gawk manual says: The regular expressions in awk are a superset of the POSIX specification. This is why you can use either: [\\[\\]] or [][]. The later being the posix way.
To include a literal ']' in the list, make it the first character
See:
Posix Regex specification:
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_04
Posix awk specification:
http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk.html
Gnu Awk manual:
http://www.gnu.org/software/gawk/manual/gawk.html#Bracket-Expressions
FS="[]" Here it looks for data inside the [] and there are none.
To use square brackets you need to write them like this [][]
This is also wrong gawk '{FS="[\[\]]";print NF}' you need FS as a variable outside expression.
Eks
echo 'CreateMainPageLink("410",$objUserData,$mnt[139]);' | awk -F[][] '{print $2}'
139
Or
awk '{print $2}' FS=[][]
Or
awk 'BEGIN {FS="[][]"} {print $2}'
All gives 139
Edit: gawk '{FS="[\[\]]";print NF}' Here you print number of fields NF and not value of it $NF. Anyway it will not help, since dividing your data with [] gives ); as last filed, use this awk '{print $(NF-1)}' FS=[][] to get second last filed.
Do you need awk? You can get the value via sed like this:
# echo 'CreateMainPageLink("410",$objUserData,$mnt[139]);' | sed -n 's:.*\[\([0-9]\+\)\].*:\1:p'
139
In the following awk command
awk '{sum+=$1; ++n} END {avg=sum/n; print "Avg monitoring time = "avg}' file.txt
what should I change to remove scientific notation output (very small values displayed as 1.5e-05) ?
I was not able to succeed with the OMFT variable.
You should use the printf AWK statement. That way you can specify padding, precision, etc. In your case, the %f control letter seems the more appropriate.
I was not able to succeed with the OMFT variable.
It is actually OFMT (outputformat), so for example:
awk 'BEGIN{OFMT="%f";print 0.000015}'
will output:
0.000015
as opposed to:
awk 'BEGIN{print 0.000015}'
which output:
1.5e-05
GNU AWK manual says that if you want to be POSIX-compliant it should be floating-point conversion specification.
Setting -v OFMT='%f' (without having to embed it into my awk statement) worked for me in the case where all I wanted from awk was to sum columns of arbitrary floating point numbers.
As the OP found, awk produces exponential notation with very small numbers,
$ some_accounting_widget | awk '{sum+=$0} END{print sum+0}'
8.992e-07 # Not useful to me
Setting OFMT for fixed that, but also rounded too aggressively,
$ some_accounting_widget | awk -v OFMT='%f' '{sum+=$0} END{print sum+0}'
0.000001 # Oops. Rounded off too much. %f rounds to 6 decimal places by default.
Specifying the number of decimal places got me what I needed,
$ some_accounting_widget | awk -v OFMT='%.10f' '{sum+=$0} END{print sum+0}'
0.0000008992 # Perfect.