How to have awk print variable instead of the full matched line? - awk

The awk script below prints a different result than expected, and I'd like to know why. How can I make it behave the way I want?
Awk script
$ cat script.awk
/^This is/ {
print "Block entered";
my_var="value";
print $my_var;
};
Input data
$ cat input-data
This is the text to match
Expected output
Block entered
value
Actual output
$ awk -f script.awk -- input-data
Block entered
This is the text to match

anubhava's already suggested a fix for OP's code; the following is an explanation of what OP's code is doing ...
$ is used to reference a field in the current line (eg, $1 == 1st field) so ...
$myvar becomes $value, but since this is an invalid field reference awk silently ignores it so ...
print $myvar becomes print but ...
print (with no args) says to print the current (input) line ...
hence the output of This is the text to match
NOTE: following the link provided by OP (see comment - above), it appears that $value may actually be treated as $0 leaving us with print $0 (which is the same as print in this case)
Consider:
awk '{myvar="1"; print $myvar}' <<< "field#1 field#2 field#3"
In this scenario $myvar becomes $1 which is a valid field reference so the output generated is:
field#1

Related

Replace value in nth line ABOVE search string using awk/sed

I have a large firewall configuration file with sections like these distributed all over:
edit 78231
set srcintf "port1"
set dstintf "any"
set srcaddr "srcaddr"
set dstaddr "vip-dnat"
set service "service"
set schedule "always"
set logtraffic all
set logtraffic-start enable
set status enable
set action accept
next
I want to replace value "port1", which is 3 lines above search string "vip-dnat".
It seems the below solution is close but I don't seem to be able to invert the search to check above the matched string. Also it does not replace the value inside the file:
Replace nth line below the searched pattern in a file
I'm able to extract the exact value using the following awk command but simply cannot figure out how to replace it within the file (sub/gsub?):
awk -v N=3 -v pattern=".*vip-dnat.*" '{i=(1+(i%N));if (buffer[i]&& $0 ~ pattern) print buffer[i]; buffer[i]=$3;}' filename
"port1"
We could use tac + awk combination here. I have created a variable occur with value after how many lines(when "vip-dnat" is found) you need to perform substitution.
tac Input_file |
awk -v occur="3" -v new_port="new_port_value" '
/\"vip-dnat\"/{
found=1
print
next
}
found && ++count==occur{
sub(/"port1"/,new_port)
found=""
}
1' |
tac
Explanation: Adding detailed explanation for above.
tac Input_file | ##Printing Input_file content in reverse order, sending output to awk command as an input.
awk -v occur="3" -v new_port="new_port_value" ' ##Starting awk program with 2 variables occur which has number of lines after we need to perform substitution and new_port what is new_port value we need to keep.
/\"vip-dnat\"/{ ##Checking if line has "vip-dnat" then do following.
found=1 ##Setting found to 1 here.
print ##Printing current line here.
next ##next will skip all statements from here.
}
found && ++count==occur{ ##Checking if found is SET and count value equals to occur.
sub(/"port1"/,new_port) ##Then substitute "port1" with new_port value here.
found="" ##Nullify found here.
}
1' | ##Mentioning 1 will print current line and will send output to tac here.
tac ##Again using tac will print output in actual order.
Use a Perl one-liner. In this example, it changes line number 3 above the matched string to set foo bar:
perl -0777 -pe 's{ (.*\n) (.*\n) ( (?:.*\n){2} .* vip-dnat ) }{${1} set foo bar\n${3}}xms' in_file
Prints:
edit 78231
set foo bar
set dstintf "any"
set srcaddr "srcaddr"
set dstaddr "vip-dnat"
set service "service"
set schedule "always"
set logtraffic all
set logtraffic-start enable
set status enable
set action accept
next
When you are satisfied with the replacement written into STDOUT, change perl to perl -i.bak to replace the file in-place.
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
-0777 : Slurp files whole.
(.*\n) : Any character, repeated 0 or more times, ending with a newline. Parenthesis serve to capture the matched part into "match variables", numbered $1, $2, etc, from left to right according to the position of the opening parenthesis.
( (?:.*\n){2} .* vip-dnat ) : 2 lines followed by the line with the desired string vip-dnat. (?: ... ) represents non-capturing parentheses.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start
The regex uses these modifiers:
/x : Ignore whitespace and comments, for readability.
/m : Allow multiline matches.
/s : Allow . to match a newline.
Whenever you have tag-value pairs in your data it's best to first create an array of that mapping (tag2val[] below) and then you can test and/or change and/or print the values in whatever order you like just be using their names:
$ cat tst.awk
$1 == "edit" { editId=$2; next }
editId != "" {
if ($1 == "next") {
# Here is where you test and/or set the values of whatever tags
# you like by referencing their names.
if ( tag2val[ifTag] == ifVal ) {
tag2val[thenTag] = thenVal
}
print "edit", editId
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
val = tag2val[tag]
print " set", tag, val
}
print $1
editId = ""
numTags = 0
delete tag2val
}
else {
tag = $2
sub(/^[[:space:]]*([^[:space:]]+[[:space:]]+){2}/,"")
sub(/[[:space:]]+$/,"")
val = $0
if ( !(tag in tag2val) ) {
tags[++numTags] = tag
}
tag2val[tag] = val
}
}
$ awk -v ifTag='dstaddr' -v ifVal='"vip-dnat"' -v thenTag='srcintf' -v thenVal='"foobar"' -f tst.awk file
edit 78231
set srcintf "foobar"
set dstintf "any"
set srcaddr "srcaddr"
set dstaddr "vip-dnat"
set service "service"
set schedule "always"
set logtraffic all
set logtraffic-start enable
set status enable
set action accept
next
Note that the above approach:
Will work even if/when either of the values you want to find appear in other contexts (e.g. associated with other tags or as substrings of other values), and doesn't rely on how many lines are between the lines you're interested in or what order they appear in in the record.
Will let you change that value of any tag based on the value of any other tag and is easily extended to do compound comparisons, assignments, etc. and/or print the values in a different order or print a subset of them or add new tags+values.

Bash code for Selecting few columns from a variable

In a file I have a list of coordinates stored (see figure, to the left).
From there I want to copy the coordinates only (red marked) and put them in another file.
I copy the correct section from the file using COORD=`grep -B${i} '&END COORD' ${cpki_file}. Then I tried to use awk to extract the required numbers from the COORD variable . It does output all the numbers in the file but deletes the spaces between values (figure, to the right).
How to write the red marked section as they are?
N=200
NEndCoord=`grep -B${N} '&END COORD' ${cpki_file}|wc -l`
NCoord=`grep -B${N} '&END COORD' ${cpki_file}| grep -B200 '&COORD' |wc -l`
let i=$NEndCoord-$NCoord
COORD=`grep -B${i} '&END COORD' ${cpki_file}`
echo "$COORD" | awk '{ print $2 $3 $4 }'
echo "$COORD" | awk '{ print $2 $3 $4 }'>tmp.txt
When you start using combinations of grep, sed, awk, cut and alike, you should realize you can do it all in a single awk command. In case of the OP, this would do exactly the same:
awk '/[&]END COORD/{p=0}
p { print $2,$3,$4 }
/[&]COORD/{p=1}' file
This parses the file keeping track of a printing flag p. The flag is set if "&COORD" is found and unset if "&END COORD" is found. Printing is done, only when the flag p is set. Since we don't want to print the line with "&END COORD", we have to reset the flag before we do the check for the printing. The same holds for the line with "&COORD", but there we have to reset it after we do the check for the printing (its a bit a weird reversed logic).
The problem with the above is that it will also process the lines
UNIT angstrom
If you want to have these removed, you might want to do a check on the total columns:
awk '/[&]END COORD/{p=0}
p && (NF==4){ print $2,$3,$4 }
/[&]COORD/{p=1}' file
Of only print the lines which do not contain "UNIT" or are empty:
awk '/[&]END COORD/{p=0}
p && (NF>0) && ($1 != "UNIT"){ print $2,$3,$4 }
/[&]COORD/{p=1}' file
sed one-liner:
sed -n '/^&COORD$/,/^UNIT/{s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)/\1\t\2\t\3/p}' <infile.txt >outfile.txt
Explanation:
Invocation:
sed: stream editor
-n: do not print unless eplicit
Commands in sed:
/^&COORD$/,/^UNIT/: Selects groups of lines after &COORDS and before UNIT.
{s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)/\1\t\2\t\3/p}: Process each selected lines.
s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\): Regex capture space delimited groups except the first.
/\1\t\2\t\3/: Replace with tab delimited values of the captured groups.
p: Explicit printout.

How to filter the OTU by counts with AWK?

I am trying to filter all the singleton from a fasta file.
Here is my input file:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU2;size=1;
ATCCGGGACTGATC
>OTU3;size=5;
GAACTATCGGGTAA
>OTU4;size=1;
AATTGGCCATCT
The expected output is:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
I've tried
awk -F'>' '{if($1>=2) {print $0}' input.fasta > ouput.fasta
but this will remove all the header for each OTU.
Anyone could help me out?
Could you please try following.
awk -F'[=;]' '/^>/{flag=""} $3>=3{flag=1} flag' Input_file
$ awk '/>/{f=/=1;/}!f' file
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
awk -v FS='[;=]' 'prev_sz>=2 && !/size/{print prev RS $0} /size/{prev=$0;prev_sz=$(NF-1)}'
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
Store the size from each line in prev_sz variable and whole line in prev variables. Now check if its >= 2, then print the previous line and the current line. RS is used to print new line.
While all the above methods work, they are limited to the fact that input always has to look the same. I.e the sequence-name in your fasta-file needs to have the form:
>NAME;size=value;
A few solutions can handle a bit more extended sequence-names, but none handle the case where things go a bit more generic, i.e.
>NAME;label1=value1;label2=value2;STRING;label3=value3;
Print sequence where label xxx matches value vvv:
awk '/>{f = /;xxx=vvv;/}f' file.fasta
Print sequence where label xxx has a numeric value p bigger than q:
awk -v label="xxx" -v limit=q \
'BEGIN{ere=";" label "="}
/>/{ f=0; match($0,ere);value=0+substr($0,RSTART+length(ere)); f=(value>limit)}
f' <file>
In the above ere is a regular expression we try to match. We use it to find the location of the value attached to label xxx. This substring will have none-numeric characters after its value, but by adding 0 to it, it is converted to a number, losing all non-numeric values (i.e. 3;label4=value4; is converted to 3). We check if the value is bigger than our limit, and print the sequence based on that result.

Find replace "./." in awk

I am very new to using linux and I am trying to find/replace some of the text in my file.
I have successfully been able to find and replace "0/0" using gsub:
awk '{gsub(/0\/0/,"0")}; 1' filename
However, if I try to replace "./." using the same idea
awk '{gsub(/\.\/\./,"U")}; 1' filename
the output is truncated and stops at the location of the first "./." in the file. I know that "." is a special wildcard character, but I thought that having the "\" in front of it would neutralize it. I have searched but have been unable to find an explanation why the formula I used would truncate the file.
Any thoughts would be much appreciated. Thank you.
Recall that the basic outline of an awk is:
awk 'pattern { action }'
The most common patterns are regexes or tests against line counts:
awk '/FOO/ { do_something_with_a_line_with_FOO_in_it }'
awk 'FNR==10'
The last one has no action so the default is to print the line.
But functions that return a value are also useable as patterns. gsub is a function and returns the number of substitutions.
So given:
$ echo "$txt"
abc./.def line 1
ghk/lmn won't get printed
abc./.def abc./.def printed
To print only lines that have a successful substitution you can do:
$ echo "$txt" | awk 'gsub(/\.\/\./,"U")'
abcUdef line 1
abcUdef abcUdef printed
You do not need to put gsub into an action block since you want to run it on every line and the return tells you something about what happened. The lines that successfully are matched are printed since gsub returns the number of substitutions.
If you want every line printed regardless if there is a match:
$ echo "$txt" | awk 'gsub(/\.\/\./,"U") || 1'
abcUdef line 1
ghk/lmn won't get printed
abcUdef abcUdef printed
Or, you can use the function as an action with an empty pattern and then a 1 with an empty action:
$ echo "$txt" | awk '{gsub(/\.\/\./,"U")} 1'
abcUdef line 1
ghk/lmn won't get printed
abcUdef abcUdef printed
In either case, 1 as a pattern with no action prints the line regardless if there is a match and the gsub makes the substitution if any.
The second awk is what you have. Why it is not working on your input data is probably related to you input data.
Your awk script is fine, your input contains control-Ms, probably from being created by a Windows program. You can see them with cat -v file and use dos2unix or similar to remove them.

Concatenating lines using awk

I have fasta file that contains two gene sequences and what I want to do is remove the fasta header (line starting with ">"), concatenate the rest of the lines and output that sequence
Here is my fasta sequence (genome.fa):
>Potrs164783
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
>Potrs164784
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Desired output
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
I am using awk to do this but I am getting this error
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa
awk: syntax error at source line 1
context is
BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >>> >filename. <<< fa;}
awk: illegal statement at source line 1
I am basically a python person and I was given this script by someone. What am I doing wrong here?
I realized that i was not clear and so i am pasting the whole code that i got from someone. The input file and desired output remains the same
mkdir split_genome;
cd split_genome;
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa;
ls -1 `pwd`/* > ../scaffold_list.txt;
cd ..;
If all you want to do is produce the desired output shown in your question, other solutions will work.
However, the script you have is trying to print each sequence to a file that is named using its header, and the extension .fa.
The syntax error you're getting is because filename.fa is neither a variable or a fixed string. While no Awk will allow you to print to filename.fa because it is neither in quotes or a variable (varaible names can't have a . in them), BSD Awk does not allow manipulating strings when they currently act as a file name where GNU Awk does.
So the solution:
print $0 > filename".fa"
would produce the same error in BSD Awk, but would work in GNU Awk.
To fix this, you can append the extension ".fa" to filename at assignment.
This will do the job:
$ awk '{if($0 ~ /^>/) filename=substr($0, 2)".fa"; else print $0 > filename}' file
$ cat Potrs164783.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
$ cat Potrs164784.fa
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
You'll notice I left out the BEGIN{filename="file1"} declaration statement as it is unnecessary. Also, I replaced the need for sub(...) by using the string function substr as it is more clear and requires fewer actions.
The awk code that you show attempts to do something different than produce the output that you want. Fortunately, there are much simpler ways to obtain your desired output. For example:
$ grep -v '>' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Alternatively, if you had intended to have all non-header lines concatenated into one line:
$ sed -n '/^>/!H; $!d; x; s/\n//gp' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAATTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCCGGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Try this to print lines not started by > and in one line:
awk '!/^>/{printf $0}' genome.fa > filename.fa
With carriage return:
awk '!/^>/' genome.fa > filename.fa
To create single files named by the headers:
awk 'split($0,a,"^>")>1{file=a[2];next}{print >file}' genome.fa