What does this Awk expression mean - awk

I am working with bash script that has this command in it.
awk -F ‘‘ ‘/abc/{print $3}’|xargs
What is the meaning of this command?? Assume input is provided to awk.

The quick answer is it'll do different things depending on the version of awk you're running and how many fields of output the awk script produces.
I assume you meant to write:
awk -F '' '/abc/{print $3}'|xargs
not the syntactically invalid (due to "smart quotes"):
awk -F ‘’’/abc/{print $3}’|xargs
-F '' is undefined behavior per POSIX so what it will do depends on the version of awk you're running. In some awks it'll split the current line into 1 character per field. in others it'll be ignored and the line will be split into fields at every sequence of white space. In other awks still it could do anything else.
/abc/ looks for a string matching the regexp abc on the current line and if found invokes the subsequent action, in this case {print $3}.
However it's split into fields, print $3 will print the 3rd such field.
xargs as used will just print chunks of the multi-line input it's getting all on 1 line so you could get 1 line of all-fields output if you don't have many fields being output or several lines of multi-field output if you do.
I suspect the intent of that code was to do what this code actually will do in any awk alone:
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
e.g.:
$ printf 'foo\nxabc\nyzabc\nbar\n' |
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
b a

Related

How to put a comma in between awk when filtering columns in bash shell script?

I want put a comma in between outputs from awk in bash shell script (linux).
This is a snippet of my original command
awk {print $13, $10} >> test.csv
If I put a comma in between $13 and $10, I would get a space in between the two columns
But what I want is a comma between these two columns
I'm very new to this and I can't find any resources about this online so bear with me if this is a simple mistake. Thank you
suggestion 1
awk '{print $13 ";" $10}' >> CPU2.csv
suggestion 2
awk '{print $13, $10}' OFS=";" >> CPU2.csv
suggestion 3
awk '{prtinf("%s;%s\n", $13, $10)}' >> CPU2.csv
echo "1 2 3"|awk '{a=";";print $1a$2","$3}'
not as elegant as I hoped, but this should work :
mawk 'NF=($+_=$13(_)$10)~_' \_=\;
It first overwrites the entire row with just $13 and $10, with the semi-colon ; in between. _ is a semi-colon, thus numerical evaluation of $+_ is identical to $0, and since I've forced the delimiter in between them, the regex test of presence of semi-colon will always yield true (1), making NF = 1, and printing just that.
Assigning 1 into NF in lieu of $1 = $1.
NF isn't a 2 because I'm using the new sep in between them, instead of space or tab, so even though $0 was overwritten, awk wouldn't have found the sep it needed to split out 2nd field.
Tested on mawk 1.3.4, mawk 1.9.9.6, macOS nawk, and gawk 5.1.1, including gawk -t traditional flag and gawk -P posix mode.
-- The 4Chan Teller

Match regexp at the end of the string with AWK

I am trying to match two different Regexp to long strings with awk, removing the part of the string that matches in a 35 characters window.
The problem is that the same bunch of code works when I am looking for the first (which matches at the beginnng) whereas fails to match with the second one (end of string).
Input:
Regexp1(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)Regexp2
Desired output
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
So far I used this code that extracts correctly Regexp1, but, unfortunately, is not able to extract also Regexp2 since indexed of RSTART and RLENGTH for Regexp2 are incorrect.
Code for extracting Regexp1 (correct output):
awk -v F="Regexp1" '{if (match(substr($1,1,35),F)) print substr($1,RSTART,RLENGTH)}' file
Code for extracting Regexp2 (wrong output)
awk -v F="Regexp2" '{if (match(substr($1,length($1)-35,35),F)) print substr($1,RSTART,RLENGTH)}' file
Despite the indexes for Regexp1 are correct, for Regexp2 indexes are wrond (RSTART=13). I cannot figure out how to extract the second Regexp.
Considering that your actual Input_file is same as shown samples, if this is the case could you please try following then(good to have new version of awk since old versions may not support number of times logic for regex).
awk '
match($0,/\([0-9]+\){5}.*\([0-9]\){4}/){
print substr($0,RSTART,RLENGTH)
}' Input_file
In case your number of parenthesis values are not fixed then you could do like as follows:
awk '
match($0,/\([0-9]+\){1,}.*\([0-9]\){1,}/){
print substr($0,RSTART,RLENGTH)
}' Input_file
If this isn't all you need:
$ sed 's/Regexp1\(.*\)Regexp2/\1/' file
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
or using GNU awk for gensub():
$ awk '{print gensub(/Regexp1(.*)Regexp2/,"\\1",1)}' file
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
then edit your question to be far clearer with your requirements and example.

Join lines into one line using awk

I have a file with the following records
ABC
BCD
CDE
EFG
I would like to convert this into
'ABC','BCD','CDE','EFG'
I attempted to attack this problem using Awk in the following way:
awk '/START/{if (x)print x;x="";next}{x=(!x)?$0:x","$0;}END{print x;}'
but I obtain not what I expected:
ABC,BCD,CDE,EFG
Are there any suggestions on how we can achieve this?
Could you please try following.
awk -v s1="'" 'BEGIN{OFS=","} {val=val?val OFS s1 $0 s1:s1 $0 s1} END{print val}' Input_file
Output will be as follows.
'ABC','BCD','CDE','EFG'
With GNU awk for multi-char RS:
$ awk -v RS='\n$' -F'\n' -v OFS="','" -v q="'" '{$1=$1; print q $0 q}' file
'ABC','BCD','CDE','EFG'
There are many ways of achieving this:
with pipes:
sed "s/.*/'&'/" <file> | paste -sd,
awk '{print '"'"'$0'"'"'}' <file> | paste -sd,
remark: we do not make use of tr here as this would lead to an extra , at the end.
reading the full file into memory:
sed ':a;N;$!ba;s/\n/'"','"'/g;s/.*/'"'&'"'/g' <file> #POSIX
sed -z 's/^\|\n$/'"'"'/g;s/\n/'"','"'/g;' <file> #GNU
and the solution of #EdMorton
without reading the full file into memory:
awk '{printf (NR>1?",":"")"\047"$0"\047"}' <file>
and some random other attempts:
awk '(NR-1){s=s","}{s=s"\047"$0"\047"}END{print s}' <file>
awk 'BEGIN{printf s="\047";ORS=s","s}(NR>1){print t}{t=$0}END{ORS=s;print t} <file>
So what is going on with the OP's attempts?
Writing down the OP's awk line, we have
/START/{if (x)print x;x="";next}
{x=(!x)?$0:x","$0;}
END{print x;}
What does this do? Let us analyze step by step:
/START/{if (x)print x;x="";next}:: This reads If the current record/line contains the string START, then do
if (x) print x:: if x is not an empty string, print the value of x
x="" set x to be an empty string
next:: skip to the next record/line
In this code block, the OP probably assumed that /START/ means do this at the beginning of all things. In awk, this is however written as BEGIN and since in the beginning, all variables are empty strings or zero, the if statement is not executed by default. This block could be replaced by:
BEGIN{x=""}
But again, this is not needed and thus one can remove it:
{x=(!x)?$0:x","$0;}:: concatenate the string with the correct delimiter. This is good, especially due to the usage of the ternary operator. Sadly the delimiter is set to , and not ',' which in awk is best written as \047,\047. So the line could read:
{x=(!x)?$0:x"\047,\047"$0;}
This line, can be written shorter if you realize that x could be an empty string. For an empty string, x=$0 is equivalent to x=x $0 and all you want to do is add a separator which all or not could be an empty string. So you can write this as
{x= x ((!x)?"":"\047,\047") $0}
or inverting the logic to get rid of some more characters:
{x=x(x?"\047,\047":"")$0}
one could even write
{x=x(x?"\047,\047":x)$0}
but this is not optimal as it needs to read what is the memory of x again. However, this form can be used to finally optimize it to (per #EdMorton's comment)
{x=(x?x"\047,\047":"")$0}
This is better as it removes an extra concatenation operator.
END{print x}:: Here the OP prints the result. This, however, will miss the final single-quotes at the beginning and end of the string, so they could be added
END{print "\047" x "\047"}
So the corrected version of the OP's code would read:
awk '{x=(x?x"\047,\047":"")$0}END{print "\047" x "\047"}'
awk may be better
awk '{printf fmt,$1}' fmt="'%s'\n" file | paste -sd, -
'ABC','BCD','CDE','EFG'

Different results in awk when using different FS syntax

I have a sample file which contains the following.
logging.20160309.113.txt.log: 0 Rows successfully loaded.
logging.20160309.1180.txt.log: 0 Rows successfully loaded.
logging.20160309.1199.txt.log: 0 Rows successfully loaded.
I currently am familiar with 2 ways of implementing a Field Separator syntax in awk. However, I am currently getting different results.
For the longest time I use
"FS=" syntax when my FS is more than one character.
"-f" flag when my FS is just one character.
I would like to understand why the FS= syntax is giving me an unexpected result as seen below. Somehow the 1st record is being left behind.
$ head -3 reload_list | awk -F"\.log\:" '{ print $1 }'
awk: warning: escape sequence `\.' treated as plain `.'
awk: warning: escape sequence `\:' treated as plain `:'
logging.20160309.113.txt
logging.20160309.1180.txt
logging.20160309.1199.txt
$ head -3 reload_list | awk '{ FS="\.log\:" } { print $1 }'
awk: warning: escape sequence `\.' treated as plain `.'
awk: warning: escape sequence `\:' treated as plain `:'
logging.20160309.113.txt.log:
logging.20160309.1180.txt
logging.20160309.1199.txt
The reason you are getting different results, is that in the case where you set FS in the awk program, it is not in a BEGIN block. So by the time you've set it, the first record has already been parsed into fields (using the default separator).
Setting with -F
$ awk -F"\\.log:" '{ print $1 }' b.txt
logging.20160309.113.txt
logging.20160309.1180.txt
logging.20160309.1199.txt
Setting FS after parsing first record
$ awk '{ FS= "\\.log:"} { print $1 }' b.txt
logging.20160309.113.txt.log:
logging.20160309.1180.txt
logging.20160309.1199.txt
Setting FS before parsing any records
$ awk 'BEGIN { FS= "\\.log:"} { print $1 }' b.txt
logging.20160309.113.txt
logging.20160309.1180.txt
logging.20160309.1199.txt
I noticed this relevant bit in an awk manual. If perhaps you've seen different behavior previously or with a different implementation, this could explain why:
According to the POSIX standard, awk is supposed to behave as if
each record is split into fields at the time that it is read. In
particular, this means that you can change the value of FS after a
record is read, but before any of the fields are referenced. The value
of the fields (i.e. how they were split) should reflect the old value
of FS, not the new one.
However, many implementations of awk do not do this. Instead,
they defer splitting the fields until a field reference actually
happens, using the current value of FS! This behavior can be
difficult to diagnose.
-f is for running a script from a file. -F and FS works the same
$ awk -F'.log' '{print $1}' logs
logging.20160309.113.txt
logging.20160309.1180.txt
logging.20160309.1199.txt
$ awk 'BEGIN{FS=".log"} {print $1}' logs
logging.20160309.113.txt
logging.20160309.1180.txt
logging.20160309.1199.txt

Concatenating lines using awk

I have fasta file that contains two gene sequences and what I want to do is remove the fasta header (line starting with ">"), concatenate the rest of the lines and output that sequence
Here is my fasta sequence (genome.fa):
>Potrs164783
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
>Potrs164784
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Desired output
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
I am using awk to do this but I am getting this error
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa
awk: syntax error at source line 1
context is
BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >>> >filename. <<< fa;}
awk: illegal statement at source line 1
I am basically a python person and I was given this script by someone. What am I doing wrong here?
I realized that i was not clear and so i am pasting the whole code that i got from someone. The input file and desired output remains the same
mkdir split_genome;
cd split_genome;
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa;
ls -1 `pwd`/* > ../scaffold_list.txt;
cd ..;
If all you want to do is produce the desired output shown in your question, other solutions will work.
However, the script you have is trying to print each sequence to a file that is named using its header, and the extension .fa.
The syntax error you're getting is because filename.fa is neither a variable or a fixed string. While no Awk will allow you to print to filename.fa because it is neither in quotes or a variable (varaible names can't have a . in them), BSD Awk does not allow manipulating strings when they currently act as a file name where GNU Awk does.
So the solution:
print $0 > filename".fa"
would produce the same error in BSD Awk, but would work in GNU Awk.
To fix this, you can append the extension ".fa" to filename at assignment.
This will do the job:
$ awk '{if($0 ~ /^>/) filename=substr($0, 2)".fa"; else print $0 > filename}' file
$ cat Potrs164783.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
$ cat Potrs164784.fa
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
You'll notice I left out the BEGIN{filename="file1"} declaration statement as it is unnecessary. Also, I replaced the need for sub(...) by using the string function substr as it is more clear and requires fewer actions.
The awk code that you show attempts to do something different than produce the output that you want. Fortunately, there are much simpler ways to obtain your desired output. For example:
$ grep -v '>' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Alternatively, if you had intended to have all non-header lines concatenated into one line:
$ sed -n '/^>/!H; $!d; x; s/\n//gp' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAATTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCCGGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Try this to print lines not started by > and in one line:
awk '!/^>/{printf $0}' genome.fa > filename.fa
With carriage return:
awk '!/^>/' genome.fa > filename.fa
To create single files named by the headers:
awk 'split($0,a,"^>")>1{file=a[2];next}{print >file}' genome.fa