Why does an awk field assignment lose the output field separator? - awk

This command works. It outputs the field separator (in this case, a comma):
$ echo "hi,ho"|awk -F, '/hi/{print $0}'
hi,ho
This command has strange output (it is missing the comma):
$ echo "hi,ho"|awk -F, '/hi/{$2="low";print $0}'
hi low
Setting the OFS (output field separator) variable to a comma fixes this case, but it really does not explain this behaviour.
Can I tell awk to keep the OFS?

When you modify the line ($0) awk re-constructs all columns and puts the value of OFS between them which by default is space. You modified the value of $2 which means you forced awk to re-evaluate$0.
When you print the line as is using $0 in your first case, since you did not modify any fields, awk did not re-evaluated each field and hence the field separator is preserved.
In order to preserve the field separator, you can specify that using:
BEGIN block:
$ echo "hi,ho" | awk 'BEGIN{FS=OFS=","}/hi/{$2="low";print $0}'
hi,low
Using -v option:
$ echo "hi,ho" | awk -F, -v OFS="," '/hi/{$2="low";print $0}'
hi,low
Defining at the end of awk:
$ echo "hi,ho" | awk -F, '/hi/{$2="low";print $0}' OFS=","
hi,low

You first example does not change anything, so all is printed out as the input.
In second example, it change the line and it will use the default OFS, that is (one space)
So to overcome this:
echo "hi,ho"|awk -F, '/hi/{$2="low";print $0}' OFS=","
hi,low

In your BEGIN action, set OFS = FS.

Related

awk command to read a key value pair from a file

I have a file input.txt which stores information in KEY:VALUE form. I'm trying to read GOOGLE_URL from this input.txt which prints only http because the seperator is :. What is the problem with my grep command and how should I print the entire URL.
SCRIPT
$> cat script.sh
#!/bin/bash
URL=`grep -e '\bGOOGLE_URL\b' input.txt | awk -F: '{print $2}'`
printf " $URL \n"
INPUT_FILE
$> cat input.txt
GOOGLE_URL:https://www.google.com/
OUTPUT
https
DESIRED_OUTPUT
https://www.google.com/
Since there are multiple : in your input, getting $2 will not work in awk because it will just give you 2nd field. You actually need an equivalent of cut -d: -f2- but you also need to check key name that comes before first :.
This awk should work for you:
awk -F: '$1 == "GOOGLE_URL" {sub(/^[^:]+:/, ""); print}' input.txt
https://www.google.com/
Or this non-regex awk approach that allows you to pass key name from command line:
awk -F: -v k='GOOGLE_URL' '$1==k{print substr($0, length(k FS)+1)}' input.txt
Or using gnu-grep:
grep -oP '^GOOGLE_URL:\K.+' input.txt
https://www.google.com/
Could you please try following, written and tested with shown samples in GNU awk. This will look for string GOOGLE_URL and will catch further either http or https value from url, in case you need only https then change http[s]? to https in following solution please.
awk '/^GOOGLE_URL:/{match($0,/http[s]?:\/\/.*/);print substr($0,RSTART,RLENGTH)}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^GOOGLE_URL:/{ ##Checking condition if line starts from GOOGLE_URL: then do following.
match($0,/http[s]?:\/\/.*/) ##Using match function to match http[s](s optional) : till last of line here.
print substr($0,RSTART,RLENGTH) ##Printing sub string of matched value from above function.
}
' Input_file ##Mentioning Input_file name here.
2nd solution: In case you need anything coming after first : then try following.
awk '/^GOOGLE_URL:/{match($0,/:.*/);print substr($0,RSTART+1,RLENGTH-1)}' Input_file
Take your pick:
$ sed -n 's/^GOOGLE_URL://p' file
https://www.google.com/
$ awk 'sub(/^GOOGLE_URL:/,"")' file
https://www.google.com/
The above will work using any sed or awk in any shell on every UNIX box.
I would use GNU AWK following way for that task:
Let file.txt content be:
EXAMPLE_URL:http://www.example.com/
GOOGLE_URL:https://www.google.com/
KEY:GOOGLE_URL:
Then:
awk 'BEGIN{FS="^GOOGLE_URL:"}{if(NF==2){print $2}}' file.txt
will output:
https://www.google.com/
Explanation: GNU AWK FS might be pattern, so I set it to GOOGLE_URL: anchored (^) to begin of line, so GOOGLE_URL: in middle/end will not be seperator (consider 3rd line of input). With this FS there might be either 1 or 2 fields in each line - latter is case only if line starts with GOOGLE_URL: so I check number of fields (NF) and if this is second case I print 2nd field ($2) as first record in this case is empty.
(tested in gawk 4.2.1)
Yet another awk alternative:
gawk -F'(^[^:]*:)' '/^GOOGLE_URL:/{ print $2 }' infile

AWK:Remove multiple columns and retain the column delimiters [duplicate]

This command works. It outputs the field separator (in this case, a comma):
$ echo "hi,ho"|awk -F, '/hi/{print $0}'
hi,ho
This command has strange output (it is missing the comma):
$ echo "hi,ho"|awk -F, '/hi/{$2="low";print $0}'
hi low
Setting the OFS (output field separator) variable to a comma fixes this case, but it really does not explain this behaviour.
Can I tell awk to keep the OFS?
When you modify the line ($0) awk re-constructs all columns and puts the value of OFS between them which by default is space. You modified the value of $2 which means you forced awk to re-evaluate$0.
When you print the line as is using $0 in your first case, since you did not modify any fields, awk did not re-evaluated each field and hence the field separator is preserved.
In order to preserve the field separator, you can specify that using:
BEGIN block:
$ echo "hi,ho" | awk 'BEGIN{FS=OFS=","}/hi/{$2="low";print $0}'
hi,low
Using -v option:
$ echo "hi,ho" | awk -F, -v OFS="," '/hi/{$2="low";print $0}'
hi,low
Defining at the end of awk:
$ echo "hi,ho" | awk -F, '/hi/{$2="low";print $0}' OFS=","
hi,low
You first example does not change anything, so all is printed out as the input.
In second example, it change the line and it will use the default OFS, that is (one space)
So to overcome this:
echo "hi,ho"|awk -F, '/hi/{$2="low";print $0}' OFS=","
hi,low
In your BEGIN action, set OFS = FS.

How to use awk to find the line starting with a variable

I know 2 things about awk:
1.
PAT='aGeneName'
awk -v var="$PAT" '$3 ~ var {print $0}' file.txt # will print the line where 3rd field includes the variable $PAT
2.
awk '$3 ~ /^aGeneName/' file.txt # will print the line where 3rd field starts with string "aGeneName"
But what I want is the combination of these two: I want to print the line where the 3rd field starts with the variable $PAT, something like
PAT='aGeneName'
awk -v var="$PAT" '$3 ~ /^var/ {print $0}' file.txt # but this is wrong, since variable can't be put into //
One way is like this:
PAT='aGeneName'
awk -v var="$PAT" '$3 ~ "^" var {print $0}' file.txt
And the {print $0} can be saved here, it's implied.
Another way, when the pattern var is a simple string, no RegEX character inside:
PAT='aGeneName'
awk -v var="$PAT" 'index($3, var)==1' file.txt

How to print the length size of the following line

I would like to modify a file by including the size of following line using awk.
My file is like this:
>AAAS:1220136:1220159:-:0::NW_015494524.1:1220136-1220159(-)
ATGTCGATGCTCGATC
>AAAS::1215902:1215986:-:1::NW_015494524.1:1215902-1215986(-)
ATGCGATGCTAGCTAGCTCGAT
>AAAS:1215614:1215701:-:1::NW_015494524.1:1215614-1215701(-)
ATGCCGCGACGCAGCACCCGACGCGCAG
I am using awk to modify it to have the following format:
>Assembly_AAAS_1_16
ATGTCGATGCTCGATC
>Assembly_AAAS_2_22
ATGCGATGCTAGCTAGCTCGAT
>Assembly_AAAS_3_28
ATGCCGCGACGCAGCACCCGACGCGCAG
I have used awk to modify the first part.
awk -F":" -v i=1 '/>/{print ">Assembly_" $1 "_" val i "_";i++;next} {print length($0)} 1' infile | sed -e "s/_>/_/g" > outfile
I can use print length($0) but how to print it in the same line?
Thanks
EDIT2: Since OP has changed the sample data again so adding this code now.
awk -v val="Assembly_AAAS_" '/>/{++i;val=">"val i "_";next} {sub(/ +$/,"");print val length($0) ORS $0}' Input_file
OR
awk -v val="Assembly_AAAS_" '/>/{++i;val=">"val i "_";next} {print val length($1) ORS $0;}' Input_file
Above will remove spaces from last of the lines of Input_file, in case you don't need it then remove sub(/ +$/,""); part from above code please.
EDIT: As per OP changed solution now.
awk -v i=1 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '/>/{value="\047" val i val1;i++;next} {print value length($0) ORS $0}' Input_file
OR
awk -v i=1 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '
/>/{ value="\047" val i val1;
i++;
next}
{
print value length($0) ORS $0
}
' Input_file
Following awk may help you on same.
awk -v i="" -v j=2 '/>/{print "\047>Assembly_GeneName1_"++i"_sizeline"j;j+=2;next} 1' Input_file
Solution 2nd:
awk -v i=1 -v j=2 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '/>/{print "\047" val i val1 j;j+=2;i++;next} 1' Input_file
What you are dealing with is a beautiful example of records which are not lines. awk is a record parser and by default, a record is defined to be a line. With awk you can define a record to be a block of text using the record separator RS.
RS : The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more
than one character, the results are unspecified. If RS is null, then
records are separated by sequences consisting of a <newline> plus one
or more blank lines, leading or trailing blank lines shall not result
in empty records at the beginning or end of the input, and a <newline>
shall always be a field separator, no matter what the value of FS is.
So the goal is to define the record to be
AAAS:1220136:1220159:-:0::NW_015494524.1:1220136-1220159(-)
ATGTCGATGCTCGATC
And this can be done by defining the RS="\n<". Furthremore we will use \n as a field separator FS. This way you can get the requested length as length($2) and the count by using the record count NR.
A simple awk script is then:
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{$1=">Assembly_AAAS_"NR"_"length($2)}
{print $1,$2}' <file>
This will do exactly what you want.
note: we use print $1,$2 and not print $0 as the last record might have 3 fields (if the last char of the file is a newline). This would imply that you would have an extra empty line at the end of your file.
If you want to pick the AAAS string out of $1 you can use substr($1,1,match($1,":")-1) to pick it up. This results in this:
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{$1=">Assembly_"substr($1,1,match($1,":")-1)"_"NR"_"length($2)}
{print $1,$2}' <file>
Finally, be aware that the above solution only works if there are no spaces in $2, if you want to change that, you can do this :
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{ gsub(/[[:blank:]]/,"",$2);
$1=">Assembly_"substr($1,1,match($1,":")-1)"_"NR"_"length($2)
}
{ print $1,$2 }' <file>

initialising field seperators on condition in awk

I know that initialising FS in BEGIN is the correct practice but what if i need different field seperators for different lines(lines containing a particular pattern)? eg: my awk script is
{if($0 ~ /.*youtube.*/){FS="=";print $2}}
This code is not processing the first line.How to fix this?
You can use split. Eks get the middle date from third field green
echo "on,cat ,blue|green|red,more" | awk -F, '{split($3,a,"|");print a[2]}'
green
And you BEGIN block is not only where you can set the Field Separator:
echo "on,two,three" | awk -F, '{print $2}'
echo "on,two,three" | awk '{print $2}' FS=,
echo "on,two,three" | awk 'BEGIN{FS=","} {print $2}'
echo "on,two,three" | awk -v FS=, '{print $2}'
All these will print two
But they may have some different impact in when they can be used.
awk -F, 'BEGIN{print FS}'
,
and this does not work and gives no output.
awk 'BEGIN{print FS}' FS=,
Back to your problem:
This:
awk '{if($0 ~ /.*youtube.*/){FS="=";print $2}}' file
should be:
awk '{if($0 ~ /.*youtube.*/){split($0,a,"=");print a[2]}}' file
You do not need to test for any characters before and after regex, so:
awk '{if($0 ~ /youtube/){split($0,a,"=");print a[2]}}' file
And this could even more be simplified:
awk '/youtube/ {split($0,a,"=");print a[2]}' file
If data is like this:
cat file
youtube=thisisyoutube1 //starts here
youtube=thisisyoutube2
youtube=thisisyoutube3
youtube=thisisyoutube4
yautube=thisisnottobeprinted
Then do like this:
awk -F= '/youtube/ {split($2,a," ");print a[1]}' file
thisisyoutube1
thisisyoutube2
thisisyoutube3
thisisyoutube4