Find replace "./." in awk - awk

I am very new to using linux and I am trying to find/replace some of the text in my file.
I have successfully been able to find and replace "0/0" using gsub:
awk '{gsub(/0\/0/,"0")}; 1' filename
However, if I try to replace "./." using the same idea
awk '{gsub(/\.\/\./,"U")}; 1' filename
the output is truncated and stops at the location of the first "./." in the file. I know that "." is a special wildcard character, but I thought that having the "\" in front of it would neutralize it. I have searched but have been unable to find an explanation why the formula I used would truncate the file.
Any thoughts would be much appreciated. Thank you.

Recall that the basic outline of an awk is:
awk 'pattern { action }'
The most common patterns are regexes or tests against line counts:
awk '/FOO/ { do_something_with_a_line_with_FOO_in_it }'
awk 'FNR==10'
The last one has no action so the default is to print the line.
But functions that return a value are also useable as patterns. gsub is a function and returns the number of substitutions.
So given:
$ echo "$txt"
abc./.def line 1
ghk/lmn won't get printed
abc./.def abc./.def printed
To print only lines that have a successful substitution you can do:
$ echo "$txt" | awk 'gsub(/\.\/\./,"U")'
abcUdef line 1
abcUdef abcUdef printed
You do not need to put gsub into an action block since you want to run it on every line and the return tells you something about what happened. The lines that successfully are matched are printed since gsub returns the number of substitutions.
If you want every line printed regardless if there is a match:
$ echo "$txt" | awk 'gsub(/\.\/\./,"U") || 1'
abcUdef line 1
ghk/lmn won't get printed
abcUdef abcUdef printed
Or, you can use the function as an action with an empty pattern and then a 1 with an empty action:
$ echo "$txt" | awk '{gsub(/\.\/\./,"U")} 1'
abcUdef line 1
ghk/lmn won't get printed
abcUdef abcUdef printed
In either case, 1 as a pattern with no action prints the line regardless if there is a match and the gsub makes the substitution if any.
The second awk is what you have. Why it is not working on your input data is probably related to you input data.

Your awk script is fine, your input contains control-Ms, probably from being created by a Windows program. You can see them with cat -v file and use dos2unix or similar to remove them.

Related

Replacing columns of a CSV with a string using awk and gsub

I have an input csv file that looks something like:
Name,Index,Location,ID,Message
Alexis,10,Punggol,4090b43,Production 4090b43
Scott,20,Bedok,bfb34d3,Prevent
Ronald,30,one-north,86defac,Difference 86defac
Cindy,40,Punggol,40d0ced,Central
Eric,50,one-north,aeff08d,Military aeff08d
David,60,Bedok,5d1152d,Study
And I want to write a bash shell script using awk and gsub to replace 6-7 alpha numeric character long strings under the ID column with "xxxxx", with the output in a separate .csv file.
Right now I've got:
#!/bin/bash
awk -F ',' -v OFS=',' '{gsub(/^([a-zA-Z0-9]){6,7}/g, "xxxxx", $4);}1' input.csv > output.csv
But the output from I'm getting from running bash myscript.sh input.csv doesn't make any sense. The output.csv file looks like:
Name,Index,Location,ID,Message
Alexis,10,Punggol,4xxxxx9xxxxxb43,Production 4090b43
Scott,20,Bedok,bfb34d3,Prevent
Ronald,30,one-north,86defac,Difference 86defac
Cindy,40,Punggol,4xxxxxdxxxxxced,Central
Eric,50,one-north,aeffxxxxx8d,Military aeff08d
David,60,Bedok,5d1152d,Study
but the expected output csv should look like:
Name,Index,Location,ID,Message
Alexis,10,Punggol,xxxxx,Production 4090b43
Scott,20,Bedok,xxxxx,Prevent
Ronald,30,one-north,xxxxx,Difference 86defac
Cindy,40,Punggol,xxxxx,Central
Eric,50,one-north,xxxxx,Military aeff08d
David,60,Bedok,xxxxx,Study
With your shown sample, please try the following code:
awk -F ',[[:space:]]+' -v OFS=',\t' '
{
sub(/^([a-zA-Z0-9]){6,7}$/, "xxxxx", $4)
$1=$1
}
1
' Input_file | column -t -s $'\t'
Explanation: Setting field separator as comma, space(s), then setting output field separator as comma tab here. Then substituting from starting to till end of value(6 to 7 occurrences) of alphanumeric(s) with xxxxx in 4th field. Finally printing current line. Then sending output of awk program to column command to make it as per shown sample of OP.
EDIT: In case your Input_file is separated by only , as per edited samples now, then try following.
awk -F ',' -v OFS=',' '
{
sub(/^([a-zA-Z0-9]){6,7}$/, "xxxxx", $4)
}
1
' Input_file
Note: OP has installed latest version of awk from older version and these codes helped.
The short version to your answer would be the following:
$ awk 'BEGIN{FS=OFS=","}(FNR>1){$4="xxxxxx"}1' file
This will replace all entries in column 4 by "xxxxxx".
If you only want to change the first 6 to 7 characters of column 4 (and not if there are only 5 of them, there are a couple of ways:
$ awk 'BEGIN{FS=OFS=","}(FNR>1)&&(length($4)>5){$4="xxxxxx" substr($4,8)}1' file
$ awk 'BEGIN{FS=OFS=","}(FNR>1)&&{sub(/.......?/,"xxxxxx",$4)}1' file
Here, we will replace 123456abcde into xxxxxxabcde
Why is your script failing:
Besides the fact that the approach is wrong, I'll try to explain what the following command does: gsub(/([a-zA-Z0-9]){6,7}/g,"xxxxx",$4)
The notation /abc/g is valid awk syntax, but it does not do what you expect it to do. The notation /abc/ is an ERE-token (an extended regular expression). The notation g is, at this point, nothing more than an undefined variable which defaults to an empty string or zero, depending on its usage. awk will now try to execute the operation /abc/g by first executing /abc/ which means: if my current record ($0) matches the regular expression "abc", return 1 otherwise return 0. So it converts /abc/g into 0g which means to concatenate the content of g to the number 0. For this, it will convert the number 0 to a string "0" and concatenate it with the empty string g. In the end, your gsub command is equivalent to gsub("0","xxxxx",$4) and means to replace all the ZERO's by "xxxxx".
Why are you getting always gsub("0","xxxxx",$4) and never gsub("1","xxxxx",$4). The reason is that your initial regular expression never matches anything in the full record/line ($0). Your reguar expression reads /^([a-zA-Z0-9]){6,7}/, and while there are lines that start with 6 or 7 characters, it is likely that your awk does not recognize the extended regular expression notation '{m,n}' which makes it fail. If you use gnu awk, the output would be different when using -re-interval which in old versions of GNU awk is not enabled by default.
I tried to find why your code behave like that, for simplicty sake I made example concering only gsub you have used:
awk 'BEGIN{id="4090b43"}END{gsub(/^([a-zA-Z0-9]){6,7}/g, "xxxxx", id);print id}' emptyfile.txt
output is
4xxxxx9xxxxxb43
after removing g in first argument
awk 'BEGIN{id="4090b43"}END{gsub(/^([a-zA-Z0-9]){6,7}/, "xxxxx", id);print id}' emptyfile.txt
output is
xxxxx
So regular expression followed by g caused malfunction. I was unable to find relevant passage in GNU AWK manual what g after / is supposed to do.
(tested in gawk 4.2.1)

Concatenating lines using awk

I have fasta file that contains two gene sequences and what I want to do is remove the fasta header (line starting with ">"), concatenate the rest of the lines and output that sequence
Here is my fasta sequence (genome.fa):
>Potrs164783
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
>Potrs164784
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Desired output
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
I am using awk to do this but I am getting this error
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa
awk: syntax error at source line 1
context is
BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >>> >filename. <<< fa;}
awk: illegal statement at source line 1
I am basically a python person and I was given this script by someone. What am I doing wrong here?
I realized that i was not clear and so i am pasting the whole code that i got from someone. The input file and desired output remains the same
mkdir split_genome;
cd split_genome;
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa;
ls -1 `pwd`/* > ../scaffold_list.txt;
cd ..;
If all you want to do is produce the desired output shown in your question, other solutions will work.
However, the script you have is trying to print each sequence to a file that is named using its header, and the extension .fa.
The syntax error you're getting is because filename.fa is neither a variable or a fixed string. While no Awk will allow you to print to filename.fa because it is neither in quotes or a variable (varaible names can't have a . in them), BSD Awk does not allow manipulating strings when they currently act as a file name where GNU Awk does.
So the solution:
print $0 > filename".fa"
would produce the same error in BSD Awk, but would work in GNU Awk.
To fix this, you can append the extension ".fa" to filename at assignment.
This will do the job:
$ awk '{if($0 ~ /^>/) filename=substr($0, 2)".fa"; else print $0 > filename}' file
$ cat Potrs164783.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
$ cat Potrs164784.fa
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
You'll notice I left out the BEGIN{filename="file1"} declaration statement as it is unnecessary. Also, I replaced the need for sub(...) by using the string function substr as it is more clear and requires fewer actions.
The awk code that you show attempts to do something different than produce the output that you want. Fortunately, there are much simpler ways to obtain your desired output. For example:
$ grep -v '>' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Alternatively, if you had intended to have all non-header lines concatenated into one line:
$ sed -n '/^>/!H; $!d; x; s/\n//gp' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAATTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCCGGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Try this to print lines not started by > and in one line:
awk '!/^>/{printf $0}' genome.fa > filename.fa
With carriage return:
awk '!/^>/' genome.fa > filename.fa
To create single files named by the headers:
awk 'split($0,a,"^>")>1{file=a[2];next}{print >file}' genome.fa

Enclosing a single quote in Awk

I currently have this line of code, that needs to be increased by one every-time in run this script. I would like to use awk in increasing the third string (570).
'set t 570'
I currently have this to change the code, however I am missing the closing quotation mark. I would also desire that this only acts on this specific (above) line, however am unsure about where to place the syntax that awk uses to do that.
awk '/set t /{$3+=1} 1' file.gs >file.tmp && mv file.tmp file.gs
Thank you very much for your input.
Use sub() to perform a replacement on the string itself:
$ awk '/set t/ {sub($3+0,$3+1,$3)} 1' file
'set t 571'
This looks for the value in $3 and replaces it with itself +1. To avoid replacing all of $3 and making sure the quote persists in the string, we say $3+0 so that it evaluates to just the number, not the quote:
$ echo "'set t 570'" | awk '{print $3}'
570'
$ echo "'set t 570'" | awk '{print $3+0}'
570
Note this would fail if the value in $3 happens more times in the same line, since it will replace all of them.

Processing of awk with multiple variable from previous processing?

I have a Q's for awk processing, i got a file below
cat test.txt
/home/shhh/
abc.c
/home/shhh/2/
def.c
gthjrjrdj.c
/kernel/sssh
sarawtera.c
wrawrt.h
wearwaerw.h
My goal is to make a full path from splitting sentences into /home/jhyoon/abc.c.
This is the command I got from someone:
cat test.txt | awk '/^\/.*/{path=$0}/^[a-zA-Z]/{printf("%s/%s\n",path,$0);}'
It works, but I do not understand well about how do make interpret it step by step.
Could you teach me how do I make interpret it?
Result :
/home/shhh//abc.c
/home/shhh/2//def.c
/home/shhh/2//gthjrjrdj.c
/kernel/sssh/sarawtera.c
/kernel/sssh/wrawrt.h
/kernel/sssh/wearwaerw.h
What you probably want is the following:
$ awk '/^\//{path=$0}/^[a-zA-Z]/ {printf("%s/%s\n",path,$0)}' file
/home/jhyoon//abc.c
/home/jhyoon/2//def.c
/home/jhyoon/2//gthjrjrdj.c
/kernel/sssh/sarawtera.c
/kernel/sssh/wrawrt.h
/kernel/sssh/wearwaerw.h
Explanation
/^\//{path=$0} on lines starting with a /, store it in the path variable.
/^[a-zA-Z]/ {printf("%s/%s\n",path,$0)} on lines starting with a letter, print the stored path together with the current line.
Note you can also say
awk '/^\//{path=$0; next} {printf("%s/%s\n",path,$0)}' file
Some comments
cat file | awk '...' is better written as awk '...' file.
You don't need the ; at the end of a block {} if you are executing just one command. It is implicit.

Extract data from ASCII file with grep/AWK

I have a long ASCII log-file from a simulation and need to extract some data from it.
The lines I want have the structure:
Main step= 1 a= 0.00E+00 b=-6.85E-08 c= 4.58E-08
The phrase "Main step" is only used in the lines I want. This is easy to grep for, but I also want to include the next line following the line above, which has the structure:
Fine step= 1 t=-1.31854E+01
Note that "Fine step" is used other places in the log-file.
My question boils down to this: How can I extract the lines containing a keyword/phrase (here "Main step") and also make sure that I get the next following line using grep or AWK or some other standard Linux program?
You can use sed
sed -n '/Main step/,/./p' inputFile
This prints only the lines in a range starting from Main step and ending with . (the wildcard). Effectively, every line which reads Main step and the following are printed.
Posted according to the tag awk. And the one through awk's getline function,
awk '/Main step/{print; getline; print}' file
It would print the Main step line and also the next line.
Because you tagged "grep", and since this is the most obvious solution to me:
grep -A1 'Main step' file
...although this will add "--" between matches. So to get the same output as the awk and sed answer:
grep -A1 'Main step' file | grep -v '^--$'