Replacing columns of a CSV with a string using awk and gsub - awk

I have an input csv file that looks something like:
Name,Index,Location,ID,Message
Alexis,10,Punggol,4090b43,Production 4090b43
Scott,20,Bedok,bfb34d3,Prevent
Ronald,30,one-north,86defac,Difference 86defac
Cindy,40,Punggol,40d0ced,Central
Eric,50,one-north,aeff08d,Military aeff08d
David,60,Bedok,5d1152d,Study
And I want to write a bash shell script using awk and gsub to replace 6-7 alpha numeric character long strings under the ID column with "xxxxx", with the output in a separate .csv file.
Right now I've got:
#!/bin/bash
awk -F ',' -v OFS=',' '{gsub(/^([a-zA-Z0-9]){6,7}/g, "xxxxx", $4);}1' input.csv > output.csv
But the output from I'm getting from running bash myscript.sh input.csv doesn't make any sense. The output.csv file looks like:
Name,Index,Location,ID,Message
Alexis,10,Punggol,4xxxxx9xxxxxb43,Production 4090b43
Scott,20,Bedok,bfb34d3,Prevent
Ronald,30,one-north,86defac,Difference 86defac
Cindy,40,Punggol,4xxxxxdxxxxxced,Central
Eric,50,one-north,aeffxxxxx8d,Military aeff08d
David,60,Bedok,5d1152d,Study
but the expected output csv should look like:
Name,Index,Location,ID,Message
Alexis,10,Punggol,xxxxx,Production 4090b43
Scott,20,Bedok,xxxxx,Prevent
Ronald,30,one-north,xxxxx,Difference 86defac
Cindy,40,Punggol,xxxxx,Central
Eric,50,one-north,xxxxx,Military aeff08d
David,60,Bedok,xxxxx,Study

With your shown sample, please try the following code:
awk -F ',[[:space:]]+' -v OFS=',\t' '
{
sub(/^([a-zA-Z0-9]){6,7}$/, "xxxxx", $4)
$1=$1
}
1
' Input_file | column -t -s $'\t'
Explanation: Setting field separator as comma, space(s), then setting output field separator as comma tab here. Then substituting from starting to till end of value(6 to 7 occurrences) of alphanumeric(s) with xxxxx in 4th field. Finally printing current line. Then sending output of awk program to column command to make it as per shown sample of OP.
EDIT: In case your Input_file is separated by only , as per edited samples now, then try following.
awk -F ',' -v OFS=',' '
{
sub(/^([a-zA-Z0-9]){6,7}$/, "xxxxx", $4)
}
1
' Input_file
Note: OP has installed latest version of awk from older version and these codes helped.

The short version to your answer would be the following:
$ awk 'BEGIN{FS=OFS=","}(FNR>1){$4="xxxxxx"}1' file
This will replace all entries in column 4 by "xxxxxx".
If you only want to change the first 6 to 7 characters of column 4 (and not if there are only 5 of them, there are a couple of ways:
$ awk 'BEGIN{FS=OFS=","}(FNR>1)&&(length($4)>5){$4="xxxxxx" substr($4,8)}1' file
$ awk 'BEGIN{FS=OFS=","}(FNR>1)&&{sub(/.......?/,"xxxxxx",$4)}1' file
Here, we will replace 123456abcde into xxxxxxabcde
Why is your script failing:
Besides the fact that the approach is wrong, I'll try to explain what the following command does: gsub(/([a-zA-Z0-9]){6,7}/g,"xxxxx",$4)
The notation /abc/g is valid awk syntax, but it does not do what you expect it to do. The notation /abc/ is an ERE-token (an extended regular expression). The notation g is, at this point, nothing more than an undefined variable which defaults to an empty string or zero, depending on its usage. awk will now try to execute the operation /abc/g by first executing /abc/ which means: if my current record ($0) matches the regular expression "abc", return 1 otherwise return 0. So it converts /abc/g into 0g which means to concatenate the content of g to the number 0. For this, it will convert the number 0 to a string "0" and concatenate it with the empty string g. In the end, your gsub command is equivalent to gsub("0","xxxxx",$4) and means to replace all the ZERO's by "xxxxx".
Why are you getting always gsub("0","xxxxx",$4) and never gsub("1","xxxxx",$4). The reason is that your initial regular expression never matches anything in the full record/line ($0). Your reguar expression reads /^([a-zA-Z0-9]){6,7}/, and while there are lines that start with 6 or 7 characters, it is likely that your awk does not recognize the extended regular expression notation '{m,n}' which makes it fail. If you use gnu awk, the output would be different when using -re-interval which in old versions of GNU awk is not enabled by default.

I tried to find why your code behave like that, for simplicty sake I made example concering only gsub you have used:
awk 'BEGIN{id="4090b43"}END{gsub(/^([a-zA-Z0-9]){6,7}/g, "xxxxx", id);print id}' emptyfile.txt
output is
4xxxxx9xxxxxb43
after removing g in first argument
awk 'BEGIN{id="4090b43"}END{gsub(/^([a-zA-Z0-9]){6,7}/, "xxxxx", id);print id}' emptyfile.txt
output is
xxxxx
So regular expression followed by g caused malfunction. I was unable to find relevant passage in GNU AWK manual what g after / is supposed to do.
(tested in gawk 4.2.1)

Related

awk command works, but not in openwrt's awk

Works here: 'awk.js.org/`
but not in openwrt's awk, which returns the error message:
awk: bad regex '^(server=|address=)[': Missing ']'
Hello everyone!
I'm trying to use an awk command I wrote which is:
'!/^(server=|address=)[/][[:alnum:]][[:alnum:]-.]+([/]|[/]#)$|^#|^\s*$/ {count++}; END {print count+0}'
Which counts invalid lines in a dns blocklist (oisd in this case):
Input would be eg:
server=/0--foodwarez.da.ru/anyaddress.1.1.1
serverspellerror=/0-000.store/
server=/0-24bpautomentes.hu/
server=/0-29.com/
server=/0-day.us/
server=/0.0.0remote.cryptopool.eu/
server=/0.0mail6.xmrminingpro.com/
server=/0.0xun.cryptopool.space/
Output for this should be "2" since there are two lines that don't match the criteria (correctly formed address, comments, or blank lines).
I've tried formatting the command every which way with [], but can't find anything that works. Does anyone have an idea what format/syntax/option needs adjusting?
Thanks!
To portably include - in a bracket expression it has to be the first or last character, otherwise it means a range, and \s is shorthand for [[:space:]] in only some awks. This will work in any POSIX awk:
$ awk '!/^(server=|address=)[/][[:alnum:]][[:alnum:].-]+([/]|[/]#)$|^#|^[[:space:]]*$/ {count++}; END {print count+0}' file
2
Per #tripleee's comment below if your awk is broken such that a / inside a bracket expression isn't treated as literal then you may need this instead:
$ awk '!/^(server=|address=)\/[[:alnum:]][[:alnum:].-]+(\/|\/#)$|^#|^[[:space:]]*$/ {count++}; END {print count+0}' file
2
but get a new awk, e.g. GNU awk, as who knows what other surprises the one you're using may have in store for you!
'!/^(server=|address=)[/][[:alnum:]][[:alnum:]-.]+([/]|[/]#)$|^#|^\s*$/ {count++}; END {print count+0}'
- has special meaning inside [ and ], it is used to denote range e.g. [A-Z] means uppercase ASCII letter, use \ escape sequence to make it literal dash, let file.txt content be
server=/0--foodwarez.da.ru/anyaddress.1.1.1
serverspellerror=/0-000.store/
server=/0-24bpautomentes.hu/
server=/0-29.com/
server=/0-day.us/
server=/0.0.0remote.cryptopool.eu/
server=/0.0mail6.xmrminingpro.com/
server=/0.0xun.cryptopool.space/
then
awk '!/^(server=|address=)[/][[:alnum:]][[:alnum:]\-.]+([/]|[/]#)$|^#|^\s*$/ {count++}; END {print count+0}' file.txt
gives output
2
You might also consider replacing \s using [[:space:]] in order to main consistency.
(tested in GNU Awk 5.0.1)

Match regexp at the end of the string with AWK

I am trying to match two different Regexp to long strings with awk, removing the part of the string that matches in a 35 characters window.
The problem is that the same bunch of code works when I am looking for the first (which matches at the beginnng) whereas fails to match with the second one (end of string).
Input:
Regexp1(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)Regexp2
Desired output
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
So far I used this code that extracts correctly Regexp1, but, unfortunately, is not able to extract also Regexp2 since indexed of RSTART and RLENGTH for Regexp2 are incorrect.
Code for extracting Regexp1 (correct output):
awk -v F="Regexp1" '{if (match(substr($1,1,35),F)) print substr($1,RSTART,RLENGTH)}' file
Code for extracting Regexp2 (wrong output)
awk -v F="Regexp2" '{if (match(substr($1,length($1)-35,35),F)) print substr($1,RSTART,RLENGTH)}' file
Despite the indexes for Regexp1 are correct, for Regexp2 indexes are wrond (RSTART=13). I cannot figure out how to extract the second Regexp.
Considering that your actual Input_file is same as shown samples, if this is the case could you please try following then(good to have new version of awk since old versions may not support number of times logic for regex).
awk '
match($0,/\([0-9]+\){5}.*\([0-9]\){4}/){
print substr($0,RSTART,RLENGTH)
}' Input_file
In case your number of parenthesis values are not fixed then you could do like as follows:
awk '
match($0,/\([0-9]+\){1,}.*\([0-9]\){1,}/){
print substr($0,RSTART,RLENGTH)
}' Input_file
If this isn't all you need:
$ sed 's/Regexp1\(.*\)Regexp2/\1/' file
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
or using GNU awk for gensub():
$ awk '{print gensub(/Regexp1(.*)Regexp2/,"\\1",1)}' file
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
then edit your question to be far clearer with your requirements and example.

How to filter the OTU by counts with AWK?

I am trying to filter all the singleton from a fasta file.
Here is my input file:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU2;size=1;
ATCCGGGACTGATC
>OTU3;size=5;
GAACTATCGGGTAA
>OTU4;size=1;
AATTGGCCATCT
The expected output is:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
I've tried
awk -F'>' '{if($1>=2) {print $0}' input.fasta > ouput.fasta
but this will remove all the header for each OTU.
Anyone could help me out?
Could you please try following.
awk -F'[=;]' '/^>/{flag=""} $3>=3{flag=1} flag' Input_file
$ awk '/>/{f=/=1;/}!f' file
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
awk -v FS='[;=]' 'prev_sz>=2 && !/size/{print prev RS $0} /size/{prev=$0;prev_sz=$(NF-1)}'
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
Store the size from each line in prev_sz variable and whole line in prev variables. Now check if its >= 2, then print the previous line and the current line. RS is used to print new line.
While all the above methods work, they are limited to the fact that input always has to look the same. I.e the sequence-name in your fasta-file needs to have the form:
>NAME;size=value;
A few solutions can handle a bit more extended sequence-names, but none handle the case where things go a bit more generic, i.e.
>NAME;label1=value1;label2=value2;STRING;label3=value3;
Print sequence where label xxx matches value vvv:
awk '/>{f = /;xxx=vvv;/}f' file.fasta
Print sequence where label xxx has a numeric value p bigger than q:
awk -v label="xxx" -v limit=q \
'BEGIN{ere=";" label "="}
/>/{ f=0; match($0,ere);value=0+substr($0,RSTART+length(ere)); f=(value>limit)}
f' <file>
In the above ere is a regular expression we try to match. We use it to find the location of the value attached to label xxx. This substring will have none-numeric characters after its value, but by adding 0 to it, it is converted to a number, losing all non-numeric values (i.e. 3;label4=value4; is converted to 3). We check if the value is bigger than our limit, and print the sequence based on that result.

Find replace "./." in awk

I am very new to using linux and I am trying to find/replace some of the text in my file.
I have successfully been able to find and replace "0/0" using gsub:
awk '{gsub(/0\/0/,"0")}; 1' filename
However, if I try to replace "./." using the same idea
awk '{gsub(/\.\/\./,"U")}; 1' filename
the output is truncated and stops at the location of the first "./." in the file. I know that "." is a special wildcard character, but I thought that having the "\" in front of it would neutralize it. I have searched but have been unable to find an explanation why the formula I used would truncate the file.
Any thoughts would be much appreciated. Thank you.
Recall that the basic outline of an awk is:
awk 'pattern { action }'
The most common patterns are regexes or tests against line counts:
awk '/FOO/ { do_something_with_a_line_with_FOO_in_it }'
awk 'FNR==10'
The last one has no action so the default is to print the line.
But functions that return a value are also useable as patterns. gsub is a function and returns the number of substitutions.
So given:
$ echo "$txt"
abc./.def line 1
ghk/lmn won't get printed
abc./.def abc./.def printed
To print only lines that have a successful substitution you can do:
$ echo "$txt" | awk 'gsub(/\.\/\./,"U")'
abcUdef line 1
abcUdef abcUdef printed
You do not need to put gsub into an action block since you want to run it on every line and the return tells you something about what happened. The lines that successfully are matched are printed since gsub returns the number of substitutions.
If you want every line printed regardless if there is a match:
$ echo "$txt" | awk 'gsub(/\.\/\./,"U") || 1'
abcUdef line 1
ghk/lmn won't get printed
abcUdef abcUdef printed
Or, you can use the function as an action with an empty pattern and then a 1 with an empty action:
$ echo "$txt" | awk '{gsub(/\.\/\./,"U")} 1'
abcUdef line 1
ghk/lmn won't get printed
abcUdef abcUdef printed
In either case, 1 as a pattern with no action prints the line regardless if there is a match and the gsub makes the substitution if any.
The second awk is what you have. Why it is not working on your input data is probably related to you input data.
Your awk script is fine, your input contains control-Ms, probably from being created by a Windows program. You can see them with cat -v file and use dos2unix or similar to remove them.

Concatenating lines using awk

I have fasta file that contains two gene sequences and what I want to do is remove the fasta header (line starting with ">"), concatenate the rest of the lines and output that sequence
Here is my fasta sequence (genome.fa):
>Potrs164783
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
>Potrs164784
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Desired output
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
I am using awk to do this but I am getting this error
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa
awk: syntax error at source line 1
context is
BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >>> >filename. <<< fa;}
awk: illegal statement at source line 1
I am basically a python person and I was given this script by someone. What am I doing wrong here?
I realized that i was not clear and so i am pasting the whole code that i got from someone. The input file and desired output remains the same
mkdir split_genome;
cd split_genome;
awk 'BEGIN{filename="file1"}{if($1 ~ />/){filename=$1; sub(/>/,"",filename); print filename;} print $0 >filename.fa;}' ../genome.fa;
ls -1 `pwd`/* > ../scaffold_list.txt;
cd ..;
If all you want to do is produce the desired output shown in your question, other solutions will work.
However, the script you have is trying to print each sequence to a file that is named using its header, and the extension .fa.
The syntax error you're getting is because filename.fa is neither a variable or a fixed string. While no Awk will allow you to print to filename.fa because it is neither in quotes or a variable (varaible names can't have a . in them), BSD Awk does not allow manipulating strings when they currently act as a file name where GNU Awk does.
So the solution:
print $0 > filename".fa"
would produce the same error in BSD Awk, but would work in GNU Awk.
To fix this, you can append the extension ".fa" to filename at assignment.
This will do the job:
$ awk '{if($0 ~ /^>/) filename=substr($0, 2)".fa"; else print $0 > filename}' file
$ cat Potrs164783.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
$ cat Potrs164784.fa
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
You'll notice I left out the BEGIN{filename="file1"} declaration statement as it is unnecessary. Also, I replaced the need for sub(...) by using the string function substr as it is more clear and requires fewer actions.
The awk code that you show attempts to do something different than produce the output that you want. Fortunately, there are much simpler ways to obtain your desired output. For example:
$ grep -v '>' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Alternatively, if you had intended to have all non-header lines concatenated into one line:
$ sed -n '/^>/!H; $!d; x; s/\n//gp' ../genome.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAATTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCCGGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC
Try this to print lines not started by > and in one line:
awk '!/^>/{printf $0}' genome.fa > filename.fa
With carriage return:
awk '!/^>/' genome.fa > filename.fa
To create single files named by the headers:
awk 'split($0,a,"^>")>1{file=a[2];next}{print >file}' genome.fa