Awk: Using invert match to a string and then substitute characters - awk

I want to extract lines that don't contain # and delete ", ; in the output.
My input FILE looks like this:
# ;string"1"
# string"2";
string"3";
Can use grep and tr to get wanted output:
grep -v '#' FILE | tr -d ';"'
string3
However I want to use awk.
I can extract invert match awk '!/#/' FILE, but how can I use sub to delete ", ; in the same awk command?

You can use gsub for global substitution:
awk '!/#/{gsub(/[";]/,"",$0);print}'
The following transcript shows this in action, it delivers the same results as your grep/tr pipeline:
pax> echo '# ;string"1"
# string"2";
string"3";' | awk '!/#/{gsub(/[";]/,"",$0);print}{}'
string3
Note that the final {} may not be necessary in some implementations of awk but it's there to stop output of non-matching lines in those implementations (usually older ones) that do it automatically for lines matching none of the rules.

Use gsub instead which would replace all matches not just one:
awk '/#/{next}{gsub(/[";]/,"")}1' file
Output:
string3
Skipping the third parameter to gsub makes it process $0 by default.
/#/{next} makes it skip lines containing #
1 makes it print $0

Another awk version
awk -F"[\";]" '{$1=$1} !/^#/' OFS= file
string3
awk '{gsub(/[";]/,x)} !/^#/' file
string3
The x represents nothing. Could also have used "", but saves one characters :)

If you want to give sed a chance:
sed -n '/^[^#]/s/[";]//gp' file
string3

Related

Extracting and rearranging columns

I read from stdin lines which contain fields. The field delimiter is a semicolon. There are no specific quoting characters in the input (i.e. the fields can't contain themselves semicolons or newline characters). The number of the input fields is unknown, but it is at least 4.
The output is supposed to be a similar file, consisting of the fields from 2 to the end, but field 2 and 3 reversed in order.
I'm using zsh.
I came up with a solution, but find it clumsy. In particular, I could not think of anything specific to zsh which would help me here, so basically I reverted to awk. This is my approach:
awk -F ';' '{printf("%s", $3 ";" $2); for(i=4;i<=NF;i++) printf(";%s", $i); print "" }' <input_file >output_file
The first printf takes care about the two reversed fields, and then I use an explicit loop to write out the remaining fields. Is there a possibility in awk (or gawk) to print a range of fields in a single command? Or did I miss some incredibly clever feature in zsh, which could make my life simpler?
UPDATE: Example input data
a;bb;c;D;e;fff
gg;h;ii;jj;kk;l;m;n
Should produce the output
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
Using any awk in any shell on every Unix box:
$ awk 'BEGIN{FS=OFS=";"} {t=$3; $3=$2; $2=t; sub(/[^;]*;/,"")} 1' file
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
With GNU awk you could try following code. Using match function ogf GNU awk, where using regex ^[^;]*;([^;]*;)([^;]*;)(.*)$ to catch the values as per requirement, this is creating 3 capturing groups; whose values are getting stored into array named arr(GNU awk's functionality) and then later in program printing values as per requirement.
Here is the Online demo for used regex.
awk 'match($0,/^[^;]*;([^;]*;)([^;]*;)(.*)$/,arr){
print arr[2] arr[1] arr[3]
}
' Input_file
If perl is accepted, it provides a join() function to join elements on a delimiter. In awk though you'd have to explicitly define one (which isn't complex, just more lines of code)
perl -F';' -nlae '$t = #F[2]; #F[2] = #F[1]; $F[1] = $t; print join(";", #F[1..$#F])' file
With sed, perl, hck and rcut (my own script):
$ sed -E 's/^[^;]+;([^;]+);([^;]+)/\2;\1/' ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# can also use: perl -F';' -lape '$_ = join ";", #F[2,1,3..$#F]' ip.txt
$ perl -F';' -lane 'print join ";", #F[2,1,3..$#F]' ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# -d and -D specifies input/output separators
$ hck -d';' -D';' -f3,2,4- ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# syntax similar to cut, but output field order can be different
$ rcut -d';' -f3,2,4- ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
Note that the sed version will preserve input lines with less than 3 fields.
$ cat ip.txt
1;2;3
apple;fig
abc
$ sed -E 's/^[^;]+;([^;]+);([^;]+)/\2;\1/' ip.txt
3;2
apple;fig
abc
$ perl -F';' -lane 'print join ";", #F[2,1,3..$#F]' ip.txt
3;2
;fig
;
Another awk variant:
awk 'BEGIN{FS=OFS=";"} {$1=$3; $3=""; sub(/;;/, ";")} 1' file
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
With gnu awk and gensub switching the position of 2 capture groups:
awk '{print gensub(/^[^;]*;([^;]*);([^;]*)/, "\\2;\\1", 1)}' file
The pattern matches
^ Start of string
[^;]*; Negated character class, match optional chars other than ; and then match ;
([^;]*);([^;]*) 2 capture groups, both capturing chars other than ; and match ; in between
Output
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
awk '{print $3, $0}' {,O}FS=\; < file | cut -d\; -f1,3,5-
This uses awk to prepend the third column, then pipes to cut to extract the desired columns.
Here is one way to do it using only zsh:
rearrange() {
local -a lines=(${(#f)$(</dev/stdin)})
for line in $lines; do
local -a flds=(${(s.;.)line})
print $flds[3]';'$flds[2]';'${(j.;.)flds[4,-1]}
done
}
The same idea in a single line. This may not be an improvement over your awk script:
for l in ${(#f)$(<&0)}; print ${${(A)i::=${(s.;.)l}}[3]}\;$i[2]\;${(j.;.)i:3}
Some of the pieces:
$(</dev/stdin) - read from stdin using pseudo-device.
$(<&0) - another way to read from stdin.
(f) - parameter expansion flag to split by newlines.
(#) - treat split as an array.
(s.;.) - split by semicolon.
$flds[3] - expands to the third array element.
$flds[4,-1] - fourth, fifth, etc. array elements.
$i:3 - ksh-style array slice for fourth, fifth ... elements.
Mixing styles like this can be confusing, even if it is slightly shorter.
(j.;.) - join array by semicolon.
i::= - assign the result of the expansion to the variable i.
This lets us use the semicolon-split fields later.
(A)i::= - the (A) flag ensures i is an array.

Extract substring from a field with single awk in AIX

I have a file file with content like:
stringa 8.0.1.2 stringx
stringb 12.01.0.0 stringx
I have to get a substring from field 2 (first two values with the dot).
I am currently doing cat file | awk '{print $2}' | awk -F. '{print $1"."$2}' and getting the expected output:
8.0
12.01
The query is how to do this with single awk?
I have tried with match() but not seeing an option for a back reference.
Any help would be appreciated.
You can do something like this.
$ awk '{ split($2,str,"."); print str[1]"."str[2] }' file
8.0
12.01
Also, keep in mind that your cat is not needed. Simply give the file directly to awk.
With GNU grep please try following command once.
grep -oP '^\S+\s+\K[[:digit:]]+\.[[:digit:]]+' Input_file
Explanation: Using GNU grep here. Using its -oP options to print matched part and enable PCRE with -P option here. In main program, matching from starting non-space characters followed by 1 or more spaces, then using \K option to forget that match. Then matching 1 or more digits occurrences followed by a dot; which is further followed by digits. If a match is found then it prints matched value.
I would use GNU AWK's split function as follow, let file.txt content be
stringa 8.0.1.2 stringx
stringb 12.01.0.0 stringx
then
awk '{split($2,arr,".");print arr[1]"."arr[2]}' file.txt
output
8.0
12.01
Explantion: split at . 2nd field and put elements into array arr.
(tested in gawk 4.2.1)
You could match digits . digits from the second column and print if there is a match:
awk 'match($2, /^[[:digit:]]+\.[[:digit:]]+/) {
print substr($2, RSTART, RLENGTH)
}
' file
Output
8.0
12.01
Also with GNU awk and gensub():
awk '{print gensub(/([[:digit:]]+[.][[:digit:]]+)(.*)/,"\\1","g",$2)}' file
8.0
12.01
gensub() provides the ability to specify components of a regexp in the replacement text using parentheses in the regexp to mark the components and then specifying \\n in the replacement text, where n is a digit from 1 to 9.
You should perhaps not use awk at all (or any other external program, for that matter) but rely on the field-splitting capabilities of the shell and some variable expansion. For instance:
# printf "%s\n%s\n" "stringa 8.0.1.2 stringx" \
"stringb 12.01.0.0 stringx" |\
while read first second third junk ; do
printf "=%s= =%s= =%s=\n" "$first" "$second" "$third"
done
=stringa= =8.0.1.2= =stringx=
=stringb= =12.01.0.0= =stringx=
As you can see the value is captured in the variable "$second" already and you just need to further isolate the parts you want to see - the first and second part separated by a dot. You can do that either with parameter expansion:
# variable="8.0.1.2"
# echo ${variable%.*.*}
8.0
or like this:
# variable="12.01.0.0"
# echo ${variable%${variable#*.*.}}
12.01
or you can use a further read-statement to separate the parts and then put them back together:
# variable="12.01.0.0"
# echo ${variable} | IFS=. read parta partb junk
# echo ${parta}.${partb}
12.01
So, putting all together:
# printf "%s\n%s\n" "stringa 8.0.1.2 stringx" \
"stringb 12.01.0.0 stringx" |\
while read first second third junk ; do
printf "%s\n" "$second" | IFS=. read parta partb junk
printf "%s.%s\n" "$parta" "$partb"
done
8.0
12.01

Strip last field

My script will be receiving various lengths of input and I want to strip the last field separated by a "/". An example of the input I will be dealing with is.
this/that/and/more
But the issue I am running into is that the length of the input will vary like so:
this/that/maybe/more/and/more
or/even/this/could/be/it/and/maybe/more
short/more
In any case, the expected output should be the whole string minus the last "/more".
Note: The word "more" will not be a constant these are arbitrary examples.
Example input:
this/that/and/more
this/that/maybe/more/and/more
Expected output:
this/that/and
this/that/maybe/more/and
What I know works for a string you know the length of would be
cut -d'/' -f[x]
With what I need is a '/' delimited AWK command I'm assuming like:
awk '{$NF=""; print $0}'
With awk as requested:
$ awk '{sub("/[^/]*$","")} 1' file
this/that/maybe/more/and
or/even/this/could/be/it/and/maybe
short
but this is the type of job sed is best suited for:
$ sed 's:/[^/]*$::' file
this/that/maybe/more/and
or/even/this/could/be/it/and/maybe
short
The above were run against this input file:
$ cat file
this/that/maybe/more/and/more
or/even/this/could/be/it/and/maybe/more
short/more
Depending on how you have the input in your script, bash's Shell Parameter Expansion may be convenient:
$ s1=this/that/maybe/more/and/more
$ s2=or/even/this/could/be/it/and/maybe/more
$ s3=short/more
$ echo ${s1%/*}
this/that/maybe/more/and
$ echo ${s2%/*}
or/even/this/could/be/it/and/maybe
$ echo ${s3%/*}
short
(Lots of additional info on parameter expansion at https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html)
In your script, you could create a loop that removes the last character in the input string if it is not a slash through each iteration. Then, when the loop finds a slash character, exit the loop then remove the final character (which is supposed to be a slash).
Pseudo-code:
while (lastCharacter != '/') {
removeLastCharacter();
}
removeLastCharacter(); # removes the slash
(Sorry, it's been a while since I wrote a bash script.)
Another awk alternative using fields instead of regexs
awk -F/ '{printf "%s", $1; for (i=2; i<NF; i++) printf "/%s", $i; printf "\n"}'
Here is an alternative shell solution:
while read -r path; do dirname "$path"; done < file

Grep part of string after symbol and shuffle columns

I would like to take the number after the - sign and put is as column 2 in my matrix. I know how to grep the string but not how to print it after the text string.
in:
1-967764 GGCTGGTCCGATGGTAGTGGGTTATCAGAACT
3-425354 GCATTGGTGGTTCAGTGGTAGAATTCTCGCC
4-376323 GGCTGGTCCGATGGTAGTGGGTTATCAGAAC
5-221398 GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT
6-180339 TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT
out:
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339
awk -F'[[:space:]-]+' '{print $3,$2}' file
Seems like a simple substitution should do the job:
sed -E 's/[0-9]+-([0-9]+)[[:space:]]*(.*)/\2 \1/' file
Capture the parts you're interested in and use them in the replacement.
Alternatively, using awk:
awk 'sub(/^[0-9]+-/, "") { print $2, $1 }' file
Remove the leading digits and - from the start of the line. When this is successful, sub returns true, so the action is performed, printing the second field, followed by the first.
Using regex ( +|-) as field separator:
$ awk -F"( +|-)" '{print $3,$2}' file
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339
here is another awk
$ awk 'split($1,a,"-") {print $2,a[2]}' file
awk '{sub(/.-/,"");print $2,$1}' file
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339

Use AWK to search through fasta file, given a second file containing sequence names

I have a 2 files. One is a fasta file contain multiple fasta sequences, while another file includes the names of candidate sequences I want to search (file Example below).
seq.fasta
>Clone_18
GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC
>Clone_27-2
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTCGTTTTGTTCTAGATTAACTATCAGTTTGGTTCTGTTTGTCCTCGTACTGGGTTGTGTCAATGCACAACTT
>Clone_34-1
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCG
>Clone_34-3
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCGATATCGCTGAAGCCCAATC
>Clone_44-1
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCC
>Clone_44-3
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCCCGGCAGCGCAGCCGTCGTCTCTACCCTTCACCAGGAATAAGTTTATTTTTCTACTTAC
name.txt
Clone_23
Clone_27-1
I want to use AWK to search through the fasta file, and obtain all the fasta sequences for given candidates whose names were saved in another file.
awk 'NR==FNR{a[$1]=$1} BEGIN{RS="\n>"; FS="\n"} NR>FNR {if (match($1,">")) {sub(">","",$1)} for (p in a) {if ($1==p) print ">"$0}}' name.txt seq.fasta
The problem is that I can only extract the sequence of first candidate in name.txt, like this
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
Can anyone help to fix one-line awk command above?
If it is ok or even desired to print the name as well, you can simply use grep:
grep -Ff name.txt -A1 a.fasta
-f name.txt picks patterns from name.txt
-F treats them as literal strings rather than regular expressions
A1 prints the matching line plus the subsequent line
If the names are not desired in output I would simply pipe to another grep:
above_command | grep -v '>'
An awk solution can look like this:
awk 'NR==FNR{n[$0];next} substr($0,2) in n && getline' name.txt a.fasta
Better explained in a multiline version:
# True as long as we are reading the first file, name.txt
NR==FNR {
# Store the names in the array 'n'
n[$0]
next
}
# I use substr() to remove the leading `>` and check if the remaining
# string which is the name is a key of `n`. getline retrieves the next line
# If it succeeds the condition becomes true and awk will print that line
substr($0,2) in n && getline
$ awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' name.txt seq.fasta
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC