Remove pattern from a column if it is present in another one - awk

I have this file :
>AX-89916436-Affx-G-[A/G]
TTGTCCGAGAGTGACGTCAATCCGCA
>AX-89916437-Affx-A-[A/G]
TGTGTGGAAACTCCG
>AX-89916438-Affx-C-[A/C]
GAAGTACGGTAACAT
>AX-89916440-Affx-T-[G/T]
AGTTGATGGTGTATGTGTGTCTTT
I would like to remove in the last field [X/X] the letter present in the 4th field. To get something like that :
>AX-89916436-Affx-G-A
TTGTCCGAGAGTGACGTCAATCCGCA
>AX-89916437-Affx-A-G
TGTGTGGAAACTCCG
>AX-89916438-Affx-C-A
GAAGTACGGTAACAT
>AX-89916440-Affx-T-G
AGTTGATGGTGTATGTGTGTCTTT
I have :
awk -F'-' '
match($0, /\[[A-Z]\/[A-Z]]/) {m = substr($0, RSTART, RLENGTH); if(/^>/ && $NF~/m/); print ... }'

$ awk 'BEGIN{FS=OFS="-"} />/{gsub("[][/]",""); sub($(NF-1),"",$NF)}1' file
>AX-89916436-Affx-G-A
XXXXXXX
>AX-89916437-Affx-A-G
XXXXXXXXXXX
>AX-89916438-Affx-C-A
XXXXXXX
>AX-89916440-Affx-T-G
XXXXXXX

Here is another awk:
awk 'BEGIN {FS=OFS="-"} NF>1 {gsub("[][/" $(NF-1) "]", "", $NF) } 1' file
>AX-89916436-Affx-G-A
XXXXXXX
>AX-89916437-Affx-A-G
XXXXXXXXXXX
>AX-89916438-Affx-C-A
XXXXXXX
>AX-89916440-Affx-T-G
XXXXXXX

With your shown samples, please try following awk code. Simple explanation would be setting FS and OFS as = and in main section checking if a line starts from > and 5th field is matching regex \[[A-Z]\/[A-Z]] then remove whatever values present of 4th field in 5th field using gsub. 1 is awksh way of printing current edited/non-edited line.
awk '
BEGIN{ FS=OFS="-" }
/^>/ && $5 ~ /\[[A-Z]\/[A-Z]]/{
gsub("[][/"$4"]", "", $5)
}
1' Input_file

Using sed
$ sed -E s'#([A-Z])-\[(\1|([A-Z]))/(\1|([A-Z]))]#\1-\3\5#' input_file
>AX-89916436-Affx-G-A
TTGTCCGAGAGTGACGTCAATCCGCA
>AX-89916437-Affx-A-G
TGTGTGGAAACTCCG
>AX-89916438-Affx-C-A
GAAGTACGGTAACAT
>AX-89916440-Affx-T-G
AGTTGATGGTGTATGTGTGTCTTT

You can use
awk 'BEGIN{FS=OFS="-"} /^>/ && $5 ~ /\[[A-Z]\/[A-Z]]/{gsub("[][/"$4"]", "", $5);}1' file
Details:
BEGIN{FS=OFS="-"} - set input/output field separator to -
/^>/ && $5 ~ /\[[A-Z]\/[A-Z]]/ - if the string starts with > and Field 5 contains [ + uppercase letter + / + uppercase letter + ] substring...
{gsub("[][/"$4"]", "", $5);} - then remove from Field 5 ], [, / and Field 4 chars
1 - fires the default print action.
See the online demo:
#!/bin/bash
s='>AX-89916436-Affx-G-[A/G]
XXXXXXX
>AX-89916437-Affx-A-[A/G]
XXXXXXXXXXX
>AX-89916438-Affx-C-[A/C]
XXXXXXX
>AX-89916440-Affx-T-[G/T]
XXXXXXX'
awk 'BEGIN{FS=OFS="-"} /^>/ && $5 ~ /\[[A-Z]\/[A-Z]]/{gsub("[][/"$4"]", "", $5);}1' <<< "$s"
Output:
>AX-89916436-Affx-G-A
XXXXXXX
>AX-89916437-Affx-A-G
XXXXXXXXXXX
>AX-89916438-Affx-C-A
XXXXXXX
>AX-89916440-Affx-T-G
XXXXXXX

much better now :
>AX-89916436-Affx-G-A
TTGTCCGAGAGTGACGTCAATCCGCA
>AX-89916437-Affx-A-G
TGTGTGGAAACTCCG
>AX-89916438-Affx-C-A
GAAGTACGGTAACAT
>AX-89916440-Affx-T-G
AGTTGATGGTGTATGTGTGTCTTT
# gawk profile, created Thu May 12 05:05:48 2022
# Rule(s)
8 NF*=($_=(NF=NF)==!_?$!_:$!(NF-=($(_+=(_-=_)-+-++_-+-++_)=\
$((_+=_+=(_^=_<_)+_)-($--_!=$--_) ) )^(_-=_)+!_))~""'

Related

awk to add prefix if not present in field

I am trying to add a prefix to a field in awk if it is not already present. That is if chr isn't present before the number it is inserted. However, if it is there it is skipped.
The first awk adds the prefix to each $2 even if it is present and the senond awk does skip the $2 with chr in them, but does print chr in the $2 without. Thank you :).
file
ASPA,17:3483575-3483585
ATM,11:108289609-108289613
ATP7B,13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123
desired
ASPA,chr17:3483575-3483585
ATM,chr11:108289609-108289613
ATP7B,chr13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123
awk
awk -F, '{$2="chr"$2; print}' file
awk 2
awk -F, '$2 !~/chr/{gsub("chr","chr",$2)}1' file
You can use:
awk 'BEGIN {FS=OFS=","} $2 !~ /^chr/ {$2="chr" $2} 1' file
ASPA,chr17:3483575-3483585
ATM,chr11:108289609-108289613
ATP7B,chr13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123
Or without using any regex:
awk 'BEGIN {FS=OFS=","} index($2 , "chr") != 1 {$2="chr" $2} 1' file
Another solution that might be shortest of all:
awk '{sub(/,(chr)?/, ",chr")} 1' file
1st solution: With your shown samples, please try following awk code.
awk '
BEGIN{FS=OFS=":"}
{
split($1,arr,",")
if(int(arr[2]) || arr[2]==0){
$1=arr[1] ",chr" arr[2]
}
}
1
' Input_file
2nd solution: With GNU awk using its match function which captures values into an array from capturing groups try following code.
awk '
match($0,/^([^,]*,)([^:]*)(:.*)/,arr){
if(int(arr[2]) || arr[2]==0){
arr[2]="chr" arr[2]
}
print arr[1] arr[2] arr[3]
}
' Input_file
3rd solution(Bonus one): Just in case your 2nd field is having Negative values(integers) and you want to change it Eg: from -11 to -chr11 then you can try following GNU awk code.
awk '
match($0,/^([^,]*,)(-)?([^:]*)(:.*)/,arr){
if(int(arr[3]) || arr[3]==0){
if(arr[2]=="-"){
arr[3]="-chr" arr[3]
}
else{
arr[3]="chr" arr[3]
}
$0=arr[1] arr[3] arr[4]
}
print
}
' Input_file
mawk NF=NF FS=',(chr)?' OFS=',chr'
ASPA,chr17:3483575-3483585
ATM,chr11:108289609-108289613
ATP7B,chr13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123

AWK: How to number auto-increment?

I have a file.file content is:
20210126000880000003|3|33.00|20210126|15:30
1|20210126000000000000000000002207|1220210126080109|1000|100000000000000319|100058110000000325|402041000012|402041000012|PT07|621067000000123645|收款方户名|2021-01-26|2021-01-26|10.00|TN|NCS|12|875466
2|20210126000000000000000000002208|1220210126080110|1000|100000000000000319|100058110000000325|402041000012|402041000012|PT06|621067000000123645|收款方户名|2021-01-26|2021-01-26|20.00|TN|NCS|12|875466
3|20210126000000000000000000002209|1220210126080111|1000|100000000000000319|100058110000000325|402041000012|402041000012|PT08|621067000000123645|收款方户名|2021-01-26|2021-01-26|3.00|TN|NCS|12|875466
I use awk command:
awk -F"|" 'NR==1{print $1};FNR==2{print $2,$3}' testfile
Get the following result:
20210126000880000003
20210126000000000000000000002207 1220210126080109
I want the number to auto-increase:
awk -F"|" 'NR==1{print $1+1};FNR==2{print $2+1,$3+1}' testfile
But get follow result:
20210126000880001024
20210126000000000944237587726336 1220210126080110
have question:
I want to the numer is auto-increase: hope the result is:
20210126000880000003
20210126000000000000000000002207|1220210126080109
-------------------------------------------------
20210126000880000004
20210126000000000000000000002208|1220210126080110
--------------------------------------------------
20210126000880000005
20210126000000000000000000002209|1220210126080111
How to auto_increase?
Thanks!
You may try this gnu awk command:
awk -M 'BEGIN {FS=OFS="|"} NR == 1 {hdr = $1; next} NF>2 {print ++hdr; print $2, $3; print "-------------------"}' file
20210126000880000004
20210126000000000000000000002207|1220210126080109
-------------------
20210126000880000005
20210126000000000000000000002208|1220210126080110
-------------------
20210126000880000006
20210126000000000000000000002209|1220210126080111
-------------------
A more readable version:
awk -M 'BEGIN {
FS=OFS="|"
}
NR == 1 {
hdr = $1
next
}
NF > 2 {
print ++hdr
print $2, $3
print "-------------------"
}' file
Here is a POSIX awk solution that doesn't need -M:
awk 'BEGIN {FS=OFS="|"} NR == 1 {hdr = $1; next} NF>2 {"echo " hdr " + 1 | bc" | getline hdr; print hdr; print $2, $3; print "-------------------"}' file
20210126000880000004
20210126000000000000000000002207|1220210126080109
-------------------
20210126000880000005
20210126000000000000000000002208|1220210126080110
-------------------
20210126000880000006
20210126000000000000000000002209|1220210126080111
-------------------
Anubhava has the best solution but for older versions of GNU awk that don't support -M (big numbers) you can try the following:
awk -F\| 'NR==1 { print $1;hed=$1;hed1=substr($1,(length($1)-1));next; } !/^$/ {print $2" "$3 } /^$/ { print "--------------------------------------------------";printf "%s%s\n",substr(hed,1,((length(hed))-(length(hed1)+1))),++hed1 }' testfile
Explanation:
awk -F\| 'NR==1 { # Set field delimiter to | and process the first line
print $1; # Print the first field
hed=$1; # Set the variable hed to the first field
hed1=substr($1,(length($1)-1)); # Set a counter variable hed1 to the last digit in hed ($1)
next;
}
!/^$/ {
print $2" "$3 # Where there is no blank line, print the second field, a space and the third field
}
/^$/ {
print "--------------------------------------------------"; # Where there is a blank field, process
printf "%s%s\n",substr(hed,1,((length(hed))-(length(hed1)+1))),++hed1 # print the header extract before the counter, followed by the incremented counter
}' testfile

Awk column with pattern array

Is it possible to do this but use an actual array of strings where it says "array"
array=(cat
dog
mouse
fish
...)
awk -F "," '{ if ( $5!="array" ) { print $0; } }' file
I would like to use spaces in some of the strings in my array.
I would also like to be able to match partial matches, so "snow" in my array would match "snowman"
It should be case sensitive.
Example csv
s,dog,34
3,cat,4
1,african elephant,gd
A,African Elephant,33
H,snowman,8
8,indian elephant,3k
7,Fish,94
...
Example array
snow
dog
african elephant
Expected output
s,dog,34
H,snowman,8
1,african elephant,gd
Cyrus posted this which works well, but it doesn't allow spaces in the array strings and wont match partial matches.
echo "${array[#]}" | awk 'FNR==NR{len=split($0,a," "); next} {for(i=1;i<=len;i++) {if(a[i]==$2){next}} print}' FS=',' - file
The brief approach using a single regexp for all array contents:
$ array=('snow' 'dog' 'african elephant')
$ printf '%s\n' "${array[#]}" | awk -F, 'NR==FNR{r=r s $0; s="|"; next} $2~r' - example.csv
s,dog,34
1,african elephant,gd
H,snowman,8
Or if you prefer string comparisons:
$ cat tst.sh
#!/bin/env bash
array=('snow' 'dog' 'african elephant')
printf '%s\n' "${array[#]}" |
awk -F',' '
NR==FNR {
array[$0]
next
}
{
for (val in array) {
if ( index($2,val) ) { # or $2 ~ val for a regexp match
print
next
}
}
}
' - example.csv
$ ./tst.sh
s,dog,34
1,african elephant,gd
H,snowman,8
This prints no line from csv file which contains an element from array in column 5:
echo "${array[#]}" | awk 'FNR==NR{len=split($0,a," "); next} {for(i=1;i<=len;i++) {if(a[i]==$5){next}} print}' FS=',' - file

How to count spaces between columns

How can I count the number of spaces (16) between S1, and // in the following line:
S1, // name
One way:
awk -F '//' '{ n = gsub(/ /, "", $1); print n }'
Test:
echo 'S1, // name' | awk -F '//' '{ n = gsub(/ /, "", $1); print n }'
Results:
16
If you really want awk then you can build on the following.
$ echo "S1, // name" | awk '{x=gsub(/ /," ",$0); print x}'
17
gsub returns the number of replacements made. Obviously this regex will also find and count other spaces but you get the point.
Or try something like this:
echo "S1, // name" |
awk -F[,/] ' { for (i=1;i<=NF;i++) print "$"i " is \""$i"\" of length, " length($i);}'
Test:
$ echo "S1, // name" | awk -F[,/] ' { for (i=1;i<=NF;i++) print "$"i " is \""$i"\" of length, " length($i);}'
$1 is "S1" of length, 2
$2 is " " of length, 16
$3 is "" of length, 0
$4 is " name" of length, 5
Count all spaces between S1, and // only with awk:
$ echo 'S1, // name' | awk -F'[,/]' '{print length($2)}'
16
Or a method based off fedorqui comment:
$ echo 'S1, // name' | grep -Po '(?<=S1,) *(?=//)' | wc -L
16
Pure bash
x='S1, // name'
x=${x#S1,}
x=${x%//*}
echo ${#x}
16

How to print out a specific field in AWK?

A very simple question, which a found no answer to. How do I print out a specific field in awk?
awk '/word1/', will print out the whole sentence, when I need just a word1. Or I need a chain of patterns (word1 + word2) to be printed out only from a text.
Well if the pattern is a single word (which you want to print and can't contaion FS (input field separator)) why not:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN { print MYPATTERN }' INPUTFILE
If your pattern is a regex:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN { print gensub(".*(" MYPATTERN ").*","\\1","1",$0) }' INPUTFILE
If your pattern must be checked in every single field:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN {
for (i=1;i<=NF;i++) {
if ($i ~ MYPATTERN) { print "Field " i " in " NR " row matches: " MYPATTERN }
}
}' INPUTFILE
Modify any of the above to your taste.
The fields in awk are represented by $1, $2, etc:
$ echo this is a string | awk '{ print $2 }'
is
$0 is the whole line, $1 is the first field, $2 is the next field ( or blank ),
$NF is the last field, $( NF - 1 ) is the 2nd to last field, etc.
EDIT (in response to comment).
You could try:
awk '/crazy/{ print substr( $0, match( $0, "crazy" ), RLENGTH )}'
i know you can do this with awk :
an alternative would be :
sed -nr "s/.*(PATTERN_TO_MATCH).*/\1/p" file
or you can use grep -o
Something like this perhaps:
awk '{split("bla1 bla2 bla3",a," "); print a[1], a[2], a[3]}'