Match string pattern - Replacement via awk-gsub with another pattern - awk

AIM
I want to be able to match a pattern in a string, this using its initial and final boundaries.
I further aim to replace the pattern with "ID=".
STRING
Class=Grainyhead.domain.factors;Family=CP2-related.factors;id=TFCP2.Ca9750.2.YY2017.HT-SE2;strand=+;seq=TTCTGGTTGGGACCAGGA;score=7.62921;pval=6.53e-05;Averageconservationscore=1.77
DESIRED PATTERN OF THE STRING TO BE MATCHED WITH A COMMAND IN AWK
PATTERN
Class=Grainyhead.domain.factors;Family=CP2-related.factors;id=
COMMAND
(/\Class=(.*);id=/)
AWK-GSUB
awk 'BEGIN{FS=OFS="\t"} {gsub(/\Class=(.*);id=/), "ID=", $4) 1'}
I am not sure about the (.*) use !
I commonly employed it in R to select part of a string.
Can this be employed as well in awk-gsub filtering?

Your separator appears like a ';' (not a tab).
To filter with "start with a token", use '^' (not \) at the beginning of the regexp.
After first replace, select the columns with $number.
cat file | awk 'BEGIN{FS=OFS=";"} {gsub(/^Class=(.*);id=/, "id="); print $1, $6}' > outputfile

Related

How to extract (First match)text between two words

I have a file having the following structure
destination list
move from station d-435-435 to point place1
move from station d-435-435 to point place2
move from mainpoint
I want to extract the word "d-435-435"(Only the first match, this need not be same value always) in between the words "from station" and "to point"
How can I achieve this?
What I have tried so far?
id=$(sed 's/.*from station \(.*\) to.*/\1/' input.txt)
But this returns the following value: destination list d-435-435 move from mainpoint
1st solution: With your shown samples, please try following GNU awk code. Using match function of awk program here to match regex rom station\s+\S+\s+to point to get requested value by OP then removing from station\s+ and \s+to point from matched value and printing required value.
awk '
match($0,/from station\s+\S+\s+to point/){
val=substr($0,RSTART,RLENGTH)
gsub(/from station\s+|\s+to point/,"",val)
print val
exit
}
' Input_file
2nd solution: Using GNU grep please try following. Using -oP option to print matched portion and enabling PCRE regex respectively here. Then in main grep program matching string from station followed by space(s) then using \K option will make sure matched part before \K is forgotten(since e don't need this in output), Then matching \S+(non space values) followed by space(s) to point string(using positive look ahead here to make sure it only checks its present or not but doesn't print that).
grep -oP -m1 'from station\s+\K\S+(?=\s+to point)' Input_file
If GNU sed is available, how about:
id=$(sed -nE '0,/from station.*to/ s/.*from station (.*) to.*/\1/p' input.txt)
The -n option suppress the print unless the substitution succeeds.
The condition 0,/pattern/ is a flip-flop operator and it returns false
after the pattern match succeeds. The 0 address is a GNU sed extension which
makes the 1st line to match against the pattern.
With awk you can write the before and after conditions of
field $4, where d-435-435 is, and then print this field only the first match and exit with exit after print statement:
awk '$2=="from" && $3=="station" && $5=="to" && $6=="point" {print $4; exit}' file
d-435-435
or using GNU awk for the 3rd arg to match():
awk 'match($0,/from station\s+(.*)\s+to point/,a){print a[1];exit}' file
d-435-435
The regexp contains a parenthesis, so the integer-indexed element of array a[1] contain the portion of string between from station followed by space(s) \s+ and space(s) \s+ followed byto point.
This might work for you (GNU sed):
sed -nE '/.*station (\S+) to point.*/{s//\1/;H;x;/\n(\S+)\n.*\1/{s/\n\S+$//;x;d};x;p}' file
Turn off implicit printing and on extended regexps command line options -nE.
If a line matches the required criteria, extract the required string, append a copy to the hold space, check if the match has already been seen and if not print it. If the match has been seen, remove it from the hold space.
Otherwise, do not print anything.
This should work in any sed:
sed -e '/.*from station \([^ ]*\) to .*/!d' -e 's//\1/' -e q file

Using a wildcard in awk with references to file

Using awk, I want to print all lines that have a string in the first column that starts with /dev/vda
I tried the following, but obviously * does not work as a wildcard in awk:
awk '$1=="/dev/vda/*" {print $3}'
Is this possible in awk?
For string matching there is the index() function:
awk 'index($1, "/dev/vda") == 1'
The function returns the starting position of a substring, which must be 1 if it's at the beginning.
* does not work as a wildcard
This is true – awk does not use shell-like patterns, although * may appear as a regex metacharacter (where it would match zero or more repetitions of the the previous character).
awk '$1~/^\/dev\/vda\// {print $3}'
awk has wildcards in regular expressions, not in string equality checks.
Technically the * is not a wildcard in a regex, it's a quantifier. The regex equivalent of the wildcard * of any number of any character would be .*.

Delete string from line that matches regex with AWK

I have file that contains a lot of data like this and I have to delete everything that matches this regex [-]+\d+(.*)
Input:
zxczxc-6-9hw7w
qweqweqweqweqwe-18-8c5r6
asdasdasasdsad-11-br9ft
Output should be:
zxczxc
qweqweqweqweqwe
asdasdasasdsad
How can I do this with AWK?
sed might be easier...
$ sed -E 's/-+[0-9].*//' file
note that .* covers +.*
AFAIK awk doesn't support \d so you could use [0-9], your regex is correct only thing you need to put it in correct function of awk.
awk '{sub(/-+[0-9].*/,"")} 1' Input_file
You don't need the extra <plus> sign afther [0-9] as this is covered by the .*
Generally, if you want to delete a string that matches a regular expression, then all you need to do is substitute it with an empty string. The most straightforward solution is sed which is presented by karafka, the other solution is using awk as presented by RavinderSingh13.
The overall syntax would look like this:
sed -e 's/ere//g' file
awk '{gsub(/ere/,"")}1' file
with ere the regular expression representation. Note I use g and gsub here to substitute all non-overlapping strings.
Due to the nature of the regular expression in the OP, i.e. it ends with .*, the g can be dropped. It also allows us to write a different awk solution which works with field separators:
awk -F '-+[0-9]' '{print $1}' file

Gawk matching one word - one unexpected match

I wanted to get all matches in Column 3 which have the exact word "aa" (case insensitive match) in the string in Column 3
The gawk command used in the awk file is:
$3 ~ /\<aa\>/
The BEGIN statement specifies: IGNORECASE = 1
The command returns 20 rows. What is puzzling is this value in Column 3 in the returned rows:
aA.AHAB
How do I avoid this row as it is not a word by itself because there is dot following the first two aa's and not a space?
A is a word character. . is not a word character. \> matches the zero-width string at the end of a word. Such a zero-width string occurs between A and ..
To search for the string aa delimited by space characters (or start/end of field):
$3 ~ /(^|[ ])aa([ ]|$)
Add any other characters that you care about inside the set ([ ]).
Note that by default, awk splits records into fields on whitespace, so you will not get any spaces in $3 unless you have changed the value of FS.
1st solution: OR to exactly match aa try:
awk 'BEGIN{IGNORECASE=1} $3 ~ /^aa$/' Input_file
2nd solution: OR without IGNORECASE option try:
awk 'tolower($3)=="aa"' Input_file
Question: Why does the awk regex-pattern /\<aa\>/ matches a string like: "aa.bbb"?
We can quickly verify this with:
$ echo aa.bbb | awk '/\<aa\>/'
aa.bbb
The answer is simply found in the manual of gnu awk:
3.7 gawk-Specific Regexp Operators
GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section and are specific to gawk; they are not available in other awk implementations. Most of the additional operators deal with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (‘_’):
\<: Matches the empty string at the beginning of a word. For example, /\<away/ matches "away" but not "stowaway".
\>:
Matches the empty string at the end of a word. For example, /stow\>/ matches "stow" but not "stowaway".
source: GNU awk manual: Section 3 :: Regular Expressions
So to come back to the example from above, the string "aa.bbb" contains two words "aa" and "bbb" since the <dot>-character is not part of the character set that can build up a word. The empty strings matched here is the empty string before "aa.bbb" and the empty string between the characters a and . (an empty string is really an empty string, length 0, 0 characters, commonly written as "")
Solution to the OP: Since FS is most likely the default value, the field $3 cannot have a space. So the following two solutions are possible:
$3 ~ /^aa$/
$3 == "aa"
If the field separator FS is defined in the code, the following might work
" "$3" " ~ /" aa "/
$3 ~ /(^|[ ])aa([ ]|$) # See solution of JHNC

Awk multi character field separator containing caret not working as expected

I have tried multiple google searches, but none of the proposed answers are working for my example below. NF should be 3, but I keep getting 1.
# cat a
1^%2^%3
# awk -F^% '{print NF}' a
1
# awk -F'^%' {print NF}' a
1
awk -F "^%" {print NF}' a
1
The -F variable in awk takes a regular expression as its value. So the value ^ is interpreted as a special anchor regex pattern which need to be deprived of its special meaning. So you escape it a with a literal backslash \ character
awk -F'\\^%' '{ print NF }'
from GNU Awk manual for Escape Sequences
The backslash character itself is another character that cannot be included normally; you must write \\ to put one backslash in the string or regexp. Thus, the string whose contents are the two characters " and \ must be written \"\\.
You should escape ^ to remove its special meaning which is getting used as a regex by field separator.Once you escape ^ by doing \\^ it will be treated as a normal/literal character and then ^% will be considered as string and you will get answer as 3.
awk -F'\\^%' '{print NF}' Input_file
Here is one nice SO link which you could take it as an example too for better understanding, it doesn't talk about specifically ^ character but it talks about how to use escape sequence in field separator in awk.
https://stackoverflow.com/a/44072825/5866580