Using a wildcard in awk with references to file - awk

Using awk, I want to print all lines that have a string in the first column that starts with /dev/vda
I tried the following, but obviously * does not work as a wildcard in awk:
awk '$1=="/dev/vda/*" {print $3}'
Is this possible in awk?

For string matching there is the index() function:
awk 'index($1, "/dev/vda") == 1'
The function returns the starting position of a substring, which must be 1 if it's at the beginning.
* does not work as a wildcard
This is true – awk does not use shell-like patterns, although * may appear as a regex metacharacter (where it would match zero or more repetitions of the the previous character).

awk '$1~/^\/dev\/vda\// {print $3}'
awk has wildcards in regular expressions, not in string equality checks.
Technically the * is not a wildcard in a regex, it's a quantifier. The regex equivalent of the wildcard * of any number of any character would be .*.

Related

Regex match last three integers and one character before that integers

I have been trying this for a lot of time,but my search is failing..I have below test data
mhie0104:x:104:600:Martinescu Horia:/home/scs/gr911/mhie0104:/bin/bash
mlie0105:x:105:600:Martinescu Laurentiu:/home/scs/gr911/mlie0105:/bin/bash
mmie0106:x:106:600:Martinescu Marius:/home/scs/gr911/mmie0106:/bin/bash
mnie0107:x:107:600:Martinescu Nicolae:/home/scs/gr911/mnie0107:/bin/bash
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
I am trying to find out users,who has exact three digits at the end ..So below is what i did
awk -F: '$1 ~ /*[a-z]//d{3}/'
My understanding of using above regex is :
"*" at the begining should match any characters
[a-z] it should match any character string just before digits
Finally three digits
I also tried with below variation
awk -F: '$1 ~ /*?//d{3}/'
So what i need from above test data is
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
1st solution: If you want to see only last 4 characters of 1st field where 4th last character is NOT digit then you can try following code.
awk -F':' '$1 ~ /[^0-9][0-9]{3}$/' Input_file
Explanation:
Simply making field separator as : for all the line of Input_file.
Then checking condition with 1st field /[^0-9][0-9]{3}$/ if 4 letter from last is anything apart from digit and last 3 are digit then print that line.
2nd solution: In case you want to check if none of characters(from starting of 1st field except last 3 characters) should contain digit and last 3 characters should be digits then try following code.
awk -F':' '
substr($1,1,length($1)-3)!~/[0-9]/ && int(substr($1,length($1)-2))~/^[0-9]{3}$/
' Input_file
Explanation:
First thing first making field separator as : for this awk program.
using substr function of awk to get sub string and using substr($1,1,length($1)-3)!~/[0-9]/ condition I am checking if everything of 1st field apart from last 3 characters is NOT having digit.
Then checking another condition int(substr($1,length($1)-2))~/^[0-9]{3}$/ where last 3 characters are of 3 digits.
If both of the conditions are TRUE then print that line.
You can't use this kind of notation : \d
This is perl type regex.
Solution:
$ awk -F: '$1 ~ /[a-zA-Z][0-9]{3}$/' file
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
You can use negative lookbehind in perl
$ perl -F: -ne ' print if $F[0]=~/(?<!\d)\d{3}$/ ' gameiswar.txt
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
$
For this particular task, sed might be used as well:
sed '/^[^0-9]*[0-9]\{3\}:/!d' file
Not sure if you can use only digits as the username, but as it is the first field, and the : is present as the delimiter.
Here, ([^:]*[^0-9])? matches optional repetitions of any char except : followed by a char other than 0-9:
awk '/^([^:]*[^0-9])?[0-9]{3}:/' file
If there has to be a leading char a-z
awk '/^[^:]*[a-z][0-9]{3}:/' file
Output
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
mawk '!_<NF' FS='^[^:]*[a-z][0-9][0-9][0-9]:'
— or —
gawk '!_<NF' FS='^[^:]*[a-z][0-9]{3}:'
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash

Match string pattern - Replacement via awk-gsub with another pattern

AIM
I want to be able to match a pattern in a string, this using its initial and final boundaries.
I further aim to replace the pattern with "ID=".
STRING
Class=Grainyhead.domain.factors;Family=CP2-related.factors;id=TFCP2.Ca9750.2.YY2017.HT-SE2;strand=+;seq=TTCTGGTTGGGACCAGGA;score=7.62921;pval=6.53e-05;Averageconservationscore=1.77
DESIRED PATTERN OF THE STRING TO BE MATCHED WITH A COMMAND IN AWK
PATTERN
Class=Grainyhead.domain.factors;Family=CP2-related.factors;id=
COMMAND
(/\Class=(.*);id=/)
AWK-GSUB
awk 'BEGIN{FS=OFS="\t"} {gsub(/\Class=(.*);id=/), "ID=", $4) 1'}
I am not sure about the (.*) use !
I commonly employed it in R to select part of a string.
Can this be employed as well in awk-gsub filtering?
Your separator appears like a ';' (not a tab).
To filter with "start with a token", use '^' (not \) at the beginning of the regexp.
After first replace, select the columns with $number.
cat file | awk 'BEGIN{FS=OFS=";"} {gsub(/^Class=(.*);id=/, "id="); print $1, $6}' > outputfile

Gawk matching one word - one unexpected match

I wanted to get all matches in Column 3 which have the exact word "aa" (case insensitive match) in the string in Column 3
The gawk command used in the awk file is:
$3 ~ /\<aa\>/
The BEGIN statement specifies: IGNORECASE = 1
The command returns 20 rows. What is puzzling is this value in Column 3 in the returned rows:
aA.AHAB
How do I avoid this row as it is not a word by itself because there is dot following the first two aa's and not a space?
A is a word character. . is not a word character. \> matches the zero-width string at the end of a word. Such a zero-width string occurs between A and ..
To search for the string aa delimited by space characters (or start/end of field):
$3 ~ /(^|[ ])aa([ ]|$)
Add any other characters that you care about inside the set ([ ]).
Note that by default, awk splits records into fields on whitespace, so you will not get any spaces in $3 unless you have changed the value of FS.
1st solution: OR to exactly match aa try:
awk 'BEGIN{IGNORECASE=1} $3 ~ /^aa$/' Input_file
2nd solution: OR without IGNORECASE option try:
awk 'tolower($3)=="aa"' Input_file
Question: Why does the awk regex-pattern /\<aa\>/ matches a string like: "aa.bbb"?
We can quickly verify this with:
$ echo aa.bbb | awk '/\<aa\>/'
aa.bbb
The answer is simply found in the manual of gnu awk:
3.7 gawk-Specific Regexp Operators
GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section and are specific to gawk; they are not available in other awk implementations. Most of the additional operators deal with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (‘_’):
\<: Matches the empty string at the beginning of a word. For example, /\<away/ matches "away" but not "stowaway".
\>:
Matches the empty string at the end of a word. For example, /stow\>/ matches "stow" but not "stowaway".
source: GNU awk manual: Section 3 :: Regular Expressions
So to come back to the example from above, the string "aa.bbb" contains two words "aa" and "bbb" since the <dot>-character is not part of the character set that can build up a word. The empty strings matched here is the empty string before "aa.bbb" and the empty string between the characters a and . (an empty string is really an empty string, length 0, 0 characters, commonly written as "")
Solution to the OP: Since FS is most likely the default value, the field $3 cannot have a space. So the following two solutions are possible:
$3 ~ /^aa$/
$3 == "aa"
If the field separator FS is defined in the code, the following might work
" "$3" " ~ /" aa "/
$3 ~ /(^|[ ])aa([ ]|$) # See solution of JHNC

Awk multi character field separator containing caret not working as expected

I have tried multiple google searches, but none of the proposed answers are working for my example below. NF should be 3, but I keep getting 1.
# cat a
1^%2^%3
# awk -F^% '{print NF}' a
1
# awk -F'^%' {print NF}' a
1
awk -F "^%" {print NF}' a
1
The -F variable in awk takes a regular expression as its value. So the value ^ is interpreted as a special anchor regex pattern which need to be deprived of its special meaning. So you escape it a with a literal backslash \ character
awk -F'\\^%' '{ print NF }'
from GNU Awk manual for Escape Sequences
The backslash character itself is another character that cannot be included normally; you must write \\ to put one backslash in the string or regexp. Thus, the string whose contents are the two characters " and \ must be written \"\\.
You should escape ^ to remove its special meaning which is getting used as a regex by field separator.Once you escape ^ by doing \\^ it will be treated as a normal/literal character and then ^% will be considered as string and you will get answer as 3.
awk -F'\\^%' '{print NF}' Input_file
Here is one nice SO link which you could take it as an example too for better understanding, it doesn't talk about specifically ^ character but it talks about how to use escape sequence in field separator in awk.
https://stackoverflow.com/a/44072825/5866580

using a wildcard in awk

Using awk, I want to print all lines that have a string in the first column that starts with 22_
I tried the following, but obviously * does not work as a wildcard in awk:
awk '$1=="22_*" {print $0}' input > output
Is this possible in awk?
Let's start with a test file:
$ cat >file
22_something keep
23_other omit
To keep only lines that start with 22_:
$ awk '/^22_/' file
22_something keep
Alternatively, if you prefer to reference the first field explicitly, we could use:
$ awk '$1 ~ /^22_/' file
22_something keep
Note that we don't have to write {print $0} after the condition because that is exactly the default action that awk associates with a condition.
At the start of a regular expressions, ^ matches the beginning of a line. Thus, if you want 22_ to occur at the start of a line or the start of a field, you want to write ^22_.
In the condition $1 ~ /^22_/, note that the operator is ~. That operator tells awk to check if the preceding string, $1, matches the regular expression ^22_.
Chosen answer does not answer how to use a wildcard in awk, which is achieved using .* (instead of *):
awk '$1=="22_.*" {print $0}' input > output