Gawk matching one word - one unexpected match - awk

I wanted to get all matches in Column 3 which have the exact word "aa" (case insensitive match) in the string in Column 3
The gawk command used in the awk file is:
$3 ~ /\<aa\>/
The BEGIN statement specifies: IGNORECASE = 1
The command returns 20 rows. What is puzzling is this value in Column 3 in the returned rows:
aA.AHAB
How do I avoid this row as it is not a word by itself because there is dot following the first two aa's and not a space?

A is a word character. . is not a word character. \> matches the zero-width string at the end of a word. Such a zero-width string occurs between A and ..
To search for the string aa delimited by space characters (or start/end of field):
$3 ~ /(^|[ ])aa([ ]|$)
Add any other characters that you care about inside the set ([ ]).
Note that by default, awk splits records into fields on whitespace, so you will not get any spaces in $3 unless you have changed the value of FS.

1st solution: OR to exactly match aa try:
awk 'BEGIN{IGNORECASE=1} $3 ~ /^aa$/' Input_file
2nd solution: OR without IGNORECASE option try:
awk 'tolower($3)=="aa"' Input_file

Question: Why does the awk regex-pattern /\<aa\>/ matches a string like: "aa.bbb"?
We can quickly verify this with:
$ echo aa.bbb | awk '/\<aa\>/'
aa.bbb
The answer is simply found in the manual of gnu awk:
3.7 gawk-Specific Regexp Operators
GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section and are specific to gawk; they are not available in other awk implementations. Most of the additional operators deal with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (‘_’):
\<: Matches the empty string at the beginning of a word. For example, /\<away/ matches "away" but not "stowaway".
\>:
Matches the empty string at the end of a word. For example, /stow\>/ matches "stow" but not "stowaway".
source: GNU awk manual: Section 3 :: Regular Expressions
So to come back to the example from above, the string "aa.bbb" contains two words "aa" and "bbb" since the <dot>-character is not part of the character set that can build up a word. The empty strings matched here is the empty string before "aa.bbb" and the empty string between the characters a and . (an empty string is really an empty string, length 0, 0 characters, commonly written as "")
Solution to the OP: Since FS is most likely the default value, the field $3 cannot have a space. So the following two solutions are possible:
$3 ~ /^aa$/
$3 == "aa"
If the field separator FS is defined in the code, the following might work
" "$3" " ~ /" aa "/
$3 ~ /(^|[ ])aa([ ]|$) # See solution of JHNC

Related

Regex match last three integers and one character before that integers

I have been trying this for a lot of time,but my search is failing..I have below test data
mhie0104:x:104:600:Martinescu Horia:/home/scs/gr911/mhie0104:/bin/bash
mlie0105:x:105:600:Martinescu Laurentiu:/home/scs/gr911/mlie0105:/bin/bash
mmie0106:x:106:600:Martinescu Marius:/home/scs/gr911/mmie0106:/bin/bash
mnie0107:x:107:600:Martinescu Nicolae:/home/scs/gr911/mnie0107:/bin/bash
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
I am trying to find out users,who has exact three digits at the end ..So below is what i did
awk -F: '$1 ~ /*[a-z]//d{3}/'
My understanding of using above regex is :
"*" at the begining should match any characters
[a-z] it should match any character string just before digits
Finally three digits
I also tried with below variation
awk -F: '$1 ~ /*?//d{3}/'
So what i need from above test data is
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
1st solution: If you want to see only last 4 characters of 1st field where 4th last character is NOT digit then you can try following code.
awk -F':' '$1 ~ /[^0-9][0-9]{3}$/' Input_file
Explanation:
Simply making field separator as : for all the line of Input_file.
Then checking condition with 1st field /[^0-9][0-9]{3}$/ if 4 letter from last is anything apart from digit and last 3 are digit then print that line.
2nd solution: In case you want to check if none of characters(from starting of 1st field except last 3 characters) should contain digit and last 3 characters should be digits then try following code.
awk -F':' '
substr($1,1,length($1)-3)!~/[0-9]/ && int(substr($1,length($1)-2))~/^[0-9]{3}$/
' Input_file
Explanation:
First thing first making field separator as : for this awk program.
using substr function of awk to get sub string and using substr($1,1,length($1)-3)!~/[0-9]/ condition I am checking if everything of 1st field apart from last 3 characters is NOT having digit.
Then checking another condition int(substr($1,length($1)-2))~/^[0-9]{3}$/ where last 3 characters are of 3 digits.
If both of the conditions are TRUE then print that line.
You can't use this kind of notation : \d
This is perl type regex.
Solution:
$ awk -F: '$1 ~ /[a-zA-Z][0-9]{3}$/' file
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
You can use negative lookbehind in perl
$ perl -F: -ne ' print if $F[0]=~/(?<!\d)\d{3}$/ ' gameiswar.txt
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
$
For this particular task, sed might be used as well:
sed '/^[^0-9]*[0-9]\{3\}:/!d' file
Not sure if you can use only digits as the username, but as it is the first field, and the : is present as the delimiter.
Here, ([^:]*[^0-9])? matches optional repetitions of any char except : followed by a char other than 0-9:
awk '/^([^:]*[^0-9])?[0-9]{3}:/' file
If there has to be a leading char a-z
awk '/^[^:]*[a-z][0-9]{3}:/' file
Output
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
mawk '!_<NF' FS='^[^:]*[a-z][0-9][0-9][0-9]:'
— or —
gawk '!_<NF' FS='^[^:]*[a-z][0-9]{3}:'
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash

Working with the awk line matching pattern

The tool awk has line pattern matching like
/pattern/ { statements; }
Is there any way to get the string of pattern as a variable, for use in match expressions etc?
Or even better, directly get:
pattern matched text
pattern matched length
match groups if there are any (groups) in the pattern
within the {statements} block?
If you use GNU awk and, instead of using /pattern/ in the condition part, use match and its third argument match(string, regexp [, array]) you get access to matched text, start index, length and the groups:
$ echo foobar |
gawk 'match($0, /(fo*)(b.*)/, a) {
print a[0],a[0,"start"],a[0,"length"] # 0 index refers to whole matched text
print a[2],a[2,"start"],a[2,"length"] # 1, 2, etc. to matched groups
}'
foobar 1 6
bar 4 3
See GNU awk documentation for match for more info.
Could you please try following ones.
1st: To get matching text match is BEST option.
awk 'match($0,/regex/){print substr($0,RSTART,RLENGTH)}' Input_file
2nd: To get length of matched string:
awk 'match($0,/regex/){print RLENGTH}' Input_file
3rd: To get all matched patterns use while loop with match until match found in line and we should get all matched patterns.

Retrieve matched regex record-separator using Gnu AWK

Using AWK, I am processing a text file by splitting it into multiple records. As a record separator RS I use a regular expression. Is there a way to obtain the found record separator as RS only represents the regex string?
Example:
BEGIN { RS="a[0-9]*. "; ORS="\n-----\n"}
/foo/ {print $0 RS;}
END {}
input file:
a1. Hello
this
is foo
a2. hello
this
is bar
a3. Hello
this
is foo
output:
Hello
this
is foo
a[0-9]*.
-----
Hello
this
is foo
a[0-9]*.
-----
As you see, the output is printing RS as a string representing the regular expression, but not printing the actual value.
How can I retrieve the actual matched value of the record separator?
expected output:
Hello
this
is foo
a1
-----
Hello
this
is foo
a3
-----
In POSIX compliant AWK, the record separator RS is only a single character, hence it is easy to call it back in the form of.
awk 'BEGIN{RS="a"}{print $0 RS}'
GNU AWK, on the other hand, does not limit RS to be a one-character string but allows it to be any regular expression. In this case, it becomes a bit more tricky to use the above AWK because RS is a regular expression and not a string.
To this end, GNU AWK introduced the variable RT which represents nothing more than the found record separator. When RS is a single character, RT contains the same single character. However, when RS is a regular expression, RT contains the actual input text that matched the regular expression.
So naively, one could update your AWK program as:
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 RT}
Unfortunately, RT is set to the value found after the current record and it seems the OP requests the value before the current record, hence you can introduce a new variable pRT which could be read as prevous record separator found.
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 pRT}{pRT=RT}
and as Shaki Siegal pointed out in the comments, you still have to update pRT to remove the final space and dot:
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 pRT}{pRT=RT;sub(/[.] $/,"",pRT)}
note: The original RS of the OP (RS="a[0-9]*. ") has been updated for an improved matching to RS="a[0-9]+[.] " This ensures the appearance of a number behind a and an actual ..
If, as the original example indicates, the record separator always appears at the beginning of the line, RS should be slightly modified into RS="(^|\n)a[0-9]+[.] "Dito comment also made various excellent points. So if the string a[0-9]+. appears always at the beginning, you need to process a bit more:
BEGIN {
RS ="(^|\n)a[0-9]+[.] ";
ORS="\n-----\n"
}
/foo/ {
if (RT ~ /^$/ && NR != 2) pRT = substr(pRT,2)
print $0 pRT
}
{pRT=RT;sub(/[.] $/,"",pRT)}
Here, we added a correction to fix the last record.
If there are more then two AWK records (the first record is always empty), you need to remove the first new-line character from pRT, otherwise you include an extra new-line caused by the last record which ends with a new-line (in contrast to all others).
If there are only two AWK records (one effective in the text), then you should not do this correction as the first RT does not start with a new-line
The final improvement is done by realising that we always remove the initial newline in pRT if it is there, so we can merge it all in a single gsub:
BEGIN {
RS ="(^|\n)a[0-9]+[.] ";
ORS="\n-----\n"
}
/foo/ { print $0 pRT }
{pRT=RT;gsub(/^\n|[.] $/,"",pRT)}
RS: The input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines. If it is a regexp, records are separated by matches of the regexp in the input text.
The ability for RS to be a regular expression is a gawk extension. In most other AWK implementations, or if gawk is in compatibility mode (see Options), just the first character of RS’s value is used.
ORS: The output record separator. It is output at the end of every print statement. Its default value is "\n", the newline character.
RT: (GNU AWK specific) The input text that matched the text denoted by RS, the record separator. It is set every time a record is read.
source: GNU AWK manual
This might work for you (GNU sed):
sed -rn '/^a[0-9]+\.\s/{:a;x;/foo/{s/^(a[0-9]+\.)\s*(.*)/\2\n\1\n-----/p;$d};x;h;b};H;$ba' file
Gather up lines that begin an. where n is an integer. If the line(s) contain the word foo make the required substitution and print the results otherwise do nothing.
Apology: When I began the solution the question was tagged sed.
When a line beginning an. is encountered, this line replaces whatever was in the hold space. However before it does, the hold space is first checked, and if it contains the word foo i.e. a collection already exists, the requirements to be processed are met and the so the lines are formatted as required and printed. Other lines are appended to the hold space. A special condition is met when the end-of-file is encountered which the is the same condition as when line beginning an. This is allowed for by the addition of a goto label :a.
With GNU awk, which you're already using for multi-char RS, the builtin variable that contains the string that matched the RS regexp is RT.
We need to fix your RS setting though because you need a regexp for RS that matches a<integer><dot><blank> at the start of a line ((^|\n)a[0-9]+[.]) or a newline on it's own at the end of the file (\n$) so the last record in the file is parsed the same as all the rest and below is how to write that. Note that the RT will start with a newline for all except the very first match in the file so we need to strip that leading newline from RT to get the actual identifier we want to print for each record:
$ cat tst.awk
BEGIN {
RS = "(^|\n)a[0-9]+[.] |\n$"
ORS = "\n-----\n"
}
/foo/ { print $0 "\n" id }
{ id = gensub(/^\n|[.] /,"","g",RT) }
Here's what it does given this input which includes more rainy-day cases than are present in the question (you should test other proposed solutions against this):
input:
$ cat file
a1. Hello
this
is foo bat man
a2. hello
this
is bar
a3. Hello
this is a7. just fine
is foo
output:
$ awk -f tst.awk file
Hello
this
is foo bat man
a1
-----
Hello
this is a7. just fine
is foo
a3
-----

Filter fields with multiple delimiters

I've done extensive searching for a solution but can't quite find what I need. Have a file like this:
aaa|bbb|ccc|ddd~eee^fff^ggg|hhh|iii
111|222|333|444~555^666^777|888|999
AAA|BBB|CCC||EEE|FFF
What I want to do is use awk or something else to return lines from this file with a change to field 4(pipe delimited). Field 4 has a tilde and caret as delimiters which is where I'm struggling. We want the lines returned as this:
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
If field 4 is empty, it's returned as is. But when field 4 has multiple values, we want the first value right after the tilde returned only.
awk -F "[|^~]" 'BEGIN{OFS="|"}NF==6{print} NF==9{print $1,$2,$3,$5,$8,$9}' tmp.txt
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
use a regular expression as your delimiter
count the fields to decide what to do
set the output delimiter to pipe
$ awk -F'|' '{sub(/^[^~]*~/, "", $4); sub(/\^.*/, "", $4)} 1' OFS='|' file
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
This approach makes no assumption about the contents of fields other than field 4. The other fields may, for example, contain ~ or ^ characters and that will not affect the results.
How it works
-F'|'
This sets the field delimiter on input to |.
sub(/^[^~]*~/, "", $4)
If field 4 contains a ~, this removes the first ~ and everything before the first ~.
sub(/\^.*/, "", $4)
If field 4 contains ^, this removes the first ^ and everything after it.
1
This is awk's cryptic shorthand for print-the-line.
OFS='|'
This sets the field separator on output to |.

Escaping separator within double quotes, in awk

I am using awk to parse my data with "," as separator as the input is a csv file. However, there are "," within the data which is escaped by double quotes ("...").
Example
filed1,filed2,field3,"field4,FOO,BAR",field5
How can i ignore the comma "," within the the double quote so that I can parse the output correctly using awk? I know we can do this in excel, but how do we do it in awk?
It's easy, with GNU awk 4:
zsh-4.3.12[t]% awk '{
for (i = 0; ++i <= NF;)
printf "field %d => %s\n", i, $i
}' FPAT='([^,]+)|("[^"]+")' infile
field 1 => filed1
field 2 => filed2
field 3 => field3
field 4 => "field4,FOO,BAR"
field 5 => field5
Adding some comments as per OP requirement.
From the GNU awk manual on "Defining fields by content:
The value of FPAT should be a string that provides a regular
expression. This regular expression describes the contents of each
field. In the case of CSV data as presented above, each field is
either “anything that is not a comma,” or “a double quote, anything
that is not a double quote, and a closing double quote.” If written as
a regular expression constant, we would have /([^,]+)|("[^"]+")/. Writing this as a string
requires us to escape the double quotes, leading to:
FPAT = "([^,]+)|(\"[^\"]+\")"
Using + twice, this does not work properly for empty fields, but it can be fixed as well:
As written, the regexp used for FPAT requires that each field contain at least one character. A straightforward modification (changing the first ‘+’ to ‘*’) allows fields to be empty:
FPAT = "([^,]*)|(\"[^\"]+\")"
FPAT works when there are newlines and commas inside the quoted fields, but not when there are double quotes, like this:
field1,"field,2","but this field has ""escaped"" quotes"
You can use a simple wrapper program I wrote called csvquote to make data easy for awk to interpret, and then restore the problematic special characters, like this:
csvquote inputfile.csv | awk -F, '{print $4}' | csvquote -u
See https://github.com/dbro/csvquote for code and docs
Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.
Suppose you only want to print the 4th field:
perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){ #f=$csv->fields(); print "\"$f[3]\"" }' file
The input line is split into array #f
Field 4 is $f[3] since Perl starts indexing at 0
I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk