Retrieve matched regex record-separator using Gnu AWK - awk

Using AWK, I am processing a text file by splitting it into multiple records. As a record separator RS I use a regular expression. Is there a way to obtain the found record separator as RS only represents the regex string?
Example:
BEGIN { RS="a[0-9]*. "; ORS="\n-----\n"}
/foo/ {print $0 RS;}
END {}
input file:
a1. Hello
this
is foo
a2. hello
this
is bar
a3. Hello
this
is foo
output:
Hello
this
is foo
a[0-9]*.
-----
Hello
this
is foo
a[0-9]*.
-----
As you see, the output is printing RS as a string representing the regular expression, but not printing the actual value.
How can I retrieve the actual matched value of the record separator?
expected output:
Hello
this
is foo
a1
-----
Hello
this
is foo
a3
-----

In POSIX compliant AWK, the record separator RS is only a single character, hence it is easy to call it back in the form of.
awk 'BEGIN{RS="a"}{print $0 RS}'
GNU AWK, on the other hand, does not limit RS to be a one-character string but allows it to be any regular expression. In this case, it becomes a bit more tricky to use the above AWK because RS is a regular expression and not a string.
To this end, GNU AWK introduced the variable RT which represents nothing more than the found record separator. When RS is a single character, RT contains the same single character. However, when RS is a regular expression, RT contains the actual input text that matched the regular expression.
So naively, one could update your AWK program as:
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 RT}
Unfortunately, RT is set to the value found after the current record and it seems the OP requests the value before the current record, hence you can introduce a new variable pRT which could be read as prevous record separator found.
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 pRT}{pRT=RT}
and as Shaki Siegal pointed out in the comments, you still have to update pRT to remove the final space and dot:
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 pRT}{pRT=RT;sub(/[.] $/,"",pRT)}
note: The original RS of the OP (RS="a[0-9]*. ") has been updated for an improved matching to RS="a[0-9]+[.] " This ensures the appearance of a number behind a and an actual ..
If, as the original example indicates, the record separator always appears at the beginning of the line, RS should be slightly modified into RS="(^|\n)a[0-9]+[.] "Dito comment also made various excellent points. So if the string a[0-9]+. appears always at the beginning, you need to process a bit more:
BEGIN {
RS ="(^|\n)a[0-9]+[.] ";
ORS="\n-----\n"
}
/foo/ {
if (RT ~ /^$/ && NR != 2) pRT = substr(pRT,2)
print $0 pRT
}
{pRT=RT;sub(/[.] $/,"",pRT)}
Here, we added a correction to fix the last record.
If there are more then two AWK records (the first record is always empty), you need to remove the first new-line character from pRT, otherwise you include an extra new-line caused by the last record which ends with a new-line (in contrast to all others).
If there are only two AWK records (one effective in the text), then you should not do this correction as the first RT does not start with a new-line
The final improvement is done by realising that we always remove the initial newline in pRT if it is there, so we can merge it all in a single gsub:
BEGIN {
RS ="(^|\n)a[0-9]+[.] ";
ORS="\n-----\n"
}
/foo/ { print $0 pRT }
{pRT=RT;gsub(/^\n|[.] $/,"",pRT)}
RS: The input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines. If it is a regexp, records are separated by matches of the regexp in the input text.
The ability for RS to be a regular expression is a gawk extension. In most other AWK implementations, or if gawk is in compatibility mode (see Options), just the first character of RS’s value is used.
ORS: The output record separator. It is output at the end of every print statement. Its default value is "\n", the newline character.
RT: (GNU AWK specific) The input text that matched the text denoted by RS, the record separator. It is set every time a record is read.
source: GNU AWK manual

This might work for you (GNU sed):
sed -rn '/^a[0-9]+\.\s/{:a;x;/foo/{s/^(a[0-9]+\.)\s*(.*)/\2\n\1\n-----/p;$d};x;h;b};H;$ba' file
Gather up lines that begin an. where n is an integer. If the line(s) contain the word foo make the required substitution and print the results otherwise do nothing.
Apology: When I began the solution the question was tagged sed.
When a line beginning an. is encountered, this line replaces whatever was in the hold space. However before it does, the hold space is first checked, and if it contains the word foo i.e. a collection already exists, the requirements to be processed are met and the so the lines are formatted as required and printed. Other lines are appended to the hold space. A special condition is met when the end-of-file is encountered which the is the same condition as when line beginning an. This is allowed for by the addition of a goto label :a.

With GNU awk, which you're already using for multi-char RS, the builtin variable that contains the string that matched the RS regexp is RT.
We need to fix your RS setting though because you need a regexp for RS that matches a<integer><dot><blank> at the start of a line ((^|\n)a[0-9]+[.]) or a newline on it's own at the end of the file (\n$) so the last record in the file is parsed the same as all the rest and below is how to write that. Note that the RT will start with a newline for all except the very first match in the file so we need to strip that leading newline from RT to get the actual identifier we want to print for each record:
$ cat tst.awk
BEGIN {
RS = "(^|\n)a[0-9]+[.] |\n$"
ORS = "\n-----\n"
}
/foo/ { print $0 "\n" id }
{ id = gensub(/^\n|[.] /,"","g",RT) }
Here's what it does given this input which includes more rainy-day cases than are present in the question (you should test other proposed solutions against this):
input:
$ cat file
a1. Hello
this
is foo bat man
a2. hello
this
is bar
a3. Hello
this is a7. just fine
is foo
output:
$ awk -f tst.awk file
Hello
this
is foo bat man
a1
-----
Hello
this is a7. just fine
is foo
a3
-----

Related

Gawk matching one word - one unexpected match

I wanted to get all matches in Column 3 which have the exact word "aa" (case insensitive match) in the string in Column 3
The gawk command used in the awk file is:
$3 ~ /\<aa\>/
The BEGIN statement specifies: IGNORECASE = 1
The command returns 20 rows. What is puzzling is this value in Column 3 in the returned rows:
aA.AHAB
How do I avoid this row as it is not a word by itself because there is dot following the first two aa's and not a space?
A is a word character. . is not a word character. \> matches the zero-width string at the end of a word. Such a zero-width string occurs between A and ..
To search for the string aa delimited by space characters (or start/end of field):
$3 ~ /(^|[ ])aa([ ]|$)
Add any other characters that you care about inside the set ([ ]).
Note that by default, awk splits records into fields on whitespace, so you will not get any spaces in $3 unless you have changed the value of FS.
1st solution: OR to exactly match aa try:
awk 'BEGIN{IGNORECASE=1} $3 ~ /^aa$/' Input_file
2nd solution: OR without IGNORECASE option try:
awk 'tolower($3)=="aa"' Input_file
Question: Why does the awk regex-pattern /\<aa\>/ matches a string like: "aa.bbb"?
We can quickly verify this with:
$ echo aa.bbb | awk '/\<aa\>/'
aa.bbb
The answer is simply found in the manual of gnu awk:
3.7 gawk-Specific Regexp Operators
GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section and are specific to gawk; they are not available in other awk implementations. Most of the additional operators deal with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (‘_’):
\<: Matches the empty string at the beginning of a word. For example, /\<away/ matches "away" but not "stowaway".
\>:
Matches the empty string at the end of a word. For example, /stow\>/ matches "stow" but not "stowaway".
source: GNU awk manual: Section 3 :: Regular Expressions
So to come back to the example from above, the string "aa.bbb" contains two words "aa" and "bbb" since the <dot>-character is not part of the character set that can build up a word. The empty strings matched here is the empty string before "aa.bbb" and the empty string between the characters a and . (an empty string is really an empty string, length 0, 0 characters, commonly written as "")
Solution to the OP: Since FS is most likely the default value, the field $3 cannot have a space. So the following two solutions are possible:
$3 ~ /^aa$/
$3 == "aa"
If the field separator FS is defined in the code, the following might work
" "$3" " ~ /" aa "/
$3 ~ /(^|[ ])aa([ ]|$) # See solution of JHNC

AWK doesn't update record with new separator

I'm a bit confused with awk (I'm totally new to awk)
find static/*
static/conf
static/conf/server.xml
my goal is to romove 'static/' from the result
First step:
find static/* | awk -F/ '{print $(0)}'
static/conf
static/conf/server.xml
Same result. I expected it. Now deleting the first part:
find static/* | awk -F/ '{$1="";print $(0)}'
conf
conf server.xml
thats nearly good, but I don't now why the delimiter is killed
But I can deal with it just adding the delimiter to the output:
find static/* | awk -F/ '{$1="";OFS=FS;print $(0)}'
conf
/conf/server.xml
OK now I'm completley lost.
Why is a '/' on the second line and not on the first? In both cases I deleted the first column.
Any explanations, ideas.
BTW my preferred output would be
conf
conf/server.xml
Addendum: thank you for your kind answers. they will help me to fix the problem.
However I want to understand why the first '/' is deleted in my last try. To make it a bit clearer:
find static/* | awk -F/ '{$1="";OFS="#";print $(0)}'
conf
^ a space and no / ?
#conf#server.xml
but I don't now why the delimiter is killed.
Whenever you redefine a field in awk using a statement like:
$n = new_value
awk will rebuild the current record $0 and automatically replace all field separators defined by FS, by the output field separator OFS (see below). The default value of OFS is a single space. This implies the following:
awk -F/ '{$1="";print $(0)}'
The field separator FS is set to a single <slash>-character. The first field is reset to "" which enables the re-evaluation of $0 by which all regular expression matches corresponding to FS are replaced by the string OFS which is currently a single space.
awk -F/ '{$1="";OFS=FS;print $(0)'
The same action applies as earlier. However, after the re-computation of $0, the output field separator OFS is set to FS. This implies that from record 2 onward, you will not replace FS with a space, but with the value of FS.
Possible solution with same ideology
awk 'BEGIN{FS=OFS="/"}{$1=""}{print substr($0,2)}'
The substring function substr is needed to remove the first /
DESCRIPTION
The awk utility shall interpret each input record as a sequence of fields where, by default, a field is a string of non- <blank> non- <newline> characters. This default <blank> and <newline> field delimiter can be changed by using the FS built-in variable or the -F sepstring option. The awk utility shall denote the first field in a record $1, the second $2, and so on. The symbol $0 shall refer to the entire record; setting any other field causes the re-evaluation of $0. Assigning to $0 shall reset the values of all other fields and the NF built-in variable.
Variables and Special Variables
References to nonexistent fields (that is, fields after $NF), shall evaluate to the uninitialized value. Such references shall not create new fields. However, assigning to a nonexistent field (for example, $(NF+2)=5) shall increase the value of NF; create any intervening fields with the uninitialized value; and cause the value of $0 to be recomputed, with the fields being separated by the value of OFS. Each field variable shall have a string value or an uninitialized value when created. Field variables shall have the uninitialized value when created from $0 using FS and the variable does not contain any characters.
source: POSIX standard: awk utility
Be aware that the default field separator FS=" " has some special rules
If you have GNU find you don't need awk at all.
$ find static/ -mindepth 1 -printf '%P\n'
conf
conf/server.xml
1st solution: Considering that in your output word static will come only once if this is the case try. I am simply making field separator as string static/ for lines and printing the last field of lines then which will be after word static/.
find static/* | awk -F'static/' '{print $NF}'
2nd solution: Adding a more generic solution here. Which will match values from very first occurrence of / to till last of the line and while printing it will not printing starting /.
find static/* | awk 'match($0,/\/.*/){print substr($0,RSTART+1,RLENGTH)}'
When you reset the first field value the field is still there. Just remove the initial / chars after that with sub(/^\/+/, "") (where ^\/+ pattern matches one or more / chars at the start of the string):
awk 'BEGIN{OFS=FS="/"} {$1="";sub(/^\/+/, "")}1'
See an online demo:
s="static/conf
static/conf/server.xml"
awk 'BEGIN{OFS=FS="/"} {$1="";sub(/^\/+/, "")}1' <<< "$s"
Output:
conf
conf/server.xml
Note that with BEGIN{OFS=FS="/"} you set the input/output field separator just once at the start, and 1 at the end triggers the default line print operation.

Can RS be set "empty" to split string characters to records?

Is there a way in awk—gawk most likely—to set the record separator RS to empty value to process each character of a string as a separate record? Kind of like setting the FS to empty to separate each character in its own field:
$ echo abc | awk -F '' '{print $2}'
b
but to separate them each as a separate record, like:
$ echo abc | awk -v RS='?' '{print $0}'
a
b
c
The most obvious one:
$ echo abc | awk -v RS='' '{print $0}'
abc
didn't award me (as that one was apparently meant for something else per GNU awk documentation).
Am I basically stuck using for etc.?
EDIT:
#xhienne's answer was what I was looking for but even using that (20 chars and a questionable variable A :):
$ echo abc | awk -v A="\n" -v RS='(.)' -v ORS="" '{print(RT==A?NR:RT)}'
abc4
wouldn't help me shorten my earlier code using length. Then again, how could I win the Pyth code: +Qfql+Q :D.
If you just want to print one character per line, #klashxx's answer is OK. But a sed 's/./&\n/g' would be shorter since you are golfing.
If you truly want a separate record for each character, the best approaching solution I have found for you is:
echo -n abc | awk -v RS='(.)' '{ print RT }'
(use gawk; your input character is in RT, not $1)
[update] If RS is set to the null string, it means to awk that records are separated by blank lines. If I had just defined RS='.', the record separator would have been a mere dot (i.e. a fixed string). But if its length is more than one character, one feature of gawk is to consider RS as a regex. So, what I did here is to give gawk a regex meaning "each character" as a record separator. And I use another feature of gawk: to retrieve the string that matched the regex in the special variable RT (record terminator)
Here is the relevant parts of the gwak manual:
Normally, records are separated by newline characters. You can control how records are separated by assigning values to the built-in variable RS. If RS is any single character, that character separates records. Otherwise, RS is a regular expression. Text in the input that matches this regular expression separates the record.
If RS is set to the null string, then records are separated by blank lines.
Gawk sets RT to the input text that matched the character or regular expression specified by RS.
It is not possible
The empty string "" (a string without any characters) has a special
meaning as the value of RS. It means that records are separated by one
or more blank lines and nothing else.
A simply alternative:
echo abc | awk 'BEGIN{FS="";OFS="\n"}$1=$1'
No there is no setting of RS that will do what you want. It looks like your requirement is to append a newline after every character that is not a newline, if so this will produce the output you want:
$ echo 'abc' | awk -v ORS= 'gsub(/[^\n]/,"&\n")'
a
b
c
That will work on any awk on any UNIX system.

Add a number by subtracting an existing number by awk

I would like to convert
Title Page/4,Black,notBold,notItalic,open,TopLeftZoom,0,0,0.0
Contents/16,Black,notBold,notItalic,open,TopLeftZoom,0,0,0.0
to
Title Page 1/4,Black,notBold,notItalic,open,TopLeftZoom,0,0,0.0
Contents 13/16,Black,notBold,notItalic,open,TopLeftZoom,0,0,0.0
The rule is to subtract the number following / by 3 and add that result in front of /.
I tried to do that with awk.
awk -F',/' '{gsub(/\//, ($2-10) + "\/"}' myfile
but it doesn't work. Why is it? Thanks.
A slight modification to your attempt produces the desired output:
$ awk -F'[,/]' '{sub(/\//, " " ($2-3) "/") }1' file
Title Page 1/4,Black,notBold,notItalic,open,TopLeftZoom,0,0,0.0
Contents 13/16,Black,notBold,notItalic,open,TopLeftZoom,0,0,0.0
-F is used to specify the input field separator. I have changed it to a regex group which matches commas and slashes, which means that the second field $2 contains the number that you are trying to replace. As you are only interested in making a single substitution in each record, I have used sub rather than gsub. Note that in awk, strings are automatically concatenated (you shouldn't use +).
Awk programs are stuctured like condition { action }. If no condition is specified, the action block is always run. If no action is specified, the default action is { print }, which prints the record. In the above script, 1 is used to print the record, as it is the simplest expression that evaluates to true.

why the last new-line-character not replaced

the file to be processed by awk.
hello world
hello Jack
hello Jim
Hello Marry
Hello Bob
Hello Everyone
And my command is awk 'BEGIN{RS=""; FS="\n";} {gsub("\n","#"); print}'. The awk manual said that when the RS is set to the null (empty?) string, then records are separated by blank lines. So the result is expected to be
hello world#hello Jack#hello Jim#
hello Marry#hello Bob#hello Everyone#
But actually, the result is
hello world#hello Jack#hello Jim
hello Marry#hello Bob#hello Everyone
The last new-line-character is not replaced by #. Is it because the last new-line-character of a record is ommited by awk when awk read and cut content to fields? Are there some manuals about the details of how awk read and cut and process fields with patterns and actions? Thanks.
The reason you don't have trailing # in output is:
if you set RS="", it is similar with RS="\n\n+" (*but with difference, I explain it later). So the longest (>=2) continuous line-breaks would be used by awk as RS.
looking at your data, after the Jim there are two \ns, until the next text block. So awk will take the two \n as RS, so there is no ending \n in your record (Jim record). of course, your gsub won't replace it. You see the line break in your output, it was brought by print
the 2nd line in your output has no ending # either, because we used RS="" instead of RS="\n\n+". The important difference is, for RS="", leading newlines in the input data file are ignored, and if a file ends without extra blank lines after the last record, the final newline is removed from the record. That's why there is no trailing # in output line#2.
If you changed it into RS="\n\n+", you should see the ending # on the 2nd line in your output.
I guess you want to find out why the output you got was not something you expected. but not try to achieve your expected output, right? if your question is how to get that output, I would edit my answer.
You can have a look at this page: http://www.gnu.org/software/gawk/manual/gawk.html#Multiple-Line
It says:
"When RS is set to the empty string, and FS is set to a single character, the newline character always acts as a field separator."
So you do not have to specify FS=\n, it happens automatically if you say RS=""..
In order to produce your expected output you can do the following:
BEGIN{
RS=""
}
{
$0=$0 ORS
gsub("\n","#")
print
}