Output field separators in awk after substitution in fields - awk

Is it always the case, after modifying a specific field in awk, that information on the output field separator is lost? What happens if there are multiple field separators and I want them to be recovered?
For example, suppose I have a simple file example that contains:
a:e:i:o:u
If I just run an awk script, which takes account of the input field separator, that prints each line in my file, such as running
awk -F: '{print $0}' example
I will see the original line. If however I modify one of the fields directly, e.g. with
awk -F: '{$2=$2"!"; print $0}' example
I do not get back a modified version of the original line, rather I see the fields separated by the default whitespace separator, i.e:
a e! i o u
I can get back a modified version of the original by specifying OFS, e.g.:
awk -F: 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example
In the case, however, where there are multiple potential field separators but in the case of multiple separators is there a simple way of restoring the original separators?
For example, if example had both : and ; as separators, I could use -F":|;" to process the file but OFS would no be sufficient to restore the original separators in their relative positions.
More explicitly, if we switched to example2 containing
a:e;i:o;u
we could use
awk -F":|;" 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example2
(or -F"[:;]") to get
a:e!:i:o:u
but we've lost the distinction between : and ; which would have been maintained if we could recover
a:e!;i:o;u

You need to use GNU awk for the 4th arg to split() which saves the separators, like RT does for RS:
$ awk -F'[:;]' '{split($0,f,FS,s); $2=$2"!"; r=s[0]; for (i=1;i<=NF;i++) r=r $i s[i]; $0=r} 1' file
a:e!;i:o;u
There is no automatically populated array of FS matching strings because of how expensive it'd be in time and memory to store the string that matches FS every time you split a record into fields. Instead the GNU awk folks provided a 4th arg to split() so you can do it yourself if/when you want it. That is the result of a long conversation a few years ago in the comp.lang.awk newsgroup between experienced awk users and gawk providers before all agreeing that this was the best approach.
See split() at https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions.

Related

How does PROCINFO show info on FS on the specific record?

I was reading the definition of the PROCINFO built-in variable on GNU Awk User's Guide → 7.5.2 Built-in Variables That Convey Information:
PROCINFO #
The elements of this array provide access to information about the running awk program. The following elements (listed alphabetically) are guaranteed to be available:
PROCINFO["FS"]
This is "FS" if field splitting with FS is in effect, "FIELDWIDTHS" if field splitting with FIELDWIDTHS is in effect, "FPAT" if field matching with FPAT is in effect, or "API" if field splitting is controlled by an API input parser.
And yes, it works very well. See this example when I provide the string "hello;you" and I set, by order, FS to ";", FIELDWIDTHS to "2 2 " and FPAT to three characters:
$ gawk 'BEGIN{FS=";"}{print PROCINFO["FS"]; print $1}' <<< "hello;you"
FS
hello
$ gawk 'BEGIN{FIELDWIDTHS="2 2 2"}{print PROCINFO["FS"]; print $1}' <<< "hello;you"
FIELDWIDTHS
he
$ gawk 'BEGIN{FPAT="..."}{print PROCINFO["FS"]; print $1}' <<< "hello;you"
FPAT
hel
This is fine and works very well.
The, a bit before they mention in 4.8 Checking How gawk Is Splitting Records:
In order to tell which kind of field splitting is in effect, use PROCINFO["FS"] (see section Built-in Variables That Convey Information). The value is "FS" if regular field splitting is being used, "FIELDWIDTHS" if fixed-width field splitting is being used, or "FPAT" if content-based field splitting is being used.
And also in Changing FS Does Not Affect the Fields they describe how the changes affect the next record:
According to the POSIX standard, awk is supposed to behave as if each record is split into fields at the time it is read. In particular, this means that if you change the value of FS after a record is read, the values of the fields (i.e., how they were split) should reflect the old value of FS, not the new one.
This case explains it very well:
$ gawk 'BEGIN{FS=";"} {FS="|"; print $1}' <<< "hello;you
bye|everyone"
hello # "hello;you" is splitted using FS=";", the assignment FS="|" doesn't affect it yet
bye # "bye|everyone" is splitted using FS="|"
Having all of this into consideration, I would assume that PROCINFO["FS"] would always reflect the "FS" as the field splitting in the record it is being printed on.
However, see this case:
$ gawk 'BEGIN{FPAT="..."}{FS=";"; print PROCINFO["FS"]; print $1}' <<< "hello;you"
FS
hel
PROCINFO["FS"] shows the info set in the current record (FS), not the one that Awk is taking into account when processing the data (that is, FPAT). The same occurs if we swap the assignments:
$ gawk 'BEGIN{FS=";"}{FPAT="..."; print PROCINFO["FS"]; print $1}' <<< "hello;you"
FPAT
hello
Why is PROCINFO["FS"] showing a different FS than the one that is being used in the record it is printed in?
Field splitting (using FS, FIELDWIDTHS, or FPAT) occurs when a record is read or $0 as a whole is given a new value otherwise (e.g. $0="foo" or sub(/foo/,"bar")). print PROCINFO["FS"] tells you the value that PROCINFO["FS"] currently has which is not necessarily the same value it had when field splitting last occurred.
With:
$ gawk 'BEGIN{FPAT="..."}{FS=";"; print PROCINFO["FS"]; print $1}' <<< "hello;you"
FS
hel
You're setting FS=";" after $1 has already been populated based on FPAT="...", then printing PROCINFO["FS"] new value (which will be used the next time a record is split into fields), then printing the value of $1 which was populated before you set FS=";".
If you set $0 to itself the field splitting will occur again, this time using the new FS value rather than the original FPAT value:
$ gawk 'BEGIN{FPAT="..."}{FS=";"; print PROCINFO["FS"]; print $1; $0=$0; print $1}' <<< "hello;you"
FS
hel
hello

AWK doesn't update record with new separator

I'm a bit confused with awk (I'm totally new to awk)
find static/*
static/conf
static/conf/server.xml
my goal is to romove 'static/' from the result
First step:
find static/* | awk -F/ '{print $(0)}'
static/conf
static/conf/server.xml
Same result. I expected it. Now deleting the first part:
find static/* | awk -F/ '{$1="";print $(0)}'
conf
conf server.xml
thats nearly good, but I don't now why the delimiter is killed
But I can deal with it just adding the delimiter to the output:
find static/* | awk -F/ '{$1="";OFS=FS;print $(0)}'
conf
/conf/server.xml
OK now I'm completley lost.
Why is a '/' on the second line and not on the first? In both cases I deleted the first column.
Any explanations, ideas.
BTW my preferred output would be
conf
conf/server.xml
Addendum: thank you for your kind answers. they will help me to fix the problem.
However I want to understand why the first '/' is deleted in my last try. To make it a bit clearer:
find static/* | awk -F/ '{$1="";OFS="#";print $(0)}'
conf
^ a space and no / ?
#conf#server.xml
but I don't now why the delimiter is killed.
Whenever you redefine a field in awk using a statement like:
$n = new_value
awk will rebuild the current record $0 and automatically replace all field separators defined by FS, by the output field separator OFS (see below). The default value of OFS is a single space. This implies the following:
awk -F/ '{$1="";print $(0)}'
The field separator FS is set to a single <slash>-character. The first field is reset to "" which enables the re-evaluation of $0 by which all regular expression matches corresponding to FS are replaced by the string OFS which is currently a single space.
awk -F/ '{$1="";OFS=FS;print $(0)'
The same action applies as earlier. However, after the re-computation of $0, the output field separator OFS is set to FS. This implies that from record 2 onward, you will not replace FS with a space, but with the value of FS.
Possible solution with same ideology
awk 'BEGIN{FS=OFS="/"}{$1=""}{print substr($0,2)}'
The substring function substr is needed to remove the first /
DESCRIPTION
The awk utility shall interpret each input record as a sequence of fields where, by default, a field is a string of non- <blank> non- <newline> characters. This default <blank> and <newline> field delimiter can be changed by using the FS built-in variable or the -F sepstring option. The awk utility shall denote the first field in a record $1, the second $2, and so on. The symbol $0 shall refer to the entire record; setting any other field causes the re-evaluation of $0. Assigning to $0 shall reset the values of all other fields and the NF built-in variable.
Variables and Special Variables
References to nonexistent fields (that is, fields after $NF), shall evaluate to the uninitialized value. Such references shall not create new fields. However, assigning to a nonexistent field (for example, $(NF+2)=5) shall increase the value of NF; create any intervening fields with the uninitialized value; and cause the value of $0 to be recomputed, with the fields being separated by the value of OFS. Each field variable shall have a string value or an uninitialized value when created. Field variables shall have the uninitialized value when created from $0 using FS and the variable does not contain any characters.
source: POSIX standard: awk utility
Be aware that the default field separator FS=" " has some special rules
If you have GNU find you don't need awk at all.
$ find static/ -mindepth 1 -printf '%P\n'
conf
conf/server.xml
1st solution: Considering that in your output word static will come only once if this is the case try. I am simply making field separator as string static/ for lines and printing the last field of lines then which will be after word static/.
find static/* | awk -F'static/' '{print $NF}'
2nd solution: Adding a more generic solution here. Which will match values from very first occurrence of / to till last of the line and while printing it will not printing starting /.
find static/* | awk 'match($0,/\/.*/){print substr($0,RSTART+1,RLENGTH)}'
When you reset the first field value the field is still there. Just remove the initial / chars after that with sub(/^\/+/, "") (where ^\/+ pattern matches one or more / chars at the start of the string):
awk 'BEGIN{OFS=FS="/"} {$1="";sub(/^\/+/, "")}1'
See an online demo:
s="static/conf
static/conf/server.xml"
awk 'BEGIN{OFS=FS="/"} {$1="";sub(/^\/+/, "")}1' <<< "$s"
Output:
conf
conf/server.xml
Note that with BEGIN{OFS=FS="/"} you set the input/output field separator just once at the start, and 1 at the end triggers the default line print operation.

Match regexp at the end of the string with AWK

I am trying to match two different Regexp to long strings with awk, removing the part of the string that matches in a 35 characters window.
The problem is that the same bunch of code works when I am looking for the first (which matches at the beginnng) whereas fails to match with the second one (end of string).
Input:
Regexp1(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)Regexp2
Desired output
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
So far I used this code that extracts correctly Regexp1, but, unfortunately, is not able to extract also Regexp2 since indexed of RSTART and RLENGTH for Regexp2 are incorrect.
Code for extracting Regexp1 (correct output):
awk -v F="Regexp1" '{if (match(substr($1,1,35),F)) print substr($1,RSTART,RLENGTH)}' file
Code for extracting Regexp2 (wrong output)
awk -v F="Regexp2" '{if (match(substr($1,length($1)-35,35),F)) print substr($1,RSTART,RLENGTH)}' file
Despite the indexes for Regexp1 are correct, for Regexp2 indexes are wrond (RSTART=13). I cannot figure out how to extract the second Regexp.
Considering that your actual Input_file is same as shown samples, if this is the case could you please try following then(good to have new version of awk since old versions may not support number of times logic for regex).
awk '
match($0,/\([0-9]+\){5}.*\([0-9]\){4}/){
print substr($0,RSTART,RLENGTH)
}' Input_file
In case your number of parenthesis values are not fixed then you could do like as follows:
awk '
match($0,/\([0-9]+\){1,}.*\([0-9]\){1,}/){
print substr($0,RSTART,RLENGTH)
}' Input_file
If this isn't all you need:
$ sed 's/Regexp1\(.*\)Regexp2/\1/' file
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
or using GNU awk for gensub():
$ awk '{print gensub(/Regexp1(.*)Regexp2/,"\\1",1)}' file
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
then edit your question to be far clearer with your requirements and example.

awk: Adding a new column based on concatenated value of two columns

I am trying to add a new column to a text file based on the concatenated values of two columns. Value is being inserted in the middle instead of the end of the string.
I am using awk. Here are two sample lines
$ head -1 file.txt
8502CC169154|02|GA|TN|89840|9|2008-11-15 00:00:00.000|2009-11-15 00:00:00.000|1|TEAM1|1639009|1000000|0|2008-11-15 00:00:00.000|2009-11-15 00:00:00.000|85|00|37421||241|20|331|1052A|5000|0|.1500|Chattanooga|47065|.000|025|35|25000|0|0|0|0|0|718||E|-17.00|-17.00|-17.00|-17.00|-17.00|-2.55|-2.55|-2.55|-2.55|D|C9N7I4115531902|-2.19|-2.19|-2.19|-2.19|-14.81|051|2008-12-31 00:00:00.000|151|2008-12-17 00:00:00.000|||AC|CC|Y||2008-12-31 00:00:00.000|.000000|A|.000000|.000000|.000000|Y|8502CC169154-8|8502CC169154|8|||122130|122130M|7764298|RA
I tried the following.
$ head -1 file.txt | awk -F'|' '{$(NF+1)=$1"-"$6;}1' OFS='|'
I am expecting a new column at the end of the string. But you can see that the concatenated field is being inserted in the middle of the string instead of the end of the string.
8502CC169154|02|GA|TN|89840|9|2008-11-15 00:00:00.000|2009-11-15 00:00:00.000|1|TEAM1|1639009|1000000|0|2008-11-15 00:00:00.000|2009-11-15 00:00:00.000|85|00|37421||241|20|331|1052A|5000|0|.1500|Chattanooga|47065|.000|025|35|25000|0|0|0|0|0|718||E|-17.00|-17.00|-17.00|-17.00|-17.00|-2.55|-2.55|-2.55|-2.55|D|C9N7I4115531902|-2.19|-2.19|-2.19|-2.19|-14.81|051|2008-12-31 00:00:00.000|151|2008|8502CC169154-9.000|||AC|CC|Y||2008-12-31 00:00:00.000|.000000|A|.000000|.000000|.000000|Y|8502CC169154-8|8502CC169154|8|||122130|122130M|7764298|RA
Your original code works for me using GNU awk but I suspect that not all awks support setting $(NF+1). To avoid that, try:
head -1 file.txt | awk -F'|' '{$0=$0 FS $1"-"$6;}1' OFS='|'
Awk is a surprising powerful language and it has all the capabilities that head has, making the pipeline unnecessary. So, for greater efficiency, try the simple command:
awk -F'|' '{print $0 FS $1"-"$6; exit}' file.txt
How it works:
-F'|'
This sets the field separator to a vertical bar.
print $0 FS $1"-"$6
This prints the output line that you want which consists of the original line, $0, followed by a field separator, FS, followed by combination of the first field, a dash, and the sixth field.
exit
After the first line is printed, this tells awk to exit. This eliminates the need for head -1.

awk to take FS into effect

Why does the following happen? How can I understand the logic?
$ echo "123456" | awk 'BEGIN {FS="4"; OFS="-"}; {print}'
123456
But if I "modify" some of the fields, everything is OK:
$ echo "123456" | awk 'BEGIN {FS="4"; OFS="-"}; {$1=$1;print}'
123-56
The Output Field Separator only takes effect once record has been touched in some way. From the GNU AWK manual:
It is important to remember that $0 is the full record, exactly as it was read from the input. This includes any leading or trailing whitespace, and the exact whitespace (or other characters) that separates the fields.
It is a common error to try to change the field separators in a record simply by setting FS and OFS, and then expecting a plain print or print $0 to print the modified record.
But this does not work, because nothing was done to change the record itself. Instead, you must force the record to be rebuilt, typically with a statement such as $1 = $1