Is a /start/,/end/ range expression ever useful in awk? - awk

I've always contended that you should never use a range expression like:
/start/,/end/
in awk because although it makes the trivial case where you only want to print matching text including the start and end lines slightly briefer than the alternative*:
/start/{f=1} f{print; if (/end/) f=0}
when you want to tweak it even slightly to do anything else, it requires a complete re-write or results in duplicated or otherwise undesirable code. e.g. if you want to print the matching text excluding the range delimiters using the second form above you'd just tweak it to move the components around:
f{if (/end/) f=0; else print} /start/{f=1}
but if you started with /start/,/end/ you'd need to abandon that approach in favor of what I just posted or you'd have to write something like:
/start/,/end/{ if (!/start|end/) print }
i.e. duplicate the conditions which is undesirable.
Then I saw a question posted that required identifying the LAST end in a file and where a range expression was used in the solution and I thought it seemed like that might have some value (see https://stackoverflow.com/a/21145009/1745001).
Now, though, I'm back to thinking that it's just not worth bothering with range expressions at all and a solution that doesn't use range expressions would have worked just as well for that case.
So - does anyone have an example where a range expression actually adds noticeable value to a solution?
*I used to use:
/start/{f=1} f; /end/{f=0}
but too many times I found I had to do something additional when f is true and /end/ is found (or to put it another way ONLY do something when /end/ is found IF f were true) so now I just try to stick to the slightly less brief but much more robust and extensible:
/start/{f=1} f{print; if (/end/) f=0}

Interesting. I also often start with a range expression and then later on switch to using a variable..
I think a situation where this could be useful, aside from the pure range-only situations is if you want to print a match, but only if it lies in a certain range. Also because it is immediately obvious what it does. For example:
awk '/start/,/end/{if(/ppp/)print}' file
with this input:
start
dfgd gd
ppp 1
gfdg
fd gfd
end
ppp 2
ppp 3
start
ppp 4
ppp 5
end
ppp 6
ppp 7
gfdgdgd
will produce:
ppp 1
ppp 4
ppp 5
--
One could of course also use:
awk '/start/{f=1} /ppp/ && f; /end/{f=0}' file
But it is longer and somewhat less readable..

While you are right that the /start/,/end/ range expression can easily be reimplemented with a conditional, it has many interesting use-cases where it is used on its own. As you observe it, it might have little value for processing of tabular data, the main but not only use case of awk.
So - does anyone have an example where a range expression actually adds noticeable value to a solution?
In the mentioned use-cases, the range expression improves legibility. Here are a few examples, where the range expression accurately selects the text to be processed. These are only a hand of examples, but there is countlessly similar applications, demonstrating the incredible versatility of awk.
Filter logs within a time range
Assuming each log line starts with an ISO timestamp, the filter below selects all events in a given range of 1 hour:
awk '/^2015-06-30T12:00:00Z/,/^2015-06-30T13:00:00Z/'
Extract a document from a file
awk '/---- begin file.data ----/,/---- end file.data ----/'
This can be used to bundle resources with shell scripts (with cat), to extract parts of GPG-signed messages (prepared with --clearsign) or more generally of MIME-messages.
Process LaTeX files
The range pattern can be used to match LaTeX environments, so for instance we can select the abstracts of all articles in our directory:
awk '/begin{abstract}/,/end{abstract}/' *.tex
or all the theorems, to prepare a theorem database!
awk '/begin{theorem}/,/end{theorem}/' *.tex
or write a linter ensuring that theorems do not contain citations (if we regard this as bad style):
awk '
/begin{theorem}/,/end{theorem}/ { if(/\\cite{/) { c+= 1 } }
END { printf("There were %d bad-style citations.\n", c) }
'
or preprocess tables, etc.

Related

AIX: remove the last symbols (CRLF) from a file

There is a large file where the last symbols are \r\n. I need to remove them. It seems to be equivalent to removing the last line(?).
UPD: no, it's not: a file have only one line, which ends with \r\n.
I know two ways, but both don't work for AIX:
sed 's/\r\n$//' file # I don't why it doesn't work
head -c-2 # head doesn't work with negative numbers
Is there any solution for AIX? A lot of large files must be processed, so performance is important.
Usually, if you need to edit a file via a script in place, I use ed due to historical reasons. For example:
ed - /tmp/foo.txt <<EOF
g/^$/d
w
q
EOF
ed is more than a bit cantankerous. Note also that you did not really remove the empty lines at the bottom of the file but rather all of the empty lines. With ed and some practice you can probably achieve deleting only the empty lines at the bottom of the file. e.g. go to the bottom of the file, search up for a non-empty line, then move down a line and delete from that point to the end of the file. ed command scripts act (pretty much) as you would expect.
Also, if they really do have \r\n, then those are not going to be considered empty lines but rather lines with a control-M (\r) in them. You may need to adjust your pattern if that is the case.
My answer https://stackoverflow.com/a/46083912/3220113 to the duplicate question should work here too. Another solution is using
awk ' (NR>1) { print s }
{s=$0}
END { printf("%s",substr($2, 1, length($2)-1) ) }
' inputfile

DCL sort - different start positions

I have a DCL script that creates a .txt file that looks something like this
something,somethingelse,00000004
somethingdifferent,somethingelse1,00000002
anotherline,line,00000015
I need to sort the file by the 3rd column highest to lowest
ex:
anotherline,line,00000015
something,somethingelse,00000004
somethingdifferent,somethingelse1,00000002
Is it best to use the sort command, if so everything i've seen required a position number, how can this be done if each line would have a different start position?
If sort is a bad way to handle this is there something else or can I somehow handle this while writing the lines to the file.
I've only been working with VMS/DCL for a few weeks now so i'm not fimilar with all of the commands yet.
Thanks!
As you already noticed, the VMS sort expects fields with a fixed start position within a record. You can not specify a field by a separator. If you want to use the VMS sort you have to make sure your third field starts at the same column, for all records. In other words, you have to pad preceding fields. If you have control on how the file is created, this may work for you. If you don't or you don't know how big the string in front of the sort field will be, this may not be a workaround. Maybe changing the order of the fields is an option.
On the other hand, you may find GNV installed on your system. Then you can try to use its sort, which is a GNU style sort. That is, $ mcr gnv$gnu:[bin]sort -t, -k3 -r x.txt may get you the wanted results.
VMS Sort is indeed not really equipped for this.
Reformatting as you did is about the only way.
If you so not have access to GNV sort on the OpenVMS system then perhaps you have, or can install PERL? Is is somewhat easier to install.
In perl there are of course many ways.
For example using an anonymous sort function ( $a is first arg, $b second; <> reads all input )
$ perl -e "print sort { 0+(split /,/,$b)[1] <=> 0+(split /,/,$a)[2]} <>" x.x
where the 0 + forces numeric evaluation. For (fixed length?) string compare use:
$ perl -e "print sort { (split /,/,$b)[2] cmp (split /,/,$a)[2]} <>" x.x
hth,
Hein.enter code here

In awk, is there a functional difference between putting conditions outside the brackets or inside?

As far as I can tell, this:
$2 > 50{print}
functions exactly the same way as this:
{if ($2 > 50) print}
When should I use one vs the other, or is it a matter of style?
2nd one is much more readable in my opinion (especially for non awk gurus who may have to maintain you code).
It's mostly style. The first example is canonical AWK. There are, of course, times when an if statement is required.
Consider that:
$2 > 50 {foo; print}
takes 20 characters versus 27 for the following:
{if ($2 > 50) {foo; print}}
If you remove spaces, it would be 16 versus 22 characters.
So the former is more desirable if you are interested in economy (both in terms of actual characters and visually).
As potong points out, the if form requires curly braces if there are more than one statement (as I have shown in my second example above). If there is only one statement, they are optional. This is the source of subtle bugs if it's written without and additional statements are later added without adding braces. Also, for the reader of a script, the intent is much clearer if the braces are included.
Yes, they are the same. As with most syntax alternatives, there are trade offs.
I mostly agree with John3136 that 2nd form has a higher likelyhood of being understood by someone with background in c, c++, C#, java, and others.
And as Dennis Williamson points out, there are times you have to use the if (...) form.
For one-lines, I think the first form is useful, acceptable, and helps reinforce the primary design of awk, i.e. [pattern]-[action] (being optional or providing a default behaviour).
If you look at comp.lang.awk, you'll see that the long-timers mostly prefer style 1.
Large blocks of code in style 1, to my eye, are difficult to read, but do have the look of lex code.
So.... depends ;-) .... are you writing code that will ultimately be maintained by likely non-awk experts? Then stick w style 2. Writing for yourself? Use both so you don't get rusty!
I hope this helps.
Yes they are the same.
On the first form, you can omit { print } because it's the default action. So, that works too:
$2 > 50

How to skip records that turn on/off the range pattern?

gawk '/<Lexer>/,/<\/Lexer>/' file
this works but it prints the first and last records, which I'd like to omit. How to do so?
It says: "The record that turns on the range pattern and the one that turns it off both match the range pattern. If you don't want to operate on these records, you can write if statements in the rule's action to distinguish them from the records you are interested in." but no example.
I tried something like
gawk '/<Lexer>/,/<\/Lexer>/' {1,FNR-1} file
but it doesn't work.
If you have a better way to do this, without using awk, say so.
You can do it with 2 separate match statements and a variable
gawk '/<Lexer>/{p=1; next} /<\/Lexer>/ {p=0} p==1 {print}' file
This matches <Lexer> and sets p to 1 and then skips to the next line. While p is 1 it prints the current line. When it matches </Lexer> it sets p to 0 and skips. As p is 0 printing is suppressed.

Reorganizing named fields with AWK

I have to deal with various input files with a number of fields, arbitrarily arranged, but all consistently named and labelled with a header line. These files need to be reformatted such that all the desired fields are in a particular order, with irrelevant fields stripped and missing fields accounted for. I was hoping to use AWK to handle this, since it has done me so well when dealing with field-related dilemmata in the past.
After a bit of mucking around, I ended up with something much like the following (writing from memory, untested):
# imagine a perfectly-functional BEGIN {} block here
NR==1 {
fldname[1] = "first_name"
fldname[2] = "last_name"
fldname[3] = "middle_name"
maxflds = 3
# this is just a sample -- my real script went through forty-odd fields
for (i=1;i<=NF;i++) for (j=1;j<=maxflds;j++) if ($i == fldname[j]) fldpos[j]=i
}
NR!=1 {
for (j=1;j<=maxflds;j++) {
if (fldpos[j]) printf "%s",$fldpos[j]
printf "%s","/t"
}
print ""
}
Now this solution works fine. I run it, I get my output exactly how I want it. No complaints there. However, for anything longer than three fields or so (such as the forty-odd fields I had to work with), it's a lot of painfully redundant code which always has and always will bother me. And the thought of having to insert a field somewhere else into that mess makes me shudder.
I die a little inside each time I look at it.
I'm sure there must be a more elegant solution out there. Or, if not, perhaps there is a tool better suited for this sort of task. AWK is awesome in it's own domain, but I fear I may be stretching it's limits some with this.
Any insight?
The only suggestion that I can think of is to move the initial array setup into the BEGIN block and read the ordered field names from a separate template file in a loop. Then your awk program consists only of loops with no embedded data. Your external template file would be a simple newline-separated list.
BEGIN {while ((getline < "fieldfile") > 0) fldname[++maxflds] = $0}
You would still read the header line in the same way you are now, of course. However, it occurs to me that you could use an associative array and reduce the nested for loops to a single for loop. Something like (untested):
BEGIN {while ((getline < "fieldfile") > 0) fldname[$0] = ++maxflds}
NR==1 {
for (i=1;i<=NF;i++) fldpos[i] = fldname[$i]
}