Awk understanding variables - variables

What does the words (probably variables?) like NF, RF, FS mean in awk? I believe they have some pre-defined meaning and usage.
Please let me know, if there are more such variables that must be known to a beginner?
-Thanks

These are built-in variables, in awk they have special meaning.
There is the part in GAWK reference manual covering this topic.
In particular:
FS:
This is the input field separator (see Field Separators). The value
is a single-character string or a multi-character regular expression
that matches the separations between fields in an input record. If the
value is the null string (""), then each character in the record
becomes a separate field. (This behavior is a gawk extension. POSIX
awk does not specify the behavior when FS is the null string.
Nonetheless, some other versions of awk also treat "" specially.)
The default value is " ", a string consisting of a single space. As a
special exception, this value means that any sequence of spaces, TABs,
and/or newlines is a single separator. It also causes spaces, TABs,
and newlines at the beginning and end of a record to be ignored.
NF:
The number of fields in the current input record. NF is set each
time a new record is read, when a new field is created or when $0
changes (see Fields).
Unlike most of the variables described in this section, assigning a
value to NF has the potential to affect awk's internal workings. In
particular, assignments to NF can be used to create or remove fields
from the current record. See Changing Fields.

Related

sed: cut out a substring following various patterns

There is a list of identifiers I want to modify:
3300000526.a:P_A23_Liq_2_FmtDRAFT_1000944_2,
200254578.a:CR_10_Liq_3_inCRDRAFT_100545_11,
3300000110.a:BSg2DRAFT_c10006505_1,
3300000062.a:IMNBL1DRAFT_c0010786_1,
3300000558.a:Draft_10335283_1
I want to remove all starting from first . and first _ after DRAFT (case-insensitive), i.e.:
3300000526_1000944_2,
200254578_100545_11,
3300000110_c10006505_1,
3300000062_c0010786_1,
3300000558_10335283_1
I am using sed 's/.a.*[a-zA-Z0-9]DRAFT_.*[^_]_[a-zA-Z0-9]//g' but it ignores the first _ after DRAFT and does this:
3300000526_2,
200254578_11,
3300000110_1,
3300000062_1,
3300000558_1
P.S.
There can be various identifiers and I tried to show a little portion on their variance here, but they all keep same pattern.
I'd be grateful for corrections!
You could easily do this in awk, could you please try following once. Based on shown samples only.
awk -F'[.]|DRAFT_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
OR with GNU awk for handling case-insensitive try:
awk -v IGNORECASE="1" -F'[.]|DRAFT_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
To handle case insensitive without ignorecase option try:
awk -F'[.]|[dD][rR][aA][fF][tT]_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
Explanation: Simply setting field separator as . OR DRAFT_ as per OP's need. Then in main program setting 2nd field to _ and then substituting spaces underscore spaces with only _. Finally printing the current line by 1.
A workable solution
You can use:
sed 's/[.].*[dD][rR][aA][fF][tT]_/_/' data
You could also use \. in place of [.] but I'm allergic to unnecessary backslashes — you might be too if you'd had to spend time fighting whether 8 or 16 consecutive backslashes was the correct way to write to document markup (using troff).
For your sample data, it produces:
3300000526_1000944_2,
200254578_100545_11,
3300000110_c10006505_1,
3300000062_c0010786_1,
3300000558_10335283_1
What went wrong
Your command is:
sed 's/.a.*[a-zA-Z0-9]DRAFT_.*[^_]_[a-zA-Z0-9]//g'
This matches:
any character (the leading .)
lower-case 'a'
any sequence of characters
an alphanumeric character
upper-case only DRAFT
underscore
any sequence of characters
underscore
an alphanumeric character
doing the match globally on each line
It then deletes all the matched material. You could rescue it by using:
sed 's/[.]a.*[a-zA-Z0-9]DRAFT\(_.*[^_]_[a-zA-Z0-9]\)/\1/'
This matches a dot rather than any character, and saves the material after DRAFT starting with the underscore (that's the \(…\)), replacing what was matched with what was saved (that's the \1). You can convert DRAFT to the case-insensitive pattern too, of course. However, the terms of the question refer to "from the first dot (.) up to (but not including) the underscore after a (case-insensitive) DRAFT", and detailing, saving, and replacing the material after the underscore is not necessary.
Laziness
I saved myself typing the elaborate case-insensitive string by using a program called mkpattern that (according to RCS) I created on 1990-08-23 (a while ago now). It's not rocket science. I use it every so often — I've actually used it a number of times in the last week, searching for some elusive documents on a system at work.
$ mkpattern DRAFT
[dD][rR][aA][fF][tT]
$
You might have to do that longhand in future.
try something like
{mawk/mawk2/gawk} 'BEGIN { FS = "[\056].+DRAFT_"; OFS = ""; } (NF < 2) || ($1 = $1)'
It might not be the fastest but it's relatively clean approach. octal \056 is the period, and it's less ambiguous to reader when the next item is a ".+"
This might work for you (GNU sed):
sed -nE 's/DRAFT[^_]*/\n/i;s/\..*\n//p' file
First turn on the -n and -E to turn off implicit printing and make regexp more easily seen.
Since we want the first occurrence of DRAFT we can not use a regexp that uses the .* idiom as this is greedy and may pass over it if there are two or more such occurrences. Thus we replace the DRAFT by a unique character which cannot occur in the line. The newline can only be introduced by the programmer and is the best choice.
Now we can use the .* idiom as the introduced newline can only exist if the previous substitution has matched successfully.
N.B. The i flag in the first substitution allows for any upper/lower case rendition of the string DRAFT, also the second substitution includes the p flag to print the successful substitution.

Retrieve matched regex record-separator using Gnu AWK

Using AWK, I am processing a text file by splitting it into multiple records. As a record separator RS I use a regular expression. Is there a way to obtain the found record separator as RS only represents the regex string?
Example:
BEGIN { RS="a[0-9]*. "; ORS="\n-----\n"}
/foo/ {print $0 RS;}
END {}
input file:
a1. Hello
this
is foo
a2. hello
this
is bar
a3. Hello
this
is foo
output:
Hello
this
is foo
a[0-9]*.
-----
Hello
this
is foo
a[0-9]*.
-----
As you see, the output is printing RS as a string representing the regular expression, but not printing the actual value.
How can I retrieve the actual matched value of the record separator?
expected output:
Hello
this
is foo
a1
-----
Hello
this
is foo
a3
-----
In POSIX compliant AWK, the record separator RS is only a single character, hence it is easy to call it back in the form of.
awk 'BEGIN{RS="a"}{print $0 RS}'
GNU AWK, on the other hand, does not limit RS to be a one-character string but allows it to be any regular expression. In this case, it becomes a bit more tricky to use the above AWK because RS is a regular expression and not a string.
To this end, GNU AWK introduced the variable RT which represents nothing more than the found record separator. When RS is a single character, RT contains the same single character. However, when RS is a regular expression, RT contains the actual input text that matched the regular expression.
So naively, one could update your AWK program as:
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 RT}
Unfortunately, RT is set to the value found after the current record and it seems the OP requests the value before the current record, hence you can introduce a new variable pRT which could be read as prevous record separator found.
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 pRT}{pRT=RT}
and as Shaki Siegal pointed out in the comments, you still have to update pRT to remove the final space and dot:
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 pRT}{pRT=RT;sub(/[.] $/,"",pRT)}
note: The original RS of the OP (RS="a[0-9]*. ") has been updated for an improved matching to RS="a[0-9]+[.] " This ensures the appearance of a number behind a and an actual ..
If, as the original example indicates, the record separator always appears at the beginning of the line, RS should be slightly modified into RS="(^|\n)a[0-9]+[.] "Dito comment also made various excellent points. So if the string a[0-9]+. appears always at the beginning, you need to process a bit more:
BEGIN {
RS ="(^|\n)a[0-9]+[.] ";
ORS="\n-----\n"
}
/foo/ {
if (RT ~ /^$/ && NR != 2) pRT = substr(pRT,2)
print $0 pRT
}
{pRT=RT;sub(/[.] $/,"",pRT)}
Here, we added a correction to fix the last record.
If there are more then two AWK records (the first record is always empty), you need to remove the first new-line character from pRT, otherwise you include an extra new-line caused by the last record which ends with a new-line (in contrast to all others).
If there are only two AWK records (one effective in the text), then you should not do this correction as the first RT does not start with a new-line
The final improvement is done by realising that we always remove the initial newline in pRT if it is there, so we can merge it all in a single gsub:
BEGIN {
RS ="(^|\n)a[0-9]+[.] ";
ORS="\n-----\n"
}
/foo/ { print $0 pRT }
{pRT=RT;gsub(/^\n|[.] $/,"",pRT)}
RS: The input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines. If it is a regexp, records are separated by matches of the regexp in the input text.
The ability for RS to be a regular expression is a gawk extension. In most other AWK implementations, or if gawk is in compatibility mode (see Options), just the first character of RS’s value is used.
ORS: The output record separator. It is output at the end of every print statement. Its default value is "\n", the newline character.
RT: (GNU AWK specific) The input text that matched the text denoted by RS, the record separator. It is set every time a record is read.
source: GNU AWK manual
This might work for you (GNU sed):
sed -rn '/^a[0-9]+\.\s/{:a;x;/foo/{s/^(a[0-9]+\.)\s*(.*)/\2\n\1\n-----/p;$d};x;h;b};H;$ba' file
Gather up lines that begin an. where n is an integer. If the line(s) contain the word foo make the required substitution and print the results otherwise do nothing.
Apology: When I began the solution the question was tagged sed.
When a line beginning an. is encountered, this line replaces whatever was in the hold space. However before it does, the hold space is first checked, and if it contains the word foo i.e. a collection already exists, the requirements to be processed are met and the so the lines are formatted as required and printed. Other lines are appended to the hold space. A special condition is met when the end-of-file is encountered which the is the same condition as when line beginning an. This is allowed for by the addition of a goto label :a.
With GNU awk, which you're already using for multi-char RS, the builtin variable that contains the string that matched the RS regexp is RT.
We need to fix your RS setting though because you need a regexp for RS that matches a<integer><dot><blank> at the start of a line ((^|\n)a[0-9]+[.]) or a newline on it's own at the end of the file (\n$) so the last record in the file is parsed the same as all the rest and below is how to write that. Note that the RT will start with a newline for all except the very first match in the file so we need to strip that leading newline from RT to get the actual identifier we want to print for each record:
$ cat tst.awk
BEGIN {
RS = "(^|\n)a[0-9]+[.] |\n$"
ORS = "\n-----\n"
}
/foo/ { print $0 "\n" id }
{ id = gensub(/^\n|[.] /,"","g",RT) }
Here's what it does given this input which includes more rainy-day cases than are present in the question (you should test other proposed solutions against this):
input:
$ cat file
a1. Hello
this
is foo bat man
a2. hello
this
is bar
a3. Hello
this is a7. just fine
is foo
output:
$ awk -f tst.awk file
Hello
this
is foo bat man
a1
-----
Hello
this is a7. just fine
is foo
a3
-----

Using awk, how I do specify doing something only when the first field is an integer?

I am just starting out in awk, and wonder, using awk, what is the proper way to state, do something (such as printing out the record) only when the first field is an integer?
do something ... only when the first field is an integer?
This does the command in braces, print in this case, only if the first field is a positive integer:
awk '$1 ~ /^[[:digit:]]+$/{print;}'
Floating point numbers are rejected.
If we want to accept either positive or negative integers, then, as mklement0 suggests, use the following:
awk '$1 ~ /^[+-]?[[:digit:]]+$/{print;}'
Note that, because [:digit:] is used, these tests are unicode safe.

Delete lines which contain a number smaller/larger than a user specified value

I need to delete lines in a large file which contain a value larger than a user specified number(see picture). For example I'd like to get rid of lines with values larger than 5e-48 (x>5e-48), i. e. lines with 7e-46, 7e-40, 1e-36,.... should be deleted.
Can sed, grep, awk or any other command do that?
Thank you
Markus
With awk:
awk '$3 <= 5e-48' filename
This selects only those lines whose third field is smaller than 5e-48.
If fields can contain spaces (since the data appears to be tab-separated) use
awk -F '\t' '$3 <= 5e-48' filename
This sets the field separator to \t, so lines are split at tabs rather than any whitespace. It does not appear to be necessary with the shown input data, but it is good practice to be defensive about these things (thanks to #tripleee for pointing this out).
In Perl, for example, the solution can be
perl -ane'print unless$F[2]>5e-48'

Using awk how do I reprint a found pattern with a new line character?

I have a text file in the format of:
aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;
Where "bcd" can be any length of any characters, excluding ; or :
What I want to do is print the text file in the format of:
aaa: bcd;bcd;bcddd;
aaa: bcd;bcd;bcd;
-etc-
My method of approach to this problem was to isolate a pattern of ";...:" and then reprint this pattern without the initial ;
I concluded I would have to use awk's 'gsub' to do this, but have no idea how to replicate the pattern nor how to print the pattern again with this added new line character 1 character into my pattern.
Is this possible?
If not, can you please direct me in a way of tackling it?
We can't quite be sure of the variability in the aaa or bcd parts; presumably, each one could be almost anything.
You should probably be looking for:
a series of one or more non-colon, non-semicolon characters followed by colon,
with one or more repeats of:
a series of one or more non-colon, non-semicolon characters followed by a semi-colon
That makes up the unit you want to match.
/[^:;]+:([^:;]+;)+/
With that, you can substitute what was found by the same followed by a newline, and then print the result. The only trick is avoiding superfluous newlines.
Example script:
{
echo "aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;"
echo "aaz: xcd;ycd;bczdd;baa:bed;bid;bud;"
} |
awk '{ gsub(/[^:;]+:([^:;]+;)+/, "&\n"); sub(/\n+$/, ""); print }'
Example output
aaa: bcd;bcd;bcddd;
aaa:bcd;bcd;bcd;
aaz: xcd;ycd;bczdd;
baa:bed;bid;bud;
Paraphrasing the question in a comment:
Why does the regular expression not include the characters before a colon (which is what it's intended to do, but I don't understand why)? I don't understand what "breaks" or ends the regex.
As I tried to explain at the top, you're looking for what we can call 'words', meaning sequences of characters that are neither a colon nor a semicolon. In the regex, that is [^:;]+, meaning one or more (+) of the negated character class — one or more non-colon, non-semicolon characters.
Let's pretend that spaces in a regex are not significant. We can space out the regex like this:
/ [^:;]+ : ( [^:;]+ ; ) + /
The slashes simply mark the ends, of course. The first cluster is a word; then there's a colon. Then there is a group enclosed in parentheses, tagged with a + at the end. That means that the contents of the group must occur at least once and may occur any number of times more than that. What's inside the group? Well, a word followed by a semicolon. It doesn't have to be the same word each time, but there does have to be a word there. If something can occur zero or more times, then you use a * in place of the +, of course.
The key to the regex stopping is that the aaa: in the middle of the first line does not consist of a word followed by a semicolon; it is a word followed by a colon. So, the regex has to stop before that because the aaa: doesn't match the group. The gsub() therefore finds the first sequence, and replaces that text with the same material and a newline (that's the "&\n", of course). It (gsub()) then resumes its search directly after the end of the replacement material, and — lo and behold — there is a word followed by a colon and some words followed by semicolons, so there's a second match to be replaced with its original material plus a newline.
I think that $0 must contain the newline at the end of the line. Therefore, without the sub() to remove a trailing newlines, the print (implictly of $0 with a newline) generated a blank line I didn't want in the output, so I removed the extraneous newline(s). The newline at the end of $0 would not be matched by the gsub() because it is not followed by a colon or semicolon.
This might work for you:
awk '{gsub(/[^;:]*:/,"\n&");sub(/^\n/,"");gsub(/: */,": ")}1' file
Prepend a newline (\n) to any string not containing a ; or a : followed by a :
Remove any newline prepended to the start of line.
Replace any : followed by none or many spaces with a : followed by a single space.
Print all lines.
Or this:
sed 's/;\([^;:]*: *\)/;\n\1 /g' file
Not sure how to do it in awk, but with sed this does what I think you want:
$ nl='
'
$ sed "s/\([^;]*:\)/\\${nl}\1/g" input
The first command sets the shell variable $nl to the string containing a single new line. Some versions of sed allow you to use \n inside the replacement string, but not all allow that. This keeps any whitespace that appears after the final ; and puts it at the start of the line. To get rid of that, you can do
$ sed "s/\([^;]*:\)/\\${nl}\1/g; s/\n */\\$nl/g" input
Ordinary awk gsub() and sub() don't allow you to specify components in the replacement strings Gnu awk - "gawk" - supplies "gensub()" which would allow "gensub(/(;) (.+:)/,"\1\n\2","g")"