Commands in awk - awk

Do I need to have ; at the end of each command in awk files.
title = gensub (beg_ere, "\\2", 1, $0);
subtitle = gensub (beg_ere, "\\3", 1, $0);
keywords = gensub (beg_ere, "\\4", 1, $0);
nu = split (ukeys, uaggr, ",");
nk = split (keywords, kaggr, ",");

The awk grammar contains the definitions:
program : item_list
| item_list item
;
item_list : /* empty */
| item_list item terminator
;
and
terminator : terminator NEWLINE
| ';'
| NEWLINE
;
So, no, you do not always need a semi-colon. Only if you are trying to put two items on the same line (ie. no newline in between).

Do I need to have ; at the end of each command in awk files.
The AWK programming language says about Actions that
Sometimes an action is very simple: a single print or assignment.
Other times, it may be a sequence of several statements separated by
newlines or semicolons.(...)
which means using only newline or only semicolon to separate is enough.

Related

awk: first, split a line into separate lines; second, use those new lines as a new input

Let's say I have this line:
foo|bar|foobar
I want to split it at | and then use those 3 new lines as the input for the further proceedings (let's say replace bar with xxx).
Sure, I can pipe two awk instances, like this:
echo "foo|bar|foobar" | awk '{gsub(/\|/, "\n"); print}' | awk '/bar/ {gsub(/bar/, "xxx"); print}'
But how I can achieve this in one script? First, do one operation on some input, and then treat the result as the new input for the second operation?
I tried something like this:
echo "foo|bar|foobar" | awk -v c=0 '{
{
gsub(/\|/, "\n");
sprintf("%s", $0);
}
{
if ($0 ~ /bar/) {
c+=1;
gsub(/bar/, "xxx");
print c;
print
}
}
}'
Which results in this:
1
foo
xxx
fooxxx
And thanks to the counter c, it's absolutely obvious that the subsequent if doesn't treat the multi-line input it receives as several new records but instead just as one multi-lined record.
Thus, my question is: how to tell awk to treat this new multi-line record it receives as many single-line records?
The desired output in this very example should be something like this if I'm correct:
1
xxx
2
fooxxx
But this is just an example, the question is more about the mechanics of such a transition.
I would suggest an alternative approach using split() where you can just split the elements based on the delimiter into an array and iterate over its fields, Instead of working on a single multi line string.
echo "foo|bar|foobar" |\
awk '{
count = 0
n = split($0, arr, "|")
for ( i = 1; i <= n; i++ )
{
if ( arr[i] ~ /bar/ )
{
count += sub(/bar/, "xxx", arr[i])
print count
print arr[i]
}
}
}'
Also you don't need an explicit increment of count variable, sub() returns the number of substitutions made on the source string. You can just increment to the existing value of count.
As one more level of optimization, you can get rid of the ~ match in the if condition and directly use the sub() function there
if ( sub(/bar/, "xxx", arr[i]) )
{
count++
print count
print arr[i]
}
If you set the record separator (RS) to the pipe character, you almost get the desired effect, e.g.:
echo 'foo|bar|foobar' | awk -v RS='|' 1
Output:
foo
bar
foobar
[...an empty line
Except that a new-line character becomes part of the last field, so there is an extra line at the end of the output. You can work around this by either including a new-line in the RS variable, making it less portable, or avoid sending new-lines to awk.
For example using the less portable way:
echo 'foo|bar|foobar' | awk -v RS='\\||\n' '{ sub(/bar/, "baz") } 1'
Output:
foo
baz
foobaz
Note that the empty record at the end is ignored.
With GNU awk:
$ awk -v RS='[|\n]' 'gsub(/bar/,"xxx"){print ++c ORS $i}' file
1
xxx
2
fooxxx
With any awk:
$ awk -F'|' '{c=0; for (i=1;i<=NF;i++) if ( gsub(/bar/,"xxx",$i) ) print ++c ORS $i }' file
1
xxx
2
fooxxx

Remove word from a comma separated values of specific field

The NIS group file has format
group1:*:100:bat,cat,zat,ratt
group2:*:200:rat,cat,bat
group3:*:300:rat
With : as delimiter, need to remove exact word (for example rat) from 4th column. Any leading or trailing , to the word should be deleted as well to preserve comma separated values format in 4th column
Expected output:
group1:*:100:bat,cat,zat,ratt
group2:*:200:cat,bat
group3:*:300:
You'd better use awk for this job. Try this (GNU awk):
awk 'BEGIN {OFS=FS=":"} {gsub (/\yrat,?\y|\y,?rat\y/, "", $4)}1' file
Using : as field separator, gsub removes all rat in 4th field. \y is used for word boundaries so that rat will match but not rrat.
If perl solution is okay:
Modified sample input to add more relevant cases..
$ cat ip.txt
group1:*:100:bat,cat,zat,ratt
group2:*:200:rat,cat,bat
group3:*:300:rat
group4:*:400:mat,rat,sat
group5:*:500:pat,rat
$ perl -F: -lane '(#a) = split/,/,$F[3]; $F[3] = join ",", grep { $_ ne "rat" } #a; print join ":", #F' ip.txt
group1:*:100:bat,cat,zat,ratt
group2:*:200:cat,bat
group3:*:300:
group4:*:400:mat,sat
group5:*:500:pat
-F: split input line on : and save to #F array
(#a) = split/,/,$F[3] split 4th column on , and save to #a array
$F[3] = join ",", grep { $_ ne "rat" } #a remove elements in #a array exactly matching rat, join those elements with , and modify 4th field of input line
print join ":", #F print the modified #F array elements joined by :
Golfing to avoid the temp array #a
$ perl -F: -lane '$F[3] = join ",", grep { $_ ne "rat" } split/,/,$F[3]; print join ":", #F' ip.txt
Using regex on 4th column:
$ perl -F: -lane '$F[3] =~ s/,rat\b|\brat(,|\b)//g; print join ":", #F' ip.txt
group1:*:100:bat,cat,zat,ratt
group2:*:200:cat,bat
group3:*:300:
group4:*:400:mat,sat
group5:*:500:pat
This might work for you (GNU sed):
sed -r 's/\brat\b,?//g' file
Remove one or more words rat followed by a possible ,.
awk 'NR>1{sub(/rat,*/,"")}1' file
group1:*:100:bat,cat,zat,ratt
group2:*:200:cat,bat
group3:*:300:

Convert single column into three comma separated columns using awk

I have a single long column and want to reformat it into three comma separated columns, as indicated below, using awk or any Unix tool.
Input:
Xaa
Ybb
Mdd
Tmmn
UUnx
THM
THSS
THEY
DDe
Output:
Xaa,Ybb,Mdd
Tmmn,UUnx,THM
THSS,THEY,DDe
$ awk '{printf "%s%s",$0,NR%3?",":"\n";}' file
Xaa,Ybb,Mdd
Tmmn,UUnx,THM
THSS,THEY,DDe
How it works
For every line of input, this prints the line followed by, depending on the line number, either a comma or a newline.
The key part is this ternary statement:
NR%3?",":"\n"
This takes the line number modulo 3. If that is non-zero, then it returns a comma. If it is zero, it returns a newline character.
Handling files that end before the final line is complete
The assumes that the number of lines in the file is an integer multiple of three. If it isn't, then we probably want to assure that the last line has a newline. This can be done, as Jonathan Leffler suggests, using:
awk '{printf "%s%s",$0,NR%3?",":"\n";} END { if (NR%3 != 0) print ""}' file
If the final line is short of three columns, the above code will leave a trailing comma on the line. This may or may not be a problem. If we do not want the final comma, then use:
awk 'NR==1{printf "%s",$0; next} {printf "%s%s",(NR-1)%3?",":"\n",$0;} END {print ""}' file
Jonathan Leffler offers this slightly simpler alternative to achieve the same goal:
awk '{ printf("%s%s", pad, $1); pad = (NR%3 == 0) ? "\n" : "," } END { print "" }'
Improved portability
To support platforms which don't use \n as the line terminator, Ed Morton suggests:
awk -v OFS=, '{ printf("%s%s", pad, $1); pad = (NR%3?OFS:ORS)} END { print "" }' file
There is a tool for this. Use pr
pr -3ats,
3 columns width, across, suppress header, comma as separator.
xargs -n3 < file | awk -v OFS="," '{$1=$1} 1'
xargs uses echo as default action, $1=$1 forces rebuild of $0.
Using only awk I would go with this (which is similar to what proposed by #jonathan-leffler and #John1024)
{
sep = NR == 1 ? "" : \
(NR-1)%3 ? "," : \
"\n"
printf sep $0
}
END {
printf "\n"
}

why doesn't awk seem to work on splitting into fields based on alternative involving "."?

It is OK to awk split on .:
>printf foo.bar | awk '{split($0, a, "."); print a[1]}'
foo
It is OK to awk split on an alternative:
>printf foo.bar | awk '{split($0, a, "b|a"); print a[1]}'
foo.
Then why is it not OK to split on an anternative involving .:
>printf foo.bar | awk '{split($0, a, ".|a"); print a[1]}'
(nothing printed)
Escape that period and I think you'll be golden:
printf foo.bar | awk '{split($0, a, "\\.|a"); print a[1]}'
JNevill showed how to get it working. But to answer your question of why the escape is needed in one case but not the other, we can find the answer in the awk manual in the summary of "how fields are split, based on the value of FS." (And the same rules apply to the fieldsep given to the split command.)
The bottom line is that when FS is a single character it is not treated as a regular expression but otherwise it is.
Hence split($0, a, ".") works as we hope, taking . to literally be ., but split($0, a, ".|a") takes .|a to be a regexp where . has a special meaning, setting the separator to be any character, and with that the necessity to add the backslashes to have the . treated literally.
FS == " "
Fields are separated by runs of whitespace. Leading and trailing whitespace are ignored. This is the default.
FS == any single character
Fields are separated by each occurrence of the character. Multiple successive occurrences delimit empty fields, as do leading and
trailing occurrences.
FS == regexp
Fields are separated by occurrences of characters that match regexp. Leading and trailing matches of regexp delimit empty fields.
You can see the despite the empty result .|a is really doing something, dividing the line into eight empty fields --- same as a line like ,,,,,,, would do with FS set to ,.
$ printf foo.bar | awk '{split($0, a, ".|a"); for (i in a) print i ": " a[i]; }'
4:
5:
6:
7:
8:
1:
2:
3:

replacing the `'` char using awk

I have lines with a single : and a' in them that I want to get rid of. I want to use awk for this. I've tried using:
awk '{gsub ( "[:\\']","" ) ; print $0 }'
and
awk '{gsub ( "[:\']","" ) ; print $0 }'
and
awk '{gsub ( "[:']","" ) ; print $0 }'
non of them worked, but return the error Unmatched ".. when I put
awk '{gsub ( "[:_]","" ) ; print $0 }'
then It works and removes all : and _ chars. How can I get rid of the ' char?
tr is made for this purpose
echo test\'\'\'\':::string | tr -d \':
teststring
$ echo test\'\'\'\':::string | awk '{gsub(/[:\47]*/,"");print $0}'
teststring
This works:
awk '{gsub( "[:'\'']","" ); print}'
You could use:
Octal code for the single quote:
[:\47]
The single quote inside double quotes, but in that case special
characters will be expanded by the shell:
% print a\': | awk "sub(/[:']/, x)"
a
Use a dynamic regexp, but there are performance implications related
to this approach:
% print a\': | awk -vrx="[:\\\']" 'sub(rx, x)'
a
With bash you cannot insert a single quote inside a literal surrounded with single quotes. Use '"'"' for example.
First ' closes the current literal, then "'" concatenates it with a literal containing only a single quote, and ' reopens a string literal, which will be also concatenated.
What you want is:
awk '{gsub ( "[:'"'"']","" ) ; print $0; }'
ssapkota's alternative is also good ('\'').
I don't know why you are restricting yourself to using awk, anyways you've got many answers from other users. You can also use sed to get rid of " :' "
sed 's/:\'//g'
This will also serve your purpose. Simple and less complex.
This also works:
awk '{gsub("\x27",""); print}'
simplest
awk '{gsub(/\047|:/,"")};1'