Extract substring from a field with single awk in AIX - awk

I have a file file with content like:
stringa 8.0.1.2 stringx
stringb 12.01.0.0 stringx
I have to get a substring from field 2 (first two values with the dot).
I am currently doing cat file | awk '{print $2}' | awk -F. '{print $1"."$2}' and getting the expected output:
8.0
12.01
The query is how to do this with single awk?
I have tried with match() but not seeing an option for a back reference.
Any help would be appreciated.

You can do something like this.
$ awk '{ split($2,str,"."); print str[1]"."str[2] }' file
8.0
12.01
Also, keep in mind that your cat is not needed. Simply give the file directly to awk.

With GNU grep please try following command once.
grep -oP '^\S+\s+\K[[:digit:]]+\.[[:digit:]]+' Input_file
Explanation: Using GNU grep here. Using its -oP options to print matched part and enable PCRE with -P option here. In main program, matching from starting non-space characters followed by 1 or more spaces, then using \K option to forget that match. Then matching 1 or more digits occurrences followed by a dot; which is further followed by digits. If a match is found then it prints matched value.

I would use GNU AWK's split function as follow, let file.txt content be
stringa 8.0.1.2 stringx
stringb 12.01.0.0 stringx
then
awk '{split($2,arr,".");print arr[1]"."arr[2]}' file.txt
output
8.0
12.01
Explantion: split at . 2nd field and put elements into array arr.
(tested in gawk 4.2.1)

You could match digits . digits from the second column and print if there is a match:
awk 'match($2, /^[[:digit:]]+\.[[:digit:]]+/) {
print substr($2, RSTART, RLENGTH)
}
' file
Output
8.0
12.01

Also with GNU awk and gensub():
awk '{print gensub(/([[:digit:]]+[.][[:digit:]]+)(.*)/,"\\1","g",$2)}' file
8.0
12.01
gensub() provides the ability to specify components of a regexp in the replacement text using parentheses in the regexp to mark the components and then specifying \\n in the replacement text, where n is a digit from 1 to 9.

You should perhaps not use awk at all (or any other external program, for that matter) but rely on the field-splitting capabilities of the shell and some variable expansion. For instance:
# printf "%s\n%s\n" "stringa 8.0.1.2 stringx" \
"stringb 12.01.0.0 stringx" |\
while read first second third junk ; do
printf "=%s= =%s= =%s=\n" "$first" "$second" "$third"
done
=stringa= =8.0.1.2= =stringx=
=stringb= =12.01.0.0= =stringx=
As you can see the value is captured in the variable "$second" already and you just need to further isolate the parts you want to see - the first and second part separated by a dot. You can do that either with parameter expansion:
# variable="8.0.1.2"
# echo ${variable%.*.*}
8.0
or like this:
# variable="12.01.0.0"
# echo ${variable%${variable#*.*.}}
12.01
or you can use a further read-statement to separate the parts and then put them back together:
# variable="12.01.0.0"
# echo ${variable} | IFS=. read parta partb junk
# echo ${parta}.${partb}
12.01
So, putting all together:
# printf "%s\n%s\n" "stringa 8.0.1.2 stringx" \
"stringb 12.01.0.0 stringx" |\
while read first second third junk ; do
printf "%s\n" "$second" | IFS=. read parta partb junk
printf "%s.%s\n" "$parta" "$partb"
done
8.0
12.01

Related

Extracting and rearranging columns

I read from stdin lines which contain fields. The field delimiter is a semicolon. There are no specific quoting characters in the input (i.e. the fields can't contain themselves semicolons or newline characters). The number of the input fields is unknown, but it is at least 4.
The output is supposed to be a similar file, consisting of the fields from 2 to the end, but field 2 and 3 reversed in order.
I'm using zsh.
I came up with a solution, but find it clumsy. In particular, I could not think of anything specific to zsh which would help me here, so basically I reverted to awk. This is my approach:
awk -F ';' '{printf("%s", $3 ";" $2); for(i=4;i<=NF;i++) printf(";%s", $i); print "" }' <input_file >output_file
The first printf takes care about the two reversed fields, and then I use an explicit loop to write out the remaining fields. Is there a possibility in awk (or gawk) to print a range of fields in a single command? Or did I miss some incredibly clever feature in zsh, which could make my life simpler?
UPDATE: Example input data
a;bb;c;D;e;fff
gg;h;ii;jj;kk;l;m;n
Should produce the output
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
Using any awk in any shell on every Unix box:
$ awk 'BEGIN{FS=OFS=";"} {t=$3; $3=$2; $2=t; sub(/[^;]*;/,"")} 1' file
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
With GNU awk you could try following code. Using match function ogf GNU awk, where using regex ^[^;]*;([^;]*;)([^;]*;)(.*)$ to catch the values as per requirement, this is creating 3 capturing groups; whose values are getting stored into array named arr(GNU awk's functionality) and then later in program printing values as per requirement.
Here is the Online demo for used regex.
awk 'match($0,/^[^;]*;([^;]*;)([^;]*;)(.*)$/,arr){
print arr[2] arr[1] arr[3]
}
' Input_file
If perl is accepted, it provides a join() function to join elements on a delimiter. In awk though you'd have to explicitly define one (which isn't complex, just more lines of code)
perl -F';' -nlae '$t = #F[2]; #F[2] = #F[1]; $F[1] = $t; print join(";", #F[1..$#F])' file
With sed, perl, hck and rcut (my own script):
$ sed -E 's/^[^;]+;([^;]+);([^;]+)/\2;\1/' ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# can also use: perl -F';' -lape '$_ = join ";", #F[2,1,3..$#F]' ip.txt
$ perl -F';' -lane 'print join ";", #F[2,1,3..$#F]' ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# -d and -D specifies input/output separators
$ hck -d';' -D';' -f3,2,4- ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# syntax similar to cut, but output field order can be different
$ rcut -d';' -f3,2,4- ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
Note that the sed version will preserve input lines with less than 3 fields.
$ cat ip.txt
1;2;3
apple;fig
abc
$ sed -E 's/^[^;]+;([^;]+);([^;]+)/\2;\1/' ip.txt
3;2
apple;fig
abc
$ perl -F';' -lane 'print join ";", #F[2,1,3..$#F]' ip.txt
3;2
;fig
;
Another awk variant:
awk 'BEGIN{FS=OFS=";"} {$1=$3; $3=""; sub(/;;/, ";")} 1' file
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
With gnu awk and gensub switching the position of 2 capture groups:
awk '{print gensub(/^[^;]*;([^;]*);([^;]*)/, "\\2;\\1", 1)}' file
The pattern matches
^ Start of string
[^;]*; Negated character class, match optional chars other than ; and then match ;
([^;]*);([^;]*) 2 capture groups, both capturing chars other than ; and match ; in between
Output
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
awk '{print $3, $0}' {,O}FS=\; < file | cut -d\; -f1,3,5-
This uses awk to prepend the third column, then pipes to cut to extract the desired columns.
Here is one way to do it using only zsh:
rearrange() {
local -a lines=(${(#f)$(</dev/stdin)})
for line in $lines; do
local -a flds=(${(s.;.)line})
print $flds[3]';'$flds[2]';'${(j.;.)flds[4,-1]}
done
}
The same idea in a single line. This may not be an improvement over your awk script:
for l in ${(#f)$(<&0)}; print ${${(A)i::=${(s.;.)l}}[3]}\;$i[2]\;${(j.;.)i:3}
Some of the pieces:
$(</dev/stdin) - read from stdin using pseudo-device.
$(<&0) - another way to read from stdin.
(f) - parameter expansion flag to split by newlines.
(#) - treat split as an array.
(s.;.) - split by semicolon.
$flds[3] - expands to the third array element.
$flds[4,-1] - fourth, fifth, etc. array elements.
$i:3 - ksh-style array slice for fourth, fifth ... elements.
Mixing styles like this can be confusing, even if it is slightly shorter.
(j.;.) - join array by semicolon.
i::= - assign the result of the expansion to the variable i.
This lets us use the semicolon-split fields later.
(A)i::= - the (A) flag ensures i is an array.

Awk: gsub("\\\\", "\\\\") yields suprising results

Consider the following input:
$ cat a
d:\
$ cat a.awk
{ sub("\\", "\\\\"); print $0 }
$ cat a_double.awk
{ sub("\\\\", "\\\\"); print $0 }
Now running cat a | awk -f a.awk gives
d:\
and running cat a | awk -f a_double.awk gives
d:\\
and I expect exactly the other way around. How should I interpret this?
$ awk -V
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
Yes, its expected behavior of awk. When you run sub("\\", "\\\\") in your first script, in sub's inside "(double quotes) since we are NOT using / for matching pattern we need to escape first \(actual literal character) then for escaping we are using \ so we need to escape that also, hence it will become \\\\
\\ \\
| |
| |
first 2 chars are denoting escaping next 2 chars are denoting actual literal character \
Which is NOT happening your 1st case hence NO match so no substitution in it, in your 2nd awk script you are doing this(escaping part in regex matching section of sub) hence its matching \ perfectly.
Let's see this by example and try putting ... for checking purposes.
When Nothing happens: Since no match on
awk '{sub("\\", "....\\\\"); print $0}' Input_file
d:\
When pattern matching happens:
awk '{sub("\\\\", "...\\\\"); print $0}' Input_file
d:...\\
From man awk:
gsub(r, s [, t])
For each substring matching the regular expression r in the string t,
substitute the string s, and return the number of substitutions.
How could we could do perform actual escaping part(where we need to use only \ before character only once)? Do mention your regexp in /../ in first section of sub like and we need NOT to double escape \ here.
awk '{sub(/\\/,"&\\")} 1' Input_file
The first arg to *sub() is a regexp, not a string, so you should use regexp (/.../) rather than string ("...") delimiters. The former is a literal regexp which is used as-is while the latter defines a dynamic (or computed) regexp which forces awk to interpret the string twice, the first time to convert the string to a regexp and the second to use it as a regexp, hence double the backslashes needed for escaping. See https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps.
In the following we just need to escape the backslash once because we're using a literal, rather than dynamic, regexp:
$ cat a
d:\
$ awk '{sub(/\\/,"\\\\")}1' a
d:\\
Your first script would rightfully produce a syntax error in a more recent version of gawk (5.1.0) since "\\" in a dynamic regexp is equivalent to /\/ in a literal one and in that expression the backslash is escaping the final forward slash which means there is no final delimiter:
$ cat a.awk
{ sub("\\", "\\\\"); print $0 }
$ awk -f a.awk a
awk: a.awk:1: (FILENAME=a FNR=1) fatal: invalid regexp: Trailing backslash: /\/

Combining two awk commands into one

Currently, I'm using the following two awk commands connected with one pipeline:
awk 'sub(/([^ ]+[ ]+){4}[^ ]+[ ]/,"")' ~/.bash_eternal_history | awk '!a[$0]++'
I want to combine them in one awk invocation. How should I revise the commands above?
This works like your code:
awk 'sub(/([^ ]+[ ]+){4}[^ ]+[ ]/,"") && !a[$0]++' ~/.bash_eternal_history
The first part returns false if the pattern does not match, the second condition returns false if the replaced string is already in the hash.
A simplified example
echo -e "xlmx\nxlmx\nyyy\nxlmx"|awk 'sub("lm", "") && !a[$0]++'
Output:
xx
Notes
With older gawk (like 3.1.5) --re-interval has to be used to enable {n,m} RE interval expressions. In newer version it is on by default as OP noted.
The RE could be reduced a bit. [ ] is identical to a simple space, so the pattern could be like /([^ ]+ +){4}[^ ]+ /. Or to extend the pattern, use [[:space:]] to enable all kind of white space as separator.
Actually it turned out that in some older gawk there is some RE problem. The second line does not return any row with gawk v3.1.5, but it does work with newer gawk v4.1.3.
$ echo -e "al\na b c \n a"|awk --re-interval '/([^ ]+ +){2}/'
a b c
$ echo -e "al\na b c \n a"|awk --re-interval '/([^ ]+[ ]+){2}/'
You can move the second script to the first separated by ;
awk '{sub(/([^ ]+[ ]+){4}[^ ]+[ ]/,"")}; !a[$0]++' ~/.bash_eternal_history
Note: The first action, sub should be placed in {} to prevent awk from throwing the output twice.

Is there a way to write multiple awk commands on a single line?

Is there a clear method to write the following with pipes avoiding the temp file(s) redirection?
awk '{gsub("Sat ", "Sat. ");print}' Sat.txt >win.txt
awk '{gsub(" 1-0", "");print}' win.txt >loss.txt
awk '{gsub(" 0-1", "");print}' loss.txt >draw.txt
awk '{gsub(" 1/2-1/2", "");print}' draw.txt >$TARGET
You can have as many gsub() as you want in awk. Each one of them will replace $0, so every time you will be working with the modified string.
However, note you can compress the gsub() into just two of them by using some regular expressions:
awk '{gsub("Sat ", "Sat. "); gsub(/ (1-0|0-1|1\/2-1\/2)/, "")}1' file
# ^^^ ^^^ ^^^^^^^^^
# 1-0 0-1 1/2-1/2
The first one replaces Sat with Sat. and the second one removes space + either of 1-0, 0-1 or 1/2-1/2.
Test
$ cat a
hello Sat is now and 1-0 occurs there when 0-1 results happen but 1/2-1/2 also
bye
$ awk '{gsub("Sat ", "Sat. "); gsub(/ (1-0|0-1|1\/2-1\/2)/, "")}1' file
hello Sat. is now and occurs there when results happen but also
bye
You can do it by using semicolon ;
Below code is also use to find out the particular paragraph which contain Sub and Warning and saving into new sub.txt
For Example :-
awk '/Sub/' RS="\n\n" ORS="\n\n" IMD.txt >sub.txt ; awk '/Warning/' RS="\n\n" ORS="\n\n" IMD.txt >> sub.txt

Awk: Using invert match to a string and then substitute characters

I want to extract lines that don't contain # and delete ", ; in the output.
My input FILE looks like this:
# ;string"1"
# string"2";
string"3";
Can use grep and tr to get wanted output:
grep -v '#' FILE | tr -d ';"'
string3
However I want to use awk.
I can extract invert match awk '!/#/' FILE, but how can I use sub to delete ", ; in the same awk command?
You can use gsub for global substitution:
awk '!/#/{gsub(/[";]/,"",$0);print}'
The following transcript shows this in action, it delivers the same results as your grep/tr pipeline:
pax> echo '# ;string"1"
# string"2";
string"3";' | awk '!/#/{gsub(/[";]/,"",$0);print}{}'
string3
Note that the final {} may not be necessary in some implementations of awk but it's there to stop output of non-matching lines in those implementations (usually older ones) that do it automatically for lines matching none of the rules.
Use gsub instead which would replace all matches not just one:
awk '/#/{next}{gsub(/[";]/,"")}1' file
Output:
string3
Skipping the third parameter to gsub makes it process $0 by default.
/#/{next} makes it skip lines containing #
1 makes it print $0
Another awk version
awk -F"[\";]" '{$1=$1} !/^#/' OFS= file
string3
awk '{gsub(/[";]/,x)} !/^#/' file
string3
The x represents nothing. Could also have used "", but saves one characters :)
If you want to give sed a chance:
sed -n '/^[^#]/s/[";]//gp' file
string3