Extracting and rearranging columns - awk

I read from stdin lines which contain fields. The field delimiter is a semicolon. There are no specific quoting characters in the input (i.e. the fields can't contain themselves semicolons or newline characters). The number of the input fields is unknown, but it is at least 4.
The output is supposed to be a similar file, consisting of the fields from 2 to the end, but field 2 and 3 reversed in order.
I'm using zsh.
I came up with a solution, but find it clumsy. In particular, I could not think of anything specific to zsh which would help me here, so basically I reverted to awk. This is my approach:
awk -F ';' '{printf("%s", $3 ";" $2); for(i=4;i<=NF;i++) printf(";%s", $i); print "" }' <input_file >output_file
The first printf takes care about the two reversed fields, and then I use an explicit loop to write out the remaining fields. Is there a possibility in awk (or gawk) to print a range of fields in a single command? Or did I miss some incredibly clever feature in zsh, which could make my life simpler?
UPDATE: Example input data
a;bb;c;D;e;fff
gg;h;ii;jj;kk;l;m;n
Should produce the output
c;bb;D;e;fff
ii;h;jj;kk;l;m;n

Using any awk in any shell on every Unix box:
$ awk 'BEGIN{FS=OFS=";"} {t=$3; $3=$2; $2=t; sub(/[^;]*;/,"")} 1' file
c;bb;D;e;fff
ii;h;jj;kk;l;m;n

With GNU awk you could try following code. Using match function ogf GNU awk, where using regex ^[^;]*;([^;]*;)([^;]*;)(.*)$ to catch the values as per requirement, this is creating 3 capturing groups; whose values are getting stored into array named arr(GNU awk's functionality) and then later in program printing values as per requirement.
Here is the Online demo for used regex.
awk 'match($0,/^[^;]*;([^;]*;)([^;]*;)(.*)$/,arr){
print arr[2] arr[1] arr[3]
}
' Input_file

If perl is accepted, it provides a join() function to join elements on a delimiter. In awk though you'd have to explicitly define one (which isn't complex, just more lines of code)
perl -F';' -nlae '$t = #F[2]; #F[2] = #F[1]; $F[1] = $t; print join(";", #F[1..$#F])' file

With sed, perl, hck and rcut (my own script):
$ sed -E 's/^[^;]+;([^;]+);([^;]+)/\2;\1/' ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# can also use: perl -F';' -lape '$_ = join ";", #F[2,1,3..$#F]' ip.txt
$ perl -F';' -lane 'print join ";", #F[2,1,3..$#F]' ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# -d and -D specifies input/output separators
$ hck -d';' -D';' -f3,2,4- ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
# syntax similar to cut, but output field order can be different
$ rcut -d';' -f3,2,4- ip.txt
c;bb;D;e;fff
ii;h;jj;kk;l;m;n
Note that the sed version will preserve input lines with less than 3 fields.
$ cat ip.txt
1;2;3
apple;fig
abc
$ sed -E 's/^[^;]+;([^;]+);([^;]+)/\2;\1/' ip.txt
3;2
apple;fig
abc
$ perl -F';' -lane 'print join ";", #F[2,1,3..$#F]' ip.txt
3;2
;fig
;

Another awk variant:
awk 'BEGIN{FS=OFS=";"} {$1=$3; $3=""; sub(/;;/, ";")} 1' file
c;bb;D;e;fff
ii;h;jj;kk;l;m;n

With gnu awk and gensub switching the position of 2 capture groups:
awk '{print gensub(/^[^;]*;([^;]*);([^;]*)/, "\\2;\\1", 1)}' file
The pattern matches
^ Start of string
[^;]*; Negated character class, match optional chars other than ; and then match ;
([^;]*);([^;]*) 2 capture groups, both capturing chars other than ; and match ; in between
Output
c;bb;D;e;fff
ii;h;jj;kk;l;m;n

awk '{print $3, $0}' {,O}FS=\; < file | cut -d\; -f1,3,5-
This uses awk to prepend the third column, then pipes to cut to extract the desired columns.

Here is one way to do it using only zsh:
rearrange() {
local -a lines=(${(#f)$(</dev/stdin)})
for line in $lines; do
local -a flds=(${(s.;.)line})
print $flds[3]';'$flds[2]';'${(j.;.)flds[4,-1]}
done
}
The same idea in a single line. This may not be an improvement over your awk script:
for l in ${(#f)$(<&0)}; print ${${(A)i::=${(s.;.)l}}[3]}\;$i[2]\;${(j.;.)i:3}
Some of the pieces:
$(</dev/stdin) - read from stdin using pseudo-device.
$(<&0) - another way to read from stdin.
(f) - parameter expansion flag to split by newlines.
(#) - treat split as an array.
(s.;.) - split by semicolon.
$flds[3] - expands to the third array element.
$flds[4,-1] - fourth, fifth, etc. array elements.
$i:3 - ksh-style array slice for fourth, fifth ... elements.
Mixing styles like this can be confusing, even if it is slightly shorter.
(j.;.) - join array by semicolon.
i::= - assign the result of the expansion to the variable i.
This lets us use the semicolon-split fields later.
(A)i::= - the (A) flag ensures i is an array.

Related

Extract substring from a field with single awk in AIX

I have a file file with content like:
stringa 8.0.1.2 stringx
stringb 12.01.0.0 stringx
I have to get a substring from field 2 (first two values with the dot).
I am currently doing cat file | awk '{print $2}' | awk -F. '{print $1"."$2}' and getting the expected output:
8.0
12.01
The query is how to do this with single awk?
I have tried with match() but not seeing an option for a back reference.
Any help would be appreciated.
You can do something like this.
$ awk '{ split($2,str,"."); print str[1]"."str[2] }' file
8.0
12.01
Also, keep in mind that your cat is not needed. Simply give the file directly to awk.
With GNU grep please try following command once.
grep -oP '^\S+\s+\K[[:digit:]]+\.[[:digit:]]+' Input_file
Explanation: Using GNU grep here. Using its -oP options to print matched part and enable PCRE with -P option here. In main program, matching from starting non-space characters followed by 1 or more spaces, then using \K option to forget that match. Then matching 1 or more digits occurrences followed by a dot; which is further followed by digits. If a match is found then it prints matched value.
I would use GNU AWK's split function as follow, let file.txt content be
stringa 8.0.1.2 stringx
stringb 12.01.0.0 stringx
then
awk '{split($2,arr,".");print arr[1]"."arr[2]}' file.txt
output
8.0
12.01
Explantion: split at . 2nd field and put elements into array arr.
(tested in gawk 4.2.1)
You could match digits . digits from the second column and print if there is a match:
awk 'match($2, /^[[:digit:]]+\.[[:digit:]]+/) {
print substr($2, RSTART, RLENGTH)
}
' file
Output
8.0
12.01
Also with GNU awk and gensub():
awk '{print gensub(/([[:digit:]]+[.][[:digit:]]+)(.*)/,"\\1","g",$2)}' file
8.0
12.01
gensub() provides the ability to specify components of a regexp in the replacement text using parentheses in the regexp to mark the components and then specifying \\n in the replacement text, where n is a digit from 1 to 9.
You should perhaps not use awk at all (or any other external program, for that matter) but rely on the field-splitting capabilities of the shell and some variable expansion. For instance:
# printf "%s\n%s\n" "stringa 8.0.1.2 stringx" \
"stringb 12.01.0.0 stringx" |\
while read first second third junk ; do
printf "=%s= =%s= =%s=\n" "$first" "$second" "$third"
done
=stringa= =8.0.1.2= =stringx=
=stringb= =12.01.0.0= =stringx=
As you can see the value is captured in the variable "$second" already and you just need to further isolate the parts you want to see - the first and second part separated by a dot. You can do that either with parameter expansion:
# variable="8.0.1.2"
# echo ${variable%.*.*}
8.0
or like this:
# variable="12.01.0.0"
# echo ${variable%${variable#*.*.}}
12.01
or you can use a further read-statement to separate the parts and then put them back together:
# variable="12.01.0.0"
# echo ${variable} | IFS=. read parta partb junk
# echo ${parta}.${partb}
12.01
So, putting all together:
# printf "%s\n%s\n" "stringa 8.0.1.2 stringx" \
"stringb 12.01.0.0 stringx" |\
while read first second third junk ; do
printf "%s\n" "$second" | IFS=. read parta partb junk
printf "%s.%s\n" "$parta" "$partb"
done
8.0
12.01

Replace character sequence with several characters using AWK gsub()

I'm trying to convert text by replacing several identical letters (more than 3) with several characters (two *).
My input:
ffffOOOOuuuurrrr
fffffiiiiivvvvveeeee
What should I get:
**OOOO****
********
My test command is:
awk '{gsub(/[a-z]{4}/,"*"); print}' textfile
I don't understand how to transform {4} into 'more than 3'.
Also how to print * two times (like multiply it).
I'm also sure that the condition 'more than three' will convert input into:
**OOOO**
**
Is there any way to avoid this (replace a sequence of identical letters)?
Or it is not possible to fit in one small command.
POSIX awk or sed don't support back-references. You will need to use gnu-sed or perl:
sed -E 's/([a-z])\1{3,}/**/g' file
**OOOO****
********
or using perl:
perl -pe 's/([a-z])\1{3,}/**/g' file
RegEx Details:
([a-z]): Match [a-z] and capture in group #1
\1: Back-reference of the letter captured in group #1
{3,}: Repeat 3 or more times
You mentioned sed as an option in the tags:
echo "fffffiiiiivvvvveeeee" | sed 's/\([A-Za-z]\)\1\1\1\+/\1/g'
five
echo "fffffiiiiivvveeeee" | sed 's/\([A-Za-z]\)\1\1\1\+/\1/g'
fivvve
Here's how to do it with any awk assuming a locale where lower case letters are a-z = ASCII 97-122:
$ cat tst.awk
{
for (i=97; i<122; i++) {
gsub(sprintf("%c{3,}",i),"**")
}
print
}
$ awk -f tst.awk file
**OOOO****
********
otherwise with GNU awk for the ord() function:
$ cat tst.awk
#load "ordchr"
{
for (i=ord("a"); i<=ord("z"); i++) {
gsub(sprintf("%c{3,}",i),"**")
}
print
}
$ awk -f tst.awk file
**OOOO****
********
or iyou can use a different numeric loop range or split("abc...z",...) or whatever else to get the loop but the point is - you need to loop on each character.

Combine grep -f and awk

I am using two commands:
awk '{ print $2 }' SomeFile.txt > Pattern.txt
grep -f Pattern.txt File.txt
With the first command I create a list of desirable patterns. With the second command I extract all lines in File.txt that match the lines in the Pattern.txt
My question is, is there a way to combine awk and grep in a pipeline so that I don't have to generate the intermediate Pattern.txt file?
Thanks!
You can do this all in one invocation of awk:
awk 'NR==FNR{a[$2];next}{for(i in a)if($0~i)print}' Somefile.txt File.txt
Populate keys in the array a from the second column of the first file. NR==FNR identifies the first file (total record number is equal to this file's record number). next skips the second block for the first file.
In the second block, loop through all the keys in the array and if the line matches any of them, print it. To avoid printing the line more than once if it matches more than one pattern, you could add a next here too, i.e. {for(i in a)if($0~i){print;next}}.
If the "patterns" are actually fixed strings, it is even simpler:
awk 'NR==FNR{a[$2];next}$0 in a' Somefile.txt File.txt
If your shell supports it, you can use process substitution:
grep -f <(awk '{ print $2 }' SomeFile.txt) File.txt
bash and zsh will support that, others will probably too, didn't tested.
Simpler as the above and supported by all shells would be to use a pipe:
awk '{ print $2 }' SomeFile.txt | grep -f - File.txt
- is used as the argument to -f. - has a special meaning here and stands for stdin. Thanks to Tom Fenech for mentioning that!

AWK get specificic pattern

I have lines like this:
Volume.Free_IBM_LUN59_28D: 2072083693568
I would like to get only IBM_LUN59_28D from this line using awk.
Thanks
You can use sub to do substitutions on each input line, as per the following transcript:
pax> echo 'Volume.Free_IBM_LUN59_28D: 2072083693568' | awk '
...> {
...> sub (".*Free_", "");
...> sub (":.*", "");
...> print
...> }'
IBM_LUN59_28D
That command crosses multiple lines for readability but, if you're operating on a file and not too concerned about readability, you can just use the compressed version:
awk '{sub(".*Free_","");sub(":.*","");print}' inputFile
If you're amenable to non-awk solutions, you could also use sed:
sed -e 's/.*Free_//' -e 's/:.*//' inputFile
Note that both those solutions rely on your (somewhat sparse) test data. If your definition of "like" includes preceding textual segments other than Free_ or subsequent characters other than :, some more work may be needed.
For example, if you wanted the string between the first _ and the first :, you could use:
awk '{sub("[^_]*_","");sub(":.*","");print}'
With sed:
sed 's/[^_]*_\(.*\):.*/\1/'
Search for sequence of non _ characters followed by _ (this will match Volume.Free_), then another sequence of characters (this will match IBM_LUN59_28D, we group this for future use), followed by : and any char sequence. Substitute with the saved pattern (\1). That's it.
Sample:
$ echo "Volume.Free_IBM_LUN59_28D: 2072083693568" | sed 's/[^_]*_\(.*\):.*/\1/'
IBM_LUN59_28D
Here is one awk
awk -F"Free_" 'NF>1{split($2,a,":");print a[1]}'
Eks:
echo "Volume.Free_IBM_LUN59_28D: 2072083693568" | awk -F"Free_" 'NF>1{split($2,a,":");print a[1]}'
IBM_LUN59_28D
It divides the line by Free_.
If line then have more than one field NF>1 then:
Split second field bye : and print first part a[1]
With awk:
echo "$val" | awk -F: '{print $1}' | awk -F. '{print $2}' | awk '{print substr($0,6)}'
where the given string is in $val.

awk to transpose lines of a text file

A .csv file that has lines like this:
20111205 010016287,1.236220,1.236440
It needs to read like this:
20111205 01:00:16.287,1.236220,1.236440
How do I do this in awk? Experimenting, I got this far. I need to do it in two passes I think. One sub to read the date&time field, and the next to change it.
awk -F, '{print;x=$1;sub(/.*=/,"",$1);}' data.csv
Use that awk command:
echo "20111205 010016287,1.236220,1.236440" | \
awk -F[\ \,] '{printf "%s %s:%s:%s.%s,%s,%s\n", \
$1,substr($2,1,2),substr($2,3,2),substr($2,5,2),substr($2,7,3),$3,$4}'
Explanation:
-F[\ \,]: sets the delimiter to space and ,
printf "%s %s:%s:%s.%s,%s,%s\n": format the output
substr($2,0,3): cuts the second firls ($2) in the desired pieces
Or use that sed command:
echo "20111205 010016287,1.236220,1.236440" | \
sed 's/\([0-9]\{8\}\) \([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{3\}\)/\1 \2:\3:\4.\5/g'
Explanation:
[0-9]\{8\}: first match a 8-digit pattern and save it as \1
[0-9]\{2\}...: after a space match 3 times a 2-digit pattern and save them to \2, \3 and \4
[0-9]\{3\}: and at last match 3-digit pattern and save it as \5
\1 \2:\3:\4.\5: format the output
sed is better suited to this job since it's a simple substitution on single lines:
$ sed -r 's/( ..)(..)(..)/\1:\2:\3./' file
20111205 01:00:16.287,1.236220,1.236440
but if you prefer here's GNU awk with gensub():
$ awk '{print gensub(/( ..)(..)(..)/,"\\1:\\2:\\3.","")}' file
20111205 01:00:16.287,1.236220,1.236440