automatic text formatting with printf (basing on maxlength of each column) - awk

the input data have n columns delimeted with "|" like on example below:
121|234234|5345|2342342342432423
1|2342|2|2342
234|23|343|34214222
how to find max length of each column and use it later in printf formatting of the input which will work even when input data are changed in a future?
in command like:
awk -F'|' '..... { printf "%-longestincol1s %-longestincol2s %-longestincol3s %-s\n", $1, $2, $3, $4 }' ....

Input:
$ cat infile
121|234234|5345|2342342342432423
1|2342|2|2342
234|23|343|34214222
Output:
With printf("%*s%s"
awk 'BEGIN{FS=OFS="|"}FNR==NR{for(i=1; i<=NF;i++)wd[i]=wd[i]>length($i)?wd[i]:length($i);next}{for(i=1; i<=NF; i++)printf("%*s%s",wd[i],$i,(i<NF?OFS:ORS))}' infile infile
121|234234|5345|2342342342432423
1| 2342| 2| 2342
234| 23| 343| 34214222
With printf("%-*s%s"
awk 'BEGIN{FS=OFS="|"}FNR==NR{for(i=1; i<=NF;i++)wd[i]=wd[i]>length($i)?wd[i]:length($i);next}{for(i=1; i<=NF; i++)printf("%-*s%s",wd[i],$i,(i<NF?OFS:ORS))}' infile infile
121|234234|5345|2342342342432423
1 |2342 |2 |2342
234|23 |343 |34214222
Better Readable:
awk '
BEGIN{
FS=OFS="|"
}
FNR==NR{
for(i=1; i<=NF;i++)
wd[i]=wd[i]>length($i)?wd[i]:length($i);
next
}
{
for(i=1; i<=NF; i++)
printf("%*s%s",wd[i],$i,(i<NF?OFS:ORS));
}
' infile infile
Explanation
Like C/C++ Specifies how much space to allocate for the string
* The width is not specified in the format string, but as an
additional integer value argument preceding the argument that has to
be formatted.
printf("%*s",5,"")
is same as
printf("%5s", "");

Column
if you just want to have a pretty-printed output, you can use column e.g.
column -t -s'|' -o'|' file
However, it is not exactly fit your printf format. It does left-alignment.
Awk
If you want to do it with awk, you can apply at least two approaches:
process the file once
You let awk go through the input once, calc the max widths during looping, save all lines in memory. At the end, END{...} you loop through the array to print.
process the file twice
First you let awk go through the file, just calculate max-width, and save into variables. In the second go, do format printing.
I didn't give working codes, but I hope I answered clearly. Also it would be helpful when you next time face the similar problem.

Related

assigning a var inside AWK for use outside awk

I am using ksh on AIX.
I have a file with multiple comma delimited fields. The value of each field is read into a variable inside the script.
The last field in the file may contain multiple | delimited values. I need to test each value and keep the first one that doesn't begin with R, then stop testing the values.
sample value of $principal_diagnosis0
R65.20|A41.9|G30.9|F02.80
I've tried:
echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){echo $i; primdiag = $i}}}'
but I get this message : awk: Field $i is not correct.
My goal is to have a variable that I can use outside of the awk statement that gets assigned the first non-R code (in this case it would be A41.9).
echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){print $i}}}'
gets me the output of :
A41.9
G30.9
F02.80
So I know it's reading the values and evaluating properly. But I need to stop after the first match and be able to use that value outside of awk.
Thanks!
To answer your specific question:
$ principal_diagnosis0='R65.20|A41.9|G30.9|F02.80'
$ foo=$(echo "$principal_diagnosis0" | awk -v RS='|' '/^[^R]/{sub(/\n/,""); print; exit}')
$ echo "$foo"
A41.9
The above will work with any awk, you can do it more briefly with GNU awk if you have it:
foo=$(echo "$principal_diagnosis0" | awk -v RS='[|\n]' '/^[^R]/{print; exit}')
you can make FS and OFS do all the hard work :
echo "${principal_diagnosis0}" |
mawk NF=NF FS='^(R[^|]+[|])+|[|].+$' OFS=
A41.9
——————————————————————————————————————————
another slightly different variation of the same concept — overwriting fields but leaving OFS as is :
gawk -F'^.*R[^|]+[|]|[|].+$' '$--NF=$--NF'
A41.9
this works, because when you break it out :
gawk -F'^.*R[^|]+[|]|[|].+$' '
{ print NF
} $(_=--NF)=$(__=--NF) { print _, __, NF, $0 }'
3
1 2 1 A41.9
you'll notice you start with NF = 3, and the two subsequent decrements make it equivalent to $1 = $2,
but since final NF is now reduced to just 1, it would print it out correctly instead of 2 copies of it
…… which means you can also make it $0 = $2, as such :
gawk -F'^.*R[^|]+[|]|[|].+$' '$-_=$-—NF'
A41.9
——————————————————————————————————————————
a 3rd variation, this time using RS instead of FS :
mawk NR==2 RS='^.*R[^|]+[|]|[|].+$'
A41.9
——————————————————————————————————————————
and if you REALLY don't wanna mess with FS/OFS/RS, use gsub() instead :
nawk 'gsub("^.*R[^|]+[|]|[|].+$",_)'
A41.9

awk match pattern and convert number to different unit

I have a csv file that contains this kind of values:
vm47,8,32794384Ki,16257320Ki
vm47,8,30223304245,15223080Ki
vm48,8,32794384Ki,16257312Ki
vm48,8,30223304245,15223072Ki
vm49,8,32794384Ki,16257320Ki
vm49,8,30223304245,15223080Ki
The columns 3 and 4 are memoy values expressed either in bytes, or kibibytes. The problem is that the "Ki" string appears randomly through the CSV file, particularly in column3, it's inconsistent.
So to make the file consistent, I need to convert everything in bytes. So basically, any value matching a trailing "Ki" needs to have its numeric value multiplied by 1024, and then replace the corresponding XXXXXKi match.
The reason why I want to do it with awk is because I am already using awk to generate that csv format, but I am happy to do it with sed too.
This is my code so far but obviously it's wrong as it's multiplying any value in columns 3 and 4 by 1024 even though it does not match "Ki". I am not sure at this point how to ask awk "if you see Ki at the end, then multiply by 1024".
kubectl describe node --context=$context| sed -E '/Name:|cpu:|ephemeral-storage:|memory:/!d' | sed 's/\s//g' | awk '
BEGIN {FS = ":"; OFS = ","}
{record[$1] = $2}
$1 == "memory" {print record["Name"], record["cpu"], record["ephemeral-storage"], record["memory"]}
' | awk -F, '{print $1,$2,$3,$3*1024,$4,$4*1024}' >> describe_nodes.csv
Edit: I made a mistake, you need to multiply by 128 to convert KiB in bytes, not 1024.
"if you see Ki at the end, then multiply by 1024
You may use:
awk 'BEGIN{FS=OFS=","} $3 ~ /Ki$/ {$3 *= 1024} $4 ~ /Ki$/ {$4 *= 1024} 1' file
vm47,8,33581449216,16647495680
vm47,8,30223304245,15588433920
vm48,8,33581449216,16647487488
vm48,8,30223304245,15588425728
vm49,8,33581449216,16647495680
vm49,8,30223304245,15588433920
Or a bit shorter:
awk 'BEGIN{FS=OFS=","} {
for (i=3; i<=4; ++i) $i ~ /Ki$/ && $i *= 1024} 1' file
With your shown samples/attempts, please try following awk code. Simple explanation would be, traverse through fields from 3rd field onwards and look for if a value has Ki(ignore cased manner) then multiply it with 128, print all edited/non-edited lines at last.
awk 'BEGIN{FS=OFS=","} {for(i=3;i<=NF;i++){if($i~/[Kk][Ii]$/){$i *= 128}}} 1' Input_file
You could try numfmt:
$ numfmt -d, --field 3,4 --from=auto --to=none <<EOF
vm47,8,32794384Ki,16257320Ki
vm47,8,30223304245,15223080Ki
EOF
vm47,8,33581449216,16647495680
vm47,8,30223304245,15588433920

Awk Field number of matched pattern

I was wondering if there's a built in command in awk to get the field number of the phrase that you just matched.
Banana is yellow.
awk {
/yellow/{ for (i=1;i<=NF;i++) if($i ~/yellow/) print $i}'
Is there a way to avoid writing the loop?
Your command doesn't work when I test it. Here's my version:
echo "banana is yellow" | awk '{for (i=1;i<=NF;i++) if($i ~/yellow/) print i}'
The output is :
3
As far as I know, there's no such built-in feature, to improve your command, the pattern match /yellow/ at the beginning is not necessary, and also $i will print the matching field other than the field number that you need.
Alternatively, you can use an array to store each field and its corresponding index number, and then print field by arr["yellow"]
If the input string is a oneline string you can set the record delimiter to the field delimiter. Doing so you can use NR to print the position:
awk 'BEGIN{RS=FS}/yellow/{print NR}' <<< 'banana is yellow'
3

awk: split a column of delimited text in a row into lines

I have a file with five columns and the second column has delimited text. I want to split that delimited text dedup it and print into lines. I'm able to do it with the commands below. I want to make a awk script. Can anyone help me.
awk -F"\t" 'NR>1{print $2}' <input file> | awk -F\| '{for (i = 0; ++i <= NF;) print $i}' | awk '!x[$0]++'
Input file:
test hello|good|this|will|be 23421 test 4543
test2 good|would|may|can 43234 test2 3421
Output:
hello
good
this
will
be
would
may
can
You could use this single awk one-liner:
$ awk '{split($2,a,"|");for(i in a)if(!seen[a[i]]++)print a[i]}' file
will
be
hello
good
this
can
would
may
The second field is split into the array a on the | character. Each element of a is printed if it isn't already in seen, which will only be true on the first occurrence.
Note that the order of the keys is undefined.
To preserve the order, you can use this:
$ awk '{n=split($2,a,"|");for(i=1;i<=n;++i)if(!seen[a[i]]++)print a[i]}' file
split returns the number of elements in the array a, which you can use to loop through them in the order they appeared.
I wrote exactly Tom's answer before I saw it. If you want to maintain the order of the words as they are seen, it's a little more work:
awk '
{
n = split($2, a, "|")
for (i=1; i<=n; i++)
if (!(a[i] in seen)) {
# the hash to store the unique keys
seen[a[i]] = 1
# the array to store the keys in order
words[++count] = a[i]
}
}
END {for (i=1; i<=count; i++) print words[i]}
' file
hello
good
this
will
be
would
may
can
Here is how I would have done it:
awk '{n=split($2,a,"|");for (i=1;i<=n;i++) print a[i]}' file
hello
good
this
will
be
good
would
may
can
Or this way (this may change the order of the outdata, but for some reason I am not sure about, it works fine here):
awk '{split($2,a,"|");for(i in a) print a[i]}' file
hello
good
this
will
be
good
would
may
can
Or if you do not like duplicate output:
awk '{split($2,a,"|");for(i in a) if (!f[a[i]]++) print a[i]}' file
hello
good
this
will
be
would
may
can

Print every nth column of a file

I have a rather big file with 255 coma separated columns and I need to print out every third column only.
I was trying something like this
awk '{ for (i=0;i<=NF;i+=3) print $i }' file
but that doesn't seem to be the solution, since it prints to only one long column. Anybody can help? Thanks
Here is one way to do this.
The script prog.awk:
BEGIN {FS = ","} # field separator
{for (i = 1; i <= NF; i += 3) printf ("%s%c", $i, i + 3 <= NF ? "," : "\n");}
Invocation:
awk -f prog.awk <input.csv >output.csv
Example input.csv:
1,2,3,4,5,6,7,8,9,10
11,12,13,14,15,16,17,18,19,20
Example output.csv:
1,4,7,10
11,14,17,20
It behaves like that because by default awk splits fields in spaces. You have to tell it to split them with commas, and it's done using the FS variable or the -F switch. Besides that, first field is number one. The zero is the whole line, so also change the initial value of the for loop:
awk -F',' '{ for (i=1;i<=NF;i+=3) print $i }' file