awk: split a column of delimited text in a row into lines - awk

I have a file with five columns and the second column has delimited text. I want to split that delimited text dedup it and print into lines. I'm able to do it with the commands below. I want to make a awk script. Can anyone help me.
awk -F"\t" 'NR>1{print $2}' <input file> | awk -F\| '{for (i = 0; ++i <= NF;) print $i}' | awk '!x[$0]++'
Input file:
test hello|good|this|will|be 23421 test 4543
test2 good|would|may|can 43234 test2 3421
Output:
hello
good
this
will
be
would
may
can

You could use this single awk one-liner:
$ awk '{split($2,a,"|");for(i in a)if(!seen[a[i]]++)print a[i]}' file
will
be
hello
good
this
can
would
may
The second field is split into the array a on the | character. Each element of a is printed if it isn't already in seen, which will only be true on the first occurrence.
Note that the order of the keys is undefined.
To preserve the order, you can use this:
$ awk '{n=split($2,a,"|");for(i=1;i<=n;++i)if(!seen[a[i]]++)print a[i]}' file
split returns the number of elements in the array a, which you can use to loop through them in the order they appeared.

I wrote exactly Tom's answer before I saw it. If you want to maintain the order of the words as they are seen, it's a little more work:
awk '
{
n = split($2, a, "|")
for (i=1; i<=n; i++)
if (!(a[i] in seen)) {
# the hash to store the unique keys
seen[a[i]] = 1
# the array to store the keys in order
words[++count] = a[i]
}
}
END {for (i=1; i<=count; i++) print words[i]}
' file
hello
good
this
will
be
would
may
can

Here is how I would have done it:
awk '{n=split($2,a,"|");for (i=1;i<=n;i++) print a[i]}' file
hello
good
this
will
be
good
would
may
can
Or this way (this may change the order of the outdata, but for some reason I am not sure about, it works fine here):
awk '{split($2,a,"|");for(i in a) print a[i]}' file
hello
good
this
will
be
good
would
may
can
Or if you do not like duplicate output:
awk '{split($2,a,"|");for(i in a) if (!f[a[i]]++) print a[i]}' file
hello
good
this
will
be
would
may
can

Related

assigning a var inside AWK for use outside awk

I am using ksh on AIX.
I have a file with multiple comma delimited fields. The value of each field is read into a variable inside the script.
The last field in the file may contain multiple | delimited values. I need to test each value and keep the first one that doesn't begin with R, then stop testing the values.
sample value of $principal_diagnosis0
R65.20|A41.9|G30.9|F02.80
I've tried:
echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){echo $i; primdiag = $i}}}'
but I get this message : awk: Field $i is not correct.
My goal is to have a variable that I can use outside of the awk statement that gets assigned the first non-R code (in this case it would be A41.9).
echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){print $i}}}'
gets me the output of :
A41.9
G30.9
F02.80
So I know it's reading the values and evaluating properly. But I need to stop after the first match and be able to use that value outside of awk.
Thanks!
To answer your specific question:
$ principal_diagnosis0='R65.20|A41.9|G30.9|F02.80'
$ foo=$(echo "$principal_diagnosis0" | awk -v RS='|' '/^[^R]/{sub(/\n/,""); print; exit}')
$ echo "$foo"
A41.9
The above will work with any awk, you can do it more briefly with GNU awk if you have it:
foo=$(echo "$principal_diagnosis0" | awk -v RS='[|\n]' '/^[^R]/{print; exit}')
you can make FS and OFS do all the hard work :
echo "${principal_diagnosis0}" |
mawk NF=NF FS='^(R[^|]+[|])+|[|].+$' OFS=
A41.9
——————————————————————————————————————————
another slightly different variation of the same concept — overwriting fields but leaving OFS as is :
gawk -F'^.*R[^|]+[|]|[|].+$' '$--NF=$--NF'
A41.9
this works, because when you break it out :
gawk -F'^.*R[^|]+[|]|[|].+$' '
{ print NF
} $(_=--NF)=$(__=--NF) { print _, __, NF, $0 }'
3
1 2 1 A41.9
you'll notice you start with NF = 3, and the two subsequent decrements make it equivalent to $1 = $2,
but since final NF is now reduced to just 1, it would print it out correctly instead of 2 copies of it
…… which means you can also make it $0 = $2, as such :
gawk -F'^.*R[^|]+[|]|[|].+$' '$-_=$-—NF'
A41.9
——————————————————————————————————————————
a 3rd variation, this time using RS instead of FS :
mawk NR==2 RS='^.*R[^|]+[|]|[|].+$'
A41.9
——————————————————————————————————————————
and if you REALLY don't wanna mess with FS/OFS/RS, use gsub() instead :
nawk 'gsub("^.*R[^|]+[|]|[|].+$",_)'
A41.9

automatic text formatting with printf (basing on maxlength of each column)

the input data have n columns delimeted with "|" like on example below:
121|234234|5345|2342342342432423
1|2342|2|2342
234|23|343|34214222
how to find max length of each column and use it later in printf formatting of the input which will work even when input data are changed in a future?
in command like:
awk -F'|' '..... { printf "%-longestincol1s %-longestincol2s %-longestincol3s %-s\n", $1, $2, $3, $4 }' ....
Input:
$ cat infile
121|234234|5345|2342342342432423
1|2342|2|2342
234|23|343|34214222
Output:
With printf("%*s%s"
awk 'BEGIN{FS=OFS="|"}FNR==NR{for(i=1; i<=NF;i++)wd[i]=wd[i]>length($i)?wd[i]:length($i);next}{for(i=1; i<=NF; i++)printf("%*s%s",wd[i],$i,(i<NF?OFS:ORS))}' infile infile
121|234234|5345|2342342342432423
1| 2342| 2| 2342
234| 23| 343| 34214222
With printf("%-*s%s"
awk 'BEGIN{FS=OFS="|"}FNR==NR{for(i=1; i<=NF;i++)wd[i]=wd[i]>length($i)?wd[i]:length($i);next}{for(i=1; i<=NF; i++)printf("%-*s%s",wd[i],$i,(i<NF?OFS:ORS))}' infile infile
121|234234|5345|2342342342432423
1 |2342 |2 |2342
234|23 |343 |34214222
Better Readable:
awk '
BEGIN{
FS=OFS="|"
}
FNR==NR{
for(i=1; i<=NF;i++)
wd[i]=wd[i]>length($i)?wd[i]:length($i);
next
}
{
for(i=1; i<=NF; i++)
printf("%*s%s",wd[i],$i,(i<NF?OFS:ORS));
}
' infile infile
Explanation
Like C/C++ Specifies how much space to allocate for the string
* The width is not specified in the format string, but as an
additional integer value argument preceding the argument that has to
be formatted.
printf("%*s",5,"")
is same as
printf("%5s", "");
Column
if you just want to have a pretty-printed output, you can use column e.g.
column -t -s'|' -o'|' file
However, it is not exactly fit your printf format. It does left-alignment.
Awk
If you want to do it with awk, you can apply at least two approaches:
process the file once
You let awk go through the input once, calc the max widths during looping, save all lines in memory. At the end, END{...} you loop through the array to print.
process the file twice
First you let awk go through the file, just calculate max-width, and save into variables. In the second go, do format printing.
I didn't give working codes, but I hope I answered clearly. Also it would be helpful when you next time face the similar problem.

Awk Field number of matched pattern

I was wondering if there's a built in command in awk to get the field number of the phrase that you just matched.
Banana is yellow.
awk {
/yellow/{ for (i=1;i<=NF;i++) if($i ~/yellow/) print $i}'
Is there a way to avoid writing the loop?
Your command doesn't work when I test it. Here's my version:
echo "banana is yellow" | awk '{for (i=1;i<=NF;i++) if($i ~/yellow/) print i}'
The output is :
3
As far as I know, there's no such built-in feature, to improve your command, the pattern match /yellow/ at the beginning is not necessary, and also $i will print the matching field other than the field number that you need.
Alternatively, you can use an array to store each field and its corresponding index number, and then print field by arr["yellow"]
If the input string is a oneline string you can set the record delimiter to the field delimiter. Doing so you can use NR to print the position:
awk 'BEGIN{RS=FS}/yellow/{print NR}' <<< 'banana is yellow'
3

awk to compare two file by identifier & output in a specific format

I have 2 large files i need to compare all pipe delimited
file 1
a||d||f||a
1||2||3||4
file 2
a||d||f||a
1||1||3||4
1||2||r||f
Now I want to compare the files & print accordingly such as if any update found in file 2 will be printed as updated_value#oldvalue & any new line added to file 2 will also be updated accordingly.
So the desired output is: (only the updated & new data)
1||1#2||3||4
1||2||r||f
what I have tried so far is to get the separated changed values:
awk -F '[||]+' 'NR==FNR{for(i=1;i<=NF;i++)a[NR,i]=$i;next}{for(i=1;i<=NF;i++)if(a[FNR,i]!=$i)print $i"#"a[FNR,i]}' file1 file2 >output
But I want to print the whole line. How can I achieve that??
I would say:
awk 'BEGIN{FS=OFS="|"}
FNR==NR {for (i=1;i<=NF;i+=2) a[FNR,i]=$i; next}
{for (i=1; i<=NF; i+=2)
if (a[FNR,i] && a[FNR,i]!=$i)
$i=$i"#"a[FNR,i]
}1' f1 f2
This stores the file1 in a matrix a[line number, column]. Then, it compares its values with its correspondence in file2.
Note I am using the field separator | instead of || and looping in steps of two to use the proper data. This is because I for example did gawk -F'||' '{print NF}' f1 and got just 1, meaning that FS wasn't well understood. Will be grateful if someone points the error here!
Test
$ awk 'BEGIN{FS=OFS="|"} FNR==NR {for (i=1;i<=NF;i+=2) a[FNR,i]=$i; next} {for (i=1; i<=NF; i+=2) if (a[FNR,i] && a[FNR,i]!=$i) $i=$i"#"a[FNR,i]}1' f1 f2
a||d||f||b#a
1||1#2||3||4
1||2||r||f

Print every nth column of a file

I have a rather big file with 255 coma separated columns and I need to print out every third column only.
I was trying something like this
awk '{ for (i=0;i<=NF;i+=3) print $i }' file
but that doesn't seem to be the solution, since it prints to only one long column. Anybody can help? Thanks
Here is one way to do this.
The script prog.awk:
BEGIN {FS = ","} # field separator
{for (i = 1; i <= NF; i += 3) printf ("%s%c", $i, i + 3 <= NF ? "," : "\n");}
Invocation:
awk -f prog.awk <input.csv >output.csv
Example input.csv:
1,2,3,4,5,6,7,8,9,10
11,12,13,14,15,16,17,18,19,20
Example output.csv:
1,4,7,10
11,14,17,20
It behaves like that because by default awk splits fields in spaces. You have to tell it to split them with commas, and it's done using the FS variable or the -F switch. Besides that, first field is number one. The zero is the whole line, so also change the initial value of the for loop:
awk -F',' '{ for (i=1;i<=NF;i+=3) print $i }' file