Remove word from a comma separated values of specific field - awk

The NIS group file has format
group1:*:100:bat,cat,zat,ratt
group2:*:200:rat,cat,bat
group3:*:300:rat
With : as delimiter, need to remove exact word (for example rat) from 4th column. Any leading or trailing , to the word should be deleted as well to preserve comma separated values format in 4th column
Expected output:
group1:*:100:bat,cat,zat,ratt
group2:*:200:cat,bat
group3:*:300:

You'd better use awk for this job. Try this (GNU awk):
awk 'BEGIN {OFS=FS=":"} {gsub (/\yrat,?\y|\y,?rat\y/, "", $4)}1' file
Using : as field separator, gsub removes all rat in 4th field. \y is used for word boundaries so that rat will match but not rrat.

If perl solution is okay:
Modified sample input to add more relevant cases..
$ cat ip.txt
group1:*:100:bat,cat,zat,ratt
group2:*:200:rat,cat,bat
group3:*:300:rat
group4:*:400:mat,rat,sat
group5:*:500:pat,rat
$ perl -F: -lane '(#a) = split/,/,$F[3]; $F[3] = join ",", grep { $_ ne "rat" } #a; print join ":", #F' ip.txt
group1:*:100:bat,cat,zat,ratt
group2:*:200:cat,bat
group3:*:300:
group4:*:400:mat,sat
group5:*:500:pat
-F: split input line on : and save to #F array
(#a) = split/,/,$F[3] split 4th column on , and save to #a array
$F[3] = join ",", grep { $_ ne "rat" } #a remove elements in #a array exactly matching rat, join those elements with , and modify 4th field of input line
print join ":", #F print the modified #F array elements joined by :
Golfing to avoid the temp array #a
$ perl -F: -lane '$F[3] = join ",", grep { $_ ne "rat" } split/,/,$F[3]; print join ":", #F' ip.txt
Using regex on 4th column:
$ perl -F: -lane '$F[3] =~ s/,rat\b|\brat(,|\b)//g; print join ":", #F' ip.txt
group1:*:100:bat,cat,zat,ratt
group2:*:200:cat,bat
group3:*:300:
group4:*:400:mat,sat
group5:*:500:pat

This might work for you (GNU sed):
sed -r 's/\brat\b,?//g' file
Remove one or more words rat followed by a possible ,.

awk 'NR>1{sub(/rat,*/,"")}1' file
group1:*:100:bat,cat,zat,ratt
group2:*:200:cat,bat
group3:*:300:

Related

Add a comma to every column value in a table [unix]

I have a file produced from a program that is filled with values as such :
1 [4:space] 2 [4:space] 3 [4:space] ... N
There is 4 space between each values, I want to remove the 3 spaces and place commas after each values to get the final results :
1, 2, 3, ..., N
I found out from other topics that this command can remove the 3 spaces :
awk -F' +' -v OFS='\t' '{sub(/ +$/,""); $1=$1}1' file
I need to add commas then, or maybe is there a way to removes the space and add commas at the same time.
To replace all space with comma and a space, use:
$ awk '{gsub(/ +/,", ")}1' file
1, 2, 3, ..., N
To replace exactly three spaces with a comma, use:
$ awk '{gsub(/ {3}/,",")}1' file
Using field delimiters for it:
$ awk -F" " -v OFS=", " '{$1=$1}1' file
Using GNU sed to modify file in place:
sed -i -e 's/ /, /g' file
And with the brackets:
sed -i -e 's/ /, /g;s/^/[/;s/$/]/' file
how about hands-free driving with awk?
{m,n,g}awk NF=NF RS='\r?\n' OFS=', '
It'll handle both CRLF from windows and unix LF, trim both ends,and place ", " in between each field

gawk - Delimit lines with custom character and no similar ending character

Let's say I have a file like so:
test.txt
one
two
three
I'd like to get the following output: one|two|three
And am currently using this command: gawk -v ORS='|' '{ print $0 }' test.txt
Which gives: one|two|three|
How can I print it so that the last | isn't there?
Here's one way to do it:
$ seq 1 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1
$ seq 3 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1|2|3
With paste:
$ seq 1 | paste -sd'|'
1
$ seq 3 | paste -sd'|'
1|2|3
Convert one column to one row with field separator:
awk '{$1=$1} 1' FS='\n' OFS='|' RS='' file
Or in another notation:
awk -v FS='\n' -v OFS='|' -v RS='' '{$1=$1} 1' file
Output:
one|two|three
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
awk solutions work great. Here is tr + sed solution:
tr '\n' '|' < file | sed 's/\|$//'
1|2|3
just flatten it :
gawk/mawk 'BEGIN { FS = ORS; RS = "^[\n]*$"; OFS = "|"
} NF && ( $NF ? NF=NF : —-NF )'
ascii | = octal \174 = hex 0x7C. The reason for —-NF is that more often than not, the input includes a trailing new line, which makes field count 1 too many and result in
1|2|3|
Both NF=NF and --NF are similar concepts to $1=$1. Empty inputs, regardless of whether trailing new lines exist or not, would result in nothing printed.
At the OFS spot, you can delimit it with any string combo you like instead of being constrained by tr, which has inconsistent behavior. For instance :
gtr '\012' '高' # UTF8 高 = \351\253\230 = xE9 xAB x98
on bsd-tr, \n will get replaced by the unicode properly 1高2高3高 , but if you're on gnu-tr, it would only keep the leading byte of the unicode, and result in
1 \351 2 \351 . . .
For unicode equiv-classes, bsd-tr works as expected while gtr '[=高=]' '\v' results in
gtr: ?\230: equivalence class operand must be a single character
and if u attempt equiv-classes with an arbitrary non-ASCII byte, bsd-tr does nothing while gnu-tr would gladly oblige, even if it means slicing straight through UTF8-compliant characters :
g3bn 77138 | (g)tr '[=\224=]' '\v'
bsd-tr : 77138=Koyote 코요태 KYT✜ 高耀太
gnu-tr : 77138=Koyote ?
?
태 KYT✜ 高耀太
I would do it following way, using GNU AWK, let test.txt content be
one
two
three
then
awk '{printf NR==1?"%s":"|%s", $0}' test.txt
output
one|two|three
Explanation: If it is first line print that line content sans trailing newline, otherwise | followed by line content sans trailing newline. Note that I assumed that test.txt has not trailing newline, if this is not case test this solution before applying it.
(tested in gawk 5.0.1)
Also you can try this with awk:
awk '{ORS = (NR%3 ? "|" : RS)} 1' file
one|two|three
% is the modulo operator and NR%3 ? "|" : RS is a ternary expression.
See Ed Morton's explanation here: https://stackoverflow.com/a/55998710/14259465
With a GNU sed, you can pass -z option to match line breaks, and thus all you need is replace each newline but the last one at the end of string:
sed -z 's/\n\(.\)/|\1/g' test.txt
perl -0pe 's/\n(?!\z)/|/g' test.txt
perl -pe 's/\n/|/g if !eof' test.txt
See the online demo.
Details:
s - substitution command
\n\(.\) - an LF char followed with any one char captured into Group 1 (so \n at the end of string won't get matched)
|\1 - a | char and the captured char
g - all occurrences.
The first perl command matches any LF char (\n) not at the end of string ((?!\z)) after slurping the whole file into a single string input (again, to make \n visible to the regex engine).
The second perl command replaces an LF char at the end of each line except the one at the end of file (eof).
To make the changes inline add -i option (mind this is a GNU sed example):
sed -i -z 's/\n\(.\)/|\1/g' test.txt
perl -i -0pe 's/\n(?!\z)/|/g' test.txt
perl -i -pe 's/\n/|/g if !eof' test.txt

Replacing with Awk and preserving the FS to OFS

I have a file with input text below (this is not the original file and just example of input text ) and I want to replace all the 2 letter string to numeric 100 . In this file FS can be :,| or " " (space) , I have no other choice but to treat all three of them as FS, and I want to preserve these field separators at the original position (as in input file) in the output
A:B C|D
AA:C EE G
BB|FF XX1 H
DD:MM:YY K
I have tried
awk -F"[:| ]" '{gsub(/[A-Z]{2}/,"100");print}'
but this does not seem to work , please suggest.
Desired output:
A:B C|D
100:C 1000 G
100|100 1001 H
100:100:100 K
There is no functionality in POSIX awk to retain the strings that match the string defined by RS (POSIX) or regexp defined by FS. Since in POSIX RS is just a string there's no need for such functionality and to do it for every FS matching string would be unnecessarily inefficient given it's rarely needed.
With GNU awk where RS can be a regexp, not just a string, you can retain the string that matched the regexp RS with RT but there is no functionality that retains the values that match FS for the same efficiency reason that POSIX doesn't do it. Instead in GNU awk they added a 4th arg to split() so you can retain the strings that match FS in an array yourself if you want it (seps[] below):
$ awk -v FS='[:| ]' '{
split($0,flds,FS,seps)
gsub(/[A-Z]{2}/,"100")
for (i=1;i<=NF;i++) {
printf "%s%s", $i, seps[i]
}
print ""
}' file
A:B C|D
100:C 100 G
100|100 1001 H
100:100:100 K
Look up split() in the GNU awk manual for more info.
in this case
sed 's/[A-Z]\{2\}/100/g' YourFile
awk '{gsub(/[A-Z]{2}/, "100"); print}' YourFile
no need of field separation in this case, change all group of upper letter by "100", unless you specify other constraint than in OP (like other element in the string, you than need to specify what is possible and idealy, add a sample of expected result to be univoq)
Now you certainly have lot more thing around, so this code will certainly failed by changing thing like ABC:DEF with 100C:100F that is certainly not expected
in this case
awk -F '[[:blank:]:|]+' '
{
split( $0, aS, /[^[:blank:]:|]+/)
for( i=1;i<=NF;i++){
if( $i ~ /^[A-Z][A-Z]$/) $i = "100"
printf( "%s%s", $i, aS[i+1])
}
printf( "\n" )
} ' YourFile
Give this sed one-liner a try:
kent$ sed -r 's/(^|[:| ])[A-Z][A-Z]([:| ]|$)/\1100\2/g' file
A:B C|D
100:C 100 G
100|FF XX1 H
100:MM:100 K
Note:
this will search and replace pattern: exact two [A-Z] between two delimiters. If this is not what you want exactly, paste the desired output.
Your code seems to work just fine with my Gnu awk:
A:B C|D
100:C 100 G # even the typo in this record got fixed.
100|100 1001 H
100:100:100 K
I'd say the problem is that the regex /[A-Z]{2}/ should be written /[A-Z][A-Z]/.

Convert single column into three comma separated columns using awk

I have a single long column and want to reformat it into three comma separated columns, as indicated below, using awk or any Unix tool.
Input:
Xaa
Ybb
Mdd
Tmmn
UUnx
THM
THSS
THEY
DDe
Output:
Xaa,Ybb,Mdd
Tmmn,UUnx,THM
THSS,THEY,DDe
$ awk '{printf "%s%s",$0,NR%3?",":"\n";}' file
Xaa,Ybb,Mdd
Tmmn,UUnx,THM
THSS,THEY,DDe
How it works
For every line of input, this prints the line followed by, depending on the line number, either a comma or a newline.
The key part is this ternary statement:
NR%3?",":"\n"
This takes the line number modulo 3. If that is non-zero, then it returns a comma. If it is zero, it returns a newline character.
Handling files that end before the final line is complete
The assumes that the number of lines in the file is an integer multiple of three. If it isn't, then we probably want to assure that the last line has a newline. This can be done, as Jonathan Leffler suggests, using:
awk '{printf "%s%s",$0,NR%3?",":"\n";} END { if (NR%3 != 0) print ""}' file
If the final line is short of three columns, the above code will leave a trailing comma on the line. This may or may not be a problem. If we do not want the final comma, then use:
awk 'NR==1{printf "%s",$0; next} {printf "%s%s",(NR-1)%3?",":"\n",$0;} END {print ""}' file
Jonathan Leffler offers this slightly simpler alternative to achieve the same goal:
awk '{ printf("%s%s", pad, $1); pad = (NR%3 == 0) ? "\n" : "," } END { print "" }'
Improved portability
To support platforms which don't use \n as the line terminator, Ed Morton suggests:
awk -v OFS=, '{ printf("%s%s", pad, $1); pad = (NR%3?OFS:ORS)} END { print "" }' file
There is a tool for this. Use pr
pr -3ats,
3 columns width, across, suppress header, comma as separator.
xargs -n3 < file | awk -v OFS="," '{$1=$1} 1'
xargs uses echo as default action, $1=$1 forces rebuild of $0.
Using only awk I would go with this (which is similar to what proposed by #jonathan-leffler and #John1024)
{
sep = NR == 1 ? "" : \
(NR-1)%3 ? "," : \
"\n"
printf sep $0
}
END {
printf "\n"
}

How to split a delimited string into an array in awk?

How to split the string when it contains pipe symbols | in it.
I want to split them to be in array.
I tried
echo "12:23:11" | awk '{split($0,a,":"); print a[3] a[2] a[1]}'
Which works fine. If my string is like "12|23|11" then how do I split them into an array?
Have you tried:
echo "12|23|11" | awk '{split($0,a,"|"); print a[3],a[2],a[1]}'
To split a string to an array in awk we use the function split():
awk '{split($0, array, ":")}'
# \/ \___/ \_/
# | | |
# string | delimiter
# |
# array to store the pieces
If no separator is given, it uses the FS, which defaults to the space:
$ awk '{split($0, array); print array[2]}' <<< "a:b c:d e"
c:d
We can give a separator, for example ::
$ awk '{split($0, array, ":"); print array[2]}' <<< "a:b c:d e"
b c
Which is equivalent to setting it through the FS:
$ awk -F: '{split($0, array); print array[2]}' <<< "a:b c:d e"
b c
In GNU Awk you can also provide the separator as a regexp:
$ awk '{split($0, array, ":*"); print array[2]}' <<< "a:::b c::d e
#note multiple :
b c
And even see what the delimiter was on every step by using its fourth parameter:
$ awk '{split($0, array, ":*", sep); print array[2]; print sep[1]}' <<< "a:::b c::d e"
b c
:::
Let's quote the man page of GNU awk:
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces in array and the separator strings in the seps array. The first piece is stored in array[1], the second piece in array[2], and so forth. The string value of the third argument, fieldsep, is a regexp describing where to split string (much as FS can be a regexp describing where to split input records). If fieldsep is omitted, the value of FS is used. split() returns the number of elements created. seps is a gawk extension, with seps[i] being the separator string between array[i] and array[i+1]. If fieldsep is a single space, then any leading whitespace goes into seps[0] and any trailing whitespace goes into seps[n], where n is the return value of split() (i.e., the number of elements in array).
Please be more specific! What do you mean by "it doesn't work"?
Post the exact output (or error message), your OS and awk version:
% awk -F\| '{
for (i = 0; ++i <= NF;)
print i, $i
}' <<<'12|23|11'
1 12
2 23
3 11
Or, using split:
% awk '{
n = split($0, t, "|")
for (i = 0; ++i <= n;)
print i, t[i]
}' <<<'12|23|11'
1 12
2 23
3 11
Edit: on Solaris you'll need to use the POSIX awk (/usr/xpg4/bin/awk) in order to process 4000 fields correctly.
I do not like the echo "..." | awk ... solution as it calls unnecessary fork and execsystem calls.
I prefer a Dimitre's solution with a little twist
awk -F\| '{print $3 $2 $1}' <<<'12|23|11'
Or a bit shorter version:
awk -F\| '$0=$3 $2 $1' <<<'12|23|11'
In this case the output record put together which is a true condition, so it gets printed.
In this specific case the stdin redirection can be spared with setting an awk internal variable:
awk -v T='12|23|11' 'BEGIN{split(T,a,"|");print a[3] a[2] a[1]}'
I used ksh quite a while, but in bash this could be managed by internal string manipulation. In the first case the original string is split by internal terminator. In the second case it is assumed that the string always contains digit pairs separated by a one character separator.
T='12|23|11';echo -n ${T##*|};T=${T%|*};echo ${T#*|}${T%|*}
T='12|23|11';echo ${T:6}${T:3:2}${T:0:2}
The result in all cases is
112312
Actually awk has a feature called 'Input Field Separator Variable' link. This is how to use it. It's not really an array, but it uses the internal $ variables. For splitting a simple string it is easier.
echo "12|23|11" | awk 'BEGIN {FS="|";} { print $1, $2, $3 }'
I know this is kind of old question, but I thought maybe someone like my trick. Especially since this solution not limited to a specific number of items.
# Convert to an array
_ITEMS=($(echo "12|23|11" | tr '|' '\n'))
# Output array items
for _ITEM in "${_ITEMS[#]}"; do
echo "Item: ${_ITEM}"
done
The output will be:
Item: 12
Item: 23
Item: 11
Joke? :)
How about echo "12|23|11" | awk '{split($0,a,"|"); print a[3] a[2] a[1]}'
This is my output:
p2> echo "12|23|11" | awk '{split($0,a,"|"); print a[3] a[2] a[1]}'
112312
so I guess it's working after all..
echo "12|23|11" | awk '{split($0,a,"|"); print a[3] a[2] a[1]}'
should work.
echo "12|23|11" | awk '{split($0,a,"|"); print a[3] a[2] a[1]}'
code
awk -F"|" '{split($0,a); print a[1],a[2],a[3]}' <<< '12|23|11'
output
12 23 11
The challenge: parse and store split strings with spaces and insert them into variables.
Solution: best and simple choice for you would be convert the strings list into array and then parse it into variables with indexes. Here's an example how you can convert and access the array.
Example: parse disk space statistics on each line:
sudo df -k | awk 'NR>1' | while read -r line; do
#convert into array:
array=($line)
#variables:
filesystem="${array[0]}"
size="${array[1]}"
capacity="${array[4]}"
mountpoint="${array[5]}"
echo "filesystem:$filesystem|size:$size|capacity:$capacity|mountpoint:$mountpoint"
done
#output:
filesystem:/dev/dsk/c0t0d0s1|size:4000|usage:40%|mountpoint:/
filesystem:/dev/dsk/c0t0d0s2|size:5000|usage:50%|mountpoint:/usr
filesystem:/proc|size:0|usage:0%|mountpoint:/proc
filesystem:mnttab|size:0|usage:0%|mountpoint:/etc/mnttab
filesystem:fd|size:1000|usage:10%|mountpoint:/dev/fd
filesystem:swap|size:9000|usage:9%|mountpoint:/var/run
filesystem:swap|size:1500|usage:15%|mountpoint:/tmp
filesystem:/dev/dsk/c0t0d0s3|size:8000|usage:80%|mountpoint:/export
awk -F'['|'] -v '{print $1"\t"$2"\t"$3}' file <<<'12|23|11'