How to replace multiple empty fields into zeroes using awk - awk

I am using the following command to replace tab delimited empty fields with zeroes.
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = 0 }; 1'
How can I do the same, if I have the following input that is not tab delimited and have multiple empty fields ?
input
name A1348138 A1086070 A1080879 A1070208 A821846 A1068905 A1101931
g1 5 8 1 2 1 3 1
g2 1 3 2 1 1 2
desired output
name A1348138 A1086070 A1080879 A1070208 A821846 A1068905 A1101931
g1 5 8 1 2 1 3 1
g2 1 3 2 1 1 2 0

I'd suggest using GNU awk for FIELDWIDTHS to solve the problem you appear to be asking about and also to convert your fixed-width input to tab-separated output (or something else sensible) while you're at it:
$ cat file
1 2 3
4 6
$ gawk -v FIELDWIDTHS='4 4 4' -v OFS='\t' '{for (i=1;i<=NF;i++) {gsub(/^[[:space:]]+|[[:space:]]+$/,"",$i); $i=($i==""?0:$i)}; print}' file
1 2 3
4 0 6
$ gawk -v FIELDWIDTHS='4 4 4' -v OFS=',' '{for (i=1;i<=NF;i++) {gsub(/^[[:space:]]+|[[:space:]]+$/,"",$i); $i=($i==""?0:$i)}; print}' file
1,2,3
4,0,6
$ gawk -v FIELDWIDTHS='4 4 4' -v OFS=',' '{for (i=1;i<=NF;i++) {gsub(/^[[:space:]]+|[[:space:]]+$/,"",$i); $i="\""($i==""?0:$i)"\""}; print}' file
"1","2","3"
"4","0","6"
Take your pick of the above.

Related

awk Can not Select Column with empty value

i am trying to select a column with its missing value
here is my input file separated by tab
1 2 3
4 5
6
7 8
9
i am trying to select the first column in which output will look like
1
4
7
and the length of my column would be 5 in this case
I have tried
awk '$1!=""{print $1}' ./demo.txt
but it returns
1
4
6
7
9
can anybody help with this I am new in AWK
You can use cut:
$ cut -f 1 file # the default delimiter is a tab
Or with sed:
$ sed 's/[[:blank:]].*$//' file
Or awk:
$ awk '{sub(/[[:blank:]].*$/,"")}1' file
Or:
$ awk 'BEGIN{FS=OFS="\t"} {print $1}' file
All those print the first column and all five lines (blank or not)
Prints:
1
4
7
Tell awk to use a tab (\t) as the input field delimiter (-F):
$ awk -F'\t' '{ print $1 }' demo.txt
1
4
7
If you want to print multiple columns, maintaining the same delimiter for output, another approach using the FS and OFS variables:
$ awk 'BEGIN { FS=OFS="\t" } { print $1,$3 }' demo.txt
1 3
4 5
7
9
With sed something like:
sed 's/^\([^[:blank:]]*\).*/\1/' demo.txt
Using FIELDWIDTHS in gnu-awk you can do this for fixed width separated data:
awk 'BEGIN {FIELDWIDTHS = "4 4 *"} {print $1}' file
1
4
7
For demo purpose:
awk 'BEGIN {FIELDWIDTHS = "4 4 *"} {print NR ":", $1}' file
1: 1
2: 4
3:
4: 7
5:
if they're all single digits in 1st column :
echo \
'1 2 3
4 5
6
7 8
9' |
mawk NF=1 FS=
gcat -n
1 1
2 4
3
4 7
5
that's literally all you need. To play it safe, then do
nawk NF=1 FS='[[:space:]]' # overly-verbose so-called
# "proper" posix form
gawk NF=1 FS='[ \t]' # suffices unless the input
# happens to have uncommon bytes
# like \013 \v or \014 \f
or a very fringe way of fudging NF :
mawk 'NF ^= FS="[ \t]"'

How to get the filenumber that is being processing by an awk script?

Suppose I have 2 or more files being processed by an awk script.
$ cat file1
a
b
c
$ cat file2
d
e
How do I get the number of the file being processed? Is the a built-in awk for that?
I want to have a script with the behavior of the one bellow. What could I use as my
SOMEVARIABLE?
$ awk '{print FILENAME, NR, FNR, SOMEVARIABLE, $0}' file1 file2
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 4 1 2 d
file2 5 2 2 e
EDIT: Since OP needs output in a specific format and DO NOT want only count of file so adding following solution now, which should consider empty files count too.(tested and written in GNU awk)
awk '
FNR==1{
FNUM++
}
{
print FILENAME, NR, FNR, FNUM, $0
}
ENDFILE{
if(FNUM==prev){
FNUM++
print FILENAME, 0, 0, FNUM, "Empty file"
}
prev=FNUM
}' file1 file2
Output for 1 Input_file1 and empty Input_file2 comes as follows.
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 0 0 2 Empty file
Solutions when one wants to know total number of files processed by awk command:
1st solution: Could you please try following, using GNU awk(considering that you don't want to count empty files here).
awk 'NF{count++;nextfile} END{print count}' Input_file1 Input_file2
2nd solution: In case you only want to know number of files passed to awk command then try following.
awk 'END {print ARGC-1}' Input_file1 Input_file2
Explanation of above codes above with examples: Let's say following are the Input_files, where Input_file1 is having contents and Input_file2 is empty file as follows:
cat Input_file1
a
b
c
> Input_file2
Now when we run command ARGC we get output as 2 files.
awk 'END {print ARGC-1}' Input_file1 Input_file2
2
Now when I run my 1st command it gives 1 file since it is not counting empty file.
awk 'NF{count++;nextfile} END{print count}' Input_file1 Input_file2
1
Well... I managed to do it as following:
$ awk 'BEGIN{FNUM=0} FNR==1{FNUM++} {print FILENAME, NR, FNR, FNUM, $0}' file1 file2
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 4 1 2 d
file2 5 2 2 e
I guess there is no built-in variable to help with that, so I created the variable FNUM (for file number). If there is a solution with a built-in variable, please give me a better answer.

How to use awk and grep combination

I have a file with 10 columns and lots of lines. I want to add a fix correction to the 10th column where its line contain 'G01' pattern.
For example, in the file below
AS G17 2014 3 31 0 2 0.000000 1 -0.809159910000E-04
AS G12 2014 3 31 0 2 0.000000 1 0.195515363000E-03
AS G15 2014 3 31 0 2 0.000000 1 -0.171167837000E-03
AS G29 2014 3 31 0 2 0.000000 1 0.521982134000E-03
AS G07 2014 3 31 0 2 0.000000 1 0.329889640000E-03
AS G05 2014 3 31 0 2 0.000000 1 -0.381588767000E-03
AS G25 2014 3 31 0 2 0.000000 1 0.203352860000E-04
AS G01 2014 3 31 0 2 0.000000 1 0.650180300000E-05
AS G24 2014 3 31 0 2 0.000000 1 -0.258444780000E-04
AS G27 2014 3 31 0 2 0.000000 1 -0.203691700000E-04
the 10th column of the line with G01 should be corrected.
I've used 'awk' with 'while' loop to do that, but it takes a very long time for massive files.
It will be appreciated if anybody can help for a more effective way.
You can use the following :
awk '$2 == "G01" {$10="value"}1' file.txt
To preserve whitespaces you can use the solution from this post :
awk '$2 == "G01" {
data=1
n=split($0,a," ",b)
a[10]="value"
line=b[0]
for (i=1;i<=n; i++){
line=(line a[i] b[i])
}
print line
}{
if (data!=1){
print;
}
else {
data=0;
}
}' file.txt
the 10th column of the line with G01 should be corrected
Syntax is as follows, which will search for regex given inside /../ in current record/line/row regardless of which field the regex was found
Either
$ awk '/regex/{ $10 = "somevalue"; print }' infile
OR
1 at the end does default operation print $0, that is print current record/line/row
$ awk '/regex/{ $10 = "somevalue" }1' infile
OR
$0 means current record/line/row
$ awk '$0 ~ /regex/{ $10 = "somevalue"}1' infile
So in current context, it will be any of the following
$ awk '/G01/{$10 = "somevalue" ; print }' infile
$ awk '/G01/{$10 = "somevalue" }1' infile
$ awk '$0 ~ /G01/{$10 = "somevalue"; print }' infile
$ awk '$0 ~ /G01/{$10 = "somevalue" }1' infile
If you would like to strict your search to specific field/column in record/line/row then
$10 means 10th field/column
$ awk '$2 == "G01" {$10 = "somevalue"; print }' infile
$ awk '$2 == "G01" {$10 = "somevalue" }1' infile
In case if you would like to pass say some word from shell variable to awk or just a word then
$ awk -v search="G01" -v replace="foo" '$2 == search {$10 = replace }1' infile
and then same from shell
$ search_value="G01"
$ new_value="foo"
$ awk -v search="$search_value" -v replace="$new_value" '$2 == search {$10 = replace }1' infile
From man
-v var=val
--assign var=val
Assign the value val to the variable var, before execution of
the program begins. Such variable values are available to the
BEGIN block of an AWK program.
For additional syntax instructions:
"sed & awk" by Dale Dougherty and Arnold Robbins
(O'Reilly)
"UNIX Text Processing," by Dale Dougherty and Tim O'Reilly (Hayden
Books)
"GAWK: Effective awk Programming," by Arnold D. Robbins
(O'Reilly)
http://www.gnu.org/software/gawk/manual/

Join two columns from different files with awk

I want to join two columns from two different files using awk. These files look like (A, B, C, 0, 1, 2, etc are columns)
file1:
A B C D E F
fil2:
0 1 2 3 4 5
And I want to be able to select arbitrary columns on my ouput, something of the form:
Ie, I want the output to be:
A C E 4 5
I've seen a million answers with the following awk code (and very similar ones), offering no explanation. But none of them address the exact problem I want to solve:
awk 'FNR==NR{a[FNR]=$2;next};{$NF=a[FNR]};1' file2 file1
awk '
NR==FNR {A[$1,$3,$6] = $0; next}
($1 SUBSEP $2 SUBSEP $3) in A {print A[$1,$2,$3], $4}
' A.txt B.txt
But none of them seem to do what I want and I am not able to understand them.
So, how can I achieve the desired output using awk? (and please, offer an explanation, I want to actually learn)
Note:
I know I can do this using something like
paste <(awk '{print $1}' file1) <(awk '{print $2}' file2)
As I said, I'm trying to learn and understand awk.
With GNU awk for true multi-dimensional arrays and ARGIND:
$ awk -v flds='1 1 1 3 1 5 2 5 2 6' '
BEGIN{ nf = split(flds,o) }
{ f[ARGIND][1]; split($0,f[ARGIND]) }
NR!=FNR { for (i=2; i<=nf; i+=2) printf "%s%s", f[o[i-1]][o[i]], (i<nf?OFS:ORS) }
' file1 file2
A C E 4 5
The "flds" string is just a series of <file number> <field number in that file> pairs so you can print the fields from each file in whatever order you like, e.g.:
$ awk -v flds='1 1 2 2 1 3 2 4 1 5 2 6' 'BEGIN{nf=split(flds,o)} {f[ARGIND][1]; split($0,f[ARGIND])} NR!=FNR{for (i=2; i<=nf; i+=2) printf "%s%s",f[o[i-1]][o[i]], (i<nf?OFS:ORS)}' file1 file2
A 1 C 3 E 5
$ awk -v flds='2 1 1 2 2 3 1 4 2 5' 'BEGIN{nf=split(flds,o)} {f[ARGIND][1]; split($0,f[ARGIND])} NR!=FNR{for (i=2; i<=nf; i+=2) printf "%s%s",f[o[i-1]][o[i]], (i<nf?OFS:ORS)}' file1 file2
0 B 2 D 4

Print all but the first three columns

Too cumbersome:
awk '{print " "$4" "$5" "$6" "$7" "$8" "$9" "$10" "$11" "$12" "$13}' things
awk '{for(i=1;i<4;i++) $i="";print}' file
use cut
$ cut -f4-13 file
or if you insist on awk and $13 is the last field
$ awk '{$1=$2=$3="";print}' file
else
$ awk '{for(i=4;i<=13;i++)printf "%s ",$i;printf "\n"}' file
A solution that does not add extra leading or trailing whitespace:
awk '{ for(i=4; i<NF; i++) printf "%s",$i OFS; if(NF) printf "%s",$NF; printf ORS}'
### Example ###
$ echo '1 2 3 4 5 6 7' |
awk '{for(i=4;i<NF;i++)printf"%s",$i OFS;if(NF)printf"%s",$NF;printf ORS}' |
tr ' ' '-'
4-5-6-7
Sudo_O proposes an elegant improvement using the ternary operator NF?ORS:OFS
$ echo '1 2 3 4 5 6 7' |
awk '{ for(i=4; i<=NF; i++) printf "%s",$i (i==NF?ORS:OFS) }' |
tr ' ' '-'
4-5-6-7
EdMorton gives a solution preserving original whitespaces between fields:
$ echo '1 2 3 4 5 6 7' |
awk '{ sub(/([^ ]+ +){3}/,"") }1' |
tr ' ' '-'
4---5----6-7
BinaryZebra also provides two awesome solutions:
(these solutions even preserve trailing spaces from original string)
$ echo -e ' 1 2\t \t3 4 5 6 7 \t 8\t ' |
awk -v n=3 '{ for ( i=1; i<=n; i++) { sub("^["FS"]*[^"FS"]+["FS"]+","",$0);} } 1 ' |
sed 's/ /./g;s/\t/->/g;s/^/"/;s/$/"/'
"4...5...6.7.->.8->."
$ echo -e ' 1 2\t \t3 4 5 6 7 \t 8\t ' |
awk -v n=3 '{ print gensub("["FS"]*([^"FS"]+["FS"]+){"n"}","",1); }' |
sed 's/ /./g;s/\t/->/g;s/^/"/;s/$/"/'
"4...5...6.7.->.8->."
The solution given by larsr in the comments is almost correct:
$ echo '1 2 3 4 5 6 7' |
awk '{for (i=3;i<=NF;i++) $(i-2)=$i; NF=NF-2; print $0}' | tr ' ' '-'
3-4-5-6-7
This is the fixed and parametrized version of larsr solution:
$ echo '1 2 3 4 5 6 7' |
awk '{for(i=n;i<=NF;i++)$(i-(n-1))=$i;NF=NF-(n-1);print $0}' n=4 | tr ' ' '-'
4-5-6-7
All other answers before Sep-2013 are nice but add extra spaces:
Example of answer adding extra leading spaces:
$ echo '1 2 3 4 5 6 7' |
awk '{$1=$2=$3=""}1' |
tr ' ' '-'
---4-5-6-7
Example of answer adding extra trailing space
$ echo '1 2 3 4 5 6 7' |
awk '{for(i=4;i<=13;i++)printf "%s ",$i;printf "\n"}' |
tr ' ' '-'
4-5-6-7-------
Try this:
awk '{ $1=""; $2=""; $3=""; print $0 }'
The correct way to do this is with an RE interval because it lets you simply state how many fields to skip, and retains inter-field spacing for the remaining fields.
e.g. to skip the first 3 fields without affecting spacing between remaining fields given the format of input we seem to be discussing in this question is simply:
$ echo '1 2 3 4 5 6' |
awk '{sub(/([^ ]+ +){3}/,"")}1'
4 5 6
If you want to accommodate leading spaces and non-blank spaces, but again with the default FS, then it's:
$ echo ' 1 2 3 4 5 6' |
awk '{sub(/[[:space:]]*([^[:space:]]+[[:space:]]+){3}/,"")}1'
4 5 6
If you have an FS that's an RE you can't negate in a character set, you can convert it to a single char first (RS is ideal if it's a single char since an RS CANNOT appear within a field, otherwise consider SUBSEP), then apply the RE interval subsitution, then convert to the OFS. e.g. if chains of "."s separated the fields:
$ echo '1...2.3.4...5....6' |
awk -F'[.]+' '{gsub(FS,RS);sub("([^"RS"]+["RS"]+){3}","");gsub(RS,OFS)}1'
4 5 6
Obviously if OFS is a single char AND it can't appear in the input fields you can reduce that to:
$ echo '1...2.3.4...5....6' |
awk -F'[.]+' '{gsub(FS,OFS); sub("([^"OFS"]+["OFS"]+){3}","")}1'
4 5 6
Then you have the same issue as with all the loop-based solutions that reassign the fields - the FSs are converted to OFSs. If that's an issue, you need to look into GNU awks' patsplit() function.
Pretty much all the answers currently add either leading spaces, trailing spaces or some other separator issue. To select from the fourth field where the separator is whitespace and the output separator is a single space using awk would be:
awk '{for(i=4;i<=NF;i++)printf "%s",$i (i==NF?ORS:OFS)}' file
To parametrize the starting field you could do:
awk '{for(i=n;i<=NF;i++)printf "%s",$i (i==NF?ORS:OFS)}' n=4 file
And also the ending field:
awk '{for(i=n;i<=m=(m>NF?NF:m);i++)printf "%s",$i (i==m?ORS:OFS)}' n=4 m=10 file
awk '{$1=$2=$3="";$0=$0;$1=$1}1'
Input
1 2 3 4 5 6 7
Output
4 5 6 7
echo 1 2 3 4 5| awk '{ for (i=3; i<=NF; i++) print $i }'
Another way to avoid using the print statement:
$ awk '{$1=$2=$3=""}sub("^"FS"*","")' file
In awk when a condition is true print is the default action.
I can't believe nobody offered plain shell:
while read -r a b c d; do echo "$d"; done < file
Options 1 to 3 have issues with multiple whitespace (but are simple).
That is the reason to develop options 4 and 5, which process multiple white spaces with no problem.
Of course, if options 4 or 5 are used with n=0 both will preserve any leading whitespace as n=0 means no splitting.
Option 1
A simple cut solution (works with single delimiters):
$ echo '1 2 3 4 5 6 7 8' | cut -d' ' -f4-
4 5 6 7 8
Option 2
Forcing an awk re-calc sometimes solve the problem (works with some versions of awk) of added leading spaces:
$ echo '1 2 3 4 5 6 7 8' | awk '{ $1=$2=$3="";$0=$0;} NF=NF'
4 5 6 7 8
Option 3
Printing each field formated with printf will give more control:
$ echo ' 1 2 3 4 5 6 7 8 ' |
awk -v n=3 '{ for (i=n+1; i<=NF; i++){printf("%s%s",$i,i==NF?RS:OFS);} }'
4 5 6 7 8
However, all previous answers change all FS between fields to OFS. Let's build a couple of solutions to that.
Option 4
A loop with sub to remove fields and delimiters is more portable, and doesn't trigger a change of FS to OFS:
$ echo ' 1 2 3 4 5 6 7 8 ' |
awk -v n=3 '{ for(i=1;i<=n;i++) { sub("^["FS"]*[^"FS"]+["FS"]+","",$0);} } 1 '
4 5 6 7 8
NOTE: The "^["FS"]*" is to accept an input with leading spaces.
Option 5
It is quite possible to build a solution that does not add extra leading or trailing whitespace, and preserve existing whitespace using the function gensub from GNU awk, as this:
$ echo ' 1 2 3 4 5 6 7 8 ' |
awk -v n=3 '{ print gensub("["FS"]*([^"FS"]+["FS"]+){"n"}","",1); }'
4 5 6 7 8
It also may be used to swap a field list given a count n:
$ echo ' 1 2 3 4 5 6 7 8 ' |
awk -v n=3 '{ a=gensub("["FS"]*([^"FS"]+["FS"]+){"n"}","",1);
b=gensub("^(.*)("a")","\\1",1);
print "|"a"|","!"b"!";
}'
|4 5 6 7 8 | ! 1 2 3 !
Of course, in such case, the OFS is used to separate both parts of the line, and the trailing white space of the fields is still printed.
Note1: ["FS"]* is used to allow leading spaces in the input line.
Cut has a --complement flag that makes it easy (and fast) to delete columns. The resulting syntax is analogous with what you want to do -- making the solution easier to read/understand. Complement also works for the case where you would like to delete non-contiguous columns.
$ foo='1 2 3 %s 5 6 7'
$ echo "$foo" | cut --complement -d' ' -f1-3
%s 5 6 7
$
Perl solution which does not add leading or trailing whitespace:
perl -lane 'splice #F,0,3; print join " ",#F' file
The perl #F autosplit array starts at index 0 while awk fields start with $1
Perl solution for comma-delimited data:
perl -F, -lane 'splice #F,0,3; print join ",",#F' file
Python solution:
python -c "import sys;[sys.stdout.write(' '.join(line.split()[3:]) + '\n') for line in sys.stdin]" < file
For me the most compact and compliant solution to the request is
$ a='1 2\t \t3 4 5 6 7 \t 8\t ';
$ echo -e "$a" | awk -v n=3 '{while (i<n) {i++; sub($1 FS"*", "")}; print $0}'
And if you have more lines to process as for instance file foo.txt, don't forget to reset i to 0:
$ awk -v n=3 '{i=0; while (i<n) {i++; sub($1 FS"*", "")}; print $0}' foo.txt
Thanks your forum.
As I was annoyed by the first highly upvoted but wrong answer I found enough to write a reply there, and here the wrong answers are marked as such, here is my bit. I do not like proposed solutions as I can see no reason to make answer so complex.
I have a log where after $5 with an IP address can be more text or no text. I need everything from the IP address to the end of the line should there be anything after $5. In my case, this is actualy withn an awk program, not an awk oneliner so awk must solve the problem. When I try to remove the first 4 fields using the old nice looking and most upvoted but completely wrong answer:
echo " 7 27.10.16. Thu 11:57:18 37.244.182.218 one two three" | awk '{$1=$2=$3=$4=""; printf "[%s]\n", $0}'
it spits out wrong and useless response (I added [] to demonstrate):
[ 37.244.182.218 one two three]
Instead, if columns are fixed width until the cut point and awk is needed, the correct and quite simple answer is:
echo " 7 27.10.16. Thu 11:57:18 37.244.182.218 one two three" | awk '{printf "[%s]\n", substr($0,28)}'
which produces the desired output:
[37.244.182.218 one two three]
I've found this other possibility, maybe it could be useful also...
awk 'BEGIN {OFS=ORS="\t" }; {for(i=1; i<14; i++) print $i " "; print $NF "\n" }' your_file
Note: 1. For tabular data and from column $1 to $14
Use cut:
cut -d <The character between characters> -f <number of first column>,<number of last column> <file name>
e.g.: If you have file1 containing : car.is.nice.equal.bmw
Run : cut -d . -f1,3 file1 will print car.is.nice
This isn't very far from some of the previous answers, but does solve a couple of issues:
cols.sh:
#!/bin/bash
awk -v s=$1 '{for(i=s; i<=NF;i++) printf "%-5s", $i; print "" }'
Which you can now call with an argument that will be the starting column:
$ echo "1 2 3 4 5 6 7 8 9 10 11 12 13 14" | ./cols.sh 3
3 4 5 6 7 8 9 10 11 12 13 14
Or:
$ echo "1 2 3 4 5 6 7 8 9 10 11 12 13 14" | ./cols.sh 7
7 8 9 10 11 12 13 14
This is 1-indexed; if you prefer zero indexed, use i=s + 1 instead.
Moreover, if you would like to have to arguments for the starting index and end index, change the file to:
#!/bin/bash
awk -v s=$1 -v e=$2 '{for(i=s; i<=e;i++) printf "%-5s", $i; print "" }'
For example:
$ echo "1 2 3 4 5 6 7 8 9 10 11 12 13 14" | ./cols.sh 7 9
7 8 9
The %-5s aligns the result as 5-character-wide columns; if this isn't enough, increase the number, or use %s (with a space) instead if you don't care about alignment.
AWK printf-based solution that avoids % problem, and is unique in that it returns nothing (no return character) if there are less than 4 columns to print:
awk 'NF > 3 { for(i=4; i<NF; i++) printf("%s ", $(i)); print $(i) }'
Testing:
$ x='1 2 3 %s 4 5 6'
$ echo "$x" | awk 'NF > 3 { for(i=4; i<NF; i++) printf("%s ", $(i)); print $(i) }'
%s 4 5 6
$ x='1 2 3'
$ echo "$x" | awk 'NF > 3 { for(i=4; i<NF; i++) printf("%s ", $(i)); print $(i) }'
$ x='1 2 3 '
$ echo "$x" | awk 'NF > 3 { for(i=4; i<NF; i++) printf("%s ", $(i)); print $(i) }'
$