Join two columns from different files with awk - awk

I want to join two columns from two different files using awk. These files look like (A, B, C, 0, 1, 2, etc are columns)
file1:
A B C D E F
fil2:
0 1 2 3 4 5
And I want to be able to select arbitrary columns on my ouput, something of the form:
Ie, I want the output to be:
A C E 4 5
I've seen a million answers with the following awk code (and very similar ones), offering no explanation. But none of them address the exact problem I want to solve:
awk 'FNR==NR{a[FNR]=$2;next};{$NF=a[FNR]};1' file2 file1
awk '
NR==FNR {A[$1,$3,$6] = $0; next}
($1 SUBSEP $2 SUBSEP $3) in A {print A[$1,$2,$3], $4}
' A.txt B.txt
But none of them seem to do what I want and I am not able to understand them.
So, how can I achieve the desired output using awk? (and please, offer an explanation, I want to actually learn)
Note:
I know I can do this using something like
paste <(awk '{print $1}' file1) <(awk '{print $2}' file2)
As I said, I'm trying to learn and understand awk.

With GNU awk for true multi-dimensional arrays and ARGIND:
$ awk -v flds='1 1 1 3 1 5 2 5 2 6' '
BEGIN{ nf = split(flds,o) }
{ f[ARGIND][1]; split($0,f[ARGIND]) }
NR!=FNR { for (i=2; i<=nf; i+=2) printf "%s%s", f[o[i-1]][o[i]], (i<nf?OFS:ORS) }
' file1 file2
A C E 4 5
The "flds" string is just a series of <file number> <field number in that file> pairs so you can print the fields from each file in whatever order you like, e.g.:
$ awk -v flds='1 1 2 2 1 3 2 4 1 5 2 6' 'BEGIN{nf=split(flds,o)} {f[ARGIND][1]; split($0,f[ARGIND])} NR!=FNR{for (i=2; i<=nf; i+=2) printf "%s%s",f[o[i-1]][o[i]], (i<nf?OFS:ORS)}' file1 file2
A 1 C 3 E 5
$ awk -v flds='2 1 1 2 2 3 1 4 2 5' 'BEGIN{nf=split(flds,o)} {f[ARGIND][1]; split($0,f[ARGIND])} NR!=FNR{for (i=2; i<=nf; i+=2) printf "%s%s",f[o[i-1]][o[i]], (i<nf?OFS:ORS)}' file1 file2
0 B 2 D 4

Related

awk Can not Select Column with empty value

i am trying to select a column with its missing value
here is my input file separated by tab
1 2 3
4 5
6
7 8
9
i am trying to select the first column in which output will look like
1
4
7
and the length of my column would be 5 in this case
I have tried
awk '$1!=""{print $1}' ./demo.txt
but it returns
1
4
6
7
9
can anybody help with this I am new in AWK
You can use cut:
$ cut -f 1 file # the default delimiter is a tab
Or with sed:
$ sed 's/[[:blank:]].*$//' file
Or awk:
$ awk '{sub(/[[:blank:]].*$/,"")}1' file
Or:
$ awk 'BEGIN{FS=OFS="\t"} {print $1}' file
All those print the first column and all five lines (blank or not)
Prints:
1
4
7
Tell awk to use a tab (\t) as the input field delimiter (-F):
$ awk -F'\t' '{ print $1 }' demo.txt
1
4
7
If you want to print multiple columns, maintaining the same delimiter for output, another approach using the FS and OFS variables:
$ awk 'BEGIN { FS=OFS="\t" } { print $1,$3 }' demo.txt
1 3
4 5
7
9
With sed something like:
sed 's/^\([^[:blank:]]*\).*/\1/' demo.txt
Using FIELDWIDTHS in gnu-awk you can do this for fixed width separated data:
awk 'BEGIN {FIELDWIDTHS = "4 4 *"} {print $1}' file
1
4
7
For demo purpose:
awk 'BEGIN {FIELDWIDTHS = "4 4 *"} {print NR ":", $1}' file
1: 1
2: 4
3:
4: 7
5:
if they're all single digits in 1st column :
echo \
'1 2 3
4 5
6
7 8
9' |
mawk NF=1 FS=
gcat -n
1 1
2 4
3
4 7
5
that's literally all you need. To play it safe, then do
nawk NF=1 FS='[[:space:]]' # overly-verbose so-called
# "proper" posix form
gawk NF=1 FS='[ \t]' # suffices unless the input
# happens to have uncommon bytes
# like \013 \v or \014 \f
or a very fringe way of fudging NF :
mawk 'NF ^= FS="[ \t]"'

How can I use awk to calculate sum and replace column in file

I'm new to the site and to the programming world and I hope you have time to help me.
My problem is as follows: I have a file with several columns. In the 2nd column there are values. I'm tring to calculate the sum of each values to a given number and to replace the second column with a new column containing the results of the sum.
Here an example of my input:
A B C
x 1 t
y 2 u
z 3 v
I want to sum values in B column to 5 and obtain an output like the one below:
A B C
x 6 t
y 7 u
z 8 v
The code I tried unsucesfully is
zcat my_file.vcf.gz| tail -n +49 | awk 'BEGIN{FS=OFS="\t"} {print $0, $2+5}'>my.output.vcf
Thanks in advance
We could avoid using tail since printing of lines from 49th line could be handled within awk itself. Also you need to add value in 2nd field and then you could print the whole line itself by print command.
Important point, as per OP's sample if 2nd field is having alphabets then need NOT to add 5 in it, so taken care of that condition too here.
zcat my_file.vcf.gz |
awk '
BEGIN{ FS=OFS="\t" }
FNR>=49{
$2=($2~/[a-zA-Z]/?$2:$2+5)
print
}
' > my.output.vcf
You can use
awk 'BEGIN{FS=OFS="\t"} {$2+=5}1'
Here, $2+=5 will add 5 to Filed 2 value, and 1 will trigger the display of the record (row, line, same as print $0).
See an online awk demo:
#!/bin/bash
s='A B C
x 1 t
y 2 u
z 3 v'
awk 'BEGIN{FS=OFS="\t"} {$2+=5}1' <<< "$s"
Output:
A 5 C
x 6 t
y 7 u
z 8 v
Another form for clarity:
awk 'BEGIN{FS=OFS="\t"} {print $1, $2+5, $3}'
you can use:
awk 'BEGIN {FS=OFS="\t"} NR == 1 {print $0} NR > 1 {print $1,($2+5),$3;}'
output:
A B C
x 6 t
y 7 u
z 8 v
Maube this can help you:
cat file | awk '{if (NR > 1 && $2 = ($2+5))
print $0;
else print $0;}'
Answer apply to your code:
zcat my_file.vcf.gz| tail -n +49 | awk '{if (NR > 1 && $2 = ($2+5)) print $0; else print $0;}' > my.output.vcf
cat boo
A B C
x 1 t
y 2 u
z 3 v
cat boo | awk 'BEGIN{FS=OFS="\t"} $2 ~ /^[0-9]+$/ {print $1, $2+5, $3} $2 !~ /^[0-9]+$/ {print} '
A B C
x 6 t
y 7 u
z 8 v

How to get the filenumber that is being processing by an awk script?

Suppose I have 2 or more files being processed by an awk script.
$ cat file1
a
b
c
$ cat file2
d
e
How do I get the number of the file being processed? Is the a built-in awk for that?
I want to have a script with the behavior of the one bellow. What could I use as my
SOMEVARIABLE?
$ awk '{print FILENAME, NR, FNR, SOMEVARIABLE, $0}' file1 file2
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 4 1 2 d
file2 5 2 2 e
EDIT: Since OP needs output in a specific format and DO NOT want only count of file so adding following solution now, which should consider empty files count too.(tested and written in GNU awk)
awk '
FNR==1{
FNUM++
}
{
print FILENAME, NR, FNR, FNUM, $0
}
ENDFILE{
if(FNUM==prev){
FNUM++
print FILENAME, 0, 0, FNUM, "Empty file"
}
prev=FNUM
}' file1 file2
Output for 1 Input_file1 and empty Input_file2 comes as follows.
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 0 0 2 Empty file
Solutions when one wants to know total number of files processed by awk command:
1st solution: Could you please try following, using GNU awk(considering that you don't want to count empty files here).
awk 'NF{count++;nextfile} END{print count}' Input_file1 Input_file2
2nd solution: In case you only want to know number of files passed to awk command then try following.
awk 'END {print ARGC-1}' Input_file1 Input_file2
Explanation of above codes above with examples: Let's say following are the Input_files, where Input_file1 is having contents and Input_file2 is empty file as follows:
cat Input_file1
a
b
c
> Input_file2
Now when we run command ARGC we get output as 2 files.
awk 'END {print ARGC-1}' Input_file1 Input_file2
2
Now when I run my 1st command it gives 1 file since it is not counting empty file.
awk 'NF{count++;nextfile} END{print count}' Input_file1 Input_file2
1
Well... I managed to do it as following:
$ awk 'BEGIN{FNUM=0} FNR==1{FNUM++} {print FILENAME, NR, FNR, FNUM, $0}' file1 file2
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 4 1 2 d
file2 5 2 2 e
I guess there is no built-in variable to help with that, so I created the variable FNUM (for file number). If there is a solution with a built-in variable, please give me a better answer.

How to replace multiple empty fields into zeroes using awk

I am using the following command to replace tab delimited empty fields with zeroes.
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = 0 }; 1'
How can I do the same, if I have the following input that is not tab delimited and have multiple empty fields ?
input
name A1348138 A1086070 A1080879 A1070208 A821846 A1068905 A1101931
g1 5 8 1 2 1 3 1
g2 1 3 2 1 1 2
desired output
name A1348138 A1086070 A1080879 A1070208 A821846 A1068905 A1101931
g1 5 8 1 2 1 3 1
g2 1 3 2 1 1 2 0
I'd suggest using GNU awk for FIELDWIDTHS to solve the problem you appear to be asking about and also to convert your fixed-width input to tab-separated output (or something else sensible) while you're at it:
$ cat file
1 2 3
4 6
$ gawk -v FIELDWIDTHS='4 4 4' -v OFS='\t' '{for (i=1;i<=NF;i++) {gsub(/^[[:space:]]+|[[:space:]]+$/,"",$i); $i=($i==""?0:$i)}; print}' file
1 2 3
4 0 6
$ gawk -v FIELDWIDTHS='4 4 4' -v OFS=',' '{for (i=1;i<=NF;i++) {gsub(/^[[:space:]]+|[[:space:]]+$/,"",$i); $i=($i==""?0:$i)}; print}' file
1,2,3
4,0,6
$ gawk -v FIELDWIDTHS='4 4 4' -v OFS=',' '{for (i=1;i<=NF;i++) {gsub(/^[[:space:]]+|[[:space:]]+$/,"",$i); $i="\""($i==""?0:$i)"\""}; print}' file
"1","2","3"
"4","0","6"
Take your pick of the above.

What are NR and FNR and what does "NR==FNR" imply?

I am learning file comparison using awk.
I found syntax like below,
awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
I couldn't understand what is the significance of NR==FNR in this?
If I try with FNR==NR then also I get the same output?
What exactly does it do?
In Awk:
FNR refers to the record number (typically the line number) in the current file.
NR refers to the total record number.
The operator == is a comparison operator, which returns true when the two surrounding operands are equal.
This means that the condition NR==FNR is normally only true for the first file, as FNR resets back to 1 for the first line of each file but NR keeps on increasing.
This pattern is typically used to perform actions on only the first file. It works assuming that the first file is not empty, otherwise the two variables would continue to be equal while Awk was processing the second file.
The next inside the block means any further commands are skipped, so they are only run on files other than the first.
The condition FNR==NR compares the same two operands as NR==FNR, so it behaves in the same way.
Look for keys (first word of line) in file2 that are also in file1.
Step 1: fill array a with the first words of file 1:
awk '{a[$1];}' file1
Step 2: Fill array a and ignore file 2 in the same command. For this check the total number of records until now with the number of the current input file.
awk 'NR==FNR{a[$1]}' file1 file2
Step 3: Ignore actions that might come after } when parsing file 1
awk 'NR==FNR{a[$1];next}' file1 file2
Step 4: print key of file2 when found in the array a
awk 'NR==FNR{a[$1];next} $1 in a{print $1}' file1 file2
Look up NR and FNR in the awk manual and then ask yourself what is the condition under which NR==FNR in the following example:
$ cat file1
a
b
c
$ cat file2
d
e
$ awk '{print FILENAME, NR, FNR, $0}' file1 file2
file1 1 1 a
file1 2 2 b
file1 3 3 c
file2 4 1 d
file2 5 2 e
There are awk built-in variables.
NR - It gives the total number of records processed.
FNR - It gives the total number of records for each input file.
Assuming you have Files a.txt and b.txt with
cat a.txt
a
b
c
d
1
3
5
cat b.txt
a
1
2
6
7
Keep in mind
NR and FNR are awk built-in variables.
NR - Gives the total number of records processed. (in this case both in a.txt and b.txt)
FNR - Gives the total number of records for each input file (records in either a.txt or b.txt)
awk 'NR==FNR{a[$0];}{if($0 in a)print FILENAME " " NR " " FNR " " $0}' a.txt b.txt
a.txt 1 1 a
a.txt 2 2 b
a.txt 3 3 c
a.txt 4 4 d
a.txt 5 5 1
a.txt 6 6 3
a.txt 7 7 5
b.txt 8 1 a
b.txt 9 2 1
lets Add "next" to skip the first matched with NR==FNR
in b.txt and in a.txt
awk 'NR==FNR{a[$0];next}{if($0 in a)print FILENAME " " NR " " FNR " " $0}' a.txt b.txt
b.txt 8 1 a
b.txt 9 2 1
in b.txt but not in a.txt
awk 'NR==FNR{a[$0];next}{if(!($0 in a))print FILENAME " " NR " " FNR " " $0}' a.txt b.txt
b.txt 10 3 2
b.txt 11 4 6
b.txt 12 5 7
awk 'NR==FNR{a[$0];next}!($0 in a)' a.txt b.txt
2
6
7
Here is the pseudo code for your interest.
NR = 1
for (i=1; i<=files.length; ++i) {
line = read line from files[i]
FNR = 1
while (not EOF) {
columns = getColumns(line)
if (NR is equals to FNR) { // processing first file
add columns[1] to a
} else { // processing remaining files
if (columns[1] exists in a) {
print columns[1]
}
}
NR = NR + 1
FNR = FNR + 1
line = read line from files[i]
}
}