Merge files print 0 in empty field - awk

I have 5 tab delim files
file 0 is basically a key
A
C
F
AA
BC
CC
D
KKK
S
file1
A 2
C 3
F 5
AA 5
BC 4
D 7
file2
A 2
C 3
F 7
D 10
file3
A 2
C 2
F 5
CC 4
D 7
file4
A 1
C 3
F 5
CC 4
D 7
KKK 10
I would like to merge all files based on the 1st column and print 0 in missing fields.
A 2 2 2 1
C 3 3 2 3
F 5 7 5 5
AA 5 0 0 0
BC 4 0 0 0
CC 0 0 4 4
D 7 10 7 7
KKK 0 0 0 10
S 0 0 0 0
Columns must keep the order of input file0, file1, file2, file3, file4

I was going to wait til you included your own attempt in your question but since you have 2 answers already anyway....
$ cat tst.awk
NR==FNR {
key2rowNr[$1] = ++numRows
rowNr2key[numRows] = $1
next
}
FNR==1 { ++numCols }
{
rowNr = key2rowNr[$1]
vals[rowNr,numCols] = $2
}
END {
for (rowNr=1; rowNr<=numRows; rowNr++) {
printf "%s", rowNr2key[rowNr]
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%d", OFS, vals[rowNr,colNr]
}
print ""
}
}
$ awk -f tst.awk file0 file1 file2 file3 file4
A 2 2 2 1
C 3 3 2 3
F 5 7 5 5
AA 5 0 0 0
BC 4 0 0 0
CC 0 0 4 4
D 7 10 7 7
KKK 0 0 0 10
S 0 0 0 0

awk solution
awk '
FNR==1{f++}
{
a[f""$1]=$2
b[$1]++
}
END{
for(i in b){
printf i" "
for(j=1;j<=f;j++){
tmp=j""i
if(tmp in a){
printf a[tmp]" "
}else{
printf 0" "
}
}
print ""
}
}
' file*
oupput :
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0
first i store every value for each file number and key value in variable a
then store all uniqe key in variable b
and in END block, checked if key is exists or not, if exists print it OR not exist print 0
we can delete file0, if delete it, awk show only exists key in file1,2,3,4,..

Not awk, but this sort of joining of files on a common field is exactly what join is meant for. Complicated a bit by it only working with two files at a time; you have to pipe the results of each one into the next as the first file.
$ join -o 0,2.2 -e0 -a1 <(sort file0) <(sort file1) \
| join -o 0,1.2,2.2 -e0 -a1 - <(sort file2) \
| join -o 0,1.2,1.3,2.2 -e0 -a1 - <(sort file3) \
| join -o 0,1.2,1.3,1.4,2.2 -e0 -a1 - <(sort file4) \
| tr ' ' '\t'
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0
Caveats: This requires a shell like bash or zsh that understands <(command) redirection. Sorting all the files in advance is an alternative. Or as pointed out, even though join normally requires its input files to be sorted on the column that's being joined on, it works anyways without the sorts for this particular input.

With GNU awk you can use the ENDFILE clause to make sure you have enough elements in all rows, e.g.:
parse.awk
BEGIN { OFS = "\t" }
# Collect all information into the `h` hash
{ h[$1] = (ARGIND == 1 ? $1 : h[$1] OFS $2) }
# At the end of each file do the necessary padding
ENDFILE {
for(k in h) {
elems = split(h[k], a, OFS)
if (elems != ARGIND)
h[k] = h[k] OFS 0
}
}
# Print the content of `h`
END {
for(k in h)
print h[k]
}
Run it like this:
awk -f parse.awk file[0-4]
Output:
AA 5 0 0 0
A 2 2 2 1
C 3 3 2 3
D 7 10 7 7
BC 4 0 0 0
CC 0 0 4 4
S 0 0 0 0
KKK 0 0 0 10
F 5 7 5 5
NB: This solution assumes you only have two columns per file (except the first one).

You could use coreutils join to determine missing fields and add them to each file:
sort file0 > file0.sorted
for file in file[1-4]; do
{
cat $file
join -j 1 -v 1 file0.sorted <(sort $file) | sed 's/$/ 0/'
} | sort > $file.sorted
done
Now you just need to paste them together:
paste file0.sorted \
<(cut -d' ' -f2 file1.sorted) \
<(cut -d' ' -f2 file2.sorted) \
<(cut -d' ' -f2 file3.sorted) \
<(cut -d' ' -f2 file4.sorted)
Output:
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0

Related

For each unique occurrence in field, print sum corresponding numerical field and number of occurrences/counts

I have a file
a x 0 3
a x 0 1
b,c x 4 4
dd x 3 5
dd x 2 5
e,e,t x 5 7
a b 1 9
cc b 2 1
cc b 1 1
e,e,t b 1 2
e,e,t b 1 2
e,e,t b 1 2
for each element in $1$2, I want print the sum $3, $4 and the number of occurrences/lenght/counts
So that I have
a x 0 4 0 2
b,c x 4 4 1 1
dd x 5 10 2 2
e,e,t x 5 7 1 1
a b 1 9 1 1
cc b 3 2 2 2
e,e,t b 3 6 3 3
I am using
awk -F"\t" '{for(n=2;n<=NF; ++n) a[$1 OFS $2][n]+=$n}
END {for(i in a) {
printf "%s", i
for (n=3; n<=4; ++n) printf "\t%s", a[i][n], a[i][n]++
printf "\n" }}' file
but it's only printing the sums, not the counts
The actual file has many columns: the keys are $4$6$7$8 and the numerical columns are $9-$13
You may use this awk:
cat sum.awk
BEGIN {
FS = OFS = "\t" # set input/output FS to tab
}
{
k = $1 OFS $2 # create key using $1 tab $2
if (!(k in map3)) # if k is not in map3 save it in an ordered array
ord[++n] = k
map3[k] += $3 # sum of $3 in array map3 using key as k
$3 > 0 && ++fq3[k] # frequency of $3 if it is > 0
map4[k] += $4 # sum of $4 in array map4 using key as k
$4 > 0 && ++fq4[k] # frequency of $4 if it is > 0
}
END {
for(i=1; i<=n; ++i) # print everything by looping through ord array
print ord[i], map3[ord[i]], map4[ord[i]], fq3[ord[i]]+0, fq4[ord[i]]+0
}
Then use it as:
awk -f sum.awk file
a x 0 4 0 2
b,c x 4 4 1 1
dd x 5 10 2 2
e,e,t x 5 7 1 1
a b 1 9 1 1
cc b 3 2 2 2
e,e,t b 3 6 3 3

awk: take entries from file and add values in between

I have the following file:
2 some
5 some
8 some
10 thing
15 thing
19 thing
Now I want to end up with entries, where for "some" 2,5,8 correspond to rows where there is a 1, everything else is 0. It doesn't matter how many rows there are. This means for "some":
0
1
0
0
1
0
0
1
0
0
and for "thing"
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
Is this possible in a quick way with awk? I mean with something like:
awk '{for(i=1;i<=10;i++) entries[$i]=0 for(f=0;<=NF;f++) entries[$f]=1' testfile.txt
another awk, output terminates with the last index
awk -v key='thing' '$2==key{while(++c<$1) print 0; print 1}' file
to add some extra 0's after the last 1; add END{while(i++<3) print 0}
Something like this seems to work in order to produce "some" data:
$ cat file1
2 some
5 some
8 some
10 thing
15 thing
19 thing
$ awk 'max<$1 && $2=="some"{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
0
1
0
0
1
0
0
1
Similarly , this one works for "thing" data
$ awk 'max<$1 && $2=="thing"{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
Alternativelly, as mentioned by glennjackman in comments we could use an external variable to select between some or thing:
$ awk -v word="some" 'max<$1 && $2==word{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
# for thing just apply awk -v word="thing"
You can achieve better parameterizing using an awk variable like this:
$ w="some" #selectable / set by shell , by script , etc
$ awk -v word="$w" 'max<$1 && $2==word{max=$1;b[$1]=1}END{for (i=1;i<=max;i++) print (i in b?1:0)}' file1
perl:
perl -lanE '
push #{$idx{$F[1]}}, $F[0] - 1; # subtract 1 because we are working with
# (zero-based) array indices
$max = $F[0]; # I assume the input is sorted by column 1
} END {
$, = "\n";
for $word (keys %idx) {
# create a $max-sized array filled with zeroes
#a = (0) x $max;
# then, populate the entries which should be 1
#a[ #{$idx{$word}} ] = (1) x #{$idx{$word}};
say $word, #a;
}
' file | pr -2T -s | nl -v0
0 thing some
1 0 0
2 0 1
3 0 0
4 0 0
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 1 0
11 0 0
12 0 0
13 0 0
14 0 0
15 1 0
16 0 0
17 0 0
18 0 0
19 1 0

Split a column in lines after every 5 entries - awk

I have a file that looks like the following
1
1
1
1
1
1
12
2
2
2
2
2
2
2
3
4
What I want to do is convert this column in multiple rows. Each new line/row should start after 5 entries, so that the output will like like this
1 1 1 1 1
1 12 2 2 2
2 2 2 2 3
4
I tried to achieve that by using
awk '{printf "%s" (NR%5==0?"RS:FS),$1}' file
but I get the following error
awk: line 1: runaway string constant "RS:FS),$1} ...
Any idea on how to achieve the desired output?
Maybe this awk one liner can help.
awk '{if (NR%5==0){a=a $0" ";print a; a=""} else a=a $0" "}END{print}' file
Output:
1 1 1 1 1
1 12 2 2 2
2 2 2 2 3
4
Longer awk:
{
if (NR%5==0)
{
a=a $0" ";
print a;
a="";
}
else
{
a=a $0" ";
}
}
END
{
print
}
Slightly different approach, still using awk:
$ awk '{if (NR%5) {ORS=""} else {ORS="\n"}{print " "$0}}' input.txt
1 1 1 1 1
1 12 2 2 2
2 2 2 2 3
4
Using perl:
$ perl -p -e 's/\n/ / if $.%5' input.txt
1 1 1 1 1
1 12 2 2 2
2 2 2 2 3
4
No need to complicate...
$ pr -5ats' ' <file
1 1 1 1 1
1 12 2 2 2
2 2 2 2 3
4

sum rows based on unique columns awk

I'm looking for a more elegant way to do this (for more than >100 columns):
awk '{a[$1]+=$4}{b[$1]+=$5}{c[$1]+=$6}{d[$1]+=$7}{e[$1]+=$8}{f[$1]+=$9}{g[$1]+=$10}END{for(i in a) print i,a[i],b[i],c[i],d[i],e[i],f[i],g[i]}'
Here is the input:
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
and output:
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
Thanks :)
I break the one-liner down into lines, to make it easier to read.
awk '{n[$1];for(i=2;i<=NF;i++)a[$1,i]+=$i}
END{for(x in n){
printf "%s ", x
for(y=2;y<=NF;y++)printf "%s%s", a[x,y],(y==NF?ORS:FS)
}
}' file
this awk command should work with your 100 columns file.
test with your file:
kent$ cat f
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
kent$ awk '{n[$1];for(i=2;i<=NF;i++)a[$1,i]+=$i}END{for(x in n){printf "%s ", x;for(y=2;y<=NF;y++)printf "%s%s", a[x,y],(y==NF?ORS:OFS)}}' f
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
Using arrays of arrays in gnu awk version 4
awk '{for (i=2;i<=NF;i++) a[$1][i]+=$i}
END{for (i in a)
{ printf i FS;
for (j in a[i]) printf a[i][j] FS
printf RS}
}' file
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
If you care about order of output try this
$ cat file
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
Awk Code :
$ cat tester
awk 'FNR==NR{
U[$1] # Array U with index being field1
for(i=2;i<=NF;i++) # loop through columns thats is column2 to NF
A[$1,i]+=$i # Array A holds sum of columns
next # stop processing the current record and go on to the next record
}
($1 in U){ # Here we read same file once again,if field1 is found in array U, then following statements
for(i=1;i<=NF;i++)
s = s ? s OFS A[$1,i] : A[$1,i] # I am writing sum to variable s since I want to use only one print statement, here you can use printf also
print $1,s # print column1 and variable s
delete U[$1] # We have done, so delete array element
s = "" # reset variable s
}' OFS='\t' file{,} # output field separator is tab you can set comma also
Resulting
$ bash tester
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk
--edit--
As requested in comment here is one liner, in above post for better reading purpose I had commented and it became several lines.
$ awk 'FNR==NR{U[$1];for(i=2;i<=NF;i++)A[$1,i]+=$i;next}($1 in U){for(i=1;i<=NF;i++)s = s ? s OFS A[$1,i] : A[$1,i];print $1,s;delete U[$1];s = ""}' OFS='\t' file{,}
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7

Using an array in AWK when working with two files

I have two files I merged them based key using below code
file1
-------------------------------
1 a t p bbb
2 b c f aaa
3 d y u bbb
2 b c f aaa
2 u g t ccc
2 b j h ccc
file2
--------------------------------
1 11 bbb
2 22 ccc
3 33 aaa
4 44 aaa
I merged these two file based key using below code
awk 'NR==FNR{a[$3]=$0;next;}{for(x in a){if(x==$5) print $1,$2,$3,$4,a[x]};
My question is how I can save $2 of file2 in variable or array and print after a[x] again.
My desired result is :
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
As you see the first 7 columns is the result of my merge code. I need add the last column (field 2 of a[x]) to my result.
Important:
My next question is if I have .awk file, how I can use some bash script code like (| column -t) or send result to file (awk... > result.txt)? I always use these codes in command prompt. Can I use them inside my code in .awk file?
Simply add all of file2 to an array, and use split to hold the bits you want:
awk 'FNR==NR { two[$0]++; next } { for (i in two) { split(i, one); if (one[3] == $NF) print $1,$2,$3,$4, i, one[2] } }' file2 file1
Results:
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Regarding your last question; you can also add 'pipes' and 'writes' inside of your awk. Here's an example of a pipe to column -t:
Contents of script.awk:
FNR==NR {
two[$0]++
next
}
{
for (i in two) {
split(i, one)
if (one[3] == $NF) {
print $1,$2,$3,$4, i, one[2] | "column -t"
}
}
}
Run like: awk -f script.awk file2 file1
EDIT:
Add the following to your shell script:
results=$(awk '
FNR==NR {
two[$0]++
next
}
{
for (i in two) {
split(i, one)
if (one[3] == $NF) {
print $1,$2,$3,$4, i, one[2] | "column -t"
}
}
}
' $1 $2)
echo "$results"
Run like:
./script.sh file2.txt file1.txt
Results:
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Your current script is:
awk 'NR==FNR { a[$3]=$0; next }
{ for (x in a) { if (x==$5) print $1,$2,$3,$4,a[x] } }'
(Actually, the original is missing the second close brace for the second pattern/action pair.)
It seems that you process file2 before you process file1.
You shouldn't need the loop in the second code. And you can make life easier for yourself by using the splitting in the first phase to keep the values you need:
awk 'NR==FNR { c1[$3] = $1; c2[$3] = $2; next }
{ print $1, $2, $3, $4, c1[$5], c2[$5], $5, c2[$5] }'
You can upgrade that to check whether c1[$5] and c2[$5] are defined, presumably skipping the row if they are not.
Given your input files, the output is:
1 a t p 1 11 bbb 11
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Give or take column spacing, that's what was requested. Column spacing can be fixed by using printf instead of print, or setting OFS to tab, or ...
The c1 and c2 notations for column 1 and 2 is OK for two columns. If you need more, then you should probably use the 2D array notation:
awk 'NR==FNR { for (i = 1; i <= NF; i++) col[i,$3] = $i; next }
{ print $1, $2, $3, $4, col[1,$5], col[2,$5], $5, col[2,$5] }'
This produces the same output as before.
To achieve what you ask, save the second field after the whole line in the processing of your first file, with a[$3]=$0 OFS $2. For your second question, awk has a variable to separate fields in output, it's OFS, assign a tabulator to it and play with it. Your script would be like:
awk '
BEGIN { OFS = "\t"; }
NR==FNR{
a[$3]=$0 OFS $2;
next;
}
{
for(x in a){
if(x==$5) print $1,$2,$3,$4,a[x]
}
}
' file2 file1
That yields:
1 a t p 1 11 bbb 11
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22