Combine two columns into new and print all columns - awk

I want to combine columns 1 and 2 and add them as a new column in my data frame. Then I want to print all the old columns and the newly created column. I can combine the columns using the script below, but not sure how to print all columns, not only the combined:
awk ' { print $1 $2 "_" $NF } ' input_file
in
c1 c2 c3
12 1 12
4 4 57
out
c1 c2 c3 c4
12 1 12 12_1
4 4 57 4_4

If you want to print the _ between field 1 and 2, then the first output would be c1 c2 c3 c1_c2 instead of c1 c2 c3 c4
You can add a column at the end with the value of $1 and $2 and then print the whole line:
awk ' { $(NF+1) = $1"_"$2 }1' input_file
Output
c1 c2 c3 c1_c2
12 1 12 12_1
4 4 57 4_4
Or you can print the whole line followed by field $1 and $2
awk '{print $0, $1"_"$2}' input_file
Output
c1 c2 c3 c1_c2
12 1 12 12_1
4 4 57 4_4

Here is a Generic solution in awk. Just mention field numbers in awk variable named fields eg: 1,2,3,4,7,8(example) and it will add all fields values to last column. Written and tested in GNU awk should work in any awk.
awk -v fields="1,2" '
BEGIN{
num=split(fields,arr,",")
for(i=1;i<=num;i++){
field[arr[i]]
}
}
FNR==1{
print
next
}
{
val=""
for(i=1;i<=NF;i++){
if(i in field){
val=(val?val "_":"")$i
}
}
print $0,val
}
' Input_file

$ awk '{print $0, (NR>1 ? $1"_"$2 : "c4")}' file
c1 c2 c3 c4
12 1 12 12_1
4 4 57 4_4
or to get tab-separated output if your input is tab-separated:
$ awk 'BEGIN{FS=OFS="\t"} {print $0, (NR>1 ? $1"_"$2 : "c4")}' file
c1 c2 c3 c4
12 1 12 12_1
4 4 57 4_4
or if it isn't:
$ awk -v OFS='\t' '{$(NF+1)=(NR>1 ? $1"_"$2 : "c4")} 1' file
c1 c2 c3 c4
12 1 12 12_1
4 4 57 4_4

Another awk which at FNR==1 uses the field name in $NF to create the field name for the next field (c3 -> c4, c -> c1, etc):
$ awk '{
printf "%s%s%s\n",
$0,
OFS,
(FNR>1?$1 "_" $2:(match($3,/[0-9]+$/)?substr($3,1,RSTART-1) substr($3,RSTART)+1:$3 1))
}' file
Output:
c1 c2 c3 c4
12 1 12 12_1
4 4 57 4_4

golfed version
$ awk '$++NF=NR>1?$1"_"$2:"c4"' file
c1 c2 c3 c4
12 1 12 12_1
4 4 57 4_4

Related

How to update one file's column from another file's column in awk

I have two files, the first file:
1 AA
2 BB
3 CC
4 DD
and the second file
15 AA
17 BB
20 CC
25 FF
File 1 should be updated and the expected output should looks like this:
15 AA
17 BB
20 CC
4 DD
I have tried this script from another post but it didn't work
awk 'NR==FNR{a[$1]=$2;next}a[$1]{print $2,a[$1]}' file1 file2
$ awk 'NR==FNR{a[$2]=$1; next} $2 in a{$1=a[$2]} 1' file2 file1
15 AA
17 BB
20 CC
4 DD
Here is an awk:
awk 'FNR==NR{f2[$2]=$0; next}
$2 in f2 {print f2[$2]; next}
1' file2 file1
Prints:
15 AA
17 BB
20 CC
4 DD

AWK program that can read a second file either from a file specified on the command line or from data received via a pipe

I have an AWK program that does a join of two files, file1 and file2. The files are joined based on a set of columns. I placed the AWK program into a bash script that I named join.sh. See below. Here is an example of how the script is executed:
./join.sh '1,2,3,4' '2,3,4,5' file1 file2
That says this: Do a join of file1 and file2, using columns (fields) 1,2,3,4 of file1 and columns (fields) 2,3,4,5 of file2.
That works great.
Now what I would like to do is to filter file2 and pipe the results to the join tool:
./fetch.sh ident file2 | ./join.sh '1,2,3,4' '2,3,4,5' file1
fetch.sh is a bash script containing an AWK program that fetches the rows in file2 with primary key ident and outputs to stdout the rows that were fetched.
Unfortunately, that pipeline is not working. I get no results.
Recap: I want the join program to be able to read the second file either from a file that I specify on the command line or from data received via a pipe. How to do that?
Here is my bash script, named join.sh
#!/bin/bash
awk -v f1cols=$1 -v f2cols=$2 '
BEGIN { FS=OFS="\t"
m=split(f1cols,f1,",")
n=split(f2cols,f2,",")
}
{ sub(/\r$/, "") }
NR == 1 { b[0] = $0 }
(NR == FNR) && (NR > 1) { idx2=$(f2[1])
for (i=2;i<=n;i++)
idx2=idx2 $(f2[i])
a[idx2] = $0
next
}
(NR != FNR) && (FNR == 1) { print $0, b[0] }
FNR > 1 { idx1=$(f1[1])
for (i=2;i<=m;i++)
idx1=idx1 $(f1[i])
for (idx1 in a)
print $0, a[idx1]
}' $3 $4
I'm not sure if this is 'correct' as you haven't provided any example input and expected output, but does using - to signify stdin work for your use-case? E.g.
cat file1
1 2 3 4
AA BB CC DD
AA EE FF GG
cat file2
1 2 3 4
AA ZZ YY XX
AA 11 22 33
./join.sh '1' '1' file1 file2
1 2 3 4 1 2 3 4
AA ZZ YY XX AA BB CC DD
AA ZZ YY XX AA EE FF GG
AA 11 22 33 AA BB CC DD
AA 11 22 33 AA EE FF GG
cat file2 | ./join.sh '1' '1' file1 -
1 2 3 4 1 2 3 4
AA ZZ YY XX AA BB CC DD
AA ZZ YY XX AA EE FF GG
AA 11 22 33 AA BB CC DD
AA 11 22 33 AA EE FF GG
be able to read(...)from data received via a pipe
GNU AWK does support Using getline from a Pipe consider following simple example
awk 'BEGIN{cmd="seq 7";while((cmd | getline) > 0){print $1*7};close(cmd)}' emptyfile
gives output
7
14
21
28
35
42
49
Explanation: I process output of seq 7 command (numbers from 1 to 7 inclusive, each on separate line), body of while is executed for each line of seq 7 output, fields are set like for normal processing.

Reading from a file and writing to another using Awk

There are two tab delimiter text files. My aim is to change File 1 so that corresponding values in the 2nd column of File 2 will be substituted with zeros in File 1.
To visualize,
File 1:
AA 0
BB 0
CC 0
DD 0
EE 0
File 2:
AA 256
DD 142
EE 26
File 1 - Output:
AA 256
BB 0
CC 0
DD 142
EE 26
I wrote below but as you can see I give the value of 1st row of File 2 by hand. I want to achieve this task automatically. What should I do?
awk -F'\t' 'BEGIN {OFS=FS} {if($1 == "AA") $2="256";print}' test > test.tmp && mv test.tmp test
Thank you in advance.
awk 'BEGIN {FS=OFS="\t"} NR==FNR{a[$1]=$2; next} {print $1, a[$1]+0}' file2 file1

sum rows based on unique columns awk

I'm looking for a more elegant way to do this (for more than >100 columns):
awk '{a[$1]+=$4}{b[$1]+=$5}{c[$1]+=$6}{d[$1]+=$7}{e[$1]+=$8}{f[$1]+=$9}{g[$1]+=$10}END{for(i in a) print i,a[i],b[i],c[i],d[i],e[i],f[i],g[i]}'
Here is the input:
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
and output:
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
Thanks :)
I break the one-liner down into lines, to make it easier to read.
awk '{n[$1];for(i=2;i<=NF;i++)a[$1,i]+=$i}
END{for(x in n){
printf "%s ", x
for(y=2;y<=NF;y++)printf "%s%s", a[x,y],(y==NF?ORS:FS)
}
}' file
this awk command should work with your 100 columns file.
test with your file:
kent$ cat f
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
kent$ awk '{n[$1];for(i=2;i<=NF;i++)a[$1,i]+=$i}END{for(x in n){printf "%s ", x;for(y=2;y<=NF;y++)printf "%s%s", a[x,y],(y==NF?ORS:OFS)}}' f
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
Using arrays of arrays in gnu awk version 4
awk '{for (i=2;i<=NF;i++) a[$1][i]+=$i}
END{for (i in a)
{ printf i FS;
for (j in a[i]) printf a[i][j] FS
printf RS}
}' file
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
If you care about order of output try this
$ cat file
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
Awk Code :
$ cat tester
awk 'FNR==NR{
U[$1] # Array U with index being field1
for(i=2;i<=NF;i++) # loop through columns thats is column2 to NF
A[$1,i]+=$i # Array A holds sum of columns
next # stop processing the current record and go on to the next record
}
($1 in U){ # Here we read same file once again,if field1 is found in array U, then following statements
for(i=1;i<=NF;i++)
s = s ? s OFS A[$1,i] : A[$1,i] # I am writing sum to variable s since I want to use only one print statement, here you can use printf also
print $1,s # print column1 and variable s
delete U[$1] # We have done, so delete array element
s = "" # reset variable s
}' OFS='\t' file{,} # output field separator is tab you can set comma also
Resulting
$ bash tester
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk
--edit--
As requested in comment here is one liner, in above post for better reading purpose I had commented and it became several lines.
$ awk 'FNR==NR{U[$1];for(i=2;i<=NF;i++)A[$1,i]+=$i;next}($1 in U){for(i=1;i<=NF;i++)s = s ? s OFS A[$1,i] : A[$1,i];print $1,s;delete U[$1];s = ""}' OFS='\t' file{,}
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7

How to replace blank space zero?

I have a file:
nr kl1 kl2 kl3 kl4
d1 15 58 63 58
d2 3 3
d3 3 8 0
I want to print:
nr kl1 kl2 kl3 kl4
d1 15 58 63 58
d2 0 3 3 0
d3 3 0 8 0
I tried gsub solution, but it does not work.
awk '{gsub(/ /, 0, $2); print }' file
Thank you for your help.
EDIT:
Ed Morton solution works on gawk, but it does not work on mawk.
$ gawk 'BEGIN{ FIELDWIDTHS="5 5 5 5 5"; OFS="" }NR>1 {for (i=2;i<=NF;i++)$i=sprintf("%-5d",$i)}{ sub(/ +$/,""); print }' file
nr kl1 kl2 kl3 kl4
d1 15 58 63 58
d2 0 3 3 0
d3 3 0 8 0
.
$ mawk 'BEGIN{ FIELDWIDTHS="5 5 5 5 5"; OFS="" }NR>1 {for (i=2;i<=NF;i++)$i=sprintf("%-5d",$i)}{ sub(/ +$/,""); print }' file
nr kl1 kl2 kl3 kl4
d115 58 63 58
d23 3
d33 8 0
How to do the same, but the mawk?
What you tried didn't work because your fields aren't separated by spaces, they're a fixed width. Try this with GNU awk:
BEGIN{ FIELDWIDTHS="5 5 5 5 5"; OFS="" }
NR>1 {
for (i=2;i<=NF;i++)
$i=sprintf("%-5d",$i)
}
{ sub(/ +$/,""); print }