Conserve header while joining files in bash - awk

I have this 2 tab separated files:
fileA.tsv
probeId sample1_betaval sample2_betaval sample3_betaval
a 1 2 3
b 4 5 6
c 7 8 9
fileB.tsv
probeId region gene
a intronic tp53
b non-coding NA
c exonic kras
As they are already sorted by probeId, I've merged both files:
join -j 1 fileA.tsv fileB.tsv -t $'\t' > complete.tsv
The problem is that the output does not conserve headers:
a 1 2 3 intronic tp53
b 4 5 6 non-coding NA
c 7 8 9 exonic kras
While my desired output is:
probeId sample1_betaval sample2_betaval sample3_betaval region gene
a 1 2 3 intronic tp53
b 4 5 6 non-coding NA
c 7 8 9 exonic kras
How can I achieve that?

Add --header option if your join provides it:
join --header -j 1 fileA.tsv fileB.tsv -t $'\t' > complete.tsv

Could you please try following(in case you are ok with it).
awk '
FNR==NR{
array[$1]=$0
next
}
($1 in array){
print array[$1],$2,$3
}
' filea fileb | column -t
EDIT: In case OP has many columns in fileb and want to print all apart from 1st column then try following.
awk '
FNR==NR{
array[$1]=$0
next
}
($1 in array){
val=$1
$1=""
sub(/^ +/,"")
print array[val],$0
}
' filea fileb | column -t

Related

Merge files print 0 in empty field

I have 5 tab delim files
file 0 is basically a key
A
C
F
AA
BC
CC
D
KKK
S
file1
A 2
C 3
F 5
AA 5
BC 4
D 7
file2
A 2
C 3
F 7
D 10
file3
A 2
C 2
F 5
CC 4
D 7
file4
A 1
C 3
F 5
CC 4
D 7
KKK 10
I would like to merge all files based on the 1st column and print 0 in missing fields.
A 2 2 2 1
C 3 3 2 3
F 5 7 5 5
AA 5 0 0 0
BC 4 0 0 0
CC 0 0 4 4
D 7 10 7 7
KKK 0 0 0 10
S 0 0 0 0
Columns must keep the order of input file0, file1, file2, file3, file4
I was going to wait til you included your own attempt in your question but since you have 2 answers already anyway....
$ cat tst.awk
NR==FNR {
key2rowNr[$1] = ++numRows
rowNr2key[numRows] = $1
next
}
FNR==1 { ++numCols }
{
rowNr = key2rowNr[$1]
vals[rowNr,numCols] = $2
}
END {
for (rowNr=1; rowNr<=numRows; rowNr++) {
printf "%s", rowNr2key[rowNr]
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%d", OFS, vals[rowNr,colNr]
}
print ""
}
}
$ awk -f tst.awk file0 file1 file2 file3 file4
A 2 2 2 1
C 3 3 2 3
F 5 7 5 5
AA 5 0 0 0
BC 4 0 0 0
CC 0 0 4 4
D 7 10 7 7
KKK 0 0 0 10
S 0 0 0 0
awk solution
awk '
FNR==1{f++}
{
a[f""$1]=$2
b[$1]++
}
END{
for(i in b){
printf i" "
for(j=1;j<=f;j++){
tmp=j""i
if(tmp in a){
printf a[tmp]" "
}else{
printf 0" "
}
}
print ""
}
}
' file*
oupput :
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0
first i store every value for each file number and key value in variable a
then store all uniqe key in variable b
and in END block, checked if key is exists or not, if exists print it OR not exist print 0
we can delete file0, if delete it, awk show only exists key in file1,2,3,4,..
Not awk, but this sort of joining of files on a common field is exactly what join is meant for. Complicated a bit by it only working with two files at a time; you have to pipe the results of each one into the next as the first file.
$ join -o 0,2.2 -e0 -a1 <(sort file0) <(sort file1) \
| join -o 0,1.2,2.2 -e0 -a1 - <(sort file2) \
| join -o 0,1.2,1.3,2.2 -e0 -a1 - <(sort file3) \
| join -o 0,1.2,1.3,1.4,2.2 -e0 -a1 - <(sort file4) \
| tr ' ' '\t'
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0
Caveats: This requires a shell like bash or zsh that understands <(command) redirection. Sorting all the files in advance is an alternative. Or as pointed out, even though join normally requires its input files to be sorted on the column that's being joined on, it works anyways without the sorts for this particular input.
With GNU awk you can use the ENDFILE clause to make sure you have enough elements in all rows, e.g.:
parse.awk
BEGIN { OFS = "\t" }
# Collect all information into the `h` hash
{ h[$1] = (ARGIND == 1 ? $1 : h[$1] OFS $2) }
# At the end of each file do the necessary padding
ENDFILE {
for(k in h) {
elems = split(h[k], a, OFS)
if (elems != ARGIND)
h[k] = h[k] OFS 0
}
}
# Print the content of `h`
END {
for(k in h)
print h[k]
}
Run it like this:
awk -f parse.awk file[0-4]
Output:
AA 5 0 0 0
A 2 2 2 1
C 3 3 2 3
D 7 10 7 7
BC 4 0 0 0
CC 0 0 4 4
S 0 0 0 0
KKK 0 0 0 10
F 5 7 5 5
NB: This solution assumes you only have two columns per file (except the first one).
You could use coreutils join to determine missing fields and add them to each file:
sort file0 > file0.sorted
for file in file[1-4]; do
{
cat $file
join -j 1 -v 1 file0.sorted <(sort $file) | sed 's/$/ 0/'
} | sort > $file.sorted
done
Now you just need to paste them together:
paste file0.sorted \
<(cut -d' ' -f2 file1.sorted) \
<(cut -d' ' -f2 file2.sorted) \
<(cut -d' ' -f2 file3.sorted) \
<(cut -d' ' -f2 file4.sorted)
Output:
A 2 2 2 1
AA 5 0 0 0
BC 4 0 0 0
C 3 3 2 3
CC 0 0 4 4
D 7 10 7 7
F 5 7 5 5
KKK 0 0 0 10
S 0 0 0 0

Find match between 2 columns of different files and replace third with awk

I am looking for a way to replace a column in a file, if two ID columns match.
I have file A.txt
c a b ID
0.1 0.01 5 1
0.2 0.1 6 2
0.3 2 3
and file B.txt
ID a b
1 10 15
2 20 16
3 30 12
4 40 14
The output im looking for is
file A.txt
ID a b
1 0.01 5
2 0.1 6
3 30 2
I found out that it is possible with the following
awk 'NR==FNR{ if(NR>1) a[$1]=$2; next }
FNR>1 && $1 in a && NF<3{ f=$2; $2=a[$1]; $3=f }1' B.txt A.txt | column -t
But the problem is that it compares $1 from both files. How can i instead compare $4 from A.txt with $1 from B.txt
I tried the following
awk 'NR==FNR{ if(NR>1) a[$1]=$2; b[$1]=$1; next }
FNR>1 && $1~ /b[$1] in a && NF<3{ f=$2; $2=a[$1]; $3=f }1' eaf.txt final.txt | column -t
But it didnt work. Is there a way to solve it? Thank you
awk solution:
awk 'NR==FNR{ if(NR>1) a[$1]=$2; next }
FNR==1{ print $NF,$2,$3; next } # output the header line rearranged
FNR>1 && ($NF in a){ $1=$NF; if(NF<4) { f=$2; $2=a[$1]; $3=f } else $NF=""
}1' B.txt A.txt | column -t
The output:
ID a b
1 0.01 5
2 0.1 6
3 30 2

awk print column data in rows based on matching key

I am trying to write an awk string to print column data in rows based on match.
My file is as below:
$ cat 1.txt
2016-05-10,UJ,ALL 1 7
2016-05-10,UJ,ALL 1 10
2016-05-10,UJ,ALL 1 9
2016-05-10,UJ,ALL 1 8
2016-05-10,UJ,ALL 1 14
2016-05-10,UJ,ALL 1 8
2016-05-10,UJ,ALL 1 12
2016-05-10,UJ,ALL 2 11
2016-05-10,UJ,ALL 1 10
2016-05-10,UJ,ALL 2 12
2016-05-10,UJ,ALL 2 9
2016-05-10,UJ,ALL 1 13
expected output is as below (uniq key match is before first space i.e. 2016-05-10,UJ,ALL)
2016-05-10,UJ,ALL<\tab>1 1 1 1 1 1 1 2 1 2 2 1<\tab>7 10 9 8 14 8 12 11 10 12 9 13
I am using below awk pattern matching
awk '$1 != prev{printf "%s%s",ors,$1; ors=ORS; ofs="\t"} {printf "%s%s",ofs,$2; ofs=OFS; prev=$1} END{print ""}' 1.txt
but it is not working on last coulmn, i tried all possible combinations but no success... please suggest.
I would go for something like:
awk -v OFS="\t" '{
cols[$1];
col2[$1]=(length(col2[$1]) ? col2[$1] FS : "") $2;
col3[$1]=(length(col3[$1]) ? col3[$1] FS : "") $3
} END {for (i in cols) print i, col2[i], col3[i]}' file
See it in action:
$ awk -v OFS="\t" '{cols[$1]; col2[$1]=(length(col2[$1]) ? col2[$1] FS : "") $2; col3[$1]=(length(col3[$1]) ? col3[$1] FS : "") $3} END {for (i in cols) print i, col2[i], col3[i]}' a
2016-05-10,UJ,ALL 1 1 1 1 1 1 1 2 1 2 2 1 7 10 9 8 14 8 12 11 10 12 9 13
# ^ ^
# tab tab
$ head -n1 1.txt | cut -d' ' -f1
2016-05-10,UJ,ALL
$ # transform multiple lines to single line with space as separator
$ cut -d' ' -f2 1.txt | paste -sd' '
1 1 1 1 1 1 1 2 1 2 2 1
$ cut -d' ' -f3 1.txt | paste -sd' '
7 10 9 8 14 8 12 11 10 12 9 13
$ # finally, combine the three results
$ # by default paste uses tab as delimiter
$ paste <(head -n1 1.txt | cut -d' ' -f1) <(cut -d' ' -f2 1.txt | paste -sd' ') <(cut -d' ' -f3 1.txt | paste -sd' ')
2016-05-10,UJ,ALL 1 1 1 1 1 1 1 2 1 2 2 1 7 10 9 8 14 8 12 11 10 12 9 13
$ # to use a different delimiter
$ paste -d: <(head -n1 1.txt | cut -d' ' -f1) <(cut -d' ' -f2 1.txt | paste -sd' ') <(cut -d' ' -f3 1.txt | paste -sd' ')
2016-05-10,UJ,ALL:1 1 1 1 1 1 1 2 1 2 2 1:7 10 9 8 14 8 12 11 10 12 9 13
Another option is to use GNU datamash, however it will give comma separated values
$ datamash -t' ' -W -g1 collapse 2 -g1 collapse 3 <1.txt
2016-05-10,UJ,ALL 1,1,1,1,1,1,1,2,1,2,2,1 7,10,9,8,14,8,12,11,10,12,9,13
-t' ' input delimiter is space
-W whitespace as output delimiter
-g1 collapse 2 comma separated column 2 values using column 1 as key
-g1 collapse 3 comma separated column 3 values using column 1 as key

replace a block of lines in file1 with a block of lines in file2

file1:
a xyz 1 2 4
a xyz 1 2 3
a abc 3 9 7
a abc 3 9 2
a klm 9 3 1
a klm 9 8 3
a tlc 3 9 3
file2:
a xyz 9 2 9
a xyz 8 9 2
a abc 3 8 9
a abc 6 2 7
a tlk 7 8 9
I want to replace the lines that have 'abc' in file1 with the lines that have 'abc' in file2. I'm new to sed, awk, etc. Any help is appreciated.
I tried cat file1 <(sed '/$r = abc;/d' file2) > newfile among others but this one simply copies file1 to newfile. I also don't want to generate a new file but only edit file1.
desired output:
(processed) file1:
a xyz 1 2 4
a xyz 1 2 3
a abc 3 8 9
a abc 6 2 7
a klm 9 3 1
a klm 9 8 3
a tlc 3 9 3
With GNU awk, you can use this trick:
gawk -v RS='([^\n]* abc [^\n]*\n)+' 'NR == FNR { save = RT; nextfile } FNR == 1 { printf "%s", $0 save; next } { printf "%s", $0 RT }' file2 file1
With the record separator ([^\n]* abc [^\n]*\n)+, this splits the input files into records delimited by blocks of lines with " abc " in them. Then,
NR == FNR { # while processing the first given file (file2)
save = RT # remember the first record terminator -- the
# first block of lines with abc in them
nextfile # and go to the next file.
}
FNR == 1 { # for the first record in file1
printf "%s", $0 save # print it with the saved record terminator
next # from file2, and get the next record
}
{ # from then on, just echo.
printf "%s", $0 RT
}
Note that this uses several GNU extensions, so it will not work with mawk.

How to merge two files based on the first three columns using awk

I wanted to merge two files into a single one line by line using the first three columns as a key. Example:
file1.txt
a b c 1 4 7
x y z 2 5 8
p q r 3 6 9
file2.txt
p q r 11
a b c 12
x y z 13
My desired output for the above two files is:
a b c 1 4 7 12
x y z 2 5 8 13
p q r 3 6 9 11
The number of columns in each file is not fixed, it can vary from line to line. Also, I got more than 27K lines in each file.
They are not ordered. They only thing is that the first three fields are the same for both files.
You could also use join, it requires sorted input and that the first 3 fields are merged. The example below sorts each file and lets sed merge and separate the fields:
join <(sort file1.txt | sed 's/ /-/; s/ /-/') \
<(sort file2.txt | sed 's/ /-/; s/ /-/') |
sed 's/-/ /; s/-/ /'
Output:
a b c 1 4 7 12
p q r 3 6 9 11
x y z 2 5 8 13
Join on the first three fields where the number of fields are variable (four or more):
{
# get the forth field until the last
for (i=4;i<=NF;i++)
f=f$i" "
# concat fields
arr[$1OFS$2OFS$3]=arr[$1OFS$2OFS$3]f;
# reset field string
f=""
}
END {
for (key in arr)
print key, arr[key]
}
Run like:
$ awk -f script.awk file1 file2
a b c 1 4 7 12
p q r 3 6 9 11
x y z 2 5 8 13
try this:
awk 'NR==FNR{a[$1$2$3]=$4;next}$1$2$3 in a{print $0, a[$1$2$3]}' file2 file1
If the columns have varying lengths, you could try something like this using SUBSEP:
awk 'NR==FNR{A[$1,$2,$3]=$4; next}($1,$2,$3) in A{print $0, A[$1,$2,$3]}' file2 file1
For varying columns in file1 and sorted output, try:
awk '{$1=$1; i=$1 FS $2 FS $3 FS; sub(i,x)} NR==FNR{A[i]=$0; next}i in A{print i $0, A[i]}' file2 file1 | sort