append columns to tab delimited file using AWK - awk

I have multiple files without headers with same first four columns and different fifth column. I have to append first four common columns all fifth columns with respective headers as shown below into single final tab delimited text file using awk.
File_1.txt
chr1 101845021 101845132 A 0
chr2 128205033 128205154 B 0
chr3 128205112 128205223 C 0
chr4 36259133 36259244 D 0
chr5 36259333 36259444 E 0
chr6 25497759 25497870 F 1
chr7 25497819 25497930 G 1
chr8 25497869 25497980 H 1
File_2.txt
chr1 101845021 101845132 A 6
chr2 128205033 128205154 B 7
chr3 128205112 128205223 C 7
chr4 36259133 36259244 D 7
chr5 36259333 36259444 E 10
chr6 25497759 25497870 F 11
chr7 25497819 25497930 G 11
chr8 25497869 25497980 H 12
File_3.txt
chr1 101845021 101845132 A 41
chr2 128205033 128205154 B 41
chr3 128205112 128205223 C 42
chr4 36259133 36259244 D 43
chr5 36259333 36259444 E 47
chr6 25497759 25497870 F 48
chr7 25497819 25497930 G 48
chr8 25497869 25497980 H 49
Expected Output file Final.txt
Part Start End Name File1 File2 File3
chr1 101845021 101845132 A 0 6 41
chr2 128205033 128205154 B 0 7 41
chr3 128205112 128205223 C 0 7 42
chr4 36259133 36259244 D 0 7 43
chr5 36259333 36259444 E 0 10 47
chr6 25497759 25497870 F 1 11 48
chr7 25497819 25497930 G 1 11 48
chr8 25497869 25497980 H 1 12 49

Files in same order
If it is safe to assume that the rows are in the same order in each file, then you can do the job fairly succinctly with:
awk '
FILENAME != oname { FN++; oname = FILENAME }
{ p[FNR] = $1; s[FNR] = $2; e[FNR] = $3; n[FNR] = $4; f[FN,FNR] = $5; N = FNR }
END {
printf("%-8s %-12s %-12s %-4s %-5s %-5s %-5s\n",
"Part", "Start", "End", "Name", "File1", "File2", "File3");
for (i = 1; i <= N; i++)
{
printf("%-8s %-12d %-12d %-4s %-5d %-5d %-5d\n",
p[i], s[i], e[i], n[i], f[1,i], f[2,i], f[3,i]);
}
}' file_1.txt file_2.txt file_3.txt
The first line spots when you start on a new file, and increments the FN variable (so lines from file 1 can be tagged with FN == 1, etc). It records the file name in oname so it can spot changes.
The second line operates on each data line, storing the first four fields in the arrays p, s, e, n (indexed by record number within the current file), and records the fifth column in f (indexed by FN and record number). It records the current record number in the current file in N.
The END block prints out the heading, then for each row in the array (indexed from 1 to N), prints out the various fields.
The output is (unsurprisingly):
Part Start End Name File1 File2 File3
chr1 101845021 101845132 A 0 6 41
chr2 128205033 128205154 B 0 7 41
chr3 128205112 128205223 C 0 7 42
chr4 36259133 36259244 D 0 7 43
chr5 36259333 36259444 E 0 10 47
chr6 25497759 25497870 F 1 11 48
chr7 25497819 25497930 G 1 11 48
chr8 25497869 25497980 H 1 12 49
Files in different orders
If you can't rely on the records being in the same order in each file, you have to work harder. Assuming that the records in the first file are in the required order, the following script arranges to print the records in the order:
awk '
FILENAME != oname { FN++; oname = FILENAME }
{ key = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4
if (FN == 1)
{ p[key] = $1; s[key] = $2; e[key] = $3; n[key] = $4; f[FN,key] = $5; k[FNR] = key; N = FNR }
else
{ if (key in p)
f[FN,key] = $5
else
printf "Unmatched key (%s) in %s\n", key, FILENAME
}
}
END {
printf("%-8s %-12s %-12s %-4s %-5s %-5s %-5s\n",
"Part", "Start", "End", "Name", "File1", "File2", "File3")
for (i = 1; i <= N; i++)
{
key = k[i]
printf("%-8s %-12d %-12d %-4s %-5d %-5d %-5d\n",
p[key], s[key], e[key], n[key], f[1,key], f[2,key], f[3,key])
}
}' "$#"
This is closely based on the previous script; the FN handling is identical. The SUBSEP variable is used to separate subscripts in a multi-index array. The variable key contains the same value that would
be generated by indexing an array z[$1,$2,$3,$4].
If working on the first file (FN == 1), then the values in arrays p, s, e, n are created, indexed by key. The fifth column is recorded in f similarly. The order in which the keys appear in the file are recorded in array k, indexed by the (file) record number.
If working on the second or third file, check whether the key is known, reporting if it is not. Assuming it is known, add the fifth column in f again.
The printing is similar, except it collects the keys in sequence from k, and then prints the relevant values.
Given these files:
file_4.txt
chr8 25497869 25497980 H 1
chr7 25497819 25497930 G 1
chr6 25497759 25497870 F 1
chr5 36259333 36259444 E 0
chr4 36259133 36259244 D 0
chr3 128205112 128205223 C 0
chr2 128205033 128205154 B 0
chr1 101845021 101845132 A 0
file_5.txt
chr2 128205033 128205154 B 7
chr8 25497869 25497980 H 12
chr3 128205112 128205223 C 7
chr1 101845021 101845132 A 6
chr6 25497759 25497870 F 11
chr4 36259133 36259244 D 7
chr7 25497819 25497930 G 11
chr5 36259333 36259444 E 10
file_6.txt
chr5 36259333 36259444 E 47
chr4 36259133 36259244 D 43
chr6 25497759 25497870 F 48
chr8 25497869 25497980 H 49
chr2 128205033 128205154 B 41
chr3 128205112 128205223 C 42
chr7 25497819 25497930 G 48
chr1 101845021 101845132 A 41
The script yields the output:
Part Start End Name File1 File2 File3
chr8 25497869 25497980 H 1 12 49
chr7 25497819 25497930 G 1 11 48
chr6 25497759 25497870 F 1 11 48
chr5 36259333 36259444 E 0 10 47
chr4 36259133 36259244 D 0 7 43
chr3 128205112 128205223 C 0 7 42
chr2 128205033 128205154 B 0 7 41
chr1 101845021 101845132 A 0 6 41
There are many circumstances that these scripts do not accommodate very thoroughly. For example, if the files are of different lengths; if there are repeated keys; if there are keys found in one or two files not found in the other(s); if the fifth column data is not numeric; if the second and third columns are not numeric; if there are only two files, or more than three files listed. The 'not numeric' issue is actually easily fixed; simply use %s instead of %d. But the scripts are fragile. They work in the ecosystems shown, but not very generally. The necessary fixes are not incredibly hard; they are a nuisance to have to code, though.
There could be more or less than 3 files
Extending the previous script to handle an arbitrary number of files, and to output tab-separated data instead of formatted (readable) data is not very difficult.
awk '
FILENAME != oname { FN++; file[FN] = oname = FILENAME }
{ key = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4
if (FN == 1)
{ p[key] = $1; s[key] = $2; e[key] = $3; n[key] = $4; f[FN,key] = $5; k[FNR] = key; N = FNR }
else
{ if (key in p)
f[FN,key] = $5
else
{
printf "Unmatched key (%s) in %s\n", key, FILENAME
exit 1
}
}
}
END {
printf("%s\t%s\t%s\t%s", "Part", "Start", "End", "Name")
for (i = 1; i <= FN; i++) printf("\t%s", file[i]);
print ""
for (i = 1; i <= N; i++)
{
key = k[i]
printf("%s\t%s\t%s\t%s", p[key], s[key], e[key], n[key])
for (j = 1; j <= FN; j++)
printf("\t%s", f[j,key])
print ""
}
}' "$#"
The key point is that printf doesn't output a newline unless you tell it to do so, but print does output a newline. The code keeps a record of the actual file names for use in printing out the columns. It loops over the array of file data, assuming that there are the same number of lines in each file.
Given 6 files as input — the three original files, a copy of the first file in reverse order, and permuted copies of the second and third files, the output has 6 columns of extra data, with the columns identified:
Part Start End Name file_1.txt file_2.txt file_3.txt file_4.txt file_5.txt file_6.txt
chr1 101845021 101845132 A 0 6 41 0 6 41
chr2 128205033 128205154 B 0 7 41 0 7 41
chr3 128205112 128205223 C 0 7 42 0 7 42
chr4 36259133 36259244 D 0 7 43 0 7 43
chr5 36259333 36259444 E 0 10 47 0 10 47
chr6 25497759 25497870 F 1 11 48 1 11 48
chr7 25497819 25497930 G 1 11 48 1 11 48
chr8 25497869 25497980 H 1 12 49 1 12 49

Assuming both 3 files are sorted, you can use join command:
join -o "1.1,1.2,1.3,1.4,2.5,2.6,1.5" file3 <(join -o "1.1,1.2,1.3,1.4,1.5,2.5" file1 file2)
-o option allows to format the output result with selecting certain fields from both files. 1.x and 2.x refers to the file given. For example, 1.1 refers to the first field of the first file.
Since join only accept 2 files, the bash operator <(...) is used to create a temporary file.
Another solution using paste and awk (still assuming files are sorted):
paste file* | awk '{print $1,$2,$3,$4,$5,$10,$15}'

Related

Using awk to select rows with a specific value in column greater than x

I tried to use awk to select all rows with a value greater than 98 in the third column. In the output, only lines between 98 - 98.99... were selected and lines with a value more than 98.99 not.
I would like to extract all lines with a value greater than 98 including 99, 100 and so on.
Here my code and my input format:
for i in *input.file; do awk '$3>98' $i >{i/input./output.}; done
A chr11 98.80 83 1 0 2 84
B chr7 95.45 22 1 0 40 61
C chr7 88.89 27 0 1 46 72
D chr6 100.00 20 0 0 1 20
Expected Output
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20
Okay, if you have a series of files, *input.file and you want to select those lines where $3 > 98 and then write the values to the same prefix, but with output.file as the rest of the filename, you can use:
awk '$3 > 98 {
match (FILENAME,/input.file$/)
print $0 > substr(FILENAME,1,RSTART-1) "output.file"
}' *input.file
Which uses match to find the index where input.file begins and then uses substr to get the part of the filename before that and appends "output.file" to the substring for the final output filename.
match() sets the RSTART value to the index where input.file begins in the current filename which is then used by substr truncate the current filename at that index. See GNU awk String Functions for complete details.
For exmaple, if you had input files:
$ ls -1 *input.file
v1input.file
v2input.file
Both with your example content:
$ cat v1input.file
A chr11 98.80 83 1 0 2 84
B chr7 95.45 22 1 0 40 61
C chr7 88.89 27 0 1 46 72
D chr6 100.00 20 0 0 1 20
Running the awk command above would results in two output files:
$ ls -1 *output.file
v1output.file
v2output.file
Containing the records where the third-field was greater than 98:
$ cat v1output.file
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20

Compare 4 columns in two files; and output the line for unique combination (from first file) and line for duplicate combination (from second file)

I have two tab separated values file, say
File1.txt
chr1 894573 rs13303010 GG
chr2 18674 rs10195681 **CC**
chr3 104972 rs990284 AA <--- Unique Line
chr4 111487 rs17802159 AA
chr5 200868 rs4956994 **GG**
chr5 303686 rs6896163 AA <--- Unique Line
chrX 331033 rs4606239 TT
chrY 2893277 i4000106 **GG**
chrY 2897433 rs9786543 GG
chrM 57 i3002191 **TT**
File2.txt
chr1 894573 rs13303010 GG
chr2 18674 rs10195681 AT
chr4 111487 rs17802159 AA
chr5 200868 rs4956994 CC
chrX 331033 rs4606239 TT
chrY 2893277 i4000106 GA
chrY 2897433 rs9786543 GG
chrM 57 i3002191 TA
Desired Output:
Output.txt
chr1 894573 rs13303010 GG
chr2 18674 rs10195681 AT
chr3 104972 rs990284 AA <--Unique Line from File1.txt
chr4 111487 rs17802159 AA
chr5 200868 rs4956994 CC
chr5 303686 rs6896163 AA <--Unique Line from File1.txt
chrX 331033 rs4606239 TT
chrY 2893277 i4000106 GA
chrY 2897433 rs9786543 GG
chrM 57 i3002191 TA
File1.txt has total 10 entries while File2.txt has 8 entries.
I want to compare the both the file using Column 1 and Column 2.
If both the file's first two column values are same, it should print the corresponding line to Output.txt from File2.txt.
When File1.txt has unique combination (Column1:column2, which is not present in File2.txt) it should print the corresponding line from File1.txt to the Output.txt.
I tried various awk and perl combination available at website, but couldn't get correct answer.
Any suggestion will be helpful.
Thanks,
Amit
next time, show your awk code tryso we can help on error or missing object
awk 'NR==FNR || (NR>=FNR&&($1","$2 in k)){k[$1,$2]=$0}END{for(K in k)print k[K]}' file1 file2

LINUX AWK command for a big file

I have encountered a problem that exceeds my basic unix knowledge and would really appreciate some help. I have a large file in the following format:
chr1 10495 10499 211
chr1 10496 10500 1
chr1 10587 10591 93
chr1 10588 10592 1
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10936 6
chr3 10933 10937 2
chr5 11056 11060 4
chr6 11155 11159 9
If the values in column one match and one number difference in column two, I want to sum the values in column 4 of both lines and replace the value of column 3 in line 1 with the value of column 3 in line 2 , else just the the values in the unique line without modifying any column.
So the output I am hoping for would look like this:
chr1 10495 10500 212
chr1 10587 10592 94
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10937 8
chr5 11056 11060 4
chr6 11155 11159 9
$ cat tst.awk
BEGIN { OFS="\t" }
NR>1 {
if ( ($1==p[1]) && ($2==(p[2]+1)) ) {
print p[1], p[2], $3, p[4]+$4
delete p[0]
next
}
else if (0 in p) {
print p[0]
}
}
{ split($0,p); p[0]=$0 }
END { if (0 in p) print p[0] }
$
$ awk -f tst.awk file
chr1 10495 10500 212
chr1 10587 10592 94
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10937 8
chr5 11056 11060 4
chr6 11155 11159 9
Haven't checked closely, but I think you want:
awk '{split(p,a)}
$1==a[1] && a[2]==$2-1{print a[1], a[2], $3, $4 + a[4]; p=""; next}
p {print p} {p=$0}
END {print}' OFS=\\t input
At any given step (except the first), p holds the value from the previous line. The 2nd line of the script checks if the first field in the current line matches the first field of the last line and that the 2nd field is one greater than the 2nd field of the last line. In that condition, it prints the first two fields from the previous line, the third from the current line, and the sum of the 4th fields and moves on to the next line. If they don't match, it prints the previous line. At the end, it just prints the line.
This script, I'm using to merge intervals in transcriptome data
awk '
NR==1{
n= split($0, first);
c=1;
for(i=1; i<=n; i++) d[c, i] = first[i];
}
NR>1{
n= split($0, actual);
#if(actual[1] != d[c, 1] || actual[2]>d[c, 3]){ #for interval fusion
if(actual[1] != d[c, 1] || actual[2]>d[c,2]+1){ #OP requirement
c++;
for(i=1; i<=n; i++) d[c, i] = actual[i];
}else{
if(actual[3] > d[c,3]) d[c,3] = actual[3];
d[c,4] = d[c,4] + actual[4];
}
}
END{
for(i=1; i<=c; i++){
print d[i, 1], d[i, 2], d[i, 3], d[i, 4]
}
}' file
you get:
chr1 10495 10500 212
chr1 10587 10592 94
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10937 8
chr5 11056 11060 4
chr6 11155 11159 9

How to add multiple columns to a file from another file

I have two files like shown below which are tab-delimited:
file A
chr1 123 aa b c d
chr1 234 a b c d
chr1 345 aa b c d
chr1 456 a b c d
....
file B
chr1 123 aa c d e ff
chr1 345 aa e f g gg
chr1 123 aa c d e hh
chr1 567 aa z c a ii
chr1 345 bb x q r kk
chr1 789 df f g s ff
chr1 345 sh d t g ll
...
I want to add a new column to file A from file B based on 2 key columns "chr1", "123" i.e.(first two columns are key columns). If the key columns matches in both files, the data in column 7 in file B should be added to column 3 in file A.
For example (chr1 123) key is found twice in file B, therefore 3rd column in file A has ff and hh separated by comma. If the key is not found it should put NA and output should look like as shown below:
output:
chr1 123 ff,hh aa b c d
chr1 234 NA a b c d
chr1 345 gg,kk,ll aa b c d
chr1 456 NA a b c d
I achieved this using the awk solution
awk -F'\t' -v OFS='\t' 'NR==FNR{a[$1FS$2]=a[$1FS$2]?a[$1FS$2]","$7:$7;next}{$3=(($1FS$2 in a)?a[$1FS$2]:"NA")FS $3}1' fileB fileA
Now, i would like to add another column 6 along with column 7. Could anyone suggest how to do this?
The output looks like:
chr1 123 ff,hh e,e aa b c d
chr1 234 NA NA a b c d
chr1 345 gg,kk,ll g,r,g aa b c d
chr1 456 NA NA a b c d
Thanks
My suggestion is to use another array to track the next variable you want to add, but to keep the code a little more readable, I've made an executable awk script to generalize it a bit:
#!/usr/bin/awk -f
BEGIN { FS="\t"; OFS="\t" }
{ key = $1 FS $2 }
FNR==NR {
updateArray( a, $7 )
updateArray( b, $6 )
next
}
{ $3 = concat( a, concat( b, $3 ) ) }
1
function updateArray( arr, fld ) {
arr[key] = arr[key]!="" ? arr[key] "," fld : fld
}
function concat( arr, suffix ) {
return( (arr[key]=="" ? "NA" : arr[key]) OFS suffix )
}
Here's the breakdown:
Set the FS and OFS values
Make a global key for every line read
Store data from the first file in arrays a and b where they are passed by reference to the function updateArray and the field value is passed by value
Update $3 using the local concat function
Print the updated line out with 1
As another option, you could make the value stored in a single a[key] equal to all the file B fields you want represented in $3 and have them separated by OFS. That would require parsing and reassembling the value in a[key] every time it changed as file B is parsed, but would make creating the $3 a simple three part concatenation.

How to replace blank space zero?

I have a file:
nr kl1 kl2 kl3 kl4
d1 15 58 63 58
d2 3 3
d3 3 8 0
I want to print:
nr kl1 kl2 kl3 kl4
d1 15 58 63 58
d2 0 3 3 0
d3 3 0 8 0
I tried gsub solution, but it does not work.
awk '{gsub(/ /, 0, $2); print }' file
Thank you for your help.
EDIT:
Ed Morton solution works on gawk, but it does not work on mawk.
$ gawk 'BEGIN{ FIELDWIDTHS="5 5 5 5 5"; OFS="" }NR>1 {for (i=2;i<=NF;i++)$i=sprintf("%-5d",$i)}{ sub(/ +$/,""); print }' file
nr kl1 kl2 kl3 kl4
d1 15 58 63 58
d2 0 3 3 0
d3 3 0 8 0
.
$ mawk 'BEGIN{ FIELDWIDTHS="5 5 5 5 5"; OFS="" }NR>1 {for (i=2;i<=NF;i++)$i=sprintf("%-5d",$i)}{ sub(/ +$/,""); print }' file
nr kl1 kl2 kl3 kl4
d115 58 63 58
d23 3
d33 8 0
How to do the same, but the mawk?
What you tried didn't work because your fields aren't separated by spaces, they're a fixed width. Try this with GNU awk:
BEGIN{ FIELDWIDTHS="5 5 5 5 5"; OFS="" }
NR>1 {
for (i=2;i<=NF;i++)
$i=sprintf("%-5d",$i)
}
{ sub(/ +$/,""); print }