LINUX AWK command for a big file - awk

I have encountered a problem that exceeds my basic unix knowledge and would really appreciate some help. I have a large file in the following format:
chr1 10495 10499 211
chr1 10496 10500 1
chr1 10587 10591 93
chr1 10588 10592 1
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10936 6
chr3 10933 10937 2
chr5 11056 11060 4
chr6 11155 11159 9
If the values in column one match and one number difference in column two, I want to sum the values in column 4 of both lines and replace the value of column 3 in line 1 with the value of column 3 in line 2 , else just the the values in the unique line without modifying any column.
So the output I am hoping for would look like this:
chr1 10495 10500 212
chr1 10587 10592 94
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10937 8
chr5 11056 11060 4
chr6 11155 11159 9

$ cat tst.awk
BEGIN { OFS="\t" }
NR>1 {
if ( ($1==p[1]) && ($2==(p[2]+1)) ) {
print p[1], p[2], $3, p[4]+$4
delete p[0]
next
}
else if (0 in p) {
print p[0]
}
}
{ split($0,p); p[0]=$0 }
END { if (0 in p) print p[0] }
$
$ awk -f tst.awk file
chr1 10495 10500 212
chr1 10587 10592 94
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10937 8
chr5 11056 11060 4
chr6 11155 11159 9

Haven't checked closely, but I think you want:
awk '{split(p,a)}
$1==a[1] && a[2]==$2-1{print a[1], a[2], $3, $4 + a[4]; p=""; next}
p {print p} {p=$0}
END {print}' OFS=\\t input
At any given step (except the first), p holds the value from the previous line. The 2nd line of the script checks if the first field in the current line matches the first field of the last line and that the 2nd field is one greater than the 2nd field of the last line. In that condition, it prints the first two fields from the previous line, the third from the current line, and the sum of the 4th fields and moves on to the next line. If they don't match, it prints the previous line. At the end, it just prints the line.

This script, I'm using to merge intervals in transcriptome data
awk '
NR==1{
n= split($0, first);
c=1;
for(i=1; i<=n; i++) d[c, i] = first[i];
}
NR>1{
n= split($0, actual);
#if(actual[1] != d[c, 1] || actual[2]>d[c, 3]){ #for interval fusion
if(actual[1] != d[c, 1] || actual[2]>d[c,2]+1){ #OP requirement
c++;
for(i=1; i<=n; i++) d[c, i] = actual[i];
}else{
if(actual[3] > d[c,3]) d[c,3] = actual[3];
d[c,4] = d[c,4] + actual[4];
}
}
END{
for(i=1; i<=c; i++){
print d[i, 1], d[i, 2], d[i, 3], d[i, 4]
}
}' file
you get:
chr1 10495 10500 212
chr1 10587 10592 94
chr1 10639 10643 4
chr1 10668 10672 11
chr1 10697 10701 13
chr1 10726 10730 8
chr1 10755 10759 7
chr1 10784 10788 5
chr2 10856 10860 4
chr3 10932 10937 8
chr5 11056 11060 4
chr6 11155 11159 9

Related

awk to calculate difference between two files and output specific text based on value

I am trying to use awk to check if each $2 in file1 falls between $2 and $3 of the matching $4 line of file2. If it does then in $5 of file2, exon if it does not intron. I think the awk below will do that, but I am struggling trying to is add a calculation that if the difference is less than or equal to 10, then $5 is splicing. I have added an example of line 1 as well.
The 6th line is an example of the splicing, because the $2 value in file1 is 2 away from the $2 value in file2. My actual data is very large with file2 always being several hundreds of thousand lines. File 1 will be variable but usually ~100 lines. The files are hardcoded in this example but will be gotten from a bash for loop. That will provide the input. Thank you :).
file1 tab-delimited with whitespace after $3 and $4
chr1 17345304 17345315 SDHB
chr1 17345516 17345524 SDHB
chr1 93306242 93306261 RPL5
chr1 93307262 93307291 RPL5
chrX 153295819 153296875 MECP2
chrX 153295810 153296830 MECP2
file2 tab-delimited
chr1 17345375 17345453 SDHB_cds_0_0_chr1_17345376_r 0 -
chr1 17349102 17349225 SDHB_cds_1_0_chr1_17349103_r 0 -
chr1 17350467 17350569 SDHB_cds_2_0_chr1_17350468_r 0 -
chr1 17354243 17354360 SDHB_cds_3_0_chr1_17354244_r 0 -
chr1 17355094 17355231 SDHB_cds_4_0_chr1_17355095_r 0 -
chr1 17359554 17359640 SDHB_cds_5_0_chr1_17359555_r 0 -
chr1 17371255 17371383 SDHB_cds_6_0_chr1_17371256_r 0 -
chr1 17380442 17380514 SDHB_cds_7_0_chr1_17380443_r 0 -
chr1 93297671 93297674 RPL5_cds_0_0_chr1_93297672_f 0 +
chr1 93298945 93299015 RPL5_cds_1_0_chr1_93298946_f 0 +
chr1 93299101 93299217 RPL5_cds_2_0_chr1_93299102_f 0 +
chr1 93300335 93300470 RPL5_cds_3_0_chr1_93300336_f 0 +
chr1 93301746 93301949 RPL5_cds_4_0_chr1_93301747_f 0 +
chr1 93303012 93303190 RPL5_cds_5_0_chr1_93303013_f 0 +
chr1 93306107 93306196 RPL5_cds_6_0_chr1_93306108_f 0 +
chr1 93307322 93307422 RPL5_cds_7_0_chr1_93307323_f 0 +
chrX 153295817 153296901 MECP2_cds_0_0_chrX_153295818_r 0 -
chrX 153297657 153298008 MECP2_cds_1_0_chrX_153297658_r 0 -
chrX 153357641 153357667 MECP2_cds_2_0_chrX_153357642_r 0 -
desired output tab-delimited
chr1 17345304 17345315 SDHB intron
chr1 17345516 17345524 SDHB intron
chr1 93306242 93306261 RPL5 intron
chr1 93307262 93307291 RPL5 intron
chrX 153295819 153296875 MECP2 exon
chrX 153295810 153296800 MECP2 splicing
awk
awk '
FNR==NR{
a[$4];
min[$4]=$2;
max[$4]=$3;
next
}
{
split($4,array,"_");
print $0,(array[1] in a) && ($2>=min[array[1]] &&
$2<=max[array[1]])?"exon":"intron"
}' file1 OFS="\t" file2 > output
example of line 1
a[$4] = SDHB
min[$4] = 17345304
max[$4] = 17345315
array[1] = SDHB, 17345304 >= 17345375 && array[1] = SDHB, 17345315 <= 17345453 ---- intron

manipulating columns in a text file in awk

I have a tab separated text file and want to do some math operation on one column and make a new tab separated text file.
this is an example of my file:
chr1 144520803 144520804 12 chr1 144520813 58
chr1 144520840 144520841 12 chr1 144520845 36
chr1 144520840 144520841 12 chr1 144520845 36
chr1 144520848 144520849 14 chr1 144520851 32
chr1 144520848 144520849 14 chr1 144520851 32
i want to change the 4th column. in fact I want to divide every single element in the 4th column by sum of all elements in the 4th column and then multiply by 1000000 . like the expected output.
expected output:
chr1 144520803 144520804 187500 chr1 144520813 58
chr1 144520840 144520841 187500 chr1 144520845 36
chr1 144520840 144520841 187500 chr1 144520845 36
chr1 144520848 144520849 218750 chr1 144520851 32
chr1 144520848 144520849 218750 chr1 144520851 32
I am trying to do that in awk using the following command but it does not return what I want. do you know how to fix it:
awk '{print $1 "\t" $2 "\t" $3 "\t" $4/{sum+=$4}*1000000 "\t" $5 "\t" $6 "\t" $7}' myfile.txt > new_file.txt
you need two passes, one to compute the sum and then to scale the field
something like this
$ awk -v OFS='\t' 'NR==FNR {sum+=$4; next}
{$4*=(1000000/sum)}1' file{,} > newfile

Awk OR conditional not working

Input: A tab-separated input file with 15 columns where column 15 is an integer.
Output: The number of lines that satisfy the conditional.
My code:
$ closest-features --closest --no-overlaps --delim '\t' --dist --ec megatrans_enhancers.sorted.bed ../../data/alu_repeats.sorted.bed | awk -v OFS='\t' '{if ($15 <= 1000 || $15 >= -1000) print $0}' | wc -l
1188
The || conditional in this case is failing to work (the total number of lines in the file are 1188 and I know for certain at least some lines do not satisfy the condition), because if I remove the OR conditional then suddenly it works:
$ closest-features --closest --no-overlaps --delim '\t' --dist --ec megatrans_enhancers.sorted.bed ../../data/alu_repeats.sorted.bed | awk -v OFS='\t' '{if ($15 <= 1000) print $0}' | wc -l
926
Not sure what i'm doing wrong. Any advice?
Example Input to Awk command:
chr1 378268 378486 chr1-798_Enhancer 17.2 + chr1 375923 376219 AluY|SINE|Alu-HOMER529 0 + E:375923 0.044 -2050
chr1 1079471 1079689 chr1-929_Enhancer 14.6 - chr1 1071271 1071563 AluSx1|SINE|Alu-HOMER1669 0 - E:1071271 0.13 -7909
chr1 1080259 1080477 chr1-830_Enhancer 16.7 - chr1 1071271 1071563 AluSx1|SINE|Alu-HOMER1669 0 - E:1071271 0.13 -8697
chr1 6611744 6611962 chr1-241_Enhancer 46.6 + chr1 6611431 6611723 AluSc|SINE|Alu-HOMER10257 0 + E:6611431 0.089 -22
chr1 6959639 6959857 chr1-58_Enhancer 100.1 - chr1 6966612 6966911 AluSx|SINE|Alu-HOMER11041 0 - E:6966612 0.137 6756
chr1 6960593 6960811 chr1-202_Enhancer 51.6 - chr1 6966612 6966911 AluSx|SINE|Alu-HOMER11041 0 - E:6966612 0.137 5802
chr1 7447888 7448106 chr1-2_Enhancer 181.9 - chr1 7449489 7449799 AluSz|SINE|Alu-HOMER11879 0 + E:7449489 0.119 1384
chr1 10752461 10752679 chr1-131_Enhancer 65.4 - chr1 10752754 10753065 AluSq2|SINE|Alu-HOMER19455 0 + E:10752754 0.106 76
chr1 12485694 12485912 chr1-353_Enhancer 36.7 + chr1 12487328 12487634 AluSx3|SINE|Alu-HOMER23581 0 + E:12487328 0.085 1417
chr1 12486469 12486687 chr1-141_Enhancer 63.6 + chr1 12487328 12487634 AluSx3|SINE|Alu-HOMER23581 0 + E:12487328 0.085 642
Try to put && condition because a digit should be greater than -1000 and lesser than 1000.
Your_command | awk '$15<=1000 && $15>=-1000{count++} END{print count}'
Add -F"\t" in above awk in case your Input to it is coming TAB delimited too. Also there is no need to use wc -l after awk. I have written logic for that so give the count of lines which are satisfying the condition by creating a variable named count and printing it at very last of Input_file.
Also for your provided samples output is coming as 3 which I believe is correct one.

append columns to tab delimited file using AWK

I have multiple files without headers with same first four columns and different fifth column. I have to append first four common columns all fifth columns with respective headers as shown below into single final tab delimited text file using awk.
File_1.txt
chr1 101845021 101845132 A 0
chr2 128205033 128205154 B 0
chr3 128205112 128205223 C 0
chr4 36259133 36259244 D 0
chr5 36259333 36259444 E 0
chr6 25497759 25497870 F 1
chr7 25497819 25497930 G 1
chr8 25497869 25497980 H 1
File_2.txt
chr1 101845021 101845132 A 6
chr2 128205033 128205154 B 7
chr3 128205112 128205223 C 7
chr4 36259133 36259244 D 7
chr5 36259333 36259444 E 10
chr6 25497759 25497870 F 11
chr7 25497819 25497930 G 11
chr8 25497869 25497980 H 12
File_3.txt
chr1 101845021 101845132 A 41
chr2 128205033 128205154 B 41
chr3 128205112 128205223 C 42
chr4 36259133 36259244 D 43
chr5 36259333 36259444 E 47
chr6 25497759 25497870 F 48
chr7 25497819 25497930 G 48
chr8 25497869 25497980 H 49
Expected Output file Final.txt
Part Start End Name File1 File2 File3
chr1 101845021 101845132 A 0 6 41
chr2 128205033 128205154 B 0 7 41
chr3 128205112 128205223 C 0 7 42
chr4 36259133 36259244 D 0 7 43
chr5 36259333 36259444 E 0 10 47
chr6 25497759 25497870 F 1 11 48
chr7 25497819 25497930 G 1 11 48
chr8 25497869 25497980 H 1 12 49
Files in same order
If it is safe to assume that the rows are in the same order in each file, then you can do the job fairly succinctly with:
awk '
FILENAME != oname { FN++; oname = FILENAME }
{ p[FNR] = $1; s[FNR] = $2; e[FNR] = $3; n[FNR] = $4; f[FN,FNR] = $5; N = FNR }
END {
printf("%-8s %-12s %-12s %-4s %-5s %-5s %-5s\n",
"Part", "Start", "End", "Name", "File1", "File2", "File3");
for (i = 1; i <= N; i++)
{
printf("%-8s %-12d %-12d %-4s %-5d %-5d %-5d\n",
p[i], s[i], e[i], n[i], f[1,i], f[2,i], f[3,i]);
}
}' file_1.txt file_2.txt file_3.txt
The first line spots when you start on a new file, and increments the FN variable (so lines from file 1 can be tagged with FN == 1, etc). It records the file name in oname so it can spot changes.
The second line operates on each data line, storing the first four fields in the arrays p, s, e, n (indexed by record number within the current file), and records the fifth column in f (indexed by FN and record number). It records the current record number in the current file in N.
The END block prints out the heading, then for each row in the array (indexed from 1 to N), prints out the various fields.
The output is (unsurprisingly):
Part Start End Name File1 File2 File3
chr1 101845021 101845132 A 0 6 41
chr2 128205033 128205154 B 0 7 41
chr3 128205112 128205223 C 0 7 42
chr4 36259133 36259244 D 0 7 43
chr5 36259333 36259444 E 0 10 47
chr6 25497759 25497870 F 1 11 48
chr7 25497819 25497930 G 1 11 48
chr8 25497869 25497980 H 1 12 49
Files in different orders
If you can't rely on the records being in the same order in each file, you have to work harder. Assuming that the records in the first file are in the required order, the following script arranges to print the records in the order:
awk '
FILENAME != oname { FN++; oname = FILENAME }
{ key = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4
if (FN == 1)
{ p[key] = $1; s[key] = $2; e[key] = $3; n[key] = $4; f[FN,key] = $5; k[FNR] = key; N = FNR }
else
{ if (key in p)
f[FN,key] = $5
else
printf "Unmatched key (%s) in %s\n", key, FILENAME
}
}
END {
printf("%-8s %-12s %-12s %-4s %-5s %-5s %-5s\n",
"Part", "Start", "End", "Name", "File1", "File2", "File3")
for (i = 1; i <= N; i++)
{
key = k[i]
printf("%-8s %-12d %-12d %-4s %-5d %-5d %-5d\n",
p[key], s[key], e[key], n[key], f[1,key], f[2,key], f[3,key])
}
}' "$#"
This is closely based on the previous script; the FN handling is identical. The SUBSEP variable is used to separate subscripts in a multi-index array. The variable key contains the same value that would
be generated by indexing an array z[$1,$2,$3,$4].
If working on the first file (FN == 1), then the values in arrays p, s, e, n are created, indexed by key. The fifth column is recorded in f similarly. The order in which the keys appear in the file are recorded in array k, indexed by the (file) record number.
If working on the second or third file, check whether the key is known, reporting if it is not. Assuming it is known, add the fifth column in f again.
The printing is similar, except it collects the keys in sequence from k, and then prints the relevant values.
Given these files:
file_4.txt
chr8 25497869 25497980 H 1
chr7 25497819 25497930 G 1
chr6 25497759 25497870 F 1
chr5 36259333 36259444 E 0
chr4 36259133 36259244 D 0
chr3 128205112 128205223 C 0
chr2 128205033 128205154 B 0
chr1 101845021 101845132 A 0
file_5.txt
chr2 128205033 128205154 B 7
chr8 25497869 25497980 H 12
chr3 128205112 128205223 C 7
chr1 101845021 101845132 A 6
chr6 25497759 25497870 F 11
chr4 36259133 36259244 D 7
chr7 25497819 25497930 G 11
chr5 36259333 36259444 E 10
file_6.txt
chr5 36259333 36259444 E 47
chr4 36259133 36259244 D 43
chr6 25497759 25497870 F 48
chr8 25497869 25497980 H 49
chr2 128205033 128205154 B 41
chr3 128205112 128205223 C 42
chr7 25497819 25497930 G 48
chr1 101845021 101845132 A 41
The script yields the output:
Part Start End Name File1 File2 File3
chr8 25497869 25497980 H 1 12 49
chr7 25497819 25497930 G 1 11 48
chr6 25497759 25497870 F 1 11 48
chr5 36259333 36259444 E 0 10 47
chr4 36259133 36259244 D 0 7 43
chr3 128205112 128205223 C 0 7 42
chr2 128205033 128205154 B 0 7 41
chr1 101845021 101845132 A 0 6 41
There are many circumstances that these scripts do not accommodate very thoroughly. For example, if the files are of different lengths; if there are repeated keys; if there are keys found in one or two files not found in the other(s); if the fifth column data is not numeric; if the second and third columns are not numeric; if there are only two files, or more than three files listed. The 'not numeric' issue is actually easily fixed; simply use %s instead of %d. But the scripts are fragile. They work in the ecosystems shown, but not very generally. The necessary fixes are not incredibly hard; they are a nuisance to have to code, though.
There could be more or less than 3 files
Extending the previous script to handle an arbitrary number of files, and to output tab-separated data instead of formatted (readable) data is not very difficult.
awk '
FILENAME != oname { FN++; file[FN] = oname = FILENAME }
{ key = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4
if (FN == 1)
{ p[key] = $1; s[key] = $2; e[key] = $3; n[key] = $4; f[FN,key] = $5; k[FNR] = key; N = FNR }
else
{ if (key in p)
f[FN,key] = $5
else
{
printf "Unmatched key (%s) in %s\n", key, FILENAME
exit 1
}
}
}
END {
printf("%s\t%s\t%s\t%s", "Part", "Start", "End", "Name")
for (i = 1; i <= FN; i++) printf("\t%s", file[i]);
print ""
for (i = 1; i <= N; i++)
{
key = k[i]
printf("%s\t%s\t%s\t%s", p[key], s[key], e[key], n[key])
for (j = 1; j <= FN; j++)
printf("\t%s", f[j,key])
print ""
}
}' "$#"
The key point is that printf doesn't output a newline unless you tell it to do so, but print does output a newline. The code keeps a record of the actual file names for use in printing out the columns. It loops over the array of file data, assuming that there are the same number of lines in each file.
Given 6 files as input — the three original files, a copy of the first file in reverse order, and permuted copies of the second and third files, the output has 6 columns of extra data, with the columns identified:
Part Start End Name file_1.txt file_2.txt file_3.txt file_4.txt file_5.txt file_6.txt
chr1 101845021 101845132 A 0 6 41 0 6 41
chr2 128205033 128205154 B 0 7 41 0 7 41
chr3 128205112 128205223 C 0 7 42 0 7 42
chr4 36259133 36259244 D 0 7 43 0 7 43
chr5 36259333 36259444 E 0 10 47 0 10 47
chr6 25497759 25497870 F 1 11 48 1 11 48
chr7 25497819 25497930 G 1 11 48 1 11 48
chr8 25497869 25497980 H 1 12 49 1 12 49
Assuming both 3 files are sorted, you can use join command:
join -o "1.1,1.2,1.3,1.4,2.5,2.6,1.5" file3 <(join -o "1.1,1.2,1.3,1.4,1.5,2.5" file1 file2)
-o option allows to format the output result with selecting certain fields from both files. 1.x and 2.x refers to the file given. For example, 1.1 refers to the first field of the first file.
Since join only accept 2 files, the bash operator <(...) is used to create a temporary file.
Another solution using paste and awk (still assuming files are sorted):
paste file* | awk '{print $1,$2,$3,$4,$5,$10,$15}'

How to split awk field correctly

I have a file (test.bed) that looks like this (which might not be tab-seperated):
chr1 10002 10116 id=1;frame=0;strand=+; 0 +
chr1 10116 10122 id=2;frame=0;strand=+; 0 +
chr1 10122 10128 id=3;frame=0;strand=+; 0 +
chr1 10128 10134 id=4;frame=0;strand=+; 0 +
chr1 10134 10140 id=5;frame=0;strand=+; 0 +
chr1 10140 10146 id=6;frame=0;strand=+; 0 +
chr1 10146 10182 id=7;frame=0;strand=+; 0 +
chr1 10182 10188 id=8;frame=0;strand=+; 0 +
chr1 10188 10194 id=9;frame=0;strand=+; 0 +
chr1 10194 10200 id=10;frame=0;strand=+; 0 +
I want to produce the following output (which should be tab-seperated):
chr1 10002 10116 id=1 0 +
chr1 10116 10122 id=2 0 +
chr1 10122 10128 id=3 0 +
chr1 10128 10134 id=4 0 +
chr1 10134 10140 id=5 0 +
chr1 10140 10146 id=6 0 +
chr1 10146 10182 id=7 0 +
chr1 10182 10188 id=8 0 +
chr1 10188 10194 id=9 0 +
chr1 10194 10200 id=10 0 +
I have tried with the following code:
awk 'OFS="\t" split ($0, a, ";"){print a[1],$5,$6}' test.bed
But then I get:
chr1 10002 10116 id=1 40 4+
chr1 10116 10122 id=2 40 4+
chr1 10122 10128 id=3 40 4+
chr1 10128 10134 id=4 40 4+
chr1 10134 10140 id=5 40 4+
chr1 10140 10146 id=6 40 4+
chr1 10146 10182 id=7 40 4+
chr1 10182 10188 id=8 40 4+
chr1 10188 10194 id=9 40 4+
chr1 10194 10200 id=10 40 4+
What am I doing wrong? Somehow the number '4' is added to the last two fields. I thought the number '4' somehow might have something to do with splitting in the 4th field, however, I tried producing a similar file where it was the 3rd field that was split, and still got the number '4' added to the last two fields. I am rather new to 'awk' so I guess it is an error in the syntax. Any help would be appreciated.
If you set your field separator as whitespace or semi-columns you won't have to handle the splitting yourself:
$ awk '{print $1,$2,$3,$4,$8,$9}' FS='[[:space:]]+|;' OFS='\t' file
chr1 10002 10116 id=1 0 +
chr1 10116 10122 id=2 0 +
chr1 10122 10128 id=3 0 +
chr1 10128 10134 id=4 0 +
chr1 10134 10140 id=5 0 +
chr1 10140 10146 id=6 0 +
chr1 10146 10182 id=7 0 +
chr1 10182 10188 id=8 0 +
chr1 10188 10194 id=9 0 +
chr1 10194 10200 id=10 0 +
As for what you are doing wrong in:
awk 'OFS="\t" split ($0, a, ";"){print a[1],$5,$6}'
The syntax of awk is condition{block} and setting the value of OFS and splitting is not a conditional. They are statements that should be inside the block.
However you really don't need to set the value of OFS on every line so it should be initialized only once. You can do this using the -v option, in the BEGIN block or after the script.
Valid alternatives:
$ awk -v OFS='\t' '{split($0,a,";");print a[1],$5,$6}' file
$ awk 'BEGIN{OFS="\t"}{split($0,a,";");print a[1],$5,$6}' file
$ awk '{split ($0,a,";");print a[1],$5,$6}' OFS='\t' file
Try this :
awk -F\; '{print $1,$4}' test.bed