how to format a large txt file to bed - awk

I am trying to format CpG methylation calls from R package "methyKit" to simple bed format. Since it is a large file, i can not do it in Excel. I also tried Seqmonk, but it does not allow me to export the data in the format I want. Linux Awk/sed might be a good option, but I am new to them as well. Basically, I need to trim "chr" column, add "stop" column, convert "F" to "+" /"R" to "-", and freqC with 2 decimal places. Can you please help?
From:
chrBase chr base strand coverage freqC freqT
chr1.339 chr1 339 F 7 0.00 100.00
chr1.183 chr1 183 R 4 0.00 100.00
chr1.192 chr1 192 R 6 0.00 100.00
chr1.340 chr1 340 R 5 40.00 60.00
chr1.10007 chr1 10007 F 13 53.85 46.15
chr1.10317 chr1 10317 F 8 0.00 100.00
chr1.10346 chr1 10346 F 9 88.89 11.11
chr1.10349 chr1 10349 F 9 88.89 11.11
To:
chr start stop freqc Coverage strand
1 67678 67679 0 3 -
1 67701 67702 0 3 -
1 67724 67725 0 3 -
1 67746 67747 0 3 -
1 67768 67769 0.333333 3 -
1 159446 159447 0 3 +
1 162652 162653 0 3 +
1 167767 167768 0.666667 3 +
1 167789 167790 0.666667 3 +
1 167797 167798 0 3 +

This should do what you actually want, producing a BED6 file with the methylation percentage in the score column:
$ cat foo.awk
BEGIN{OFS="\t"}
{if(NR>1) {
if($4=="F") {
strand="+"
} else {
strand="-"
}
chromUse=gsub("chr", "", $2);
print chromUse,$3-1,$3,$1,$6,strand,$5
}}
That would then be run with:
awk -f foo.awk input.txt > output.bed
The additional column 7 is the coverage, since genome viewers will only display a single score column:
1 338 339 chr1.339 0.00 + 7
1 182 183 chr1.183 0.00 - 4
1 191 192 chr1.192 0.00 - 6
1 339 340 chr1.340 40.00 - 5
1 10006 10007 chr1.10007 53.85 + 13
1 10316 10317 chr1.10317 0.00 + 8
1 10345 10346 chr1.10346 88.89 + 9
1 10348 10349 chr1.10349 88.89 + 9
You can tweak that further as needed.

It is not entirely clear the exact sequence you want since your "From" data does not correspond to what you show as your "To" results, but if what you are showing is the general format change and in the "From" data, you want to:
discard field 1,
retrieve the "chr" value from the end of field 2,
if the 4th field is "F" make it "+" else if it is "R" make it "-" otherwise leave it unchanged,
use the 3rd field as "start" and 3rd + 1 as "stop" (adjust whether to add or subtract 1 as needed to get the desired "start" and "stop" values),
print 6th field as "freqc",
output 5th field as "Coverage", and finally
output modified 4th field as "strand"
If that is your goal, then with your from data in the file named from, you can do something like the following:
awk '
BEGIN { OFS="\t"; print "chr","start","stop","freqc","Coverage","strand" }
FNR > 1 {
match($2, /[[:digit:]]+$/, arr)
if ($4 == "F")
$4 = "+"
else if ($4 == "R")
$4 = "-"
print arr[0], $3, $3 + 1, $6, $5, $4
}
' from
Explanation, the BEGIN rule is run before awk starts processing records (lines) from the file. Above it simply sets the Output Field Separator to tab and prints the heading.
The condition (pattern) of FNR > 1 on the second rule processes the from file from the 2nd record (line) on (skipping the heading row). FNR is awk's way of saying File Record Number (even though it looks like the N and R are backwards).
match($2, /[[:digit:]]+$/, arr) splits the trailing digits from the second field into the first element of arr (e.g. arr[0]) and not relevant here sets the RSTART and RLENGTH internal variables telling you which character the first digit starts on and how many digits there are.
The if and else if statement does the "F" to "+" and "R" to "-" change. And, finally, the print statement just prints the modified values and unchanged fields in the order specified above.
Example Output
Running the above on your original "From" data will produce:
chr start stop freqc Coverage strand
1 339 340 0.00 7 +
1 183 184 0.00 4 -
1 192 193 0.00 6 -
1 340 341 40.00 5 -
1 10007 10008 53.85 13 +
1 10317 10318 0.00 8 +
1 10346 10347 88.89 9 +
1 10349 10350 88.89 9 +
Let me know if this is close to what you explained in your question, and if not, drop a comment below.
The GNU Awk User's Guide is a great gawk/awk reference.

Related

Using awk to select rows with a specific value in column greater than x

I tried to use awk to select all rows with a value greater than 98 in the third column. In the output, only lines between 98 - 98.99... were selected and lines with a value more than 98.99 not.
I would like to extract all lines with a value greater than 98 including 99, 100 and so on.
Here my code and my input format:
for i in *input.file; do awk '$3>98' $i >{i/input./output.}; done
A chr11 98.80 83 1 0 2 84
B chr7 95.45 22 1 0 40 61
C chr7 88.89 27 0 1 46 72
D chr6 100.00 20 0 0 1 20
Expected Output
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20
Okay, if you have a series of files, *input.file and you want to select those lines where $3 > 98 and then write the values to the same prefix, but with output.file as the rest of the filename, you can use:
awk '$3 > 98 {
match (FILENAME,/input.file$/)
print $0 > substr(FILENAME,1,RSTART-1) "output.file"
}' *input.file
Which uses match to find the index where input.file begins and then uses substr to get the part of the filename before that and appends "output.file" to the substring for the final output filename.
match() sets the RSTART value to the index where input.file begins in the current filename which is then used by substr truncate the current filename at that index. See GNU awk String Functions for complete details.
For exmaple, if you had input files:
$ ls -1 *input.file
v1input.file
v2input.file
Both with your example content:
$ cat v1input.file
A chr11 98.80 83 1 0 2 84
B chr7 95.45 22 1 0 40 61
C chr7 88.89 27 0 1 46 72
D chr6 100.00 20 0 0 1 20
Running the awk command above would results in two output files:
$ ls -1 *output.file
v1output.file
v2output.file
Containing the records where the third-field was greater than 98:
$ cat v1output.file
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20

How to loop awk command over row values

I would like to use awk to search for a particular word in the first column of a table and print the value in the 6th column. I understand how to do this searching one word at time using something along the lines of:
awk '$1 == "<insert-word>" { print $6 }' file.txt
But I was wondering if it is possible to loop this over a list of words in a row?
For example If I had a table like file1.txt below:
cat file1.txt
dna1 dna4 dna5
dna3 dna6 dna2
dna7 dna8 dna9
Could I loop over each value in row 1 and search for this word in column 1 of file2.txt below, each time printing the value of column 6? Then do this for row 2, 3 and so on...
cat file2
dna1 0 229 7 0 4 0 0
dna2 0 296 39 2 1 3 100
dna3 0 255 15 0 6 0 0
dna4 0 209 3 0 0 0 0
dna5 0 253 14 2 3 7 100
dna6 0 897 629 7 8 1 100
dna7 0 214 4 0 9 0 0
dna8 0 255 15 0 2 0 0
dna9 0 606 338 8 3 1 100
So an example looping the awk over row 1 of file 1 would return the numbers 4, 0 and 3.
The looping the command over row 2 would return the numbers 6, 8 and 1
And finally looping over row 3 would return the number 9, 2, 3
An example output might be
4 0 3
6 8 1
9 2 3
What I would really like to to is sum the total value of the numbers returned for each row. I just wasn't sure if this would be possible...
An example output of this would be
7
15
14
But I am not worried if this step isn't possible using awk as I could just do it separately
Hope this makes sense
Cheers
Ollie
yes, you can give awk multiple input files. For your example:
awk 'NR==FNR{a[$1]=a[$2]=1;next}a[$1]{print $6}' file1 file2
I didn't test the above one-liner, but it should go. At least you get the idea.
If you don't know how many columns in your file1, as you said, you want to do a loop:
awk 'NR==FNR{for(x=1;x<=NF;x++)a[$x]=1;next}a[$1]{print $6}' file1 file2
update
edit for the new requirement:
awk 'NR==FNR{a[$1]=$6;next}{for(i=1;i<=NF;i++)s+=a[$i];print s;s=0}' f2 f1
The output of above one-liner: (take f1 and f2 as your input example file1 file2):
7
15
14

Extract lines from file 2 that matches values on two columns with those on a different number of columns of file 1

I am trying to use awk to extract lines from a file2 that match values in $132 & $133 columns with those on $1, $2 from file 1 and create an output with those lines from file 2 (with have a lot of columns compared with file 1).
File 1
1 6727802 TTC T 0/1 0/0 0/0
2 12887332 C A 0/1 0/0 0/0
File 2 (it has a great number of columns and also a lot of lines, I only show you one of them)
1 6727803 6727804 TC - exonic DNAJC11 frameshift deletion DNAJC11:NM_018198:exon4:c.343_344del:p.E115fs exonic ENSG00000007923 frameshift deletion ENSG00000007923:ENST00000377577:exon4:c.343_344del:p.E115fs,ENSG00000007923:ENST00000426784:exon4:c.343_344del:p.E115fs,ENSG00000007923:ENST00000294401:exon4:c.343_344del:p.E115fs,ENSG00000007923:ENST00000542246:exon4:c.229_230del:p.E77fs,ENSG00000007923:ENST00000451196:exon3:c.271_272del:p.E91fs,ENSG00000007923:ENST00000377573:exon3:c.73_74del:p.E25fs,ENSG00000007923:ENST00000349363:exon3:c.229_230del:p.E77fs Score=562;lod=257 614827 rs374290353 224 0.0020 0.0001 0.0001 0 0 0 0.0006 6.805e-05 0 0.0012 0.0010 0.0012 0.0012 0.0010 0.0003 0.0014 0.0019 0.0020 MU9804 GBM-US|1|268|0.00373,PRAD-US|1|256|0.00391,LGG-US|1|283|0.00353,CESC-US|1|194|0.00515,BRCA-US|1|955|0.00105,SKCM-US|1|335|0.00299,COAD-US|1|216|0.00463 ID=COSM426618,COSM426619;OCCURENCE=2(NS),1(large_intestine),1(breast) het 280660 129 1 6727802 rs374290353 TTC T 280660 PASS 1 0.164634 2 -0.246 0 1498874 0 -0.0829 59.33 0.452 1.04 -0.079 0.72 3.49 FS 0/1 113 12 129 68 68,0,4158
I use the following awk code with success to extract lines from file 2 that match values in columns $1, $2 with those in $1, $2 of file 1
awk 'NR == FNR { a[$1, $2]++; next } a[$1, $2]' 'file1' 'file2' > file3
But now I need to extract all lines from file2 that match values of $132 (file2) = $1(file 1) and $133(file2) = $2(file 1). I tried to change the code in different ways with no succes. I would appreciate your help, I am new with awk!

Use awk to sum or average for each unique ID

Can anyone tell me how to use awk in order to calculate the sum of two individuals columns or the average of one column for each unique ID.
Input
chr1 3661532 3661533 0.0 5 0 chr1 3661529 3662079 NM_01011874
chr1 3661534 3661535 0.2 5 1 chr1 3661529 3662079 NM_01011874
chr1 3661537 3661538 0.0 5 0 chr1 3661529 3662079 NM_01011874
chr1 3661559 3661560 0.0 6 0 chr1 3661529 3662079 NM_01011874
chr2 4661532 4661533 0.0 8 0 chr1 4661532 4661533 NM_00175642
chr2 6661534 6661535 0.2 5 2 chr1 6661534 6661535 NM_00175642
chr2 2661537 2661538 0.0 5 0 chr1 2661537 2661538 NM_00175642
chr2 9661559 9661560 0.0 7 0 chr1 9661559 9661560 NM_00175642
Output (sum $5 $6) for each unique ID
NM_01011874 21 1
NM_00175642 25 2
or average of $4 for each unique ID
NM_01011874 0.0476
NM_00175642 0.08
Also, if you could breakdown the components of the solution I would be grateful. I'm a molecular biologist with minimal bioinformatics training.
sum of columns 5 and 6 per id:
awk '{sum5[$10] += $5; sum6[$10] += $6}; END{ for (id in sum5) { print id, sum5[id], sum6[id] } }' < /tmp/input
NM_00175642 25 2
NM_01011874 21 1
Explained: $10 is the id field, $5 and $6 are columns 5 and 6. We build 2 arrays for summing columns 5 and 6 (which are indexed by strings, so we can use the id field). Once we've processed all the lines/records, we iterate through the array keys (id strings), and print the value at that array index.
average of column 4 per id:
awk '{sum4[$10] += $4; count4[$10]++}; END{ for (id in sum4) { print id, sum4[id]/count4[id] } }' < /tmp/input
NM_00175642 0.05
NM_01011874 0.05
Explained: Very similar to the summing example. We keep a sum of column 4 per id, and a count of records seen for each id. At the end, we iterate through the ids and print the sum/count.
I don't do much with awk, I find Perl much better for small scripts. But this looks like a good starting point. There are links to more pages with example scripts.

Obtaining "consensus" results from two different files using awk

I have file1 as a result of a first operation, it has the following structure
201 12 0.298231 8.8942
206 13 -0.079795 0.6367
101 34 0.86348 0.7456
301 15 0.215355 4.6378
303 16 0.244734 5.9895
and file2 as a result of a different operation and has the same type of structure.
File 2 sample
204 60 -0.246038 6.0535
304 83 -0.246209 6.0619
101 34 -0.456629 6.0826
211 36 -0.247003 6.1011
305 83 -0.247134 6.1075
206 46 -0.247485 6.1249
210 39 -0.248066 6.1537
107 41 -0.248201 6.1603
102 20 -0.248542 6.1773
I would like to select fields 1 and 2 that have a field 3 value higher than a threshold in file1 (0.8) , then for these selected values of field 1 and 2, select the values that have a field 3 value higher than another threshold in file 2 (abs(x)=0.4).
Note that although files 1 and 2 have the same structure fields 1 and 2 values are not the same (not the same number of lines etc..)
Can you do this with awk?
desired output
101 34
If you combine awk with unix commands you can do the following
sort file1.txt > sorted1.txt
sort file2.txt > sorted2.txt
Sorting will allow you to use JOIN on the first line (which I assume is unique). Now field 3 of file1 is $3 and file2 is $6. Using awk you can write the following.:
join sorted1.txt sorted2.txt | awk 'function abs(value){return (value<0?-value:value);}{print $1"\t"$2} $3 >=0.8 && abs($6) >=0.4'
In essence, in the awk you first write a function to deal with absolute values, then you simply ask it to print line 1 and 2 selecting for the criteria you detailed at $3 and $6 (formely field 3 of file1 and file2 respectively)
Hope this helps...