Extract lines from file 2 that matches values on two columns with those on a different number of columns of file 1 - awk

I am trying to use awk to extract lines from a file2 that match values in $132 & $133 columns with those on $1, $2 from file 1 and create an output with those lines from file 2 (with have a lot of columns compared with file 1).
File 1
1 6727802 TTC T 0/1 0/0 0/0
2 12887332 C A 0/1 0/0 0/0
File 2 (it has a great number of columns and also a lot of lines, I only show you one of them)
1 6727803 6727804 TC - exonic DNAJC11 frameshift deletion DNAJC11:NM_018198:exon4:c.343_344del:p.E115fs exonic ENSG00000007923 frameshift deletion ENSG00000007923:ENST00000377577:exon4:c.343_344del:p.E115fs,ENSG00000007923:ENST00000426784:exon4:c.343_344del:p.E115fs,ENSG00000007923:ENST00000294401:exon4:c.343_344del:p.E115fs,ENSG00000007923:ENST00000542246:exon4:c.229_230del:p.E77fs,ENSG00000007923:ENST00000451196:exon3:c.271_272del:p.E91fs,ENSG00000007923:ENST00000377573:exon3:c.73_74del:p.E25fs,ENSG00000007923:ENST00000349363:exon3:c.229_230del:p.E77fs Score=562;lod=257 614827 rs374290353 224 0.0020 0.0001 0.0001 0 0 0 0.0006 6.805e-05 0 0.0012 0.0010 0.0012 0.0012 0.0010 0.0003 0.0014 0.0019 0.0020 MU9804 GBM-US|1|268|0.00373,PRAD-US|1|256|0.00391,LGG-US|1|283|0.00353,CESC-US|1|194|0.00515,BRCA-US|1|955|0.00105,SKCM-US|1|335|0.00299,COAD-US|1|216|0.00463 ID=COSM426618,COSM426619;OCCURENCE=2(NS),1(large_intestine),1(breast) het 280660 129 1 6727802 rs374290353 TTC T 280660 PASS 1 0.164634 2 -0.246 0 1498874 0 -0.0829 59.33 0.452 1.04 -0.079 0.72 3.49 FS 0/1 113 12 129 68 68,0,4158
I use the following awk code with success to extract lines from file 2 that match values in columns $1, $2 with those in $1, $2 of file 1
awk 'NR == FNR { a[$1, $2]++; next } a[$1, $2]' 'file1' 'file2' > file3
But now I need to extract all lines from file2 that match values of $132 (file2) = $1(file 1) and $133(file2) = $2(file 1). I tried to change the code in different ways with no succes. I would appreciate your help, I am new with awk!

Related

how to format a large txt file to bed

I am trying to format CpG methylation calls from R package "methyKit" to simple bed format. Since it is a large file, i can not do it in Excel. I also tried Seqmonk, but it does not allow me to export the data in the format I want. Linux Awk/sed might be a good option, but I am new to them as well. Basically, I need to trim "chr" column, add "stop" column, convert "F" to "+" /"R" to "-", and freqC with 2 decimal places. Can you please help?
From:
chrBase chr base strand coverage freqC freqT
chr1.339 chr1 339 F 7 0.00 100.00
chr1.183 chr1 183 R 4 0.00 100.00
chr1.192 chr1 192 R 6 0.00 100.00
chr1.340 chr1 340 R 5 40.00 60.00
chr1.10007 chr1 10007 F 13 53.85 46.15
chr1.10317 chr1 10317 F 8 0.00 100.00
chr1.10346 chr1 10346 F 9 88.89 11.11
chr1.10349 chr1 10349 F 9 88.89 11.11
To:
chr start stop freqc Coverage strand
1 67678 67679 0 3 -
1 67701 67702 0 3 -
1 67724 67725 0 3 -
1 67746 67747 0 3 -
1 67768 67769 0.333333 3 -
1 159446 159447 0 3 +
1 162652 162653 0 3 +
1 167767 167768 0.666667 3 +
1 167789 167790 0.666667 3 +
1 167797 167798 0 3 +
This should do what you actually want, producing a BED6 file with the methylation percentage in the score column:
$ cat foo.awk
BEGIN{OFS="\t"}
{if(NR>1) {
if($4=="F") {
strand="+"
} else {
strand="-"
}
chromUse=gsub("chr", "", $2);
print chromUse,$3-1,$3,$1,$6,strand,$5
}}
That would then be run with:
awk -f foo.awk input.txt > output.bed
The additional column 7 is the coverage, since genome viewers will only display a single score column:
1 338 339 chr1.339 0.00 + 7
1 182 183 chr1.183 0.00 - 4
1 191 192 chr1.192 0.00 - 6
1 339 340 chr1.340 40.00 - 5
1 10006 10007 chr1.10007 53.85 + 13
1 10316 10317 chr1.10317 0.00 + 8
1 10345 10346 chr1.10346 88.89 + 9
1 10348 10349 chr1.10349 88.89 + 9
You can tweak that further as needed.
It is not entirely clear the exact sequence you want since your "From" data does not correspond to what you show as your "To" results, but if what you are showing is the general format change and in the "From" data, you want to:
discard field 1,
retrieve the "chr" value from the end of field 2,
if the 4th field is "F" make it "+" else if it is "R" make it "-" otherwise leave it unchanged,
use the 3rd field as "start" and 3rd + 1 as "stop" (adjust whether to add or subtract 1 as needed to get the desired "start" and "stop" values),
print 6th field as "freqc",
output 5th field as "Coverage", and finally
output modified 4th field as "strand"
If that is your goal, then with your from data in the file named from, you can do something like the following:
awk '
BEGIN { OFS="\t"; print "chr","start","stop","freqc","Coverage","strand" }
FNR > 1 {
match($2, /[[:digit:]]+$/, arr)
if ($4 == "F")
$4 = "+"
else if ($4 == "R")
$4 = "-"
print arr[0], $3, $3 + 1, $6, $5, $4
}
' from
Explanation, the BEGIN rule is run before awk starts processing records (lines) from the file. Above it simply sets the Output Field Separator to tab and prints the heading.
The condition (pattern) of FNR > 1 on the second rule processes the from file from the 2nd record (line) on (skipping the heading row). FNR is awk's way of saying File Record Number (even though it looks like the N and R are backwards).
match($2, /[[:digit:]]+$/, arr) splits the trailing digits from the second field into the first element of arr (e.g. arr[0]) and not relevant here sets the RSTART and RLENGTH internal variables telling you which character the first digit starts on and how many digits there are.
The if and else if statement does the "F" to "+" and "R" to "-" change. And, finally, the print statement just prints the modified values and unchanged fields in the order specified above.
Example Output
Running the above on your original "From" data will produce:
chr start stop freqc Coverage strand
1 339 340 0.00 7 +
1 183 184 0.00 4 -
1 192 193 0.00 6 -
1 340 341 40.00 5 -
1 10007 10008 53.85 13 +
1 10317 10318 0.00 8 +
1 10346 10347 88.89 9 +
1 10349 10350 88.89 9 +
Let me know if this is close to what you explained in your question, and if not, drop a comment below.
The GNU Awk User's Guide is a great gawk/awk reference.

How to insert column from a file to another file at multiple places

I would like to insert columns no. 1 and 2 from file no. 2 into file no. 1 after every second column and till the last column.
File1.txt (tab-separated, column range from 1-2400 and cell range from 1-4500)
ID IMPACT ID IMPACT ID IMPACT
51 0.288 128 0.4557 156 0.85
625 0.858 15 -0.589 51 0.96
8 0.845 7 0.5891
File2.txt (consist of only two-tab separated column with 19000 raws)
ID IMPACT
18 -1
165 -1
41 -1
11 -1
Output file
ID IMPACT ID IMPACT ID IMPACT ID IMPACT ID IMPACT ID IMPACT
51 0.288 18 -1 128 0.4557 18 -1 156 0.85 18 -1
625 0.858 165 -1 15 -0.589 165 -1 51 0.96 165 -1
8 0.845 41 -1 7 0.5891 41 -1 41 -1
11 -1 11 -1 11 -1
I tried the below commands but it's not working
paste <(cut -f 1,2 File1.txt) <(cut -f 1,2 File2.txt) <(cut -f 3,4 File1.txt) <(cut -f 1,2 File2.txt)......... > File3
Prob: It starts sifting the File2.txt column value into different columns after the highest cell of File1.txt
paste File1.txt File2.txt > File3.txt
awk '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6 "\t" $3 "\t" $4....}' File3.txt > File4.txt
This do the job, however it mixup the value of File1.txt from one column to another column.
I tried everything but failed to succeed.
Any help would be appreciated, however, bash or pandas would be better. Thanks in advance.
$ awk '
BEGIN {
FS=OFS="\t" # tab-separated data
}
NR==FNR { # hash fields of file2
a[FNR]=$1 # index with record numbers FNR
b[FNR]=$2
next
}
{ # print file1 records with file2 fields
print $1,$2,a[FNR],b[FNR],$3,$4,a[FNR],b[FNR],$5,$6,a[FNR],b[FNR]
}
END { # in the end
for(i=(FNR+1);(i in a);i++) # deal with extra records of file2
print "","",a[i],b[i],"","",a[i],b[i],"","",a[i],b[i]
}' file2 file1
Output:
ID IMPACT ID IMPACT ID IMPACT ID IMPACT ID IMPACT ID IMPACT
51 0.288 18 -1 128 0.4557 18 -1 156 0.85 18 -1
625 0.858 165 -1 15 -0.589 165 -1 51 0.96 165 -1
8 0.845 41 -1 7 0.5891 41 -1 41 -1
11 -1 11 -1 11 -1

Modify tab delimited txt file

I want to modify tab delimited txt file using linux commands sed/awk/or any other method
This is an example of tab delimited txt file which I want to modify for R boxplot input:
----start of input format---------
chr8 38277027 38277127 Ex8_inner
25425 8 100 0.0800000
chr8 38277027 38277127 Ex8_inner
25426 4 100 0.0400000
chr9 38277027 38277127 Ex9_inner
25427 9 100 0.0900000
chr9 38277027 38277127 Ex9_inner
25428 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
30935 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
31584 1 100 0.0100000
all 687 1 1000 0.0010000
all 694 1 1000 0.0010000
all 695 1 1000 0.0010000
all 697 1 1000 0.0010000
all 699 6 1000 0.0060000
all 700 2 1000 0.0020000
all 723 7 1000 0.0070000
all 740 8 1000 0.0080000
all 742 1 1000 0.0010000
all 761 5 1000 0.0050000
all 814 2 1000 0.0020000
all 821 48 1000 0.0480000
------end of input file format------
I want it to be modified so that 4th column of odd rows becomes 1st column and 2nd column of the even rows (1st column is blank) becomes 2nd column. Rows starting with "all" gets deleted.
This is how output file should look:
-----start of the output file----
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
-----end of the output file----
EDIT: As OP has changed Input_file sample a bit so adding code too it.
awk --re-interval 'match($0,/Exon[0-9]{1,}/){val=substr($0,RSTART,RLENGTH);getline;sub(/^ +/,"",$1);print val,$1}' Input_file
NOTE: My awk is old version to I added --re-interval to it you need not to add it in case you have recent version of it too.
With single awk following may help you on same too.
awk '/Ex[0-9]+_inner/{val=$NF;getline;sub(/^ +/,"",$1);print val,$1}' Input_file
Explanation: Adding explanation too here for same.
awk '
/Ex[0-9]+_inner/{ ##Checking condition here if a line contains string Ex then digits _inner if yes then do following actions.
val=$NF; ##Creating variable named val whose value is $NF(last field of current line).
getline; ##using getline which is out of the box keyword of awk to take the cursor to the next line from current line.
sub(/^ +/,"",$1); ##Using sub utility of awk to substitute initial space of first field with NULL.
print val,$1 ##Printing variable named val and first field value here.
}
' Input_file ##Mentioning the Input_file name here.
another awk
$ awk '/^all/{next}
!/^chr/{printf "%s\n", $1; next}
{printf "%s ", $NF}' file
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
or perhaps
$ awk '!/^all/{if(/^chr/) printf "%s", $NF OFS; else print $1}' file

awk to update file with sum of matching `another file

In the awk below I am trying to add a penalty to a score to each matching $1 in file2 based on the sum of $3+$4 (variable TL) in file1. Then the $4 value in file1 is divided by TL and multiplied by 100 (this valvue is variable S). Finally, $2 in file2 -S gives the updated $2 result in file2. Since math is not my strong suit there probaly is a better way of doing this, but this is what I could think off. Thank you :).
file1 space delimited
ACP5 4 1058 0
ACTB5 10 1708 79
ORAI1 2 952 0
TBX1 9 1932 300
file2 tab-delimited
ACP5 100.00
ACTB 100.00
ORAI1 94.01
TBX1 77.23
desired output tab-delimited the --- is an example calculation and not part of the output
ACP5 100.00
ACTB 89.59 ---- $3+$4=1787 this is TL (comes from file1), $4/TL*100 is 4.42, $2 in file2 is 100 - 4.42 = 95.58 ----
ORAI1 94.01
TBX1 63.79
awk
awk '
FNR==NR{ # process each line
TL[$1]=($3+$4);next} ($1 in TL) # from file1 store sum of $3 and $4 in TL
{S=(P[$4]/TL)*100;printf("%s\t %.2f\n",$1, $2-S) # store $4/TL from file1 in S and subtract S from $2 in file2, output two decimal places
}1' OFS="\t" file1 FS="\t" file2 # update and define input
current output
ACP5 100.00
ACTB 100.00
ORAI1 94.01
TBX1 77.23
As pointed out in the comments, the question is not completely clear. Since I can't comment yet I will give a solution that calculates the values as requested.
awk '
NF==4 { S[$1] = 100 * $4 / ($3 + $4) }
NF==2 { printf("%s\t%.2f\n", $1, $2 - S[$1]) }
' file1 file2
file1
ACP5 4 1058 0
ACTB 10 1708 79
ORAI1 2 952 0
TBX1 9 1932 300
file2
ACP5 100.00
ACTB 100.00
ORAI1 94.01
TBX1 77.23
output
ACP5 100.00
ACTB 95.58
ORAI1 94.01
TBX1 63.79
Explanation:
The script works by calculating and storing the S value in a associative array using $1 as the key. This is done in a block filtered by NF==4, so it will only runs for the first file (the only one with 4 fields). Finally, for NF==2 representing the second file, the result is printed using a printf and by subtracting the corresponding S value from $2.
Observation: Keep in mind that as #kvantour pointed out the example you provided does not follow the indications in the question. For example, where did the 89.59 value come from? The explanation ends up with 95.58 as the result just like the output of the script I provided

Concatenate files based off unique titles in their first column

I have many files that are of two column format with a label in the first column and a number in the second column. The number is positive (never zero):
AGS 3
KET 45
WEGWET 12
FEW 56
Within each file, the labels are not repeated.
I would like to concatenate these many files into one file with many+1 columns, such that the first column includes the unique set of all labels across all files, and the last five columns include the number for each label of each file. If the label did not exist in a certain file (and hence there is no number for it), I would like it to default to zero. For instance, if the second file contains this:
AGS 5
KET 14
KJV 2
FEW 3
then the final output would look like:
AGS 3 5
KET 45 14
WEGWET 12 0
KJV 0 2
FEW 56 3
I am new to Linux, and have been playing around with sed and awk, but realize this probably requires multiple steps...
*Edit note: I had to change it from just 2 files to many files. Even though my example only shows 2 files, I would like to do this in case of >2 files as well. Thank you...
Here is one way using awk:
awk '
NR==FNR {a[$1]=$0;next}
{
print (($1 in a)?a[$1] FS $2: $1 FS "0" FS $2)
delete a[$1]
}
END{
for (x in a) print a[x],"0"
}' file1 file2 | column -t
AGS 3 5
KET 45 14
KJV 0 2
FEW 56 3
WEGWET 12 0
You read file1 in to an array indexed at column 1 and assign entire line as it's value
For the file2, check if column 1 is present in our array. If it is print the value from file1 along with value from file2. If it is not present print 0 as value for file1.
Delete the array element as we go along to get only what was unique in file1.
In the END block print what was unique in file1 and print 0 for file2.
Pipe the output to column -t for pretty format.
Assuming that your data are in files named file1 and file2:
$ awk 'FNR==NR {a[$1]=$2; b[$1]=0; next} {a[$1]+=0; b[$1]=$2} END{for (x in b) {printf "%-15s%3s%3s\n",x,a[x],b[x]}}' file1 file2
KJV 0 2
WEGWET 12 0
KET 45 14
AGS 3 5
FEW 56 3
To understand the above, we have to understand an awk trick.
In awk, NR is the number of records (lines) that have been processed and FNR is the number of records that we have processed in the current file. Consequently, the condition FNR==NR is true only when we are processing in the first file. In this case, the associative array a gets all the values from the first file and associative array b gets placeholder, i.e. zero, values. When we process the second file, its values go in array b and we make sure that array a at least has a placeholder value of zero. When we are done with the second file, the data is printed.
More than two files using GNU Awk
I created a file3:
$ cat file3
AGS 3
KET 45
WEGWET 12
FEW 56
AGS 17
ABC 100
The awk program extended to work with any number of files is:
$ awk 'FNR==1 {n+=1} {a[$1][n]=$2} END{for (x in a) {printf "%-15s",x; for (i=1;i<=n;i++) {printf "%5s",a[x][i]};print ""}}' file1 file2 file3
KJV 2
ABC 100
WEGWET 12 12
KET 45 14 45
AGS 3 5 17
FEW 56 3 56
This code works creates a file counter. We know that we are in a new file every time that FNR is 1 and a counter, n, is incremented. For every line we encounter, we put the data in a 2-D array. The first dimension of a is the label and the second is the number of the file that we encountered it in. In the end, we just loop over all the labels and all the files, from 1 to n and print the data.
More than 2 files without GNU Awk
Without requiring GNU's awk, we can solve the problem using simulated two-dimensional arrays:
$ awk 'FNR==1 {n+=1} {b[$1]=1; a[$1,":",n]=$2} END{for (x in b) {printf "%-15s",x; for (i=1;i<=n;i++) {q=a[x,":",i]+0; printf "%5s",q};print ""}}' file1 file2 file3
KJV 0 2 0
ABC 0 0 100
WEGWET 12 0 12
KET 45 14 45
AGS 3 5 17
FEW 56 3 56