awk to add text to file based on coordinates - awk

In the below awk (which executes but produces an empty output) I am using the $4 in file1 as a unique id and reading each $1, $2, and $3 value into a variable chr, min, and max.
The $4 is then split on the _ in file2 and read into array. Each value in the split will match a $4 id in file1 The chr needs to match the $1, the min and max must be between the $2 and $3 values in file2.
An exact match is not needed rather just that the min or max variables falls within $2 and $3. If that is true then exon is printed in $5 of file1, if it is not true then intron is printed in $5.
The desired output has the exon/intron added to it but there is another part where the values in $2 or $3 are adjusted but I am trying to script that before I ask. I am not sure if the below is the best way but hopefully it is a start. Thank you :).
file1 tab delimited except for whitespace after $3 and $4
chr7 94027591 94027701 COL1A2
chr6 31980068 31980074 TNXB
file2 tab delimited
chr7 94027059 94027070 COL1A2_cds_1_0_chr7_94027060_f 0 +
chr7 94027693 94027708 COL1A2_cds_2_0_chr7_94027694_f 0 +
chr6 32009125 32009227 TNXB_cds_0_0_chr6_32009126_r 0 -
chr6 32009547 32009711 TNXB_cds_1_0_chr6_32009548_r 0 -
desired output
chr7 94027683 94027701 COL1A2 exon
chr6 31980068 31980074 TNXB intron
awk w/ comments
awk '
FNR==NR{ open block process matching line in file 1 and file2
a[$4]; # use as a key with unique id
chr[$4]=$1; # store $1 value in chr
min[$4]=$2; # store $2 value in min
max[$4]=$3; # store $3 value in max
next # process next line
} # close block
{ # open block
split($4,array,"_"); # spilt $4 on underscore
print $0,(array[1] in a) && ($2<=min[array[1]] && $2<=max[array[1] && $1=chr[array[1]])?"exon":"intron"
}' file1 OFS="\t" file2 > output # close block, mention input with field separators and output

IMHO, your shown final output is NOT looking correct by logic, since Input_file2 has multiple entries and Input_file1 has only single ones(I am going by samples shown only). Could you please check this one once? If any changes in your output or logic then please do mention them clearly.
awk '
BEGIN{
SUBSEP=","
}
FNR==NR{
max[$1,$NF]=$3
min[$1,$NF]=$2
next
}
{
split($4,array,"_")
}
(($1,array[1]) in max){
if(($2>min[array[5],array[1]] && $2<max[array[5],array[1]]) || ($3>max[array[5],array[1]] && $3<max[array[5],array[1]])){
print array[5],array[1],min[array[5],array[1]],max[array[5],array[1]],"exon"
next
}
}
{
print $0,"intron"
}' Input_file1 Input_file2 | column -t
What this command is doing it is checking Input_file2's 2nd field OR 3rd field either they are coming in range of Input_file1's 2nd and 3rd field. If anyone of them is coming then I am printing Input_file1's output adding exon in it or else printing Input_file2's output adding intron string at last of it.

Related

Printing lines of file2 when two fields from file1 match substrings of a single field in file2

Goal: To print lines of File2 when field 1 ($1) and field 4 ($4) of File1 both match a substring in field 4 ($4) on lines beginning with ">" in File2.
Important note #1: The lines being printed to output include the line being searched and all the lines following it until the next line with a ">".
Example: When fields 1 and 4 of File1 are 2776 & 2968 respectively, these should be searched against field 4 of File2 to evntually find the match 2776-2968(+) (because both numbers of File1 match a substring in field 4 of File2). The order of the numbers in the string does not matter - 2968-2776(+) should also be considered a match. Since they match, that line of File2 is printed with all lines below it until another line with ">" is encountered.
Important Note #2: File1 is tab-delimited: \t. File 2 is colon-delimited: :.
File1:
Transcription_Start Translation_Start Translation_Stop Transcription_Stop Strand Expression
2776 2968 + 920
17374 17563 + 1959
2968 2786 - 802
17563 17375 - 1694
19606 19395 - 1914
File2:
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
>antisense_CDR20291_r27::NC_013316.1:10830-11707(-)
TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA
>antisense_CDR20291_r27::NC_013316.1:11814-11874(-)
TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA
Desired Output:
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
This is what I've tried so far (it outputs the full contents of File2, thus failing to produce the desired output):
$ awk -F"\t|:" 'NR==FNR{a[$4]; next} ($1 in a) || ($4 in a)' File1 File2 > Output
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
>antisense_CDR20291_r27::NC_013316.1:10830-11707(-)
TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA
>antisense_CDR20291_r27::NC_013316.1:11814-11874(-)
TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA
How can I process my files with awk (or similar) to achieve my goal?
With your shown samples, please try following. Written and tested with GNU awk.
awk '
FNR==NR{
arr[$1,$2]
next
}
/^>/{
found=""
if((($5,$6) in arr) || (($6,$5) in arr)){
found=1
}
}
found
' file1 FS=":|-|\\\\(" file2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when file1 is being read.
arr[$1,$2] ##Creating arr with index of 1st and 2nd field.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
found="" ##Nullifying found here.
if((($5,$6) in arr) || (($6,$5) in arr)){ ##Checking condition if either 5th 6th field is present in arr OR 6th 5th field as a key present in arr then do following.
found=1 ##Setting found to 1 here.
}
}
found ##Checking condition if found is set then print that line.
' file1 FS=":|-|\\\\(" file2 ##Mentioning Input_file(s) and setting field separator before Input_file2 to get exact values to match.

awk to remove text and split on two delimiters

I am trying to use awk to remove the text after the last digit and split by the :. That is common to both lines and I believe the first portion of the awk below will do that. If there is no _ in the line then $2 is repeated in $3 and I believe the split will do that. What I am not sure how to do is if the is an _ in the line then the number to the left of the _ is $2 and the number to the right of the _ is $3. Thank you :).
input
chr7:140453136A>T 
chr7:140453135_140453136delCAinsTT
desired
chr7 140453136 140453136 
chr7 140453135 140453136
awk
awk '{sub(/[^0-9]+$/, "", $1); {split($0,a,":"); print a[1],a[2]a[2]} 1' input
Here is one:
$ awk '
BEGIN {
FS="[:_]" # using field separation for the job
OFS="\t"
}
{
sub(/[^0-9]*$/,"",$NF) # strip non-digits off the end of last field
if(NF==2) # if only 2 fields
$3=$2 # make the $2 from $2
}1' file # output
Output:
chr7 140453136 140453136
chr7 140453135 140453136
Tested on GNU awk, mawk, Busybox awk and awk version 20121220.
Using GNU awk:
awk -v FPAT='[0-9]+|chr[0-9]*' -v OFS='\t' 'NF==2{$3=$2}{$1=$1}1'
This relies on the field pattern FPAT that is a regex representing a number or the string chr with a number.
The statement NF==2{$3=$2} is to duplicate the second field if there is only 2 in the record.
The last statement is to force awk to rebuild the record to have the wanted formatting.
$ awk -F'[:_]' '{print $1, $2+0, $NF+0}' file
chr7 140453136 140453136
chr7 140453135 140453136
Could you please try following, more generic solution in terms of NO hard coding of copying fields values to another fields etc, you can simply mention maximum number of field value in awk variable and it will check each line(along with removing alphabets from their value) and will copy last value to till end of max value for that line.
awk -F'[:_]' -v max="3" '
{
for(i=2;i<=max;i++){
if($i==""){
$i=$(i-1)
}
gsub(/[^0-9]+/,"",$i)
}
}
1
' Input_file
To get output in TAB delimited form append | column -t in above code.

awk search file may contain multiple entries of id

The awk below works perfect, thanks to #RavinderSingh13, if the id in $4 (like in f) is unique. However most of my data is like f1 where the same id can appear multiple times. I think that is the reason why the awk is not working as expected. I added a comment on the line that I can not change without causing the script to abort. Each line in f2 is searched and must contain the id, in this case COL1A2 but that id may not be a single entry. That is the id may appear 5 times, but each line with that is in f2 is searched. Using the $4 in f1 as the id and reading each $1, $2, and $3 value into a variable min and max.
The $4 is then split on the _ in f2 and read into array. The same id from f1 may appear in multiple lines of f2 however each will have unique $2 and $3 values. Each value in the split will match a $4 id in f1. The min and max must match the $1 of f2 and be between the $2 and $3 values in f2. An exact match is not needed rather just that the min or max variables falls within $2 and $3. If that is true then exon is printed in $5 of f2 if it is not true then intron is printed in $5. Most of this works as expected I just did not account for the possibity for multiple enteries and am nut sure how to adjust for it. Thank you :)
For example using the contents of the f1 where COL1A2 appears 3 times, each entry or line is searched in f2. Currently, I believe since COL1A2 is not unique not match is found in f2 as the min and max are not set per entry or line. Thank you :).
awk w/ desired output
awk '
BEGIN{
SUBSEP=","
}
FNR==NR{
max[$1,$NF]=$3
min[$1,$NF]=$2
next
}
{
split($4,array,"_") # How do I change/modify this so it only looks a each line with this id `COL1A2` in it?
}
(($1,array[1]) in max){
if(($2>min[array[5],array[1]] && $2<max[array[5],array[1]]) || ($3>max[array[5],array[1]] && $3<max[array[5],array[1]])){
print array[5],array[1],min[array[5],array[1]],max[array[5],array[1]],"exon"
next
}
}
{
print $0,"intron"}' f f2
chr7 94024333 94024423 COL1A2_cds_0_0_chr7_94024344_f 0 + intron
chr7 94027049 94027080 COL1A2_cds_1_0_chr7_94027060_f 0 + intron
chr7 COL1A2 94027591 94027701 exon
awk w/ current output
.... }' f1 f2
chr7 94024333 94024423 COL1A2_cds_0_0_chr7_94024344_f 0 + intron
chr7 94027049 94027080 COL1A2_cds_1_0_chr7_94027060_f 0 + intron
chr7 94027683 94027718 COL1A2_cds_2_0_chr7_94027694_f 0 + intron
contents of f single COL1A2 entry
chr7 94027591 94027701 COL1A2
contents of f1 multiple COL1A2 entry, this is most of the actual data, very few are single entries though there are some
chr7 94027591 94027701 COL1A2
chr7 94027799 94027811 COL1A2
chr7 94030799 94030847 COL1A2
contents of f2 always the same format
chr7 94024333 94024423 COL1A2_cds_0_0_chr7_94024344_f 0 +
chr7 94027049 94027080 COL1A2_cds_1_0_chr7_94027060_f 0 +
chr7 94027683 94027718 COL1A2_cds_2_0_chr7_94027694_f 0 +

Awk command to compare specific columns in file1 to file2 and display output

File1
111,222,560,0.7
111,333,560,0.2
111,444,560,0.1
File2
2017,111,560,0.0537
2018,111,560,0.0296
2019,111,560,0.0624
Desired output:
2017,111,560,0.0537,222,0.7
2018,111,560,0.0296,222,0.7
2019,111,560,0.0624,222,0.7
2017,111,560,0.0537,333,0.2
2018,111,560,0.0296,333,0.2
2019,111,560,0.0624,333,0.2
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0296,444,0.1
I tried awk NR==FNR command but it’s displaying only the last matched...
reads every line and check if column 1 and 3 of file1 exists in file2:
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0296,444,0.1
I tried awk NR==FNR command but it’s displaying only the last
matched...
reads every line and check if column 1 and 3 of file1 exists in file2:
Using awk and sort
awk 'BEGIN{
# set input and output field separator
FS=OFS=","
}
# read first file f1
# index key field1 and field3 of file1 (f1)
{
k=$1 FS $3
}
# save 2nd and last field of file1 (f1) in array a, key being k
FNR==NR{
a[k]=(k in a ? a[k] RS:"") $2 OFS $NF;
# stop processing go to next line
next
}
# read 2nd file f2 from here
# 2nd and 3rd field of fiel2 (f2) used as key
{
k=$2 FS $3
}
# if key exists in array a
k in a{
# split array value by RS row separator, and put it in array t
split(a[k],t,RS);
# iterate array t, print and sort
for(i=1; i in t; i++)
print $0,t[i] | "sort -t, -nk5"
}
' f1 f2
Test Results:
$ cat f1
111,222,560,0.7
111,333,560,0.2
111,444,560,0.1
$ cat f2
2017,111,560,0.0537
2018,111,560,0.0296
2019,111,560,0.0624
$ awk 'BEGIN{FS=OFS=","}{k=$1 FS $3}FNR==NR{a[k]=(k in a ? a[k] RS:"") $2 OFS $NF; next}{k=$2 FS $3}k in a{split(a[k],t,RS); for(i=1; i in t; i++)print $0,t[i] | "sort -t, -nk5" }' f1 f2
2017,111,560,0.0537,222,0.7
2018,111,560,0.0296,222,0.7
2019,111,560,0.0624,222,0.7
2017,111,560,0.0537,333,0.2
2018,111,560,0.0296,333,0.2
2019,111,560,0.0624,333,0.2
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0624,444,0.1
Following awk may help you in same.
awk -F, '
FNR==NR{
a[FNR]=$0;
next
}
{
for(i=1;i<=length(a);i++){
print a[i] FS $2 FS $NF
}
}' Input_file2 Input_file1
Adding explanation too for code as follows.
awk -F, ' ##Setting field separator as comma here for all the lines.
FNR==NR{ ##Using FNR==NR condition which will be only TRUE then first Input_file named File2 is being read.
##FNR and NR both indicates the number of lines for a Input_file only difference is FNR value will be RESET whenever a new file is being read and NR value will be keep increasing till all Input_files are read.
a[FNR]=$0; ##Creating an array named a whose index is FNR(current line) value and its value is current line value.
next ##Using next statement will sip all further statements now.
}
{
for(i=1;i<=length(a);i++){##Starting a for loop from variable i value from 1 to length of array a value. This will be executed on 2nd Input_file reading.
print a[i] FS $2 FS $NF ##Printing the value of array a whose index is variable i and printing 2nd and last field of current line.
}
}' File2 File1 ##Mentioning the Input_file names here.
another one with join/awk
$ join -t, -j99 file2 file1 |
awk -F, -v OFS=, '$3==$6 && $4==$8 {print $2,$3,$4,$5,$7,$9}'

How to find the difference between two files using multiple conditions?

I have two files file1.txt and file2.txt like below -
cat file1.txt
2016-07-20-22 4343250019 1003116 001 data45343 25-JUL-16 11-MAR-16 1 N 0 0 N
2016-06-20-22 654650018 1003116 001 data45343 25-JUL-17 11-MAR-16 1 N 0 0 N
cat file2.txt
2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Requirement is to fetch the records which are not available in
file1.txt using below condition.
file1.txt file2.txt
col1(date) col1(Date)
col2(number: 4343250019 ) col2(last value of number: 9)
col3(number) col3(number)
col5(alphanumeric) col5(alphanumeric)
Expected Output :
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one
This output line doesn't available in file1.txt but available in
file2.txt after satisfying the matching criteria.
I was trying below steps to achieve this output -
###Replacing the space/tab from the file1.txt with pipe
awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' OFS="|" file1.txt > file1.txt1
### Looping on a combination of four column of file1.txt1 with combination of modified column of file2.txt and output in output.txt
awk 'BEGIN{FS=OFS="|"} {a[$1FS$2FS$3FS$5];next} {(($1 FS substr($2,length($2),1) FS $3 FS $5) in a) print $0}' file2.txt file1.txt1 > output.txt
###And finally, replace the "N" from column 8th and put "NULL" if the value is "N".
awk -F'|' '{ gsub ("N","NULL",$8);print}' OFS="|" output.txt > output.txt1
What is the issue?
My 2nd operation is not working and I am trying to put all 3 operations in one operation.
awk -F'[|]|[[:blank:]]+' 'FNR==NR{E[$1($2%10)$3$5]++;next}!($1$2$3$5 in E)' file1.txt file2.txt
and your sample output is wrong, it should be (last field if different: data45333)
2016-07-20-22|9|1003116|001|data45333|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2017-06-22-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Commented code
# separator for both file first with blank, second with `|`
awk -F'[|]|[[:blank:]]+' '
# for first file
FNR==NR{
# create en index entry based on the 4 field. The forat of filed allow to use them directly without separator (univoq)
E[ $1 ( $2 % 10 ) $3 $5 ]++
# for this line (file) don't go further
next
}
# for next file lines
# if not in the index list of entry, print the line (default action)
! ( ( $1 $2 $3 $5 ) in E ) { print }
' file1.txt file2.txt
Input
$ cat f1
2016-07-20-22 4343250019 1003116 001 data45343 25-JUL-16 11-MAR-16 1 N 0 0 N
2016-06-20-22 654650018 1003116 001 data45343 25-JUL-17 11-MAR-16 1 N 0 0 N
$ cat f2
2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Output
$ awk 'FNR==NR{a[$1,substr($2,length($2)),$3,$5];next}!(($1,$2,$3,$5) in a)' f1 FS="|" f2
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Explanation
awk ' # call awk.
FNR==NR{ # This is true when awk reads first file
a[$1,substr($2,length($2)),$3,$5] # array a where index being $1(field1), last char from $2, $3 and $5
next # stop processing go to next line
}
!(($1,$2,$3,$5) in a) # here we check index $1,$2,$3,$5 exists in array a by reading file f2
' f1 FS="|" f2 # Read f1 then
# set FS and then read f2
FNR==NR If the number of records read so far in the current file
is equal to the number of records read so far across all files,
condition which can only be true for the first file read.
a[$1,substr($2,length($2)),$3,$5] populate array "a" such that the
indexed by the first
field, last char of second field, third field and fifth field from
current record of file1
next Move on to the next record so we don't do any processing
intended for records from the second file.
!(($1,$2,$3,$5) in a) IF the array a index constructed from the
fields ($1,$2,$3,$5) of the current record of file2 does not exist
in array a, we get boolean true (! Called Logical NOT Operator. It is used to reverse the logical state of its operand. If a condition is true, then Logical NOT operator will make it false and vice versa.) so awk does default operation print $0 from file2
f1 FS="|" f2 read file1(f1), set field separator "|" after
reading first file, and then read file2(f2)
--edit--
When filesize is huge around 60GB(900 millions rows), its not a good
idea to process the file two times. 3rd operation - (replace "N" with
"NULL" from col - 8 ""awk -F'|' '{ gsub ("N","NULL",$8);print}'
OFS="|" output.txt
$ awk 'FNR==NR{
a[$1,substr($2,length($2)),$3,$5];
next
}
!(($1,$2,$3,$5) in a){
sub(/N/,"NULL",$8);
print
}' f1 FS="|" OFS="|" f2
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one
You can try this awk:
awk -F'[ |]*' 'NR==FNR{su=substr($2,length($2),1); a[$1":"su":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{print $0}' f1 f2
Here,
a[] - an associative array
$1":"su":"$3":"$5 - this forms key for an array index. su is last digit of field $2 (su=substr($2,length($2),1)). Then, assigning an 1 as value for this key.
NR==FNR{...;next} - this block works for processing f1.
Update:
awk 'NR==FNR{$2=substr($2,length($2),1); a[$1":"$2":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{gsub(/^N$/,"NULL",$8);print}' f1 FS="|" OFS='|' f2