awk to find match using unique id between two files and append data to second file - awk

In the awk below I am trying to skip the # in f1 and match $4 in f1 with $6 of f2 and if there is a match the contents in $5 in f1 are appended to $8 in f2.
I added comments as well.
f1 tab-delimited
#
#
chr1 2019345 2030758 GABRD {There is a lot of text here}
f2 tab-delimited
chr1 2028270 2028270 G A GABRD This has text in it.
chr1 2028297 2028302 CAT C GABRD This has text in it.
chr1 2229406 2229406 A G SKI This has text in it.
chr1 2304553 2304553 G A SKI This has text in it.
chr1 2306636 2306636 C T SKI This has text in it.
desired tab-delimited
chr1 2028270 2028270 G A GABRD This has text in it. {There is a lot of text here}
chr1 2028297 2028302 CAT C GABRD This has text in it. {There is a lot of text here}
chr1 2229406 2229406 A G SKI This has text in it. unknown
chr1 2304553 2304553 G A SKI This has text in it. unknown
chr1 2306636 2306636 C T SKI This has text in it. unknown
awk
awk '/^[^#]/ # skipping lines starting with # in f1
FNR==NR{ # checking condition which will be TRUE when f2 is being read.
a[$4]=$6 # creating array a with index of $4 and value of $6 here.
next # next will skip all further statements from here.
}
{
print $0,($4 in a?a[$4]:"unknown") # Printing current line and checking if 1st field is there in and print a[$4] else print unknown.
}' f1 f2 # close and inputs

Assumptions:
skip all lines (all files) that start with a #
f1/$5 does not contain any embedded tabs
if f1/$4 == f2/$6 then replace f2/$8 with f1/$5 else set f2/$8 = "unknown"
One awk idea:
awk '
BEGIN { FS=OFS="\t" }
/^#/ { next }
FNR==NR { a[$4]=$5; next }
{ $8= ($6 in a) ? a[$6] : "unknown" }
1
' f1 f2
This generates:
chr1 2028270 2028270 G A GABRD This has text in it. {There is a lot of text here}
chr1 2028297 2028302 CAT C GABRD This has text in it. {There is a lot of text here}
chr1 2229406 2229406 A G SKI This has text in it. unknown
chr1 2304553 2304553 G A SKI This has text in it. unknown
chr1 2306636 2306636 C T SKI This has text in it. unknown

Related

awk to update unknown values in file using range in another

I am trying to modify an awkkindly provided by #karakfa to update all the unknown values in $6 of file2, if the $4 value in file2 is within the range of $1 of file1. If there is already a value in $6 other then unknown, it is skipped and the next line is processed. In my awk attempt below the final output is 6 tab-delimited fields. Currently the awk runs but the unknown vales are not updated and I can not seem to solve this. Thank you :)
file1 (space delimited)
chr1:4714792-4852594 AJAP1
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
chr1:15783224-15798586 CELA2A
file2 (tab-delimited)
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
desired output
--- the second and fourth unknown values are updated based on the range that they fall in $1 of file1
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
current output with awk
awk -v OFS='\t' 'NR==FNR{
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
edit:
awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
possible solution to issue 2:
----- matching $2 values in file1 are combined with the first lines rstart[a[1]]=a[2] being the start and the last lines rend[a[1]]=a[3] being the end
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
here is another script (it's inefficient since does a linear scan instead of more efficient search approaches) but works and simpler.
$ awk -v OFS='\t' 'NR==FNR{split($1,a,"[:-]"); k=a[1]; c[k]++;
rstart[k,c[k]]=a[2];
rend[k,c[k]]=a[3];
value[k,c[k]]=$2;
next}
$6=="unknown" && ($1 in c) {k=$1;
for(i=1; i<=c[k]; i++)
if($2>=rstart[k,i] && $3<=rend[k,i])
{$6=value[k,i]; break}}1' file1 file2 |
column -t
since it's possible to have more than one match, this one uses the first found.
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . GPR157
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
note that the fourth record also matches based on the rules.

awk to add header to output file

I am trying to filter a file_to_filter by using another filter_file, which is just a list of strings in $1. I think I am close but can not seem to include the header row in the output. The file_to_filter is tab delimited as well. Thank you :).
file_to_filter
Chr Start End Ref Alt Func.refGene Gene.refGene
chr1 160098543 160098543 G A exonic ATP1A2
chr1 172410967 172410967 G A exonic PIGC
filter_file
PIGC
desired output (header included)
Chr Start End Ref Alt Func.refGene Gene.refGene
chr1 172410967 172410967 G A exonic PIGC
awk with current output (header not included)
awk -F'\t' 'NR==1{A[$1];next}$7 in A' file test
chr1 172410967 172410967 G A exonic PIGC
Assuming your fields really are tab-separated:
awk -F'\t' 'NR==FNR{tgts[$1]; next} (FNR==1) || ($7 in tgts)' filter_file file_to_filter
To start learning awk, read the book Effective Awk Programing, 4th Edition, by Arnold Robbins.

remove field from tab seperated file using awk

I am trying to clean-up some tab-delineated files and thought that the awk below would remove field 18 Otherinfo from the file. I also tried cut and can not seem to get the desired output. Thank you :).
file
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene PopFreqMax CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID common Otherinfo
chr1 949654 949654 A G exonic ISG15 . synonymous SNV ISG15:NM_005101:exon2:c.294A>G:p.V98V 0.96 . . . . . . 1 3825.28 624 chr1 949654 . A G 3825.28 PASS AF=1;AO=621;DP=624;FAO=399;FDP=399;FR=.;FRO=0;FSAF=225;FSAR=174;FSRF=0;FSRR=0;FWDB=0.00425236;FXX=0.00249994;HRUN=1;LEN=1;MLLD=97.922;OALT=G;OID=.;OMAPALT=G;OPOS=949654;OREF=A;PB=0.5;PBP=1;QD=38.3487;RBI=0.0367904;REFB=0.0353003;REVB=-0.0365438;RO=2;SAF=335;SAR=286;SRF=0;SRR=2;SSEN=0;SSEP=0;SSSB=0.00332809;STB=0.5;STBP=1;TYPE=snp;VARB=-3.42335e-05;ANN=ISG15 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 1/1:171:624:399:2:0:621:399:1:286:335:0:2:174:225:0:0 GOOD 399 reads
desired output
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene PopFreqMax CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID common
chr1 949654 949654 A G exonic ISG15 0 synonymous SNV ISG15:NM_005101:exon2:c.294A>G:p.V98V 0.96 . . . . . .
awk (runs but doesn't remove field 18)
awk '{ $18=""; print }' file1
cut (removes all field except 18)
cut -f18 file1
By default, awk uses blanks as delimiters. Therefore, you have to specify to use tabs as delimiters in your output (OFS):
awk 'BEGIN{FS=OFS="\t"}{$18=""; gsub(/\t\t/,"\t")}1' file1

separate number range from a file using awk

I have a file with 5 columns and I want to separate the columns using number range as a criteria: example:
chr1 2120987 2144159 NM_001282670 0.48106
chr1 2123333 2126214 NM_001256946 2.71647
chr1 4715104 4837854 NM_001042478 0
chr1 4715104 4843851 NM_018836 0
chr1 3728644 3773797 NM_014704 4.61425
chr1 3773830 3801993 NM_004402 4.39674
chr1 3773830 3801993 NM_001282669 0
chr1 6245079 6259679 NM_000983 75.1769
chr1 6304251 6305638 NM_001024598 0
chr1 6307405 6321035 NM_207370 0.273874
chr1 6161846 6240194 NM_015557 0.0149477
chr1 6266188 6281359 NM_207396 0
chr1 6281252 6296044 NM_012405 14.0752
I want to remove 0 from the list , then would like to sort out numbers between 0.01 and 0.27 and so on....
I am new to shell programming....can someone help with awk ?
Thanks.
As you are new to shell programming, you may not be aware of grep and sort which would be simpler for this job.
If you are hell-bent on awk as your tool of choice, please just disregard my answer.
I would do it like this:
grep -v '\s0$' file | sort -k 5,5 -g
chr1 6161846 6240194 NM_015557 0.0149477
chr1 6307405 6321035 NM_207370 0.273874
chr1 2120987 2144159 NM_001282670 0.48106
chr1 2123333 2126214 NM_001256946 2.71647
chr1 3773830 3801993 NM_004402 4.39674
chr1 3728644 3773797 NM_014704 4.61425
chr1 6281252 6296044 NM_012405 14.0752
chr1 6245079 6259679 NM_000983 75.1769
The grep with -v inverts the search and looks for lines not containing the sequence space followed by a zero followed by end of line. The sort sorts the data on column 5, and does a general numeric sort because of the -g.
If you are trying to select the rows in which $5 is non-zero and within a certain range, then indeed awk makes sense, and the following may be close to what you're after:
awk -v min=0.01 -v max=0.27 '
$5 == 0 { next }
min <= $5 && $5 <= max { print }'
Here, the call to awk has been parameterized to suggest how these few lines can be adapted for more general usage.

Matching up MySQL tables using Perl

I have 2 MYSQL tables viz. main_table and query1. main_table contains the columns position and chr whilst query1 contains position, chr and symbol. The table query1 is derived by querying the main_table. I am wanting to match up both these tables using Perl such that the output would have the entire list of positions from the main_table in the first column and 2nd column would be symbols corresponding to that position. There could be no symbols at all or just one symbol or multiple symbols for each positions.
I am not very certain how to write the code up for this, currently I have
#!/usr/bin/perl
use strict;
use DBI;
my %ucsc;
my $dbh = DBI->connect('DBI:mysql:disc1pathway;user=home;password=home');
my $dbs = DBI->connect('DBI:mysql:results;user=home;password=home');
my $main = $dbh->prepare("select chr, position from main_table");
my $q1 = $dbs->prepare("select position, symbol, chrom from query1");
$main->execute();
$q1->execute();
while (my $main_ref = $main->fetchrow_hashref()) {
$ucsc{$main_ref->{chr}}{$main_ref->{position}} = 1;
}
while (my $gene_ref = $q1->fetchrow_hashref()) {
my $q1position = $gene_ref->{position};
my $q1symbol = $gene_ref->{symbol};
my $q1chr = $gene_ref->{chr};
foreach my $ucsc (keys %{$ucsc{$q1chr}}) {
print "$ucsc $q1symbol\n";
}
}
$dbh->disconnect();
$dbs->disconnect();
exit (0);
The following are examples of the of the main_table and query1. The desired output is what I am expecting and I worked it out using the VLOOKUP function in excel.
main_table
CHR Position
chr1 229830537
chr1 229723373
chr1 229723385
chr1 229723393
chr1 229723420
chr1 229829627
chr1 229723430
chr1 229829926
chr1 229723483
chr1 229723490
chr1 229723499
chr1 229723501
chr1 229830343
chr1 229723534
chr1 229723540
chr1 230039934
chr1 229723576
chr1 229830537
chr1 229830469
chr1 229725982
chr1 229726209
chr1 229966154
chr1 229726439
chr1 229726726
chr1 229726755
chr1 229726973
chr1 229967564
chr1 229727249
chr1 229727408
chr1 229727612
chr1 229728018
chr1 229728050
chr1 229728435
chr1 229728513
chr1 229966327
Query1
symbol CHR Position
C1 chr1 229829230
C1 chr1 229829278
C1 chr1 229829442
C1 chr1 229829627
C1 chr1 229829653
C1 chr1 229829683
C1 chr1 229829810
C1 chr1 229829926
C1 chr1 229829961
C1 chr1 229830085
C1 chr1 229830086
C1 chr1 229830087
C1 chr1 229830088
C1 chr1 229830141
C1 chr1 229830343
C1 chr1 229830469
C1 chr1 229830534
C1 chr1 229830537
C2 chr1 230039932
C2 chr1 230039934
C2 chr1 230039939
C2 chr1 230039944
457 chr1 229966154
457 chr1 229966327
457 chr1 229966500
457 chr1 229966552
457 chr1 229966748
457 chr1 229966998
457 chr1 229967327
457 chr1 229967564
457 chr1 229967594
457 chr1 229829627
Desired Output
Position symbol
229830537 C1
229723373
229723385
229723393
229723420
229829627 C1, 457
229723430
229829926 C1
229723483
229723490
229723499
229723501
229830343 C1
229723534
229723540
230039934 c2
229723576
229830537 C1
229830469
229725982
229726209
229966154 457
229726439
229726726
229726755
229726973
229967564 457
229727249
229727408
229727612
229728018
229728050
229728435
229728513
229966327
Thanks in advance
Caren
It sounds like you need to do a join operation in your SQL query, but you'll need some kind of relationship in order for this to work properly. You might be able to figure out what you need using the MySQL reference manual's section on JOIN syntax.
On the Perl side you'll need to write the logic for your output. I would recommend making a hash, using the "position" as your key and then any symbols as values. Fill the hash first, then do your output. It would simplify your process for outputting your query the way you would like.
IF you have all the data already and you're just wondering how to output it in columns you should look at sprintf and printf which allow you to format output strings.
use strict;
use DBI;
my %ucsc;
my $dbh = DBI->connect('DBI:mysql:disc1pathway;user=home;password=home');
my $dbs = DBI->connect('DBI:mysql:results;user=home;password=home');
my $main = $dbh->prepare("select chr, position from main_table");
$main->execute();
my $q1 = $dbs->prepare("select position, symbol, chrom from query1");
$q1->execute();
while (my $main_ref = $main->fetchrow_hashref()) {
$ucsc{$main_ref->{chr}}{$main_ref->{position}} = 1;
}
while (my $gene_ref = $q1->fetchrow_hashref()) {
my $q1position = $gene_ref->{position};
my $q1symbol = $gene_ref->{symbol};
my $q1chr = $gene_ref->{chr};
foreach my $ucsc (keys %{$ucsc{$q1chr}}) {
print "$ucsc $q1symbol\n";
}
}
$dbh->disconnect();
$dbs->disconnect();
exit (0);
=====================================================================================
The above code just lists the position and the symbol, but does not match them up. I cant seem to get my head around on how to match them up. Any suggestions.
Thanks.
Caren
Weegee has the right answer, you can specify the location of a table like this: ipaddress.database.table. If you are on the same machine you can drop the ipaddress portion, and if you are in the same database you can drop the database portion. So your code should wind up looking like:
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $dbh = DBI->connect(
'DBI:mysql:disc1pathway',
"home",
"home",
{
ChopBlanks => 1,
AutoCommit => 1,
PrintError => 0,
RaiseError => 1,
FetchHashKeyName => 'NAME_lc',
}
) or die "could not connect to database: ", DBI->errstr;
my $sth = $dbh->prepare("
SELECT
disc1pathway.main_table.chr,
disc1pathway.main.position,
results.query1.symbol,
results.query1.chrom
FROM disc1pathway.main_table, results.query1
JOIN results.query1 ON (
disc1pathway.main_table.position = results.query1.position
)
");
$sth->execute;
while (my $col = $sth->fetchrow_hashref) {
print join(" ", #{$col}{qw/chr position symbol chrom/}), "\n";
}
$sth->finish;
$dbh->disconnect;