awk to extract value in each line and create new file - awk

In the below awk I am trying to extract the value of a substring in each line, and the 2 attempts do not produce the desired results. The first awk executes and returns no data,
and the second only extracts the value. Thank you :).
file
#CHROM POS ID REF ALT QUAL FILTER INFO
1 930215 CM1613956 A G . . PHEN="Retinitis_pigmentosa";RANKSCORE=0.21
awk 1
awk '/^#/ {for (I=1;I<NF;I++) if ($I == "RANKSCORE=") print $(I+1)}' file
awk 2
awk 'BEGIN{FS=OFS="\t"}; /^#/ {print $1,$2,$3} {sub(/.*RANKSCORE=/, ""); print}' file
#CHROM POS ID
#CHROM POS ID REF ALT QUAL FILTER INFO
0.21
0.99
desired (tab-delimited)
1 930215 CM1613956 A G . . 0.21

You may use this awk:
awk 'BEGIN {FS=OFS="\t"}
/^#/ {next}
$NF ~ /;RANKSCORE=/ {
sub(/.+=/, "", $NF)
} 1' file
1 930215 CM1613956 A G . . 0.21

With your shown samples please try following awk code.
awk -F';RANKSCORE=' '
BEGIN{ OFS ="\t" }
/^#/ { next }
NF==2 && match($0,/.* /){
print substr($0,RSTART,RLENGTH-1),$2
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk -F';RANKSCORE=' ' ##Starting awk program from here, settings field separator as ;RANKSCORE=
BEGIN{ OFS ="\t" } ##Setting OFS as tab in BEGIN section of this code.
/^#/ { next } ##If a line starts from # then simply skip that line.
NF==2 && match($0,/.* /){ ##Check if NF is 2 AND matching till last occurrence of single space.
print substr($0,RSTART,RLENGTH-1),$2 ##Printing sub string till matched regex along with 2md field.
}
' Input_file ##Mentioning Input_file name here.

Your RANKSCORE seems to appear in field 8.
match can locate it. substr can extract it.
$ awk -F'\t' -v OFS='\t' '
match($8,/RANKSCORE=[0-9.]+/){
$8 = substr($8, RSTART+10, RLENGTH-10)
print
}
' file
Or more safely, assuming semi-colon sub-delimiters, a couple of subs:
$ awk -F'\t' -v OFS='\t' '
sub(/^(.*;)?RANKSCORE=/,"",$8){
sub(/[^0-9.].*$/,"",$8)
print
}
' file

Assumptions:
we only want exact word matches on RANKSCORE (eg, do not match on old_RANKSCORE)
RANKSCORE=value could show up anywhere in a ;-delimited last field
Adding some lines with different locations of RANKSCORE:
#CHROM POS ID REF ALT QUAL FILTER INFO
1 930215 CM1613956 A G . . PHEN="Retinitis_pigmentosa";RANKSCORE=0.21
1 930215 CM1613956 A G . . RANKSCORE=3.235;PHEN="Retinitis_pigmentosa"
1 930215 CM1613956 A G . . stuff=123;old_RANKSCORE=7.7234;PHEN="Retinitis_pigmentosa"
1 930215 CM1613956 A G . . stuff=123;RANKSCORE=9.3325;PHEN="Retinitis_pigmentosa"
One awk idea:
awk '
BEGIN { FS=OFS="\t" }
/RANKSCORE/ { n=split($NF,a,"[;=]") # if line contains "RANKSCORE" then split last field on dual delimiters ";" and "="
for (i=1;i<=n;i=i+2) # loop through attribute names (odd-numbered indices) and ...
if (a[i] == "RANKSCORE") { # if attribute == "RANKSCORE" then ...
$NF=a[i+1] # use associated value (even-numbered index) as new value for last field
print # print new line
next # go to next input line
}
}
' file
This generates:
1 930215 CM1613956 A G . . 0.21
1 930215 CM1613956 A G . . 3.235
1 930215 CM1613956 A G . . 9.3325

no arrays needed :
{m,n,g}awk '!+_<+NF && sub(";.*$", _, $(NF=NF))^_'\
FS='[ \t]+([^ \t]*;)?RANKSCORE=' OFS='\t'
1 930215 CM1613956 A G . . 0.21
1 930215 CM1613956 A G . . 3.235
1 930215 CM1613956 A G . . 9.3325

Related

compare and print 2 columns from 2 files in awk ou perl

I have 2 files with 2 million lines.
I need to compare 2 columns in 2 different files and I want to print the lines of the 2 files where there are equal items.
this awk code works, but it does not print lines from the 2 files:
awk 'NR == FNR {a[$3]; next}$3 in a' file1.txt file2.txt
file1.txt
0001 00000001 084010800001080
0001 00000010 041140000100004
file2.txt
2451 00000009 401208008004000
2451 00000010 084010800001080
desired output:
file1[$1]-file2[$1] file1[$2]-file2[$2] $3 ( same on both files )
0001-2451 00000001-00000010 084010800001080
how to do this in awk or perl?
Assuming your $3 values are unique within each input file as shown in your sample input/output:
$ cat tst.awk
NR==FNR {
foos[$3] = $1
bars[$3] = $2
next
}
$3 in foos {
print foos[$3] "-" $1, bars[$3] "-" $2, $3
}
$ awk -f tst.awk file1.txt file2.txt
0001-2451 00000001-00000010 084010800001080
I named the arrays foos[] and bars[] as I don't know what the first 2 columns of your input actually represent - choose a more meaningful name.
With your shown samples, please try following awk code. Fair warning
I haven't tested it yet with millions of lines.
awk '
FNR == NR{
arr1[$3]=$0
next
}
($3 in arr1){
split(arr1[$3],arr2)
print (arr2[1]"-"$1,arr2[2]"-"$2,$3)
delete arr2
}
' file1.txt file2.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR == NR{ ##checking condition which will be TRUE when first Input_file is being read.
arr1[$3]=$0 ##Creating arr1 array with value of $1 OFS $2 and $3
next ##next will skip all further statements from here.
}
($3 in arr1){ ##checking if $3 is present in arr1 then do following.
split(arr1[$3],arr2) ##Splitting value of arr1 into arr2.
print (arr2[1]"-"$1,arr2[2]"-"$2,$3) ##printing values as per requirement of OP.
delete arr2 ##Deleting arr2 array here.
}
' file1.txt file2.txt ##Mentioning Input_file names here.
If you have two massive files, you may want to use sort, join and awk to produce your output without having to have the first file mostly in memory.
Based on your example, this pipe would do that:
join -1 3 -2 3 <(sort -k3 -n file1) <(sort -k3 -n file2) | awk '{printf("%s-%s %s-%s %s\n",$2,$4,$3,$5,$1)}'
Prints:
0001-2451 00000001-00000010 084010800001080
If your files are that big, you might want to avoid storing the data in memory. It's a whole lot of comparisons, 2 million lines times 2 million lines = 4 * 1012 comparisons.
use strict;
use warnings;
use feature 'say';
my $file1 = shift;
my $file2 = shift;
open my $fh1, "<", $file1 or die "Cannot open '$file1': $!";
while (<$fh1>) {
my #F = split;
open my $fh2, "<", $file2 or die "Cannot open '$file2': $!";
# for each line of file1 file2 is reopened and read again
while (my $cmp = <$fh2>) {
my #C = split ' ', $cmp;
if ($F[2] eq $C[2]) { # check string equality
say "$F[0]-$C[0] $F[1]-$C[1] $F[2]";
}
}
}
With your rather limited test set, I get the following output:
0001-2451 00000001-00000010 084010800001080
Python: tested with 2.000.000 rows each file
d = {}
with open('1.txt', 'r') as f1, open('2.txt', 'r') as f2:
for line in f1:
if not line: break
c0,c1,c2 = line.split()
d[(c2)] = (c0,c1)
for line in f2:
if not line: break
c0,c1,c2 = line.split()
if (c2) in d: print("{}-{} {}-{} {}".format(d[(c2)][0], c0, d[(c2)][1], c1, c2))
$ time python3 comapre.py
1001-2001 10000001-20000001 224010800001084
1042-2013 10000042-20000013 224010800001096
real 0m3.555s
user 0m3.234s
sys 0m0.321s

Extract first position of a regex match grep

Good morning everyone,
I have a text file containing multiple lines. I want to find a regular pattern inside it and print its position using grep.
For example:
ARTGHFRHOPLIT
GFRTLOPLATHLG
TGHLKTGVARTHG
I want to find L[any_letter]T in the file and print the position of L and the three letter code. In this case it would results as:
11 LIT
8 LAT
4 LKT
I wrote a code in grep, but it doesn't return what I need. The code is:
grep -E -boe "L.T" file.txt
It returns:
11:LIT
21:LAT
30:LKT
Any help would be appreciated!!
Awk suites this better:
awk 'match($0, /L[[:alpha:]]T/) {
print RSTART, substr($0, RSTART, RLENGTH)}' file
11 LIT
8 LAT
4 LKT
This is assuming only one such match per line.
If there can be multiple overlapping matches per line then use:
awk '{
n = 0
while (match($0, /L[[:alpha:]]T/)) {
n += RSTART
print n, substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + 1)
}
}' file
With your shown samples, please try following awk code. Written and tested in GNU awk, should work in any awk.
awk '
{
ind=prev=""
while(ind=index($0,"L")){
if(substr($0,ind+2,1)=="T" && substr($0,ind+1,1) ~ /[a-zA-Z]/){
if(prev==""){ print prev+ind,substr($0,ind,3) }
if(prev>1) { print prev+ind+2,substr($0,ind,3) }
}
$0=substr($0,ind+3)
prev+=ind
}
}' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
{
ind=prev="" ##Nullifying ind and prev variables here.
while(ind=index($0,"L")){ ##Run while loop to check if index for L letter is found(whose index will be stored into ind variable).
if(substr($0,ind+2,1)=="T" && substr($0,ind+1,1) ~ /[a-zA-Z]/){ ##Checking condition if letter after 1 position of L is T AND letter next to L is a letter.
if(prev==""){ print prev+ind,substr($0,ind,3) } ##Checking if prev variable is NULL then printing prev+ind along with 3 letters from index of L eg:(LIT).
if(prev>1) { print prev+ind+2,substr($0,ind,3) } ##If prev is greater than 1 then printing prev+ind+2 and along with 3 letters from index of L eg:(LIT).
}
$0=substr($0,ind+3) ##Setting value of rest of line value to 2 letters after matched L position.
prev+=ind ##adding ind to prev value.
}
}' Input_file ##Mentioning Input_file name here.
Peeking at the answer of #anubhava you might also sum the RSTART + RLENGTH and use that as the start for the substr to get multiple matches per line and per word.
The while loop takes the current line, and for every iteration it updates its value by setting it to the part right after the last match till the end of the string.
Note that if you use the . in a regex it can match any character.
awk '{
pos = 0
while (match($0, /L[a-zA-Z]T/)) {
pos += RSTART;
print pos, substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + RLENGTH)
}
}' file
If file contains
ARTGHFRHOPLIT
GFRTLOPLATHLG
TGHLKTGVARTHG
ARTGHFRHOPLITLOT LATTELET
LUT
The output is
11 LIT
8 LAT
4 LKT
11 LIT
12 LOT
14 LAT
17 LET
1 LUT

Prevent awk from adding non-integers?

I have a file that has these columns that I would like to add:
absolute_broad_major_cn
1
1
1
1
1.76
1.76
NA
1
and
absolute_broad_minor_cn
1
1
1
1
0.92
0.92
NA
1
I did awk '{ print $1+$2 }, which worked well but it put 0 for where there was an NA. Is it possible to make awk forget this and just put NA again instead (so awk only adds numbers)?
Edit: Desired output is:
<Column header>
2
2
2
2
2.68
2.68
NA
2
paste absolute* | awk '{ if ($1 == "NA" && $2 == "NA") print "NA"; else print $1 + $2; }'
would do the trick; whether you want && (both are "NA" to produce an "NA") or || (either one is "NA" produces an NA) is specific to your need.
Could you please try following, written and tested with shown samples.
awk '
FNR==NR{
a[FNR]=$0
next
}
{
print ($0~/[a-zA-Z]/ && a[FNR]~/[a-zA-Z]/?"NA":a[FNR]+$0)
}
' absolute_broad_major_cn absolute_broad_minor_cn
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when Input_file absolute_broad_major_cn is being read.
a[FNR]=$0 ##Creating array a with index FNR and having value as current line here.
next ##next will skip all further statements from here.
}
{
print ($0~/[a-zA-Z]/ && a[FNR]~/[a-zA-Z]/?"NA":a[FNR]+$0) ##Printing either addition of current line with array a value or print NA in case any alphabate is found either in array value OR in current line.
}
' absolute_broad_major_cn absolute_broad_minor_cn ##Mentioning Input_file names here.
I think what you're really trying to do is sum 2 numeric columns from 1 file:
awk '{print ($1==($1+0) ? $1+$2 : $1)}' file
$1 == $1+0 will only be true if $1 is a number.
Just remove the lines with NA & then add them
awk '$1 != "NA"' FS=' ' file | awk '{ print $1+$2 }'

awk duplicated lines with starting with # symbol

In the below awk is there a way to process only lines below a pattern #CHROM, however print all in the output. The problem I am having is if I ignore all lines with a # they do print in the output, but the other lines without the # get duplicated. In my data file there are thousands of lines but only the oone format below is updated by the awk. Thank you :).
file tab-delimited
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224
awk
awk '!/^#/
BEGIN {FS = OFS = "\t"
}
NF == 10 {
split($8, a, /[=;]/)
$11 = $12 = $13 = $14 = $15 = $18 = "."
$16 = (a[1] == "DP") ? a[2] : "DP=num_Missing"
$17 = "homref"
}
1' out > ref
curent output tab-delimited
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 --- duplicated line ---
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 . . . . . 159 homref . --- this line is correct ---
desired output tab-delimited
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 . . . . . 159 homref .
Your first statement:
/^#/
says "print every line that starts with #" and your last:
1
says "print every line". Hence the duplicate lines in the output.
To only modify lines that don't start with # but print all lines would be:
!/^#/ { do stuff }
1

awk to update value in field of out file using contents of another

In the out.txt below I am trying to use awk to update the contents of $9. The out.txt is created by the awk before the pipe |. If $9 contains a + or - then $8 of out.txt is used as a key to lookup in $2 of file2. When a match ( there will always be one) is found the $3 value of that file2 is used to update $9 of out.txt seperated by a :. So the original +6 in out.txt would be +6:NM_005101.3. The awk below is close but has syntax errors after the | that I can not seem to fix. Thank you :).
out.txt tab-delimited
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6 . . .
file2 space-delimited
2 ISG15 NM_005101.3 948846-948956 949363-949919
desired output `tab-delimited'
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0:NM_005101.3 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6:NM_005101.3 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6:NM_005101.3 . . .
Description
lines 1, 3, 5 `$9` updated with`: ` and value of `$3` in `file2`
line 2 and 4 are skipped as these do not have a `+` or `-` in them
awk
awk -v extra=50 -v OFS='\t' '
NR == FNR {
count[$2] = $1
for(i = 1; i <= $1; i++) {
low[$2, i] = $(2 + 2 * i)
high[$2, i] = $(3 + 2 * i)
mid[$2, i] = (low[$2, i] + high[$2, i]) / 2
}
next
}
FNR != 1 && $9 == "." && $12 == "." && $8 in count {
for(i = 1; i <= count[$8]; i++)
if($4 >= (low[$8, i] - extra) && $4 <= (high[$8, i] + extra)) {
if($4 > mid[$8, i]) {
sign = "+"
value = high[$8, i]
}
else {
sign = "-"
value = low[$8, i]
}
diff = (value > $4) ? value - $4 : $4 - value
$9 = (diff > 50) ? ">50" : (sign diff)
break
}
if(i > count[$8]) {
$9 = ">50"
}
}
1
' FS='[- ]' file2 FS='\t' file1 | awk if($6 == "-" || $6 == "+") printf ":" ; 'FNR==NR {a[$2]=$3; next} a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('
As far as I can tell, your awk code is OK and your bash usage is wrong.
FS='[- ]' file2 FS='\t' file1 |
awk if($6 == "-" || $6 == "+")
printf ":" ;
'FNR==NR {a[$2]=$3; next}
a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('
I don't know what that's supposed to do. This for sure, though: on the second line, the awk code needs to be quoted (awk 'if(....). The bash error message stems from the fact that bash is interpreting the (unquoted) awk code, and ( is not a valid shell-script token after if.