Compare two files and print matching lines with some lines after match - awk

I have two files file1.txt and file2.txt.
file1.txt
DS496218 40654 42783
DS496218 40654 42783
DS496218 40654 42783
file2.txt
###
DS496108 ena gene 99942 102567 . -
DS496128 ena mRNA 99942 102567 . -
DS496118 ena three_prime_UTR 99942 100571
###
DS496218 ena gene 40654 42783 . -
DS496108 ena mRNA 99942 102567 . -
DS496108 ena three_prime_UTR 99942 100571
###
DS496128 ena gene 99942 102567 . -
DS496133 ena mRNA 99942 102567 . -
DS496139 ena three_prime_UTR 99942 100571
###
I want to match column 1,2 and 3 of file1.txt with column 1,4 and 5 of file2.txt. If it matches print the matching line with the following lines till ### but don't print ###. I tried it with 'awk' command in
awk -F'\t' 'NR==FNR{c[$1$2$3]++;next};c[$1$4$5] > 0' file1.txt file2.txt > out.txt.

Without seeing your expected output it's a guess but it sounds like this is what you want:
awk '
NR==FNR { a[$1,$2,$3]; next }
($1,$4,$5) in a { found=1 }
/^###/ { found=0 }
found
' file1 file2

Related

awk to extract value in each line and create new file

In the below awk I am trying to extract the value of a substring in each line, and the 2 attempts do not produce the desired results. The first awk executes and returns no data,
and the second only extracts the value. Thank you :).
file
#CHROM POS ID REF ALT QUAL FILTER INFO
1 930215 CM1613956 A G . . PHEN="Retinitis_pigmentosa";RANKSCORE=0.21
awk 1
awk '/^#/ {for (I=1;I<NF;I++) if ($I == "RANKSCORE=") print $(I+1)}' file
awk 2
awk 'BEGIN{FS=OFS="\t"}; /^#/ {print $1,$2,$3} {sub(/.*RANKSCORE=/, ""); print}' file
#CHROM POS ID
#CHROM POS ID REF ALT QUAL FILTER INFO
0.21
0.99
desired (tab-delimited)
1 930215 CM1613956 A G . . 0.21
You may use this awk:
awk 'BEGIN {FS=OFS="\t"}
/^#/ {next}
$NF ~ /;RANKSCORE=/ {
sub(/.+=/, "", $NF)
} 1' file
1 930215 CM1613956 A G . . 0.21
With your shown samples please try following awk code.
awk -F';RANKSCORE=' '
BEGIN{ OFS ="\t" }
/^#/ { next }
NF==2 && match($0,/.* /){
print substr($0,RSTART,RLENGTH-1),$2
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk -F';RANKSCORE=' ' ##Starting awk program from here, settings field separator as ;RANKSCORE=
BEGIN{ OFS ="\t" } ##Setting OFS as tab in BEGIN section of this code.
/^#/ { next } ##If a line starts from # then simply skip that line.
NF==2 && match($0,/.* /){ ##Check if NF is 2 AND matching till last occurrence of single space.
print substr($0,RSTART,RLENGTH-1),$2 ##Printing sub string till matched regex along with 2md field.
}
' Input_file ##Mentioning Input_file name here.
Your RANKSCORE seems to appear in field 8.
match can locate it. substr can extract it.
$ awk -F'\t' -v OFS='\t' '
match($8,/RANKSCORE=[0-9.]+/){
$8 = substr($8, RSTART+10, RLENGTH-10)
print
}
' file
Or more safely, assuming semi-colon sub-delimiters, a couple of subs:
$ awk -F'\t' -v OFS='\t' '
sub(/^(.*;)?RANKSCORE=/,"",$8){
sub(/[^0-9.].*$/,"",$8)
print
}
' file
Assumptions:
we only want exact word matches on RANKSCORE (eg, do not match on old_RANKSCORE)
RANKSCORE=value could show up anywhere in a ;-delimited last field
Adding some lines with different locations of RANKSCORE:
#CHROM POS ID REF ALT QUAL FILTER INFO
1 930215 CM1613956 A G . . PHEN="Retinitis_pigmentosa";RANKSCORE=0.21
1 930215 CM1613956 A G . . RANKSCORE=3.235;PHEN="Retinitis_pigmentosa"
1 930215 CM1613956 A G . . stuff=123;old_RANKSCORE=7.7234;PHEN="Retinitis_pigmentosa"
1 930215 CM1613956 A G . . stuff=123;RANKSCORE=9.3325;PHEN="Retinitis_pigmentosa"
One awk idea:
awk '
BEGIN { FS=OFS="\t" }
/RANKSCORE/ { n=split($NF,a,"[;=]") # if line contains "RANKSCORE" then split last field on dual delimiters ";" and "="
for (i=1;i<=n;i=i+2) # loop through attribute names (odd-numbered indices) and ...
if (a[i] == "RANKSCORE") { # if attribute == "RANKSCORE" then ...
$NF=a[i+1] # use associated value (even-numbered index) as new value for last field
print # print new line
next # go to next input line
}
}
' file
This generates:
1 930215 CM1613956 A G . . 0.21
1 930215 CM1613956 A G . . 3.235
1 930215 CM1613956 A G . . 9.3325
no arrays needed :
{m,n,g}awk '!+_<+NF && sub(";.*$", _, $(NF=NF))^_'\
FS='[ \t]+([^ \t]*;)?RANKSCORE=' OFS='\t'
1 930215 CM1613956 A G . . 0.21
1 930215 CM1613956 A G . . 3.235
1 930215 CM1613956 A G . . 9.3325

awk duplicated lines with starting with # symbol

In the below awk is there a way to process only lines below a pattern #CHROM, however print all in the output. The problem I am having is if I ignore all lines with a # they do print in the output, but the other lines without the # get duplicated. In my data file there are thousands of lines but only the oone format below is updated by the awk. Thank you :).
file tab-delimited
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224
awk
awk '!/^#/
BEGIN {FS = OFS = "\t"
}
NF == 10 {
split($8, a, /[=;]/)
$11 = $12 = $13 = $14 = $15 = $18 = "."
$16 = (a[1] == "DP") ? a[2] : "DP=num_Missing"
$17 = "homref"
}
1' out > ref
curent output tab-delimited
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 --- duplicated line ---
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 . . . . . 159 homref . --- this line is correct ---
desired output tab-delimited
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 . . . . . 159 homref .
Your first statement:
/^#/
says "print every line that starts with #" and your last:
1
says "print every line". Hence the duplicate lines in the output.
To only modify lines that don't start with # but print all lines would be:
!/^#/ { do stuff }
1

Awk output formatting

I have 2 .po files and some word in there has 2 different meanings
and want to use awk to turn it into some kind of translator
For example
in .po file 1
msgid "example"
msgstr "something"
in .po file 2
msgid "example"
msgstr "somethingelse"
I came up with this
awk -F'"' 'match($2, /^example$/) {printf "%s", $2": ";getline; printf "%s", $2}' file1.po file2.po
The output will be
example:something example:somethinelse
How do I make it into this kind of format
example : something, somethingelse.
Reformatting
example:something example:somethinelse
into
example : something, somethingelse
can be done with this one-liner:
awk -F":| " -v OFS="," '{printf "%s:", $1; for (i=1;i<=NF;i++) if (i % 2 == 0)printf("%s%s%s", ((i==2)?"":OFS), $i, ((i==NF)?"\n":""))}'
Testing:
$ echo "example:something example:somethinelse example:something3 example:something4" | \
awk -F":| " -v OFS="," '{ \
printf "%s:", $1; \
for (i=1;i<=NF;i++) \
if (i % 2 == 0) \
printf("%s%s%s", ((i==2)?"":OFS), $i, ((i==NF)?"\n":""))}'
example:something,somethinelse,something3,something4
Explanation:
$ cat tst.awk
BEGIN{FS=":| ";OFS=","} # define field sep and output field sep
{ printf "%s:", $1 # print header line "example:"
for (i=1;i<=NF;i++) # loop over all fields
if (i % 2 == 0) # we're only interested in all "even" fields
printf("%s%s%s", ((i==2)?"":OFS), $i, ((i==NF)?"\n":""))
}
But you could have done the whole thing in one go with something like this:
$ cat tst.awk
BEGIN{OFS=","} # set output field sep to ","
NF{ # if NF (i.e. number of fields) > 0
# - to skip empty lines -
if (match($0,/msgid "(.*)"/,a)) id=a[1] # if line matches 'msgid "something",
# set "id" to "something"
if (match($0,/msgstr "(.*)"/,b)) str=b[1] # same here for 'msgstr'
if (id && str){ # if both "id" and "str" are set
r[id]=(id in r)?r[id] OFS str:str # save "str" in array r with index "id".
# if index "id" already exists,
# add "str" preceded by OFS (i.e. "," here)
id=str=0 # after printing, reset "id" and "str"
}
}
END { for (i in r) printf "%s : %s\n", i, r[i] } # print array "r"
and call this like:
awk -f tst.awk *.po
$ awk -F'"' 'NR%2{k=$2; next} NR==FNR{a[k]=$2; next} {print k" : "a[k]", "$2}' file1 file2
example : something, somethingelse

Awk to split input a tab-delimited file using multiple delimiters in the same field

I am trying to use awk to split the file, skipping the header, into either an 8-column or 6-column output. I am not sure if I did the split correct though as I need to split $2 first by the : then by the -. The desired output of each awk is below as one or the other is used depending on the situation. Thank you :).
file 'tab-delimited`
Gene Position Strand
SMARCB1 22:24133967-24133967 +
RB1 13:49037865-49037865 -
SMARCB1 22:24176357-24176357 +
awk
awk -F'\t' -v OFS="\t" 'NR>1{split($2,a,":"); print a[1],a[2],a[3],"chr"$2,"0",$3,"GENE_ID="$1}'
8-column desired output tab-delimited
chr22 24133967 24133967 chr22:24133967-24133967 0 + . GENE_ID=SMARCB1
chr13 49037865 49037865 chr13:49037865-49037865 0 - . GENE_ID=RB1
chr22 24176357 24176357 chr22:24176357-24176357 0 + . GENE_ID=SMARCB1
awk
awk -F'\t' -v OFS="\t" 'NR>1{split($2,a,":"); print a[1],a[2],a[3],"chr"$2,".",$1,}'
6-column desired output tab-delimited
chr22 24133967 24133967 chr22:24133967-24133967 . SMARCB1
chr13 49037865 49037865 chr13:49037865-49037865 . RB1
chr22 24176357 24176357 chr22:24176357-24176357 . SMARCB1
Extended approach:
For 6-column output:
awk -v c=6 'BEGIN{ FS=OFS="\t" }NR>1{ split($2,a,":|-"); k="chr";
printf("%s\t%d\t%d\t%s\t",k a[1],a[2],a[3],k $2);
if (c==6) print ".",$1; else print "0",$3,".","GENE_ID="$1 }' file
The output:
chr22 24133967 24133967 chr22:24133967-24133967 . SMARCB1
chr13 49037865 49037865 chr13:49037865-49037865 . RB1
chr22 24176357 24176357 chr22:24176357-24176357 . SMARCB1
For 8-column output (via passing -v c=<number> (column) variable):
awk -v c=8 'BEGIN{ FS=OFS="\t" }NR>1{ split($2,a,":|-"); k="chr";
printf("%s\t%d\t%d\t%s\t",k a[1],a[2],a[3],k $2);
if (c==6) print ".",$1; else print "0",$3,".","GENE_ID="$1 }' file
The output:
chr22 24133967 24133967 chr22:24133967-24133967 0 + . GENE_ID=SMARCB1
chr13 49037865 49037865 chr13:49037865-49037865 0 - . GENE_ID=RB1
chr22 24176357 24176357 chr22:24176357-24176357 0 + . GENE_ID=SMARCB1

awk to update value in field of out file using contents of another

In the out.txt below I am trying to use awk to update the contents of $9. The out.txt is created by the awk before the pipe |. If $9 contains a + or - then $8 of out.txt is used as a key to lookup in $2 of file2. When a match ( there will always be one) is found the $3 value of that file2 is used to update $9 of out.txt seperated by a :. So the original +6 in out.txt would be +6:NM_005101.3. The awk below is close but has syntax errors after the | that I can not seem to fix. Thank you :).
out.txt tab-delimited
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6 . . .
file2 space-delimited
2 ISG15 NM_005101.3 948846-948956 949363-949919
desired output `tab-delimited'
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0:NM_005101.3 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6:NM_005101.3 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6:NM_005101.3 . . .
Description
lines 1, 3, 5 `$9` updated with`: ` and value of `$3` in `file2`
line 2 and 4 are skipped as these do not have a `+` or `-` in them
awk
awk -v extra=50 -v OFS='\t' '
NR == FNR {
count[$2] = $1
for(i = 1; i <= $1; i++) {
low[$2, i] = $(2 + 2 * i)
high[$2, i] = $(3 + 2 * i)
mid[$2, i] = (low[$2, i] + high[$2, i]) / 2
}
next
}
FNR != 1 && $9 == "." && $12 == "." && $8 in count {
for(i = 1; i <= count[$8]; i++)
if($4 >= (low[$8, i] - extra) && $4 <= (high[$8, i] + extra)) {
if($4 > mid[$8, i]) {
sign = "+"
value = high[$8, i]
}
else {
sign = "-"
value = low[$8, i]
}
diff = (value > $4) ? value - $4 : $4 - value
$9 = (diff > 50) ? ">50" : (sign diff)
break
}
if(i > count[$8]) {
$9 = ">50"
}
}
1
' FS='[- ]' file2 FS='\t' file1 | awk if($6 == "-" || $6 == "+") printf ":" ; 'FNR==NR {a[$2]=$3; next} a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('
As far as I can tell, your awk code is OK and your bash usage is wrong.
FS='[- ]' file2 FS='\t' file1 |
awk if($6 == "-" || $6 == "+")
printf ":" ;
'FNR==NR {a[$2]=$3; next}
a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('
I don't know what that's supposed to do. This for sure, though: on the second line, the awk code needs to be quoted (awk 'if(....). The bash error message stems from the fact that bash is interpreting the (unquoted) awk code, and ( is not a valid shell-script token after if.