awk duplicated lines with starting with # symbol

awk duplicated lines with starting with # symbol - awk

In the below awk is there a way to process only lines below a pattern #CHROM, however print all in the output. The problem I am having is if I ignore all lines with a # they do print in the output, but the other lines without the # get duplicated. In my data file there are thousands of lines but only the oone format below is updated by the awk. Thank you :).
file tab-delimited
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224
awk
awk '!/^#/
BEGIN {FS = OFS = "\t"
}
NF == 10 {
split($8, a, /[=;]/)
$11 = $12 = $13 = $14 = $15 = $18 = "."
$16 = (a[1] == "DP") ? a[2] : "DP=num_Missing"
$17 = "homref"
}
1' out > ref
curent output tab-delimited
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 --- duplicated line ---
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 . . . . . 159 homref . --- this line is correct ---
desired output tab-delimited
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 . . . . . 159 homref .

Your first statement:
/^#/
says "print every line that starts with #" and your last:
1
says "print every line". Hence the duplicate lines in the output.
To only modify lines that don't start with # but print all lines would be:
!/^#/ { do stuff }
1

Related

awk to extract value in each line and create new file

In the below awk I am trying to extract the value of a substring in each line, and the 2 attempts do not produce the desired results. The first awk executes and returns no data,
and the second only extracts the value. Thank you :).
file
#CHROM POS ID REF ALT QUAL FILTER INFO
1 930215 CM1613956 A G . . PHEN="Retinitis_pigmentosa";RANKSCORE=0.21
awk 1
awk '/^#/ {for (I=1;I<NF;I++) if ($I == "RANKSCORE=") print $(I+1)}' file
awk 2
awk 'BEGIN{FS=OFS="\t"}; /^#/ {print $1,$2,$3} {sub(/.*RANKSCORE=/, ""); print}' file
#CHROM POS ID
#CHROM POS ID REF ALT QUAL FILTER INFO
0.21
0.99
desired (tab-delimited)
1 930215 CM1613956 A G . . 0.21

You may use this awk:
awk 'BEGIN {FS=OFS="\t"}
/^#/ {next}
$NF ~ /;RANKSCORE=/ {
sub(/.+=/, "", $NF)
} 1' file
1 930215 CM1613956 A G . . 0.21

With your shown samples please try following awk code.
awk -F';RANKSCORE=' '
BEGIN{ OFS ="\t" }
/^#/ { next }
NF==2 && match($0,/.* /){
print substr($0,RSTART,RLENGTH-1),$2
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk -F';RANKSCORE=' ' ##Starting awk program from here, settings field separator as ;RANKSCORE=
BEGIN{ OFS ="\t" } ##Setting OFS as tab in BEGIN section of this code.
/^#/ { next } ##If a line starts from # then simply skip that line.
NF==2 && match($0,/.* /){ ##Check if NF is 2 AND matching till last occurrence of single space.
print substr($0,RSTART,RLENGTH-1),$2 ##Printing sub string till matched regex along with 2md field.
}
' Input_file ##Mentioning Input_file name here.

Your RANKSCORE seems to appear in field 8.
match can locate it. substr can extract it.
$ awk -F'\t' -v OFS='\t' '
match($8,/RANKSCORE=[0-9.]+/){
$8 = substr($8, RSTART+10, RLENGTH-10)
print
}
' file
Or more safely, assuming semi-colon sub-delimiters, a couple of subs:
$ awk -F'\t' -v OFS='\t' '
sub(/^(.*;)?RANKSCORE=/,"",$8){
sub(/[^0-9.].*$/,"",$8)
print
}
' file

Assumptions:
we only want exact word matches on RANKSCORE (eg, do not match on old_RANKSCORE)
RANKSCORE=value could show up anywhere in a ;-delimited last field
Adding some lines with different locations of RANKSCORE:
#CHROM POS ID REF ALT QUAL FILTER INFO
1 930215 CM1613956 A G . . PHEN="Retinitis_pigmentosa";RANKSCORE=0.21
1 930215 CM1613956 A G . . RANKSCORE=3.235;PHEN="Retinitis_pigmentosa"
1 930215 CM1613956 A G . . stuff=123;old_RANKSCORE=7.7234;PHEN="Retinitis_pigmentosa"
1 930215 CM1613956 A G . . stuff=123;RANKSCORE=9.3325;PHEN="Retinitis_pigmentosa"
One awk idea:
awk '
BEGIN { FS=OFS="\t" }
/RANKSCORE/ { n=split($NF,a,"[;=]") # if line contains "RANKSCORE" then split last field on dual delimiters ";" and "="
for (i=1;i<=n;i=i+2) # loop through attribute names (odd-numbered indices) and ...
if (a[i] == "RANKSCORE") { # if attribute == "RANKSCORE" then ...
$NF=a[i+1] # use associated value (even-numbered index) as new value for last field
print # print new line
next # go to next input line
}
}
' file
This generates:
1 930215 CM1613956 A G . . 0.21
1 930215 CM1613956 A G . . 3.235
1 930215 CM1613956 A G . . 9.3325

no arrays needed :
{m,n,g}awk '!+_<+NF && sub(";.*$", _, $(NF=NF))^_'\
FS='[ \t]+([^ \t]*;)?RANKSCORE=' OFS='\t'
1 930215 CM1613956 A G . . 0.21
1 930215 CM1613956 A G . . 3.235
1 930215 CM1613956 A G . . 9.3325

compare and print 2 columns from 2 files in awk ou perl

I have 2 files with 2 million lines.
I need to compare 2 columns in 2 different files and I want to print the lines of the 2 files where there are equal items.
this awk code works, but it does not print lines from the 2 files:
awk 'NR == FNR {a[$3]; next}$3 in a' file1.txt file2.txt
file1.txt
0001 00000001 084010800001080
0001 00000010 041140000100004
file2.txt
2451 00000009 401208008004000
2451 00000010 084010800001080
desired output:
file1[$1]-file2[$1] file1[$2]-file2[$2] $3 ( same on both files )
0001-2451 00000001-00000010 084010800001080
how to do this in awk or perl?

Assuming your $3 values are unique within each input file as shown in your sample input/output:
$ cat tst.awk
NR==FNR {
foos[$3] = $1
bars[$3] = $2
next
}
$3 in foos {
print foos[$3] "-" $1, bars[$3] "-" $2, $3
}
$ awk -f tst.awk file1.txt file2.txt
0001-2451 00000001-00000010 084010800001080
I named the arrays foos[] and bars[] as I don't know what the first 2 columns of your input actually represent - choose a more meaningful name.

With your shown samples, please try following awk code. Fair warning
I haven't tested it yet with millions of lines.
awk '
FNR == NR{
arr1[$3]=$0
next
}
($3 in arr1){
split(arr1[$3],arr2)
print (arr2[1]"-"$1,arr2[2]"-"$2,$3)
delete arr2
}
' file1.txt file2.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR == NR{ ##checking condition which will be TRUE when first Input_file is being read.
arr1[$3]=$0 ##Creating arr1 array with value of $1 OFS $2 and $3
next ##next will skip all further statements from here.
}
($3 in arr1){ ##checking if $3 is present in arr1 then do following.
split(arr1[$3],arr2) ##Splitting value of arr1 into arr2.
print (arr2[1]"-"$1,arr2[2]"-"$2,$3) ##printing values as per requirement of OP.
delete arr2 ##Deleting arr2 array here.
}
' file1.txt file2.txt ##Mentioning Input_file names here.

If you have two massive files, you may want to use sort, join and awk to produce your output without having to have the first file mostly in memory.
Based on your example, this pipe would do that:
join -1 3 -2 3 <(sort -k3 -n file1) <(sort -k3 -n file2) | awk '{printf("%s-%s %s-%s %s\n",$2,$4,$3,$5,$1)}'
Prints:
0001-2451 00000001-00000010 084010800001080

If your files are that big, you might want to avoid storing the data in memory. It's a whole lot of comparisons, 2 million lines times 2 million lines = 4 * 1012 comparisons.
use strict;
use warnings;
use feature 'say';
my $file1 = shift;
my $file2 = shift;
open my $fh1, "<", $file1 or die "Cannot open '$file1': $!";
while (<$fh1>) {
my #F = split;
open my $fh2, "<", $file2 or die "Cannot open '$file2': $!";
# for each line of file1 file2 is reopened and read again
while (my $cmp = <$fh2>) {
my #C = split ' ', $cmp;
if ($F[2] eq $C[2]) { # check string equality
say "$F[0]-$C[0] $F[1]-$C[1] $F[2]";
}
}
}
With your rather limited test set, I get the following output:
0001-2451 00000001-00000010 084010800001080

Python: tested with 2.000.000 rows each file
d = {}
with open('1.txt', 'r') as f1, open('2.txt', 'r') as f2:
for line in f1:
if not line: break
c0,c1,c2 = line.split()
d[(c2)] = (c0,c1)
for line in f2:
if not line: break
c0,c1,c2 = line.split()
if (c2) in d: print("{}-{} {}-{} {}".format(d[(c2)][0], c0, d[(c2)][1], c1, c2))
$ time python3 comapre.py
1001-2001 10000001-20000001 224010800001084
1042-2013 10000042-20000013 224010800001096
real 0m3.555s
user 0m3.234s
sys 0m0.321s

Awk if else expression not printing correct results for mathematical operation

So I have an input file that looks like this:
atom Comp
C1 45.7006
H40 30.0407
N41 148.389
S44 502.263
F45 365.162
I also have some variables that I have called in from another file, which I know are defined correctly, as the correct values print when I call them using echo.
These values are
Hslope=-1.1120
Hint=32.4057
Cslope=-1.0822
Cint=196.4234
What I am trying to do is to for all lines with C in the first column, print (column 2 - Cint)/Cslope. The same for all lines with H in the first column with the appropriate variables and have all lines that don't have C or H print "NA".
The first line should be skipped.
Currently, my code reads
awk -v Hslope=$Hslope -v Hint=$Hint -v Cslope=$Cslope -v Cint=$Cint '{for(i=2; i<=NR; i++)
{
if($1 ~ /C/)
{ shift = (($2-Cint)/Cslope); print shift }
else if($1 ~ /H/)
{ shift = (($2-Hint)/Hslope); print shift }
else
{ print "NA" }
} }' avRNMR >> vgRNMR
Where avRNMR is the input file and vgRNMR is the output file, which is already created with the contents "shift" by another line.
I have also tried a version where print is just set to the mathematical expression instead using "shift" as a variable. Another attempt was putting $ in front of every variable. Neither of these have produced any different results.
The output I get is
shift
139.274
2.1268
2.1268
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
Which is not the correct answer, particularly considering that my input file only has the six lines shown above. Note that the number of lines with C, H, and other letters is variable.
What I should get is
shift
139.27
2.13
NA
NA
NA
EDIT
As suggested, exchanging "for(i=2; i<=NR; i++)" for FNR>1 gives the following output
shift
NA
C1 45.7006
139.274
H40 30.0407
2.1268
N41 148.389
NA
S44 502.263
NA
F45 365.162
NA
Which is almost the correct output for the math answers, but not in the desired format. That first NA also means that a line is getting read to print that, which, if it is truly skipping the first line, shouldn't happen.

Remove the for loop on i=2. Add pattern FNR>1 before the action. Anchor the two patterns to the beginning of the field:
awk -v Hslope=$Hslope -v Hint=$Hint -v Cslope=$Cslope -v Cint=$Cint '
FNR > 1 { # skip first record
if($1 ~ /^C/) print (($2-Cint)/Cslope)
else if($1 ~ /^H/) print (($2-Hint)/Hslope)
else print "NA"
}' avRNMR >> vgRNMR
Warning: I didn't test that code.
EDIT: I have now tested the code:
$ cat avRNMR
atom Comp
C1 45.7006
H40 30.0407
N41 148.389
S44 502.263
F45 365.162
$ awk -v Hslope=-1.1120 -v Hint=32.4057 -v Cslope=-1.0822 -v Cint=196.4234 '
> FNR > 1 { # skip first record
> if($1 ~ /^C/) print (($2-Cint)/Cslope)
> else if($1 ~ /^H/) print (($2-Hint)/Hslope)
> else print "NA"
> }' avRNMR
139.274
2.1268
NA
NA
NA
That looks to me like what you want. Please tell me what you are seeing.

Try this:
$ awk 'NR==FNR{v[$1]=$2} NR<=FNR||FNR==1{next} /^[CH]/{c=substr($0, 0, 1); print ($2-v[c"int"])/v[c"slope"];next} {print "NA"}' FS="=" vars FS=" " file
139.274
2.1268
NA
NA
NA
The first pattern/action pair reads variables from file vars into an array v. The second skips further processing and also skips the first line for the second file file. The third will match lines with C and H and do the calculations.
You'll need to change the file names and redirect the output to your outfile.

$ cat tst.awk
{ shift = "NA" }
/^C/ { shift = ($2 - Cint) / Cslope }
/^H/ { shift = ($2 - Hint) / Hslope }
NR>1 { print shift }
$ awk -v Hslope="$Hslope" -v Hint="$Hint" -v Cslope="$Cslope" -v Cint="$Cint" -f tst.awk file
139.274
2.1268
NA
NA
NA
or if this is what you really want:
$ cat tst.awk
{ shift = (NR==1 ? "shift" : "NA") }
/^C/ { shift = ($2 - Cint) / Cslope }
/^H/ { shift = ($2 - Hint) / Hslope }
{ print shift }
$ awk -v Hslope="$Hslope" -v Hint="$Hint" -v Cslope="$Cslope" -v Cint="$Cint" -f tst.awk file
shift
139.274
2.1268
NA
NA
NA

awk to update value in field of out file using contents of another

In the out.txt below I am trying to use awk to update the contents of $9. The out.txt is created by the awk before the pipe |. If $9 contains a + or - then $8 of out.txt is used as a key to lookup in $2 of file2. When a match ( there will always be one) is found the $3 value of that file2 is used to update $9 of out.txt seperated by a :. So the original +6 in out.txt would be +6:NM_005101.3. The awk below is close but has syntax errors after the | that I can not seem to fix. Thank you :).
out.txt tab-delimited
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6 . . .
file2 space-delimited
2 ISG15 NM_005101.3 948846-948956 949363-949919
desired output `tab-delimited'
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0:NM_005101.3 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6:NM_005101.3 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6:NM_005101.3 . . .
Description
lines 1, 3, 5 `$9` updated with`: ` and value of `$3` in `file2`
line 2 and 4 are skipped as these do not have a `+` or `-` in them
awk
awk -v extra=50 -v OFS='\t' '
NR == FNR {
count[$2] = $1
for(i = 1; i <= $1; i++) {
low[$2, i] = $(2 + 2 * i)
high[$2, i] = $(3 + 2 * i)
mid[$2, i] = (low[$2, i] + high[$2, i]) / 2
}
next
}
FNR != 1 && $9 == "." && $12 == "." && $8 in count {
for(i = 1; i <= count[$8]; i++)
if($4 >= (low[$8, i] - extra) && $4 <= (high[$8, i] + extra)) {
if($4 > mid[$8, i]) {
sign = "+"
value = high[$8, i]
}
else {
sign = "-"
value = low[$8, i]
}
diff = (value > $4) ? value - $4 : $4 - value
$9 = (diff > 50) ? ">50" : (sign diff)
break
}
if(i > count[$8]) {
$9 = ">50"
}
}
1
' FS='[- ]' file2 FS='\t' file1 | awk if($6 == "-" || $6 == "+") printf ":" ; 'FNR==NR {a[$2]=$3; next} a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('

As far as I can tell, your awk code is OK and your bash usage is wrong.
FS='[- ]' file2 FS='\t' file1 |
awk if($6 == "-" || $6 == "+")
printf ":" ;
'FNR==NR {a[$2]=$3; next}
a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('
I don't know what that's supposed to do. This for sure, though: on the second line, the awk code needs to be quoted (awk 'if(....). The bash error message stems from the fact that bash is interpreting the (unquoted) awk code, and ( is not a valid shell-script token after if.

editing a output file to be delimited with a semicolon and the input file is a CSV kornshell

My input file is CSV
AED,E ,3.67295,20160105,20:10:00,UAE DIRHAM
ATS,E ,10.9814,20160105,20:10:00,AUSTRIAN SHILLINGS
AUD,A ,0.71525,20160105,20:10:00,AUSTRALIAN DOLLAR
I want to read it in to output it like so
EUR;1.127650;USD/EUR;EURO;Cash
JPY;124.335000;JPY/USD;JAPANESE YEN;Cash
GBP;1.538050;USD/GBP;BRITISH POUND;Cash
actual code :
cat $FILE2 | while read a b c d e f
do
echo $a $c $a/USD $f Cash \
| awk -F, 'BEGIN { OFS =";" } {print $1, $2, $3, $4, $5}' >> my_ratesoutput.csv
output:
Cash;;;;95 AED/USD UAE DIRHAM
Cash;;;;14 ATS/USD AUSTRIAN SHILLINGS
Cash;;;;25 AUD/USD AUSTRALIAN DOLLAR
Cash;;;;/USD BARBADOS DOLLAR

export IFS=","
semico=';'
FILE=rates.csv
FILE2=rateswork.csv
echo $FILE
rm my_ratesoutput.csv
cp -p $FILE $FILE2
sed 1d $FILE2 > temp.csv
mv temp.csv $FILE2
echo "Currency;Spot Rate;Terms;Name;Curve" >>my_ratesoutput.csv
cat $FILE2 |while read a b c d e f
do
echo $a$semico$c$semico$a/USD$semico$f$semicoCash >> my_ratesoutput.csv
done

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk duplicated lines with starting with # symbol - awk

Your first statement: /^#/ says "print every line that starts with #" and your last: 1 says "print every line". Hence the duplicate lines in the output. To only modify lines that don't start with # but print all lines would be: !/^#/ { do stuff } 1

Related

awk to extract value in each line and create new file

compare and print 2 columns from 2 files in awk ou perl

Awk if else expression not printing correct results for mathematical operation

awk to update value in field of out file using contents of another

editing a output file to be delimited with a semicolon and the input file is a CSV kornshell

Categories

Resources