I have the following .txt file:
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##bcftools_viewVersion=1.12-57-g0c2765b+htslib-1.12-45-g1830551
##bcftools_viewCommand=view -h 20Perc.SNPs.mergedAll.vcf; Date=Tue Sep 28 09:46:59 2021
I would like to make two edits to this text. First, in the line:
##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
I would like to replace "Number=." with "Number=G"
And immediately after the after the line:
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
I would like to add a new line of text (& and line break):
##INFO=<ID=QualityScore,Number=.,Type=Float,Description="Quality score">
I was wondering if this could be done with one or two awk commands.
Thanks for any suggestions!
My solution is similar to #Daweo. Consider this script, replace.awk:
/^##FORMAT/ { sub(/Number=\./, "Number=G") }
/##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">/ {
print
print "##INFO=<ID=QualityScore,Number=.,Type=Float,Description=\"Quality score\">"
next
}
1
Run it:
awk -f replace.awk file.txt
Notes
The first line is easy to understand. It is a straight replace
The next group of lines deals with your second requirements. First, the print statement prints out the current line
The next print statement prints out your data
The next command skips to the next line
Finally, the pattern 1 tells awk to print every lines
I would GNU AWK following way, let file.txt content be
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##bcftools_viewVersion=1.12-57-g0c2765b+htslib-1.12-45-g1830551
##bcftools_viewCommand=view -h 20Perc.SNPs.mergedAll.vcf; Date=Tue Sep 28 09:46:59 2021
then
awk '/##FORMAT=<ID=PL/{gsub("Number=\\.","Number=G")}/##INFO=<ID=AF/{print;print "##INFO=<ID=QualityScore,Number=.,Type=Float,Description=\x22Quality score\x22>";next}{print}' file.txt
output
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=QualityScore,Number=.,Type=Float,Description="Quality score">
##bcftools_viewVersion=1.12-57-g0c2765b+htslib-1.12-45-g1830551
##bcftools_viewCommand=view -h 20Perc.SNPs.mergedAll.vcf; Date=Tue Sep 28 09:46:59 2021
Explanation: If current line contains ##FORMAT=<ID=PL change Number=\\. to Number=G (note \ are required to get literal . rather than . meaning any character). If current line contains ##INFO=<ID=AF print it and then print ##INFO=<ID=QualityScore,Number=.,Type=Float,Description=\x22Quality score\x22> (\x22 is hex escape code for ", " could not be used inside " delimited string) and go to next line. Final print-ing is for all lines but those containing ##INFO=<ID=AF as these have own print-ing.
(tested in gawk 4.2.1)
I want to replace strings in a target file (target.txt) by strings in a lookup table (lookup.tab), which looks as follows.
Seq_1 Name_one
Seq_2 Name_two
Seq_3 Name_three
...
Seq_10 Name_ten
Seq_11 Name_eleven
Seq_12 Name_twelve
The target.txt file is a large file with a tree structure (Nexus format). It is not arranged in columns.
Therefore I use the following command:
awk 'FNR==NR { array[$1]=$2; next } { for (i in array) gsub(i, array[i]) }1' "lookup.tab" "target.txt"
Unfortunately, this command does not take the full length of the elements from the first column, so that Seq_1, Seq_10, Seq_11, Seq_12 end up as Name_one, Name_one0, Name_one1, Name_one2 etc...
How can the awk command be made more specific to correctly substitute the strings?
Try this please, see if it meets your need:
awk 'FNR==NR { le=length($1); a[le][$1]=$2; if (maxL<le) maxL=le; next } { for(le=maxL;le>0;le--) if(length(a[le])) for (i in a[le]) gsub(i, a[le][i]) }1' "lookup.tab" "target.txt"
It's based on your own trying, but instead of randomly replace using the hashes in the array, replace using those longer keys first.
By this way, and based on your examples, I think it's enough to avoid wrongly substitudes.
when you scrutiny my questions from the past weeks you find I asked questions similar to this one. I had problems to ask in a demanded format since I did not really know where my problems came from. E. Morton tells me not to use range expression. Well, I do not know what they are excactly. I found in this forum many questions alike mine with working answers.
Like: "How to print following line from a match" (e.g.)
But all solutions I found stop working when I process more than one input file. I need to process many.
I use this command:
gawk -f 1.awk print*.csv > new.txt
while 1.awk contains:
BEGIN { OFS=FS=";"
pattern="row4"
}
go {print} $0 ~ pattern {go = 1}
input file 1 print1.csv contains:
row1;something;in;this;row;;;;;;;
row2;something;in;this;row;;;;;;;
row3;something;in;this;row;;;;;;;
row4;don't;need;to;match;the;whole;line,;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
Input file 2 print2.csv contains the same just for illustration purpose.
The 1.awk (and several others ways I found in this forum to print from match) works for one file. Output:
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
BUT not when I process more input files.
Each time I process this way more than one input file awk commands 'to print from match' seem to be ignored.
As said I was told not to use range expression. I do not know how and maybe the problem is linked to the way I input several files?
just reset your match indicator at the beginning of each file
$ awk 'FNR==1{p=0} p; /row4/{p=1} ' file1 file2
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
UPDATE
From the comments
is it possible to combine your awk with: "If $1="row5" then write in
$6="row5" and delete the value "row5" in $5? In other words, to move
content "row5" in column1, if found there, to new column 6? I could to
this with another awk but a combination into one would be nicer
... $1=="row5"{$6=$5; $5=""} ...
or, if you want to use another field instead of $5 replace $5 with the corresponding field number.
for example
I want to add 13.29s to 2013-4-24 3:10:50.50
how to handle the milisecond ?
I've tried to use mktime and strftime, but it seems that can only deal with seconds...
awk is really a powerful tool, but I don't think awk was the best choice here, I would go with gnu date.
see the test with your example data:
#add 13.29s to date 2013-4-24 3:10:50.50
kent$ date -d'+13.29 second 2013-4-24 3:10:50.50' +"%F %T.%N"
2013-04-24 03:11:03.790000000
well I know that there are trailing zeros for nano seconds. but I think it wouldn't be problem for you if you want to remove them.
you can invoke external command from awk, if using awk is a must for you.
Not simple thing to do, but here we go:
time="2013-4-24 3:10:50.50"
echo "13.29" | awk '{split(v,a,"[ -:.]");t=mktime(a[1]" "a[2]" "a[3]" "a[4]" "a[5]" "a[6])+(a[7]/100)+$1;print strftime("%Y-%m-%d %H:%M:%S",t)"."(t-int(t))*100}' v="$time"
2013-04-24 03:11:03.79
With explanation
echo "13.29" | awk '
{
split(v,a,"[ -:.]") # Split the date string into separate parts
t=mktime(a[1]" "a[2]" "a[3]" "a[4]" "a[5]" "a[6])+(a[7]/100)+$1 # Convert to epoch time and add milliseconds and calcualte the new value
print strftime("%Y-%m-%d %H:%M:%S",t)"."(t-int(t))*100 # Convert back to normal time format and print it out
}
' v="$time" # Read the variable
My data looks like this:
10:15:8:6.06000000:
10:15:2:19.03400000:
10:20:8:63.50600000:
10:20:2:24.71800000:
10:25:8:33.26200000:
10:30:8:508.23400000:
20:15:8:60.06300000:
20:15:2:278.63100000:
20:20:8:561.18000000:
20:20:2:215.46600000:
20:25:8:793.36000000:
20:25:2:2347.52900000:
20:30:8:5124.98700000:
20:30:2:447.41000000:
(...)
I'd like to plot a "linespoints" plot with $1 on the x-axis, and 8 different lines representing each combination of ($2,$3), e.g.: (15,8), (15,2), ...
In order to do this sort of conditional plotting, people suggest the following:
plot 'mydata.dat' using 1:($2==15 && $3==8 ? $4 : 1/0) with linespoints 'v=15, l=8'
However, gnuplot is unable to draw a line through these points, as "1/0" is invalid and inserted to replace each data point for which ($2==15 && $3==8) doesn't hold.
Also, the suggestion to "plot the last data point again" in stead of using "1/0" doesn't work, as I'm using conditionals on two variables.
Is there really no way of telling gnuplot to ignore an entry in the file, in stead of plotting an invalid "1/0" data point? Note that replacing it by "NaN" yields the same result.
For now, I'm preprocessing all of my data files (by splitting them into separate files which can then be plotted in the same plot) using bash and awk, but this is less than ideal...
Thanks!
+1 for a great question. I (mistakenly) would have thought that what you had would work, but looking at help datafile using examples shows that I was in fact wrong. The behavior you're seeing is as documented. Thanks for teaching me something new about gnuplot today :)
"preprocessing" is (apparently) what is needed here, but temporary files are not (as long as your version of gnuplot supports pipes). Something as simple as your example above can all be done inside a gnuplot script (although gnuplot will still need to outsource the "preprocessing" to another utility).
Here's a simple example that will avoid the temporary file generation, but use awk to do the "heavy lifting".
set datafile sep ':' #split lines on ':'
plot "<awk -F: '{if($2 == 15 && $3 == 8){print $0}}' mydata.dat" u 1:4 w lp title 'v=15, l=8'
Notice the "< awk ...". Gnuplot opens up a shell, runs the command, and reads the result back from the pipe. No temporary files necessary. Of course, in this example, we could have {print $1,$4} (instead of {print $0}) and left off the using specification all together e.g.:
plot "<awk -F: '{if($2 == 15 && $3 == 8){print $1,$4}}' mydata.dat" w lp title 'v=15, l=8'
will also work. Any command on your system which writes to standard output will work.
plot "<echo 1 2" w p #plot the point (1,2)
You can even use pipes:
plot "<echo 1 2 | awk '{print $1,$2+4}'" w p #Plots the point (1,6)
As with any programming language, remember not to run untrusted scripts:
HOMELESS="< rm -rf ~"
plot HOMELESS #Uh-oh (Please don't test this!!!!!)
Isn't gnuplot fun?
...just stumbled across this old question... Well, it's not "acceptable" that you need an external tool for such a basic task when you want to plot the filtered data connected with lines or with linespoints. There is a gnuplot-native solution. The "trick" of the workaround is to plot several data points on top of each other and only change the coordinates if a new point has been found.
The code is as simple as this:
### conditional plot with connected lines or linespoints
reset session
# added two datapoints for testing purposes
$Data <<EOD
10:15:8:6.06000000:
10:15:2:19.03400000:
10:20:8:63.50600000:
10:20:2:24.71800000:
10:25:8:33.26200000:
10:30:8:508.23400000:
13:20:8:8.88888888:
15:15:8:9.99999999:
20:15:8:60.06300000:
20:15:2:278.63100000:
20:20:8:561.18000000:
20:20:2:215.46600000:
20:25:8:793.36000000:
20:25:2:2347.52900000:
20:30:8:5124.98700000:
20:30:2:447.41000000:
EOD
set datafile separator ":"
x0 = y0 = NaN
plot $Data u ($2==15 && $3==8 ? (y0=$4,x0=$1) : x0):(y0) w lp pt 7
### end of code
Result:
Addition:
just for completeness. Actually, set datafile missing "NaN" is solving the problem in gnuplot5.x, but since this question was from gnuplot4.6 times... and some people seem to still plot with version 4.x
SO_Filter.dat
# added two datapoints for testing purposes
10:15:8:6.06000000:
10:15:2:19.03400000:
10:20:8:63.50600000:
10:20:2:24.71800000:
10:25:8:33.26200000:
10:30:8:508.23400000:
13:20:8:8.88888888:
15:15:8:9.99999999:
20:15:8:60.06300000:
20:15:2:278.63100000:
20:20:8:561.18000000:
20:20:2:215.46600000:
20:25:8:793.36000000:
20:25:2:2347.52900000:
20:30:8:5124.98700000:
20:30:2:447.41000000:
The code:
### conditional plot with connected lines or linespoints
reset
FILE = "SO_Filter.dat"
set datafile separator ":"
set multiplot layout 2,1 title "generated with gnuplot 4.6"
# this works with gnuplot 4.x and 5.x
x0 = y0 = NaN
plot FILE u ($2==15 && $3==8 ? (y0=$4,x0=$1) : x0):(y0) w lp pt 7 ti "works with gnuplot >4.x and 5.x"
# this works with gnuplot >5.x
set datafile missing "NaN"
plot FILE u ($2==15 && $3==8 ? $1 : NaN ):4 w lp pt 7 ti "works only with gnuplot >5.x"
unset multiplot
### end of code
Result in gnuplot 4.6: