Why is my awk command not printing within the range I specified? - awk

I am trying to get awk to print the lines that have the values in column 2 between 71395943 - 72282539. Below is the command I ran.
gzip -cd ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz | awk {'if($2-1>="71395943" && $2-1<="72282539" && $2-2>="71395943" && $2-2<="72282539")print $1"\t"$2-1"\t"$2"\t"$3"\t"$8"\t.\t+"'} > negr1_var.bed
and this is part of the output. All of the output starts with 7 but it is a lot smaller than the range I had specified. I am still new to using awk and would really appreciate any insight or an alternative method to accomplish the same thing. Thank you in advance!
1 72118 72119 rs199639004 AA=.;AC=8;AF=0.0037;AMR_AF=0.0028;AN=2184;ASN_AF=0.01;AVGPOST=0.9589;ERATE=0.0026;EUR_AF=0.0013;LDAF=0.0243;RSQ=0.2268;THETA=0.0016;VT=INDEL . +
1 72147 72148 rs182862337 AN=2184;RSQ=0.2794;THETA=0.0130;VT=SNP;AA=.;LDAF=0.0019;AVGPOST=0.9971;SNPSOURCE=LOWCOV;AC=1;ERATE=0.0007;AF=0.0005;AMR_AF=0.0028
. +
1 713976 713977 rs74512038 ERATE=0.0004;AN=2184;VT=SNP;AA=.;AC=155;THETA=0.0019;AVGPOST=0.9916;SNPSOURCE=LOWCOV;LDAF=0.0723;RSQ=0.9544;AF=0.07;ASN_AF=0.22;AMR_AF=0.07;AFR_AF=0.01;EUR_AF=0.0040 . +
Here is an example of the desired output
1 71396733 713957241 rs74512038 ERATE=0.0004;AN=2184;VT=SNP;AA=.;AC=155;THETA=0.0019;AVGPOST=0.9916;SNPSOURCE=LOWCOV;LDAF=0.0723;RSQ=0.9544;AF=0.07;ASN_AF=0.22;AMR_AF=0.07;AFR_AF=0.01;EUR_AF=0.0040 . +
Example of input. The file is pretty large, 10582-10583 is where it starts and it ends at 249000000. I just want the lines between 71395943 - 72282539.
1 10582 10583 rs58108140 AVGPOST=0.7707;RSQ=0.4319;LDAF=0.2327;ERATE=0.0161;AN=2184;VT=SNP;AA=.;THETA=0.0046;AC=314;SNPSOURCE=LOWCOV;AF=0.14;ASN_AF=0.13;AMR_AF=0.17;AFR_AF=0.04;EUR_AF=0.21 . +
1 10610 10611 rs189107123 AN=2184;THETA=0.0077;VT=SNP;AA=.;AC=41;ERATE=0.0048;SNPSOURCE=LOWCOV;AVGPOST=0.9330;LDAF=0.0479;RSQ=0.3475;AF=0.02;ASN_AF=0.01;AMR_AF=0.03;AFR_AF=0.01;EUR_AF=0.02 . +
1 13301 13302 rs180734498 THETA=0.0048;AN=2184;AC=249;VT=SNP;AA=.;RSQ=0.6281;LDAF=0.1573;SNPSOURCE=LOWCOV;AVGPOST=0.8895;ERATE=0.0058;AF=0.11;ASN_AF=0.02;AMR_AF=0.08;AFR_AF=0.21;EUR_AF=0.14 . +
1 13326 13327 rs144762171 AVGPOST=0.9698;AN=2184;VT=SNP;AA=.;RSQ=0.6482;AC=59;SNPSOURCE=LOWCOV;ERATE=0.0012;LDAF=0.0359;THETA=0.0204;AF=0.03;ASN_AF=0.02;AMR_AF=0.03;AFR_AF=0.02;EUR_AF=0.04 . +
example of current output

if($2-1>="71395943" && $2-1<="72282539" && $2-2>="71395943" &&
$2-2<="72282539")
You should not use string literals when you desire numerical comparison as Comparison Operators
When comparing operands of mixed types, numeric operands are converted
to strings using the value of CONVFMT
and in effect you will get comparison using lexicographical order, consider that
awk 'END{print 20<=100;print 20<="100";print "20"<="100"}' emptyfile.txt
gives output
1
0
0
Explanation: when comparing numbers condtion does not hold, however when comparing number vs string is generally same as string vs string and does not hold as 1st character 2 has bigger ASCII code as 1 (0x32 vs 0x31).
(tested in gawk 4.2.1)

Related

extract specific row with numbers over N

I have a dataframe like this
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
with using awk (or others)
I want to extract rows with only SRMAPQ over 60.
This means the output is
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
update: "SRMAPQ=60" can be anywhere in the line,
MAPQ=44;CT=3to5;SRMAPQ=61;DT=3to5
You don't have to extract the value out of SRMAPQ separately and do the comparison. If the format is fixed like above, just use = as the field separator and access the last field using $NF
awk -F= '$NF > 60' file
Or if SRMAPQ can occur anywhere in the line (as updated in the comments), use a generic approach
awk 'match($0, /SRMAPQ=([0-9]+)/){ l = length("SRMAPQ="); v = substr($0, RSTART+l, RLENGTH-l) } v > 60' file
I would use GNU AWK following way let file.txt content be
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
then
awk 'BEGIN{FS="SRMAPQ="}$2>60' file.txt
output
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
Note: added SOMETHING to test if it would work when SRMAPQ is not last. Explantion: I set FS to SRMAPQ= thus what is before that becomes first field ($1) and what is behind becomes second field ($2). In 2nd line this is 67;SOMETHING=2 with which GNU AWK copes by converting its' longmost prefix which constitute number in this case 67, other lines have just numbers. Disclaimer: this solution assumes that all but last field have trailing ;, if this does not hold true please test my solution fully before usage.
(tested in gawk 4.2.1)

Adding a decimal point to an integer with awk or sed

So, I have csv files to use with hledger, and last field of every row is the amount for that line transaction.
Lines are in the following format:
date1, date2, description, amount
With the amount format any length between 4 and 6 digits; now for some reason all amounts are missing the period before the last two digits.
Now: 1000
Should be: 10.00
Now: 25452
Should be: 254.52
How to add a '.' before the last two digits of all lines, preferably with sed/awk?
So the input file is:
16.12.2005,18.12.2005,ATM,2000
17.12.2005,18.12.2005,utility,12523
18.12.2005,20.12.2005,salary,459023
desired output
16.12.2005,18.12.2005,ATM,20.00
17.12.2005,18.12.2005,utility,125.23
18.12.2005,20.12.2005,salary,4590.23
Thanks
You could try:
awk -F , '{printf "%s,%s,%s,%-6.2f\n", $1, $2, $3, $4/100.0}'
You should always add a sample of your input file and of the output you want in your question.
In this input you provide, you will have to define what has to happen when the description field contains a ,, or if it is possible to have amount of less than 100 as input.
In function of your answer, I will need to adapt the code or not.
sed 's/..$/.&/'
......................
You can also use cut utility to get the desired output. In your case, you always want to add '.' before the last two digits. So essentially it can be thought as something like this:
Step 1: Get all the characters from the beginning till the last 2 characters.
Step 2: Get the last 2 characters from the end.
Step 3: Concatenate them with the character that you want ('.' in this case).
The corresponding command for each of the step is the following:
$ a='17.12.2005,18.12.2005,utility,12523'
$ b=`echo $a | rev | cut -c3- | rev`
$ c=`echo $a | rev | cut -c1-2 | rev`
$ echo $b"."$c
This would produce the output
17.12.2005,18.12.2005,utility,125.23
16.12.2005,18.12.2005,ATM,20.00
17.12.2005,18.12.2005,utility,125.23
18.12.2005,20.12.2005,salary,4590.23
awk -F, '{sub(/..$/,".& ")}1' file

awk compare two elements in the same line with regular expression

I have very long files where I have to compare two chromosome numbers present in the same line. I would like to use awk to create a file that take only the lines where the chromosome numbers are different.
Here is the example of my file:
CHROM ALT
1 ]1:1234567]T
1 T[1:2345678[
1 A[12:3456789[
2 etc...
In this example, I wish to compare the number of the chromosome (here '1' in the CHROM column) and the number that is between the first bracket ([ or ]) and the ":" symbol. If these numbers are different, I wish to print the corresponding line.
Here, the result should be like this:
1 A[12:3456789[
Thank you for your help.
$ awk -F'[][]' '$1+0 != $2+0' file
1 A[12:3456789[
2 etc...
This requires GNU awk for the 3 argument match() function:
gawk 'match($2, /[][]([0-9]+):/, a) && $1 != a[1]' file
Thanks again for the different answers.
Here are how my data looks like with several columns:
CHROM POS ID REF ALT
1 1000000 123:1 A ]1:1234567]T
1 2000000 456:1 A T[1:2345678[
1 3000000 789:1 T A[12:3456789[
2 ... ... . ...
My question is: how do I modify the previous code, when I have several columns?

In a CSV file, subtotal 2 columns based on a third one, using AWK in KSH

Disclaimers:
1) English is my second language, so please forgive any grammatical horrors you may find. I am pretty confident you will be able to understand what I need despite these.
2) I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs.
The "Problem":
I have an CSV file that looks like this:
c1,c2,c3,c4,c5,134.6,,c8,c9,SERVER1,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER1,c11
c1,c2,c3,c4,c5,0.18,,c8,c9,SERVER2,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER2,c11
c1,c2,c3,c4,c5,416.09,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,12.1,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,480.64,,c8,c9,SERVER4,c11
c1,c2,c3,c4,c5,,83.65,c8,c9,SERVER5,c11
c1,c2,c3,c4,c5,,253.15,c8,c9,SERVER6,c11
c1,c2,c3,c4,c5,,18.84,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,8.12,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,22.45,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,117.81,c8,c9,SERVER8,c11
c1,c2,c3,c4,c5,,96.34,c8,c9,SERVER9,c11
Complementary facts:
1) File has 11 columns.
2) The data in columns 1, 2, 3, 4, 5, 8, 9 and 11 is irrelevant in this case. In other words, I will only work with columns 6, 7 and 10.
3) Column 10 will be typically alphanumeric strings (server names), though it may contain also "-" and/or "_".
4) Columns 6 and 7 will have exclusively numbers, with up to two decimal places (A possible value is 0). Only one of the two will have data per line, never both.
What I need as an output:
- A single occurrence of every string in column 10 (as column 1), then the sum (subtotal) of it's values in column 6 (as column 2) and last, the sum (subtotal) of it's values in column 7 (as column 3).
- If the total for a field is "0" the field must be left empty, but still must exist (it's respective comma has to be printed).
- **Note** that the strings in column 10 will be already alphabetically sorted, so there is no need to do that part of the processing with AWK.
Output sample, using the sample above as an input:
SERVER1,134.6,,
SERVER2,0.18,,
SERVER3,428.19,,
SERVER4,480.64,,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,26.96
I've already found within these pages not one, but two AWK oneliners that PARTIALLY accomplish what it need:
awk -F "," 'NR==1{last=$10; sum=0;}{if (last != $10) {print last "," sum; last=$10; sum=0;} sum += $6;}END{print last "," sum;}' inputfile
awk -F, '{a[$10]+=$6;}END{for(i in a)print i","a[i];}' inputfile
My "problems" in both cases are the same:
- Subtotals of 0 are printed.
- I can only handle the sum of one column at a time. Whenever I try to add the second one, I get either a syntax error or it does simply not print the third column at all.
Thanks in advance for your support people!
Regards,
Martín
something like this?
$ awk 'BEGIN{FS=OFS=","}
{s6[$10]+=$6; s7[$10]+=$7}
END{for(k in s6) print k,(s6[k]?s6[k]:""),(s7[k]?s7[k]:"")}' file | sort
SERVER1,134.6,
SERVER2,0.18,
SERVER3,428.19,
SERVER4,480.64,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,49.41
SERVER8,,117.81
SERVER9,,96.34
note that your treatment of commas is not consistent, you're adding an extra one when the last field is zero (count the commas)
Your posted expected output doesn't seem to match your posted sample input so we're guessing but this might be what you're looking for:
$ cat tst.awk
BEGIN { FS=OFS="," }
$10 != prev {
if (NR > 1) {
print prev, sum6, sum7
}
sum6 = sum7 = ""
prev = $10
}
$6 { sum6 += $6 }
$7 { sum7 += $7 }
END { print prev, sum6, sum7 }
$ awk -f tst.awk file
SERVER1,134.6,
SERVER2,0.18,
SERVER3,428.19,
SERVER4,480.64,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,49.41
SERVER8,,117.81
SERVER9,,96.34

awk greater than why show string value?

I am using this command
awk '$1 > 3 {print $1}' file;
file :
String
2
4
5
6
7
String
output this;
String
4
5
6
7
String
Why result does not been only numbers as below,
4
5
6
7
This happens because one side of the comparison is a string, so awk is doing string comparison and the numeric value of the character 'S' is greater than 3.
$ printf "3: %d S: %d\n" \'3 \'S
3: 51 S: 83
Note: the ' before the arguments passed to printf are important, as they trigger the conversion to the numeric value in the underlying codeset:
If the leading character is a single-quote or double-quote, the value shall be the numeric value in the underlying codeset of the character following the single-quote or double-quote.
We write \' so that the ' is passed to printf, rather than being interpreted as syntax by the shell (a plain ' would open/close a string literal).
Returning to the question, to get the desired behaviour, you need to convert the first field to a number:
awk '+$1 > 3 { print $1 }' file
I am using the unary plus operator to convert the field to a number. Alternatively, some people prefer to simply add 0.
Taken from the awk user guide...
ftp://ftp.gnu.org/old-gnu/Manuals/gawk-3.0.3/html_chapter/gawk_8.html
When comparing operands of mixed types, numeric operands are converted
to strings using the value of CONVFMT. ... CONVFMT's default value is
"%.6g", which prints a value with at least six significant digits.
So, basically they are all treated as strings, and "String" Happens to be greater than "3".