jq while loop with variable produces error: Invalid path expression with result 0 - while-loop

Consider
$ jq -n '0 as $i | while($i<3; $i+=1)'
null
jq: error (at <unknown>): Invalid path expression with result 0
I was expecting output of something like 0 1 2 but I get this error.
What am I doing wrong?

You are using variables wrong. You cannot just assign a value to a variable like $i = 1 or update it like $i += 1.
Variables behave similar to other functions/filters, in the sense that they produce their value when evaluated, and that they have to be defined using the as construction (just like def for functions) before they can be used. The assignment operator = and its update variants like +=, on the other hand, only work for the context or a subset of the context. You can set . = 1 (which can more simply be written as just 1), or, if appropriate, a given subset like .a[0].b += 1.
You may want to have a look at the variables section, and the assignment section of the manual; especially the part in the assignment section reading
Note that the LHS of assignment operators refers to a value in .. Thus $var.foo = 1 won't work as expected ($var.foo is not a valid or useful path expression in .); use $var | .foo = 1 instead.
Your example without variables works as expected. The context is being modified during the loop:
$ jq -n '0 | while(.<3; .+1)'
0
1
2
Referring to this part in the variable section reading
The expression exp as $x | ... means: for each value of expression exp, run the rest of the pipeline with the entire original input, and with $x set to that value. Thus as functions as something of a foreach loop.
you can implement that loop using a variable like so:
$ jq -n '(0,1,2) as $i | $i'
0
1
2
Or so:
$ jq -n 'range(3) as $i | $i'
0
1
2
Note that the variable definition only takes care of the looping through the values (by producing multiple contexts). The generation of the different values needs to be taken care of separately. And here, you can employ your while loop:
$ jq -n '0 | while(.<3; .+1) as $i | $i'
0
1
2

Related

Why is my awk command not printing within the range I specified?

I am trying to get awk to print the lines that have the values in column 2 between 71395943 - 72282539. Below is the command I ran.
gzip -cd ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz | awk {'if($2-1>="71395943" && $2-1<="72282539" && $2-2>="71395943" && $2-2<="72282539")print $1"\t"$2-1"\t"$2"\t"$3"\t"$8"\t.\t+"'} > negr1_var.bed
and this is part of the output. All of the output starts with 7 but it is a lot smaller than the range I had specified. I am still new to using awk and would really appreciate any insight or an alternative method to accomplish the same thing. Thank you in advance!
1 72118 72119 rs199639004 AA=.;AC=8;AF=0.0037;AMR_AF=0.0028;AN=2184;ASN_AF=0.01;AVGPOST=0.9589;ERATE=0.0026;EUR_AF=0.0013;LDAF=0.0243;RSQ=0.2268;THETA=0.0016;VT=INDEL . +
1 72147 72148 rs182862337 AN=2184;RSQ=0.2794;THETA=0.0130;VT=SNP;AA=.;LDAF=0.0019;AVGPOST=0.9971;SNPSOURCE=LOWCOV;AC=1;ERATE=0.0007;AF=0.0005;AMR_AF=0.0028
. +
1 713976 713977 rs74512038 ERATE=0.0004;AN=2184;VT=SNP;AA=.;AC=155;THETA=0.0019;AVGPOST=0.9916;SNPSOURCE=LOWCOV;LDAF=0.0723;RSQ=0.9544;AF=0.07;ASN_AF=0.22;AMR_AF=0.07;AFR_AF=0.01;EUR_AF=0.0040 . +
Here is an example of the desired output
1 71396733 713957241 rs74512038 ERATE=0.0004;AN=2184;VT=SNP;AA=.;AC=155;THETA=0.0019;AVGPOST=0.9916;SNPSOURCE=LOWCOV;LDAF=0.0723;RSQ=0.9544;AF=0.07;ASN_AF=0.22;AMR_AF=0.07;AFR_AF=0.01;EUR_AF=0.0040 . +
Example of input. The file is pretty large, 10582-10583 is where it starts and it ends at 249000000. I just want the lines between 71395943 - 72282539.
1 10582 10583 rs58108140 AVGPOST=0.7707;RSQ=0.4319;LDAF=0.2327;ERATE=0.0161;AN=2184;VT=SNP;AA=.;THETA=0.0046;AC=314;SNPSOURCE=LOWCOV;AF=0.14;ASN_AF=0.13;AMR_AF=0.17;AFR_AF=0.04;EUR_AF=0.21 . +
1 10610 10611 rs189107123 AN=2184;THETA=0.0077;VT=SNP;AA=.;AC=41;ERATE=0.0048;SNPSOURCE=LOWCOV;AVGPOST=0.9330;LDAF=0.0479;RSQ=0.3475;AF=0.02;ASN_AF=0.01;AMR_AF=0.03;AFR_AF=0.01;EUR_AF=0.02 . +
1 13301 13302 rs180734498 THETA=0.0048;AN=2184;AC=249;VT=SNP;AA=.;RSQ=0.6281;LDAF=0.1573;SNPSOURCE=LOWCOV;AVGPOST=0.8895;ERATE=0.0058;AF=0.11;ASN_AF=0.02;AMR_AF=0.08;AFR_AF=0.21;EUR_AF=0.14 . +
1 13326 13327 rs144762171 AVGPOST=0.9698;AN=2184;VT=SNP;AA=.;RSQ=0.6482;AC=59;SNPSOURCE=LOWCOV;ERATE=0.0012;LDAF=0.0359;THETA=0.0204;AF=0.03;ASN_AF=0.02;AMR_AF=0.03;AFR_AF=0.02;EUR_AF=0.04 . +
example of current output
if($2-1>="71395943" && $2-1<="72282539" && $2-2>="71395943" &&
$2-2<="72282539")
You should not use string literals when you desire numerical comparison as Comparison Operators
When comparing operands of mixed types, numeric operands are converted
to strings using the value of CONVFMT
and in effect you will get comparison using lexicographical order, consider that
awk 'END{print 20<=100;print 20<="100";print "20"<="100"}' emptyfile.txt
gives output
1
0
0
Explanation: when comparing numbers condtion does not hold, however when comparing number vs string is generally same as string vs string and does not hold as 1st character 2 has bigger ASCII code as 1 (0x32 vs 0x31).
(tested in gawk 4.2.1)

awk compare two elements in the same line with regular expression

I have very long files where I have to compare two chromosome numbers present in the same line. I would like to use awk to create a file that take only the lines where the chromosome numbers are different.
Here is the example of my file:
CHROM ALT
1 ]1:1234567]T
1 T[1:2345678[
1 A[12:3456789[
2 etc...
In this example, I wish to compare the number of the chromosome (here '1' in the CHROM column) and the number that is between the first bracket ([ or ]) and the ":" symbol. If these numbers are different, I wish to print the corresponding line.
Here, the result should be like this:
1 A[12:3456789[
Thank you for your help.
$ awk -F'[][]' '$1+0 != $2+0' file
1 A[12:3456789[
2 etc...
This requires GNU awk for the 3 argument match() function:
gawk 'match($2, /[][]([0-9]+):/, a) && $1 != a[1]' file
Thanks again for the different answers.
Here are how my data looks like with several columns:
CHROM POS ID REF ALT
1 1000000 123:1 A ]1:1234567]T
1 2000000 456:1 A T[1:2345678[
1 3000000 789:1 T A[12:3456789[
2 ... ... . ...
My question is: how do I modify the previous code, when I have several columns?

Awk, order foreach 12 lines to insert query

I have the following script:
curl -s 'https://someonepage=5m' | jq '.[]|.[0],.[1],.[2],.[3],.[4],.[5],.[6],.[7],.[8],.[9],.[10],.[11],.[12]' | perl -p -e 's/\"//g' |awk '/^[0-9]/{print; if (++onr%12 == 0) print ""; }'
This is part of result:
1517773500000
0.10250100
0.10275700
0.10243500
0.10256600
257.26700000
1517773799999
26.38912220
1229
104.32200000
10.70579910
0
1517773800000
0.10256600
0.10268000
0.10231600
0.10243400
310.64600000
1517774099999
31.83806883
1452
129.70500000
13.29758266
0
1517774100000
0.10243400
0.10257500
0.10211800
0.10230000
359.06300000
1517774399999
36.73708621
1296
154.78500000
15.84041910
0
I want to insert this data in a MySQL database. I want for each line this result:
(1517773800000,0.10256600,0.10268000,0.10231600,0.10243400,310.64600000,1517774099999,31.83806883,1452,129.70500000,13.29758266,0)
(1517774100000,0.10243400,0.10257500,0.10211800,0.10230000,359.06300000,151774399999,36.73708621,1296,154.78500000,15.84041910,0)
I need merge lines each 12 lines, any can help me for get this result.
Here's an all-jq solution:
.[] | .[0:12] | #tsv | gsub("\t";",") | "(\(.))"
In the sample, all the subarrays have length 12, so you might be able to drop the .[0:12] part of the pipeline. If using jq 1.5 or later, you could use join(“,”) instead of the #tsv|gsub portion of the pipeline. You might, for example, want to consider:
.[] | join(“,”) | “(\(.))”. # jq 1.5 or later
Invocation: use the -r command-line option
Sample output:
(1517627400000,0.10452300,0.10499000,0.10418200,0.10449400,819.50400000,1517627699999,85.57150693,2340,452.63400000,47.27213035,0)
(1517627700000,0.10435700,0.10449200,0.10366000,0.10370000,717.37000000,1517627999999,74.60582079,1996,321.25500000,33.42273846,0)
(1517628000000,0.10376600,0.10390000,0.10366000,0.10370400,519.59400000,1517628299999,53.88836170,1258,239.89300000,24.88613854,0)
$ awk 'BEGIN {RS=""; OFS=","} {$1=$1; $0="("$0")"}1' file
(1517773500000,0.10250100,0.10275700,0.10243500,0.10256600,257.26700000,1517773799999,26.38912220,1229,104.32200000,10.70579910,0)
(1517773800000,0.10256600,0.10268000,0.10231600,0.10243400,310.64600000,1517774099999,31.83806883,1452,129.70500000,13.29758266,0)
(1517774100000,0.10243400,0.10257500,0.10211800,0.10230000,359.06300000,1517774399999,36.73708621,1296,154.78500000,15.84041910,0)
RS="":
Treat groups of lines separated one or more blank lines as a record
OFS=","
Set the output separator to be a ","
$1=$1
Reconstitute the line, replacing the input separators with the output separator
$0="("$0")"
Surround the record with parens
1
Print the record

How do I check if input is one word or 2 separated by delimiter "-"

I need help with the following ksh script:
ExpResult=`echo "$LoadString" | awk -F"-" '{print NF}'=2`
MinExp=`echo "$ExpResult" | tr -s " " | sed 's/^[ ]//g'| cut -d"-" -f1`
MaxExp=`echo "$ExpResult" | tr -s " " | sed 's/^[ ]//g'| cut -d"-" -f2`
I can get an input as two options : "50-100" or "50" (for example)
I have two questions:
How do I check if the input is "one word" or two words separated by delimiter "-"?
If the input is two words, how can I separate them?
Rather than call an external program to parse your input, you can use the internal case statement to validate input and parameter expansion features to convert your input, i.e.
# set a copy/paste value for $1
set -- 50-10
case "$1" in
*-* )
range="$1"
min="${range%-*}"
max="${range#*-}"
;;
* )
singleNum="$1"
;;
esac
echo min=$min ... max=$max
output
min=50 ... max=100
Try for non-pair
unset min max
set -- other values
case ...
echo min= ... max= ... singleNum=$singleNum
output
min= ... max= ... singleNum=other
Hopefully the case processing is self-explanatory, but the parameter expansion may require a little explanation.
The statement
min=${range%-*}
says remove from the right side of the expanded value (50-100) anything starting at the last - until the end of the string. This leaves the value 50 remaining.
The reverse happens with
max=${range#*-}
Says remove from the left side of the expanded value anything up to the first - char. This leaves the 100.
As there is only one - char in this string, you don't need to worry about the other versions of ${var##*-} which says remove all from the left until the last match of -, and the reverse ${var%%-*} , remove all from right (backwards) until the very first - char.
The fanatical minimalists will remind us that this can be done without a temporary variable, i.e.
min=${1%-*} ; max=${1#*-}
And the one-line fantasists can be satisfied with
case "$1" in *-* ) range="$1";min="${range%-*}";max="${range#*-}";;* ) singleNum="$1";;esac; echo min=$min ... max=$max .,, singleNum=$singleNum
:-)
IHTH
you can try this;
LoadString=$1
MinExp=`echo "$LoadString" | awk -F"-" '{if (NF==2) print $1}`
MaxExp=`echo "$LoadString" | awk -F"-" '{if (NF==2) print $2}`
echo $MinExp
echo $MaxExp
eg:
user#host:/tmp/test$ ksh test.ksh 50-100
50
100

Awk print single element of array

This ought to be ridiculously easy. I simply want to print a single element of an array. However, all I get from a command like print arr[1] is an empty line.
Here is my entire bash script:
#!/bin/bash
find -X $1 -type f |
xargs md5 |
awk '
NF == 4 {
md5[$4]++;
files[$2]++;
}
END {
for (i = 1; i <= NF; i++)
for (j = i + 1; j <= NF; j++)
if (md5[i] == md5[j]) {
print "These are duplicates: "
print files[j+1]
print files[i]
}
'
exit 0
It is a very simply duplicate file finder. The problematic part is in the END{} statement within awk.
This just gives me a bunch of "These are duplicates: " with empty lines after them. I know that the information is available, because I add this to END{}: for (x in arr); print x and it flawlessly prints every element in arr, as expected.
I must be doing something very silly.
What you're currently doing is assigning the values you want to save as the indices of the two arrays, which seems to be common from code examples in awk. However, that's usually used in conjunction with the for (x in y) syntax. To fix your code, the way that comes to mind to fix what you're doing is to modify your awk bit like so:
BEGIN {
md5idx = 0;
filesidx = 0;
}
And then change:
NF == 4 {
md5[md5idx++] = $4;
files[filesidx++] = $2;
}
That should about do it, I think, but I haven't tested it.
Instead of using variables, you can also use NR which contains line number as an index to store field values in to your arrays.
NF == 4 {
md5[NR]=$4;
files[NR]=$2;
}
and then in your END portion, you can use, something like for (i=1;i<=NR;i++}. Since, in END statement you will always have the value of NR as the last line number, you don't need to use an arbitrary number or even the length function of awk to find the length of an array.
It took me a while to find a standard md5 as opposed to my own home-brew version, but the example output from a version on MacOS X 10.7.2 is:
$ /sbin/md5 $(which -a md5)
MD5 (./md5) = 57f49e1c53ca7875fe63a33958ab0b0b
MD5 (/Users/jleffler/bin/md5) = 57f49e1c53ca7875fe63a33958ab0b0b
MD5 (/sbin/md5) = dd00b1dc4dd11c8443a70b5d33e0cade
$
Assuming that the output of md5 is a hash in column 4 and a file name in column 2 with the parentheses around the name not mattering, and also assuming that the names do not contain any spaces (because spaces in the file name will mess up the column numbering), then you probably want something like:
#!/bin/bash
find -X "${#:-'.'}" -type f |
xargs /sbin/md5 |
awk '
NF == 4 {
if (file[$4] != "") printf "Duplicate: MD5 %s - %s & %s\n", $4, file[$4], $2;
else file[$4] = $2;
}'
exit 0
Example output:
Duplicate: MD5 57f49e1c53ca7875fe63a33958ab0b0b - (./md5) & (/Users/jleffler/bin/md5)
This identifies duplicate MD5 values as it goes. If there is no entry in the (associative) array file for the given MD5 hash, then an entry is created with the file's name. If there is an entry, then the MD5 value and the two file names are printed; you can debate the format, which might be better spread over three lines than cramped onto one.
The "${#:-'.'}" notation means 'use the command line arguments if there are any; otherwise, use . (the current directory)'. This seems likely to be more usable than using the first argument (only) and failing if no argument is supplied.