Anyone know how to make a self-contained Awk/Gawk program on Windows - awk

I'm using an awk script to do some reasonably heavy parsing that could be useful to repeat in the future but I'm not sure if my unix-unfriendly co-workers will be willing to install awk/gawk in order to do the parsing. Is there a way to create a self-contained executable from my script?

I'm not aware of a way to make a self-contained binary using AWK. However, if you like AWK, chances seem good that you might like Python, and there are several ways to make a self-contained Python program. For example, Py2Exe.
Here's a quick example of Python:
# comments are introduced by '#', same as AWK
import re # make regular expressions available
import sys # system stuff like args or stdin
# read from specified file, else read standard input
if len(sys.argv) == 2:
f = open(sys.argv[1])
else:
f = sys.stdin
# Compile some regular expressions to use later.
# You don't have to pre-compile, but it's more efficient.
pat0 = re.compile("regexp_pattern_goes_here")
pat1 = re.compile("some_other_regexp_here")
# for loop to read input lines.
# This assumes you want normal line separation.
# If you want lines split on some other character, you would
# have to split the input yourself (which isn't hard).
# I can't remember ever changing the line separator in my AWK code...
for line in f:
FS = None # default: split on whitespace
# change FS to some other string to change field sep
words = line.split(FS)
if pat0.search(line):
# handle the pat0 match case
elif pat1.search(line):
# handle the pat1 match case
elif words[0].lower() == "the":
# handle the case where the first word is "the"
else:
for word in words:
# do something with words
Not the same as AWK, but easy to learn, and actually more powerful than AWK (the language has more features and there are many "modules" to import and use). Python doesn't have anything implicit like the
/pattern_goes_here/ {
# code goes here
}
feature in AWK, but you can simply have an if/elif/elif/else chain with patterns to match.

Theres a standalone awk.exe in the Cygwin Toolkit as far as I know.
You could just bundle that in with whatever files you're distributing to your colleagues.

Does it have to be self contained? You could write a small executable that will invoke awk with the right arguments and pipe the results to a file the users chooses, or to stdout - whichever is appropriate for your co-workers.

MAWK in GnuWin32 — http://gnuwin32.sourceforge.net/packages/mawk.htm
also interesting alternative, Java implementation — http://sourceforge.net/projects/jawk/

Related

How to open and read a .gz file in Nim (preferably line by line)

I just sat down to write my first Nim script to parse a .vcf (Variant Call Format) file. This file format stores genetic mutations from sequencing data.
For scripting languages, I 'grew up' on Perl and later migrated to Python, but I would love to use a language with the speed that Nim offers. I realize Nim is still young, but I couldn't even find a clear example for how to open and read a .gz (gzip) file (preferably line by line).
Can anyone provide a simple example to open and read a gzip file using Nim, line by line?
In Python, I'm accustomed to the following (uber-simple) code:
import gzip
my_file = gzip.open('my_file.vcf.gz', 'w')
for line in my_file:
# do something
my_file.close()
I have seen related questions, but they're not clear. The posts are also relatively old and I hope/suspect something better has come about. Here's what I've found:
Read gzip-compressed file line by line
File, FileStream, and GZFileStream
Reading files from tar.gz archive in Nim
Really appreciate it.
P.S. I also think it would be useful if someone created a Nim tag in StackOverflow. I do not have the reputation to create tags.
Just in case you need to handle VCF rather than .gz, there's a nice wrapper for htslib written by Brent Pedersen:
https://github.com/brentp/hts-nim
You need to install the htslib in your system, and then require the library in your .nimble file with requires "hts", or install the library with nimble install hts. If you are going to do NGS analysis in Nim you'll need it.
The code you need:
import hts
var v:VCF
doAssert open(v, "myfile.vcf.gz")
# Here you have the VCF file loaded in v, and can access the headers through
# v.header property
for record in v:
# Here you get a Record object per line, e.g. extract the Ref and Alts:
echo v.REF, " ", v.ALT
v.close()
Be sure to follow the docs, because some things differ from python, specially when getting the INFO and FORMAT fields.
Checkout the whole Brent repo. It has plenty of wrappers, code samples and utilities to handle NGS problems (e.g. an ultrafast coverage tool utility called Mosdepth).
Per suggestion from Maurice Meyer, I looked at the tests for the Nim zip package. It turned out to be quite simple. This is my first Nim script, so my apologies if I didn't follow convention, etc.
import zip/gzipfiles # Import zip package
block:
let vcf = newGzFileStream("my_file.vcf.gz") # Open gzip file
defer: outFile.close() # Close file (like a 'final' statement in 'try' block)
var line: string # Declare line variable
# Loop over each line in the file
while not vcf.atEnd():
line = vcf.readLine()
# Cure disease with my VCF file
To install the zip package, I simply ran because it is already in the Nim package library:
> nimble refresh
> nimble install zip
I tried to use Nim some time ago to parse a fastq or fastq.gz file.
The code should be available here:
https://gitlab.pasteur.fr/bli/qaf_demux/blob/master/Nim/src/qaf_demux.nim
I don't remember exactly how this works, but apparently, I did an import zip/gzipfiles and used newGZFileStream on the input file name to obtain a Stream from which lines can be read using .readLine() in this piece of code:
proc fastqParser(stream: Stream): iterator(): Fastq =
result = iterator(): Fastq =
var
nameLine: string
nucLine: string
quaLine: string
while not stream.atEnd():
nameLine = stream.readLine()
nucLine = stream.readLine()
discard stream.readLine()
quaLine = stream.readLine()
yield [nameLine, nucLine, quaLine]
It is used in something that amounts to this piece of code:
let inputFqs = fastqParser(newGZFileStream($inFastqFilename))
Hopefully you can adapt this to your case.
My .nimble file has a requires "zip#head". I suppose this triggers the installation of zip/gzipfiles.

apache velocity: remap $ and # keys

I wonder if it is possible to remap "$" and "#" to other keys.
sample:
#set( $foo = "bar" )
I want to use other keys because those interfere with another syntax of a script I am using.
$ and # characters are not configurable in Velocity. Even at compile time, it would at least imply to recompile the parser, and make a full code review for standalone $ and # chars...
That said:
Velocity does cope pretty well with syntax fragments it cannot parse, like jQuery $ object. It just render them as is, and most of the time it does the job.
You can escape your other script's sensitive characters whenever needed, for instance by using the EscapeTool: ${esc.d} for dollar, ${esc.h} for hash.

Buffering output with AWK

I have an input file which consists of three parts:
inputFirst
inputMiddle
inputLast
Currently I have an AWK script which with this input creates an output file which consists of two parts:
outputFirst
outputLast
where outputFirst and outputLast is generated (on the fly) from inputFirst and inputLast respectively. However, to calculate the outputMiddle part (which is only one line) I need to scan the entire input, so I store it in a variable. The problem is that the value of this variable should go in between outputFirst and outputLast in the output file.
Is there a way to solve this using a single portable AWK script that takes no arguments? Is there a portable way to create temporary files in an AWK script or should I store the output from outputFirst and outputLast in two variables? I suspect that using variables will be quite inefficient for large files.
All versions of AWK (since at least 1985) can do basic I/O redirection to files or pipelines, just like the shell can, as well as run external commands without I/O redirection.
So, there are any number of ways to approach your problem and solve it without having to read the entire input file into memory. The most optimal solution will depend on exactly what you're trying to do, and what constraints you must honour.
A simple approach to the more precise example problem you describe in your comment above would perhaps go something like this: first in the BEGIN clause form two unique filenames with rand() (and define your variables), then read and sum the first 50 numbers from standard input while also writing them to a temporary file, then continuing to read and sum the next 50 numbers and write them to a second file, then finally in an END clause you would use a loop to read the first temporary file with getline and write it to standard output, print the total sum, then read the second temporary file the same way and write it to standard output, and finally call system("rm " file1 " " file2) to remove the temporary files.
If the output file is not too large (whatever that is), saving outputLast in a variable is quite reasonable. The first part, outputFirst, can (as described) be generated on the fly. I tried this approach and it worked fine.
Print the "first" output while processing the file, then write the remainder to a temporary file until you have written the middle.
Here is a self-contained shell script which processes its input files and writes to standard output.
#!/bin/sh
t=$(mktemp -t middle.XXXXXXXXX) || exit 127
trap 'rm -f "$t"' EXIT
trap 'exit 126' HUP INT TERM
awk -v temp="$t" "NR<500000 { print n+1 }
{ s+=$1 }
NR>=500000 { print n+1 >>temp
END { print s }' "$#"
cat "$t"
For illustration purposes, I used really big line numbers. I'm afraid your question is still too vague to really obtain a less general answer, but perhaps this can help you find the right direction.

How to Input Redirect Two Files to Standard Input?

Is it possible to redirect two or more files to standard input in one command? For example
$ myProgram < file1 < file 2
I tried that command however, it seemed like the OS is only taking the first file and ignoring the other...
If not, how can I achieve that?
NOTE: concatenating the two files will not help in my case.
When you do this from bash, it isn't inputting multiple files to standard input, it is called Process Substitution
The output is sent to an file descriptor under /dev/fd/<n> for each substitution

Is there any way to generate a set of JWebUnit tests from an apache rewrite config?

Seems unlikely, but is there any way to generate a set of unit tests for the following rewrite rule:
RewriteRule ^/(user|group|country)/([a-z]+)/(photos|videos)$ http:/whatever?type=$1&entity=$2&resource=$3
From this I'd like to generate a set of urls of the form:
/user/foo/photos
/user/bar/photos
/group/baz/videos
/country/bar/photos
etc...
The reason I don't want to just do this once by hand is that I'd like the bounded alternation groups (e.g. (user|group|country)) to be able to grow and maintain coverage without having the update the tests by hand.
Is there a rewrite rule or regex parser that might be able to do this, or am I doing it by hand?
If you don't mind hacking a few lines of Perl then there's a package, Regexp::Genex that you can use to generate something close to what you require e.g.
# perl -MRegexp::Genex=:all -le 'print for strings(qr/\/(user|group|country)\/([a-z]+)\//)'
/user/dxb/
/user/dx/
/user/d/
/group/xd/
/group/x/
# perl -MRegexp::Genex=:all -le 'my $re=qr/\/(user|group|country)\/([a-z]+)\/(phone|videos)/;$Regexp::Genex::DEFAULT_LEN = length $re;print for strings($re)'
/user/mgcgmccdmgdmmzccgmczgmzzdcmmd/phone
/user/mgcgmccdmgdmmzccgmczgmzzdcmm/phone
/user/mgcgmccdmgdmmzccgmczgmzzdcm/phone
/user/mgcgmccdmgdmmzccgmczgmzzdc/phone
...
/group/gg/videos
/group/g/phone
/group/g/videos
/country/jvmmm/phone
/country/jvmmm/videos
/country/jvmm/phone
/country/jvmm/videos
/country/jvm/phone
/country/jvm/videos
/country/jv/phone
/country/jv/videos
/country/j/phone
/country/j/videos
#
Note:
1) You'll need to write a wrapper to parse the source file, tokenise (extract) the source patterns, escape certain characters in the rule e.g. "/", and possibly split your rules into more manageable parts, before expanding, via Genex, and then outputting the results, in the desired format.
2) To install the module type: cpan Regexp::Genex