Is possible to set global AWK separator - awk

Is it possible to set a global separator of awk, e.g., in a .conf file or environment var?
I'm handling lot of files that using customized separator, don't feel right to manually set the FS and OFS every time.
thanks.

No, best you can do is either export a shell variable and then set FS from that, e.g.:
$ cat tst.awk
BEGIN { FS=ENVIRON["AWK_FS"] }
{ print NF }
$ export AWK_FS=","
$ echo 'a,b,c' | awk -f tst.awk
3
or include a file that sets the FS:
$ cat setFS.awk
BEGIN{ FS="," }
$ cat tst.awk
#include "setFS.awk"
{ print NF }
$ echo 'a,b,c' | awk -f tst.awk
3
The "#include" construct is gawk-specific, see https://www.gnu.org/software/gawk/manual/gawk.html#Include-Files. You still have to put the same code into every script to include that file but at least the actual FS setting (and OFS and whatever else you commonly do) code will only be specified once in that included file.

Related

Bash how to split file on empty line with awk

I have a text file (A.in) and I want to split it into multiple files. The split should occur everytime an empty line is found. The filenames should be progressive (A1.in, A2.in, ..)
I found this answer that suggests using awk, but I can't make it work with my desired naming convention
awk -v RS="" '{print $0 > $1".txt"}' file
I also found other answers telling me to use the command csplit -l but I can't make it match empty lines, I tried matching the pattern '' but I am not that familiar with regex and I get the following
bash-3.2$ csplit A.in ""
csplit: : unrecognised pattern
Input file:
A.in
4
RURDDD
6
RRULDD
KKKKKK
26
RRRULU
Desired output:
A1.in
4
RURDDD
A2.in
6
RRULDD
KKKKKK
A3.in
26
RRRULU
Another fix for the awk:
$ awk -v RS="" '{
split(FILENAME,a,".") # separate name and extension
f=a[1] NR "." a[2] # form the filename, use NR as number
print > f # output to file
close(f) # in case there are MANY to avoid running out f fds
}' A.in
In any normal case, the following script should work:
awk 'BEGIN{RS=""}{ print > ("A" NR ".in") }' file
The reason why this might fail is most likely due to some CRLF terminations (See here and here).
As mentioned by James, making it a bit more robust as:
awk 'BEGIN{RS=""}{ f = "A" NR ".in"; print > f; close(f) }' file
If you want to use csplit, the following will do the trick:
csplit --suppress-matched -f "A" -b "%0.2d.in" A.in '/^$/' '{*}'
See man csplit for understanding the above.
Input file content:
$ cat A.in
4
RURDDD
6
RRULDD
KKKKKK
26
RRRULU
AWK file content:
BEGIN{
n=1
}
{
if(NF!=0){
print $0 >> "A"n".in"
}else{
n++
}
}
Execution:
awk -f ctrl.awk A.in
Output:
$ cat A1.in
4
RURDDD
$ cat A2.in
6
RRULDD
KKKKKK
$ cat A3.in
26
RRRULU
PS: One-liner execution without AWK file:
awk 'BEGIN{n=1}{if(NF!=0){print $0 >> "A"n".in"}else{n++}}' A.in

awk field separator not working for first line

echo 'NODE_1_length_317516_cov_18.568_ID_4005' | awk 'FS="_length" {print $1}'
Obtained output:
NODE_1_length_317516_cov_18.568_ID_4005
Expected output:
NODE_1
How is that possible? I'm missing something.
When you are going through lines using Awk, the field separator is interpreted before processing the record. Awk reads the record according the current values of FS and RS and goes ahead performing the operations you ask it for.
This means that if you set the value of FS while reading a record, this won't have effect for that specific record. Instead, the FS will have effect when reading the next one and so on.
So if you have a file like this:
$ cat file
1,2 3,4
5,6 7,8
And you set the field separator while reading one record, it takes effect from the next line:
$ awk '{FS=","} {print $1}' file
1,2 # FS is still the space!
5
So what you want to do is to set the FS before starting to read the file. That is, set it in the BEGIN block or via parameter:
$ awk 'BEGIN{FS=","} {print $1}' file
1,2 # now, FS is the comma
5
$ awk -F, '{print $1}' file
1
5
There is also another way: make Awk recompute the full record with {$0=$0}. With this, Awk will take into account the current FS and act accordingly:
$ awk '{FS=","} {$0=$0;print $1}' file
1
5
awk Statement used incorrectly
Correct way is
awk 'BEGIN { FS = "#{delimiter}" } ; { print $1 }'
In your case you can use
awk 'BEGIN { FS = "_length" } ; { print $1 }'
Inbuilt variables like FS, ORS etc must be set within a context i.e in 1 of the following blocks: BEGIN, condition blocks or END.
$ echo 'NODE_1_length_317516_cov_18.568_ID_4005' | awk 'BEGIN{FS="_length"} {print $1}'
NODE_1
$
You can also pass the delimiter using -F switch like this:
$ echo 'NODE_1_length_317516_cov_18.568_ID_4005' | awk -F "_length" '{print $1}'
NODE_1
$

include library of functions in awk

There are many common functions (especially arithmetic/mathematics) that are not built into awk that I need to write myself all the time.
For example:
There is no c=min(a,b) , so in awk i constantly write c=a<b?a:b
same for maximum i.e. c=max(a,b)
same for absolute value i.e. c=abs(a) so i have to constantly write c=a>0?a:-a
and so on....
Ideally, I could write these functions into an awk source file, and "include" it into all of my instances of awk, so I can call them at will.
I looked into the "#include" functionality of GNU's gawk , but it just executes whatever is in the included script - i.e. I cannot call functions.
I was hoping to write some functions in e.g. mylib.awk, and then "include" this whenever I call awk.
I tried the -f mylib.awk option to awk, but the script is executed - the functions therein are not callable.
With GNU awk:
$ ls lib
prims.awk
$ cat lib/prims.awk
function abs(num) { return (num > 0 ? num : -num) }
function max(a,b) { return (a > b ? a : b) }
function min(a,b) { return (a < b ? a : b) }
$ export AWKPATH="$PWD/lib"
$ awk -i prims.awk 'BEGIN{print min(4,7), abs(-3)}'
4 3
$ cat tst.awk
#include "prims.awk"
BEGIN { print min(4,7), abs(-3) }
$ awk -f tst.awk
4 3
You can have multiple -f program-file options, so one can be your common functions and the other can be a specific problem solving awk script, which will have access to those functions.
awk -f common-funcs.awk -f specific.awk file-to-process.txt
I don't know if this is what you were looking for, but it's the best I've come up with. Here's an example:
$ cat common_func.awk
# Remove spaces from front and back of string
function trim(s) {
gsub(/^[ \t]+/, "", s);
gsub(/[ \t]+$/, "", s);
return s;
}
$ cat specific.awk
{ print $1, $2 }
{ print trim($1), trim($2) }
$ cat file-to-process.txt
abc | def |
2$ awk -F\| -f common_func.awk -f specific.awk file-to-process.txt
abc def
abc def
With regular awk (non-gnu) you can't mix the -f program-file option with an inline program. That is, the following won't work:
awk -f common_func.awk '{ print trim($1) }' file-to-process.txt # WRONG
As pointed out in the comments, however, with gawk you can use the -f option together with -e:
awk -f file.awk -e '{stuff}' file.txt
In case if you can't use -i (if your awk < 4.1 version ), which EdMorton suggested, make a try of below works with GNU Awk 3.1.7
--source program-text
Provide program source code in the program-text. This option allows you to mix source code in files with source code that you enter
on the command line. This is particularly useful when you have library
functions that you want to use from your command-line programs
$ awk --version
GNU Awk 3.1.7
Copyright (C) 1989, 1991-2009 Free Software Foundation.
$ cat primes.awk
function abs(num) { return (num > 0 ? num : -num) }
function max(a,b) { return (a > b ? a : b) }
function min(a,b) { return (a < b ? a : b) }
$ awk -f primes.awk --source 'BEGIN{print min(4,7), abs(-3)}'
4 3
on regular awk (non gnu) you can still fake a bit using the shell using a cat of the file(s) into the 'code' (generally in front, but could be everywhere since it respect the awk way of working order)
> cat /tmp/delme.awk
function PrintIt( a) { printf( "#%s\n", a )}
> echo "aze\nqsd" | awk "$( cat /tmp/delme.awk)"'{ sub( /./, ""); PrintIt( $0 )}'
#ze
#sd
With GNU awk you can use the -i command line option or from inside a script the #include directive, but if you want a POSIX solution then awk -f functions.awk -f script.awk file.txt is the way you need to go.

Awk script: How to prevent ARGV from being treated as an input file name

It seems that awk script considers ARGV[1] to ARGV[ARGC] as input files.
Is there any way to make awk considering ARGV as simple arguments instead of an input file
Example:
test.awk
#!/usr/bin/awk -f
BEGIN {title=ARGV[2]}
{if ($1=="AA") {print title}}
dat file
AB
BA
AA
CC
$ test.awk dat 'My Interesting Title'
My Interesting Title
awk: test.awk:3: fatal: cannot open file `My Interesting Title' for reading (No such file or directory)
You can modify ARGV at any time. Awk processes the elements of ARGV in turn, so if you modify them during processing, you can arrange to read different files or not to treat some arguments as file names. In particular, if you modify ARGV in the BEGIN block, anything is possible. For example, the following snippet causes awk to read from standard input even when arguments were passed, and saves the arguments in an array called args:
awk '
BEGIN {for (i in ARGV) {args[i] = ARGV[i]; delete ARGV[i]}}
…
' hello world
If you just want to skip the first argument, delete it only:
awk '
BEGIN {title = ARGV[1]; delete ARGV[1]}
$1 == "AA" {print title}
' 'My Interesting Title' input.txt
However, this is unusual and therefore may be considered hard to maintain. Consider using a shell wrapper and passing the title through an environment variable instead.
#!/bin/sh
title=$1; shift
awk '
$1 == "AA" {print ENV["title"]}
' "$#"
You can also pass a string as an awk variable. Beware that the value undergoes backslash expansion.
awk -v 'title=My Interesting Title\nThis is a subtitle' '
$1 == "AA" {print title} # prints two lines!
' input.txt
Something like this?
$ awk -v title='My Interesting Title' '$0 ~ /AA/ {print title}1' input
AB
BA
My Interesting Title
AA
CC
Yes:
BEGIN{title=ARGV[2];ARGV[--ARGC]=""}
$1=="AA" {print title}
but you probably want this instead:
$ cat tst.sh
awk -v title="$2" '$1=="AA" {print title}'
See http://cfajohnson.com/shell/cus-faq-2.html#Q24 for details on those and the other ways to pass the value of shell variables to awk scripts.
As an aside, note that whether you use this script or your original, the contents of your file is a shell script that calls awk, not an awk script, so the suffix should not be .awk, it should be .sh or similar.
You can decrement ARGC after reading arguments so that only the first(s) argument(s) is(are) considered by awk as input file(s) :
#!/bin/awk -f
BEGIN {
for (i=ARGC; i>2; i--) {
print ARGV[ARGC-1];
ARGC--;
}
}
…
Or alternatively, you can reset ARGC after having read all arguments :
#!/bin/awk -f
BEGIN {
for (i=0; i<ARGC; i++) {
print ARGV[ARGC-1];
}
ARGC=2;
}
…
Both methods will correctly process myawkscript.awk foobar foo bar … as if foobar was the only file to process (of course you can set ARGC to 3 if you want the two first arguments as files, etc.).
allow also
awk 'BEGIN {title=ARGV[2]}
{if ($1=="AA") {print title}}
' input.txt -v "title=My Interesting Title"
argument for ARGV are also any string (argument of command line) of format varname=VarContent

awk won't print new line characters

I am using the below code to change an existing awk script so that I can add more and more cases with a simple command.
echo `awk '{if(/#append1/){print "pref'"$1"'=0\n" $0 "\n"} else{print $0 "\n"}}' tf.a
note that the first print is "pref'"$1"'=0\n" so it is referring to the variable $1 in its environment, not in awk itself.
The command ./tfb.a "c" should change the code from:
BEGIN{
#append1
}
...
to:
BEGIN{
prefc=0
#append1
}
...
However, it gives me everything on one line.
Does anyone know why this is?
If you take awk right out of the equation you can see what's going on:
# Use a small test file instead of an awk script
$ cat xxx
hello
there
$ echo `cat xxx`
hello there
$ echo "`cat xxx`"
hello
there
$ echo "$(cat xxx)"
hello
there
$
The backtick operator expands the output into shell "words" too soon. You could play around with the $IFS variable in the shell (yikes), or you could just use double-quotes.
If you're running a modern sh (e.g. ksh or bash, not the "classic" Bourne sh), you may also want to use the $() syntax (it's easier to find the matching start/end delimiter).
do it like this. pass the variable from shell to awk properly using -v
#!/bin/bash
toinsert="$1"
awk -v toinsert=$toinsert '
/#append1/{
$0="pref"toinsert"=0\n"$0
}
{print}
' file > temp
mv temp file
output
$ cat file
BEGIN{
#append1
}
$ ./shell.sh c
BEGIN{
prefc=0
#append1
}