The code
#!/usr/bin/awk
# Sed and AWK by O'Reilly (p.179)
# Example of what should happen: factorial 5 gives
# factorial
# Enter number: 3
# The factorial of 3 is 6
BEGIN {
printf("Enter number: ")
}
$1 ~ /^[0-9]+$/ {
# assign value of $1 to number & fact
number = $1
if (number == 0)
fact = 1
else
fact = number
for (x = 1; x < number; x++)
fact *=x
printf("The factorial of %d is %g\n", number, fact)
exit
}
# if not a number, prompt again.
{ printf("\nInvalid entry. Enter a number: ")
}
I run the command unsuccessfully by
./factorial.awk
I get
/usr/bin/awk: syntax error at source line 1
context is
>>> <<< ./factorial.awk
/usr/bin/awk: bailing out at source line 1
What does the error message mean?
I think that the problem is that you are writing a shell script and passing it to awk for execution. The following is a shell script, hence the #! /bin/sh, so it will be passed to the shell (Bourne-compatible in this case).
#! /bin/sh
awk 'BEGIN { printf("Hello world!\n"); exit }'
The she-bang (#!) line tells the current interpreter which interpreter to pass the script to for execution. You have to pass the script to the awk interpreter so you need to call awk explicitly. This assumes that awk is in your path somewhere.
The following, however, is an awk script.
#! /usr/bin/awk -f
BEGIN {
printf("Hello world!\n");
exit
}
The she-bang invokes awk and passes the script as input. You don't need to explicitly invoke awk in this case and you don't have to quote the entire script since it is passed directly to awk.
Think of the she-bang as saying take what follows the she-bang, append the name of the file, and execute it. Wikipedia describes the usage pretty well including some common ways to solve the path to the interpreter problem.
Possibly a dumb answer but in my terminal I would have to type in:
./factorial.awk
where the file is factorial.awk.
You could edit your path environment variable to include . but ./ should work just fine I think. And adding . to $PATH could prove to be very dangerous in some situations where you would run code that you did not expect to.
Does that work??
EDIT:
./factorial.awk
-bash: ./factorial.awk: /usr/bin/gawk: bad interpreter: No such file or directory
That says that it ran the file but could not find the program gawk.
Please type in 'which gawk' and then 'which awk'.
Is your first line supposed to be:
#!/usr/bin/awk
Also, just to amuse me, type in:
sudo apt-get install gawk
That will make sure you actually have gawk on your system.
EDIT2:
I took a look at your code and this is what I have now. I removed two quotes and a dash.
#!/usr/bin/gawk
# I think we do not need these (p.179) so I comment them out, since I do not know where else to put them.
# The same bug occurs also with them.
#fact = number
#for (x = number -1 ; x > 1; x--)
# fact *= x
awk # factorial: returns factorial of user-supplied number
BEGIN {
printf("Enter number: ")
}
$1 ~ /^[0-9]+$/ {
# assign value of $1 to number & fact
number = $1
if (number == 0)
fact = 1
else
fact = number
#loop to multiply fact*x until x = 1
for (x = number - 1; x > 1; x--)
fact *= x
printf("The factorial of %d is %g\n", number, fact)
#exit -- saves user from typing ^-d
exit
}
# if not a number, prompt again.
{ printf("\nInvalid entry. Enter a number: ")
}
may be it wasn't that complicated.
#!/usr/bin/awk ---------> #!/usr/bin/awk -f
Check whether there is a file /usr/bin/gawk; if not, use either the path of awk or the correct location for gawk.
Also, did you make the script executable?
And also, do you have the current directory in your PATH?
I got the script to work in Ubuntu and OS X by running
awk -f factorial.awk
It seems that you cannot run the script as follows although the book says so
./factorial.awk
Here's a recursive version:
#!/usr/bin/awk -f
function f(x) {
if (x <= 1) return 1
return (f(x-1) *x)}
BEGIN {
printf("Enter number: ")
}
$1 ~ /^[0-9]+$/ {
printf("The factorial of %d is %d\n", $1, f($1))
exit
}
{ printf("\nInvalid entry. Enter a number: ")
}
This question was the top hit on Google for the search phrase "awk factorial", so here's a simple way to print a factorial in awk:
$ awk 'BEGIN{x=1;for(i=2;i<=6;i++)x*=i;print x}'
720
As a shell function (the space after -v is required by nawk which comes with macOS but not by gawk):
$ fac(){ awk -v "n=$1" 'BEGIN{x=1;for(i=2;i<=n;i++)x*=i;print x}';}
$ fac 6
720
As an awk function for calculating k-combinations:
$ awk 'function f(x){r=1;for(i=2;i<=x;i++)r*=i;return r}BEGIN{n=5;k=3;print f(n)/(f(k)*f(n-k))}'
10
Related
To print multiple(2) lines following the pattern using awk:
I have found somewhere the following solution
$ awk '/Linux/{x=NR+2}(NR<=x){print}' file
Linux
Solaris
Aix
I am trying to understand the syntax
Generally awk syntax is
awk 'pattern{action}' file
Here we find
pattern = /Linux/
action = {x=NR+2}
then what is (NR<=x){print}
Solution:
My understaning of c-like syntax for this is:
While read (file,line)
{
if (line ~ '/pattern/') then
{
x= NR+2
}
if (NR <= x)
{
print
{
}
for NR=1 and if (line ~ '/pattern/') then x is set to NR+2 eg(1+2 =3). This value will not be reset till the process is over. SO when the next line is read and !(line ~ '/pattern/') then x is still 3, (NR (2) <= 3) is true so it prints the next line
Thanks to #Edmorton for the undestating
FWIW I wouldn't write the code you're asking about, instead I'd write:
awk '/Linux/{c=3} c&&c--' file
See example "g" at https://stackoverflow.com/a/17914105/1745001.
Having said that, your original code in C-like syntax would be:
NR=0
x=0
While read (file,line)
{
NR++
if (line ~ "Linux") {
x = NR+2
}
if (NR <= x) {
print
}
}
Btw, I know it's frequently mis-used but don't use the word "pattern" in your software as it's highly ambiguous - use string or regexp or condition (or in shell but not awk, sed, grep, etc. and only where appropriate "globbing pattern"), whichever it is you really mean.
For example you wrote that awk syntax is:
awk 'pattern{action}' file
No. Or maybe, depending on what you think "pattern" means! Despite what many books, tutorials, etc. say so as to remove any ambiguity you should simply think of awk syntax as:
awk 'condition{action}' file
where condition can be any of:
a key word like BEGIN or END
an arithmetic expression like var < 7 or NF or 1
a regexp comparison like $0 ~ "foo" or $0 ~ /foo/ or /foo/ or $0 ~ var or match($0,/foo/)
a string comparison like $0 == "foo" or index($0,"foo")
nothing at all in which case it's assumed to be true when there's an associated action block.
and probably other things I'm forgetting to list.
your script has two blocks
$ awk '/Linux/ {x=NR+2}
NR<=x {print}' file
first block sets the variable x, second uses to print the lines. Note that you can drop {print}, since it's the default action.
I am working with bash on a linux cluster. I am trying to extract reads from a .fastq file if they contain a match to a queried sequence. Below is an example .fastq file containing three reads.
$ cat example.fastq
#SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
#SRR1111111.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
#SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
I would like to extract reads containing the sequence GAAATAATA. I can perform this extraction using grep as shown in the following command.
$ grep -F -B 1 -A 2 "GAAATAATA" example.fastq > MATCH.fastq
$ cat MATCH.fastq
#SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
#SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
However, this strategy does not tolerate any mismatches. For example, a read containing the sequence GAAATGATA will be ignored. I need this extraction to tolerate one mismatch at any position in the queried sequence. So my question is how can I achieve this? Is there a sequence alignment package available with similar functionality to grep? Are there any fastq subsetting packages available that perform this type of operation? One note is that speed is very important. Thanks for your guidance.
Here is a solution using agrep to get the record numbers of matches and an awk that prints out those records with some context (due to missing -Aand -B in agrep):
$ agrep -1 -n "GAAATGATA" file |
awk -F: 'NR==FNR{for(i=($1-1);i<=($1+2);i++)a[i];next}FNR in a' - file
Output:
#SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
#SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
This should work but idk if the MATCH.fastq in your question is the expected output or not or even if your sample input contains any cases that need a working solution to find so idk if it's actually working or not:
$ cat tst.awk
BEGIN {
for (i=1; i<=length(seq); i++) {
regexp = regexp sep substr(seq,1,i-1) "." substr(seq,i+1)
sep = "|"
}
}
{ rec = rec $0 ORS }
!(NR % 4) {
if (rec ~ regexp) {
printf "%s", rec
}
rec = ""
}
$ awk -v seq='GAAATAATA' -f tst.awk example.fastq
#SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
#SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
You might try a file of patterns -
$: cat GAAATAATA
.AAATAATA
G.AATAATA
GA.ATAATA
GAA.TAATA
GAAA.AATA
GAAAT.ATA
GAAATA.TA
GAAATAA.A
GAAATAAT.
then
grep -B 1 -A 2 -f GAAATAATA example.fastq > MATCH.fastq
but it will probably slow the process down a bit to add both full regex parsing AND an alternate pattern for each possible single change...
responding to question in comments:
For a given value of $word, such as word=GAAATAATA,
awk '{
for ( i=1; i<=length($0); i++ ) {
split($0,tmp,""); tmp[i]=".";
for ( n=1; n<=length($0); n++ ) { printf tmp[n]; }
printf "\n";
}
}' <<< "$word" > "$word"
This will create this specific file.
Hope that helps, but remember that this will be a lot slower since you are now using regexes instead of just matching plain strings, AND you are introducing a whole series of alternate patterns to match...
I'm trying to use GAWK from CYGWIN to process a csv file. Pass 1 finds the max value, and pass 2 prints the records that match the max value. I'm using a .awk file as input. When I use the text in the manual, it matches on both passes. I can use the IF form as a workaround, but that forces me to use IF inside every pattern match, which is kind of a pain. Any idea what I'm doing wrong?
Here's my .awk file:
pass == 1
{
print "pass1 is", pass;
}
pass == 2
{
if(pass == 2)
print "pass2 is", pass;
}
Here's my output (input file is just "hello):
hello
pass1 is 1
pass1 is 2
hello
pass2 is 2
Here's my command line:
gawk -F , -f test.awk pass=1 x.txt pass=2 x.txt
I'd appreciate any help.
An (g)awk solution might look like this:
awk 'FNR == NR{print "1st pass"; next}
{print "second pass"}' x.txt x.txt
(Please replace awk by gawk if necessary.)
Let's say, you wanted to search the maximum value in the first column of file x.txt and then print all lines which have this value in the first column, your program might look like this (thank to Ed Morton for some tip, see comment):
awk -F"," 'FNR==NR {max = ( (FNR==1) || ($1 > max) ? $1 : max ); next}
$1==max' x.txt x.txt
The output for x.txt:
6,5
2,6
5,7
6,9
is
6,5
6,9
How does this work? The variable NR keeps increasing with every record, whereas FNR is reset to 1 when reading a new file. Therefore, FNR==NR is only true for the first file processed.
So... F.Knorr answered your question accurately and concisely, and he deserves a big green checkmark. NR==FNR is exactly the secret sauce you're looking for.
But here is a different approach, just in case the multi-pass thing proves to be problematic. (Perhaps you're reading the file from a slow drive, a USB stick, across a network, DAT tape, etc.)
awk -F, '$1>m{delete l;n=0;m=$1}m==$1{l[++n]=$0}END{for(i=1;i<=n;i++)print l[i]}' inputfile
Or, spaced out for easier reading:
BEGIN {
FS=","
}
$1 > max {
delete list # empty the array
n=0 # reset the array counter
max=$1 # set a new max
}
max==$1 {
list[++n]=$0 # record the line in our array
}
END {
for(i=1;i<=n;i++) { # print the array in order of found lines.
print list[i]
}
}
With the same input data that F.Knorr tested with, I get the same results.
The idea here is that go through the file in ONE pass. We record every line that matches our max in an array, and if we come across a value that exceeds the max, we clear the array and start collecting lines afresh.
This approach is heaver on CPU and memory (depending on the size of your dataset), but being single pass, it is likely to be lighter on IO.
The issue here is that newlines matter to awk.
# This does what I should have done:
pass==1 {print "pass1 is", pass;}
pass==2 {if (pass==2) print "pass2 is", pass;}
# This is the code in my question:
# When pass == 1, do nothing
pass==1
# On every condition, do this
{print "pass1 is", pass;}
# When pass == 2, do nothing
pass==2
# On every condition, do this
{if (pass==2) print "pass2 is", pass;}
Using pass==1, pass==2 isn't as elegant, but it works.
I'm trying to set up a script where an alert is generated when a certain string appears in a log file.
The solution already in place greps the whole log file once a minute and counts how often the string appears, using the log line's timestamp to count only occurrences in the previous minute.
I figured it would be much more efficient to do this with a tail, so I tried the following, as a test:
FILENAME="/var/log/file.log"
tail -f $FILENAME | awk -F , -v var="$HOSTNAME" '
BEGIN {
failed_count=0;
}
/account failure reason/ {
failed_count++;
}
END {
printf("%saccount failure reason (Errors per Interval)=%d\n", var, failed_count);
}
'
but this just hangs and doesn't output anything. Somebody suggested this minor change:
FILENAME="/var/log/file.log"
awk -F , -v var="$HOSTNAME" '
BEGIN {
failed_count=0;
}
/account failure reason/ {
failed_count++;
}
END {
printf("%saccount failure reason (Errors per Interval)=%d\n", var, failed_count);
}
' <(tail -f $FILENAME)
but that does the same thing.
The awk I'm using (I've simplified in the code above) works, as it's used in the existing script where the results of grep "^$TIMESTAMP" are piped into it.
My question is, how can get the tail -f to work with awk?
Assuming your log looks something like this:
Jul 13 06:43:18 foo account failure reason: unknown
│ │
│ └── $2 in awk
└────── $1 in awk
you could do something like this:
FILENAME="/var/log/file.log"
tail -F $FILENAME | awk -v hostname="$HOSTNAME" '
NR == 1 {
last=$1 " " $2;
}
$1 " " $2 != last {
printf("%s account failure reason (Errors on %s)=%d\n", hostname, last, failed);
last=$1 " " $2;
failed=0;
}
/account failure reason/ {
failed++;
}
'
Note that I've changed this to tail -F (capital F) because it handles log aging. This isn't supported in every operating system, but it should work in modern BSDs and Linuces.
How does this work?
Awk scripts consist of sets of test { commands; } evaluated against each line of input. (There are two special tests, BEGIN and END whose commands run when awk starts and when awk ends, respectively. In your question, awk never ended, so the END code was never run.)
The script above has three of test/command sections:
In the first, NR == 1 is a test that evaluates true on only the first line of input. The command it runs creates the initial value for the last variable, used in the next section.
In the second section, we test whether the "last" variable has changed since the last line that was evaluated. If this is true, it indicates that we're evaluating a new day's data. Now it's time to print a summary (log) of last month, reset our variables and move on.
In the third, if the line we're evaluating matches the regular expression /account failure reason/, we increment our counter.
Clear as mud? :-)
I find AWK really useful. Here is a one liner I put together to manipulate data.
ls | awk '{ print "awk " "'"'"'" " {print $1,$2,$3} " "'"'"'" " " $1 ".old_ext > " $1 ".new_ext" }' > file.csh
I used this AWK to make a script file that would rename some files and only print out selective columns. Anyone know a better way to do this? What are you best AWK one liners or clever manipulations?
The AWK book is full of great examples. They used to be collected for download from Kernighan's webpage (404s now).
You can find several nice one liners here.
I use this:
df -m | awk '{p+=$3}; END {print p}'
To total all disk space used on a system across filesystems.
Many years ago I wrote a tail script in awk:
#!/usr/bin/awk -f
BEGIN {
lines=10
}
{
high = NR % lines + 1
a[high] = $0
}
END {
for (i = 0; i < lines; i++) {
n = (i + high) % lines + 1
if (n in a) {
print a[n]
}
}
}
It's silly, I know, but that's what awk does to you. It's just very fun playing with it.
Henry Spencer wrote a fairly good implementation of nroff on awk. He called it "awf". He also claimed that if Larry Wall had known how powerful awk was, he wouldn't have needed to invent perl.
Here's a couple of awks that I used to use regularly ... note that you can use $1, $2, etc to get out the column you want. So, for manipulating a bunch of files, for example here's a stupid command you could use instead of mv ...
ls -1 *.mp3 | awk '{printf("mv %s newDir/%s\n",$1,$1)}' | /bin/sh
Or if you're looking at a set of processes maybe ...
ps -ef | grep -v username | awk '{printf("kill -9 %s\n",$2)}' | /bin/sh
Pretty trivial but you can see how that would get you quite a ways. =) Most of the stuff I used to do you can use xargs for, but hey, who needs them new fangled commands?
I use this script a lot for editing PATH and path-like environment variables.
Usage:
export PATH=$(clnpath /new/bin:/other/bin:$PATH /old/bin:/other/old/bin)
This command adds /new/bin and /other/bin in front of PATH, removes both /old/bin and /other/old/bin from PATH (if present - no error if absent), and removes duplicate directory entries on path.
: "#(#)$Id: clnpath.sh,v 1.6 1999/06/08 23:34:07 jleffler Exp $"
#
# Print minimal version of $PATH, possibly removing some items
case $# in
0) chop=""; path=${PATH:?};;
1) chop=""; path=$1;;
2) chop=$2; path=$1;;
*) echo "Usage: `basename $0 .sh` [$PATH [remove:list]]" >&2
exit 1;;
esac
# Beware of the quotes in the assignment to chop!
echo "$path" |
${AWK:-awk} -F: '#
BEGIN { # Sort out which path components to omit
chop="'"$chop"'";
if (chop != "") nr = split(chop, remove); else nr = 0;
for (i = 1; i <= nr; i++)
omit[remove[i]] = 1;
}
{
for (i = 1; i <= NF; i++)
{
x=$i;
if (x == "") x = ".";
if (omit[x] == 0 && path[x]++ == 0)
{
output = output pad x;
pad = ":";
}
}
print output;
}'
Count memory used by httpd
ps -ylC httpd | awk '/[0-9]/ {SUM += $8} END {print SUM/1024}'
Or any other process by replacing httpd. Dividing by 1024 to get output in MB.
I managed to build a DOS tree command emulator for UNIX ( find + awk ):
find . -type d -print 2>/dev/null|awk '{for (i=1;i< NF;i++)printf("%"length($i)"s","|");gsub(/[^\/]*\//,"--",$0);print $NF}' FS='/'
Print lines between two patterns:
awk '/END/{flag=0}flag;/START/{flag=1}' inputFile
Detailed explanation: http://nixtip.wordpress.com/2010/10/12/print-lines-between-two-patterns-the-awk-way/
A couple of favorites, essentially unrelated to each other. Read as 2 different, unconnected suggestions.
Identifying Column Numbers Easily
:
For those that use awk frequently, as I do for log analysis at work, I often find myself needing to find out what the column numbers are for a file. So, if I am analyzing, say, Apache access files (some samples can be found here) I run the script below against the file:
NR == 1 {
for (i = 1 ; i <= NF ; i++)
{
print i "\t" $i
}
}
NR > 1 {
exit
}
I typically call it "cn.awk", for 'c'olumn 'n'umbers. Creative, eh? Anyway, the output looks like:
1 64.242.88.10
2 -
3 -
4 [07/Mar/2004:16:05:49
5 -0800]
6 "GET
7 /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables
8 HTTP/1.1"
9 401
10 12846
Very easy to tell what's what. I usually alias this on my servers and have it everywhere.
Referencing Fields by Name
Now, suppose your file has a header row and you'd rather use those names instead of field numbers. This allows you to do so:
NR == 1 {
for (i = 1 ; i <= NF ; i++)
{
field[$i] = i
}
}
Now, suppose I have this header row...
metric,time,val,location,http_status,http_request
...and I'd like to sum the val column. Instead of referring to $3, I can refer to it by name:
NR > 1 {
SUM += $field["val"]
}
The main benefit is making the script much more readable.
Printing fields is one of the first things mentioned in most AWK tutorials.
awk '{print $1,$3}' file
Lesser known but equally useful is excluding fields which is also possible:
awk '{$1=$3=""}1' file