awk - Skip Header, Add Columns

awk - Skip Header, Add Columns - awk

I need to create a script that uses awk to sum the values of several columns that are output by a LSF command. I also need the script to skip the headers in the first line. This is what I have so far, will it work? I'm not sure that it will properly skip the first line and add the others. I would test it, but I do not have access to the LSF machines.
bhosts | awk '
BEGIN { running=suspended=reserved=0; }
NR < 2 { next }
(running = running + $6)
(a = a + $7)
(b = b + $8)
(suspended = a + b)
(reserved = reserved + $9)
END {
...
...
}'
exit

I can't test either. This would be better asked on http://codereview.stackexchange.com, but if you want to do some calculations on every line except the first one:
bhosts | awk '
NR >= 2 {
running += $6
a += $7
b += $8
suspended = a + b
reserved += $9
}
END {
...
}
'
Undeclared variables are automatically treated as zero in numeric context, so it's not strictly necessary to declare them.

Related

awk - split column then get average

I'm trying to compute the average based on the values in column 5.
I intend to split the entries at the comma, sum the two numbers, compute the average and assign them to two new columns (ave1 and ave2).
I keep getting the error fatal: division by zero attempted, and I cannot get around it.
Below is the table in image format since I tried to post a markdown table and it failed.
Here's my code:
awk -v FS='\t' -v OFS='\t' '{split($5,a,","); sum=a[1]+a[2]}{ print $0, "ave1="a[1]/sum, "ave2="a[2]/sum}' vcf_table.txt

Never write a[2]/sum, always write (sum ? a[2]/sum : 0) or similar instead to protect from divide-by-zero.
You also aren't taking your header row into account. Try this:
awk '
BEGIN { FS=OFS="\t" }
NR == 1 {
ave1 = "AVE1"
ave2 = "AVE2"
}
NR > 1 {
split($5,a,",")
sum = a[1] + a[2]
if ( sum ) {
ave1 = a[1] / sum
ave2 = a[2] / sum
}
else {
ave1 = 0
ave2 = 0
}
}
{ print $0, ave1, ave2 }
' vcf_table.txt

How to return 0 if awk returns null from processing an expression?

I currently have a awk method to parse through whether or not an expression output contains more than one line. If it does, it aggregates and prints the sum. For example:
someexpression=$'JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)'
might be the one-liner where it DOESN'T yield any information. Then,
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
for (i in a) {
printf "%d\n", a[i]
}
}'
this will yield NULL or an empty return. Instead, I would like to have it return a numeric value of $0$ if empty. How can I modify the above to do this?

Nothing in UNIX "returns" anything (despite the unfortunately named keyword for setting the exit status of a function), everything (tools, functions, scripts) outputs X and exits with status Y.
Consider these 2 identical functions named foo(), one in C and one in shell:
C (x=foo() means set x to the return code of foo()):
foo() {
printf "7\n"; // this is outputting 7 from the full program
return 3; // this is returning 3 from this function
}
x=foo(); <- 7 is output on screen and x has value '3'
shell (x=foo means set x to the output of foo()):
foo() {
printf "7\n"; # this is outputting 7 from just this function
return 3; # this is setting this functions exit status to 3
}
x=foo <- nothing is output on screen, x has value '7', and '$?' has value '3'
Note that what the return statement does is vastly different in each. Within an awk script, printing and return codes from functions behave the same as they do in C but in terms of a call to the awk tool, externally it behaves the same as every other UNIX tool and shell script and produces output and sets an exit status.
So when discussing anything in UNIX avoid using the term "return" as it's imprecise and ambiguous and so different people will think you mean "output" while others think you mean "exit status".
In this case I assume you mean "output" BUT you should instead consider setting a non-zero exit status when there's no match like grep does, e.g.:
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
for (i in a) {
print a[i]
}
exit (NR < 2)
}'
and then your code that uses the above can test for the success/fail exit status rather than testing for a specific output value, just like if you were doing the equivalent with grep.
You can of course tweak the above to:
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
if ( NR > 1 ) {
for (i in a) {
print a[i]
}
}
else {
print "$0$"
exit 1
}
}'
if necessary and then you have both a specific output value and a success/fail exit status.

You may keep a flag inside for loop to detect whether loop has executed or not:
echo "$someexpression" |
awk 'NR>1 {
a[$4]++
}
END
{
for (i in a) {
p = 1
printf "%d\n", a[i]
}
if (!p)
print "$0$"
}'
$0$

Migration from Pelican to Hugo

I was reading an article showing how to migrate markdown files from Pelican to Hugo. I'm trying to understand what the awk script is doing. :
# begin block, executed once,
# to set field separator, output fied separator & print 3 dashes
BEGIN { FS = ":"; OFS = ":"; print "---" }
# ???
!c && /^$/ { print "---\n"; c = 1 }
# user defined function?
c { print; next }
# user defined function?
!c {
# lower first field
$1 = tolower($1)
# if first field is "date"
if ($1 == "date") {
# transform second field
$2 = gensub(/ ([^.]+)\.([^.]+).([^.]+)/, " \\3-\\2-\\1", 1, $2)
$2 = gensub(/-([0-9])-/, "-0\\1-", 1, $2)
}
if ($1 == "tags")
$2 = " [" gensub(/[-a-z]+/, "'\\0'", "g", substr($2, 2)) "]"
print
}
I don't really understand, what are c and !c are they user defined functions? Without the function keyword and without parameters? What is exactly the meaning of c=1?

c is a variable. c=1 sets the value of c to 1
c is a test of variable c and its true, other than 0
!c is a test of variable c and its true if c is not set or 0
c { print; next } If c is set to some other than nothing or 0, then print (will print the whole line since nothing other is specified). next stop what you are doing and skip to next line and start over.

Match specific pattern and print just the matched string in the previous line

I update the question with additional information
I have a .fastq file formatted in the following way
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 (sequence name)
CATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC.. (sequence)
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFF.. (sequence quality)
For each sequence the format is the same (repetition of 4 lines)
What I am trying to do is searching for a specific regex pattern ([A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,})in a window of n=35 characters of the 2nd line, cut it if found and report it at the end of the previous line.
So far I've written a bunch of code that does almost what I want.I thought using the match function together wit the substr of my window of interest but i didn't achieve my goal. I report below the script.awk :
match(substr($0,0,35),/regexp/,a) {
print p,a[0] #print the previous line respect to the matched one
print #print the current line
for(i=0;i<=1;i++) { # print the 2 lines following
getline
print
}
}#store previous line
{ p = $0 }
Starting from a file like this:
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
I would like to obtain an output like this:
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 TATTCACATATAGACATGAAA #is the string that matched the regexp WITHOUT initial AA that doesn' match my expression
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC #without initial AA
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF # without "GGGGGGGGDGGGFGGGGGGFGGG" that is the same number of characters removed in the 2nd line

$ cat tst.awk
BEGIN {
tgtStr = "pattern"
tgtLgth = length(tgtStr)
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( idx = index(substr(rec[2],1,winLgth),tgtStr) ) {
rec[1] = rec[1] " " tgtStr
rec[2] = substr(rec[2],idx+tgtLgth)
rec[4] = substr(rec[4],idx+tgtLgth)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
$ awk -f tst.awk file
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
wrt the code you posted:
substr($0,0,35) - strings, fields, line numbers, and arrays in awk start at 1 not 0 so that should be substr($0,1,35). Awk will compensate for your mistake and treat it as if you had written 1 instead of 0 in this case but get used to starting everything at 1 to avoid mistakes when it matters.
for(i=0;i<=1;i++) - should be for(i=1;i<=2;i++) for the same reason.
getline - not an appropriate use and syntactically fragile, see for(i=0;i<=1;i++)
Update - per your comment below that pattern is actually a regexp rather than a string:
$ cat tst.awk
BEGIN {
tgtRegexp = "[A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,}"
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( match(substr(rec[2],1,winLgth),tgtRegexp) ) {
rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
rec[2] = substr(rec[2],RSTART+RLENGTH)
rec[4] = substr(rec[4],RSTART+RLENGTH)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}

I warn you, I wanted to have some fun and it is twisted.
awk -v pattern=pattern -v window=15 '
BEGIN{RS="#";FS=OFS="\n"}
{pos = match($2, pattern); n_del=pos+length(pattern)}
pos && (n_del<=window){$1 = $1 " " pattern; $2=substr($2, n_del); $4=substr($4, n_del)}
NR!=1{printf "%s%s", RS, $0}
' file
Input :
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
Output :
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
Second block is not updated because window is 15 and it cannot find the pattern within this window.
I used variable RS to deal with entire 4 lines block with $0, $1, $2, $3 and $4. Because input file starts with RS and does not end with RS, I prefered to not set ORS and use printf instead of print.

AWK: How can I print averages of consecutive numbers in a file, but skip over alphabetical characters/strings?

I've figured out how to get the average of a file that contains numbers in all lines such as:
Numbers.txt
1
2
4
8
Output:
Average: 3.75
This is the code I use for that:
awk '{ sum += $1; tot++ } END { print sum / tot; }' Numbers.txt
However, the problem is that this doesn't take into account possible strings that might be in the file. For example, a file that looks like this:
NumbersAndExtras.txt
1
2
4
8
Hello
4
5
6
Cat
Dog
2
4
3
For such a file I'd want to print the multiple averages of the consecutive numbers, ignoring the strings such that the result looks something like this:
Output:
Average: 3.75
Average: 5
Average: 3
I could devise some complicated code that might accomplish that with variables and 'if' statements and loops and whatnot, but I've been told it's easier than that given some of awk features. I'd like to know how that might look like, along with an explanation of why it works.

BEGIN runs before reading the first line from file. Set sum and count to 0.
awk 'BEGIN{ sum=0; count=0} {if ( /[a-z][A-Z]/ ) { if (count > 0) {avg = sum/count; print avg;} count=0; sum=0} else { count++; sum += $1} } END{if (count > 0) {avg = sum/count; print avg}} ' NumbersAndExtras.txt
When there is an alphabet on the line, calculate and print average so far.
And do the same in the END block that runs after processing the whole file.

Keep it simple:
awk '/^$/{next}
/^[0-9]+/{a+=$1+0;c++;next}
c&&a{print "Average: "a/c;a=c=0}
END{if(c&&a){print "Average: "a/c}}' input_file
Results:
Average: 3.75
Average: 5
Average: 3

Another one:
$ awk '
function avg(s, c) { print "Average: ", s/c }
NF && !/^[[:digit:]]/ { if (count) avg(sum, count); sum = 0; count = 0; next}
NF { sum += $1; count++ }
END {if (count) avg(sum, count)}
' <file

Note: The value of this answer in explaining the solution; other answers offer more concise alternatives.
Try the following:
Note that this is an awk command with a script specified as a multi-line shell string literal - you can paste the whole thing into your terminal to try it; while it is possible to cram this into a single line, it hurts readability and the ability to comment:
awk '
# Define output function that prints an average.
function printAvg() { print "Average: ", sum/count }
# Skip blank lines
NF == 0 { next}
# Is the line non-numeric?
/[[:alpha:]]/ {
# If this line ends a numeric block, print its
# average now and reset the variables to start the next group.
if (count) {
printAvg()
wasNum = sum = count = 0
}
# Skip to next line.
next
}
# Numeric line: set flag, sum, and increment counter.
{ sum += $1; count++ }
# Finally:
END {
# If there is a group whose average has not been printed yet,
# do it now.
if (count) printAvg()
}
' NumbersAndExtras.txt
If we condense whitespace and strip the comments, we still get a reasonably readable solution, as long as we still use multiple lines:
awk '
function printAvg() { print "Average: ", sum/count }
NF == 0 { next}
/[[:alpha:]]/ { if (count) { printAvg(); sum = count = 0 } next }
{ sum += $1; count++ }
END { if (count) printAvg() }
' NumbersAndExtras.txt

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk - Skip Header, Add Columns - awk

Related

awk - split column then get average

How to return 0 if awk returns null from processing an expression?

Migration from Pelican to Hugo

Match specific pattern and print just the matched string in the previous line

AWK: How can I print averages of consecutive numbers in a file, but skip over alphabetical characters/strings?

Categories

Resources