AWK : Ensure only one blank line after the output block - awk

The following awk code outputs what is required except that it outputs two blank lines after each block of data. Only one blank line needs to be inserted. (Without the last {"print "\n"} statement, no blank lines are output. With the statement, there are two blank lines. I need only one blank line.)
/Reco/ {for(i=0; i<=2; i++) {getline; print} {print "\n"}}

Based on your comment below that you actually want the line that matches /Reco/ and 2 subsequent lines and a blank line (to be inserted after that) here's how to do that based on idiom "g" below:
awk '/Reco/{c=3} c&&c--{print; if(!c)print ""}' file
wrt an explanation - just remember that awk provides this functionality for you:
WHILE read line from file
DO
execute the users script (/Reco/{c=3} c&&c--{print; if(!c)print ""})
DONE
and that the body of an awk script is made up of:
<condition> { <action> }
statements with the default condition being TRUE and the default action being to print the current record/line.
The posted awk script above does the following:
/Reco/ { # IF the pattern "Reco" is present on the current line THEN
c=3 # Set the count of the number of lines to print to 3
} # ENDIF
c&&c-- { # IF c is non-zero THEN decrement c and THEN
print; # print the current line
if(!c) # IF c is now zero (i.e. this is the 3rd line) THEN
print "" # print a blank line
# ENDIF
} # ENDIF
so the whole execution of parsing the input file is:
WHILE read line from file
DO
/Reco/ { # IF the pattern "Reco" is present on the current line THEN
c=3 # Set the count of the number of lines to print to 3
} # ENDIF
c&&c-- { # IF c is non-zero THEN decrement c and THEN
print; # print the current line
if(!c) # IF c is now zero (i.e. this is the 3rd line) THEN
print "" # print a blank line
# ENDIF
} # ENDIF
DONE
Maybe it'd be a little clearer if the script was written as something like:
awk '/Reco/{c=3} c{c--; print; if(c == 0)print ""}' file
You got the answer you were looking for but here's how to really print the N lines after some pattern in awk:
c&&c--;/pattern/{c=N}
which in your case would be:
c&&c--;/Reco/{c=3}
and if you want to add that extra newline then it becomes:
c&&c--{print; if(!c)print ""} /Reco/{c=3}
If you're considering using getline make sure you read http://awk.info/?tip/getline first and understand all of the caveats so you know what you're getting yourself into.
P.S. The following idioms describe how to select a range of records given
a specific pattern to match:
a) Print all records from some pattern:
awk '/pattern/{f=1}f' file
b) Print all records after some pattern:
awk 'f;/pattern/{f=1}' file
c) Print the Nth record after some pattern:
awk 'c&&!--c;/pattern/{c=N}' file
d) Print every record except the Nth record after some pattern:
awk 'c&&!--c{next}/pattern/{c=N}1' file
e) Print the N records after some pattern:
awk 'c&&c--;/pattern/{c=N}' file
f) Print every record except the N records after some pattern:
awk 'c&&c--{next}/pattern/{c=N}1' file
g) Print the N records from some pattern:
awk '/pattern/{c=N}c&&c--' file
I changed the variable name from "f" for "found" to "c" for "count" where
appropriate as that's more expressive of what the variable actually IS.

#Kevin's post provides the specific answer (use print "" or, as suggested by #BMW, printf ORS), but here's some background:
In awk,
print
is the same as:
print $0
i.e., it prints the current input line followed by the output record separator - which defaults to \n and is stored in the special ORS variable.
You can pass arguments to print to print something other than (or in addition to) $0, but the record separator is invariably appended.
Note that if you pass multiple arguments separated with , to print, they will be output separated by the output field separator - which defaults to a space and is stored in the special variable OFS.
By contrast, the - more flexible - printf function takes a format string (as in its C counterpart) and as many arguments as are needed to instantiate the placeholders (fields) in the format string.
An output record separator is NOT appended to the result.
For instance, the printf equivalent of what print without arguments does is:
printf "%s\n", $0 # assumes that \n is the output record separator
Or, more generally:
printf "%s%s", $0, ORS
Note that, as the names suggest, the output field/record separators (OFS/ORS) have input counterparts (FS/RS) - their respective default values are identical (single space / \n - though on parsing input multiple adjacent spaces are treated as a single field separator).

print already includes the newline. Just use print "".

Related

awk choose a line with $1 present in a file and output with a changed field

I've tried to use Awk to do the following:
I have a large txt file with first column the name of a gene and different values, essentially numeric, in each column.
Now I have a file with a list of genes (not all genes, just a subset) that I want to modify.
Initially I just removed lines using something I found in a forum
awk -F '\t' ' FILENAME=="gene_list" {arr[$1]; next} # create an array without values
!($1 in arr)' gene_list original_file.txt > modified_file.txt
This worked great but now I need to keep all rows (in the same order) but modify these genes to do something like:
if ($1 in arr) {print $1, $2, $3-($4/10), $4}
else {print $0}
So you see, this time, if it is different (the gene is not in my list), I want to keep the whole line, otherwise I want to keep the whole line but modify the value in one column by a given number.
If you could include something so that the value remains an integer that would be great. I'll also have to replace by 0 if the value becomes negative. But this I know how to do , at least in a separate command.
Edit: minimal example:
list of genes in a txt file, one under the other:
ccl5
cxcr4
setx
File to modify: (I put comas as field separator here, but there should be tab to separate the fields)
ccl4,3,18000,50000
ccl5,4,400,5000
cxcr4,5,300,2500
apoe,4,100,90
setx,3,200,1903
Expected output: (I remove the 10th of 4th column when gene in first column matches a gene in my separate txt file, otherwise I keep the full line unchanged)
ccl4,3,18000,50000
ccl5,4,0,5000
cxcr4,5,50,2500
apoe,4,100,90
setx,3,10,1903
Just spell out the arithmetic constraints.
The following is an attempt to articulate it in idiomatic Awk.
if (something) { print } can be rearticulated as just something. So just 1 (which is always true) is a common idiom for "print all lines (if you reach this point in the script before hitting next).
Rounding a floating-point number can be done with sprintf("%1.0f", n) which correctly rounds up if the fraction is bigger than 0.5 (int(n) would always round down).
awk 'BEGIN { FS=OFS="\t" }
FILENAME=="gene_list" {arr[$1]; next}
$1 in arr { x=sprintf("%1.0f", $3-($4/10));
if (x<0) x=0; print $1, $2, x, $4; next }
1' gene_list original_file.txt > modified_file.txt
Demo: https://ideone.com/oDjKhf

preceding each line by a number when "\n" is also inside quoted records

I am running an awk script as this:
find ~/dir/ -regextype sed -regex '.*\.[0-9]\{3\}\.txt' -exec ~/script.awk {} \;
it produces lines by extracting data from files found by find. I would like to have each line preceded by an integer number making a sequence, ie. first line by 1, second line by 2 etc. The problem is that one of the columns that awk produces has new line symbols \n in them, ie. a line from awk can look like:
"col1" col2 "col3_\n_col3_continued_\n_more_col3" "col4"
such lines are constructed in the END part of the script:
END {
method=""
i=1
for (i=1;i<nm;i++) {
if (i==1) {
method=methodarr[i]
}
else {
method=method"\n"methodarr[i]
}
}
method="\""method"\""
printf "%s %s %s %s\n",date,time,method,"\""FILENAME"\""
}
This results, when displayed that there are more lines that there were files but the records are correctly separated because the new line symbols are enclosed in quotes. The resultant text file is then used for imports as data into some spreadsheet software and the new line separated parts enclosed by apostrophes are correctly put into a single cell. But this prevents a simple add a number before each line.
One simple solution is to assume a format of the first column, pass the result again to awk, match the first column format and precede those by the line number. I do not like this approach. Is there a simpler way, ideally without having to assume something about the first column? I would guess some way of global system variable i that would be increased by each run of the awk but this i have no idea if it is possible or how to achieve that.
The simplest solution, I think, is to precede each line with a special character sequence and then replace it with numbers by another AWK script.
For example, replace your code:
printf "%s %s %s %s\n",date,time,method,"\""FILENAME"\""
with:
printf "<LINE_N> %s %s %s %s\n",date,time,method,"\""FILENAME"\""
And then run the output through this script to replace <LINE_N> with numbers:
awk '$1 == "<LINE_N>" { cnt += 1; printf cnt; $1=""; } 1;'

How can I exclude blank lines with awk?

Question
How can I exclude lines starting with a space character, and that have nothing else on the line? With awk, I want to print the line Need to print, but it's also printing the blank line. How can I exclude it?
Script: test.awk
$0 !~/^start|^#/ {
print "Result : %s",$0
}
Data
# test
start
Need to print
Result
Result : %s
Result : %s Need to print
Use the NF Variable
You aren't really asking about lines that start with a space, you're asking about how to discard blank lines. Pragmatically speaking, blank lines have no fields, so you can use the built-in NF variable to discard lines which don't have at least one field. For example:
$ awk 'NF > 0 && !/^(start|#)/ {print "Result: " $0}' /tmp/corpus
Result: Need to print
You can use:
awk '/^[^[:space:]]/{print "Result : " $0}'
The use of [^[:space:]] ensures that there is at least a single non space character in every line which get's printed.

The meaning of "a" in an awk command?

I have an awk command in a script I am trying to make work, and I don't understand the meaning of 'a':
awk 'FNR==NR{ a[$1]=$0;next } ($2 in a)' FILELIST.TXT FILEIN.* > FILEOUT.*
I'm quite new to using command line, so I'm just trying to figure things out, thanks.
a is an associative array.
a[$1] = $0;
takes the first word $1 on the line as the index in the array, and stores the whole line $0 as the value. It does this for the first file (while the file record number is equal to the overall record number). The next command means it doesn't process the rest of the script while it is processing the first file.
For the rest of the data files, it evaluates:
($2 in a)
and prints the line if the word in $2 is found. This makes storing $0 in a relatively expensive because it is storing a copy of the whole file (possibly twice if there's only one word on each line of the file). It is more conventional and sufficient to do a[$1]++ or even a[$1] = 1.
Given FILELIST.TXT
ABC The rest
DEF And more
Given FILEIN.1 containing:
Word ABC and so on
Grow FED won't be shown
This DEF will be shown
The XYZ will be missing
The output will be:
Word ABC and so on
This DEF will be shown
Here a is not a command but an awk array it can very well be arr also:
awk 'FNR==NR {arr[$1]=$0;next} ($2 in arr)' FILELIST.TXT FILEIN.* > FILEOUT.*
a is nothing but an array, in your code
FNR==NR{ a[$1]=$0;next }
Creates an array called "a" with indexes taken from the first column of the first input file.
All element values are set to the current record.
The next statement forces awk to immediately stop processing the current record and go on to the next record.

awk and regular expressions confusion

Having never used awk before on Linux I am attempting to understand how it matches regular expressions. For example in the past based on my experience the regular expression /2/ would match 2 in all of the following lines.
This will match 2
This will not match 2
Now if I run the command awk '{if(NR~2)print}' sample.txt which has the contents
2 will be matched
This will not match 2
2 may be matched
The line that is matched is This will not match 2 which indicates it is matching the line 2 because if I replace the command with awk '{if(NR~3)print}' sample.txt it matches 2 may be matched. Now if I also run the command awk '{if(NR~/^2$/)print}' sample.txt, the matches the same exact line i.e. line 2.
However the source I am referring to at http://www.youtube.com/watch?feature=player_detailpage&v=Htnno4CHVus#t=502s seems to indicate otherwise.
What am I missing and how is the command awk '{if(NR~2)print}' sample.txt different to that of awk '{if(NR~/^2$/)print}' sample.txt?
The condition NR~2 is checking whether the record number, NR, matches 2. For a 2 or 3 line input file, the expression is equivalent to:
if (NR == 2)
Similarly with NR~3, of course. Try:
awk '/2/'
That will print all lines where the text of the line ($0) contains a 2. By default, a regular expression matches against the whole line; you could limit it to a particular field with $3 ~ /3/, for example.
An awk program consists of patterns and actions, where either the pattern or the action may be omitted.
awk '{ if ($0 ~ /2/) print }
/2/
/2/ { if ($0 ~ /a.*z/) print "Matches a.*z"; }'
The first line has no pattern; the action in the { ... } is executed for each input line (but only some input lines will generate output because of the conditional. All lines that contain a 2 will be printed. (If there is no argument to print, it prints $0 followed by a newline.)
The second line has a pattern but no action; all lines that contain a 2 will be printed again. (The missing action is equivalent to { print }.)
The third line has both a pattern and an action; all lines that both contain a 2 and also contain an 'a' followed by a 'z' will be remarked upon.
How are these two commands different?
`awk '{if(NR~2)print}' sample.txt`
`awk '{if(NR~/^2$/)print}' sample.txt`
The first command will print line numbers 2, 12, 20..29, 32, 42, ... 102, 112, 120..129, ... 200..299, ...; all lines where the line number contains a 2.
The second command will print only line number 2 because the /^2$/ constrains the value to contain start of string, digit 2 and end of string.
I take it that means that the source is wrong?
Now I've looked at the YouTube resource, I think you must have misunderstood what it is trying to teach. When it talks about {if (NR~2) print}, it should be saying it will print any line number which contains a 2; the video cites line numbers 2, 12, 20, 21, 22, etc. It should not be saying any line which contains a 2; I think the video does say that, but the video misspoke (but the text was accurate). The comparison against NR is not actually wrong, but it is aconventional; I'm not sure that I'd include regexes against NR in an introductory video describing awk. So, the video appears to have a glitch in the audio but the text on screen is accurate, I think. I may still have missed something.
The command awk '{ if ($0 ~ /2/) print } against the file say sample.txt with the contents I mentioned would only result in the output 2 will be matched. Is that correct?
That command, given the input:
2 will be matched
This will not match 2
2 may be matched
will print all three lines; they all contain the digit 2.
I also thought that the action was print and the pattern was $0 ~ /2/.
No; the pattern was empty (because there was nothing before the open brace) — so all lines match it — and the action was the part in braces { if ($0 ~ /2/) print }. Now, the action contains a conditional, but that's a separate issue.
Now the command awk '/2/' sample.txt would print all three lines. Is that correct?
Yes.
NR means the Number of the Record being processed...
You are matching against line number 2.