How can I exclude blank lines with awk? - awk

Question
How can I exclude lines starting with a space character, and that have nothing else on the line? With awk, I want to print the line Need to print, but it's also printing the blank line. How can I exclude it?
Script: test.awk
$0 !~/^start|^#/ {
print "Result : %s",$0
}
Data
# test
start
Need to print
Result
Result : %s
Result : %s Need to print

Use the NF Variable
You aren't really asking about lines that start with a space, you're asking about how to discard blank lines. Pragmatically speaking, blank lines have no fields, so you can use the built-in NF variable to discard lines which don't have at least one field. For example:
$ awk 'NF > 0 && !/^(start|#)/ {print "Result: " $0}' /tmp/corpus
Result: Need to print

You can use:
awk '/^[^[:space:]]/{print "Result : " $0}'
The use of [^[:space:]] ensures that there is at least a single non space character in every line which get's printed.

Related

Can I delete a field in awk?

This is test.txt:
0x01,0xDF,0x93,0x65,0xF8
0x01,0xB0,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0xB2,0x00,0x76
If I run
awk -F, 'BEGIN{OFS=","}{$2="";print $0}' test.txt
the result is:
0x01,,0x93,0x65,0xF8
0x01,,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,,0x00,0x76
The $2 wasn't deleted, it just became empty.
I hope, when printing $0, that the result is:
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
All the existing solutions are good though this is actually a tailor made job for cut:
cut -d, -f 1,3- file
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
If you want to remove 3rd field then use:
cut -d, -f 1,2,4- file
To remove 4th field use:
cut -d, -f 1-3,5- file
I believe simplest would be to use sub function to replace first occurrence of continuous ,,(which are getting created after you made 2nd field NULL) with single ,. But this assumes that you don't have any commas in between field values.
awk 'BEGIN{FS=OFS=","}{$2="";sub(/,,/,",");print $0}' Input_file
2nd solution: OR you could use match function to catch regex from first comma to next comma's occurrence and get before and after line of matched string.
awk '
match($0,/,[^,]*,/){
print substr($0,1,RSTART-1)","substr($0,RSTART+RLENGTH)
}' Input_file
It's a bit heavy-handed, but this moves each field after field 2 down a place, and then changes NF so the unwanted field is not present:
$ awk -F, -v OFS=, '{ for (i = 2; i < NF; i++) $i = $(i+1); NF--; print }' test.txt
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01
0x01,0x00,0x76
$
Tested with both GNU Awk 4.1.3 and BSD Awk ("awk version 20070501" on macOS Mojave 10.14.6 — don't ask; it frustrates me too, but sometimes employers are not very good at forward thinking). Setting NF may or may not work on older versions of Awk — I was a little surprised it did work, but the surprise was a pleasant one, for a change.
If Awk is not an absolute requirement, and the input is indeed as trivial as in your example, sed might be a simpler solution.
sed 's/,[^,]*//' test.txt
This is especially elegant if you want to remove the second field. A more generic approach to remove, the nth field would require you to put in a regex which matches the first n - 1 followed by the nth, then replace that with just the the first n - 1.
So for n = 4 you'd have
sed 's/\([^,]*,[^,]*,[^,]*,\)[^,]*,/\1/' test.txt
or more generally, if your sed dialect understands braces for specifying repetitions
sed 's/\(\([^,]*,\)\{3\}\)[^,]*,/\1/' test.txt
Some sed dialects allow you to lose all those pesky backslashes with an option like -r or -E but again, this is not universally supported or portable.
In case it's not obvious, [^,] matches a single character which is not (newline or) comma; and \1 recalls the text from first parenthesized match (back reference; \2 recalls the second, etc).
Also, this is completely unsuitable for escaped or quoted fields (though I'm not saying it can't be done). Every comma acts as a field separator, no matter what.
With GNU sed you can add a number modifier to substitute nth match of non-comma characters followed by comma:
sed -E 's/[^,]*,//2' file
Using awk in a regex-free way, with the option to choose which line will be deleted:
awk '{ col = 2; n = split($0,arr,","); line = ""; for (i = 1; i <= n; i++) line = line ( i == col ? "" : ( line == "" ? "" : "," ) arr[i] ); print line }' test.txt
Step by step:
{
col = 2 # defines which column will be deleted
n = split($0,arr,",") # each line is split into an array
# n is the number of elements in the array
line = "" # this will be the new line
for (i = 1; i <= n; i++) # roaming through all elements in the array
line = line ( i == col ? "" : ( line == "" ? "" : "," ) arr[i] )
# appends a comma (except if line is still empty)
# and the current array element to the line (except when on the selected column)
print line # prints line
}
Another solution:
You can just pipe the output to another sed and squeeze the delimiters.
$ awk -F, 'BEGIN{OFS=","}{$2=""}1 ' edward.txt | sed 's/,,/,/g'
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
$
Commenting on the first solution of #RavinderSingh13 using sub() function:
awk 'BEGIN{FS=OFS=","}{$2="";sub(/,,/,",");print $0}' Input_file
The gnu-awk manual: https://www.gnu.org/software/gawk/manual/html_node/Changing-Fields.html
It is important to note that making an assignment to an existing field changes the value of $0 but does not change the value of NF, even when you assign the empty string to a field." (4.4 Changing the Contents of a Field)
So, following the first solution of RavinderSingh13 but without using, in this case,sub() "The field is still there; it just has an empty value, delimited by the two colons":
awk 'BEGIN {FS=OFS=","} {$2="";print $0}' file
0x01,,0x93,0x65,0xF8
0x01,,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,,0x00,0x76
My solution:
awk -F, '
{
regex = "^"$1","$2
sub(regex, $1, $0);
print $0;
}'
or one line code:
awk -F, '{regex="^"$1","$2;sub(regex, $1, $0);print $0;}' test.txt
I found that OFS="," was not necessary
I would do it following way, let file.txt content be:
0x01,0xDF,0x93,0x65,0xF8
0x01,0xB0,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0xB2,0x00,0x76
then
awk 'BEGIN{FS=",";OFS=""}{for(i=2;i<=NF;i+=1){$i="," $i};$2="";print}' file.txt
output
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
Explanation: I set OFS to nothing (empty string), then for 2nd and following column I add , at start. Finally I set what is now comma and value to nothing. Keep in mind this solution would need rework if you wish to remove 1st column.

Bash code for Selecting few columns from a variable

In a file I have a list of coordinates stored (see figure, to the left).
From there I want to copy the coordinates only (red marked) and put them in another file.
I copy the correct section from the file using COORD=`grep -B${i} '&END COORD' ${cpki_file}. Then I tried to use awk to extract the required numbers from the COORD variable . It does output all the numbers in the file but deletes the spaces between values (figure, to the right).
How to write the red marked section as they are?
N=200
NEndCoord=`grep -B${N} '&END COORD' ${cpki_file}|wc -l`
NCoord=`grep -B${N} '&END COORD' ${cpki_file}| grep -B200 '&COORD' |wc -l`
let i=$NEndCoord-$NCoord
COORD=`grep -B${i} '&END COORD' ${cpki_file}`
echo "$COORD" | awk '{ print $2 $3 $4 }'
echo "$COORD" | awk '{ print $2 $3 $4 }'>tmp.txt
When you start using combinations of grep, sed, awk, cut and alike, you should realize you can do it all in a single awk command. In case of the OP, this would do exactly the same:
awk '/[&]END COORD/{p=0}
p { print $2,$3,$4 }
/[&]COORD/{p=1}' file
This parses the file keeping track of a printing flag p. The flag is set if "&COORD" is found and unset if "&END COORD" is found. Printing is done, only when the flag p is set. Since we don't want to print the line with "&END COORD", we have to reset the flag before we do the check for the printing. The same holds for the line with "&COORD", but there we have to reset it after we do the check for the printing (its a bit a weird reversed logic).
The problem with the above is that it will also process the lines
UNIT angstrom
If you want to have these removed, you might want to do a check on the total columns:
awk '/[&]END COORD/{p=0}
p && (NF==4){ print $2,$3,$4 }
/[&]COORD/{p=1}' file
Of only print the lines which do not contain "UNIT" or are empty:
awk '/[&]END COORD/{p=0}
p && (NF>0) && ($1 != "UNIT"){ print $2,$3,$4 }
/[&]COORD/{p=1}' file
sed one-liner:
sed -n '/^&COORD$/,/^UNIT/{s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)/\1\t\2\t\3/p}' <infile.txt >outfile.txt
Explanation:
Invocation:
sed: stream editor
-n: do not print unless eplicit
Commands in sed:
/^&COORD$/,/^UNIT/: Selects groups of lines after &COORDS and before UNIT.
{s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)/\1\t\2\t\3/p}: Process each selected lines.
s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\): Regex capture space delimited groups except the first.
/\1\t\2\t\3/: Replace with tab delimited values of the captured groups.
p: Explicit printout.

How to filter the OTU by counts with AWK?

I am trying to filter all the singleton from a fasta file.
Here is my input file:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU2;size=1;
ATCCGGGACTGATC
>OTU3;size=5;
GAACTATCGGGTAA
>OTU4;size=1;
AATTGGCCATCT
The expected output is:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
I've tried
awk -F'>' '{if($1>=2) {print $0}' input.fasta > ouput.fasta
but this will remove all the header for each OTU.
Anyone could help me out?
Could you please try following.
awk -F'[=;]' '/^>/{flag=""} $3>=3{flag=1} flag' Input_file
$ awk '/>/{f=/=1;/}!f' file
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
awk -v FS='[;=]' 'prev_sz>=2 && !/size/{print prev RS $0} /size/{prev=$0;prev_sz=$(NF-1)}'
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
Store the size from each line in prev_sz variable and whole line in prev variables. Now check if its >= 2, then print the previous line and the current line. RS is used to print new line.
While all the above methods work, they are limited to the fact that input always has to look the same. I.e the sequence-name in your fasta-file needs to have the form:
>NAME;size=value;
A few solutions can handle a bit more extended sequence-names, but none handle the case where things go a bit more generic, i.e.
>NAME;label1=value1;label2=value2;STRING;label3=value3;
Print sequence where label xxx matches value vvv:
awk '/>{f = /;xxx=vvv;/}f' file.fasta
Print sequence where label xxx has a numeric value p bigger than q:
awk -v label="xxx" -v limit=q \
'BEGIN{ere=";" label "="}
/>/{ f=0; match($0,ere);value=0+substr($0,RSTART+length(ere)); f=(value>limit)}
f' <file>
In the above ere is a regular expression we try to match. We use it to find the location of the value attached to label xxx. This substring will have none-numeric characters after its value, but by adding 0 to it, it is converted to a number, losing all non-numeric values (i.e. 3;label4=value4; is converted to 3). We check if the value is bigger than our limit, and print the sequence based on that result.

AWK - get value between two strings over multiple lines

input.txt:
>block1
111111111111111111111
>block2
222222222222222222222
>block3
333333333333333333333
AWK command:
awk '/>block2.*>/' input.txt
Expected output
222222222222222222222
However, AWK is returning nothing. What am I misunderstanding?
Thanks!
If you want to print the line after the line containing >block2, then you could use:
awk '/^>block2$/ { nr=NR+1 } NR == nr { print }'
Track the record number plus 1 when you find the match; when the current record number matches the remembered one, print the current record.
If you want all the lines between the line >block2 and >block3, then you'd use:
awk '/^>block2$/,/^>block3/ {if ($0 !~ /^>block[23]$/) print }'
For all lines between the two markers, if the line doesn't match either marker, print it. The output is the same with the sample data file.
another awk
$ awk 'c&&c--; /^>block2/{c=1}' file
222222222222222222222
c specifies how many lines you want to print after the match. If you want the text between two markers
$ awk '/^>block3/{exit} s; /^>block2/{s=1}' file
222222222222222222222
if there are multiple instances and you want them all, just change exit to s=0
You probably meant:
$ awk '/>/{f=/^>block2$/;next} f' file
222222222222222222222

print last two words of last line

I have a script which returns few lines of output and I am trying to print the last two words of the last line (irrespective of number of lines in the output)
$ ./test.sh
service is running..
check are getting done
status is now open..
the test is passed
I tried running as below but it prints last word of each line.
$ ./test.sh | awk '{ print $NF }'
running..
done
open..
passed
how do I print the last two words "is passed" using awk or sed?
Just say:
awk 'END {print $(NF-1), $NF}'
"normal" awks store the last line (but not all of them!), so that it is still accessible by the time you reach the END block.
Then, it is a matter of printing the penultimate and the last one. This can be done using the NF-1 and NF trick.
For robustness if your last line can only contain 1 field and your awk doesn't retain the field values in the END section:
awk '{split($0,a)} END{print (NF>1?a[NF-1]OFS:"") a[NF]}'
This might work for you (GNU sed):
sed '$s/.*\(\<..*\<.*\)/\1/p;d' file
This deletes all lines in the file but on the last line it replaces all words by the last two words and prints them if successful.