Delete matching nth line until blank line in awk/sed/grep - awk

I need to delete the nth matching line in a file from the match up to the next blank line (i.e. one chunk of blank line delimited text starting with the nth match).

This will delete a chunk of text that starts and ends with a blank line starting with the fourth blank line. It also deletes those delimiting lines.
sed -n '/^$/!{p;b};H;x;/^\(\n[^\n]*\)\{4\}/{:a;n;/^$/!ba;d};x;p' inputfile
Change the first /^$/ to change the start match. Change the second one to change the end match.
Given this input:
aaa
---
bbb
---
ccc
---
ddd delete me
eee delete me
===
fff
---
ggg
This version of the command:
sed -n '/^---$/!{p;b};H;x;/^\(\n[^\n]*\)\{3\}/{:a;n;/^===$/!ba;d};x;p' inputfile
would give this as the result:
aaa
---
bbb
---
ccc
fff
---
ggg
Edit:
I removed an extraneous b instruction from the sed commands above.
Here's a commented version:
sed -n ' # don't print by default
/^---$/!{ # if the input line doesn't match the begin block marker
p; # print it
b}; # branch to end of script and start processing next input line
H; # line matches begin mark, append to hold space
x; # swap pattern space and hold space
/^\(\n[^\n]*\)\{3\}/{ # if what was in hold consists of 3 lines
# in other words, 3 copies of the begin marker
:a; # label a
n; # read the next line
/^===$/!ba; # if it's not the end of block marker, branch to :a
d}; # otherwise, delete it, d branches to the end automatically
x; # swap pattern space and hold space
p; # print the line (it's outside the block we're looking for)
' inputfile # end of script, name of input file
Any unambiguous pattern should work for the begin and end markers. They can be the same or different.

perl -00 -pe 'if (/pattern/) {++$count == $n and $_ = "$`\n";}' file
-00 is to read the file in "paragraph" mode (record separator is one or more blank lines)
$` is Perl's special variable for the "prematch" (text in front of the matching pattern)

In AWK
/m1/ {i++};
(i==3) {while (getline temp > 0 && temp != "" ){}; if (temp == "") {i++;next}};
{print}
Transforms this:
m1 1
first
m1 2
second
m1 3
third delete me!
m1 4
fourth
m1 5
last
into this:
m1 1
first
m1 2
second
m1 4
fourth
m1 5
last
deleting the third block of "m1" ...
Running on ideone here
HTH!

Obligatory awk script. Just change n=2 to whatever your nth match should be.
n=2; awk -v n=$n '/^HEADER$/{++i==n && ++flag} !flag; /^$/&&flag{flag=0}' ./file
Input
$ cat ./file
HEADER
line1a
line2a
line3a
HEADER
line1b
line2b
line3b
HEADER
line1c
line2c
line3c
HEADER
line1d
line2d
line3d
Output
$ n=2; awk -v n=$n '/^HEADER$/{++i==n&&++flag} !flag; /^$/&&flag{flag=0}' ./file
HEADER
line1a
line2a
line3a
HEADER
line1c
line2c
line3c
HEADER
line1d
line2d
line3d

Related

awk match ONLY X and Y

I want to do strict matching on a text file so that it only returns the patterns I have anded. So for example in a file:
xyz
xy
yx
zyx
I want to run a command similar to:
awk '/x/ && /y/' filename.txt
and I would like it to return only the lines.
yx
xy
and ignore the others because although they do contain an x and a y, they also have a z so they are ignored.
Is this possible in awk?
I'd just keep it clear and simple, e.g. depending on your requirements for matching lines that only contain x or only contain y which you didn't include in your example:
$ awk '/^[xy]+$/' file
xy
yx
or:
$ awk '/x/ && /y/ && !/[^xy]/' file
xy
yx
This /x/ && /y/ matches when there is an x and Y present.
Edit:
To allow the same chars in the whole string, you can use a repeated character class and assert the start and end of the string:
awk '/^[xy]+$/' file
If you also want to allow matching spaces, uppercase X and Y and do not want to match empty lines:
awk '/^[[:space:]]*[xyXY][[:space:]xyXY]*$/' file
The pattern matches:
^ Start of string
[[:space:]]* Match optional spaces
[xyXY] Match a single char x y X Y
[[:space:]xyXY]* Match optional spaces or x y X Y
$ End of string
Assumptions:
user provides a list of characters to match on (x and y in the provided example)
lines of interest are those that contain only said characters (plus white space)
matches should be case insensitive, ie, x will match on both x and X
blank/empty lines, and lines with only white space, are to be ignored
Adding more lines to the sample input:
$ cat filename.txt
xyz
xy
yx
zyx
---------
xxx
abc def xy
Xy xY XY
z x yy z
x; y; X; Y:
xxyYxy XXyxyy yx # tab delimited
# 1 space
# blank/empty line
NOTE: comments added for clarification; file does not contain any comments
One awk idea:
awk -v chars='xY' ' # provide list of characters (in the form of a string) to match on
BEGIN { regex="[" tolower(chars) "]" } # build regex of lowercase characters, eg: "[xy]"
{ line=tolower($0) # make copy of all lowercase line
gsub(/[[:space:]]/,"",line) # remove all white space
if (length(line) == 0) # if length of line==0 (blank/empty lines, lines with only white space) then ...
next # skip to next line of input
gsub(regex,"",line) # remove all characters matching regex
if (length(line) == 0) # if length of line == 0 (ie, no other characters) then ...
print $0 # print current line to stdout
}
' filename.txt
This generates:
xy
yx
xxx
Xy xY XY
xxyYxy XXyxyy yx
NOTE: the last 2 input lines (1 space, blank/empty) are ignored
This awk solution applies the condition on the main block to process only lines containing 'x' and 'y' using /x/&&/y/.
Inside the action block the record $0 is assigned to a variable named temp which then has the 'x' and 'y' occurrences removed using gsub(/[xy]/, "",temp). A conditional block then determines the length of temp after the substitution: if the length is 0, the line could only have contained 'x' and 'y' characters, so the line is printed.
awk '/x/&&/y/ { temp=$0; gsub(/[xy]/, "",temp); if (length(temp)==0){print $0}}' input.txt
tested with input.txt file:
xyz
xy
yx
zyx
y
x
xxy
yyx
result:
xy
yx
xxy
yyx
You can treat the strings as a set of characters and do a set equality on the two strings.
awk -v set='xy' '
function cmp(s1, s2) {
# turns s1 and s2 into associative arrays to do a set equality comparison
# cmp("xy", "xyxyxyxy") returns 1; cmp("xy", "xyz") returns 0
split("", a1); split("", a2) # clear the arrays from last use
split(s1, tmp, ""); for (i in tmp) a1[tmp[i]]
split(s2, tmp, ""); for (i in tmp) a2[tmp[i]]
if (length(a1) != length(a2)) return 0
for (e in a1) if (!(e in a2)) return 0
return 1
}
cmp(set, $1)' file
Prints:
xy
yx

sed match pattern and print two lines before pattern and all lines upto end of file

sample
tyu
abc
def
ghi
fgg
yui
Output
abc
def
ghi
fgg
yui
Matching pattern : ^def
Print two lines before matching line including pattern and print all lines after pattern until end
1st solution: With your shown samples try following awk code, written and tested in GNU awk.
awk -v RS='(^|\n)def.*' '
RT{
num=split($0,arr,ORS)
sub(/\n$/,"",RT)
print arr[num-1] ORS arr[num] RT
}
' Input_file
2nd solution(More Generic one): In this solution one could mention number of lines needed to be printed before a match is found in awk's variable named lines and we need NOT to hardcode number of times we need to print array's element(in split function for first line).
awk -v lines="2" -v RS='(^|\n)def.*' '
RT{
val=""
num=split($0,arr,ORS)
sub(/\n$/,"",RT)
for(i=lines;i<=num;i++){
val=(val?val ORS:"") arr[i]
}
print val RT
}
' Input_file
Would you please try the following:
awk '
f {print; next} # if the flag f is set, print the line
{que[NR % 3] = $0} # store the line in a queue
/^def/ { # if the pattern matches
f = 1 # then set the flag
for (i = NR - 2; i <= NR; i++) # and print two previous lines and current line
print que[i % 3]
}
' input_file.txt
This might work for you (GNU sed):
sed '1N;N;/def/!D;:a;n;ba' file
Open a window of 3 lines and if the desired string is not present, delete the first and append another until a match is found.
Then print those lines and all other lines to the end of the file.
N.B. This will start printing as soon a match is found, even if the match is in the first or second lines. If the match must be in the third or subsequent lines, use:
sed '1N;N;/def[^\n]*$/!D;:a;n;ba' file

awk to remove lines starting with symbol without keyword in them

I am trying to selectively remove lines that start with # but do not contain the keywords Build or Type in them. The lines that do not start with # are unchanged. I can remove all lines that starting with # using the first awk, but not sure how to selectively remove lines that start with # but do not contain a keyword. The second awk does execute but only leaves two lines (#CN Filters:
# Flags = 1,2,3). Thank you :).
awk
awk '!/#/' input < out # will remove all lines with #
awk
awk '/#/ && !/Build|Length/' input < out # remove lines starting with # but must not have Build or Length in them
input various spacing
#Build = NCBI Build 37
#CN Filters:
# Flags = 1,2,3
# Type = Lowess
Length Event ID
1 Gain xxx
10 Loss yyy
desired output
Build = NCBI Build 37
Type = Lowess
Length Event ID
1 Gain xxx
10 Loss yyy
You want to do something with lines that start with # and do not contain Build or Type, right? I'm sure you could write that condition:
Start with # = /^#/
AND = &&
Do not contains Build or Type = !/Build|Type/
i.e.
/^#/ && !/Build|Type/
Now, what is it you wanted to do when that condition s true? Not print the current line. So you could just write that as simply:
awk '/^#/ && !/Build|Type/{next} 1'
but if you prefer to use awks default print given a true condition then you just need to negate your condition (a{next} 1 = !a):
awk '!(/^#/ && !/Build|Type/)'
which by boolean algebra ( !(a && b) = !a || !b) can be reduced to:
awk '!/^#/ || /Build|Type/'
$ awk '!/^#/ || /Build|Type/' file
#Build = NCBI Build 37
# Type = Lowess
Length Event ID
1 Gain xxx
10 Loss yyy
If you want to remove those initial # characters and the spaces after them:
$ awk '!/^#/ || /Build|Type/ { sub("^#[[:blank:]]*", ""); print }' file
Build = NCBI Build 37
Type = Lowess
Length Event ID
1 Gain xxx
10 Loss yyy
Following awk may help you on same too.
awk '!(/^#/ && !/Build/ && !/Type/){gsub(/^#|^# +/,"");print}' Input_file
Explanation:
awk '
!(/^#/ && !/Build/ && !/Type/){ ##Checking condition here if a line starts with # and NOT having string Build and Type in it, Negating this condition to work it as opposite, if this condition is TRUE then do following.
gsub(/^#|^# +/,""); ##Using gsub to remove hash in starting of a line OR remove a hash starting fr a line with space with NULL in current line.
print ##Printing the current line here.
}' Input_file ##Mentioning the Input_file name here.
A sed solution:
$ sed 's/^# *\(.*\(Build\|Type\).*\)/\1/;/^#/d' file
Build = NCBI Build 37
Type = Lowess
Length Event ID
1 Gain xxx
10 Loss yyy
awk '!/CN|Fl/{sub(/\43/,"")sub(/^\s*/,"");print}' file
Build = NCBI Build 37
Type = Lowess
Length Event ID
1 Gain xxx
10 Loss yyy

print whole variable contents if the number of lines are greater than N

How to print all lines if certain condition matches.
Example:
echo "$ip"
this is a sample line
another line
one more
last one
If this file has more than 3 lines then print the whole variable.
I am tried:
echo $ip| awk 'NR==4'
last one
echo $ip|awk 'NR>3{print}'
last one
echo $ip|awk 'NR==12{} {print}'
this is a sample line
another line
one more
last one
echo $ip| awk 'END{x=NR} x>4{print}'
Need to achieve this:
If this file has more than 3 lines then print the whole file. I can do this using wc and bash but need a one liner.
The right way to do this (no echo, no pipe, no loops, etc.):
$ awk -v ip="$ip" 'BEGIN{if (gsub(RS,"&",ip)>2) print ip}'
this is a sample line
another line
one more
last one
You can use Awk as follows,
echo "$ip" | awk '{a[$0]; next}END{ if (NR>3) { for(i in a) print i }}'
one more
another line
this is a sample line
last one
you can also make the value 3 configurable from an awk variable,
echo "$ip" | awk -v count=3 '{a[$0]; next}END{ if (NR>count) { for(i in a) print i }}'
The idea is to store the contents of the each line in {a[$0]; next} as each line is processed, by the time the END clause is reached, the NR variable will have the line count of the string/file you have. Print the lines if the condition matches i.e. number of lines greater than 3 or whatever configurable value using.
And always remember to double-quote the variables in bash to avoid undergoing word-splitting done by the shell.
Using James Brown's useful comment below to preserve the order of lines, do
echo "$ip" | awk -v count=3 '{a[NR]=$0; next}END{if(NR>3)for(i=1;i<=NR;i++)print a[i]}'
this is a sample line
another line
one more
last one
Another in awk. First test files:
$ cat 3
1
2
3
$ cat 4
1
2
3
4
Code:
$ awk 'NR<4{b=b (NR==1?"":ORS)$0;next} b{print b;b=""}1' 3 # look ma, no lines
[this line left intentionally blank. no wait!]
$ awk 'NR<4{b=b (NR==1?"":ORS)$0;next} b{print b;b=""}1' 4
1
2
3
4
Explained:
NR<4 { # for tghe first 3 records
b=b (NR==1?"":ORS) $0 # buffer them to b with ORS delimiter
next # proceed to next record
}
b { # if buffer has records, ie. NR>=4
print b # output buffer
b="" # and reset it
}1 # print all records after that

Extracting ID and sequence from a FASTQ file

I'm trying to manipulate a Fastq file.
It looks like this:
#HWUSI-EAS610:1:1:3:1131#0/1
GATGCTAAGCCCCTAAGGTCATAAGACTGNNANGTC
+
B<ABA<;B#=4A9#:6#96:1??9;>##########
#HWUSI-EAS610:1:1:3:888#0/1
GATAGGACCAAACATCTAACATCTTCCCGNNGNTTC
+
B9>>ABA#B7BB:7?#####################
#HWUSI-EAS610:1:1:4:941#0/1
GCTTAGGAAGGAAGGAAGGAAGGGGTGTTCTGTAGT
+
BBBB:CB=#CB#?BA/#BA;6>BBA8A6A<?A4?B=
...
...
...
#HWUSI-EAS610:1:1:7:1951#0/1
TGATAGATAAGTGCCTACCTGCTTACGTTACTCTCC
+
BB=A6A9>BBB9B;B:B?B#BA#AB#B:74:;8=>7
My expected output is:
#HWUSI-EAS610:1:1:3:1131#0/1
GACNTNNCAGTCTTATGACCTTAGGGGCTTAGCATC
#HWUSI-EAS610:1:1:3:888#0/1
GAANCNNCGGGAAGATGTTAGATGTTTGGTCCTATC
#HWUSI-EAS610:1:1:4:941#0/1
ACTACAGAACACCCCTTCCTTCCTTCCTTCCTAAGC
So, the ID line are those starting with #HWUSI (i.e #HWUSI-EAS610:1:1:7:1951#0/1).. After each ID there is a line with its sequence.
Now, I would like to obtain a file only with each ID and its correspondig sequence and the sequence should be reverse and complement. (A=T, T=A, C=G, G=C)
With Sed I can obtain all the sequence reverse and complementary with the command
sed -n '2~4p' MYFILE.fq | rev | tr ATCG TAGC
How can I obtain also the corresponding ID?
With sed:
sed -n '/#HWUSI/ { p; s/.*//; N; :a /\n$/! { s/\n\(.*\)\(.\)/\2\n\1/; ba }; y/ATCG/TAGC/; p }' filename
This works as follows:
/#HWUSI/ { # If a line starts with #HWUSI
p # print it
s/.*// # empty the pattern space
N # fetch the sequence line. It is now preceded
# by a newline in the pattern space. That is
# going to be our cursor
:a # jump label for looping
/\n$/! { # while the cursor has not arrived at the end
s/\n\(.*\)\(.\)/\2\n\1/ # move the last character before the cursor
ba # go back to a. This loop reverses the sequence
}
y/ATCG/TAGC/ # then invert it
p # and print it.
}
I intentionally left the newline in there for more readable spacing; if that is not desired, replace the last p with a P (upper case instead of lower case). Where p prints the whole pattern space, P only prints the stuff before the first newline.
$ sed -n '/^[^#]/y/ATCG/TAGC/;/^#/p;/^[ATCGN]*$/p' file
#HWUSI-EAS610:1:1:3:1131#0/1
CTACGATTCGGGGATTCCAGTATTCTGACNNTNCAG
#HWUSI-EAS610:1:1:3:888#0/1
CTATCCTGGTTTGTAGATTGTAGAAGGGCNNCNAAG
#HWUSI-EAS610:1:1:4:941#0/1
CGAATCCTTCCTTCCTTCCTTCCCCACAAGACATCA
#HWUSI-EAS610:1:1:7:1951#0/1
ACTATCTATTCACGGATGGACGAATGCAATGAGAGG
Explanation
/^[^#]/y/ATCG/TAGC/ # Translate bases on lines that don't start with an #
/^#/p # Print IDs
/^[ATCGN]*$/p # Print sequence lines