I need to delete the nth matching line in a file from the match up to the next blank line (i.e. one chunk of blank line delimited text starting with the nth match).
This will delete a chunk of text that starts and ends with a blank line starting with the fourth blank line. It also deletes those delimiting lines.
sed -n '/^$/!{p;b};H;x;/^\(\n[^\n]*\)\{4\}/{:a;n;/^$/!ba;d};x;p' inputfile
Change the first /^$/ to change the start match. Change the second one to change the end match.
Given this input:
aaa
---
bbb
---
ccc
---
ddd delete me
eee delete me
===
fff
---
ggg
This version of the command:
sed -n '/^---$/!{p;b};H;x;/^\(\n[^\n]*\)\{3\}/{:a;n;/^===$/!ba;d};x;p' inputfile
would give this as the result:
aaa
---
bbb
---
ccc
fff
---
ggg
Edit:
I removed an extraneous b instruction from the sed commands above.
Here's a commented version:
sed -n ' # don't print by default
/^---$/!{ # if the input line doesn't match the begin block marker
p; # print it
b}; # branch to end of script and start processing next input line
H; # line matches begin mark, append to hold space
x; # swap pattern space and hold space
/^\(\n[^\n]*\)\{3\}/{ # if what was in hold consists of 3 lines
# in other words, 3 copies of the begin marker
:a; # label a
n; # read the next line
/^===$/!ba; # if it's not the end of block marker, branch to :a
d}; # otherwise, delete it, d branches to the end automatically
x; # swap pattern space and hold space
p; # print the line (it's outside the block we're looking for)
' inputfile # end of script, name of input file
Any unambiguous pattern should work for the begin and end markers. They can be the same or different.
perl -00 -pe 'if (/pattern/) {++$count == $n and $_ = "$`\n";}' file
-00 is to read the file in "paragraph" mode (record separator is one or more blank lines)
$` is Perl's special variable for the "prematch" (text in front of the matching pattern)
In AWK
/m1/ {i++};
(i==3) {while (getline temp > 0 && temp != "" ){}; if (temp == "") {i++;next}};
{print}
Transforms this:
m1 1
first
m1 2
second
m1 3
third delete me!
m1 4
fourth
m1 5
last
into this:
m1 1
first
m1 2
second
m1 4
fourth
m1 5
last
deleting the third block of "m1" ...
Running on ideone here
HTH!
Obligatory awk script. Just change n=2 to whatever your nth match should be.
n=2; awk -v n=$n '/^HEADER$/{++i==n && ++flag} !flag; /^$/&&flag{flag=0}' ./file
Input
$ cat ./file
HEADER
line1a
line2a
line3a
HEADER
line1b
line2b
line3b
HEADER
line1c
line2c
line3c
HEADER
line1d
line2d
line3d
Output
$ n=2; awk -v n=$n '/^HEADER$/{++i==n&&++flag} !flag; /^$/&&flag{flag=0}' ./file
HEADER
line1a
line2a
line3a
HEADER
line1c
line2c
line3c
HEADER
line1d
line2d
line3d
Related
I want to do strict matching on a text file so that it only returns the patterns I have anded. So for example in a file:
xyz
xy
yx
zyx
I want to run a command similar to:
awk '/x/ && /y/' filename.txt
and I would like it to return only the lines.
yx
xy
and ignore the others because although they do contain an x and a y, they also have a z so they are ignored.
Is this possible in awk?
I'd just keep it clear and simple, e.g. depending on your requirements for matching lines that only contain x or only contain y which you didn't include in your example:
$ awk '/^[xy]+$/' file
xy
yx
or:
$ awk '/x/ && /y/ && !/[^xy]/' file
xy
yx
This /x/ && /y/ matches when there is an x and Y present.
Edit:
To allow the same chars in the whole string, you can use a repeated character class and assert the start and end of the string:
awk '/^[xy]+$/' file
If you also want to allow matching spaces, uppercase X and Y and do not want to match empty lines:
awk '/^[[:space:]]*[xyXY][[:space:]xyXY]*$/' file
The pattern matches:
^ Start of string
[[:space:]]* Match optional spaces
[xyXY] Match a single char x y X Y
[[:space:]xyXY]* Match optional spaces or x y X Y
$ End of string
Assumptions:
user provides a list of characters to match on (x and y in the provided example)
lines of interest are those that contain only said characters (plus white space)
matches should be case insensitive, ie, x will match on both x and X
blank/empty lines, and lines with only white space, are to be ignored
Adding more lines to the sample input:
$ cat filename.txt
xyz
xy
yx
zyx
---------
xxx
abc def xy
Xy xY XY
z x yy z
x; y; X; Y:
xxyYxy XXyxyy yx # tab delimited
# 1 space
# blank/empty line
NOTE: comments added for clarification; file does not contain any comments
One awk idea:
awk -v chars='xY' ' # provide list of characters (in the form of a string) to match on
BEGIN { regex="[" tolower(chars) "]" } # build regex of lowercase characters, eg: "[xy]"
{ line=tolower($0) # make copy of all lowercase line
gsub(/[[:space:]]/,"",line) # remove all white space
if (length(line) == 0) # if length of line==0 (blank/empty lines, lines with only white space) then ...
next # skip to next line of input
gsub(regex,"",line) # remove all characters matching regex
if (length(line) == 0) # if length of line == 0 (ie, no other characters) then ...
print $0 # print current line to stdout
}
' filename.txt
This generates:
xy
yx
xxx
Xy xY XY
xxyYxy XXyxyy yx
NOTE: the last 2 input lines (1 space, blank/empty) are ignored
This awk solution applies the condition on the main block to process only lines containing 'x' and 'y' using /x/&&/y/.
Inside the action block the record $0 is assigned to a variable named temp which then has the 'x' and 'y' occurrences removed using gsub(/[xy]/, "",temp). A conditional block then determines the length of temp after the substitution: if the length is 0, the line could only have contained 'x' and 'y' characters, so the line is printed.
awk '/x/&&/y/ { temp=$0; gsub(/[xy]/, "",temp); if (length(temp)==0){print $0}}' input.txt
tested with input.txt file:
xyz
xy
yx
zyx
y
x
xxy
yyx
result:
xy
yx
xxy
yyx
You can treat the strings as a set of characters and do a set equality on the two strings.
awk -v set='xy' '
function cmp(s1, s2) {
# turns s1 and s2 into associative arrays to do a set equality comparison
# cmp("xy", "xyxyxyxy") returns 1; cmp("xy", "xyz") returns 0
split("", a1); split("", a2) # clear the arrays from last use
split(s1, tmp, ""); for (i in tmp) a1[tmp[i]]
split(s2, tmp, ""); for (i in tmp) a2[tmp[i]]
if (length(a1) != length(a2)) return 0
for (e in a1) if (!(e in a2)) return 0
return 1
}
cmp(set, $1)' file
Prints:
xy
yx
sample
tyu
abc
def
ghi
fgg
yui
Output
abc
def
ghi
fgg
yui
Matching pattern : ^def
Print two lines before matching line including pattern and print all lines after pattern until end
1st solution: With your shown samples try following awk code, written and tested in GNU awk.
awk -v RS='(^|\n)def.*' '
RT{
num=split($0,arr,ORS)
sub(/\n$/,"",RT)
print arr[num-1] ORS arr[num] RT
}
' Input_file
2nd solution(More Generic one): In this solution one could mention number of lines needed to be printed before a match is found in awk's variable named lines and we need NOT to hardcode number of times we need to print array's element(in split function for first line).
awk -v lines="2" -v RS='(^|\n)def.*' '
RT{
val=""
num=split($0,arr,ORS)
sub(/\n$/,"",RT)
for(i=lines;i<=num;i++){
val=(val?val ORS:"") arr[i]
}
print val RT
}
' Input_file
Would you please try the following:
awk '
f {print; next} # if the flag f is set, print the line
{que[NR % 3] = $0} # store the line in a queue
/^def/ { # if the pattern matches
f = 1 # then set the flag
for (i = NR - 2; i <= NR; i++) # and print two previous lines and current line
print que[i % 3]
}
' input_file.txt
This might work for you (GNU sed):
sed '1N;N;/def/!D;:a;n;ba' file
Open a window of 3 lines and if the desired string is not present, delete the first and append another until a match is found.
Then print those lines and all other lines to the end of the file.
N.B. This will start printing as soon a match is found, even if the match is in the first or second lines. If the match must be in the third or subsequent lines, use:
sed '1N;N;/def[^\n]*$/!D;:a;n;ba' file
I am trying to selectively remove lines that start with # but do not contain the keywords Build or Type in them. The lines that do not start with # are unchanged. I can remove all lines that starting with # using the first awk, but not sure how to selectively remove lines that start with # but do not contain a keyword. The second awk does execute but only leaves two lines (#CN Filters:
# Flags = 1,2,3). Thank you :).
awk
awk '!/#/' input < out # will remove all lines with #
awk
awk '/#/ && !/Build|Length/' input < out # remove lines starting with # but must not have Build or Length in them
input various spacing
#Build = NCBI Build 37
#CN Filters:
# Flags = 1,2,3
# Type = Lowess
Length Event ID
1 Gain xxx
10 Loss yyy
desired output
Build = NCBI Build 37
Type = Lowess
Length Event ID
1 Gain xxx
10 Loss yyy
You want to do something with lines that start with # and do not contain Build or Type, right? I'm sure you could write that condition:
Start with # = /^#/
AND = &&
Do not contains Build or Type = !/Build|Type/
i.e.
/^#/ && !/Build|Type/
Now, what is it you wanted to do when that condition s true? Not print the current line. So you could just write that as simply:
awk '/^#/ && !/Build|Type/{next} 1'
but if you prefer to use awks default print given a true condition then you just need to negate your condition (a{next} 1 = !a):
awk '!(/^#/ && !/Build|Type/)'
which by boolean algebra ( !(a && b) = !a || !b) can be reduced to:
awk '!/^#/ || /Build|Type/'
$ awk '!/^#/ || /Build|Type/' file
#Build = NCBI Build 37
# Type = Lowess
Length Event ID
1 Gain xxx
10 Loss yyy
If you want to remove those initial # characters and the spaces after them:
$ awk '!/^#/ || /Build|Type/ { sub("^#[[:blank:]]*", ""); print }' file
Build = NCBI Build 37
Type = Lowess
Length Event ID
1 Gain xxx
10 Loss yyy
Following awk may help you on same too.
awk '!(/^#/ && !/Build/ && !/Type/){gsub(/^#|^# +/,"");print}' Input_file
Explanation:
awk '
!(/^#/ && !/Build/ && !/Type/){ ##Checking condition here if a line starts with # and NOT having string Build and Type in it, Negating this condition to work it as opposite, if this condition is TRUE then do following.
gsub(/^#|^# +/,""); ##Using gsub to remove hash in starting of a line OR remove a hash starting fr a line with space with NULL in current line.
print ##Printing the current line here.
}' Input_file ##Mentioning the Input_file name here.
A sed solution:
$ sed 's/^# *\(.*\(Build\|Type\).*\)/\1/;/^#/d' file
Build = NCBI Build 37
Type = Lowess
Length Event ID
1 Gain xxx
10 Loss yyy
awk '!/CN|Fl/{sub(/\43/,"")sub(/^\s*/,"");print}' file
Build = NCBI Build 37
Type = Lowess
Length Event ID
1 Gain xxx
10 Loss yyy
How to print all lines if certain condition matches.
Example:
echo "$ip"
this is a sample line
another line
one more
last one
If this file has more than 3 lines then print the whole variable.
I am tried:
echo $ip| awk 'NR==4'
last one
echo $ip|awk 'NR>3{print}'
last one
echo $ip|awk 'NR==12{} {print}'
this is a sample line
another line
one more
last one
echo $ip| awk 'END{x=NR} x>4{print}'
Need to achieve this:
If this file has more than 3 lines then print the whole file. I can do this using wc and bash but need a one liner.
The right way to do this (no echo, no pipe, no loops, etc.):
$ awk -v ip="$ip" 'BEGIN{if (gsub(RS,"&",ip)>2) print ip}'
this is a sample line
another line
one more
last one
You can use Awk as follows,
echo "$ip" | awk '{a[$0]; next}END{ if (NR>3) { for(i in a) print i }}'
one more
another line
this is a sample line
last one
you can also make the value 3 configurable from an awk variable,
echo "$ip" | awk -v count=3 '{a[$0]; next}END{ if (NR>count) { for(i in a) print i }}'
The idea is to store the contents of the each line in {a[$0]; next} as each line is processed, by the time the END clause is reached, the NR variable will have the line count of the string/file you have. Print the lines if the condition matches i.e. number of lines greater than 3 or whatever configurable value using.
And always remember to double-quote the variables in bash to avoid undergoing word-splitting done by the shell.
Using James Brown's useful comment below to preserve the order of lines, do
echo "$ip" | awk -v count=3 '{a[NR]=$0; next}END{if(NR>3)for(i=1;i<=NR;i++)print a[i]}'
this is a sample line
another line
one more
last one
Another in awk. First test files:
$ cat 3
1
2
3
$ cat 4
1
2
3
4
Code:
$ awk 'NR<4{b=b (NR==1?"":ORS)$0;next} b{print b;b=""}1' 3 # look ma, no lines
[this line left intentionally blank. no wait!]
$ awk 'NR<4{b=b (NR==1?"":ORS)$0;next} b{print b;b=""}1' 4
1
2
3
4
Explained:
NR<4 { # for tghe first 3 records
b=b (NR==1?"":ORS) $0 # buffer them to b with ORS delimiter
next # proceed to next record
}
b { # if buffer has records, ie. NR>=4
print b # output buffer
b="" # and reset it
}1 # print all records after that
I'm trying to manipulate a Fastq file.
It looks like this:
#HWUSI-EAS610:1:1:3:1131#0/1
GATGCTAAGCCCCTAAGGTCATAAGACTGNNANGTC
+
B<ABA<;B#=4A9#:6#96:1??9;>##########
#HWUSI-EAS610:1:1:3:888#0/1
GATAGGACCAAACATCTAACATCTTCCCGNNGNTTC
+
B9>>ABA#B7BB:7?#####################
#HWUSI-EAS610:1:1:4:941#0/1
GCTTAGGAAGGAAGGAAGGAAGGGGTGTTCTGTAGT
+
BBBB:CB=#CB#?BA/#BA;6>BBA8A6A<?A4?B=
...
...
...
#HWUSI-EAS610:1:1:7:1951#0/1
TGATAGATAAGTGCCTACCTGCTTACGTTACTCTCC
+
BB=A6A9>BBB9B;B:B?B#BA#AB#B:74:;8=>7
My expected output is:
#HWUSI-EAS610:1:1:3:1131#0/1
GACNTNNCAGTCTTATGACCTTAGGGGCTTAGCATC
#HWUSI-EAS610:1:1:3:888#0/1
GAANCNNCGGGAAGATGTTAGATGTTTGGTCCTATC
#HWUSI-EAS610:1:1:4:941#0/1
ACTACAGAACACCCCTTCCTTCCTTCCTTCCTAAGC
So, the ID line are those starting with #HWUSI (i.e #HWUSI-EAS610:1:1:7:1951#0/1).. After each ID there is a line with its sequence.
Now, I would like to obtain a file only with each ID and its correspondig sequence and the sequence should be reverse and complement. (A=T, T=A, C=G, G=C)
With Sed I can obtain all the sequence reverse and complementary with the command
sed -n '2~4p' MYFILE.fq | rev | tr ATCG TAGC
How can I obtain also the corresponding ID?
With sed:
sed -n '/#HWUSI/ { p; s/.*//; N; :a /\n$/! { s/\n\(.*\)\(.\)/\2\n\1/; ba }; y/ATCG/TAGC/; p }' filename
This works as follows:
/#HWUSI/ { # If a line starts with #HWUSI
p # print it
s/.*// # empty the pattern space
N # fetch the sequence line. It is now preceded
# by a newline in the pattern space. That is
# going to be our cursor
:a # jump label for looping
/\n$/! { # while the cursor has not arrived at the end
s/\n\(.*\)\(.\)/\2\n\1/ # move the last character before the cursor
ba # go back to a. This loop reverses the sequence
}
y/ATCG/TAGC/ # then invert it
p # and print it.
}
I intentionally left the newline in there for more readable spacing; if that is not desired, replace the last p with a P (upper case instead of lower case). Where p prints the whole pattern space, P only prints the stuff before the first newline.
$ sed -n '/^[^#]/y/ATCG/TAGC/;/^#/p;/^[ATCGN]*$/p' file
#HWUSI-EAS610:1:1:3:1131#0/1
CTACGATTCGGGGATTCCAGTATTCTGACNNTNCAG
#HWUSI-EAS610:1:1:3:888#0/1
CTATCCTGGTTTGTAGATTGTAGAAGGGCNNCNAAG
#HWUSI-EAS610:1:1:4:941#0/1
CGAATCCTTCCTTCCTTCCTTCCCCACAAGACATCA
#HWUSI-EAS610:1:1:7:1951#0/1
ACTATCTATTCACGGATGGACGAATGCAATGAGAGG
Explanation
/^[^#]/y/ATCG/TAGC/ # Translate bases on lines that don't start with an #
/^#/p # Print IDs
/^[ATCGN]*$/p # Print sequence lines