How to print lines between two patterns with optional end pattern - awk

I have gone through stack over flow and found these questions
How to print lines between two patterns, inclusive or exclusive (in sed, AWK or Perl)?
Combine multiple lines between flags in one line in AWK
The problem with my question is that there can be another TAG1 without the matching TAG2 like this
file.txt:
aa
TAG1
some right text
TAG2
some text2
TAG1
some text3
TAG1
some text4
TAG1
some right text 2
TAG2
some text4
TAG1
some text5
some text6
expected output:
TAG1
some right text
TAG2
TAG1
some right text 2
TAG2

One way is to reverse the input, get TAG2 to TAG1 and then reverse again:
$ tac ip.txt | sed -n '/TAG2/,/TAG1/p' | tac
TAG1
some right text
TAG2
TAG1
some right text 2
TAG2
Another way is to reset and start collecting lines once the first one is found and print only when the second one is found:
$ awk '/TAG1/{f=1; buf=$0; next}
f{buf=buf ORS $0}
/TAG2/{if(f) print buf; f=0}' ip.txt
TAG1
some right text
TAG2
TAG1
some right text 2
TAG2

Here is an example with GNU sed. Collect the data into pattern space and only print when matching TAG1/TAG2 found:
sed -nE ':a; /TAG1$/ s/.*(TAG1)/\1/; N; /TAG2$/ { /^TAG1/ { G; p; }; z; }; ba'
Or as a stand-alone script with explanation:
parse.sed
:a # main-loop
/TAG1$/ s/.*(TAG1)/\1/ # Ensure only one TAG1
N # Read next line
/TAG2$/ { # When TAG2 encountered
/^TAG1/ { G; p; } # Which started with a TAG1, print
z # Clear out pattern space
}
ba # Repeat main-loop
Run it like this:
sed -nEf parse.sed infile

Related

Extract span between starttext and endtext in txt file, exactly n times

I want to extract only the n first occurrences of
starttext(
some text
endtext)
in text file F.
I've tried experimenting with sed:
sed - '/starttext/,/endtext/p' inputfile
... this will give me ALL the ranges between starttext and endtext in inputfile. But I only want the first n ranges...
File F:
starttext(
sometext1
more text
endtext)
starttext(
sometext2
pineapple
endtext)
starttext(
sometext3
orange
banana
endtext)
starttext(
sometext4
some other text
endtext)
starttext(
sometext5
coconut
endtext)
starttext(
sometext6
endtext)
Fake command
sed - '/starttext/,/endtext/p' ((get the top 3 instances) inputfile
Epected output:
starttext(
sometext1
more text
endtext)
starttext(
sometext2
pineapple
endtext)
starttext(
sometext3
orange
banana
endtext)
I was asked to provide the following:
Also tell us if there can be nested ranges, or overlapping ranges, or starts with no end or ends with no start.
There is always an endtext after each starttext. Ranges are not nested or overlapping.
Also do you want to look for the start and end as regexps or strings?
Strings
And do you want to do a full-line match or partial line or something else?
I want the full line match from starttext to the full line of endtext with all the text (several lines) in between.
With awk:
awk -v n=3 'index($0, "starttext"){f=1; if(c++ == n) exit}
f; index($0, "endtext"){f=0}' ip.txt
If you don't have any lines outside of these two markers (based on given sample), you can also use:
awk -v n=3 'index($0, "starttext") && c++ == n{exit} 1'
It looks like you want everything from the beginning of the file up to, and including, the nth endtext.
Not pretty but it gets the job done.
head -$(grep -nm $n endtext $file | tail -1 | grep -o [0-9]*) $file
Where $n is the number of text spans you want and $file is the text file.
$ awk -v n=3 '
$0=="starttext(" { f=1 }
f { print; if ($0=="endtext)") { f=0; if (++c==n) exit } }
' fileF
starttext(
sometext1
more text
endtext)
starttext(
sometext2
pineapple
endtext)
starttext(
sometext3
orange
banana
endtext)
If ed is available/acceptable with brace expansion from the bash shell.
printf '%s\n' '/^starttext(/;/^endtext)/p'{,,} q | ed -s file.txt
The comma , inside the braces duplicates what ever is at the left side.
So the first count is the pattern '/starttext/;/endtext/p' and two commas ,, inside the curly braces equals 3 patterns.
The above brace expansion expands to.
printf '%s\n' '/^starttext($/;/^endtext)$/p' '/^starttext($/;/^endtext)$/p' '/^starttext($/;/^endtext)$/p' q | ed -s file.txt
Or mapfile from the bash shell
mapfile -d')' -t array < file.txt
IFS=')'; printf '%s\n' "${array[*]:0:3})"
This might work for you (GNU sed):
sed -n '/starttext/,/endtext/{H;g;s/endtext/&/3;T;s/.//p;q}' file
Gather up the matching occurrences in the hold space.
If the end delimiter matches n times (in this case 3), remove the introduced newline, print the result and quit.

How can I remove a string after a specific character ONLY in a column/field in awk or bash?

I have a file with tab-delimited fields (or columns) like this one below:
cat abc_table.txt
a b c
1 11;qqw 213
2 22 222
3 333;rs2 83838
I would like to remove everything after the ";" on only the second field.
I have tried with
awk 'BEGIN{FS=OFS="\t"} NR>=1 && sub (/;[*]/,"",$2){print $0}' abc_table.txt
but it does not seem to work.
I also tried with sed:
's/;.*//g' abc_table.txt
but it erases also the strings in the third field:
a b c
1 11
2 22 222
3 333
The desired output is:
a b c
1 11 213
2 22 222
3 333 83838
If someone could help me, I would be very grateful!
You need to simply correct your regex.
awk '{sub(/;.*/,"",$2)} 1' Input_file
In case you have Input_file TAB delimited then try:
awk 'BEGIN{FS=OFS="\t"} {sub(/;.*/,"",$2)} 1' Input_file
Problem in OP's regex: OP's regex ;[*] is looking for ; and *(literal character) in 2nd field that's why its NOT able to substitute everything after ; in 2nd field. We need to simply give ;.* which means grab everything from very first occurrence of ; till last of 2nd field and then substitute with NULL in 2nd field.
An alternative solution using gnu sed:
sed -E 's/(^[^\t]*\t+[^;]*);[^\t]*/\1/' file
a b c
1 11 213
2 22 222
3 333 83838
This might work for you (GNU sed):
sed 's/[^\t]*/&\n/2;s/;[^\t]*\n//;s/\n//' file
Append a unique marker e.g. newline, to the end of field 2.
Remove everything from the first ; which is not a tab to a newline.
Remove the newline if any.
N.B. This method can be extended for selective or all fields e.g. same removal but for the first and third fields:
sed 's/[^\t]*/&\n/1;s//&\n/3;s/;[^\t]*\n//g;s/\n//g' file

How to duplicate every word in the first line of a file

How can I duplicate every word in the header of a file?
I have a dataframe looking like this:
ID sample1 sample2 ...
123 1 0 1 2 ...
...
I want to duplicate every column header in the file such that after splitting the data at the space, each of them will have a header.
Desired output:
ID sample1 sample1 sample2 sample2 ...
123 1 0 1 2 ...
...
I tried to use sed:
sed -e '1s/*./& &/g' file.in
but it only append the duplicated content at the end of the line.
Thanks
Another option with awk is to simply use string concatenation to duplicate each field from 2 on. For example using a 3-space separator (and your input file with the ellipses in place), you could do:
$ awk 'FNR == 1 { for (i = 2; i <= NF; i++) $i = " " $i " " $i }1' file
ID sample1 sample1 sample2 sample2 ... ...
123 1 0 1 2 ...
...
The essential part of the expression is simply setting $i = " " $i " " $i to duplicate the field.
Using sed with extended regular expressions, you could do:
sed -r '1 s/\s+\w+/& &/g' file
ID sample1 sample1 sample2 sample2 ...
123 1 0 1 2 ...
...
Where limiting the line 1 you match any one or more separator characters \s+ followed by one or more word characters \w+ and replace it with what is matched -- twice, & &.
You can do the same thing a bit more crudely with basic regular expressions using:
sed '1 s/[ \t][ \t]*[^ \t][^ \t]*/& &/g' file
Where you match one or more spaces or tabs followed by one or more not-spaces or not-tabs. (same output, but it also duplicates the ellipses in the first line)
Some like this:
awk 'NR==1 {printf "%s ",$1;for (i=2; i<=NF; i++) printf "%s %s ", $i,$i;print "";next}1' file
ID sample1 sample1 sample2 sample2 ... ...
123 1 0 1 2 ...
...
In line #1, it duplicates every word, except the first.
Using TAB as separator
awk 'NR==1 {printf "%s\t",$1;for (i=2; i<=NF; i++) printf "%s\t%s\t", $i,$i;print "";next} {$1=$1} 1' OFS="\t" file
ID sample1 sample1 sample2 sample2 ... ...
123 1 0 1 2 ...
...
This might work for you (GNU sed):
sed -E 's/\s{2,}/\t/g;1h;1d;2{H;s/\t/& /g;G;s/^\S+([^\n]*\n)(\S+)/\2\1/;:a;s/\t \S+([^\n]*\n(\t\S+))/\2\t\1/;s/\t(\t[^\n]*\n)\t\S+/\1/;ta;s/\t\n\t\S+//};y/ /\t/' file
Replace all 2 or more consecutive spaces by tabs. Copy the header to the hold space and delete it. Append the second line to the hold space and prepend a space following each tab in the second line. Append the first and second lines to the second line. The first line in the pattern space is used as a template for the headings. The first column is special (ID) and is copied non-iteratively. All other heading are replaced iteratively until there no further headings. The last tab of the first line and the remainder of the second line (last column of the headings) is removed. All subsequent spaces are replaced by tabs.
N.B. All columns will be tab delimited, if space delimited is preferred, replace the last command by y/\t/ /.
I assume you actually meant '1s/.*/& &/g' rather than '1s/*./& &/g'?
In that case, remember that * is a greedy quantifier, so will match the whole line. You want to match each word on the line:
sed -e '1s/\w\+/& &/g'
Looking at the example, it seems that we don't want the first word (ID) to be doubled like the rest - only the words with preceding whitespace:
sed -e '1s/ \+\w\+/&&/g'
Output:
ID sample1 sample1 sample2 sample2 ...
123 1 0 1 2 ...

How to awk pattern over two consecutive lines?

I am trying do something which I guess could be done very easy but I cant seem to find the answer. I want to use awk to pick out lines between two patterns, but I also want the pattern to match two consecutive lines. I have tried to find the solution on the Internet bu perhaps I did not search for the right keywords. An example would better describe this.
Suppose I have the following file called test:
aaaa
bbbb
SOME CONTENT 1
ddddd
fffff
aaaa
cccc
SOME CONTENT 2
ccccc
fffff
For example lets say I would like to find "SOME CONTENT 1"
Then I would use awk like this:
cat test | awk '/aaa*/ { show=1} show; /fff*/ {show=0}'
But that is not want I want. I want somehow to enter the pattern:
aaaa*\nbbbb*
And the same for the end pattern. Any suggestions how to do this?
You can use this:
awk '/aaa*/ {f=1} /bbb*/ && f {show=1} show; /fff*/ {show=f=0}' file
bbbb
SOME CONTENT 1
ddddd
fffff
If pattern1 is aaa* then set flag f
If pattern2 is bbb* and flag f is true, then set the show flag
If you need to print patter1 the aaa*?
awk '/aaa*/ {f=$0} /bbb*/ && f {show=1;$0=f RS $0} show; /fff*/ {show=f=0}' file
aaaa
bbbb
SOME CONTENT 1
ddddd
fffff
If every record ends with fffff, and GNU awk is available, you could do something like this:
$ awk '/aaa*\nbbbb*/' RS='fffff' file
aaaa
bbbb
SOME CONTENT 1
ddddd
Or if you want just SOME CONTENT 1 to be visible, you can do:
$ awk -F $'\n' '/aaa*\nbbbb*/{print $4}' RS='fffff' file
SOME CONTENT 1
I searched for two patterns and checkd that they were consecutive using line numbers, having line numbers lets sed insert a line between them, well after the first line/pattern.
awk '$0 ~ "Encryption" {print NR} $0 ~ "Bit Rates:1" {print NR}' /tmp/mainscan | while read line1; do read line2; echo "$(($line2 - 1)) $line1"; done > /tmp/this
while read line
do
pato=$(echo $line | cut -f1 -d' ')
patt=$(echo $line | cut -f2 -d' ')
if [[ "$pato" = "$patt" ]]; then
inspat=$((patt + 1))
sed -i "${inspat}iESSID:##" /tmp/mainscan
sed -i 's/##/""/g' /tmp/mainscan
fi
done < /tmp/this

How to replace the lower case letters to upper case letter 'C' using awk?

I have a text file with protein sequences. I would like to replace all the lowercase letters to upper case letter 'C'. How can I do this with awk?
>1CHE
aHKLbMaHc
>2HV3
PNMRrYnf
>5GH3
LKDeVmqQ
desired output
>1CHE
CHKLCMCHC
>2HV3
PNMRCYCC
>5GH3
LKDCVCCQ
echo 'changecase' | tr [:lower:] C
I would use sed for this:
sed '/^>/!s/[a-z]/C/g' file.txt
If you'd like the awk, here it is:
awk '!/^>/ { gsub(/[a-z]/, "C") }1' file.txt
Results:
>1CHE
CHKLCMCHC
>2HV3
PNMRCYCC
>5GH3
LKDCVCCQ