Extract lines that follow a pattern - awk

I'm looking to extract all lines that begins with Start and its next line ends with ***.
Appreciate any help.
Example:
*********************************************
Start the extract for customer_id [XXXX-2359]
*********************************************
Start the extract for customer_id [XXXX-2987]
Available
Printing records
Moving to output file
*********************************************
Start the extract for customer_id [XXXX-1539]
*********************************************
Start the extract for customer_id [XXXX-4527]
Available
Printing records
Moving to output file
*********************************************
Desired Output:
Start the extract for customer_id [XXXX-2359]
Start the extract for customer_id [XXXX-1539]
I tried:
awk '/Start*/ {p=1;print;next} /$**/ && p {p=0;print} p' test

Could you please try following, written and tested with shown samples. With tac + awk approach here.
tac Input_file |
awk '
/^\*/{
count=""
found=1
next
}
found && ++count==1 && /^Start/{
print
found=""
}
' | tac
Explanation: Adding detailed explanation for above.
tac Input_file | ##Using tac wit Input_file to print contents in reverse order and send it to awk command.
awk ' ##Starting awk program here which reads tac output as an Input here.
/^\*/{ ##Checking condition if line starts from * then do following.
count="" ##Nullifying count here.
found=1 ##Setting found as 1 here.
next ##next will skip all further statements from here.
}
found && ++count==1 && /^Start/{ ##Checking if found is SET and count is 1 and line starts with Start then do following.
print ##Printing current line here.
found="" ##Nullifying found here.
}
' | tac ##Sending awk program output as an input to tac to get output in exact order.

$ awk '(index($0,"***")==1) && (p1=="Start"){print p0} {p1=$1; p0=$0}' file
Start the extract for customer_id [XXXX-2359]
Start the extract for customer_id [XXXX-1539]
For every line after the first, p1 and p0 contain the values of $1 and $0 from the previous line read. So when the current line starts with 3 *s and the $1 from the previous line (p1) was Start then it prints the $0 from the previous line (p0).
With respect to the regexps in your question:
Start* means Star followed by t repeated 0 or more times.
$** contains back-to-back regexp repetition characters (*) and so is undefined behavior per POSIX and so any tool can do whatever it likes with it. Some will report it, some will silently ignore one of the *s, others could do anything else. The $ at the start is an end-of-string indication which matches the end of the current input so having any *s after it doesn't make sense but AFAIK it's not technically invalid.

This awk should work:
awk 'p != "" && /^\*{3}/ {print p} {p = ($1 == "Start" ? $0 : "")}' file
Start the extract for customer_id [XXXX-2359]
Start the extract for customer_id [XXXX-1539]

sed -n '/^Start/!d;N;/\*\{3\}$/P;D' file
pcregrep -Mo1 '^(Start.*)\n.*\*{3}$' file
Start* would match Star or Startttt anywhere in a string. The regex to match Start at the beginning of a string is ^Start.
$** is not a useful RE - * should not be repeated like that. POSIX leaves the behaviour undefined, with GNU sed reporting it as Invalid preceding regular expression. \*\{3\}$ (or equivalent) will match a string ending with three asterisks.

I would do it following way using GNU AWK, let file.txt content be
*********************************************
Start the extract for customer_id [XXXX-2359]
*********************************************
Start the extract for customer_id [XXXX-2987]
Available
Printing records
Moving to output file
*********************************************
Start the extract for customer_id [XXXX-1539]
*********************************************
Start the extract for customer_id [XXXX-4527]
Available
Printing records
Moving to output file
*********************************************
then
awk 'BEGIN{RS="\n?[*]+\n";FS="\n"}($NF~/^Start/)' file.txt
output
Start the extract for customer_id [XXXX-2359]
Start the extract for customer_id [XXXX-1539]
Explanation: I instructed AWK to treat content between lines with * as rows and lines in content as fields. To attain such effect I set row separator (RS) to one or more * followed by newline (\n) with \n optionally at begin (without that first ********************************************* would not be detected corretly) and FS to newline (\n). Then I check if last field i.e. line starts with Start. DISCLAIMER: this solution assumes that last line of your file is always like [*]+ it might fail if it is not.
(tested in GNU Awk 5.0.1)

This might work for you (GNU sed):
sed -n '/^\*\*\*/{g;/^Start/p};h' file
Turn off implicit printing.
If the current line begins ***, replace it by the previous line stored in the hold space and if that line begins Start print that line.
In all cases, store the current line in the hold space.
Another couple of solutions:
sed -n 'N;/^Start.*\n\*\*\*/P;D' file
or:
sed -n 'N;/^Start/{/^\*\*\*/MP};D' file
Both open a 2 line window and print the first of those lines if it begins Start and the second begins ***.
The second solution uses the M flag to match the *** at the start of a line, as there are only 2 such lines and the first begins Start the criteria for the both matches is met.

Related

Search through a markdown using sed

The Problem
I multiple Markdown files in a folder, formatted like so...
# Cool Project
* Random Text
* Other information
TODO: This is a task
TODO: This is another task
And I've written a script that pulls out all the strings that start with TODO from all the files...
ag TODO: ~/myfolder/journal | sed 's/\(^.*:\)\(.*\)/TODO:\2 /g' | sed ''/TODO:/s//`printf "\033[35mTODO:\033[0m"`/'' | sed ''s/![a-zA-Z0-9]*/$(printf "\033[31;1m&\033[0m")/''
and this gives me an output like this
TODO: This is a task
TODO: This is another task
I was wondering if it would be possible to look backward from the pattern using sed to identify and pickup the line that starts with /^# / and appended it to the end of the string... something like this
TODO: This is a task # Cool Project
TODO: This is another task # Cool Project
Using sed:
sed -n '/^#/h;/^TODO/{G;s/\n/ /p}' file
Search for lines beginning with # and add to hold space (h) Then when a line begins with "TODO", append hold space to pattern space (G) and substitute new lines for a space.
You could do this in a single awk itself. With your shown samples, could you please try following, written and tested with GNU awk.
awk '/^# /{val=$0;next} /^TODO/{print $0,val}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^# /{ ##Checking condition if line starts from hash space then do following.
val=$0 ##Creating val which has current line value getting stored init here.
next ##next will skip all statements from here.
}
/^TODO/{ ##Checking condition if line starts with TODO then do following.
print $0,val ##Printing current line and val here.
}
' Input_file ##Mentioning Input_file name here.

Replace a letter with another from the last word from the last two lines of a text file

How could I possibly replace a character with another, selecting the last word from the last two lines of a text file in shell, using only a single command? In my case, replacing every occurrence of a with E from the last word only.
Like, from a text file containing this:
tree;apple;another
mango.banana.half
monkey.shelf.karma
to this:
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
I tried using sed -n 'tail -2 'mytext.txt' -r 's/[a]+/E/*$//' but it doesn't work (my error: sed expression #1, char 10: unknown option to 's).
Could you please try following, tac + awk solution. Completely based on OP's samples only.
tac Input_file |
awk 'FNR<=2{if(/;/){FS=OFS=";"};if(/\./){FS=OFS="."};gsub(/a/,"E",$NF)} 1' |
tac
Output with shown samples is:
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
NOTE: Change gsub to sub in case you want to substitute only very first occurrence of character a in last field.
This might work for you (GNU sed):
sed -E 'N;${:a;s/a([^a.]*)$/E\1/mg;ta};P;D' file
Open a two line window throughout the length of the file by using the N to append the next line to the previous and the P and D commands to print then delete the first of these. Thus at the end of the file, signified by the $ address the last two lines will be present in the pattern space.
Using the m multiline flag on the substitution command, as well as the g global flag and a loop between :a and ta, replace any a in the last word (delimited by .) by an E.
Thus the first pass of the substitution command will replace the a in half and the last a in karma. The next pass will match nothing in the penultimate line and replace the a in karmE. The third pass will match nothing and thus the ta command will fail and the last two lines will printed with the required changes.
If you want to use Sed, here's a solution:
tac input_file | sed -E '1,2{h;s/.*[^a-zA-Z]([a-zA-Z]+)/\1/;s/a/E/;x;s/(.*[^a-zA-Z]).*/\1/;G;s/\n//}' | tac
One tiny detail. In your question you say you want to replace a letter, but then you transform karma in kErme, so what is this? If you meant to write kErma, then the command above will work; if you meant to write kErmE, then you have to change it just a bit: the s/a/E/ should become s/a/E/g.
With tac+perl
$ tac ip.txt | perl -pe 's/\w+\W*$/$&=~tr|a|E|r/e if $.<=2' | tac
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
\w+\W*$ match last word in the line, \W* allows any possible trailing non-word characters to be matched as well. Change \w and \W accordingly if numbers and underscores shouldn't be considered as word characters - for ex: [a-zA-Z]+[^a-zA-Z]*$
$&=~tr|a|E|r change all a to E only for the matched portion
e flag to enable use of Perl code in replacement section instead of string
To do it in one command, you can slurp the entire input as single string (assuming this'll fit available memory):
perl -0777 -pe 's/\w+\W*$(?=(\n.*)?\n\z)/$&=~tr|a|E|r/gme'
Using GNU awk forsplit() 4th arg since in the comments of another solution the field delimiter is every sequence of alphanumeric and numeric characters:
$ gawk '
BEGIN {
pc=2 # previous counter, ie how many are affected
}
{
for(i=pc;i>=1;i--) # buffer to p hash, a FIFO
if(i==pc && (i in p)) # when full, output
print p[i]
else if(i in p) # and keep filling
p[i+1]=p[i] # above could be done using mod also
p[1]=$0
}
END {
for(i=pc;i>=1;i--) {
n=split(p[i],t,/[^a-zA-Z0-9\r]+/,seps) # split on non alnum
gsub(/a/,"E",t[n]) # replace
for(j=1;j<=n;j++) {
p[i]=(j==1?"":p[i] seps[j-1]) t[j] # pack it up
}
print p[i] # output
}
}' file
Output:
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
Would this help you ? on GNU awk
$ cat file
tree;apple;another
mango.banana.half
monkey.shelf.karma
$ tac file | awk 'NR<=2{s=gensub(/(.*)([.;])(.*)$/,"\\3",1);gsub(/a/,"E",s); print gensub(/(.*)([.;])(.*)$/,"\\1\\2",1) s;next}1' | tac
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
Better Readable version :
$ tac file | awk 'NR<=2{
s=gensub(/(.*)([.;])(.*)$/,"\\3",1);
gsub(/a/,"E",s);
print gensub(/(.*)([.;])(.*)$/,"\\1\\2",1) s;
next
}1' | tac
With GNU awk you can set FS with the two separators, then gsub for the replacement in $3, the third field, if NR>1
awk -v FS=";|[.]" 'NR>1 {gsub("a", "E",$3)}1' OFS="." file
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
With GNU awk for the 3rd arg to match() and gensub():
$ awk -v n=2 '
NR>n { print p[NR%n] }
{ p[NR%n] = $0 }
END {
for (i=0; i<n; i++) {
match(p[i],/(.*[^[:alnum:]])(.*)/,a)
print a[1] gensub(/a/,"E","g",a[2])
}
}
' file
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
or with any awk:
awk -v n=2 '
NR>n { print p[NR%n] }
{ p[NR%n] = $0 }
END {
for (i=0; i<n; i++) {
match(p[i],/.*[^[:alnum:]]/)
lastWord = substr(p[i],1+RLENGTH)
gsub(/a/,"E",lastWord )
print substr(p[i],1,RLENGTH) lastWord
}
}
' file
If you want to do it for the last 50 lines of a file instead of the last 2 lines just change -v n=2 to -v n=50.
The above assumes there are at least n lines in your input.
You can let sed repeat changing an a into E only for the last word with a label.
tac mytext.txt| sed -r ':a; 1,2s/a(\w*)$/E\1/; ta' | tac

use awk to split one file into several small files by pattern

I have read this post about using awk to split one file into several files:
and I am interested in one of the solutions provided by Pramod and jaypal singh:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Because I still can not add any comment so I ask in here.
If the input is
>chr22
asdgasge
asegaseg
>chr1
aweharhaerh
agse
>chr14
gasegaseg
How come it will result in three files:
chr22.fasta
chr1.fasta
chr14.fasta
As an example, in chr22.fasta:
>chr22
asdgasge
asegaseg
I understand the first part
/^>chr/ {OUT=substr($0,2) ".fa"};
and these commands:
/^>chr/ substr() close() >>
But I don't understand that how awk split the input by the second part:
{print >> OUT; close(OUT)}
Could anyone explain more details about this command? Thanks a lot!
Could you please go through following and let me know if this helps you.
awk ' ##Starting awk program here.
/^>chr/{ ##Checking condition if a line starts from string chr then do following.
OUT=substr($0,2) ".fa" ##Create variable OUT whose value is substring of current line and starts from letter 2nd to till end. concatenating .fa to it too.
}
{
print >> OUT ##Printing current line(s) in file name whose value is variable OUT.
close(OUT) ##using close to close output file whose value if variable OUT value. Basically this is to avoid "TOO MANY FILES OPENED ERROR" error.
}' Input_File ##Mentioning Input_file name here.
You could take reference from man awk page for used functions of awk too as follows.
substr(s, i [, n]) Returns the at most n-character substring of s starting at i. If n is omitted, the rest of s is used.
The part you are asking questions about is a bit uncomfortable:
{ print $0 >> OUT; close(OUT) }
With this part, the awk program does the following for every line it processes:
Open the file OUT
Move the file pointer the the end of the file OUT
append the line $0 followed by ORS to the file OUT
close the file OUT
Why is this uncomfortable? Mainly because of the structure of your files. You should only close the file when you finished writing to it and not every time you write to it. Currently, if you have a fasta record of 100 lines, it will open, close the file 100 times.
A better approach would be:
awk '/^>chr/{close(OUT); OUT=substr($0,2)".fasta" }
{print > OUT }
END {close(OUT)}'
Here we only open the file the first time we write to it and we close it when we don't need it anymore.
note: the END statement is not really needed.

Using SED/AWK to replace letters after a certain position

I have a file with words (1 word per line). I need to censor all letters in the word, except the first five, with a *.
Ex.
Authority -> Autho****
I'm not very sure how to do this.
If you are lucky, all you need is
sed 's/./*/6g' file
When I originally posted this, I believed this to be reasonably portable; but as per #ghoti's comment, it is not.
Perl to the rescue:
perl -pe 'substr($_, 5) =~ s/./*/g' -- file
-p reads the input line by line and prints each line after processing
substr returns a substring of the given string starting at the given position.
s/./*/g replaces any character with an asterisk. The g means the substitution will happen as many times as possible, not just once, so all the characters will be replaced.
In some versions of sed, you can specify which substitution should happen by appending a number to the operation:
sed -e 's/./*/g6'
This will replace all (again, because of g) characters, starting from the 6th position.
Here's a portable solution for sed:
$ echo abcdefghi | sed -e 's/\(.\{5\}\)./\1*/;:x' -e 's/\*[a-z]/**/;t x'
abcde****
Here's how it works:
's/\(.\{5\}\)./\1*/' - preserve the first five characters, replacing the 6th with an asterisk.
':x' - set a "label", which we can branch back to later.
's/\*[a-z]/**/ - ' - substitute the letter following an asterisk with an asterisk.
't x' - if the last substitution succeeded, jump back to the label "x".
This works equally well in GNU and BSD sed.
Of course, adjust the regexes to suit.
Following awk may help you in same.
Solution 1st: awk solution with substr and gensub.
awk '{print substr($0,1,5) gensub(/./,"*","g",substr($0,6))}' Input_file
Solution 2nd:
awk 'NF{len=length($0);if(len>5){i=6;while(i<=len){val=val?val "*":"*";i++};print substr($0,1,5) val};val=i=""}' Input_file
Autho****
EDIT: Adding a non-one liner form of solution too now. Adding explanation with it too now.
awk '
NF{ ##Checking if a line is NON-empty.
len=length($0); ##Taking length of the current line into a variable called len here.
if(len>5){ ##Checking if length of current line is greater than 5 as per OP request. If yes then do following.
i=6; ##creating variable named i whose value is 6 here.
while(i<=len){ ##staring a while loop here which runs from value of variable named i value to till the length of current line.
val=val?val "*":"*"; ##creating variable named val here whose value will be concatenated to its own value, it will add * to its value each time.
i++ ##incrementing variable named i value with 1 each time.
};
print substr($0,1,5) val##printing value of substring from 1st letter to 5th letter and then printing value of variable val here too.
};
val=i="" ##Nullifying values of variable val and i here too.
}
' Input_file ##Mentioning Input_file name here.
Personally I'd just use sed for this (see #triplee's answer) but if you want to do it in awk it'd be:
$ awk '{t=substr($0,1,5); gsub(/./,"*"); print t substr($0,6)}' file
Autho****
or with GNU awk for gensub():
$ awk '{print substr($0,1,5) gensub(/./,"*","g",substr($0,6))}' file
Autho****
It is also possible and quite straightforward with sed:
sed 's/./\*/6;:loop;s/\*[^\*]/\**/;/\*[^\*]/b loop' file_to_censor.txt
output:
explanation:
s/./\*/6 #replace the 6th character of the chain by *
:loop #define an label for the goto
s/\*[^\*]/\**/ #replace * followed by non * char by **
/\*[^\*]/b loop #then loop until it does not exist a * followed by a non * char
Here is a pretty straightforward sed solution (that does not require GNUsed):
sed -e :a -e 's/^\(.....\**\)[^*]/\1*/;ta' filename

print last two words of last line

I have a script which returns few lines of output and I am trying to print the last two words of the last line (irrespective of number of lines in the output)
$ ./test.sh
service is running..
check are getting done
status is now open..
the test is passed
I tried running as below but it prints last word of each line.
$ ./test.sh | awk '{ print $NF }'
running..
done
open..
passed
how do I print the last two words "is passed" using awk or sed?
Just say:
awk 'END {print $(NF-1), $NF}'
"normal" awks store the last line (but not all of them!), so that it is still accessible by the time you reach the END block.
Then, it is a matter of printing the penultimate and the last one. This can be done using the NF-1 and NF trick.
For robustness if your last line can only contain 1 field and your awk doesn't retain the field values in the END section:
awk '{split($0,a)} END{print (NF>1?a[NF-1]OFS:"") a[NF]}'
This might work for you (GNU sed):
sed '$s/.*\(\<..*\<.*\)/\1/p;d' file
This deletes all lines in the file but on the last line it replaces all words by the last two words and prints them if successful.