Search through a markdown using sed - awk

The Problem
I multiple Markdown files in a folder, formatted like so...
# Cool Project
* Random Text
* Other information
TODO: This is a task
TODO: This is another task
And I've written a script that pulls out all the strings that start with TODO from all the files...
ag TODO: ~/myfolder/journal | sed 's/\(^.*:\)\(.*\)/TODO:\2 /g' | sed ''/TODO:/s//`printf "\033[35mTODO:\033[0m"`/'' | sed ''s/![a-zA-Z0-9]*/$(printf "\033[31;1m&\033[0m")/''
and this gives me an output like this
TODO: This is a task
TODO: This is another task
I was wondering if it would be possible to look backward from the pattern using sed to identify and pickup the line that starts with /^# / and appended it to the end of the string... something like this
TODO: This is a task # Cool Project
TODO: This is another task # Cool Project

Using sed:
sed -n '/^#/h;/^TODO/{G;s/\n/ /p}' file
Search for lines beginning with # and add to hold space (h) Then when a line begins with "TODO", append hold space to pattern space (G) and substitute new lines for a space.

You could do this in a single awk itself. With your shown samples, could you please try following, written and tested with GNU awk.
awk '/^# /{val=$0;next} /^TODO/{print $0,val}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^# /{ ##Checking condition if line starts from hash space then do following.
val=$0 ##Creating val which has current line value getting stored init here.
next ##next will skip all statements from here.
}
/^TODO/{ ##Checking condition if line starts with TODO then do following.
print $0,val ##Printing current line and val here.
}
' Input_file ##Mentioning Input_file name here.

Related

Extract lines that follow a pattern

I'm looking to extract all lines that begins with Start and its next line ends with ***.
Appreciate any help.
Example:
*********************************************
Start the extract for customer_id [XXXX-2359]
*********************************************
Start the extract for customer_id [XXXX-2987]
Available
Printing records
Moving to output file
*********************************************
Start the extract for customer_id [XXXX-1539]
*********************************************
Start the extract for customer_id [XXXX-4527]
Available
Printing records
Moving to output file
*********************************************
Desired Output:
Start the extract for customer_id [XXXX-2359]
Start the extract for customer_id [XXXX-1539]
I tried:
awk '/Start*/ {p=1;print;next} /$**/ && p {p=0;print} p' test
Could you please try following, written and tested with shown samples. With tac + awk approach here.
tac Input_file |
awk '
/^\*/{
count=""
found=1
next
}
found && ++count==1 && /^Start/{
print
found=""
}
' | tac
Explanation: Adding detailed explanation for above.
tac Input_file | ##Using tac wit Input_file to print contents in reverse order and send it to awk command.
awk ' ##Starting awk program here which reads tac output as an Input here.
/^\*/{ ##Checking condition if line starts from * then do following.
count="" ##Nullifying count here.
found=1 ##Setting found as 1 here.
next ##next will skip all further statements from here.
}
found && ++count==1 && /^Start/{ ##Checking if found is SET and count is 1 and line starts with Start then do following.
print ##Printing current line here.
found="" ##Nullifying found here.
}
' | tac ##Sending awk program output as an input to tac to get output in exact order.
$ awk '(index($0,"***")==1) && (p1=="Start"){print p0} {p1=$1; p0=$0}' file
Start the extract for customer_id [XXXX-2359]
Start the extract for customer_id [XXXX-1539]
For every line after the first, p1 and p0 contain the values of $1 and $0 from the previous line read. So when the current line starts with 3 *s and the $1 from the previous line (p1) was Start then it prints the $0 from the previous line (p0).
With respect to the regexps in your question:
Start* means Star followed by t repeated 0 or more times.
$** contains back-to-back regexp repetition characters (*) and so is undefined behavior per POSIX and so any tool can do whatever it likes with it. Some will report it, some will silently ignore one of the *s, others could do anything else. The $ at the start is an end-of-string indication which matches the end of the current input so having any *s after it doesn't make sense but AFAIK it's not technically invalid.
This awk should work:
awk 'p != "" && /^\*{3}/ {print p} {p = ($1 == "Start" ? $0 : "")}' file
Start the extract for customer_id [XXXX-2359]
Start the extract for customer_id [XXXX-1539]
sed -n '/^Start/!d;N;/\*\{3\}$/P;D' file
pcregrep -Mo1 '^(Start.*)\n.*\*{3}$' file
Start* would match Star or Startttt anywhere in a string. The regex to match Start at the beginning of a string is ^Start.
$** is not a useful RE - * should not be repeated like that. POSIX leaves the behaviour undefined, with GNU sed reporting it as Invalid preceding regular expression. \*\{3\}$ (or equivalent) will match a string ending with three asterisks.
I would do it following way using GNU AWK, let file.txt content be
*********************************************
Start the extract for customer_id [XXXX-2359]
*********************************************
Start the extract for customer_id [XXXX-2987]
Available
Printing records
Moving to output file
*********************************************
Start the extract for customer_id [XXXX-1539]
*********************************************
Start the extract for customer_id [XXXX-4527]
Available
Printing records
Moving to output file
*********************************************
then
awk 'BEGIN{RS="\n?[*]+\n";FS="\n"}($NF~/^Start/)' file.txt
output
Start the extract for customer_id [XXXX-2359]
Start the extract for customer_id [XXXX-1539]
Explanation: I instructed AWK to treat content between lines with * as rows and lines in content as fields. To attain such effect I set row separator (RS) to one or more * followed by newline (\n) with \n optionally at begin (without that first ********************************************* would not be detected corretly) and FS to newline (\n). Then I check if last field i.e. line starts with Start. DISCLAIMER: this solution assumes that last line of your file is always like [*]+ it might fail if it is not.
(tested in GNU Awk 5.0.1)
This might work for you (GNU sed):
sed -n '/^\*\*\*/{g;/^Start/p};h' file
Turn off implicit printing.
If the current line begins ***, replace it by the previous line stored in the hold space and if that line begins Start print that line.
In all cases, store the current line in the hold space.
Another couple of solutions:
sed -n 'N;/^Start.*\n\*\*\*/P;D' file
or:
sed -n 'N;/^Start/{/^\*\*\*/MP};D' file
Both open a 2 line window and print the first of those lines if it begins Start and the second begins ***.
The second solution uses the M flag to match the *** at the start of a line, as there are only 2 such lines and the first begins Start the criteria for the both matches is met.

How to extract data in such a pattern using grep or awk?

I have multiple instances of the following pattern in my document:
Dipole Moment: [D]
X: 1.5279 Y: 0.1415 Z: 0.1694 Total: 1.5438
I want to extract the total dipole moment, so 1.5438. How can I pull this off?
When I throw in grep "Dipole Moment: [D]" filename, I don't get the line after. I am new to these command line interfaces. Any help you can provide would be greatly appreciated.
Could you please try following. Written and tested with shown samples in GNU awk.
awk '/Dipole Moment: \[D\]/{found=1;next} found{print $NF;found=""}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/Dipole Moment: \[D\]/{ ##Checking if line contains Dipole Moment: \[D\] escaped [ and ] here.
found=1 ##Setting found to 1 here.
next ##next will skip all further statements from here.
}
found{ ##Checking condition if found is NOT NULL then do following.
print $NF ##Printing last field of current line here.
found="" ##Nullifying found here.
}
' Input_file ##Mentioning Input_file name here.
Sed alternative:
sed -rn '/^Dipole/{n;s/(^[[:space:]]{5}.*[[:space:]]{5})(.*)(([[:space:]]{5}.*+[:][[:space:]]{5}.*){3})/\2/p}' file
Search for the line beginning with "Dipole" then read the next line. Split this line into three sections based on regular expressions and substitute the line for the second section only, printing the result.

Delete third-to-last line of file using sed or awk

I have several text files with different row numbers and I have to delete in all of them the third-to-last line . Here is a sample file:
bear
horse
window
potato
berry
cup
Expected result for this file:
bear
horse
window
berry
cup
Can we delete the third-to-last line of a file:
a. not based on any string/pattern.
b. based only on a condition that it has to be the third-to-last line
I have problem on how to index my files beginning from the last line. I have tried this from another SO question for the second-to-last line:
> sed -i 'N;$!P;D' output1.txt
With tac + awk solution, could you please try following. Just set line variable of awk to line(from bottom) whichever you want to skip.
tac Input_file | awk -v line="3" 'line==FNR{next} 1' | tac
Explanation: Using tac will read the Input_file reverse(from bottom line to first line), passing its output to awk command and then checking condition if line is equal to line(which we want to skip) then don't print that line, 1 will print other lines.
2nd solution: With awk + wc solution, kindly try following.
awk -v lines="$(wc -l < Input_file)" -v skipLine="3" 'FNR!=(lines-skipLine+1)' Input_file
Explanation: Starting awk program here and creating a variable lines which has total number of lines present in Input_file in it. variable skipLine has that line number which we want to skip from bottom of Input_file. Then in main program checking condition if current line is NOT equal to lines-skipLine+1 then printing the lines.
3rd solution: Adding solution as per Ed sir's comment here.
awk -v line=3 '{a[NR]=$0} END{for (i=1;i<=NR;i++) if (i != (NR-line)) print a[i]}' Input_file
Explanation: Adding detailed explanation for 3rd solution.
awk -v line=3 ' ##Starting awk program from here, setting awk variable line to 3(line which OP wants to skip from bottom)
{
a[NR]=$0 ##Creating array a with index of NR and value is current line.
}
END{ ##Starting END block of this program from here.
for(i=1;i<=NR;i++){ ##Starting for loop till value of NR here.
if(i != (NR-line)){ ##Checking condition if i is NOT equal to NR-line then do following.
print a[i] ##Printing a with index i here.
}
}
}
' Input_file ##Mentioning Input_file name here.
With ed
ed -s ip.txt <<< $'$-2d\nw'
# thanks Shawn for a more portable solution
printf '%s\n' '$-2d' w | ed -s ip.txt
This will do in-place editing. $ refers to last line and you can specify a negative relative value. So, $-2 will refer to last but second line. w command will then write the changes.
See ed: Line addressing for more details.
This might work for you (GNU sed):
sed '1N;N;$!P;D' file
Open a window of 3 lines in the file then print/delete the first line of the window until the end of the file.
At the end of the file, do not print the first line in the window i.e. the 3rd line from the end of the file. Instead, delete it, and repeat the sed cycle. This will try to append a line after the end of file, which will cause sed to bail out, printing the remaining lines in the window.
A generic solution for n lines back (where n is 2 or more lines from the end of the file), is:
sed ':a;N:s/[^\n]*/&/3;Ta;$!P;D' file
Of course you could use:
tac file | sed 3d | tac
But then you would be reading the file 3 times.
To delete the 3rd-to-last line of a file, you can use head and tail:
{ head -n -3 file; tail -2 file; }
In case of a large input file, when perfomance matters, this is very fast, because it doesn't read and write line by line. Also, do not modify the semicolons and the spaces next to the brackets, see about commands grouping.
Or use sed with tac:
tac file | sed '3d' | tac
Or use awk with tac:
tac file | awk 'NR!=3' | tac

Understand the code of Split file to fasta

I understand the matching pattern but how the sequence is read from the matching pattern as the code is matching only pattern ">chr" then how sequence goes to the output file?
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Could you please go through following explanation once.
awk ' ##Starting awk program here.
/^>chr/{ ##Checking condition if any line starts from string chr then do following.
OUT=substr($0,2) ".fa" ##Creating variable OUT whose value is substring of first 2 letters and concatenating .fa string to it.
} ##Closing block for condition ^>chr here.
{
print >> OUT ##Printing current line to variable OUT value which is formed above and is writing output into out file.
close(OUT) ##If we keep writing lot of files we will get "Too many files opened error(s)" so closing these files in backend to avoid that error.
}
' Input_File ##Mentioning Input_file here which we are processing through awk.

use awk to split one file into several small files by pattern

I have read this post about using awk to split one file into several files:
and I am interested in one of the solutions provided by Pramod and jaypal singh:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Because I still can not add any comment so I ask in here.
If the input is
>chr22
asdgasge
asegaseg
>chr1
aweharhaerh
agse
>chr14
gasegaseg
How come it will result in three files:
chr22.fasta
chr1.fasta
chr14.fasta
As an example, in chr22.fasta:
>chr22
asdgasge
asegaseg
I understand the first part
/^>chr/ {OUT=substr($0,2) ".fa"};
and these commands:
/^>chr/ substr() close() >>
But I don't understand that how awk split the input by the second part:
{print >> OUT; close(OUT)}
Could anyone explain more details about this command? Thanks a lot!
Could you please go through following and let me know if this helps you.
awk ' ##Starting awk program here.
/^>chr/{ ##Checking condition if a line starts from string chr then do following.
OUT=substr($0,2) ".fa" ##Create variable OUT whose value is substring of current line and starts from letter 2nd to till end. concatenating .fa to it too.
}
{
print >> OUT ##Printing current line(s) in file name whose value is variable OUT.
close(OUT) ##using close to close output file whose value if variable OUT value. Basically this is to avoid "TOO MANY FILES OPENED ERROR" error.
}' Input_File ##Mentioning Input_file name here.
You could take reference from man awk page for used functions of awk too as follows.
substr(s, i [, n]) Returns the at most n-character substring of s starting at i. If n is omitted, the rest of s is used.
The part you are asking questions about is a bit uncomfortable:
{ print $0 >> OUT; close(OUT) }
With this part, the awk program does the following for every line it processes:
Open the file OUT
Move the file pointer the the end of the file OUT
append the line $0 followed by ORS to the file OUT
close the file OUT
Why is this uncomfortable? Mainly because of the structure of your files. You should only close the file when you finished writing to it and not every time you write to it. Currently, if you have a fasta record of 100 lines, it will open, close the file 100 times.
A better approach would be:
awk '/^>chr/{close(OUT); OUT=substr($0,2)".fasta" }
{print > OUT }
END {close(OUT)}'
Here we only open the file the first time we write to it and we close it when we don't need it anymore.
note: the END statement is not really needed.