Using awk how do I reprint a found pattern with a new line character? - awk

I have a text file in the format of:
aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;
Where "bcd" can be any length of any characters, excluding ; or :
What I want to do is print the text file in the format of:
aaa: bcd;bcd;bcddd;
aaa: bcd;bcd;bcd;
-etc-
My method of approach to this problem was to isolate a pattern of ";...:" and then reprint this pattern without the initial ;
I concluded I would have to use awk's 'gsub' to do this, but have no idea how to replicate the pattern nor how to print the pattern again with this added new line character 1 character into my pattern.
Is this possible?
If not, can you please direct me in a way of tackling it?

We can't quite be sure of the variability in the aaa or bcd parts; presumably, each one could be almost anything.
You should probably be looking for:
a series of one or more non-colon, non-semicolon characters followed by colon,
with one or more repeats of:
a series of one or more non-colon, non-semicolon characters followed by a semi-colon
That makes up the unit you want to match.
/[^:;]+:([^:;]+;)+/
With that, you can substitute what was found by the same followed by a newline, and then print the result. The only trick is avoiding superfluous newlines.
Example script:
{
echo "aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;"
echo "aaz: xcd;ycd;bczdd;baa:bed;bid;bud;"
} |
awk '{ gsub(/[^:;]+:([^:;]+;)+/, "&\n"); sub(/\n+$/, ""); print }'
Example output
aaa: bcd;bcd;bcddd;
aaa:bcd;bcd;bcd;
aaz: xcd;ycd;bczdd;
baa:bed;bid;bud;
Paraphrasing the question in a comment:
Why does the regular expression not include the characters before a colon (which is what it's intended to do, but I don't understand why)? I don't understand what "breaks" or ends the regex.
As I tried to explain at the top, you're looking for what we can call 'words', meaning sequences of characters that are neither a colon nor a semicolon. In the regex, that is [^:;]+, meaning one or more (+) of the negated character class — one or more non-colon, non-semicolon characters.
Let's pretend that spaces in a regex are not significant. We can space out the regex like this:
/ [^:;]+ : ( [^:;]+ ; ) + /
The slashes simply mark the ends, of course. The first cluster is a word; then there's a colon. Then there is a group enclosed in parentheses, tagged with a + at the end. That means that the contents of the group must occur at least once and may occur any number of times more than that. What's inside the group? Well, a word followed by a semicolon. It doesn't have to be the same word each time, but there does have to be a word there. If something can occur zero or more times, then you use a * in place of the +, of course.
The key to the regex stopping is that the aaa: in the middle of the first line does not consist of a word followed by a semicolon; it is a word followed by a colon. So, the regex has to stop before that because the aaa: doesn't match the group. The gsub() therefore finds the first sequence, and replaces that text with the same material and a newline (that's the "&\n", of course). It (gsub()) then resumes its search directly after the end of the replacement material, and — lo and behold — there is a word followed by a colon and some words followed by semicolons, so there's a second match to be replaced with its original material plus a newline.
I think that $0 must contain the newline at the end of the line. Therefore, without the sub() to remove a trailing newlines, the print (implictly of $0 with a newline) generated a blank line I didn't want in the output, so I removed the extraneous newline(s). The newline at the end of $0 would not be matched by the gsub() because it is not followed by a colon or semicolon.

This might work for you:
awk '{gsub(/[^;:]*:/,"\n&");sub(/^\n/,"");gsub(/: */,": ")}1' file
Prepend a newline (\n) to any string not containing a ; or a : followed by a :
Remove any newline prepended to the start of line.
Replace any : followed by none or many spaces with a : followed by a single space.
Print all lines.
Or this:
sed 's/;\([^;:]*: *\)/;\n\1 /g' file

Not sure how to do it in awk, but with sed this does what I think you want:
$ nl='
'
$ sed "s/\([^;]*:\)/\\${nl}\1/g" input
The first command sets the shell variable $nl to the string containing a single new line. Some versions of sed allow you to use \n inside the replacement string, but not all allow that. This keeps any whitespace that appears after the final ; and puts it at the start of the line. To get rid of that, you can do
$ sed "s/\([^;]*:\)/\\${nl}\1/g; s/\n */\\$nl/g" input

Ordinary awk gsub() and sub() don't allow you to specify components in the replacement strings Gnu awk - "gawk" - supplies "gensub()" which would allow "gensub(/(;) (.+:)/,"\1\n\2","g")"

Related

extracting a specific word between : using sed, awk or grep

I have a file that has the following contents and many more.
#set_property board_part my.biz:ab0860_1cf:part0:1.0 [current_project]
set_property board_part my.biz:ab0820_1ab:part0:1.0 [current_project]
My ideal output is as shown below (ie, the text after the first ":" and the second ":".
ab0820_1ab
I generally use python and use regular expression along the lines of below to get the result.
\s*set_property board_part trenz.biz:([a-zA-Z_0-9]+)
I wish to know how can it be done quickly and in a more generic way using commandline tools (sed, awk).
You might use GNU sed following way, let file.txt content be
#set_property board_part my.biz:ab0860_1cf:part0:1.0 [current_project]
set_property board_part my.biz:ab0820_1ab:part0:1.0 [current_project]
garbage garbage garbage
then
sed -n '/ set_property board_part my.biz/ s/[^:]*:\([^:]*\):.*/\1/ p' file.txt
gives output
ab0820_1ab
Explanation: -n turns off default printing, / set_property board_part my.biz/ is so-called address, following commands will be applied solely to lines matching adress. First command is substitution (s) which use capturing group denoted by \( and \), regular expression is as followes zero-or-more non-: (i.e. everything before 1st :), :, then zero-or-more non-: (i.e. everything between 1st and 2nd :) encased in capturing group : and zero-or-more any character (i.e. everything after 2nd :), this is replaced by content of 1st (and sole in this case) capturing group. After substitution takes place p command is issued to prompt GNU sed to print changed line.
(tested in GNU sed 4.2.2)
Your example data has my.biz but your pattern tries to match trenz.biz
If gnu awk is available, you can use the capture group and then print the first value of which is available in a[1]
awk 'match($0, /^\s*set_property board_part \w+\.biz:(\w+)/, a) {print a[1]}' file
The pattern matches:
^ Start of string
\s* Match optional whitespace chars
set_property board_part Match literally
\w+\.biz: Match 1+ word chars followed by .biz (note to escape the dot to match it literally)
(\w+) Capture group 1, match 1+ word chars
Notes
If you just want to match trenz.biz then you can replace \w+\.biz with trenz\.biz
If the strings are not at the start of the string, you can change ^ for \s wo match a whitespace char instead
Output
ab0820_1ab

sed: cut out a substring following various patterns

There is a list of identifiers I want to modify:
3300000526.a:P_A23_Liq_2_FmtDRAFT_1000944_2,
200254578.a:CR_10_Liq_3_inCRDRAFT_100545_11,
3300000110.a:BSg2DRAFT_c10006505_1,
3300000062.a:IMNBL1DRAFT_c0010786_1,
3300000558.a:Draft_10335283_1
I want to remove all starting from first . and first _ after DRAFT (case-insensitive), i.e.:
3300000526_1000944_2,
200254578_100545_11,
3300000110_c10006505_1,
3300000062_c0010786_1,
3300000558_10335283_1
I am using sed 's/.a.*[a-zA-Z0-9]DRAFT_.*[^_]_[a-zA-Z0-9]//g' but it ignores the first _ after DRAFT and does this:
3300000526_2,
200254578_11,
3300000110_1,
3300000062_1,
3300000558_1
P.S.
There can be various identifiers and I tried to show a little portion on their variance here, but they all keep same pattern.
I'd be grateful for corrections!
You could easily do this in awk, could you please try following once. Based on shown samples only.
awk -F'[.]|DRAFT_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
OR with GNU awk for handling case-insensitive try:
awk -v IGNORECASE="1" -F'[.]|DRAFT_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
To handle case insensitive without ignorecase option try:
awk -F'[.]|[dD][rR][aA][fF][tT]_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
Explanation: Simply setting field separator as . OR DRAFT_ as per OP's need. Then in main program setting 2nd field to _ and then substituting spaces underscore spaces with only _. Finally printing the current line by 1.
A workable solution
You can use:
sed 's/[.].*[dD][rR][aA][fF][tT]_/_/' data
You could also use \. in place of [.] but I'm allergic to unnecessary backslashes — you might be too if you'd had to spend time fighting whether 8 or 16 consecutive backslashes was the correct way to write to document markup (using troff).
For your sample data, it produces:
3300000526_1000944_2,
200254578_100545_11,
3300000110_c10006505_1,
3300000062_c0010786_1,
3300000558_10335283_1
What went wrong
Your command is:
sed 's/.a.*[a-zA-Z0-9]DRAFT_.*[^_]_[a-zA-Z0-9]//g'
This matches:
any character (the leading .)
lower-case 'a'
any sequence of characters
an alphanumeric character
upper-case only DRAFT
underscore
any sequence of characters
underscore
an alphanumeric character
doing the match globally on each line
It then deletes all the matched material. You could rescue it by using:
sed 's/[.]a.*[a-zA-Z0-9]DRAFT\(_.*[^_]_[a-zA-Z0-9]\)/\1/'
This matches a dot rather than any character, and saves the material after DRAFT starting with the underscore (that's the \(…\)), replacing what was matched with what was saved (that's the \1). You can convert DRAFT to the case-insensitive pattern too, of course. However, the terms of the question refer to "from the first dot (.) up to (but not including) the underscore after a (case-insensitive) DRAFT", and detailing, saving, and replacing the material after the underscore is not necessary.
Laziness
I saved myself typing the elaborate case-insensitive string by using a program called mkpattern that (according to RCS) I created on 1990-08-23 (a while ago now). It's not rocket science. I use it every so often — I've actually used it a number of times in the last week, searching for some elusive documents on a system at work.
$ mkpattern DRAFT
[dD][rR][aA][fF][tT]
$
You might have to do that longhand in future.
try something like
{mawk/mawk2/gawk} 'BEGIN { FS = "[\056].+DRAFT_"; OFS = ""; } (NF < 2) || ($1 = $1)'
It might not be the fastest but it's relatively clean approach. octal \056 is the period, and it's less ambiguous to reader when the next item is a ".+"
This might work for you (GNU sed):
sed -nE 's/DRAFT[^_]*/\n/i;s/\..*\n//p' file
First turn on the -n and -E to turn off implicit printing and make regexp more easily seen.
Since we want the first occurrence of DRAFT we can not use a regexp that uses the .* idiom as this is greedy and may pass over it if there are two or more such occurrences. Thus we replace the DRAFT by a unique character which cannot occur in the line. The newline can only be introduced by the programmer and is the best choice.
Now we can use the .* idiom as the introduced newline can only exist if the previous substitution has matched successfully.
N.B. The i flag in the first substitution allows for any upper/lower case rendition of the string DRAFT, also the second substitution includes the p flag to print the successful substitution.

Remove pattern and everything before using AWK in fasta file

I searched a lot but could not find a solution to my issue. I have a file that looks like:
>HEADER1
AACTGGTTACGTGGTTCTCT
>HEADER2
GGTTTCTC
>HEADER3
CCAGGTTTCGAGGGGTTACGGGGTA
I want to remove GGTT pattern and everything before it. So basically there are several of these patterns in some of the lines so I want to remove all of them including everything before the pattern or among them.
The desired output should look like:
>HEADER1
CTCT
>HEADER2
TCTC
>HEADER3
ACGGGGTA
I tried suggested ways but could not be able to adjust it to my data.
Thank you in Advance for your help.
If it's not possible for your headers to include GGTT, I suppose the easiest would be:
$ sed 's/.*GGTT//' file
>HEADER1
CTCT
>HEADE2
TCTC
>HEADER3
ACGGGGTA
If your headers might contain GGTT, then awk probably be better:
$ awk '!/^>/ {sub(/.*GGTT/, "")}1' file
>HEADER1
CTCT
>HEADE2
TCTC
>HEADER3
ACGGGGTA
In both cases, the .*GGTT is "greedy", so it doesn't matter if there are multiple instances of GGTT, it will always match up to and remove everything through the last occurrence.
In the awk version, the pattern !/^>/ makes sure that subsitution is only done on lines that do not start with >.
Note that in general, sequences in fasta format as shown in the question may span multiple lines (= they are often wrapped to 80 or 100 nucleotides per line). This answer handles such cases correctly as well, unlike some other answers in this thread.
Use these two Perl one-liners connected by a pipe. The first one-liner does all of the common reformatting of the fasta sequences that is necessary in this and similar cases. It removes newlines and whitespace in the sequence (which also unwraps the sequence), but does not change the sequence header lines. It also properly handles leading and trailing whitespace/newlines in the file. The second one-liner actually removes everything up to and including the last GGTT in the sequence, in a case-insensitive manner.
Note: If GGTT is at the end of the sequence, the output will be a header plus an empty sequence. See seq4 in the example below. This may cause issues with some bioinformatics tools used downstream.
# Create the input for testing:
cat > in.fa <<EOF
>seq1 with blanks
ACGT GGTT ACGT
>seq2 with newlines
ACGT
GGTT
ACGT
>seq3 without blanks or newlines
ACGTGGTTACGT
>seq4 everything should be deleted, with empty sequence in the output
ACGTGGTTACGTGGTT
>seq5 lowercase
acgtggttacgt
EOF
# Reformat to single-line fasta, then delete subsequences:
perl -ne 'chomp; if ( /^>/ ) { print "\n" if $n; print "$_\n"; $n++; } else { s/\s+//g; print; } END { print "\n"; }' in.fa | \
perl -pe 'next if /^>/; s/.*GGTT//i;' > out.fa
Output in file out.fa:
>seq1 with blanks
ACGT
>seq2 with newlines
ACGT
>seq3 without blanks or newlines
ACGT
>seq4 everything should be deleted, with empty sequence in the output
>seq5 lowercase
acgt
The Perl one-linera use these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
chomp : Remove the input line separator (\n on *NIX).
if ( /^>/ ) : Test if the current line is a sequence header line.
$n : This variable is undefined (false) at the beginning, and true after seeing the first sequence header, in which case we print an extra newline. This newline goes at the end of each sequence, starting from the first sequence.
END { print "\n"; } : Print the final newline after the last sequence.
s/\s+//g; print; : If the current line is sequence (not header), remove all the whitespace and print without the terminal newline.
next if /^>/; : Skip the header lines.
s/.*GGTT//i; : Replace everything (.*) up to and including the the last GGTT with nothing (= delete it). The /i modifier means case-insensitive match.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
Remove line breaks in a FASTA file

Print out a line from a .txt file that has a keyword with parentheses in it

I'm trying to print out lines of a bunch of output files that contain the characters "g(tot)" in them.
awk '/g(tot)/{print}' ./*/*.out
However, this is not printing anything, and it seems to be due to the parentheses around the "tot". How can I get around this?
( and ) are interpreted as special characters in a regular expression.
Escape ( and ) with a \:
awk '/g\(tot\)/{print}' ./*/*.out
() are specials characters in RegEx, they are for catching things in a group
(…)
Parentheses are used for grouping in regular expressions, as in arithmetic. They can be used to concatenate regular expressions containing the alternation operator, ‘|’. For example, ‘#(samp|code){[^}]+}’ matches both ‘#code{foo}’ and ‘#samp{bar}’. (These are Texinfo formatting control sequences. The ‘+’ is explained further on in this list.)
So /g(tot)/ actually matches gtot not g(tot).
You need to escape it, like this /g\(tot\)/.
Also you can remove the part {print}, it's implied after a condition, so that sums to:
awk '/g\(tot\)/' ./*/*.out
However, for this simple task, I would suggest you to use grep instead:
awk 'g\(tot\)' ./*/*.out
And you can use sed too:
sed -n '/g\(tot\)/p' ./*/*.out

To find ";" then delete spaces up to next character

I have many lines starting with ";", then 1 or more spaces, followed by some other character(s) on the same line. I need to remove the spaces following the ";" up to but not including the characters that follow.
I tried a few variations of the following code because it worked great on lines with empty spaces, but I am not very familiar with awk.
awk '{gsub(/^ +| +$/,"")}1' filea>fileb
Sample input:
; 4
; group 452
; ring
Output wanted:
;4
;group 452
;ring
To remove any white space after the first semicolon, try:
$ awk '{sub(/^;[[:blank:]]+/, ";")} 1' filea
;4
;group 452
;ring
The regex ^;[[:blank:]]+ matches the first semicolon and any blanks or tabs which follow it. The function sub replaces this with ;. Since this only occurs once on the line (at the beginning), there is no need for gsub.
[:blank:] is a unicode-safe way of specifying blank space.
awk '{sub(/^; +/,";")}1' file
;4
;group 452
;ring
sed would also do :
sed -E 's/^(;){1}([[:blank:]]+)/\1/' file
The parentheses is used as selectors and a \number combination represents the corresponding selection.
In ^(;){1}([[:blank:]]+) we check for the start of line (^) and a ; that occurs one time({1}), followed by any number of blank characters ([[:blank:]]+) and then replace the matched pattern with our first selection.
This defines the field separator FS to be the start of the record followed by a ; and a set of spaces. It then redefines the output field separator OFS to be just a ;. The conversion from FS to OFS is done by reassigning $1 to itself.
awk 'BEGIN{FS="^; *";OFS=";"}{$1=$1}1'
or
awk -F'^; *' -vOFS=";" '{$1=$1}1'