extracting a specific word between : using sed, awk or grep - awk

I have a file that has the following contents and many more.
#set_property board_part my.biz:ab0860_1cf:part0:1.0 [current_project]
set_property board_part my.biz:ab0820_1ab:part0:1.0 [current_project]
My ideal output is as shown below (ie, the text after the first ":" and the second ":".
ab0820_1ab
I generally use python and use regular expression along the lines of below to get the result.
\s*set_property board_part trenz.biz:([a-zA-Z_0-9]+)
I wish to know how can it be done quickly and in a more generic way using commandline tools (sed, awk).

You might use GNU sed following way, let file.txt content be
#set_property board_part my.biz:ab0860_1cf:part0:1.0 [current_project]
set_property board_part my.biz:ab0820_1ab:part0:1.0 [current_project]
garbage garbage garbage
then
sed -n '/ set_property board_part my.biz/ s/[^:]*:\([^:]*\):.*/\1/ p' file.txt
gives output
ab0820_1ab
Explanation: -n turns off default printing, / set_property board_part my.biz/ is so-called address, following commands will be applied solely to lines matching adress. First command is substitution (s) which use capturing group denoted by \( and \), regular expression is as followes zero-or-more non-: (i.e. everything before 1st :), :, then zero-or-more non-: (i.e. everything between 1st and 2nd :) encased in capturing group : and zero-or-more any character (i.e. everything after 2nd :), this is replaced by content of 1st (and sole in this case) capturing group. After substitution takes place p command is issued to prompt GNU sed to print changed line.
(tested in GNU sed 4.2.2)

Your example data has my.biz but your pattern tries to match trenz.biz
If gnu awk is available, you can use the capture group and then print the first value of which is available in a[1]
awk 'match($0, /^\s*set_property board_part \w+\.biz:(\w+)/, a) {print a[1]}' file
The pattern matches:
^ Start of string
\s* Match optional whitespace chars
set_property board_part Match literally
\w+\.biz: Match 1+ word chars followed by .biz (note to escape the dot to match it literally)
(\w+) Capture group 1, match 1+ word chars
Notes
If you just want to match trenz.biz then you can replace \w+\.biz with trenz\.biz
If the strings are not at the start of the string, you can change ^ for \s wo match a whitespace char instead
Output
ab0820_1ab

Related

How to extract (First match)text between two words

I have a file having the following structure
destination list
move from station d-435-435 to point place1
move from station d-435-435 to point place2
move from mainpoint
I want to extract the word "d-435-435"(Only the first match, this need not be same value always) in between the words "from station" and "to point"
How can I achieve this?
What I have tried so far?
id=$(sed 's/.*from station \(.*\) to.*/\1/' input.txt)
But this returns the following value: destination list d-435-435 move from mainpoint
1st solution: With your shown samples, please try following GNU awk code. Using match function of awk program here to match regex rom station\s+\S+\s+to point to get requested value by OP then removing from station\s+ and \s+to point from matched value and printing required value.
awk '
match($0,/from station\s+\S+\s+to point/){
val=substr($0,RSTART,RLENGTH)
gsub(/from station\s+|\s+to point/,"",val)
print val
exit
}
' Input_file
2nd solution: Using GNU grep please try following. Using -oP option to print matched portion and enabling PCRE regex respectively here. Then in main grep program matching string from station followed by space(s) then using \K option will make sure matched part before \K is forgotten(since e don't need this in output), Then matching \S+(non space values) followed by space(s) to point string(using positive look ahead here to make sure it only checks its present or not but doesn't print that).
grep -oP -m1 'from station\s+\K\S+(?=\s+to point)' Input_file
If GNU sed is available, how about:
id=$(sed -nE '0,/from station.*to/ s/.*from station (.*) to.*/\1/p' input.txt)
The -n option suppress the print unless the substitution succeeds.
The condition 0,/pattern/ is a flip-flop operator and it returns false
after the pattern match succeeds. The 0 address is a GNU sed extension which
makes the 1st line to match against the pattern.
With awk you can write the before and after conditions of
field $4, where d-435-435 is, and then print this field only the first match and exit with exit after print statement:
awk '$2=="from" && $3=="station" && $5=="to" && $6=="point" {print $4; exit}' file
d-435-435
or using GNU awk for the 3rd arg to match():
awk 'match($0,/from station\s+(.*)\s+to point/,a){print a[1];exit}' file
d-435-435
The regexp contains a parenthesis, so the integer-indexed element of array a[1] contain the portion of string between from station followed by space(s) \s+ and space(s) \s+ followed byto point.
This might work for you (GNU sed):
sed -nE '/.*station (\S+) to point.*/{s//\1/;H;x;/\n(\S+)\n.*\1/{s/\n\S+$//;x;d};x;p}' file
Turn off implicit printing and on extended regexps command line options -nE.
If a line matches the required criteria, extract the required string, append a copy to the hold space, check if the match has already been seen and if not print it. If the match has been seen, remove it from the hold space.
Otherwise, do not print anything.
This should work in any sed:
sed -e '/.*from station \([^ ]*\) to .*/!d' -e 's//\1/' -e q file

sed: cut out a substring following various patterns

There is a list of identifiers I want to modify:
3300000526.a:P_A23_Liq_2_FmtDRAFT_1000944_2,
200254578.a:CR_10_Liq_3_inCRDRAFT_100545_11,
3300000110.a:BSg2DRAFT_c10006505_1,
3300000062.a:IMNBL1DRAFT_c0010786_1,
3300000558.a:Draft_10335283_1
I want to remove all starting from first . and first _ after DRAFT (case-insensitive), i.e.:
3300000526_1000944_2,
200254578_100545_11,
3300000110_c10006505_1,
3300000062_c0010786_1,
3300000558_10335283_1
I am using sed 's/.a.*[a-zA-Z0-9]DRAFT_.*[^_]_[a-zA-Z0-9]//g' but it ignores the first _ after DRAFT and does this:
3300000526_2,
200254578_11,
3300000110_1,
3300000062_1,
3300000558_1
P.S.
There can be various identifiers and I tried to show a little portion on their variance here, but they all keep same pattern.
I'd be grateful for corrections!
You could easily do this in awk, could you please try following once. Based on shown samples only.
awk -F'[.]|DRAFT_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
OR with GNU awk for handling case-insensitive try:
awk -v IGNORECASE="1" -F'[.]|DRAFT_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
To handle case insensitive without ignorecase option try:
awk -F'[.]|[dD][rR][aA][fF][tT]_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
Explanation: Simply setting field separator as . OR DRAFT_ as per OP's need. Then in main program setting 2nd field to _ and then substituting spaces underscore spaces with only _. Finally printing the current line by 1.
A workable solution
You can use:
sed 's/[.].*[dD][rR][aA][fF][tT]_/_/' data
You could also use \. in place of [.] but I'm allergic to unnecessary backslashes — you might be too if you'd had to spend time fighting whether 8 or 16 consecutive backslashes was the correct way to write to document markup (using troff).
For your sample data, it produces:
3300000526_1000944_2,
200254578_100545_11,
3300000110_c10006505_1,
3300000062_c0010786_1,
3300000558_10335283_1
What went wrong
Your command is:
sed 's/.a.*[a-zA-Z0-9]DRAFT_.*[^_]_[a-zA-Z0-9]//g'
This matches:
any character (the leading .)
lower-case 'a'
any sequence of characters
an alphanumeric character
upper-case only DRAFT
underscore
any sequence of characters
underscore
an alphanumeric character
doing the match globally on each line
It then deletes all the matched material. You could rescue it by using:
sed 's/[.]a.*[a-zA-Z0-9]DRAFT\(_.*[^_]_[a-zA-Z0-9]\)/\1/'
This matches a dot rather than any character, and saves the material after DRAFT starting with the underscore (that's the \(…\)), replacing what was matched with what was saved (that's the \1). You can convert DRAFT to the case-insensitive pattern too, of course. However, the terms of the question refer to "from the first dot (.) up to (but not including) the underscore after a (case-insensitive) DRAFT", and detailing, saving, and replacing the material after the underscore is not necessary.
Laziness
I saved myself typing the elaborate case-insensitive string by using a program called mkpattern that (according to RCS) I created on 1990-08-23 (a while ago now). It's not rocket science. I use it every so often — I've actually used it a number of times in the last week, searching for some elusive documents on a system at work.
$ mkpattern DRAFT
[dD][rR][aA][fF][tT]
$
You might have to do that longhand in future.
try something like
{mawk/mawk2/gawk} 'BEGIN { FS = "[\056].+DRAFT_"; OFS = ""; } (NF < 2) || ($1 = $1)'
It might not be the fastest but it's relatively clean approach. octal \056 is the period, and it's less ambiguous to reader when the next item is a ".+"
This might work for you (GNU sed):
sed -nE 's/DRAFT[^_]*/\n/i;s/\..*\n//p' file
First turn on the -n and -E to turn off implicit printing and make regexp more easily seen.
Since we want the first occurrence of DRAFT we can not use a regexp that uses the .* idiom as this is greedy and may pass over it if there are two or more such occurrences. Thus we replace the DRAFT by a unique character which cannot occur in the line. The newline can only be introduced by the programmer and is the best choice.
Now we can use the .* idiom as the introduced newline can only exist if the previous substitution has matched successfully.
N.B. The i flag in the first substitution allows for any upper/lower case rendition of the string DRAFT, also the second substitution includes the p flag to print the successful substitution.

Remove pattern and everything before using AWK in fasta file

I searched a lot but could not find a solution to my issue. I have a file that looks like:
>HEADER1
AACTGGTTACGTGGTTCTCT
>HEADER2
GGTTTCTC
>HEADER3
CCAGGTTTCGAGGGGTTACGGGGTA
I want to remove GGTT pattern and everything before it. So basically there are several of these patterns in some of the lines so I want to remove all of them including everything before the pattern or among them.
The desired output should look like:
>HEADER1
CTCT
>HEADER2
TCTC
>HEADER3
ACGGGGTA
I tried suggested ways but could not be able to adjust it to my data.
Thank you in Advance for your help.
If it's not possible for your headers to include GGTT, I suppose the easiest would be:
$ sed 's/.*GGTT//' file
>HEADER1
CTCT
>HEADE2
TCTC
>HEADER3
ACGGGGTA
If your headers might contain GGTT, then awk probably be better:
$ awk '!/^>/ {sub(/.*GGTT/, "")}1' file
>HEADER1
CTCT
>HEADE2
TCTC
>HEADER3
ACGGGGTA
In both cases, the .*GGTT is "greedy", so it doesn't matter if there are multiple instances of GGTT, it will always match up to and remove everything through the last occurrence.
In the awk version, the pattern !/^>/ makes sure that subsitution is only done on lines that do not start with >.
Note that in general, sequences in fasta format as shown in the question may span multiple lines (= they are often wrapped to 80 or 100 nucleotides per line). This answer handles such cases correctly as well, unlike some other answers in this thread.
Use these two Perl one-liners connected by a pipe. The first one-liner does all of the common reformatting of the fasta sequences that is necessary in this and similar cases. It removes newlines and whitespace in the sequence (which also unwraps the sequence), but does not change the sequence header lines. It also properly handles leading and trailing whitespace/newlines in the file. The second one-liner actually removes everything up to and including the last GGTT in the sequence, in a case-insensitive manner.
Note: If GGTT is at the end of the sequence, the output will be a header plus an empty sequence. See seq4 in the example below. This may cause issues with some bioinformatics tools used downstream.
# Create the input for testing:
cat > in.fa <<EOF
>seq1 with blanks
ACGT GGTT ACGT
>seq2 with newlines
ACGT
GGTT
ACGT
>seq3 without blanks or newlines
ACGTGGTTACGT
>seq4 everything should be deleted, with empty sequence in the output
ACGTGGTTACGTGGTT
>seq5 lowercase
acgtggttacgt
EOF
# Reformat to single-line fasta, then delete subsequences:
perl -ne 'chomp; if ( /^>/ ) { print "\n" if $n; print "$_\n"; $n++; } else { s/\s+//g; print; } END { print "\n"; }' in.fa | \
perl -pe 'next if /^>/; s/.*GGTT//i;' > out.fa
Output in file out.fa:
>seq1 with blanks
ACGT
>seq2 with newlines
ACGT
>seq3 without blanks or newlines
ACGT
>seq4 everything should be deleted, with empty sequence in the output
>seq5 lowercase
acgt
The Perl one-linera use these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
chomp : Remove the input line separator (\n on *NIX).
if ( /^>/ ) : Test if the current line is a sequence header line.
$n : This variable is undefined (false) at the beginning, and true after seeing the first sequence header, in which case we print an extra newline. This newline goes at the end of each sequence, starting from the first sequence.
END { print "\n"; } : Print the final newline after the last sequence.
s/\s+//g; print; : If the current line is sequence (not header), remove all the whitespace and print without the terminal newline.
next if /^>/; : Skip the header lines.
s/.*GGTT//i; : Replace everything (.*) up to and including the the last GGTT with nothing (= delete it). The /i modifier means case-insensitive match.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
Remove line breaks in a FASTA file

Printing lines with duplicate words

I am trying to print all line that can contain same word twice or more
E.g. with this input file:
cat dog cat
dog cat deer
apple peanut banana apple
car bus train plane
car train car train
Output should be
cat dog cat
apple peanut banana apple
car train car train.
I have tried this code and it works but I think there must be a shorter way.
awk '{ a=0;for(i=1;i<=NF;i++){for(j=i+1;j<=NF;j++){if($i==$j)a=1} } if( a==1 ) print $0}'
Later I want to find all such duplicate words and delete all the duplicate entries except for 1st occurrence.
So input:
cat dog cat lion cat
dog cat deer
apple peanut banana apple
car bus train plane
car train car train
Desired output:
cat dog lion
dog cat deer
apple peanut banana
car bus train plane
car train
You can use this GNU sed command:
sed -rn '/(\b\w+\b).*\b\1\b/ p' yourfile
-r activate extended re and n deactivates the implicit printing of every line
the p command then prints only lines that match the preceding re (inside the slashes):
\b\w+\b are words : an nonemtpy sequence of word charactes (\w) between word boundaries (\b`), these are GNU extensions
such a word is "stored" in \1 for later reuse, due to the use of parentheses
then we try to match this word with \b\1\b again with something optional (.*) between those two places.
and that is the whole trick: match something, put it in parentheses so you can reuse it in the same re with \1
To answer the second part of the question, deleting the doubled words after the first, but print all lines (modifying only the lines with doubled words), you could use some sed s magic:
sed -r ':A s/(.*)(\b\w+\b)(.*)\b\2\b(.*)/\1\2\3\4/g; t A ;'
here we use again the backreference trick.
but we have to account for the things before, between and after our doubled words, thus we have a \2 in the matching part of then s command and we have the other backreferences in the replacement part.
notice that only the \2 has no parens in the matching part and we use all groups in the replacement, thus we effectively deleted the second word of the pair.
for more repetitions of the word we need loop:
:A is a label
t A jumps to the label if there was a replacement done in the last s comamnd
this builds a "while loop" around the s to delete the other repetitions, too
Here's a solution for printing only lines that contain duplicate words.
awk '{
delete seen
for (i=1;i<=NF;++i) {
if (seen[$i]) { print ; next }
seen[$i] = 1
}
}'
Here's a solution for deleting duplicate words after the first.
awk '{
delete seen
for (i=1;i<=NF;++i) {
if (seen[$i]) { continue }
printf("%s ", $i);
seen[$i] = 1
}
print "";
}'
Re your comment...
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. — Jamie Zawinski, 1997
With egrep you can use a so called back reference:
egrep '(\b\w+\b).*\b\1\b' file
(\b\w+\b) matches a word at word boundaries in capturing group 1. \1 references that matched word in the pattern.
I'll show solutions in Perl as it is probably the most flexible tool for text parsing, especially when it comes to regular expressions.
Detecting Duplicates
perl -ne 'print if m{\b(\S+)\b.*?(\b\1\b)}g' file
where
-n causes Perl to execute the expression passed via -e for each input line;
\b matches word boundaries;
\S+ matches one or more non-space characters;
.*? is a non-greedy match for zero or more characters;
\1 is a backreference to the first group, i.e. the word \S+;
g globally matches the pattern repeatedly in the string.
Removing Duplicates
perl -pe '1 while (s/\b(\S+)\b.*?\K(\s\1\b)//g)' file
where
-p causes Perl to print the line ($_), like sed;
1 while loop runs as long as the substitution replaces something;
\K keeps the part matching the previous expression;
Duplicate words (\s\1\b) are replaced with empty string (//g).
Why Perl?
Perl regular expressions are known to be very flexible, and regular expressions in Perl are actually more than just regular expressions. For example, you can embed Perl code into the substitution using the /e modifier. You can use the /x modifier that allows to write regular expressions in a more readable format and even use Perl comments in it, e.g.:
perl -pe '1 while (
s/ # Begins substitution: s/pattern/replacement/flags
\b (\S+) \b # A word
.*? # Ungreedy pattern for any number of characters
\K # Keep everything that matched the previous patterns
( # Group for the duplicate word:
\s # - space
\1 # - backreference to the word
\b # - word boundary
)
//xg
)' file
As you should have noticed, the \K anchor is very convenient, but is not available in many popular tools including awk, bash, and sed.

Using awk how do I reprint a found pattern with a new line character?

I have a text file in the format of:
aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;
Where "bcd" can be any length of any characters, excluding ; or :
What I want to do is print the text file in the format of:
aaa: bcd;bcd;bcddd;
aaa: bcd;bcd;bcd;
-etc-
My method of approach to this problem was to isolate a pattern of ";...:" and then reprint this pattern without the initial ;
I concluded I would have to use awk's 'gsub' to do this, but have no idea how to replicate the pattern nor how to print the pattern again with this added new line character 1 character into my pattern.
Is this possible?
If not, can you please direct me in a way of tackling it?
We can't quite be sure of the variability in the aaa or bcd parts; presumably, each one could be almost anything.
You should probably be looking for:
a series of one or more non-colon, non-semicolon characters followed by colon,
with one or more repeats of:
a series of one or more non-colon, non-semicolon characters followed by a semi-colon
That makes up the unit you want to match.
/[^:;]+:([^:;]+;)+/
With that, you can substitute what was found by the same followed by a newline, and then print the result. The only trick is avoiding superfluous newlines.
Example script:
{
echo "aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;"
echo "aaz: xcd;ycd;bczdd;baa:bed;bid;bud;"
} |
awk '{ gsub(/[^:;]+:([^:;]+;)+/, "&\n"); sub(/\n+$/, ""); print }'
Example output
aaa: bcd;bcd;bcddd;
aaa:bcd;bcd;bcd;
aaz: xcd;ycd;bczdd;
baa:bed;bid;bud;
Paraphrasing the question in a comment:
Why does the regular expression not include the characters before a colon (which is what it's intended to do, but I don't understand why)? I don't understand what "breaks" or ends the regex.
As I tried to explain at the top, you're looking for what we can call 'words', meaning sequences of characters that are neither a colon nor a semicolon. In the regex, that is [^:;]+, meaning one or more (+) of the negated character class — one or more non-colon, non-semicolon characters.
Let's pretend that spaces in a regex are not significant. We can space out the regex like this:
/ [^:;]+ : ( [^:;]+ ; ) + /
The slashes simply mark the ends, of course. The first cluster is a word; then there's a colon. Then there is a group enclosed in parentheses, tagged with a + at the end. That means that the contents of the group must occur at least once and may occur any number of times more than that. What's inside the group? Well, a word followed by a semicolon. It doesn't have to be the same word each time, but there does have to be a word there. If something can occur zero or more times, then you use a * in place of the +, of course.
The key to the regex stopping is that the aaa: in the middle of the first line does not consist of a word followed by a semicolon; it is a word followed by a colon. So, the regex has to stop before that because the aaa: doesn't match the group. The gsub() therefore finds the first sequence, and replaces that text with the same material and a newline (that's the "&\n", of course). It (gsub()) then resumes its search directly after the end of the replacement material, and — lo and behold — there is a word followed by a colon and some words followed by semicolons, so there's a second match to be replaced with its original material plus a newline.
I think that $0 must contain the newline at the end of the line. Therefore, without the sub() to remove a trailing newlines, the print (implictly of $0 with a newline) generated a blank line I didn't want in the output, so I removed the extraneous newline(s). The newline at the end of $0 would not be matched by the gsub() because it is not followed by a colon or semicolon.
This might work for you:
awk '{gsub(/[^;:]*:/,"\n&");sub(/^\n/,"");gsub(/: */,": ")}1' file
Prepend a newline (\n) to any string not containing a ; or a : followed by a :
Remove any newline prepended to the start of line.
Replace any : followed by none or many spaces with a : followed by a single space.
Print all lines.
Or this:
sed 's/;\([^;:]*: *\)/;\n\1 /g' file
Not sure how to do it in awk, but with sed this does what I think you want:
$ nl='
'
$ sed "s/\([^;]*:\)/\\${nl}\1/g" input
The first command sets the shell variable $nl to the string containing a single new line. Some versions of sed allow you to use \n inside the replacement string, but not all allow that. This keeps any whitespace that appears after the final ; and puts it at the start of the line. To get rid of that, you can do
$ sed "s/\([^;]*:\)/\\${nl}\1/g; s/\n */\\$nl/g" input
Ordinary awk gsub() and sub() don't allow you to specify components in the replacement strings Gnu awk - "gawk" - supplies "gensub()" which would allow "gensub(/(;) (.+:)/,"\1\n\2","g")"