Extract text between symbols with bash - awk

First off, I am relatively new at this, so please bear with me.
I have an annotated transcriptome .fasta file, which contains ~60,000 records of genes like these two:
>comp35897_c0_seq11 len=1039 path=[11:0-12;24:13-1038] Match_Acc=E5SX33 Gene=Putative_CAP-Gly_domain_protein
TTTTAAATTGATTACTTTGCTATTTTTGGCAATGTTGGACTGAGTTGTCGTATTTTTTCG
>comp32620_c0_seq3 len=1874 path=[1:0-195;197:196-220;222:221-354;356:355-481;4197:482-487;489:488-579;581:580-1159;1161:1160-1712;1714:1713-1729;1731:1730-1794;5873:1795-1873] Match_Acc=K1PQJ1 Gene=HAUS_augmin-like_complex_subunit_3 GO=GO:0051225,GO:0070652
CAGACTTTTGGATTTAGTACATGTATGTATGAATATGTGTTTCAATGTACAACTCAGGAT
I am trying to create a two-column, space-delimited .tab with component number in the first column and gene name in the second column. I have looked at many similar posts using grep, sed, or awk, but none of the suggested code has worked for me.
Specifically, what I need to pull from the .fasta is the comp number between the > and the next space for the first column, and gene name between Gene=and the next space. For the two genes above, that should give me:
comp35897_c0_seq11 Putative_CAP-Gly_domain_protein
comp32620_c0_seq3 HAUS_augmin-like_complex_subunit_3
Any help would be much appreciated!

with awk:
Skip's gene name if 'gene' is absent
awk 'BEGIN{RS=">"} NF>1{if($5 ~ /Gene=/){gsub("Gene=","",$5); print $1,$5} else {print $1}}' < transcriptome.fasta > space-delimited.tab
Output:
comp35897_c0_seq11 Putative_CAP-Gly_domain_protein
comp32620_c0_seq3
Skip's the record if 'gene' is absent
awk 'BEGIN{RS=">"} NF>1{if($5 ~ /Gene=/){gsub("Gene=","",$5); print $1,$5}}' < transcriptome.fasta > space-delimited.tab
Output:
comp35897_c0_seq11 Putative_CAP-Gly_domain_protein

Have you tried anything yet?
with sed you could do:
sed 's/>\(comp[^ ]\+\) \+.*Gene=\([^ ]\+\) .*$/\1 \2/'
which looks complex but is relatively easy to understand if you take it slowly and break it down into it's component parts.
edit
ok, so to ensure sed only outputs what you want, you need to switch on the 'no output by default' mode -n and explicitly print each line you are interested in p
I'll try to break it down, so that it is understandable.
comp[^ ]\+ #is a regex that says:
#text that starts with the string 'comp'
#and is followed by at least one character
#that is anything that isn't a space (the [^ ])
\(comp[^ ]\+\) #is the sed construct that remembers what
#that regex matches.
.* #is the regex for zero or more of any chars.
'Gene=\([^ ]\+\) ' #look for the string 'Gene' followed by an
#equals sign, followed by at least one char
#that isn't a space, followed by a space
#oh, and remember the bit after = and before the space
so, along with the -n and the p switches for sed you could use:
sed -n 's/>\(comp[^ ]\+\) \+.*Gene=\([^ ]\+\) .*$/\1 \2/p'

Related

How to extract (First match)text between two words

I have a file having the following structure
destination list
move from station d-435-435 to point place1
move from station d-435-435 to point place2
move from mainpoint
I want to extract the word "d-435-435"(Only the first match, this need not be same value always) in between the words "from station" and "to point"
How can I achieve this?
What I have tried so far?
id=$(sed 's/.*from station \(.*\) to.*/\1/' input.txt)
But this returns the following value: destination list d-435-435 move from mainpoint
1st solution: With your shown samples, please try following GNU awk code. Using match function of awk program here to match regex rom station\s+\S+\s+to point to get requested value by OP then removing from station\s+ and \s+to point from matched value and printing required value.
awk '
match($0,/from station\s+\S+\s+to point/){
val=substr($0,RSTART,RLENGTH)
gsub(/from station\s+|\s+to point/,"",val)
print val
exit
}
' Input_file
2nd solution: Using GNU grep please try following. Using -oP option to print matched portion and enabling PCRE regex respectively here. Then in main grep program matching string from station followed by space(s) then using \K option will make sure matched part before \K is forgotten(since e don't need this in output), Then matching \S+(non space values) followed by space(s) to point string(using positive look ahead here to make sure it only checks its present or not but doesn't print that).
grep -oP -m1 'from station\s+\K\S+(?=\s+to point)' Input_file
If GNU sed is available, how about:
id=$(sed -nE '0,/from station.*to/ s/.*from station (.*) to.*/\1/p' input.txt)
The -n option suppress the print unless the substitution succeeds.
The condition 0,/pattern/ is a flip-flop operator and it returns false
after the pattern match succeeds. The 0 address is a GNU sed extension which
makes the 1st line to match against the pattern.
With awk you can write the before and after conditions of
field $4, where d-435-435 is, and then print this field only the first match and exit with exit after print statement:
awk '$2=="from" && $3=="station" && $5=="to" && $6=="point" {print $4; exit}' file
d-435-435
or using GNU awk for the 3rd arg to match():
awk 'match($0,/from station\s+(.*)\s+to point/,a){print a[1];exit}' file
d-435-435
The regexp contains a parenthesis, so the integer-indexed element of array a[1] contain the portion of string between from station followed by space(s) \s+ and space(s) \s+ followed byto point.
This might work for you (GNU sed):
sed -nE '/.*station (\S+) to point.*/{s//\1/;H;x;/\n(\S+)\n.*\1/{s/\n\S+$//;x;d};x;p}' file
Turn off implicit printing and on extended regexps command line options -nE.
If a line matches the required criteria, extract the required string, append a copy to the hold space, check if the match has already been seen and if not print it. If the match has been seen, remove it from the hold space.
Otherwise, do not print anything.
This should work in any sed:
sed -e '/.*from station \([^ ]*\) to .*/!d' -e 's//\1/' -e q file

Generating csv from text file in Linux command line with sed, awk or other

I have a file with thousands of lines that I would like to have it as a csv, for later processing.
The original file looks like this:
cc_1527 (ILDO_I173_net9 VSSA) capacitor_mis c=9.60713e-16
cc_1526 (VDD_MAIN Istartupcomp_I115_G7) capacitor_mis \
c=4.18106e-16
cc_1525 (VDD_MAIN Istartupcomp_I7_net025) capacitor_mis \
c=9.71462e-16
cc_1524 (VDD_MAIN Istartupcomp_I7_ST_net14) \
capacitor_mis c=4.6011e-17
cc_1523 (VDD_MAIN Istartupcomp_I7_ST_net15) \
capacitor_mis c=1.06215e-15
cc_1522 (VDD_MAIN ILDO_LDO_core_Istartupcomp_I7_ST_net16) \
capacitor_mis c=1.37289e-15
cc_1521 (VDD_MAIN ILDO_LDO_core_Istartupcomp_I7_I176_G4) capacitor_mis \
c=6.81758e-16
The problem here, is that some of the lines continue to the next one, indicated by the symbol "\".
The final csv format for the first 5 lines of the original text should be:
cc_1527,(ILDO_I173_net9 VSSA),capacitor_mis c=9.60713e-16
cc_1526,(VDD_MAIN Istartupcomp_I115_G7),capacitor_mis,c=4.18106e-16
cc_1525,(VDD_MAIN Istartupcomp_I7_net025),capacitor_mis,c=9.71462e-16
So, now everything is in one line only and the "\" characters have been removed.
Please notice that may exist spaces in the beginning of each line, so these should be trimmed before anything else is done.
Any idea on how to accomplish this. ?
Thanks in advance.
Best regards,
Pedro
Using some of the more obscure features of sed (It can do more than s///):
$ sed -E ':line /\\$/ {s/\\$//; N; b line}; s/[[:space:]]+/,/g' demo.txt
cc_1527,(ILDO_I173_net9,VSSA),capacitor_mis,c=9.60713e-16
cc_1526,(VDD_MAIN,Istartupcomp_I115_G7),capacitor_mis,c=4.18106e-16
cc_1525,(VDD_MAIN,Istartupcomp_I7_net025),capacitor_mis,c=9.71462e-16
cc_1524,(VDD_MAIN,Istartupcomp_I7_ST_net14),capacitor_mis,c=4.6011e-17
cc_1523,(VDD_MAIN,Istartupcomp_I7_ST_net15),capacitor_mis,c=1.06215e-15
cc_1522,(VDD_MAIN,ILDO_LDO_core_Istartupcomp_I7_ST_net16),capacitor_mis,c=1.37289e-15
cc_1521,(VDD_MAIN,ILDO_LDO_core_Istartupcomp_I7_I176_G4),capacitor_mis,c=6.81758e-16
Basically:
Read a line into the pattern space.
:line /\\$/ {s/\\$//; N; b line}: If the pattern space ends in a \, remove that backslash, read the next line and append it to the pattern space, and repeat this step.
s/[[:space:]]+/,/g: Convert every case of 1 or more whitespace characters to a single comma.
Print the result, and go back to the beginning with a new line.
The answer by #Shawn has been accepted by the OP and I'm not sure
if my answer is worth posting but allow me to do it just for information.
If Perl is your option, please try the following script which preserves
the whitespaces within parens not replacing them by commas:
perl -0777 -ne '
s/\\\n//g;
foreach $line (split(/\n/)) {
while ($line =~ /(\([^)]+\))|(\S+)/g) {
push(#ary, $&);
}
print join(",", #ary), "\n";
#ary = ();
}
' input.txt
Output:
cc_1527,(ILDO_I173_net9 VSSA),capacitor_mis,c=9.60713e-16
cc_1526,(VDD_MAIN Istartupcomp_I115_G7),capacitor_mis,c=4.18106e-16
cc_1525,(VDD_MAIN Istartupcomp_I7_net025),capacitor_mis,c=9.71462e-16
cc_1524,(VDD_MAIN Istartupcomp_I7_ST_net14),capacitor_mis,c=4.6011e-17
cc_1523,(VDD_MAIN Istartupcomp_I7_ST_net15),capacitor_mis,c=1.06215e-15
cc_1522,(VDD_MAIN ILDO_LDO_core_Istartupcomp_I7_ST_net16),capacitor_mis,c=1.37289e-15
cc_1521,(VDD_MAIN ILDO_LDO_core_Istartupcomp_I7_I176_G4),capacitor_mis,c=6.81758e-16
[How it works]
First of all, -0777 -ne option tells Perl to slurp all lines
into the Perl's default variable $_.
Next, s/\\\n//g; removes trailing backslashes by merging lines.
Then split(/\n/) splits the lines on newlines back again.
The regex /(\([^)]+\))|(\S+)/g will be the most important part
which divides each line into fields. The field pattern is defined as:
"substring surrounded by parens OR substring which does not include whitespaces." It works as FPAT in awk and preserves whitespaces
between parens without dividing the line on them.
I've tested with approx. 10,000 line input and the execution time
is less than a second.
Hope this helps.

Using SED/AWK to replace letters after a certain position

I have a file with words (1 word per line). I need to censor all letters in the word, except the first five, with a *.
Ex.
Authority -> Autho****
I'm not very sure how to do this.
If you are lucky, all you need is
sed 's/./*/6g' file
When I originally posted this, I believed this to be reasonably portable; but as per #ghoti's comment, it is not.
Perl to the rescue:
perl -pe 'substr($_, 5) =~ s/./*/g' -- file
-p reads the input line by line and prints each line after processing
substr returns a substring of the given string starting at the given position.
s/./*/g replaces any character with an asterisk. The g means the substitution will happen as many times as possible, not just once, so all the characters will be replaced.
In some versions of sed, you can specify which substitution should happen by appending a number to the operation:
sed -e 's/./*/g6'
This will replace all (again, because of g) characters, starting from the 6th position.
Here's a portable solution for sed:
$ echo abcdefghi | sed -e 's/\(.\{5\}\)./\1*/;:x' -e 's/\*[a-z]/**/;t x'
abcde****
Here's how it works:
's/\(.\{5\}\)./\1*/' - preserve the first five characters, replacing the 6th with an asterisk.
':x' - set a "label", which we can branch back to later.
's/\*[a-z]/**/ - ' - substitute the letter following an asterisk with an asterisk.
't x' - if the last substitution succeeded, jump back to the label "x".
This works equally well in GNU and BSD sed.
Of course, adjust the regexes to suit.
Following awk may help you in same.
Solution 1st: awk solution with substr and gensub.
awk '{print substr($0,1,5) gensub(/./,"*","g",substr($0,6))}' Input_file
Solution 2nd:
awk 'NF{len=length($0);if(len>5){i=6;while(i<=len){val=val?val "*":"*";i++};print substr($0,1,5) val};val=i=""}' Input_file
Autho****
EDIT: Adding a non-one liner form of solution too now. Adding explanation with it too now.
awk '
NF{ ##Checking if a line is NON-empty.
len=length($0); ##Taking length of the current line into a variable called len here.
if(len>5){ ##Checking if length of current line is greater than 5 as per OP request. If yes then do following.
i=6; ##creating variable named i whose value is 6 here.
while(i<=len){ ##staring a while loop here which runs from value of variable named i value to till the length of current line.
val=val?val "*":"*"; ##creating variable named val here whose value will be concatenated to its own value, it will add * to its value each time.
i++ ##incrementing variable named i value with 1 each time.
};
print substr($0,1,5) val##printing value of substring from 1st letter to 5th letter and then printing value of variable val here too.
};
val=i="" ##Nullifying values of variable val and i here too.
}
' Input_file ##Mentioning Input_file name here.
Personally I'd just use sed for this (see #triplee's answer) but if you want to do it in awk it'd be:
$ awk '{t=substr($0,1,5); gsub(/./,"*"); print t substr($0,6)}' file
Autho****
or with GNU awk for gensub():
$ awk '{print substr($0,1,5) gensub(/./,"*","g",substr($0,6))}' file
Autho****
It is also possible and quite straightforward with sed:
sed 's/./\*/6;:loop;s/\*[^\*]/\**/;/\*[^\*]/b loop' file_to_censor.txt
output:
explanation:
s/./\*/6 #replace the 6th character of the chain by *
:loop #define an label for the goto
s/\*[^\*]/\**/ #replace * followed by non * char by **
/\*[^\*]/b loop #then loop until it does not exist a * followed by a non * char
Here is a pretty straightforward sed solution (that does not require GNUsed):
sed -e :a -e 's/^\(.....\**\)[^*]/\1*/;ta' filename

Replace chars after column X

Lets say my data looks like this
iqwertyuiop
and I want to replace all the letters i after column 3 with a Z.. so my output would look like this
iqwertyuZop
How can I do this with sed or awk?
It's not clear what you mean by "column" but maybe this is what you want using GNU awk for gensub():
$ echo iqwertyuiop | awk '{print substr($0,1,3) gensub(/i/,"Z","g",substr($0,4))}'
iqwertyuZop
Perl is handy for this: you can assign to a substring
$ echo "iiiiii" | perl -pe 'substr($_,3) =~ s/i/Z/g'
iiiZZZ
This would totally be ideal for the tr command, if only you didn't have the requirement that the first 3 characters remain untouched.
However, if you are okay using some bash tricks plus cut and paste, you can split the file into two parts and paste them back together afterwords:
paste -d'\0' <(cut -c-3 foo) <(cut -c4- foo | tr i Z)
The above uses paste to rejoin together the two parts of the file that get split with cut. The second section is piped through tr to translate i's to Z's.
(1) Here's a short-and-simple way to accomplish the task using GNU sed:
sed -r -e ':a;s/^(...)([^i]*)i/\1\2Z/g;ta'
This entails looping (t), and so would not be as efficient as non-looping approaches. The above can also be written using escaped parentheses instead of unescaped characters, and so there is no real need for the -r option. Other implementations of sed should (in principle) be up to the task as well, but your MMV.
(2) It's easy enough to use "old awk" as well:
awk '{s=substr($0,4);gsub(/i/,"Z",s); print substr($0,1,3) s}'
The most intuitive way would be to use awk:
awk 'BEGIN{FS="";OFS=FS}{for(i=4;i<=NF;i++){if($i=="i"){$i="Z"}}}1' file
FS="" splits the input string by characters into fields. We iterate trough character/field 4 to end and replace i by Z.
The final 1 evaluates to true and make awk print the modified input line.
With sed it looks not very intutive but still it is possible:
sed -r '
h # Backup the current line in hold buffer
s/.{3}// # Remove the first three characters
s/i/Z/g # Replace all i by Z
G # Append the contents of the hold buffer to the pattern buffer (this adds a newline between them)
s/(.*)\n(.{3}).*/\2\1/ # Remove that newline ^^^ and assemble the result
' file

text processing: sed to work backwards to delete until string

My AWK script generates 1 of the following 2 outputs depending on what text file it is being used on.
49 1146.469387755102 mongodb 192.168.0.8:27017 -p mongodb.database
1 1243.0 jdbc:mysql 192.168.0.8:3306/ycsb -p db.user
I need a way of deleting everything past the IP address, including the port number.
sed 's/:[^:]*//2g'
Works apart from the fact it deletes from left to right and as one of the outputs contains 2 : 's it stops and deletes everything after that. Is there a way of reversing sed to work from right to left?
Just to be clear, desired output of each would be:
49 1146.469387755102 mongodb 192.168.0.8
1 1243.0 jdbc:mysql 192.168.0.8
You could use the below sed command.
sed 's/:[0-9]\{4\}.*//' file
OR
sed 's/:[^:]*$//' file
[^:]* negated character class which matches any char but not of :, zero or more times. $ matches the end of the line boundary. So :[^:]*$ matches all the chars from the last colon upto the end. Replacing those matched chars with empty string will give you the desired output.
You can take advantage of the greedy nature of the Kleene *:
sed 's/\(.*\):.*/\1/' file
The .* consumes as much as it can, while still matching the pattern. The captured part of the line is used in the replacement.
Alternatively, using awk (thanks to glenn jackman for setting me straight):
awk -F: -v OFS=: 'NF{NF--}1' file
Set the input and output field separators to a colon remove the final field by decrementing NF. 1 is true so the default action {print} is performed. The NF condition prevents empty lines from causing an error, which may not be necessary in your case but does no harm.
Output either way:
49 1146.469387755102 mongodb 192.168.0.8
1 1243.0 jdbc:mysql 192.168.0.8