extracting subtext between two characters using grep - awk

I have text file which has information like follows
#Mp_chzt_1
asdjhsadhasdhdbjashdjaudashdjashdasdhasdhasdh
asdasdkasjdkaskdskadkasdkasdkjaskldasdklasdas
ahsjdasdfdfsdhghrtuztiuiuzozuoiouiouiouiouiou
asjkjieqjeroiweoriksfjksjksjkf
+
!!!#!!!!!!!!++??????????????~~~~~~~~~~~~~
BBBBBBBBBBBBMMMMMM!!!!!++LLLLLL******
#Mp_btrea_1
uokjjkzghqawsdasduihdlöklöaklöskdlkaökgzgzggz
asdasduzuqwtzeqweuvixcvdjfiisduiifuzwpqüqwoeü
kjkjiuijwiqquzwuziziqz
+
**********||||||||||||#########++++?????????
MMMMMMMMMUUUU***+++~~~~~~~~~~~~~~~~~~~~~~~~~~
#Mp_trwe_3
jhtrqhkjiqkjkqwjelasjjljiewkjkljkldjflsjljki8u
immhgwqtzopirpjgbsdkfjieipwippieoroeirkvsdjjfk
jkahdjhjhfuhjkwekksjakjeiuwiurweiurioweuroweod
poplrtm,ernmjhazqweqwjidiipfiopdifosidpfppsdif
mnasnbdhgqweqweipoipoxkajksdökalsklsaksldkasöd
asdas
+
!!!!!!!!!!!!!!!!!!#####???????????????????
I would like extract the region only between #Mp_* and + that comes right below the text and export it to txt file like following
#Mp_chzt_1
asdjhsadhasdhdbjashdjaudashdjashdasdhasdhasdh
asdasdkasjdkaskdskadkasdkasdkjaskldasdklasdas
ahsjdasdfdfsdhghrtuztiuiuzozuoiouiouiouiouiou
asjkjieqjeroiweoriksfjksjksjkf
#Mp_btrea_1
uokjjkzghqawsdasduihdlöklöaklöskdlkaökgzgzggz
asdasduzuqwtzeqweuvixcvdjfiisduiifuzwpqüqwoeü
kjkjiuijwiqquzwuziziqz
#Mp_trwe_3
jhtrqhkjiqkjkqwjelasjjljiewkjkljkldjflsjljki8u
immhgwqtzopirpjgbsdkfjieipwippieoroeirkvsdjjfk
jkahdjhjhfuhjkwekksjakjeiuwiurweiurioweuroweod
poplrtm,ernmjhazqweqwjidiipfiopdifosidpfppsdif
mnasnbdhgqweqweipoipoxkajksdökalsklsaksldkasöd
asdas
When I used the following code
grep -o -P '(?<=#MP.*).*(?=+)' query.txt > output.txt
It gave me "grep: nothing to repeat".
Could anyone guide where my mistake is and how to rectify it.
Thanks in advance.

Better use awk for this:
awk '/^#/{f=1} /^+/ {f=0} f' file > output.txt
Or, if you have leading spaces, match them with \s*:
awk '/^\s*#/{f=1} /^\s*\+/ {f=0} f' file > output.txt
This uses a flag f to decide whether the line should be printed or not.
When it sees a line starting with #, it activates it.
When it sees a line starting with +, it deactivates it.
Then, it evaluates the flag and prints if it is True.
With your given input it returns:
#Mp_chzt_1
asdjhsadhasdhdbjashdjaudashdjashdasdhasdhasdh
asdasdkasjdkaskdskadkasdkasdkjaskldasdklasdas
ahsjdasdfdfsdhghrtuztiuiuzozuoiouiouiouiouiou
asjkjieqjeroiweoriksfjksjksjkf
#Mp_btrea_1
uokjjkzghqawsdasduihdlöklöaklöskdlkaökgzgzggz
asdasduzuqwtzeqweuvixcvdjfiisduiifuzwpqüqwoeü
kjkjiuijwiqquzwuziziqz
#Mp_trwe_3
jhtrqhkjiqkjkqwjelasjjljiewkjkljkldjflsjljki8u
immhgwqtzopirpjgbsdkfjieipwippieoroeirkvsdjjfk
jkahdjhjhfuhjkwekksjakjeiuwiurweiurioweuroweod
poplrtm,ernmjhazqweqwjidiipfiopdifosidpfppsdif
mnasnbdhgqweqweipoipoxkajksdökalsklsaksldkasöd
asdas

Related

AWK - Replace between two csv files and print correctly

I have 2 files:
f1.csv:
CL*VIN
AV*AZA
PS*LUG
f2.csv:
2100-12-31*1234A*Thomas*Frederuc*1931-02-20*6791237*6791238*test1*1*0*0*CL*Jame 12*13*a1*zz3*D*13*1234*Tex*F
2100-12-31*1235A*Jack*Borbin*1931-02-21*7791238*7791239*test2*1*0*0*PS*Willliams Hou*14*a2*zz4*A*13*1235*Barc*F
2100-12-31*1236A*Pierce*Matheus*1931-02-22*8791239*8791240*test3*1*1*1*AV*Magnum st*15*a3*zz5*A*13*1236*Euo*F
And I want this output:
2100-12-31*1234A*Thomas*Frederuc*1931-02-20*6791237*6791238*test1*1*0*0*VIN*Jame 12*13*a1*zz3*D*13*1234*Tex*F
2100-12-31*1235A*Jack*Borbin*1931-02-21*7791238*7791239*test2*1*0*0*LUG*Willliams Hou*14*a2*zz4*A*13*1235*Barc*F
2100-12-31*1236A*Pierce*Matheus*1931-02-22*8791239*8791240*test3*1*1*1*AZA*Magnum st*15*a3*zz5*A*13*1236*Euo*F
I have the following code:
awk -F"*" 'FNR==NR{ A[$1]=$2;next} ($12 in A){$12=A[$12];print}' OFS='*' f1.csv f2.csv
But the output is:
*Jame 12*13*a1*zz3*D*13*1234*Tex*F931-02-20*6791237*6791238*test1*1*0*0*VIN
*Willliams Hou*14*a2*zz4*A*13*1235*Barc*F791238*7791239*test2*1*0*0*LUG
*Magnum st*15*a3*zz5*A*13*1236*Euo*F-02-22*8791239*8791240*test3*1*1*1*VIN
How can I obtain my desired output?
Your code works perfectly fine here, what's your system/code environment and awk version?
It seems something to do with carriage returns, so better run this before dealing with these files:
dos2unix files
However, you can try this:
awk 'BEGIN{FS=OFS="*";RS="\r\n|\r|\n";}FNR==NR{A[$1]=$2;next}($12 in A){$12=A[$12];print}' f1.csv f2.csv

Removing steric (*) from the end of a fasta sequence in a multi fasta file

I have a multifasta file containi g predicted proteins from 2 abinitio tools. Every sequence contains a steric (*) in the end. I want to remove it from the file. my sequences are like this:
>snapgene1
SFLPSAEAIEKVLSHMSRRIIDDMKAELQQPEMRWFWP*
>snapgene2
SFLPSAEAIEKVLSHIIIIAAAAKKKPPFFDDMKAELQQPEMRWFWP*
i want the sequences like this :
>snapgen1
SFLPSAEAIEKVLSHMSRRIIDDMKAELQQPEMRWFWP
>snapgene2
SFLPSAEAIEKVLSHIIIIAAAAKKKPPFFDDMKAELQQPEMRWFWP
Can anyone help me in this. Thankyou
If the text stored in a file "temp.txt",you can use command :
sed -i "s/*$//" temp.txt
In awk, if you keep your fastas in file:
$ awk '{sub(/\*$/,"")}1' file
>snapgene1
SFLPSAEAIEKVLSHMSRRIIDDMKAELQQPEMRWFWP
>snapgene2
SFLPSAEAIEKVLSHIIIIAAAAKKKPPFFDDMKAELQQPEMRWFWP
It replaces trailing * with nothing.

awk/sed - generate an error if 2nd address of range is missing

We are currently using sed to filter output of regression runs. Sometimes we have a filter that looks like this:
/copyright/,/end copyright/d
If that end copyright is ever missing, the rest of the file is deleted. I'm wondering if there's some way to generate an error for this? awk would also be okay to use. I don't really want to add code that reads the file line by line and issues an error if it hits EOF.
here's a string
copyright
2016 jan 15
end copyright
date 2016 jan 5 time 15:36
last one
I'd like to get an error if end copyright is missing. The real filter also would replace the date line with DATE, so it's more that just ripping out the copyright.
You can persuade sed to generate an error if you reach end of input (i.e. see address $) between your start and end, but it won't be a very helpful message:
/copyright/,/end copyright/{
$s//\1/ # here
d
}
This will error if end copyright is missing or on the last line, with an exit status of 1 and the helpful message:
sed: -e expression #1, char 0: invalid reference \1 on `s' command's RHS
If you're using this in a makefile, you might want to echo a helpful message first, or (better) to wrap this in something that catches the error and produces a more useful one.
I tested this with GNU sed; though if you are using GNU sed, you could more easily use its useful extension:
q [EXIT-CODE]
This command only accepts a single address.
Exit 'sed' without processing any more commands or input. Note
that the current pattern space is printed if auto-print is not
disabled with the -n options. The ability to return an exit code
from the 'sed' script is a GNU 'sed' extension.
Q [EXIT-CODE]
This command only accepts a single address.
This command is the same as 'q', but will not print the contents of
pattern space. Like 'q', it provides the ability to return an exit
code to the caller.
So you could simply write
/copyright/,/end copyright/{
$Q 42
d
}
Never use range expressions /start/,/end/ as they make trivial code very slightly briefer but require a complete rewrite or duplicate conditions when you have the tiniest requirements change. Always use a flag instead. Note that since sed doesn't support variables, it doesn't support flag variables, and so you shouldn't be using sed you should be using awk instead.
In this case your original code would be:
awk '/copyright/{f=1} !f; /end copyright/{f=0}' file
And your modified code would be:
awk '/copyright/{f=1} !f; /end copyright/{f=0} END{if (f) print "Missing end copyright"}' file
The above is obviously untested since you didn't provide any sample input/output we could test a potential solution against.
With sed you can build a loop:
sed -e '/copyright/{:a;/end copyright/d;N;ba;};' file
:a defines the label "a"
/copyright end/d deletes the pattern space, only when "end copyright" matches
N appends the next line to the pattern space
ba jumps to the label "a"
Note that d ends the loop.
In this way you can avoid to delete the text until the end.
If you don't want the text to be displayed at all and prefer an error message when a "copyright" block stays unclosed, you obviously need to wait the end of the file. You can do it with sed too storing all the lines in the buffer space until the end:
sed -n -e '/copyright/{:a;/end copyright/d;${c\ERROR MESSAGE
;};N;ba;};H;${g;p};' file
H appends the current line to the buffer space
g put the content of the buffer space to the pattern space
The file content is only displayed once the last line reached with ${g;p} otherwise when the closing "end copyright" is missing, the current line is changed in the error message with ${c\ERROR MESSAGE\n;} inside the loop.
This way you can test what returns sed before redirecting it to whatever you want.

grep/awk - how to filter out a certain keyword

I have the following line of text, where i want to filter out the N from (KEY_N) etc. Keep in mind that the N is not constant, it can be anything, like (KEY_J), (KEY_K), (KEY_L), (KEY_I), (KEY_SPACE) and so on..
Event: time 1442439135.995248, type 1 (EV_KEY), code 49 (KEY_N), value 0
Update:
I hope that I got the question properly, if not then please let me know.
Having GNU grep you can use this:
grep -oP '.*\(\K[^)]+' file
An alternative on non GNU systems might be to use sed:
sed 's/.*(\([^)]\{1,\}\)).*/\1/' file

awk getline skipping to last line -- possible newline character issue

I'm using
while( (getline line < "filename") > 0 )
within my BEGIN statement, but this while loop only seems to read the last line of the file instead of each line. I think it may be a newline character problem, but really I don't know. Any ideas?
I'm trying to read the data in from a file other than the main input file.
The same syntax actually works for one file, but not another, and the only difference I see is that the one for which it DOES work has "^M" at the end of each line when I look at it in Vim, and the one for which it DOESN'T work doesn't have ^M. But this seems like an odd problem to be having on my (UNIX based) Mac.
I wish I understood what was going with getline a lot better than I do.
You would have to specify RS to something more vague.
Here is a ugly hack to get things working
RS="[\x0d\x0a\x0d]"
Now, this may require some explanation.
Diffrent systems use difrent ways to handle change of line.
Read http://en.wikipedia.org/wiki/Carriage_return and http://en.wikipedia.org/wiki/Newline if you are
interested in it.
Normally awk hadles this gracefully, but it appears that in your enviroment, some files are being naughty.
0x0d or 0x0a or 0x0d 0x0a (CR+LF) should be there, but not mixed.
So lets try a example of a mixed data stream
$ echo -e "foo\x0d\x0abar\x0d\x0adoe\x0arar\x0azoe\x0dqwe\x0dtry" |awk 'BEGIN{while((getline r )>0){print "r=["r"]";}}'
Result:
r=[foo]
r=[bar]
r=[doe]
r=[rar]
try]oe
We can see that the last lines are lost.
Now using the ugly hack to RS
$ echo -e "foo\x0d\x0abar\x0d\x0adoe\x0arar\x0azoe\x0dqwe\x0dtry" |awk 'BEGIN{RS="[\x0d\x0a\x0d]";while((getline r )>0){print "r=["r"]";}}'
Result:
r=[foo]
r=[bar]
r=[doe]
r=[rar]
r=[zoe]
r=[qwe]
r=[try]
We can see every line is obtained, reguardless of the 0x0d 0x0a junk :-)
Maybe you should preprocess your input file with for example dos2unix (http://sourceforge.net/projects/dos2unix/) utility?