Searching multiple lines as a single sequence - python-re

I have some DNA sequencing data, that is in a Fasta format:
Example
Seq1
CTTGGCAAGCAACACTTCAGGAACCGATTACCAGTGTACCTTCGTCGATGTAGAGTCCTA
TATCTCCTCTTAGAGTTAGACAGGCGATTACCAGTCAGATCTCCAGTCAGGACTCCAGTC
AGTAGCTATCAACGATACTGACTCTAGATG
I am trying to find a specific sequence and then write out the next 7 letters. I have a code that worked when my Fasta was in a single line format:
Example:
Seq1
CTTGGCAAGCAACACTTCAGGAACCGATTACCAGTGTACCTTCGTCGATGTAGAGTCCTATATCTCCTCTTAGAGTTAGACAGGCGATTACCAGTCAGATCTCCAGTCAGGACTCCAGTCAGTAGCTATCAACGATACTGACTCTAGATG
The code I was using is below:
import re
infile = "/home/Desktop/FASTA Test.fasta"
with open(infile, "r") as seq, open("Fasta_Test.fa", "w") as w:
for i, line in enumerate(seq):
if not ">" in line:
m = re.search("(?<=AGTCAGTAGC)...C...", line)
if not m == None:
w.write("%s \n %s \n"% (">seq"+str(int((i+1)/2)), m.group(0)))
This code would just spit out a list in the format
Seq 1
TATCAAC
The issue I am having with the multiline format is searching the fasta as a continuous sequence string. I am fairly new to python, so there is probably an easy solution, but I can not find it (or the right words to search to find it).

Related

How can I read and parse files with variant spaces as delim?

I need help solving this problem:
I have a directory full of .txt files that look like this:
file1.no
file2.no
file3.no
And every file has the following structure (I only care for the first two "columns" in the .txt):
#POS SEQ SCORE QQ-INTERVAL STD MSA DATA
#The alpha parameter 0.75858
#The likelihood of the data given alpha and the tree is:
#LL=-4797.62
1 M 0.3821 [0.01331,0.5465] 0.4421 7/7
2 E 0.4508 [0.05393,0.6788] 0.5331 7/7
3 L 0.5334 [0.05393,0.6788] 0.6279 7/7
4 G 0.5339 [0.05393,0.6788] 0.624 7/7
And I want to parse all of them into one DataFrame, while also converting the columns into lists for each row (i.e., the first column should be converted into a string like this: ["MELG"]).
But now I am running into two issues:
How to read the different files and append all of them to a single DataFrame, and also making a single column out of al the rows inside said files
How to parse this files, giving that the spaces between the columns vary for almost all of them.
My output should look like this:
|File |SEQ |SCORE|
| --- | ---| --- |
|File1|MELG|0.3821,0.4508,0.5334,0.5339|
|File2|AAHG|0.5412,1,2345,0.0241,0.5901|
|File3|LLKM|0.9812,0,2145,0.4142,0.4921|
So, the first column for the first file (file1.no), the one with single letters, is now in a list, in a row with all the information from that file, and the DataFrame has one row for each file.
Any help is welcome, thanks in advance.
Here is an example code that should work for you:
using DataFrames
function parsefile(filename)
l = readlines(filename)
filter!(x -> !startswith(x, "#"), l)
sl = split.(l)
return (File=filename,
SEQ=join(getindex.(sl, 2)),
SCORE=parse.(Float64, getindex.(sl, 3)))
end
df = DataFrame()
foreach(fn -> push!(df, parsefile(fn)), ["file$i.no" for i in 1:3])
your result will be in df data frame.

SAS/SQL Text File Reading

I'm looking to upload a ".txt" file into SAS so that I can search the contents for specific characters and words to analyse. The text file in question is poorly formatted so would ideally to have one column with each word being a new observation like:
TEXT
1 Hello
2 World
Currently I'm downloading the file into SAS but there's lots of spaces and it has multiple words per observation.
data mylib.textimport;
infile "../TEXTTEST.txt" dlm="' ', ',', '.'";
input __text__ $char300. ;
run;
Could anyone help me with how to put every new word into a new column?
Thanks in advance. :)
If you want read the file "word by word" then just tell SAS what characters you consider to be delimiters and use FLOWOVER option to read the words. So if you wanted to treat spaces, commas, periods, quotes, tabs, linefeeds and carriage returns as word delimiters your program could look like this.
data want;
dlm=' ,."''' || '090A0D'x;
infile "../TEXTTEST.txt" dlm=dlm flowover;
length word $300 ;
input word ## ;
run;
I would TRANSLATE the characters you want out into spaces and then loop over the remaining, outputting each word.
Here's some test data
data have;
format line $200.;
input ;
line = _infile_;
datalines;
This is some, test.text
How,about,this cheesey.cheese?
;
run;
Here is a DATA Step to loop through and output what you are looking for:
data want(keep=word);
format word $200.;
set have;
line = translate(line," ",",."); /*convert , and . to space*/
n = countw(line);
/*Loop through the words and output*/
do i=1 to n;
word = scan(line,i);
output;
end;
run;
TRANSLATE converts the characters in the 3rd argument into the characters in the second. This of the string as an array. It does this replacement for each value in the array.
As this example shows, you probably want to think about other punctuation.

Plotting a function directly from a text file

Is there a way to plot a function based on values from a text file?
I know how to define a function in gnuplot and then plot it but that is not what I need.
I have a table with constants for functions that are updated regularly. When this update happens I want to be able to run a script that draws a figure with this new curve. Since there are quite few figures to draw I want to automate the procedure.
Here is an example table with constants:
location a b c
1 1 3 4
2
There are two ways I see to solve the problem but I do not know if and how they can be implemented.
I can then use awk to produce the string: f(x)=1(x)**2+3(x)+4, write it to a file and somehow make gnuplot read this new file and plot on a certain x range.
or use awk inside gnuplot something like f(x) = awk /1/ {print "f(x)="$2 etc., or use awk directly in the plot command.
I any case, I'm stuck and have not found a solution to this problem online, do you have any suggestions?
Another possibilty to have a somewhat generic version for this, you can do the following:
Assume, the parameters are stored in a file parameters.dat with the first line containing the variable names and all others the parameter sets, like
location a b c
1 1 3 4
The script file looks like this:
file = 'parameters.dat'
par_names = system('head -1 '.file)
par_cnt = words(par_names)
# which parameter set to choose
par_line_num = 2
# select the respective string
par_line = system(sprintf('head -%d ', par_line_num).file.' | tail -1')
par_string = ''
do for [i=1:par_cnt] {
eval(word(par_names, i).' = '.word(par_line, i))
}
f(x) = a*x**2 + b*x + c
plot f(x) title sprintf('location = %d', location)
This question (gnuplot store one number from data file into variable) had some hints for me in the first answer.
In my case I have a file which contains parameters for a parabola. I have saved the parameters in gnuplot variables. Then I plot the function containing the parameter variables for each timestep.
#!/usr/bin/gnuplot
datafile = "parabola.txt"
set terminal pngcairo size 1000,500
set xrange [-100:100]
set yrange [-100:100]
titletext(timepar, apar, cpar) = sprintf("In timestep %d we have parameter a = %f, parameter c = %f", timepar, apar, cpar)
do for [step=1:400] {
set output sprintf("parabola%04d.png", step)
# read parameters from file, where the first line is the header, thus the +1
a=system("awk '{ if (NR == " . step . "+1) printf \"%f\", $1}' " . datafile)
c=system("awk '{ if (NR == " . step . "+1) printf \"%f\", $2}' " . datafile)
# convert parameters to numeric format
a=a+0.
c=c+0.
set title titletext(step, a, c)
plot c+a*x**2
}
This gives a series of png files called parabola0001.png,
parabola0002.png,
parabola0003.png,
…, each showing a parabola with the parameters read from the file called parabola.txt. The title contains the parameters of the given time step.
For understanding the gnuplot system() function you have to know that:
stuff inside double quotes is not parsed by gnuplot
the dot is for concatenating strings in gnuplot
the double quotes for the awk printf command have to be escaped, to hide them from gnuplot parser
To test this gnuplot script, save it into a file with an arbitrary name, e.g. parabolaplot.gplot and make it executable (chmad a+x parabolaplot.gplot). The parabola.txt file can be created with
awk 'BEGIN {for (i=1; i<=1000; i++) printf "%f\t%f\n", i/200, i/100}' > parabola.txt
awk '/1/ {print "plot "$2"*x**2+"$3"*x+"$4}' | gnuplot -persist
Will select the line and plot it
This was/is another question about how to extract specific values into variables with gnuplot (maybe it would be worth to create a Wiki entry about this topic).
There is no need for using awk, you can do this simply with gnuplot only (hence platform-independent), even with gnuplot 4.6.0 (March 2012).
You can do a stats (check help stats) and assign the values to variables.
Data: SO15007620_Parameters.txt
location a b c
1 1 3 4
2 -1 2 3
3 2 1 -1
Script: (works with gnuplot 4.6.0, March 2012)
### read parameters from separate file into variables
reset
FILE = "SO15007620_Parameters.txt"
myLine = 1 # line index 0-based
stats FILE u (a=$2, b=$3, c=$4) every ::myLine::myLine nooutput
f(x) = a*x**2 + b*x + c
plot f(x) w l lc rgb "red" ti sprintf("f(x) = %gx^2 + %gx + %g", a,b,c)
### end of script
Result:

gnuplot store one number from data file into variable

OSX v10.6.8 and Gnuplot v4.4
I have a data file with 8 columns. I would like to take the first value from the 6th column and make it the title. Here's what I have so far:
#m1 m2 q taua taue K avgPeriodRatio time
#1 2 3 4 5 6 7 8
K = #read in data here
graph(n) = sprintf("K=%.2e",n)
set term aqua enhanced font "Times-Roman,18"
plot file using 1:3 title graph(K)
And here is what the first few rows of my data file looks like:
1.00e-07 1.00e-07 1.00e+00 1.00e+05 1.00e+04 1.00e+01 1.310 12070.00
1.11e-06 1.00e-07 9.02e-02 1.00e+05 1.00e+04 1.00e+01 1.310 12070.00
2.12e-06 1.00e-07 4.72e-02 1.00e+05 1.00e+04 1.00e+01 1.310 12070.00
3.13e-06 1.00e-07 3.20e-02 1.00e+05 1.00e+04 1.00e+01 1.310 12090.00
I don't know how to correctly read in the data or if this is even the right way to go about this.
EDIT #1
Ok, thanks to mgilson I now have
#m1 m2 q taua taue K avgPeriodRatio time
#1 2 3 4 5 6 7 8
set term aqua enhanced font "Times-Roman,18"
K = "`head -1 datafile | awk '{print $6}'`"
print K+0
graph(n) = sprintf("K=%.2e",n)
plot file using 1:3 title graph(K)
but I get the error: Non-numeric string found where a numeric expression was expected
EDIT #2
file = "testPlot.txt"
K = "`head -1 file | awk '{print $6}'`"
K=K+0 #Cast K to a floating point number #this is line 9
graph(n) = sprintf("K=%.2e",n)
plot file using 1:3 title graph(K)
This gives the error--> head: file: No such file or directory
"testPlot.gnu", line 9: Non-numeric string found where a numeric expression was expected
You have a few options...
FIRST OPTION:
use columnheader
plot file using 1:3 title columnheader(6)
I haven't tested it, but this may prevent the first row from actually being plotted.
SECOND OPTION:
use an external utility to get the title:
TITLE="`head -1 datafile | awk '{print $6}'`"
plot 'datafile' using 1:3 title TITLE
If the variable is numeric, and you want to reformat it, in gnuplot, you can cast strings to a numeric type (integer/float) by adding 0 to them (e.g).
print "36.5"+0
Then you can format it with sprintf or gprintf as you're already doing.
It's weird that there is no float function. (int will work if you want to cast to an integer).
EDIT
The script below worked for me (when I pasted your example data into a file called "datafile"):
K = "`head -1 datafile | awk '{print $6}'`"
K=K+0 #Cast K to a floating point number
graph(n) = sprintf("K=%.2e",n)
plot "datafile" using 1:3 title graph(K)
EDIT 2 (addresses comments below)
To expand a variable in backtics, you'll need macros:
set macro
file="mydatafile.txt"
#THE ORDER OF QUOTES (' and ") IS CRUCIAL HERE.
cmd='"`head -1 ' . file . ' | awk ''{print $6}''`"'
# . is string concatenation. (this string has 3 pieces)
# to get a single quote inside a single quoted string
# you need to double. e.g. 'a''b' yields the string a'b
data=#cmd
To address your question 2, it is a good idea to familiarize yourself with shell utilities -- sed and awk can both do it. I'll show a combination of head/tail:
cmd='"`head -2 ' . file . ' | tail -1 | awk ''{print $6}''`"'
should work.
EDIT 3
I recently learned that in gnuplot, system is a function as well as a command. To do the above without all the backtic gymnastics,
data=system("head -1 " . file . " | awk '{print $6}'")
Wow, much better.
This is a very old question, but here's a nice way to get access to a single value anywhere in your data file and save it as a gnuplot-accessible variable:
set term unknown #This terminal will not attempt to plot anything
plot 'myfile.dat' index 0 every 1:1:0:0:0:0 u (var=$1):1
The index number allows you to address a particular dataset (separated by two carriage returns), while every allows you to specify a particular line.
The colon-separated numbers after every should be of the form 1:1:<line_number>:<block_number>:<line_number>:<block_number>, where the line number is the line with the the block (starting from 0), and the block number is the number of the block (separated by a single carriage return, again starting from 0). The first and second numbers say plot every 1 lines and every one data block, and the third and fourth say start from line <line_number> and block <block_number>. The fifth and sixth say where to stop. This allows you to select a single line anywhere in your data file.
The last part of the plot command assigns the value in a particular column (in this case, column 1) to your variable (var). There needs to be two values to a plot command, so I chose column 1 to plot against my variable assignment statement.
Here is a less 'awk'-ward solution which assigns the value from the first row and 6th column of the file 'Data.txt' to the variable x16.
set table
# Syntax: u 0:($0==RowIndex?(VariableName=$ColumnIndex):$ColumnIndex)
# RowIndex starts with 0, ColumnIndex starts with 1
# 'u' is an abbreviation for the 'using' modifier
plot 'Data.txt' u 0:($0==0?(x16=$6):$6)
unset table
A more general example for storing several values is given below:
# Load data from file to variable
# Gnuplot can only access the data via the "plot" command
set table
# Syntax: u 0:($0==RowIndex?(VariableName=$ColumnIndex):$ColumnIndex)
# RowIndex starts with 0, ColumnIndex starts with 1
# 'u' is an abbreviation for the 'using' modifier
# Example: Assign all values according to: xij = Data33[i,j]; i,j = 1,2,3
plot 'Data33.txt' u 0:($0==0?(x11=$1):$1),\
'' u 0:($0==0?(x12=$2):$2),\
'' u 0:($0==0?(x13=$3):$3),\
'' u 0:($0==1?(x21=$1):$1),\
'' u 0:($0==1?(x22=$2):$2),\
'' u 0:($0==1?(x23=$3):$3),\
'' u 0:($0==2?(x31=$1):$1),\
'' u 0:($0==2?(x32=$2):$2),\
'' u 0:($0==2?(x33=$3):$3)
unset table
print x11, x12, x13 # Data from first row
print x21, x22, x23 # Data from second row
print x31, x32, x33 # Data from third row

How to load 2D array from a text(csv) file into Octave?

Consider the following text(csv) file:
1, Some text
2, More text
3, Text with comma, more text
How to load the data into a 2D array in Octave? The number can go into the first column, and all text to the right of the first comma (including other commas) goes into the second text column.
If necessary, I can replace the first comma with a different delimiter character.
AFAIK you cannot put stings of different size into an array. You need to create a so called cell array.
A possible way to read the data from your question stored in a file Test.txt into a cell array is
t1 = textread("Test.txt", "%s", "delimiter", "\n");
for i = 1:length(t1)
j = findstr(t1{i}, ",")(1);
T{i,1} = t1{i}(1:j - 1);
T{i,2} = strtrim(t1{i}(j + 1:end));
end
Now
T{3,1} gives you 3 and
T{3,2} gives you Text with comma, more text.
After many long hours of searching and debugging, here's how I got it to work on Octave 3.2.4. Using | as the delimiter (instead of comma).
The data file now looks like:
1|Some text
2|More text
3|Text with comma, more text
Here's how to call it: data = load_data('data/data_file.csv', NUMBER_OF_LINES);
Limitation: You need to know how many lines you want to get. If you want to get all, then you will need to write a function to count the number of lines in the file in order to initialize the cell_array. It's all very clunky and primitive. So much for "high level languages like Octave".
Note: After the unpleasant exercise of getting this to work, it seems that Octave is not very useful unless you enjoy wasting your time writing code to do the simplest things. Better choices seems to be R, Python, or C#/Java with a Machine Learning or Matrix library.
function all_messages = load_data(filename, NUMBER_OF_LINES)
fid = fopen(filename, "r");
all_messages = cell (NUMBER_OF_LINES, 2 );
counter = 1;
line = fgetl(fid);
while line != -1
separator_index = index(line, '|');
all_messages {counter, 1} = substr(line, 1, separator_index - 1); % Up to the separator
all_messages {counter, 2} = substr(line, separator_index + 1, length(line) - separator_index); % After the separator
counter++;
line = fgetl(fid);
endwhile
fprintf("Processed %i lines.\n", counter -1);
fclose(fid);
end