Iterating through Pandas dataframe and dictionary items - pandas

here's a tough one.
Problem Introduction:
I'm working with two different files: a GFF3, which is basically a "9 columns" TSV, and a FASTA, which is a text file.
Now, importing the GFF3 file with gffpandas it looks like this:
seq_id source type start end score strand phase attributes
1 ctg.s1.000000F_arrow prokka gene 56.0 244.0 . + . NHDIEHOJ_00001
3 ctg.s1.000000F_arrow prokka gene 902.0 2167.0 . - . NHDIEHOJ_00002
5 ctg.s1.000001F_arrow prokka gene 2363.0 2905.0 . - . NHDIEHOJ_00003
7 ctg.s1.000003F_arrow prokka gene 2916.0 3947.0 . - . NHDIEHOJ_00004
9 ctg.s2.000000F_arrow prokka gene 4353.0 5174.0 . + . NHDIEHOJ_00005
Dropping the seq_id column I got the following "values":
ctg.s1.000000F_arrow
ctg.s1.000001F_arrow
ctg.s1.000003F_arrow
ctg.s2.000000F_arrow
Now let's step to the FASTA file, which looks like this:
>ctg.s1.000000F_arrow
CCGGAACATCGTCCTCATCCGCAAAGTCGAGCTCTGCCTCGATCATTGCACGCGAATGGGTCAGCCGTCGGGCCCAACCG
GCATAGAGTGCGGACTGTCCGCCACCGGACTGCTCTATGGCGAGACGACGCTGCATTTCCGTTTCTGCCGCAATCAGGTC
>ctg.s1.000001F_arrow
ACGCCGGCTGCAACTTTGAGAAGATGTGGCGATGTCGACCGCTGCATCCCGCCCTTCTCTGCAGAATTTTCCAGCTGTCC
GAGGACATTGGCAAAAAGGCCCTTGGAATCCTTGCGGCTCATTCTCTCCCCCATGCCTTCCAGAAGAGGCCCTCGAGTTC
>ctg.s1.000003F_arrow
GGCGCTGGTTTTCCCCGACACCTCGCCGCGCGGCGAGGGCGTGGCTGACGACGAGGCTTATGATCTCGGTCAGGGTGCGG
GCTTCTATGTCAATGCGACGCAGAAGCCCTGGTCGCCGCACTATCGCATGTATGATTATATCGTCACCGAATTGCCCGCC
>ctg.s2.000000F_arrow
GCGCTCGACGGCATGCCCGTACGCGGCCGATCCTGCGCCGCTTCCTTAACCTTAGCTGCGGATGGAAAGTCGTCCTCGGA
GTTCGGCTCGCAAACGCTTTCGAGCGCGCAATTGACGACGATGTCGTACCCAACTTAGATCGCCGAACGCCATGAGGTCG
Assuming that the uppercased text part is much longer than two lines, as you can see, the text part characterised by ">" symbol presents the same values of seq_id GFF3 column.
As a matter of fact I wrote few line to assign to the FASTA file a dictionary in which the "key" is the text part characterised by ">" symbol, the "item" is the uppercased part.
Problem processing
For each attributes value inside the dataframe there's a corresponding start and end value which is an interval of the corresponding seq_id. I'd like to extract from the FASTA file that exact interval but with respect to the seq_id value which refers to. I mean the the interval 56-244 must be searched only for the FASTA sequence of ctg.s1.000000F_arrow, as well as 902-2167.
The final goal is to obtain a dataframe which has an additional 10th column (es. 'sequence') that contains the corresponding FASTA sequence of the interval, like this:
seq_id source type start end score strand phase attributes sequence
1 ctg.s1.000000F_arrow prokka gene 56.0 244.0 . + . NHDIEHOJ_00001 CCGGAACATCGTCCTCATCCG
3 ctg.s1.000000F_arrow prokka gene 902.0 2167.0 . - . NHDIEHOJ_00002 CAAGGACATCGTGATCAATTC
5 ctg.s1.000001F_arrow prokka gene 2363.0 2905.0 . - . NHDIEHOJ_00003 TCGCCGCGCGGCGAGTGATTA
7 ctg.s1.000003F_arrow prokka gene 2916.0 3947.0 . - . NHDIEHOJ_00004 TCGAGCGCGCAATTGACGACG
9 ctg.s2.000000F_arrow prokka gene 4353.0 5174.0 . + . NHDIEHOJ_00005 AGATCGCCGAACGCCATATTT
N.B. The sequences in sequence have been randomly typed of the same length but will differ proportionally to the end - start dimension for each attributes value.
I hope I was clear in the explanation.
Thank you everybody for the help.

Assuming df the DataFrame and dic the dictionary and the sequence indexing to be starting at 1 (not 0 like python indexing):
df['sequence'] = [dic[k][int(i-1):int(j)] for k, i, j in
zip(df['seq_id'], df['start'], df['end'])]

Related

How can I convert this format of S. tuberosum gene sequence ID - Soltu.DM.10G013850.1 - to Entrez ID?

How can I convert this format of S. tuberosum gene sequence ID - Soltu.DM.10G013850.1 - to Entrez ID? I have a problem with annotation due to inadequate gene ID's.

'grep' or 'awk' for extracting numeric data from a file

I have a CFD output file containing alpha-numeric data. My goal is to extract certain rows having numeric data to be able to plot. I was able to extract data which starts with a numeric value using grep. However, some of the rows of this extracted data start with a number but contains alphabets also which i do not want. here is a sample
3185 interface metric data, zone 1444, binary.
33268 interface metric data, zone 1440, binary.
3d, double precision, pressure-based, SST k-omega solver.
1 1.0000e+00 1.0163e-01 4.9782e-06 1.2250e-05 6.5126e-06 3.8876e+01 4.1845e+03 7.8685e+02 7.9475e+02 7.8234e+02 3.0537e+00 4.4427e+02 106:48:28 4999
2 1.0000e+00 6.5455e-02 1.4961e-04 2.2052e-04 1.3530e-02 6.8334e-01 4.5948e-01 7.9448e+02 8.0249e+02 7.9007e+02 1.3742e+00 5.7040e+02 92:12:06 4998
4587 interface metric data, zone 2541, binary.
6584 interface metric data, zone 1254, binary.
3 1.0000e+00 4.2029e-02 1.5227e-04 2.1588e-04 3.0255e-03 6.4570e-01 1.2661e-01 7.8044e+02 7.9048e+02 7.7804e+02 -2.3999e+05 6.4085e+02 80:35:24 4997
4 9.9121e-01 3.0808e-02 1.1390e-04 1.7132e-04 1.6542e-03 6.0594e-01 3.4626e-02 7.8613e+02 7.9673e+02 7.8422e+02 -1.9033e+05 7.0184e+02 70:56:41 4996
This is the command i used grep -P '^\s*\d+' file. How can i modify grep command to give me last rows with numeric data only ie
1 1.0000e+00 1.0163e-01 4.9782e-06 1.2250e-05 6.5126e-06 3.8876e+01 4.1845e+03 7.8685e+02 7.9475e+02 7.8234e+02 3.0537e+00 4.4427e+02 106:48:28 4999
2 1.0000e+00 6.5455e-02 1.4961e-04 2.2052e-04 1.3530e-02 6.8334e-01 4.5948e-01 7.9448e+02 8.0249e+02 7.9007e+02 1.3742e+00 5.7040e+02 92:12:06 4998
3 1.0000e+00 4.2029e-02 1.5227e-04 2.1588e-04 3.0255e-03 6.4570e-01 1.2661e-01 7.8044e+02 7.9048e+02 7.7804e+02 -2.3999e+05 6.4085e+02 80:35:24 4997
4 9.9121e-01 3.0808e-02 1.1390e-04 1.7132e-04 1.6542e-03 6.0594e-01 3.4626e-02 7.8613e+02 7.9673e+02 7.8422e+02 -1.9033e+05 7.0184e+02 70:56:41 4996
How can i modify grep command to give me last 4 rows only
Pipe the grep output to tail.
grep -P '^\s*\d+' file | tail -n 4
Given that the text in the question is the only thing we have to go on, I see a few patterns we might use to extract the last four lines.
The following matches lines whose first field is a number and contain no commas:
egrep '^[[:space:]]*[0-9][^,]+$'
This one matches lines containing numbers in scientific notation:
grep '[0-9]e[+-][0-9]'
And this one matches lines containing what looks like a time followed by an integer at the end of the line:
egrep '[0-9]+(:[0-9]{2}){2} [0-9]+$'
Or if you want an explicit match for the whole line -- that is, an integer, a number of scientific numbers, a time and then an integer, you can bundle it all together:
egrep '^[[:space:]]*[0-9]([[:space:]]+-?[0-9]+\.[0-9]+e[+-][0-9]+)+[[:space:]]+[0-9]+(:[0-9]{2}){2} [0-9]+$'
Note that I'm using explicit class names and ERE rather than shortcuts and PREG to maintain compatibility with non-Linux environments.
If your desired section of data can be identified by a certain header, e.g., the 3d, before it, you can look for the header and only start printing matching lines afterwards, e.g.,
awk '/^\s*3d,/ { in_data=1; next } in_data && /^\s*[0-9]/' file
Here /^\s*3d,/ is the pattern for the header which indicates the beginning of the "data section" (from the next line, i.e., not including the header itself). And /^\s*[0-9]/ is the pattern for lines to print within the data section.
In case there is no such header, you could try to identify the first line of data itself with a more complicated pattern, e.g., the number of fields combined with a regular expression:
awk 'NF == 15 && /^\s*[0-9]*\s*/ { in_data=1 } in_data && /^\s*[0-9]/' file

find partial match between two data frame

I have two data frames.
head(NEexp)
Gene Transcript Ratio_log2 FDR
HLHmgamma HLHmgamma-RA 3.759200 1.09e-10
Brd Brd-RA 3.527000 2.66e-08
CG4080 CG4080-RE 3.378500 2.95e-50
RpII215 RpII215-RA 3.343967 1.82e-10
head(excel$gene)
Enhancer of split mgamma, helix-loop-helix
distal antenna
CG4080 gene product from transcript CG4080-RB
As you can see, the two gene column match partially(HLHmgamma matches Enhancer of split mgamma, helix-loop-helix; CG4080 matches CG4080 gene product from transcript CG4080-RB), is there anyway that I can link these two?
codes I have tried so far:
genename=as.character(NEexp$Gene)
query=paste("select * from excel where excel.gene LIKE \"", genename,"\ ",sep"")
Newtable<-dbGetQuery(con,query)
dbGetQuery(con,"select * from excel, NEexp where excel.gene LIKE % "NEexp$Gene" %")
You need merge , which basically is the same as join in SQL. But first you might want to split excel$gene to get the part you want to match.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html
https://stat.ethz.ch/R-manual/R-devel/library/base/html/strsplit.html

Stata - Spin on Reshape

I was working through reshaping a file and was wondering how Stata handled a file in the below format. Using data from a race, for example.
Race_Number Race_Date Racer_1_Name Racer_2_Name Racer_3_Name Racer_1_Position Racer_2_Position Racer_3_Position
Is it possible to transform this to the following.
Race_Number Race_Date Racer_Name Racer Position
Out of curiosity I created the above dataset and reshape did not work and I had to manually manipulate.
We appreciate you show us exactly what your input/output was. Things like
...reshape did not work and I had to manually manipulate.
don't tell us much.
Also, a complete toy data set would have helped. I assume you mean Race_Date where you typed Race Date (first code line) and Racer_Position where you typed Racer Position (second code line).
You can try
clear all
set more off
*----- example dataset -----
input ///
Race_Num Race_Dat str5(R1_Name R2_Name R3_Name) R1_Pos R2_Pos R3_Pos
1 5 "Al" "Bob" "Carl" 3 2 1
2 7 "Al" "Bob" "Carl" 3 1 2
3 15 "Al" "Bob" "Carl" 1 2 3
end
format Race_Dat %td
list
*----- what you want -----
forvalues i = 1/3 {
rename R`i'_Name Nam_R`i'
rename R`i'_Pos Pos_R`i'
}
list
reshape long Pos_R Nam_R, i(Race_Num) j(Racer)
order Race_Num Race_Dat
list, sepby(Race_Num)
All I did was change variable names before the reshape.
A better way is to use the # and then there's no need for renaming variables:
reshape long R#_Pos R#_Name, i(Race_Num) j(Racer)

gnuplot store one number from data file into variable

OSX v10.6.8 and Gnuplot v4.4
I have a data file with 8 columns. I would like to take the first value from the 6th column and make it the title. Here's what I have so far:
#m1 m2 q taua taue K avgPeriodRatio time
#1 2 3 4 5 6 7 8
K = #read in data here
graph(n) = sprintf("K=%.2e",n)
set term aqua enhanced font "Times-Roman,18"
plot file using 1:3 title graph(K)
And here is what the first few rows of my data file looks like:
1.00e-07 1.00e-07 1.00e+00 1.00e+05 1.00e+04 1.00e+01 1.310 12070.00
1.11e-06 1.00e-07 9.02e-02 1.00e+05 1.00e+04 1.00e+01 1.310 12070.00
2.12e-06 1.00e-07 4.72e-02 1.00e+05 1.00e+04 1.00e+01 1.310 12070.00
3.13e-06 1.00e-07 3.20e-02 1.00e+05 1.00e+04 1.00e+01 1.310 12090.00
I don't know how to correctly read in the data or if this is even the right way to go about this.
EDIT #1
Ok, thanks to mgilson I now have
#m1 m2 q taua taue K avgPeriodRatio time
#1 2 3 4 5 6 7 8
set term aqua enhanced font "Times-Roman,18"
K = "`head -1 datafile | awk '{print $6}'`"
print K+0
graph(n) = sprintf("K=%.2e",n)
plot file using 1:3 title graph(K)
but I get the error: Non-numeric string found where a numeric expression was expected
EDIT #2
file = "testPlot.txt"
K = "`head -1 file | awk '{print $6}'`"
K=K+0 #Cast K to a floating point number #this is line 9
graph(n) = sprintf("K=%.2e",n)
plot file using 1:3 title graph(K)
This gives the error--> head: file: No such file or directory
"testPlot.gnu", line 9: Non-numeric string found where a numeric expression was expected
You have a few options...
FIRST OPTION:
use columnheader
plot file using 1:3 title columnheader(6)
I haven't tested it, but this may prevent the first row from actually being plotted.
SECOND OPTION:
use an external utility to get the title:
TITLE="`head -1 datafile | awk '{print $6}'`"
plot 'datafile' using 1:3 title TITLE
If the variable is numeric, and you want to reformat it, in gnuplot, you can cast strings to a numeric type (integer/float) by adding 0 to them (e.g).
print "36.5"+0
Then you can format it with sprintf or gprintf as you're already doing.
It's weird that there is no float function. (int will work if you want to cast to an integer).
EDIT
The script below worked for me (when I pasted your example data into a file called "datafile"):
K = "`head -1 datafile | awk '{print $6}'`"
K=K+0 #Cast K to a floating point number
graph(n) = sprintf("K=%.2e",n)
plot "datafile" using 1:3 title graph(K)
EDIT 2 (addresses comments below)
To expand a variable in backtics, you'll need macros:
set macro
file="mydatafile.txt"
#THE ORDER OF QUOTES (' and ") IS CRUCIAL HERE.
cmd='"`head -1 ' . file . ' | awk ''{print $6}''`"'
# . is string concatenation. (this string has 3 pieces)
# to get a single quote inside a single quoted string
# you need to double. e.g. 'a''b' yields the string a'b
data=#cmd
To address your question 2, it is a good idea to familiarize yourself with shell utilities -- sed and awk can both do it. I'll show a combination of head/tail:
cmd='"`head -2 ' . file . ' | tail -1 | awk ''{print $6}''`"'
should work.
EDIT 3
I recently learned that in gnuplot, system is a function as well as a command. To do the above without all the backtic gymnastics,
data=system("head -1 " . file . " | awk '{print $6}'")
Wow, much better.
This is a very old question, but here's a nice way to get access to a single value anywhere in your data file and save it as a gnuplot-accessible variable:
set term unknown #This terminal will not attempt to plot anything
plot 'myfile.dat' index 0 every 1:1:0:0:0:0 u (var=$1):1
The index number allows you to address a particular dataset (separated by two carriage returns), while every allows you to specify a particular line.
The colon-separated numbers after every should be of the form 1:1:<line_number>:<block_number>:<line_number>:<block_number>, where the line number is the line with the the block (starting from 0), and the block number is the number of the block (separated by a single carriage return, again starting from 0). The first and second numbers say plot every 1 lines and every one data block, and the third and fourth say start from line <line_number> and block <block_number>. The fifth and sixth say where to stop. This allows you to select a single line anywhere in your data file.
The last part of the plot command assigns the value in a particular column (in this case, column 1) to your variable (var). There needs to be two values to a plot command, so I chose column 1 to plot against my variable assignment statement.
Here is a less 'awk'-ward solution which assigns the value from the first row and 6th column of the file 'Data.txt' to the variable x16.
set table
# Syntax: u 0:($0==RowIndex?(VariableName=$ColumnIndex):$ColumnIndex)
# RowIndex starts with 0, ColumnIndex starts with 1
# 'u' is an abbreviation for the 'using' modifier
plot 'Data.txt' u 0:($0==0?(x16=$6):$6)
unset table
A more general example for storing several values is given below:
# Load data from file to variable
# Gnuplot can only access the data via the "plot" command
set table
# Syntax: u 0:($0==RowIndex?(VariableName=$ColumnIndex):$ColumnIndex)
# RowIndex starts with 0, ColumnIndex starts with 1
# 'u' is an abbreviation for the 'using' modifier
# Example: Assign all values according to: xij = Data33[i,j]; i,j = 1,2,3
plot 'Data33.txt' u 0:($0==0?(x11=$1):$1),\
'' u 0:($0==0?(x12=$2):$2),\
'' u 0:($0==0?(x13=$3):$3),\
'' u 0:($0==1?(x21=$1):$1),\
'' u 0:($0==1?(x22=$2):$2),\
'' u 0:($0==1?(x23=$3):$3),\
'' u 0:($0==2?(x31=$1):$1),\
'' u 0:($0==2?(x32=$2):$2),\
'' u 0:($0==2?(x33=$3):$3)
unset table
print x11, x12, x13 # Data from first row
print x21, x22, x23 # Data from second row
print x31, x32, x33 # Data from third row