inputting and aligning protein sequence - input

I have a script for finding mutated positions in protein sequence.The following script will do this.
import pandas as pd #data analysis python module
data = 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN' #protein sequences
df = pd.DataFrame(map(list,data.split(',')))
I = df.columns[(df.ix[0] != df).any()]
J = [pd.get_dummies(df[i], prefix=df[i].name+1, prefix_sep='') for i in I]
print df[[]].join(J)
Here I gave the data(hard coded) ie, input protein sequences .Normally in an application user has to give the input sequences ie, I mean soft coding.
Also here alignment is not done.I read biopython tutorial and i got following script,but I don't know how to add these scripts to above one.
from Bio import AlignIO
alignment = AlignIO.read("c:\python27\proj\data1.fasta", "fasta")
print alignment
How can I do these
What I have tried :
>>> import sys
>>> import pandas as pd
>>> from Bio import AlignIO
>>> data=sys.stdin.read()
MTAQDDSYSDGKGDYNTIYLGAVFQLN
MTAQDDSYSDGRGDYNTIYLGAVFQLN
MTSQEDSYSDGKGNYNTIMPGAVFQLN
MTAQDDSYSDGRGDYNTIMPGAVFQLN
MKAQDDSYSDGRGNYNTIYLGAVFQLQ
MKSQEDSYSDGRGDYNTIYLGAVFQLN
MTAQDDSYSDGRGDYNTIYPGAVFQLN
MTAQEDSYSDGRGEYNTIYLGAVFQLQ
MTAQDDSYSDGKGDYNTIMLGAVFQLN
MTAQDDSYSDGRGEYNTIYLGAVFQLN
^Z
>>> df=pd.DataFrame(map(list,data.split(',')))
>>> I=df.columns[(df.ix[0]!=df).any()]
>>> J=[pd.get_dummies(df[i],prefix=df[i].name+1,prefix_sep='')for i in I]
>>> print df[[]].join(J)
But it is giving empty DataFrame as output.
I also tried following, but i don't know how to load these sequences into my script
while 1:
var=raw_input("Enter your sequence here:")
print "you entered ",var
Please help me.

When you read in data via:
sys.stdin.read()
Sequences are separating using '\n' rather than ',' (printing data would confirm whether this is the case, it may be system dependent), so you should split using this:
df = pd.DataFrame(map(list,data.split('\n')))
A good way to check this kind of thing is to go through it line by line, where you would see that df was a one row DataFrame (which then propagates to make I empty).
Aside: what a well written piece of code you are using! :)

Related

MATLAB .mat in Pandas DataFrame to be used in Tensorflow

I have gone days trying to figure this out, hopefully someone can help.
I am uploading a .mat file into python using scipy.io, placing the struct into a dataframe, which will then be used in Tensorflow.
from scipy.io import loadmat
import pandas as pd
import numpy as p
import matplotlib.pyplot as plt
#import TF
path = '/home/anthony/PycharmProjects/Deep_Learning_MATLAB/circuit-data/for tinghao/template1-lib5-eqns-CR-RESULTS-SET1-FINAL.mat'
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
df = pd.DataFrame(data, dtype=int)
df.pop('transferFunc')
print(df.dtypes)
The out put is:
A object
Ln object
types object
nz int64
np int64
dtype: object
Process finished with exit code 0
The struct is (43249x6). Each cell in the 'A' column is a different sized matrix, i.e. 18x18, or 16x16 etc. Each cell in "Ln" is a row of letters each in their own separate cell. Each cell in 'Types' contains 12 columns of numbers, and 'nz' and 'np' i have no issues with.
I want to put all columns into a dataframe, and use column A or LN or Types as the 'Labels' and nz and np as 'features', again i do not have issues with the latter. Can anyone help with this or have some kind of work around.
The end goal is to have tensorflow train on nz and np and give me either a matrix, Ln, or Type.
What type of data is your .mat file of ? Is your application very time critical?
If you can collect all your data in a struct you could give jsonencode a try, make the struct a json file and load it back into python via json (see json documentation on loading data).
Then you can create a pandas dataframe via
pd.df.from_dict()
Of course this would only be a workaround. Still you would have to ensure your data in the MATLAB struct is correctly orderer to be then imported and transferred to a df.
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
graph_labels = pd.DataFrame()
graph_labels['perf'] = raw_data['Objective'][0:1000]
graph_labels['np'] = data['np'][0:1000]
The code above helped out. Its very simple and drawn out, but it got the job done. But, it does not work in tensorflow because tensorflow does not accept this format, and that was my main issue. I have to convert adjacency matrices to networkx graphs, then upload them into stellargraph.

My bokeh could runs fine, but outputs a blank chart. What am I doing wrong?

I am trying to run some simple code that reads a CSV file, and runs the data to show an output in the form of a line graph. The query runs fine and gives me the below output, but for some reason it shows a very odd date format on the x-axis which leads to a very odd line with several outliers (not actually the case). Could someone help?
Date,Value
01/11/2020,4.5202
01/12/2020,4.6555
01/01/2021,4.7194
01/02/2021,4.7317
01/03/2021,4.6655
01/04/2021,4.4641
01/05/2021,4.3875
01/06/2021,4.3560
01/07/2021,4.3318
01/08/2021,4.3607
01/09/2021,4.4853
01/10/2021,4.6456
01/11/2021,5.2262
01/12/2021,5.3259
01/01/2022,5.3820
01/02/2022,5.3855
01/03/2022,5.2673
01/04/2022,4.9346
01/05/2022,4.7287
01/06/2022,4.6274
01/07/2022,4.6632
01/08/2022,4.6929
01/09/2022,4.7841
01/10/2022,4.9572
01/11/2022,5.4293
01/12/2022,5.5214
01/01/2023,5.5697
01/02/2023,5.5738
01/03/2023,5.4550
01/04/2023,5.1962
01/05/2023,4.9534
01/06/2023,4.8514
01/07/2023,4.8112
01/08/2023,4.8415
01/09/2023,4.9338
01/10/2023,5.1461
01/11/2023,5.6022
01/12/2023,5.6960
01/01/2024,5.7451
01/02/2024,5.7499
01/03/2024,5.6308
01/04/2024,5.2752
01/05/2024,5.0306
01/06/2024,4.9282
01/07/2024,4.8877
01/08/2024,4.9188
01/09/2024,5.0127
01/10/2024,5.2100
01/11/2024,5.6716
01/12/2024,5.7669
01/01/2025,5.8176
01/02/2025,5.8229
01/03/2025,5.7031
01/04/2025,5.2633
01/05/2025,5.0164
01/06/2025,4.9133
01/07/2025,4.8730
01/08/2025,4.9053
01/09/2025,5.0005
01/10/2025,5.3274
01/11/2025,5.6325
01/12/2025,5.7293
import pandas as pd
# Read in the CSV file: df
df = pd.read_csv('TTFcurve.csv', parse_dates=['Date'])
# Import figure from bokeh.plotting
from bokeh.plotting import figure, output_file, show
output_file("lines.html")
# Create the figure: p
#x = df.Date
#y = df.Value
p = figure(x_axis_label='Date', y_axis_label='Value')
# Plot mpg vs hp by color
p.line(df['Date'], df['Value'], line_color="red")
# Specify the name of the output file and show the result
show(p)
You have to tell Bokeh that your X axis is datetime:
p = figure(..., x_axis_type='datetime')
Regarding the outliers - check the data. I'm almost certain that Bokeh cannot "invent" any new points here. If you make sure that your data is absolutely fine, please post it so the above plot could be reproduced and checked.

Merging geodataframes in geopandas (CRS do not match)

I am trying to merge two geodataframes (want to see which polygon each point is in).
The following code gets me a warning first ("CRS does not match!")
and then an error ("RTreeError: Coordinates must not have minimums more than maximums").
What exactly is wrong in there? Are CRS coordinates systems? If so, why are they not loaded the same way?
import geopandas as gpd
from shapely.geometry import Point, mapping,shape
from geopandas import GeoDataFrame, read_file
#from geopandas.tools import overlay
from geopandas.tools import sjoin
print('Reading points...')
points=pd.read_csv(points_csv)
points['geometry'] = points.apply(lambda z: Point(z.Latitude, z.Longitude), axis=1)
PointsGeodataframe = gpd.GeoDataFrame(points)
print PointsGeodataframe.head()
print('Reading polygons...')
PolygonsGeodataframe = gpd.GeoDataFrame.from_file(china_shapefile+".shp")
print PolygonsGeodataframe.head()
print('Merging GeoDataframes...')
merged=sjoin(PointsGeodataframe, PolygonsGeodataframe, how='left', op='intersects')
#merged = PointsGeodataframe.merge(PolygonsGeodataframe, left_on='iso_alpha2', right_on='ISO2', how='left')
print(merged.head(5))
Link to data for reproduction:
Shapefile,
GPS points
As noted in the comments on the question, you can eliminate the CRS does not match! warning by manually setting PointsGeodataframe.crs = PolygonsGeodataframe.crs (assuming the CRSs are indeed the same for both datasets).
However, that doesn't address the RTreeError. It's possible that you have missing lat/lon data in points_csv - in that case you would end up creating Point objects containing NaN values (i.e. Point(nan nan)), which go on to cause issues in rtree. I had a similar problem and the fix was just to filter out rows with missing coordinate data when loading in the CSV:
points=pd.read_csv(points_csv).dropna(subset=["Latitude", "Longitude"])
I'll add an answer here since I was recently struggling with this and couldn't find a great answer here. The Geopandas documentation has a good example for how to solve the "CRS does not match" issue.
I copied the entire code chunk from the documentation below, but the most relevant line is this one, where the to_crs() method is used to reproject a geodataframe. You can call mygeodataframe.crs to find the CRS of each dataframe, and then to_crs() to reproject one to match the other, like so:
world = world.to_crs({'init': 'epsg:3395'})
Simply setting PointsGeodataframe.crs = PolygonsGeodataframe.crs will stop the error from being thrown, but will not correctly reproject the geometry.
Full documentation code for reference:
# load example data
In [1]: world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Check original projection
# (it's Platte Carre! x-y are long and lat)
In [2]: world.crs
Out[2]: {'init': 'epsg:4326'}
# Visualize
In [3]: ax = world.plot()
In [4]: ax.set_title("WGS84 (lat/lon)");
# Reproject to Mercator (after dropping Antartica)
In [5]: world = world[(world.name != "Antarctica") & (world.name != "Fr. S. Antarctic Lands")]
In [6]: world = world.to_crs({'init': 'epsg:3395'}) # world.to_crs(epsg=3395) would also work
In [7]: ax = world.plot()
In [8]: ax.set_title("Mercator");

How to Render Math Table Properly in IPython Notebook

The math problem that I'm solving gives different analytical solutions in different scenarios, and I would like to summarize the result in a nice table. IPython Notebook renders the list nicely:
for example:
import sympy
from pandas import DataFrame
from sympy import *
init_printing()
a, b, c, d = symbols('a b c d')
t = [[a/b, b/a], [c/d, d/c]]
t
However, when I summarize the answers into a table using DataFrame, the math cannot be rendered any more:
df = DataFrame(t, index=['Situation 1', 'Situation 2'], columns=['Answer1','Answer2'])
df
"print df.to_latex()" also gives the same result. I also tried "print(latex(t))" but it gives this after compiling in LaTex, which is alright, but I still need to manually convert it to a table:
How should I use DataFrame properly in order to render the math properly? Or is there any other way to export the math result into a table in Latex? Thanks!
Update: 01/25/14
Thanks again to #Jakob for solving the problem. It works perfectly for simple matrices, though there are still some minor problems for more complicated math expressions. But I guess like #asmeurer said, perfection requires an update in IPython and Pandas.
Update: 01/26/14
If I render the result directly, i.e. just print the list, it works fine:
MathJax is currently not able to render tables, hence the most obvious approach (pure latex) does not work.
However, following the advise of #asmeurer you should use an html table and render the cell content as latex. In your case this could be easily achieved by the following intermediate step:
from sympy import latex
tl = map(lambda tc: '$'+latex(tc)+'$',t)
df = DataFrame(tl, index=['Situation 1', 'Situation 2'], columns=['Answer'])
df
which gives:
Update:
In case of two dimensional data, the simple map function will not work directly. To cope with this situation the numpy shape, reshape and ravel functions could be used like:
import numpy as np
t = [[a/b, b/a],[a*a,b*b]]
tl=np.reshape(map(lambda tc: '$'+latex(tc)+'$',np.ravel(t)),np.shape(t))
df = DataFrame(tl, index=['Situation 1', 'Situation 2'], columns=['Answer 1','Answer 2'])
df
This gives:
Update 2:
Pandas crops cell content if the string length exceeds a certain number. E.g a more complicated expression like
t1 = [a/2+b/2+c/2+d/2]
tl=np.reshape(map(lambda tc: '$'+latex(tc)+'$',np.ravel(t1)),np.shape(t1))
df = DataFrame(tl, index=['Situation 1'], columns=['Answer 1'])
df
gives:
To cope with this issue a pandas package option has to be altered, for details see here. For the present case the max_colwidth has to be changed. The default value is 50, hence let's change it to 100:
import pandas as pd
pd.options.display.max_colwidth=100
df
gives:

Stuck importing NetCDF file into Pandas DataFrame

I've been working on this as a beginner for a while. Overall, I want to read in a NetCDF file and import multiple (~50) columns (and 17520 cases) into a Pandas DataFrame. At the moment I have set it up for a list of 4 variables but I want to be able to expand that somehow. I made a start, but any help on how to loop through to make this happen with 50 variables would be great. It does work using the code below for 4 variables. I know its not pretty - still learning!
Another question I have it that when I try to read the numpy arrays directly into Pandas DataFrame it doesn't work and instead creates a DataFrame that is 17520 columns large. It should be the other way (transposed). If I create a series, it works fine. So I have had to use the following lines to get around this. Not even sure why it works. Any suggestions of a better way (especially when it comes to 50 variables)?
d={vnames[0] :vartemp[0], vnames[1] :vartemp[1], vnames[2] :vartemp[2], vnames[3] :vartemp[3]}
hs = pd.DataFrame(d,index=times)
The whole code is pasted below:
import pandas as pd
import datetime as dt
import xlrd
import numpy as np
import netCDF4
def excel_to_pydate(exceldate):
datemode=0 # datemode: 0 for 1900-based, 1 for 1904-based
pyear, pmonth, pday, phour, pminute, psecond = xlrd.xldate_as_tuple(exceldate, datemode)
py_date = dt.datetime(pyear, pmonth, pday, phour, pminute, psecond)
return(py_date)
def main():
filename='HowardSprings_2010_L4.nc'
#Define a list of variables names we want from the netcdf file
vnames = ['xlDateTime', 'Fa', 'Fh' ,'Fg']
# Open the NetCDF file
nc = netCDF4.Dataset(filename)
#Create some lists of size equal to length of vnames list.
temp=list(xrange(len(vnames)))
vartemp=list(xrange(len(vnames)))
#Enumerate the list and assign each NetCDF variable to an element in the lists.
# First get the netcdf variable object assign to temp
# Then strip the data from that and add to temporary variable (vartemp)
for index, variable in enumerate(vnames):
temp[index]= nc.variables[variable]
vartemp[index] = temp[index][:]
# Now call the function to convert to datetime from excel. Assume datemode: 0
times = [excel_to_pydate(elem) for elem in vartemp[0]]
#Dont know why I cant just pass a list of variables i.e. [vartemp[0], vartemp[1], vartemp[2]]
#But this is only thing that worked
#Create Pandas dataframe using times as index
d={vnames[0] :vartemp[0], vnames[1] :vartemp[1], vnames[2] :vartemp[2], vnames[3] :vartemp[3]}
theDataFrame = pd.DataFrame(d,index=times)
#Define missing data value and apply to DataFrame
missing=-9999
theDataFrame1=theDataFrame.replace({vnames[0] :missing, vnames[1] :missing, vnames[2] :missing, vnames[3] :missing},'NaN')
main()
You could replace:
d = {vnames[0] :vartemp[0], ..., vnames[3]: vartemp[3]}
hs = pd.DataFrame(d, index=times)
with
hs = pd.DataFrame(vartemp[0:4], columns=vnames[0:4], index=times)
.
Saying that, pandas can read HDF5 directly, so perhaps the same is true for netCDF (which is based on HDF5)...