Reading BED files into pandas dataframe (windows) - pandas

For a bioinformatics project, I would like to read a .BED file into a pandas dataframe and have no clue how I can do it and what tools/programs are required. Nothing I found on the internet was really applicable to me, as I am working on windows10 with Python 3.7 (Anaconda distribution).
Any help would be appreciated.

According to https://software.broadinstitute.org/software/igv/BED:
A BED file (.bed) is a tab-delimited text file that defines a feature
track.
According to http://genome.ucsc.edu/FAQ/FAQformat#format1 is contains up to 12 fields (columns) and possible comment lines starting with the word 'track'. The following is a minimal program to read such a bed file into a pandas dataframe.
import pandas as pd
df = pd.read_csv('so58178958.bed', sep='\t', comment='t', header=None)
header = ['chrom', 'chromStart', 'chromEnd', 'name', 'score', 'strand', 'thickStart', 'thickEnd', 'itemRgb', 'blockCount', 'blockSizes', 'blockStarts']
df.columns = header[:len(df.columns)]
This is just a very simple code snippet treating all lines starting with a 't' as comments. This should work as all 'chrom' field entries should start with either a 'c', an 's' or a digit.

If you use pyranges, the df will be given names and the columns appropriate data types.
import pyranges as pr
df = pr.read_bed("your.bed", as_df=True)
It also has readers for untidy bioinformatics formats such as gtfs and gff3s.

Related

Replacing whole string with part of it using Regex in Python Pandas

I have a table in pdf found on this link: https://taxation-customs.ec.europa.eu/system/files/2022-11/tobacco_products_releases-consumption.pdf
I am trying to clean the data before doing analysis but I noticed that between 2014-2017 the cigarette data was merged due to error. Instead of two cells per year in a column for Sweden and UK I got one merged which looks something like this: 5393688\r28587000
I would like to update data only for Sweden and get the first value before \r.
So far my code was as follows:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
cig= pd.DataFrame(tabula.read_pdf(r"https://taxation-customs.ec.europa.eu/system/files/2022-11/tobacco_products_releases-consumption.pdf", pages ='all')[0])
cig.replace(to_replace='N/A', value=0, inplace=True, regex=True)
cig= cig.replace(',','', regex=True)
After this I tried
df.iloc[26,:].str.replace("('\r').*","")
cig.iloc[26,:] = cig.iloc[26,:].replace("('\r').*","", regex=True)
and
cig.iloc[26,:].replace(to_replace='(?:[0-9]+)([^0-9]{2})([0-9]+)', value='', regex=True)
But none of the above seem to produce desired result and I still have values with similar format i.e. 5393688\r28587000
Set regex=True and assign the changed subset back to the dataframe:
df.iloc[26,:] = df.iloc[26,:].replace("('\r').*","", regex=True)

Save output in CSV without losing previous data on that CSV in pandas dataframe

I'm doing sentiment analysis of Tweeter data. For this work, I've made some datasets in CSV format where different month in different dataset. When I do the preprocessing of every dataset individually, I want to save all dataset in 1 single CSV file. but when I write the below's code by using pandas dataframe:
df.to_csv('dataset.csv', index=False)
It removes previous data (Rows) of that dataset. Is there any way that I can keep the previous data too on that file? So that I can merge all data together. Thank you..........
It's not entirely clear what you want from your question, so this is just a guess, but something like this might be what you're looking for. if you keep assigning dataframes to df, then new data will overwrite the old data. Try reassigning them to differently named dataframes like df1 and `df21. Then you can merge them.
# vertically merge the multiple dataframes and reassign to new variable
df = pd.concat([df1, df2])
# save the dataframe
df.to_csv('my_dataset.csv', index=False)
In python you can use the open("file") method with the parameter 'a':
open("file", 'a').
The 'a' means "append" so you will add lines at the end of your file.
You can use the same parameter for the pandas.dataFrame.to_csv() method.
e.g:
import pandas as pd
# code where you get data and return df
pd.df.to_csv("file", mode='a')
#thehand0: Your code works, but it's inefficient, so it will take longer for your script to run.

Beckhoff TwinCat Scope CSV Format into pandas dataframe

After recording data in Beckhoff TwinCAT Scope, one can export this data to a CSV file. Said CSV file, however, has a rather complicated format. Can anyone suggestion the most effective way to import such a file into a pandas Dataframe so I can perform analysis?
An example of the format can be found here:
https://infosys.beckhoff.com/english.php?content=../content/1033/tcscope2/html/TwinCATScopeView2_Tutorial_SaveExport.htm&id=
No need to write a custom parser. Using the example data scope_data.csv:
Name,fasd,,,,
File,C;\,,,,
Start,dfsd,,,,
,,,,,
,,,,,
Name,Peak,Name,PULS1,Name,SINUS_FAST
Net id,123.123.123,Net id,123.123.124,Net Id,123.123.125
Port,801,Port,801,Port,801
,,,,,
0,0.6113936598,0,0.07994111349,0,0.08425652468
0,0.524852539,0,0.2051963401,0,0.4391185847
0,0.4993723482,0,0.2917317117,0,0.4583736263
0,0.5976553194,0,0.8675482865,0,0.8435987898
0,0.06087224998,0,0.7933980583,0,0.5614294705
0,0.1967968423,0,0.3923966599,0,0.1951608414
0,0.9723649064,0,0.5187276782,0,0.7646786192
You can import as follows:
import pandas as pd
scope_data = pd.read_csv(
"scope_data.csv",
skiprows=[*range(5), *range(6, 9)],
usecols=[*range(1, 6, 2)]
)
Then you get
>>> scope_data.head()
Peak PULS1 SINUS_FAST
0 0.611394 0.079941 0.084257
1 0.524853 0.205196 0.439119
2 0.499372 0.291732 0.458374
3 0.597655 0.867548 0.843599
4 0.060872 0.793398 0.561429
I don't have the original scope csv, but a little adjustment of skiprows and use_cols should give you the desired result.
To read the bulk of the file (ignoring the header material) use the skiprows keyword argument to read_csv:
import pandas as pd
df = pd.read_csv('data.csv', skiprows=18)
For the header material, I think you'd have to write a custom parser.

Pandas- Groupby Plot is not working for object

I am new to Pandas and doing some analysis csv file. I have successfully read csv and shown all details. I have got two column as an object type which I need to plot. I have done groupy for those two columns and getting first and all data, However I am not sure, how to do plotting for these object types in Pandas. Below is my sample of Groupby and smaple for event_type and event_description for which I need to do plotting. If I can plot for Application and Network for event_type that will be great help
import pandas as pd
data = pd.read_csv('/Users/temp/Downloads/sample.csv’)
data.head()
grouped_df = data.groupby([ "event_type", "event_description"])
grouped_df.first()
As commented - need more info, but IIUC, try:
df['event_type'].value_counts(sort=True).plot(kind='barh')

inputting and aligning protein sequence

I have a script for finding mutated positions in protein sequence.The following script will do this.
import pandas as pd #data analysis python module
data = 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN' #protein sequences
df = pd.DataFrame(map(list,data.split(',')))
I = df.columns[(df.ix[0] != df).any()]
J = [pd.get_dummies(df[i], prefix=df[i].name+1, prefix_sep='') for i in I]
print df[[]].join(J)
Here I gave the data(hard coded) ie, input protein sequences .Normally in an application user has to give the input sequences ie, I mean soft coding.
Also here alignment is not done.I read biopython tutorial and i got following script,but I don't know how to add these scripts to above one.
from Bio import AlignIO
alignment = AlignIO.read("c:\python27\proj\data1.fasta", "fasta")
print alignment
How can I do these
What I have tried :
>>> import sys
>>> import pandas as pd
>>> from Bio import AlignIO
>>> data=sys.stdin.read()
MTAQDDSYSDGKGDYNTIYLGAVFQLN
MTAQDDSYSDGRGDYNTIYLGAVFQLN
MTSQEDSYSDGKGNYNTIMPGAVFQLN
MTAQDDSYSDGRGDYNTIMPGAVFQLN
MKAQDDSYSDGRGNYNTIYLGAVFQLQ
MKSQEDSYSDGRGDYNTIYLGAVFQLN
MTAQDDSYSDGRGDYNTIYPGAVFQLN
MTAQEDSYSDGRGEYNTIYLGAVFQLQ
MTAQDDSYSDGKGDYNTIMLGAVFQLN
MTAQDDSYSDGRGEYNTIYLGAVFQLN
^Z
>>> df=pd.DataFrame(map(list,data.split(',')))
>>> I=df.columns[(df.ix[0]!=df).any()]
>>> J=[pd.get_dummies(df[i],prefix=df[i].name+1,prefix_sep='')for i in I]
>>> print df[[]].join(J)
But it is giving empty DataFrame as output.
I also tried following, but i don't know how to load these sequences into my script
while 1:
var=raw_input("Enter your sequence here:")
print "you entered ",var
Please help me.
When you read in data via:
sys.stdin.read()
Sequences are separating using '\n' rather than ',' (printing data would confirm whether this is the case, it may be system dependent), so you should split using this:
df = pd.DataFrame(map(list,data.split('\n')))
A good way to check this kind of thing is to go through it line by line, where you would see that df was a one row DataFrame (which then propagates to make I empty).
Aside: what a well written piece of code you are using! :)