csv_amino acid composition_columnwise - sequence

Sample file:
Column 10: A|Y|E|A
Column 11: W|I|Q|Q
How do I calculate amino acid composition (percentage) specific to each column?
for ex: composition of A in column 10 is 50%, E is 25% and Y is 25%.
Biopython provides modules to calculate amino acid composition of entire file in fasta format
from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis
for record in SeqIO.parse('output_translation3.fasta', 'fasta'):
X = ProteinAnalysis(str(record.seq))
print('\n Results for record: {}'.format(record.id))
print(X.count_amino_acids()['G'])
print(X.count_amino_acids()['A'])
print(X.count_amino_acids()['L'])
print(X.count_amino_acids()['M'])

from collections import Counter
import re
with open("input.txt") as f:
for line in f:
line=line.strip()
[col,sep,seq] = re.split(r'(: )', line)
aa = re.split(r'[|]', seq)
aa_counts = Counter(aa)
aa_length=len(aa)
print(col)
for k,v in aa_counts.items():
print(" ", k, v/aa_length)
Gives:
Column 10
A 0.5
Y 0.25
E 0.25
Column 11
W 0.25
I 0.25
Q 0.5

Related

Specific calculations for unique column values in DataFrame

I want to make a beta calculation in my dataframe, where beta = Σ(daily returns - mean daily return) * (daily market returns - mean market return) / Σ (daily market returns - mean market return)**2
But I want my beta calculation to apply to specific firms. In my dataframe, each firm as an ID code number (specified in column 1), and I want each ID code to be associated with its unique beta.
I tried groupby, loc and for loop, but it seems to always return an error since the beta calculation is quite long and requires many parenthesis when inserted.
Any idea how to solve this problem? Thank you!
Dataframe:
index ID price daily_return mean_daily_return_per_ID daily_market_return mean_daily_market_return date
0 1 27.50 0.008 0.0085 0.0023 0.03345 01-12-2012
1 2 33.75 0.0745 0.0745 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 0.00006 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 0.005125 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 0.0085 0.0846 0.04345 04-05-2014
5 4 22.75 0.00539 0.005125 0.0003 0.0006
I assume the following form of your equation is what you intended.
Then the following should compute the beta value for each group
identified by ID.
Method 1: Creating our own function to output beta
import pandas as pd
import numpy as np
# beta_data.csv is a csv version of the sample data frame you
# provided.
df = pd.read_csv("./beta_data.csv")
def beta(daily_return, daily_market_return):
"""
Returns the beta calculation for two pandas columns of equal length.
Will return NaN for columns that have just one row each. Adjust
this function to account for groups that have only a single value.
"""
mean_daily_return = np.sum(daily_return) / len(daily_return)
mean_daily_market_return = np.sum(daily_market_return) / len(daily_market_return)
num = np.sum(
(daily_return - mean_daily_return)
* (daily_market_return - mean_daily_market_return)
)
denom = np.sum((daily_market_return - mean_daily_market_return) ** 2)
return num / denom
# groupby the column ID. Then 'apply' the function we created above
# columnwise to the two desired columns
betas = df.groupby("ID")["daily_return", "daily_market_return"].apply(
lambda x: beta(x["daily_return"], x["daily_market_return"])
)
print(f"betas: {betas}")
Method 2: Using pandas' builtin statistical functions
Notice that beta as stated above is just covarianceof DR and
DMR divided by variance of DMR. Therefore we can write the above
program much more concisely as follows.
import pandas as pd
import numpy as np
df = pd.read_csv("./beta_data.csv")
def beta(dr, dmr):
"""
dr: daily_return (pandas columns)
dmr: daily_market_return (pandas columns)
TODO: Fix the divided by zero erros etc.
"""
num = dr.cov(dmr)
denom = dmr.var()
return num / denom
betas = df.groupby("ID")["daily_return", "daily_market_return"].apply(
lambda x: beta(x["daily_return"], x["daily_market_return"])
)
print(f"betas: {betas}")
The output in both cases is.
ID
1 0.012151
2 NaN
3 NaN
4 -0.883333
dtype: float64
The reason for getting NaNs for IDs 2 and 3 is because they only have a single row each. You should modify the function beta to accomodate these corner cases.
Maybe you can start like this?
id_list = list(set(df["ID"].values.tolist()))
for firm_id in id_list:
new_df = df.loc[df["ID"] == firm_id]

Stratified Sampling with different sizes

I am trying to create a function for stratified sampling which takes in a dataframe created using the faker module along with strata, sample size and a random seed. For the sample size, I want the number of samples in each strata to vary based on user input. This is my code for creating the data:
import pandas as pd
import numpy as np
import random as rn#generating random numbers
from faker import Faker
fake = Faker()
frame_fake = pd.DataFrame( [{"region":
fake.random_number(1,fix_len=True),
"district": fake.random_number(2,fix_len=True),
"enum_area": fake.random_number(5,fix_len=True),
"hhs": fake.random_number(3),
"pop": fake.random_number(4),
"area": fake.random_number(1)} for x in range(100)])
# check for and remove duplicates from enum area (should be unique)
# before any further analysis
mask= frame_fake.duplicated('enum_area', keep='last')
duplicates = frame_fake[mask]
# print(duplicates)
# drop all except last
frame_fake = frame_fake.drop_duplicates('enum_area',
keep='last').sort_values(by='enum_area',ascending=True)
# reset index to have them sequentially after sorting by enum_area and
# drop the old index column
frame_fake = frame_fake.reset_index().drop('index',axis=1)
frame_fake
This is the code for sampling:
def stratified_custom(data,strata,sample_size, seed=None):
# for this part, we sample 5 enum areas in each strata/region
# we groupby strata and use the transform method with 'count' parameter
# to get strata sizes
data['strat_size'] = data.groupby(strata)[strata].transform('count')
# map input sample size to each strata
data['strat_sample_size'] = data[strata].map(sample_size)
# grouby strata, get sample size per stratum, cast to int and reset
# index.
smp_size = data.groupby(strata)
['strat_sample_size'].unique().astype(int).reset_index()
# groupby strata and select sample per stratum based on the sample size
# for that strata
sample = (data.groupby(strata, group_keys=False)
.apply(lambda x: x.sample(smp_size,random_state=seed)))
# probability of inclusion
sample['inclusion_prob'] =
sample['strat_sample_size']/sample['strat_size']
return sample
s_size={1:7,2:5,3:5,4:5,5:5,6:5,7:5,8:5,9:8} #pass in strata and sample
# size as dict. (key, values)
(stratified_custom(data=frame_fake,strata='region',sample_size=s_size,
seed=99).sort_values(by=['region','enum_area'],ascending=True))
I however receive this error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
I can't figure out what this error is talking about. Any help is appreciated.
After much research, I stumbled upon this post https://stackoverflow.com/a/58794577/14198137 and implemented this in my code to not only sample based on varying sample sizes but also with fixed ones using the same function. Here is my code for the data:
import pandas as pd
import numpy as np
import random as rn
from faker import Faker
Faker.seed(99)
fake = Faker()
frame_fake = pd.DataFrame( [{"region":
fake.random_number(1,fix_len=True),"district":
fake.random_number(2,fix_len=True),"enum_area":
fake.random_number(5,fix_len=True), "hhs":
fake.random_number(3),"pop":
fake.random_number(4),"area":
rn.randint(1,2)} for x in range(100)])
frame_fake = frame_fake.drop_duplicates('enum_area',keep='last').sort_values(by='enum_area',ascending=True)
frame_fake = frame_fake.reset_index().drop('index',axis=1)
Here is the updated code for stratified sampling which now works.
def stratified_custom(data,strata,sample_size, seed=None):
data = data.copy()
data['strat_size'] = data.groupby(strata)[strata].transform('count')
try:
data['strat_sample_size'] = data[strata].map(sample_size)
smp_size = data.set_index(strata)['strat_sample_size'].to_dict()
strat2_sample = (data.groupby(strata, group_keys=False).apply(lambda x: x.sample(smp_size[x.name],random_state=seed)))
strat2_sample['inclusion_prob'] = strat2_sample['strat_sample_size']/strat2_sample['strat_size']
return strat2_sample
except:
data['strat_sample_size'] = sample_size
strat2_sample = (data.groupby(strata, group_keys=False).apply(lambda x: x.sample(sample_size,random_state=seed)))
strat2_sample['inclusion_prob'] = strat2_sample['strat_sample_size']/strat2_sample['strat_size']
return strat2_sample
s_size={1:3,2:9,3:5,4:5,5:5,6:5,7:5,8:5,9:8}
variablesize = (stratified_custom(data=frame_fake,strata='region',sample_size=s_size, seed=99).sort_values(by=['region','enum_area'],ascending=True)).head()
variablesize
fixedsize = (stratified_custom(data=frame_fake,strata='region',sample_size=3, seed=99).sort_values(by=['region','enum_area'],ascending=True)).head()
fixedsize
The output of variable sample size:
region district enum_area ... strat_size strat_sample_size inclusion_prob
5 1 60 14737 ... 5 3 0.6
26 1 42 34017 ... 5 3 0.6
68 1 31 72092 ... 5 3 0.6
0 2 65 10566 ... 10 9 0.9
15 2 22 25560 ... 10 9 0.9
The output of fixed sample size:
region district enum_area ... strat_size strat_sample_size inclusion_prob
5 1 60 14737 ... 5 3 0.6
26 1 42 34017 ... 5 3 0.6
68 1 31 72092 ... 5 3 0.6
38 2 74 48408 ... 10 3 0.3
43 2 15 56365 ... 10 3 0.3
I was however wondering if there is a better way of achieving this?

Random Choice loop through groups of samples

I have a df containing column of "Income_group", "Rate", and "Probability", respectively. I need randomly select rate for each income group. How can I write a Loop function and print out the result for each income bin.
The pandas data frame table looks like this:
import pandas as pd
df={'Income_Groups':['1','1','1','2','2','2','3','3','3'],
'Rate':[1.23,1.25,1.56, 2.11,2.32, 2.36,3.12,3.45,3.55],
'Probability':[0.25, 0.50, 0.25,0.50,0.25,0.25,0.10,0.70,0.20]}
df2=pd.DataFrame(data=df)
df2
Datatable
Shooting in the dark here, but you can use np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], p=x['Probability']))
)
Output (can vary due to randomness):
Income_Groups
1 1.25
2 2.36
3 3.45
dtype: float64
You can also pass size into np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], size=3, p=x['Probability']))
)
Output:
Income_Groups
1 [1.23, 1.25, 1.25]
2 [2.36, 2.11, 2.11]
3 [3.12, 3.12, 3.45]
dtype: object
GroupBy.apply because of the weights.
import numpy as np
(df2.groupby('Income_Groups')
.apply(lambda gp: np.random.choice(a=gp.Rate, p=gp.Probability, size=1)[0]))
#Income_Groups
#1 1.23
#2 2.11
#3 3.45
#dtype: float64
Another silly way because your weights seem to be have precision to 2 decimal places:
s = df2.set_index(['Income_Groups', 'Probability']).Rate
(s.repeat(s.index.get_level_values('Probability')*100) # Weight
.sample(frac=1) # Shuffle |
.reset_index() # + | -> Random Select
.drop_duplicates(subset=['Income_Groups']) # Select |
.drop(columns='Probability'))
# Income_Groups Rate
#0 2 2.32
#1 1 1.25
#3 3 3.45

tfidf from word counts

I have a categorical variable with large cardinality (+1000). Each of these values can occur repeatedly in each train/test instance.
Although this is not really text data it seems to have similar properties and I would like to treat this as a text classification problem.
My starting point is a dataframe listing the number of occurrences of each "word" in each "document", e.g.
{'Word1': {0: '1',
1: '3',
2: '0',
3: '0',
4: '0'},
'Word2': {0: '0',
1: '2',
2: '0',
3: '0',
4: '0'}
I would like to apply tfidf transformation to these "word" counts. How can I do that?
sklearn.feature_extraction.text.TfidfVectorizer seems to expect a sequence of strings or a file as an input which it preprocesses and tokenizes. None of this is necessary in this case as I already have the "word" counts.
So how to get the tfidf transformation of these counts?
I had a similar situation where I was trying to recreate TF-IDF from word counts
Try below code, worked for me.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["The dog ate a sandwich and I ate a sandwich",
"The wizard transfigured a sandwich"]
vectorizer = TfidfVectorizer(stop_words='english')
tfidfs = vectorizer.fit_transform(corpus)
from collections import Counter
import pandas as pd
columns = [k for (v, k) in sorted((v, k)
for k, v in vectorizer.vocabulary_.items())]
tfidfs = pd.DataFrame(tfidfs.todense(),
columns=columns)
# ate dog sandwich transfigured wizard
#0 0.75 0.38 0.54 0.00 0.00
#1 0.00 0.00 0.45 0.63 0.63
df = (1 / pd.DataFrame([vectorizer.idf_], columns=columns))
# ate dog sandwich transfigured wizard
#0 0.71 0.71 1.0 0.71 0.71
corp = [txt.lower().split() for txt in corpus]
corp = [[w for w in d if w in vectorizer.vocabulary_] for d in corp]
tfs = pd.DataFrame([Counter(d) for d in corp]).fillna(0).astype(int)
# ate dog sandwich transfigured wizard
#0 2 1 2 0 0
#1 0 0 1 1 1
# The first document's TFIDF vector:
tfidf0 = tfs.iloc[0] * (1. / df)
tfidf0 = tfidf0 / pd.np.linalg.norm(tfidf0)
# ate dog sandwich transfigured wizard
#0 0.754584 0.377292 0.536893 0.0 0.0
tfidf1 = tfs.iloc[1] * (1. / df)
tfidf1 = tfidf1 / pd.np.linalg.norm(tfidf1)
# ate dog sandwich transfigured wizard
#0 0.0 0.0 0.449436 0.631667 0.631667

Numpy or Pandas function for "x-value-window" means or other stats?

Let's say I have x-y data samples sorted by x-value. I'm going to use Pandas as example, but I would be perfectly happy with a Numpy/Scipy-only solution, of course.
In [24]: pd.set_option('display.max_rows', 10)
In [25]: df = pd.DataFrame(np.random.randn(100, 2), columns=['x', 'y'])
In [26]: df = df.sort('x')
In [27]: df
Out[27]:
x y
13 -3.403818 0.717744
49 -2.688876 1.936267
74 -2.388332 -0.121599
52 -2.185848 0.617896
90 -2.155343 -1.132673
.. ... ...
65 1.736506 -0.170502
0 1.770901 0.520490
60 1.878376 0.206113
63 2.263602 1.112115
33 2.384195 -1.877502
[100 rows x 2 columns]
Now, I want to kind of "window" it or "discretize" it and get statistics on each window. But I don't want to do the Pandas moving-window functions because they define windows by rows. I want to define windows by a span of x-values, thus "x-value-window". Specifically, let's define each x-value-window with 2 parameters:
center x-value of each window
in this example, let's say I want x = 0.0 + 0.4 * k for all positive or negative k
thus -3.2, -2.8, -2.4, ..., 1.6, 2.0, 2.4
width of each window
in this example, let's say I want W = 0.5
thus, the example windows will be [-3.2-0.25, -3.2+0.25], [-2.8-0.25, -2.8+0.25], ..., [2.4-0.25, 2.4+0.25]
note that the windows overlap, which is intended
Having thus defined the windows, I would like to ask if there's a function that will produce the following data frame (or numpy array):
x y
-3.2 mean of y-values in x-value-window centered at -3.2
-2.8 mean of y-values in x-value-window centered at -2.8
-2.4 mean of y-values in x-value-window centered at -2.4
... ...
1.6 mean of y-values in x-value-window centered at 1.6
2.0 mean of y-values in x-value-window centered at 2.0
2.4 mean of y-values in x-value-window centered at 2.4
Is there anything that will do this for me? Or do I have to totally roll my own (and probably in a very slow python loop instead of fast numpy or pandas code)?
Extra 1: It would be even better if there's support for weighted windows (such as supported by Pandas's rolling_window function) but of course the weights in this case would not be based on how far the sample's row is from the center row of the window, but rather, how far the sample's x-value is from the center of the x-value-window.
Extra 2: It would be nice if there's support for statistics other than mean on the x-value-windows, e.g. (a) variance of the y-values in each x-value-window or (b) count of the number of samples falling within each x-value-window.
I first create a range of x values centered at zero. This range is wide enough so that then min value minus the width and the max value plus the width will capture all x values.
I then iterate through this range of x values which have k as the step size. At each point, I use loc to capture y values located at the selected x value plus and minus the width. The mean of these selected values are then calculated. These values are used to create the result dataframe.
import math
import numpy as np
import pandas as pd
k = .4
w = .5
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 2), columns=['x', 'y'])
x_range = np.arange(math.floor((df.x.min() + w) / k) * k,
k * (math.ceil((df.x.max() - w) / k) + 1), k)
result = pd.DataFrame((df.loc[df.x.between(x - w, x + w), 'y'].mean() for x in x_range),
index=x_range, columns=['y_mean'])
result.index.name = 'centered_x'
>>> result
y_mean
centered_x
-2.400000e+00 0.653619
-2.000000e+00 0.733606
-1.600000e+00 0.576594
-1.200000e+00 0.150462
-8.000000e-01 0.065884
-4.000000e-01 0.022925
-8.881784e-16 0.211693
4.000000e-01 0.057527
8.000000e-01 -0.141970
1.200000e+00 0.233695
1.600000e+00 0.203570
2.000000e+00 0.306409
2.400000e+00 0.576789