Why does a table I create from a Pandas dataframe have extra decimal places? - pandas

I have a dataframe which I want to use to create a table. The dataframe contains data in float form which I have already rounded to two decimal places. Here is a dataframe which is a subset of the dataframe I am working on:
Band R^2
0 Band2 Train 0.37
1 Band3 Train 0.50
2 Band4 Train 0.19
3 Band2 Test 0.41
4 Band3 Test 0.53
5 Band4 Test 0.12
As you can see all data in the R^2 column are rounded.
I have written the following simple code to create a table which I intend to export as a png so I can embed it in a LaTeX document. Here is the code:
ax1 = plt.subplot(111,frameon = False)
ax1.xaxis.set_visible(False)
ax1.yaxis.set_visible(False)
ax1.set_frame_on(False)
myTable = table(ax1, df)
myTable.auto_set_font_size(False)
myTable.set_fontsize(13)
myTable.scale(1.2, 3.5)
Here is the table:
Can anybody explain why the data in the R^2 are longer than two decimal places?

Related

Specific calculations for unique column values in DataFrame

I want to make a beta calculation in my dataframe, where beta = Σ(daily returns - mean daily return) * (daily market returns - mean market return) / Σ (daily market returns - mean market return)**2
But I want my beta calculation to apply to specific firms. In my dataframe, each firm as an ID code number (specified in column 1), and I want each ID code to be associated with its unique beta.
I tried groupby, loc and for loop, but it seems to always return an error since the beta calculation is quite long and requires many parenthesis when inserted.
Any idea how to solve this problem? Thank you!
Dataframe:
index ID price daily_return mean_daily_return_per_ID daily_market_return mean_daily_market_return date
0 1 27.50 0.008 0.0085 0.0023 0.03345 01-12-2012
1 2 33.75 0.0745 0.0745 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 0.00006 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 0.005125 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 0.0085 0.0846 0.04345 04-05-2014
5 4 22.75 0.00539 0.005125 0.0003 0.0006
I assume the following form of your equation is what you intended.
Then the following should compute the beta value for each group
identified by ID.
Method 1: Creating our own function to output beta
import pandas as pd
import numpy as np
# beta_data.csv is a csv version of the sample data frame you
# provided.
df = pd.read_csv("./beta_data.csv")
def beta(daily_return, daily_market_return):
"""
Returns the beta calculation for two pandas columns of equal length.
Will return NaN for columns that have just one row each. Adjust
this function to account for groups that have only a single value.
"""
mean_daily_return = np.sum(daily_return) / len(daily_return)
mean_daily_market_return = np.sum(daily_market_return) / len(daily_market_return)
num = np.sum(
(daily_return - mean_daily_return)
* (daily_market_return - mean_daily_market_return)
)
denom = np.sum((daily_market_return - mean_daily_market_return) ** 2)
return num / denom
# groupby the column ID. Then 'apply' the function we created above
# columnwise to the two desired columns
betas = df.groupby("ID")["daily_return", "daily_market_return"].apply(
lambda x: beta(x["daily_return"], x["daily_market_return"])
)
print(f"betas: {betas}")
Method 2: Using pandas' builtin statistical functions
Notice that beta as stated above is just covarianceof DR and
DMR divided by variance of DMR. Therefore we can write the above
program much more concisely as follows.
import pandas as pd
import numpy as np
df = pd.read_csv("./beta_data.csv")
def beta(dr, dmr):
"""
dr: daily_return (pandas columns)
dmr: daily_market_return (pandas columns)
TODO: Fix the divided by zero erros etc.
"""
num = dr.cov(dmr)
denom = dmr.var()
return num / denom
betas = df.groupby("ID")["daily_return", "daily_market_return"].apply(
lambda x: beta(x["daily_return"], x["daily_market_return"])
)
print(f"betas: {betas}")
The output in both cases is.
ID
1 0.012151
2 NaN
3 NaN
4 -0.883333
dtype: float64
The reason for getting NaNs for IDs 2 and 3 is because they only have a single row each. You should modify the function beta to accomodate these corner cases.
Maybe you can start like this?
id_list = list(set(df["ID"].values.tolist()))
for firm_id in id_list:
new_df = df.loc[df["ID"] == firm_id]

In python pandas: Using '\t' Doesn't change the number of columns when I import data

I'm trying to process some data files from some equipment in my lab. My code works for half the data, but for some reason it doesn't correctly import the rest of the data.
If I import the data without trying to create new columns, it looks like this
sep=
0 Time [s]\tS1΄_m [ng/cm^2]
1 0.0000\t0.00
2 9.6292\t0.00
3 16.0488\t0.00
4 22.4638\t0.00
[418 rows x 1 columns]
The code is pretty simple and as I said, it creates two columns for some of the files, but for some reason it only makes 1 column, and therefore I cannot graph the data.
def graph(file):
data = pd.read_csv(path + "/" + file, sep = '\t', header=0)
I get a read out
sep=
Time [s] S1΄_m [ng/cm^2]
0.0000 0.00
9.6292 0.00
16.0488 0.00
22.4638 0.00
[418 rows x 1 columns]

How to show shaded error bands when the data is in two different dataframes?

I want to plot a line graph with shaded error bands across this line. My data is in two different data frames. First dataframe, as shown below, has mean and the positive error is mean + stdev, negative error is mean - stdev. A different dataframe has x-axis as the column against which I want to plot mean and error bands.
​a
b
mean
stdev
0.05
0.06
0.055
0.007
0.1
0.13
0.115
0.02
-0.2
-0.5
-0.35
0.21
I found few discussions on how to do this with lists and arrays but nothing with dataframes.
How to plot shaded error bands with seaborn?

How to filter a Dataframe based on an ID-Column which corresponds to a second Dataframe containing conditions for each ID efficiently?

I have a Dataframe with one ID column and two data columns X,Y containing numeric values. For each ID there are several rows of data.
I have a second Dataframe with the same ID column and two numeric columns specifing the lower and upper Limit for the X - Values for each ID.
I want to use the second Dataframe to filter the first Dataframe to only have rows which have X Values within in the X_min-X_max Range of the specific ID.
I can solve this by Looping over the second dataframe and filtering groupby(ID) - Elements of the first DF but that is slow for large amount of IDs. Is there an efficient way to solve this?
Example Code with the data in df, the ranges in df_ranges and the expected result in df_result. The real data Frame is obviously a lot bigger.
import pandas as pd
x=[2.1,2.2,2.6,2.4,2.8,3.5,2.8,3.2]
y=[3.1,3.5,3.4,2.7,2.1,2.7,4.1,4.3]
ID=[0]*4+[0.1]*4
x_min=[2.0,3.0]
x_max=[2.5,3.4]
IDs=[0,0.1]
df=pd.DataFrame({'ID':ID,'X':x,'Y':y})
df_ranges=pd.DataFrame({'ID':IDs,'X_min':x_min,'X_max':x_max})
df_result=df.iloc[[0,1,3,7],:]
Possible Solution:
def filter_ranges(grp,df_ranges):
x_min=df_ranges.loc[df_ranges.ID==grp.name,'X_min'].values[0]
x_max=df_ranges.loc[df_ranges.ID==grp.name,'X_max'].values[0]
return grp.loc[(grp.X>=x_min)&(grp.X<=x_max),:]
target_df_grp=df.groupby('ID').apply(filter_ranges,df_ranges=df_ranges)
Try this:
merged = df.merge(df_ranges, on='ID')
target_df = merged[(merged.X>=merged.X_min)&(merged.X<=merged.X_max)][['ID', 'X', 'Y']] # Here, desired filter is applied.
print(target_df) will give:
ID X Y
0 0.0 2.1 3.1
1 0.0 2.2 3.5
3 0.0 2.4 2.7
7 0.1 3.2 4.3

Series of if statements applied to data frame

I have a question on how to this task. I want to return or group a series of numbers in my data frame, the numbers are from the column 'PD' which ranges from .001 to 1. What I want to do is to group those that are .91>'PD'>.9 to .91 (or return a value of .91), .92>'PD'>=.91 to .92, ..., 1>='PD' >=.99 to 1. onto a column named 'Grouping'. What I have been doing is manually doing each if statement then merging it with the base data frame. Can anyone please help me with a more efficient way of doing this? Still on the early stages of using python. Sorry if the question seems to be easy. Thank you for answering and for your time.
Let your data look like this
>>> df = pd.DataFrame({'PD': np.arange(0.001, 1, 0.001), 'data': np.random.randint(10, size=999)})
>>> df.head()
PD data
0 0.001 6
1 0.002 3
2 0.003 5
3 0.004 9
4 0.005 7
Then cut-off the last decimal of the PD column. This is a bit tricky since you get a lot of issues with rounding when doing it without str conversion. E.g.
>>> df['PD'] = df['PD'].apply(lambda x: float('{:.3f}'.format(x)[:-1]))
>>> df.tail()
PD data
994 0.99 1
995 0.99 3
996 0.99 2
997 0.99 1
998 0.99 0
Now you can use the pandas-groupby. Do with data whatever you want, e.g.
>>> df.groupby('PD').agg(lambda x: ','.join(map(str, x)))
data
PD
0.00 6,3,5,9,7,3,6,8,4
0.01 3,5,7,0,4,9,7,1,7,1
0.02 0,0,9,1,5,4,1,6,7,3
0.03 4,4,6,4,6,5,4,4,2,1
0.04 8,3,1,4,6,5,0,6,0,5
[...]
Note that the first row is one item shorter due to missing 0.000 in my sample.