Pandas: Update() dataframe issue - pandas

I am using pandas 0.18 on Suse Enterprise Linux 11 w/ python 2.7.9. I have two tables, A and B.
A contains the following column and types:
>>> print a.dtypes
cid object
bid int64
li object
lit int64
x1 float64
y1 float64
x2 float64
y2 float64
hit_num object
B contains the following column and types:
>>> print b.dtypes
cid object
li object
x1 float64
y1 float64
x2 float64
y2 float64
hit_num object
Now here is a sample dataset for A:
cid,bid,li,lit,x1,y1,x2,y2,hit_num
id1,0,m0,1,6775.5711,6102.5771,6775.6051,6102.7731,
id1,0,m0,2,6775.5311,6103.0631,6775.5531,6103.2051,
id1,0,m0,3,6775.6231,6103.0631,6775.6451,6103.2051,
id1,0,m0,4,6775.1631,6103.6571,6775.1971,6103.7451,
Now here is a sample dataset for B:
cid,li,x1,y1,x2,y2,hit_num
id1,m0,6775.1631,6103.6571,6775.1971,6103.7451,hello
id1,m0,6775.6231,6103.0631,6775.6451,6103.2051,world
id1,m0,6775.5311,6103.0631,6775.5531,6103.2051,gotta
id1,m0,6775.5711,6102.5771,6775.6051,6102.7731,go
I do A.update(B). So I'm expecting B[hit_num] to update A[hit_num] by aligning on columns cid,lid,x1,y1,x2,y2.
So I expect something like this (unless my understanding of update() is wrong?):
cid,bid,li,lit,x1,y1,x2,y2,hit_num
id1,0,m0,1,6775.5711,6102.5771,6775.6051,6102.7731,0.018,0.02,0.0269,go
id1,0,m0,2,6775.5311,6103.0631,6775.5531,6103.2051,0.018,0.02,0.0269,gotta
id1,0,m0,3,6775.6231,6103.0631,6775.6451,6103.2051,0.018,0.02,0.0269,world
id1,0,m0,4,6775.1631,6103.6571,6775.1971,6103.7451,0.018,0.02,0.0269,hello
However, what I end up getting the below. The 'lit' columns (highlighted in bold) seems to be messed up, and there is a duplicated entry of '1'. This is not present in A. I am wondering why this is happening. I created a small example and tried to reproduce the issue, but was unsuccessful. I get expected results there.
However, in a larger table that I'm running my regression on, I'm seeing this behavior. I've printed table A, table B and A.update(B), and I see the below. I'm not calling any other dataframe operations in between. I.e., pseudocode:
print v['les_tables']['foo']
print overlay_tables['foo']
v['les_tables']['foo'].update(overlay_tables['foo'])
print v['les_tables']['foo']
I am not totally sure how update works, but I would think it is using some type of equality operator to match columns? If so, would x1,y1,x2,y2 being float64 be causing any issue? Any ideas what I'm doing wrong?
I've confirmed the columns to align on are the same name/type in both A/B (see A.dtypes/B.dtypes above).
cid,bid,li,lit,x1,y1,x2,y2,hit_num
id1,0,m0,1,6775.5711,6102.5771,6775.6051,6102.7731,0.018,0.02,0.0269,go
id1,0,m0,3,6775.5311,6103.0631,6775.5531,6103.2051,0.018,0.02,0.0269,gotta
id1,0,m0,2,6775.6231,6103.0631,6775.6451,6103.2051,0.018,0.02,0.0269,world
id1,0,m0,1,6775.1631,6103.6571,6775.1971,6103.7451,0.018,0.02,0.0269,hello

try this:
In [73]: df = (A.set_index(['cid','li','x1','y1','x2','y2'])
....: .drop(['hit_num'], axis=1)
....: .join(B.set_index(['cid','li','x1','y1','x2','y2']))
....: .reset_index()
....: )
In [74]: df
Out[74]:
cid li x1 y1 x2 y2 bid lit hit_num
0 id1 m0 6775.5711 6102.5771 6775.6051 6102.7731 0 1 go
1 id1 m0 6775.5311 6103.0631 6775.5531 6103.2051 0 2 gotta
2 id1 m0 6775.6231 6103.0631 6775.6451 6103.2051 0 3 world
3 id1 m0 6775.1631 6103.6571 6775.1971 6103.7451 0 4 hello

Related

Specific calculations for unique column values in DataFrame

I want to make a beta calculation in my dataframe, where beta = Σ(daily returns - mean daily return) * (daily market returns - mean market return) / Σ (daily market returns - mean market return)**2
But I want my beta calculation to apply to specific firms. In my dataframe, each firm as an ID code number (specified in column 1), and I want each ID code to be associated with its unique beta.
I tried groupby, loc and for loop, but it seems to always return an error since the beta calculation is quite long and requires many parenthesis when inserted.
Any idea how to solve this problem? Thank you!
Dataframe:
index ID price daily_return mean_daily_return_per_ID daily_market_return mean_daily_market_return date
0 1 27.50 0.008 0.0085 0.0023 0.03345 01-12-2012
1 2 33.75 0.0745 0.0745 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 0.00006 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 0.005125 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 0.0085 0.0846 0.04345 04-05-2014
5 4 22.75 0.00539 0.005125 0.0003 0.0006
I assume the following form of your equation is what you intended.
Then the following should compute the beta value for each group
identified by ID.
Method 1: Creating our own function to output beta
import pandas as pd
import numpy as np
# beta_data.csv is a csv version of the sample data frame you
# provided.
df = pd.read_csv("./beta_data.csv")
def beta(daily_return, daily_market_return):
"""
Returns the beta calculation for two pandas columns of equal length.
Will return NaN for columns that have just one row each. Adjust
this function to account for groups that have only a single value.
"""
mean_daily_return = np.sum(daily_return) / len(daily_return)
mean_daily_market_return = np.sum(daily_market_return) / len(daily_market_return)
num = np.sum(
(daily_return - mean_daily_return)
* (daily_market_return - mean_daily_market_return)
)
denom = np.sum((daily_market_return - mean_daily_market_return) ** 2)
return num / denom
# groupby the column ID. Then 'apply' the function we created above
# columnwise to the two desired columns
betas = df.groupby("ID")["daily_return", "daily_market_return"].apply(
lambda x: beta(x["daily_return"], x["daily_market_return"])
)
print(f"betas: {betas}")
Method 2: Using pandas' builtin statistical functions
Notice that beta as stated above is just covarianceof DR and
DMR divided by variance of DMR. Therefore we can write the above
program much more concisely as follows.
import pandas as pd
import numpy as np
df = pd.read_csv("./beta_data.csv")
def beta(dr, dmr):
"""
dr: daily_return (pandas columns)
dmr: daily_market_return (pandas columns)
TODO: Fix the divided by zero erros etc.
"""
num = dr.cov(dmr)
denom = dmr.var()
return num / denom
betas = df.groupby("ID")["daily_return", "daily_market_return"].apply(
lambda x: beta(x["daily_return"], x["daily_market_return"])
)
print(f"betas: {betas}")
The output in both cases is.
ID
1 0.012151
2 NaN
3 NaN
4 -0.883333
dtype: float64
The reason for getting NaNs for IDs 2 and 3 is because they only have a single row each. You should modify the function beta to accomodate these corner cases.
Maybe you can start like this?
id_list = list(set(df["ID"].values.tolist()))
for firm_id in id_list:
new_df = df.loc[df["ID"] == firm_id]

Pandas (with def and np.where): error with values in a dataframe row conditioned on another dataframe row

I have dataframes A of shape XxY with values and dataframes B of shape ZxY to be filled with statistics calculated from A.
As an example:
A = pd.DataFrame(np.array(range(9)).reshape((3,3)))
B = pd.DataFrame(np.array(range(6)).reshape((2,3)))
Now I need to fill row 1 of B with quantile(0.5) of A columns where row 0 of B > 1 (else: np.nan). I need to use a function of the kind:
def mydef(df0, df1):
df1.loc[1] = np.where(df1.loc[0]>1,
df0.quantile(0.5),
np.nan)
pass
mydef(A,B)
Now B is:
0 1 2
0 0.0 1.0 2.0
1 NaN NaN 3.5
It works perfectly for these mock dataframes and all my real ones apart from one.
For that one this error is raised:
ValueError: cannot set using a list-like indexer with a different length than the value
When I run the same code without calling a function, it doesn't raise any error.
Since I need to use a function, any suggestion?
I found the error. I erroneously had the same label twice in the index. Essentially my dataframe B was something like:
B = pd.DataFrame(np.array(range(9)).reshape((3,3)), index=[0,0,1])
so that calling the def:
def mydef(df0, df1):
df1.loc[1] = np.where(df1.loc[0]>1,
df0.quantile(0.5),
np.nan)
pass
would cause the condition and the if-false lines of np.where to not match their shapes, I guess.
Still not sure why working outside the def worked.

Comparing string values from sequential rows in pandas series

I am trying to count common string values in sequential rows of a panda series using a user defined function and to write an output into a new column. I figured out individual steps, but when I put them together, I get a wrong result. Could you please tell me the best way to do this? I am a very beginner Pythonista!
My pandas df is:
df = pd.DataFrame({"Code": ['d7e', '8e0d', 'ft1', '176', 'trk', 'tr71']})
My string comparison loop is:
x='d7e'
y='8e0d'
s=0
for i in y:
b=str(i)
if b not in x:
s+=0
else:
s+=1
print(s)
the right result for these particular strings is 2
Note, when I do def func(x,y): something happens to s counter and it doesn't produce the right result. I think I need to reset it to 0 every time the loop runs.
Then, I use df.shift to specify the position of y and x in a series:
x = df["Code"]
y = df["Code"].shift(periods=-1, axis=0)
And finally, I use df.apply() method to run the function:
df["R1SB"] = df.apply(func, axis=0)
and I get None values in my new column "R1SB"
My correct output would be:
"Code" "R1SB"
0 d7e None
1 8e0d 2
2 ft1 0
3 176 1
4 trk 0
5 tr71 2
Thank you for your help!
TRY:
df['R1SB'] = df.assign(temp=df.Code.shift(1)).apply(
lambda x: np.NAN
if pd.isna(x['temp'])
else sum(i in str(x['temp']) for i in str(x['Code'])),
1,
)
OUTPUT:
Code R1SB
0 d7e NaN
1 8e0d 2.0
2 ft1 0.0
3 176 1.0
4 trk 0.0
5 tr71 2.0

Grouped function between 2 columns in a pandas.DataFrame?

I have a dataframe that has multiple numerical data columns, and a 'group' column. I want to get the output of various functions over two of the columns, for each group.
Example data and function:
df = pandas.DataFrame({"Dummy":[1,2]*6, "X":[1,3,7]*4,
"Y":[2,3,4]*4, "group":["A","B"]*6})
def RMSE(X):
return(np.sqrt(np.sum((X.iloc[:,0] - X.iloc[:,1])**2)))
I want to do something like
group_correlations = df[["X", "Y"]].groupby('group').apply(RMSE)
But if I do that, the 'group' column isn't in the dataframe. If I do it the other way around, like this:
group_correlations = df.groupby('group')[["X", "Y"]].apply(RMSE)
Then the column selection doesn't work:
df.groupby('group')[['X', 'Y']].head(1)
Dummy X Y group
group
A 0 1 1 2 A
B 1 2 3 3 B
the Dummy column is still included, so the function will calculate RMSE on the wrong data.
Is there any way to do what I'm trying to do? I know I could do a for loop over the different groups, and subselect the columns manually, but I'd prefer to do it the pandas way, if there is one.
This looks like a bug (or that grabbing multiple columns in a groupby is not implemented?), a workaround is to pass in the groupby column directly:
In [11]: df[['X', 'Y']].groupby(df['group']).apply(RMSE)
Out[11]:
group
A 4.472136
B 4.472136
dtype: float64
To see it's the same:
In [12]: df.groupby('group')[['X', 'Y']].apply(RMSE) # wrong
Out[12]:
group
A 8.944272
B 7.348469
dtype: float64
In [13]: df.iloc[:, 1:].groupby('group')[['X', 'Y']].apply(RMSE) # correct: ignore dummy col
Out[13]:
group
A 4.472136
B 4.472136
dtype: float64
More robust implementation:
To avoid this completely, you could change RMSE to select the columns by name:
In [21]: def RMSE2(X, left_col, right_col):
return(np.sqrt(np.sum((X[left_col] - X[right_col])**2)))
In [22]: df.groupby('group').apply(RMSE2, 'X', 'Y') # equivalent to passing lambda x: RMSE2(x, 'X', 'Y'))
Out[22]:
group
A 4.472136
B 4.472136
dtype: float64
Thanks to #naught101 for pointing out the sweet apply syntax to avoid the lambda.

Case insensitive pandas.concat

How would I perform a case insensitive pandas.concat?
df1 = pd.DataFrame({"a":[1,2,3]},index=["a","b","c"])
df2 = pd.DataFrame({"b":[1,2,3]},index=["a","b","c"])
df1a = pd.DataFrame({"A":[1,2,3]},index=["A","B","C"])
pd.concat([df1, df2],axis=1)
a b
a 1 1
b 2 2
c 3 3
but this does not work:
pd.concat([df1, df1a],axis=1)
a A
A NaN 1
B NaN 2
C NaN 3
a 1 NaN
b 2 NaN
c 3 NaN
Is there an easy way to do this?
I have the same question for concat on a Series.
This works for a DataFrame:
pd.DataFrame([11,21,31],index=pd.MultiIndex.from_tuples([("A",x) for x in ["a","B","c"]])).rename(str.lower)
but this does not work for a Series:
pd.Series([11,21,31],index=pd.MultiIndex.from_tuples([("A",x) for x in ["a","B","c"]])).rename(str.lower)
TypeError: descriptor 'lower' requires a 'str' object but received a 'tuple'
For renaming, DataFrames use:
def rename_axis(self, mapper, axis=1):
index = self.axes[axis]
if isinstance(index, MultiIndex):
new_axis = MultiIndex.from_tuples([tuple(mapper(y) for y in x) for x in index], names=index.names)
else:
new_axis = Index([mapper(x) for x in index], name=index.name)
whereas when renaming Series:
result.index = Index([mapper_f(x) for x in self.index], name=self.index.name)
so my updated question is how to perform the rename/case insensitive concat with a Series?
You can do this via rename:
pd.concat([df1, df1a.rename(index=str.lower)], axis=1)
EDIT:
If you want to do this with a MultiIndexed Series you'll need to set it manually, for now. There's a bug report over at pandas GitHub repo waiting to be fixed (thanks #ViktorKerkez).
s.index = pd.MultiIndex.from_tuples(s.index.map(lambda x: tuple(map(str.lower, x))))
You can replace str.lower with whatever function you want to use to rename your index.
Note that you cannot use reindex in general here, because it tries to find values with the renamed index and thus it will return nan values, unless your rename results in no changes to the original index.
For the MultiIndexed Series objects, if this is not a bug, you can do:
s.index = pd.MultiIndex.from_tuples(
s.index.map(lambda x: tuple(map(str.lower, x)))
)