Looking for efficient way to get pearsonr between two pandas columns - pandas

I am trying to find a way to get the person correlation and p-value between two columns in a dataframe when a third column meets certain conditions.
df =
BucketID
Intensity
BW25113
825.326
3459870
0.5
825.326
8923429
0.95
734.321
12124
0.4
734.321
2387499
0.3
I originally tried something with the pd.Series.corr() function which is very fast and does what I want it to do to get my final outputs:
bio1 = df.columns[1:].tolist()
pcorrs2 = [s + '_Corr' for s in bio1]
coldict2 = dict(zip(bios,pcorrs2))
coldict2
df2 = df.groupby('BucketID')[bio1].corr(method = 'pearson').unstack()['Intensity'].reset_index().rename(columns = coldict2)
df3 = pd.melt(df2, id_vars = 'BucketID', var_name = 'Org', value_name = 'correlation')
df3['Org'] = df3.Org.apply(lambda x: x.rstrip('_corr'))
df3
This then gives me the (mostly) desired table:
BucketID
Org
correlation
734.321
Intensity
1.0
825.326
Intensity
1.0
734.321
BW25113
-1.0
825.326
BW25113
1.0
This works for giving me the person correlations but not the p-value, which would be helpful for determining the relevance of the correlations.
Is there a way to get the p-value associated with pd.Series.corr() in this way or would some version with scipy.stats.pearsonr that iterates over the dataframe for each BucketID be more efficient? I tried something of this flavor, but it has been incredibly slow (tens of minutes instead of a few seconds).
Thanks in advance for the assistance and/or comments.

You can use scipy.stats.pearsonr on a dataframe as follows:
df = pd.DataFrame({'col1': [1,2,3,4,5,6,7,8,9,10],
'col2': [1,2,6,4,5,7,7,8,7,12]})
import scipy
scipy.stats.pearsonr(df['col1'], df['col2'])
Results in a tuple, the first being the correlation and the second value being the p-value.
(0.9049484650760702, 0.00031797789083818853)
Update
for doing this for groups programmatically, you can groupby() then loop through the groups...
df = pd.DataFrame({'group': ['A', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B'],
'col1': [1,2,3,4,5,6,7,8,9,10],
'col2': [1,2,6,4,5,7,7,8,7,12]})
for group_name, group_data in df.groupby('group'):
print(group_name, scipy.stats.pearsonr(group_data['col1'], group_data['col2']))
Results in...
A (0.9817469600192116, 0.0029521879612042588)
B (0.8648495371134326, 0.05841898744667266)
These can also be stored in a new df results
results = pd.DataFrame()
for group_name, group_data in df.groupby('group'):
correlation, p_value = scipy.stats.pearsonr(group_data['col1'], group_data['col2'])
results = results.append({'group': group_name, 'corr': correlation, 'p_value': p_value},
ignore_index=True)

Related

Select columns from data frame using 1, 0 list , pandas

I have a list of 1,0 where each element is corresponding to an index of a column on a data frame, for example:
df.columns = ['a','b','c']
binary_list = [0,1,0]
based on that I want to select only b column from the data frame as on my binary list it has 1 only corresponds to b
is there a way to perform that in pandas?
P.S this is my first time posting on stackoverflow, apologies if I am not following a specific styling
If the binary list is aligned with the columns, you can use boolean indexing:
df = pd.DataFrame([[1, 2, 3]], columns=['a', 'b', 'c'])
binary_list = [0,1,0]
df.loc[:, map(bool, binary_list)]
Output:
b
0 2

Merge dataframes base on Column and Row values

I have a dataframe that I'd like to merge with another dataframe with the same column values. Also with specified row values.
Dataframe 1
d = {'id': ['111', '222', '333'], 'queries': ['High', 'Mid', 'Low'], 'time_stay': ['High', 'Mid', 'Low']}
dd = pd.DataFrame(data=d)
Dataframe 2
l = {'Features': ['queries', 'queries', 'queries', 'time_stay', 'time_stay', 'time_stay'], 'groups':['High', 'Mid', 'Low', 'High', 'Mid', 'Low'], 'parameters':[1.2, 1.1, 1.0, 1000, 2000, 3000]}
feature_data = pd.DataFrame(data=l)
feature_data
I pivoted dataframe 2 to make the first row as columns.
feature_data = feature_data.T
feature_data.columns = feature_data.loc['Features', :]
Then I merged it
dd.merge(feature_data, on=list(feature_data.columns), how='left')
As expected, pandas doesn't let me merge it because column queries is duplicated.
Expected output
What's a better way to do this ? thanks
Filter for column values in feature_data dataframe and then merge it to dd dataframe
cols_name = 'queries'
queries = feature_data[feature_data['Features']==cols_name]
dd.merge(queries[['groups','parameters']],
left_on=['queries'],
right_on=['groups'],
how="left")
.drop(columns='groups')
print(dd)
id queries time_stay parameters
111 High High 1.2
222 Mid Mid 1.1
333 Low Low 1.0

A pandas groupby operation on a categorical index that respects order

I want to groupby and aggregate to go from the first cell to the second cell in the below image (aggregate method is mean).
Use Series.shift with Series.cumsum:
In [662]: df = pd.DataFrame(index=['a', 'b', 'c', 'c', 'a'], data=[1,2,3,4,5], columns=['val'])
In [634]: x = df.reset_index()
In [653]: df['new'] = (x['index'] != x['index'].shift()).cumsum().tolist()
In [659]: df.groupby('new').transform('mean').drop_duplicates()
Out[659]:
val
a 1.0
b 2.0
c 3.5
a 5.0

How to insert a tuple into row of pandas DataFrame

I want to insert a row of values into a DataFrame based on the values in a tuple. Below is an example where I want to insert the values from names['blue'] intp columns 'a' and 'b' of the DataFrame.
import numpy as np
import pandas as pd
df = pd.DataFrame({'name': ['red', 'blue', 'green'], 'a': [1,np.nan,2], 'b':[2,np.nan,3]})
names = {'blue': (1,2),
'yellow': (5, 5)}
Note I have an attempt below (note 'a' and 'b' will always have missing together):
names_needed = df.loc[df['a'].isnull(), 'name']
subset_dict = {colour:names[colour] for colour in names_needed}
for colour, values in subset_dict.items():
df.loc[df['name']==colour, ['a','b']]=values
I think there has to be a more elegant solution, possibly using some map function?
Applying a lambda function over the rows where there are missing values, and then unpacking the values appropriately:
names_needed = df.loc[df['a'].isnull(), 'name']
subset_dict = {colour:names[colour] for colour in names_needed}
mask = df['name'].isin(list(subset_dict.keys()))
df.loc[mask, ['a', 'b']] = df[mask].apply(lambda x: subset_dict.get(x["name"]), axis=1).values[0]
Then gives you:
df
name a b
0 red 1.0 2.0
1 blue 1.0 2.0
2 green 2.0 3.0

Pandas - understanding output of pivot table

Here is my example:
import pandas as pd
df = pd.DataFrame({
'Student': ['A', 'B', 'B'],
'Assessor': ['C', 'D', 'D'],
'Score': [72, 19, 92]})
df = df.pivot_table(
index='Student',
columns='Assessor',
values='Score',
aggfunc=lambda x: x)
print(df)
The output looks like:
Assessor C D
Student
A 72 NaN
B NaN [1, 2]
I am not sure why I get '[1,2]' as output. I would expect something like:
Assessor C D
Student
A 72 NaN
B NaN 19
B NaN 92
Here is related question:
if I replace my dataframe with
df = pd.DataFrame({
'Student': ['A', 'B', 'B'],
'Assessor': ['C', 'D', 'D'],
'Score': ['foo', 'bar', 'foo']})
The output of the same pivot is going to be
Process finished with exit code 255
Any thoughts.
pivot_table finds the unique values of the index/columns and aggregates if there are multiple rows in the original DataFrame in a particular cell.
Indexes/columns are generally meant to be unique, so if you want to get the data in that form, you have do something a little ugly like this, although you probably don't want to.
In [21]: pivoted = pd.DataFrame(columns=df['Assessor'], index=df['Student'])
In [22]: for (assessor, score, student) in df.itertuples(index=False):
...: pivoted.loc[student, assessor] = score
For your second question, the reason that groupby generally fails if that there are no numeric columns to aggregate, although it seems to be a bug that it completely crashes like that. I added a note to the issue here.