Pandas - understanding output of pivot table - pandas

Here is my example:
import pandas as pd
df = pd.DataFrame({
'Student': ['A', 'B', 'B'],
'Assessor': ['C', 'D', 'D'],
'Score': [72, 19, 92]})
df = df.pivot_table(
index='Student',
columns='Assessor',
values='Score',
aggfunc=lambda x: x)
print(df)
The output looks like:
Assessor C D
Student
A 72 NaN
B NaN [1, 2]
I am not sure why I get '[1,2]' as output. I would expect something like:
Assessor C D
Student
A 72 NaN
B NaN 19
B NaN 92
Here is related question:
if I replace my dataframe with
df = pd.DataFrame({
'Student': ['A', 'B', 'B'],
'Assessor': ['C', 'D', 'D'],
'Score': ['foo', 'bar', 'foo']})
The output of the same pivot is going to be
Process finished with exit code 255
Any thoughts.

pivot_table finds the unique values of the index/columns and aggregates if there are multiple rows in the original DataFrame in a particular cell.
Indexes/columns are generally meant to be unique, so if you want to get the data in that form, you have do something a little ugly like this, although you probably don't want to.
In [21]: pivoted = pd.DataFrame(columns=df['Assessor'], index=df['Student'])
In [22]: for (assessor, score, student) in df.itertuples(index=False):
...: pivoted.loc[student, assessor] = score
For your second question, the reason that groupby generally fails if that there are no numeric columns to aggregate, although it seems to be a bug that it completely crashes like that. I added a note to the issue here.

Related

How to Exclude NaNs from Pandas Rolling, but not return NaN if there is one in the DataFrame

Currently I have the DataFrame seen below and I want to do a rolling average over the last 10 occurrences that have actual values, but to skip the NaNs
Example DataFrame
The issues is that if I run df['AST_Hit'].rolling(10).mean(skipna=True).shift(1) I get this DataFrame below which is not what I am looking for
Example Output DataFrame
I've tried using window and min_period but that does not give me what I want as I don't want the average over anything greater than 10.
Ideally I would like the DataFrame to be able to discard a NaN, but still look to see if there are 10 values in that selection. From what I am describing I think I need some sort of max period where it is equal to 10 as well as the min period equal to 10, but I could not find anything on Pandas documentation for rolling on setting up a max period.
Maybe it would also be best if I just dropped any NaN rows. My DataFrame is much bigger than what is seen, so it isn't just those 3 rows that contain a NaN, but it may be the best course of action
Any help or tips is greatly appreciated.
This might help you:
import pandas as pd
import numpy as np
# create a sample DataFrame with non-numeric columns
np.random.seed(123)
df = pd.DataFrame({
'Date': pd.date_range(start='2022-01-01', periods=100),
'AST_Hit': np.random.randint(0, 10, size=100),
'Other_Column': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] * 10
})
df.iloc[3:6, 1] = np.nan
df.iloc[7, 0] = np.nan
df.iloc[10:15, 2] = np.nan
df.iloc[20:25, 1] = np.nan
df.iloc[30:40, 2] = np.nan
# compute rolling average over the last 10 non-null values
rolling_mask = df['AST_Hit'].notnull().rolling(window=10, min_periods=1).sum().eq(10)
result = df['AST_Hit'].rolling(window=10, min_periods=1).apply(lambda x: np.mean(x[rolling_mask]))
print(result)
which gives
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
95 5.6
96 5.1
97 4.7
98 4.2
99 3.9
Name: AST_Hit, Length: 100, dtype: float64

Subset multiindex dataframe keeps original index value

I found subsetting multi-index dataframe will keep original index values behind.
Here is the sample code for test.
level_one = ["foo","bar","baz"]
level_two = ["a","b","c"]
df_index = pd.MultiIndex.from_product((level_one,level_two))
df = pd.DataFrame(range(9), index = df_index, columns=["number"])
df
Above code will show dataframe like this.
number
foo a 0
b 1
c 2
bar a 3
b 4
c 5
baz a 6
b 7
c 8
Code below subset the dataframe to contain only 'a' and 'b' for index level 1.
df_subset = df.query("(number%3) <=1")
df_subset
number
foo a 0
b 1
bar a 3
b 4
baz a 6
b 7
The dataframe itself is expected result.
BUT index level of it is still containing the original index level, which is NOT expected.
#Following code is still returnning index 'c'
df_subset.index.levels[1]
#Result
Index(['a', 'b', 'c'], dtype='object')
My first question is how can I remove the 'original' index after subsetting?
The Second question is this is expected behavior for pandas?
Thanks
Yes, this is expected, it can allow you to access the missing levels after filtering. You can remove the unused levels with remove_unused_levels:
df_subset.index = df_subset.index.remove_unused_levels()
print(df_subset.index.levels[1])
Output:
Index(['a', 'b'], dtype='object')
It is normal that the "original" index after subsetting remains, because it's a behavior of pandas, according to the documentation "The MultiIndex keeps all the defined levels of an index, even if they are not actually used.This is done to avoid a recomputation of the levels in order to make slicing highly performant."
You can see that the index levels is a FrozenList using:
[I]: df_subset.index.levels
[O]: FrozenList([['bar', 'baz', 'foo'], ['a', 'b', 'c']])
If you want to see only the used levels, you can use the get_level_values() or the unique() methods.
Here some example:
[I]: df_subset.index.get_level_values(level=1)
[O]: Index(['a', 'b', 'a', 'b', 'a', 'b'], dtype='object')
[I]: df_subset.index.unique(level=1)
[O]: Index(['a', 'b'], dtype='object')
Hope it can help you!

How to calculate mean in a dataframe for a single year [duplicate]

Im trying to use groupby function, is there a way to use a specific element rather than the column name. Example of code.
df.groupby(['Month', 'Place'])['Number'].sum()
This is what I want to do.
df.groupby(['April', Place'])['Number'].sum()
you need filter your DataFrame before:
df.loc[df['Month'].eq('April')].groupby('Place')['Number'].sum()
#df[df['Month'].eq('April')].groupby('Place')['Number'].sum()
Yes, you can pass as many columns to groupby
df = pd.DataFrame([['a', 'x', 1], ['a', 'x', 2], ['b', 'y',3], ['b', 'z',4]])
df.columns = ['c1', 'c2', 'c3']
df.groupby(['c1', 'c2'])['c3'].mean()
resuls in
c1 c2
a x 1.5
b y 3.0
z 4.0

Looking for efficient way to get pearsonr between two pandas columns

I am trying to find a way to get the person correlation and p-value between two columns in a dataframe when a third column meets certain conditions.
df =
BucketID
Intensity
BW25113
825.326
3459870
0.5
825.326
8923429
0.95
734.321
12124
0.4
734.321
2387499
0.3
I originally tried something with the pd.Series.corr() function which is very fast and does what I want it to do to get my final outputs:
bio1 = df.columns[1:].tolist()
pcorrs2 = [s + '_Corr' for s in bio1]
coldict2 = dict(zip(bios,pcorrs2))
coldict2
df2 = df.groupby('BucketID')[bio1].corr(method = 'pearson').unstack()['Intensity'].reset_index().rename(columns = coldict2)
df3 = pd.melt(df2, id_vars = 'BucketID', var_name = 'Org', value_name = 'correlation')
df3['Org'] = df3.Org.apply(lambda x: x.rstrip('_corr'))
df3
This then gives me the (mostly) desired table:
BucketID
Org
correlation
734.321
Intensity
1.0
825.326
Intensity
1.0
734.321
BW25113
-1.0
825.326
BW25113
1.0
This works for giving me the person correlations but not the p-value, which would be helpful for determining the relevance of the correlations.
Is there a way to get the p-value associated with pd.Series.corr() in this way or would some version with scipy.stats.pearsonr that iterates over the dataframe for each BucketID be more efficient? I tried something of this flavor, but it has been incredibly slow (tens of minutes instead of a few seconds).
Thanks in advance for the assistance and/or comments.
You can use scipy.stats.pearsonr on a dataframe as follows:
df = pd.DataFrame({'col1': [1,2,3,4,5,6,7,8,9,10],
'col2': [1,2,6,4,5,7,7,8,7,12]})
import scipy
scipy.stats.pearsonr(df['col1'], df['col2'])
Results in a tuple, the first being the correlation and the second value being the p-value.
(0.9049484650760702, 0.00031797789083818853)
Update
for doing this for groups programmatically, you can groupby() then loop through the groups...
df = pd.DataFrame({'group': ['A', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B'],
'col1': [1,2,3,4,5,6,7,8,9,10],
'col2': [1,2,6,4,5,7,7,8,7,12]})
for group_name, group_data in df.groupby('group'):
print(group_name, scipy.stats.pearsonr(group_data['col1'], group_data['col2']))
Results in...
A (0.9817469600192116, 0.0029521879612042588)
B (0.8648495371134326, 0.05841898744667266)
These can also be stored in a new df results
results = pd.DataFrame()
for group_name, group_data in df.groupby('group'):
correlation, p_value = scipy.stats.pearsonr(group_data['col1'], group_data['col2'])
results = results.append({'group': group_name, 'corr': correlation, 'p_value': p_value},
ignore_index=True)

From Pandas groupBy to PySpark groupBy

Consider an Spark DataFrame, wherein we have few columns. The goal is to perform groupBy operation on it without converting it to Pandas DataFrame. An equivalent Pandas groupBy code looks something like this:
def compute_metrics(x):
return pd.Series({
'a': x['a'].values[0],
'new_b': np.sum(x['b']),
'c': np.mean(x['c']),
'cnt': len(x)
})
data.groupby([
'col_1',
'col_2'
]).apply(compute_metrics).reset_index()
And I'm intending to write this in PySpark. So far I have come up with something like this in PySpark:
gdf = df.groupBy([
'col_1',
'col_2'
]).agg({
'c': 'avg',
'b': 'sum'
}).withColumnRenamed('sum(b)', 'new_b')
However, I am not sure how to go about 'a': x['a'].values[0] and 'cnt': len(x). I thought about using collect_list from from pyspark.sql import functions but that slaps my face with Column object is not Callable. Any idea how to accomplish the aforementioned conversion? Thanks!
[UPDATE] Would it make sense to perform count operation on any column in order to get cnt? Say I do this:
gdf = df.groupBy([
'col_1',
'col_2'
]).agg({
'c': 'avg',
'b': 'sum',
'some_column': 'count'
}).withColumnRenamed('sum(b)', 'new_b')
.withColumnRenamed('count(some_column)', 'cnt')
I have this toy solution using PySpark function sum, avg, count and first. note that I use Spark 2.1 in this solution. Hope this help a bit!
from pyspark.sql.functions import sum, avg, count, first
# create toy example dataframe with column 'A', 'B' and 'C'
ls = [['a', 'b',3], ['a', 'b', 4], ['a', 'c', 3], ['b', 'b', 5]]
df = spark.createDataFrame(ls, schema=['A', 'B', 'C'])
# group by column 'A' and 'B' then performing some function here
group_df = df.groupby(['A', 'B'])
df_grouped = group_df.agg(sum("C").alias("sumC"),
avg("C").alias("avgC"),
count("C").alias("countC"),
first("C").alias("firstC"))
df_grouped.show() # print out the spark dataframe