Iteration between rows and columsn of a DataFrame to calculate the mean - pandas

I have a dataframe which reads:
A 2007/Ago 2007/Set 2007/Out ... 2020/Jan 2020/Fev
row1 x number number number ... number number
row2 y number number number ... number number
row3 w number number number ... number number
...
row27 z number number number ... number number
I mean, there are numbers in each cell. I want to calculate the mean of the cells for which the columns starts with 2007, and then calculate the mean of the cells of which the columns starts with 2008, and then 2009, ..., and then 2020 and do this for each row.
What I tried to sketch is something like:
x = []
for i in df.row(i): #that is, for each row of the dataframe
if column.startswith('j'): #which starts with j=2008, 2009, 2010 etc
x += df[i][j] #the variable x gets the number on that row i,column j and sum
What I want in the end are various columsn with the results of the mean for each year, that is, I want
result1 result2 result3 ... resultn
mean colums mean colums mean columsn mean columsn
starts starts starts starts
with 2008 with 2009 with 2010 with 2020
That is, I want 13 new columns: one for each mean (years from 2008 to 2020).
I can't continue this loop and I do not know how much basic this is, but my questions are:
1- Are there any more optimal way of doing this? I mean, using pandas functions other than loops?
In my dataframe, each cell corresponds to the total cost of health expends in that month, and I want to take the mean of the cost of the entire year to compare it to the population of each city (which are thw rows). I am struggling with this for some time and I am not able to solve it. My level using pandas is very basic.
PS: sorry for the dataframe representation, I do not know how to properly write one in the stackoverflow's body question.

An option via melt + pivot_table with the aggfunc set to mean:
import pandas as pd
df = pd.DataFrame({
'A': {'row1': 'x', 'row2': 'y', 'row3': 'w', 'row27': 'z'},
'2007/Ago': {'row1': 1, 'row2': 2, 'row3': 3, 'row27': 4},
'2007/Set': {'row1': 5, 'row2': 6, 'row3': 7, 'row27': 8},
'2007/Out': {'row1': 9, 'row2': 10, 'row3': 11, 'row27': 12},
'2020/Jan': {'row1': 13, 'row2': 14, 'row3': 15, 'row27': 16},
'2020/Fev': {'row1': 17, 'row2': 18, 'row3': 19, 'row27': 20}
})
df = df.melt(id_vars='A', var_name='year')
# Rename month columns to their year value
df['year'] = df['year'].str.split('/').str[0]
# pivot to wide format based on the new year value
df = (
df.pivot_table(columns='year', index='A', aggfunc='mean')
.droplevel(0, 1)
.rename_axis(None)
.rename_axis(None, axis=1)
)
print(df)
df:
2007 2020
w 7 17
x 5 15
y 6 16
z 8 18

Suppose you have this dataframe:
A 2007/Ago 2007/Set 2007/Out 2020/Jan 2020/Fev
row1 x 1 5 9 13 17
row2 y 2 6 10 14 18
row3 w 3 7 11 15 19
row27 z 4 8 12 16 20
You can use .filter() and .mean(axis=1) to compute the values:
df["result"] = df.filter(regex=r"^\d{4}").mean(axis=1)
print(df)
Prints:
A 2007/Ago 2007/Set 2007/Out 2020/Jan 2020/Fev result
row1 x 1 5 9 13 17 9.0
row2 y 2 6 10 14 18 10.0
row3 w 3 7 11 15 19 11.0
row27 z 4 8 12 16 20 12.0

While re-working my other answer, I found this one-liner:
df.mean().groupby(lambda x: x[:4]).mean()
Explanation
Pandas' `mean' function calculates the mean per column:
# using the DataFrame from Henry's answer:
df = pd.DataFrame({
'A': {'row1': 'x', 'row2': 'y', 'row3': 'w', 'row27': 'z'},
'2007/Ago': {'row1': 1, 'row2': 2, 'row3': 3, 'row27': 4},
'2007/Set': {'row1': 5, 'row2': 6, 'row3': 7, 'row27': 8},
'2007/Out': {'row1': 9, 'row2': 10, 'row3': 11, 'row27': 12},
'2020/Jan': {'row1': 13, 'row2': 14, 'row3': 15, 'row27': 16},
'2020/Fev': {'row1': 17, 'row2': 18, 'row3': 19, 'row27': 20}
})
# calculate mean per column
col_means = df.mean()
# 2007/Ago 2.5
# 2007/Set 6.5
# 2007/Out 10.5
# 2020/Jan 14.5
# 2020/Fev 18.5
# dtype: float64
# group above columns by first 4 characters, i.e., the year
year_groups = col_means.groupby(lambda x: x[:4])
# calculate the mean per year group
year_groups.mean()
# 2007 6.5
# 2020 16.5
# dtype: float64

You could iterate over the years, select the subset of columns and just use pandas' mean() function to get the mean of that year:
means = {}
for year in range(2007, 2021):
# assuming df is your dataframe
sub_df = df.loc[:, df.columns.str.startswith(str(year))]
# first mean() aggregates per column, second mean() aggregates the whoƶe year
means[year] = sub_df.mean().mean()
This yields a dict with the years as key and the mean for that year as value. If there are no columns for one year, means[year] contains NaN.

Related

Why doesn't .loc reverse slice correctly?

From my understanding, there are two ways to subset a dataframe in pandas:
a) df['columns']['rows']
b) df.loc['rows', 'columns']
I was following a guided case study, where the instruction was to select the first and last n rows of a column in a dataframe. The solution used Method A, whereas I tried Method B.
My method wasn't working and I couldn't for the life of me figure out why.
I've created a simplified version of the dataframe...
male = [6, 14, 12, 13, 21, 14, 14, 14, 14, 18]
female = [9, 11, 6, 10, 11, 13, 12, 11, 9, 11]
df = pd.DataFrame({'Male': male,
'Female': female},
index = np.arange(1, 11))
df['Mean'] = df[['Male', 'Female']].mean(axis = 1).round(1)
df
Selecting the first two rows, works fine for method a and b
print('Method A: \n', df['Mean'][:2])
print('Method B: \n', df.loc[:2, 'Mean'])
Method A:
1 7.5
2 12.5
Method B:
1 7.5
2 12.5
But not for selecting the last 2 rows, it doesn't work the same. Method A returns the last two rows as it should.
Method B (.loc) doesn't, it returns the whole dataframe. Why is this and how do I fix it?
print('Method A: \n', df['Mean'][-2:])
print('Method B: \n', df.loc[-2:, 'Mean'])
Method A:
9 11.5
10 14.5
Method B:
1 7.5
2 12.5
3 9.0
4 11.5
5 16.0
6 13.5
7 13.0
8 12.5
9 11.5
10 14.5
You could use .index[-2:] to get the index of the lasts two rows which are 9 and 10 instead of only -2:. Here is some reproducible code:
male = [6, 14, 12, 13, 21, 14, 14, 14, 14, 18]
female = [9, 11, 6, 10, 11, 13, 12, 11, 9, 11]
df = pd.DataFrame({'Male': male,
'Female': female},
index = np.arange(1, 11))
df['Mean'] = df[['Male', 'Female']].mean(axis = 1).round(1)
print('Method B: \n', df.loc[df.index[-2:], 'Mean'])
Output:
Method B:
9 11.5
10 14.5
Name: Mean, dtype: float64
As you can see it returns the two last rows of your dataframe.
Also you can get with iloc and tail method, like that :
df['Mean'][-2:]
df['Mean'].iloc[-2:]
df['Mean'].tail(2)
We don't usually use loc for this. iloc or other methods are easier to use. But if you want to use it could be like this:
df.loc[df.index[-2:],'Mean']

Pandas Vlookup 2 DF Columns Different Lengths & Perform Calculation

I need to execute a vlookup-like calculation considering two df's of diff lengths with the same column name. Suppose i have a df called df1 such as:
Y M P D
2020 11 Red 10
2020 11 Blue 9
2020 11 Green 12
2020 11 Tan 7
2020 11 White 5
2020 11 Cyan 17
and a second df called df2 such as:
Y M P D
2020 11 Blue 4
2020 11 Red 12
2020 11 White 6
2020 11 Tan 7
2020 11 Green 20
2020 11 Violet 10
2020 11 Black 7
2020 11 BlackII 3
2020 11 Cyan 14
2020 11 Copper 6
I need a new df like df3['Res','P'] with 2 columns showing results from subtracting df1 from df2 such as:
Res P
Red -2
Blue 5
Green -8
Tan 0
White -1
Cyan 3
I have not been able to find anything with a lookup and then calculation on the web. I've tried merging df1 and df2 into one df but I do not see how to execute the calculation when the values in the "P" column match. I think that a merge of df1 and df2 is probably the first step though?
Based on the example, column 'Y' and 'M' do not matter for the merge. If these columns are relevant, then use a list with the on parameter (e.g. on=['Y', 'M', 'P']).
Currently, only columns [['P', 'D']] are being used from df1 and df2.
The following code, produces the desire output for the example, but it's difficult say what will happen with larger dataframes and if there are repeating values in 'P'.
import pandas as pd
# setup the dataframes
df1 = pd.DataFrame({'Y': [2020, 2020, 2020, 2020, 2020, 2020], 'M': [11, 11, 11, 11, 11, 11], 'P': ['Red', 'Blue', 'Green', 'Tan', 'White', 'Cyan'], 'D': [10, 9, 12, 7, 5, 17]})
df2 = pd.DataFrame({'Y': [2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020], 'M': [11, 11, 11, 11, 11, 11, 11, 11, 11, 11], 'P': ['Blue', 'Red', 'White', 'Tan', 'Green', 'Violet', 'Black', 'BlackII', 'Cyan', 'Copper'], 'D': [4, 12, 6, 7, 20, 10, 7, 3, 14, 6]})
# merge the dataframes
df = pd.merge(df1[['P', 'D']], df2[['P', 'D']], on='P', suffixes=('_1', '_2')).rename(columns={'P': 'Res'})
# subtract the values
df['P'] = (df.D_1 - df.D_2)
# drop the unneeded columns
df = df.drop(columns=['D_1', 'D_2'])
# display(df)
Res P
0 Red -2
1 Blue 5
2 Green -8
3 Tan 0
4 White -1
5 Cyan 3

Reshaping column values into rows with Identifier column at the end

I have measurements for Power related to different sensors i.e A1_Pin, A2_Pin and so on. These measurements are recorded in file as columns. The data is uniquely recorded with timestamps.
df1 = pd.DataFrame({'DateTime': ['12/12/2019', '12/13/2019', '12/14/2019',
'12/15/2019', '12/16/2019'],
'A1_Pin': [2, 8, 8, 3, 9],
'A2_Pin': [1, 2, 3, 4, 5],
'A3_Pin': [85, 36, 78, 32, 75]})
I want to reform the table so that each row corresponds to one sensor. The last column indicates the sensor ID to which the row data belongs to.
The final table should look like:
df2 = pd.DataFrame({'DateTime': ['12/12/2019', '12/12/2019', '12/12/2019',
'12/13/2019', '12/13/2019','12/13/2019', '12/14/2019', '12/14/2019',
'12/14/2019', '12/15/2019','12/15/2019', '12/15/2019', '12/16/2019',
'12/16/2019', '12/16/2019'],
'Power': [2, 1, 85,8, 2, 36, 8,3,78, 3, 4, 32, 9, 5, 75],
'ModID': ['A1_PiN','A2_PiN','A3_PiN','A1_PiN','A2_PiN','A3_PiN',
'A1_PiN','A2_PiN','A3_PiN','A1_PiN','A2_PiN','A3_PiN',
'A1_PiN','A2_PiN','A3_PiN']})
I have tried Groupby, Melt, Reshape, Stack and loops but could not do that. If anyone could help? Thanks
When you tried stack, you were on one good track. you need to set_index first and reset_index after such as:
df2 = df1.set_index('DateTime').stack().reset_index(name='Power')\
.rename(columns={'level_1':'ModID'}) #to fit the names your expected output
And you get:
print (df2)
DateTime ModID Power
0 12/12/2019 A1_Pin 2
1 12/12/2019 A2_Pin 1
2 12/12/2019 A3_Pin 85
3 12/13/2019 A1_Pin 8
4 12/13/2019 A2_Pin 2
5 12/13/2019 A3_Pin 36
6 12/14/2019 A1_Pin 8
7 12/14/2019 A2_Pin 3
8 12/14/2019 A3_Pin 78
9 12/15/2019 A1_Pin 3
10 12/15/2019 A2_Pin 4
11 12/15/2019 A3_Pin 32
12 12/16/2019 A1_Pin 9
13 12/16/2019 A2_Pin 5
14 12/16/2019 A3_Pin 75
I'd try something like this:
df1.set_index('DateTime').unstack().reset_index()

Apply rolling function to groupby over several columns

I'd like to apply rolling functions to a dataframe grouped by two columns with repeated date entries. Specifically, with both "freq" and "window" as datetime values, not simply ints.
In principle, I'm try to combine the methods from How to apply rolling functions in a group by object in pandas and pandas rolling sum of last five minutes.
Input
Here is a sample of the data, with one id=33 although we expect several id's.
X = [{'date': '2017-02-05', 'id': 33, 'item': 'A', 'points': 20},
{'date': '2017-02-05', 'id': 33, 'item': 'B', 'points': 10},
{'date': '2017-02-06', 'id': 33, 'item': 'B', 'points': 10},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
{'date': '2017-02-13', 'id': 33, 'item': 'A', 'points': 4}]
# df = pd.DataFrame(X) and reindex df to pd.to_datetime(df['date'])
df
id item points
date
2017-02-05 33 A 20
2017-02-05 33 B 10
2017-02-06 33 B 10
2017-02-11 33 A 1
2017-02-11 33 A 1
2017-02-11 33 A 1
2017-02-13 33 A 4
Goal
Sample each 'id' every 2 days (freq='2d') and return the sum of total points for each item over the previous three days (window='3D'), end-date inclusive
Desired Output
id A B
date
2017-02-05 33 20 10
2017-02-07 33 20 30
2017-02-09 33 0 10
2017-02-11 33 3 0
2017-02-13 33 7 0
E.g. on the right-inclusive end-date 2017-02-13, we sample the 3-day period 2017-02-11 to 2017-02-13. In this period, id=33 had a sum of A points equal to 1+1+1+4 = 7
Attempts
An attempt of groupby with a pd.rolling_sum as follows didn't work, due to repeated dates
df.groupby(['id', 'item'])['points'].apply(pd.rolling_sum, freq='4D', window=3)
ValueError: cannot reindex from a duplicate axis
Also note that from the documentation http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_apply.html 'window' is an int representing the size sample period, not the number of days to sample.
We can also try resampling and using last, however the desired look-back of 3 days doesn't seem to be used
df.groupby(['id', 'item'])['points'].resample('2D', label='right', closed='right').\
apply(lambda x: x.last('3D').sum())
id item date
33 A 2017-02-05 20
2017-02-07 0
2017-02-09 0
2017-02-11 3
2017-02-13 4
B 2017-02-05 10
2017-02-07 10
Of course,setting up a loop over unique id's ID, selecting df_id = df[df['id']==ID], and summing over the periods does work but is computationally-intensive and doesn't exploit groupby's nice vectorization.
Thanks to #jezrael for good suggestions so far
Notes
Pandas version = 0.20.1
I'm a little confused as to why the documentation on rolling() here:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
suggests that the "window" parameter can be in an int or offset but on attempting df.rolling(window='3D',...) I getraise ValueError("window must be an integer")
It appears that the above documentation is not consistent with the latest code for rolling's window from ./core/window.py :
https://github.com/pandas-dev/pandas/blob/master/pandas/core/window.py
elif not is_integer(self.window):
raise ValueError("window must be an integer")
It's easiest to handle resample and rolling with date frequencies when we have a single level datetime index.
However, I can't pivot/unstack appropriately without dealing with duplicate A/Bs so I groupby and sum
I unstack one level date so I can fill_value=0. Currently, I can't fill_value=0 when I unstack more than one level at a time. I make up for it with a transpose T
Now that I've got a single level in the index, I reindex with a date range from the min to max values in the index
Finally, I do a rolling 3 day sum and resample that result every 2 days with resample
I clean this up with a bit of renaming indices and one more pivot.
s = df.set_index(['id', 'item'], append=True).points
s = s.groupby(level=['date', 'id', 'item']).sum()
d = s.unstack('date', fill_value=0).T
tidx = pd.date_range(d.index.min(), d.index.max())
d = d.reindex(tidx, fill_value=0)
d1 = d.rolling('3D').sum().resample('2D').first().astype(d.dtypes).stack(0)
d1 = d1.rename_axis(['date', 'id']).rename_axis(None, 1)
print(d1)
A B
date id
2017-02-05 33 20 10
2017-02-07 33 20 20
2017-02-09 33 0 0
2017-02-11 33 3 0
2017-02-13 33 7 0
df = pd.DataFrame(X)
# group sum by day
df = df.groupby(['date', 'id', 'item'])['points'].sum().reset_index().sort_values(['date', 'id', 'item'])
# convert index to datetime index
df = df.set_index('date')
df.index = DatetimeIndex(df.index)
# rolloing sum by 3D
df['pointsum'] = df.groupby(['id', 'item']).transform(lambda x: x.rolling(window='3D').sum())
# reshape dataframe
df = df.reset_index().set_index(['date', 'id', 'item'])['pointsum'].unstack().reset_index().set_index('date').fillna(0)
df

Pandas DataFrame.update with MultiIndex label

Given a DataFrame A with MultiIndex and a DataFrame B with one-dimensional index, how to update column values of A with new values from B where the index of B should be matched with the second index label of A.
Test data:
begin = [10, 10, 12, 12, 14, 14]
end = [10, 11, 12, 13, 14, 15]
values = [1, 2, 3, 4, 5, 6]
values_updated = [10, 20, 3, 4, 50, 60]
multiindexed = pd.DataFrame({'begin': begin,
'end': end,
'value': values})
multiindexed.set_index(['begin', 'end'], inplace=True)
singleindexed = pd.DataFrame.from_dict(dict(zip([10, 11, 14, 15],
[10, 20, 50, 60])),
orient='index')
singleindexed.columns = ['value']
And the desired result should be
value
begin end
10 10 10
11 20
12 12 3
13 4
14 14 50
15 60
Now I was thinking about a variant of
multiindexed.update(singleindexed)
I searched the docs of DataFrame.update, but could not find anything w.r.t. index handling.
Am I missing an easier way to accomplish this?
You can use loc for selecting data in multiindexed and then set new values by values:
print singleindexed.index
Int64Index([10, 11, 14, 15], dtype='int64')
print singleindexed.values
[[10]
[20]
[50]
[60]]
idx = pd.IndexSlice
print multiindexed.loc[idx[:, singleindexed.index],:]
value
start end
10 10 1
11 2
14 14 5
15 6
multiindexed.loc[idx[:, singleindexed.index],:] = singleindexed.values
print multiindexed
value
start end
10 10 10
11 20
12 12 3
13 4
14 14 50
15 60
Using slicers in docs.