calculating average depends of the number of variables of raw data in python - pandas

I am stuck on calculating the average depends on the variables in python.
I have 8 variables as below.
Case 1.
Time df1 df2 df3 df4 df5 df6 df7 df8
2020-01-01 220 250 235 215 221 221 220 253
In this case I can just calculate average like below
df['dfaverage']=(df['df1']+df['df2']+df['df3']+df['df4']+df['df5']+df['df6']+df['df7']+df['df8'])/8
The output will be 229.4
but if one of the value is zero then how would i ignore that values and calculate?
Case 2.
Time df1 df2 df3 df4 df5 df6 df7 df8
2020-01-01 220 250 235 215 221 221 220 0
the output should be 226 for case2 but when I run the code I will get 197.8.
How can I ignore 0 when calculating the average?

You can replace 0 to missing values and use mean:
df['dfaverage'] = df.replace(0, np.nan).mean(axis=1)
print (df)
Time df1 df2 df3 df4 df5 df6 df7 df8 dfaverage
0 2020-01-01 220 250 235 215 221 221 220 0 226.0
Or if need specify column names use list:
cols = [ 'df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8']
df['dfaverage'] = df[cols].replace(0, np.nan).mean(axis=1)
print (df)
Time df1 df2 df3 df4 df5 df6 df7 df8 dfaverage
0 2020-01-01 220 250 235 215 221 221 220 0 226.0

Related

Create a dataframe from a series with a TimeSeriesIndex multiplied by another series

Let's say I have a series, ser1 with a TimeSeriesIndex length x. I also have another series, ser2 length y. How do I multiply these so that I get a dataframe shape (x,y) where the index is from ser1 and the columns are the indices from ser2. I want every element of ser2 to be multiplied by the values of each element in ser1.
import pandas as pd
ser1 = pd.Series([100, 105, 110, 114, 89],index=pd.date_range(start='2021-01-01', end='2021-01-05', freq='D'), name='test')
test_ser2 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
Perhaps this is more elegantly done with numpy.
Try this using np.outer with pandas DataFrame constructor:
pd.DataFrame(np.outer(ser1, test_ser2), index=ser1.index, columns=test_ser2.index)
Output:
a b c d e
2021-01-01 100 200 300 400 500
2021-01-02 105 210 315 420 525
2021-01-03 110 220 330 440 550
2021-01-04 114 228 342 456 570
2021-01-05 89 178 267 356 445

pandas dataframe how to shift rows based on date

I am trying to assess the impact of a promotional campaign on our customers. The goal is to assess revenue from the point the promotion was offered. However promotion was offered for different customers at different points. How do I rearrange the data to Month 0, Month 1, Month 2, Month 3. Month 0 being the month the customer first got the promotion.
With below self explanatory code you can get your desired output:
# Create DataFrame
import pandas as pd
df = pd.DataFrame({"Account":[1,2,3,4,5,6],\
"May-18":[181,166,221,158,210,159],\
"Jun-18":[178,222,230,189,219,200],\
"Jul-18":[184,207,175,167,201,204],\
"Aug-18":[161,174,178,233,223,204],\
"Sep-18":[218,209,165,165,204,225],\
"Oct-18":[199,206,205,196,212,205],\
"Nov-18":[231,196,189,218,234,235],\
"Dec-18":[173,178,189,218,234,205],\
"Promotion Month":["Sep-18","Aug-18","Jul-18","May-18","Aug-18","Jun-18"]})
df = df.set_index("Account")
cols = ["May-18","Jun-18","Jul-18","Aug-18","Sep-18","Oct-18","Nov-18","Dec-18","Promotion Month"]
df = df[cols]
# Define function to select the four months after promotion
def selectMonths(row):
cols = df.columns.to_list()
colMonth0 = cols.index(row["Promotion Month"])
colsOut = cols[colMonth0:colMonth0+4]
out = pd.Series(row[colsOut].to_list())
return out
# Apply the function and set the index and columns of output DataFrame
out = df.apply(selectMonths, axis=1)
out.index = df.index
out.columns=["Month 0","Month 1","Month 2","Month 3"]
Then the output you get is:
>>> out
Month 0 Month 1 Month 2 Month 3
Account
1 218 199 231 173
2 174 209 206 196
3 175 178 165 205
4 158 189 167 233
5 223 204 212 234
6 200 204 204 225

Extract dictionary value from a list contained in Pandas dataframe column

I'm trying to extract values from a dictionary contained within list in a Pandas dataframe .Objective is to split the id key into multiple columns. Sample data is like :
Column_Header
[{'id': '498', 'relTypeId': 2'},{'id': '499', 'relTypeId': 3'}]
[{'id': '499', 'relTypeId': 3'},{'id': '500', 'relTypeId': 4'},{'id': '501', 'relTypeId': 5'}]
I have tried as below
list(map(lambda x: x["id"], df["Column_Header"]))
But getting error as following:
"list indices must be integers or slices, not str". Desired o/p is :
col1|col2|col3
498 |499 |
499 |500 |501
Can some one please help ?
We can do explode first then create the additional key with cumcount , and pivot
s=df.Column_Header.explode().str['id']
s=pd.crosstab(index=s.index,columns=s.groupby(level=0).cumcount(),values=s,aggfunc='sum')
Out[133]:
col_0 0 1 2
row_0
0 498 499 NaN
1 499 500 501
Use nested list comprehension with select id in keys of dictionaries if performance is important:
df = pd.DataFrame([[y['id'] for y in x] for x in df['Column_Header']], index=df.index)
print (df)
0 1 2
0 498 499 None
1 499 500 501
If possible some missing values use:
L = [[y['id'] for y in x] if isinstance(x, list) else [None] for x in df['Column_Header']]
df = pd.DataFrame(L, index=df.index)

How to reset pandas data reader index? [duplicate]

This seems rather obvious, but I can't seem to figure out how to convert an index of data frame to a column?
For example:
df=
gi ptt_loc
0 384444683 593
1 384444684 594
2 384444686 596
To,
df=
index1 gi ptt_loc
0 0 384444683 593
1 1 384444684 594
2 2 384444686 596
either:
df['index1'] = df.index
or, .reset_index:
df = df.reset_index(level=0)
so, if you have a multi-index frame with 3 levels of index, like:
>>> df
val
tick tag obs
2016-02-26 C 2 0.0139
2016-02-27 A 2 0.5577
2016-02-28 C 6 0.0303
and you want to convert the 1st (tick) and 3rd (obs) levels in the index into columns, you would do:
>>> df.reset_index(level=['tick', 'obs'])
tick obs val
tag
C 2016-02-26 2 0.0139
A 2016-02-27 2 0.5577
C 2016-02-28 6 0.0303
rename_axis + reset_index
You can first rename your index to a desired label, then elevate to a series:
df = df.rename_axis('index1').reset_index()
print(df)
index1 gi ptt_loc
0 0 384444683 593
1 1 384444684 594
2 2 384444686 596
This works also for MultiIndex dataframes:
print(df)
# val
# tick tag obs
# 2016-02-26 C 2 0.0139
# 2016-02-27 A 2 0.5577
# 2016-02-28 C 6 0.0303
df = df.rename_axis(['index1', 'index2', 'index3']).reset_index()
print(df)
index1 index2 index3 val
0 2016-02-26 C 2 0.0139
1 2016-02-27 A 2 0.5577
2 2016-02-28 C 6 0.0303
To provide a bit more clarity, let's look at a DataFrame with two levels in its index (a MultiIndex).
index = pd.MultiIndex.from_product([['TX', 'FL', 'CA'],
['North', 'South']],
names=['State', 'Direction'])
df = pd.DataFrame(index=index,
data=np.random.randint(0, 10, (6,4)),
columns=list('abcd'))
The reset_index method, called with the default parameters, converts all index levels to columns and uses a simple RangeIndex as new index.
df.reset_index()
Use the level parameter to control which index levels are converted into columns. If possible, use the level name, which is more explicit. If there are no level names, you can refer to each level by its integer location, which begin at 0 from the outside. You can use a scalar value here or a list of all the indexes you would like to reset.
df.reset_index(level='State') # same as df.reset_index(level=0)
In the rare event that you want to preserve the index and turn the index into a column, you can do the following:
# for a single level
df.assign(State=df.index.get_level_values('State'))
# for all levels
df.assign(**df.index.to_frame())
For MultiIndex you can extract its subindex using
df['si_name'] = R.index.get_level_values('si_name')
where si_name is the name of the subindex.
If you want to use the reset_index method and also preserve your existing index you should use:
df.reset_index().set_index('index', drop=False)
or to change it in place:
df.reset_index(inplace=True)
df.set_index('index', drop=False, inplace=True)
For example:
print(df)
gi ptt_loc
0 384444683 593
4 384444684 594
9 384444686 596
print(df.reset_index())
index gi ptt_loc
0 0 384444683 593
1 4 384444684 594
2 9 384444686 596
print(df.reset_index().set_index('index', drop=False))
index gi ptt_loc
index
0 0 384444683 593
4 4 384444684 594
9 9 384444686 596
And if you want to get rid of the index label you can do:
df2 = df.reset_index().set_index('index', drop=False)
df2.index.name = None
print(df2)
index gi ptt_loc
0 0 384444683 593
4 4 384444684 594
9 9 384444686 596
This should do the trick (if not multilevel indexing) -
df.reset_index().rename({'index':'index1'}, axis = 'columns')
And of course, you can always set inplace = True, if you do not want to assign this to a new variable in the function parameter of rename.
df1 = pd.DataFrame({"gi":[232,66,34,43],"ptt":[342,56,662,123]})
p = df1.index.values
df1.insert( 0, column="new",value = p)
df1
new gi ptt
0 0 232 342
1 1 66 56
2 2 34 662
3 3 43 123
In the newest version of pandas 1.5.0, you could use the function reset_index with the new argument names to specify a list of names you want to give the index columns. Here is a reproducible example with one index column:
import pandas as pd
df = pd.DataFrame({"gi":[232,66,34,43],"ptt":[342,56,662,123]})
gi ptt
0 232 342
1 66 56
2 34 662
3 43 123
df.reset_index(names=['new'])
Output:
new gi ptt
0 0 232 342
1 1 66 56
2 2 34 662
3 3 43 123
This can also easily be applied with MultiIndex. Just create a list of the names you want.
I usually do it this way:
df = df.assign(index1=df.index)

aggregate data by quarter

I have a pivot pandas data frame (sales by region) that got created from another pandas data frame (sales by store) using the pivot_table method.
As an example:
df = pd.DataFrame(
{'store':['A','B','C','D','E']*7,
'region':['NW','NW','SW','NE','NE']*7,
'date':['2017-03-30']*5+['2017-04-05']*5+['2017-04-07']*5+['2017-04-12']*5+['2017-04-13']*5+['2017-04-17']*5+['2017-04-20']*5,
'sales':[30,1,133,9,1,30,3,135,9,11,30,1,140,15,15,25,10,137,9,3,29,10,137,9,11,30,19,145,20,10,30,8,141,25,25]
})
df['date'] = pd.to_datetime(df['date'])
df_sales = df.pivot_table(index = ['region'], columns = ['date'], aggfunc = [np.sum], margins = True)
df_sales = df_sales.ix[:,range(0, df_sales.shape[1]-1)]
My goal is to do the following to the sales data frame, df_sales.
Create a new dataframe that summarizes sales by quarter. I could use the original dataframe df, or the sales_df.
As of quarter here we only have only two quarters (USA fiscal calendar year) so the quarterly aggregated data frame would look like:
2017Q1 2017Q2
10 27
31 37.5
133 139.17
I take the average for all days in Q1, and same for Q2. Thus, for example for the North east region, 'NE', the Q1 is the average of only one day 2017-03-30, i.e., 10, and for the Q2 is the average across 2017-04-05 to 2017-04-20, i.e.,
(20+30+12+20+30+50)/6=27
Any suggestions?
ADDITIONAL NOTE: I would ideally do the quarter aggregations on the df_sales pivoted table since it's a much smaller dataframe to keep in memory. The current solution does it on the original df, but I am still seeking a way to do it in the df_sales dataframe.
UPDATE:
Setup:
df.date = pd.to_datetime(df.date)
df_sales = df.pivot_table(index='region', columns='date', values='sales', aggfunc='sum')
In [318]: df_sales
Out[318]:
date 2017-03-30 2017-04-05 2017-04-07 2017-04-12 2017-04-13 2017-04-17 2017-04-20
region
NE 10 20 30 12 20 30 50
NW 31 33 31 35 39 49 38
SW 133 135 140 137 137 145 141
Solution:
In [319]: (df_sales.groupby(pd.PeriodIndex(df_sales.columns, freq='Q'), axis=1)
...: .apply(lambda x: x.sum(axis=1)/x.shape[1])
...: )
Out[319]:
date 2017Q1 2017Q2
region
NE 10.0 27.000000
NW 31.0 37.500000
SW 133.0 139.166667
Solution based on the original DF:
In [253]: (df.groupby(['region', pd.PeriodIndex(df.date, freq='Q-DEC')])
...: .apply(lambda x: x['sales'].sum()/x['date'].nunique())
...: .to_frame('avg').unstack('date')
...: )
...:
Out[253]:
avg
date 2017Q1 2017Q2
region
NE 10.0 27.000000
NW 31.0 37.500000
SW 133.0 139.166667
NOTE: df - is the original DF (before "pivoting")