How to subset a dataframe, groupby and export the dataframes as multiple sheets of a one excel file in Python - pandas
Python newbie here
In the dataset below:
import pandas as pd
import numpy as np
data = {'Gender':['M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F'],
'Location':['NE','NE','NE','NE','SW','SW','SW','SW','SE','SE','SE','SE','NC','NC','NC','NC'],
'Type':['L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R'],
'PDP':['<10','<10','<10','<10',10,10,10,10,20,20,20,20,'>20','>20','>20','>20'],
'PDP_code':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'diff':[-1,-1,-1,-1,0,0,0,0,1,1,1,1,3,3,3,3],
'series':[1,2,4,8,1,2,4,8,1,2,4,8,1,2,4,8],
'Revenue_YR1':[1150.78,1162.34,1188.53,1197.69,2108.07,2117.76,2129.48,1319.51,1416.87,1812.54,1819.57,1991.97,2219.28,2414.73,2169.91,2149.19],
'Revenue_YR2':[250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.12],
'Revenue_YR3':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Revenue_YR4':[270.84,282.14,298.53,306.69,318.73,327.47,369.63,389.59,398.75,432.18,449.78,473.55,494.85,509.39,515.52,539.23],
'Revenue_YR5':[251.78,221.34,282.53,272.69,310.07,317.7,329.81,333.15,334.87,332.54,336.59,339.97,329.28,334.73,336.91,334.12],
'Revenue_YR6':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Revenue_YR7':[27.84,28.14,29.53,30.69,18.73,27.47,36.63,38.59,38.75,24.18,24.78,21.55,13.85,9.39,15.52,39.23],
'Revenue_YR8':[279.84,289.14,299.53,309.69,318.73,327.47,336.63,398.59,398.75,324.18,324.78,321.55,333.85,339.39,315.52,319.23],
}
df = pd.DataFrame(data,columns = ['Gender','Location','Type','PDP','PDP_code','diff','series',
'Revenue_YR1','Revenue_YR2','Revenue_YR3','Revenue_YR4','Revenue_YR5','Revenue_YR6',
'Revenue_YR7','Revenue_YR8'])
df.head(5)
I want a pythonic way of doing the following :
subset df into 4 dataframes / lists based on unique Location resulting in NE,SW,SE & NC dataframes
aggregating all the Revenue_YR columns while GroupBy series and PDP_code columns and export all the aggregated dataframes (NE,SW,SE & NC) as multiple sheets of one xlsx file
My attempt
### this code returns output of 1 df instead of 4 dfs, I need help aggregating each of the 4 dataframes and export them to 4 sheets of 12312021_output.xlsx
for i, part_df in df.groupby('Location'):
part_df.groupby(['series','PDP_code'])[['Revenue_YR1', 'Revenue_YR2','Revenue_YR3',
'Revenue_YR4', 'Revenue_YR5', 'Revenue_YR6', 'Revenue_YR7']].mean().unstack().style.background_gradient(cmap='Blues').to_excel('12312021_output.xlsx')
Please share your code.
You can use pandas.ExcelWriter, and your loop (which I improved slightly for readability):
import pandas as pd
with pd.ExcelWriter("output.xlsx") as writer:
cols = df.filter(like='Revenue_YR').columns
for g, d in df.groupby('Location'):
(d.groupby(['series','PDP_code'])[cols].mean().unstack()
.style.background_gradient(cmap='Blues')
).to_excel(writer, sheet_name=g)
Related
How to include an if condition after merging two dataframes?
currently in my code I'm merging two dataframes from my desktop, dropping some duplicates and some column and the final output is converted in a picture to be sent via telegram. import pandas as pd import dataframe_image as di import telepot df = pd.read_csv('a.csv', delimiter=';') df1 = pd.read_csv('b.csv', delimiter=';') total = pd.merge(df, df1, on="Conc", how="inner") total = total.drop_duplicates(subset=["A"], keep="first") total = total.drop(['A','B','C', 'D', 'E','Conc'], 1) di.export(total, 'total.png') bot = telepot.Bot('token') bot.sendPhoto(chatid, photo=open('total.png', 'rb')) This is the good path, in case the merging row is giving me a new dataframe with text on it. How can I manage the situation if the merging task as an output an empty df so I can send "NA" via telegram? Many thanks
Lambdas function on multiple columns
I am trying to extract only number from multiple columns in my pandas data.frame. I am able to do so one-by-one columns however I would like to perform this operation simultaneously to multiple columns My reproduced example: import pandas as pd import re import numpy as np import seaborn as sns df = sns.load_dataset('diamonds') # Create columns one again df['clarity2'] = df['clarity'] df.head() df[['clarity', 'clarity2']].apply(lambda x: x.str.extract(r'(\d+)'))
If you want a tuple cols = ['clarity', 'clarity2'] tuple(df[col].str.extract(r'(\d+)') for col in cols) If you want a list cols = ['clarity', 'clarity2'] [df[col].str.extract(r'(\d+)') for col in cols] adding them to the original data df['digit1'], df['digit2'] = [df[col].str.extract(r'(\d+)') for col in cols]
How do I swap two (or more) columns in two different data tables? on pandas
new here and I am new to programming. So.. as the title says I am trying to swap two full columns from two different files (columns has the same name but different data). I started this: import numpy as np import pandas as pd from pandas import DataFrame df = pd.read_csv('table1.csv', col_name= 'COL1') df1 = pd.read_csv('table2.csv', col_name = 'COL1') df1.COL1 = df.COL1 But now I am stack.. how do I select whole column and how can I print the new combined table to a new file (i.e table 3)?
You could perform the swapping by copying one column in a temporary one and deleting afterwards like follows df1['temp'] = df1['COL1'] df1['COL1'] = df['COL1'] df['COL1'] = df1['temp'] del df1['temp'] and then writing the result via to_csv to a third CSV df1.to_csv('table3.csv')
Parse JSON to Excel - Pandas + xlwt
I'm kind of half way through this functionality. However, I need some help with formatting the data in the sheet that contains the output. My current code... response = {"sic2":[{"confidence":1.0,"label":"73"}],"sic4":[{"confidence":0.5,"label":"7310"}],"sic8":[{"confidence":0.5,"label":"73101000"},{"confidence":0.25,"label":"73102000"},{"confidence":0.25,"label":"73109999"}]} # Create a Pandas dataframe from the data. df = pd.DataFrame.from_dict(json.loads(response), orient='index') # Create a Pandas Excel writer using XlsxWriter as the engine. writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter') # Convert the dataframe to an XlsxWriter Excel object. df.to_excel(writer, sheet_name='Sheet1') # Close the Pandas Excel writer and output the Excel file. writer.save() The output is as follows... What I want is something like this... I suppose that first I would need to extract and organise the headers. This would also include manually assigning a header for a column that cannot have a header by default as in case of SIC column. After that, I can feed data to the columns with their respective headers.
You can loop over the keys of your json object and create a dataframe from each, then use pd.concat to combine them all: import json import pandas as pd response = '{"sic2":[{"confidence":1.0,"label":"73"}],"sic4":[{"confidence":0.5,"label":"7310"}],"sic8":[{"confidence":0.5,"label":"73101000"},{"confidence":0.25,"label":"73102000"},{"confidence":0.25,"label":"73109999"}]}' json_data = json.loads(response) all_frames = [] for k, v in json_data.items(): df = pd.DataFrame(v) df['SIC Category'] = k all_frames.append(df) final_data = pd.concat(all_frames).set_index('SIC Category') print(final_data) This prints: confidence label SIC Category sic2 1.00 73 sic4 0.50 7310 sic8 0.50 73101000 sic8 0.25 73102000 sic8 0.25 73109999 Which you can export to Excel as before, through final_data.to_excel(writer, sheet_name='Sheet1')
Average of selected rows in csv file
In a csv file, how can i calculate the average of selected rows in a column: Columns I did this: import numpy as np import matplotlib.pyplot as plt import pandas as pd #Read the csv file: df = pd.read_csv("D:\\xxxxx\\mmmmm.csv") #Separate the columns and get the average: # Skid: S = df['Skid Number after milling'].mean() But this just gave me the average for the entire column Thank you for the help!
For selecting rows in a pandas dataframe or series you can use the .iloc attribute. For example df['A'].iloc[3:5] selects the fourth and fifth row in column "A" of a DataFrame. Indexing starts at 0 and the number behind the colon is not included. This returns a pandas series. You can do the same using numpy: df["A"].values[3:5] This already returns a numpy array. Possibilities to calculate the mean are therefore. df['A'].iloc[3:5].mean() or df["A"].values[3:5].mean() Also see the documentation about indexing in pandas.