Average of selected rows in csv file - pandas

In a csv file, how can i calculate the average of selected rows in a column:
Columns
I did this:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Read the csv file:
df = pd.read_csv("D:\\xxxxx\\mmmmm.csv")
#Separate the columns and get the average:
# Skid:
S = df['Skid Number after milling'].mean()
But this just gave me the average for the entire column
Thank you for the help!

For selecting rows in a pandas dataframe or series you can use the .iloc attribute.
For example df['A'].iloc[3:5] selects the fourth and fifth row in column "A" of a DataFrame. Indexing starts at 0 and the number behind the colon is not included. This returns a pandas series.
You can do the same using numpy: df["A"].values[3:5]
This already returns a numpy array.
Possibilities to calculate the mean are therefore.
df['A'].iloc[3:5].mean()
or
df["A"].values[3:5].mean()
Also see the documentation about indexing in pandas.

Related

How to subset a dataframe, groupby and export the dataframes as multiple sheets of a one excel file in Python

Python newbie here
In the dataset below:
import pandas as pd
import numpy as np
data = {'Gender':['M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F'],
'Location':['NE','NE','NE','NE','SW','SW','SW','SW','SE','SE','SE','SE','NC','NC','NC','NC'],
'Type':['L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R'],
'PDP':['<10','<10','<10','<10',10,10,10,10,20,20,20,20,'>20','>20','>20','>20'],
'PDP_code':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'diff':[-1,-1,-1,-1,0,0,0,0,1,1,1,1,3,3,3,3],
'series':[1,2,4,8,1,2,4,8,1,2,4,8,1,2,4,8],
'Revenue_YR1':[1150.78,1162.34,1188.53,1197.69,2108.07,2117.76,2129.48,1319.51,1416.87,1812.54,1819.57,1991.97,2219.28,2414.73,2169.91,2149.19],
'Revenue_YR2':[250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.12],
'Revenue_YR3':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Revenue_YR4':[270.84,282.14,298.53,306.69,318.73,327.47,369.63,389.59,398.75,432.18,449.78,473.55,494.85,509.39,515.52,539.23],
'Revenue_YR5':[251.78,221.34,282.53,272.69,310.07,317.7,329.81,333.15,334.87,332.54,336.59,339.97,329.28,334.73,336.91,334.12],
'Revenue_YR6':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Revenue_YR7':[27.84,28.14,29.53,30.69,18.73,27.47,36.63,38.59,38.75,24.18,24.78,21.55,13.85,9.39,15.52,39.23],
'Revenue_YR8':[279.84,289.14,299.53,309.69,318.73,327.47,336.63,398.59,398.75,324.18,324.78,321.55,333.85,339.39,315.52,319.23],
}
df = pd.DataFrame(data,columns = ['Gender','Location','Type','PDP','PDP_code','diff','series',
'Revenue_YR1','Revenue_YR2','Revenue_YR3','Revenue_YR4','Revenue_YR5','Revenue_YR6',
'Revenue_YR7','Revenue_YR8'])
df.head(5)
I want a pythonic way of doing the following :
subset df into 4 dataframes / lists based on unique Location resulting in NE,SW,SE & NC dataframes
aggregating all the Revenue_YR columns while GroupBy series and PDP_code columns and export all the aggregated dataframes (NE,SW,SE & NC) as multiple sheets of one xlsx file
My attempt
### this code returns output of 1 df instead of 4 dfs, I need help aggregating each of the 4 dataframes and export them to 4 sheets of 12312021_output.xlsx
for i, part_df in df.groupby('Location'):
part_df.groupby(['series','PDP_code'])[['Revenue_YR1', 'Revenue_YR2','Revenue_YR3',
'Revenue_YR4', 'Revenue_YR5', 'Revenue_YR6', 'Revenue_YR7']].mean().unstack().style.background_gradient(cmap='Blues').to_excel('12312021_output.xlsx')
Please share your code.
You can use pandas.ExcelWriter, and your loop (which I improved slightly for readability):
import pandas as pd
with pd.ExcelWriter("output.xlsx") as writer:
cols = df.filter(like='Revenue_YR').columns
for g, d in df.groupby('Location'):
(d.groupby(['series','PDP_code'])[cols].mean().unstack()
.style.background_gradient(cmap='Blues')
).to_excel(writer, sheet_name=g)

merge the code of groupby ffill then bfill

In the following code, I want to do ffill then bfill according to item_name,code:
import pandas as pd
import numpy as np
df=pd.DataFrame({"date":['1/1/2021','1/2/2021','1/3/2021','1/4/2021','1/5/2021','1/1/2021','1/2/2021','1/3/2021','1/4/2021','1/5/2021'],
"item_name":["bracelet","bracelet","bracelet","bracelet","bracelet","earring","earring","earring","earring","earring"],
"quantity_sold":[np.nan,np.nan,3,4,np.nan,100,200,300,400,500]})
df['date']=pd.to_datetime(df['date'])
display(df)
#sort on the right fields before the calculation
df=df.sort_values(['item_name','date'])
#sum of quantity for last 3 days (curr_day-2,curr_day-1,curr_day)
df['quantity_sold']=df.groupby("item_name")['quantity_sold'].ffill()
df['quantity_sold']=df.groupby("item_name")['quantity_sold'].bfill()
df
original table:
table after ffill then bfill:
The question is, is there a way to group following code from two lines into one?
df['quantity_sold']=df.groupby("item_name")['quantity_sold'].ffill()
df['quantity_sold']=df.groupby("item_name")['quantity_sold'].bfill()
Instead of writing this in 2 seperate lines:-
df['quantity_sold']=df.groupby("item_name")['quantity_sold'].ffill()
df['quantity_sold']=df.groupby("item_name")['quantity_sold'].bfill()
Write this in a single line by method chaining it:-
df['quantity_sold']=df.groupby("item_name")['quantity_sold'].ffill().bfill()

how to plot graded letters like A* in matplotlib

i'm a complete beginner and i have a college stats project, im comparing exam scores for our year group and the one below. i collected my own data and since i do cs i decided to try visualize the data with pandas and matplotlib (my first time). i was able to read the csv file into a dataframe with columns = Level,Grade,Difficulty,Happy,MAG. Level is just ' year group ' e.g. AS or A2. and MAG is like a minimum expected grade, the rest are numeric values out of 5.
i want to do some type of plotting but i cant' seem to get it work.
i want to plot revision against difficulty? for AS group and try show a correlation. i also want to show a barchart ( if appropriate ) for Grade Vs MAG.
here is the csv https://docs.google.com/spreadsheets/d/169UKfcet1qh8ld-eI7B4U14HIl7pvgZfQLE45NrleX8/edit?usp=sharing
this is the code so far:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('Report Task.csv')
df.columns = ['Level','Grade','Difficulty','Revision','Happy','MAG'] #numerical values are out of 5
df[df.Level.str.match('AS')] #to get only AS group
plt.plot(df.Revision, df.Difficulty)
this is my first time ever posting on stack so im really sorry if i did something wrong.
For difficulty vs revision, you were using a line plot. You're probably looking for a scatter plot:
df = df[df.Level.str.match('AS')] # note the extra `df =` as per comments
plt.scatter(x=df.Revision, y=df.Difficulty)
plt.xlabel('Revision')
plt.ylabel('Difficulty')
Alternatively you can plot via pandas directly:
df = df[df.Level.str.match('AS')] # note the extra `df =` as per comments
df.plot.scatter(x='Revision', y='Difficulty')

Time series analysis - putting values into bins

Data
How can I split the values in the category_lvl2 column into bins for each different value, and find the average amount for all the values in each bin?
For example finding the average amount spent on coffee
I have already performed feature scaling on the amounts
You can use groupby() method and provide the groups you get with pd.cut(). The example below bins the data into 10 categories by sepal_length column. Then those categories are used to groupby the iris df. You could also bin with a variable and get the mean of another one with groupby.
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
bins = pd.cut(iris.sepal_length, 10)
iris.groupby(bins).sepal_length.mean()

How do you append a column and drop a column with pandas dataframes? Can't figure out why it won't print the dataframe afterwards

The DataFrame that I am working with has a datetime object that I changed to a date object. I attempted to append the date object to be the last column in the DataFrame. I also wanted to drop the datetime object column.
Both the append and drop operations don't work as expected. Nothing prints out afterwards. It should print the entire DataFrame (shortened it is long).
My code:
import pandas as pd
import numpy as np
df7=pd.read_csv('kc_house_data.csv')
print(df7)
mydates = pd.to_datetime(df7['date']).dt.date
print(mydates)
df7.append(mydates)
df7.drop(['date'], axis=1)
print(df7)
Why drop/append? You can overwrite
df7['date'] = pd.to_datetime(df7['date']).dt.date
import pandas as pd
import numpy as np
# read csv, convert column type
df7=pd.read_csv('kc_house_data.csv')
df7['date'] = pd.to_datetime(df7['date']).dt.date
print(df7)
Drop a column using df7.drop('date', axis=1, inplace=True).
Append a column using df7['date'] = mydates.