Pandas: How to get the average of all averages in pandas after groupby - pandas

I am working in Pandas and below is the dataframe
# initialize list of lists
data = [['A','Excel','1'], ['A','Word','0'], ['A','Java','1'],['B','Excel','1'],['B','Word','0'],['C','Word','0'],['D','Java','1'],['E','PPT','0'], ['E','Word','0'], ['E','Java','1']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['System','App','DevTool'])
I obtained the DevTool average for each system using below but how can I get the total average of all averages
df.groupby('System')['DevTool'].mean()*100
System
DevTool Ratio
A
66.67
B
50.00
C
00.00
D
100.00
E
33.33
Please advice.

you can use:
# initialize list of lists
data1 = [['A','Excel','1'], ['A','Word','0'], ['A','Java','1'],['B','Excel','1'],['B','Word','0'],['C','Word','0'],['D','Java','1'],['E','PPT','0'], ['E','Word','0'], ['E','Java','1']]
# Create the pandas DataFrame
dfdf = pd.DataFrame(data1, columns=['System','App','DevTool'])
#average System
dfdf['DevTool'] = dfdf['DevTool'].astype('int')
PVT_T = dfdf.pivot_table(index='System', aggfunc={'DevTool':np.mean})
#average all DevTool
dfdf['DevTool'].mean()

IIUC use:
df['DevTool'] = df['DevTool'].astype(int)
s = df.groupby('System')['DevTool'].mean()*100
s.loc['total avg'] = df['DevTool'].mean()*100

Related

Merge value_counts of different pandas dataframes

I have a list of pandas dataframes in which i do the value_counts of a column and finally append all the results to another dataframe.
df_AB = pd.read_pickle('df_AB.pkl')
df_AC = pd.read_pickle('df_AC.pkl')
df_AD = pd.read_pickle('df_AD.pkl')
df_AE = pd.read_pickle('df_AE.pkl')
df_AF = pd.read_pickle('df_AF.pkl')
df_AG = pd.read_pickle('df_AG.pkl')
The format of the above dataframes is as below (Example: df_AB):
df_AB:
id is_valid
121 True
122 False
123 True
For every pandas dataframe, I would need to get the value_counts of is_valid column and store the results to df_result. I tried the below code but doesn't seem to work as expected.
df_AB_VC = df_AB['is_valid'].value_counts()
df_AB_VC['group'] = "AB"
df_AC_VC = df_AC['is_valid'].value_counts()
df_AC_VC['group'] = "AC"
Result dataframe (df_result):
Group is_valid_True_Count is_Valid_False_Count
AB 2 1
AC
AD
.
.
.
Any leads would be appreciated
I think you just need to work on the dataframes a bit more systematically:
groups = ['AB', 'AC', 'AD',...]
out = pd.DataFrame({
g: pd.read_pickle(f'df_{g}.pkl')['is_valid'].value_counts()
for g in groups
}).T
Do not use variables, that makes your code much more complicated. Use a container
files = ['df_AB.pkl', 'df_AC.pkl', 'df_AD.pkl', 'df_AE.pkl', 'df_AF.pkl']
# using the XX part in "df_XX.pkl", you need to adapt to your real use-case
dataframes = {f[3:5]: pd.read_pickle(f) for f in files}
# compute counts
counts = (pd.DataFrame({k: d['is_valid'].value_counts()
for k,d in dataframes.items()})
.T.add_prefix('is_valid_').add_suffix('_Count')
)
example output:
is_valid_True_Count is_valid_False_Count
AB 2 1
AC 2 1
Use pathlib to extract group name then collect data into dictionary before concatenate all entries:
import pandas as pd
import pathlib
data = {}
for pkl in pathlib.Path().glob('df_*.pkl'):
group = pkl.stem.split('_')[1]
df = pd.read_pickle(pkl)
data[group] = df['is_valid'].value_counts() \
.add_prefix('is_valid_') \
.add_suffix('_Count')
df = pd.concat(data, axis=1).T
>>> df
is_valid_True_Count is_valid_False_Count
AD 2 1
AB 4 2
AC 0 3

Daily to Weekly Pandas conversion

I am trying to convert my 15ys worth of daily data into weekly by taking the mean, diff and count of certain features. I tried using .resample but I was not sure if that is the most efficient way.
My sample data:
Date,Product,New Quantity,Price,Refund Flag
8/16/1994,abc,10,0.5,
8/17/1994,abc,11,0.9,1
8/18/1994,abc,15,0.6,
8/19/1994,abc,19,0.4,
8/22/1994,abc,22,0.2,1
8/23/1994,abc,19,0.1,
8/16/1994,xyz,16,0.5,1
8/17/1994,xyz,10,0.9,1
8/18/1994,xyz,12,0.6,1
8/19/1994,xyz,19,0.4,
8/22/1994,xyz,26,0.2,1
8/23/1994,xyz,30,0.1,
8/16/1994,pqr,0,0,
8/17/1994,pqr,0,0,
8/18/1994,pqr,1,1,
8/19/1994,pqr,2,0.6,
8/22/1994,pqr,9,0.1,
8/23/1994,pqr,12,0.2,
This is the output I am looking for:
Date,Product,Net_Quantity_diff,Price_avg,Refund
8/16/1994,abc,9,0.6,1
8/22/1994,abc,-3,0.15,0
8/16/1994,xyz,3,0.6,3
8/22/1994,xyz,4,0.15,1
8/16/1994,pqr,2,0.4,0
8/22/1994,pqr,3,0.15,0
I think the pandas resample method is indeed ideal for this. You can pass a dictionary to the agg method, defining which aggregation function to use for each column. For example:
import numpy as np
import pandas as pd
df = pd.read_csv('sales.txt') # your sample data
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index(df['Date'])
del df['Date']
df['Refund Flag'] = df['Refund Flag'].fillna(0).astype(bool)
def span(s):
return np.max(s) - np.min(s)
df_weekly = df.resample('w').agg({'New Quantity': span,
'Price': np.mean,
'Refund Flag': np.sum})
df_weekly
New Quantity Price Refund Flag
Date
1994-08-21 19 0.533333 4
1994-08-28 21 0.150000 2

Restructering Dataframes (stored in dictionairy)

i have footballdata stored in a dictionairy by league for different seasons. So for example I have the results from 1 league for the seasons 2017-2020 in one dataframe stored in the dictionary. Now I need to create new dataframes by season, so that I have all the results from 2019 in one dataframe. What is the best way to do this?
Thank you!
assuming you are using open football as your source
use GitHub API to get all files in repo
function to normalize the JSON
simple to generate either a concatenated DF of a dict of all the results
import requests
import pandas as pd
# normalize footbal scores data into a dataframe
def structuredf(res):
js = res.json()
if "rounds" not in res.json().keys():
return (pd.json_normalize(js["matches"])
.pipe(lambda d: d.loc[:,].join(d["score.ft"].apply(pd.Series).rename(columns={0:"home",1:"away"})))
.drop(columns="score.ft")
.rename(columns={"round":"name"})
.assign(seasonname=js["name"], url=res.url)
)
df = (pd.json_normalize(pd.json_normalize(js["rounds"])
.explode("matches").to_dict("records"))
.assign(seasonname=js["name"], url=res.url)
.pipe(lambda d: d.loc[:,].join(d["matches.score.ft"].apply(pd.Series).rename(columns={0:"home",1:"away"})))
.drop(columns="matches.score.ft")
.pipe(lambda d: d.rename(columns={c:c.split(".")[-1] for c in d.columns}))
)
return df
# get listing of all datafiles that we're interested in
res = requests.get("https://api.github.com/repos/openfootball/football.json/git/trees/master?recursive=1")
dfm = pd.DataFrame(res.json()["tree"])
# concat into one dataframe
df = pd.concat([structuredf(res)
for p in dfm.loc[dfm["path"].str.contains(r".en.[0-9]+.json"), "path"].iteritems()
for res in [requests.get(f"https://raw.githubusercontent.com/openfootball/football.json/master/{p[1]}")]])
# dictionary of dataframe
d = {res.json()["name"]:structuredf(res)
for p in dfm.loc[dfm["path"].str.contains(r".en.[0-9]+.json"), "path"].iteritems()
for res in [requests.get(f"https://raw.githubusercontent.com/openfootball/football.json/master/{p[1]}")]}

How to quickly normalise data in pandas dataframe?

I have a pandas dataframe as follows.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
I want to normalise the entire dataframe. Since column C is not a numbered column what I do is as follows (i.e. remove C first, normalise data and add the column).
df_new = df.drop('concept', axis=1)
df_concept = df[['concept']]
from sklearn import preprocessing
x = df_new.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_new = pd.DataFrame(x_scaled)
df_new['concept'] = df_concept
However, I am sure that there is more easy way of doing this in pandas (given the column names that I do not need to normalise, then do the normalisation straightforward).
I am happy to provide more details if needed.
Use DataFrame.select_dtypes for DataFrame with numeric columns and then normalize with division by minimal and maximal values and then assign back only normalized columns:
df1 = df.select_dtypes(np.number)
df[df1.columns]=(df1-df1.min())/(df1.max()-df1.min())
print (df)
A B C
0 0.0 0.0 a
1 0.5 0.5 b
2 1.0 1.0 c
In case you want to apply any other functions on the data frame, you can use df[columns] = df[columns].apply(func).

Pandas data frame creation using static data

I have a data set like this : {'IT',[1,20,35,44,51,....,1000]}
I want to convert this into python/pandas data frame. I want to see output in the below format. How to achieve this output.
Dept Count
IT 1
IT 20
IT 35
IT 44
IT 51
.. .
.. .
.. .
IT 1000
Below way i can write, but this is not efficient way for huge data.
data = [['IT',1],['IT',2],['IT',3]]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
No need for a list comprehension since pandas will automatically fill IT in for every row.
import pandas as pd
d = {'IT':[1,20,35,44,51,1000]}
df = pd.DataFrame({'dept': 'IT', 'count': d['IT']})
Use list comprehension for tuples and pass to DataFrame constructor:
d = {'IT':[1,20,35,44,51], 'NEW':[1000]}
data = [(k, x) for k, v in d.items() for x in v]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
Dept Count
0 IT 1
1 IT 20
2 IT 35
3 IT 44
4 IT 51
5 NEW 1000
You can use melt
import pandas as pd
d = {'IT': [10]*100000}
df = pd.DataFrame(d)
df = pd.melt(df, var_name='Dept', value_name='Count')