Pandas to summarize (cum and count) columns - pandas

Dataframe as below and I want to summarize the columns, like:
I tried pivot_table:
import pandas as pd
from io import StringIO
csvfile = StringIO(
"""Staff Color Sales Spend Transport
Amelia Red 188 49 35
Elijah Yellow 326 18
James Blue 378 10
Benjamin Red 144 34 45
Isabella Red 269 10
Lucas Yellow 159 48
Mason Blue 496 48 20""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')
df = df.pivot_table(index= None, \
values=['Sales', "Spend", "Transport"], \
aggfunc={'Sales':'sum','Spend':'sum','Transport':'count'})
print (df)
But the error says:
ValueError: No group keys passed!
And I also tried:
columns = ['sum of Sales','sum of Spend','count of Transport']
df_1 = pd.DataFrame(columns = columns)
df_1 = df_1.append({"sum of Sales":df.Sales.sum(),
"sum of Spend":df.Spend.sum(),
"count of Transport":df.Transport.count()}, ignore_index=True)
print (df_1)
It looks ok but I wonder what's the better way to achieve it. Thank you.

In your case just do
df.agg({'Sales':'sum','Spend':'sum','Transport':'count'}).to_frame().T

Related

Pandas histogram with legend

My dataframe looks like this:
Customer ID
Age
Is True
123
31
1
124
33
1
125
45
0
126
27
0
127
37
1
128
39
0
129
49
0
130
30
0
131
30
0
132
38
1
I can create age histogram like this:
df.Age.hist()
plt.title('Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
And I will get:
I would like to add a legend of the 'Is True' field. For each Bin, I would like to see what portion is 1. How can I do that?
I'm not sure you can do that with Matplotlib. But I know you can with Plotly.
import plotly.express as px
df = px.data.tips()
fig = px.histogram(df, x="total_bill", color="sex")
fig.show()
more here:
https://plotly.com/python/histograms/

Improper broadcasting(?) on dataframe multiply() operation on two multindex slices

I'm trying to multipy() two multilevel slices of a dataframe, however I'm unable to coerce the multiply operation to broadcast properly, so I just end up with lots of nans. It's like somehow I'm not specifying the indexing properly.
I've tried all variations of both axis and level but it eithers throws an exception or gives me a 6x6 grid of Nan
import numpy as np
import pandas as pd
np.random.seed(0)
idx = pd.IndexSlice
df_a = pd.DataFrame(index=range(6),
columns=pd.MultiIndex.from_product([['weight', ], ['alice','bob', 'sue']],
names=['measure','person']),
data=np.random.randint(70, high=120, size=(6,3), dtype=int)
)
df_a.index.name= "m"
df_b = pd.DataFrame(index=range(6),
columns=pd.MultiIndex.from_product([['coef', ], ['alice','bob', 'sue']],
names=['measure','person']),
data=np.random.rand(6,3)
)
df_b.index.name= "m"
df_c = pd.DataFrame(index=range(6),
columns=pd.MultiIndex.from_product([['extraneous', ], ['alice','bob', 'sue']],
names=['measure','person']),
data=np.random.rand(6,3)
)
df = df_a.join([df_b, df_c])
# What I'm wanting:
# new column = coef*weight
#measure NewCol
#person alice bob sue
#m
#0 30.2 48.1 88.9
#...
#5 18.3 32.2 103
#all of these variations generatea 6x6 grid of NaNs
df.loc[:,idx['weight',:]].multiply(df.loc[:,idx['coef',:]], axis="rows", )
df.loc[:,idx['weight',:]].multiply(df.loc[:,idx['coef',:]], axis="colums", )
Here is an approach using pandas.concat:
df = pd.concat([df,
pd.concat({'NewCol': df['coef'].mul(df['weight'])},
axis=1)],
axis=1)
output:
measure weight coef extraneous NewCol
person alice bob sue alice bob sue alice bob sue alice bob sue
m
0 107 98 89 0.906243 0.761173 0.754762 0.889252 0.140435 0.708203 96.968045 74.594927 67.173827
1 106 77 117 0.193279 0.138338 0.699014 0.826331 0.087769 0.242337 20.487623 10.652021 81.784634
2 104 77 101 0.340416 0.131111 0.394653 0.465670 0.825667 0.624923 35.403258 10.095575 39.859948
3 80 92 116 0.329999 0.144878 0.794014 0.539082 0.968411 0.588952 26.399889 13.328731 92.105674
4 75 76 100 0.024841 0.083313 0.113684 0.160948 0.003354 0.246954 1.863067 6.331802 11.368357
5 115 99 71 0.662492 0.755795 0.123242 0.144265 0.993883 0.513367 76.186541 74.823720 8.750217
You can try via to_numpy() If you want to assign changes back to DataFrame:
df.loc[:,idx['weight',:]]=df.loc[:,idx['weight',:]].to_numpy()*df.loc[:,idx['coef',:]].to_numpy()
#you can also use values attribute
OR
If you want to create a new MultiIndexed column then use concat()+join():
df=df.join(pd.concat([df['coef'].mul(df['weight'])],keys=['NewCol'],axis=1))
#OR
#df=df.join(pd.concat({'NewCol': df['coef'].mul(df['weight'])},axis=1))

Getting variable no of pandas rows w.r.t. a dictionary lookup

In this sample dataframe df:
import pandas as pd
import numpy as np
import random, string
max_rows = {'A': 3, 'B': 2, 'D': 4} # max number of rows to be extracted
data_size = 1000
df = pd.DataFrame({'symbol': pd.Series(random.choice(string.ascii_uppercase) for _ in range(data_size)),
'qty': np.random.randn(data_size)}).sort_values('symbol')
How to get a dataframe with variable rows from a dictionary?
Tried using [df.groupby('symbol').head(i) for i in df.symbol.map(max_rows)]. It gives a RuntimeWarning and looks very incorrect.
You can use concat with list comprehension:
print (pd.concat([df.loc[df["symbol"].eq(k)].head(v) for k,v in max_rows.items()]))
symbol qty
640 A -0.725947
22 A -1.361063
190 A -0.596261
451 B -0.992223
489 B -2.014979
593 D 1.581863
600 D -2.162044
793 D -1.162758
738 D 0.345683
Adding another method using groupby+cumcount and df.query
df.assign(v=df.groupby("symbol").cumcount()+1,k=df['symbol'].map(max_rows)).query("v<=k")
Or same logic without assigning extra columns #thanks #jezrael
df[df.groupby("symbol").cumcount()+1 <= df['symbol'].map(max_rows)]
symbol qty
882 A -0.249236
27 A 0.625584
122 A -1.154539
229 B -1.269212
55 B 1.403455
457 D -2.592831
449 D -0.433731
634 D 0.099493
734 D -1.551012

Is there a way to add a column called "Rank" in pandas that would take a list of values as number 1 being the highest value, and so on and so on?

df['Total'] = df['HP'] + df['Attack'] + df['Defense'] + df['Sp. Atk'] + df['Sp. Def'] +
df['Speed']
df['Total'] = df.iloc[:,4:10].sum(axis=1)
df['Total'] = df['Total'].astype(int)
cols = list(df.columns.values)
df = df[cols[1:3] + [cols[-1]] + cols[3:12]]
df = df.sort_values(by=['Name','Total'], ascending=[True,False])
My output looks like this:
Name Type 1 Total ... Speed Generation Legendary
510 Abomasnow Grass 494 ... 60 4 False
511 AbomasnowMega Abomasnow Grass 594 ... 30 4 False
68 Abra Psychic 310 ... 90 1 False
392 Absol Dark 465 ... 75 3 False
393 AbsolMega Absol Dark 565 ... 115 3 False
Is there to ad a new column titled 'Rank' in the index that will rank the Names by their total from first to last?
Yes!
import numpy as np
df["Rank"] = np.argsort(df["Total"])
numpy's argsort function "returns the indices that would sort an array"
Using pandas, you can add columns easily.
s= pd.Series(df.sort_values(by='Total').index)
df['Rank']= s
And you may find a column "Rank" In your dataframe.

How to order dataframe using a list in pandas

I have a pandas dataframe as follows.
import pandas as pd
data = [['Alex',10, 175],['Bob',12, 178],['Clarke',13, 179]]
df = pd.DataFrame(data,columns=['Name','Age', 'Height'])
print(df)
I also have a list as follows.
mynames = ['Emj', 'Bob', 'Jenne', 'Alex', 'Clarke']
I want to order the rows of my dataframe in the order of mynames list. In other words, my output should be as follows.
Name Age Height
0 Bob 12 178
1 Alex 10 175
2 Clarke 13 179
I was trying to do this as follows. I am wondering if there is an easy way to do this in pandas than converting the dataframe to list.
I am happy to provide more details if needed.
You can do pd.Categorical + argsort
df=df.loc[pd.Categorical(df.Name,mynames).argsort()]
Name Age Height
1 Bob 12 178
0 Alex 10 175
2 Clarke 13 179