Bokeh grouped bar chart, changing data presented - pandas

I'm experimenting with the grouped bar chart example for Bokeh using Pandas shown below. I am trying to see if I can get it to display the data differently for example I wanted the bar graph to show a count of the rows that meet each group. I tried that by replacing instances of 'mpg_mean' to 'mpg_count' but just got an invalid column error. I also experimented with having the graph show a sum by again using 'mpg_sum' with the same error. I'm assuming the calculation for the 'mpg_mean' is occurring in the groupby but how do I get it to display the count or the sum? It's definitely not clear in this example where any calculations are happening.
Thanks in advance for any help!
from bokeh.io import output_file, show
from bokeh.palettes import Spectral5
from bokeh.plotting import figure
from bokeh.sampledata.autompg import autompg_clean as df
from bokeh.transform import factor_cmap
output_file("bar_pandas_groupby_nested.html")
df.cyl = df.cyl.astype(str)
df.yr = df.yr.astype(str)
group = df.groupby(by=['cyl', 'mfr'])
index_cmap = factor_cmap('cyl_mfr', palette=Spectral5, factors=sorted(df.cyl.unique()), end=1)
p = figure(width=800, height=300, title="Mean MPG by # cylinders and manufacturer",
x_range=group, toolbar_location=None, tooltips=[("MPG", "#mpg_mean"), ("Cyl, Mfr", "#cyl_mfr")])
p.vbar(x='cyl_mfr', top='mpg_mean', width=1, source=group,
line_color="white", fill_color=index_cmap, )
p.y_range.start = 0
p.x_range.range_padding = 0.05
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "Manufacturer grouped by # Cylinders"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None
show(p)

Related

matplotlib: make a pie chart for select column in Python

Dataset:https://dl.dropboxusercontent.com/s/v9gmgxupkypn5dw/train-data.csv
I want to make a pie chart for Fuel_Type.
Divide the Fuel_Type into three part: Diesel, Petrol, and Others (not Diesel and Petrol).
Make a pie chart for that and show the percentages on that chart.
My codes are below:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('car_train_data.csv',sep= ",",index_col= 0)
temp = car_df.groupby("Fuel_Type", as_index=False).size()
cutoff = temp["size"].sum() * 0.05
temp_idx = temp["size"] < cutoff
other_sum = temp.loc[temp_idx, "size"].sum()
But I really confused about how to define not Diesel and Petrol...Anyone can help me? Or show me a example so that I can change the code from other example. Thanks a lot!
You can filter your dataframe based on the desired value for a specific column. Try something like:
petrol = df[df['Fuel_Type'] == 'Petrol']
diesel = df[df['Fuel_Type'] == 'Diesel']
others = df[df['Fuel_Type'] != 'Petrol' & df['Fuel_Type'] != 'Diesel']
this will create the three portions you need, then pass them to the pie chart

how to plot graded letters like A* in matplotlib

i'm a complete beginner and i have a college stats project, im comparing exam scores for our year group and the one below. i collected my own data and since i do cs i decided to try visualize the data with pandas and matplotlib (my first time). i was able to read the csv file into a dataframe with columns = Level,Grade,Difficulty,Happy,MAG. Level is just ' year group ' e.g. AS or A2. and MAG is like a minimum expected grade, the rest are numeric values out of 5.
i want to do some type of plotting but i cant' seem to get it work.
i want to plot revision against difficulty? for AS group and try show a correlation. i also want to show a barchart ( if appropriate ) for Grade Vs MAG.
here is the csv https://docs.google.com/spreadsheets/d/169UKfcet1qh8ld-eI7B4U14HIl7pvgZfQLE45NrleX8/edit?usp=sharing
this is the code so far:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('Report Task.csv')
df.columns = ['Level','Grade','Difficulty','Revision','Happy','MAG'] #numerical values are out of 5
df[df.Level.str.match('AS')] #to get only AS group
plt.plot(df.Revision, df.Difficulty)
this is my first time ever posting on stack so im really sorry if i did something wrong.
For difficulty vs revision, you were using a line plot. You're probably looking for a scatter plot:
df = df[df.Level.str.match('AS')] # note the extra `df =` as per comments
plt.scatter(x=df.Revision, y=df.Difficulty)
plt.xlabel('Revision')
plt.ylabel('Difficulty')
Alternatively you can plot via pandas directly:
df = df[df.Level.str.match('AS')] # note the extra `df =` as per comments
df.plot.scatter(x='Revision', y='Difficulty')

Align Pandas barchart style

please I went through the documentation on:
https://pandas.pydata.org/pandas-docs/stable/style.html
section "Bar Charts" but when I try to use the same code with this simple dataframe it seems something is going wrong.
My code is the following:
import pandas as pd
df = pd.DataFrame([[0,0,19],[0,-3,16],[1,0,21]], columns = ["5D Net Chg","20D Net Chg","60D Net Chg"] )
df.style.bar(subset=["5D Net Chg","20D Net Chg","60D Net Chg"], align='mid', color=['#d65f5f', '#5fba7d'])
If you run this code, you should see what I see, that is:
* the -3 cell is fully red, whereas I would expect to see a red bar starting from the middle of the cell
* the cell with value 1 is fully green. I would expect to see a tiny green bar starting from the middle of the cell, given the max value in the dataframe is 21
what could I do to make it look like more like the "Conditional Formatting" data bars we have in excel?
Thanks
what you want is to align the value 0 in the mid of the cell. So you should choose align='zero'. Note that the aligning works columnwise.
import pandas as pd
df = pd.DataFrame([[0,0,-5],[0,-3,16],[1,2,21]], columns = [1,2,3] )
df.style.bar(subset=[1,2,3], align='zero', color=['#d65f5f', '#5fba7d'])
I think everything is correct. It's just the values that you have. Changing the values to the following:
import pandas as pd
df = pd.DataFrame([[0,2,19],[0,-3,16],[1,-4,21]],
columns = ["5D Net Chg","20D Net Chg","60D Net Chg"] )
I obtain this
df.style.bar(subset=["5D Net Chg","20D Net Chg","60D Net Chg"], align='mid', color=['#d65f5f', '#5fba7d'])

Creating dataframe boxplot from dataframe with row and column multiindex

I have the following Pandas data frame and I'm trying to create a boxplot of the "dur" value for both client and server organized by qdepth (qdepth on x-axis, duration on y-axis, with two variables client and server). It seems like I need to get client and serveras columns. I haven't been able to figure this out trying combinations ofunstackandreset_index`.
Here's some dummy data I recreated since you didn't post yours aside from an image:
qdepth,mode,runid,dur
1,client,0x1b7bd6ef955979b6e4c109b47690c862,7.0
1,client,0x45654ba030787e511a7f0f0be2db21d1,30.0
1,server,0xb760550f302d824630f930e3487b4444,19.0
1,server,0x7a044242aec034c44e01f1f339610916,95.0
2,client,0x51c88822b28dfa006bf38603d74f9911,15.0
2,client,0xd5a9028fddf9a400fd8513edbdc58de0,49.0
2,server,0x3943710e587e3932adda1cad8eaf2aeb,30.0
2,server,0xd67650fd984a48f2070de426e0a942b0,93.0
Load the data: df = pd.read_clipboard(sep=',', index_col=[0,1,2])
Option 1:
df.unstack(level=1).boxplot()
Option 2:
df.unstack(level=[0,1]).boxplot()
Option 3:
Using seaborn:
import seaborn as sns
sns.boxplot(x="qdepth", hue="mode", y="dur", data=df.reset_index(),)
Update:
To answer your comment, here's a very approximate way (could be used as a starting point) to recreate the seaborn option using only pandas and matplotlib:
fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(12,6))
#bp = df.unstack(level=[0,1])['dur'].boxplot(ax=ax, return_type='dict')
bp = df.reset_index().boxplot(column='dur',by=['qdepth','mode'], ax=ax, return_type='dict')['dur']
# Now fill the boxes with desired colors
boxColors = ['darkkhaki', 'royalblue']
numBoxes = len(bp['boxes'])
for i in range(numBoxes):
box = bp['boxes'][i]
boxX = []
boxY = []
for j in range(5):
boxX.append(box.get_xdata()[j])
boxY.append(box.get_ydata()[j])
boxCoords = list(zip(boxX, boxY))
# Alternate between Dark Khaki and Royal Blue
k = i % 2
boxPolygon = mpl.patches.Polygon(boxCoords, facecolor=boxColors[k])
ax.add_patch(boxPolygon)
plt.show()

Pandas Stacked Bar Plot - Columns by Max Value, Not Summed

%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np
import pandas as pd
my_data = np.array([[ 0.110622 , 0.98174432, 0.56583323],
[ 0.61825694, 0.14166864, 0.44180003],
[ 0.02572145, 0.55764373, 0.24183103],
[ 0.98040318, 0.76171712, 0.41994361],
[ 0.49859658, 0.76637672, 0.75487683]])
pd.DataFrame(my_data).plot(kind='bar', stacked='true')
Using the above code I get:
How do I change this so that the hight of every bar is the max value for that bar instead of the sum, and so all the lower values for the bar are in the same bar as different colors?
Thanks for your help.
If I understood well your question, I would normalize your data multiplying each value by the current maximum and then divided by the sum of all elements. So that:
df = df.apply(lambda x: x*df.max(axis=1)/df.sum(axis=1))
where:
df = pd.DataFrame(my_data)
The new plot is:
df.plot(kind='bar', stacked='true')
Hope that helps.