Align Pandas barchart style - pandas

please I went through the documentation on:
https://pandas.pydata.org/pandas-docs/stable/style.html
section "Bar Charts" but when I try to use the same code with this simple dataframe it seems something is going wrong.
My code is the following:
import pandas as pd
df = pd.DataFrame([[0,0,19],[0,-3,16],[1,0,21]], columns = ["5D Net Chg","20D Net Chg","60D Net Chg"] )
df.style.bar(subset=["5D Net Chg","20D Net Chg","60D Net Chg"], align='mid', color=['#d65f5f', '#5fba7d'])
If you run this code, you should see what I see, that is:
* the -3 cell is fully red, whereas I would expect to see a red bar starting from the middle of the cell
* the cell with value 1 is fully green. I would expect to see a tiny green bar starting from the middle of the cell, given the max value in the dataframe is 21
what could I do to make it look like more like the "Conditional Formatting" data bars we have in excel?
Thanks

what you want is to align the value 0 in the mid of the cell. So you should choose align='zero'. Note that the aligning works columnwise.
import pandas as pd
df = pd.DataFrame([[0,0,-5],[0,-3,16],[1,2,21]], columns = [1,2,3] )
df.style.bar(subset=[1,2,3], align='zero', color=['#d65f5f', '#5fba7d'])

I think everything is correct. It's just the values that you have. Changing the values to the following:
import pandas as pd
df = pd.DataFrame([[0,2,19],[0,-3,16],[1,-4,21]],
columns = ["5D Net Chg","20D Net Chg","60D Net Chg"] )
I obtain this
df.style.bar(subset=["5D Net Chg","20D Net Chg","60D Net Chg"], align='mid', color=['#d65f5f', '#5fba7d'])

Related

How can I always choose the last column in a csv table that's updated monthly?

Automating small business reporting from my Quickbooks P&L. I'm trying to get the net income value for the current month from a specific cell in a dataframe, but that cell moves one column to the right every month when I update the csv file.
For example, for the code below, this month I want the value from Nov[0], but next month I'll want the value from Dec[0], even though that column doesn't exist yet.
Is there a graceful way to always select the second right most column, or is this a stupid way to try and get this information?
import numpy as np
import pandas as pd
nov = -810
dec = 14958
total = 8693
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
Sure, you can reference the last or second-to-last row or column.
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
x = df.iloc[-1,-2]
This will select the value in the last row for the second-to-last column, in this case 70. :)
If you plan to use the full file, #VincentRupp's answer will get you what you want.
But if you only plan to use the values in the second right most column and you can infer what it will be called, you can tell pd.read_csv that's all you want.
import pandas as pd # 1.5.1
# assuming we want this month's name
# can modify to use some other month
abbreviated_month_name = pd.to_datetime("today").strftime("%b")
df = pd.read_csv("path/to/file.csv", usecols=[abbreviated_month_name])
print(df.iloc[-1, 0])
References
pd.read_csv
strftime cheat-sheet

How to plot coordinates from single pandas series

I have a pandas series called df1['geometry.coordinates'] of coordinate values in the following format:
geometry.coordinates
0 [150.792711, -34.210868]
1 [151.551228, -33.023339]
2 [148.92149870748742, -34.767207772932835]
3 [151.033742, -33.919998]
4 [150.953963043732, -32.3935017885229]
... ...
432 [114.8927165, -28.902492300000002]
433 [115.34601918477634, -30.041742290803096]
434 [115.4632611, -30.8581035]
435 [121.42151909999998, -30.7804027]
436 [115.69424934340425, -30.680970908597665]
I want to plot each point on a graph, probably through using a scatter plot.
I tried: df1['geometry.coordinates'].plot.scatter() but it gets confused because it only reads it as one list value rather than two and therefore I always get the following error:
TypeError: scatter() missing 2 required positional arguments: 'x' and 'y'
Anyone know how I can solve this?
You need to separate the column containing the list so that you can specify x and y in the plot call.
You can split a column containing a list by constructing a data frame from a list.
pd.DataFrame(df2["geometry.coordinates"].to_list(), columns=['x', 'y']).plot.scatter(x=“x”, y=“y”)
Step 1: Split array into multiple columns
df1[['x','y']] = pd.DataFrame(df1['geometry.coordinates'].tolist(), index= df1.index)
Step 2: Plot
df1.plot.scatter(x = 'x', y = 'y', s = 30) #s is size of dots
You are not giving the parameters to scatter(), so the error is quite logical. Something among the lines of df.scatter.plot(df[0],df[1]) should work.
Also, as you are working working with column vectors, you need to transpose your data for it to be viewed as rows: df.scatter.plot(df.T[0],df.T[1])
I did it this way.
import matplotlib.pyplot as plt
geometry = pd.Series([
[150.792711, -34.210868],
[151.551228, -33.023339],
[148.92149870748742, -34.767207772932835],
[151.033742, -33.919998],
[150.953963043732, -32.3935017885229]])
df = pd.DataFrame(geometry.to_list(), columns = ['x','y'])
plt.scatter(x = df['x'], y = df['y'],
edgecolor ='black')
plt.grid(alpha=.15)
you can try
import pandas as pd
geometry_coordinates=[[150.792711, -34.210868],
[151.551228, -33.023339],
[148.92149870748742, -34.767207772932835],
[151.033742, -33.919998],
[150.953963043732, -32.3935017885229],
[114.8927165, -28.902492300000002],
[115.34601918477634, -30.041742290803096],
[115.4632611, -30.8581035],
[121.42151909999998, -30.7804027],
[115.69424934340425, -30.680970908597665]]
geometry_coordinates=pd.DataFrame(geometry_coordinates,columns=['lat','long'])
geometry_coordinates.plot.scatter(x='lat',y='long')

Bokeh grouped bar chart, changing data presented

I'm experimenting with the grouped bar chart example for Bokeh using Pandas shown below. I am trying to see if I can get it to display the data differently for example I wanted the bar graph to show a count of the rows that meet each group. I tried that by replacing instances of 'mpg_mean' to 'mpg_count' but just got an invalid column error. I also experimented with having the graph show a sum by again using 'mpg_sum' with the same error. I'm assuming the calculation for the 'mpg_mean' is occurring in the groupby but how do I get it to display the count or the sum? It's definitely not clear in this example where any calculations are happening.
Thanks in advance for any help!
from bokeh.io import output_file, show
from bokeh.palettes import Spectral5
from bokeh.plotting import figure
from bokeh.sampledata.autompg import autompg_clean as df
from bokeh.transform import factor_cmap
output_file("bar_pandas_groupby_nested.html")
df.cyl = df.cyl.astype(str)
df.yr = df.yr.astype(str)
group = df.groupby(by=['cyl', 'mfr'])
index_cmap = factor_cmap('cyl_mfr', palette=Spectral5, factors=sorted(df.cyl.unique()), end=1)
p = figure(width=800, height=300, title="Mean MPG by # cylinders and manufacturer",
x_range=group, toolbar_location=None, tooltips=[("MPG", "#mpg_mean"), ("Cyl, Mfr", "#cyl_mfr")])
p.vbar(x='cyl_mfr', top='mpg_mean', width=1, source=group,
line_color="white", fill_color=index_cmap, )
p.y_range.start = 0
p.x_range.range_padding = 0.05
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "Manufacturer grouped by # Cylinders"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None
show(p)

how to plot graded letters like A* in matplotlib

i'm a complete beginner and i have a college stats project, im comparing exam scores for our year group and the one below. i collected my own data and since i do cs i decided to try visualize the data with pandas and matplotlib (my first time). i was able to read the csv file into a dataframe with columns = Level,Grade,Difficulty,Happy,MAG. Level is just ' year group ' e.g. AS or A2. and MAG is like a minimum expected grade, the rest are numeric values out of 5.
i want to do some type of plotting but i cant' seem to get it work.
i want to plot revision against difficulty? for AS group and try show a correlation. i also want to show a barchart ( if appropriate ) for Grade Vs MAG.
here is the csv https://docs.google.com/spreadsheets/d/169UKfcet1qh8ld-eI7B4U14HIl7pvgZfQLE45NrleX8/edit?usp=sharing
this is the code so far:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('Report Task.csv')
df.columns = ['Level','Grade','Difficulty','Revision','Happy','MAG'] #numerical values are out of 5
df[df.Level.str.match('AS')] #to get only AS group
plt.plot(df.Revision, df.Difficulty)
this is my first time ever posting on stack so im really sorry if i did something wrong.
For difficulty vs revision, you were using a line plot. You're probably looking for a scatter plot:
df = df[df.Level.str.match('AS')] # note the extra `df =` as per comments
plt.scatter(x=df.Revision, y=df.Difficulty)
plt.xlabel('Revision')
plt.ylabel('Difficulty')
Alternatively you can plot via pandas directly:
df = df[df.Level.str.match('AS')] # note the extra `df =` as per comments
df.plot.scatter(x='Revision', y='Difficulty')

NaN-columns is plotted as a all 0 column in pandas

I have got some problems with plotting a sliced DataFrame with entire columns filled with NaN's.
How come:
pandas.DataFrame(
dict(
A=pandas.Series([np.NaN]*32),
B=pd.Series(range(-1,32))
)
).plot()
differs from:
#Ugly fix
pandas.DataFrame(
dict(
A=pandas.Series( [0] + [numpy.NaN]*32),
B=pd.Series(range(-1,32))
)
).plot()
by plotting a 0-line as if the column is filled with zeros.
Shouldn't the first code work just as:
pylab.plot(
range(0,33),
range(-1,32),
range(0,32),
[numpy.NaN]*32
)
And also plotting just a Series filled with NaN works fine:
pandas.Series([numpy.NaN]*32).plot()
What am I missing? Is there a right way to plot a column with all NaN's or is it a bug?
This looks like a bug in pandas. Looking at the source code, in pandas.tools.plotting, lines 554:556:
empty = df[col].count() == 0
# is this right?
values = df[col].values if not empty else np.zeros(len(df))
If the column contains only NaNs, then empty is True and values is set to np.zeros().
Note: I did not add the "is this right?" comment: it's in the source code! (pandas v.0.8.1).
I've raised a bug about it: https://github.com/pydata/pandas/issues/1696