Only plotting observed dates in matplotlib, skipping range of dates - matplotlib

I have a simple dataframe I am plotting in matplotlib. However, the plot is showing the range of the dates, rather than just the two observed data points.
How can I only plot the two data points and not the range of the dates?
df structure:
Date Number
2018-01-01 12:00:00 1
2018-02-01 12:00:00 2
Output of the matplotlib code:
Here is what I expected (this was done using a string and not a date on the x-axis data):
df code:
import pandas as pd
df = pd.DataFrame([['2018-01-01 12:00:00', 1], ['2018-02-01 12:00:00',2]], columns=['Date', 'Number'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(['Date'],inplace=True)
Plot code:
import matplotlib.pyplot as plt
fig, ax1 = plt.subplots(
figsize=(4,5),
dpi=72
)
width = 0.75
#starts the bar chart creation
ax1.bar(df.index, df['Number'],
width,
align='center',
color=('#666666', '#333333'),
edgecolor='#FF0000',
linewidth=2
)
ax1.set_ylim(0,3)
ax1.set_ylabel('Score')
fig.autofmt_xdate()
#Title
plt.title('Scores by group and gender')
plt.tight_layout()
plt.show()

Try adding something like:
import matplotlib.dates as mdates
myFmt = mdates.DateFormatter('%y-%m-%d')
ax1.xaxis.set_major_formatter(myFmt)
plt.xticks(df.index)
I think the dates are transformed to large integers at the time of the plot. So width = 0.75 is very small, try something bigger (like width = 20:

Matplotlib bar plots are numeric in nature. If you want a categorical bar plot instead, you may use pandas bar plots.
df.plot.bar()
You may then want to beautify the labels a bit
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame([['2018-01-01 12:00:00', 1], ['2018-02-01 12:00:00',2]], columns=['Date', 'Number'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(['Date'],inplace=True)
ax = df.plot.bar()
ax.tick_params(axis="x", rotation=0)
ax.set_xticklabels([t.get_text().split()[0] for t in ax.get_xticklabels()])
plt.show()

Related

How make scatterplot in pandas readable

I've been playing with Titanic dataset and working through some visualisations in Pandas using this tutorial. https://www.kdnuggets.com/2023/02/5-pandas-plotting-functions-might-know.html
I have a visual of scatterplot having used this code.
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('train.csv')
I was confused by bootstrap plot result so went on to scatterplot.
pd.plotting.scatter_matrix(df, figsize=(10,10), )
plt.show()
I can sort of interpret it but I'd like to put the various variables at top and bottom of every column. Is that doable?
You can use:
fig, ax = plt.subplots(4, 3, figsize=(20, 15))
sns.scatterplot(x = 'bedrooms', y = 'price', data = dataset, whis=1.5, ax=ax[0, 0])
sns.scatterplot(x = 'bathrooms', y = 'price', data = dataset, whis=1.5, ax=ax[0, 1])

How to start Seaborn Logarithmic Barplot at y=1

I have a problem figuring out how to have Seaborn show the right values in a logarithmic barplot. A value of mine should be, in the ideal case, be 1. My dataseries (5,2,1,0.5,0.2) has a set of values that deviate from unity and I want to visualize these in a logarithmic barplot. However, when plotting this in the standard log-barplot it shows the following:
But the values under one are shown to increase from -infinity to their value, whilst the real values ought to look like this:
Strangely enough, I was unable to find a Seaborn, Pandas or Matplotlib attribute to "snap" to a different horizontal axis or "align" or ymin/ymax. I have a feeling I am unable to find it because I can't find the terms to shove down my favorite search engine. Some semi-solutions I found just did not match what I was looking for or did not have either xaxis = 1 or a ylog. A try that uses some jank Matplotlib lines:
If someone knows the right terms or a solution, thank you in advance.
Here are the Jupyter cells I used:
{1}
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
data = {'X': ['A','B','C','D','E'], 'Y': [5,2,1,0.5,0.2]}
df = pd.DataFrame(data)
{2}
%matplotlib widget
g = sns.catplot(data=df, kind="bar", y = "Y", x = "X", log = True)
{3}
%matplotlib widget
plt.vlines(x=data['X'], ymin=1, ymax=data['Y'])
You could let the bars start at 1 instead of at 0. You'll need to use sns.barplot directly.
The example code subtracts 1 of all y-values and sets the bar bottom at 1.
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import seaborn as sns
import pandas as pd
import numpy as np
data = {'X': ['A', 'B', 'C', 'D', 'E'], 'Y': [5, 2, 1, 0.5, 0.2]}
df = pd.DataFrame(data)
ax = sns.barplot(y=df["Y"] - 1, x=df["X"], bottom=1, log=True, palette='flare_r')
ax.axhline(y=1, c='k')
# change the y-ticks, as the default shows too few in this case
ax.set_yticks(np.append(np.arange(.2, .8, .1), np.arange(1, 7, 1)), minor=False)
ax.set_yticks(np.arange(.3, 6, .1), minor=True)
ax.yaxis.set_major_formatter(lambda x, pos: f'{x:.0f}' if x >= 1 else f'{x:.1f}')
ax.yaxis.set_minor_formatter(NullFormatter())
ax.bar_label(ax.containers[0], labels=df["Y"])
sns.despine()
plt.show()
PS: With these specific values, the plot might go without logscale:

Show multiple columns values on labels with squarify.plot

I have a dataframe that I'd like to plot a tree map with squarify. I'd like to show the country_name and counts on the chart by editing the labels parameter but it seems only taking one value.
Example data
import squarify
import pandas as pd
from matplotlib import pyplot as plt
d = {'country_name':['USA', 'UK', 'Germany'], 'counts':[100, 200, 300]}
dd = pd.DataFrame(data=d)
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(16, 4.5)
norm = matplotlib.colors.Normalize(vmin=min(dd.counts), vmax=max(dd.counts))
colors = [matplotlib.cm.Blues(norm(value)) for value in dd.counts]
squarify.plot(label=dd.country_name, sizes=dd.counts, alpha=.7, color=colors)
plt.axis('off')
plt.show()
Expected output will have both counts and country_name on the chart.
You can create a list of labels by looping simultaneously through both columns and composing combined strings. For example:
import squarify
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib
d = {'country_name': ['USA', 'UK', 'Germany'], 'counts': [100, 200, 300]}
dd = pd.DataFrame(data=d)
labels = [f'{country}\n{count}' for country, count in zip(dd.country_name, dd.counts)]
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(16, 4.5)
norm = matplotlib.colors.Normalize(vmin=min(dd.counts), vmax=max(dd.counts))
colors = [matplotlib.cm.Blues(norm(value)) for value in dd.counts]
squarify.plot(label=labels, sizes=dd.counts, alpha=.7, color=colors)
plt.axis('off')
plt.show()

Pandas histogram df.hist() group by

How to plot a histogram with pandas DataFrame.hist() using group by?
I have a data frame with 5 columns: "A", "B", "C", "D" and "Group"
There are two Groups classes: "yes" and "no"
Using:
df.hist()
I get the hist for each of the 4 columns.
Now I would like to get the same 4 graphs but with blue bars (group="yes") and red bars (group = "no").
I tried this withouth success:
df.hist(by = "group")
Using Seaborn
If you are open to use Seaborn, a plot with multiple subplots and multiple variables within each subplot can easily be made using seaborn.FacetGrid.
import numpy as np; np.random.seed(1)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(300,4), columns=list("ABCD"))
df["group"] = np.random.choice(["yes", "no"], p=[0.32,0.68],size=300)
df2 = pd.melt(df, id_vars='group', value_vars=list("ABCD"), value_name='value')
bins=np.linspace(df2.value.min(), df2.value.max(), 10)
g = sns.FacetGrid(df2, col="variable", hue="group", palette="Set1", col_wrap=2)
g.map(plt.hist, 'value', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
This is not the most flexible workaround but will work for your question specifically.
def sephist(col):
yes = df[df['group'] == 'yes'][col]
no = df[df['group'] == 'no'][col]
return yes, no
for num, alpha in enumerate('abcd'):
plt.subplot(2, 2, num)
plt.hist(sephist(alpha)[0], bins=25, alpha=0.5, label='yes', color='b')
plt.hist(sephist(alpha)[1], bins=25, alpha=0.5, label='no', color='r')
plt.legend(loc='upper right')
plt.title(alpha)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
You could make this more generic by:
adding a df and by parameter to sephist: def sephist(df, by, col)
making the subplots loop more flexible: for num, alpha in enumerate(df.columns)
Because the first argument to matplotlib.pyplot.hist can take
either a single array or a sequency of arrays which are not required
to be of the same length
...an alternattive would be:
for num, alpha in enumerate('abcd'):
plt.subplot(2, 2, num)
plt.hist((sephist(alpha)[0], sephist(alpha)[1]), bins=25, alpha=0.5, label=['yes', 'no'], color=['r', 'b'])
plt.legend(loc='upper right')
plt.title(alpha)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
I generalized one of the other comment's solutions. Hope it helps someone out there. I added a line to ensure binning (number and range) is preserved for each column, regardless of group. The code should work for both "binary" and "categorical" groupings, i.e. "by" can specify a column wherein there are N number of unique groups. Plotting also stops if the number of columns to plot exceeds the subplot space.
import numpy as np
import matplotlib.pyplot as plt
def composite_histplot(df, columns, by, nbins=25, alpha=0.5):
def _sephist(df, col, by):
unique_vals = df[by].unique()
df_by = dict()
for uv in unique_vals:
df_by[uv] = df[df[by] == uv][col]
return df_by
subplt_c = 4
subplt_r = 5
fig = plt.figure()
for num, col in enumerate(columns):
if num + 1 > subplt_c * subplt_r:
continue
plt.subplot(subplt_c, subplt_r, num+1)
bins = np.linspace(df[col].min(), df[col].max(), nbins)
for lbl, sepcol in _sephist(df, col, by).items():
plt.hist(sepcol, bins=bins, alpha=alpha, label=lbl)
plt.legend(loc='upper right', title=by)
plt.title(col)
plt.tight_layout()
return fig
TLDR oneliner;
It won't create the subplots but will create 4 different plots;
[df.groupby('group')[i].plot(kind='hist',title=i)[0] and plt.legend() and plt.show() for i in 'ABCD']
Full working example below
import numpy as np; np.random.seed(1)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(300,4), columns=list("ABCD"))
df["group"] = np.random.choice(["yes", "no"], p=[0.32,0.68],size=300)
[df.groupby('group')[i].plot(kind='hist',title=i)[0] and plt.legend() and plt.show() for i in 'ABCD']

Arrange two plots horizontally

As an exercise, I'm reproducing a plot from The Economist with matplotlib
So far, I can generate a random data and produce two plots independently. I'm struggling now with putting them next to each other horizontally.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
df1 = pd.DataFrame({"broadcast": np.random.randint(110, 150,size=8),
"cable": np.random.randint(100, 250, size=8),
"streaming" : np.random.randint(10, 50, size=8)},
index=pd.Series(np.arange(2009,2017),name='year'))
df1.plot.bar(stacked=True)
df2 = pd.DataFrame({'usage': np.sort(np.random.randint(1,50,size=7)),
'avg_hour': np.sort(np.random.randint(0,3, size=7) + np.random.ranf(size=7))},
index=pd.Series(np.arange(2009,2016),name='year'))
plt.figure()
fig, ax1 = plt.subplots()
ax1.plot(df2['avg_hour'])
ax2 = ax1.twinx()
ax2.bar(left=range(2009,2016),height=df2['usage'])
plt.show()
You should try using subplots. First you create a figure by plt.figure(). Then add one subplot(121) where 1 is number of rows, 2 is number of columns and last 1 is your first plot. Then you plot the first dataframe, note that you should use the created axis ax1. Then add the second subplot(122) and repeat for the second dataframe. I changed your axis ax2 to ax3 since now you have three axis on one figure. The code below produces what I believe you are looking for. You can then work on aesthetics of each plot separately.
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
df1 = pd.DataFrame({"broadcast": np.random.randint(110, 150,size=8),
"cable": np.random.randint(100, 250, size=8),
"streaming" : np.random.randint(10, 50, size=8)},
index=pd.Series(np.arange(2009,2017),name='year'))
ax1 = fig.add_subplot(121)
df1.plot.bar(stacked=True,ax=ax1)
df2 = pd.DataFrame({'usage': np.sort(np.random.randint(1,50,size=7)),
'avg_hour': np.sort(np.random.randint(0,3, size=7) + np.random.ranf(size=7))},
index=pd.Series(np.arange(2009,2016),name='year'))
ax2 = fig.add_subplot(122)
ax2.plot(df2['avg_hour'])
ax3 = ax2.twinx()
ax3.bar(left=range(2009,2016),height=df2['usage'])
plt.show()