Changed frequency of ticks in Pandas '.bar' plot, but messed up the actual bars - pandas

how's your self-isolation going on?
Mine rocks, as I'm drilling through visualization in Python. Recently, however, I've ran into an issue.
I figured that .plot.bar() in Pandas has an uncommon formatting of x-axis (which kinda confirms that I read before I ask). I had price data with monthly frequency, so I applied a fix to display only yearly ticks in a bar chart:
fig, ax = plt.subplots()
ax.bar(btc_returns.index, btc_returns)
ax.xaxis.set_major_locator(mdates.YearLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
Where btc_returns is a Series object with datetime in index.
The output I got was weird. Here are the screenshots of what I expected vs the end result.
I tried to find a solution to this, but no luck. Can you guys please give me a hand? Thanks! Criticism is welcome as always :)

And my solution is like this:
fig, ax = plt.subplots(figsize=(15,7))
ax.bar(btc_returns.index, btc_returns.returns.values, width = 1)
Where btc_returns is a DataFrame with the returns of BTC. I figured that .values makes the bar plot read the datetime input correctly. For the 'missing' bars - their resolution was just way too small, so I set the width to '1'.

Using the stock value data from Yahoo Finance: Bitcoin USD
Technically, you can do pd.to_datetime(btc.Date).dt.date at the beginning, but resample won't work, which is why btc_monthly.index.date is done as a second step.
resample can happen over different periods (e.g. 2M = every two months)
Load and transform the data
import pandas as pd
import matplotlib.pyplot as plt
# load data
btc = pd.read_csv('data/BTC-USD.csv')
# Date to datetime
btc.Date = pd.to_datetime(btc.Date)
# calculate daily return %
btc['return'] = ((btc.Close - btc.Close.shift(1))/btc.Close.shift(1))*100
# resample to monthly and aggregate by sum
btc_monthly = btc.resample('M', on='Date').sum()
# set the index to be date only (no time)
btc_monthly.index = btc_monthly.index.date
Plot
btc_monthly.plot(y='return', kind='bar', figsize=(15, 8))
plt.show()
Plot Bimonthly
btc_monthly = btc.resample('2M', on='Date').sum() # instead of 'M'
btc_monthly.index = btc_monthly.index.date
btc_monthly.plot(y='return', kind='bar', figsize=(15, 8), legend=False)
plt.title('Bitcoin USD: Bimonthly % Return')
plt.ylabel('% return')
plt.xlabel('Date')
plt.show()

Related

ValueError: Maximum allowed size exceeded when plotting using seaborn

i have a data set with 37 columns and 230k rows
i am trying using seaborn to histogram every column
i have not yet cleaned my data
here is my code
for i in X.columns:
plt.figure()
ax = sns.histplot(data=df,x=i)
i got also this File C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\function_base.py:135 in linspace y = _nx.arange(0, num, dtype=dt).reshape((-1,) + (1,) * ndim(delta))
any solution for this please
It may be due to the size of your dataset. So you can try to draw one histogram at a time.
I think there is a inconsistency in your code : you loop over the columns of the dataframe X but you draw the columns of the dataframe df. It is more consistent like that :
for i in df.columns:
plt.figure()
ax = sns.histplot(data=df,x=i)
problem solved by determining the number of bins, since the bins default is set to auto and this was the reason, normally this leads to a huge computational error for high dataset size and with high variance
the code solved my issue as below:
for i in X.columns:
plt.figure()
ax = sns.histplot(data=df,x=i,bins=50)

Iterating through a list of series for plotting

I have a list of series where each series is 100 observations of the time it took to parse some data.
list_series = [Series1, Series2, Series3...]
When I try to plot the series individually, I would do something like
# generate subplots
fig, ax = plt.subplots()
# histplot
sns.histplot(data=Series1, ax=ax, color='deepskyblue', kde=True, alpha=0.3)
# title
ax.set_title(f'Parsing times for Series1: Mean = {np.mean(Series1):0.4}, Median = {np.median(Series1):0.4}')
# fix overlap
fig.tight_layout()
simple enough, and the plot looks good.
However, when I put this process through a loop, using list_series, the plots end up becoming wonky using pretty much the same method except through a loop:
list_series = [Series1, Series2, Series3...]
for i in list_series:
fig, ax = plt.subplots()
sns.histplot(data=i, ax=ax, color='deepskyblue', kde=True, alpha=0.3)
ax.set_title(f'Parsing times for {i}: Mean = {np.mean(i):0.4}, Median = {np.median(i):0.4}')
fig.tight_layout()
The name of the Series in the title gets replaced with the index number, and the head and tail of the series gets squished above the actual plot...
(don't worry about mean/median not matching up, I did some order shuffling within the list during trials)
Is there something I'm not understanding as to how loops or plotting works?
Thank you.

Adding grouping ticks to a bar chart

I have a chart created from a pandas DataFrame that looks like this:
I've formatted the ticks with:
ax = df.plot(kind='bar')
ax.set_xticklabels(df.index.strftime('%I %p'))
However, I'd like to add a second set of larger ticks, to achieve this kind of effect:
I've tried many variations of use set_major_locator and set_major_formatter (as well as combining major and minor formatter), but it seems I'm not approaching it correctly and I wasn't able to find useful examples of similar combined ticks online either.
Does someone have a suggestion on how to achieve something similar to the bottom image?
The dataframe has a datetime index and is binned data, from something like df.resample(bin_size, label='right', closed='right').sum())
One idea is to set major ticks to display the date (%-d-%b) at noon each day with some padding (e.g., pad=40). This will leave a minor tick gap at noon, so for consistency you could set minor ticks only on the odd hours and give them rotation=90.
Note that this uses matplotlib's bar() since pandas' plot.bar() doesn't play well with the date formatting.
import matplotlib.dates as mdates
# toy data
dates = pd.date_range('2021-08-07', '2021-08-10', freq='1H')
df = pd.DataFrame({'date': dates, 'value': np.random.randint(10, size=len(dates))}).set_index('date')
# pyplot bar instead of pandas bar
fig, ax = plt.subplots(figsize=(14, 4))
ax.bar(df.index, df.value, width=0.02)
# put day labels at noon
ax.xaxis.set_major_locator(mdates.HourLocator(byhour=[12]))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%-d-%b'))
ax.xaxis.set_tick_params(which='major', pad=40)
# put hour labels on odd hours
ax.xaxis.set_minor_locator(mdates.HourLocator(byhour=range(1, 25, 2)))
ax.xaxis.set_minor_formatter(mdates.DateFormatter('%-I %p'))
ax.xaxis.set_tick_params(which='minor', pad=0, rotation=90)
# add day separators at every midnight tick
ticks = df[df.index.strftime('%H:%M:%S') == '00:00:00'].index
arrowprops = dict(width=2, headwidth=1, headlength=1, shrink=0.02)
for tick in ticks:
xy = (mdates.date2num(tick), 0) # convert date index to float coordinate
xytext = (0, -65) # draw downward 65 points
ax.annotate('', xy=xy, xytext=xytext, textcoords='offset points',
annotation_clip=False, arrowprops=arrowprops)

Pandas: How can I plot with separate y-axis, but still control the order?

I am trying to plot multiple time series in one plot. The scales are different, so they need separate y-axis, and I want a specific time series to have its y-axis on the right. I also want that time series to be behind the others. But I find that when I use secondary_y=True, this time series is always brought to the front, even if the code to plot it comes before the others. How can I control the order of the plots when using secondary_y=True (or is there an alternative)?
Furthermore, when I use secondary_y=True the y-axis on the left no longer adapts to appropriate values. Is there a fixed for this?
# imports
import numpy as np
import matplotlib.pyplot as plt
# dummy data
lenx = 1000
x = range(lenx)
np.random.seed(4)
y1 = np.random.randn(lenx)
y1 = pd.Series(y1, index=x)
y2 = 50.0 + y1.cumsum()
# plot time series.
# use ax to make Pandas plot them in the same plot.
ax = y2.plot.area(secondary_y=True)
y1.plot(ax=ax)
So what I would like is to have the blue area plot behind the green time series, and to have the left y-axis take appropriate values for the green time series:
https://i.stack.imgur.com/6QzPV.png
Perhaps something like the following using matplotlib.axes.Axes.twinx instead of using secondary_y, and then following the approach in this answer to move the twinned axis to the background:
# plot time series.
fig, ax = plt.subplots()
y1.plot(ax=ax, color='green')
ax.set_zorder(10)
ax.patch.set_visible(False)
ax1 = ax.twinx()
y2.plot.area(ax=ax1, color='blue')

Pandas time series plot - setting custom ticks

I am creating a general-purpose average_week aggregation and plot tool using pandas. Everything works fine (I'd be glad to receive comments on that, too), but the ticks: as I "fake" dates, I want to replace the whole set of ticks with the homebrewed (I already received some questions regarding January 1 on the timeline).
Yet, it seems that pandas overwrite all the ticks, no matter what I pass after. I was able to add ticks I want - yet I can't find how to erase pandas ones.
def averageWeek(df, ax, tcol='ts', ccol='id', label=None, treshold=0,
normalize=True, **kwargs):
'''calculate average week on ts'''
s = df[[tcol, ccol]].rename(columns={tcol:'ts',ccol:'id'}) # rename to convention
s = df[['id', 'ts']].set_index('ts').resample('15Min', how='count').reset_index()
s['id'] = s['id'].astype(float)
s['ts'] = s.ts.apply(lambda x: datetime.datetime(year=2015,month=1,
day=(x.weekday()+1),
hour=x.hour,
minute = x.minute))
s = s.groupby(['ts']).agg('mean')
if s.id.sum() >= treshold:
if normalize:
s = 1.0*s/s.sum()
else:
pass
if label:
s.rename(columns={'id':label}, inplace=1)
s.plot(ax=ax, legend=False, **kwargs);
else:
print name, 'didnt pass treshhold:', s[name].sum()
pass
return g
fig, ax = plt.subplots(figsize=(18,6))
aw = averageWeek(LMdata, ax=frame, label='Lower Manhattan', alpha=1, lw=1)
x = [datetime.datetime(year=2015, month=1, day=i) for i in range(1,8)]
labels = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
frame.axes.get_xaxis().set_ticks([])
plt.xlabel('Average week')
plt.legend()
Your problem is that there are actually two kinds of tick labels involved in this: major and minor ticklabels, at major and minor ticks. You want to clear both of them. For example, if ax is the axis in question, the following will work:
ax.set_xticklabels([],minor=False) # the default
ax.set_xticklabels([],minor=True)
You can then set the ticklabels and tick locations that you want.