How to expand bars over the month on the x-axis while being the same width? - dataframe

for i in range(len(basin)):
prefix = "URL here"
state = "OR"
basin_name = basin[i]
df_orig = pd.read_csv(f"{prefix}/{basin_name}.csv", index_col=0)
#---create date x-index
curr_wy_date_rng = pd.date_range(
start=dt(curr_wy-1, 10, 1),
end=dt(curr_wy, 9, 30),
freq="D",
)
if not calendar.isleap(curr_wy):
print("dropping leap day")
df_orig.drop(["02-29"], inplace=True)
use_cols = ["Median ('91-'20)", f"{curr_wy}"]
df = pd.DataFrame(data=df_orig[use_cols].copy())
df.index = curr_wy_date_rng
#--create EOM percent of median values-------------------------------------
curr_wy_month_rng = pd.date_range(
start=dt(curr_wy-1, 10, 1),
end=dt(curr_wy, 6, 30),
freq="M",
)
df_monthly_prec = pd.DataFrame(data=df_monthly_basin[basin[i]].copy())
df_monthly_prec.index = curr_wy_month_rng
df_monthly = df.groupby(pd.Grouper(freq="M")).max()
df_monthly["date"] = df_monthly.index
df_monthly["wy_date"] = df_monthly["date"].apply(lambda x: cal_to_wy(x))
df_monthly.index = pd.to_datetime(df_monthly["wy_date"])
df_monthly.index = df_monthly["date"]
df_monthly["month"] = df_monthly["date"].apply(
lambda x: calendar.month_abbr[x.month]
)
df_monthly["wy"] = df_monthly["wy_date"].apply(lambda x: x.year)
df_monthly.sort_values(by="wy_date", axis=0, inplace=True)
df_monthly.drop(
columns=[i for i in df_monthly.columns if "date" in i], inplace=True
)
# df_monthly.index = df_monthly['month']
df_merge = pd.merge(df_monthly,df_monthly_prec,how='inner', left_index=True, right_index=True)
#---Subplots---------------------------------------------------------------
fig, ax = plt.subplots(figsize=(8,4))
ax.plot(df_merge.index, df_merge["Median ('91-'20)"], color="green", linewidth="1", linestyle="dashed", label = 'Median Snowpack')
ax.plot(df_merge.index, df_merge[f'{curr_wy}'], color='red', linewidth='2',label='WY Current')
#------Seting x-axis range to expand bar width for ax2
ax.bar(df_merge.index,df_merge[basin[i]], color = 'blue', label = 'Monthly %')
#n = n + 1
#--format chart
ax.set_title(chart_name[w], fontweight = 'bold')
w = w + 1
ax.set_ylabel("Basin Precipitation Index")
ax.set_yticklabels([])
ax.margins(x=0)
ax.legend()
#plt.xlim(0,9)
#---Setting date format
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
#---EXPORT
plt.show()
End result desired: Plotting both the monthly dataframe (df_monthly_prec) with the daily dataframe charting only monthly values (df_monthly). The bars for the monthly DataFrame should ideally span the whole month on the chart.
I have tried creating a secondary axis, but had trouble aligning the times for the primary and secondary axes. Ideally, I would like to replace plotting df_monthly with df (showing all daily data instead of just the end-of-month values within the daily dataset).
Any assistance or pointers would be much appreciated! Apologies if additional clarification is needed.

Related

How to add count (n) / summary statistics as a label to ggplot2 boxplots?

I am new to R and trying to add count labels to my boxplots, so the sample size per boxplot shows in the graph.
This is my code:
bp_east_EC <-total %>% filter(year %in% c(1977, 2020, 2021, 1992),
sampletype == "groundwater",
East == 1,
#EB == 1,
#N59 == 1,
variable %in% c("EC_uS")) %>%
ggplot(.,aes(x = as.character(year), y = value, colour = as.factor(year))) +
theme_ipsum() +
ggtitle("Groundwater EC, eastern Curacao") +
theme(plot.title = element_text(hjust = 0.5, size=14)) +
theme(legend.position = "none") +
labs(x="", y="uS/cm") +
geom_jitter(color="grey", size=0.4, alpha=0.9) +
geom_boxplot() +
stat_summary(fun.y=mean, geom="point", shape=23, size=2) #shows mean
I have googled a lot and tried different things (with annotate, with return functions, mtext, etc), but it keeps giving different errors. I think I am such a beginner I cannot figure out how to integrate such suggestions into my own code.
Does anybody have an idea what the best way would be for me to approach this?
I would create a new variable that contained your sample sizes per group and plot that number with geom_label. I've generated an example of how to add count/sample sizes to a boxplot using the iris dataset since your example isn't fully reproducible.
library(tidyverse)
data(iris)
# boxplot with no label
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot()
# boxplot with label
iris %>%
group_by(Species) %>%
mutate(count = n()) %>%
mutate(mean = mean(Sepal.Length)) %>%
ggplot(aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot() +
geom_label(aes(label= count , y = mean + 0.75), # <- change this to move label up and down
size = 4, position = position_dodge(width = 0.75)) +
geom_jitter(alpha = 0.35, aes(color = Species)) +
stat_summary(fun = mean, geom = "point", shape = 23, size = 6)

Is there any other way to find percentage and plot a group bar-chart without using matplotlib?

emp_attrited = pd.DataFrame(df[df['Attrition'] == 'Yes'])
emp_not_attrited = pd.DataFrame(df[df['Attrition'] == 'No'])
print(emp_attrited.shape)
print(emp_not_attrited.shape)
att_dep = emp_attrited['Department'].value_counts()
percentage_att_dep = (att_dep/237)*100
print("Attrited")
print(percentage_att_dep)
not_att_dep = emp_not_attrited['Department'].value_counts()
percentage_not_att_dep = (not_att_dep/1233)*100
print("\nNot Attrited")
print(percentage_not_att_dep)
fig = plt.figure(figsize=(20,10))
ax1 = fig.add_subplot(221)
index = np.arange(att_dep.count())
bar_width = 0.15
rect1 = ax1.bar(index, percentage_att_dep, bar_width, color = 'black', label = 'Attrited')
rect2 = ax1.bar(index + bar_width, percentage_not_att_dep, bar_width, color = 'green', label = 'Not Attrited')
ax1.set_ylabel('Percenatage')
ax1.set_title('Comparison')
xTickMarks = att_dep.index.values.tolist()
ax1.set_xticks(index + bar_width)
xTickNames = ax1.set_xticklabels(xTickMarks)
plt.legend()
plt.tight_layout()
plt.show()
The first block represents how the dataset is split into 2 based upon Attrition
The second block represents the calculation of percentage of Employees in each Department who are attrited and not attrited.
The third block is to plot the given as a grouped chart.
You can do:
(df.groupby(['Department'])
['Attrited'].value_counts(normalize=True)
.unstack('Attrited')
.plot.bar()
)

Stacked Bar Chart Labels-- Using geom_text to label % on a value based y-axis

I am looking to create a stacked bar chart where my y-axis measures the value but the table shows the % of total bar.
I think I need to add a pct column to my table then use that but am not sure how to get the pct column either.
Df for example is:
date, type, value, pct
Jan 1, A, 5, 45% (5/11)
Jan 1, B, 6, 55% (6/11)
table and chart image
Maybe something like this?
library(dplyr)
library(ggplot2)
test.df <- data.frame(date = c("2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02"),
type = c("A", "B", "A", "B"),
val = c(5:6, 1, 7))
test.df <- test.df %>%
group_by(date) %>%
mutate(
type.num = as.numeric(type),
prop = val/sum(val),
y_text_pos = ifelse(type=="B", val, sum(val))) %>%
ungroup()
ggplot(data = test.df, aes(x = as.Date(date), y = val, fill = type)) +
geom_col() +
geom_text(aes(y = y_text_pos, label = paste0(round(prop*100,1), "%")), color = "black", vjust = 1.1)
With the output:

Stacked bar plot - percentage

I want to represent this information in a stacked bar plot in percentage
On the x axis I want the age groups and on the y axis and the values that represent percentage of Gender in each age group
Age is represented by bins in the dataset
I have this so far
This is my code:
c = ds.groupby(['Age','Gender'])['Gender'].count()
d=(((c /c.groupby(level=0).sum())*100).round()).astype('int64')
d
I created a test data frame:
df = pd.DataFrame({'Gender': ['F','M','F','F','F','M','M','M','F','F','M','F','F','M','M','M','M','F','F','M','M','M'], 'Age': [17,10,20,51,53,15,50,60,43,28,35,67,33,17,20,40,43,47,48,51,53,54]})
You can use pandas.cut function to segment the age into proper intervals:
bins = pd.IntervalIndex.from_tuples([(0,17),(17,25),(25,35),(35,46),(46,50),(50,55), (55,np.inf)])
df['Age_interval'] = pd.cut(df['Age'], bins=bins)
df = df.groupby(['Age_interval', 'Gender']).size().unstack().fillna(0)
df['F'] = df['F']/sum(df['F']+df['M'])*100
df['M'] = df['M']/sum(df['M']+df['F'])*100
df['Age'] = ['0-17', '18-25','26-35', '36-45', '46-50', '51-55', '55-']
df.plot(kind='bar', x='Age', title = 'Gender distribution in Age groups', rot=0,figsize=(10,5), color=['turquoise','brown'], stacked=True)

Pandas: Using datetime as a condition

I'm using a for loop to plot all of the features in my dataset. I want it to skip plotting any attributes that have a datetime type. It doesn't seem to skipping correctly.....what do I need to fix?
(JFYI, I have confirmed with df.dtypes that the columns appear as datetime64[ns])
def plot_distribution(dataset, cols=5, width=20, height=50, hspace=0.2, wspace=0.5):
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(width,height))
fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)
rows = math.ceil(float(dataset.shape[1]) / cols)
for i, column in enumerate(dataset.columns):
ax = fig.add_subplot(rows, cols, i + 1)
ax.set_title(column)
g = sns.countplot(y=column, hue = target_column, data = df)
if df.dtypes[column] == np.datetime64:
continue
plot_distribution(df, cols=1, width=20, height=500, hspace=0.8, wspace=0.5)