I have a pandas dataframe a snippet of which is shown below:-
I wish to recreate the graphs shown below in Seaborn. These graphs were created in R using ggplot, but I am working with pandas/matplotlib/seaborn.
Essentially the graphs summarize the variables(mi,steps,st...) grouped by sensor id, with hours to the event on the x-axis. Additionally and most importantly, there is smoothing performed by stat_smooth() within ggplot. I have included a snippet of my ggplot code.
step.plot <- ggplot(data=cdays, aes(x=dfc, y=steps, col=legid)) +
ggtitle('time to event' +
labs(x="Days from event", y='Number of steps') +
stat_smooth(method='loess', span=0.2, formula=y~x) +
geom_vline(mapping=aes(xintercept=0), color='blue') +

here is how I would do it. Bear in mind that I had to make assumptions about the structure of your data, so please review what I did before applying it.
Creating some simulated data
subject = np.repeat(np.repeat([1, 2, 3, 4, 5], 4), 31)
time = np.tile(np.repeat(np.arange(-15, 16, 1), 4), 5)
sensor = np.tile([1, 2, 3, 4], 31*5)
measure1 = subject*20 + time*(5-sensor) - time**2*(sensor-2)*0.1 + (time >= 0)*np.random.normal(100*(sensor-2), 10, 620) + np.random.normal(0, 10, 620)
measure2 = subject*10 + time*(2-sensor) - time**2*(sensor-4)*0.1 + (time >= 0)*np.random.normal(50*(sensor-1), 10, 620) + np.random.normal(0, 8, 620)
measure3 = time**2*(sensor-1)*0.1 + (time >= 0)*np.random.normal(50*(sensor-3), 10, 620) + np.random.normal(0, 8, 620)
measure4 = time**2*(sensor-1)*0.1 + np.random.normal(0, 8, 620)
Putting it in a long form dataset for plotting
df = pd.DataFrame(dict(subject=subject, time=time, sensor=sensor, measure1=measure1,
measure2=measure2, measure3=measure3, measure4=measure4))
df = pd.melt(df, id_vars=["sensor", "subject", "time"],
value_vars=["measure1", "measure2","measure3", "measure4"],
Creating the plot, without smoothing
g = sns.FacetGrid(data=df, col="measure", col_wrap=2)
g.map_dataframe(sns.tsplot, time="time", value="value", condition="sensor", unit="subject", color="deep")
g.add_legend(title="Sensor Number")
g.set_xlabels("Days from Event")
Plotted data, before smoothing
Now let's use statsmodels to smooth the data.
Please review this part, this is where I made assumptions about the sampling unit (I assume that the sampling unit is the subject, and therefore treat sensors and measure types as conditions).
from statsmodels.nonparametric.smoothers_lowess import lowess
dfs = []
for sens in df.sensor.unique():
for meas in df.measure.unique():
# One independent smoothing per Sensor/Measure condition.
df_filt = df.loc[(df.sensor == sens) & (df.measure == meas)]
# Frac is equivalent to span in R
filtered = lowess(df_filt.value, df_filt.time, frac=0.2)
df_filt["filteredvalue"] = filtered[:,1]
df = pd.concat(dfs)
Plotted data, after smoothing
From there you can tweak your plot however you like. Tell me if you have any question.


How to expand bars over the month on the x-axis while being the same width?

for i in range(len(basin)):
prefix = "URL here"
state = "OR"
basin_name = basin[i]
df_orig = pd.read_csv(f"{prefix}/{basin_name}.csv", index_col=0)
#---create date x-index
curr_wy_date_rng = pd.date_range(
start=dt(curr_wy-1, 10, 1),
end=dt(curr_wy, 9, 30),
if not calendar.isleap(curr_wy):
print("dropping leap day")
df_orig.drop(["02-29"], inplace=True)
use_cols = ["Median ('91-'20)", f"{curr_wy}"]
df = pd.DataFrame(data=df_orig[use_cols].copy())
df.index = curr_wy_date_rng
#--create EOM percent of median values-------------------------------------
curr_wy_month_rng = pd.date_range(
start=dt(curr_wy-1, 10, 1),
end=dt(curr_wy, 6, 30),
df_monthly_prec = pd.DataFrame(data=df_monthly_basin[basin[i]].copy())
df_monthly_prec.index = curr_wy_month_rng
df_monthly = df.groupby(pd.Grouper(freq="M")).max()
df_monthly["date"] = df_monthly.index
df_monthly["wy_date"] = df_monthly["date"].apply(lambda x: cal_to_wy(x))
df_monthly.index = pd.to_datetime(df_monthly["wy_date"])
df_monthly.index = df_monthly["date"]
df_monthly["month"] = df_monthly["date"].apply(
lambda x: calendar.month_abbr[x.month]
df_monthly["wy"] = df_monthly["wy_date"].apply(lambda x: x.year)
df_monthly.sort_values(by="wy_date", axis=0, inplace=True)
columns=[i for i in df_monthly.columns if "date" in i], inplace=True
# df_monthly.index = df_monthly['month']
df_merge = pd.merge(df_monthly,df_monthly_prec,how='inner', left_index=True, right_index=True)
fig, ax = plt.subplots(figsize=(8,4))
ax.plot(df_merge.index, df_merge["Median ('91-'20)"], color="green", linewidth="1", linestyle="dashed", label = 'Median Snowpack')
ax.plot(df_merge.index, df_merge[f'{curr_wy}'], color='red', linewidth='2',label='WY Current')
#------Seting x-axis range to expand bar width for ax2,df_merge[basin[i]], color = 'blue', label = 'Monthly %')
#n = n + 1
#--format chart
ax.set_title(chart_name[w], fontweight = 'bold')
w = w + 1
ax.set_ylabel("Basin Precipitation Index")
#---Setting date format
End result desired: Plotting both the monthly dataframe (df_monthly_prec) with the daily dataframe charting only monthly values (df_monthly). The bars for the monthly DataFrame should ideally span the whole month on the chart.
I have tried creating a secondary axis, but had trouble aligning the times for the primary and secondary axes. Ideally, I would like to replace plotting df_monthly with df (showing all daily data instead of just the end-of-month values within the daily dataset).
Any assistance or pointers would be much appreciated! Apologies if additional clarification is needed.

geom_bar for total counts of binned continuous variable

I'm really struggling to achieve what feels like an incredibly basic geom_bar plot. I would like the sum of y to be represented by one solid bar (with colour = black outline) in bins of 10 for x. I know that stat = "identity" is what is creating the unnecessary individual blocks in each bar but can't find an alternative to achieving what is so close to my end goal. I cheated and made the below desired plot in illustrator.
I don't really want to code x as a factor for the bins as I want to keep the format of the axis ticks and text rather than having text as "0 -10", "10 -20" etc. Is there a way to do this in ggplot without the need to use summerise or cut functions on the raw data? I am also aware of geom_col and sat_count options but again, can't achive my desired outcome.
DF as below, where y = counts at various values of a continuous variable x. Also a factor variable of type.
y = c(1 ,1, 3, 2, 1, 1, 2, 1, 1, 1, 1, 1, 4, 1, 1,1, 2, 1, 2, 3, 2, 2, 1)
x = c(26.7, 28.5, 30.0, 34.8, 35.0, 36.4, 38.6, 40.0, 42.1, 43.7, 44.1, 45.0, 45.5, 47.4, 48.0, 57.2, 57.8, 64.2, 65.0, 66.7, 68.0, 74.4, 94.1)
type = c(rep("Type 1", 20), "Type 2", rep("Type 1", 2))
Bar plot of total y count for each bin of x - trying to fill by total of type, but getting individual proportions as shown by line colour = black. Would like total for each type in each bar.
ggplot(df,aes(y=y, x=x))+
geom_bar(stat = "identity",color = "black", aes(fill = type))+
scale_x_binned(limits = c(20,100))+
scale_y_continuous(expand = c(0, 0), breaks = seq(0,10,2)) +
ylab("Total Count")
Or trying to just have the total count within each bin but don't want the internal lines in the bars, just the outer colour = black for each bar
ggplot(df,aes(y=y, x=x))+
geom_col(fill = "#00C3C6", color = "black")+
scale_x_binned(limits = c(20,100))+
scale_y_continuous(expand = c(0, 0), breaks = seq(0,10,2)) +
ylab("Total Count")
Here is one way to do it, with previous data transformation and geom_col:
df <- df |>
mutate(bins = floor(x/10) * 10) |>
group_by(bins, type) |>
summarise(y = sum(y))
ggplot(data = df,
aes(y = y,
x = bins))+
geom_col(aes(fill = type),
color = "black")+
scale_x_continuous(breaks = seq(0,100,10)) +
scale_y_continuous(expand = c(0, 0),
breaks = seq(0,10,2)) +
ylab("Total Count")

ggplot2: add title changes point colors <-> scale_color_manual removes ggtitle

I am facing a silly point color in a dot plot with ggplot 2. I have a whole table of data of which i take relevant rows to make a dot plot. With scale_color_manual my points get colored according to the named palette and factor genotype specified in aes() and when i simply want to add a title specifying the cell line used, the points get colored back to automatic yellow and purple. Adding the title first and setting scale_color_manual as the last layer changes the points colors and removes the title.
What is wrong in there? I don't get it and it is a bit frustrating
thanks for your help!
Here's reproducible code to get my whole df and the subset for the plots:
# df of data to plot
exp <- c(rep(284, times = 6), rep(285, times = 12))
geno <- c(rep(rep(c("WT", "KO"), each =3), times = 6))
line <- c(rep(5, times = 6),rep(8, times= 12), rep(5, times =12), rep(8, times = 6))
ttt <- c(rep(c(0, 10, 60), times = 10), rep(c("ZAc60", "Cu60", "Cu200"), times = 2))
rep <- c(rep(1, times = 12), rep(2, times = 6), rep(c(1,2), times = 6), rep(1, times = 6))
rel_expr <- c(0.20688185, 0.21576131, 0.94046028, 0.30327675, 0.22865200,
0.92941881, 0.13787508, 0.13325281, 0.22114990, 0.95591724,
1.03239718, 0.83339248, 0.15332420, 0.17558160, 0.22475604,
1.02356351, 0.77882000, 0.69214403, 0.16874097, 0.15548158,
0.45207943, 0.28123760, 0.23500083, 0.51588856, 0.1399634,
0.14610184, 1.06716713, 0.16517801, 0.34736164, 0.64773650,
0.18334429, 0.05924757, 0.01803593, 0.86685230, 0.39554685,
df_all <- data.frame(exp, geno, line, ttt, rep, rel_expr)
names(df_all) <- c("EXP", "Geno", "Line", "TTT", "Rep", "Rel_Expr")
# make Geno an ordered factor
df_all$Geno <- ordered(df_all$Geno, levels = c("WT", "KO"))
# select set of whole dataset for current plot
df_ions <- df_all[df_all$Line == 8 & !df_all$TTT %in% c(10, 60),]
# add a treatment as factor columns fTTT
df_ions$fTTT <- ordered(df_ions$TTT, levels = c("0", "ZAc60", "Cu60", "Cu200"))
# plot rel_exp vs factor treatment, color points by geno
# with named color palette
col_palette <- c("#000000", "#1356BC")
names(col_palette) <- c("WT", "KO")
plt <- ggplot(df_ions, aes(x = fTTT, y = Rel_Expr, color = Geno)) +
geom_jitter(width = 0.1)
plt # intermediate_plt_1.png
plt + scale_color_manual(values = col_palette) # intermediate_plt_2.png
plt + ggtitle("mRPTEC8") # final_plot.png

Smoothing geom_ribbon

I've created a plot with geom_line and geom_ribbon (image 1) and the result is okay, but for the sake of aesthetics, I'd like the line and ribbon to be smoother. I know I can use geom_smooth for the line (image 2), but I'm not sure if it's possible to smooth the ribbon.I could create a geom_smooth line for the top and bottom lines of the ribbon (image 3), but is there anyway to fill in the space between those two lines?
A principled way to achieve what you want is to fit a GAM model to your data using the gam() function in mgcv and then apply the predict() function to that model over a finer grid of values for your predictor variable. The grid can cover the span defined by the range of observed values for your predictor variable. The R code below illustrates this process for a concrete example.
# load R packages
# simulate some x and y data
# x = predictor; y = response
x <- seq(-10, 10, by = 1)
y <- 1 - 0.5*x - 2*x^2 + rnorm(length(x), mean = 0, sd = 20)
d <- data.frame(x,y)
# plot the simulated data
ggplot(data = d, aes(x,y)) +
# fit GAM model
m <- gam(y ~ s(x), data = d)
# define finer grid of predictor values
xnew <- seq(-10, 10, by = 0.1)
# apply predict() function to the fitted GAM model
# using the finer grid of x values
p <- predict(m, newdata = data.frame(x = xnew), se = TRUE)
# plot the estimated mean values of y (fit) at given x values
# over the finer grid of x values;
# superimpose approximate 95% confidence band for the true
# mean values of y at given x values in the finer grid
g <- data.frame(x = xnew,
fit = p$fit,
lwr = p$fit - 1.96*p$,
upr = p$fit + 1.96*p$
ggplot(data = g, aes(x, fit)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), fill = "lightblue") +
geom_line() +
geom_point(data = d, aes(x, y), shape = 1)
This same principle would apply if you were to fit a polynomial regression model to your data using the lm() function.

How to create a R shiny app for getting PCA plot

I am just starting to learn R shiny and am trying to create a shiny app that produces scatter plot for principal component analysis and allows user to choose various principal components on the X and Y axis of scatter plot. I know how to write R code to do PCA but I just cant seem to get the shiny app to get me what I need.. I have tried following the examples available for Iris kmeans clustering but I am having trouble getting the scatter plot. Here is my code so far (P.S. my original dataset has genes as rows and samples as columns (columns 1 through 10 are cancer samples, 11 through 20 are normal):
data<-read.table("genes_data.txt", header=TRUE, row.names=1)
pca_data<-prcomp(t(data), scale=T)
pca_sig.var.per<-round(pca_sig.var/sum(pca_sig.var)*100, 1)<-data.frame(Sample=rownames(pca_data$x), PC1=pca_data$x[,1], PC2=pca_data$x[,2], PC3=pca_data$x[,3], PC4=pca_data$x[,4], PC5=pca_data$x[,5])<[-1]
pca_sig.data2$category=rep("CANCER", 20)
pca_sig.data2$category[11:20]=rep("NORMAL", 10)
ggplot(data=pca_sig.data2, aes(x=PC1, y=PC2, label=category, colour=category))+
geom_point(size=2, stroke=1, alpha=0.8, aes(color=category))+
xlab(paste("PCA1 - ", pca_sig.var.per[1], "%", sep=""))+
ylab(paste("PCA2 - ", pca_sig.var.per[2], "%", sep=""))+
ggtitle("My PCA Graph")
headerPanel('Gene Data PCA'),
selectInput('xcol', 'X Variable', names(pca_sig.data2[,1:5])),
selectInput('ycol', 'Y Variable', names(pca_sig.data2[,1:5]),
server<- function(input, output, session) {
# Combine the selected variables into a new data frame
selectedData <- reactive({
pca_sig.data2[, c(input$xcol, input$ycol)]
output$plot1 <- renderPlot({
palette(c("#E41A1C", "#377EB8"))
par(mar = c(5.1, 4.1, 0, 1))
pch = 20, cex = 3)
points(selectedData()[,1:5], pch = 4, cex = 4, lwd = 4)
shinyApp(ui = ui, server = server)
At the end, when I run the app, I get "Error:undefined columns selected"
Also, for simplicity sake let's assume that my original dataset that I want to do PCA on looks something like this (in reality I have about 600 genes and 20 samples):
probeID<-c("gene1", "gene2", "gene3", "gene4","gene5")
BCR1<-c(28.005966, 30.806433, 17.341375, 17.40666, 30.039436)
BCR2<-c(30.973469, 29.236025, 30.41161, 20.914383, 20.904331)
BCR3<-c(26.322796, 25.542833, 22.460772, 19.972183, 30.409641)
BCR4<-c(26.441898, 25.837685, 23.158352, 20.379173, 33.81327)
BCR5<-c(39.750206, 19.901133, 28.180124, 22.668673, 25.748884)
CTL6<-c(23.004385, 28.472675, 23.81621, 26.433413, 28.851719)
CTL7<-c(22.239546, 28.741674, 23.754929, 26.015385, 28.16368)
CTL8<-c(29.590443, 30.041988, 21.323061, 24.272501, 18.099016)
CTL9<-c(15.856442, 22.64224, 29.629637, 25.374926, 22.356894)
CTL10<-c(38.137985, 24.753338, 26.986668, 24.578161, 19.223558)
data<-data.frame(probeID, BCR1, BCR2, BCR3, BCR4, BCR5, CTL6, CTL7, CTL8, CTL9, CTL10)
where BCR1 through BCR5 are the cancer samples and CTL6 through CTL10 are the normal samples.
Is this what you want?
server<- function(input, output, session) {
# Combine the selected variables into a new data frame
selectedData <- reactive({
pca_sig.data2[c(input$xcol, input$ycol, 'category')]
output$plot1 <- renderPlot({
palette(c("#E41A1C", "#377EB8"))
plot(selectedData()[,c(1:2)], col=factor(selectedData()$category), pch = 20, cex = 3)
points(selectedData()[,c(1:2)], pch = 4, cex = 4, lwd = 4)
The result is like this: