Failure using geom_polygon with 2D facet_grid: 1D works - ggplot2

I'm trying to add a state map boundary to a 2D facet_grid in ggplot which plots correlations against principal components (PC).
The 1D case works but I cannot discover what is causing the failure in the 2D case. It may be a limit to facet_grid in some way or, more likely, something I don't understand.
ggplot(A0G, aes(Longitude, Latitude, color=Correlation)) +
+ geom_point(size=2*abs(A0G$Correlation)) +
+ geom_polygon(data = ca_df, aes(x=Longitude, y=Latitude, fill = NA, color = "black")) +
+ facet_grid(rows=vars(A0G$Lag), cols=vars(A0G$PC))
# Error in `$<-.data.frame`(`*tmp*`, "PANEL", value = c(4L, 3L, 5L, 10L, :
# replacement has 11112 rows, data has 516
The issue seems related to the dimensions of the map df and the data df but I can't grok the problem since it works in the 1D case.
dim(ca_df)
[1] 516 6
dim(A0G)
[1] 11112 18
Any suggestions welcome. I'll just keep breaking the problem down if there isn't anything obvious.

Related

How to convert a mimira object (Cox regression model, from multiple imputations and a propensity score matching (MatchThem pkg)) into a Forest plot

Dear StackOverflow community,
as a surgeon, and full of enthusiasm for 6 months for R learning in self-taught mode (StackOverflow, and so many websites), I beg your indulgence in the triviality of my concern.
The background:
Briefly, my objective is to run a survival cox model regression for a dataset of cancer patients. Due to the retrospective aspect, I planned to make a matching 1:3 with propensity score matching (PSM). The missing data were dealt with multiple imputations ("mice" pkg). The PSM was managed with "MatchThem" pkg.
I used "survey" pkg for pooling the survival (svycoxph() pooled through with() function). This leads us to a mimira object, which I can easily print out into a beautiful Table, with tbl_regression ("gtsummary" pkg).
The issue:
As a usually print my cox regressions into a Hazard ratios Table and a graphical version (Forest plot with ggforest(), from "survminer" pkg), this time I am really stuck. The function ggforest doesn't recognize the mimira object as a "coxph object" and send this error :
Error in ggforest(tbl_regression_object, data = mimira_object) :
inherits(model, "coxph") is not TRUE
I guess that adding a PSM to my multiple imputations is the problem, as I had no problem for printing cox regression of multiple imputations with Forest plot (ggforest is able to deal mira objects without problem with pool_and_tidy_mice() function).
Here is the script:
#Data
library(fabricatr)
library(simsurv)
# Simulate patient data in a clinical trial
participant_data <- fabricate(
N = 2000,
age = runif(N, min = 18, max = 85),
is_female = draw_binary(prob = 0.5, N = N),
is_smoker = draw_binary(prob = 0.2 + 0.2 * (age > 50), N = N),
disease_stage = round(runif(N, min = 1 + 0.5 * (age > 65), max = 4)),
treatment = draw_binary(prob = 0.5, N = N),
kps = runif(N, min = 40, max = 100)
)
# Simulate data in the survival context
survival_data <- simsurv(
lambdas = 0.1, gammas = 1.8,
x = participant_data,
betas = c(is_female = -0.2, is_smoker = 1.2,
treatment = -0.4, kps = -0.005,
disease_stage = 0.2),
maxt = 5)
# Merging df
library(dplyr)
mydata_complete <- bind_cols(survival_data, participant_data)
# generating missing value
library(missMethods)
mydata_uncomp <- delete_MCAR(mydata_complete, 0.3)
mydata <- mydata_uncomp
#1 imputation with "mice"
library(mice)
mydata$nelsonaalen <- nelsonaalen(mydata, eventtime, status)
mydata_mice_imp_m3 <- mice(mydata, maxit = 2, m = 3, seed = 20200801) # m=3 is for testing
#2 matching (PSM 1:3) with "MatchThem"
library(MatchThem)
mydata_imp_m3_psm <- matchthem(treatment ~ age + is_female + disease_stage, data = mydata_mice_imp_m3, approach = "within" ,ratio= 1, method = "optimal")
#3 Pooling Coxph models in multiple imputed datasets and PSM with "survey"
library(survey)
mimira_object <- with(data = mydata_imp_m3_psm, expr = svycoxph(Surv(eventtime, status) ~ age+ is_smoker + disease_stage))
pool_and_tidy_mice(mimira_object, exponentiate = TRUE, conf.int=TRUE) -> pooled_imp_m3_cph
# estimates with pool_and_tidy_mice() works with mimira_object but cannot bring me de degree of freedoms. Warning message :
In get.dfcom(object, dfcom) : Infinite sample size assumed.
> pooled_imp_m3_cph
term estimate std.error statistic p.value conf.low conf.high b df dfcom fmi lambda m riv ubar
1 age 0.9995807 0.001961343 -0.2138208 NaN NaN NaN 1.489769e-06 NaN Inf NaN 0.5163574 3 1.067643 1.860509e-06
2 is_smoker 2.8626952 0.093476026 11.2516931 NaN NaN NaN 4.182884e-03 NaN Inf NaN 0.6382842 3 1.764601 3.160589e-03
3 disease_stage 1.2386947 0.044092483 4.8547535 NaN NaN NaN 8.995628e-04 NaN Inf NaN 0.6169374 3 1.610540 7.447299e-04
#4 Table summary of the pooled results
library(gtsummary)
tbl_regression_object <- tbl_regression(mimira_object, exp=TRUE, conf.int = TRUE) # 95% CI and p-value are missing due to an issue with an other issue in the pooling of the mimira_object. The Matchthem:::get.2dfcom function gives a dfcom = 999999 (another issue to be solved in my concern)
#5 What it should looks like as graphical summary
library(survival)
mydata.cox <- coxph(Surv(eventtime, status) ~ age+ is_smoker + disease_stage, mydata_uncomp) # (df mydata_uncomp is without imputation and PSM)
#with gtsummary
forestGT <-
mydata.cox %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot()
(forestGT) # See picture GT_plot1. Almost perfect. Would have been great to know how to add N, 95% CI, HR, p-value and parameters of the model (AIC, events, concordance, etc.)
#with survminer
HRforest <-
survminer::ggforest(mydata.cox, data = mydata_uncomp)
(HRforest) # See picture Ggforest. Everything I need to know about my cox regression is all in there. For me it is just a great regression cox forest plot.
#6 Actually what happens when I do the same thing with imputed and matched df
#with gtsummary
forestGT_imp_psm <-
mimira_object %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot() # WARNING message : In get.dfcom(object, dfcom) : Infinite sample size assumed.
(forestGT_imp_psm) # See picture GT_plot2. The plot is rendered but without 95% IC
#with survminer
HRforest_imp_psm <-
ggforest(mimira_object, data = mydata_imp_m3_psm) # ERROR:in ggforest(mimira_object, data = mydata_imp_m3_psm) : inherits(model, "coxph") is not TRUE
(HRforest_imp_psm)
#7 The lucky and providential step
# your solution/advise
Would greatly appreciate your help.
cheers.
AK
Picture GT_plot1
(not allowed to embed images in this post, here is sharelink : GT_plot1
Picture Ggforest_plot
Ggforest_plot
Picture GT_plot2
GT_plot2
It seems that there are two distinct problems here:
Problem #1. getting gtsummary() to produce a table with p values and confidence intervals of the pooled, matched data
Problem #2. producing a ggforest() to produce a plot of the pooled estimates.
Problem #1:
Let us follow the instructions in the paper "MatchThem:: Matching and Weighting after Multiple Imputation" (https://arxiv.org/ftp/arxiv/papers/2009/2009.11772.pdf) [page 15]
and modify your block #3. Instead of calling pool_and_tidy_mice() we do the following:
matched.results <- pool(mimira_object)
summary(matched.results, conf.int = TRUE)
This produces the following:
term estimate std.error statistic df p.value 2.5 % 97.5 %
1 age -0.0005997864 0.001448251 -0.4141453 55.266353 6.803707e-01 -0.003501832 0.00230226
2 is_smoker 1.1157796620 0.077943244 14.3152839 9.961064 5.713387e-08 0.942019234 1.28954009
3 disease_stage 0.2360965310 0.051799813 4.5578645 3.879879 1.111782e-02 0.090504018 0.38168904
This means that performing the imputation with mice and then matching with MatchThem works, since you do get the p values and the confidence intervals.
Compare to the output from pool_and_tidy_mice():
term estimate std.error statistic p.value b df dfcom fmi lambda m
1 age -0.0005997864 0.001448251 -0.4141453 NaN 2.992395e-07 NaN Inf NaN 0.1902260 3
2 is_smoker 1.1157796620 0.077943244 14.3152839 NaN 2.041627e-03 NaN Inf NaN 0.4480827 3
3 disease_stage 0.2360965310 0.051799813 4.5578645 NaN 1.444843e-03 NaN Inf NaN 0.7179644 3
riv ubar
1 0.2349124 1.698446e-06
2 0.8118657 3.352980e-03
3 2.5456522 7.567636e-04
Where everything is the same except for df and p.value which were not calculated in the latter table.
I therefore think this is an issue with the pool_and_tidy_mice() and you should post this as an issue on GitHub at gtsummary.
For right now, you can bypass this problem by changing svycoxph() to survival::coxph() in block #3 when you call the with() function. If you do that, then eventually you will get a gtsummary table with p.values and confidence intervals. Ultimately, the problem is probably some interaction between svycoxph() and pool_and_mice(), hence why I believe that you should post this on GitHub.
Problem #2:
The short answer is that there cannot be a ggforest plot with all the data that you are looking for.
https://www.rdocumentation.org/packages/mice/versions/3.13.0/topics/pool reads:
A common error is to reverse steps 2 and 3, i.e., to pool the multiply-imputed data instead of the estimates. Doing so may severely bias the estimates of scientific interest and yield incorrect statistical intervals and p-values. The pool() function will detect this case.
This means that there is no "real" dataset for the pooled estimates (i.e. you cannot really combine the datasets for imputations 1-3), which means that ggforest() cannot compute the desired plot (since it needs to have a dataset and that cannot be used because it would lead to erroneous estimates).
What you could do, is present all the ggforest plots for each imputation (so if you did 3 imputations, you will get 3 slightly different ggforest plots) and finally add the pooled estimates plot by using plot() as suggested above.
To create each ggforest plot you need the following line of code:
ggforest(mimira_object$analyses[[1]], complete(mydata_imp_m3_psm, 1))
This will create the ggforest plot for your first imputation. Change the numbers to 2 and 3 to check the remaining imputations.
I hope this helped,
Alex
If you provide a reproducible example (i.e. an example on a data set that we can all run on our machines), we can better help you out.
The gtsummary package exports a plot() method you can use to construct a forest plot. Example below!
library(gtsummary)
library(survival)
ggforest <-
coxph(Surv(ttdeath, death) ~ trt + grade, trial) %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot()
#> Registered S3 method overwritten by 'GGally':
#> method from
#> +.gg ggplot2
ggforest
Created on 2021-08-26 by the reprex package (v2.0.1)

Have trouble concat indexing

I have been trying to put a series of values into a dataframe so later on can make a plot out of it. Here are my code:
datafileR = datafile = pd.read_csv("pixel_data.csv")
datafileR = pd.DataFrame(datafileR)
region = datafileR.groupby(["Reg"])
mm_MidEast= region["PP"].median().loc["Middle East and North Africa"] ##>> 138
mm_Africa= region["PP"].median().loc["Africa (excl MENA)"] ##>> 151
mm_Asia= region["PP"].median().loc["Asia and Pacific"] ##>> 158
mm_Europe= region["PP"].median().loc["Europe and Eurasia"] ##>> 127
mm_Cross= region["PP"].median().loc["Cross-regional"] ##>> 86
ppdata= pd.concat([mm_MidEast,mm_Africa,mm_Asia,mm_Europe,mm_Cross],axis="columns", sort=False)
I am getting an error:
TypeError: cannot concatenate object of type '<class 'numpy.int64'>'; only Series and DataFrame objs
are valid
I understand what it means although do not know how to fix this problem by putting this series into as a DataFrame obj. This is the final graph I would like to achieve:
sns.set_theme()
display(ppdata["tip"].mean())
ax = sns.distplot(ppdata,color="black")
expected graph
Never mind. Found a solution with a simple line of code:
sns.set_theme()
df2 = region["PP"].median()
plt.bar(x=df2.index, height=df2.values, align="center",color=(0.2, 0.4, 0.6, 0.6))

Grouping the factors in ggplot

I am trying to create a graph based on matrix similar to one below... I am trying to group the Erosion values based on "Slope"...
library(ggplot2)
new_mat<-matrix(,nrow = 135, ncol = 7)
colnames(new_mat)<-c("Scenario","Runoff (mm)","Erosion (t/ac)","Slope","Soil","Tillage","Rotation")
for ( i in 1:nrow(new_mat)){
new_mat[i,2]<-sample(10:50, 1)
new_mat[i,3]<-sample(0.1:20, 1)
new_mat[i,4]<-sample(c("S2","S3","S4","S5","S1"),1)
new_mat[i,5]<-sample(c("Deep","Moderate","Shallow"),1)
new_mat[i,7]<-sample(c("WBP","WBF","WF"),1)
new_mat[i,6]<-sample(c("Intense","Reduced","Notill"),1)
new_mat[i,1]<-paste0(new_mat[i,4],"_",new_mat[i,5],"_",new_mat[i,6],"_",new_mat[i,7],"_")
}
#### Graph part ########
grphs_mat<-as.data.frame(new_mat)
grphs_mat$`Runoff (mm)`<-as.numeric(as.character(grphs_mat$`Runoff (mm)`))
grphs_mat$`Erosion (t/ac)`<-as.numeric(as.character(grphs_mat$`Erosion (t/ac)`))
ggplot(grphs_mat, aes(Scenario, `Erosion (t/ac)`,group=Slope, colour = Slope))+
scale_y_continuous(limits=c(0,max(as.numeric((grphs_mat$`Erosion (t/ac)`)))))+
geom_point()+geom_line()
But when i run this code.. The values are distributed in x-axis for all 135 scenarios. But what i want is grouping to be done in terms of slope but it also picks up the other common factors such as Soil+Rotation+Tillage and place it in x-axis. For example:
For these five scenarios:
S1_Deep_Intense_WBF_
S2_Deep_Intense_WBF_
S3_Deep_Intense_WBF_
S4_Deep_Intense_WBF_
S5_Deep_Intense_WBF_
It separates the S1, S2, S3,S4,S5 but also be able to know that other factors are same and put them in x-axis such that the slope lines are stacked on top of each other in 135/5 = 27 x-axis points. The final figure should look like this (Refer image). Apologies for not being able to explain it better.
I think i am making a mistake in grouping or assigning the x-axis values.
I will appreciate your suggestions.
In the example you give, I didn't get every possible factor combination represented so the plots looked a bit weird. What I did instead was start with the following:
set.seed(42)
new_mat <- matrix(,nrow = 1000, ncol = 7)
And then deduplicated this by summarising the values. A possible relevant step here for you analysis is that I made new variable with the interaction() function that is the combination of three other factors.
library(tidyverse)
df <- grphs_mat
df$x <- with(df, interaction(Rotation, Soil, Tillage))
# The simulation did not yield unique combinations
df <- df %>% group_by(x, Slope) %>%
summarise(n = sum(`Erosion (t/ac)`))
Next, I plotted this new x variable on the x-axis and used "stack" positions for the lines and points.
g <- ggplot(df, aes(x, y = n, colour = Slope, group = Slope)) +
geom_line(position = "stack") +
geom_point(position = "stack")
To make the x-axis slightly more readable, you can replace the . that the interaction() function placed by newlines.
g + scale_x_discrete(labels = function(x){gsub("\\.", "\n", x)})
Another option is to simply rotate the x axis labels:
g + theme(axis.text.x.bottom = element_text(angle = 90))
There are a few additional options for the x-axis if you go into ggplot2 extension packages.

Creating a line plot after every 48 rows in Dataframe

So I am given thousands of lines of data of which I inserted into a data frame using pandas. I would like to create plots that includes only 48 rows of data and after every 48 rows creating a new plot that has the next 48 rows and so on. I'm confused as to how to do that. I would also like to know how to graph only certain rows in my data frame in my line plot. P.S. this is my first question so I apologize for any formatting errors.
I isolated a certain column of my code "HP" and assigned into the variable hp by doing hp = df.HP. I also made a basic plot for the whole data already by doing hp.plot(x = '#', y = None, kind = 'line'). I've looked up my issue and tried using
hpnew = hp[seq(1, nrow(hp), 48), ]
hpnew.plot(x = '#', y = None, kind = 'line')
Where hp new would be every 48th row. It didn't work and I was left with the error message
NameError: name 'seq' is not defined
Initially I told to use
for i to range(hp):
hp(i)
But I was left with a syntax error and was confused what to from there.
You can use the answer by Roman Pekar here to bin your dataframe into groups of 48:
df.groupby(df.index / 48)
Then if you have some plotting function you can apply it to the grouped data:
def plot_function(df):
df.plot( ... )
df.groupby(df.index / 48)['hp'].apply(plot_function)

How to auto-deduce axes in pandas plot()

I am struggling to replicate the elegant ease - and successful outcome - teasingly promised in the 'Basic Plotting:plot' section of the pandas df.plot() documentation at:
http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization
There the authors' first image is pretty close to the kind of line-graph I want to plot from my dataframe. Their first df and resulting plot is a single-liner just as I hoped my df below would look when plotted.
My df looks like this:
2014-03-28 2014-04-04 2014-04-11 2014-04-18 \
Jenny Todd 1699.6 1741.6 1710.7 1744.2
2014-04-25 2014-05-02 2014-05-09
Jenny Todd 1764.2 1789.7 1802.3
Their second image is a multi-line graph very similar to what I hoped for when I try to plot a multiple-index version of my df. Eg:
2014-06-13 2014-06-20 2014-06-27 \
William Acer 1674.7 1689.4 1682.0
Katherine Baker 1498.5 1527.3 1530.5
2014-07-04 2014-07-11 2014-07-18 \
William Acer 1700.0 1674.5 1677.8
Katherine Baker 1540.4 1522.3 1537.3
2014-07-25
William Acer 1708.0
Katherine Baker 1557.1
However, they get plots. I get featureless 3.3kb images and a warning:
/home/lee/test/local/lib/python2.7/site-packages/matplotlib/axes/_base.py:2787: UserWarning: Attempting to set identical left==right results in singular transformations; automatically expanding.
left=0.0, right=0.0
'left=%s, right=%s') % (left, right))
The authors of the documentation seem to have the plot() function deducing from the df's indexes the values of the x-axis and the range and values of the y axis.
Searching around, I can find people with different data, different indexes and different scenarios (for example, plotting one column against another or trying to produce multiple subplots) who get this kind of 'axes' error. However, I haven't been able to map their issues to mine.
I wonder if anyone can help resolve what is different about my data or code that leads to a different plot outcome from the documentation's seemingly-similar data and seemingly-similar code.
My code:
print plotting_df # (This produces the df examples I pasted above)
plottest = plotting_df.plot.line(title='Calorie Intake', legend=True)
plottest.set_xlabel('Weeks')
plottest.set_ylabel('Calories')
fig = plt.figure()
plot_name = week_ending + '_' + collection_name + '.png'
fig.savefig(plot_name)
Note this dataframe is being created dynamically many times within the script. On any given run, the script will acquire different sets of dates, differently-named people, and different numbers to plot. So I don't have predictability about what strings will come up for index and legend labels for plotting beforehand. I do have predictability about the format.
I get that my dataframe's date index has differently-formatted dates than the referred documentation describes. Is this the cause? Whether it is or isn't, how should one best handle this issue?
Added on 2016-08-24 to answer the comment below about being unable to recreate my data
plotting_df is created on the fly as a subset of a much larger dataframe. It's simply an index (or sometimes multiple indices) and some of the date columns extracted from the larger dataframe. The code that produces plotting_df works fine and always produces plotting_df with correct indices and columns in a format I expect.
I can simulate creation of a dataset to store in plotting_df with this python code:
plotting_1 = {
'2014-03-28': 1699.6,
'2014-04-04': 1741.6,
'2014-04-11': 1710.7,
'2014-04-18': 1744.2,
'2014-04-25': 1764.2,
'2014-05-02': 1789.7,
'2014-05-09': 1802.3
}
plotting_df = pd.DataFrame(plotting_1, index=['Jenny Todd'])
and I can simulate creation of a multiple-indices plotting_df with this python code:
plotting_2 = {
'Katherine Baker': {
'2014-06-13': 1498.5,
'2014-06-20': 1527.3,
'2014-06-27': 1530.5,
'2014-07-04': 1540.4,
'2014-07-11': 1522.3,
'2014-07-18': 1537.3,
'2014-07-25': 1557.1
},
'William Acer': {
'2014-06-13': 1674.7,
'2014-06-20': 1689.4,
'2014-06-27': 1682.0,
'2014-07-04': 1700.0,
'2014-07-11': 1674.5,
'2014-07-18': 1677.8,
'2014-07-25': 1708.0
}
}
plotting_df = pd.DataFrame.from_dict(plotting_2)
I did try the suggested transform with code:
plotdf = plotting_df.T
plotdf.index = pd.to_datetime(plotdf.index)
so that my original code now looks like:
print plotting_df # (This produces the df examples I pasted above)
plotdf = plotting_df.T # Transform the df - date columns to indices
plotdf.index = pd.to_datetime(plotdf.index) # Convert indices to datetime
plottest = plotdf.plot.line(title='Calorie Intake', legend=True)
plottest.set_xlabel('Weeks')
plottest.set_ylabel('Calories')
fig = plt.figure()
plot_name = week_ending + '_' + collection_name + '.png'
fig.savefig(plot_name)
but I still get the same result (blank 3.3kb images created).
I did note that adding the transform made no difference when I printed out the first instance of plotdf. So should be I doing some other transform?
This is your problem:
fig = plt.figure()
plot_name = week_ending + '_' + collection_name + '.png'
fig.savefig(plot_name)
You are creating a second figure after creating the first one and then you are saving only that second empty figure. Just take out the line fig = plt.figure() and change fig.savefig to plt.savefig
So you should have :
print plotting_df # (This produces the df examples I pasted above)
plotdf = plotting_df.T # Transform the df - date columns to indices
plotdf.index = pd.to_datetime(plotdf.index) # Convert indices to datetime
plottest = plotdf.plot.line(title='Calorie Intake', legend=True)
plottest.set_xlabel('Weeks')
plottest.set_ylabel('Calories')
plot_name = week_ending + '_' + collection_name + '.png'
plt.savefig(plot_name)