Construct simple 1D probability contours in r - ggplot2

I want to recreate an image like the attached for my own data in R.
Green and Theobald. Sexing birds by discriminant analysis: further considerations. 1989. IBIS
An example of my data:
df <- data.frame (sex = c("M", "M", "F", "M", "F"),
wing = c(110,110,115,113,113),
bill = c(24,24,26,25,28)
)
I've plotted my real data using the following code just fine, but want to add lines representing 0.01, 0.5, 0.99 probability of a bird being male given bill and wing.
Dunlin_abbrev %>%
ggplot(
aes(Bill, Wing, color = Sex))+
geom_point()
The attached graph is from a Green and Theobald (1989) paper that used a discriminant function analysis to predict bird gender by bill and wing length.
The paper said they constructed a graph with contours corresponding to specified probabilities a bird is a male. However, all my searches for "probability contours" give me 2D contour lines that completely encircle my data like mapping a mountain such as geom_contour(), rather than the simple 1D splitting line I desire.

Related

Altering size of points on map in R (geom_sf) to reflect categorical data

I am creating a map to depict density of datapoints at different locations. At some locations, there is a high density of data available, and at others, there is a low density of data available. I would like to present the map with each data point shown but with each point a certain size to represent the density.
In my data table I have the location, and each location is assigned 'A', 'B', or 'C' to depict 'Low', 'Medium', and 'High' density. When plotting using geom_sf, I am able to get the points on the map, but I would like each category to be represented by a different size circle. I.e. 'Low density' locations with a small circle, and 'High density' locations with a larger circle.
I have been approaching the aesthetics of this map in the same way I would approach it as if it were a normal ggplot situation, but have not had any luck. I feel like I must be missing something obvious related to the fact that I am using geom_sf(), so any advice would be appreciated!
Using a very simple code:
ggplot() +
geom_sf(data = stc_land, color = "grey40", fill = "grey80") +
geom_sf(data = stcdens, aes(shape = Density) +
theme_classic()
I know that the aes() call should go in with the 'stcdens' data, and I got close with the 'shape = Density', but I am not sure how to move forward with assigning what shapes I want to each category.
You probably want to swap shape = Density for size = Density; then the plot should behave itself (and yes, it is a standard ggplot behavior, nothing sf specific :)
As your code is not exactly reproducible allow me to use my favorite example of 3 cities in NC:
library(sf)
library(ggplot2)
shape <- st_read(system.file("shape/nc.shp", package="sf")) # included with sf package
cities <- data.frame(name = c("Raleigh", "Greensboro", "Wilmington"),
x = c(-78.633333, -79.819444, -77.912222),
y = c(35.766667, 36.08, 34.223333),
population = c("high", "medium","low")) %>%
st_as_sf(coords = c("x", "y"), crs = 4326) %>%
dplyr::mutate(population = ordered(population,
levels = c("low", "medium", "high")))
ggplot() +
geom_sf(data = shape, fill = NA) +
geom_sf(data = cities, aes(size = population))
Note that I turned the population from a character variable to ordered factor, where high > medium > low (so that the circles follow the expected order).

Create a bar chart with bars colored according to a category and line on the same chart

I trained a model to predict a value and I want to make a bar chart that plots target - prediction for each sample, and then color these bars according to a category. I then want to add two horizontal lines for plus or minus sigma around the central axis, so it's clear which predictions are very far off. Imagine we know sigma == 0.3 and we have a dataframe
error
sample_id
category
.1
1
'A'
.4
2
'A'
.1
3
'B'
-.2
4
'B'
-.1
5
'C'
How could I do this? I've managed to do just the errors and the plus or minus sigma lines just using matplotlib, here it is to communicate what I mean.
You'll find the pd.Series.transform() and/or pd.DataFrame.apply() methods quite useful. Essentially, you can map each value of your input columns (in this case errors) into some valid color value, returning a pd.Series of colors that's the same shape as errors.
The phrasing of the question is unclear, but it sounds like you want a single pair of lines for each category? In which case, you will first need to do a pd.Series.groupby() operation to get the shape that you want before the transform opeartion. Probably just a series of length 3, for your A B C categories.
Then, this Series (whether it is of length len(df) or df.category.nunique()) can be passed into your plt.bar method as the color argument.
This is actually very easy, I just didn't understand the 'color' option of plt.bar. If it is a list of length equal to the number of bars, then it will color each bar with the corresponding color. It's as simple as
plt.(x,y,color = z)
#len(x) = len(y) = len(z), and z is an array of colors
As krukah mentions, you just need to translate categories to colors. I picked a color map, made a dictionary that picked a color for each unique category, and then turned the cats array (a 2d np array, each row encodes a category) into an array of colors.
unique_cats = np.unique(cats, axis=0)
n_unique = unique_cats.shape[0]
for_picking = np.arange(0,1,1/n_unique)
cmap = plt.cm.get_cmap('plasma')
color_dict = {}
#this for loop fills in the dictionary by picking colors from the cmap
for i in range(n_unique):
color_dict[str(unique_cats[i])] =cmap(for_picking[i])
color_cats = [color_dict[str(cat)] for cat in cats]
Hopefully that helps someone some day.

Drawing map with geopandas library for a continent, but the data points contains whole world

I have got data set of meteorites which were found with latitude and longitude information. I almost have 30,000 data points from all around the world. But I would like to plot the map of only one continent, for example "South America" by using geopandas library.
I am using 'naturalearth_lowres' default map of geopandas. From that world map, I filtered South America. My data which is called mod_data_geo consists geometry type data, Point(longitute, latitude).
Data Set looks like that:
My code:
mod_data_geo = gpd.GeoDataFrame(mod_data, geometry = gpd.points_from_xy(mod_data['long'], mod_data['lat']))
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
countries = world[world['continent'] == "South America"]
axis = countries.plot(color = 'Lightblue', edgecolor = 'black', figsize=(15,15))
mod_data_geo.plot(ax=axis, markersize = 1, color = 'purple' )
Map that I plotted:
How can I filter data of meteorites inside mod_data_geo dataframe with Geopandas library or any other tool, in order to see only meteorites found over the South Africa continent only?
Thank you in advance!
Three ways - crop the image, filter the points with a bounding box, or filter the points by checking whether they're inside the country shapes.
Crop the image
If you'd like the points to extend to the edge of the image, but simply to limit the image extent, you can simply set the x and y limits on the matplotlib axis:
axis = countries.plot(color = 'Lightblue', edgecolor = 'black', figsize=(15,15))
orig_extent = axis.get_extent()
mod_data_geo.plot(ax=axis, markersize = 1, color = 'purple' )
axis.set_extent(*orig_extent)
Filter to a bounding box
The first approach is nice in that it retains all the data that can fit within your plot. But it's not super efficient, as matplotlib has to filter the data for you based on whether it will appear in the image. A faster approach could filter the data first; note that the data will no longer go to the edge of the image.
First find a bounding box around the countries:
In [6]: bounds = countries.bounds.agg({'minx': 'min', 'miny': 'min', 'maxx': 'max', 'maxy': 'max'})
In [7]: bounds
Out[7]:
minx -81.410943
miny -55.611830
maxx -34.729993
maxy 12.437303
dtype: float64
Then you can filter the data based on these bounds:
In [8]: mod_data_filtered = mod_data_geo[(
...: (mod_data_geo.lat >= bounds.miny)
...: (mod_data_geo.lat <= bounds.maxy)
...: (mod_data_geo.long >= bounds.minx)
...: (mod_data_geo.long <= bounds.maxx)
...: )]
Now you can plot with mod_data_filtered.
Note that you could set the extent of the plot to the bounding box, though this will get a bit tight.
Filter with country shapes
If you'd like to filter the data to being within one of the countries rather than just cropping the data to the bounding box, you could use geopandas.GeoSeries.contains.
First, dissolve the data to get a single shape for South America:
In [8]: south_america = countries.dissolve()
In [9]: south_america
Out[9]:
geometry pop_est continent name iso_a3 gdp_md_est
0 MULTIPOLYGON (((-57.75000 -51.55000, -58.05000... 44293293 South America Argentina ARG 879400.0
Then, filter the points to those within the shape:
In [10]: mod_data_filtered = mod_data_geo[south_america.contains(mod_data_geo)]

Add a line at z=0 to ggplot2 heatmap

I have plotted a heatmap in ggplot2. I want to add a curved line to the plot to show where z=0 (i.e. where the value of the data used for the fill is zero), how can I do this?
Thanks
Since no example data or code is provided, I'll illustrate with the volcano dataset, representing heights of a volcano in a matrix. Since the data doesn't contain a zero point, we'll draw the line at the arbitrarily chosen 125 mark.
library(ggplot2)
# Convert matrix to data.frame
df <- data.frame(
row = as.vector(row(volcano)),
col = as.vector(col(volcano)),
value = as.vector(volcano)
)
# Set contour breaks at desired level
ggplot(df, aes(col, row, fill = value)) +
geom_raster() +
geom_contour(aes(z = value),
breaks = 125, col = 'red')
Created on 2020-04-06 by the reprex package (v0.3.0)
If this isn't a good approximation of your problem, I'd suggest to include example data and code in your question.

Number at risk for cox regression plot

can I make "number at risk "table for cox plot if I have more than one independent variable?
if it possible where can I find the relevant code (I searched but couldn't find)
the code I used on my data:
fit <- coxph(Surv(time,event) ~chr1q21_status+CCND1+CRTM1+IRF4,data = myeloma)
ggsurvplot(fit, data = myeloma,
risk.table=TRUE, break.time.by=365, xlim = c(0,4000),
risk.table.y.text=FALSE, legend.labs = c("2","3","4+"))
got this message- object 'ggsurv' not found' although for only one variable and the function survfit it worked.
"number at risk "table for cox plot
It's not a Cox plot, it's a Kaplan-Meier plot. You're trying to plot a Cox model, when what you want is to fit KM curves using survfit and then to plot the resulting fit:
library("survival")
library("survminer")
fit <- survfit(Surv(time,status) ~ ph.ecog + sex , data = lung)
ggsurvplot(fit, data = lung, risk.table = TRUE)
Since you now mention that you have continuous predictors, perhaps you could think about what you expect an at-risk table or KM plot to show.
Here's an example of binning a continuous measure (age):
library("survival")
library("survminer")
#> Loading required package: ggplot2
#> Loading required package: ggpubr
#> Loading required package: magrittr
lung$age_bin <- cut(lung$age, quantile(lung$age))
fit <- survfit(Surv(time,status) ~ age_bin + sex , data = lung)
ggsurvplot(fit, data = lung, risk.table = TRUE)