Altering size of points on map in R (geom_sf) to reflect categorical data - ggplot2

I am creating a map to depict density of datapoints at different locations. At some locations, there is a high density of data available, and at others, there is a low density of data available. I would like to present the map with each data point shown but with each point a certain size to represent the density.
In my data table I have the location, and each location is assigned 'A', 'B', or 'C' to depict 'Low', 'Medium', and 'High' density. When plotting using geom_sf, I am able to get the points on the map, but I would like each category to be represented by a different size circle. I.e. 'Low density' locations with a small circle, and 'High density' locations with a larger circle.
I have been approaching the aesthetics of this map in the same way I would approach it as if it were a normal ggplot situation, but have not had any luck. I feel like I must be missing something obvious related to the fact that I am using geom_sf(), so any advice would be appreciated!
Using a very simple code:
ggplot() +
geom_sf(data = stc_land, color = "grey40", fill = "grey80") +
geom_sf(data = stcdens, aes(shape = Density) +
theme_classic()
I know that the aes() call should go in with the 'stcdens' data, and I got close with the 'shape = Density', but I am not sure how to move forward with assigning what shapes I want to each category.

You probably want to swap shape = Density for size = Density; then the plot should behave itself (and yes, it is a standard ggplot behavior, nothing sf specific :)
As your code is not exactly reproducible allow me to use my favorite example of 3 cities in NC:
library(sf)
library(ggplot2)
shape <- st_read(system.file("shape/nc.shp", package="sf")) # included with sf package
cities <- data.frame(name = c("Raleigh", "Greensboro", "Wilmington"),
x = c(-78.633333, -79.819444, -77.912222),
y = c(35.766667, 36.08, 34.223333),
population = c("high", "medium","low")) %>%
st_as_sf(coords = c("x", "y"), crs = 4326) %>%
dplyr::mutate(population = ordered(population,
levels = c("low", "medium", "high")))
ggplot() +
geom_sf(data = shape, fill = NA) +
geom_sf(data = cities, aes(size = population))
Note that I turned the population from a character variable to ordered factor, where high > medium > low (so that the circles follow the expected order).

Related

ggplot2 - wrap data around legend in custom position

When placing a legend in a custom position (using legend.position = c(x, y)) in a ggplot, is it possible to format the legend so that it does not overlay the data, and instead, the datapoints wrap around it?
In this example, would it be possible to, say, have ggplot insert extra space in the plot, so that datapoints are not obscured by the legend (without changing the legend.position)?
Thanks!
library(tidyverse)
data(mtcars)
ggplot(data = mtcars, aes(x = wt, y = hp))+
geom_point(aes(color = mpg))+
theme(legend.direction = "horizontal",
legend.position = c(0.5, 0.9))
An inelegant solution is to add plot.title(element_text(margin = margin(a, b, c, d))) where a, b, c and d are padding values for top, right, bottom, left, respectively, and adjust the c value until there is sufficient space. Let me know if you come up with a better solution!

Drawing map with geopandas library for a continent, but the data points contains whole world

I have got data set of meteorites which were found with latitude and longitude information. I almost have 30,000 data points from all around the world. But I would like to plot the map of only one continent, for example "South America" by using geopandas library.
I am using 'naturalearth_lowres' default map of geopandas. From that world map, I filtered South America. My data which is called mod_data_geo consists geometry type data, Point(longitute, latitude).
Data Set looks like that:
My code:
mod_data_geo = gpd.GeoDataFrame(mod_data, geometry = gpd.points_from_xy(mod_data['long'], mod_data['lat']))
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
countries = world[world['continent'] == "South America"]
axis = countries.plot(color = 'Lightblue', edgecolor = 'black', figsize=(15,15))
mod_data_geo.plot(ax=axis, markersize = 1, color = 'purple' )
Map that I plotted:
How can I filter data of meteorites inside mod_data_geo dataframe with Geopandas library or any other tool, in order to see only meteorites found over the South Africa continent only?
Thank you in advance!
Three ways - crop the image, filter the points with a bounding box, or filter the points by checking whether they're inside the country shapes.
Crop the image
If you'd like the points to extend to the edge of the image, but simply to limit the image extent, you can simply set the x and y limits on the matplotlib axis:
axis = countries.plot(color = 'Lightblue', edgecolor = 'black', figsize=(15,15))
orig_extent = axis.get_extent()
mod_data_geo.plot(ax=axis, markersize = 1, color = 'purple' )
axis.set_extent(*orig_extent)
Filter to a bounding box
The first approach is nice in that it retains all the data that can fit within your plot. But it's not super efficient, as matplotlib has to filter the data for you based on whether it will appear in the image. A faster approach could filter the data first; note that the data will no longer go to the edge of the image.
First find a bounding box around the countries:
In [6]: bounds = countries.bounds.agg({'minx': 'min', 'miny': 'min', 'maxx': 'max', 'maxy': 'max'})
In [7]: bounds
Out[7]:
minx -81.410943
miny -55.611830
maxx -34.729993
maxy 12.437303
dtype: float64
Then you can filter the data based on these bounds:
In [8]: mod_data_filtered = mod_data_geo[(
...: (mod_data_geo.lat >= bounds.miny)
...: (mod_data_geo.lat <= bounds.maxy)
...: (mod_data_geo.long >= bounds.minx)
...: (mod_data_geo.long <= bounds.maxx)
...: )]
Now you can plot with mod_data_filtered.
Note that you could set the extent of the plot to the bounding box, though this will get a bit tight.
Filter with country shapes
If you'd like to filter the data to being within one of the countries rather than just cropping the data to the bounding box, you could use geopandas.GeoSeries.contains.
First, dissolve the data to get a single shape for South America:
In [8]: south_america = countries.dissolve()
In [9]: south_america
Out[9]:
geometry pop_est continent name iso_a3 gdp_md_est
0 MULTIPOLYGON (((-57.75000 -51.55000, -58.05000... 44293293 South America Argentina ARG 879400.0
Then, filter the points to those within the shape:
In [10]: mod_data_filtered = mod_data_geo[south_america.contains(mod_data_geo)]

How can I define and add a lagend to this ggplot 2 script?

I came up with the following script to bin my data on X values, and plot the means of those bins in overlapping bar graphs. It works fine, but I can't seem to get a legend to generate, probably due to poor understanding of aesthetic mapping.
Here is the script, note that "MOI" and "T_cell_contacts" are two data columns in each DF.
ggplot(mapping=aes(MOI, T_cell_contacts)) + stat_summary_bin(data = Cleaned24hr4, fun = "mean", geom="bar", bins= 100, fill = "#FF6666", alpha = 0.3) + stat_summary_bin(data = cleaned24hr8, fun = "mean", geom="bar", bins= 100, fill = "#3733FF", alpha = 0.3) + ylab("mean")
I also added the graph that it plots.
Full disclosure: I was in the middle of writing this when #schumacher posted their response :). Decided to finish anyway.
There are two ways to approach this. One way (more complicated) is to keep the dataframes separate and ask ggplot2 to create a legend via mapping, and the second (simpler) way is to combine into one dataset similar to what #schumacher posted and map the fill color to the extra id column created.
I'll show you both, but first, here's a sample dataset:
library(ggplot2)
set.seed(8675309)
df1 <- data.frame(my_x=rep(1:100, 3), my_y=rnorm(300, 40, 4))
df2 <- data.frame(my_x=rep(11:110, 3), my_y=rnorm(300, 110, 10))
# and the plot code similar to OP's question
ggplot(mapping=aes(x = my_x, y = my_y)) +
stat_summary_bin(data=df1, fun="mean", geom="bar", bins=40, fill="blue", alpha=0.3) +
stat_summary_bin(data=df2, fun="mean", geom="bar", bins=40, fill="red", alpha=0.3)
Method 1 : Combine Dataframes
This is the preferred method for a variety of reasons I can't list completely here. There are a lot of options you can use for combining datasets. One is using union() or rbind() after adding some sort of ID column to your data, but you can do all in one shot using bind_rows() from dplyr:
df <- dplyr::bind_rows(list(dataset1 = df1, dataset2 = df2), .id="id")
The result will bind the rows together and by specifying the .id argument, it will create a new column in the dataset called "id" that uses the names for each of the datasets in the list as the value. In this case, the value in thd df$id column is either "dataset1" if it originated from df1 or "dataset2" if it originated from df2.
Then you use aes(fill=...) to map the fill color to the column "id" in the combined dataset.
p <- ggplot(df, aes(x=my_x, y=my_y)) +
stat_summary_bin(aes(fill=id), fun="mean", geom="bar", bins=40, alpha=0.3)
p
This creates a plot with the default colors for fill, so if you want to supply your own, just use scale_fill_manual(values=...) to specify the particular colors. Using a named vector for values= ensures that each color is applied the way you want it to be, but you can just supply an unnamed vector of color names.
p + scale_fill_manual(values = c("dataset1" = "blue", "dataset2" = "red"))
Method 2 : Use mapping to add the legend
While Method 1 is preferred, there is another way that does not force you to combine your dataframes. This is also useful to illustrate a bit about how ggplot2 decides to create and draw legends. The legend is created automaticaly via the mapping= argument, specifically via aes(). If you put any aesthetic inside of aes() that would normally impart a different appearance and not location (with some exceptions like x, y, and label), then this initiates the creation of a legend. You can map either a column in your dataset (like above), or you can just supply a single value and that will be applied to the entire dataset used for the geom. In this case, see what happens when you change the fill= argument for each geom call to be within aes() and assign it to a character value:
p1 <- ggplot(mapping = aes(x=my_x, y=my_y)) +
stat_summary_bin(aes(fill="first"), data=df1, fun="mean", geom="bar", bins=40, alpha=0.3) +
stat_summary_bin(aes(fill="second"), data=df2, fun="mean", geom="bar", bins=40, alpha=0.3) +
scale_fill_manual(values = c("first" = "blue", "second" = "red"))
p1
It works! When you provide a character value for the fill= aesthetic inside aes(), it's basically labeling every observation in that data to have the value "first" or "second" and using that to make the legend. Cool, right?
You notice a problem though, which is that the alpha value for the legend is not correct. This is because you get overplotting. It's just one of the reasons why you shouldn't really do it this way, but... sort of works. It is only noticeable if you ahve an alpha value. You can get that to look normal, but you need to use guide_legend() to override the aesthetics. Since the code effectively causes the legend to be drawn completely for each geom... you have to cut the alpha value in half for it to display correctly.
p1 + guides(fill=guide_legend(override.aes = list(alpha=0.15)))
Oh, and the real reason why not to use Method 2 is.... just think about doing that again for 5 datasets... how about 10?... how about 20?.....
I think the difficulty has to do with building a single legend out of two different geoms. My approach was to combine your data into a single data frame. The records from each to be set apart by a new category column, I'll call "cat" for short.
With the popular dplyr package:
Cleaned24hr4 <- mutate(Cleaned24hr4, cat = "hr4")
Cleaned24hr8 <- mutate(Cleaned24hr8, cat = "hr8")
Then put them together:
Cleaned <- union(Cleaned24hr4,Cleaned24hr8)
Define your colors:
colorcode <- c("hr4" = "#FF6666", "hr8" = "#3733FF")
Here's my ggplot statement:
ggplot(Cleaned, mapping=aes(MOI, T_cell_contacts)) +
stat_summary_bin(fun = "mean", geom="bar", bins= 100, aes(fill = cat), alpha = 0.3) +
scale_fill_manual(values = colorcode) +
ylab("mean")
Output using some dummy data.

Grouping the factors in ggplot

I am trying to create a graph based on matrix similar to one below... I am trying to group the Erosion values based on "Slope"...
library(ggplot2)
new_mat<-matrix(,nrow = 135, ncol = 7)
colnames(new_mat)<-c("Scenario","Runoff (mm)","Erosion (t/ac)","Slope","Soil","Tillage","Rotation")
for ( i in 1:nrow(new_mat)){
new_mat[i,2]<-sample(10:50, 1)
new_mat[i,3]<-sample(0.1:20, 1)
new_mat[i,4]<-sample(c("S2","S3","S4","S5","S1"),1)
new_mat[i,5]<-sample(c("Deep","Moderate","Shallow"),1)
new_mat[i,7]<-sample(c("WBP","WBF","WF"),1)
new_mat[i,6]<-sample(c("Intense","Reduced","Notill"),1)
new_mat[i,1]<-paste0(new_mat[i,4],"_",new_mat[i,5],"_",new_mat[i,6],"_",new_mat[i,7],"_")
}
#### Graph part ########
grphs_mat<-as.data.frame(new_mat)
grphs_mat$`Runoff (mm)`<-as.numeric(as.character(grphs_mat$`Runoff (mm)`))
grphs_mat$`Erosion (t/ac)`<-as.numeric(as.character(grphs_mat$`Erosion (t/ac)`))
ggplot(grphs_mat, aes(Scenario, `Erosion (t/ac)`,group=Slope, colour = Slope))+
scale_y_continuous(limits=c(0,max(as.numeric((grphs_mat$`Erosion (t/ac)`)))))+
geom_point()+geom_line()
But when i run this code.. The values are distributed in x-axis for all 135 scenarios. But what i want is grouping to be done in terms of slope but it also picks up the other common factors such as Soil+Rotation+Tillage and place it in x-axis. For example:
For these five scenarios:
S1_Deep_Intense_WBF_
S2_Deep_Intense_WBF_
S3_Deep_Intense_WBF_
S4_Deep_Intense_WBF_
S5_Deep_Intense_WBF_
It separates the S1, S2, S3,S4,S5 but also be able to know that other factors are same and put them in x-axis such that the slope lines are stacked on top of each other in 135/5 = 27 x-axis points. The final figure should look like this (Refer image). Apologies for not being able to explain it better.
I think i am making a mistake in grouping or assigning the x-axis values.
I will appreciate your suggestions.
In the example you give, I didn't get every possible factor combination represented so the plots looked a bit weird. What I did instead was start with the following:
set.seed(42)
new_mat <- matrix(,nrow = 1000, ncol = 7)
And then deduplicated this by summarising the values. A possible relevant step here for you analysis is that I made new variable with the interaction() function that is the combination of three other factors.
library(tidyverse)
df <- grphs_mat
df$x <- with(df, interaction(Rotation, Soil, Tillage))
# The simulation did not yield unique combinations
df <- df %>% group_by(x, Slope) %>%
summarise(n = sum(`Erosion (t/ac)`))
Next, I plotted this new x variable on the x-axis and used "stack" positions for the lines and points.
g <- ggplot(df, aes(x, y = n, colour = Slope, group = Slope)) +
geom_line(position = "stack") +
geom_point(position = "stack")
To make the x-axis slightly more readable, you can replace the . that the interaction() function placed by newlines.
g + scale_x_discrete(labels = function(x){gsub("\\.", "\n", x)})
Another option is to simply rotate the x axis labels:
g + theme(axis.text.x.bottom = element_text(angle = 90))
There are a few additional options for the x-axis if you go into ggplot2 extension packages.

How can I make this plot awesome (colours by group plus alpha value by second group)

I do have following dataframe:
I plotted it the following way:
Right now the plot looks ugly. Aside of using different font size, marker_edge_width, marker face color etc. I would like to have two colors for each protein (hum1 and hum2) and within the group the different pH values should have different intensities. What makes it more difficult is the fact that my groups do not have the same size.
Any ideas ?
P.S Such a build in feature would be really cool e.g colourby = level_one thenby level_two
fig = plt.figure(figsize=(9,9))
ax = fig.add_subplot(1,1,1)
c1 = plt.cm.Greens(np.linspace(0.5, 1, 4))
c2 = plt.cm.Blues(np.linspace(0.5, 1, 4))
colors = np.vstack((c1,c2))
gr.unstack(level=(0,1))['conc_dil'].plot(marker='o',linestyle='-',color=colors,ax=ax)
plt.legend(loc=1,bbox_to_anchor = (0,0,1.5,1),numpoints=1)
gives:
P.S This post helped me:
stacked bar plot and colours