Graphing Multiple Column Averages from Different dfs Representing Different Years - ggplot2

Below is a sample of the data:
df_1 <- data.frame(total = c(0.9, 0.4, 0.2), white = c(0.6, 0.2, 0.1), black = c(0.3, 0.2, 0.1), immigrant = c(0.7, 0.3, 0.9))
df_2 <- data.frame(total = c(0.8, 0.7, 0.6), white = c(0.4, 0.3, 0.2), black = c(0.4, 0.4, 0.4), immigrant = c(0.9, 0.2, 0.1))
df_3 <- data.frame(total = c(0.6, 0.8, 0.9), white = c(0.4, 0.2, 0.7), black = c(0.2, 0.6, 0.2), immigrant = c(0.6, 0.8, 0.5))
Hi, I am interested in using ggplot2 to graph the dataframes above. In my example, each dataframe represents a different decade as follows: df_1 represents 1930, df_2 represents 1990, and df_3 represents 2020. I am interested in calculating the mean/average of each of the four columns and then graphing the results. I would like the x-axis to represent each year (1930, 1990, and 2020) and the y-axis to represent the calculated means (which should range from 0-1). The columns in all of the dataframes show different demographic groups and would be visualized as a point in the graph. Below is an idea of what I am envisioning.
Illustration of the desired graph
I tried grouping the dataframes first but then I am not sure how to categorize each dataframe as a different year. The code below is something I adapted from another graph I made but it didn't work as expected. Note, 'ratio' is meant to represent the calculated means of each column.
Consideration:
The number of rows in each column may be different throughout the dataframes
list(df_1,
df_2,
df_3) %>%
lapply(function(x) setNames(x, 'ratio')) %>%
{do.call(bind_rows, c(., .id = 'demographic'))} %>%
mutate(ratio = mean(ratio)) %>%
group_by(demographic) %>%
ggplot(aes(ratio, n, colour = demographic, group = demographic)) +
labs(x="Mean", y="Year", ))

If you want your plot to be a ggplot, then it's important for your data to be tidy. That means that 1) each variable must have its own column, 2) each observation must have its own row, and 3) each value must have its own cell. These requirements also imply that all relevant values are in one dataset, not distributed over multiple datasets.
One option is to assign a year variable to each dataset, bind your datasets together, and then "lengthen" your dataset using pivot_longer(), so you can see each combination of year and your grouping variable. Then you can use summarize() to average by year and your grouping variable.
library(tidyverse)
df_1 <- data.frame(total = c(0.9, 0.4, 0.2), white = c(0.6, 0.2, 0.1), black = c(0.3, 0.2, 0.1), immigrant = c(0.7, 0.3, 0.9))
df_2 <- data.frame(total = c(0.8, 0.7, 0.6), white = c(0.4, 0.3, 0.2), black = c(0.4, 0.4, 0.4), immigrant = c(0.9, 0.2, 0.1))
df_3 <- data.frame(total = c(0.6, 0.8, 0.9), white = c(0.4, 0.2, 0.7), black = c(0.2, 0.6, 0.2), immigrant = c(0.6, 0.8, 0.5))
df_1$year <- 1930
df_2$year <- 1990
df_3$year <- 2020
bigdf <- rbind(df_1, df_2, df_3) %>%
pivot_longer(cols = -year) %>%
mutate(year = as.factor(year)) %>%
group_by(year, name) %>%
summarize(value = mean(value))
ggplot(bigdf, aes(x = year, y = value,
color = name, group = name)) +
geom_path() + geom_point()
small edit
If you want to reorder the labels in the legend, you can turn name into an ordered factor.
bigdf <- bigdf %>%
mutate(name = factor(name,
levels = c("total",
"black",
"white",
"immigrant")))

Related

R Markdown - PDF Table with conditional bold format for row maximum AND percentage format

This question is similar to my past question: Conditionally format each cell containing the max value of a row in a data frame - R Markdown PDF
The difference is in the past question my example was printing a table with numbers and this time it's technically characters (numbers with percentage format)
Data for example:
---
title: "Untitled"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r, include=FALSE}
segment<- c('seg1', 'seg1', 'seg2', 'seg2', 'seg3', 'seg3', 'Tot')
subSegment<- c('subseg1.1', 'subseg1.2', 'subseg2.1', 'subseg2.2', 'subseg3.1', 'subseg3.2', "-")
co.1<- c(0.1, 0.4, 0.3, 0.2, 0.5, 0.4, 0.4)
co.2<- c(0.5, 0.3, 0.3, 0.2, 0.1, 0.5, 0.4)
co.3<- c(0.2, 0.1, 0.4, 0.4, 0.1, 0.1, 0.15)
co.4<- c(0.2, 0.2, 0.0, 0.2, 0.3, 0.0, 0.05)
total<- c(1,1,1,1,1,1,1)
df<-data.frame(segment, subSegment, co.1, co.2, co.3, co.4, total) %>%
rowwise() %>%
mutate(across(co.1:co.4, ~cell_spec(.x, 'latex', bold = ifelse(.x == max(c_across(co.1:co.4)), TRUE, FALSE))))
df %>%
kable(booktabs = TRUE,
caption = "Title",
align = "c",
escape = FALSE) %>%
kable_styling(latex_options = c("HOLD_position", "repeat_header", "scale_down"),
font_size = 6) %>%
pack_rows(index = table(fct_inorder(df$segment)),
italic = FALSE,
bold = FALSE,
underline = TRUE,
latex_gap_space = "1em",
background = "#f2f2f2")%>%
column_spec(1, monospace = TRUE, color = "white") %>%
row_spec(nrow(df), bold = TRUE)
```
so after doing this I get a very nice table:
My problem is that I want the numbers to be printed as percentages. I tried using the scales::percent both before and after the conditional formating but none of them work.
If I try to give the percentage format after the bold I get the error:
Error in UseMethod("round_any") :
no applicable method for 'round_any' applied to an object of class "character".
If I try to use it before the conditional bold then I can't find the maximum of each row since they are characters and not numbers.
aux.n<- df
aux.n[c(3:ncol(aux.n))] = sapply(aux.n[c(3:ncol(aux.n))], function(x) scales::percent(x, accuracy = 0.1))
I should add that this is just an example but actual numbers are stuff like 0.5471927 so it's really important to print "54.7%" instead of the full number.
libraries I used:
require("pacman")
p_load(tidyverse, reshape, reshape2, knitr, kableExtra, tinytex, scales, pander, janitor)
The percentage values are converted into character with the cell_spec argument. with a bit of stringr and regex the decimal values can be converted to percentages. Note % is a reserved symbol in LaTeX so needs escaping.
---
output:
pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
require("pacman")
p_load(dplyr, tidyr, stringr, kableExtra, forcats, tinytex, scales, janitor)
```{r df, include=FALSE}
segment<- c('seg1', 'seg1', 'seg2', 'seg2', 'seg3', 'seg3', 'Tot')
subSegment<- c('subseg1.1', 'subseg1.2', 'subseg2.1', 'subseg2.2', 'subseg3.1', 'subseg3.2', "-")
co.1<- c(0.1, 0.4, 0.3, 0.2, 0.5, 0.4, 0.4)
co.2<- c(0.5, 0.3, 0.3, 0.2, 0.1, 0.5, 0.4)
co.3<- c(0.2, 0.1, 0.4, 0.4, 0.1, 0.1, 0.15)
co.4<- c(0.2, 0.2, 0.0, 0.2, 0.3, 0.0, 0.05)
total<- c(1,1,1,1,1,1,1)
df <-
data.frame(segment, subSegment, co.1, co.2, co.3, co.4, total) %>%
rowwise() %>%
mutate(across(co.1:co.4, ~cell_spec(.x, 'latex', bold = ifelse(.x == max(c_across(co.1:co.4)), TRUE, FALSE)))) %>%
ungroup() %>%
pivot_longer(starts_with("co."))%>%
mutate(pc = percent(as.numeric(str_extract(value, "0.\\d+|0")), accuracy = 0.1),
value = str_replace(value, "0.\\d+|0", pc),
value = str_replace(value, "%", "\\\\%")) %>%
select(-pc) %>%
pivot_wider() %>%
select(-total, everything(), total)
```
```{r kable, results='asis'}
df %>%
kable(booktabs = TRUE,
caption = "Title",
align = "c",
escape = FALSE) %>%
kable_styling(latex_options = c("HOLD_position", "repeat_header", "scale_down"),
font_size = 6) %>%
pack_rows(index = table(fct_inorder(df$segment)),
italic = FALSE,
bold = FALSE,
underline = TRUE,
latex_gap_space = "1em",
background = "#f2f2f2") %>%
column_spec(1, monospace = TRUE, color = "white") %>%
row_spec(nrow(df), bold = TRUE)
```

Two Pandas dataframes, how to interpolate row-wise using scipy

How can I use scipy interpolate on two dataframes, interpolating row-rise?
For example, if I have:
dfx = pd.DataFrame({"a": [0.1, 0.2, 0.5, 0.6], "b": [3.2, 4.1, 1.1, 2.8]})
dfy = pd.DataFrame({"a": [0.8, 0.2, 1.1, 0.1], "b": [0.5, 1.3, 1.3, 2.8]})
display(dfx)
display(dfy)
And say I want to interpolate for y(x=0.5), how can I get the results into an array that I can put in a new dataframe?
Expected result is: [0.761290323 0.284615385 1.1 -0.022727273]
For example, for first row, you can see the expected value is 0.761290323:
x = [0.1, 3.2] # from dfx, row 0
y = [0.8,0.5] # from dfy, row 0
fig, ax = plt.subplots(1,1)
ax.plot(x,y)
f = scipy.interpolate.interp1d(x,y)
out = f(0.5)
print(out)
I tried the following but received ValueError: x and y arrays must be equal in length along interpolation axis.
f = scipy.interpolate.interp1d(dfx, dfy)
out = np.exp(f(0.5))
print(out)
Since you are looking for linear interpolation, you can do:
def interpolate(val, dfx, dfy):
t = (dfx['b'] - val) / (dfx['b'] - dfx['a'])
return dfy['a'] * t + dfy['b'] * (1-t)
interpolate(0.5, dfx, dfy)
Output:
0 0.885714
1 0.284615
2 1.100000
3 -0.022727
dtype: float64

ggplot multiple densities with common density

I would like to plot something that is "between" a histogram and a density plot. Here is an example:
library(ggplot2)
set.seed(1)
f1 <- rep(1, 100)
v1 <- rnorm(100)
df1 <- data.frame(f1, v1)
f1 <- rep(2, 10)
v1 <- (rnorm(10)+1*2)
df2 <- data.frame(f1, v1)
df <- rbind(df1, df2)
df$f1 <- as.factor(df$f1)
ggplot(df, aes(x = v1, colour = f1)) +
geom_density(position="identity", alpha = 0.6, fill = NA, size = 1)
You will see that the area under each curve is 1.0, which is OK for a density. BUT notice that the second distribution is made up of just 10 observations rather than the 100 of the first. What I would like is that the area under curve 2 reflects this, e.g. is a tenth of that of curve 1. Thanks.
There is a computed variable for stat_density that you can use, called count.
ggplot(df, aes(x = v1, colour = f1)) +
geom_density(position="identity", alpha = 0.6, fill = NA, size = 1,
aes(y = after_stat(count)))
Note for ggplot2 <3.3.0 use stat(count) instead of after_stat(count).
You can find these tricks in the documentation of ?geom_density() under the section "Computed Variables".

linear regression fit plot over boxplots in shared y-axis

I have a plot in the picture below:
Is it possible to add a colored band to indicate a linear regression between the different x-axis? I want a plot like this, with filling with the same color all the zone between the two green lines:
A quick and dirty solution, to create a visually equal single plot, would be to use range(1,17) for x values and use the matplotlib functions xticks, grid and axvline to fine tune the plot:
# fake some data
xs = range(1, 17)
vals = np.asarray([0.73, 0.74, 0.73, 0.71,
0.75, 0.76, 0.75, 0.73,
0.77, 0.78, 0.77, 0.75,
0.79, 0.80, 0.79, 0.77])
data = np.random.rand(20, len(vals)) * 0.03 + vals
avgs = np.mean(data, axis=0)
# plot linear regr. lines and fill
xs2 = [0,20]
coef = np.polyfit(xs[0::4], avgs[0::4], 1) # values for 0.01
ys2a = np.polyval(coef, xs2)
coef = np.polyfit(xs[3::4], avgs[3::4], 1) # values for 0.5
ys2b = np.polyval(coef, xs2)
plt.fill_between(xs2, ys2a, ys2b, color='OliveDrab', alpha=0.5)
plt.plot(xs2, ys2a, color='OliveDrab', lw=3)
plt.plot(xs2, ys2b, color='OliveDrab', lw=3)
# plot data and manipulate axis and grid
plt.boxplot(data, showfliers=False)
plt.xticks(xs, [0.01, 0.1, 0.2, 0.5] * 4)
plt.xlim(0.5, 16.5)
plt.grid(False)
for i in range(3):
plt.axvline(i * 4 + 4.5, c='white')
plt.xlabel('$\sigma^{2}$')
plt.ylabel('$F_{w}(t)$')
plt.show()

Setting colors individually in matplotlib

I want to create a custom plot. I want to precisely specify the color of each object. Specifically, I am creating a Gantt chart for system events. I am classifying those events into groups and color coding them to visualize.
Please consider the following code:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame()
df['y'] = [0,4,5,6,10]
df['color'] = [(.5, .5, .5, .5),]*len(df)
print df['color']
#fig = plt.figure(figsize=(12, 6))
#vax = fig.add_subplot(1,1,1)
#vax.hlines(df['y'], 0, 10, colors=df['color'])
#fig.savefig('ok.png')
only_four = df['y']==4
df['color'][only_four] = [(0.7, 0.6, 0.5, 0.4),]*sum(only_four)
print df['color']
Note that I first am setting the color for all to be a semi-transparent gray. Later, for a particular set of values, I want to change the color. I end up with this color table.
0 (0.5, 0.5, 0.5, 0.5)
1 0.6
2 (0.5, 0.5, 0.5, 0.5)
3 (0.5, 0.5, 0.5, 0.5)
4 (0.5, 0.5, 0.5, 0.5)
I want to be able to specify any RGBA value (i.e. including transparency) for any subset of the hlines. Could someone share how to do this? I'm open to any other way to do this as long as I can precisely color each line including a transparency.
ADDITION TO QUESTION:
I am able to update multiple rows by iterating as in:
def set_color(df, row_bool, r, g, b, a=1.0):
idx = np.where(row_bool)[0]
for i in idx:
df['color'][i] = (r,g,b,a)
return
This is sufficient, but I really wanted a vector operation (ie no explicit loop by me).
I'm guessing the problem is that you cannot get your updated tuple to be input into the DataFrame and you only get that 0.6 value in the DataFrame. Have you tried using DataFrame.set_value?
In [1]: df
Out[1]:
y color
0 0 (0.5, 0.5, 0.5, 0.5)
1 4 0.6
2 5 (0.5, 0.5, 0.5, 0.5)
3 6 (0.5, 0.5, 0.5, 0.5)
4 10 (0.5, 0.5, 0.5, 0.5)
In [2]: df.set_value(1, 'color', (0.7, 0.6, 0.5, 0.4))
Out[2]:
y color
0 0 (0.5, 0.5, 0.5, 0.5)
1 4 (0.7, 0.6, 0.5, 0.4)
2 5 (0.5, 0.5, 0.5, 0.5)
3 6 (0.5, 0.5, 0.5, 0.5)
4 10 (0.5, 0.5, 0.5, 0.5)