unique values in categorical variables R estudio - ggplot2

How can I find how many unique values each categorical takes in a data frame and then represent it with a graph? all this in R studio

We'll use the tidyverse here.
library(tidyverse)
You can apply the unique() function to a dataframe to remove any repeat rows.
df <- iris %>% unique()
The group_by(), summarise() and n() functions let you count the number of instances of a variable in a dataframe.
df2 <- df %>% group_by(Species) %>% summarise(n = n())
## alternatively use count() which does the same thing
df2 <- df %>% count(Species)
Finally we can use the ggplot package to create a graph.
ggplot() + geom_col(data = df2, aes(x = Species, y = n))
If you're not interested in having a separate dataframe with the data in it and want to jump straight to the graph, you can ignore the step with group_by() and summarise() and just use geom_bar().
ggplot() + geom_bar(data = df, aes(Species))

Related

How to add a vector to a table in backend using dbplyr (R)

I created a table from a data source using tbl(). I need to add a column including 1:nrow() to my dataset and tried different methods but I didn't succeed. My code is as below:
nrow_df1 <- df1 %>% summarise(n = n()) %>% pull(n)
df1 <- df1 %>% mutate(ID = 1:nrow_df1, step = 1)
It doesn't add column ID to my dataset and only adds column step.
Using as.data.frame(), it works but so slow.
Do you have any ideas? thanks in advance
For this case, you can use row_number().
library(dbplyr)
library(DBI)
# simulate a fake database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
# add in the row
tbl(con, "mtcars") %>%
mutate(ID = row_number())
dbDisconnect(con)
I found the answer. It is to use row_number() but as.numeric is also needed to convert the output from integer64 to numeric:
df1 <- df1 %>% mutate(ID = as.numeric(row_number(a column)), step = 1)

dplyr to data.table for speed up execution time

I am currently dealing with a moderately large dataframe called d.mkt (> 2M rows and 12 columns). As dplyr is too slow when applying summarise() function combined with group_by_at, I am trying to write an equivalent statement using data.table to speed up the summarise computation part of dplyr. However, the situation is quite special in the case that the original dataframe is group_by_at and then summarising over the same set of columns (e.g. X %>% select(-id) %>% group_by_at(vars(-x,-y,-z,-t) %>% summarise(x = sum(x), y = sum(y), z = sum(z), y = sum(t)) %>% ungroup()).
With that in mind, below is my current attempt, which kept failing to work because of this error: keyby or by has length (1,1,1,1). Could someone please help let me know how to fix this error?
dplyr's code
d.mkt <- d.mkt %>%
left_join(codes, by = c('rte_cd', 'cd')) %>%
mutate(is_valid = replace_na(is_valid, FALSE),
rte_cd = ifelse(is_valid, rte_cd, 'RC'),
rte_dsc = ifelse(is_valid, rte_dsc, 'SKIPPED')) %>%
select(-is_valid) %>%
group_by_at(vars(-c_rv, -g_rv, -h_rv, -rn)) %>%
summarise(c_rv = sum(as.numeric(c_rv)), g_rv = sum(as.numeric(g_rv)), h_rv = sum(as.numeric(h_rv)), rn = sum(as.numeric(rn))) %>%
ungroup()
My attempt for translating the above
d.mkt <- as.data.table(d.mkt)
d.mkt <- d.mkt[codes, on = c('rte_cd', 'sb_cd'),
`:=` (is.valid = replace_na(is_valid, FALSE), rte_cd = ifelse(is_valid, rte_cd, 'RC00'),
rte_ds = ifelse(is_valid, rte_ds, 'SKIPPED'))]
d.mkt <- d.mkt[, -"is.valid", with=FALSE]
d.mkt <- d.mkt[, .(c_rv=sum(c_rv), g_rv=sum(g_rv), h_rv = sum(h_rv), rn = sum(rn)), by = .('prop', 'date')] --- Error here already, but how do we ungroup a `data.table` though?
Close. Some suggestions/answers.
If you're shifting to data.table for speed, I suggest use if fifelse in lieu of replace_na and ifelse, minor.
The canonical way to remove is_valid is d.mkt[, is.valid := NULL].
Grouping cab be done with a setdiff. In data.table, there is no need to "ungroup", each [-call uses its own grouping. (For the reason, if you have multiple chained [-operations that use the same grouping, it can be useful to store that group as a variable, perhaps index it, and/or combine all the [-chain into a single call. This is prone to lots of benchmarking discussion outside the scope of what we have here.)
Since all of your summary stats are the same, we can lapply(.SD, ..) this for a little readability improvement.
This might work:
library(data.table)
setDT(codes) # or using `as.data.table(codes)` below instead
setDT(d.mkt) # ditto
tmp <- codes[d.mkt, on = .(rte_cd, cd) ] %>%
.[, c("is_valid", "rte_cd", "rte_dsc") :=
.(fcoalesce(is_valid, FALSE),
fifelse(is.na(is_valid), rte_cd, "RC"),
fifelse(is.an(is_valid), rte_dsc, "SKIPPED")) ]
tmp[, is_valid := NULL ]
cols <- c("c_rv", "g_rv", "h_rv", "rn")
tmp[, lapply(.SD, function(z) sum(as.numeric(z))),
by = setdiff(names(tmp), cols), .SDcols = cols ]

Scatter plot with ggplot, using indexing to plot subsets of the same variable on x and y axis

I'm working with a subset of weather data for Heathrow downloaded Met Office data. This data set contains no missing values.
Using ggplot, I'd like to create a scatter plot for the maximum temperature (tmax) for Heathrow, with 2018 data plotted against 2019 data (see below for example). There are 12 data points for both 2018 and 2019.
I've attempted this with the below, however it does not work. This appears to be due to the indexing as the code works fine when not attempting to use the indexes within the aes() function.
How can I get this to work?
2018Index <- which(HeathrowData$Year == 2018)
2019Index <- which(HeathrowData$Year == 2019)
scatter<-ggplot(HeathrowData, aes(tmax[2018Index], tmax[2019Index]))
scatter + geom_point()
scatter + geom_point(size = 2) + labs(x = "2018", y = "2019"))
As your data is in long format you need some data wrangling to put the values for your years in separate columns aka you have to reshape your data to wide:
Using some random fake data:
library(dplyr)
library(tidyr)
library(ggplot2)
# Example data
set.seed(123)
HeathrowData <- data.frame(
Year = rep(2017:2019, each = 12),
tmax = runif(36)
)
# Select, Filter, Convert to Wide
HeathrowData <- HeathrowData %>%
select(Year, tmax) %>%
filter(Year %in% c(2018, 2019)) %>%
group_by(Year) %>%
mutate(id = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = Year, values_from = tmax, names_prefix = "y")
ggplot(HeathrowData, aes(y2018, y2019)) +
geom_point(size = 2) +
labs(x = "2018", y = "2019")

How to add two boxplots in a same graph in ggplot2

I have this sample data.
sample <- data.frame(sample = 1:12,
site = c('A','A','A','B','B','B','A','A','A','B','B','B'),
month = c(rep('Feb', 6), rep('Aug', 6)),
Ar = c(7,8,9,8,9,9,4,5,7,5,8,9))
And created two boxplots
ggplot(sample, aes(x=factor(month), y=Ar)) +
geom_boxplot(aes(fill=site))
ggplot(sample, aes(x=factor(month), y=Ar)) +
geom_boxplot()
I wonder if there is a way to combine them in the same graph so that total, site A and site B are right next to each other per each month.
You could utilize dplyr (via the tidyverse package) and reshape2.
library(dplyr)
library(reshape2)
sample%>%
dplyr::select(-sample) %>%
mutate(global = 'Global') %>%
melt(., id.vars=c("month", "Ar")) %>%
ggplot(aes(month, Ar)) + geom_boxplot(aes(month, Ar, fill=value))
This drops the sample column as you aren't currently using it, adds the term global in a separate column, reshapes the data via the melt function and generates a figure. Note that I changed the input code format in your original question. With the changes to the data.frame you no longer need to coerce the variables to factors.

Julia DataFrame: Create new column sum of col values :x by :y

I have a DataFrame of x and y occurrences. I would like to count how often each occurrence happens in the DataFrame and what percentage of the :y occurrences that combination represents. I have the first part down now, thanks to a previous question.
using DataFrames
mydf = DataFrame(y = rand('a':'h', 1000), x = rand('i':'p', 1000))
mydfsum = by(mydf, [:x, :y], df -> DataFrame(n = length(df[:x])))
This successfully creates a column that counts how often each value of :x occurs with each value of :y. Now I need to be able to generate a new column that counts how often each value of :y occurs. I could next create a new DataFrame using:
mydfsumy = by(mydf, [:y], df -> DataFrame(ny = length(df[:x])))
Join the DataFrames together.
mydfsum = join(mydfsum, mydfsumy, on = :y)
And create the percentage :yp column
mydfsum[:yp] = mydfsum[:n] ./ mydfsum[:ny]
But this seems like a clunky workaround for a common data management problem. In R I would do all of this in one line using dplyr:
mydf %>% groupby(x,y) %>% summarize(n = n()) %>% groupby(y) %>% mutate(yp = n/sum(n))
You can do it in one line:
mydfsum = by(mydf, :y, df -> by(df, :x, dd -> DataFrame(n = size(dd,1), yp = size(dd,1)/size(df,1))))
or, if that becomes hard to read, you can use the do notation for anonymous functions:
mydfsum = by(mydf,:y) do df
by(df, :x) do dd
DataFrame(n = size(dd,1), yp = size(dd,1)/size(df,1))
end
end
What you are doing in R is actually doing a first by on both x and y, then mutating a column of the output. You can also do that, but you need to have created that column first. Here I first initialize the yp column with zeroes and then modify it in place with another by.
mydfsum = by(mydf,[:x,:y], df -> DataFrame(n = size(df,1), yp = 0.))
by(mydfsum, :y, df -> (df[:yp] = df[:n]/sum(df[:n])))
For more advanced data manipulation you may want to take a look at Query.jl