I need help regarding this database https://www.kaggle.com/datasets/hugomathien/soccer.
I need a table of the date , the hometeam name , the away team name , the goals the hometeam scored and the goals the away team scored, for a random game ( I chose match_api_id = 492476). I use this code in r :
con <- DBI::dbConnect(RSQLite::SQLite(), "data/database.sqlite")
library(tidyverse)
library(DBI)
match<-tbl(con,"Match")
team<-tbl(con,"Team")
T1<-match %>%
left_join(team, by = c(away_team_api_id="team_api_id"))
T1<-T1 %>% rename(away_team_name = team_long_name)
T2<-match %>%
left_join(team, by = c(home_team_api_id="team_api_id"))
T2<-T2 %>% rename(home_team_name = team_long_name)
TT<-T1 %>%
left_join(T2,by = c("match_api_id","date","home_team_goal","away_team_goal"))
finalT<-TT %>%
left_join(match,by = c("match_api_id","date","home_team_goal","away_team_goal")) %>%
filter(match_api_id==492476) %>%
select(home_team_name,away_team_name,date,home_team_goal,away_team_goal)
glimpse(finalT)
and it works, but can i take take the same table with a more simple and straighforward code , without defining all theese tables?
Try this to reduce the objects created:
library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), "database.sqlite")
team <- tbl(con, "Team") %>%
select(team_api_id, team_long_name)
tbl(con, "Match") %>%
select(date, ends_with("api_id"), ends_with("_goal")) %>%
left_join(team, by = c("home_team_api_id" = "team_api_id")) %>%
rename(home_team = team_long_name) %>%
left_join(team, by = c("away_team_api_id" = "team_api_id")) %>%
rename(away_team = team_long_name) %>%
filter(match_api_id == 492476) %>%
select(-ends_with("_id"))
Related
I can't figure out where the problem is, but i don't have values on my "MIN' and i still have "NA" on my graph after using some codes like na.mit() and drop_na,pictures and code chunks are slated below:
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
code to get min value
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%
arrange(member_casual,weekday)
I tried using these codes: na.omit() and print(df %>% drop_na()) ](https://i.stack.imgur.com/aW0kC.jpg)
I tried using these codes: na.omit() and print(df %>% drop_na()) to remove the NA values but it's not working.
For mean in summarize, after the variable name add , na.rm = TRUE . –
Elin
Feb 1 at 5:50
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(),average_duration = mean(ride_length,na.rm = TRUE)) %>%
arrange(member_casual,weekday)
I'm still learning R and I'm not sure why there is NA data in my graph. Considering that I have used the table function to check the variables in the column.
graph
Any suggestions to remove the NA variable in my graph?
Please find below sample of code(not actual dataset):
*Install and load relevant packages
install.packages("tidyverse")
install.packages("lubridate")
install.packages("ggplot2")
install.packages("tibble")
library(tidyverse)
library(lubridate)
library(ggplot2)
library(tibble)
library(dplyr)
*data frame
all_trips <- tribble(~start, ~end, ~start_name, ~type,
"2020-03-22 03:20:20", "2020-03-22 04:10:15", "A", "member",
"2020-03-25 01:01:07", "2020-03-25 05:09:45", NA, "member",
"2020-03-26 07:09:55", "2020-03-26 08:10:20", "B", "casual",
"2020-03-29 09:10:30", "2020-03-29 09:00:20", "A", "casual",
"2020-03-30 11:09:18", "2020-03-30 03:40:10", "B", "member")
*generate new columns
all_trips$date <- as.Date(all_trips$start) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
all_trips$ride_length <- difftime(all_trips$end,all_trips$start)
is.factor(all_trips$ride_length)
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
*data cleaning
all_trips_v2 <- all_trips[!(all_trips$start_name == "NA" |
all_trips$ride_length<0),]
*data viz
all_trips_v2 %>%
mutate(weekday = wday(start, label = TRUE)) %>% #creates weekday field using wday()
group_by(type, weekday) %>% #groups by usertype and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(type, weekday) %>% # sorts
ggplot(aes(x = weekday, y = number_of_rides, fill = type)) +
geom_col(position = "dodge", na.rm = TRUE) +
scale_x_discrete(na.translate = FALSE)
Bar Chart:
Click here
Adding na.rmand na.translate arguments will remove missing values from bar chart without a warning message as shown here:
tibble(x = rep(c('One', 'Two', 'Two', NA),2), Group=rep(c("A","B"),each=4)) %>%
ggplot(aes(x, fill=Group)) +
labs(title="Sample Group Bar Chart with NA's Removed") +
geom_bar(stat="Count", position=position_dodge(), na.rm = TRUE) +
scale_x_discrete(na.translate = FALSE)
So, I've done my searches but cannot find the solution to this problem i have with a bar plot in ggplot.
I'm trying to make the bars be in percentage of the total number of cases in each group in grouping variable 2.
Right now i have it visualising the number of counts,
Dataframe = ASAP
Grouping variable 1 - cc_groups (seen in top of the graph)
(counts number of cases within a range (steps of 20) in a score from 0-100.)
grouping variable 2 - asap
( binary variable with either intervention or control, number of controls and interventions are not the same)
Initial code
``` r
ggplot(ASAP, aes(x = asap, fill = asap)) + geom_bar(position = "dodge") +
facet_grid(. ~ cc_groups) + scale_fill_manual(values = c("red",
"darkgray"))
#> Error in ggplot(ASAP, aes(x = asap, fill = asap)): could not find function "ggplot"
```
Created on 2020-05-19 by the reprex package (v0.3.0)
this gives me the following graph which is a visualisation of the counts in each subgroup.
enter image description here
I have manually calculated the different percentages that actually needs to be visualised:
table_groups <- matrix(c(66/120,128/258,34/120,67/258,10/120,30/258,2/120,4/258,0,1/258,8/120,28/258),ncol = 2, byrow = T)
colnames(table_groups) <- c("ASAP","Control")
rownames(table_groups) <- c("0-10","20-39","40-59","60-79","80-99","100")
ASAP Control
0-10 0.55000 0.496124
20-39 0.28333 0.259690
40-59 0.08333 0.116279
60-79 0.01667 0.015504
80-99 0.00000 0.003876
100 0.06667 0.108527
When i use the solution provided by Stefan below (which was an excellent answer but didn't do the actual trick. i get the following output
``` r
ASAP %>% count(cc_groups, asap) %>% group_by(cc_groups) %>% mutate(pct = n/sum(n)) %>%
ggplot(aes(x = asap, y = pct, fill = asap)) + geom_col(position = "dodge") +
facet_grid(~cc_groups) + scale_fill_manual(values = c("red",
"darkgray"))
#> Error in ASAP %>% count(cc_groups, asap) %>% group_by(cc_groups) %>% mutate(pct = n/sum(n)) %>% : could not find function "%>%"
```
<sup>Created on 2020-05-19 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>
enter image description here
whereas (when i go analogue) id like it to show the percentages as above like this.
enter image description here
Im SO sorry about that drawing.. :) and reprex kept feeding me errors, im sure im using it incorrectly.
The easiest way to achieve this is via aggregating the data before plotting, i.e. manually computing counts and percentages:
library(ggplot2)
library(dplyr)
ASAP %>%
count(cc_groups, asap) %>%
group_by(asap) %>%
mutate(pct = n / sum(n)) %>%
ggplot(aes(x = asap, y = pct, fill=asap)) +
geom_col(position="dodge")+
facet_grid(~cc_groups)+
scale_fill_manual(values = c("red","darkgray"))
Using ggplot2::mpg as example data:
library(ggplot2)
library(dplyr)
# example data
mpg2 <- mpg %>%
filter(cyl %in% c(4, 6)) %>%
mutate(cyl = factor(cyl))
# Manually compute counts and percentages
mpg3 <- mpg2 %>%
count(class, cyl) %>%
group_by(class) %>%
mutate(pct = n / sum(n))
# Plot
ggplot(mpg3, aes(x = cyl, y = pct, fill = cyl)) +
geom_col(position = "dodge") +
facet_grid(~ class) +
scale_fill_manual(values = c("red","darkgray"))
Created on 2020-05-18 by the reprex package (v0.3.0)
How do I edit this code to replace == with if the Species contains %setosa%
new_iris <- iris %>%
mutate(flag = ifelse(Species == "setosa", 1, 0) # add a new column
)
To stay in the tidyverse you could use stringr.
library(dplyr)
library(stringr)
iris %>%
mutate(flag = str_detect(Species, "setosa"))
We can use %like% from data.table
library(dplyr)
library(data.table)
iris %>%
mutate( flag = as.integer(Species %like% "setosa") )
I'm looking for a way to automate table generation using the expss package in an attempt to move from spss to R.
I think this should be simple but I seem to miss something...
I only define a few different tables based on the question type.
Eg. the table for single respons looks like below
banner <- d %>% tab_cols(total(),Q2.banner,Q3.banner)
banner %>%
tab_cells (Q1) %>%
tab_stat_cases(total_row_position = c("above"),label = 'N') %>%
tab_stat_cpct(total_row_position = c("none"), label = '%') %>%
tab_pivot (stat_position = "inside_rows") %>%
drop_c () %>%
custom_format()
I'm looking for a function in which I only have to specify the variable
Eg .
Table1 = function (Q, banner) {
banner %>%
tab_cells (Q) %>%
tab_stat_cases(total_row_position = c("above"),label = 'N') %>%
tab_stat_cpct(total_row_position = c("none"), label = '%') %>%
tab_pivot (stat_position = "inside_rows") %>%
drop_c () %>%
custom_format()
}
Ideally I would like to add a table title as well.
I'm running the table books in R Notebook.
Any other tips to automate table generation are all welcome.
Thanks for all help,
michaëla
There is a rather universal solution for working with non-standard evaluation - eval.parent(substitute(...)). In your case it looks like this:
Table1 = function (Q, banner) {
eval.parent(substitute(
{
banner %>%
tab_cells (Q) %>%
tab_stat_cases(total_row_position = c("above"),label = 'N') %>%
tab_stat_cpct(total_row_position = c("none"), label = '%') %>%
tab_pivot (stat_position = "inside_rows") %>%
drop_c () %>%
custom_format()
}
))
}