how to unlist a `tknlist`? - tidyverse

step_tokenize returns a vector of type tknlist. How can I get a rectangular for of it? I mean something like unnesting the tokens and add them a cols of the tibble.
library(textrecipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
show_tokens(medium)
tate_obj <- tate_rec %>%
prep()
dd <- bake(tate_obj, new_data = NULL, medium)

There is not a direct way to get a rectangle from a tknlist object in {textrecipes}. The object is mainly used internally in the package.
You can use the unexported textrecipes:::get_tokens() function to turn the tknlist object into a list of character tokens. But the package doesn't have any functions that let unnest that object.
library(textrecipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium)
tate_obj <- tate_rec %>%
prep()
dd <- bake(tate_obj, new_data = NULL, medium, everything())
dd
#> # A tibble: 4,284 × 5
#> medium id artist title year
#> <tknlist> <dbl> <fct> <fct> <dbl>
#> 1 [8 tokens] 21926 Absalon Proposals for a Habitat 1990
#> 2 [3 tokens] 20472 Auerbach, Frank Michael 1990
#> 3 [3 tokens] 20474 Auerbach, Frank Geoffrey 1990
#> 4 [3 tokens] 20473 Auerbach, Frank Jake 1990
#> 5 [4 tokens] 20513 Auerbach, Frank To the Studios 1990
#> 6 [4 tokens] 21389 Ayres, OBE Gillian Phaëthon 1990
#> 7 [4 tokens] 121187 Barlow, Phyllida Untitled 1990
#> 8 [3 tokens] 19455 Baselitz, Georg Green VIII 1990
#> 9 [6 tokens] 20938 Beattie, Basil Present Bound 1990
#> 10 [3 tokens] 105941 Beuys, Joseph Joseph Beuys: A Private Collectio… 1990
#> # … with 4,274 more rows
dd %>%
mutate(medium = textrecipes:::get_tokens(medium))
#> # A tibble: 4,284 × 5
#> medium id artist title year
#> <list> <dbl> <fct> <fct> <dbl>
#> 1 <chr [8]> 21926 Absalon Proposals for a Habitat 1990
#> 2 <chr [3]> 20472 Auerbach, Frank Michael 1990
#> 3 <chr [3]> 20474 Auerbach, Frank Geoffrey 1990
#> 4 <chr [3]> 20473 Auerbach, Frank Jake 1990
#> 5 <chr [4]> 20513 Auerbach, Frank To the Studios 1990
#> 6 <chr [4]> 21389 Ayres, OBE Gillian Phaëthon 1990
#> 7 <chr [4]> 121187 Barlow, Phyllida Untitled 1990
#> 8 <chr [3]> 19455 Baselitz, Georg Green VIII 1990
#> 9 <chr [6]> 20938 Beattie, Basil Present Bound 1990
#> 10 <chr [3]> 105941 Beuys, Joseph Joseph Beuys: A Private Collection… 1990
#> # … with 4,274 more rows

Related

How to convert rows into columns by group?

I'd like to do association analysis using apriori algorithm.
To do that, I have to make a dataset.
What I have data is like this.
data.frame("order_number"=c("100145", "100155", "100155", "100155",
"500002", "500002", "500002", "500007"),
"order_item"=c("27684535","15755576",
"1357954","124776249","12478324","15755576","13577","27684535"))
order_number order_item
1 100145 27684535
2 100155 15755576
3 100155 1357954
4 100155 124776249
5 500002 12478324
6 500002 15755576
7 500002 13577
8 500007 27684535
and I want to transfer the data like this
data.frame("order_number"=c("100145","100155","500002","500007"),
"col1"=c("27684535","15755576","12478324","27684535"),
"col2"=c(NA,"1357954","15755576",NA),
"col3"=c(NA,"124776249","13577",NA))
order_number col1 col2 col3
1 100145 27684535 <NA> <NA>
2 100155 15755576 1357954 124776249
3 500002 12478324 15755576 13577
4 500007 27684535 <NA> <NA>
Thank you for your help.
This would be a case of pivot_wider (or other functions for changing column layout). First step would be creating a row id variable to note whether each is 1, 2 or 3, then shaping this into the dataframe you want:
df <- data.frame("order_number"=c("100145", "100155", "100155", "100155",
"500002", "500002", "500002", "500007"),
"order_item"=c("27684535","15755576",
"1357954","124776249","12478324","15755576","13577","27684535"))
library(tidyr)
library(dplyr)
df |>
group_by(order_number) |>
mutate(rank = row_number()) |>
pivot_wider(names_from = rank, values_from = order_item,
names_prefix = "col")
#> # A tibble: 4 × 4
#> # Groups: order_number [4]
#> order_number col1 col2 col3
#> <chr> <chr> <chr> <chr>
#> 1 100145 27684535 <NA> <NA>
#> 2 100155 15755576 1357954 124776249
#> 3 500002 12478324 15755576 13577
#> 4 500007 27684535 <NA> <NA>

Joining two dataframes which have variable names as values in R / SQL

I am using R
I am trying to join two which look like this:
DF1:
Name Species Value Variable_id
Jake Human 99 1
Jake Human 20 2
Mike Lizard 12 1
Mike Lizard 30 2
DF2:
Variable_id Varible_name
1 Height
2 Age
And I need it in the form of
Name Species Height Age
Jake Human 99 20
Mike Lizard 12 30
library(dplyr)
library(tidyr)
DF1 %>% left_join(DF2) %>%
select(-Variable_id) %>%
pivot_wider(names_from = Varible_name, values_from = Value)
# Joining, by = "Variable_id"
# # A tibble: 2 x 4
# Name Species Height Age
# <chr> <chr> <int> <int>
# 1 Jake Human 99 20
# 2 Mike Lizard 12 30
Using this data:
DF1 = read.table(text = 'Name Species Value Variable_id
Jake Human 99 1
Jake Human 20 2
Mike Lizard 12 1
Mike Lizard 30 2', header = T)
DF2 = read.table(text = "Variable_id Varible_name
1 Height
2 Age", header = TRUE)

R summarise_at dynamically by condition : mean for some columns, sum for others

I would like that but with the conditions in the summarise_at()
Edit #1: I've added the word dynamically in the title: When I use vars(c()) in the summarise_at() it's for fast and clear examples, but in fact it's for use contains(), starts_with() and matches(,, perl=TRUE), because I have 50 columns, with many sum() and some mean().
And the goal is for generate dynamic SQL with tbl()..%>% group_by() ... %>% summarise_at()...%>% collect().
Edit #2: I added example with SQL generated in my second example
library(tidyverse)
(mtcars
%>% group_by(carb)
%>% summarise_at(vars(c("mpg","cyl","disp")), list (~mean(.),~sum(.)))
# I don't want this line below, I would like a conditional in summarise_at() because I have 50 columns in my real case
%>% select(carb,cyl_mean,disp_mean,mpg_sum)
)
#> # A tibble: 6 x 4
#> carb cyl_mean disp_mean mpg_sum
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 4.57 134. 177.
#> 2 2 5.6 208. 224
#> 3 3 8 276. 48.9
#> 4 4 7.2 309. 158.
#> 5 6 6 145 19.7
#> 6 8 8 301 15
Created on 2020-02-19 by the reprex package (v0.3.0)
This works, but I want only sum for mpg, and only mean for cyl and disp:
library(RSQLite)
library(dbplyr)
library(tidyverse)
library(DBI)
db <- dbConnect(SQLite(),":memory:")
dbCreateTable(db, "mtcars_table", mtcars)
(tbl( db, build_sql( con=db,"select * from mtcars_table" ))
%>% group_by(carb)
%>% summarise_at(vars(c("mpg","cyl","disp")), list (~mean(.),~sum(.)))
%>% select(carb,cyl_mean,disp_mean,mpg_sum)
%>% show_query()
)
#> <SQL>
#> Warning: Missing values are always removed in SQL.[...] to silence this warning
#> SELECT `carb`, `cyl_mean`, `disp_mean`, `mpg_sum`
#> FROM (SELECT `carb`, AVG(`mpg`) AS `mpg_mean`, AVG(`cyl`) AS `cyl_mean`, AVG(`disp`) AS `disp_mean`, SUM(`mpg`) AS `mpg_sum`, SUM(`cyl`) AS `cyl_sum`, SUM(`disp`) AS `disp_sum`
#> FROM (select * from mtcars_table)
#> GROUP BY `carb`)
#> # Source: lazy query [?? x 4]
#> # Database: sqlite 3.30.1 [:memory:]
#> # … with 4 variables: carb <dbl>, cyl_mean <lgl>, disp_mean <lgl>,
#> # mpg_sum <lgl>
I tried all possibilities like that but it doesn't work or it produces error.
(mtcars %>% group_by(carb)%>% summarise_at(vars(c("mpg","cyl","disp")),ifelse(vars(contains(names(.),"mpg")),list(sum(.)),list(mean(.)))) )
Not good, too many columns
library(tidyverse)
(mtcars %>% group_by(carb)%>% summarise_at(vars(c("mpg","cyl","disp")),ifelse ((names(.)=="mpg"), list(~sum(.)) , list(~mean(.)))))
#> # A tibble: 6 x 34
#> carb mpg_sum cyl_sum disp_sum mpg_mean..2 cyl_mean..2 disp_mean..2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 177. 32 940. 25.3 4.57 134.
#> 2 2 224 56 2082. 22.4 5.6 208.
#> 3 3 48.9 24 827. 16.3 8 276.
#> 4 4 158. 72 3088. 15.8 7.2 309.
#> 5 6 19.7 6 145 19.7 6 145
#> 6 8 15 8 301 15 8 301
#> # … with 27 more variables: mpg_mean..3 <dbl>, cyl_mean..3 <dbl>,
#> # disp_mean..3 <dbl>, mpg_mean..4 <dbl>, cyl_mean..4 <dbl>,
#> # disp_mean..4 <dbl>, mpg_mean..5 <dbl>, cyl_mean..5 <dbl>,
#> # disp_mean..5 <dbl>, mpg_mean..6 <dbl>, cyl_mean..6 <dbl>,
#> # disp_mean..6 <dbl>, mpg_mean..7 <dbl>, cyl_mean..7 <dbl>,
#> # disp_mean..7 <dbl>, mpg_mean..8 <dbl>, cyl_mean..8 <dbl>,
#> # disp_mean..8 <dbl>, mpg_mean..9 <dbl>, cyl_mean..9 <dbl>,
#> # disp_mean..9 <dbl>, mpg_mean..10 <dbl>, cyl_mean..10 <dbl>,
#> # disp_mean..10 <dbl>, mpg_mean..11 <dbl>, cyl_mean..11 <dbl>,
#> # disp_mean..11 <dbl>
Some other attempts and remarks: I would like conditional sum(.) or mean(.) depending of the name of the column in the summarise().
It could be good if it accepts not only primitive functions.
At then end it's for tbl()..%>% group_by() ... %>% summarise_at()...%>% collect() to generate conditional SQL with AVG() and SUM().
T-SQL function like ~(convert(varchar()) works for mutate_at() and similar ~AVG()works for summarise_at() but I arrive at the same point: conditional summarise_at() doesn't work depending of name of columns.
:)
An option is to group_by the 'carb', and then create the sum of 'mpg' as another grouping variable and then use summarise_at with the rest of the variables needed
library(dplyr)
mtcars %>%
group_by(carb) %>%
group_by(mpg_sum = sum(mpg), .add = TRUE) %>%
summarise_at(vars(cyl, disp), list(mean = mean))
# A tibble: 6 x 4
# Groups: carb [6]
# carb mpg_sum cyl_mean disp_mean
# <dbl> <dbl> <dbl> <dbl>
#1 1 177. 4.57 134.
#2 2 224 5.6 208.
#3 3 48.9 8 276.
#4 4 158. 7.2 309.
#5 6 19.7 6 145
#6 8 15 8 301
Or using the devel version of dplyr, this can be done in a single summarise by wrapping the blocks of columns in across and the single column by themselves and apply different functions on it
mtcars %>%
group_by(carb) %>%
summarise(across(one_of(c("cyl", "disp")), list(mean = mean)),
mpg_sum = sum(mpg))
# A tibble: 6 x 4
# carb cyl_mean disp_mean mpg_sum
# <dbl> <dbl> <dbl> <dbl>
#1 1 4.57 134. 177.
#2 2 5.6 208. 224
#3 3 8 276. 48.9
#4 4 7.2 309. 158.
#5 6 6 145 19.7
#6 8 8 301 15
NOTE: summarise_at/summarise_if/mutate_at/mutate_if/... etc. will be superseded by the across verb with the default functions (summarise/mutate/filter/...) in the upcoming releases
workaround waiting across() with regex
library(RSQLite)
library(dbplyr)
library(tidyverse)
library(DBI)
db <- dbConnect(SQLite())
mtcars_table <- mtcars %>% rename(mpg_sum=mpg,cyl_mean=cyl,disp_mean=disp )
RSQLite::dbWriteTable(db, "mtcars_table", mtcars_table)
req<-as.character((tbl( db, build_sql( con=db,"select * from mtcars_table" ))
%>% group_by(carb)
%>% summarise_at(vars(c(ends_with("mean"), ends_with("sum")) ), ~sum(.))
) %>% sql_render())
#> Warning: Missing values are always removed in SQL.
#> Use `SUM(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
req<-gsub("(SUM)(\\(.{1,30}mean.{1,10}\\))", "AVG\\2", req, perl=TRUE)
print(req)
#> [1] "SELECT `carb`, AVG(`cyl_mean`) AS `cyl_mean`, AVG(`disp_mean`) AS `disp_mean`,
# SUM(`mpg_sum`) AS `mpg_sum`\nFROM (select * from mtcars_table)\n
# GROUP BY `carb`"
dbGetQuery(db, req)
#> carb cyl_mean disp_mean mpg_sum
#> 1 1 4.571429 134.2714 177.4
#> 2 2 5.600000 208.1600 224.0
#> 3 3 8.000000 275.8000 48.9
#> 4 4 7.200000 308.8200 157.9
#> 5 6 6.000000 145.0000 19.7
#> 6 8 8.000000 301.0000 15.0
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] DBI_1.1.0 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.4 purrr_0.3.3
[6] readr_1.3.1 tidyr_1.0.2 tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.3.0
[11] dbplyr_1.4.2 RSQLite_2.2.0
loaded via a namespace (and not attached):
[1] xfun_0.10 tidyselect_1.0.0 haven_2.2.0 lattice_0.20-38 colorspace_1.4-1
[6] vctrs_0.2.2 generics_0.0.2 htmltools_0.4.0 blob_1.2.1 rlang_0.4.4
[11] pillar_1.4.3 glue_1.3.1 withr_2.1.2 bit64_0.9-7 modelr_0.1.5
[16] readxl_1.3.1 lifecycle_0.1.0 munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0
[21] rvest_0.3.5 memoise_1.1.0 evaluate_0.14 knitr_1.25 callr_3.3.2
[26] ps_1.3.0 fansi_0.4.1 broom_0.5.2 Rcpp_1.0.3 clipr_0.7.0
[31] scales_1.1.0 backports_1.1.5 jsonlite_1.6.1 fs_1.3.1 bit_1.1-15.1
[36] hms_0.5.3 digest_0.6.23 stringi_1.4.5 processx_3.4.1 grid_3.6.1
[41] cli_2.0.1 tools_3.6.1 magrittr_1.5 lazyeval_0.2.2 whisker_0.4
[46] crayon_1.3.4 pkgconfig_2.0.3 xml2_1.2.2 reprex_0.3.0 lubridate_1.7.4
[51] assertthat_0.2.1 rmarkdown_1.16 httr_1.4.1 rstudioapi_0.10 R6_2.4.1
[56] nlme_3.1-141 compiler_3.6.1

dbplyr, dplyr, and functions with no SQL equivalents [eg `slice()`]

library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
I can create this mock SQL database above. And it's very cool that I can perform standard dplyr functions on this "database":
mtcars2 %>%
group_by(cyl) %>%
summarise(mpg = mean(mpg, na.rm = TRUE)) %>%
arrange(desc(mpg))
#> # Source: lazy query [?? x 2]
#> # Database: sqlite 3.29.0 [:memory:]
#> # Ordered by: desc(mpg)
#> cyl mpg
#> <dbl> <dbl>
#> 1 4 26.7
#> 2 6 19.7
#> 3 8 15.1
It appears I'm unable to use dplyr functions that have no direct SQL equivalents, (eg dplyr::slice()). In the case of slice() I can use the alternative combination of filter() and row_number() to get the same results as just using slice(). But what happens when there's not such an easy workaround?
mtcars2 %>% slice(1:5)
#>Error in UseMethod("slice_") :
#> no applicable method for 'slice_' applied to an object of class
#> "c('tbl_SQLiteConnection', 'tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')"
When dplyr functions have no direct SQL equivalents can I force their use with dbplyr, or is the only option to get creative with dplyr verbs that do have SQL equivalents, or just write the SQL directly (which is not my preferred solution)?
I understood this question: How can I make slice() work for SQL databases? This is different from "forcing their use" but still might work in your case.
The example below shows how to implement a "poor man's" variant of slice() that works on the database. We still need to do the legwork and implement it with verbs that work on the database, but then we can use it similarly to data frames.
Read more about S3 classes in http://adv-r.had.co.nz/OO-essentials.html#s3.
library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
# mtcars2 has a class attribute
class(mtcars2)
#> [1] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql"
#> [4] "tbl_lazy" "tbl"
# slice() is an S3 method
slice
#> function(.data, ..., .preserve = FALSE) {
#> UseMethod("slice")
#> }
#> <bytecode: 0x560a03460548>
#> <environment: namespace:dplyr>
# we can implement a "poor man's" variant of slice()
# for the particular class. (It doesn't work quite the same
# in all cases.)
#' #export
slice.tbl_sql <- function(.data, ...) {
rows <- c(...)
.data %>%
mutate(...row_id = row_number()) %>%
filter(...row_id %in% !!rows) %>%
select(-...row_id)
}
mtcars2 %>%
slice(1:5)
#> # Source: lazy query [?? x 11]
#> # Database: sqlite 3.29.0 [:memory:]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
Created on 2019-12-07 by the reprex package (v0.3.0)

Graphing time series in ggplot2 with CDC weeks ordered sensibly

I have a data frame ('Example') like this.
n CDCWeek Year Week
25.512324 2011-39 2011 39
26.363035 2011-4 2011 4
25.510500 2011-40 2011 40
25.810663 2011-41 2011 41
25.875451 2011-42 2011 42
25.860873 2011-43 2011 43
25.374876 2011-44 2011 44
25.292944 2011-45 2011 45
24.810807 2011-46 2011 46
24.793090 2011-47 2011 47
22.285000 2011-48 2011 48
23.015480 2011-49 2011 49
26.296376 2011-5 2011 5
22.074581 2011-50 2011 50
22.209183 2011-51 2011 51
22.270705 2011-52 2011 52
25.391377 2011-6 2011 6
25.225481 2011-7 2011 7
24.678918 2011-8 2011 8
24.382214 2011-9 2011 9
I want to plot this as a time series with 'CDCWeek' as the X-axis and 'n' as the Y using this code.
ggplot(Example, aes(CDCWeek, n, group=1)) + geom_line()
The problem I am running into is that it is not graphing CDCWeek in the right order. CDCWeek is the year followed by the week number (1 to 52 or 53 depending on the year). It is being graphed in the order shown in the data frame, with 2011-39 followed by 2011-4, etc. I understand why this is happening but is there anyway to force ggplot2 to use the proper order of weeks?
EDIT: I can't just use the 'week' variable because the actual dataset covers many years.
Thank you
aweek::get_date allows you to get weekly dates only using the year and epiweek.
Here I created a reprex with a sequence of dates (link), extract the epiweek with lubridate::epiweek, defined sunday as start of a week with aweek::set_week_start, summarized weekly values, created a new date vector with aweek::get_date, and plot them.
library(tidyverse)
library(lubridate)
library(aweek)
data_ts <- tibble(date=seq(ymd('2012-04-07'),
ymd('2014-03-22'),
by = '1 day')) %>%
mutate(value = rnorm(n(),mean = 5),
#using aweek
epidate=date2week(date,week_start = 7),
#using lubridate
epiweek=epiweek(date),
dayw=wday(date,label = T,abbr = F),
month=month(date,label = F,abbr = F),
year=year(date)) %>%
print()
#> # A tibble: 715 x 7
#> date value epidate epiweek dayw month year
#> <date> <dbl> <aweek> <dbl> <ord> <dbl> <dbl>
#> 1 2012-04-07 3.54 2012-W14-7 14 sábado 4 2012
#> 2 2012-04-08 5.79 2012-W15-1 15 domingo 4 2012
#> 3 2012-04-09 4.50 2012-W15-2 15 lunes 4 2012
#> 4 2012-04-10 5.44 2012-W15-3 15 martes 4 2012
#> 5 2012-04-11 5.13 2012-W15-4 15 miércoles 4 2012
#> 6 2012-04-12 4.87 2012-W15-5 15 jueves 4 2012
#> 7 2012-04-13 3.28 2012-W15-6 15 viernes 4 2012
#> 8 2012-04-14 5.72 2012-W15-7 15 sábado 4 2012
#> 9 2012-04-15 6.91 2012-W16-1 16 domingo 4 2012
#> 10 2012-04-16 4.58 2012-W16-2 16 lunes 4 2012
#> # ... with 705 more rows
#CORE: Here you set the start of the week!
set_week_start(7) #sunday
get_week_start()
#> [1] 7
data_ts_w <- data_ts %>%
group_by(year,epiweek) %>%
summarise(sum_week_value=sum(value)) %>%
ungroup() %>%
#using aweek
mutate(epi_date=get_date(week = epiweek,year = year),
wik_date=date2week(epi_date)
) %>%
print()
#> # A tibble: 104 x 5
#> year epiweek sum_week_value epi_date wik_date
#> <dbl> <dbl> <dbl> <date> <aweek>
#> 1 2012 1 11.0 2012-01-01 2012-W01-1
#> 2 2012 14 3.54 2012-04-01 2012-W14-1
#> 3 2012 15 34.7 2012-04-08 2012-W15-1
#> 4 2012 16 35.1 2012-04-15 2012-W16-1
#> 5 2012 17 34.5 2012-04-22 2012-W17-1
#> 6 2012 18 34.7 2012-04-29 2012-W18-1
#> 7 2012 19 36.5 2012-05-06 2012-W19-1
#> 8 2012 20 32.1 2012-05-13 2012-W20-1
#> 9 2012 21 35.4 2012-05-20 2012-W21-1
#> 10 2012 22 37.5 2012-05-27 2012-W22-1
#> # ... with 94 more rows
#you can use get_date output with ggplot
data_ts_w %>%
slice(-(1:3)) %>%
ggplot(aes(epi_date, sum_week_value)) +
geom_line() +
scale_x_date(date_breaks="5 week", date_labels = "%Y-%U") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Weekly time serie",
x="Time (Year - CDC epidemiological week)",
y="Sum of weekly values")
ggsave("figure/000-timeserie-week.png",height = 3,width = 10)
Created on 2019-08-12 by the reprex package (v0.3.0)
Convert the Year and Week into a date with dplyr:
df <- df %>%
mutate(date=paste(Year, Week, 1, sep="-") %>%
as.Date(., "%Y-%U-%u"))
ggplot(df, aes(date, n, group=1)) +
geom_line() +
scale_x_date(date_breaks="8 week", date_labels = "%Y-%U")
One option would be to use the Year and Week variables you already have but facet by Year. I changed the Year variable in your data a bit to make my case.
Example$Year = rep(2011:2014, each = 5)
ggplot(Example, aes(x = Week, y = n)) +
geom_line() +
facet_grid(Year~., scales = "free_x")
#facet_grid(.~Year, scales = "free_x")
This has the added advantage of being able to compare across years. If you switch the final line to the option I've commented out then the facets will be horizontal.
Yet another option would be to group by Year as a factor level and include them all on the same figure.
ggplot(Example, aes(x = Week, y = n)) +
geom_line(aes(group = Year, color = factor(Year)))
It turns out I just had to order Example$CDCWeek properly and then ggplot would graph it properly.
1) Put the database in the proper order.
Example <- Example[order(Example$Year, Example$Week), ]
2) Reset the rownames.
row.names(Example) <- NULL
3) Create a new variable with the observation number from the rownames
Example$Obs <- as.numeric(rownames(Example))
4) Order the CDCWeeks variable as a factor according to the observation number
Example$CDCWeek <- factor(Example$CDCWeek, levels=Example$CDCWeek[order(Example$Obs)], ordered=TRUE)
5) Graph it
ggplot(Example, aes(CDCWeek, n, group=1)) + geom_line()
Thanks a lot for the help, everyone!