Graphing time series in ggplot2 with CDC weeks ordered sensibly

Graphing time series in ggplot2 with CDC weeks ordered sensibly - ggplot2

I have a data frame ('Example') like this.
n CDCWeek Year Week
25.512324 2011-39 2011 39
26.363035 2011-4 2011 4
25.510500 2011-40 2011 40
25.810663 2011-41 2011 41
25.875451 2011-42 2011 42
25.860873 2011-43 2011 43
25.374876 2011-44 2011 44
25.292944 2011-45 2011 45
24.810807 2011-46 2011 46
24.793090 2011-47 2011 47
22.285000 2011-48 2011 48
23.015480 2011-49 2011 49
26.296376 2011-5 2011 5
22.074581 2011-50 2011 50
22.209183 2011-51 2011 51
22.270705 2011-52 2011 52
25.391377 2011-6 2011 6
25.225481 2011-7 2011 7
24.678918 2011-8 2011 8
24.382214 2011-9 2011 9
I want to plot this as a time series with 'CDCWeek' as the X-axis and 'n' as the Y using this code.
ggplot(Example, aes(CDCWeek, n, group=1)) + geom_line()
The problem I am running into is that it is not graphing CDCWeek in the right order. CDCWeek is the year followed by the week number (1 to 52 or 53 depending on the year). It is being graphed in the order shown in the data frame, with 2011-39 followed by 2011-4, etc. I understand why this is happening but is there anyway to force ggplot2 to use the proper order of weeks?
EDIT: I can't just use the 'week' variable because the actual dataset covers many years.
Thank you

aweek::get_date allows you to get weekly dates only using the year and epiweek.
Here I created a reprex with a sequence of dates (link), extract the epiweek with lubridate::epiweek, defined sunday as start of a week with aweek::set_week_start, summarized weekly values, created a new date vector with aweek::get_date, and plot them.
library(tidyverse)
library(lubridate)
library(aweek)
data_ts <- tibble(date=seq(ymd('2012-04-07'),
ymd('2014-03-22'),
by = '1 day')) %>%
mutate(value = rnorm(n(),mean = 5),
#using aweek
epidate=date2week(date,week_start = 7),
#using lubridate
epiweek=epiweek(date),
dayw=wday(date,label = T,abbr = F),
month=month(date,label = F,abbr = F),
year=year(date)) %>%
print()
#> # A tibble: 715 x 7
#> date value epidate epiweek dayw month year
#> <date> <dbl> <aweek> <dbl> <ord> <dbl> <dbl>
#> 1 2012-04-07 3.54 2012-W14-7 14 sábado 4 2012
#> 2 2012-04-08 5.79 2012-W15-1 15 domingo 4 2012
#> 3 2012-04-09 4.50 2012-W15-2 15 lunes 4 2012
#> 4 2012-04-10 5.44 2012-W15-3 15 martes 4 2012
#> 5 2012-04-11 5.13 2012-W15-4 15 miércoles 4 2012
#> 6 2012-04-12 4.87 2012-W15-5 15 jueves 4 2012
#> 7 2012-04-13 3.28 2012-W15-6 15 viernes 4 2012
#> 8 2012-04-14 5.72 2012-W15-7 15 sábado 4 2012
#> 9 2012-04-15 6.91 2012-W16-1 16 domingo 4 2012
#> 10 2012-04-16 4.58 2012-W16-2 16 lunes 4 2012
#> # ... with 705 more rows
#CORE: Here you set the start of the week!
set_week_start(7) #sunday
get_week_start()
#> [1] 7
data_ts_w <- data_ts %>%
group_by(year,epiweek) %>%
summarise(sum_week_value=sum(value)) %>%
ungroup() %>%
#using aweek
mutate(epi_date=get_date(week = epiweek,year = year),
wik_date=date2week(epi_date)
) %>%
print()
#> # A tibble: 104 x 5
#> year epiweek sum_week_value epi_date wik_date
#> <dbl> <dbl> <dbl> <date> <aweek>
#> 1 2012 1 11.0 2012-01-01 2012-W01-1
#> 2 2012 14 3.54 2012-04-01 2012-W14-1
#> 3 2012 15 34.7 2012-04-08 2012-W15-1
#> 4 2012 16 35.1 2012-04-15 2012-W16-1
#> 5 2012 17 34.5 2012-04-22 2012-W17-1
#> 6 2012 18 34.7 2012-04-29 2012-W18-1
#> 7 2012 19 36.5 2012-05-06 2012-W19-1
#> 8 2012 20 32.1 2012-05-13 2012-W20-1
#> 9 2012 21 35.4 2012-05-20 2012-W21-1
#> 10 2012 22 37.5 2012-05-27 2012-W22-1
#> # ... with 94 more rows
#you can use get_date output with ggplot
data_ts_w %>%
slice(-(1:3)) %>%
ggplot(aes(epi_date, sum_week_value)) +
geom_line() +
scale_x_date(date_breaks="5 week", date_labels = "%Y-%U") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Weekly time serie",
x="Time (Year - CDC epidemiological week)",
y="Sum of weekly values")
ggsave("figure/000-timeserie-week.png",height = 3,width = 10)
Created on 2019-08-12 by the reprex package (v0.3.0)

Convert the Year and Week into a date with dplyr:
df <- df %>%
mutate(date=paste(Year, Week, 1, sep="-") %>%
as.Date(., "%Y-%U-%u"))
ggplot(df, aes(date, n, group=1)) +
geom_line() +
scale_x_date(date_breaks="8 week", date_labels = "%Y-%U")

One option would be to use the Year and Week variables you already have but facet by Year. I changed the Year variable in your data a bit to make my case.
Example$Year = rep(2011:2014, each = 5)
ggplot(Example, aes(x = Week, y = n)) +
geom_line() +
facet_grid(Year~., scales = "free_x")
#facet_grid(.~Year, scales = "free_x")
This has the added advantage of being able to compare across years. If you switch the final line to the option I've commented out then the facets will be horizontal.
Yet another option would be to group by Year as a factor level and include them all on the same figure.
ggplot(Example, aes(x = Week, y = n)) +
geom_line(aes(group = Year, color = factor(Year)))

It turns out I just had to order Example$CDCWeek properly and then ggplot would graph it properly.
1) Put the database in the proper order.
Example <- Example[order(Example$Year, Example$Week), ]
2) Reset the rownames.
row.names(Example) <- NULL
3) Create a new variable with the observation number from the rownames
Example$Obs <- as.numeric(rownames(Example))
4) Order the CDCWeeks variable as a factor according to the observation number
Example$CDCWeek <- factor(Example$CDCWeek, levels=Example$CDCWeek[order(Example$Obs)], ordered=TRUE)
5) Graph it
ggplot(Example, aes(CDCWeek, n, group=1)) + geom_line()
Thanks a lot for the help, everyone!

Related

How to convert large .csv file with "too many columns" into SQL database

I was given a large .csv file (around 6.5 Gb) with 25k rows and 20k columns. Let's call first column ID1 and then each additional column is a value for each of these ID1s in different conditions. Let's call these ID2s.
This is the first time I work with such large files. I wanted to process the .csv file in R and summarize the values, mean, standard deviation and coefficient of variation for each ID1.
My idea was to read the file directly (with datatable fread), convert it into "long" data (with dplyr) so I have three columns: ID1, ID2 and value. Then group them by ID1,ID2 and summarize. However, I do not seem to have enough memory to read the file (I assume R uses more memory than the file's size to store it).
I think it would be more efficient to first convert the file into a SQL database and then process it from there. I have tried to convert it using sqlite3 but it gives me an error message stating that the maximum number of columns to read are 4096.
I have no experience with SQL, so I was wondering what would be the best way of converting the .csv file into a database. I guess reading each column and storing them as a table or something like that would work.
I have searched for similar questions but most of them just say that having so many columns is a bad db design. I cannot generate the .csv file with a proper structure.
Any suggestions for an efficient way of processing the .csv file?
Best,
Edit: I was able to read the initial file in R, but I still find some problems:
1- I cannot write into a sqlite db because of the "too many columns" limit.
2- I cannot pivot it inside R because I get the error:
Error: cannot allocate vector of size 7.8 Gb
Eventhough my memory limit is high enough. I have 8.5 Gb of free memory and:
> memory.limit()
[1] 16222
I have used #danlooo 's code but the data is not in the format I would like it to be. Probably I was not clear enough explaining its structure.
Here is an example of how I would like the data to look like (ID1 = Sample, ID2 = name, value = value)
> test = input[1:5,1:5]
>
> test
Sample DRX007662 DRX007663 DRX007664 DRX014481
1: AT1G01010 12.141565 16.281420 14.482322 35.19884
2: AT1G01020 12.166693 18.054251 12.075236 37.14983
3: AT1G01030 9.396695 9.704697 8.211935 4.36051
4: AT1G01040 25.278412 24.429031 22.484845 17.51553
5: AT1G01050 64.082870 66.022141 62.268711 58.06854
> test2 = pivot_longer(test, -Sample)
> test2
# A tibble: 20 x 3
Sample name value
<chr> <chr> <dbl>
1 AT1G01010 DRX007662 12.1
2 AT1G01010 DRX007663 16.3
3 AT1G01010 DRX007664 14.5
4 AT1G01010 DRX014481 35.2
5 AT1G01020 DRX007662 12.2
6 AT1G01020 DRX007663 18.1
7 AT1G01020 DRX007664 12.1
8 AT1G01020 DRX014481 37.1
9 AT1G01030 DRX007662 9.40
10 AT1G01030 DRX007663 9.70
11 AT1G01030 DRX007664 8.21
12 AT1G01030 DRX014481 4.36
13 AT1G01040 DRX007662 25.3
14 AT1G01040 DRX007663 24.4
15 AT1G01040 DRX007664 22.5
16 AT1G01040 DRX014481 17.5
17 AT1G01050 DRX007662 64.1
18 AT1G01050 DRX007663 66.0
19 AT1G01050 DRX007664 62.3
20 AT1G01050 DRX014481 58.1
> test3 = test2 %>% group_by(Sample) %>% summarize(mean(value))
> test3
# A tibble: 5 x 2
Sample `mean(value)`
<chr> <dbl>
1 AT1G01010 19.5
2 AT1G01020 19.9
3 AT1G01030 7.92
4 AT1G01040 22.4
5 AT1G01050 62.6
How should I change the code to make it look that way?
Thanks a lot!

Pivoting in SQL is very tedious and often requires writing nested queries for each column. SQLite3 is indeed the way to go if the data can not live in the RAM. This code will read the text file in chunks, pivot the data in long format and puts it into the SQL database. Then you can access the database with dplyr verbs for summarizing. This uses another example dataset, because I have no idea which column types ID1 and ID2 have. You might want to do pivot_longer(-ID2) to have two name columns.
library(tidyverse)
library(DBI)
library(vroom)
conn <- dbConnect(RSQLite::SQLite(), "my-db.sqlite")
dbCreateTable(conn, "data", tibble(name = character(), value = character()))
file <- "https://github.com/r-lib/vroom/raw/main/inst/extdata/mtcars.csv"
chunk_size <- 10 # read this many lines of the text file at once
n_chunks <- 5
# start with offset 1 to ignore header
for(chunk_offset in seq(1, chunk_size * n_chunks, by = chunk_size)) {
# everything must be character to allow pivoting numeric and text columns
vroom(file, skip = chunk_offset, n_max = chunk_size,
col_names = FALSE, col_types = cols(.default = col_character())
) %>%
pivot_longer(everything()) %>%
dbAppendTable(conn, "data", value = .)
}
data <- conn %>% tbl("data")
data
#> # Source: table<data> [?? x 2]
#> # Database: sqlite 3.37.0 [my-db.sqlite]
#> name value
#> <chr> <chr>
#> 1 X1 Mazda RX4
#> 2 X2 21
#> 3 X3 6
#> 4 X4 160
#> 5 X5 110
#> 6 X6 3.9
#> 7 X7 2.62
#> 8 X8 16.46
#> 9 X9 0
#> 10 X10 1
#> # … with more rows
data %>%
# summarise only the 3rd column
filter(name == "X3") %>%
group_by(value) %>%
count() %>%
arrange(-n) %>%
collect()
#> # A tibble: 3 × 2
#> value n
#> <chr> <int>
#> 1 8 14
#> 2 4 11
#> 3 6 7
Created on 2022-04-15 by the reprex package (v2.0.1)

Calculate mean-deviated values (subtract mean of all columns except one from this one column)

I have a dataset with the following structure:
df <- data.frame(id = 1:5,
study = c("st1","st2","st3","st4","st5"),
a_var = c(10,20,30,40,50),
b_var = c(6,5,4,3,2),
c_var = c(3,4,5,6,7),
d_var = c(80,70,60,50,40))
I would like to calculate the difference between each column that has _var in its name and the mean of all other columns containing _var in their names, like this:
mean_deviated_value <- function(data, variable) {
md_value = data[,variable] - rowMeans(data[,names(data) != variable])
md_value
}
df$a_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "a_var")
df$b_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "b_var")
df$c_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "c_var")
df$d_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "d_var")
Which gives me my desired output:
id study a_var b_var c_var d_var a_var_md b_var_md c_var_md d_var_md
1 1 st1 10 6 3 80 -19.666667 -12.33333 -9.80 83.80000
2 2 st2 20 5 4 70 -6.333333 -16.91667 -10.35 70.76667
3 3 st3 30 4 5 60 7.000000 -21.50000 -10.90 57.73333
4 4 st4 40 3 6 50 20.333333 -26.08333 -11.45 44.70000
5 5 st5 50 2 7 40 33.666667 -30.66667 -12.00 31.66667
How do I do it in one go, without repeating the code, preferably with dplyr/purrr?
I tried this:
df %>%
mutate(across(contains("_var"), ~ list(md = .x - rowMeans(select(., contains("_var") & !.x)))))
And got this error:
Error: Problem with `mutate()` input `..1`.
ℹ `..1 = across(...)`.
x no applicable method for 'select' applied to an object of class "c('double', 'numeric')"

We can use map_dfc with transmute to create *_md columns, and glue syntax for the names.
library(tidyverse)
nms <- names(df) %>%
str_subset('^.*_')
bind_cols(df, map_dfc(nms, ~transmute(df, '{.x}_md' := mean_deviated_value(select(df, contains("_var")), .x))))
#> id study a_var b_var c_var d_var a_var_md b_var_md c_var_md d_var_md
#> 1 1 st1 10 6 3 80 -19.666667 -25.00000 -29.00000 73.66667
#> 2 2 st2 20 5 4 70 -6.333333 -26.33333 -27.66667 60.33333
#> 3 3 st3 30 4 5 60 7.000000 -27.66667 -26.33333 47.00000
#> 4 4 st4 40 3 6 50 20.333333 -29.00000 -25.00000 33.66667
#> 5 5 st5 50 2 7 40 33.666667 -30.33333 -23.66667 20.33333
Note that if you use assigment. The first time rowMeans will compute with b_var, c_bar and d_bar. But the second time, contains("_var") will also capture the previously created a_var_md and use it to compute the means. I don't know if this is intended behaviour but it is worth mentioning.
df$a_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "a_var")
select(df, contains("_var"))
#> a_var b_var c_var d_var a_var_md
#> 1 10 6 3 80 -19.666667
#> 2 20 5 4 70 -6.333333
#> 3 30 4 5 60 7.000000
#> 4 40 3 6 50 20.333333
#> 5 50 2 7 40 33.666667
We can avoid this by replacing contains("_var") with matches("^.*_var$")
Created on 2021-12-20 by the reprex package (v2.0.1)

Standard deviation with groupby(multiple columns) Pandas

I am working with data from the California Air Resources Board.
site,monitor,date,start_hour,value,variable,units,quality,prelim,name
5407,t,2014-01-01,0,3.00,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,1,1.54,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,2,3.76,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,3,5.98,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,4,8.09,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,5,12.05,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,6,12.55,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
...
df = pd.concat([pd.read_csv(file, header = 0) for file in f]) #merges all files into one dataframe
df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'],
inplace = True) #drops bottom columns without data in them, NaN
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df.set_index('datetime', inplace = True)
df = df.rename(columns={'value':'conc'})
I have multiple years of hourly PM2.5 concentration data and am trying to prepare graphs that show the average monthly concentration over many years (different graphs for each month). Here's an image of the graph I've created thus far. [![Bombay Beach][1]][1] However, I want to add error bars to the average concentration line but I am having issues when attempting to calculate the standard deviation. I've created a new dataframe d_avg that includes the year, month, day, and average concentration of PM2.5; here's some of the data.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
year month day conc
0 2014 1 1 9.644583
1 2014 1 2 4.945652
2 2014 1 3 4.345238
3 2014 1 4 5.047917
4 2014 1 5 5.212857
5 2014 1 6 2.095714
After this, I found the monthly average m_avg and created a datetime index to plot datetime vs monthly avg conc (refer above, black line).
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
m_avg['datetime'] = pd.to_datetime(m_avg.year.astype(str) + m_avg.month.astype(str), format='%Y%m') + MonthEnd(1)
[In]: m_avg.head(6)
[Out]:
year month conc datetime
0 2014 1 4.330985 2014-01-31
1 2014 2 2.280096 2014-02-28
2 2014 3 4.464622 2014-03-31
3 2014 4 6.583759 2014-04-30
4 2014 5 9.069353 2014-05-31
5 2014 6 9.982330 2014-06-30
Now I want to calculate the standard deviation of the d_avg concentration, and I've tried multiple things:
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].agg(np.std)
sd = d_avg['conc'].apply(lambda x: x.std())
However, each attempt has left me with the same error in the dataframe. I am unable to plot the standard deviation because I believe it is taking the standard deviation of the year and month too, which I am trying to group the data by. Here's what my resulting dataframe sd looks like:
year month sd
0 44.877611 1.000000 1.795868
1 44.877611 1.414214 2.355055
2 44.877611 1.732051 2.597531
3 44.877611 2.000000 2.538749
4 44.877611 2.236068 5.456785
5 44.877611 2.449490 3.315546
Please help me!
[1]: https://i.stack.imgur.com/ueVrG.png

I tried to reproduce your error and it works fine for me. Here's my complete code sample, which is pretty much exactly the same as yours EXCEPT for the generation of the original dataframe. So I'd suspect that part of the code. Can you provide the code that creates the dataframe?
import pandas as pd
columns = ['year', 'month', 'day', 'conc']
data = [[2014, 1, 1, 2.0],
[2014, 1, 1, 4.0],
[2014, 1, 2, 6.0],
[2014, 1, 2, 8.0],
[2014, 2, 1, 2.0],
[2014, 2, 1, 6.0],
[2014, 2, 2, 10.0],
[2014, 2, 2, 14.0]]
df = pd.DataFrame(data, columns=columns)
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year', 'month'], as_index=False)['conc'].mean()
m_std = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
print(f'Concentrations:\n{df}\n')
print(f'Daily Average:\n{d_avg}\n')
print(f'Monthly Average:\n{m_avg}\n')
print(f'Standard Deviation:\n{m_std}\n')
Outputs:
Concentrations:
year month day conc
0 2014 1 1 2.0
1 2014 1 1 4.0
2 2014 1 2 6.0
3 2014 1 2 8.0
4 2014 2 1 2.0
5 2014 2 1 6.0
6 2014 2 2 10.0
7 2014 2 2 14.0
Daily Average:
year month day conc
0 2014 1 1 3.0
1 2014 1 2 7.0
2 2014 2 1 4.0
3 2014 2 2 12.0
Monthly Average:
year month conc
0 2014 1 5.0
1 2014 2 8.0
Monthly Standard Deviation:
year month conc
0 2014 1 2.828427
1 2014 2 5.656854

I decided to dance around my issue since I couldn't figure out what was causing the problem. I merged the m_avg and sd dataframes and dropped the year and month columns that were causing me issues. See code below, lots of renaming.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std(ddof=0)
sd = sd.rename(columns={"conc":"sd", "year":"wrongyr", "month":"wrongmth"})
m_avg_sd = pd.concat([m_avg, sd], axis = 1)
m_avg_sd.drop(columns=['wrongyr', 'wrongmth'], inplace = True)
m_avg_sd['datetime'] = pd.to_datetime(m_avg_sd.year.astype(str) + m_avg_sd.month.astype(str), format='%Y%m') + MonthEnd(1)
and here's the new dataframe:
m_avg_sd.head(5)
Out[2]:
year month conc sd datetime
0 2009 1 48.350105 18.394192 2009-01-31
1 2009 2 21.929383 16.293645 2009-02-28
2 2009 3 15.094729 6.821124 2009-03-31
3 2009 4 12.021009 4.391219 2009-04-30
4 2009 5 13.449100 4.081734 2009-05-31

Pandas 1.0 create column of months from year and date

I have a dataframe df with values as:
df.iloc[1:4, 7:9]
Year Month
38 2020 4
65 2021 4
92 2022 4
I am trying to create a new MonthIdx column as:
df['MonthIdx'] = pd.to_timedelta(df['Year'], unit='Y') + pd.to_timedelta(df['Month'], unit='M') + pd.to_timedelta(1, unit='D')
But I get the error:
ValueError: Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta values durations.
Following is the desired output:
df['MonthIdx']
MonthIdx
38 2020/04/01
65 2021/04/01
92 2022/04/01

So you can pad the month value in a series, and then reformat to get a datetime for all of the values:
month = df.Month.astype(str).str.pad(width=2, side='left', fillchar='0')
df['MonthIdx'] = pd.to_datetime(pd.Series([int('%d%s' % (x,y)) for x,y in zip(df['Year'],month)]),format='%Y%m')
This will give you:
Year Month MonthIdx
0 2020 4 2020-04-01
1 2021 4 2021-04-01
2 2022 4 2022-04-01
You can reformat the date to be a string to match exactly your format:
df['MonthIdx'] = df['MonthIdx'].apply(lambda x: x.strftime('%Y/%m/%d'))
Giving you:
Year Month MonthIdx
0 2020 4 2020/04/01
1 2021 4 2021/04/01
2 2022 4 2022/04/01

return rows which elements are duplicates, not the logical vector

I know the duplicated-function of the package dplyr. The problem is that it only returns a logical vector indicating which elements (rows) are duplicates.
I want to get a vector which gives back those rows with the certain elements.
I want to get back all the observations of A and B because they have for the key Name and year duplicated values.
I already have coded this:
>df %>% group_by(Name) %>% filter(any(( ?????)))
but I dont know how to write the last part of code.
Anyone any ideas?
Thanks :)

An option using dplyr can be achieved by grouping on both Name and Year to calculate count. Afterwards group on only Name and filter for groups having any count > 1 (meaning duplicate):
library(dplyr)
df %>% group_by(Name, Year) %>%
mutate(count = n()) %>%
group_by(Name) %>%
filter(any(count > 1)) %>%
select(-count)
# # A tibble: 7 x 3
# # Groups: Name [2]
# Name Year Value
# <chr> <int> <int>
# 1 A 1990 5
# 2 A 1990 3
# 3 A 1991 5
# 4 A 1995 5
# 5 B 2000 0
# 6 B 2000 4
# 7 B 1998 5
Data:
df <- read.table(text =
"Name Year Value
A 1990 5
A 1990 3
A 1991 5
A 1995 5
B 2000 0
B 2000 4
B 1998 5
C 1890 3
C 1790 2",
header = TRUE, stringsAsFactors = FALSE)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Graphing time series in ggplot2 with CDC weeks ordered sensibly - ggplot2

Convert the Year and Week into a date with dplyr: df <- df %>% mutate(date=paste(Year, Week, 1, sep="-") %>% as.Date(., "%Y-%U-%u")) ggplot(df, aes(date, n, group=1)) + geom_line() + scale_x_date(date_breaks="8 week", date_labels = "%Y-%U")

Related

How to convert large .csv file with "too many columns" into SQL database

Calculate mean-deviated values (subtract mean of all columns except one from this one column)

Standard deviation with groupby(multiple columns) Pandas

Pandas 1.0 create column of months from year and date

return rows which elements are duplicates, not the logical vector

Categories

Resources