How i can correct an spread error, of rows to columns? - tidyverse

iam triying to pass rows to columns with the function spread (tidyr) and gives the next error: Error in spread():
! Each row of output must be identified by a unique combination of keys.
I have this data frame
Month
pH
January
7.2
January
5.2
February
4.0
February
7.3
March
7.1
March
5.0
Are about 8.000 values of pH, January to december, are aprox per month 700 but are diferent long.
I want this
January
February
March
7.2
4.0
7.1
5.2
7.3
5.0

This happens because there is no unique identifier, also spread is deprecated, so you can use pivot_wider.
Data
data <-
tibble::tribble(
~Month, ~pH,
"January", 7.2,
"January", 5.2,
"February", 4,
"February", 7.3,
"March", 7.1,
"March", 5
)
Code
library(dplyr)
library(tidyr)
data %>%
group_by(Month) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = Month,values_from = pH)
Output
# A tibble: 2 x 4
id January February March
<int> <dbl> <dbl> <dbl>
1 1 7.2 4 7.1
2 2 5.2 7.3 5

Related

Rolling Rows in pandas.DataFrame

I have a dataframe that looks like this:
year
month
valueCounts
2019
1
73.411285
2019
2
53.589128
2019
3
71.103842
2019
4
79.528084
I want valueCounts column's values to be rolled like:
year
month
valueCounts
2019
1
53.589128
2019
2
71.103842
2019
3
79.528084
2019
4
NaN
I can do this by dropping first index of dataframe and assigning last index to NaN but it doesn't look efficient. Is there any simpler method to do this?
Thanks.
Assuming your dataframe are already sorted.
Use shift:
df['valueCounts'] = df['valueCounts'].shift(-1)
print(df)
# Output
year month valueCounts
0 2019 1 53.589128
1 2019 2 71.103842
2 2019 3 79.528084
3 2019 4 NaN

Standard deviation with groupby(multiple columns) Pandas

I am working with data from the California Air Resources Board.
site,monitor,date,start_hour,value,variable,units,quality,prelim,name
5407,t,2014-01-01,0,3.00,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,1,1.54,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,2,3.76,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,3,5.98,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,4,8.09,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,5,12.05,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,6,12.55,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
...
df = pd.concat([pd.read_csv(file, header = 0) for file in f]) #merges all files into one dataframe
df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'],
inplace = True) #drops bottom columns without data in them, NaN
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df.set_index('datetime', inplace = True)
df = df.rename(columns={'value':'conc'})
I have multiple years of hourly PM2.5 concentration data and am trying to prepare graphs that show the average monthly concentration over many years (different graphs for each month). Here's an image of the graph I've created thus far. [![Bombay Beach][1]][1] However, I want to add error bars to the average concentration line but I am having issues when attempting to calculate the standard deviation. I've created a new dataframe d_avg that includes the year, month, day, and average concentration of PM2.5; here's some of the data.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
year month day conc
0 2014 1 1 9.644583
1 2014 1 2 4.945652
2 2014 1 3 4.345238
3 2014 1 4 5.047917
4 2014 1 5 5.212857
5 2014 1 6 2.095714
After this, I found the monthly average m_avg and created a datetime index to plot datetime vs monthly avg conc (refer above, black line).
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
m_avg['datetime'] = pd.to_datetime(m_avg.year.astype(str) + m_avg.month.astype(str), format='%Y%m') + MonthEnd(1)
[In]: m_avg.head(6)
[Out]:
year month conc datetime
0 2014 1 4.330985 2014-01-31
1 2014 2 2.280096 2014-02-28
2 2014 3 4.464622 2014-03-31
3 2014 4 6.583759 2014-04-30
4 2014 5 9.069353 2014-05-31
5 2014 6 9.982330 2014-06-30
Now I want to calculate the standard deviation of the d_avg concentration, and I've tried multiple things:
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].agg(np.std)
sd = d_avg['conc'].apply(lambda x: x.std())
However, each attempt has left me with the same error in the dataframe. I am unable to plot the standard deviation because I believe it is taking the standard deviation of the year and month too, which I am trying to group the data by. Here's what my resulting dataframe sd looks like:
year month sd
0 44.877611 1.000000 1.795868
1 44.877611 1.414214 2.355055
2 44.877611 1.732051 2.597531
3 44.877611 2.000000 2.538749
4 44.877611 2.236068 5.456785
5 44.877611 2.449490 3.315546
Please help me!
[1]: https://i.stack.imgur.com/ueVrG.png
I tried to reproduce your error and it works fine for me. Here's my complete code sample, which is pretty much exactly the same as yours EXCEPT for the generation of the original dataframe. So I'd suspect that part of the code. Can you provide the code that creates the dataframe?
import pandas as pd
columns = ['year', 'month', 'day', 'conc']
data = [[2014, 1, 1, 2.0],
[2014, 1, 1, 4.0],
[2014, 1, 2, 6.0],
[2014, 1, 2, 8.0],
[2014, 2, 1, 2.0],
[2014, 2, 1, 6.0],
[2014, 2, 2, 10.0],
[2014, 2, 2, 14.0]]
df = pd.DataFrame(data, columns=columns)
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year', 'month'], as_index=False)['conc'].mean()
m_std = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
print(f'Concentrations:\n{df}\n')
print(f'Daily Average:\n{d_avg}\n')
print(f'Monthly Average:\n{m_avg}\n')
print(f'Standard Deviation:\n{m_std}\n')
Outputs:
Concentrations:
year month day conc
0 2014 1 1 2.0
1 2014 1 1 4.0
2 2014 1 2 6.0
3 2014 1 2 8.0
4 2014 2 1 2.0
5 2014 2 1 6.0
6 2014 2 2 10.0
7 2014 2 2 14.0
Daily Average:
year month day conc
0 2014 1 1 3.0
1 2014 1 2 7.0
2 2014 2 1 4.0
3 2014 2 2 12.0
Monthly Average:
year month conc
0 2014 1 5.0
1 2014 2 8.0
Monthly Standard Deviation:
year month conc
0 2014 1 2.828427
1 2014 2 5.656854
I decided to dance around my issue since I couldn't figure out what was causing the problem. I merged the m_avg and sd dataframes and dropped the year and month columns that were causing me issues. See code below, lots of renaming.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std(ddof=0)
sd = sd.rename(columns={"conc":"sd", "year":"wrongyr", "month":"wrongmth"})
m_avg_sd = pd.concat([m_avg, sd], axis = 1)
m_avg_sd.drop(columns=['wrongyr', 'wrongmth'], inplace = True)
m_avg_sd['datetime'] = pd.to_datetime(m_avg_sd.year.astype(str) + m_avg_sd.month.astype(str), format='%Y%m') + MonthEnd(1)
and here's the new dataframe:
m_avg_sd.head(5)
Out[2]:
year month conc sd datetime
0 2009 1 48.350105 18.394192 2009-01-31
1 2009 2 21.929383 16.293645 2009-02-28
2 2009 3 15.094729 6.821124 2009-03-31
3 2009 4 12.021009 4.391219 2009-04-30
4 2009 5 13.449100 4.081734 2009-05-31

dataframe - timespan between timestamps based on value of other column

I have a pandas dataframe with the index containing years and a column containing dividend payouts. I now want to determine how many years the company has continuously paid out dividends (column dividend > 0).
As an example, for the following table I want the result to be 2 (2019+2018)
year dividend
2019 1.89
2018 1.70
2017 0
2016 1.5
And for this one 4
year dividend
2019 1.89
2018 1.70
2017 1.6
2016 1.58
Eventhough the below answer is a round about, yet it can solve your problem.
Convert it to pd.df and use idxmin() with ne() wrapper to find the continuous divident pay.
year dividend
2019 1.89
2018 1.70
2017 0
2016 1.5
df = pd.DataFrame({'dividend' : [1.89,1.70,0,1.5]}, index=[2019,2018,2017,2016])
print('Continuous Divident value :', df.loc[ : df.dividend.idxmin()].ne(0).sum()[0])
Continuous Divident value : 2
year dividend
2019 1.89
2018 1.70
2017 1.6
2016 1.58
df = pd.DataFrame({'dividend' : [1.89,1.70,1.6,1.58]}, index=[2019,2018,2017,2016])
print('Continuous Divident value :', df.loc[ : df.dividend.idxmin()].ne(0).sum()[0])
Continuous Divident value : 4
Edit
Based on your comment, I just played with loops, I hope this satisfy your requiremnts. If not let me know..
value = 0 if df['bool'].iloc[0] == 0 else (len(df) if len(df) == df.iloc[-1::]['bool'].values[0] else len(df.loc[: df[df['bool'].duplicated(keep = 'last')].index[0]]))
print('Continous divident value : ', value)

Sort a alphanumeric column in pandas and replace it with original column of the dataset [duplicate]

I have a data frame like this:
print(df)
0 1 2
0 354.7 April 4.0
1 55.4 August 8.0
2 176.5 December 12.0
3 95.5 February 2.0
4 85.6 January 1.0
5 152 July 7.0
6 238.7 June 6.0
7 104.8 March 3.0
8 283.5 May 5.0
9 278.8 November 11.0
10 249.6 October 10.0
11 212.7 September 9.0
As you can see, months are not in calendar order. So I created a second column to get the month number corresponding to each month (1-12). From there, how can I sort this data frame according to calendar months' order?
Use sort_values to sort the df by a specific column's values:
In [18]:
df.sort_values('2')
Out[18]:
0 1 2
4 85.6 January 1.0
3 95.5 February 2.0
7 104.8 March 3.0
0 354.7 April 4.0
8 283.5 May 5.0
6 238.7 June 6.0
5 152.0 July 7.0
1 55.4 August 8.0
11 212.7 September 9.0
10 249.6 October 10.0
9 278.8 November 11.0
2 176.5 December 12.0
If you want to sort by two columns, pass a list of column labels to sort_values with the column labels ordered according to sort priority. If you use df.sort_values(['2', '0']), the result would be sorted by column 2 then column 0. Granted, this does not really make sense for this example because each value in df['2'] is unique.
I tried the solutions above and I do not achieve results, so I found a different solution that works for me. The ascending=False is to order the dataframe in descending order, by default it is True. I am using python 3.6.6 and pandas 0.23.4 versions.
final_df = df.sort_values(by=['2'], ascending=False)
You can see more details in pandas documentation here.
Using column name worked for me.
sorted_df = df.sort_values(by=['Column_name'], ascending=True)
Panda's sort_values does the work.
There are various parameters one can pass, such as ascending (bool or list of bool):
Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
As the default is ascending, and OP's goal is to sort ascending, one doesn't need to specify that parameter (see the last note below for the way to solve descending), so one can use one of the following ways:
Performing the operation in-place, and keeping the same variable name. This requires one to pass inplace=True as follows:
df.sort_values(by=['2'], inplace=True)
# or
df.sort_values(by = '2', inplace = True)
# or
df.sort_values('2', inplace = True)
If doing the operation in-place is not a requirement, one can assign the change (sort) to a variable:
With the same name of the original dataframe, df as
df = df.sort_values(by=['2'])
With a different name, such as df_new, as
df_new = df.sort_values(by=['2'])
All this previous operations would give the following output
0 1 2
4 85.6 January 1.0
3 95.5 February 2.0
7 104.8 March 3.0
0 354.7 April 4.0
8 283.5 May 5.0
6 238.7 June 6.0
5 152 July 7.0
1 55.4 August 8.0
11 212.7 September 9.0
10 249.6 October 10.0
9 278.8 November 11.0
2 176.5 December 12.0
Finally, one can reset the index with pandas.DataFrame.reset_index, to get the following
df.reset_index(drop = True, inplace = True)
# or
df = df.reset_index(drop = True)
[Out]:
0 1 2
0 85.6 January 1.0
1 95.5 February 2.0
2 104.8 March 3.0
3 354.7 April 4.0
4 283.5 May 5.0
5 238.7 June 6.0
6 152 July 7.0
7 55.4 August 8.0
8 212.7 September 9.0
9 249.6 October 10.0
10 278.8 November 11.0
11 176.5 December 12.0
A one-liner that sorts ascending, and resets the index would be as follows
df = df.sort_values(by=['2']).reset_index(drop = True)
[Out]:
0 1 2
0 85.6 January 1.0
1 95.5 February 2.0
2 104.8 March 3.0
3 354.7 April 4.0
4 283.5 May 5.0
5 238.7 June 6.0
6 152 July 7.0
7 55.4 August 8.0
8 212.7 September 9.0
9 249.6 October 10.0
10 278.8 November 11.0
11 176.5 December 12.0
Notes:
If one is not doing the operation in-place, forgetting the steps mentioned above may lead one (as this user) to not be able to get the expected result.
There are strong opinions on using inplace. For that, one might want to read this.
One is assuming that the column 2 is not a string. If it is, one will have to convert it:
Using pandas.to_numeric
df['2'] = pd.to_numeric(df['2'])
Using pandas.Series.astype
df['2'] = df['2'].astype(float)
If one wants in descending order, one needs to pass ascending=False as
df = df.sort_values(by=['2'], ascending=False)
# or
df.sort_values(by = '2', ascending=False, inplace=True)
[Out]:
0 1 2
2 176.5 December 12.0
9 278.8 November 11.0
10 249.6 October 10.0
11 212.7 September 9.0
1 55.4 August 8.0
5 152 July 7.0
6 238.7 June 6.0
8 283.5 May 5.0
0 354.7 April 4.0
7 104.8 March 3.0
3 95.5 February 2.0
4 85.6 January 1.0
Just as another solution:
Instead of creating the second column, you can categorize your string data(month name) and sort by that like this:
df.rename(columns={1:'month'},inplace=True)
df['month'] = pd.Categorical(df['month'],categories=['December','November','October','September','August','July','June','May','April','March','February','January'],ordered=True)
df = df.sort_values('month',ascending=False)
It will give you the ordered data by month name as you specified while creating the Categorical object.
Just adding some more operations on data. Suppose we have a dataframe df, we can do several operations to get desired outputs
ID cost tax label
1 216590 1600 test
2 523213 1800 test
3 250 1500 experiment
(df['label'].value_counts().to_frame().reset_index()).sort_values('label', ascending=False)
will give sorted output of labels as a dataframe
index label
0 test 2
1 experiment 1
This worked for me
df.sort_values(by='Column_name', inplace=True, ascending=False)
You probably need to reset the index after sorting:
df = df.sort_values('2')
df = df.reset_index(drop=True)
Here is template of sort_values according to pandas documentation.
DataFrame.sort_values(by, axis=0,
ascending=True,
inplace=False,
kind='quicksort',
na_position='last',
ignore_index=False, key=None)[source]
In this case it will be like this.
df.sort_values(by=['2'])
API Reference pandas.DataFrame.sort_values
Just adding a few more insights
df=raw_df['2'].sort_values() # will sort only one column (i.e 2)
but ,
df =raw_df.sort_values(by=["2"] , ascending = False) # this will sort the whole df in decending order on the basis of the column "2"
If you want to sort column dynamically but not alphabetically.
and dont want to use pd.sort_values().
you can try below solution.
Problem : sort column "col1" in this sequence ['A', 'C', 'D', 'B']
import pandas as pd
import numpy as np
## Sample DataFrame ##
df = pd.DataFrame({'col1': ['A', 'B', 'D', 'C', 'A']})
>>> df
col1
0 A
1 B
2 D
3 C
4 A
## Solution ##
conditions = []
values = []
for i,j in enumerate(['A','C','D','B']):
conditions.append((df['col1'] == j))
values.append(i)
df['col1_Num'] = np.select(conditions, values)
df.sort_values(by='col1_Num',inplace = True)
>>> df
col1 col1_Num
0 A 0
4 A 0
3 C 1
2 D 2
1 B 3
This one worked for me:
df=df.sort_values(by=[2])
Whereas:
df=df.sort_values(by=['2'])
is not working.
Example:
Assume you have a column with values 1 and 0 and you want to separate and use only one value, then:
// furniture is one of the columns in the csv file.
allrooms = data.groupby('furniture')['furniture'].agg('count')
allrooms
myrooms1 = pan.DataFrame(allrooms, columns = ['furniture'], index = [1])
myrooms2 = pan.DataFrame(allrooms, columns = ['furniture'], index = [0])
print(myrooms1);print(myrooms2)

Graphing time series in ggplot2 with CDC weeks ordered sensibly

I have a data frame ('Example') like this.
n CDCWeek Year Week
25.512324 2011-39 2011 39
26.363035 2011-4 2011 4
25.510500 2011-40 2011 40
25.810663 2011-41 2011 41
25.875451 2011-42 2011 42
25.860873 2011-43 2011 43
25.374876 2011-44 2011 44
25.292944 2011-45 2011 45
24.810807 2011-46 2011 46
24.793090 2011-47 2011 47
22.285000 2011-48 2011 48
23.015480 2011-49 2011 49
26.296376 2011-5 2011 5
22.074581 2011-50 2011 50
22.209183 2011-51 2011 51
22.270705 2011-52 2011 52
25.391377 2011-6 2011 6
25.225481 2011-7 2011 7
24.678918 2011-8 2011 8
24.382214 2011-9 2011 9
I want to plot this as a time series with 'CDCWeek' as the X-axis and 'n' as the Y using this code.
ggplot(Example, aes(CDCWeek, n, group=1)) + geom_line()
The problem I am running into is that it is not graphing CDCWeek in the right order. CDCWeek is the year followed by the week number (1 to 52 or 53 depending on the year). It is being graphed in the order shown in the data frame, with 2011-39 followed by 2011-4, etc. I understand why this is happening but is there anyway to force ggplot2 to use the proper order of weeks?
EDIT: I can't just use the 'week' variable because the actual dataset covers many years.
Thank you
aweek::get_date allows you to get weekly dates only using the year and epiweek.
Here I created a reprex with a sequence of dates (link), extract the epiweek with lubridate::epiweek, defined sunday as start of a week with aweek::set_week_start, summarized weekly values, created a new date vector with aweek::get_date, and plot them.
library(tidyverse)
library(lubridate)
library(aweek)
data_ts <- tibble(date=seq(ymd('2012-04-07'),
ymd('2014-03-22'),
by = '1 day')) %>%
mutate(value = rnorm(n(),mean = 5),
#using aweek
epidate=date2week(date,week_start = 7),
#using lubridate
epiweek=epiweek(date),
dayw=wday(date,label = T,abbr = F),
month=month(date,label = F,abbr = F),
year=year(date)) %>%
print()
#> # A tibble: 715 x 7
#> date value epidate epiweek dayw month year
#> <date> <dbl> <aweek> <dbl> <ord> <dbl> <dbl>
#> 1 2012-04-07 3.54 2012-W14-7 14 sábado 4 2012
#> 2 2012-04-08 5.79 2012-W15-1 15 domingo 4 2012
#> 3 2012-04-09 4.50 2012-W15-2 15 lunes 4 2012
#> 4 2012-04-10 5.44 2012-W15-3 15 martes 4 2012
#> 5 2012-04-11 5.13 2012-W15-4 15 miércoles 4 2012
#> 6 2012-04-12 4.87 2012-W15-5 15 jueves 4 2012
#> 7 2012-04-13 3.28 2012-W15-6 15 viernes 4 2012
#> 8 2012-04-14 5.72 2012-W15-7 15 sábado 4 2012
#> 9 2012-04-15 6.91 2012-W16-1 16 domingo 4 2012
#> 10 2012-04-16 4.58 2012-W16-2 16 lunes 4 2012
#> # ... with 705 more rows
#CORE: Here you set the start of the week!
set_week_start(7) #sunday
get_week_start()
#> [1] 7
data_ts_w <- data_ts %>%
group_by(year,epiweek) %>%
summarise(sum_week_value=sum(value)) %>%
ungroup() %>%
#using aweek
mutate(epi_date=get_date(week = epiweek,year = year),
wik_date=date2week(epi_date)
) %>%
print()
#> # A tibble: 104 x 5
#> year epiweek sum_week_value epi_date wik_date
#> <dbl> <dbl> <dbl> <date> <aweek>
#> 1 2012 1 11.0 2012-01-01 2012-W01-1
#> 2 2012 14 3.54 2012-04-01 2012-W14-1
#> 3 2012 15 34.7 2012-04-08 2012-W15-1
#> 4 2012 16 35.1 2012-04-15 2012-W16-1
#> 5 2012 17 34.5 2012-04-22 2012-W17-1
#> 6 2012 18 34.7 2012-04-29 2012-W18-1
#> 7 2012 19 36.5 2012-05-06 2012-W19-1
#> 8 2012 20 32.1 2012-05-13 2012-W20-1
#> 9 2012 21 35.4 2012-05-20 2012-W21-1
#> 10 2012 22 37.5 2012-05-27 2012-W22-1
#> # ... with 94 more rows
#you can use get_date output with ggplot
data_ts_w %>%
slice(-(1:3)) %>%
ggplot(aes(epi_date, sum_week_value)) +
geom_line() +
scale_x_date(date_breaks="5 week", date_labels = "%Y-%U") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Weekly time serie",
x="Time (Year - CDC epidemiological week)",
y="Sum of weekly values")
ggsave("figure/000-timeserie-week.png",height = 3,width = 10)
Created on 2019-08-12 by the reprex package (v0.3.0)
Convert the Year and Week into a date with dplyr:
df <- df %>%
mutate(date=paste(Year, Week, 1, sep="-") %>%
as.Date(., "%Y-%U-%u"))
ggplot(df, aes(date, n, group=1)) +
geom_line() +
scale_x_date(date_breaks="8 week", date_labels = "%Y-%U")
One option would be to use the Year and Week variables you already have but facet by Year. I changed the Year variable in your data a bit to make my case.
Example$Year = rep(2011:2014, each = 5)
ggplot(Example, aes(x = Week, y = n)) +
geom_line() +
facet_grid(Year~., scales = "free_x")
#facet_grid(.~Year, scales = "free_x")
This has the added advantage of being able to compare across years. If you switch the final line to the option I've commented out then the facets will be horizontal.
Yet another option would be to group by Year as a factor level and include them all on the same figure.
ggplot(Example, aes(x = Week, y = n)) +
geom_line(aes(group = Year, color = factor(Year)))
It turns out I just had to order Example$CDCWeek properly and then ggplot would graph it properly.
1) Put the database in the proper order.
Example <- Example[order(Example$Year, Example$Week), ]
2) Reset the rownames.
row.names(Example) <- NULL
3) Create a new variable with the observation number from the rownames
Example$Obs <- as.numeric(rownames(Example))
4) Order the CDCWeeks variable as a factor according to the observation number
Example$CDCWeek <- factor(Example$CDCWeek, levels=Example$CDCWeek[order(Example$Obs)], ordered=TRUE)
5) Graph it
ggplot(Example, aes(CDCWeek, n, group=1)) + geom_line()
Thanks a lot for the help, everyone!