Rolling up daily data to weekly with resample and offset in pandas - pandas

I have daily covid data that I would like to roll up to weekly. The problem is that I want my weeks to go from Sunday to Saturday, but the default is Mon to Sun. I tried to use loffset, but it's only changing the dates not my data, plus it's adding a date that does not exist in the dataset.
Code:
logic = {'iso_code' : 'first',
'new_cases' : 'sum',
'new_deaths' : 'sum',
'icu_patients' : 'sum',
'hosp_patients' : 'sum',
'people_vaccinated': 'sum'} #it's possible to have 'first', 'max','last', etc
offset = pd.offsets.DateOffset(-1)
df_covid_weekly = df_covid_file.resample('W', on='date', label = 'right', loffset=offset).apply(logic).reset_index()
Raw Data Snipet:
Current Outcome:
Expected Outcome:

Use anchored offsets:
df_covid_file.resample('W-SAT', on='date', label = 'right')
The offset W is equivalent to W-SUN ("week ending on Sunday") and W-SAT is "week ending on Saturday", and so on.
If you want an offset object you can use pd.offsets.Week(weekday=5), which is equivalent to W-SAT. The offset strings are aliases for these objects. Sometimes using the objects instead of their string counterparts makes code parametrization a little easier.

Related

How to set Custom Business Day End Frequency in Pandas

I have a pandas dataframe with an unusual DatetimeIndex. The frame contains daily data (end of each day) from 1985 to 1990 but some "random" days are missing:
DatetimeIndex(['1985-01-02', '1985-01-03', '1985-01-04', '1985-01-07',
'1985-01-08', '1985-01-09', '1985-01-10', '1985-01-11',
'1985-01-14', '1985-01-15',
...
'1990-12-17', '1990-12-18', '1990-12-19', '1990-12-20',
'1990-12-21', '1990-12-24', '1990-12-26', '1990-12-27',
'1990-12-28', '1990-12-31'],
dtype='datetime64[ns]', name='date', length=1516, freq=None)
I often need operations like shifting an entire column such that a value that is at the last day of a month (which could e.g. in my DatetimeIndex be '1985-05-30') is shifted to the last day of the next (which could e.g. my DatetimeIndex be '1985-06-27').
While looking for a smart way to perform such shifts, I stumbled over Offset Aliases provided by pandas.tseries.offsets. It can be observed that there are the aliases custom business day frequency (C) and custom business month end frequency (CBM). When looking at an example, it seems like that this could provide exactly what I need:
mth_us = pd.offsets.CustomBusinessMonthEnd(calendar=USFederalHolidayCalendar())
day_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
df['Col1_shifted'] = df['Col1'].shift(periods=1, freq = mth_us) # shifted by 1 month
df['Col2_shifted'] = df['Col2'].shift(periods=1, freq = day_us) # shifted by 1 day
The problem is that my DatetimeIndex is not equal to USFederalHolidayCalendar(). Can someone please tell me how I can use pd.offsets.CustomBusinessMonthEnd (and also pd.offsets.CustomBusinessDay) with my own custom DatetimeIndex?
If not, has any of you an idea how to tackle this issue in a different way?
Thanks a lot for your help!

pandas reading csv date as a string that's a 5 digit number

I have a date in a .csv with format YYYY-MM-DD. Pandas is reading it in as a string but instead of the format shown in the csv, it reads in as a 5 digit number, coded as a string.
I've tried:
pd.to_datetime(df['alert_date'], unit = 's')
pd.to_datetime(df['alert_date'], unit = 'D')
I've also tried calling out to read it as a string and let date parser take over. See below:
dtype_dict = {'alert_date':'str','lossdate1':'str', 'lossdate2':'str',
'lossdate3':'str', 'lossdate4':'str', 'lossdate5':'str',
'effdate':'str'}
parse_dates = ['lossdate1', 'lossdate2', 'lossdate3',
'lossdate4', 'lossdate5', 'effdate']
df = pd.read_csv("Agent Alerts Earned and Incurred with Loss Dates as of Q3 2021.csv",
encoding='latin1', dtype = dtype_dict, parse_dates=parse_dates)
I'm not sure what else to try or what is wrong with it to begin with.
Here is an example of what the data looks like.
alertflag,alert_type,alert_date,effdate,cal_year,totalep,eufactor,product,NonCatincrd1,Catincrd1,lossdate1,NonCatcvrcnt1,Catcvrcnt1,NonCatincrd2,Catincrd2,lossdate2,NonCatcvrcnt2,Catcvrcnt2,NonCatincrd3,Catincrd3,lossdate3,NonCatcvrcnt3,Catcvrcnt3,NonCatincrd4,Catincrd4,lossdate4,NonCatcvrcnt4,Catcvrcnt4,NonCatincrd5,Catincrd5,lossdate5,NonCatcvrcnt5,Catcvrcnt5,incurred
1,CANCEL NOTICE,2019-06-06,2018-12-17,2019,91.00,0.96,444,,,,,,,,,,,,,,,,,,,,,,,,,,
The alert_date comes through on that record as 21706.

groupby based on one column and get the sum values in another column

I have a data frame such
mode travel time
transit_walk 284.0
transit_walk 284.0
pt 270.0
transit_walk 346.0
walk 455.0
I want to group by "mode" and get the sum of all travel time.
so my desire result looks like:
mode total travel time
transit_ walk 1200000000
pt 30000000
walk 88888888
I have written the code such as
df.groupby('mode')['travel time'].sum()
however, I have the result such as:
mode
pt 270.01488.01518.01788.01300.01589.01021.01684....
transit_walk 284.0284.0346.0142.0142.01882.0154.0154.0336.0...
walk 455.018.0281.0554.0256.0256.0244.0244.0244.045...
Name: travel time, dtype: object
which just put all the time side by side, and it didn't sum them up.
There are strings in column travel time, so try use Series.astype:
df['travel time'] = df['travel time'].astype(float)
If failed bcause some not numeric value, use to_numeric with errors='coerce':
df['travel time'] = pd.to_numeric(df['travel time'], errors='coerce')
And last aggregate:
df1 = df.groupby('mode', as_index=False)['travel time'].sum()

Make a plot by occurence of a col by hour of a second col

I have this df :
and i would like to make a graph by half hour of how many row i have by half hour without including the day.
Just a graph with number of occurence by half hour not including the day.
3272 8711600410367 2019-03-11T20:23:45.415Z d7ec8e9c5b5df11df8ec7ee130552944 home 2019-03-11T20:23:45.415Z DISPLAY None
3273 8711600410367 2019-03-11T20:23:51.072Z d7ec8e9c5b5df11df8ec7ee130552944 home 2019-03-11T20:23:51.072Z DISPLAY None
Here is my try :
df["Created"] = pd.to_datetime(df["Created"])
df.groupby(df.Created.dt.hour).size().plot()
But it's not by half hour
I would like to show all half hour on my graph
One way you could do this is split up coding for hours and half-hours, and then bring them together. To illustrate, I extended your data example a bit:
import pandas as pd
df = pd.DataFrame({'Created':['2019-03-11T20:23:45.415Z', '2019-03-11T20:23:51.072Z', '2019-03-11T20:33:03.072Z', '2019-03-11T21:10:10.072Z']})
df["Created"] = pd.to_datetime(df["Created"])
First create a 'Hours column':
df['Hours'] = df.Created.dt.hour
Then create a column that codes half hours. That is, if the minutes are greater than 30, count it as half hour.
df['HalfHours'] = [0.5 if x>30 else 0 for x in df.Created.dt.minute]
Then bring them together again:
df['Hours_and_HalfHours'] = df['Hours']+df['HalfHours']
Finally, count the number of rows by groupby, and plot:
df.groupby(df['Hours_and_HalfHours']).size().plot()

Querying a DataFrame values from a specific year

I have a pandas dataframe I have created from weather data that shows the high and low temperatures by day from 2005-2015. I want to be able to query my dataframe such that it only shows the values with the year 2015. Is there any way to do this without first changing the datetime values to only show year (i.e. not making strtime(%y) only first)?
DataFrame Creation:
df=pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv')
df['Date']=pd.to_datetime(df.Date)
df['Date'] = df['Date'].dt.strftime('%m-%d-%y')
Attempt to Query:
daily_df=df[df['Date']==datetime.date(year=2015)]
Error: asks for a month and year to be specified.
Data:
An NOAA dataset has been stored in the file data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv. The data for this assignment comes from a subset of The National Centers for Environmental Information (NCEI) Daily Global Historical Climatology Network (GHCN-Daily). The GHCN-Daily is comprised of daily climate records from thousands of land surface stations across the globe.
Each row in the assignment datafile corresponds to a single observation.
The following variables are provided to you:
id : station identification code
date : date in YYYY-MM-DD format (e.g. 2012-01-24 = January 24, 2012)
element : indicator of element type
TMAX : Maximum temperature (tenths of degrees C)
TMIN : Minimum temperature (tenths of degrees C)
value : data value for element (tenths of degrees C)
Image of DataFrame:
I resolved this by adding a row with just the year and then querying that way but there has to be a better way to do this?
df['Date']=pd.to_datetime(df['Date']).dt.strftime('%d-%m-%y')
df['Year']=pd.to_datetime(df['Date']).dt.strftime('%y')
daily_df = df[df['Year']=='15']
return daily_df