I have a pandas dataframe with an unusual DatetimeIndex. The frame contains daily data (end of each day) from 1985 to 1990 but some "random" days are missing:
DatetimeIndex(['1985-01-02', '1985-01-03', '1985-01-04', '1985-01-07',
'1985-01-08', '1985-01-09', '1985-01-10', '1985-01-11',
'1985-01-14', '1985-01-15',
...
'1990-12-17', '1990-12-18', '1990-12-19', '1990-12-20',
'1990-12-21', '1990-12-24', '1990-12-26', '1990-12-27',
'1990-12-28', '1990-12-31'],
dtype='datetime64[ns]', name='date', length=1516, freq=None)
I often need operations like shifting an entire column such that a value that is at the last day of a month (which could e.g. in my DatetimeIndex be '1985-05-30') is shifted to the last day of the next (which could e.g. my DatetimeIndex be '1985-06-27').
While looking for a smart way to perform such shifts, I stumbled over Offset Aliases provided by pandas.tseries.offsets. It can be observed that there are the aliases custom business day frequency (C) and custom business month end frequency (CBM). When looking at an example, it seems like that this could provide exactly what I need:
mth_us = pd.offsets.CustomBusinessMonthEnd(calendar=USFederalHolidayCalendar())
day_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
df['Col1_shifted'] = df['Col1'].shift(periods=1, freq = mth_us) # shifted by 1 month
df['Col2_shifted'] = df['Col2'].shift(periods=1, freq = day_us) # shifted by 1 day
The problem is that my DatetimeIndex is not equal to USFederalHolidayCalendar(). Can someone please tell me how I can use pd.offsets.CustomBusinessMonthEnd (and also pd.offsets.CustomBusinessDay) with my own custom DatetimeIndex?
If not, has any of you an idea how to tackle this issue in a different way?
Thanks a lot for your help!
I have a date in a .csv with format YYYY-MM-DD. Pandas is reading it in as a string but instead of the format shown in the csv, it reads in as a 5 digit number, coded as a string.
I've tried:
pd.to_datetime(df['alert_date'], unit = 's')
pd.to_datetime(df['alert_date'], unit = 'D')
I've also tried calling out to read it as a string and let date parser take over. See below:
dtype_dict = {'alert_date':'str','lossdate1':'str', 'lossdate2':'str',
'lossdate3':'str', 'lossdate4':'str', 'lossdate5':'str',
'effdate':'str'}
parse_dates = ['lossdate1', 'lossdate2', 'lossdate3',
'lossdate4', 'lossdate5', 'effdate']
df = pd.read_csv("Agent Alerts Earned and Incurred with Loss Dates as of Q3 2021.csv",
encoding='latin1', dtype = dtype_dict, parse_dates=parse_dates)
I'm not sure what else to try or what is wrong with it to begin with.
Here is an example of what the data looks like.
alertflag,alert_type,alert_date,effdate,cal_year,totalep,eufactor,product,NonCatincrd1,Catincrd1,lossdate1,NonCatcvrcnt1,Catcvrcnt1,NonCatincrd2,Catincrd2,lossdate2,NonCatcvrcnt2,Catcvrcnt2,NonCatincrd3,Catincrd3,lossdate3,NonCatcvrcnt3,Catcvrcnt3,NonCatincrd4,Catincrd4,lossdate4,NonCatcvrcnt4,Catcvrcnt4,NonCatincrd5,Catincrd5,lossdate5,NonCatcvrcnt5,Catcvrcnt5,incurred
1,CANCEL NOTICE,2019-06-06,2018-12-17,2019,91.00,0.96,444,,,,,,,,,,,,,,,,,,,,,,,,,,
The alert_date comes through on that record as 21706.
I have a data frame such
mode travel time
transit_walk 284.0
transit_walk 284.0
pt 270.0
transit_walk 346.0
walk 455.0
I want to group by "mode" and get the sum of all travel time.
so my desire result looks like:
mode total travel time
transit_ walk 1200000000
pt 30000000
walk 88888888
I have written the code such as
df.groupby('mode')['travel time'].sum()
however, I have the result such as:
mode
pt 270.01488.01518.01788.01300.01589.01021.01684....
transit_walk 284.0284.0346.0142.0142.01882.0154.0154.0336.0...
walk 455.018.0281.0554.0256.0256.0244.0244.0244.045...
Name: travel time, dtype: object
which just put all the time side by side, and it didn't sum them up.
There are strings in column travel time, so try use Series.astype:
df['travel time'] = df['travel time'].astype(float)
If failed bcause some not numeric value, use to_numeric with errors='coerce':
df['travel time'] = pd.to_numeric(df['travel time'], errors='coerce')
And last aggregate:
df1 = df.groupby('mode', as_index=False)['travel time'].sum()
I have this df :
and i would like to make a graph by half hour of how many row i have by half hour without including the day.
Just a graph with number of occurence by half hour not including the day.
3272 8711600410367 2019-03-11T20:23:45.415Z d7ec8e9c5b5df11df8ec7ee130552944 home 2019-03-11T20:23:45.415Z DISPLAY None
3273 8711600410367 2019-03-11T20:23:51.072Z d7ec8e9c5b5df11df8ec7ee130552944 home 2019-03-11T20:23:51.072Z DISPLAY None
Here is my try :
df["Created"] = pd.to_datetime(df["Created"])
df.groupby(df.Created.dt.hour).size().plot()
But it's not by half hour
I would like to show all half hour on my graph
One way you could do this is split up coding for hours and half-hours, and then bring them together. To illustrate, I extended your data example a bit:
import pandas as pd
df = pd.DataFrame({'Created':['2019-03-11T20:23:45.415Z', '2019-03-11T20:23:51.072Z', '2019-03-11T20:33:03.072Z', '2019-03-11T21:10:10.072Z']})
df["Created"] = pd.to_datetime(df["Created"])
First create a 'Hours column':
df['Hours'] = df.Created.dt.hour
Then create a column that codes half hours. That is, if the minutes are greater than 30, count it as half hour.
df['HalfHours'] = [0.5 if x>30 else 0 for x in df.Created.dt.minute]
Then bring them together again:
df['Hours_and_HalfHours'] = df['Hours']+df['HalfHours']
Finally, count the number of rows by groupby, and plot:
df.groupby(df['Hours_and_HalfHours']).size().plot()
I have a pandas dataframe I have created from weather data that shows the high and low temperatures by day from 2005-2015. I want to be able to query my dataframe such that it only shows the values with the year 2015. Is there any way to do this without first changing the datetime values to only show year (i.e. not making strtime(%y) only first)?
DataFrame Creation:
df=pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv')
df['Date']=pd.to_datetime(df.Date)
df['Date'] = df['Date'].dt.strftime('%m-%d-%y')
Attempt to Query:
daily_df=df[df['Date']==datetime.date(year=2015)]
Error: asks for a month and year to be specified.
Data:
An NOAA dataset has been stored in the file data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv. The data for this assignment comes from a subset of The National Centers for Environmental Information (NCEI) Daily Global Historical Climatology Network (GHCN-Daily). The GHCN-Daily is comprised of daily climate records from thousands of land surface stations across the globe.
Each row in the assignment datafile corresponds to a single observation.
The following variables are provided to you:
id : station identification code
date : date in YYYY-MM-DD format (e.g. 2012-01-24 = January 24, 2012)
element : indicator of element type
TMAX : Maximum temperature (tenths of degrees C)
TMIN : Minimum temperature (tenths of degrees C)
value : data value for element (tenths of degrees C)
Image of DataFrame:
I resolved this by adding a row with just the year and then querying that way but there has to be a better way to do this?
df['Date']=pd.to_datetime(df['Date']).dt.strftime('%d-%m-%y')
df['Year']=pd.to_datetime(df['Date']).dt.strftime('%y')
daily_df = df[df['Year']=='15']
return daily_df