I have a pandas dataframe with an unusual DatetimeIndex. The frame contains daily data (end of each day) from 1985 to 1990 but some "random" days are missing:
DatetimeIndex(['1985-01-02', '1985-01-03', '1985-01-04', '1985-01-07',
'1985-01-08', '1985-01-09', '1985-01-10', '1985-01-11',
'1985-01-14', '1985-01-15',
...
'1990-12-17', '1990-12-18', '1990-12-19', '1990-12-20',
'1990-12-21', '1990-12-24', '1990-12-26', '1990-12-27',
'1990-12-28', '1990-12-31'],
dtype='datetime64[ns]', name='date', length=1516, freq=None)
I often need operations like shifting an entire column such that a value that is at the last day of a month (which could e.g. in my DatetimeIndex be '1985-05-30') is shifted to the last day of the next (which could e.g. my DatetimeIndex be '1985-06-27').
While looking for a smart way to perform such shifts, I stumbled over Offset Aliases provided by pandas.tseries.offsets. It can be observed that there are the aliases custom business day frequency (C) and custom business month end frequency (CBM). When looking at an example, it seems like that this could provide exactly what I need:
mth_us = pd.offsets.CustomBusinessMonthEnd(calendar=USFederalHolidayCalendar())
day_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
df['Col1_shifted'] = df['Col1'].shift(periods=1, freq = mth_us) # shifted by 1 month
df['Col2_shifted'] = df['Col2'].shift(periods=1, freq = day_us) # shifted by 1 day
The problem is that my DatetimeIndex is not equal to USFederalHolidayCalendar(). Can someone please tell me how I can use pd.offsets.CustomBusinessMonthEnd (and also pd.offsets.CustomBusinessDay) with my own custom DatetimeIndex?
If not, has any of you an idea how to tackle this issue in a different way?
Thanks a lot for your help!
Related
I have a datafile with quality scores from different suppliers over a time range of 3 years. The end goal is to use machine learning to predict the quality label (good or bad) of a shipment based on supplier information.
I want to use the mean historic quality data over a specific period of time as an input feature in this model by using pandas rolling window. the problem with this method is that pandas only allows you to create a window from t=0-x until t=0 for you rolling window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='14d',closed='left').mean()
And this is were the problem comes in. For my feature I want to use quality data from a period of 2 weeks, but these 2 weeks are not the 2 weeks before the corresponding shipment, but of 2 weeks, starting from t=-4weeks , and ending on t=-2weeks.
You would imagine that this could be solved by using the same string of code but changing the window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='28d' - '14d',closed='left').mean()
This, or any other type of denotation of this specific window does not seem to work.
It seems like pandas does not offer a solution to this problem, so we made a work around it with the following solution:
def time_shift_week(df):
def _avg_score_interval_func(series):
current_time = series.index[-1]
result = series[(series.index > ( current_time- pd.Timedelta(value=4, unit='w')))
& (series.index < (current_time - pd.Timedelta(value=2, unit='w')))]
return result.mean() if len(result)>0 else 0.0
temp_df = df.groupby(by=["supplier", "timestamp"], as_index=False).aggregate({"score": np.mean}).set_index('timestamp')
temp_df["w-42"] = (
temp_df
.groupby(["supplier"])
.ag_score
.apply(lambda x:
x
.rolling(window='30D', closed='both')
.apply(_avg_score_interval_func)
))
return temp_df.reset_index()
This results in a new df in which we find the average score score per supplier per timestamp, which we can subsequently merge with the original data frame to obtain the new feature.
Doing it this way seems really cumbersome and overly complicated for the task I am trying to perform. Eventhough we have found a workaround, I am wondering if there is an easier method of doing this.
Is anyone aware of a less complicated way of performing this rolling window feature extraction?
While pandas does not have the custom date offset you need, calculating the mean is pretty simple: it's just sum divided by count. You can subtract the 14-day rolling window from the 28-day rolling window:
# Some sample data. All scores are sequential for easy verification
idx = pd.MultiIndex.from_product(
[list("ABC"), pd.date_range("2020-01-01", "2022-12-31")],
names=["supplier", "timestamp"],
)
df = pd.DataFrame({"score": np.arange(len(idx))}, index=idx).reset_index()
# Now we gonna do rolling avg on score with the custom window.
# closed=left mean the current row will be excluded from the window.
score = df.set_index("timestamp").groupby("supplier")["score"]
r28 = score.rolling("28d", closed="left")
r14 = score.rolling("14d", closed="left")
avg_score = (r28.sum() - r14.sum()) / (r28.count() - r14.count())
total beginner here. If my question is irrelevant, apologies in advance, I'll remove it. So, I have a question : using pandas, I want to calculate an evolution ratio for a week data compared with the previous rolling 4 weeks mean data.
df['rolling_mean_fourweeks'] = df.rolling(4).mean().round(decimals=1)
from here I wanna create a new column for the evolution ratio based on the week data compared with the row of the rolling mean at the previous week.
what is the best way to go here? (I don't have big data) I have tried unsuccessfully with .shift() but am very foreign to .shift()... I should get NAN for week 3 (fourth week) and ~47% for fifth week.
Any suggestion for retrieving the value at row with step -1?
Thanks and have a good day!
Your idea about using shift can perfectly work. The shift(x) function simply shifts a series (a full column in your case) of x steps.
A simple way to check if the rolling_mean_fourweeks is a good predictor can be to shift Column1 and then check how it differs from rolling_mean_fourweeks:
df['column1_shifted'] = df['Column1'].shift(-1)
df['rolling_accuracy'] = ((df['column1_shifted']-df['rolling_mean_fourweeks'])
/df['rolling_mean_fourweeks'])
resulting in:
My data looks like this:
Creation Day Time St1 Time St2
0 28.01.2022 14:18:00 15:12:00
1 28.01.2022 14:35:00 16:01:00
2 29.01.2022 00:07:00 03:04:00
3 30.01.2022 17:03:00 22:12:00
It represents parts being at a given station. What I now need is something that counts how many Columns have the same Day and Hour e.g. How many parts were at the same station for a given Hour.
Here 2 Where at Station 1 for the 28th and the timespan 14-15.
Because in the end I want a bar graph that show production speed. Additionally later in the project I want to highlight Parts that havent moved for >2hrs.
Is it practical to create a datetime object for every Station (I have 5 in total)? Or is there a much simpler way to do this?
FYI I import this data from an excel sheet
I found the solution. As they are just strings I can just add them and reformat the result with pd.to_datetime().
Example:
df["Time St1"] = pd.to_datetime(
df["Creation Day"] + ' ' + df["Time St1"],
infer_datetime_format=False, format='%d.%m.%Y %H:%M:%S'
)
I have a simple data set, where we have a Dates column from which I want to extract the year.
I am using the negative index to get the year
d0['Year'] = d0['Dates'].apply(lambda x: x[-1:-5])
This normally works, however, not on this. A blank column is created.
I sampled the column for some of the data and saw no odd characters present.
I have tried the following variations
d0['Year'] = d0['Dates'].apply(lambda x: str(x)[-1:-5]) # column is created and it is blank.
d0['Year'] = d0.Dates.str.extract('\d{4}') # gives an error "ValueError: pattern contains no capture groups"
d0['Year'] = d0['Dates'].apply(lambda x: str(x).replace('[^a-zA-Z0-9_-]','a')[-1:-5]) # same - gives a blank column
Really not sure what other options I have and where is the issue.
What possibly can be the issue?
Below is a sample dump of the data I have
Outbreak,Dates,Region,Tornadoes,Fatalities,Notes
2000 Southwest Georgia tornado outbreak,"February 13–14, 2000",Georgia,17,18,"Produced a series of strong and deadly tornadoes that struck areas in and around Camilla, Meigs, and Omega, Georgia. Weaker tornadoes impacted other states."
2000 Fort Worth tornado,"March 28, 2000",U.S. South,10,2,"Small outbreak produced an F3 that hit downtown Fort Worth, Texas, severely damaging skyscrapers and killing two. Another F3 caused major damage in Arlington and Grand Prairie."
2000 Easter Sunday tornado outbreak,"April 23, 2000","Oklahoma, Texas, Louisiana, Arkansas",33,0,
"2000 Brady, Nebraska tornado","May 17, 2000",Nebraska,1,0,"Highly photographed F3 passed near Brady, Nebraska."
2000 Granite Falls tornado,"July 25, 2000","Granite Falls, Minnesota",1,1,"F4 struck Granite Falls, causing major damage and killing one person."
To extract year from "Dates" column , as object type use
da['Year'] = da['Dates'].apply(lambda x: x[-4:])
If you want to use it as int then , you could do following operations after doing the step above
da['Year']=pd.to_numeric(da['Year'])
I have this TypeError as per below, I have checked my df and it all contains numbers only, can this be caused when I converted to numpy array? After the conversion the array has items like
[Timestamp('1993-02-11 00:00:00') 28.1216 28.3374 ...]
Any suggestion how to solve this, please?
df:
Date Open High Low Close Volume
9 1993-02-11 28.1216 28.3374 28.1216 28.2197 19500
10 1993-02-12 28.1804 28.1804 28.0038 28.0038 42500
11 1993-02-16 27.9253 27.9253 27.2581 27.2974 374800
12 1993-02-17 27.2974 27.3366 27.1796 27.2777 210900
X = np.array(df.drop(['High'], 1))
X = preprocessing.scale(X)
TypeError: float() argument must be a string or a number
While you're saying that your dataframe "all contains numbers only", you also note that the first column consists of datetime objects. The error is telling you that preprocessing.scale only wants to work with float values.
The real question, however, is what you expect to happen to begin with. preprocessing.scale centers values on the mean and normalizes the variance. This is such that measured quantities are all represented on roughly the same footing. Now, your first column tells you what dates your data correspond to, while the rest of the columns are numeric data themselves. Why would you want to normalize the dates? How would you normalize the dates?
Semantically speaking, I believe you should leave your dates alone. Whatever post-processing you're planning to perform on your numerical data, the normalized data should still be parameterized by the original dates. If you want to process your dates too, you need to come up with an explicit way to handle your dates to something numeric (say, elapsed time from a given date in given units).
So I believe you should drop your dates from your processing round altogether, and start with
X = df.drop(['Date','High'], 1).as_matrix()