Efficiently label each row in a large dataframe - pandas

I have a file containing dates from June 2015 + 365 days. I am using this as a lookup table because there are custom business dates (only certain holidays are observed and there are some no-work dates because of internal reasons). Using the customer business date offsets was just so slow with 3.5 million records.
initial_date | day_1 | day_2 | day_3 | ... | day_365
2015-06-01 2015-06-02 2015-06-03 2015-06-04
2015-06-02 2015-06-03 2015-06-04 2015-06-05
The idea is to 'tag' each row in the data based on the number of (custom) business dates since specific entries. Is there a better way to do this?
For example, if a new entry happens on 2016-06-28 then this is labeled as 'initial_date'. day_1 is tomorrow, day_2 is the next day, day_3 is Friday, and day_4 is next Monday.
My naive way of doing this is to create a loop which basically does this:
df.day_1_label = np.where(df.date == df.day_1, 'Day 1', '')
df.day_2_label = np.where(df.date == df.day_2, 'Day 2', '')
df.day_label = (df.day_1_label + df.day_2_label + ...).replace('', regex=True, inplace=True)
This would result in one label per row which I could then aggregate or plot. Eventually this would be used for forecasting. initial_date + the subset of customers from previous dates = total customers
Also, depending on what events happen, a subset of the customers would have a certain event occur in the future. I want to know on what business date from the initial_date this happens on, on average. So if we have 100 customers today, a certain percent will have an event next July 15th or whatever it might be.
Edit- pastebin with some sample data:
http://pastebin.com/ZuE1Q2KJ
So the output I am looking for is the day_label. This basically checks if the date == each date from day_0 - day_n. Is there a better way to fill that in? I am just trying to match the date of each row with a value in one of the day_ columns.

I think this can be made more efficiently if your data looks as I think it looks.
Say you have a calendar of dates (June 2015 + 365 days) and a data frame as for example:
cal = ['2015-06-01', '2015-06-02', '2015-06-03', '2015-06-04', '2015-06-08',
'2015-06-09']
df = pd.DataFrame({'date': ['2015-06-04', '2015-06-09', '2015-06-09'],
'initial_date': ['2015-06-02', '2015-06-02', '2015-06-01']})
Keeping dates in numpy.datetime types is more efficient, so I convert:
df['date'] = df['date'].astype(np.datetime64)
df['initial_date'] = df['initial_date'].astype(np.datetime64)
Now, let's turn cal into a fast lookup table:
# converting types:
cal = np.asarray(cal, dtype=np.datetime64)
# lookup series
s_cal = pd.Series(range(len(cal)), index=cal)
# a convenience lookup function
def lookup(dates):
return s_cal[dates].reset_index(drop=True)
EDIT:
The above lookup returns a series with a RangeIndex, which may be different from the index of df (this causes problems when assigning the result to a column). So it's probably better to rewrite this lookup so that it returns a plain numpy array:
def lookup2(dates):
return s_cal[dates].values
It's possible to set the correct index in lookup (taken from the input dates). But such a solution would be less flexible (would require that the input be a Series) or unnecessarily complicated.
The series:
s_cal
Out[220]:
2015-06-01 0
2015-06-02 1
2015-06-03 2
2015-06-04 3
2015-06-08 4
2015-06-09 5
dtype: int64
and how it works:
lookup(df.date)
Out[221]:
0 3
1 5
2 5
dtype: int64
lookup2(df.date)
Out[36]: array([3, 5, 5])
The rest is straightforward. This adds a column of integers (day differences) to the data frame:
df['day_label'] = lookup2(df.date) - lookup2(df.initial_date)
and if you want to convert it to labels:
df['day_label'] = 'day ' + df['day_label'].astype(str)
df
Out[225]:
date initial_date day_label
0 2015-06-04 2015-06-02 day 2
1 2015-06-09 2015-06-02 day 4
2 2015-06-09 2015-06-01 day 5
Hope this is what you wanted.

Related

Count rows in a dataframe where date is in the past 7 days

I have this dataframe with a "date" column in it. I want to count the rows where the date is in the past 7 days. What's the best way to do this? I feel like using an If and a counter isn't very pandas-esque.
Also, I'm importing the data from a SQL db. Should I just load it already filtered with a query? What's the most efficient way?
Consider your dataframe is something like that:
df = pd.DataFrame({'date': ['2021-12-03', '2021-12-02', '2021-12-01', '2021-11-30'], 'data': [1, 2, 3, 4]})
date data
0 2021-12-03 1
1 2021-12-02 2
2 2021-12-01 3
3 2021-11-30 4
if you want to filter the data between dates 2021-11-30 and 2021-12-02, you can use the following command:
df_filtered = df.set_index('date').loc['2021-12-02':'2021-11-30'].reset_index()
date data
0 2021-12-02 2
1 2021-12-01 3
2 2021-11-30 4
In the first step, you set the date to the index and after that use .loc method to filter your desired dates. In the final step, you can count the number of rows by using the len(df_filtered)
My suggestion:
First, calculate the interval datetimes: today and past 7 days.
import datetime
today = datetime.date.today()
past7 = today - datetime.timedelta(days=7)
Use them to filter your dataframe:
df_filtered = df[(df['date'] >= past7) & (df['date'] <= today)]
Get the df_filtered length:
print(len(df_filtered))
try:
len(df[df['date'] > datetime.date.today() - pd.to_timedelta("7day")])

Pandas Period to to_timestamp giving me TypeError

I have a Pandas Dataframe of the format as shown below:
Month Count
2021-02 100
2021-03 200
Where the "Month" column is obtained from a timestamp using dt.to_period('M').
Now I have to convert this "Month" column to a fiscal quarter and I have used some approaches to convert the Period to a "datetime" object using "to_timestamp", but I get the error
TypeError: unsupported Type Int64Index
Is there another way to approach this problem?
If working with a column, it is necessary to add .dt. If omitting it, Pandas tries to convert DatetimeIndex and if it does not exist, it raises an error, because it called DataFrame.to_timestamp instead of Series.dt.to_timestamp:
df['Date'] = df['Month'].to_timestamp()
TypeError: unsupported Type RangeIndex
df['Date'] = df['Month'].dt.to_timestamp()
print (df)
Month Count Date
0 2021-02 100 2021-02-01
1 2021-03 200 2021-03-01
The solution for a fiscal quarter is to use Series.dt.qyear. Better documentation is here:
df['fquarter'] = df['Month'].dt.qyear
print (df)
Month Count fquarter
0 2021-02 100 2021
1 2021-03 200 2021

Facebook Prophet Future Dataframe

I have last 5 years monthly data. I am using that to create a forecasting model using fbprophet. Last 5 months of my data is as follows:
data1['ds'].tail()
Out[86]: 55 2019-01-08
56 2019-01-09
57 2019-01-10
58 2019-01-11
59 2019-01-12
I have created the model on this and made a future prediction dataframe.
model = Prophet(
interval_width=0.80,
growth='linear',
daily_seasonality=False,
weekly_seasonality=False,
yearly_seasonality=True,
seasonality_mode='additive'
)
# fit the model to data
model.fit(data1)
future_data = model.make_future_dataframe( periods=4, freq='m', include_history=True)
After 2019 December, I need the next year first four months. But it's adding next 4 months with same year 2019.
future_data.tail()
ds
59 2019-01-12
60 2019-01-31
61 2019-02-28
62 2019-03-31
63 2019-04-30
How to get the next year first 4 months in the future dataframe? Is there any specific parameter in that to adjust the year?
The issue is because of the date-format i.e. the 2019-01-12 (2019 December as per your question) is in format "%Y-%d-%m"
Hence, it creates data with month end frequency (stated by 'm') for the next 4 periods.
Just for reference this is how the future dataframe is created by Prophet:
dates = pd.date_range(
start=last_date,
periods=periods + 1, # An extra in case we include start
freq=freq)
dates = dates[dates > last_date] # Drop start if equals last_date
dates = dates[:periods] # Return correct number of periods
Hence, it infers the date format and extrapolates in the future dataframe.
Solution: Change the date format in training data to "%Y-%m-%d"
Stumbled here searching for the appropriate string for minutes
As per the docs the date time need to be YY-MM-DD format -
The input to Prophet is always a dataframe with two columns: ds and y. The ds (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp. The y column must be numeric, and represents the measurement we wish to forecast.
2019-01-12 in YY-MM-DD is 2019-12-01 ; using this
>>> dates = pd.date_range(start='2019-12-01',periods=4 + 1,freq='M')
>>> dates
DatetimeIndex(['2019-12-31', '2020-01-31', '2020-02-29', '2020-03-31',
'2020-04-30'],
dtype='datetime64[ns]', freq='M')
Other formats here; it is not given explicitly for python in prophet docs
https://pandas.pydata.org/docs/reference/api/pandas.tseries.frequencies.to_offset.html
dates = pd.date_range(start='2022-03-17 11:40:00',periods=10 + 1,freq='min')
>>> dates
DatetimeIndex(['2022-03-17 11:40:00', '2022-03-17 11:41:00',
'2022-03-17 11:42:00', '2022-03-17 11:43:00',
..],
dtype='datetime64[ns]', freq='T')

Is there a way to fix or bypass weird time formats in a specific column in a dataframe?

I am working with a SLURM dataset in Pandas that has time formats like so in the 'Elapsed' column:
00:00:00
00:26:51
However, sometimes there are sections that are greater than 24 hours, and it displays it like so:
1-00:02:00
3-01:25:02
I want to find the mean of the entire column but it mishandles the to_timedelta conversion on the entries with entries above 24 hours like shown above. One example is this:
Before to_timedelta: 3-01:25:02
after to_timedelta: -13 days +10:34:58
I cannot simply convert the column into a new format because when entry is not greater than 24 hours, preceding zeros do not exist, ex: 0-20:00:00
This method would be easiest I believe if there is a way however.
Is there a way to fix this conversion or any other ideas on approaching this?
One way to go around is replacing - with days:
pd.to_timedelta(df['time'].str.replace('-','days'))
Output (for 4 lines above):
0 0 days 00:00:00
1 0 days 00:26:51
2 1 days 00:02:00
3 3 days 01:25:02
Name: time, dtype: timedelta64[ns]

Timeseries resample error - none of Dateindex in column pandas

Please excuse obvious errors - still in the learning process.
I am trying to do a simple timeseries plot on my data with a frequency of 15 minutes. The idea is to plot monthly means, starting with resampling data every hour - including only those hourly means that have atleast 1 observation in the interval. There are subsequent conditions for daily and monthly means.
This is relatively simpler only if this error does not crop up- "None of [DatetimeIndex(['2016-01-01 05:00:00', '2016-01-01 05:15:00',\n....2016-12-31 16:15:00'],\n dtype='datetime64[ns]', length=103458, freq=None)] are in the [columns]"
This is my code:
#Original dataframe
Date value
0 1/1/2016 0:00 405.22
1 1/1/2016 0:15 418.56
Date object
value object
dtype: object
#Conversion of 'value' column to numeric/float values.
df.Date = pd.to_datetime(df.Date,errors='coerce')
year=df.Date.dt.year
df['Year'] = df['Date'].map(lambda x: x.year )
df.value = pd.to_numeric(df.value,errors='coerce' )
Date datetime64[ns]
value float64
Year int64
dtype: object
Date value Year
0 2016-01-01 00:00:00 405.22 2016
1 2016-01-01 00:15:00 418.56 2016
df=df.set_index(Date)
diurnal1 = df[df['Date']].resample('h').mean().count()>=2
**(line of error)**
diurnal_mean_1 = diurnal1.mean()[diurnal1.count() >= 1]
(the code follows)
Any help in solving the error will be appreciated.
I think you want df=df.set_index('Date') (Date is a string). Also I would move the conversions over into the constructor if possible after you get it working.