How do I usee ffill with a multiindex - pandas

I asked (and answered) a question here Pandas ffill resampled data grouped by column where I wanted to know how to ffill a date range for each unique entry for a column (my assets column).
My solution requires that the asset "id" is a column. However, the data makes more sense to me as a multiindex. Furthermore I would like more fields in the multiindex. Is the only way of filling forward to drop the non-date fields from the multiiindex before ffilling?
A modified version of my example (to work on a df with multiindex) here:
from datetime import datetime, timedelta
import pytz
some_time = datetime(2018,4,2,20,20,42)
start_date = datetime(some_time.year,some_time.month,some_time.day).astimezone(pytz.timezone('Europe/London'))
end_date = start_date + timedelta(days=1)
start_date = start_date + timedelta(hours=some_time.hour,minutes=(0 if some_time.minute < 30 else 30 ))
df = pd.DataFrame(['A','B'],columns=['asset_id'])
df2=df.copy()
df['datetime'] = start_date
df2['datetime'] = end_date
df['some_property']=0
df.loc[df['asset_id']=='B','some_property']=2
df = df.append(df2).set_index(['asset_id','datetime'])
With what is arguably my crazy solution here:
df = df.reset_index()
df = df.set_index('datetime').groupby('asset_id').resample('30T').ffill().drop('asset_id',axis=1)
df = df.reset_index().set_index(['asset_id','datetime'])
Can I avoid all that re-indexing?

Related

Why I get NaN after left join?

Doing Udacity ML course. After df_final.join(df_temp, how="left") get NaN, but in the course venv everything works great. Where might be the problem?
P.S.: I also tried df_temp.index = pd.to_datetime(df_temp.index, utc=True) for each, seems no effect.
Here we load data.
import yfinance as yf
tickets = ["AAPL", "AMD", "GOOG", "GLD"]
def download_tickets(tickets):
for ticket in tickets:
df = yf.Ticker(ticket)
df = df.history(period="max")
df.to_csv(symbol_to_path(ticket))
Here we create path to csv from symbol.
def symbol_to_path(symbol, base_dir="data"):
if not os.path.exists(base_dir):
os.mkdir(base_dir)
return os.path.join(base_dir, "{}.csv".format(str(symbol)))
Here we join data.
# Create empty df with specified dates.
start_date = "2022-01-01"
end_date = "2023-01-01"
dates = pd.date_range(start_date, end_date)
df_final = pd.DataFrame(index=dates)
df_final.index = pd.to_datetime(df_final.index, utc=True)
# Combine all with df_final
for ticket in tickets:
file_path = symbol_to_path(symbol)
df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date",
usecols=["Date", "Close"], na_values=["nan"])
df_temp = df_temp.rename(columns={"Close": symbol})
df_final = df_final.join(df_temp, how="left")
print(df_temp.head())
print(df_final.head())
return df_final
Output:
As you see, float converts to NaN for left join
For right join we get data, but not for the range 2022-01-01/2023-01-01
Inner join
Outer join
Thank you.
UPD: Data after 2021
The problem is in time zones. Tickets data is in -05:00 (I assume new york), while you generate df_final at UTC +00:00, when you join, pandas cannot find intersection in indices.
Simplest solution for me was to change df_final timezone (tz), ie generate with correct tz
# Create empty df with specified dates.
start_date = "2022-01-01"
end_date = "2023-01-01"
dates = pd.date_range(start_date, end_date, tz='-05:00') # change here
df_final = pd.DataFrame(index=dates)
# df_final.index = pd.to_datetime(df_final.index, utc=True) # NOT needed anymore

In a pyspark.sql.dataframe.Dataframe, how to resample the "TIMESTAMP" column to daily intervals, for each unique id in the "ID" column?

The title almost says it already. I have a pyspark.sql.dataframe.Dataframe with a "ID", "TIMESTAMP", "CONSUMPTION" and "TEMPERATURE" column. I need the "TIMESTAMP" column to be resampled to daily intervals (from 15min intervals) and the "CONSUMPTION" and "TEMPERATURE" column aggregated by summation. However, this needs to be performed for each unique id in the "ID" column. How do I do this?
Efficiency/speed is of importance to me. I have a huge dataframe to start with, which is why I would like to avoid .toPandas() and for loops.
Any help would be greatly appreciated!
The following code will build a spark_df to play around with. The input_spark_df represents the input spark dataframe, the disred output is like desired_outcome_spark_df.
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
df_list = []
for unique_id in ['012', '345', '678']:
date_range = pd.date_range(pd.Timestamp('2022-12-28 00:00'), pd.Timestamp('2022-12-30 23:00'),freq = 'H')
df = pd.DataFrame()
df['TIMESTAMP'] = date_range
df['ID'] = unique_id
df['TEMPERATURE'] = np.random.randint(1, 10, df.shape[0])
df['CONSUMPTION'] = np.random.randint(1, 10, df.shape[0])
df = df[['ID', 'TIMESTAMP', 'TEMPERATURE', 'CONSUMPTION']]
df_list.append(df)
pandas_df = pd.concat(df_list)
spark = SparkSession.builder.getOrCreate()
input_spark_df = spark.createDataFrame(pandas_df)
desired_outcome_spark_df = spark.createDataFrame(pandas_df.set_index('TIMESTAMP').groupby('ID').resample('1d').sum().reset_index())
To condense the question thus: how do I go from input_spark_df to desired_outcome_spark_df as efficient as possible?
I found the answer to my own question. I first change the timestamp to "date only" using pyspark.sql.functions.to_date. Then I groupby both "ID" and "TIMESTAMP" and perfrom the aggregation.
from pyspark.sql.functions import to_date, sum, avg
# Group the DataFrame by the "ID" column
spark_df = input_spark_df.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
desired_outcome = (input_spark_df
.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
.groupBy("ID", 'TIMESTAMP')
.agg(
sum(col("CONSUMPTION")).alias("CUMULATIVE_DAILY_POWER_CONSUMPTION"),
avg(col('TEMPERATURE')).alias("AVERAGE_DAILY_TEMPERATURE")
))
grouped_df.display()

setting pandas series row values to multiple column values

I am having one dataframe object df. wherein i have got some data from excel sheet. then i have added certain Date columns to this df object. this df also has certain stock ticker from yahoo finance. now i try to get the history of prices for these stocks tickers for 2 months history from yahoo finance (which will be 60 rows) and then trying to assign these price values to column header, with relevant dates, in the df object. however i am not able to do so.
In the last line of code, i am trying to set the values of "Volume", which will be in different rows, to the column values for respective dates in df. but i am not able to do so. need help. Thanks
df = pd.read_excel(r"D:\Volume Trading\python\excel"
r"\Nifty-sector-cap.xlsx")
start_date = date(2022,3,1) # Date YYYY MM DD
end_date = date(2022,4,25)
## downloading below data just to get dates which will be columns of df.
temp_data = yf.download("HDFCBANK.NS", start_date, end_date, interval = '1d', index = False)["Adj Close"]
temp_data.index = temp_data.index.date
# setting the dates as columns header in df
df = df.reindex(columns = df.columns.tolist() + temp_data.index.tolist())
i = 0
# putting the volume for each ticker on each date in df
for i in range(0,len(df)):
temp_vol = yf.download(df["Yahoo_Symbol"].iloc[i], start_date, end_date, interval ="1d")["Volume"]
temp_vol.index = temp_vol.index.date
df[temp_vol.index.tolist()].iloc[i] = temp_vol.("Volume").transpose()

Add business days to pandas dataframe with dates and skip over holidays python

I have a dataframe with dates as seen in the table below. 1st block is what it should look like and the 2nd block is what I get when just adding the BDays. This is an example of what it should look like when completed. I want to use the 1st column and add 5 business days to the dates, but if the 5 Bdays overlaps a holiday (like 15 Feb'21) then I need to add one additional day. It is fairly simple to add the 5Bday using pandas.tseries.offsets import BDay, but i cannot skip the holidays while using the dataframe.
I have tried to use pandas.tseries.holiday import USFederalHolidayCalendar, the workdays and workalendar modules, but cannot figure it out. Anyone have an idea what I can do.
Correct Example
DATE
EXIT DATE +5
2021/02/09
2021/02/17
2021/02/10
2021/02/18
Wrong Example
DATE
EXIT DATE +5
2021/02/09
2021/02/16
2021/02/10
2021/02/17
Here are some examples of code I tried:
import pandas as pd
from workdays import workday
...
df['DATE'] = workday(df['EXIT DATE +5'], days=5, holidays=holidays)
Next Example:
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
dt = df['DATE']
df['EXIT DATE +5'] = dt + bday_us
=========================================
Final code:
Below is the code I finally settled on. I had to define the holidays manually due to the days the NYSE actually trades. Like for instance the day Pres Bush was laid to rest.
import datetime as dt
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import BDay
from pandas.tseries.holiday import AbstractHolidayCalendar, Holiday, nearest_workday, \
USMartinLutherKingJr, USPresidentsDay, GoodFriday, USMemorialDay, \
USLaborDay, USThanksgivingDay
class USTradingCalendar(AbstractHolidayCalendar):
rules = [
Holiday('NewYearsDay', month=1, day=1, observance=nearest_workday),
USMartinLutherKingJr,
USPresidentsDay,
GoodFriday,
USMemorialDay,
Holiday('USIndependenceDay', month=7, day=4, observance=nearest_workday),
Holiday('BushDay', year=2018, month=12, day=5),
USLaborDay,
USThanksgivingDay,
Holiday('Christmas', month=12, day=25, observance=nearest_workday)
]
offset = 5
df = pd.DataFrame(['2019-10-11', '2019-10-14', '2017-04-13', '2018-11-28', '2021-07-02'], columns=['DATE'])
df['DATE'] = pd.to_datetime(df['DATE'])
def offset_date(start, offset):
return start + pd.offsets.CustomBusinessDay(n=offset, calendar=USTradingCalendar())
df['END'] = df.apply(lambda x: offset_date(x['DATE'], offset), axis=1)
print(df)
Input data
df = pd.DataFrame(['2021-02-09', '2021-02-10', '2021-06-28', '2021-06-29', '2021-07-02'], columns=['DATE'])
df['DATE'] = pd.to_datetime(df['DATE'])
Suggested solution using apply
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import BDay
def offset_date(start, offset):
return start + pd.offsets.CustomBusinessDay(n=offset, calendar=USFederalHolidayCalendar())
offset = 5
df['END'] = df.apply(lambda x: offset_date(x['DATE'], offset), axis=1)
DATE END
2021-02-09 2021-02-17
2021-02-10 2021-02-18
2021-06-28 2021-07-06
2021-06-29 2021-07-07
2021-07-02 2021-07-12
PS: If you want to use a particular calendar such as the NYSE, instead of the default USFederalHolidayCalendar, I recommend following the instructions on this answer, about creating a custom calendar.
Alternative solution which I do not recommend
Currently, to the best of my knowledge, pandas do not support a vectorized approach to your problem. But if you want to follow a similar approach to the one you mentioned, here is what you should do.
First, you will have to define an arbitrary far away end date that includes all the periods you might need and use it to create a list of holidays.
holidays = USFederalHolidayCalendar().holidays(start='2021-02-09', end='2030-02-09')
Then, you pass the holidays list to CustomBusinessDay through the holidays parameter instead of the calendar to generate the desired offset.
offset = 5
bday_us = pd.offsets.CustomBusinessDay(n=offset, holidays=holidays)
df['END'] = df['DATE'] + bday_us
However, this type of approach is not a true vectorized solution, even though it might seem like it. See the following SO answer for further clarification. Under the hood, this approach is probably doing a conversion that is not efficient. This why it yields the following warning.
PerformanceWarning: Non-vectorized DateOffset being applied to Series
or DatetimeIndex
Here's one way to do it
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from datetime import timedelta as td
def get_exit_date(date):
holiday_list = cals.holidays(start=date, end=date + td(weeks=2)).tolist()
# 6 periods since start date is included in set
n_bdays = pd.bdate_range(start=date, periods=6, freq='C', holidays=holiday_list)
return n_bdays[-1]
df = pd.read_clipboard()
cals = USFederalHolidayCalendar()
# I would convert this to datetime
df['DATE'] = pd.to_datetime(df['DATE'])
df['EXIT DATE +5'] = df['DATE'].apply(get_exit_date)
this is using bdate_range which returns a datetime index
Results:
DATE EXIT DATE +5
0 2021-02-09 2021-02-17
1 2021-02-10 2021-02-18
Another option is instead of dynamically creating the holiday list. You could also just choose a start date and leave it outside the function like so:
def get_exit_date(date):
# 6 periods since start date is included in set
n_bdays = pd.bdate_range(start=date, periods=6, freq='C', holidays=holiday_list)
return n_bdays[-1]
df = pd.read_clipboard()
cals = USFederalHolidayCalendar()
holiday_list = cals.holidays(start='2021-01-01').tolist()
# I would convert this to datetime
df['DATE'] = pd.to_datetime(df['DATE'])
df['EXIT DATE +5'] = df['DATE'].apply(get_exit_date)

Add fractional number of years to date in pandas Python

I have a pandas df that includes two columns: time_in_years (float64) and date (datetime64).
import pandas as pd
df = pd.DataFrame({
'date': ['2009-12-25','2005-01-09','2010-10-31'],
'time_in_years': ['10.3434','5.0977','3.3426']
})
df['date'] = pd.to_datetime(df['date'])
df["time_in_years"] = df.time_in_years.astype(float)
I need to create date2 as a datetime64 column by adding the number of years to the date.
I tried the following but with no luck:
df['date_2'] = df['date'] + datetime.timedelta(years=df['time_in_years'])
I know that with fractions I will not be able to get the exact date, but I want to get the closest new date as possible.
Try package dateutil:
from dateutil.relativedelta import relativedelta
First convert fractional years to number of days, then use lambda function and apply it to dataframe:
df['date_2'] = df.apply(lambda x: x['date'] + relativedelta(days = int(x['time_in_years']*365)), axis = 1)
Result:
date time_in_years date_2
0 2009-12-25 10.3434 2020-04-26
1 2005-01-09 5.0977 2010-02-12
2 2010-10-31 3.3426 2014-03-04
datetime.timedelta also works fine:
df['date_2'] = df.apply(lambda x: x['date'] + datetime.timedelta(days = int(x['time_in_years']*365)), axis = 1)
Please note conversion to int is necessary, because relativedelta and timedelta do not accept fractional values.