Qualifying the timeseries data as year wise data using pd.infer_freq() - pandas

How to qualify the timeseries data as year wise data using pd.infer_freq() as using pd.infer_freq() only give cfrequency checks on months and days but not years .how can we automate it
import pandas as pd
df = pd.DataFrame({'Date': ['2005-01-01', '2005-02-01', '2005-03-01', '2005-04-01'],
'Date1': ['2005-01-01', '2005-01-02', '2005-01-03', '2005-01-04'],
'Date2': ['2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01'],
'Date3': ['2006-01-01', '2006-02-06', '2006-03-11', '2006-04-01']})
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
df['Date3'] = pd.to_datetime(df['Date3'])
o/p :
pd.infer_freq(df.Date)
'MS'
pd.infer_freq(df.Date1)
'D'
pd.infer_freq(df.Date2)
'AS-JAN
'
But how do i get Year

Related

In a pyspark.sql.dataframe.Dataframe, how to resample the "TIMESTAMP" column to daily intervals, for each unique id in the "ID" column?

The title almost says it already. I have a pyspark.sql.dataframe.Dataframe with a "ID", "TIMESTAMP", "CONSUMPTION" and "TEMPERATURE" column. I need the "TIMESTAMP" column to be resampled to daily intervals (from 15min intervals) and the "CONSUMPTION" and "TEMPERATURE" column aggregated by summation. However, this needs to be performed for each unique id in the "ID" column. How do I do this?
Efficiency/speed is of importance to me. I have a huge dataframe to start with, which is why I would like to avoid .toPandas() and for loops.
Any help would be greatly appreciated!
The following code will build a spark_df to play around with. The input_spark_df represents the input spark dataframe, the disred output is like desired_outcome_spark_df.
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
df_list = []
for unique_id in ['012', '345', '678']:
date_range = pd.date_range(pd.Timestamp('2022-12-28 00:00'), pd.Timestamp('2022-12-30 23:00'),freq = 'H')
df = pd.DataFrame()
df['TIMESTAMP'] = date_range
df['ID'] = unique_id
df['TEMPERATURE'] = np.random.randint(1, 10, df.shape[0])
df['CONSUMPTION'] = np.random.randint(1, 10, df.shape[0])
df = df[['ID', 'TIMESTAMP', 'TEMPERATURE', 'CONSUMPTION']]
df_list.append(df)
pandas_df = pd.concat(df_list)
spark = SparkSession.builder.getOrCreate()
input_spark_df = spark.createDataFrame(pandas_df)
desired_outcome_spark_df = spark.createDataFrame(pandas_df.set_index('TIMESTAMP').groupby('ID').resample('1d').sum().reset_index())
To condense the question thus: how do I go from input_spark_df to desired_outcome_spark_df as efficient as possible?
I found the answer to my own question. I first change the timestamp to "date only" using pyspark.sql.functions.to_date. Then I groupby both "ID" and "TIMESTAMP" and perfrom the aggregation.
from pyspark.sql.functions import to_date, sum, avg
# Group the DataFrame by the "ID" column
spark_df = input_spark_df.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
desired_outcome = (input_spark_df
.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
.groupBy("ID", 'TIMESTAMP')
.agg(
sum(col("CONSUMPTION")).alias("CUMULATIVE_DAILY_POWER_CONSUMPTION"),
avg(col('TEMPERATURE')).alias("AVERAGE_DAILY_TEMPERATURE")
))
grouped_df.display()

setting pandas series row values to multiple column values

I am having one dataframe object df. wherein i have got some data from excel sheet. then i have added certain Date columns to this df object. this df also has certain stock ticker from yahoo finance. now i try to get the history of prices for these stocks tickers for 2 months history from yahoo finance (which will be 60 rows) and then trying to assign these price values to column header, with relevant dates, in the df object. however i am not able to do so.
In the last line of code, i am trying to set the values of "Volume", which will be in different rows, to the column values for respective dates in df. but i am not able to do so. need help. Thanks
df = pd.read_excel(r"D:\Volume Trading\python\excel"
r"\Nifty-sector-cap.xlsx")
start_date = date(2022,3,1) # Date YYYY MM DD
end_date = date(2022,4,25)
## downloading below data just to get dates which will be columns of df.
temp_data = yf.download("HDFCBANK.NS", start_date, end_date, interval = '1d', index = False)["Adj Close"]
temp_data.index = temp_data.index.date
# setting the dates as columns header in df
df = df.reindex(columns = df.columns.tolist() + temp_data.index.tolist())
i = 0
# putting the volume for each ticker on each date in df
for i in range(0,len(df)):
temp_vol = yf.download(df["Yahoo_Symbol"].iloc[i], start_date, end_date, interval ="1d")["Volume"]
temp_vol.index = temp_vol.index.date
df[temp_vol.index.tolist()].iloc[i] = temp_vol.("Volume").transpose()

Add fractional number of years to date in pandas Python

I have a pandas df that includes two columns: time_in_years (float64) and date (datetime64).
import pandas as pd
df = pd.DataFrame({
'date': ['2009-12-25','2005-01-09','2010-10-31'],
'time_in_years': ['10.3434','5.0977','3.3426']
})
df['date'] = pd.to_datetime(df['date'])
df["time_in_years"] = df.time_in_years.astype(float)
I need to create date2 as a datetime64 column by adding the number of years to the date.
I tried the following but with no luck:
df['date_2'] = df['date'] + datetime.timedelta(years=df['time_in_years'])
I know that with fractions I will not be able to get the exact date, but I want to get the closest new date as possible.
Try package dateutil:
from dateutil.relativedelta import relativedelta
First convert fractional years to number of days, then use lambda function and apply it to dataframe:
df['date_2'] = df.apply(lambda x: x['date'] + relativedelta(days = int(x['time_in_years']*365)), axis = 1)
Result:
date time_in_years date_2
0 2009-12-25 10.3434 2020-04-26
1 2005-01-09 5.0977 2010-02-12
2 2010-10-31 3.3426 2014-03-04
datetime.timedelta also works fine:
df['date_2'] = df.apply(lambda x: x['date'] + datetime.timedelta(days = int(x['time_in_years']*365)), axis = 1)
Please note conversion to int is necessary, because relativedelta and timedelta do not accept fractional values.

Create datetime from columns in a DataFrame

I got a DataFrame with these columns :
year month day gender births
I'd like to create a new column type "Date" based on the column year, month and day as : "yyyy-mm-dd"
I'm just beginning in Python and I just can't figure out how to proceed...
Assuming you are using pandas to create your dataframe, you can try:
>>> import pandas as pd
>>> df = pd.DataFrame({'year':[2015,2016],'month':[2,3],'day':[4,5],'gender':['m','f'],'births':[0,2]})
>>> df['dates'] = pd.to_datetime(df.iloc[:,0:3])
>>> df
year month day gender births dates
0 2015 2 4 m 0 2015-02-04
1 2016 3 5 f 2 2016-03-05
Taken from the example here and the slicing (iloc use) "Selection" section of "10 minutes to pandas" here.
You can useĀ .assign
For example:
df2= df.assign(ColumnDate = df.Column1.astype(str) + '- ' + df.Column2.astype(str) + '-' df.Column3.astype(str) )
It is simple and it is much faster than lambda if you have tonnes of data.

How do I usee ffill with a multiindex

I asked (and answered) a question here Pandas ffill resampled data grouped by column where I wanted to know how to ffill a date range for each unique entry for a column (my assets column).
My solution requires that the asset "id" is a column. However, the data makes more sense to me as a multiindex. Furthermore I would like more fields in the multiindex. Is the only way of filling forward to drop the non-date fields from the multiiindex before ffilling?
A modified version of my example (to work on a df with multiindex) here:
from datetime import datetime, timedelta
import pytz
some_time = datetime(2018,4,2,20,20,42)
start_date = datetime(some_time.year,some_time.month,some_time.day).astimezone(pytz.timezone('Europe/London'))
end_date = start_date + timedelta(days=1)
start_date = start_date + timedelta(hours=some_time.hour,minutes=(0 if some_time.minute < 30 else 30 ))
df = pd.DataFrame(['A','B'],columns=['asset_id'])
df2=df.copy()
df['datetime'] = start_date
df2['datetime'] = end_date
df['some_property']=0
df.loc[df['asset_id']=='B','some_property']=2
df = df.append(df2).set_index(['asset_id','datetime'])
With what is arguably my crazy solution here:
df = df.reset_index()
df = df.set_index('datetime').groupby('asset_id').resample('30T').ffill().drop('asset_id',axis=1)
df = df.reset_index().set_index(['asset_id','datetime'])
Can I avoid all that re-indexing?