pandas add datetime column [duplicate] - pandas

This question already has answers here:
Pandas: create timestamp from 3 columns: Month, Day, Hour
(2 answers)
How to combine multiple columns in a Data Frame to Pandas datetime format
(3 answers)
Closed 1 year ago.
I have a dataframe where year, month, day, hour is split into separate columns:
df.head()
Year Month Day Hour
0 2020 1 1 0
1 2020 1 1 1
2 2020 1 1 2
...
I'd like to add a proper datetime column to the dataframe so I end up with something along these lines:
df.head()
Year Month Day Hour datetime
0 2020 1 1 0 2020-01-01T00:00
1 2020 1 1 1 2020-01-01T01:00
2 2020 1 1 2 2020-01-01T02:00
...
I could add a loop that processes one row at a time, but that's not panda-esque.
Here are three things that don't work (not that I expected any of them to do so):
df['datetime'] = pd.to_datetime(datetime.datetime(df['Year'], df['Month'], df['Day'], df['Hour']))
df['datetime'] = pd.to_datetime(df['Year'], df['Month'], df['Day'], df['Hour'])
df['datetime'] = pd.datetime(df['Year'], df['Month'], df['Day'], df['Hour'])

Related

extract week columns from date in pandas

I have a dataframe that has columns like these:
Date earnings workingday length_week first_wday_week last_wdayweek
01.01.2000 10000 1 1
02.01.2000 0 0 1
03.01.2000 0 0 2
04.01.2000 0 0 2
05.01.2000 0 0 2
06.01.2000 23000 1 2
07.01.2000 1000 1 2
08.01.2000 0 0 2
09.01.2000 0 0 2
..
..
..
30.01.2000 0 0 0
31.01.2000 0 1 3
01.02.2000 0 1 3
02.02.2000 2500 1 3
working day indicates there earnings present on that particular day. I am trying to generate last three column from the date.
length_week : gives number of working days in that week
first_working_day_of_week : 1 if its first working day of a week
last_working_day_of_week : 1 if its last working day of a week
Can anyone help me with this?
I first changed the format of your date column as pd.to_datetime couldn't infer the right date format:
df.Date.str.replace('.', '-', regex=True)
df.Date = pd.to_datetime(df.Date, format='%d-%m-%Y')
Then use isocalendar so that we can work with weeks and days more easily:
df[['year', 'week', 'weekday']] = df.Date.dt.isocalendar()
Now length_week is just the sum of workingdays for each seperate weeks:
df['length_week'] = df.groupby(['year', 'week']).workingday.transform('sum')
and we can get frst_worday_week with idxmax:
min_indexes = df.groupby(['year', 'week'], as_index=False).workingday.transform('idxmax')
df['frst_worday_week'] = np.where(df.index == min_indexes.workingday, 1, 0)
Lastly, last_workdayweek is similar but a bit tricky. We need the last occurence of idxmax, so we will reverse each week inside groupby:
max_indexes = df.groupby(['year', 'week'], as_index=False).\
workingday.transform(lambda x: x[::-1].idxmax())
df['last_workdayweek'] = np.where(df.index == max_indexes.workingday, 1, 0)

Pandas rolling window cumsum

I have a pandas df as follows:
YEAR MONTH USERID TRX_COUNT
2020 1 1 1
2020 2 1 2
2020 3 1 1
2020 12 1 1
2021 1 1 3
2021 2 1 3
2021 3 1 4
I want to sum the TRX_COUNT such that, each TRX_COUNT is the sum of TRX_COUNTS of the next 12 months.
So my end result would look like
YEAR MONTH USERID TRX_COUNT TRX_COUNT_SUM
2020 1 1 1 5
2020 2 1 2 7
2020 3 1 1 8
2020 12 1 1 11
2021 1 1 3 10
2021 2 1 3 7
2021 3 1 4 4
For example TRX_COUNT_SUM for 2020/1 is 1+2+1+1=5 the count of the first 12 months.
Two areas I am confused how to proceed:
I tried various variations of cumsum and grouping by USERID, YR, MONTH but am running into errors with handling the time window as there might be MONTHS where a user has no transactions and these have to be accounted for. For example in 2020/1 the user has no transactions for months 4-11, hence a full year of transaction count would be 5.
Towards the end there will be partial years, which can be summed up and left as is (like 2021/3 which is left as 4).
Any thoughts on how to handle this?
Thanks!
I was able to accomplish this using a combination of numpy arrays, pandas, and indexing
import pandas as pd
import numpy as np
#df = your dataframe
df_dates = pd.DataFrame(np.arange(np.datetime64('2020-01-01'), np.datetime64('2021-04-01'), np.timedelta64(1, 'M'), dtype='datetime64[M]').astype('datetime64[D]'), columns = ['DATE'])
df_dates['YEAR'] = df_dates['DATE'].apply(lambda x : str(x).split('-')[0]).apply(lambda x : int(x))
df_dates['MONTH'] = df_dates['DATE'].apply(lambda x : str(x).split('-')[1]).apply(lambda x : int(x))
df_merge = df_dates.merge(df, how = 'left')
df_merge.replace(np.nan, 0, inplace=True)
df_merge.reset_index(inplace = True)
for i in range(0, len(df_merge)):
max_index = df_merge['index'].max()
if(i + 11 < max_index):
df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i:i + 12]['TRX_COUNT'].sum()
elif(i != max_index):
df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i:max_index + 1]['TRX_COUNT'].sum()
else:
df_merge.at[i, 'TRX_COUNT_SUM'] = df_merge.iloc[i]['TRX_COUNT']
final_df = pd.merge(df_merge, df)
Try this:
# Set the Dataframe index to a time series constructed from YEAR and MONTH
ts = pd.to_datetime(df.assign(DAY=1)[["YEAR", "MONTH", "DAY"]])
df.set_index(ts, inplace=True)
df["TRX_COUNT_SUM"] = (
# Reindex the dataframe with every missing month in-between
# Also reverse the index so that rolling(12) means 12 months
# forward instead of backward
df.reindex(pd.date_range(ts.min(), ts.max(), freq="MS")[::-1])
# Roll and sum
.rolling(12, min_periods=1)
["TRX_COUNT"].sum()
)

Converting daily data to monthly and get months last value in pandas

I have data consists of daily data that belongs to specific month and year like this
I want to convert all daily data to monthly data and I want to get the last value of that month as a return value of that monthly data
for example:
AccoutId, Date, Return
1 2016-01 -4.1999 (Because this return value is last value of january 1/29/16)
1 2016-02 0.19 (Same here last value of february 2/29/16)
and so on
I've looked some of topics about converting daily data to monthly data but the problem is that after converting daily data to monthly data, they take the mean() or sum() of that month as a return value. Conversely, I want the last return value of that month as the return value.
You can groupby AccountId and the Year-Month. Convert to datetime first and then format as Year-Month as follows: df['Date'].dt.strftime('%Y-%m'). Then just use last():
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby(['AccountId', df['Date'].dt.strftime('%Y-%m')])['Return'].last().reset_index()
df
Sample data:
In[1]:
AccountId Date Return
0 1 1/7/16 15
1 1 1/29/16 10
2 1 2/1/16 25
3 1 2/15/16 20
4 1 2/28/16 30
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby(['AccountId', df['Date'].dt.strftime('%Y-%m')])['Return'].last().reset_index()
df
Out[1]:
AccountId Date Return
0 1 2016-01 10
1 1 2016-02 30

Quickly fill cells with datetime based on column name in pandas?

I need to convert my cumbersome column headers into a datetime for every cell in that column. For example, I need the datetime "2001-10-06 6:00" from the column header 20011006_6_blah_blah_blah. I have a column of other datetimes that I will eventually be using to do some calculations.
Construction of an example df:
date_rng0=pd.date_range(start=datetime.date(2001,10,1),end=datetime.date(2001,10,7),freq='D')
date_rng1=pd.date_range(start=datetime.date(2001,10,5),end=datetime.date(2001,10,8),freq='D')
drstr0=[str(i.year)+str(i.month)+str(i.day)+'_blah' for i in date_rng0]
drstr1=[str(i.year)+str(i.month)+str(i.day)+'_blah' for i in date_rng1]
#make zero df
arr=np.zeros((len(date_rng0),len(date_rng1))) # all ones, mask out below
df=pd.DataFrame(arr,index=drstr0,columns=drstr1)
First I copy all the column names into the cells, column by column. This is very slow with my data:
for c in df.columns:
df[c]=c
Then I convert them to datetime using an atrocious looking lambda mess:
for c in df.columns:
df.loc[:,c]=df.loc[:,c].apply(lambda x: datetime.date(int(x.split('_')[0][:4]),int(x.split('_')[0][4:6]),int(x.split('_')[0][6:])))
Then I make a datetime column using a similar lambda function:
df['date_time']=df.index
df['date_time']=df.loc[:,'date_time'].apply(lambda x: datetime.date(int(x.split('_')[0][:4]),int(x.split('_')[0][4:6]),int(x.split('_')[0][6:])))
df.head()
gives:
2001105_blah 2001106_blah 2001107_blah 2001108_blah date_time
2001101_blah 2001-10-05 2001-10-06 2001-10-07 2001-10-08 2001-10-01
2001102_blah 2001-10-05 2001-10-06 2001-10-07 2001-10-08 2001-10-02
2001103_blah 2001-10-05 2001-10-06 2001-10-07 2001-10-08 2001-10-03
2001104_blah 2001-10-05 2001-10-06 2001-10-07 2001-10-08 2001-10-04
2001105_blah 2001-10-05 2001-10-06 2001-10-07 2001-10-08 2001-10-05
Then I can do a little math:
ndf=df.copy()
for c in df.columns:
ndf.loc[:,c]=df.loc[:,c]-df.loc[:,'date_time']
Which gives what I am ultimately after:
2001105_blah 2001106_blah 2001107_blah 2001108_blah date_time
2001101_blah 4 days 5 days 6 days 7 days 0 days
2001102_blah 3 days 4 days 5 days 6 days 0 days
2001103_blah 2 days 3 days 4 days 5 days 0 days
2001104_blah 1 days 2 days 3 days 4 days 0 days
2001105_blah 0 days 1 days 2 days 3 days 0 days
The problem is, this process has never completed using my 2,000 x 30,000 dataframe despite walking away for 30 min. I feel like I am doing something wrong. Any suggestions to improve the efficiency?
You can try with str.split, ' '.join, and pd.to_datetime
#add new column with values as the column names joined into a string
df['temp']=(' '.join(df.columns.astype(str)))
#expand the dataframe
temp=df['temp'].str.split(expand=True)
#rename the columns with original names
temp.columns=df.columns[:-1]
#parse the index to datetime
index=pd.to_datetime(df.index.str.split('_').str[0],format='%Y%m%d').to_numpy()
#substract the index to each column
newdf=temp.apply(lambda x: pd.to_datetime(x.str.split('_').str[0],format='%Y%m%d')-index)
#mask only the rows where all values are non-negative
newdf=newdf[newdf.apply(lambda x: x >= pd.Timedelta(0)).all(1)]
Output:
print(newdf)
2001105_blah 2001106_blah 2001107_blah 2001108_blah
2001101_blah 4 days 5 days 6 days 7 days
2001102_blah 3 days 4 days 5 days 6 days
2001103_blah 2 days 3 days 4 days 5 days
2001104_blah 1 days 2 days 3 days 4 days
2001105_blah 0 days 1 days 2 days 3 days

week number from given date in pandas

I have a data frame with two columns Date and value.
I want to add new column named week_number that basically is how many weeks back from the given date
import pandas as pd
df = pd.DataFrame(columns=['Date','value'])
df['Date'] = [ '04-02-2019','03-02-2019','28-01-2019','20-01-2019']
df['value'] = [10,20,30,40]
df
Date value
0 04-02-2019 10
1 03-02-2019 20
2 28-01-2019 30
3 20-01-2019 40
suppose given date is 05-02-2019.
Then I need to add a column week_number in a way such that how many weeks back the Date column date is from given date.
The output should be
Date value week_number
0 04-02-2019 10 1
1 03-02-2019 20 1
2 28-01-2019 30 2
3 20-01-2019 40 3
how can I do this in pandas
First convert column to datetimes by to_datetime with dayfirst=True, then subtract from right side by rsub, convert timedeltas to days, get modulo by 7 and add 1:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['week_number'] = df['Date'].rsub(pd.Timestamp('2019-02-05')).dt.days // 7 + 1
#alternative
#df['week_number'] = (pd.Timestamp('2019-02-05') - df['Date']).dt.days // 7 + 1
print (df)
Date value week_number
0 2019-02-04 10 1
1 2019-02-03 20 1
2 2019-01-28 30 2
3 2019-01-20 40 3