Datetime reformat weekly column - pandas

I split a dataframe from minute to daily, weekly and monthly. I had no problem to reformat the daily dataframe, though I am having a good bit of trouble trying to do the same with the weekly one. Here is the weekly dataframe if someone could help me out please, it would be great. I am adding the code I used to reformat the daily dataframe, so it may helps!
I am plotting it with Bokeh and without the datetime format I won't be able to format the axis and hovertools as I would like.
Thanks beforehand.
dfDay1 = dfDay.loc['2014-01-01':'2020-09-31']
dfDay1 = dfDay1.reset_index()
dfDay1['date1'] = pd.to_datetime(dfDay1['date'], format=('%Y/%m/%d'))
dfDay1 = dfDay1.set_index('date')
That worked fine for the day format.

If need convert dates before / use Series.str.split with str[0], if dates after / use str[1]:
df['date1'] = pd.to_datetime(df['week'].str.split('/').str[0])
print (df)
week Open Low High Close Volume \
0 2014-01-07/2014-01-13 58.1500 55.38 58.96 56.0000 324133239
1 2014-01-14/2014-01-20 56.3500 55.96 58.57 56.2500 141255151
2 2014-01-21/2014-01-27 57.8786 51.85 59.31 52.8600 279370121
3 2014-01-28/2014-02-03 53.7700 52.75 63.95 62.4900 447186604
4 2014-02-04/2014-02-10 62.8900 60.45 64.90 63.9100 238316161
.. ... ... ... ... ... ...
347 2020-09-01/2020-09-07 297.4000 271.14 303.90 281.5962 98978386
348 2020-09-08/2020-09-14 275.0000 262.64 281.40 271.0100 109717114
349 2020-09-15/2020-09-21 272.6300 244.13 274.52 248.5800 123816172
350 2020-09-22/2020-09-28 254.3900 245.40 259.98 255.8800 98550687
351 2020-09-29/2020-10-05 258.2530 256.50 268.33 261.3500 81921670
date1
0 2014-01-07
1 2014-01-14
2 2014-01-21
3 2014-01-28
4 2014-02-04
.. ...
347 2020-09-01
348 2020-09-08
349 2020-09-15
350 2020-09-22
351 2020-09-29
[352 rows x 7 columns]

Related

How to concatenate a dataframe to a multiindex main dataframe along columns

I have tried a few answers but was not able to get the desired result in my case.
I am working with stocks data.
I have a list ['3MINDIA.NS.csv', 'AARTIDRUGS.NS.csv', 'AARTIIND.NS.csv', 'AAVAS.NS.csv', 'ABB.NS.csv']
for every stock in the list I get an output which contains trades and related info.. it looks something like this:
BUY SELL profits rel_profits
0 2004-01-13 2004-01-27 -44.200012 -0.094606
1 2004-02-05 2004-02-16 18.000000 0.044776
2 2005-03-08 2005-03-11 25.000000 0.048077
3 2005-03-31 2005-04-01 13.000000 0.025641
4 2005-10-11 2005-10-26 -20.400024 -0.025342
5 2005-10-31 2005-11-04 67.000000 0.095578
6 2006-05-22 2006-06-05 -55.100098 -0.046693
7 2007-03-06 2007-03-14 3.000000 0.001884
8 2007-03-19 2007-03-28 41.500000 0.028222
9 2007-07-31 2007-08-14 69.949951 0.038224
10 2008-01-24 2008-02-05 25.000000 0.013055
11 2009-11-04 2009-11-05 50.000000 0.031250
12 2010-12-10 2010-12-15 63.949951 0.018612
13 2011-02-02 2011-02-15 -53.050049 -0.015543
14 2011-09-30 2011-10-07 74.799805 0.018181
15 2015-12-09 2015-12-18 -215.049805 -0.019523
16 2016-01-18 2016-02-01 -475.000000 -0.046005
17 2016-11-16 2016-11-30 -1217.500000 -0.096877
18 2018-03-26 2018-04-02 0.250000 0.000013
19 2018-05-22 2018-05-25 250.000000 0.012626
20 2018-06-05 2018-06-12 101.849609 0.005361
21 2018-09-25 2018-10-10 -2150.000000 -0.090717
22 2021-01-27 2021-02-03 500.150391 0.024638
23 2021-06-30 2021-07-07 393.000000 0.016038
24 2021-08-12 2021-08-13 840.000000 0.035279
25 NaN NaN -1693.850281 0.995277
# note: every dataframe will have a last row with NaN values in buy, sell columns
# each datafram has different number of rows
Now I tried to add an extra level of index to this dataframe like this:
symbol = name of the stock from given list for ex. for 3MINDIA.NS.csv symbol is 3MINDIA
trades.columns = pd.MultiIndex.from_product([[symbol], trades.columns])
after this I tried to concatenate each trades dataframe that is generated in the loop to a main dataframe using:
result_df = pd.concat([result_df, trades], axis=1)
# I am trying to do this so that Whenever
I call result_df[symbol] I should be able
to see the trade dates for that particular symbol.
But I get a result_df that has lot of NaN values because each trades dataframe has variable number of rows in it.
IS there any way I can combine trades dataframes along the columns with stock symbol as higher level index and not get all the NaN values in my result_df
result_df I got
So I found a way to get what I wanted.
first I added this code in loop
trades = pd.concat([trades], keys=[symbol], names=['Stocks'])
after this I used concatenate again on result_df and trades
# Desired Result
result_df = pd.concat([result_df, trades], axis=0, ignore_index=False)
And BAM!!! This is exactly what I wanted

How do get the first 15 min high from OHLC data through pandas?

Here is a Dataframe which has OHLC minute wise data from 2011-2021
I want to make another column named "first15 high" where I want to have first 15 minute high i.e. 9:15 to 9:30 highest high for that day.
The desired output (in the yellow column )is below. The Dataframe has more than 10 years of data (i.e. contains more than 2000 days).
The data you present is maintained in excel, but I will answer on the assumption that pandas is available. As a sample data, get the data from Yahoo Finance and first create the first five data frames grouped by year, month, and day. Group the created data frame by date and find the maximum value. Combine the data frame for which the maximum value was obtained with the original data frame. If you're looking for a quick answer, posting the data in text and providing the code you're working on is a must.
import pandas as pd
import yfinance as yf
df = yf.download("AAPL", interval='1m', start="2021-05-18", end="2021-05-25")
df.index = pd.to_datetime(df.index)
df.index = df.index.tz_localize(None)
df['date'] = df.index.date
# first 5 records by day
first_15min = df.groupby([df.index.year,df.index.month,df.index.day])['High'].head(15).to_frame()
# max value
first_15min = first_15min.groupby([first_15min.index.date]).max()
df.merge(first_15min, left_on='date', right_on=first_15min.index, how='inner')
Open High_x Low Close Adj Close Volume date High_y
0 125.980003 126.099998 125.970001 126.065002 126.065002 0 2021-05-17 126.099998
1 126.060097 126.070000 125.900002 125.910004 125.910004 135988 2021-05-17 126.099998
2 125.900002 125.900002 125.790298 125.880096 125.880096 172001 2021-05-17 126.099998
3 125.889999 125.889999 125.790001 125.860001 125.860001 81338 2021-05-17 126.099998
4 125.870003 125.968201 125.870003 125.919998 125.919998 187059 2021-05-17 126.099998
... ... ... ... ... ... ... ... ...
1942 127.490097 127.557404 127.480003 127.540001 127.540001 161355 2021-05-24 126.419998
1943 127.540001 127.559998 127.480003 127.485001 127.485001 143420 2021-05-24 126.419998
1944 127.485001 127.529999 127.449997 127.480003 127.480003 132487 2021-05-24 126.419998
1945 127.479897 127.500000 127.449997 127.470001 127.470001 98478 2021-05-24 126.419998
1946 127.480003 127.550003 127.460098 127.532303 127.532303 128118 2021-05-24 126.419998

Insert items from MultiIndexed dataframe into regular dataframe based on time

I have this regular dataframe indexed by 'Date', called ES:
Price Day Hour num_obs med abs_med Ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203
I have this other dataframe indexed by the following MultiIndex. The first index goes from 0 to 23 and the second index goes from 0 to 55. In other words we have daily 5 minute increment data.
5min_Ret
0 0 2.235875e-06
5 9.814064e-07
10 -1.453213e-06
15 4.295757e-06
20 5.884896e-07
25 -1.340122e-06
30 9.470660e-06
35 1.178204e-06
40 -1.111621e-05
45 1.159005e-05
50 6.148861e-06
55 1.070586e-05
1 0 1.485287e-05
5 3.018576e-06
10 -1.513273e-05
15 -1.105312e-05
20 3.600874e-06
...
I want to create a column in the original dataframe, ES, that has the appropriate '5min_Ret' at each appropriate hour/5minute combo.
I've tried multiple things: looping over rows, finding some apply function. But nothing has worked so far. I feel like I'm overlooking a simple and Pythonic solution here.
The expected output creates a new column called '5min_ret' to the original dataframe in which each row corresponds to the correct hour/5minute pair from the smaller dataframe containing the 5min_ret
Price Day Hour num_obs med abs_med Ret 5min_ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364 xxxx
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562 xxxx
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132 xxxx
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132 xxxx
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203 xxxx
I think one way is to use merge on hour and minute. First create a column 'min' in ES from the datetimeindex such as:
ES['min'] = ES.index.minute
Now you can merge with your multiindex DF containing the column '5min_Ret' that I named df_multi such as:
ES = ES.merge(df_multi.reset_index(), left_on = ['hour','min'],
right_on = ['level_0','level_1'], how='left')
Here you merge on 'hour' and 'min' from ES with 'level_0' and 'level_1', which are created from your multiindex of df_multi when you do reset_index, and on the value of the left df (being ES)
You should get a new column in ES named '5min_Ret' with the value you are looking for. You can drop the colum 'min' if you don't need it anymore by ES = ES.drop('min',axis=1)

How do I sort column by targeting a specific number within that cell?

I would like to use Pandas Python to sort a specific column by date (more specifically the year). However, the year is buried within a bunch of other numbers. How do I just target the 2 digits that I need?
In the example below, I want to sort this column by the numbers [16,14,15...] rather than considering all the numbers in that row.
3/18/16 11:46
6/19/14 14:58
7/27/15 14:22
8/3/15 12:59
2/20/13 12:33
9/27/16 12:08
7/27/15 14:22
Given a dataframe like this,
date
0 3/18/16
1 6/19/14
2 7/27/15
3 8/3/15
4 2/20/13
5 9/27/16
6 7/27/15
You can convert the date column to datetime format and then sort.
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by = 'date')
The resulting dataframe
date
4 2013-02-20
1 2014-06-19
2 2015-07-27
6 2015-07-27
3 2015-08-03
0 2016-03-18
5 2016-09-27

Pandas Resample Strange Zero Tolerance Behavior

I'm attempting to resample a time series in Pandas and I am getting some odd behavior:
print samples[196].base_df.to_string()
Units Sales
2008-07-03 3 820.00
2008-07-04 3 470.00
...
2010-06-22 1 335.00
2010-06-24 2 180.00
2010-06-30 -1 -2502.00
print samples[196].base_df.resample('15d', how='sum')
Units Sales
2008-07-03 17 3.149130e+03
2008-07-18 29 6.305210e+03
...
2010-06-08 18 5.204000e+03
2010-06-23 1 -2.322000e+03
2010-07-08 0 6.521324e-312
I would have expected the last value in the resampled series to be either zero or omitted. Is this expected behavior for the resample function? If helpful I can post the full time series, but it is a bit long...