Pandastic way of growing a dataframe - numpy

So, I have a year-indexed dataframe that I would like to increment by some logic beyond the end year (2013), say, grow the last value by n percent for 10 years, but the logic could also be to just add a constant, or slightly growing number. I will leave that to a function and just stuff the logic there.
I can't think of a neat vectorized way to do that with arbitrary length of time and logic, leaving a longer dataframe with the extra increments added, and would prefer not to loop it.

The particular calculation matters. In general you would have to compute the values in a loop. Some NumPy ufuncs (such as np.add, np.multiply, np.minimum, np.maximum) have an accumulate method, however, which may be useful depending on the calculation.
For example, to calculate values given a constant growth rate, you could use np.multiply.accumulate (or cumprod):
import numpy as np
import pandas as pd
N = 10
index = pd.date_range(end='2013-12-31', periods=N, freq='D')
df = pd.DataFrame({'val':np.arange(N)}, index=index)
last = df['val'][-1]
# val
# 2013-12-22 0
# 2013-12-23 1
# 2013-12-24 2
# 2013-12-25 3
# 2013-12-26 4
# 2013-12-27 5
# 2013-12-28 6
# 2013-12-29 7
# 2013-12-30 8
# 2013-12-31 9
# expand df
index = pd.date_range(start='2014-1-1', periods=N, freq='D')
df = df.reindex(df.index.union(index))
# compute new values
rate = 1.1
df['val'][-N:] = last*np.multiply.accumulate(np.full(N, fill_value=rate))
yields
val
2013-12-22 0.000000
2013-12-23 1.000000
2013-12-24 2.000000
2013-12-25 3.000000
2013-12-26 4.000000
2013-12-27 5.000000
2013-12-28 6.000000
2013-12-29 7.000000
2013-12-30 8.000000
2013-12-31 9.000000
2014-01-01 9.900000
2014-01-02 10.890000
2014-01-03 11.979000
2014-01-04 13.176900
2014-01-05 14.494590
2014-01-06 15.944049
2014-01-07 17.538454
2014-01-08 19.292299
2014-01-09 21.221529
2014-01-10 23.343682
To increment by a constant value you could simply use np.arange:
step=2
df['val'][-N:] = np.arange(last+step, last+(N+1)*step, step)
or cumsum:
step=2
df['val'][-N:] = last + np.full(N, fill_value=step).cumsum()
Some linear recurrence relations can be expressed using scipy.signal.lfilter. See for example,
Trying to vectorize iterative calculation with numpy and Recursive definitions in Pandas

Related

How to concatenate a dataframe to a multiindex main dataframe along columns

I have tried a few answers but was not able to get the desired result in my case.
I am working with stocks data.
I have a list ['3MINDIA.NS.csv', 'AARTIDRUGS.NS.csv', 'AARTIIND.NS.csv', 'AAVAS.NS.csv', 'ABB.NS.csv']
for every stock in the list I get an output which contains trades and related info.. it looks something like this:
BUY SELL profits rel_profits
0 2004-01-13 2004-01-27 -44.200012 -0.094606
1 2004-02-05 2004-02-16 18.000000 0.044776
2 2005-03-08 2005-03-11 25.000000 0.048077
3 2005-03-31 2005-04-01 13.000000 0.025641
4 2005-10-11 2005-10-26 -20.400024 -0.025342
5 2005-10-31 2005-11-04 67.000000 0.095578
6 2006-05-22 2006-06-05 -55.100098 -0.046693
7 2007-03-06 2007-03-14 3.000000 0.001884
8 2007-03-19 2007-03-28 41.500000 0.028222
9 2007-07-31 2007-08-14 69.949951 0.038224
10 2008-01-24 2008-02-05 25.000000 0.013055
11 2009-11-04 2009-11-05 50.000000 0.031250
12 2010-12-10 2010-12-15 63.949951 0.018612
13 2011-02-02 2011-02-15 -53.050049 -0.015543
14 2011-09-30 2011-10-07 74.799805 0.018181
15 2015-12-09 2015-12-18 -215.049805 -0.019523
16 2016-01-18 2016-02-01 -475.000000 -0.046005
17 2016-11-16 2016-11-30 -1217.500000 -0.096877
18 2018-03-26 2018-04-02 0.250000 0.000013
19 2018-05-22 2018-05-25 250.000000 0.012626
20 2018-06-05 2018-06-12 101.849609 0.005361
21 2018-09-25 2018-10-10 -2150.000000 -0.090717
22 2021-01-27 2021-02-03 500.150391 0.024638
23 2021-06-30 2021-07-07 393.000000 0.016038
24 2021-08-12 2021-08-13 840.000000 0.035279
25 NaN NaN -1693.850281 0.995277
# note: every dataframe will have a last row with NaN values in buy, sell columns
# each datafram has different number of rows
Now I tried to add an extra level of index to this dataframe like this:
symbol = name of the stock from given list for ex. for 3MINDIA.NS.csv symbol is 3MINDIA
trades.columns = pd.MultiIndex.from_product([[symbol], trades.columns])
after this I tried to concatenate each trades dataframe that is generated in the loop to a main dataframe using:
result_df = pd.concat([result_df, trades], axis=1)
# I am trying to do this so that Whenever
I call result_df[symbol] I should be able
to see the trade dates for that particular symbol.
But I get a result_df that has lot of NaN values because each trades dataframe has variable number of rows in it.
IS there any way I can combine trades dataframes along the columns with stock symbol as higher level index and not get all the NaN values in my result_df
result_df I got
So I found a way to get what I wanted.
first I added this code in loop
trades = pd.concat([trades], keys=[symbol], names=['Stocks'])
after this I used concatenate again on result_df and trades
# Desired Result
result_df = pd.concat([result_df, trades], axis=0, ignore_index=False)
And BAM!!! This is exactly what I wanted

Time column interval filter

I have a dataframe with a "Fecha" column, I would like to reduce de Dataframe size through filter it and maintain just the rows which are on each 10 minutes multiple and discard all rows which are not in 10 minutes multiple.
Some idea?
Thanks
I have to guess some variable names. But assuming your dataframe name is df, the solution should look similar to:
df['Fecha'] = pd.to_datetime(df['Fecha'])
df = df[df['Fecha'].minute % 10 == 0]
The first line guarantees that your 'Fecha' column is in DateTime-Format. The second line filters all rows which are a multiple of 10 minutes. To do this you use the modulus operator %.
Since I'm not sure if this solves your problem, here's a minimal example that runs by itself:
import pandas as pd
idx = pd.date_range(pd.Timestamp(2020, 1, 1), periods=60, freq='1T')
series = pd.Series(1, index=idx)
series = series[series.index.minute % 10 == 0]
series
The first three lines construct a series with a 1 minute index, which is filtered in the fourth line.
Output:
2020-01-01 00:00:00 1
2020-01-01 00:10:00 1
2020-01-01 00:20:00 1
2020-01-01 00:30:00 1
2020-01-01 00:40:00 1
2020-01-01 00:50:00 1
dtype: int64

How to sum up a selected range of rows via a condition?

I hope with these additional information someone could find time to help me with this new issue.
sample date here --> file
'Date as index' (datetime.date)
As I said I'm trying to select a range in a dataframe every time x is in interval [-20 -190] and create a new dataframe with a new column which is the sum of the selected rows and keep the last "encountered" date as index
EDIT : The "loop" start at the first date/beginning of the df and when a value which is less than 0 or -190 is found, then sum it up and continue to find and sum it up and so on
BUT I still got values which are still in the intervall (-190, 0)
example and code below.
Thks
import pandas as pd
df = pd.read_csv('http://www.sharecsv.com/s/0525f76a07fca54717f7962d58cac692/sample_file.csv', sep = ';')
df['Date'] = df['Date'].where(df['x'].between(-190, 0)).bfill()
df3 = df.groupby('Date', as_index=False)['x'].sum()
df3
##### output #####
Date sum
0 2019-01-01 13:48:00 -131395.21
1 2019-01-02 11:23:00 -250830.08
2 2019-01-02 11:28:00 -154.35
3 2019-01-02 12:08:00 -4706.87
4 2019-01-03 12:03:00 -260158.22
... ... ...
831 2019-09-29 09:18:00 -245939.92
832 2019-09-29 16:58:00 -0.38
833 2019-09-30 17:08:00 -129365.71
834 2019-09-30 17:13:00 -157.05
835 2019-10-01 08:58:00 -111911.98
########## expected output #############
Date sum
0 2019-01-01 13:48:00 -131395.21
1 2019-01-02 11:23:00 -250830.08
2 2019-01-02 12:08:00 -4706.87
3 2019-01-03 12:03:00 -260158.22
... ... ...
831 2019-09-29 09:18:00 -245939.92
832 2019-09-30 17:08:00 -129365.71
833 2019-10-01 08:58:00 -111911.98
...
...
Use Series.where with Series.between for replace values to NaNs of Date column with back filling missing values and then aggregate sum, next step is filter out rows with match range by boolean indexing and last use DataFrame.resample with cast Series to one column DataFrame by Series.to_frame:
#range -190, 0
df['Date'] = df['Date'].where(df['x'].between(-190, 0)).bfill()
df3 = df.groupby('Date', as_index=False)['x'].sum()
df3 = df3[~df3['x'].between(-190, 0)]
df3 = df3.resample('D', on='Date')['x'].sum().to_frame()

Computing Rolling autocorrelation using Pandas.rolling

I am attempting calculate the rolling auto-correlation for a Series object using Pandas (0.23.3)
Setting up the example:
dt_index = pd.date_range('2018-01-01','2018-02-01', freq = 'B')
data = np.random.rand(len(dt_index))
s = pd.Series(data, index = dt_index)
Creating a Rolling object with window size = 5:
r = s.rolling(5)
Getting:
Rolling [window=5,center=False,axis=0]
Now when I try to calculate the correlation (Pretty sure this is the wrong approach):
r.corr(other=r)
I get only NaNs
I tried another approach based on the documentation::
df = pd.DataFrame()
df['a'] = s
df['b'] = s.shift(-1)
df.rolling(window=5).corr()
Getting something like:
...
2018-03-01 a NaN NaN
b NaN NaN
Really not sure where I'm going wrong with this. Any help would be immensely appreciated! The docs use float64 as well. Thinking it's because the correlation is very close to zero and so it's showing NaN? Somebody had raised a bug report here, but jreback solved the problem in a previous bug fix I think.
This is another relevant answer, but it's using pd.rolling_apply, which does not seem to be supported in Pandas version 0.23.3?
IIUC,
>>> s.rolling(5).apply(lambda x: x.autocorr(), raw=False)
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 -0.502455
2018-01-08 -0.072132
2018-01-09 -0.216756
2018-01-10 -0.090358
2018-01-11 -0.928272
2018-01-12 -0.754725
2018-01-15 -0.822256
2018-01-16 -0.941788
2018-01-17 -0.765803
2018-01-18 -0.680472
2018-01-19 -0.902443
2018-01-22 -0.796185
2018-01-23 -0.691141
2018-01-24 -0.427208
2018-01-25 0.176668
2018-01-26 0.016166
2018-01-29 -0.876047
2018-01-30 -0.905765
2018-01-31 -0.859755
2018-02-01 -0.795077
This is a lot faster than Pandas' autocorr but the results are different. In my dataset, there is a 0.87 Pearson correlation between the results of those two methods. There is a discussion about why the results are different here.
from statsmodels.tsa.stattools import acf
s.rolling(5).apply(lambda x: acf(x, unbiased=True, fft=False)[1], raw=True)
Note that the input cannot have null values, otherwise it will return all nulls.

pandas HDFStore select rows by datetime index

I'm sure this is probably very simple but I can't figure out how to slice a pandas HDFStore table by its datetime index to get a specific range of rows.
I have a table that looks like this:
mdstore = pd.HDFStore(store.h5)
histTable = '/ES_USD20120615_MIDPOINT30s'
print(mdstore[histTable])
open high low close volume WAP \
date
2011-12-04 23:00:00 1266.000 1266.000 1266.000 1266.000 -1 -1
2011-12-04 23:00:30 1266.000 1272.375 1240.625 1240.875 -1 -1
2011-12-04 23:01:00 1240.875 1242.250 1240.500 1242.125 -1 -1
...
[488000 rows x 7 columns]
For example I'd like to get the range from 2012-01-11 23:00:00 to 2012-01-12 22:30:00. If it were in a df I would just use datetimes to slice on the index, but I can't figure out how to do that directly from the store table so I don't have to load the whole thing into memory.
I tried mdstore.select(histTable, where='index>20120111') and that worked in as much as I got everything on the 11th and 12th, but I couldn't see how to add a time in.
Example is here
needs pandas >= 0.13.0
In [2]: df = DataFrame(np.random.randn(5),index=date_range('20130101 09:00:00',periods=5,freq='s'))
In [3]: df
Out[3]:
0
2013-01-01 09:00:00 -0.110577
2013-01-01 09:00:01 -0.420989
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
2013-01-01 09:00:04 -0.830469
[5 rows x 1 columns]
In [4]: df.to_hdf('test.h5','data',mode='w',format='table')
Specify it as a quoted string
In [8]: pd.read_hdf('test.h5','data',where='index>"20130101 09:00:01" & index<"20130101 09:00:04"')
Out[8]:
0
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
[2 rows x 1 columns]
You can also specify it directly as a Timestamp
In [10]: pd.read_hdf('test.h5','data',where='index>Timestamp("20130101 09:00:01") & index<Timestamp("20130101 09:00:04")')
Out[10]:
0
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
[2 rows x 1 columns]