Apply custom function to Rolling Dataframe - pandas

I have a function (let's call it) RBSA(df) (that I'm currently treating the function as a blackbox) that takes a dataframe
DATE RETURN STYLE1 STYLE2 STYLE3 STYLE4
2020-09-01 0.01 100 251 300 211
2020-09-02 0.04 106 248 310 210
2020-09-03 0.03 104 251 308 211
2020-09-03 0.02 110 258 306 212
...
and returns a dataframe like this
DATE STYLE1 STYLE2 STYLE3 STYLE4 R2
2020 0.01 85 10 4.99 68
Now I want to be able to apply that function on a rolling basis with a window of 30 to the initial database so that the dataframe looks something like this.
DATE STYLE1 STYLE2 STYLE3 STYLE4 R2
2020-09 0.01 85 10 4.99 68 #applied date range would be 09-01 to 09-30
2020-09 0.99 80 15 4.01 77 #applied date range would be 09-02 to 10-01
2020-09 3.93 80 10 6.07 89 #applied date range would be 09-03 to 10-02
So far I've tried using df.rolling(30).apply(RBSA) however from what I can tell the rolling.apply function applies the function by turning each window into a numpy.ndarray. However since I'm treating the RBSA() function as a black box, I would rather not change the RBSA() function to have a numpy.ndarray as it's input.
My second idea was to create a for loop that append() each dataframe to a, initially, empty dataframe. However, i'm not really sure how to emulate a rolling window using a while loop.
def rolling30(df):
count = len(count) - 30
ret = []
while (count > 0):
count = count - 1
df2 = df[count:count + 30]
df2 = style(df2)
ret.append(df2)
However unlike when I manually append the dataframes together, for some reason when I'm appending the dataframes together it seems to create an output that looks like this (notice the comma)
DATE STYLE1 STYLE2 STYLE3 STYLE4 R2
2020-09 0.01 85 10 4.99 68, DATE STYLE1 STYLE2 STYLE3 STYLE4 R2
2020-09 0.99 80 15 4.01 77, DATE STYLE1 STYLE2 STYLE3 STYLE4 R2
2020-09 3.93 80 10 6.07 89, DATE STYLE1 STYLE2 STYLE3 STYLE4 R2
Right now it feels like I'm closest when coming to a solution with the while loop. Though it doesn't feel as elegant as using rolling.apply
UPDATE: just did a isinstance(rolling30(df), pd.DataFrame) and it returned Falseso I assume the problem is that somewhere it's reverting it into something thats not a dataframe.

So I figured out the solution to my while loop probem. Realized the initial retwas a list and changed that to ret = pd.DataFrame() and then the append to call back to ret
def rolling30(df):
count = len(count) - 30
ret = pd.DataFrame()
while (count > 0):
count = count - 1
df2 = df[count:count + 30]
df2 = style(df2)
ret = ret.append(df2)
I would still like to see what methods other people had to this problem since I feel like the deployment of the While Loop isn't a elegant solution here.

Related

Very slow filtering with multiple conditions in pandas dataframe

UPDATE: I have edited the question (and code) to make the problem clearer. I use here synthetic data but imagine a large df of floods and a small one of significant floods. I want add a reference to every row (of the large_df) if it is somewhat close to the significant flood.
I have 2 pandas dataframes (1 large and 1 small).
In every iteration I want to create a subset of the small dataframe based on a few conditions that are dependent on each row (of the large df):
import numpy as np
import pandas as pd
import time
SOME_THRESHOLD = 10.5
MUMBER_OF_ROWS = 2e4
large_df = pd.DataFrame(index=np.arange(MUMBER_OF_ROWS), data={'a': np.arange(MUMBER_OF_ROWS)})
small_df = large_df.loc[np.random.randint(0, MUMBER_OF_ROWS, 5)]
large_df['past_index'] = None
count_time = 0
for ind, row in large_df.iterrows():
start = time.time()
# This line takes forever.
df_tmp = small_df[(small_df.index<ind) & (small_df['a']>(row['a']-SOME_THRESHOLD)) & (small_df['a']<(row['a']+SOME_THRESHOLD))]
count_time += time.time()-start
if not df_tmp.empty:
past_index = df_tmp.loc[df_tmp.index.max()]['a']
large_df.loc[ind, 'similar_past_flood_tag'] = f'Similar to the large flood of {past_index}'
print(f'The total time of creating the subset df for 2e4 rows is: {count_time} seconds.')
The line that creates the subset takes a long time to compute:
The total time of creating the subset df for 2e4 rows is: 18.276793956756592 seconds.
This seems to me to be an too long. I have found similar questions but non of the answers seemed to work (e.g query and numpy conditions).
Is there a way to optimize this?
Note: the code does what is expected - just very slow.
While your code is logically correct, building the many boolean arrays and slicing the DataFrame accumulates to some time..
Here are some stats with %timeit:
(small_df.index<ind): ~30μs
(small_df['a']>(row['a']-SOME_THRESHOLD)): ~100μs
(small_df['a']<(row['a']+SOME_THRESHOLD)): ~100μs
After '&'-ing all three: ~500μs
Including the DataFrame slice: ~700μs
That, multiplied by 20K times is indeed 14 seconds.. :)
What you could do is take advantage of numpy's broadcast to compute the boolean matrix more efficiently, and then reconstruct the "valid" DataFrame. See below:
l_ind = np.array(large_df.index)
s_ind = np.array(small_df.index)
l_a = np.array(large_df.a)
s_a = np.array(small_df.a)
arr1 = (l_ind[:, None] < s_ind[None, :])
arr2 = (((l_a[:, None] - SOME_THRESHOLD) < s_a[None, :]) &
(s_a[None, :] < (l_a[:, None] + SOME_THRESHOLD)))
arr = arr1 & arr2
large_valid_inds, small_valid_inds = np.where(arr)
pd.DataFrame({'large_ind': np.take(l_ind, large_valid_inds),
'small_ind': np.take(s_ind, small_valid_inds)})
That gives you the following DF, which if I understood the question properly, is the expected solution:
large_ind
small_ind
0
1621
1631
1
1622
1631
2
1623
1631
3
1624
1631
4
1625
1631
5
1626
1631
6
1627
1631
7
1628
1631
8
1629
1631
9
1630
1631
10
1992
2002
11
1993
2002
12
1994
2002
13
1995
2002
14
1996
2002
15
1997
2002
16
1998
2002
17
1999
2002
18
2000
2002
19
2001
2002
20
8751
8761
21
8752
8761
22
8753
8761
23
8754
8761
24
8755
8761
25
8756
8761
26
8757
8761
27
8758
8761
28
8759
8761
29
8760
8761
30
10516
10526
31
10517
10526
32
10518
10526
33
10519
10526
34
10520
10526
35
10521
10526
36
10522
10526
37
10523
10526
38
10524
10526
39
10525
10526
40
18448
18458
41
18449
18458
42
18450
18458
43
18451
18458
44
18452
18458
45
18453
18458
46
18454
18458
47
18455
18458
48
18456
18458
49
18457
18458
In pandas for loops are much slower than column operations. So changing the calculation to loop over small_df instead of large_df will already give a big improvement:
for ind, row in small_df.iterrows():
df_tmp = large_df[ <some condition> ]
# ... some other processing
Even better for your case is to use a merge rather than a condition on large_df. The problem is your merge is not on equal columns but on approximately equal. To use this approach, you should truncate your column and use that for the merge. Here's a hacky example.
small_df['a_rounded'] = (small_df['a'] / SOME_THRESHOLD / 2).astype(int)
large_df['a_rounded'] = (large_df['a'] / SOME_THRESHOLD / 2).astype(int)
merge_result = small_df.merge(large_df, on='a_rounded')
small_df['a_rounded2'] = ((small_df['a'] + SOME_THRESHOLD) / SOME_THRESHOLD / 2).astype(int)
large_df['a_rounded2'] = ((large_df['a'] + SOME_THRESHOLD) / SOME_THRESHOLD / 2).astype(int)
merge_result2 = small_df.merge(large_df, on='a_rounded2')
total_merge_result = pd.concat([merge_result, merge_result2])
# Now remove duplicates and impose additional filters.
You can impose the additional filters on the result later.

Empty Dataframe after being populated from URL

html_data = requests.get('https://www.macrotrends.net/stocks/charts/GME/gamestop/revenue')
soup = BeautifulSoup(html_data.text, 'lxml')
all_tables = soup.find_all('table', attrs={'class': 'historical_data_table table'})
gme_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for table in all_tables:
if table.find('th').getText().startswith("Gamestop Quarterly Revenue"):
for row in table.find_all("tr"):
col = row.find_all("td")
if len(col) == 2:
date = col[0].text
revenue = col[1].text.replace('$', '').replace(',', '')
gme_revenue = gme_revenue.append({"Date": date, "Revenue": revenue}, ignore_index=True)
however, when I try to make a table, it comes up empty as
Empty DataFrame
Columns: [Date, Revenue]
Index: []
and after I do a test, this appears:
gme_revenue.empty
>>>True
unsure on why my data frame is empty. I've even copied the code from another data frame and it still doesn't work.
Help is appreciated.
Change
if table.find('th').getText().startswith("Gamestop Quarterly Revenue"):
to
if 'Quarterly' in table.find('th').text:
and it should work
Output:
Date Revenue
0 2020-10-31 1005
1 2020-07-31 942
2 2020-04-30 1021
3 2020-01-31 2194
4 2019-10-31 1439
... ... ...
59 2006-01-31 1667
60 2005-10-31 534
61 2005-07-31 416
62 2005-04-30 475
63 2005-01-31 709
64 rows × 2 columns

Merging two DataFrames in Pandas, based on conditions

I have 2 DataFrames:
siccd date retp
0 2892 31135 -0.036296
1 2892 31226 0.144768
2 2892 31320 0.063924
3 1650 31412 -0.009190
4 1299 31502 0.063326
and
start end ind indname
0 100 999 1 Agric
1 1000 1299 2 Mines
2 1300 1399 3 Oil
3 1400 1499 4 Stone
4 1500 1799 5 Cnstr
5 2000 2099 6 Food
6 2100 2199 7 Smoke
7 2200 2299 8 Txtls
8 2300 2399 9 Apprl
9 2400 2499 10 Wood
10 2500 2599 11 Chair
11 2600 2661 12 Paper
12 2700 2799 13 Print
13 2800 2899 14 Chems
14 2900 2999 15 Ptrlm
15 3000 3099 16 Rubbr
16 3100 3199 17 Lethr
17 3200 3299 18 Glass
18 3300 3399 19 Metal
The task is to take the df1['siccd'] column, compare it to the df2['start'] and df2['end'] column. If (start <= siccd <= end), assign the ind and indname values of that respective index in the second DataFrame to the first DataFrame. The output would look something like:
siccd date retp ind indname
0 2892 31135 -0.036296 14 Chems
1 2892 31226 0.144768 14 Chems
2 2892 31320 0.063924 14 Chems
3 1650 31412 -0.009190 5 Cnstr
4 1299 31502 0.063326 2 Mines
I've tried doing this with crude nested for loops, and it provides me with the correct lists that I can append to the end of the DataFrame, however this is extremely inefficient and given the data set's length it is inadequate.
siccd_lst = list(tmp['siccd'])
ind_lst = []
indname_lst = []
def categorize(siccd, df, index):
if (siccd >= df.iloc[index]['start']) and (siccd <= df.iloc[index]['end']):
ind_lst.append(df.iloc[index]['ind'])
indname_lst.append(df.iloc[index]['indname'])
else:
pass
for i in range(0, len(ff38.index)-1):
[categorize(x, ff38, i) for x in siccd_lst]
I have also attempted to vectorize the problem, however, I could not figure out how to iterate through the entire df2 when "searching" for the correct ind and indname to assign to the first DataFrame.
Intervals
We'll create a DataFrame where the index is the Interval and the columns are the values we'll want to map, then we can use .loc with that DataFrame to bring over the data.
If any of your 'siccd' values lie outside of all intervals you will get a KeyError, so this method won't work.
dfi = pd.DataFrame({'indname': df2['indname'].to_numpy(), 'ind': df2['ind'].to_numpy()},
index=pd.IntervalIndex.from_arrays(left=df2.start, right=df2.end, closed='both'))
df1[['indname', 'ind']] = dfi.loc[df1.siccd].to_numpy()
Merge
You can perform the full merge (all rows in df1 with all rows in df2) using a temporary column ('t') and then filter the result where it's in between the values.
Since your second DataFrame seems to have a small number of non-overlapping ranges, the result of the merge shouldn't be prohibitively large, in terms of memory, and the non-overlapping ranges ensure the filtering will result in at most one row remaining for each original row in df1.
If any of your 'siccd' values lie outside of all intervals the row from the original DataFrame will get dropped.
res = (df1.assign(t=1)
.merge(df2.assign(t=1), on='t', how='left')
.query('siccd >= start & siccd <= end')
.drop(columns=['t', 'start', 'end']))
# siccd date retp ind indname
#13 2892 31135 -0.036296 14 Chems
#32 2892 31226 0.144768 14 Chems
#51 2892 31320 0.063924 14 Chems
#61 1650 31412 -0.009190 5 Cnstr
#77 1299 31502 0.063326 2 Mines
If you expect values to lie outside of some of the intervals, modify the merge. Bring along the original index, subset, which drops those rows and use combine_first to add them back after the merge. I added a row at the end with a 'siccd' of 252525 as a 6th row to your original df1:
res = (df1.reset_index().assign(t=1)
.merge(df2.assign(t=1), on='t', how='left')
.query('siccd >= start & siccd <= end')
.drop(columns=['t', 'start', 'end'])
.set_index('index')
.combine_first(df1) # Adds back rows, based on index,
) # that were outside any Interval
# date ind indname retp siccd
#0 31135.0 14.0 Chems -0.036296 2892.0
#1 31226.0 14.0 Chems 0.144768 2892.0
#2 31320.0 14.0 Chems 0.063924 2892.0
#3 31412.0 5.0 Cnstr -0.009190 1650.0
#4 31502.0 2.0 Mines 0.063326 1299.0
#5 31511.0 NaN NaN 0.151341 252525.0

Datetime reformat weekly column

I split a dataframe from minute to daily, weekly and monthly. I had no problem to reformat the daily dataframe, though I am having a good bit of trouble trying to do the same with the weekly one. Here is the weekly dataframe if someone could help me out please, it would be great. I am adding the code I used to reformat the daily dataframe, so it may helps!
I am plotting it with Bokeh and without the datetime format I won't be able to format the axis and hovertools as I would like.
Thanks beforehand.
dfDay1 = dfDay.loc['2014-01-01':'2020-09-31']
dfDay1 = dfDay1.reset_index()
dfDay1['date1'] = pd.to_datetime(dfDay1['date'], format=('%Y/%m/%d'))
dfDay1 = dfDay1.set_index('date')
That worked fine for the day format.
If need convert dates before / use Series.str.split with str[0], if dates after / use str[1]:
df['date1'] = pd.to_datetime(df['week'].str.split('/').str[0])
print (df)
week Open Low High Close Volume \
0 2014-01-07/2014-01-13 58.1500 55.38 58.96 56.0000 324133239
1 2014-01-14/2014-01-20 56.3500 55.96 58.57 56.2500 141255151
2 2014-01-21/2014-01-27 57.8786 51.85 59.31 52.8600 279370121
3 2014-01-28/2014-02-03 53.7700 52.75 63.95 62.4900 447186604
4 2014-02-04/2014-02-10 62.8900 60.45 64.90 63.9100 238316161
.. ... ... ... ... ... ...
347 2020-09-01/2020-09-07 297.4000 271.14 303.90 281.5962 98978386
348 2020-09-08/2020-09-14 275.0000 262.64 281.40 271.0100 109717114
349 2020-09-15/2020-09-21 272.6300 244.13 274.52 248.5800 123816172
350 2020-09-22/2020-09-28 254.3900 245.40 259.98 255.8800 98550687
351 2020-09-29/2020-10-05 258.2530 256.50 268.33 261.3500 81921670
date1
0 2014-01-07
1 2014-01-14
2 2014-01-21
3 2014-01-28
4 2014-02-04
.. ...
347 2020-09-01
348 2020-09-08
349 2020-09-15
350 2020-09-22
351 2020-09-29
[352 rows x 7 columns]

Select every nth row as a Pandas DataFrame without reading the entire file

I am reading a large file that contains ~9.5 million rows x 16 cols.
I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.
I am able to load the data, and then select every 500th row.
My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2017-01-09 11:13:28 2017-01-09 11:25:45 1 3.30 1 N 263 161 1 12.5 0.0 0.5 2.00 0.00 0.3 15.30
1 1 2017-01-09 11:32:27 2017-01-09 11:36:01 1 0.90 1 N 186 234 1 5.0 0.0 0.5 1.45 0.00 0.3 7.25
2 1 2017-01-09 11:38:20 2017-01-09 11:42:05 1 1.10 1 N 164 161 1 5.5 0.0 0.5 1.00 0.00 0.3 7.30
3 1 2017-01-09 11:52:13 2017-01-09 11:57:36 1 1.10 1 N 236 75 1 6.0 0.0 0.5 1.70 0.00 0.3 8.50
4 2 2017-01-01 00:00:00 2017-01-01 00:00:00 1 0.02 2 N 249 234 2 52.0 0.0 0.5 0.00 0.00 0.3 52.80
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:
n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
First get the length of the file by a custom function, remove each 500 row by numpy.setdiff1d and pass to the skiprows parameter in read_csv:
#https://stackoverflow.com/q/845058
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
print (len_of_file)
skipped = np.setdiff1d(np.arange(len_of_file), np.arange(0,len_of_file,500))
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped)
How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
The idea is read only the datetime column by parameter usecols, and then sort and select each 500 index value, get the difference and pass again to parameter skiprows:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
df1 = pd.read_csv('test.csv',
usecols=['tpep_pickup_datetime'],
parse_dates=['tpep_pickup_datetime'])
sorted_idx = (df1['tpep_pickup_datetime'].sort_values()
.iloc[np.arange(0,len_of_file,500)].index)
skipped = np.setdiff1d(np.arange(len_of_file), sorted_idx)
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped).sort_values(by=['tpep_pickup_datetime'])
use a lambda with skiprows:
pd.read_csv(path, skiprows=lambda i: i % N)
to skip every N rows.
source: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can use csv module return a iterator and use itertools.cycle to select every nth row.
import csv
from itertools import cycle
source_file='D:/a.txt'
cycle_size=500
chooser = (x == 0 for x in cycle(range(cycle_size)))
with open(source_file) as f1:
rdr = csv.reader(f1)
data = [row for pick, row in zip(chooser, rdr) if pick]