Very slow filtering with multiple conditions in pandas dataframe - pandas

UPDATE: I have edited the question (and code) to make the problem clearer. I use here synthetic data but imagine a large df of floods and a small one of significant floods. I want add a reference to every row (of the large_df) if it is somewhat close to the significant flood.
I have 2 pandas dataframes (1 large and 1 small).
In every iteration I want to create a subset of the small dataframe based on a few conditions that are dependent on each row (of the large df):
import numpy as np
import pandas as pd
import time
SOME_THRESHOLD = 10.5
MUMBER_OF_ROWS = 2e4
large_df = pd.DataFrame(index=np.arange(MUMBER_OF_ROWS), data={'a': np.arange(MUMBER_OF_ROWS)})
small_df = large_df.loc[np.random.randint(0, MUMBER_OF_ROWS, 5)]
large_df['past_index'] = None
count_time = 0
for ind, row in large_df.iterrows():
start = time.time()
# This line takes forever.
df_tmp = small_df[(small_df.index<ind) & (small_df['a']>(row['a']-SOME_THRESHOLD)) & (small_df['a']<(row['a']+SOME_THRESHOLD))]
count_time += time.time()-start
if not df_tmp.empty:
past_index = df_tmp.loc[df_tmp.index.max()]['a']
large_df.loc[ind, 'similar_past_flood_tag'] = f'Similar to the large flood of {past_index}'
print(f'The total time of creating the subset df for 2e4 rows is: {count_time} seconds.')
The line that creates the subset takes a long time to compute:
The total time of creating the subset df for 2e4 rows is: 18.276793956756592 seconds.
This seems to me to be an too long. I have found similar questions but non of the answers seemed to work (e.g query and numpy conditions).
Is there a way to optimize this?
Note: the code does what is expected - just very slow.

While your code is logically correct, building the many boolean arrays and slicing the DataFrame accumulates to some time..
Here are some stats with %timeit:
(small_df.index<ind): ~30μs
(small_df['a']>(row['a']-SOME_THRESHOLD)): ~100μs
(small_df['a']<(row['a']+SOME_THRESHOLD)): ~100μs
After '&'-ing all three: ~500μs
Including the DataFrame slice: ~700μs
That, multiplied by 20K times is indeed 14 seconds.. :)
What you could do is take advantage of numpy's broadcast to compute the boolean matrix more efficiently, and then reconstruct the "valid" DataFrame. See below:
l_ind = np.array(large_df.index)
s_ind = np.array(small_df.index)
l_a = np.array(large_df.a)
s_a = np.array(small_df.a)
arr1 = (l_ind[:, None] < s_ind[None, :])
arr2 = (((l_a[:, None] - SOME_THRESHOLD) < s_a[None, :]) &
(s_a[None, :] < (l_a[:, None] + SOME_THRESHOLD)))
arr = arr1 & arr2
large_valid_inds, small_valid_inds = np.where(arr)
pd.DataFrame({'large_ind': np.take(l_ind, large_valid_inds),
'small_ind': np.take(s_ind, small_valid_inds)})
That gives you the following DF, which if I understood the question properly, is the expected solution:
large_ind
small_ind
0
1621
1631
1
1622
1631
2
1623
1631
3
1624
1631
4
1625
1631
5
1626
1631
6
1627
1631
7
1628
1631
8
1629
1631
9
1630
1631
10
1992
2002
11
1993
2002
12
1994
2002
13
1995
2002
14
1996
2002
15
1997
2002
16
1998
2002
17
1999
2002
18
2000
2002
19
2001
2002
20
8751
8761
21
8752
8761
22
8753
8761
23
8754
8761
24
8755
8761
25
8756
8761
26
8757
8761
27
8758
8761
28
8759
8761
29
8760
8761
30
10516
10526
31
10517
10526
32
10518
10526
33
10519
10526
34
10520
10526
35
10521
10526
36
10522
10526
37
10523
10526
38
10524
10526
39
10525
10526
40
18448
18458
41
18449
18458
42
18450
18458
43
18451
18458
44
18452
18458
45
18453
18458
46
18454
18458
47
18455
18458
48
18456
18458
49
18457
18458

In pandas for loops are much slower than column operations. So changing the calculation to loop over small_df instead of large_df will already give a big improvement:
for ind, row in small_df.iterrows():
df_tmp = large_df[ <some condition> ]
# ... some other processing
Even better for your case is to use a merge rather than a condition on large_df. The problem is your merge is not on equal columns but on approximately equal. To use this approach, you should truncate your column and use that for the merge. Here's a hacky example.
small_df['a_rounded'] = (small_df['a'] / SOME_THRESHOLD / 2).astype(int)
large_df['a_rounded'] = (large_df['a'] / SOME_THRESHOLD / 2).astype(int)
merge_result = small_df.merge(large_df, on='a_rounded')
small_df['a_rounded2'] = ((small_df['a'] + SOME_THRESHOLD) / SOME_THRESHOLD / 2).astype(int)
large_df['a_rounded2'] = ((large_df['a'] + SOME_THRESHOLD) / SOME_THRESHOLD / 2).astype(int)
merge_result2 = small_df.merge(large_df, on='a_rounded2')
total_merge_result = pd.concat([merge_result, merge_result2])
# Now remove duplicates and impose additional filters.
You can impose the additional filters on the result later.

Related

create seaborn facetgrid based on crosstab table

my dataframe is structured:
ACTIVITY 2014 2015 2016 2017 2018
WALK 198 501 485 394 461
RUN 187 446 413 371 495
JUMP 45 97 88 103 78
JOG 1125 2150 2482 2140 2734
SLIDE 1156 2357 2530 2044 1956
my visualization goal: facetgrid of bar charts showing the percentage points over time, with each bar either positive/negative depending on percentage change of the year of course. each facet is an INCIDENT type, if that makes sense. for example, one facet would be a barplot of WALK, the other would be RUN, and so on and so forth. x-axis would be time of course (2014, 2015, 2016, etc) and y-axis would be the value (%change) from each year.
in my analysis, i added pct_change columns for every year except the baseline 2014 using simple pct_change() function that takes in two columns from df and spits back out a new calculated column:
df['%change_2015'] = pct_change(df['2014'],df['2015'])
df['%change_2016'] = pct_change(df['2015'],df['2016'])
... etc.
so with these new columns, i think i have the elements i need for my data visualization goal. how can i do it with seaborn facetgrids? specifically bar plots?
augmented dataframe (slice view):
ACTIVITY 2014 2015 2016 2017 2018 %change_2015 %change_2016
WALK 198 501 485 394 461 153.03 -3.19
RUN 187 446 413 371 495 xyz xyz
JUMP 45 97 88 103 78 xyz xyz
i tried reading through the seaborn documentation but i was having trouble understanding the configurations: https://seaborn.pydata.org/generated/seaborn.FacetGrid.html
is the problem the way my dataframe is ordered and structured? i hope all of that made sense. i appreciate any help with this.
Use:
import pandas as pd
cols = 'ACTIVITY', '%change_2015', '%change_2016'
data = [['Jump', '10.1', '-3.19'],['Run', '9.35', '-3.19'], ['Run', '4.35', '-1.19']]
df = pd.DataFrame(data, columns = cols)
dfm = pd.melt(df, id_vars=['ACTIVITY'], value_vars=['%change_2015', '%change_2016'])
dfm['value'] = dfm['value'].astype(float)
import seaborn as sns
g = sns.FacetGrid(dfm, col='variable')
g.map(sns.barplot, 'ACTIVITY', "value")
Output:
Based on your comment:
g = sns.FacetGrid(dfm, col='ACTIVITY')
g.map(sns.barplot, 'variable', "value")
Output:

SUM in dataframe of rows that has the same date and ADD new column

My code starts this way: it takes data from HERE and I want to extract al the rows that contain "fascia_anagrafica" equal to "20-29". In italian "fascia_anagrafica" means "age range". That was relatively simple, as you see below, and I dropped some unimportant values.
import pandas as pd
import json
import numpy
import sympy
from numpy import arange,exp
from scipy.optimize import curve_fit
from matplotlib import pyplot
import math
import decimal
df = pd.read_csv('https://raw.githubusercontent.com/italia/covid19-opendata-
vaccini/master/dati/somministrazioni-vaccini-latest.csv')
df = df[df["fascia_anagrafica"] == "20-29"]
df01=df.drop(columns= ["fornitore","area","sesso_maschile","sesso_femminile","seconda_dose","pregressa_infezione","dose_aggiuntiva","codice_NUTS1","codice_NUTS2","codice_regione_ISTAT","nome_area"])
now dataframe looks like this:IMAGE
as you see, for every date there is the "20-29 age range" and for every line you may find the value "prima_dose" which stands for "first_dose".
Now the problem:
If you take into consideration the date "2020-12-27" you will notice that it is repeated about 20 times (with 20 different values) since in italy there are 21 regions, then the same applies for the other dates. Unfortunately they are not always 21, because in certain regions they didn't put any values in some days so the dataframe is NOT periodic.
I want to add a column in the dataframe that makes a sum of the values that has same date fo all dates in the dataframe. An exaple here:
Date.................prima_dose...........sum_column
2020-8-9.............. 1.......................13 <----this is (1+3+4+5 in the day 2020-8-9)
2020-8-9..............3........................8 <----this is (2+5+1 in the day 2020-8-10)
2020-8-9.............. 4..............and so on...
2020-8-9.............. 5
2020-8-10.............. 2
2020-8-10.............. 5
2020-8-10.............. 1
thanks!
If you just want to sum all the values of 'prima_dose' for each date and get the result in a new dataframe, you could use groupby.sum():
result = df01.groupby('data_somministrazione')['prima_dose'].sum().reset_index()
Prints:
>>> result
data_somministrazione prima_dose
0 2020-12-27 700
1 2020-12-28 171
2 2020-12-29 87
3 2020-12-30 486
4 2020-12-31 2425
.. ... ...
289 2021-10-12 11583
290 2021-10-13 12532
291 2021-10-14 15347
292 2021-10-15 13689
293 2021-10-16 9293
[294 rows x 2 columns]
This will change the structure of your current dataframe, and return a unique row per date
If you want to add a new column in your existing dataframe without altering it's structure, you should use groupby.transform():
df01['prima_dose_per_date'] = df01.groupby('data_somministrazione')['prima_dose'].transform('sum')
Prints:
>>> df01
data_somministrazione fascia_anagrafica prima_dose prima_dose_per_date
0 2020-12-27 20-29 2 700
7 2020-12-27 20-29 9 700
12 2020-12-27 20-29 60 700
17 2020-12-27 20-29 59 700
23 2020-12-27 20-29 139 700
... ... ... ...
138475 2021-10-16 20-29 533 9293
138484 2021-10-16 20-29 112 9293
138493 2021-10-16 20-29 0 9293
138502 2021-10-16 20-29 529 9293
138515 2021-10-16 20-29 0 9293
[15595 rows x 4 columns]
This will keep the current structure of your dataframe and return a new column with the sum of prima_dose per each date.

Merging two DataFrames in Pandas, based on conditions

I have 2 DataFrames:
siccd date retp
0 2892 31135 -0.036296
1 2892 31226 0.144768
2 2892 31320 0.063924
3 1650 31412 -0.009190
4 1299 31502 0.063326
and
start end ind indname
0 100 999 1 Agric
1 1000 1299 2 Mines
2 1300 1399 3 Oil
3 1400 1499 4 Stone
4 1500 1799 5 Cnstr
5 2000 2099 6 Food
6 2100 2199 7 Smoke
7 2200 2299 8 Txtls
8 2300 2399 9 Apprl
9 2400 2499 10 Wood
10 2500 2599 11 Chair
11 2600 2661 12 Paper
12 2700 2799 13 Print
13 2800 2899 14 Chems
14 2900 2999 15 Ptrlm
15 3000 3099 16 Rubbr
16 3100 3199 17 Lethr
17 3200 3299 18 Glass
18 3300 3399 19 Metal
The task is to take the df1['siccd'] column, compare it to the df2['start'] and df2['end'] column. If (start <= siccd <= end), assign the ind and indname values of that respective index in the second DataFrame to the first DataFrame. The output would look something like:
siccd date retp ind indname
0 2892 31135 -0.036296 14 Chems
1 2892 31226 0.144768 14 Chems
2 2892 31320 0.063924 14 Chems
3 1650 31412 -0.009190 5 Cnstr
4 1299 31502 0.063326 2 Mines
I've tried doing this with crude nested for loops, and it provides me with the correct lists that I can append to the end of the DataFrame, however this is extremely inefficient and given the data set's length it is inadequate.
siccd_lst = list(tmp['siccd'])
ind_lst = []
indname_lst = []
def categorize(siccd, df, index):
if (siccd >= df.iloc[index]['start']) and (siccd <= df.iloc[index]['end']):
ind_lst.append(df.iloc[index]['ind'])
indname_lst.append(df.iloc[index]['indname'])
else:
pass
for i in range(0, len(ff38.index)-1):
[categorize(x, ff38, i) for x in siccd_lst]
I have also attempted to vectorize the problem, however, I could not figure out how to iterate through the entire df2 when "searching" for the correct ind and indname to assign to the first DataFrame.
Intervals
We'll create a DataFrame where the index is the Interval and the columns are the values we'll want to map, then we can use .loc with that DataFrame to bring over the data.
If any of your 'siccd' values lie outside of all intervals you will get a KeyError, so this method won't work.
dfi = pd.DataFrame({'indname': df2['indname'].to_numpy(), 'ind': df2['ind'].to_numpy()},
index=pd.IntervalIndex.from_arrays(left=df2.start, right=df2.end, closed='both'))
df1[['indname', 'ind']] = dfi.loc[df1.siccd].to_numpy()
Merge
You can perform the full merge (all rows in df1 with all rows in df2) using a temporary column ('t') and then filter the result where it's in between the values.
Since your second DataFrame seems to have a small number of non-overlapping ranges, the result of the merge shouldn't be prohibitively large, in terms of memory, and the non-overlapping ranges ensure the filtering will result in at most one row remaining for each original row in df1.
If any of your 'siccd' values lie outside of all intervals the row from the original DataFrame will get dropped.
res = (df1.assign(t=1)
.merge(df2.assign(t=1), on='t', how='left')
.query('siccd >= start & siccd <= end')
.drop(columns=['t', 'start', 'end']))
# siccd date retp ind indname
#13 2892 31135 -0.036296 14 Chems
#32 2892 31226 0.144768 14 Chems
#51 2892 31320 0.063924 14 Chems
#61 1650 31412 -0.009190 5 Cnstr
#77 1299 31502 0.063326 2 Mines
If you expect values to lie outside of some of the intervals, modify the merge. Bring along the original index, subset, which drops those rows and use combine_first to add them back after the merge. I added a row at the end with a 'siccd' of 252525 as a 6th row to your original df1:
res = (df1.reset_index().assign(t=1)
.merge(df2.assign(t=1), on='t', how='left')
.query('siccd >= start & siccd <= end')
.drop(columns=['t', 'start', 'end'])
.set_index('index')
.combine_first(df1) # Adds back rows, based on index,
) # that were outside any Interval
# date ind indname retp siccd
#0 31135.0 14.0 Chems -0.036296 2892.0
#1 31226.0 14.0 Chems 0.144768 2892.0
#2 31320.0 14.0 Chems 0.063924 2892.0
#3 31412.0 5.0 Cnstr -0.009190 1650.0
#4 31502.0 2.0 Mines 0.063326 1299.0
#5 31511.0 NaN NaN 0.151341 252525.0

Summing columns and rows

How do I add up rows and columns.
The last column Sum needs to be the sum of the rows R0+R1+R2.
The last row needs to be the sum of these columns.
import pandas as pd
# initialize list of lists
data = [['AP',16,20,78], ['AP+', 10,14,55], ['SP',32,26,90],['Total',0, 0, 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Type', 'R0', 'R1', 'R2'])
The result:
Type R0 R1 R2 Sum
0 AP 16 20 78 NaN
1 AP+ 10 14 55 NaN
2 SP 32 26 90 NaN
3 Total 0 0 0 NaN
Let us try .iloc position selection
df.iloc[-1,1:]=df.iloc[:-1,1:].sum()
df['Sum']=df.iloc[:,1:].sum(axis=1)
df
Type R0 R1 R2 Sum
0 AP 16 20 78 114
1 AP+ 10 14 55 79
2 SP 32 26 90 148
3 Total 58 60 223 341
In general it may be better practice to specify column names:
import pandas as pd
# initialize list of lists
data = [['AP',16,20,78], ['AP+', 10,14,55], ['SP',32,26,90],['Total',0, 0, 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Type', 'R0', 'R1', 'R2'])
# List columns
cols_to_sum=['R0', 'R1', 'R2']
# Access last row and sum columns-wise
df.loc[df.index[-1], cols_to_sum] = df[cols_to_sum].sum(axis=0)
# Create 'Sum' column summing row-wise
df['Sum']=df[cols_to_sum].sum(axis=1)
df
Type R0 R1 R2 Sum
0 AP 16 20 78 114
1 AP+ 10 14 55 79
2 SP 32 26 90 148
3 Total 58 60 223 341

What type of graph can best show the correlation between 'Fare' (price) and "Survival" (Titanic)?

I'm playing around with Seaborn and Matplotlib and I trying to find the best type of graph to show the correlation between fare values and chance of survival from the titanic dataset.
The Titanic fare column has a lot of different values ranging from 1 to 500 and some of the values are repeated often.
Here is a sample of value_counts:
titanic.fare.value_counts()
8.0500 43
13.0000 42
7.8958 38
7.7500 34
26.0000 31
10.5000 24
7.9250 18
7.7750 16
0.0000 15
7.2292 15
26.5500 15
8.6625 13
7.8542 13
7.2500 13
7.2250 12
16.1000 9
9.5000 9
15.5000 8
24.1500 8
14.5000 7
7.0500 7
52.0000 7
31.2750 7
56.4958 7
69.5500 7
14.4542 7
30.0000 6
39.6875 6
46.9000 6
21.0000 6
.....
91.0792 2
106.4250 2
164.8667 2
Survival column on the other hand has only two values :
>>> titanic.survived.head(10)
271 1
597 0
302 0
633 0
277 0
413 0
674 0
263 0
466 0
A histogram would only show the frequency of fares in certain ranges.
For a scatter plot I would need two variables; having "survived" which has only two values would make for a strange variable.
Is there a way to show the rise of survivability as fare increases clearly through a line graph?
I know there is a correlation as If I sort fare values in ascending order (000-500).
Then do:
>>> titanic.head(50).survived.sum()
5
>>>titanic.tail(50).survived.sum()
37
I see a correlation.
Thanks.
This is what I did to show the correlation between the fare values and the chance of survival:
First, I created a new column Fare Groups, converting fare values to groups of fare ranges, using cut().
df['Fare Groups'] = pd.cut(df.Fare, [0,50,100,150,200,550])
Next, I created a pivot_table().
piv_fare = df.pivot_table(index='Fare Groups', columns='Survived', values = 'Fare', aggfunc='count')
Output:
Survived 0 1
Fare Groups
(0, 50] 484 232
(50, 100] 37 70
(100, 150] 5 19
(150, 200] 3 6
(200, 550] 6 14
Plot:
piv_fare.plot(kind='bar')
It seems, those who had the cheapest tickets (0 to 50) had the lowest chance of survival. In fact, (0 to 50) is the only fare range where the chance to die is higher than the chance to survive. Not just higher, but significantly higher.