Pandas row sum for values > 0 - pandas

I have a dataframe of the following format
ROW Value1 Value2 Value3 Value4
1 10 10 -5 -2
2 50 20 -10 -7
3 10 5 0 -1
I am looking to calculate for each row the sum of positive totals and sum of negative totals. So essentially, the resulting frame should look like
ROW Post_Total Neg_Total
1 20 -7
2 70 -17
3 15 -1
One thing I have in my dataset, a column can have only positive or negative values.
Any ideas on how this can be done. I tried subsetting by >0 but was not successful.
Thanks!

Since all columns can either have all positive or all negative, you can use all() to check for condition along the columns, then groupby:
df.groupby(df.gt(0).all(), axis=1).sum()
Output:
False True
ROW
1 -7 20
2 -17 70
3 -1 15
In general, I'll just subset/clip and sum:
out = pd.DataFrame({'pos': df.clip(lower=0).sum(1),
'neg': df.clip(upper=0).sum(1)
})

Use DataFrame.melt, but if performance is important better are another solutions ;):
df1 = (df.melt('ROW')
.assign(g = lambda x: np.where(x['value'].gt(0),'Pos_Total','Neg_Total'))
.pivot_table(index='ROW',columns='g', values='value', aggfunc='sum', fill_value=0)
.reset_index()
.rename_axis(None, axis=1))
print (df1)
ROW Neg_Total Pos_Total
0 1 -7 20
1 2 -17 70
2 3 -1 15
Numpy alternative with numpy.clip:
a = df.set_index('ROW').to_numpy()
df = pd.DataFrame({'Pos_Total': np.sum(np.clip(a, a_min=0, a_max=None), 1),
'Neg_Total': np.sum(np.clip(a, a_min=None, a_max=0), 1)},
index=df['ROW'])

You could use:
(df.melt(id_vars='ROW')
.assign(sign=lambda d: np.where(d['value'].gt(0), 'Pos_Total', 'Neg_Total'))
.groupby(['ROW', 'sign'])['value'].sum()
.unstack('sign')
)
Or alternatively, using masks.
numpy version (faster):
import numpy as np
a = df.set_index('ROW').values
mask = a > 0
pd.DataFrame({'Pos_Total': np.where(mask, a, 0).sum(1),
'Neg_Total': np.where(mask, 0, a).sum(1)})
pandas version (slower than numpy but faster than melt):
d = df.set_index('ROW')
mask = d.gt(0)
pd.DataFrame({'Pos_Total': d.where(mask).sum(1),
'Neg_Total': d.mask(mask).sum(1)},
index=df['ROW'])
output:
Pos_Total Neg_Total
ROW
1 20.0 -7.0
2 70.0 -17.0
3 15.0 -1.0

Let us try apply
out = df.set_index('ROW').apply(lambda x : {'Pos':x[x>0].sum(),'Neg':x[x<0].sum()} ,
result_type = 'expand',
axis=1)
Out[33]:
Pos Neg
ROW
1 20 -7
2 70 -17
3 15 -1

Timing of all answer in order or speed. Computed with timeit on 30k rows with unique ROW values.
# #mozway+jezrael (numpy mask v2)
940 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# #mozway (numpy mask):
1.29 ms ± 26.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# #Quang Hoang (groupby)
4.68 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# #Quang Hoang (clip)
5.2 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# #mozway (pandas mask)
10.5 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# #mozway (melt+groupby)
36.2 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# #jezrael (melt+pivot_table)
48.5 ms ± 740 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# #BENY (apply)
9.05 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
setup:
df = pd.DataFrame({'ROW': [1, 2, 3],
'Value1': [10, 50, 10],
'Value2': [10, 20, 5],
'Value3': [-5, -10, 0],
'Value4': [-2, -7, -1]})
df = pd.concat([df]*10000, ignore_index=True)
df['ROW'] = range(len(df))

Related

Frequency Table from All DataFrame Data

Want to generate frequency table from all values in DataFrame. I do not want the values from the index and index can be destroyed.
Sample data:
col_list = ['ob1','ob2','ob3','ob4', 'ob5']
df = pd.DataFrame(np.random.uniform(73.965,74.03,size=(25, 5)).astype(float), columns=col_list)
My attempt based off this answer:
my_bins = [i for i in np.arange(73.965, 74.030, 0.005)]
df2 = df.apply(pd.Series.value_counts, bins=my_bins)
Code crashes, can't find another example that does what I'm trying.
Desired out put is a frequency table with counts for all values in bins. Something like this:
data_range
Frequency
73.965<=73.97
1
73.97<=73.975
0
73.98<=73.985
3
73.99<=73.995
2
And so on.
Your approach/code works fine with me.
my_bins = [i for i in np.arange(73.965, 74.030, 0.005)]
out1 = (
df.apply(pd.Series.value_counts, bins=my_bins)
.sum(axis=1).reset_index()
.set_axis(['data_range', 'Frequency'], axis=1)
)
#32.6 ms ± 803 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Here is different approach (using cut) that seems to be ~12x faster than apply.
my_bins = np.arange(73.965, 74.030, 0.005)
​
labels = [f"{np.around(l, 3)}<={np.around(r, 3)}"
for l, r in zip(my_bins[:-1], my_bins[1:])]
​
out2 = (
pd.Series(pd.cut(df.to_numpy().flatten(),
my_bins, labels=labels))
.value_counts(sort=False).reset_index()
.set_axis(['data_range', 'Frequency'], axis=1)
)
#2.42 ms ± 45.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Output :
print(out2)
data_range Frequency
0 73.965<=73.97 16
1 73.97<=73.975 0
2 73.975<=73.98 15
3 73.98<=73.985 12
4 73.985<=73.99 7
.. ... ...
7 74.0<=74.005 8
8 74.005<=74.01 9
9 74.01<=74.015 7
10 74.015<=74.02 7
11 74.02<=74.025 11
[12 rows x 2 columns]

How to speed up df.query

df.query uses numexpr under the hood, but is much slower than pure numexpr
Let's say I have a big DataFrame:
from random import shuffle
import pandas as pd
l1=list(range(1000000))
l2=list(range(1000000))
shuffle(l1)
shuffle(l2)
df = pd.DataFrame([l1, l2]).T
df=df.sample(frac=1)
df=df.rename(columns={0: 'A', 1:'B'})
And I want to compare 2 columns:
%timeit (df.A == df.B) | (df.A / df.B < 1) | (df.A * df.B > 3000)
10.8 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It takes 10.8 ms in this example
Now I import numexpr and do the same thing:
import numexpr
a = df.A.__array__()
b = df.B.__array__()
%timeit numexpr.evaluate('((a == b) | (a / b < 1) | (a * b > 3000))')
1.95 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
numexpr is 6 times faster than Pandas
Now let's use df.loc:
%timeit df.loc[numexpr.evaluate('((a == b) | (a / b < 1) | (a * b > 3000))')]
20.5 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.loc[(df.A == df.B) | (df.A / df.B < 1) | (df.A * df.B > 3000)]
27 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.query('((A == B) | (A / B < 1) | (A * B > 3000))')
32.5 ms ± 80.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
numexpr is still significantly faster than pure Pandas. But why is df.query so slow? It uses numexpr under the hood. Is there a way to fix that? Or any other way to use numexpr in pandas without doing a lot of tweaking

Pandas: is there a difference of speed between the deprecated df.sum(level=...) and df.groupby(level=...).sum()?

I'm using pandas and noticed a HUGE difference in performance between these two statements:
df.sum(level=['My', 'Index', 'Levels']) # uses numpy sum which is vectorized
and
df.groupby(level=['My', 'Index', 'Levels']).sum() # Slow...
First example is using the numpy.sum, which is vectorized, as stated in the documentation.
Unfortunably, using sum(level=...) is deprecated in the API and produces an ugly warning:
FutureWarning: Using the level keyword in DataFrame and Series
aggregations is deprecated and will be removed in a future version.
Use groupby instead. df.sum(level=1) should use
df.groupby(level=1).sum()
I don't want to use the non vectorized version and have a poor processing performance. How can I use numpy.sum along with groupby ?
Edit: following the comments, here is a basic test I have done: Pandas 1.4.4 , 10k random lines, 10 levels (index)
import pandas as pd
import numpy as np
print('pandas:', pd.__version__)
nb_lines = int(1e4)
nb_levels = 10
# sequences of random integers [0, 9] x 10k
ix = np.random.randint(0, nb_levels-1, size=(nb_lines, nb_levels))
cols = [chr(65+i) for i in range(nb_levels)] # A, B, C, ...
df = pd.DataFrame(ix, columns=cols)
df = df.set_index(cols)
df['VALUE'] = np.random.rand(nb_lines) # random values to aggregate
print('with groupby:')
%timeit -n 300 df.groupby(level=cols).sum()
print('without groupby:')
%timeit -n 300 df.sum(level=cols)
And the result is:
pandas: 1.4.4
with groupby:
5.51 ms ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
without groupby:
<magic-timeit>:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
4.93 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
This is just an example, but the result is always faster without groupby. Changing parameters (levels, step size for the columns to group on, etc) does not change the result.
In the end, for a big data set, you can see the difference between the two methods (numpy.sum vs other).
#mozway, you results indicate a similar performance, however if you increase the number of levels, you should see results worsening with the groupby version -at least that's the results on my computer. See edited code so you can change the number of levels (example with 10 levels and 100k lines):
import numpy as np
from string import ascii_uppercase as UP
np.random.seed(0)
N = 100_000
nb_levels = 10
cols = [chr(65+i) for i in range(nb_levels)]
d = {cols[i]: np.random.choice(list(UP), size=N) for i in range(nb_levels)}
d.update({'num': np.random.random(size=N)})
df = pd.DataFrame(d).set_index(cols)
print(pd.__version__)
print('with groupby:')
%timeit -n 300 df.groupby(level=cols).sum()
print('without groupby:')
%timeit -n 300 df.sum(level=cols)
... and the result:
1.4.4
with groupby:
50.8 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 300 loops each)
without groupby:
<magic-timeit>:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
42 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 300 loops each)
Thanks
This doesn't seem to be true, both approaches have a similar speed.
Setup (3 levels, 26 groups each, ~18k combinations of groups, 1M rows):
import numpy as np
from string import ascii_uppercase as UP
np.random.seed(0)
N = 1_000_000
cols = ['A', 'B', 'C']
df = pd.DataFrame({'A': np.random.choice(list(UP), size=N),
'B': np.random.choice(list(UP), size=N),
'C': np.random.choice(list(UP), size=N),
'num': np.random.random(size=N),}
).set_index(cols)
Test:
pd.__version__
1.4.4
%%timeit # 3 times
df.sum(level=cols)
316 ms ± 85.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
287 ms ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
297 ms ± 54.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit # 3 times
df.groupby(level=cols).sum()
284 ms ± 41.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
286 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
311 ms ± 31.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
udpated example from OP
import numpy as np
from string import ascii_uppercase as UP
np.random.seed(0)
N = 1_000_000
nb_cols = 10
cols = [chr(65+i) for i in range(nb_cols)]
d = {cols[i]: np.random.choice(list(UP), size=N) for i in range(nb_cols)}
d.update({'num': np.random.random(size=N)})
df = pd.DataFrame(d).set_index(cols)
print(pd.__version__)
1.5.0
%%timeit
df.sum(level=cols)
3.36 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df.groupby(level=cols).sum()
2.94 s ± 444 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Create new dataframe column with isoweekday from datetime

I have this dataframe and I want to make a new column for which day of the week the collisions were on.
collision_date
0 2020-03-14
1 2020-07-26
2 2009-02-03
3 2009-02-28
4 2009-02-09
I have tried variations of this but nothing works.
df["day of the week"] = df["collision_date"].isoweekday()
df["day of the week"] = df["collision_date"].apply(isoweekday)
Assuming collision_date is datetime we can use dt.weekday (+1 to match isoweekday returning 1-7 instead of 0-6):
# Convert If needed
df['collision_date'] = pd.to_datetime(df['collision_date'])
# Turn into Number
df['day of week'] = df['collision_date'].dt.weekday + 1
The slower option with apply is to call isoweekday per date:
from datetime import date
# Convert If needed
df['collision_date'] = pd.to_datetime(df['collision_date'])
# Call isoweekday per date
df['day of week'] = df['collision_date'].apply(date.isoweekday)
df:
collision_date day of week
0 2020-03-14 6
1 2020-07-26 7
2 2009-02-03 2
3 2009-02-28 6
4 2009-02-09 1
Timing Information via timeit:
Sample Data with 1000 rows:
df = pd.DataFrame({
'collision_date': pd.date_range(start='now', periods=1000, freq='D')
})
dt.weekday:
%timeit df['collision_date'].dt.weekday + 1
261 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
apply:
%timeit df['collision_date'].apply(date.isoweekday)
2.53 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

efficiently extract rows from a pandas DataFrame ignoring missing index labels

I am looking for a more efficient equivalent of
df.reindex(labels).dropna(subset=[0])
that avoids including the NaN rows for missing label in the result rather than having to delete them after reindex puts them in.
Equivalently I am loooking for an efficient version of
df.loc[labels]
that silently ignores labels that are not in df.index, ie the result may have fewer rows than elements of labels.
I need something that is efficient when the numbers of rows, columns and labels are all large and there is a significant miss rate. Specifically, I'm looking for something sublinear in the length of the dataset.
Update 1
Here is a concrete demonstration of the issue following on from #MaxU's answer:
In [2]: L = 10**7
...: M = 10**4
...: N = 10**9
...: np.random.seed([3, 1415])
...: df = pd.DataFrame(np.random.rand(L, 2))
...: labels = np.random.randint(N, size=M)
...: M-len(set(labels))
...:
...:
Out[2]: 0
In [3]: %timeit df[df.index.isin(set(labels))]
904 ms ± 59.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit df.loc[df.index.intersection(set(labels))]
207 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: %timeit df.loc[np.intersect1d(df.index, labels)]
427 ms ± 37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %timeit df.loc[labels[labels<L]]
329 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: %timeit df.iloc[labels[labels<L]]
161 µs ± 8.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The last 2 examples are ~1000 times faster than those iterating over df.index. This demonstrates that df.loc[labels] does not iterate over the index and that dataframes have an efficient index structure, ie df.index does indeed index.
So the question is how do I get something as efficient as df.loc[labels[labels<L]] when df.index is not a contiguous sequence of numbers. A partial solution is the the original
In [8]: %timeit df.reindex(labels).dropna(subset=[0])
1.81 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That is still a ~100 times faster than the suggested solutions, but still losing an order of magnitude to what may be possible.
Update 2
To further demonstrate that it is possible to get sublinear performance even without assuptions on the index repeat the above with a string index
In [16]: df.index=df.index.map(str)
...: labels = np.array(list(map(str, labels)))
...:
...:
In [17]: %timeit df[df.index.isin(set(labels))]
657 ms ± 48.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [18]: %timeit df.loc[df.index.intersection(set(labels))]
974 ms ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: %timeit df.reindex(labels).dropna()
8.7 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So to be clear I am after something that is more efficient than df.reindex(labels).dropna(). This is already sublinear in df.shape[0] and makes no assumptions about the index, therefore so should the solution.
The issue I want to address is that df.reindex(labels) will include NaN rows for missing labels that then need removing with dropna. I am after an equivalent of df.reindex(labels) that does not put them there in the first place, without scanning the entire df.index to figure out the missing labels. This must be possible at least in principle: If reindex can efficiently handle missing labels on the fly by inserting dummy rows, it should be possible to handle them even more efficiently on the fly by doing nothing.
Here is a small comparison for different approaches.
Sample DF (shape: 10.000.000 x 2):
np.random.seed([3, 1415])
df = pd.DataFrame(np.random.rand(10**7, 2))
labels = np.random.randint(10**9, size=10**4)
In [88]: df.shape
Out[88]: (10000000, 2)
valid (existing labels):
In [89]: (labels <= 10**7).sum()
Out[89]: 1008
invalid (not existing labels):
In [90]: (labels > 10**7).sum()
Out[90]: 98992
Timings:
In [103]: %timeit df[df.index.isin(set(labels))]
943 ms ± 7.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [104]: %timeit df.loc[df.index.intersection(set(labels))]
360 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [105]: %timeit df.loc[np.intersect1d(df.index, labels)]
513 ms ± 655 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)