How to improve the speed of groupby/transform? - pandas

I want to implement the groupmax function, which finds the max value within each group, and assign it back to the rows within each group. It seems groupby(name).transform(max) is what I need. E.g.
In [1]: print df
name value
0 0 0.363030
1 0 0.324828
2 0 0.499279
3 1 0.799836
4 1 0.886653
5 1 0.335056
In [2]: print df.groupby('name').transform(max)
value
0 0.499279
1 0.499279
2 0.499279
3 0.886653
4 0.886653
5 0.886653
However this approach is very slow when the size of the data frame becomes large and there are many small groups. E.g. the following code will hang there forever
df = pd.DataFrame({'name' : repeat([str(x) for x in range(0, 1000000)], 2), 'value' : rand(2000000)})
print df.groupby('name').transform(max)
I wonder if there is any fast solutions for this problem?
Thanks a lot!

You could try something like
>>> df = pd.DataFrame({'name': np.repeat(list(map(str,range(10**6))), 2), 'value': np.random.rand(2*10**6)})
>>> %timeit df.groupby("name").max().loc[df.name.values].reset_index(drop=True)
1 loops, best of 3: 2.12 s per loop
Still not great, but better.

Related

Some trouble as a beginner [duplicate]

This seems like a ridiculously easy question... but I'm not seeing the easy answer I was expecting.
So, how do I get the value at an nth row of a given column in Pandas? (I am particularly interested in the first row, but would be interested in a more general practice as well).
For example, let's say I want to pull the 1.2 value in Btime as a variable.
Whats the right way to do this?
>>> df_test
ATime X Y Z Btime C D E
0 1.2 2 15 2 1.2 12 25 12
1 1.4 3 12 1 1.3 13 22 11
2 1.5 1 10 6 1.4 11 20 16
3 1.6 2 9 10 1.7 12 29 12
4 1.9 1 1 9 1.9 11 21 19
5 2.0 0 0 0 2.0 8 10 11
6 2.4 0 0 0 2.4 10 12 15
To select the ith row, use iloc:
In [31]: df_test.iloc[0]
Out[31]:
ATime 1.2
X 2.0
Y 15.0
Z 2.0
Btime 1.2
C 12.0
D 25.0
E 12.0
Name: 0, dtype: float64
To select the ith value in the Btime column you could use:
In [30]: df_test['Btime'].iloc[0]
Out[30]: 1.2
There is a difference between df_test['Btime'].iloc[0] (recommended) and df_test.iloc[0]['Btime']:
DataFrames store data in column-based blocks (where each block has a single
dtype). If you select by column first, a view can be returned (which is
quicker than returning a copy) and the original dtype is preserved. In contrast,
if you select by row first, and if the DataFrame has columns of different
dtypes, then Pandas copies the data into a new Series of object dtype. So
selecting columns is a bit faster than selecting rows. Thus, although
df_test.iloc[0]['Btime'] works, df_test['Btime'].iloc[0] is a little bit
more efficient.
There is a big difference between the two when it comes to assignment.
df_test['Btime'].iloc[0] = x affects df_test, but df_test.iloc[0]['Btime']
may not. See below for an explanation of why. Because a subtle difference in
the order of indexing makes a big difference in behavior, it is better to use single indexing assignment:
df.iloc[0, df.columns.get_loc('Btime')] = x
df.iloc[0, df.columns.get_loc('Btime')] = x (recommended):
The recommended way to assign new values to a
DataFrame is to avoid chained indexing, and instead use the method shown by
andrew,
df.loc[df.index[n], 'Btime'] = x
or
df.iloc[n, df.columns.get_loc('Btime')] = x
The latter method is a bit faster, because df.loc has to convert the row and column labels to
positional indices, so there is a little less conversion necessary if you use
df.iloc instead.
df['Btime'].iloc[0] = x works, but is not recommended:
Although this works, it is taking advantage of the way DataFrames are currently implemented. There is no guarantee that Pandas has to work this way in the future. In particular, it is taking advantage of the fact that (currently) df['Btime'] always returns a
view (not a copy) so df['Btime'].iloc[n] = x can be used to assign a new value
at the nth location of the Btime column of df.
Since Pandas makes no explicit guarantees about when indexers return a view versus a copy, assignments that use chained indexing generally always raise a SettingWithCopyWarning even though in this case the assignment succeeds in modifying df:
In [22]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [24]: df['bar'] = 100
In [25]: df['bar'].iloc[0] = 99
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
In [26]: df
Out[26]:
foo bar
0 A 99 <-- assignment succeeded
2 B 100
1 C 100
df.iloc[0]['Btime'] = x does not work:
In contrast, assignment with df.iloc[0]['bar'] = 123 does not work because df.iloc[0] is returning a copy:
In [66]: df.iloc[0]['bar'] = 123
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [67]: df
Out[67]:
foo bar
0 A 99 <-- assignment failed
2 B 100
1 C 100
Warning: I had previously suggested df_test.ix[i, 'Btime']. But this is not guaranteed to give you the ith value since ix tries to index by label before trying to index by position. So if the DataFrame has an integer index which is not in sorted order starting at 0, then using ix[i] will return the row labeled i rather than the ith row. For example,
In [1]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [2]: df
Out[2]:
foo
0 A
2 B
1 C
In [4]: df.ix[1, 'foo']
Out[4]: 'C'
Note that the answer from #unutbu will be correct until you want to set the value to something new, then it will not work if your dataframe is a view.
In [4]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [5]: df['bar'] = 100
In [6]: df['bar'].iloc[0] = 99
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.16.0_19_g8d2818e-py2.7-macosx-10.9-x86_64.egg/pandas/core/indexing.py:118: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
Another approach that will consistently work with both setting and getting is:
In [7]: df.loc[df.index[0], 'foo']
Out[7]: 'A'
In [8]: df.loc[df.index[0], 'bar'] = 99
In [9]: df
Out[9]:
foo bar
0 A 99
2 B 100
1 C 100
Another way to do this:
first_value = df['Btime'].values[0]
This way seems to be faster than using .iloc:
In [1]: %timeit -n 1000 df['Btime'].values[20]
5.82 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [2]: %timeit -n 1000 df['Btime'].iloc[20]
29.2 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df.iloc[0].head(1) - First data set only from entire first row.
df.iloc[0] - Entire First row in column.
In a general way, if you want to pick up the first N rows from the J column from pandas dataframe the best way to do this is:
data = dataframe[0:N][:,J]
To access a single value you can use the method iat that is much faster than iloc:
df['Btime'].iat[0]
You can also use the method take:
df['Btime'].take(0)
.iat and .at are the methods for getting and setting single values and are much faster than .iloc and .loc. Mykola Zotko pointed this out in their answer, but they did not use .iat to its full extent.
When we can use .iat or .at, we should only have to index into the dataframe once.
This is not great:
df['Btime'].iat[0]
It is not ideal because the 'Btime' column was first selected as a series, then .iat was used to index into that series.
These two options are the best:
Using zero-indexed positions:
df.iat[0, 4] # get the value in the zeroth row, and 4th column
Using Labels:
df.at[0, 'Btime'] # get the value where the index label is 0 and the column name is "Btime".
Both methods return the value of 1.2.
To get e.g the value from column 'test' and row 1 it works like
df[['test']].values[0][0]
as only df[['test']].values[0] gives back a array
Another way of getting the first row and preserving the index:
x = df.first('d') # Returns the first day. '3d' gives first three days.
According to pandas docs, at is the fastest way to access a scalar value such as the use case in the OP (already suggested by Alex on this page).
Building upon Alex's answer, because dataframes don't necessarily have a range index it might be more complete to index df.index (since dataframe indexes are built on numpy arrays, you can index them like an array) or call get_loc() on columns to get the integer location of a column.
df.at[df.index[0], 'Btime']
df.iat[0, df.columns.get_loc('Btime')]
One common problem is that if you used a boolean mask to get a single value, but ended up with a value with an index (actually a Series); e.g.:
0 1.2
Name: Btime, dtype: float64
you can use squeeze() to get the scalar value, i.e.
df.loc[df['Btime']<1.3, 'Btime'].squeeze()

Comparing string values from sequential rows in pandas series

I am trying to count common string values in sequential rows of a panda series using a user defined function and to write an output into a new column. I figured out individual steps, but when I put them together, I get a wrong result. Could you please tell me the best way to do this? I am a very beginner Pythonista!
My pandas df is:
df = pd.DataFrame({"Code": ['d7e', '8e0d', 'ft1', '176', 'trk', 'tr71']})
My string comparison loop is:
x='d7e'
y='8e0d'
s=0
for i in y:
b=str(i)
if b not in x:
s+=0
else:
s+=1
print(s)
the right result for these particular strings is 2
Note, when I do def func(x,y): something happens to s counter and it doesn't produce the right result. I think I need to reset it to 0 every time the loop runs.
Then, I use df.shift to specify the position of y and x in a series:
x = df["Code"]
y = df["Code"].shift(periods=-1, axis=0)
And finally, I use df.apply() method to run the function:
df["R1SB"] = df.apply(func, axis=0)
and I get None values in my new column "R1SB"
My correct output would be:
"Code" "R1SB"
0 d7e None
1 8e0d 2
2 ft1 0
3 176 1
4 trk 0
5 tr71 2
Thank you for your help!
TRY:
df['R1SB'] = df.assign(temp=df.Code.shift(1)).apply(
lambda x: np.NAN
if pd.isna(x['temp'])
else sum(i in str(x['temp']) for i in str(x['Code'])),
1,
)
OUTPUT:
Code R1SB
0 d7e NaN
1 8e0d 2.0
2 ft1 0.0
3 176 1.0
4 trk 0.0
5 tr71 2.0

filter on pandas array

I'm doing this kind of code to find if a value belongs to the array a inside a dataframe:
Solution 1
df = pd.DataFrame([{'a':[1,2,3], 'b':4},{'a':[5,6], 'b':7},])
df = df.explode('a')
df[df['a'] == 1]
will give the output:
a b
0 1 4
Problem
This can go worst if there are repetitions:
df = pd.DataFrame([{'a':[1,2,1,3], 'b':4},{'a':[5,6], 'b':7},])
df = df.explode('a')
df[df['a'] == 1]
will give the output:
a b
0 1 4
0 1 4
Solution 2
Another solution could go like:
df = pd.DataFrame([{'a':[1,2,1,3], 'b':4},{'a':[5,6], 'b':7},])
df = df[df['a'].map(lambda row: 1 in row)]
Problem
That Lambda can't go fast if the Dataframe is Big.
Question
As a first goal, I want all the lines where the value 1 belongs to a:
without using Python, since it is slow
with high performance
avoiding memory issues
...
So I'm trying to understand what may I do with the arrays inside Pandas. Is there some documentation on how to use this type efficiently?
IIUC, you are trying to do:
df[df['a'].eq(1).groupby(level=0).transform('any')
Output:
a b
0 1 4
0 2 4
0 3 4
Nothing is wrong. This is normal behavior of pandas.explode().
To check whether a value belongs to values in a you may use this:
if x in df.a.explode()
where x is what you test for.
I think you can convert arrays to scalars with DataFrame constructor and then test value with DataFrame.eq and DataFrame.any:
df = df[pd.DataFrame(df['a'].tolist()).eq(1).any(axis=1)]
print (df)
a b
0 [1, 2, 1, 3] 4
Details:
print (pd.DataFrame(df['a'].tolist()))
0 1 2 3
0 1 2 1.0 3.0
1 5 6 NaN NaN
print (pd.DataFrame(df['a'].tolist()).eq(1))
0 1 2 3
0 True False True False
1 False False False False
So I'm trying to understand what may I do with the arrays inside Pandas. Is there some documentation on how to use this type efficiently?
I think working with lists in pandas is not good idea.

How to efficiently remove duplicate rows from a DataFrame

I'm dealing with a very large Data Frame and I'm using pandas to do the analysis.
The data frame is structured as follows
import pandas as pd
df = pd.read_csv("data.csv")
df.head()
Source Target Weight
0 0 25846 1
1 0 1916 1
2 25846 0 1
3 0 4748 1
4 0 16856 1
The issue is that I want to remove all the "duplicates". In the sense that if I already have a row that contains a Source and a Target I do not want this information to be repeated on another row.
For instance, rows number 0 and 2 are "duplicate" in this sense and only one of them should be retained.
A simple way to get rid of all the "duplicates" is
for index, row in df.iterrows():
df = df[~((df.Source==row.Target)&(df.Target==row.Source))]
However, this approach is horribly slow since my data frame has about 3 million rows. Do you think there's a better way of doing this?
Create two temp columns to save minimum(df.Source, df.Target) and maximum(df.Source, df.Target), and then check duplicated rows by duplicated() method:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 5, (20, 2)), columns=["Source", "Target"])
df["T1"] = np.minimum(df.Source, df.Target)
df["T2"] = np.maximum(df.Source, df.Target)
df[~df[["T1", "T2"]].duplicated()]
No need (as usual) to use a loop with a dataframe. Use the Series.isin method:
So start with this:
df = pandas.DataFrame({
'src': [0, 0, 25, 0, 0],
'tgt': [25, 12, 0, 85, 363]
})
print(df)
src tgt
0 0 25
1 0 12
2 25 0
3 0 85
4 0 363
Then select all of the where src is not in tgt:
df[~(df['src'].isin(df['tgt']) & df['tgt'].isin(df['src']))]
src tgt
1 0 12
3 0 85
4 0 363
Your Source and Targets appear to be mutually exclusive (i.e. you can have one, but not both). Why not add them together (e.g. 25846 + 0) to get the unique identifier. You can then delete the unneeded Target column (reducing memory), and then drop duplicates. In the event your weights are not the same, it will take the first one by default.
df.Source += df.Target
df.drop('Target', axis=1, inplace=True)
df.drop_duplicates(inplace=True)
>>> df
Source Weight
0 25846 1
1 1916 1
3 4748 1
4 16856 1

selecting data from pandas panel with MultiIndex

I have a DataFrame with MultiIndex, for example:
In [1]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [2]: df = DataFrame(randn(6,2),index=MultiIndex.from_tuples(zip(*arrays)),columns=['A','B'])
In [3]: df
Out [3]:
A B
one 1 -2.028736 -0.466668
2 -1.877478 0.179211
3 0.886038 0.679528
two 1 1.101735 0.169177
2 0.756676 -1.043739
3 1.189944 1.342415
Now I want to compute the means of elements 2 and 3 (index level 1) for each row (index level 0) and each column. So I need a DataFrame which would look like
A B
one 1 mean(df['A'].ix['one'][1:3]) mean(df['B'].ix['one'][1:3])
two 1 mean(df['A'].ix['two'][1:3]) mean(df['B'].ix['two'][1:3])
How do I do that without using loops over rows (index level 0) of the original data frame? What if I want to do the same for a Panel? There must be a simple solution with groupby, but I'm still learning it and can't think of an answer.
You can use the xs function to select on levels.
Starting with:
A B
one 1 -2.712137 -0.131805
2 -0.390227 -1.333230
3 0.047128 0.438284
two 1 0.055254 -1.434262
2 2.392265 -1.474072
3 -1.058256 -0.572943
You can then create a new dataframe using:
DataFrame({'one':df.xs('one',level=0)[1:3].apply(np.mean), 'two':df.xs('two',level=0)[1:3].apply(np.mean)}).transpose()
which gives the result:
A B
one -0.171549 -0.447473
two 0.667005 -1.023508
To do the same without specifying the items in the level, you can use groupby:
grouped = df.groupby(level=0)
d = {}
for g in grouped:
d[g[0]] = g[1][1:3].apply(np.mean)
DataFrame(d).transpose()
I'm not sure about panels - it's not as well documented, but something similar should be possible
I know this is an old question, but for reference who searches and finds this page, the easier solution I think is the level keyword in mean:
In [4]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [5]: df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(z
ip(*arrays)),columns=['A','B'])
In [6]: df
Out[6]:
A B
one 1 -0.472890 2.297778
2 -2.002773 -0.114489
3 -1.337794 -1.464213
two 1 1.964838 -0.623666
2 0.838388 0.229361
3 1.735198 0.170260
In [7]: df.mean(level=0)
Out[7]:
A B
one -1.271152 0.239692
two 1.512808 -0.074682
In this case it means that level 0 is kept over axis 0 (the rows, default value for mean)
Do the following:
# Specify the indices you want to work with.
idxs = [("one", elem) for elem in [2,3]] + [("two", elem) for elem in [2,3]]
# Compute grouped mean over only those indices.
df.ix[idxs].mean(level=0)