efficiently setting values on a subset of rows - pandas

I am wondering about the best way to change values in a subset of rows in a dataframe.
Let's say I want to double the values in column value in rows where selected is true.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'value': [1, 2, 3, 4], 'selected': [False, False, True, True]})
In [3]: df
Out[3]:
selected value
0 False 1
1 False 2
2 True 3
3 True 4
There are several ways to do this:
# 1. Subsetting with .loc on left and right hand side:
df.loc[df['selected'], 'value'] = df.loc[df['selected'], 'value'] * 2
# 2. Subsetting with .loc on left hand side:
df.loc[df['selected'], 'value'] = df['value'] * 2
# 3. Using where()
df['value'] = (df['value'] * 2).where(df['selected'], df['value'])
If I only subset on the left hand side (option 2), would Pandas actually make the calculation for all rows and then discard the result for all but the selected rows?
In terms of evaluation, is there any difference between using loc and where?

Your #2 option is the most standard and recommended way to do this. Your #1 option is fine also, but the extra code is unnecessary because ix/loc/iloc are designed to pass the boolean selection through and do the necessary alignment to make sure it applies only to your desired subset.
# 2. Subsetting with .loc on left hand side:
df.loc[df['selected'], 'value'] = df['value'] * 2
If you don't use ix/loc/iloc on the left hand side, problems can arise that we don't want to get into in a simple answer. Hence, using ix/loc/iloc is generally the safest and most recommened way to go. There is nothing wrong with your option #3, but it is the least readable of the three.
One faster and acceptable alternative you should know about is numpy's where() function:
df['value'] = np.where( df['selected'], df['value'] * 2, df['value'] )
The first argument is the selection or mask, the second is the value to assign if True, and third is the value to assign if false. It's especially useful if you want to also create or change the value if the selection is False.

Related

How to apply a function on a column of a pandas dataframe? [duplicate]

I have a pandas dataframe with two columns. I need to change the values of the first column without affecting the second one and get back the whole dataframe with just first column values changed. How can I do that using apply() in pandas?
Given a sample dataframe df as:
a b
0 1 2
1 2 3
2 3 4
3 4 5
what you want is:
df['a'] = df['a'].apply(lambda x: x + 1)
that returns:
a b
0 2 2
1 3 3
2 4 4
3 5 5
For a single column better to use map(), like this:
df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])
a b c
0 15 15 5
1 20 10 7
2 25 30 9
df['a'] = df['a'].map(lambda a: a / 2.)
a b c
0 7.5 15 5
1 10.0 10 7
2 12.5 30 9
Given the following dataframe df and the function complex_function,
import pandas as pd
def complex_function(x, y=0):
if x > 5 and x > y:
return 1
else:
return 2
df = pd.DataFrame(data={'col1': [1, 4, 6, 2, 7], 'col2': [6, 7, 1, 2, 8]})
col1 col2
0 1 6
1 4 7
2 6 1
3 2 2
4 7 8
there are several solutions to use apply() on only one column. In the following I will explain them in detail.
I. Simple solution
The straightforward solution is the one from #Fabio Lamanna:
df['col1'] = df['col1'].apply(complex_function)
Output:
col1 col2
0 2 6
1 2 7
2 1 1
3 2 2
4 1 8
Only the first column is modified, the second column is unchanged. The solution is beautiful. It is just one line of code and it reads almost like english: "Take 'col1' and apply the function complex_function to it."
However, if you need data from another column, e.g. 'col2', it won't work. If you want to pass the values of 'col2' to variable y of the complex_function, you need something else.
II. Solution using the whole dataframe
Alternatively, you could use the whole dataframe as described in this SO post or this one:
df['col1'] = df.apply(lambda x: complex_function(x['col1']), axis=1)
or if you prefer (like me) a solution without a lambda function:
def apply_complex_function(x):
return complex_function(x['col1'])
df['col1'] = df.apply(apply_complex_function, axis=1)
There is a lot going on in this solution that needs to be explained. The apply() function works on pd.Series and pd.DataFrame. But you cannot use df['col1'] = df.apply(complex_function).loc[:, 'col1'], because it would throw a ValueError.
Hence, you need to give the information which column to use. To complicate things, the apply() function does only accept callables. To solve this, you need to define a (lambda) function with the column x['col1'] as argument; i.e. we wrap the column information in another function.
Unfortunately, the default value of the axis parameter is zero (axis=0), which means it will try executing column-wise and not row-wise. This wasn't a problem in the first solution, because we gave apply() a pd.Series. But now the input is a dataframe and we must be explicit (axis=1). (I marvel how often I forget this.)
Whether you prefer the version with the lambda function or without is subjective. In my opinion the line of code is complicated enough to read even without a lambda function thrown in. You only need the (lambda) function as a wrapper. It is just boilerplate code. A reader should not be bothered with it.
Now, you can modify this solution easily to take the second column into account:
def apply_complex_function(x):
return complex_function(x['col1'], x['col2'])
df['col1'] = df.apply(apply_complex_function, axis=1)
Output:
col1 col2
0 2 6
1 2 7
2 1 1
3 2 2
4 2 8
At index 4 the value has changed from 1 to 2, because the first condition 7 > 5 is true but the second condition 7 > 8 is false.
Note that you only needed to change the first line of code (i.e. the function) and not the second line.
Side note
Never put the column information into your function.
def bad_idea(x):
return x['col1'] ** 2
By doing this, you make a general function dependent on a column name! This is a bad idea, because the next time you want to use this function, you cannot. Worse: Maybe you rename a column in a different dataframe just to make it work with your existing function. (Been there, done that. It is a slippery slope!)
III. Alternative solutions without using apply()
Although the OP specifically asked for a solution with apply(), alternative solutions were suggested. For example, the answer of #George Petrov suggested to use map(); the answer of #Thibaut Dubernet proposed assign().
I fully agree that apply() is seldom the best solution, because apply() is not vectorized. It is an element-wise operation with expensive function calling and overhead from pd.Series.
One reason to use apply() is that you want to use an existing function and performance is not an issue. Or your function is so complex that no vectorized version exists.
Another reason to use apply() is in combination with groupby(). Please note that DataFrame.apply() and GroupBy.apply() are different functions.
So it does make sense to consider some alternatives:
map() only works on pd.Series, but accepts dict and pd.Series as input. Using map() with a function is almost interchangeable with using apply(). It can be faster than apply(). See this SO post for more details.
df['col1'] = df['col1'].map(complex_function)
applymap() is almost identical for dataframes. It does not support pd.Series and it will always return a dataframe. However, it can be faster. The documentation states: "In the current implementation applymap calls func twice on the first column/row to decide whether it can take a fast or slow code path.". But if performance really counts you should seek an alternative route.
df['col1'] = df.applymap(complex_function).loc[:, 'col1']
assign() is not a feasible replacement for apply(). It has a similar behaviour in only the most basic use cases. It does not work with the complex_function. You still need apply() as you can see in the example below. The main use case for assign() is method chaining, because it gives back the dataframe without changing the original dataframe.
df['col1'] = df.assign(col1=df.col1.apply(complex_function))
Annex: How to speed up apply()?
I only mention it here because it was suggested by other answers, e.g. #durjoy. The list is not exhaustive:
Do not use apply(). This is no joke. For most numeric operations, a vectorized method exists in pandas. If/else blocks can often be refactored with a combination of boolean indexing and .loc. My example complex_function could be refactored in this way.
Refactor to Cython. If you have a complex equation and the parameters of the equation are in your dataframe, this might be a good idea. Check out the official pandas user guide for more information.
Use raw=True parameter. Theoretically, this should improve the performance of apply() if you are just applying a NumPy reduction function, because the overhead of pd.Series is removed. Of course, your function has to accept an ndarray. You have to refactor your function to NumPy. By doing this, you will have a huge performance boost.
Use 3rd party packages. The first thing you should try is Numba. I do not know swifter mentioned by #durjoy; and probably many other packages are worth mentioning here.
Try/Fail/Repeat. As mentioned above, map() and applymap() can be faster - depending on the use case. Just time the different versions and choose the fastest. This approach is the most tedious one with the least performance increase.
You don't need a function at all. You can work on a whole column directly.
Example data:
>>> df = pd.DataFrame({'a': [100, 1000], 'b': [200, 2000], 'c': [300, 3000]})
>>> df
a b c
0 100 200 300
1 1000 2000 3000
Half all the values in column a:
>>> df.a = df.a / 2
>>> df
a b c
0 50 200 300
1 500 2000 3000
Although the given responses are correct, they modify the initial data frame, which is not always desirable (and, given the OP asked for examples "using apply", it might be they wanted a version that returns a new data frame, as apply does).
This is possible using assign: it is valid to assign to existing columns, as the documentation states (emphasis is mine):
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
In short:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])
In [3]: df.assign(a=lambda df: df.a / 2)
Out[3]:
a b c
0 7.5 15 5
1 10.0 10 7
2 12.5 30 9
In [4]: df
Out[4]:
a b c
0 15 15 5
1 20 10 7
2 25 30 9
Note that the function will be passed the whole dataframe, not only the column you want to modify, so you will need to make sure you select the right column in your lambda.
If you are really concerned about the execution speed of your apply function and you have a huge dataset to work on, you could use swifter to make faster execution, here is an example for swifter on pandas dataframe:
import pandas as pd
import swifter
def fnc(m):
return m*3+4
df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
# apply a self created function to a single column in pandas
df["y"] = df.m.swifter.apply(fnc)
This will enable your all CPU cores to compute the result hence it will be much faster than normal apply functions. Try and let me know if it become useful for you.
Let me try a complex computation using datetime and considering nulls or empty spaces. I am reducing 30 years on a datetime column and using apply method as well as lambda and converting datetime format. Line if x != '' else x will take care of all empty spaces or nulls accordingly.
df['Date'] = df['Date'].fillna('')
df['Date'] = df['Date'].apply(lambda x : ((datetime.datetime.strptime(str(x), '%m/%d/%Y') - datetime.timedelta(days=30*365)).strftime('%Y%m%d')) if x != '' else x)
Make a copy of your dataframe first if you need to modify a column
Many answers here suggest modifying some column and assign the new values to the old column. It is common to get the SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. warning. This happens when your dataframe was created from another dataframe but is not a proper copy.
To silence this warning, make a copy and assign back.
df = df.copy()
df['a'] = df['a'].apply('add', other=1)
apply() only needs the name of the function
You can invoke a function by simply passing its name to apply() (no need for lambda). If your function needs additional arguments, you can pass them either as keyword arguments or pass the positional arguments as args=. For example, suppose you have file paths in your dataframe and you need to read files in these paths.
def read_data(path, sep=',', usecols=[0]):
return pd.read_csv(path, sep=sep, usecols=usecols)
df = pd.DataFrame({'paths': ['../x/yz.txt', '../u/vw.txt']})
df['paths'].apply(read_data) # you don't need lambda
df['paths'].apply(read_data, args=(',', [0, 1])) # pass the positional arguments to `args=`
df['paths'].apply(read_data, sep=',', usecols=[0, 1]) # pass as keyword arguments
Don't apply a function, call the appropriate method directly
It's almost never ideal to apply a custom function on a column via apply(). Because apply() is a syntactic sugar for a Python loop with a pandas overhead, it's often slower than calling the same function in a list comprehension, never mind, calling optimized pandas methods. Almost all numeric operators can be directly applied on the column and there are corresponding methods for all of them.
# add 1 to every element in column `a`
df['a'] += 1
# for every row, subtract column `a` value from column `b` value
df['c'] = df['b'] - df['a']
If you want to apply a function that has if-else blocks, then you should probably be using numpy.where() or numpy.select() instead. It is much, much faster. If you have anything larger than 10k rows of data, you'll notice the difference right away.
For example, if you have a custom function similar to func() below, then instead of applying it on the column, you could operate directly on the columns and return values using numpy.select().
def func(row):
if row == 'a':
return 1
elif row == 'b':
return 2
else:
return -999
# instead of applying a `func` to each row of a column, use `numpy.select` as below
import numpy as np
conditions = [df['col'] == 'a', df['col'] == 'b']
choices = [1, 2]
df['new'] = np.select(conditions, choices, default=-999)
As you can see, numpy.select() has very minimal syntax difference from an if-else ladder; only need to separate conditions and choices into separate lists. For other options, check out this answer.

Error in using Pandas groupby.apply to drop duplication

I have a Pandas data frame which has some duplicate values, not rows. I want to use groupby.apply to remove the duplication. An example is as follows.
df = pd.DataFrame([['a', 1, 1], ['a', 1, 2], ['b', 1, 1]], columns=['A', 'B', 'C'])
A B C
0 a 1 1
1 a 1 2
2 b 1 1
# My function
def get_uniq_t(df):
if df.shape[0] > 1:
df['D'] = df.C * 10 + df.B
df = df[df.D == df.D.max()].drop(columns='D')
return df
df = df.groupby('A').apply(get_uniq_t)
Then I get the following value error message. The issue seems to do with creating the new column D. If I create the column D outside the function, the code seems running fine. Can someone help explain what caused the value error message?
ValueError: Shape of passed values is (3, 3), indices imply (2, 3)
The problem with your code is that it attempts to modify
the original group.
Other problem is that this function should return a single row
not a DataFrame.
Change your function to:
def get_uniq_t(df):
iMax = (df.C * 10 + df.B).idxmax()
return df.loc[iMax]
Then its application returns:
A B C
A
a a 1 2
b b 1 1
Edit following the comment
In my opinion, it is not allowed to modify the original group,
as it would indirectly modify the original DataFrame.
At least it displays a warning about this and is considered a bad practice.
Search the Web for SettingWithCopyWarning for more extensive description.
My code (get_uniq_t function) does not modify the original group.
It only returns one row from the current group.
The returned row is selected based on which row returns the greatest value
of df.C * 10 + df.B. So when you apply this function, the result is a new
DataFrame, with consecutive rows equal to results of this function
for consecutive groups.
You can perform an operation equivalent to modification, when you
create some new content, e.g. as the result of groupby instruction
and then save it under the same variable which so far held the source
DataFrame.

comparing two DataFrames, specific questions

I was read Andy's answer to the question Outputting difference in two Pandas dataframes side by side - highlighting the difference
i have two questions regarding the code, unfortunately I dont yet have 50 rep to comment on the answer so I hope i could get some help here.
what does In [24]: changed = ne_stacked[ne_stacked] do?
I'm not sure what df1 = df[df] do and i cant seem to get an answer from pandas doc, could someone explain this to me please?
is np.where(df1 != df2) the same as pd.df.where(df1 != df2). If no, what is the difference?
Question 1
ne_stacked is a pd.Series that consists of True and False values that indicate where df1 and df2 are not equal.
ne_stacked[boolean_array] is a way to filter the series ne_stacked by eliminating the rows of ne_stacked where boolean_array is False and keeping the rows of ne_stacked where boolean_array is True.
It so happens that ne_stacked is also a boolean array and so can be used to filter itself. Why would be want to do this? So we can see what the values of the index are after we've filtered.
So ne_stacked[ne_stacked] is a subset of ne_stacked with only True values.
Question 2
np.where
np.where does two things, if you only pass a conditional like in np.where(df1 != df2), you get a tuple of arrays where the first is a reference of all row indices to be used in conjunction with the second element of the tuple that is a reference to all column indices. I usually use it like this
i, j = np.where(df1 != df2)
Now I can get at all elements of df1 or df2 in which there are differences like
df.values[i, j]
Or I can assign to those cells
df.values[i, j] = -99
Or lots of other useful things.
You can also use np.where as an if, then, else for arrays
np.where(df1 != df2, -99, 99)
To produce an array the same size as df1 or df2 where you have -99 in all the places where df1 != df2 and 99 in the rest.
df.where
On the other hand df.where evaluates the first argument of boolean values and returns an object of equal size to df where the cells that evaluated to True are kept and the rest are either np.nan or the values passed in the second argument of df.where
df1.where(df1 != df2)
Or
df1.where(df1 != df2, -99)
are they the same?
Clearly they are not the "same". But you can use them similarly
np.where(df1 != df2, df1, -99)
Should be the same as
df1.where(df1 != df2, -99).values

Pandas fill cells in a column with NaN values, derive the value from other cells in the row

I have a dataframe:
a b c
0 1 2 3
1 1 1 1
2 3 7 NaN
3 2 3 5
...
I want to fill column "three" inplace (update the values) where the values are NaN using a machine learning algorithm.
I don't know how to do it inplace. Sample code:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df=pd.DataFrame([range(3), [1, 5, np.NaN], [2, 2, np.NaN], [4,5,9], [2,5,7]],columns=['a','b','c'])
x=[]
y=[]
for row in df.iterrows():
index,data = row
if(not pd.isnull(data['c'])):
x.append(data[['a','b']].tolist())
y.append(data['c'])
model = LinearRegression()
model.fit(x,y)
#this line does not do it in place.
df[~df.c.notnull()].assign(c = lambda x:model.predict(x[['a','b']]))
But this gives me a copy of the dataframe. Only option I have left is using a for loop however, I don't want to do that. I think there should be more pythonic way of doing it using pandas. Can someone please help? Or is there any other way of doing this?
You'll have to do something like :
df.loc[pd.isnull(df['three']), 'three'] = _result of model_
This modifies directly dataframe df
This way you first filter the dataframe to keep the slice you want to modify (pd.isnull(df['three'])), then from that slice you select the column you want to modify (three).
On the right hand side of the equal, it expects to get an array / list / series with the same number of lines than the filtered dataframe ( in your example, one line)
You may have to adjust depending on what your model returns exactly
EDIT
You probably need to do stg like this
pred = model.predict(df[['a', 'b']])
df['pred'] = model.predict(df[['a', 'b']])
df.loc[pd.isnull(df['c']), 'c'] = df.loc[pd.isnull(df['c']), 'pred']
Note that a significant part of the issue comes from the way you are using scikit learn in your example. You need to pass the whole dataset to the model when you predict.
The simplest way is yo transpose first, then forward fill/backward fill at your convenience.
df.T.ffill().bfill().T

Looking for built-in, invertible, list-of-list-accepting constructor/deconstructor pair for pandas dataframes

Are there built-in ways to construct/deconstruct a dataframe from/to a Python list-of-Python-lists?
As far as the constructor (let's call it make_df for now) that I'm looking for goes, I want to be able to write the initialization of a dataframe from literal values, including columns of arbitrary types, in an easily-readable form, like this:
df = make_df([[9.75, 1],
[6.375, 2],
[9., 3],
[0.25, 1],
[1.875, 2],
[3.75, 3],
[8.625, 1]],
['d', 'i'])
For the deconstructor, I want to essentially recover from a dataframe df the arguments one would need to pass to such make_df to re-create df.
AFAIK,
officially at least, the pandas.DataFrame constructor accepts only a numpy ndarray, a dict, or another DataFrame (and not a simple Python list-of-lists) as its first argument;
the pandas.DataFrame.values property does not preserve the original data types.
I can roll my own functions to do this (e.g., see below), but I would prefer to stick to built-in methods, if available. (The Pandas API is pretty big, and some of its names not what I would expect, so it is quite possible that I have missed one or both of these functions.)
FWIW, below is a hand-rolled version of what I described above, minimally tested. (I doubt that it would be able to handle every possible corner-case.)
import pandas as pd
import collections as co
import pandas.util.testing as pdt
def make_df(values, columns):
return pd.DataFrame(co.OrderedDict([(columns[i],
[row[i] for row in values])
for i in range(len(columns))]))
def unmake_df(dataframe):
columns = list(dataframe.columns)
return ([[dataframe[c][i] for c in columns] for i in dataframe.index],
columns)
values = [[9.75, 1],
[6.375, 2],
[9., 3],
[0.25, 1],
[1.875, 2],
[3.75, 3],
[8.625, 1]]
columns = ['d', 'i']
df = make_df(values, columns)
Here's what the output of the call to make_df above produced:
>>> df
d i
0 9.750 1
1 6.375 2
2 9.000 3
3 0.250 1
4 1.875 2
5 3.750 3
6 8.625 1
A simple check of the round-trip1:
>>> df == make_df(*unmake_df(df))
True
>>> (values, columns) == unmake_df(make_df(*(values, columns)))
True
BTW, this is an example of the loss of the original values' types:
>>> df.values
array([[ 9.75 , 1. ],
[ 6.375, 2. ],
[ 9. , 3. ],
[ 0.25 , 1. ],
[ 1.875, 2. ],
[ 3.75 , 3. ],
[ 8.625, 1. ]])
Notice how the values in the second column are no longer integers, as they were originally.
Hence,
>>> df == make_df(df.values, columns)
False
1 In order to be able to use == to test for equality between dataframes above, I resorted to a little monkey-patching:
def pd_DataFrame___eq__(self, other):
try:
pdt.assert_frame_equal(self, other,
check_index_type=True,
check_column_type=True,
check_frame_type=True)
except:
return False
else:
return True
pd.DataFrame.__eq__ = pd_DataFrame___eq__
Without this hack, expressions of the form dataframe_0 == dataframe_1 would have evaluated to dataframe objects, not simple boolean values.
I'm not sure what documentation you are reading, because the link you give explicitly says that the default constructor accepts other list-like objects (one of which is a list of lists).
In [6]: pandas.DataFrame([['a', 1], ['b', 2]])
Out[6]:
0 1
0 a 1
1 b 2
[2 rows x 2 columns]
In [7]: t = pandas.DataFrame([['a', 1], ['b', 2]])
In [8]: t.to_dict()
Out[8]: {0: {0: 'a', 1: 'b'}, 1: {0: 1, 1: 2}}
Notice that I use to_dict at the end, rather than trying to get back the original list of lists. This is because it is an ill-posed problem to get the list arguments back (unless you make an overkill decorator or something to actually store the ordered arguments that the constructor was called with).
The reason is that a pandas DataFrame, by default, is not an ordered data structure, at least in the column dimension. You could have permuted the order of the column data at construction time, and you would get the "same" DataFrame.
Since there can be many differing notions of equality between two DataFrame (e.g. same columns even including type, or just same named columns, or some columns and in same order, or just same columns in mixed order, etc.) -- pandas defaults to trying to be the least specific about it (Python's principle of least astonishment).
So it would not be good design for the default or built-in constructors to choose an overly specific idea of equality for the purposes of returning the DataFrame back down to its arguments.
For that reason, using to_dict is better since the resulting keys will encode the column information, and you can choose to check for column types or ordering however you want to for your own application. You can even discard the keys by iterating the dict and simply pumping the contents into a list of lists if you really want to.
In other words, because order might not matter among the columns, the "inverse" of the list-of-list constructor maps backwards into a bigger set, namely all the permutations of the same column data. So the inverse you're looking for is not well-defined without assuming more structure -- and casual users of a DataFrame might not want or need to make those extra assumptions to get the invertibility.
As mentioned elsewhere, you should use DataFrame.equals to do equality checking among DataFrames. The function has many options that allow you specify the specific kind of equality testing that makes sense for your application, while leaving the default version as a reasonably generic set of options.