pandas syntax examples confusion - pandas

I am confused by some of the examples I see for pandas. For example this is shortened from a post I recently read:
df[df.duplicated()|df()]
What I don't understand is why df needs to be on the outside: df[df.duplicated()]
vs just using df.duplicated(). In the documentation I have not yet seen the first example, everything is presented in the format df.something_doing(). But I see many examples such as df[df.something_doing()] and I don't understand what the df on the outside does.

df.duplicated() returns the boolean values. They provide a mask with True if the condition mentioned is satisfied, False otherwise.
If you want a slice of the dataframe based on the boolean mask, you need:
df[df.duplicated()]
Another simple example, consider this dataframe
col1 id
0 1 a
1 0 a
2 1 a
3 1 b
If you only want the columns where 'id' is 'a',
df.id == 'a'
would give you boolean mask but
df[df.id == 'a']
would return the dataframe
col1 id
0 1 a
1 0 a
2 1 a

Related

How to apply a function on a column of a pandas dataframe? [duplicate]

I have a pandas dataframe with two columns. I need to change the values of the first column without affecting the second one and get back the whole dataframe with just first column values changed. How can I do that using apply() in pandas?
Given a sample dataframe df as:
a b
0 1 2
1 2 3
2 3 4
3 4 5
what you want is:
df['a'] = df['a'].apply(lambda x: x + 1)
that returns:
a b
0 2 2
1 3 3
2 4 4
3 5 5
For a single column better to use map(), like this:
df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])
a b c
0 15 15 5
1 20 10 7
2 25 30 9
df['a'] = df['a'].map(lambda a: a / 2.)
a b c
0 7.5 15 5
1 10.0 10 7
2 12.5 30 9
Given the following dataframe df and the function complex_function,
import pandas as pd
def complex_function(x, y=0):
if x > 5 and x > y:
return 1
else:
return 2
df = pd.DataFrame(data={'col1': [1, 4, 6, 2, 7], 'col2': [6, 7, 1, 2, 8]})
col1 col2
0 1 6
1 4 7
2 6 1
3 2 2
4 7 8
there are several solutions to use apply() on only one column. In the following I will explain them in detail.
I. Simple solution
The straightforward solution is the one from #Fabio Lamanna:
df['col1'] = df['col1'].apply(complex_function)
Output:
col1 col2
0 2 6
1 2 7
2 1 1
3 2 2
4 1 8
Only the first column is modified, the second column is unchanged. The solution is beautiful. It is just one line of code and it reads almost like english: "Take 'col1' and apply the function complex_function to it."
However, if you need data from another column, e.g. 'col2', it won't work. If you want to pass the values of 'col2' to variable y of the complex_function, you need something else.
II. Solution using the whole dataframe
Alternatively, you could use the whole dataframe as described in this SO post or this one:
df['col1'] = df.apply(lambda x: complex_function(x['col1']), axis=1)
or if you prefer (like me) a solution without a lambda function:
def apply_complex_function(x):
return complex_function(x['col1'])
df['col1'] = df.apply(apply_complex_function, axis=1)
There is a lot going on in this solution that needs to be explained. The apply() function works on pd.Series and pd.DataFrame. But you cannot use df['col1'] = df.apply(complex_function).loc[:, 'col1'], because it would throw a ValueError.
Hence, you need to give the information which column to use. To complicate things, the apply() function does only accept callables. To solve this, you need to define a (lambda) function with the column x['col1'] as argument; i.e. we wrap the column information in another function.
Unfortunately, the default value of the axis parameter is zero (axis=0), which means it will try executing column-wise and not row-wise. This wasn't a problem in the first solution, because we gave apply() a pd.Series. But now the input is a dataframe and we must be explicit (axis=1). (I marvel how often I forget this.)
Whether you prefer the version with the lambda function or without is subjective. In my opinion the line of code is complicated enough to read even without a lambda function thrown in. You only need the (lambda) function as a wrapper. It is just boilerplate code. A reader should not be bothered with it.
Now, you can modify this solution easily to take the second column into account:
def apply_complex_function(x):
return complex_function(x['col1'], x['col2'])
df['col1'] = df.apply(apply_complex_function, axis=1)
Output:
col1 col2
0 2 6
1 2 7
2 1 1
3 2 2
4 2 8
At index 4 the value has changed from 1 to 2, because the first condition 7 > 5 is true but the second condition 7 > 8 is false.
Note that you only needed to change the first line of code (i.e. the function) and not the second line.
Side note
Never put the column information into your function.
def bad_idea(x):
return x['col1'] ** 2
By doing this, you make a general function dependent on a column name! This is a bad idea, because the next time you want to use this function, you cannot. Worse: Maybe you rename a column in a different dataframe just to make it work with your existing function. (Been there, done that. It is a slippery slope!)
III. Alternative solutions without using apply()
Although the OP specifically asked for a solution with apply(), alternative solutions were suggested. For example, the answer of #George Petrov suggested to use map(); the answer of #Thibaut Dubernet proposed assign().
I fully agree that apply() is seldom the best solution, because apply() is not vectorized. It is an element-wise operation with expensive function calling and overhead from pd.Series.
One reason to use apply() is that you want to use an existing function and performance is not an issue. Or your function is so complex that no vectorized version exists.
Another reason to use apply() is in combination with groupby(). Please note that DataFrame.apply() and GroupBy.apply() are different functions.
So it does make sense to consider some alternatives:
map() only works on pd.Series, but accepts dict and pd.Series as input. Using map() with a function is almost interchangeable with using apply(). It can be faster than apply(). See this SO post for more details.
df['col1'] = df['col1'].map(complex_function)
applymap() is almost identical for dataframes. It does not support pd.Series and it will always return a dataframe. However, it can be faster. The documentation states: "In the current implementation applymap calls func twice on the first column/row to decide whether it can take a fast or slow code path.". But if performance really counts you should seek an alternative route.
df['col1'] = df.applymap(complex_function).loc[:, 'col1']
assign() is not a feasible replacement for apply(). It has a similar behaviour in only the most basic use cases. It does not work with the complex_function. You still need apply() as you can see in the example below. The main use case for assign() is method chaining, because it gives back the dataframe without changing the original dataframe.
df['col1'] = df.assign(col1=df.col1.apply(complex_function))
Annex: How to speed up apply()?
I only mention it here because it was suggested by other answers, e.g. #durjoy. The list is not exhaustive:
Do not use apply(). This is no joke. For most numeric operations, a vectorized method exists in pandas. If/else blocks can often be refactored with a combination of boolean indexing and .loc. My example complex_function could be refactored in this way.
Refactor to Cython. If you have a complex equation and the parameters of the equation are in your dataframe, this might be a good idea. Check out the official pandas user guide for more information.
Use raw=True parameter. Theoretically, this should improve the performance of apply() if you are just applying a NumPy reduction function, because the overhead of pd.Series is removed. Of course, your function has to accept an ndarray. You have to refactor your function to NumPy. By doing this, you will have a huge performance boost.
Use 3rd party packages. The first thing you should try is Numba. I do not know swifter mentioned by #durjoy; and probably many other packages are worth mentioning here.
Try/Fail/Repeat. As mentioned above, map() and applymap() can be faster - depending on the use case. Just time the different versions and choose the fastest. This approach is the most tedious one with the least performance increase.
You don't need a function at all. You can work on a whole column directly.
Example data:
>>> df = pd.DataFrame({'a': [100, 1000], 'b': [200, 2000], 'c': [300, 3000]})
>>> df
a b c
0 100 200 300
1 1000 2000 3000
Half all the values in column a:
>>> df.a = df.a / 2
>>> df
a b c
0 50 200 300
1 500 2000 3000
Although the given responses are correct, they modify the initial data frame, which is not always desirable (and, given the OP asked for examples "using apply", it might be they wanted a version that returns a new data frame, as apply does).
This is possible using assign: it is valid to assign to existing columns, as the documentation states (emphasis is mine):
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
In short:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])
In [3]: df.assign(a=lambda df: df.a / 2)
Out[3]:
a b c
0 7.5 15 5
1 10.0 10 7
2 12.5 30 9
In [4]: df
Out[4]:
a b c
0 15 15 5
1 20 10 7
2 25 30 9
Note that the function will be passed the whole dataframe, not only the column you want to modify, so you will need to make sure you select the right column in your lambda.
If you are really concerned about the execution speed of your apply function and you have a huge dataset to work on, you could use swifter to make faster execution, here is an example for swifter on pandas dataframe:
import pandas as pd
import swifter
def fnc(m):
return m*3+4
df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
# apply a self created function to a single column in pandas
df["y"] = df.m.swifter.apply(fnc)
This will enable your all CPU cores to compute the result hence it will be much faster than normal apply functions. Try and let me know if it become useful for you.
Let me try a complex computation using datetime and considering nulls or empty spaces. I am reducing 30 years on a datetime column and using apply method as well as lambda and converting datetime format. Line if x != '' else x will take care of all empty spaces or nulls accordingly.
df['Date'] = df['Date'].fillna('')
df['Date'] = df['Date'].apply(lambda x : ((datetime.datetime.strptime(str(x), '%m/%d/%Y') - datetime.timedelta(days=30*365)).strftime('%Y%m%d')) if x != '' else x)
Make a copy of your dataframe first if you need to modify a column
Many answers here suggest modifying some column and assign the new values to the old column. It is common to get the SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. warning. This happens when your dataframe was created from another dataframe but is not a proper copy.
To silence this warning, make a copy and assign back.
df = df.copy()
df['a'] = df['a'].apply('add', other=1)
apply() only needs the name of the function
You can invoke a function by simply passing its name to apply() (no need for lambda). If your function needs additional arguments, you can pass them either as keyword arguments or pass the positional arguments as args=. For example, suppose you have file paths in your dataframe and you need to read files in these paths.
def read_data(path, sep=',', usecols=[0]):
return pd.read_csv(path, sep=sep, usecols=usecols)
df = pd.DataFrame({'paths': ['../x/yz.txt', '../u/vw.txt']})
df['paths'].apply(read_data) # you don't need lambda
df['paths'].apply(read_data, args=(',', [0, 1])) # pass the positional arguments to `args=`
df['paths'].apply(read_data, sep=',', usecols=[0, 1]) # pass as keyword arguments
Don't apply a function, call the appropriate method directly
It's almost never ideal to apply a custom function on a column via apply(). Because apply() is a syntactic sugar for a Python loop with a pandas overhead, it's often slower than calling the same function in a list comprehension, never mind, calling optimized pandas methods. Almost all numeric operators can be directly applied on the column and there are corresponding methods for all of them.
# add 1 to every element in column `a`
df['a'] += 1
# for every row, subtract column `a` value from column `b` value
df['c'] = df['b'] - df['a']
If you want to apply a function that has if-else blocks, then you should probably be using numpy.where() or numpy.select() instead. It is much, much faster. If you have anything larger than 10k rows of data, you'll notice the difference right away.
For example, if you have a custom function similar to func() below, then instead of applying it on the column, you could operate directly on the columns and return values using numpy.select().
def func(row):
if row == 'a':
return 1
elif row == 'b':
return 2
else:
return -999
# instead of applying a `func` to each row of a column, use `numpy.select` as below
import numpy as np
conditions = [df['col'] == 'a', df['col'] == 'b']
choices = [1, 2]
df['new'] = np.select(conditions, choices, default=-999)
As you can see, numpy.select() has very minimal syntax difference from an if-else ladder; only need to separate conditions and choices into separate lists. For other options, check out this answer.

Removing values of a certain object type from a dataframe column in Pandas

I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397

Find rows in dataframe column containing questions

I have a TSV file that I loaded into a pandas dataframe to do some preprocessing and I want to find out which rows have a question in it, and output 1 or 0 in a new column. Since it is a TSV, this is how I'm loading it:
import pandas as pd
df = pd.read_csv('queries-10k-txt-backup', sep='\t')
Here's a sample of what it looks like:
QUERY FREQ
0 hindi movies for adults 595
1 are panda dogs real 383
2 asuedraw winning numbers 478
3 sentry replacement keys 608
4 rebuilding nicad battery packs 541
After dropping empty rows, duplicates, and the FREQ column(not needed for this), I wrote a simple function to check the QUERY column to see if it contains any words that make the string a question:
df_test = df.drop_duplicates()
df_test = df_test.dropna()
df_test = df_test.drop(['FREQ'], axis = 1)
def questions(row):
questions_list =
["what","when","where","which","who","whom","whose","why","why don't",
"how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'] in questions_list:
return 1
else:
return 0
df_test['QUESTIONS'] = df_test.apply(questions, axis=1)
But once I check the new dataframe, even though it creates the new column, all the values are 0. I'm not sure if my logic is wrong in the function, I've used something similar with dataframe columns which just have one word and if it matches, it'll output a 1 or 0. However, that same logic doesn't seem to be working when the column contains a phrase/sentence like this use case. Any input is really appreciated!
If you wish to check exact matches of any substring from question_list and of a string from dataframe, you should use str.contains method:
questions_list = ["what","when","where","which","who","whom","whose","why",
"why don't", "how","how far","how long","how many",
"how much","how old","how come","?"]
pattern = "|".join(questions_list) # generate regex from your list
df_test['QUESTIONS'] = df_test['QUERY'].str.contains(pattern)
Simplified example:
df = pd.DataFrame({
'QUERY': ['how do you like it', 'what\'s going on?', 'quick brown fox'],
'ID': [0, 1, 2]})
Create a pattern:
pattern = '|'.join(['what', 'how'])
pattern
Out: 'what|how'
Use it:
df['QUERY'].str.contains(pattern)
Out[12]:
0 True
1 True
2 False
Name: QUERY, dtype: bool
If you're not familiar with regexes, there's a quick python re reference. Fot symbol '|', explanation is
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way
IIUC, you need to find if the first word in the string in the question list, if yes return 1, else 0. In your function, rather than checking if the entire string is in question list, split the string and check if the first element is in question list.
def questions(row):
questions_list = ["are","what","when","where","which","who","whom","whose","why","why don't","how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'].split()[0] in questions_list:
return 1
else:
return 0
df['QUESTIONS'] = df.apply(questions, axis=1)
You get
QUERY FREQ QUESTIONS
0 hindi movies for adults 595 0
1 are panda dogs real 383 1
2 asuedraw winning numbers 478 0
3 sentry replacement keys 608 0
4 rebuilding nicad battery packs 541 0

Taking second last observed row

I am new to pandas. I know how to use drop_duplicates and take the last observed row in a dataframe. Is there any way that I can use it to take only second last observed. Or any other way of doing it.
For example:
I would like to go from
df = pd.DataFrame(data={'A':[1,1,1,2,2,2],'B':[1,2,3,4,5,6]}) to
df1 = pd.DataFrame(data={'A':[1,2],'B':[2,5]})
The idea is that you'll group the data by the duplicate column , then check the length of group , if the length of group is greater than or equal 2 this mean that you can slice the second element of group , if the group has a length of one which mean that this value is not duplicated , then take index 0 which is the only element in the grouped data
df.groupby(df['A']).apply(lambda x : x.iloc[1] if len(x) >= 2 else x.iloc[0])
The first answer I think was on the right track, but possibly not quite right. I have extended your data to include 'A' groups with two observations, and an 'A' group with one observation, for the sake of completeness.
import pandas as pd
df = pd.DataFrame(data={'A':[1,1,1,2,2,2, 3, 3, 4],'B':[1,2,3,4,5,6, 7, 8, 9]})
def user_apply_func(x):
if len(x) == 2:
return x.iloc[0]
if len(x) > 2:
return x.iloc[-2]
return
df.groupby('A').apply(user_apply_func)
Out[7]:
A B
A
1 1 2
2 2 5
3 3 7
4 NaN NaN
For your reference the apply method automatically passes the data frame as the first argument.
Also, as you are always going to be reducing each group of data to a single observation you could also use the agg method (aggregate). apply is more flexible in terms of the length of the sequences that can be returned whereas agg must reduce the data to a single value.
df.groupby('A').agg(user_apply_func)

groupby on sparse matrix in pandas: filling them first

I have a pandas DataFrame df with shape (1000000,3) as follows:
id cat team
1 'cat1' A
1 'cat2' A
2 'cat3' B
3 'cat1' A
4 'cat3' B
4 'cat1' B
Then I dummify with respect to the cat column in order to get ready for a machine learning classification.
df2 = pandas.get_dummies(df,columns=['cat'], sparse=True)
But when I try to do:
df2.groupby(['id','team']).sum()
It get stuck and the computing never ends. So instead of grouping by right away, I try:
df2 = df2.fillna(0)
But it does not work and the DataFrame is still full of NaN values. Why does the fillna() function does not fill my DataFrame as it should?
In other words, how can a pandas sparse matrix I got from get_dummies be filled with 0 instead of NaN?
I also tried:
df2 = pandas.get_dummies(df,columns=['cat'], sparse=True).to_sparse(fill_value=0)
This time, df2 is well filled with 0, but when I try:
print df2.groupby(['id','sexe']).sum()
I get:
C:\Anaconda\lib\site-packages\pandas\core\groupby.pyc in loop(labels, shape)
3545 for i in range(1, nlev):
3546 stride //= shape[i]
-> 3547 out += labels[i] * stride
3548
3549 if xnull: # exclude nulls
ValueError: operands could not be broadcast together with shapes (1205800,) (306994,) (1205800,)
My solution was to do:
df2 = pandas.DataFrame(np.nan_to_num(df2.as_matrix()))
df2.groupby(['id','sexe']).sum()
And it works, but it takes a lot of memory. Can someone help me to find a better solution or at least understand why I can't fill sparse matrix with zeros easily? And why it is impossible to use groupby() then sum() on a sparse matrix?
I think your problem is due to mixing of dtypes. But you could get around it like this. First, provide only the relevant column to get_dummies() rather than the whole dataframe:
df2 = pd.get_dummies(df['cat']).to_sparse(0)
After that, you can add other variables back but everything needs to be numeric. A pandas sparse dataframe is just a wrapper on a sparse (and homogenous dtype) numpy array.
df2['id'] = df['id']
'cat1' 'cat2' 'cat3' id
0 1 0 0 1
1 0 1 0 1
2 0 0 1 2
3 1 0 0 3
4 0 0 1 4
5 1 0 0 4
For non-numeric types, you could do the following:
df2['team'] = df['team'].astype('category').cat.codes
This groupby seems to work OK:
df2.groupby('id').sum()
'cat1' 'cat2' 'cat3'
id
1 1 1 0
2 0 0 1
3 1 0 0
4 1 0 1
An additional but possibly important point for memory management is that you can often save substantial memory with categoricals rather than string objects (perhaps you are already doing this though):
df['cat2'] = df['cat'].astype('category')
df[['cat','cat2']].memory_usage()
cat 48
cat2 30
Not much savings here for the small example dataframe but could be a substantial difference in your actual dataframe.
I was tackling a similar problem before. What I did was, I applied the groupby operation before and followed it up with get_dummies().
This worked for me as groupby, after formation of thousands of dummified columns (in my case), is very slow especially on sparse dataframes. It basically gave up for me. Grouping over the columns first and then dummifying made it work.
df = pd.DataFrame(df.groupby(['id','team'])['cat'].unique())
df.columns = ['cat']
df.reset_index(inplace=True)
df = df[['id','team']].join(df['cat'].str.join('|').str.get_dummies().add_prefix('CAT_'))
Hope this helps out someone!