Drop columns which contains specific words (not as a substring) - pandas

I have the following data frame, df:
id text
1 'a little table'
2 'blue lights'
3 'food and drink'
4 'build an atom'
5 'fast animals'
and a list of stop words, that is:
sw = ['a', 'an', 'and']
I want to delete the lines that contain at least one of the stop words (as words themselves, not as substrings). That is, the result I would like is:
id text
2 'blue lights'
5 'fast animals'
I was trying with:
df[~df['text'].str.contains('|'.join(sw), regex=True, na=False)]
but it doesn't seem to work, as it works with substrings this way, and a is substring of all texts (except for 'blue lights'). How should I change my line of code?

here is one way to do it
# '|'.join(sw) : creates a string with a |, to form an OR condition
# \\b : adds the word boundary to the capture group
# create a pattern surrounded by the word boundary and then
# filtered out what is found using loc
df.loc[~df['text'].str.contains('\\b('+ '|'.join(sw) + ')\\b' )]
OR
df[df['text'].str.extract('\\b('+ '|'.join(sw) + ')\\b' )[0].isna()]
id text
1 2 'blue lights'
4 5 'fast animals'

li = ['a', 'an', 'and']
for i in li:
for k in df.index:
if i in df.text[k].split():
df.drop(k,inplace=True)

If you want to use str.contains, you could try as follows:
import pandas as pd
data = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'text': {0: "'a little table'", 1: "'blue lights'",
2: "'food and drink'", 3: "'build an atom'",
4: "'fast animals'"}}
df = pd.DataFrame(data)
sw = ['a', 'an', 'and']
res = df[~df['text'].str.contains(fr'\b(?:{"|".join(sw)})\b',
regex=True, na=False)]
print(res)
id text
1 2 'blue lights'
4 5 'fast animals'
In the regex pattern \b asserts position at a word boundary, while ?: at start of pattern between (...) creates a non-capturing group. Strictly speaking, you could do without ?:, but it suppresses a Userwarning: "This pattern ... has match groups etc.".
`

Another possible solution, which works as follows:
Split each string by space, producing a list of words
Check whether each of those lists of words is disjoint with sw.
Use the result for boolean indexing.
df[df['text'].str.split(' ').map(lambda x: set(x).isdisjoint(sw))]
Output:
id text
1 2 blue lights
4 5 fast animals

You can also use the custom apply() method,
def string_present(List,string):
return any(ele+' ' in string for ele in List)
df['status'] = df['text'].apply(lambda row: string_present(sw,row))
df[df['status']==False].drop(columns=['status'],axis=1)
The output is,
id text
1 2 blue lights
4 5 fast animals

sw = ['a', 'an', 'and']
df1.loc[~df1.text.str.split(' ').map(lambda x:pd.Series(x).isin(sw).any())]

Related

find if a substring is contained in an index in a Dataframe in Pandas [duplicate]

I have a pandas DataFrame with a column of string values. I need to select rows based on partial string matches.
Something like this idiom:
re.search(pattern, cell_in_question)
returning a boolean. I am familiar with the syntax of df[df['A'] == "hello world"] but can't seem to find a way to do the same with a partial string match, say 'hello'.
Vectorized string methods (i.e. Series.str) let you do the following:
df[df['A'].str.contains("hello")]
This is available in pandas 0.8.1 and up.
I am using pandas 0.14.1 on macos in ipython notebook. I tried the proposed line above:
df[df["A"].str.contains("Hello|Britain")]
and got an error:
cannot index with vector containing NA / NaN values
but it worked perfectly when an "==True" condition was added, like this:
df[df['A'].str.contains("Hello|Britain")==True]
How do I select by partial string from a pandas DataFrame?
This post is meant for readers who want to
search for a substring in a string column (the simplest case) as in df1[df1['col'].str.contains(r'foo(?!$)')]
search for multiple substrings (similar to isin), e.g., with df4[df4['col'].str.contains(r'foo|baz')]
match a whole word from text (e.g., "blue" should match "the sky is blue" but not "bluejay"), e.g., with df3[df3['col'].str.contains(r'\bblue\b')]
match multiple whole words
Understand the reason behind "ValueError: cannot index with vector containing NA / NaN values" and correct it with str.contains('pattern',na=False)
...and would like to know more about what methods should be preferred over others.
(P.S.: I've seen a lot of questions on similar topics, I thought it would be good to leave this here.)
Friendly disclaimer, this is post is long.
Basic Substring Search
# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1
col
0 foo
1 foobar
2 bar
3 baz
str.contains can be used to perform either substring searches or regex based search. The search defaults to regex-based unless you explicitly disable it.
Here is an example of regex-based search,
# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]
col
1 foobar
Sometimes regex search is not required, so specify regex=False to disable it.
#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
col
0 foo
1 foobar
Performance wise, regex search is slower than substring search:
df2 = pd.concat([df1] * 1000, ignore_index=True)
%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]
6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Avoid using regex-based search if you don't need it.
Addressing ValueErrors
Sometimes, performing a substring search and filtering on the result will result in
ValueError: cannot index with vector containing NA / NaN values
This is usually because of mixed data or NaNs in your object column,
s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')
0 True
1 True
2 NaN
3 True
4 False
5 NaN
dtype: object
s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError Traceback (most recent call last)
Anything that is not a string cannot have string methods applied on it, so the result is NaN (naturally). In this case, specify na=False to ignore non-string data,
s.str.contains('foo|bar', na=False)
0 True
1 True
2 False
3 True
4 False
5 False
dtype: bool
How do I apply this to multiple columns at once?
The answer is in the question. Use DataFrame.apply:
# `axis=1` tells `apply` to apply the lambda function column-wise.
df.apply(lambda col: col.str.contains('foo|bar', na=False), axis=1)
A B
0 True True
1 True False
2 False True
3 True False
4 False False
5 False False
All of the solutions below can be "applied" to multiple columns using the column-wise apply method (which is OK in my book, as long as you don't have too many columns).
If you have a DataFrame with mixed columns and want to select only the object/string columns, take a look at select_dtypes.
Multiple Substring Search
This is most easily achieved through a regex search using the regex OR pipe.
# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4
col
0 foo abc
1 foobar xyz
2 bar32
3 baz 45
df4[df4['col'].str.contains(r'foo|baz')]
col
0 foo abc
1 foobar xyz
3 baz 45
You can also create a list of terms, then join them:
terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]
col
0 foo abc
1 foobar xyz
3 baz 45
Sometimes, it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters. If your terms contain any of the following characters...
. ^ $ * + ? { } [ ] \ | ( )
Then, you'll need to use re.escape to escape them:
import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]
col
0 foo abc
1 foobar xyz
3 baz 45
re.escape has the effect of escaping the special characters so they're treated literally.
re.escape(r'.foo^')
# '\\.foo\\^'
Matching Entire Word(s)
By default, the substring search searches for the specified substring/pattern regardless of whether it is full word or not. To only match full words, we will need to make use of regular expressions here—in particular, our pattern will need to specify word boundaries (\b).
For example,
df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3
col
0 the sky is blue
1 bluejay by the window
Now consider,
df3[df3['col'].str.contains('blue')]
col
0 the sky is blue
1 bluejay by the window
v/s
df3[df3['col'].str.contains(r'\bblue\b')]
col
0 the sky is blue
Multiple Whole Word Search
Similar to the above, except we add a word boundary (\b) to the joined pattern.
p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]
col
0 foo abc
3 baz 45
Where p looks like this,
p
# '\\b(?:foo|baz)\\b'
A Great Alternative: Use List Comprehensions!
Because you can! And you should! They are usually a little bit faster than string methods, because string methods are hard to vectorise and usually have loopy implementations.
Instead of,
df1[df1['col'].str.contains('foo', regex=False)]
Use the in operator inside a list comp,
df1[['foo' in x for x in df1['col']]]
col
0 foo abc
1 foobar
Instead of,
regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]
Use re.compile (to cache your regex) + Pattern.search inside a list comp,
p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]
col
1 foobar
If "col" has NaNs, then instead of
df1[df1['col'].str.contains(regex_pattern, na=False)]
Use,
def try_search(p, x):
try:
return bool(p.search(x))
except TypeError:
return False
p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]
col
1 foobar
More Options for Partial String Matching: np.char.find, np.vectorize, DataFrame.query.
In addition to str.contains and list comprehensions, you can also use the following alternatives.
np.char.find
Supports substring searches (read: no regex) only.
df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]
col
0 foo abc
1 foobar xyz
np.vectorize
This is a wrapper around a loop, but with lesser overhead than most pandas str methods.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True, True, False, False])
df1[f(df1['col'], 'foo')]
col
0 foo abc
1 foobar
Regex solutions possible:
regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]
col
1 foobar
DataFrame.query
Supports string methods through the python engine. This offers no visible performance benefits, but is nonetheless useful to know if you need to dynamically generate your queries.
df1.query('col.str.contains("foo")', engine='python')
col
0 foo
1 foobar
More information on query and eval family of methods can be found at Dynamically evaluate an expression from a formula in Pandas.
Recommended Usage Precedence
(First) str.contains, for its simplicity and ease handling NaNs and mixed data
List comprehensions, for its performance (especially if your data is purely strings)
np.vectorize
(Last) df.query
If anyone wonders how to perform a related problem: "Select column by partial string"
Use:
df.filter(like='hello') # select columns which contain the word hello
And to select rows by partial string matching, pass axis=0 to filter:
# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)
Quick note: if you want to do selection based on a partial string contained in the index, try the following:
df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]
Should you need to do a case insensitive search for a string in a pandas dataframe column:
df[df['A'].str.contains("hello", case=False)]
Say you have the following DataFrame:
>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
a b
0 hello hello world
1 abcd defg
You can always use the in operator in a lambda expression to create your filter.
>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0 True
1 False
dtype: bool
The trick here is to use the axis=1 option in the apply to pass elements to the lambda function row by row, as opposed to column by column.
You can try considering them as string as :
df[df['A'].astype(str).str.contains("Hello|Britain")]
Suppose we have a column named "ENTITY" in the dataframe df. We can filter our df,to have the entire dataframe df, wherein rows of "entity" column doesn't contain "DM" by using a mask as follows:
mask = df['ENTITY'].str.contains('DM')
df = df.loc[~(mask)].copy(deep=True)
Here's what I ended up doing for partial string matches. If anyone has a more efficient way of doing this please let me know.
def stringSearchColumn_DataFrame(df, colName, regex):
newdf = DataFrame()
for idx, record in df[colName].iteritems():
if re.search(regex, record):
newdf = concat([df[df[colName] == record], newdf], ignore_index=True)
return newdf
Using contains didn't work well for my string with special characters. Find worked though.
df[df['A'].str.find("hello") != -1]
A more generalised example - if looking for parts of a word OR specific words in a string:
df = pd.DataFrame([('cat andhat', 1000.0), ('hat', 2000000.0), ('the small dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])
Specific parts of sentence or word:
searchfor = '.*cat.*hat.*|.*the.*dog.*'
Creat column showing the affected rows (can always filter out as necessary)
df["TrueFalse"]=df['col1'].str.contains(searchfor, regex=True)
col1 col2 TrueFalse
0 cat andhat 1000.0 True
1 hat 2000000.0 False
2 the small dog 1000.0 True
3 fog 330000.0 False
4 pet 3 30000.0 False
Maybe you want to search for some text in all columns of the Pandas dataframe, and not just in the subset of them. In this case, the following code will help.
df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]
Warning. This method is relatively slow, albeit convenient.
Somewhat similar to #cs95's answer, but here you don't need to specify an engine:
df.query('A.str.contains("hello").values')
There are answers before this which accomplish the asked feature, anyway I would like to show the most generally way:
df.filter(regex=".*STRING_YOU_LOOK_FOR.*")
This way let's you get the column you look for whatever the way is wrote.
( Obviusly, you have to write the proper regex expression for each case )
My 2c worth:
I did the following:
sale_method = pd.DataFrame(model_data['Sale Method'].str.upper())
sale_method['sale_classification'] = \
np.where(sale_method['Sale Method'].isin(['PRIVATE']),
'private',
np.where(sale_method['Sale Method']
.str.contains('AUCTION'),
'auction',
'other'
)
)
df[df['A'].str.contains("hello", case=False)]

If this string is found, return the row Pandas [duplicate]

I have a pandas DataFrame with a column of string values. I need to select rows based on partial string matches.
Something like this idiom:
re.search(pattern, cell_in_question)
returning a boolean. I am familiar with the syntax of df[df['A'] == "hello world"] but can't seem to find a way to do the same with a partial string match, say 'hello'.
Vectorized string methods (i.e. Series.str) let you do the following:
df[df['A'].str.contains("hello")]
This is available in pandas 0.8.1 and up.
I am using pandas 0.14.1 on macos in ipython notebook. I tried the proposed line above:
df[df["A"].str.contains("Hello|Britain")]
and got an error:
cannot index with vector containing NA / NaN values
but it worked perfectly when an "==True" condition was added, like this:
df[df['A'].str.contains("Hello|Britain")==True]
How do I select by partial string from a pandas DataFrame?
This post is meant for readers who want to
search for a substring in a string column (the simplest case) as in df1[df1['col'].str.contains(r'foo(?!$)')]
search for multiple substrings (similar to isin), e.g., with df4[df4['col'].str.contains(r'foo|baz')]
match a whole word from text (e.g., "blue" should match "the sky is blue" but not "bluejay"), e.g., with df3[df3['col'].str.contains(r'\bblue\b')]
match multiple whole words
Understand the reason behind "ValueError: cannot index with vector containing NA / NaN values" and correct it with str.contains('pattern',na=False)
...and would like to know more about what methods should be preferred over others.
(P.S.: I've seen a lot of questions on similar topics, I thought it would be good to leave this here.)
Friendly disclaimer, this is post is long.
Basic Substring Search
# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1
col
0 foo
1 foobar
2 bar
3 baz
str.contains can be used to perform either substring searches or regex based search. The search defaults to regex-based unless you explicitly disable it.
Here is an example of regex-based search,
# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]
col
1 foobar
Sometimes regex search is not required, so specify regex=False to disable it.
#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
col
0 foo
1 foobar
Performance wise, regex search is slower than substring search:
df2 = pd.concat([df1] * 1000, ignore_index=True)
%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]
6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Avoid using regex-based search if you don't need it.
Addressing ValueErrors
Sometimes, performing a substring search and filtering on the result will result in
ValueError: cannot index with vector containing NA / NaN values
This is usually because of mixed data or NaNs in your object column,
s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')
0 True
1 True
2 NaN
3 True
4 False
5 NaN
dtype: object
s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError Traceback (most recent call last)
Anything that is not a string cannot have string methods applied on it, so the result is NaN (naturally). In this case, specify na=False to ignore non-string data,
s.str.contains('foo|bar', na=False)
0 True
1 True
2 False
3 True
4 False
5 False
dtype: bool
How do I apply this to multiple columns at once?
The answer is in the question. Use DataFrame.apply:
# `axis=1` tells `apply` to apply the lambda function column-wise.
df.apply(lambda col: col.str.contains('foo|bar', na=False), axis=1)
A B
0 True True
1 True False
2 False True
3 True False
4 False False
5 False False
All of the solutions below can be "applied" to multiple columns using the column-wise apply method (which is OK in my book, as long as you don't have too many columns).
If you have a DataFrame with mixed columns and want to select only the object/string columns, take a look at select_dtypes.
Multiple Substring Search
This is most easily achieved through a regex search using the regex OR pipe.
# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4
col
0 foo abc
1 foobar xyz
2 bar32
3 baz 45
df4[df4['col'].str.contains(r'foo|baz')]
col
0 foo abc
1 foobar xyz
3 baz 45
You can also create a list of terms, then join them:
terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]
col
0 foo abc
1 foobar xyz
3 baz 45
Sometimes, it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters. If your terms contain any of the following characters...
. ^ $ * + ? { } [ ] \ | ( )
Then, you'll need to use re.escape to escape them:
import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]
col
0 foo abc
1 foobar xyz
3 baz 45
re.escape has the effect of escaping the special characters so they're treated literally.
re.escape(r'.foo^')
# '\\.foo\\^'
Matching Entire Word(s)
By default, the substring search searches for the specified substring/pattern regardless of whether it is full word or not. To only match full words, we will need to make use of regular expressions here—in particular, our pattern will need to specify word boundaries (\b).
For example,
df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3
col
0 the sky is blue
1 bluejay by the window
Now consider,
df3[df3['col'].str.contains('blue')]
col
0 the sky is blue
1 bluejay by the window
v/s
df3[df3['col'].str.contains(r'\bblue\b')]
col
0 the sky is blue
Multiple Whole Word Search
Similar to the above, except we add a word boundary (\b) to the joined pattern.
p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]
col
0 foo abc
3 baz 45
Where p looks like this,
p
# '\\b(?:foo|baz)\\b'
A Great Alternative: Use List Comprehensions!
Because you can! And you should! They are usually a little bit faster than string methods, because string methods are hard to vectorise and usually have loopy implementations.
Instead of,
df1[df1['col'].str.contains('foo', regex=False)]
Use the in operator inside a list comp,
df1[['foo' in x for x in df1['col']]]
col
0 foo abc
1 foobar
Instead of,
regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]
Use re.compile (to cache your regex) + Pattern.search inside a list comp,
p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]
col
1 foobar
If "col" has NaNs, then instead of
df1[df1['col'].str.contains(regex_pattern, na=False)]
Use,
def try_search(p, x):
try:
return bool(p.search(x))
except TypeError:
return False
p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]
col
1 foobar
More Options for Partial String Matching: np.char.find, np.vectorize, DataFrame.query.
In addition to str.contains and list comprehensions, you can also use the following alternatives.
np.char.find
Supports substring searches (read: no regex) only.
df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]
col
0 foo abc
1 foobar xyz
np.vectorize
This is a wrapper around a loop, but with lesser overhead than most pandas str methods.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True, True, False, False])
df1[f(df1['col'], 'foo')]
col
0 foo abc
1 foobar
Regex solutions possible:
regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]
col
1 foobar
DataFrame.query
Supports string methods through the python engine. This offers no visible performance benefits, but is nonetheless useful to know if you need to dynamically generate your queries.
df1.query('col.str.contains("foo")', engine='python')
col
0 foo
1 foobar
More information on query and eval family of methods can be found at Dynamically evaluate an expression from a formula in Pandas.
Recommended Usage Precedence
(First) str.contains, for its simplicity and ease handling NaNs and mixed data
List comprehensions, for its performance (especially if your data is purely strings)
np.vectorize
(Last) df.query
If anyone wonders how to perform a related problem: "Select column by partial string"
Use:
df.filter(like='hello') # select columns which contain the word hello
And to select rows by partial string matching, pass axis=0 to filter:
# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)
Quick note: if you want to do selection based on a partial string contained in the index, try the following:
df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]
Should you need to do a case insensitive search for a string in a pandas dataframe column:
df[df['A'].str.contains("hello", case=False)]
Say you have the following DataFrame:
>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
a b
0 hello hello world
1 abcd defg
You can always use the in operator in a lambda expression to create your filter.
>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0 True
1 False
dtype: bool
The trick here is to use the axis=1 option in the apply to pass elements to the lambda function row by row, as opposed to column by column.
You can try considering them as string as :
df[df['A'].astype(str).str.contains("Hello|Britain")]
Suppose we have a column named "ENTITY" in the dataframe df. We can filter our df,to have the entire dataframe df, wherein rows of "entity" column doesn't contain "DM" by using a mask as follows:
mask = df['ENTITY'].str.contains('DM')
df = df.loc[~(mask)].copy(deep=True)
Here's what I ended up doing for partial string matches. If anyone has a more efficient way of doing this please let me know.
def stringSearchColumn_DataFrame(df, colName, regex):
newdf = DataFrame()
for idx, record in df[colName].iteritems():
if re.search(regex, record):
newdf = concat([df[df[colName] == record], newdf], ignore_index=True)
return newdf
Using contains didn't work well for my string with special characters. Find worked though.
df[df['A'].str.find("hello") != -1]
A more generalised example - if looking for parts of a word OR specific words in a string:
df = pd.DataFrame([('cat andhat', 1000.0), ('hat', 2000000.0), ('the small dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])
Specific parts of sentence or word:
searchfor = '.*cat.*hat.*|.*the.*dog.*'
Creat column showing the affected rows (can always filter out as necessary)
df["TrueFalse"]=df['col1'].str.contains(searchfor, regex=True)
col1 col2 TrueFalse
0 cat andhat 1000.0 True
1 hat 2000000.0 False
2 the small dog 1000.0 True
3 fog 330000.0 False
4 pet 3 30000.0 False
Maybe you want to search for some text in all columns of the Pandas dataframe, and not just in the subset of them. In this case, the following code will help.
df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]
Warning. This method is relatively slow, albeit convenient.
Somewhat similar to #cs95's answer, but here you don't need to specify an engine:
df.query('A.str.contains("hello").values')
There are answers before this which accomplish the asked feature, anyway I would like to show the most generally way:
df.filter(regex=".*STRING_YOU_LOOK_FOR.*")
This way let's you get the column you look for whatever the way is wrote.
( Obviusly, you have to write the proper regex expression for each case )
My 2c worth:
I did the following:
sale_method = pd.DataFrame(model_data['Sale Method'].str.upper())
sale_method['sale_classification'] = \
np.where(sale_method['Sale Method'].isin(['PRIVATE']),
'private',
np.where(sale_method['Sale Method']
.str.contains('AUCTION'),
'auction',
'other'
)
)
df[df['A'].str.contains("hello", case=False)]

Replacing Specific Values in a Pandas Column [duplicate]

I'm trying to replace the values in one column of a dataframe. The column ('female') only contains the values 'female' and 'male'.
I have tried the following:
w['female']['female']='1'
w['female']['male']='0'
But receive the exact same copy of the previous results.
I would ideally like to get some output which resembles the following loop element-wise.
if w['female'] =='female':
w['female'] = '1';
else:
w['female'] = '0';
I've looked through the gotchas documentation (http://pandas.pydata.org/pandas-docs/stable/gotchas.html) but cannot figure out why nothing happens.
Any help will be appreciated.
If I understand right, you want something like this:
w['female'] = w['female'].map({'female': 1, 'male': 0})
(Here I convert the values to numbers instead of strings containing numbers. You can convert them to "1" and "0", if you really want, but I'm not sure why you'd want that.)
The reason your code doesn't work is because using ['female'] on a column (the second 'female' in your w['female']['female']) doesn't mean "select rows where the value is 'female'". It means to select rows where the index is 'female', of which there may not be any in your DataFrame.
You can edit a subset of a dataframe by using loc:
df.loc[<row selection>, <column selection>]
In this case:
w.loc[w.female != 'female', 'female'] = 0
w.loc[w.female == 'female', 'female'] = 1
w.female.replace(to_replace=dict(female=1, male=0), inplace=True)
See pandas.DataFrame.replace() docs.
Slight variation:
w.female.replace(['male', 'female'], [1, 0], inplace=True)
This should also work:
w.female[w.female == 'female'] = 1
w.female[w.female == 'male'] = 0
This is very compact:
w['female'][w['female'] == 'female']=1
w['female'][w['female'] == 'male']=0
Another good one:
w['female'] = w['female'].replace(regex='female', value=1)
w['female'] = w['female'].replace(regex='male', value=0)
You can also use apply with .get i.e.
w['female'] = w['female'].apply({'male':0, 'female':1}.get):
w = pd.DataFrame({'female':['female','male','female']})
print(w)
Dataframe w:
female
0 female
1 male
2 female
Using apply to replace values from the dictionary:
w['female'] = w['female'].apply({'male':0, 'female':1}.get)
print(w)
Result:
female
0 1
1 0
2 1
Note: apply with dictionary should be used if all the possible values of the columns in the dataframe are defined in the dictionary else, it will have empty for those not defined in dictionary.
Using Series.map with Series.fillna
If your column contains more strings than only female and male, Series.map will fail in this case since it will return NaN for other values.
That's why we have to chain it with fillna:
Example why .map fails:
df = pd.DataFrame({'female':['male', 'female', 'female', 'male', 'other', 'other']})
female
0 male
1 female
2 female
3 male
4 other
5 other
df['female'].map({'female': '1', 'male': '0'})
0 0
1 1
2 1
3 0
4 NaN
5 NaN
Name: female, dtype: object
For the correct method, we chain map with fillna, so we fill the NaN with values from the original column:
df['female'].map({'female': '1', 'male': '0'}).fillna(df['female'])
0 0
1 1
2 1
3 0
4 other
5 other
Name: female, dtype: object
Alternatively there is the built-in function pd.get_dummies for these kinds of assignments:
w['female'] = pd.get_dummies(w['female'],drop_first = True)
This gives you a data frame with two columns, one for each value that occurs in w['female'], of which you drop the first (because you can infer it from the one that is left). The new column is automatically named as the string that you replaced.
This is especially useful if you have categorical variables with more than two possible values. This function creates as many dummy variables needed to distinguish between all cases. Be careful then that you don't assign the entire data frame to a single column, but instead, if w['female'] could be 'male', 'female' or 'neutral', do something like this:
w = pd.concat([w, pd.get_dummies(w['female'], drop_first = True)], axis = 1])
w.drop('female', axis = 1, inplace = True)
Then you are left with two new columns giving you the dummy coding of 'female' and you got rid of the column with the strings.
w.replace({'female':{'female':1, 'male':0}}, inplace = True)
The above code will replace 'female' with 1 and 'male' with 0, only in the column 'female'
There is also a function in pandas called factorize which you can use to automatically do this type of work. It converts labels to numbers: ['male', 'female', 'male'] -> [0, 1, 0]. See this answer for more information.
w.female = np.where(w.female=='female', 1, 0)
if someone is looking for a numpy solution. This is useful to replace values based on a condition. Both if and else conditions are inherent in np.where(). The solutions that use df.replace() may not be feasible if the column included many unique values in addition to 'male', all of which should be replaced with 0.
Another solution is to use df.where() and df.mask() in succession. This is because neither of them implements an else condition.
w.female.where(w.female=='female', 0, inplace=True) # replace where condition is False
w.female.mask(w.female=='female', 1, inplace=True) # replace where condition is True
dic = {'female':1, 'male':0}
w['female'] = w['female'].replace(dic)
.replace has as argument a dictionary in which you may change and do whatever you want or need.
I think that in answer should be pointed which type of object do you get in all methods suggested above: is it Series or DataFrame.
When you get column by w.female. or w[[2]] (where, suppose, 2 is number of your column) you'll get back DataFrame.
So in this case you can use DataFrame methods like .replace.
When you use .loc or iloc you get back Series, and Series don't have .replace method, so you should use methods like apply, map and so on.
To answer the question more generically so it applies to more use cases than just what the OP asked, consider this solution. I used jfs's solution solution to help me. Here, we create two functions that help feed each other and can be used whether you know the exact replacements or not.
import numpy as np
import pandas as pd
class Utility:
#staticmethod
def rename_values_in_column(column: pd.Series, name_changes: dict = None) -> pd.Series:
"""
Renames the distinct names in a column. If no dictionary is provided for the exact name changes, it will default
to <column_name>_count. Ex. female_1, female_2, etc.
:param column: The column in your dataframe you would like to alter.
:param name_changes: A dictionary of the old values to the new values you would like to change.
Ex. {1234: "User A"} This would change all occurrences of 1234 to the string "User A" and leave the other values as they were.
By default, this is an empty dictionary.
:return: The same column with the replaced values
"""
name_changes = name_changes if name_changes else {}
new_column = column.replace(to_replace=name_changes)
return new_column
#staticmethod
def create_unique_values_for_column(column: pd.Series, except_values: list = None) -> dict:
"""
Creates a dictionary where the key is the existing column item and the value is the new item to replace it.
The returned dictionary can then be passed the pandas rename function to rename all the distinct values in a
column.
Ex. column ["statement"]["I", "am", "old"] would return
{"I": "statement_1", "am": "statement_2", "old": "statement_3"}
If you would like a value to remain the same, enter the values you would like to stay in the except_values.
Ex. except_values = ["I", "am"]
column ["statement"]["I", "am", "old"] would return
{"old", "statement_3"}
:param column: A pandas Series for the column with the values to replace.
:param except_values: A list of values you do not want to have changed.
:return: A dictionary that maps the old values their respective new values.
"""
except_values = except_values if except_values else []
column_name = column.name
distinct_values = np.unique(column)
name_mappings = {}
count = 1
for value in distinct_values:
if value not in except_values:
name_mappings[value] = f"{column_name}_{count}"
count += 1
return name_mappings
For the OP's use case, it is simple enough to just use
w["female"] = Utility.rename_values_in_column(w["female"], name_changes = {"female": 0, "male":1}
However, it is not always so easy to know all of the different unique values within a data frame that you may want to rename. In my case, the string values for a column are hashed values so they hurt the readability. What I do instead is replace those hashed values with more readable strings thanks to the create_unique_values_for_column function.
df["user"] = Utility.rename_values_in_column(
df["user"],
Utility.create_unique_values_for_column(df["user"])
)
This will changed my user column values from ["1a2b3c", "a12b3c","1a2b3c"] to ["user_1", "user_2", "user_1]. Much easier to compare, right?
If you have only two classes you can use equality operator. For example:
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b']})
df['col1'].eq('a').astype(int)
# (df['col1'] == 'a').astype(int)
Output:
0 1
1 1
2 1
3 0
Name: col1, dtype: int64

Find rows in dataframe column containing questions

I have a TSV file that I loaded into a pandas dataframe to do some preprocessing and I want to find out which rows have a question in it, and output 1 or 0 in a new column. Since it is a TSV, this is how I'm loading it:
import pandas as pd
df = pd.read_csv('queries-10k-txt-backup', sep='\t')
Here's a sample of what it looks like:
QUERY FREQ
0 hindi movies for adults 595
1 are panda dogs real 383
2 asuedraw winning numbers 478
3 sentry replacement keys 608
4 rebuilding nicad battery packs 541
After dropping empty rows, duplicates, and the FREQ column(not needed for this), I wrote a simple function to check the QUERY column to see if it contains any words that make the string a question:
df_test = df.drop_duplicates()
df_test = df_test.dropna()
df_test = df_test.drop(['FREQ'], axis = 1)
def questions(row):
questions_list =
["what","when","where","which","who","whom","whose","why","why don't",
"how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'] in questions_list:
return 1
else:
return 0
df_test['QUESTIONS'] = df_test.apply(questions, axis=1)
But once I check the new dataframe, even though it creates the new column, all the values are 0. I'm not sure if my logic is wrong in the function, I've used something similar with dataframe columns which just have one word and if it matches, it'll output a 1 or 0. However, that same logic doesn't seem to be working when the column contains a phrase/sentence like this use case. Any input is really appreciated!
If you wish to check exact matches of any substring from question_list and of a string from dataframe, you should use str.contains method:
questions_list = ["what","when","where","which","who","whom","whose","why",
"why don't", "how","how far","how long","how many",
"how much","how old","how come","?"]
pattern = "|".join(questions_list) # generate regex from your list
df_test['QUESTIONS'] = df_test['QUERY'].str.contains(pattern)
Simplified example:
df = pd.DataFrame({
'QUERY': ['how do you like it', 'what\'s going on?', 'quick brown fox'],
'ID': [0, 1, 2]})
Create a pattern:
pattern = '|'.join(['what', 'how'])
pattern
Out: 'what|how'
Use it:
df['QUERY'].str.contains(pattern)
Out[12]:
0 True
1 True
2 False
Name: QUERY, dtype: bool
If you're not familiar with regexes, there's a quick python re reference. Fot symbol '|', explanation is
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way
IIUC, you need to find if the first word in the string in the question list, if yes return 1, else 0. In your function, rather than checking if the entire string is in question list, split the string and check if the first element is in question list.
def questions(row):
questions_list = ["are","what","when","where","which","who","whom","whose","why","why don't","how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'].split()[0] in questions_list:
return 1
else:
return 0
df['QUESTIONS'] = df.apply(questions, axis=1)
You get
QUERY FREQ QUESTIONS
0 hindi movies for adults 595 0
1 are panda dogs real 383 1
2 asuedraw winning numbers 478 0
3 sentry replacement keys 608 0
4 rebuilding nicad battery packs 541 0

Pandas, concatenate certain columns if other columns are empty

I've got a CSV file that is supposed to look like this:
ID, years_active, issues
-------------------------------
'Truck1', 8, 'In dire need of a paintjob'
'Car 5', 3, 'To small for large groups'
However, the CSV is somewhat malformed and currently looks like this.
ID, years_active, issues
------------------------
'Truck1', 8, 'In dire need'
'','', 'of a'
'','', 'paintjob'
'Car 5', 3, 'To small for'
'', '', 'large groups'
Now, I am able to identify faulty rows by the lack of an 'ID' and 'years_active' value and would like to append the value of 'issues of that row to the last preceding row that had 'ID' and 'years_active' values.
I am not very experienced with pandas, but came up with the following code:
for index, row in df.iterrows():
if row['years_active'] == None:
df.loc[index-1]['issues'] += row['issues']
Yet - the IF condition fails to trigger.
Is the thing I am trying to do possible? And if so, does anyone have an idea what I am doing wrong?
Given your sample input:
df = pd.DataFrame({
'ID': ['Truck1', '', '', 'Car 5', ''],
'years_active': [8, '', '', 3, ''],
'issues': ['In dire need', 'of a', 'paintjob', 'To small for', 'large groups']
})
You can use:
new_df = df.groupby(df.ID.replace('', method='ffill')).agg({'years_active': 'first', 'issues': ' '.join})
Which'll give you:
years_active issues
ID
Car 5 3 To small for large groups
Truck1 8 In dire need of a paintjob
So what we're doing here is forward filling the non-blank IDs into subsequent blank IDs and using those to group the related rows. We then aggregate to take the first occurrence of the years_active and join together the issues columns in the order they appear to create a single result.
Following uses a for loop to find and add strings (dataframe from JonClements' answer):
df = pd.DataFrame({
'ID': ['Truck1', '', '', 'Car 5', ''],
'years_active': [8, '', '', 3, ''],
'issues': ['In dire need', 'of a', 'paintjob', 'To small for', 'large groups']
})
ss = ""; ii = 0; ilist = [0]
for i in range(len(df.index)):
if i>0 and df.ID[i] != "":
df.issues[ii] = ss
ss = df.issues[i]
ii = i
ilist.append(ii)
else:
ss += ' '+df.issues[i]
df.issues[ii] = ss
df = df.iloc[ilist]
print(df)
Output:
ID issues years_active
0 Truck1 In dire need of a paintjob 8
3 Car 5 To small for large groups 3
It might be worth mentioning in the context of this question that there is an often overlooked way of processing awkward input by using the StringIO library.
The essential point is that read_csv can read from a StringIO 'file'.
In this case, I arrange to discard single quotes and multiple commas that would confuse read_csv, and I append the second and subsequent lines of input to the first line, to form complete, conventional csv lines form read_csv.
Here is what read_csv receives.
ID years_active issues
0 Truck1 8 In dire need of a paintjob
1 Car 5 3 To small for large groups
The code is ugly but easy to follow.
import pandas as pd
from io import StringIO
for_pd = StringIO()
with open('jasper.txt') as jasper:
print (jasper.readline(), file=for_pd)
line = jasper.readline()
complete_record = ''
for line in jasper:
line = ''.join(line.rstrip().replace(', ', ',').replace("'", ''))
if line.startswith(','):
complete_record += line.replace(',,', ',').replace(',', ' ')
else:
if complete_record:
print (complete_record, file=for_pd)
complete_record = line
if complete_record:
print (complete_record, file=for_pd)
for_pd.seek(0)
df = pd.read_csv(for_pd)
print (df)