Some trouble as a beginner [duplicate] - pandas

This seems like a ridiculously easy question... but I'm not seeing the easy answer I was expecting.
So, how do I get the value at an nth row of a given column in Pandas? (I am particularly interested in the first row, but would be interested in a more general practice as well).
For example, let's say I want to pull the 1.2 value in Btime as a variable.
Whats the right way to do this?
>>> df_test
ATime X Y Z Btime C D E
0 1.2 2 15 2 1.2 12 25 12
1 1.4 3 12 1 1.3 13 22 11
2 1.5 1 10 6 1.4 11 20 16
3 1.6 2 9 10 1.7 12 29 12
4 1.9 1 1 9 1.9 11 21 19
5 2.0 0 0 0 2.0 8 10 11
6 2.4 0 0 0 2.4 10 12 15

To select the ith row, use iloc:
In [31]: df_test.iloc[0]
Out[31]:
ATime 1.2
X 2.0
Y 15.0
Z 2.0
Btime 1.2
C 12.0
D 25.0
E 12.0
Name: 0, dtype: float64
To select the ith value in the Btime column you could use:
In [30]: df_test['Btime'].iloc[0]
Out[30]: 1.2
There is a difference between df_test['Btime'].iloc[0] (recommended) and df_test.iloc[0]['Btime']:
DataFrames store data in column-based blocks (where each block has a single
dtype). If you select by column first, a view can be returned (which is
quicker than returning a copy) and the original dtype is preserved. In contrast,
if you select by row first, and if the DataFrame has columns of different
dtypes, then Pandas copies the data into a new Series of object dtype. So
selecting columns is a bit faster than selecting rows. Thus, although
df_test.iloc[0]['Btime'] works, df_test['Btime'].iloc[0] is a little bit
more efficient.
There is a big difference between the two when it comes to assignment.
df_test['Btime'].iloc[0] = x affects df_test, but df_test.iloc[0]['Btime']
may not. See below for an explanation of why. Because a subtle difference in
the order of indexing makes a big difference in behavior, it is better to use single indexing assignment:
df.iloc[0, df.columns.get_loc('Btime')] = x
df.iloc[0, df.columns.get_loc('Btime')] = x (recommended):
The recommended way to assign new values to a
DataFrame is to avoid chained indexing, and instead use the method shown by
andrew,
df.loc[df.index[n], 'Btime'] = x
or
df.iloc[n, df.columns.get_loc('Btime')] = x
The latter method is a bit faster, because df.loc has to convert the row and column labels to
positional indices, so there is a little less conversion necessary if you use
df.iloc instead.
df['Btime'].iloc[0] = x works, but is not recommended:
Although this works, it is taking advantage of the way DataFrames are currently implemented. There is no guarantee that Pandas has to work this way in the future. In particular, it is taking advantage of the fact that (currently) df['Btime'] always returns a
view (not a copy) so df['Btime'].iloc[n] = x can be used to assign a new value
at the nth location of the Btime column of df.
Since Pandas makes no explicit guarantees about when indexers return a view versus a copy, assignments that use chained indexing generally always raise a SettingWithCopyWarning even though in this case the assignment succeeds in modifying df:
In [22]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [24]: df['bar'] = 100
In [25]: df['bar'].iloc[0] = 99
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
In [26]: df
Out[26]:
foo bar
0 A 99 <-- assignment succeeded
2 B 100
1 C 100
df.iloc[0]['Btime'] = x does not work:
In contrast, assignment with df.iloc[0]['bar'] = 123 does not work because df.iloc[0] is returning a copy:
In [66]: df.iloc[0]['bar'] = 123
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [67]: df
Out[67]:
foo bar
0 A 99 <-- assignment failed
2 B 100
1 C 100
Warning: I had previously suggested df_test.ix[i, 'Btime']. But this is not guaranteed to give you the ith value since ix tries to index by label before trying to index by position. So if the DataFrame has an integer index which is not in sorted order starting at 0, then using ix[i] will return the row labeled i rather than the ith row. For example,
In [1]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [2]: df
Out[2]:
foo
0 A
2 B
1 C
In [4]: df.ix[1, 'foo']
Out[4]: 'C'

Note that the answer from #unutbu will be correct until you want to set the value to something new, then it will not work if your dataframe is a view.
In [4]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [5]: df['bar'] = 100
In [6]: df['bar'].iloc[0] = 99
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.16.0_19_g8d2818e-py2.7-macosx-10.9-x86_64.egg/pandas/core/indexing.py:118: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
Another approach that will consistently work with both setting and getting is:
In [7]: df.loc[df.index[0], 'foo']
Out[7]: 'A'
In [8]: df.loc[df.index[0], 'bar'] = 99
In [9]: df
Out[9]:
foo bar
0 A 99
2 B 100
1 C 100

Another way to do this:
first_value = df['Btime'].values[0]
This way seems to be faster than using .iloc:
In [1]: %timeit -n 1000 df['Btime'].values[20]
5.82 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [2]: %timeit -n 1000 df['Btime'].iloc[20]
29.2 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df.iloc[0].head(1) - First data set only from entire first row.
df.iloc[0] - Entire First row in column.

In a general way, if you want to pick up the first N rows from the J column from pandas dataframe the best way to do this is:
data = dataframe[0:N][:,J]

To access a single value you can use the method iat that is much faster than iloc:
df['Btime'].iat[0]
You can also use the method take:
df['Btime'].take(0)

.iat and .at are the methods for getting and setting single values and are much faster than .iloc and .loc. Mykola Zotko pointed this out in their answer, but they did not use .iat to its full extent.
When we can use .iat or .at, we should only have to index into the dataframe once.
This is not great:
df['Btime'].iat[0]
It is not ideal because the 'Btime' column was first selected as a series, then .iat was used to index into that series.
These two options are the best:
Using zero-indexed positions:
df.iat[0, 4] # get the value in the zeroth row, and 4th column
Using Labels:
df.at[0, 'Btime'] # get the value where the index label is 0 and the column name is "Btime".
Both methods return the value of 1.2.

To get e.g the value from column 'test' and row 1 it works like
df[['test']].values[0][0]
as only df[['test']].values[0] gives back a array

Another way of getting the first row and preserving the index:
x = df.first('d') # Returns the first day. '3d' gives first three days.

According to pandas docs, at is the fastest way to access a scalar value such as the use case in the OP (already suggested by Alex on this page).
Building upon Alex's answer, because dataframes don't necessarily have a range index it might be more complete to index df.index (since dataframe indexes are built on numpy arrays, you can index them like an array) or call get_loc() on columns to get the integer location of a column.
df.at[df.index[0], 'Btime']
df.iat[0, df.columns.get_loc('Btime')]
One common problem is that if you used a boolean mask to get a single value, but ended up with a value with an index (actually a Series); e.g.:
0 1.2
Name: Btime, dtype: float64
you can use squeeze() to get the scalar value, i.e.
df.loc[df['Btime']<1.3, 'Btime'].squeeze()

Related

find if a substring is contained in an index in a Dataframe in Pandas [duplicate]

I have a pandas DataFrame with a column of string values. I need to select rows based on partial string matches.
Something like this idiom:
re.search(pattern, cell_in_question)
returning a boolean. I am familiar with the syntax of df[df['A'] == "hello world"] but can't seem to find a way to do the same with a partial string match, say 'hello'.
Vectorized string methods (i.e. Series.str) let you do the following:
df[df['A'].str.contains("hello")]
This is available in pandas 0.8.1 and up.
I am using pandas 0.14.1 on macos in ipython notebook. I tried the proposed line above:
df[df["A"].str.contains("Hello|Britain")]
and got an error:
cannot index with vector containing NA / NaN values
but it worked perfectly when an "==True" condition was added, like this:
df[df['A'].str.contains("Hello|Britain")==True]
How do I select by partial string from a pandas DataFrame?
This post is meant for readers who want to
search for a substring in a string column (the simplest case) as in df1[df1['col'].str.contains(r'foo(?!$)')]
search for multiple substrings (similar to isin), e.g., with df4[df4['col'].str.contains(r'foo|baz')]
match a whole word from text (e.g., "blue" should match "the sky is blue" but not "bluejay"), e.g., with df3[df3['col'].str.contains(r'\bblue\b')]
match multiple whole words
Understand the reason behind "ValueError: cannot index with vector containing NA / NaN values" and correct it with str.contains('pattern',na=False)
...and would like to know more about what methods should be preferred over others.
(P.S.: I've seen a lot of questions on similar topics, I thought it would be good to leave this here.)
Friendly disclaimer, this is post is long.
Basic Substring Search
# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1
col
0 foo
1 foobar
2 bar
3 baz
str.contains can be used to perform either substring searches or regex based search. The search defaults to regex-based unless you explicitly disable it.
Here is an example of regex-based search,
# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]
col
1 foobar
Sometimes regex search is not required, so specify regex=False to disable it.
#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
col
0 foo
1 foobar
Performance wise, regex search is slower than substring search:
df2 = pd.concat([df1] * 1000, ignore_index=True)
%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]
6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Avoid using regex-based search if you don't need it.
Addressing ValueErrors
Sometimes, performing a substring search and filtering on the result will result in
ValueError: cannot index with vector containing NA / NaN values
This is usually because of mixed data or NaNs in your object column,
s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')
0 True
1 True
2 NaN
3 True
4 False
5 NaN
dtype: object
s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError Traceback (most recent call last)
Anything that is not a string cannot have string methods applied on it, so the result is NaN (naturally). In this case, specify na=False to ignore non-string data,
s.str.contains('foo|bar', na=False)
0 True
1 True
2 False
3 True
4 False
5 False
dtype: bool
How do I apply this to multiple columns at once?
The answer is in the question. Use DataFrame.apply:
# `axis=1` tells `apply` to apply the lambda function column-wise.
df.apply(lambda col: col.str.contains('foo|bar', na=False), axis=1)
A B
0 True True
1 True False
2 False True
3 True False
4 False False
5 False False
All of the solutions below can be "applied" to multiple columns using the column-wise apply method (which is OK in my book, as long as you don't have too many columns).
If you have a DataFrame with mixed columns and want to select only the object/string columns, take a look at select_dtypes.
Multiple Substring Search
This is most easily achieved through a regex search using the regex OR pipe.
# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4
col
0 foo abc
1 foobar xyz
2 bar32
3 baz 45
df4[df4['col'].str.contains(r'foo|baz')]
col
0 foo abc
1 foobar xyz
3 baz 45
You can also create a list of terms, then join them:
terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]
col
0 foo abc
1 foobar xyz
3 baz 45
Sometimes, it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters. If your terms contain any of the following characters...
. ^ $ * + ? { } [ ] \ | ( )
Then, you'll need to use re.escape to escape them:
import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]
col
0 foo abc
1 foobar xyz
3 baz 45
re.escape has the effect of escaping the special characters so they're treated literally.
re.escape(r'.foo^')
# '\\.foo\\^'
Matching Entire Word(s)
By default, the substring search searches for the specified substring/pattern regardless of whether it is full word or not. To only match full words, we will need to make use of regular expressions here—in particular, our pattern will need to specify word boundaries (\b).
For example,
df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3
col
0 the sky is blue
1 bluejay by the window
Now consider,
df3[df3['col'].str.contains('blue')]
col
0 the sky is blue
1 bluejay by the window
v/s
df3[df3['col'].str.contains(r'\bblue\b')]
col
0 the sky is blue
Multiple Whole Word Search
Similar to the above, except we add a word boundary (\b) to the joined pattern.
p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]
col
0 foo abc
3 baz 45
Where p looks like this,
p
# '\\b(?:foo|baz)\\b'
A Great Alternative: Use List Comprehensions!
Because you can! And you should! They are usually a little bit faster than string methods, because string methods are hard to vectorise and usually have loopy implementations.
Instead of,
df1[df1['col'].str.contains('foo', regex=False)]
Use the in operator inside a list comp,
df1[['foo' in x for x in df1['col']]]
col
0 foo abc
1 foobar
Instead of,
regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]
Use re.compile (to cache your regex) + Pattern.search inside a list comp,
p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]
col
1 foobar
If "col" has NaNs, then instead of
df1[df1['col'].str.contains(regex_pattern, na=False)]
Use,
def try_search(p, x):
try:
return bool(p.search(x))
except TypeError:
return False
p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]
col
1 foobar
More Options for Partial String Matching: np.char.find, np.vectorize, DataFrame.query.
In addition to str.contains and list comprehensions, you can also use the following alternatives.
np.char.find
Supports substring searches (read: no regex) only.
df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]
col
0 foo abc
1 foobar xyz
np.vectorize
This is a wrapper around a loop, but with lesser overhead than most pandas str methods.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True, True, False, False])
df1[f(df1['col'], 'foo')]
col
0 foo abc
1 foobar
Regex solutions possible:
regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]
col
1 foobar
DataFrame.query
Supports string methods through the python engine. This offers no visible performance benefits, but is nonetheless useful to know if you need to dynamically generate your queries.
df1.query('col.str.contains("foo")', engine='python')
col
0 foo
1 foobar
More information on query and eval family of methods can be found at Dynamically evaluate an expression from a formula in Pandas.
Recommended Usage Precedence
(First) str.contains, for its simplicity and ease handling NaNs and mixed data
List comprehensions, for its performance (especially if your data is purely strings)
np.vectorize
(Last) df.query
If anyone wonders how to perform a related problem: "Select column by partial string"
Use:
df.filter(like='hello') # select columns which contain the word hello
And to select rows by partial string matching, pass axis=0 to filter:
# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)
Quick note: if you want to do selection based on a partial string contained in the index, try the following:
df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]
Should you need to do a case insensitive search for a string in a pandas dataframe column:
df[df['A'].str.contains("hello", case=False)]
Say you have the following DataFrame:
>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
a b
0 hello hello world
1 abcd defg
You can always use the in operator in a lambda expression to create your filter.
>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0 True
1 False
dtype: bool
The trick here is to use the axis=1 option in the apply to pass elements to the lambda function row by row, as opposed to column by column.
You can try considering them as string as :
df[df['A'].astype(str).str.contains("Hello|Britain")]
Suppose we have a column named "ENTITY" in the dataframe df. We can filter our df,to have the entire dataframe df, wherein rows of "entity" column doesn't contain "DM" by using a mask as follows:
mask = df['ENTITY'].str.contains('DM')
df = df.loc[~(mask)].copy(deep=True)
Here's what I ended up doing for partial string matches. If anyone has a more efficient way of doing this please let me know.
def stringSearchColumn_DataFrame(df, colName, regex):
newdf = DataFrame()
for idx, record in df[colName].iteritems():
if re.search(regex, record):
newdf = concat([df[df[colName] == record], newdf], ignore_index=True)
return newdf
Using contains didn't work well for my string with special characters. Find worked though.
df[df['A'].str.find("hello") != -1]
A more generalised example - if looking for parts of a word OR specific words in a string:
df = pd.DataFrame([('cat andhat', 1000.0), ('hat', 2000000.0), ('the small dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])
Specific parts of sentence or word:
searchfor = '.*cat.*hat.*|.*the.*dog.*'
Creat column showing the affected rows (can always filter out as necessary)
df["TrueFalse"]=df['col1'].str.contains(searchfor, regex=True)
col1 col2 TrueFalse
0 cat andhat 1000.0 True
1 hat 2000000.0 False
2 the small dog 1000.0 True
3 fog 330000.0 False
4 pet 3 30000.0 False
Maybe you want to search for some text in all columns of the Pandas dataframe, and not just in the subset of them. In this case, the following code will help.
df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]
Warning. This method is relatively slow, albeit convenient.
Somewhat similar to #cs95's answer, but here you don't need to specify an engine:
df.query('A.str.contains("hello").values')
There are answers before this which accomplish the asked feature, anyway I would like to show the most generally way:
df.filter(regex=".*STRING_YOU_LOOK_FOR.*")
This way let's you get the column you look for whatever the way is wrote.
( Obviusly, you have to write the proper regex expression for each case )
My 2c worth:
I did the following:
sale_method = pd.DataFrame(model_data['Sale Method'].str.upper())
sale_method['sale_classification'] = \
np.where(sale_method['Sale Method'].isin(['PRIVATE']),
'private',
np.where(sale_method['Sale Method']
.str.contains('AUCTION'),
'auction',
'other'
)
)
df[df['A'].str.contains("hello", case=False)]

If this string is found, return the row Pandas [duplicate]

I have a pandas DataFrame with a column of string values. I need to select rows based on partial string matches.
Something like this idiom:
re.search(pattern, cell_in_question)
returning a boolean. I am familiar with the syntax of df[df['A'] == "hello world"] but can't seem to find a way to do the same with a partial string match, say 'hello'.
Vectorized string methods (i.e. Series.str) let you do the following:
df[df['A'].str.contains("hello")]
This is available in pandas 0.8.1 and up.
I am using pandas 0.14.1 on macos in ipython notebook. I tried the proposed line above:
df[df["A"].str.contains("Hello|Britain")]
and got an error:
cannot index with vector containing NA / NaN values
but it worked perfectly when an "==True" condition was added, like this:
df[df['A'].str.contains("Hello|Britain")==True]
How do I select by partial string from a pandas DataFrame?
This post is meant for readers who want to
search for a substring in a string column (the simplest case) as in df1[df1['col'].str.contains(r'foo(?!$)')]
search for multiple substrings (similar to isin), e.g., with df4[df4['col'].str.contains(r'foo|baz')]
match a whole word from text (e.g., "blue" should match "the sky is blue" but not "bluejay"), e.g., with df3[df3['col'].str.contains(r'\bblue\b')]
match multiple whole words
Understand the reason behind "ValueError: cannot index with vector containing NA / NaN values" and correct it with str.contains('pattern',na=False)
...and would like to know more about what methods should be preferred over others.
(P.S.: I've seen a lot of questions on similar topics, I thought it would be good to leave this here.)
Friendly disclaimer, this is post is long.
Basic Substring Search
# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1
col
0 foo
1 foobar
2 bar
3 baz
str.contains can be used to perform either substring searches or regex based search. The search defaults to regex-based unless you explicitly disable it.
Here is an example of regex-based search,
# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]
col
1 foobar
Sometimes regex search is not required, so specify regex=False to disable it.
#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
col
0 foo
1 foobar
Performance wise, regex search is slower than substring search:
df2 = pd.concat([df1] * 1000, ignore_index=True)
%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]
6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Avoid using regex-based search if you don't need it.
Addressing ValueErrors
Sometimes, performing a substring search and filtering on the result will result in
ValueError: cannot index with vector containing NA / NaN values
This is usually because of mixed data or NaNs in your object column,
s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')
0 True
1 True
2 NaN
3 True
4 False
5 NaN
dtype: object
s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError Traceback (most recent call last)
Anything that is not a string cannot have string methods applied on it, so the result is NaN (naturally). In this case, specify na=False to ignore non-string data,
s.str.contains('foo|bar', na=False)
0 True
1 True
2 False
3 True
4 False
5 False
dtype: bool
How do I apply this to multiple columns at once?
The answer is in the question. Use DataFrame.apply:
# `axis=1` tells `apply` to apply the lambda function column-wise.
df.apply(lambda col: col.str.contains('foo|bar', na=False), axis=1)
A B
0 True True
1 True False
2 False True
3 True False
4 False False
5 False False
All of the solutions below can be "applied" to multiple columns using the column-wise apply method (which is OK in my book, as long as you don't have too many columns).
If you have a DataFrame with mixed columns and want to select only the object/string columns, take a look at select_dtypes.
Multiple Substring Search
This is most easily achieved through a regex search using the regex OR pipe.
# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4
col
0 foo abc
1 foobar xyz
2 bar32
3 baz 45
df4[df4['col'].str.contains(r'foo|baz')]
col
0 foo abc
1 foobar xyz
3 baz 45
You can also create a list of terms, then join them:
terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]
col
0 foo abc
1 foobar xyz
3 baz 45
Sometimes, it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters. If your terms contain any of the following characters...
. ^ $ * + ? { } [ ] \ | ( )
Then, you'll need to use re.escape to escape them:
import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]
col
0 foo abc
1 foobar xyz
3 baz 45
re.escape has the effect of escaping the special characters so they're treated literally.
re.escape(r'.foo^')
# '\\.foo\\^'
Matching Entire Word(s)
By default, the substring search searches for the specified substring/pattern regardless of whether it is full word or not. To only match full words, we will need to make use of regular expressions here—in particular, our pattern will need to specify word boundaries (\b).
For example,
df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3
col
0 the sky is blue
1 bluejay by the window
Now consider,
df3[df3['col'].str.contains('blue')]
col
0 the sky is blue
1 bluejay by the window
v/s
df3[df3['col'].str.contains(r'\bblue\b')]
col
0 the sky is blue
Multiple Whole Word Search
Similar to the above, except we add a word boundary (\b) to the joined pattern.
p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]
col
0 foo abc
3 baz 45
Where p looks like this,
p
# '\\b(?:foo|baz)\\b'
A Great Alternative: Use List Comprehensions!
Because you can! And you should! They are usually a little bit faster than string methods, because string methods are hard to vectorise and usually have loopy implementations.
Instead of,
df1[df1['col'].str.contains('foo', regex=False)]
Use the in operator inside a list comp,
df1[['foo' in x for x in df1['col']]]
col
0 foo abc
1 foobar
Instead of,
regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]
Use re.compile (to cache your regex) + Pattern.search inside a list comp,
p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]
col
1 foobar
If "col" has NaNs, then instead of
df1[df1['col'].str.contains(regex_pattern, na=False)]
Use,
def try_search(p, x):
try:
return bool(p.search(x))
except TypeError:
return False
p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]
col
1 foobar
More Options for Partial String Matching: np.char.find, np.vectorize, DataFrame.query.
In addition to str.contains and list comprehensions, you can also use the following alternatives.
np.char.find
Supports substring searches (read: no regex) only.
df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]
col
0 foo abc
1 foobar xyz
np.vectorize
This is a wrapper around a loop, but with lesser overhead than most pandas str methods.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True, True, False, False])
df1[f(df1['col'], 'foo')]
col
0 foo abc
1 foobar
Regex solutions possible:
regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]
col
1 foobar
DataFrame.query
Supports string methods through the python engine. This offers no visible performance benefits, but is nonetheless useful to know if you need to dynamically generate your queries.
df1.query('col.str.contains("foo")', engine='python')
col
0 foo
1 foobar
More information on query and eval family of methods can be found at Dynamically evaluate an expression from a formula in Pandas.
Recommended Usage Precedence
(First) str.contains, for its simplicity and ease handling NaNs and mixed data
List comprehensions, for its performance (especially if your data is purely strings)
np.vectorize
(Last) df.query
If anyone wonders how to perform a related problem: "Select column by partial string"
Use:
df.filter(like='hello') # select columns which contain the word hello
And to select rows by partial string matching, pass axis=0 to filter:
# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)
Quick note: if you want to do selection based on a partial string contained in the index, try the following:
df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]
Should you need to do a case insensitive search for a string in a pandas dataframe column:
df[df['A'].str.contains("hello", case=False)]
Say you have the following DataFrame:
>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
a b
0 hello hello world
1 abcd defg
You can always use the in operator in a lambda expression to create your filter.
>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0 True
1 False
dtype: bool
The trick here is to use the axis=1 option in the apply to pass elements to the lambda function row by row, as opposed to column by column.
You can try considering them as string as :
df[df['A'].astype(str).str.contains("Hello|Britain")]
Suppose we have a column named "ENTITY" in the dataframe df. We can filter our df,to have the entire dataframe df, wherein rows of "entity" column doesn't contain "DM" by using a mask as follows:
mask = df['ENTITY'].str.contains('DM')
df = df.loc[~(mask)].copy(deep=True)
Here's what I ended up doing for partial string matches. If anyone has a more efficient way of doing this please let me know.
def stringSearchColumn_DataFrame(df, colName, regex):
newdf = DataFrame()
for idx, record in df[colName].iteritems():
if re.search(regex, record):
newdf = concat([df[df[colName] == record], newdf], ignore_index=True)
return newdf
Using contains didn't work well for my string with special characters. Find worked though.
df[df['A'].str.find("hello") != -1]
A more generalised example - if looking for parts of a word OR specific words in a string:
df = pd.DataFrame([('cat andhat', 1000.0), ('hat', 2000000.0), ('the small dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])
Specific parts of sentence or word:
searchfor = '.*cat.*hat.*|.*the.*dog.*'
Creat column showing the affected rows (can always filter out as necessary)
df["TrueFalse"]=df['col1'].str.contains(searchfor, regex=True)
col1 col2 TrueFalse
0 cat andhat 1000.0 True
1 hat 2000000.0 False
2 the small dog 1000.0 True
3 fog 330000.0 False
4 pet 3 30000.0 False
Maybe you want to search for some text in all columns of the Pandas dataframe, and not just in the subset of them. In this case, the following code will help.
df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]
Warning. This method is relatively slow, albeit convenient.
Somewhat similar to #cs95's answer, but here you don't need to specify an engine:
df.query('A.str.contains("hello").values')
There are answers before this which accomplish the asked feature, anyway I would like to show the most generally way:
df.filter(regex=".*STRING_YOU_LOOK_FOR.*")
This way let's you get the column you look for whatever the way is wrote.
( Obviusly, you have to write the proper regex expression for each case )
My 2c worth:
I did the following:
sale_method = pd.DataFrame(model_data['Sale Method'].str.upper())
sale_method['sale_classification'] = \
np.where(sale_method['Sale Method'].isin(['PRIVATE']),
'private',
np.where(sale_method['Sale Method']
.str.contains('AUCTION'),
'auction',
'other'
)
)
df[df['A'].str.contains("hello", case=False)]

dataframe slicing with loc [duplicate]

How do I select columns a and b from df, and save them into a new dataframe df1?
index a b c
1 2 3 4
2 3 4 5
Unsuccessful attempt:
df1 = df['a':'b']
df1 = df.ix[:, 'a':'b']
The column names (which are strings) cannot be sliced in the manner you tried.
Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__ syntax (the []'s).
df1 = df[['a', 'b']]
Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:
df1 = df.iloc[:, 0:2] # Remember that Python does not slice inclusive of the ending index.
Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).
Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the .copy() method to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.
df1 = df.iloc[0, 0:2].copy() # To avoid the case where changing df1 also changes df
To use iloc, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc along with get_loc function of columns method of dataframe object to obtain column indices.
{df.columns.get_loc(c): c for idx, c in enumerate(df.columns)}
Now you can use this dictionary to access columns through names and using iloc.
As of version 0.11.0, columns can be sliced in the manner you tried using the .loc indexer:
df.loc[:, 'C':'E']
is equivalent to
df[['C', 'D', 'E']] # or df.loc[:, ['C', 'D', 'E']]
and returns columns C through E.
A demo on a randomly generated DataFrame:
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
columns=list('ABCDEF'),
index=['R{}'.format(i) for i in range(100)])
df.head()
Out:
A B C D E F
R0 99 78 61 16 73 8
R1 62 27 30 80 7 76
R2 15 53 80 27 44 77
R3 75 65 47 30 84 86
R4 18 9 41 62 1 82
To get the columns from C to E (note that unlike integer slicing, E is included in the columns):
df.loc[:, 'C':'E']
Out:
C D E
R0 61 16 73
R1 30 80 7
R2 80 27 44
R3 47 30 84
R4 41 62 1
R5 5 58 0
...
The same works for selecting rows based on labels. Get the rows R6 to R10 from those columns:
df.loc['R6':'R10', 'C':'E']
Out:
C D E
R6 51 27 31
R7 83 19 18
R8 11 67 65
R9 78 27 29
R10 7 16 94
.loc also accepts a Boolean array so you can select the columns whose corresponding entry in the array is True. For example, df.columns.isin(list('BCD')) returns array([False, True, True, True, False, False], dtype=bool) - True if the column name is in the list ['B', 'C', 'D']; False, otherwise.
df.loc[:, df.columns.isin(list('BCD'))]
Out:
B C D
R0 78 61 16
R1 27 30 80
R2 53 80 27
R3 65 47 30
R4 9 41 62
R5 78 5 58
...
Assuming your column names (df.columns) are ['index','a','b','c'], then the data you want is in the
third and fourth columns. If you don't know their names when your script runs, you can do this
newdf = df[df.columns[2:4]] # Remember, Python is zero-offset! The "third" entry is at slot two.
As EMS points out in his answer, df.ix slices columns a bit more concisely, but the .columns slicing interface might be more natural, because it uses the vanilla one-dimensional Python list indexing/slicing syntax.
Warning: 'index' is a bad name for a DataFrame column. That same label is also used for the real df.index attribute, an Index array. So your column is returned by df['index'] and the real DataFrame index is returned by df.index. An Index is a special kind of Series optimized for lookup of its elements' values. For df.index it's for looking up rows by their label. That df.columns attribute is also a pd.Index array, for looking up columns by their labels.
In the latest version of Pandas there is an easy way to do exactly this. Column names (which are strings) can be sliced in whatever manner you like.
columns = ['b', 'c']
df1 = pd.DataFrame(df, columns=columns)
In [39]: df
Out[39]:
index a b c
0 1 2 3 4
1 2 3 4 5
In [40]: df1 = df[['b', 'c']]
In [41]: df1
Out[41]:
b c
0 3 4
1 4 5
With Pandas,
wit column names
dataframe[['column1','column2']]
to select by iloc and specific columns with index number:
dataframe.iloc[:,[1,2]]
with loc column names can be used like
dataframe.loc[:,['column1','column2']]
You can use the pandas.DataFrame.filter method to either filter or reorder columns like this:
df1 = df.filter(['a', 'b'])
This is also very useful when you are chaining methods.
You could provide a list of columns to be dropped and return back the DataFrame with only the columns needed using the drop() function on a Pandas DataFrame.
Just saying
colsToDrop = ['a']
df.drop(colsToDrop, axis=1)
would return a DataFrame with just the columns b and c.
The drop method is documented here.
I found this method to be very useful:
# iloc[row slicing, column slicing]
surveys_df.iloc [0:3, 1:4]
More details can be found here.
Starting with 0.21.0, using .loc or [] with a list with one or more missing labels is deprecated in favor of .reindex. So, the answer to your question is:
df1 = df.reindex(columns=['b','c'])
In prior versions, using .loc[list-of-labels] would work as long as at least one of the keys was found (otherwise it would raise a KeyError). This behavior is deprecated and now shows a warning message. The recommended alternative is to use .reindex().
Read more at Indexing and Selecting Data.
You can use Pandas.
I create the DataFrame:
import pandas as pd
df = pd.DataFrame([[1, 2,5], [5,4, 5], [7,7, 8], [7,6,9]],
index=['Jane', 'Peter','Alex','Ann'],
columns=['Test_1', 'Test_2', 'Test_3'])
The DataFrame:
Test_1 Test_2 Test_3
Jane 1 2 5
Peter 5 4 5
Alex 7 7 8
Ann 7 6 9
To select one or more columns by name:
df[['Test_1', 'Test_3']]
Test_1 Test_3
Jane 1 5
Peter 5 5
Alex 7 8
Ann 7 9
You can also use:
df.Test_2
And you get column Test_2:
Jane 2
Peter 4
Alex 7
Ann 6
You can also select columns and rows from these rows using .loc(). This is called "slicing". Notice that I take from column Test_1 to Test_3:
df.loc[:, 'Test_1':'Test_3']
The "Slice" is:
Test_1 Test_2 Test_3
Jane 1 2 5
Peter 5 4 5
Alex 7 7 8
Ann 7 6 9
And if you just want Peter and Ann from columns Test_1 and Test_3:
df.loc[['Peter', 'Ann'], ['Test_1', 'Test_3']]
You get:
Test_1 Test_3
Peter 5 5
Ann 7 9
If you want to get one element by row index and column name, you can do it just like df['b'][0]. It is as simple as you can imagine.
Or you can use df.ix[0,'b'] - mixed usage of index and label.
Note: Since v0.20, ix has been deprecated in favour of loc / iloc.
df[['a', 'b']] # Select all rows of 'a' and 'b'column
df.loc[0:10, ['a', 'b']] # Index 0 to 10 select column 'a' and 'b'
df.loc[0:10, 'a':'b'] # Index 0 to 10 select column 'a' to 'b'
df.iloc[0:10, 3:5] # Index 0 to 10 and column 3 to 5
df.iloc[3, 3:5] # Index 3 of column 3 to 5
Try to use pandas.DataFrame.get (see the documentation):
import pandas as pd
import numpy as np
dates = pd.date_range('20200102', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df.get(['A', 'C'])
One different and easy approach: iterating rows
Using iterows
df1 = pd.DataFrame() # Creating an empty dataframe
for index,i in df.iterrows():
df1.loc[index, 'A'] = df.loc[index, 'A']
df1.loc[index, 'B'] = df.loc[index, 'B']
df1.head()
The different approaches discussed in the previous answers are based on the assumption that either the user knows column indices to drop or subset on, or the user wishes to subset a dataframe using a range of columns (for instance between 'C' : 'E').
pandas.DataFrame.drop() is certainly an option to subset data based on a list of columns defined by user (though you have to be cautious that you always use copy of dataframe and inplace parameters should not be set to True!!)
Another option is to use pandas.columns.difference(), which does a set difference on column names, and returns an index type of array containing desired columns. Following is the solution:
df = pd.DataFrame([[2,3,4], [3,4,5]], columns=['a','b','c'], index=[1,2])
columns_for_differencing = ['a']
df1 = df.copy()[df.columns.difference(columns_for_differencing)]
print(df1)
The output would be:
b c
1 3 4
2 4 5
You can also use df.pop():
>>> df = pd.DataFrame([('falcon', 'bird', 389.0),
... ('parrot', 'bird', 24.0),
... ('lion', 'mammal', 80.5),
... ('monkey', 'mammal', np.nan)],
... columns=('name', 'class', 'max_speed'))
>>> df
name class max_speed
0 falcon bird 389.0
1 parrot bird 24.0
2 lion mammal 80.5
3 monkey mammal
>>> df.pop('class')
0 bird
1 bird
2 mammal
3 mammal
Name: class, dtype: object
>>> df
name max_speed
0 falcon 389.0
1 parrot 24.0
2 lion 80.5
3 monkey NaN
Please use df.pop(c).
I've seen several answers on that, but one remained unclear to me. How would you select those columns of interest?
The answer to that is that if you have them gathered in a list, you can just reference the columns using the list.
Example
print(extracted_features.shape)
print(extracted_features)
(63,)
['f000004' 'f000005' 'f000006' 'f000014' 'f000039' 'f000040' 'f000043'
'f000047' 'f000048' 'f000049' 'f000050' 'f000051' 'f000052' 'f000053'
'f000054' 'f000055' 'f000056' 'f000057' 'f000058' 'f000059' 'f000060'
'f000061' 'f000062' 'f000063' 'f000064' 'f000065' 'f000066' 'f000067'
'f000068' 'f000069' 'f000070' 'f000071' 'f000072' 'f000073' 'f000074'
'f000075' 'f000076' 'f000077' 'f000078' 'f000079' 'f000080' 'f000081'
'f000082' 'f000083' 'f000084' 'f000085' 'f000086' 'f000087' 'f000088'
'f000089' 'f000090' 'f000091' 'f000092' 'f000093' 'f000094' 'f000095'
'f000096' 'f000097' 'f000098' 'f000099' 'f000100' 'f000101' 'f000103']
I have the following list/NumPy array extracted_features, specifying 63 columns. The original dataset has 103 columns, and I would like to extract exactly those, then I would use
dataset[extracted_features]
And you will end up with this
This something you would use quite often in machine learning (more specifically, in feature selection). I would like to discuss other ways too, but I think that has already been covered by other Stack Overflower users.
To exclude some columns you can drop them in the column index. For example:
A B C D
0 1 10 100 1000
1 2 20 200 2000
Select all except two:
df[df.columns.drop(['B', 'D'])]
Output:
A C
0 1 100
1 2 200
You can also use the method truncate to select middle columns:
df.truncate(before='B', after='C', axis=1)
Output:
B C
0 10 100
1 20 200
To select multiple columns, extract and view them thereafter: df is the previously named data frame. Then create a new data frame df1, and select the columns A to D which you want to extract and view.
df1 = pd.DataFrame(data_frame, columns=['Column A', 'Column B', 'Column C', 'Column D'])
df1
All required columns will show up!
def get_slize(dataframe, start_row, end_row, start_col, end_col):
assert len(dataframe) > end_row and start_row >= 0
assert len(dataframe.columns) > end_col and start_col >= 0
list_of_indexes = list(dataframe.columns)[start_col:end_col]
ans = dataframe.iloc[start_row:end_row][list_of_indexes]
return ans
Just use this function
I think this is the easiest way to reach your goal.
import pandas as pd
cols = ['a', 'b']
df1 = pd.DataFrame(df, columns=cols)
df1 = df.iloc[:, 0:2]

pandas mix position and label indexing without chaining

Since .ix has been deprecated as of Pandas 0.20, I wonder what is the proper way to mix lable-based, boolean-based and position-based indexing in Pandas? I need to assign values to a slice of dataframe that can be best referenced with label or boolean on the index and position on the columns. For example (using .loc as placeholder for the desired slicing method):
df.loc[df['a'] == 'x', -12:-1] = 3
obviously this doesn't work, with which I get:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [-12] of <class 'int'>
If I use .iloc, I get:
NotImplementedError: iLocation based boolean indexing on an integer type is not available
So how do I do it, without chaining, obviously to avoid chained assignment problem.
Let's use .loc with the boolean indexing, and accessing the column labels via the dataframe column index with index slicing:
df.loc[df['a'] == 'x', df.columns[-12:-1]] = 3
maybe I should've explained clearer. I meant if your dataframe is indexed (with 0 to n), then you can use loc[] for a combination of number for rows and lable for column:
new_df = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8]})
new_df
Out[10]:
a b
0 1 5
1 2 6
2 3 7
3 4 8
new_df.loc[0,'a']
Out[11]:
1
Even though .ix has been removed, it looks like .loc does the same job now. You can make a mix reference using .loc.

get indexes for selected values of boolean index

Consider the following data frame:
df = pd.DataFrame([[6, 3.14, 'c','d',4.2,'e']])
We can create an index for dtype by
boolIndex = df.dtypes=='object'
and create a subset directly with
df.select_dtypes(include=['object'])
As a learning exercise, I would like to create a data frame including columns corresponding to the first 2 (or an arbitrary number) TRUE values in boolIndex. Meaning, I would like to create a data frame including the first 2 columns of dtype 'object'. I am trying to avoid getting all the 'object' columns and then taking the first two. Ideally, I would have a numeric index corresponding to the positions, and I could select whichever ones I want.
You can take the first two object columns before slicing:
In [11]: objs = df.dtypes == 'object'
In [12]: objs[objs][:2].index
Out[12]: Int64Index([2, 3], dtype='int64')
In [13]: df.loc[:, objs[objs][:2].index]
Out[13]:
2 3
0 c d
Though I suspect the performance difference will be slight in most usecases, it could make a small difference in extreme cases:
In [21]: %timeit objs = df.dtypes == 'object'; df.loc[:, objs[objs][:2].index];
100 loops, best of 3: 5.03 ms per loop
In [22]: %timeit df.select_dtypes(include=['object'])
10 loops, best of 3: 49.3 ms per loop