Inconsistent interface of Pandas Series; yielding access to underlying data - pandas

While the new Categorical Series support since pandas 0.15.0 is fantastic, I'm a bit annoyed with how they decided to make the underlying data inaccessible except through underscored variables. Consider the following code:
import numpy as np
import pandas as pd
x = np.empty(3, dtype=np.int64)
s = pd.DatetimeIndex(x, tz='UTC')
x
Out[17]: array([140556737562568, 55872352, 32])
s[0]
Out[18]: Timestamp('1970-01-02 15:02:36.737562568+0000', tz='UTC')
x[0] = 0
s[0]
Out[20]: Timestamp('1970-01-01 00:00:00+0000', tz='UTC')
y = s.values
y[0] = 5
x[0]
Out[23]: 5
s[0]
Out[24]: Timestamp('1970-01-01 00:00:00.000000005+0000', tz='UTC')
We can see that both in construction and when asked for underlying values, no deep copies are being made in this DatetimeIndex with regards to its underlying data. Not only is this potentially useful in terms of efficiency, but it's great if you are using a DataFrame as a buffer. You can easily get the numpy primitive containing the underlying data, from there get a pointer to the raw data, which some low level C routine can use to do a copy into from some block of memory.
Now lets look at the behavior of the new Categorical Series. The underlying data of course is not the levels, but the codes.
x2 = np.zeros(3, dtype=np.int64)
s2 = pd.Categorical.from_codes(x2, ["hello", "bye"])
s2
Out[27]:
[hello, hello, hello]
Categories (2, object): [hello, bye]
x2[0] = 1
s2[0]
Out[29]: 'hello'
y2 = s2.codes
y2[0] = 1
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-0366d645c98d> in <module>()
----> 1 y2[0] = 1
ValueError: assignment destination is read-only
y2 = s2._codes
y2[0] = 1
s2[0]
Out[34]: 'bye'
The net effect of this behavior is that as a developer, efficient manipulation of the underlying data for Categoricals is not part of the interface. Also as a user, the from_codes constructor is slow as it deep copies the codes, which may often be unnecessary. There should at least be an option for this.
But the fact that codes is a read only variable and _codes needs to be used strikes me as worse. Why wouldn't .codes give the same behavior as .values? Is there some justification for this beyond the concept that the codes are "private"? I'm hoping some of the pandas gurus on stackoverflow can shed some light on this.

The Categorical type is different from almost all other types in that it is a compound type that has a certain guarantee among its data. Namely that the codes provide a factorization of the levels.
So the argument against mutability is that it would be easy to break the codes-categories mapping, and it could be non-performant. Of course these could possibly be mitigated with checking on the setitem instead (but with some added code complexity).
The vast majority of users are not going to manipulate the codes/categories directly (and only use exposed methods) so this is really a protection against accidently breaking these guarantees.
If you need to efficiently manipulate the underlying data, best/easiest is simply to pull out the codes/categories. Mutate them, then create a new Categorical (which is cheap if codes/categories are already provided).
e.g.
In [3]: s2 = pd.Categorical.from_codes(x2, ["hello", "bye"])
In [4]: s2
Out[4]:
[hello, hello, hello]
Categories (2, object): [hello, bye]
In [5]: s2.codes
Out[5]: array([0, 0, 0], dtype=int8)
In [6]: pd.Categorical(s2.codes+1,s2.categories,fastpath=True)
Out[6]:
[bye, bye, bye]
Categories (2, object): [hello, bye]
Of course this is quite dangerous, if you added 2 to the expression would blow up. Manipulation of the codes directly is simply buyer-be-ware.

Related

`TypeError: '<' not supported between instances of 'str' and 'int'` Pandas [duplicate]

I have a pandas dataframe, df:
c1 c2
0 10 100
1 11 110
2 12 120
How do I iterate over the rows of this dataframe? For every row, I want to be able to access its elements (values in cells) by the name of the columns. For example:
for row in df.rows:
print(row['c1'], row['c2'])
I found a similar question which suggests using either of these:
for date, row in df.T.iteritems():
for row in df.iterrows():
But I do not understand what the row object is and how I can work with it.
DataFrame.iterrows is a generator which yields both the index and row (as a Series):
import pandas as pd
df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})
df = df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
print(row['c1'], row['c2'])
10 100
11 110
12 120
How to iterate over rows in a DataFrame in Pandas
Answer: DON'T*!
Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.
Do you want to print a DataFrame? Use DataFrame.to_string().
Do you want to compute something? In that case, search for methods in this order (list modified from here):
Vectorization
Cython routines
List Comprehensions (vanilla for loop)
DataFrame.apply(): i)  Reductions that can be performed in Cython, ii) Iteration in Python space
DataFrame.itertuples() and iteritems()
DataFrame.iterrows()
iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.
Appeal to Authority
The documentation page on iteration has a huge red warning box that says:
Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].
* It's actually a little more complicated than "don't". df.iterrows() is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you're not sure whether you need an iterative solution, you probably don't. PS: To know more about my rationale for writing this answer, skip to the very bottom.
Faster than Looping: Vectorization, Cython
A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.
If none exists, feel free to write your own using custom Cython extensions.
Next Best Thing: List Comprehensions*
List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you're trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.
The formula is simple,
# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df['col1'], ..., df['coln'])]
If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.
Caveats
List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don't have NaNs, but this cannot always be guaranteed.
The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
When dealing with mixed data types you should iterate over zip(df['A'], df['B'], ...) instead of df[['A', 'B']].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this.
*Your mileage may vary for the reasons outlined in the Caveats section above.
An Obvious Example
Let's demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operation, so it will be easy to contrast the performance of the methods discussed above.
Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you're doing. Stick to the API where you can (i.e., prefer vec over vec_numpy).
I should mention, however, that it isn't always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.
My Personal Opinion *
Most of the analyses performed on the various alternatives to the iter family has been through the lens of performance. However, in most situations you will typically be working on a reasonably sized dataset (nothing beyond a few thousand or 100K rows) and performance will come second to simplicity/readability of the solution.
Here is my personal preference when selecting a method to use for a problem.
For the novice:
Vectorization (when possible); apply(); List Comprehensions; itertuples()/iteritems(); iterrows(); Cython
For the more experienced:
Vectorization (when possible); apply(); List Comprehensions; Cython; itertuples()/iteritems(); iterrows()
Vectorization prevails as the most idiomatic method for any problem that can be vectorized. Always seek to vectorize! When in doubt, consult the docs, or look on Stack Overflow for an existing question on your particular task.
I do tend to go on about how bad apply is in a lot of my posts, but I do concede it is easier for a beginner to wrap their head around what it's doing. Additionally, there are quite a few use cases for apply has explained in this post of mine.
Cython ranks lower down on the list because it takes more time and effort to pull off correctly. You will usually never need to write code with pandas that demands this level of performance that even a list comprehension cannot satisfy.
* As with any personal opinion, please take with heaps of salt!
Further Reading
10 Minutes to pandas, and Essential Basic Functionality - Useful links that introduce you to Pandas and its library of vectorized*/cythonized functions.
Enhancing Performance - A primer from the documentation on enhancing standard Pandas operations
Are for-loops in pandas really bad? When should I care? - a detailed write-up by me on list comprehensions and their suitability for various operations (mainly ones involving non-numeric data)
When should I (not) want to use pandas apply() in my code? - apply is slow (but not as slow as the iter* family. There are, however, situations where one can (or should) consider apply as a serious alternative, especially in some GroupBy operations).
* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.
Why I Wrote this Answer
A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is the right thing to do.
The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I'm not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.
First consider if you really need to iterate over rows in a DataFrame. See this answer for alternatives.
If you still need to iterate over rows, you can use methods below. Note some important caveats which are not mentioned in any of the other answers.
DataFrame.iterrows()
for index, row in df.iterrows():
print(row["c1"], row["c2"])
DataFrame.itertuples()
for row in df.itertuples(index=True, name='Pandas'):
print(row.c1, row.c2)
itertuples() is supposed to be faster than iterrows()
But be aware, according to the docs (pandas 0.24.2 at the moment):
iterrows: dtype might not match from row to row
Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally much faster than iterrows()
iterrows: Do not modify rows
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
Use DataFrame.apply() instead:
new_df = df.apply(lambda x: x * 2, axis = 1)
itertuples:
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.
See pandas docs on iteration for more details.
You should use df.iterrows(). Though iterating row-by-row is not especially efficient since Series objects have to be created.
While iterrows() is a good option, sometimes itertuples() can be much faster:
df = pd.DataFrame({'a': randn(1000), 'b': randn(1000),'N': randint(100, 1000, (1000)), 'x': 'x'})
%timeit [row.a * 2 for idx, row in df.iterrows()]
# => 10 loops, best of 3: 50.3 ms per loop
%timeit [row[1] * 2 for row in df.itertuples()]
# => 1000 loops, best of 3: 541 µs per loop
You can use the df.iloc function as follows:
for i in range(0, len(df)):
print(df.iloc[i]['c1'], df.iloc[i]['c2'])
You can also use df.apply() to iterate over rows and access multiple columns for a function.
docs: DataFrame.apply()
def valuation_formula(x, y):
return x * y * 0.5
df['price'] = df.apply(lambda row: valuation_formula(row['x'], row['y']), axis=1)
How to iterate efficiently
If you really have to iterate a Pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows() is far from being the best. itertuples() can be 100 times faster.
In short:
As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and less than 255 columns. See point (3)
Otherwise, use df.itertuples() except if your columns have special characters such as spaces or '-'. See point (2)
It is possible to use itertuples() even if your dataframe has strange columns by using the last example. See point (4)
Only use iterrows() if you cannot the previous solutions. See point (1)
Different methods to iterate over rows in a Pandas dataframe:
Generate a random dataframe with a million rows and 4 columns:
df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
print(df)
1) The usual iterrows() is convenient, but damn slow:
start_time = time.clock()
result = 0
for _, row in df.iterrows():
result += max(row['B'], row['C'])
total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
2) The default itertuples() is already much faster, but it doesn't work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
result += max(row.B, row.C)
total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
3) The default itertuples() using name=None is even faster but not really convenient as you have to define a variable per column.
start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
result += max(col2, col3)
total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
4) Finally, the named itertuples() is slower than the previous point, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])
total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
Output:
A B C D
0 41 63 42 23
1 54 9 24 65
2 15 34 10 9
3 39 94 82 97
4 4 88 79 54
... .. .. .. ..
999995 48 27 4 25
999996 16 51 34 28
999997 1 39 61 14
999998 66 51 27 70
999999 51 53 47 99
[1000000 rows x 4 columns]
1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519
This article is a very interesting comparison between iterrows and itertuples
I was looking for How to iterate on rows and columns and ended here so:
for i, row in df.iterrows():
for j, column in row.iteritems():
print(column)
We have multiple options to do the same, and lots of folks have shared their answers.
I found the below two methods easy and efficient to do:
DataFrame.iterrows()
DataFrame.itertuples()
Example:
import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print (df)
# With the iterrows method
for index, row in df.iterrows():
print(row["c1"], row["c2"])
# With the itertuples method
for row in df.itertuples(index=True, name='Pandas'):
print(row.c1, row.c2)
Note: itertuples() is supposed to be faster than iterrows()
You can write your own iterator that implements namedtuple
from collections import namedtuple
def myiter(d, cols=None):
if cols is None:
v = d.values.tolist()
cols = d.columns.values.tolist()
else:
j = [d.columns.get_loc(c) for c in cols]
v = d.values[:, j].tolist()
n = namedtuple('MyTuple', cols)
for line in iter(v):
yield n(*line)
This is directly comparable to pd.DataFrame.itertuples. I'm aiming at performing the same task with more efficiency.
For the given dataframe with my function:
list(myiter(df))
[MyTuple(c1=10, c2=100), MyTuple(c1=11, c2=110), MyTuple(c1=12, c2=120)]
Or with pd.DataFrame.itertuples:
list(df.itertuples(index=False))
[Pandas(c1=10, c2=100), Pandas(c1=11, c2=110), Pandas(c1=12, c2=120)]
A comprehensive test
We test making all columns available and subsetting the columns.
def iterfullA(d):
return list(myiter(d))
def iterfullB(d):
return list(d.itertuples(index=False))
def itersubA(d):
return list(myiter(d, ['col3', 'col4', 'col5', 'col6', 'col7']))
def itersubB(d):
return list(d[['col3', 'col4', 'col5', 'col6', 'col7']].itertuples(index=False))
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
columns='iterfullA iterfullB itersubA itersubB'.split(),
dtype=float
)
for i in res.index:
d = pd.DataFrame(np.random.randint(10, size=(i, 10))).add_prefix('col')
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=100)
res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True);
To loop all rows in a dataframe you can use:
for x in range(len(date_example.index)):
print date_example['Date'].iloc[x]
for ind in df.index:
print df['c1'][ind], df['c2'][ind]
Update: cs95 has updated his answer to include plain numpy vectorization. You can simply refer to his answer.
cs95 shows that Pandas vectorization far outperforms other Pandas methods for computing stuff with dataframes.
I wanted to add that if you first convert the dataframe to a NumPy array and then use vectorization, it's even faster than Pandas dataframe vectorization, (and that includes the time to turn it back into a dataframe series).
If you add the following functions to cs95's benchmark code, this becomes pretty evident:
def np_vectorization(df):
np_arr = df.to_numpy()
return pd.Series(np_arr[:,0] + np_arr[:,1], index=df.index)
def just_np_vectorization(df):
np_arr = df.to_numpy()
return np_arr[:,0] + np_arr[:,1]
Sometimes a useful pattern is:
# Borrowing #KutalmisB df example
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
# The to_dict call results in a list of dicts
# where each row_dict is a dictionary with k:v pairs of columns:value for that row
for row_dict in df.to_dict(orient='records'):
print(row_dict)
Which results in:
{'col1':1.0, 'col2':0.1}
{'col1':2.0, 'col2':0.2}
To loop all rows in a dataframe and use values of each row conveniently, namedtuples can be converted to ndarrays. For example:
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
Iterating over the rows:
for row in df.itertuples(index=False, name='Pandas'):
print np.asarray(row)
results in:
[ 1. 0.1]
[ 2. 0.2]
Please note that if index=True, the index is added as the first element of the tuple, which may be undesirable for some applications.
In short
Use vectorization if possible
If an operation can't be vectorized - use list comprehensions
If you need a single object representing the entire row - use itertuples
If the above is too slow - try swifter.apply
If it's still too slow - try a Cython routine
Benchmark
There is a way to iterate throw rows while getting a DataFrame in return, and not a Series. I don't see anyone mentioning that you can pass index as a list for the row to be returned as a DataFrame:
for i in range(len(df)):
row = df.iloc[[i]]
Note the usage of double brackets. This returns a DataFrame with a single row.
For both viewing and modifying values, I would use iterrows(). In a for loop and by using tuple unpacking (see the example: i, row), I use the row for only viewing the value and use i with the loc method when I want to modify values. As stated in previous answers, here you should not modify something you are iterating over.
for i, row in df.iterrows():
df_column_A = df.loc[i, 'A']
if df_column_A == 'Old_Value':
df_column_A = 'New_value'
Here the row in the loop is a copy of that row, and not a view of it. Therefore, you should NOT write something like row['A'] = 'New_Value', it will not modify the DataFrame. However, you can use i and loc and specify the DataFrame to do the work.
Sometimes loops really are better than vectorized code
As many answers here correctly point out, your default plan in Pandas should be to write vectorized code (with its implicit loops) rather than attempting an explicit loop yourself. But the question remains whether you should ever write loops in Pandas, and if so what's the best way to loop in those situations.
I believe there is at least one general situation where loops are appropriate: when you need to calculate some function that depends on values in other rows in a somewhat complex manner. In this case, the looping code is often simpler, more readable, and less error prone than vectorized code.
The looping code might even be faster too, as you'll see below, so loops might make sense in cases where speed is of utmost importance. But really, those are just going to be subsets of cases where you probably should have been working in numpy/numba (rather than Pandas) to begin with, because optimized numpy/numba will almost always be faster than Pandas.
Let's show this with an example. Suppose you want to take a cumulative sum of a column, but reset it whenever some other column equals zero:
import pandas as pd
import numpy as np
df = pd.DataFrame( { 'x':[1,2,3,4,5,6], 'y':[1,1,1,0,1,1] } )
# x y desired_result
#0 1 1 1
#1 2 1 3
#2 3 1 6
#3 4 0 4
#4 5 1 9
#5 6 1 15
This is a good example where you could certainly write one line of Pandas to achieve this, although it's not especially readable, especially if you aren't fairly experienced with Pandas already:
df.groupby( (df.y==0).cumsum() )['x'].cumsum()
That's going to be fast enough for most situations, although you could also write faster code by avoiding the groupby, but it will likely be even less readable.
Alternatively, what if we write this as a loop? You could do something like the following with NumPy:
import numba as nb
#nb.jit(nopython=True) # Optional
def custom_sum(x,y):
x_sum = x.copy()
for i in range(1,len(df)):
if y[i] > 0: x_sum[i] = x_sum[i-1] + x[i]
return x_sum
df['desired_result'] = custom_sum( df.x.to_numpy(), df.y.to_numpy() )
Admittedly, there's a bit of overhead there required to convert DataFrame columns to NumPy arrays, but the core piece of code is just one line of code that you could read even if you didn't know anything about Pandas or NumPy:
if y[i] > 0: x_sum[i] = x_sum[i-1] + x[i]
And this code is actually faster than the vectorized code. In some quick tests with 100,000 rows, the above is about 10x faster than the groupby approach. Note that one key to the speed there is numba, which is optional. Without the "#nb.jit" line, the looping code is actually about 10x slower than the groupby approach.
Clearly this example is simple enough that you would likely prefer the one line of pandas to writing a loop with its associated overhead. However, there are more complex versions of this problem for which the readability or speed of the NumPy/numba loop approach likely makes sense.
There are so many ways to iterate over the rows in Pandas dataframe. One very simple and intuitive way is:
df = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]})
print(df)
for i in range(df.shape[0]):
# For printing the second column
print(df.iloc[i, 1])
# For printing more than one columns
print(df.iloc[i, [0, 2]])
The easiest way, use the apply function
def print_row(row):
print row['c1'], row['c2']
df.apply(lambda row: print_row(row), axis=1)
Probably the most elegant solution (but certainly not the most efficient):
for row in df.values:
c2 = row[1]
print(row)
# ...
for c1, c2 in df.values:
# ...
Note that:
the documentation explicitly recommends to use .to_numpy() instead
the produced NumPy array will have a dtype that fits all columns, in the worst case object
there are good reasons not to use a loop in the first place
Still, I think this option should be included here, as a straightforward solution to a (one should think) trivial problem.
You can also do NumPy indexing for even greater speed ups. It's not really iterating but works much better than iteration for certain applications.
subset = row['c1'][0:5]
all = row['c1'][:]
You may also want to cast it to an array. These indexes/selections are supposed to act like NumPy arrays already, but I ran into issues and needed to cast
np.asarray(all)
imgs[:] = cv2.resize(imgs[:], (224,224) ) # Resize every image in an hdf5 file
df.iterrows() returns tuple(a, b) where a is the index and b is the row.
This example uses iloc to isolate each digit in the data frame.
import pandas as pd
a = [1, 2, 3, 4]
b = [5, 6, 7, 8]
mjr = pd.DataFrame({'a':a, 'b':b})
size = mjr.shape
for i in range(size[0]):
for j in range(size[1]):
print(mjr.iloc[i, j])
Disclaimer: Although here are so many answers which recommend not using an iterative (loop) approach (and I mostly agree), I would still see it as a reasonable approach for the following situation:
Extend a dataframe with data from an API
Let's say you have a large dataframe which contains incomplete user data. Now you have to extend this data with additional columns, for example, the user's age and gender.
Both values have to be fetched from a backend API. I'm assuming the API doesn't provide a "batch" endpoint (which would accept multiple user IDs at once). Otherwise, you should rather call the API only once.
The costs (waiting time) for the network request surpass the iteration of the dataframe by far. We're talking about network round trip times of hundreds of milliseconds compared to the negligibly small gains in using alternative approaches to iterations.
One expensive network request for each row
So in this case, I would absolutely prefer using an iterative approach. Although the network request is expensive, it is guaranteed being triggered only once for each row in the dataframe. Here is an example using DataFrame.iterrows:
Example
for index, row in users_df.iterrows():
user_id = row['user_id']
# Trigger expensive network request once for each row
response_dict = backend_api.get(f'/api/user-data/{user_id}')
# Extend dataframe with multiple data from response
users_df.at[index, 'age'] = response_dict.get('age')
users_df.at[index, 'gender'] = response_dict.get('gender')
Some libraries (e.g. a Java interop library that I use) require values to be passed in a row at a time, for example, if streaming data. To replicate the streaming nature, I 'stream' my dataframe values one by one, I wrote the below, which comes in handy from time to time.
class DataFrameReader:
def __init__(self, df):
self._df = df
self._row = None
self._columns = df.columns.tolist()
self.reset()
self.row_index = 0
def __getattr__(self, key):
return self.__getitem__(key)
def read(self) -> bool:
self._row = next(self._iterator, None)
self.row_index += 1
return self._row is not None
def columns(self):
return self._columns
def reset(self) -> None:
self._iterator = self._df.itertuples()
def get_index(self):
return self._row[0]
def index(self):
return self._row[0]
def to_dict(self, columns: List[str] = None):
return self.row(columns=columns)
def tolist(self, cols) -> List[object]:
return [self.__getitem__(c) for c in cols]
def row(self, columns: List[str] = None) -> Dict[str, object]:
cols = set(self._columns if columns is None else columns)
return {c : self.__getitem__(c) for c in self._columns if c in cols}
def __getitem__(self, key) -> object:
# the df index of the row is at index 0
try:
if type(key) is list:
ix = [self._columns.index(key) + 1 for k in key]
else:
ix = self._columns.index(key) + 1
return self._row[ix]
except BaseException as e:
return None
def __next__(self) -> 'DataFrameReader':
if self.read():
return self
else:
raise StopIteration
def __iter__(self) -> 'DataFrameReader':
return self
Which can be used:
for row in DataFrameReader(df):
print(row.my_column_name)
print(row.to_dict())
print(row['my_column_name'])
print(row.tolist())
And preserves the values/ name mapping for the rows being iterated. Obviously, is a lot slower than using apply and Cython as indicated above, but is necessary in some circumstances.
As the accepted answer states, the fastest way to apply a function over rows is to use a vectorized function, the so-called NumPy ufuncs (universal functions).
But what should you do when the function you want to apply isn't already implemented in NumPy?
Well, using the vectorize decorator from numba, you can easily create ufuncs directly in Python like this:
from numba import vectorize, float64
#vectorize([float64(float64)])
def f(x):
#x is your line, do something with it, and return a float
The documentation for this function is here: Creating NumPy universal functions
Along with the great answers in this post I am going to propose Divide and Conquer approach, I am not writing this answer to abolish the other great answers but to fulfill them with another approach which was working efficiently for me. It has two steps of splitting and merging the pandas dataframe:
PROS of Divide and Conquer:
You don't need to use vectorization or any other methods to cast the type of your dataframe into another type
You don't need to Cythonize your code which normally takes extra time from you
Both iterrows() and itertuples() in my case were having the same performance over entire dataframe
Depends on your choice of slicing index, you will be able to exponentially quicken the iteration. The higher index, the quicker your iteration process.
CONS of Divide and Conquer:
You shouldn't have dependency over the iteration process to the same dataframe and different slice. Meaning if you want to read or write from other slice, it maybe difficult to do that.
=================== Divide and Conquer Approach =================
Step 1: Splitting/Slicing
In this step, we are going to divide the iteration over the entire dataframe. Think that you are going to read a CSV file into pandas df then iterate over it. In may case I have 5,000,000 records and I am going to split it into 100,000 records.
NOTE: I need to reiterate as other runtime analysis explained in the other solutions in this page, "number of records" has exponential proportion of "runtime" on search on the df. Based on the benchmark on my data here are the results:
Number of records | Iteration rate [per second]
========================================
100,000 | 500
500,000 | 200
1,000,000 | 50
5,000,000 | 20
Step 2: Merging
This is going to be an easy step, just merge all the written CSV files into one dataframe and write it into a bigger CSV file.
Here is the sample code:
# Step 1 (Splitting/Slicing)
import pandas as pd
df_all = pd.read_csv('C:/KtV.csv')
df_index = 100000
df_len = len(df)
for i in range(df_len // df_index + 1):
lower_bound = i * df_index
higher_bound = min(lower_bound + df_index, df_len)
# Splitting/slicing df (make sure to copy() otherwise it will be a view
df = df_all[lower_bound:higher_bound].copy()
'''
Write your iteration over the sliced df here
using iterrows() or intertuples() or ...
'''
# Writing into CSV files
df.to_csv('C:/KtV_prep_' + str(i) + '.csv')
# Step 2 (Merging)
filename = 'C:/KtV_prep_'
df = (pd.read_csv(f) for f in [filename + str(i) + '.csv' for i in range(ktv_len // ktv_index + 1)])
df_prep_all = pd.concat(df)
df_prep_all.to_csv('C:/KtV_prep_all.csv')
Reference:
Efficient way of iteration over datafreame
Concatenate CSV files into one Pandas Dataframe

How does NumPy ndarrary.view(...) work internally

I've used the following code to create a view but i want to understand how this view works under the hood,
>>> x = np.array([(1, 2)], dtype=np.int8)
>>> y = x.view(dtype=np.int16)
where to find the source code for ndarray.view(...), I searched through the GitHub code repository but couldn't find it
What are you most interested in - the coding mechanics, or how int8 values are repsented as int16?
view creates a new array, with its own shape and dtype, but sharing the data buffer with the source. Most of the code that you'll see has to do with creating that new array, with little to nothing about the specific dtypes.
There are actually two versions of int16, big-ended and little.
In [194]: np.array([(1,2)],np.int8).view('<i2')
Out[194]: array([[513]], dtype=int16)
In [195]: np.array([(1,2)],np.int8).view('>i2')
Out[195]: array([[258]], dtype=int16)
np.int8 is a single byte which can represent values up to 256. The values we see depend on how the 2 bytes are combined into 1 number.
In [197]: 2*256+1
Out[197]: 513
In [198]: 1*256+2
Out[198]: 258
My guess is that you won't see this level of detail in the numpy C code. It's performed by the C compiler.

Iterate pandas data frames by name of columns with pd.isnull [duplicate]

I have a pandas dataframe, df:
c1 c2
0 10 100
1 11 110
2 12 120
How do I iterate over the rows of this dataframe? For every row, I want to be able to access its elements (values in cells) by the name of the columns. For example:
for row in df.rows:
print(row['c1'], row['c2'])
I found a similar question which suggests using either of these:
for date, row in df.T.iteritems():
for row in df.iterrows():
But I do not understand what the row object is and how I can work with it.
DataFrame.iterrows is a generator which yields both the index and row (as a Series):
import pandas as pd
df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})
df = df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
print(row['c1'], row['c2'])
10 100
11 110
12 120
How to iterate over rows in a DataFrame in Pandas
Answer: DON'T*!
Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.
Do you want to print a DataFrame? Use DataFrame.to_string().
Do you want to compute something? In that case, search for methods in this order (list modified from here):
Vectorization
Cython routines
List Comprehensions (vanilla for loop)
DataFrame.apply(): i)  Reductions that can be performed in Cython, ii) Iteration in Python space
DataFrame.itertuples() and iteritems()
DataFrame.iterrows()
iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.
Appeal to Authority
The documentation page on iteration has a huge red warning box that says:
Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].
* It's actually a little more complicated than "don't". df.iterrows() is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you're not sure whether you need an iterative solution, you probably don't. PS: To know more about my rationale for writing this answer, skip to the very bottom.
Faster than Looping: Vectorization, Cython
A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.
If none exists, feel free to write your own using custom Cython extensions.
Next Best Thing: List Comprehensions*
List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you're trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.
The formula is simple,
# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df['col1'], ..., df['coln'])]
If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.
Caveats
List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don't have NaNs, but this cannot always be guaranteed.
The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
When dealing with mixed data types you should iterate over zip(df['A'], df['B'], ...) instead of df[['A', 'B']].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this.
*Your mileage may vary for the reasons outlined in the Caveats section above.
An Obvious Example
Let's demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operation, so it will be easy to contrast the performance of the methods discussed above.
Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you're doing. Stick to the API where you can (i.e., prefer vec over vec_numpy).
I should mention, however, that it isn't always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.
My Personal Opinion *
Most of the analyses performed on the various alternatives to the iter family has been through the lens of performance. However, in most situations you will typically be working on a reasonably sized dataset (nothing beyond a few thousand or 100K rows) and performance will come second to simplicity/readability of the solution.
Here is my personal preference when selecting a method to use for a problem.
For the novice:
Vectorization (when possible); apply(); List Comprehensions; itertuples()/iteritems(); iterrows(); Cython
For the more experienced:
Vectorization (when possible); apply(); List Comprehensions; Cython; itertuples()/iteritems(); iterrows()
Vectorization prevails as the most idiomatic method for any problem that can be vectorized. Always seek to vectorize! When in doubt, consult the docs, or look on Stack Overflow for an existing question on your particular task.
I do tend to go on about how bad apply is in a lot of my posts, but I do concede it is easier for a beginner to wrap their head around what it's doing. Additionally, there are quite a few use cases for apply has explained in this post of mine.
Cython ranks lower down on the list because it takes more time and effort to pull off correctly. You will usually never need to write code with pandas that demands this level of performance that even a list comprehension cannot satisfy.
* As with any personal opinion, please take with heaps of salt!
Further Reading
10 Minutes to pandas, and Essential Basic Functionality - Useful links that introduce you to Pandas and its library of vectorized*/cythonized functions.
Enhancing Performance - A primer from the documentation on enhancing standard Pandas operations
Are for-loops in pandas really bad? When should I care? - a detailed write-up by me on list comprehensions and their suitability for various operations (mainly ones involving non-numeric data)
When should I (not) want to use pandas apply() in my code? - apply is slow (but not as slow as the iter* family. There are, however, situations where one can (or should) consider apply as a serious alternative, especially in some GroupBy operations).
* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.
Why I Wrote this Answer
A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is the right thing to do.
The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I'm not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.
First consider if you really need to iterate over rows in a DataFrame. See this answer for alternatives.
If you still need to iterate over rows, you can use methods below. Note some important caveats which are not mentioned in any of the other answers.
DataFrame.iterrows()
for index, row in df.iterrows():
print(row["c1"], row["c2"])
DataFrame.itertuples()
for row in df.itertuples(index=True, name='Pandas'):
print(row.c1, row.c2)
itertuples() is supposed to be faster than iterrows()
But be aware, according to the docs (pandas 0.24.2 at the moment):
iterrows: dtype might not match from row to row
Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally much faster than iterrows()
iterrows: Do not modify rows
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
Use DataFrame.apply() instead:
new_df = df.apply(lambda x: x * 2, axis = 1)
itertuples:
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.
See pandas docs on iteration for more details.
You should use df.iterrows(). Though iterating row-by-row is not especially efficient since Series objects have to be created.
While iterrows() is a good option, sometimes itertuples() can be much faster:
df = pd.DataFrame({'a': randn(1000), 'b': randn(1000),'N': randint(100, 1000, (1000)), 'x': 'x'})
%timeit [row.a * 2 for idx, row in df.iterrows()]
# => 10 loops, best of 3: 50.3 ms per loop
%timeit [row[1] * 2 for row in df.itertuples()]
# => 1000 loops, best of 3: 541 µs per loop
You can use the df.iloc function as follows:
for i in range(0, len(df)):
print(df.iloc[i]['c1'], df.iloc[i]['c2'])
You can also use df.apply() to iterate over rows and access multiple columns for a function.
docs: DataFrame.apply()
def valuation_formula(x, y):
return x * y * 0.5
df['price'] = df.apply(lambda row: valuation_formula(row['x'], row['y']), axis=1)
How to iterate efficiently
If you really have to iterate a Pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows() is far from being the best. itertuples() can be 100 times faster.
In short:
As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and less than 255 columns. See point (3)
Otherwise, use df.itertuples() except if your columns have special characters such as spaces or '-'. See point (2)
It is possible to use itertuples() even if your dataframe has strange columns by using the last example. See point (4)
Only use iterrows() if you cannot the previous solutions. See point (1)
Different methods to iterate over rows in a Pandas dataframe:
Generate a random dataframe with a million rows and 4 columns:
df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
print(df)
1) The usual iterrows() is convenient, but damn slow:
start_time = time.clock()
result = 0
for _, row in df.iterrows():
result += max(row['B'], row['C'])
total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
2) The default itertuples() is already much faster, but it doesn't work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
result += max(row.B, row.C)
total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
3) The default itertuples() using name=None is even faster but not really convenient as you have to define a variable per column.
start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
result += max(col2, col3)
total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
4) Finally, the named itertuples() is slower than the previous point, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])
total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
Output:
A B C D
0 41 63 42 23
1 54 9 24 65
2 15 34 10 9
3 39 94 82 97
4 4 88 79 54
... .. .. .. ..
999995 48 27 4 25
999996 16 51 34 28
999997 1 39 61 14
999998 66 51 27 70
999999 51 53 47 99
[1000000 rows x 4 columns]
1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519
This article is a very interesting comparison between iterrows and itertuples
I was looking for How to iterate on rows and columns and ended here so:
for i, row in df.iterrows():
for j, column in row.iteritems():
print(column)
We have multiple options to do the same, and lots of folks have shared their answers.
I found the below two methods easy and efficient to do:
DataFrame.iterrows()
DataFrame.itertuples()
Example:
import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print (df)
# With the iterrows method
for index, row in df.iterrows():
print(row["c1"], row["c2"])
# With the itertuples method
for row in df.itertuples(index=True, name='Pandas'):
print(row.c1, row.c2)
Note: itertuples() is supposed to be faster than iterrows()
You can write your own iterator that implements namedtuple
from collections import namedtuple
def myiter(d, cols=None):
if cols is None:
v = d.values.tolist()
cols = d.columns.values.tolist()
else:
j = [d.columns.get_loc(c) for c in cols]
v = d.values[:, j].tolist()
n = namedtuple('MyTuple', cols)
for line in iter(v):
yield n(*line)
This is directly comparable to pd.DataFrame.itertuples. I'm aiming at performing the same task with more efficiency.
For the given dataframe with my function:
list(myiter(df))
[MyTuple(c1=10, c2=100), MyTuple(c1=11, c2=110), MyTuple(c1=12, c2=120)]
Or with pd.DataFrame.itertuples:
list(df.itertuples(index=False))
[Pandas(c1=10, c2=100), Pandas(c1=11, c2=110), Pandas(c1=12, c2=120)]
A comprehensive test
We test making all columns available and subsetting the columns.
def iterfullA(d):
return list(myiter(d))
def iterfullB(d):
return list(d.itertuples(index=False))
def itersubA(d):
return list(myiter(d, ['col3', 'col4', 'col5', 'col6', 'col7']))
def itersubB(d):
return list(d[['col3', 'col4', 'col5', 'col6', 'col7']].itertuples(index=False))
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
columns='iterfullA iterfullB itersubA itersubB'.split(),
dtype=float
)
for i in res.index:
d = pd.DataFrame(np.random.randint(10, size=(i, 10))).add_prefix('col')
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=100)
res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True);
To loop all rows in a dataframe you can use:
for x in range(len(date_example.index)):
print date_example['Date'].iloc[x]
for ind in df.index:
print df['c1'][ind], df['c2'][ind]
Update: cs95 has updated his answer to include plain numpy vectorization. You can simply refer to his answer.
cs95 shows that Pandas vectorization far outperforms other Pandas methods for computing stuff with dataframes.
I wanted to add that if you first convert the dataframe to a NumPy array and then use vectorization, it's even faster than Pandas dataframe vectorization, (and that includes the time to turn it back into a dataframe series).
If you add the following functions to cs95's benchmark code, this becomes pretty evident:
def np_vectorization(df):
np_arr = df.to_numpy()
return pd.Series(np_arr[:,0] + np_arr[:,1], index=df.index)
def just_np_vectorization(df):
np_arr = df.to_numpy()
return np_arr[:,0] + np_arr[:,1]
Sometimes a useful pattern is:
# Borrowing #KutalmisB df example
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
# The to_dict call results in a list of dicts
# where each row_dict is a dictionary with k:v pairs of columns:value for that row
for row_dict in df.to_dict(orient='records'):
print(row_dict)
Which results in:
{'col1':1.0, 'col2':0.1}
{'col1':2.0, 'col2':0.2}
To loop all rows in a dataframe and use values of each row conveniently, namedtuples can be converted to ndarrays. For example:
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
Iterating over the rows:
for row in df.itertuples(index=False, name='Pandas'):
print np.asarray(row)
results in:
[ 1. 0.1]
[ 2. 0.2]
Please note that if index=True, the index is added as the first element of the tuple, which may be undesirable for some applications.
In short
Use vectorization if possible
If an operation can't be vectorized - use list comprehensions
If you need a single object representing the entire row - use itertuples
If the above is too slow - try swifter.apply
If it's still too slow - try a Cython routine
Benchmark
There is a way to iterate throw rows while getting a DataFrame in return, and not a Series. I don't see anyone mentioning that you can pass index as a list for the row to be returned as a DataFrame:
for i in range(len(df)):
row = df.iloc[[i]]
Note the usage of double brackets. This returns a DataFrame with a single row.
For both viewing and modifying values, I would use iterrows(). In a for loop and by using tuple unpacking (see the example: i, row), I use the row for only viewing the value and use i with the loc method when I want to modify values. As stated in previous answers, here you should not modify something you are iterating over.
for i, row in df.iterrows():
df_column_A = df.loc[i, 'A']
if df_column_A == 'Old_Value':
df_column_A = 'New_value'
Here the row in the loop is a copy of that row, and not a view of it. Therefore, you should NOT write something like row['A'] = 'New_Value', it will not modify the DataFrame. However, you can use i and loc and specify the DataFrame to do the work.
Sometimes loops really are better than vectorized code
As many answers here correctly point out, your default plan in Pandas should be to write vectorized code (with its implicit loops) rather than attempting an explicit loop yourself. But the question remains whether you should ever write loops in Pandas, and if so what's the best way to loop in those situations.
I believe there is at least one general situation where loops are appropriate: when you need to calculate some function that depends on values in other rows in a somewhat complex manner. In this case, the looping code is often simpler, more readable, and less error prone than vectorized code.
The looping code might even be faster too, as you'll see below, so loops might make sense in cases where speed is of utmost importance. But really, those are just going to be subsets of cases where you probably should have been working in numpy/numba (rather than Pandas) to begin with, because optimized numpy/numba will almost always be faster than Pandas.
Let's show this with an example. Suppose you want to take a cumulative sum of a column, but reset it whenever some other column equals zero:
import pandas as pd
import numpy as np
df = pd.DataFrame( { 'x':[1,2,3,4,5,6], 'y':[1,1,1,0,1,1] } )
# x y desired_result
#0 1 1 1
#1 2 1 3
#2 3 1 6
#3 4 0 4
#4 5 1 9
#5 6 1 15
This is a good example where you could certainly write one line of Pandas to achieve this, although it's not especially readable, especially if you aren't fairly experienced with Pandas already:
df.groupby( (df.y==0).cumsum() )['x'].cumsum()
That's going to be fast enough for most situations, although you could also write faster code by avoiding the groupby, but it will likely be even less readable.
Alternatively, what if we write this as a loop? You could do something like the following with NumPy:
import numba as nb
#nb.jit(nopython=True) # Optional
def custom_sum(x,y):
x_sum = x.copy()
for i in range(1,len(df)):
if y[i] > 0: x_sum[i] = x_sum[i-1] + x[i]
return x_sum
df['desired_result'] = custom_sum( df.x.to_numpy(), df.y.to_numpy() )
Admittedly, there's a bit of overhead there required to convert DataFrame columns to NumPy arrays, but the core piece of code is just one line of code that you could read even if you didn't know anything about Pandas or NumPy:
if y[i] > 0: x_sum[i] = x_sum[i-1] + x[i]
And this code is actually faster than the vectorized code. In some quick tests with 100,000 rows, the above is about 10x faster than the groupby approach. Note that one key to the speed there is numba, which is optional. Without the "#nb.jit" line, the looping code is actually about 10x slower than the groupby approach.
Clearly this example is simple enough that you would likely prefer the one line of pandas to writing a loop with its associated overhead. However, there are more complex versions of this problem for which the readability or speed of the NumPy/numba loop approach likely makes sense.
There are so many ways to iterate over the rows in Pandas dataframe. One very simple and intuitive way is:
df = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]})
print(df)
for i in range(df.shape[0]):
# For printing the second column
print(df.iloc[i, 1])
# For printing more than one columns
print(df.iloc[i, [0, 2]])
The easiest way, use the apply function
def print_row(row):
print row['c1'], row['c2']
df.apply(lambda row: print_row(row), axis=1)
Probably the most elegant solution (but certainly not the most efficient):
for row in df.values:
c2 = row[1]
print(row)
# ...
for c1, c2 in df.values:
# ...
Note that:
the documentation explicitly recommends to use .to_numpy() instead
the produced NumPy array will have a dtype that fits all columns, in the worst case object
there are good reasons not to use a loop in the first place
Still, I think this option should be included here, as a straightforward solution to a (one should think) trivial problem.
You can also do NumPy indexing for even greater speed ups. It's not really iterating but works much better than iteration for certain applications.
subset = row['c1'][0:5]
all = row['c1'][:]
You may also want to cast it to an array. These indexes/selections are supposed to act like NumPy arrays already, but I ran into issues and needed to cast
np.asarray(all)
imgs[:] = cv2.resize(imgs[:], (224,224) ) # Resize every image in an hdf5 file
df.iterrows() returns tuple(a, b) where a is the index and b is the row.
This example uses iloc to isolate each digit in the data frame.
import pandas as pd
a = [1, 2, 3, 4]
b = [5, 6, 7, 8]
mjr = pd.DataFrame({'a':a, 'b':b})
size = mjr.shape
for i in range(size[0]):
for j in range(size[1]):
print(mjr.iloc[i, j])
Disclaimer: Although here are so many answers which recommend not using an iterative (loop) approach (and I mostly agree), I would still see it as a reasonable approach for the following situation:
Extend a dataframe with data from an API
Let's say you have a large dataframe which contains incomplete user data. Now you have to extend this data with additional columns, for example, the user's age and gender.
Both values have to be fetched from a backend API. I'm assuming the API doesn't provide a "batch" endpoint (which would accept multiple user IDs at once). Otherwise, you should rather call the API only once.
The costs (waiting time) for the network request surpass the iteration of the dataframe by far. We're talking about network round trip times of hundreds of milliseconds compared to the negligibly small gains in using alternative approaches to iterations.
One expensive network request for each row
So in this case, I would absolutely prefer using an iterative approach. Although the network request is expensive, it is guaranteed being triggered only once for each row in the dataframe. Here is an example using DataFrame.iterrows:
Example
for index, row in users_df.iterrows():
user_id = row['user_id']
# Trigger expensive network request once for each row
response_dict = backend_api.get(f'/api/user-data/{user_id}')
# Extend dataframe with multiple data from response
users_df.at[index, 'age'] = response_dict.get('age')
users_df.at[index, 'gender'] = response_dict.get('gender')
Some libraries (e.g. a Java interop library that I use) require values to be passed in a row at a time, for example, if streaming data. To replicate the streaming nature, I 'stream' my dataframe values one by one, I wrote the below, which comes in handy from time to time.
class DataFrameReader:
def __init__(self, df):
self._df = df
self._row = None
self._columns = df.columns.tolist()
self.reset()
self.row_index = 0
def __getattr__(self, key):
return self.__getitem__(key)
def read(self) -> bool:
self._row = next(self._iterator, None)
self.row_index += 1
return self._row is not None
def columns(self):
return self._columns
def reset(self) -> None:
self._iterator = self._df.itertuples()
def get_index(self):
return self._row[0]
def index(self):
return self._row[0]
def to_dict(self, columns: List[str] = None):
return self.row(columns=columns)
def tolist(self, cols) -> List[object]:
return [self.__getitem__(c) for c in cols]
def row(self, columns: List[str] = None) -> Dict[str, object]:
cols = set(self._columns if columns is None else columns)
return {c : self.__getitem__(c) for c in self._columns if c in cols}
def __getitem__(self, key) -> object:
# the df index of the row is at index 0
try:
if type(key) is list:
ix = [self._columns.index(key) + 1 for k in key]
else:
ix = self._columns.index(key) + 1
return self._row[ix]
except BaseException as e:
return None
def __next__(self) -> 'DataFrameReader':
if self.read():
return self
else:
raise StopIteration
def __iter__(self) -> 'DataFrameReader':
return self
Which can be used:
for row in DataFrameReader(df):
print(row.my_column_name)
print(row.to_dict())
print(row['my_column_name'])
print(row.tolist())
And preserves the values/ name mapping for the rows being iterated. Obviously, is a lot slower than using apply and Cython as indicated above, but is necessary in some circumstances.
As the accepted answer states, the fastest way to apply a function over rows is to use a vectorized function, the so-called NumPy ufuncs (universal functions).
But what should you do when the function you want to apply isn't already implemented in NumPy?
Well, using the vectorize decorator from numba, you can easily create ufuncs directly in Python like this:
from numba import vectorize, float64
#vectorize([float64(float64)])
def f(x):
#x is your line, do something with it, and return a float
The documentation for this function is here: Creating NumPy universal functions
Along with the great answers in this post I am going to propose Divide and Conquer approach, I am not writing this answer to abolish the other great answers but to fulfill them with another approach which was working efficiently for me. It has two steps of splitting and merging the pandas dataframe:
PROS of Divide and Conquer:
You don't need to use vectorization or any other methods to cast the type of your dataframe into another type
You don't need to Cythonize your code which normally takes extra time from you
Both iterrows() and itertuples() in my case were having the same performance over entire dataframe
Depends on your choice of slicing index, you will be able to exponentially quicken the iteration. The higher index, the quicker your iteration process.
CONS of Divide and Conquer:
You shouldn't have dependency over the iteration process to the same dataframe and different slice. Meaning if you want to read or write from other slice, it maybe difficult to do that.
=================== Divide and Conquer Approach =================
Step 1: Splitting/Slicing
In this step, we are going to divide the iteration over the entire dataframe. Think that you are going to read a CSV file into pandas df then iterate over it. In may case I have 5,000,000 records and I am going to split it into 100,000 records.
NOTE: I need to reiterate as other runtime analysis explained in the other solutions in this page, "number of records" has exponential proportion of "runtime" on search on the df. Based on the benchmark on my data here are the results:
Number of records | Iteration rate [per second]
========================================
100,000 | 500
500,000 | 200
1,000,000 | 50
5,000,000 | 20
Step 2: Merging
This is going to be an easy step, just merge all the written CSV files into one dataframe and write it into a bigger CSV file.
Here is the sample code:
# Step 1 (Splitting/Slicing)
import pandas as pd
df_all = pd.read_csv('C:/KtV.csv')
df_index = 100000
df_len = len(df)
for i in range(df_len // df_index + 1):
lower_bound = i * df_index
higher_bound = min(lower_bound + df_index, df_len)
# Splitting/slicing df (make sure to copy() otherwise it will be a view
df = df_all[lower_bound:higher_bound].copy()
'''
Write your iteration over the sliced df here
using iterrows() or intertuples() or ...
'''
# Writing into CSV files
df.to_csv('C:/KtV_prep_' + str(i) + '.csv')
# Step 2 (Merging)
filename = 'C:/KtV_prep_'
df = (pd.read_csv(f) for f in [filename + str(i) + '.csv' for i in range(ktv_len // ktv_index + 1)])
df_prep_all = pd.concat(df)
df_prep_all.to_csv('C:/KtV_prep_all.csv')
Reference:
Efficient way of iteration over datafreame
Concatenate CSV files into one Pandas Dataframe

matplotlib scatter plot: How to use the data= argument

The matplotlib documentation for scatter() states:
In addition to the above described arguments, this function can take a data keyword argument. If such a data argument is given, the following arguments are replaced by data[]:
All arguments with the following names: ‘s’, ‘color’, ‘y’, ‘c’, ‘linewidths’, ‘facecolor’, ‘facecolors’, ‘x’, ‘edgecolors’.
However, I cannot figure out how to get this to work.
The minimal example
import matplotlib.pyplot as plt
import numpy as np
data = np.random.random(size=(3, 2))
props = {'c': ['r', 'g', 'b'],
's': [50, 100, 20],
'edgecolor': ['b', 'g', 'r']}
plt.scatter(data[:, 0], data[:, 1], data=props)
plt.show()
produces a plot with the default color and sizes, instead of the supplied one.
Anyone has used that functionality?
This seems to be an overlooked feature added about two years ago. The release notes have a short example (
https://matplotlib.org/users/prev_whats_new/whats_new_1.5.html#working-with-labeled-data-like-pandas-dataframes). Besides this question and a short blog post (https://tomaugspurger.github.io/modern-6-visualization.html) that's all I could find.
Basically, any dict-like object ("labeled data" as the docs call it) is passed in the data argument, and plot parameters are specified based on its keys. For example, you can create a structured array with fields a, b, and c
coords = np.random.randn(250, 3).view(dtype=[('a', float), ('b', float), ('c', float)])
You would normally create a plot of a vs b using
pyplot.plot(coords['a'], coords['b'], 'x')
but using the data argument it can be done with
pyplot.plot('a', 'b','x', data=coords)
The label b can be confused with a style string setting the line to blue, but the third argument clears up that ambiguity. It's not limited to x and y data either,
pyplot.scatter(x='a', y='b', c='c', data=coords)
Will set the point color based on column 'c'.
It looks like this feature was added for pandas dataframes, and handles them better than other objects. Additionally, it seems to be poorly documented and somewhat unstable (using x and y keyword arguments fails with the plot command, but works fine with scatter, the error messages are not helpful). That being said, it gives a nice shorthand when the data you want to plot has labels.
In reference to your example, I think the following does what you want:
plt.scatter(data[:, 0], data[:, 1], **props)
That bit in the docs is confusing to me, and looking at the sources, scatter in axes/_axes.py seems to do nothing with this data argument. Remaining kwargs end up as arguments to a PathCollection, maybe there is a bug there.
You could also set these parameters after scatter with the the various set methods in PathCollection, e.g.:
pc = plt.scatter(data[:, 0], data[:, 1])
pc.set_sizes([500,100,200])

pandas access axis by user-defined name

I am wondering whether there is any way to access axes of pandas containers (DataFrame, Panel, etc...) by user-defined name instead of integer or "index", "columns", "minor_axis" etc...
For example, with the following data container:
df = DataFrame(randn(3,2),columns=['c1','c2'],index=['i1','i2','i3'])
df.index.name = 'myaxis1'
df.columns.name = 'myaxis2'
I would like to do this:
df.sum(axis='myaxis1')
df.xs('c1', axis='myaxis2') # cross section
Also very useful would be:
df.reshape(['myaxis2','myaxis1'])
(in this case not so relevant, but it could become so if the dimension increases)
The reason is that I work a lot with multi-dimensional arrays of varying dimensions, like "time", "variable", "percentile" etc...and a same piece of code is often applied to objects which can be DataFrame, Panel or even Panel4D or DataFrame with MultiIndex. For now I often make test on the shape of the object, or on the general settings of the script in order to know which axis is the relevant one to compute a sum or mean. But I think it would be much more convenient to forget about how the container is implemented in the detail (DataFrame, Panel etc...), and simply think about the nature of the problem (say I want to average over the time, I do not want to think about whether I work with in "probabilistic" mode with several percentiles, or in "deterministic" mode with a single time series).
Writing this post I have (re)discovered the very useful axes attribute. The above code could be translated into:
nms = [ax.name for ax in df.axes]
axid1 = nms.index('myaxis1')
axid2 = nms.index('myaxis2')
df.sum(axis=axid1)
df.xs('c1', axis=axid2) # cross section
and the "reshape" feature (does not apply to 3-d case though...):
newshape = ['myaxis2','myaxis1']
axid = [nms.index(nm) for nm in newshape]
df.swapaxes(*axid)
Well, I have to admit that I have found these solutions while writing this post (and this is already very convenient), but it could be generalized to account for DataFrame (or other) with MultiIndex axes, do a search on all axes and labels...
In my opinion it would be a major improvement to the user-friendliness of pandas (ok, forgetting about the actual structure could have a performance cost, but the user worried about performance can be careful in how he/she organizes the data).
What do you think?
This is still experimental, but look at this page:
http://pandas.pydata.org/pandas-docs/dev/dsintro.html#panelnd-experimental
import pandas
import numpy as np
from pandas.core import panelnd
MyPanel4D = panelnd.create_nd_panel_factory(
klass_name = 'MyPanel4D',
axis_orders = ['axis4', 'axis3', 'axis2', 'axis1'],
axis_slices = {'axis3': 'items',
'axis2': 'major_axis',
'axis1': 'minor_axis'},
slicer = 'Panel',
stat_axis=2)
mp4d = MyPanel4D(np.random.rand(5,4,3,2))
print mp4d
Results in this
<class 'pandas.core.panelnd.MyPanel4D'>
Dimensions: 5 (axis4) x 4 (axis3) x 3 (axis2) x 2 (axis1)
Axis4 axis: 0 to 4
Axis3 axis: 0 to 3
Axis2 axis: 0 to 2
Axis1 axis: 0 to 1
Here's the caveat, when you slice it like mp4d[0] you are going to get back a Panel, unless you create a hierarchy of custom objects (unfortunately will need to wait for 0.12-dev for support for 'renaming' Panel/DataFrame, its non-trivial and haven't had any requests)
So for higher dim objects you can impose your own name structure. The axis
aliasing should work like you are suggesting, but I think there are some bugs there