how can I traverse dataframe without nested for loop? - pandas

How can I traverse data frame in pandas without nesting for loop?
My code is:
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],columns=['a','b','c'])

Maybe you would want to try this for traversing every row and every column.
print([df.loc[row, col] for row in df.index for col in df.columns])
This outputs:
[1, 2, 3, 4, 5, 6, 7, 8, 9]

I think this is faster:
[i for i in df.values.reshape((df.shape[0]*df.shape[1]))]
%%timeit gave 8.35 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Related

What is the difference between specifying the rows vs slice in iloc?

What is the difference between specifying the rows vs slice in the following example:
df.iloc[[3, 4], 1]
vs
df.iloc[3:4, 1]
Slice a:b implies consecutive advancing/following of positions, while specifying positions as a list allows indexing by arbitrary sequence like [3, 5, 4, 1].
The difference is also in performance. slice by indexes works many times faster. For example
import pandas as pd
import numpy as np
df = pd.DataFrame({"X":np.random.random((1000,)),
})
%%timeit
df.iloc[range(100),:]
Out:
177 µs ± 5.1 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
df.iloc[:100, :]
Out:
22.4 µs ± 828 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Efficiently filter DataFrame by looking for NumPy array match in row

Given
df = pd.DataFrame({'x': [np.array(['1', '2.3']), np.array(['30', '99'])]},
index=[pd.date_range('2020-01-01', '2020-01-02', freq='D')])
I would like to filter for np.array(['1', '2.3']). I can do
df[df['x'].apply(lambda x: np.array_equal(x, np.array(['1', '2.3'])))]
but is this the fastest way to do it?
EDIT:
Let's assume that all the elements inside the numpy array are strings, even though it's not good practice!
DataFrame length can go to 500k rows and the number of values in each numpy array can go to 10.
You can rely on list comprehension for performance:
df[np.array([np.array_equal(x,np.array([1, 2.3])) for x in df['x'].values])]
Performance via timeit(on my system currently using 4gb ram) :
%timeit -n 2000 df[np.array([np.array_equal(x,np.array([1, 2.3])) for x in df['x'].values])]
#output:
425 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
%timeit -n 2000 df[df['x'].apply(lambda x: np.array_equal(x, np.array([1, 2.3])))]
#output:
875 µs ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
My suggestion would be to do the following:
import numpy as np
mat = np.stack([np.array(["a","b","c"]),np.array(["d","e","f"])])
In reality this would be the actual data from the cols of your dataframe. Make sure that these are a single numpy array.
Then do:
matching_rows = (np.array(["a","b","c"]) == mat).all(axis=1)
Which outputs you an array of bools indicating where the matches are located.
So you can then filter your rows like this:
df[matching_rows]

How to raise a matrix to the power of elements in an array that is increasing in an ascending order?

Currently I have a C matrix generated by:
def c_matrix(n):
exp = np.exp(1j*np.pi/n)
exp_n = np.array([[exp, 0], [0, exp.conj()]], dtype=complex)
c_matrix = np.array([exp_n**i for i in range(1, n, 1)], dtype=complex)
return c_matrix
What this does is basically generate a list of number from 0 to n-1 using list comprehension, then returns a list of the matrix exp_nbeing raised to the elements of the ascendingly increasing list. i.e.
exp_n**[0, 1, ..., n-1] = [exp_n**0, exp_n**1, ..., exp_n**(n-1)]
So I was wondering if there's a more numpythonic way of doing it(in order to make use of Numpy's broadcasting ability) like:
exp_n**np.arange(1,n,1) = np.array(exp_n**0, exp_n**1, ..., exp_n**(n-1))
You're speaking of a Vandermonde matrix. Numpy has numpy.vander
def c_matrix_vander(n):
exp = np.exp(1j*np.pi/n)
exp_n = np.array([[exp, 0], [0, exp.conj()]], dtype=complex)
return np.vander(exp_n.ravel(), n, increasing=True)[:, 1:].swapaxes(0, 1).reshape(n-1, 2, 2)
Performance
In [184]: %timeit c_matrix_vander(10_000)
849 µs ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [185]: %timeit c_matrix(10_000)
41.5 ms ± 549 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Validation
>>> np.isclose(c_matrix(10_000), c_matrix_vander(10_000)).all()
True

Fast implementation of incrementing dates by days according to another column pandas

I have the following pandas DF:
print(df.to_dict())
{'Date_Installed': {11885: Timestamp('2018-11-15 00:00:00'), 111885: Timestamp('2018-11-15 00:00:00')}, 'days_from_instalation': {11885: 2, 111885: 3}}
I would like to create a new column that increments the 'Date_Installed' column by the days from the column 'days_from_instalation'
I know that this is possible using the apply() method as following:
from datetime import timedelta
df['desired_date']=df.apply(lambda row:row['Date_Installed']+timedelta(row['days_from_instalation']), axis=1)
which produces my desired output:
print(df.to_dict())
{'Date_Installed': {11885: Timestamp('2018-11-15 00:00:00'), 111885: Timestamp('2018-11-15 00:00:00')}, 'days_from_instalation': {11885: 2, 111885: 3}, 'desired_date': {11885: Timestamp('2018-11-17 00:00:00'), 111885: Timestamp('2018-11-18 00:00:00')}}
However this method is extremely slow, and isn't realistic to apply to my full DF.
I wen't over several questions on incrementing dates in pandas like this one:
pandas-increment-datetime
But they all seem to deal with constant incrementation, without any vectorised method to do so.
Is there any vectorised version of this type of increment?
Thanks in advance!
Add timedeltas created by to_timedelta:
df['desired_date'] = df['Date_Installed'] +
pd.to_timedelta(df['days_from_instalation'], unit='d')
print (df)
Date_Installed days_from_instalation desired_date
11885 2018-11-15 2 2018-11-17
111885 2018-11-15 3 2018-11-18
Another numpy solution is faster, but lost timezones (if specified):
a = pd.to_timedelta(df['days_from_instalation'], unit='d').values.astype(np.int64)
df['desired_date1'] = pd.to_datetime(df['Date_Installed'].values.astype(np.int64)+a, unit='ns')
Performance:
#20krows
df = pd.concat([df] * 10000, ignore_index=True)
In [217]: %timeit df['desired_date1'] = pd.to_datetime(df['Date_Installed'].values.astype(np.int64) + pd.to_timedelta(df['days_from_instalation'], unit='d').values.astype(np.int64), unit='ns')
886 µs ± 9.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [218]: %timeit df['desired_date'] = df['Date_Installed'] + pd.to_timedelta(df['days_from_instalation'], unit='d')
1.53 ms ± 82.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

What is maybe_convert_objects good for?

I'm profiling the timing of one od my functions and I see that I spent alot of time on pandas DataFrame creation - I'm talking about 2.5 seconds to construct a dataFrame with 1000 columns and 10k rows:
def test(size):
samples = []
for r in range(10000):
a,b = np.random.randint(100, size=2)
data = np.random.beta(a,b ,size = size)
samples.append(data)
return DataFrame(samples, dtype = np.float64)
Running %prun -l 4 test(1000) returns:
Is there anyway I can avoid this check? This really not seems Tried to find out about this method and ways to bypass here but didnt find anything online.
pandas must introspect each row because you are passing it a list of arrays. Here are some more efficient methods in this case.
In [27]: size=1000
In [28]: samples = []
...: for r in range(10000):
...: data = np.random.beta(1,1 ,size = size)
...: samples.append(data)
...:
In [29]: np.asarray(samples).shape
Out[29]: (10000, 1000)
# original
In [30]: %timeit DataFrame(samples)
2.29 s ± 91.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# numpy is less flexible on the conversion, but in this case
# it is fine
In [31]: %timeit DataFrame(np.asarray(samples))
30.9 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# you should probably just do this
In [32]: samples = np.random.beta(1,1, size=(10000, 1000))
In [33]: %timeit DataFrame(samples)
74.4 µs ± 381 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)