Fast implementation of incrementing dates by days according to another column pandas - pandas

I have the following pandas DF:
print(df.to_dict())
{'Date_Installed': {11885: Timestamp('2018-11-15 00:00:00'), 111885: Timestamp('2018-11-15 00:00:00')}, 'days_from_instalation': {11885: 2, 111885: 3}}
I would like to create a new column that increments the 'Date_Installed' column by the days from the column 'days_from_instalation'
I know that this is possible using the apply() method as following:
from datetime import timedelta
df['desired_date']=df.apply(lambda row:row['Date_Installed']+timedelta(row['days_from_instalation']), axis=1)
which produces my desired output:
print(df.to_dict())
{'Date_Installed': {11885: Timestamp('2018-11-15 00:00:00'), 111885: Timestamp('2018-11-15 00:00:00')}, 'days_from_instalation': {11885: 2, 111885: 3}, 'desired_date': {11885: Timestamp('2018-11-17 00:00:00'), 111885: Timestamp('2018-11-18 00:00:00')}}
However this method is extremely slow, and isn't realistic to apply to my full DF.
I wen't over several questions on incrementing dates in pandas like this one:
pandas-increment-datetime
But they all seem to deal with constant incrementation, without any vectorised method to do so.
Is there any vectorised version of this type of increment?
Thanks in advance!

Add timedeltas created by to_timedelta:
df['desired_date'] = df['Date_Installed'] +
pd.to_timedelta(df['days_from_instalation'], unit='d')
print (df)
Date_Installed days_from_instalation desired_date
11885 2018-11-15 2 2018-11-17
111885 2018-11-15 3 2018-11-18
Another numpy solution is faster, but lost timezones (if specified):
a = pd.to_timedelta(df['days_from_instalation'], unit='d').values.astype(np.int64)
df['desired_date1'] = pd.to_datetime(df['Date_Installed'].values.astype(np.int64)+a, unit='ns')
Performance:
#20krows
df = pd.concat([df] * 10000, ignore_index=True)
In [217]: %timeit df['desired_date1'] = pd.to_datetime(df['Date_Installed'].values.astype(np.int64) + pd.to_timedelta(df['days_from_instalation'], unit='d').values.astype(np.int64), unit='ns')
886 µs ± 9.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [218]: %timeit df['desired_date'] = df['Date_Installed'] + pd.to_timedelta(df['days_from_instalation'], unit='d')
1.53 ms ± 82.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Related

What is the difference between specifying the rows vs slice in iloc?

What is the difference between specifying the rows vs slice in the following example:
df.iloc[[3, 4], 1]
vs
df.iloc[3:4, 1]
Slice a:b implies consecutive advancing/following of positions, while specifying positions as a list allows indexing by arbitrary sequence like [3, 5, 4, 1].
The difference is also in performance. slice by indexes works many times faster. For example
import pandas as pd
import numpy as np
df = pd.DataFrame({"X":np.random.random((1000,)),
})
%%timeit
df.iloc[range(100),:]
Out:
177 µs ± 5.1 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
df.iloc[:100, :]
Out:
22.4 µs ± 828 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Efficiently filter DataFrame by looking for NumPy array match in row

Given
df = pd.DataFrame({'x': [np.array(['1', '2.3']), np.array(['30', '99'])]},
index=[pd.date_range('2020-01-01', '2020-01-02', freq='D')])
I would like to filter for np.array(['1', '2.3']). I can do
df[df['x'].apply(lambda x: np.array_equal(x, np.array(['1', '2.3'])))]
but is this the fastest way to do it?
EDIT:
Let's assume that all the elements inside the numpy array are strings, even though it's not good practice!
DataFrame length can go to 500k rows and the number of values in each numpy array can go to 10.
You can rely on list comprehension for performance:
df[np.array([np.array_equal(x,np.array([1, 2.3])) for x in df['x'].values])]
Performance via timeit(on my system currently using 4gb ram) :
%timeit -n 2000 df[np.array([np.array_equal(x,np.array([1, 2.3])) for x in df['x'].values])]
#output:
425 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
%timeit -n 2000 df[df['x'].apply(lambda x: np.array_equal(x, np.array([1, 2.3])))]
#output:
875 µs ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
My suggestion would be to do the following:
import numpy as np
mat = np.stack([np.array(["a","b","c"]),np.array(["d","e","f"])])
In reality this would be the actual data from the cols of your dataframe. Make sure that these are a single numpy array.
Then do:
matching_rows = (np.array(["a","b","c"]) == mat).all(axis=1)
Which outputs you an array of bools indicating where the matches are located.
So you can then filter your rows like this:
df[matching_rows]

Find rows where field A is a substring of field B

There are 10 mil records of two columns keyword and string.
I want to find rows where the keyword column appears in its string column:
test_df=pd.DataFrame({'keyword1':['day','night','internet','day','night','internet'],'string1':['today is a good day','I like this','youtube','sunday','what is this','internet']})
test_df
my first attempt is use .apply, but it's slow.
test_df[test_df.apply(lambda x: True if x['keyword1'] in x['string1'] else False,axis=1)]
because there are 10mil different strings, but much small number of keywords(in the magnitude of 10 thousands). So I'm thinking maybe it's more efficient if I group it by keywords.
test_df.groupby('keyword1',group_keys=False).apply(lambda x: x[x['string1'].str.contains(x.loc[x.index[0],'keyword1'])])
Supposedly, this approach only has 10k iteration rather than 10m iteration. But it is only slightly faster(10%). I'm not sure why? overhead of iteration is small, or groupby has its additional cost.
My question is: is there a better way to perform this job?
One idea is create mask by GroupBy.transform and compare by x.name, also regex=False should improved performance, but here seems still a lot of groups (10k), so groupby is bottleneck here:
mask = (test_df.groupby('keyword1')['string1']
.transform(lambda x : x.str.contains(x.name, regex=False)))
df = test_df[mask]
print (df)
keyword1 string1
0 day today is a good day
3 day sunday
5 internet internet
Another idea is using list comprehension, but not sure if faster in 10M:
test_df[[x in y for x, y in test_df[['keyword1','string1']].to_numpy()]]
Some tests with sample data, but here are only few groups, so groupby really fast:
#6k data
test_df = pd.concat([test_df] * 1000, ignore_index=True)
In [49]: %timeit test_df[test_df.groupby('keyword1', sort=False)['string1'].transform(lambda x :x.str.contains(x.name, regex=False))]
5.84 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [50]: %timeit test_df[[x in y for x, y in test_df[['keyword1','string1']].to_numpy()]]
9.46 ms ± 47.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [51]: %timeit test_df.groupby('keyword1',group_keys=False).apply(lambda x: x[x['string1'].str.contains(x.loc[x.index[0],'keyword1'])])
11.7 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit test_df[test_df.apply(lambda x: True if x['keyword1'] in x['string1'] else False,axis=1)]
138 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

how can I traverse dataframe without nested for loop?

How can I traverse data frame in pandas without nesting for loop?
My code is:
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],columns=['a','b','c'])
Maybe you would want to try this for traversing every row and every column.
print([df.loc[row, col] for row in df.index for col in df.columns])
This outputs:
[1, 2, 3, 4, 5, 6, 7, 8, 9]
I think this is faster:
[i for i in df.values.reshape((df.shape[0]*df.shape[1]))]
%%timeit gave 8.35 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

What is maybe_convert_objects good for?

I'm profiling the timing of one od my functions and I see that I spent alot of time on pandas DataFrame creation - I'm talking about 2.5 seconds to construct a dataFrame with 1000 columns and 10k rows:
def test(size):
samples = []
for r in range(10000):
a,b = np.random.randint(100, size=2)
data = np.random.beta(a,b ,size = size)
samples.append(data)
return DataFrame(samples, dtype = np.float64)
Running %prun -l 4 test(1000) returns:
Is there anyway I can avoid this check? This really not seems Tried to find out about this method and ways to bypass here but didnt find anything online.
pandas must introspect each row because you are passing it a list of arrays. Here are some more efficient methods in this case.
In [27]: size=1000
In [28]: samples = []
...: for r in range(10000):
...: data = np.random.beta(1,1 ,size = size)
...: samples.append(data)
...:
In [29]: np.asarray(samples).shape
Out[29]: (10000, 1000)
# original
In [30]: %timeit DataFrame(samples)
2.29 s ± 91.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# numpy is less flexible on the conversion, but in this case
# it is fine
In [31]: %timeit DataFrame(np.asarray(samples))
30.9 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# you should probably just do this
In [32]: samples = np.random.beta(1,1, size=(10000, 1000))
In [33]: %timeit DataFrame(samples)
74.4 µs ± 381 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)