pandas: efficient way to a filter out a few rows (outliers) from a large DataFrame

pandas: efficient way to a filter out a few rows (outliers) from a large DataFrame - pandas

I am looking for an efficient way to a filter out a few rows (outliers) from a large DataFrame. According to https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#delete, the advice is to select the rows that should remain. Here is an example DataFrame -
In [288]: dai
Out[288]:
x y
frame face lmark
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
... .. ..
5146 NaN NaN NaN NaN
5147 NaN NaN NaN NaN
5148 NaN NaN NaN NaN
5149 NaN NaN NaN NaN
5150 NaN NaN NaN NaN
[312814 rows x 2 columns]
whose index is sorted -
In [295]: dai.equals(dai.sort_index())
Out[295]: True
now I extract unique sorted values of frame index except for the last one (frame 5150) -
n [305]: frames = dai.index.get_level_values('frame').drop_duplicates().sort_values()[:-1]
In [306]: frames
Out[306]:
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
...
5140, 5141, 5142, 5143, 5144, 5145, 5146, 5147, 5148, 5149],
dtype='int64', name='frame', length=5149)
and then filter rows in the DataFrame using .loc
In [307]: dai.loc[frames]
Out[307]:
x y
frame face lmark
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
... .. ..
5145 NaN NaN NaN NaN
5146 NaN NaN NaN NaN
5147 NaN NaN NaN NaN
5148 NaN NaN NaN NaN
5149 NaN NaN NaN NaN
The result is correct but it took longer than expected -
In [308]: timeit dai.loc[frames]
7.31 s ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [309]: prun -l 4 dai.loc[frames]
1159551 function calls (1138939 primitive calls) in 7.753 seconds
Ordered by: internal time
List reduced from 253 to 4 due to restriction <4>
ncalls tottime percall cumtime percall filename:lineno(function)
5148 3.544 0.001 3.544 0.001 base.py:241(_outer_indexer)
10298 1.963 0.000 1.963 0.000 {method 'searchsorted' of 'numpy.ndarray' objects}
10298 0.811 0.000 0.900 0.000 base.py:1588(is_monotonic_increasing)
5149 0.413 0.000 0.413 0.000 {method 'nonzero' of 'numpy.ndarray' objects}
Is there any way to improve the performance?

I discovered it is much quicker to filter a DataFrame with default RangeIndex than using multiIndex

Related

Pandas DataFrame subtraction is getting an unexpected result. Concatenating instead?

I have two dataframes of the same size (510x6)
preds
0 1 2 3 4 5
0 2.610270 -4.083780 3.381037 4.174977 2.743785 -0.766932
1 0.049673 0.731330 1.656028 -0.427514 -0.803391 -0.656469
2 -3.579314 3.347611 2.891815 -1.772502 1.505312 -1.852362
3 -0.558046 -1.290783 2.351023 4.669028 3.096437 0.383327
4 -3.215028 0.616974 5.917364 5.275736 7.201042 -0.735897
... ... ... ... ... ... ...
505 -2.178958 3.918007 8.247562 -0.523363 2.936684 -3.153375
506 0.736896 -1.571704 0.831026 2.673974 2.259796 -0.815212
507 -2.687474 -1.268576 -0.603680 5.571290 -3.516223 0.752697
508 0.182165 0.904990 4.690155 6.320494 -2.326415 2.241589
509 -1.675801 -1.602143 7.066843 2.881135 -5.278826 1.831972
510 rows × 6 columns
outputStats
0 1 2 3 4 5
0 2.610270 -4.083780 3.381037 4.174977 2.743785 -0.766932
1 0.049673 0.731330 1.656028 -0.427514 -0.803391 -0.656469
2 -3.579314 3.347611 2.891815 -1.772502 1.505312 -1.852362
3 -0.558046 -1.290783 2.351023 4.669028 3.096437 0.383327
4 -3.215028 0.616974 5.917364 5.275736 7.201042 -0.735897
... ... ... ... ... ... ...
505 -2.178958 3.918007 8.247562 -0.523363 2.936684 -3.153375
506 0.736896 -1.571704 0.831026 2.673974 2.259796 -0.815212
507 -2.687474 -1.268576 -0.603680 5.571290 -3.516223 0.752697
508 0.182165 0.904990 4.690155 6.320494 -2.326415 2.241589
509 -1.675801 -1.602143 7.066843 2.881135 -5.278826 1.831972
510 rows × 6 columns
when I execute:
preds - outputStats
I expect a 510 x 6 dataframe with elementwise subtraction. Instead I get this:
0 1 2 3 4 5 0 1 2 3 4 5
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
505 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
506 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
507 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
508 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
509 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I've tried dropping columns and the like, and that hasn't helped. I also get the same result with preds.subtract(outputStats). Any Ideas?

There are many ways that two different values can appear the same when displayed. One of the main ways is if they are different types, but corresponding values for those types. For instance, depending on how you're displaying them, the int 1 and the str '1' may not be easily distinguished. You can also have whitespace characters, such as '1' versus ' 1'.
If the problem is that one set is int while the other is str, you can solve the problem by converting them all to int or all to str. To do the former, do df.columns = [int(col) for col in df.columns]. To do the latter, df.columns = [str(col) for col in df.columns]. Converting to str is somewhat safer, as trying to convert to int can raise an error if the string isn't amenable to conversion (e.g. int('y') will raise an error), but int can be more usual as they have the numerical structure.
You asked in a comment about dropping columns. You can do this with drop and including axis=1 as a parameter to tell it to drop columns rather than rows, or you can use the del keyword. But changing the column names should remove the need to drop columns.

pandas dataframe rebuild based on cells

after a lot of testing I have ended with this df:
Date 1 2 3 4 5 6 7 8 9 10
0 2019-01-02 59.92 NaN NaN NaN NaN NaN NaN NaN NaN NaN
0 2019-01-02 NaN 197.28 NaN NaN NaN NaN NaN NaN NaN NaN
0 2019-01-02 NaN NaN 96.59 NaN NaN NaN NaN NaN NaN NaN
0 2019-01-02 NaN NaN NaN 275.0 NaN NaN NaN NaN NaN NaN
0 2019-01-02 NaN NaN NaN NaN 209.94 NaN NaN NaN NaN NaN
0 2019-01-02 NaN NaN NaN NaN NaN 99.83 NaN NaN NaN NaN
0 2019-01-02 NaN NaN NaN NaN NaN NaN 257.89 NaN NaN NaN
0 2019-01-02 NaN NaN NaN NaN NaN NaN NaN 215.54 NaN NaN
0 2019-01-02 NaN NaN NaN NaN NaN NaN NaN NaN 187.06 NaN
0 2019-01-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN 386.9
Would be nice any kind of trik to put all this values on the same row. Any idea?
Thanks!!

Try via groupby() and sum():
df=df.groupby('Date').sum()
output:
Date 1 2 3 4 5 6 7 8 9 10
2019-01-02 59.92 197.28 96.59 275.0 209.94 99.83 257.89 215.54 187.06 386.9

An option with groupby first in case this would need to be performed for multiple different types where sum may not behave as expected:
df = df.groupby('Date', as_index=False).first()
Date 1 2 3 4 5 6 7 8 9 10
2019-01-02 59.92 197.28 96.59 275.0 209.94 99.83 257.89 215.54 187.06 386.9

Is it better to avoid DataFrames with MultiIndex when filtering?

Experiment 1: selection from DataFrame with default range index -
In [167]: df_range = pd.read_csv('extract.csv')
In [168]: df_range
Out[168]:
frame face lmark x y
0 1 NaN NaN NaN NaN
1 2 NaN NaN NaN NaN
2 3 NaN NaN NaN NaN
3 4 NaN NaN NaN NaN
4 5 NaN NaN NaN NaN
... ... ... ... .. ..
312809 5146 NaN NaN NaN NaN
312810 5147 NaN NaN NaN NaN
312811 5148 NaN NaN NaN NaN
312812 5149 NaN NaN NaN NaN
312813 5150 NaN NaN NaN NaN
[312814 rows x 5 columns]
select index values excluding frame 5148 -
In [170]: ind = df_range.loc[(df_range['frame'] != 5148)].index.values
In [171]: ind
Out[171]: array([ 0, 1, 2, ..., 312810, 312812, 312813])
select records from df_range excluding frame 5148 -
In [173]: df_range.loc[ind]
Out[173]:
frame face lmark x y
0 1 NaN NaN NaN NaN
1 2 NaN NaN NaN NaN
2 3 NaN NaN NaN NaN
3 4 NaN NaN NaN NaN
4 5 NaN NaN NaN NaN
... ... ... ... .. ..
312808 5145 NaN NaN NaN NaN
312809 5146 NaN NaN NaN NaN
312810 5147 NaN NaN NaN NaN
312812 5149 NaN NaN NaN NaN
312813 5150 NaN NaN NaN NaN
[312813 rows x 5 columns]
In [174]: timeit df_range.loc[ind]
14.1 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Experiment 2: selection from DataFrame with MuitiIndex -
In [177]: df_multi = pd.read_csv('extract.csv').set_index(['frame', 'face', 'lmark'])
In [178]: df_multi
Out[178]:
x y
frame face lmark
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
... .. ..
5146 NaN NaN NaN NaN
5147 NaN NaN NaN NaN
5148 NaN NaN NaN NaN
5149 NaN NaN NaN NaN
5150 NaN NaN NaN NaN
[312814 rows x 2 columns]
select frame values excluding frame 5148 -
In [215]: frames = df_range.loc[ind]['frame'].drop_duplicates().values
In [216]: frames
Out[216]: array([ 1, 2, 3, ..., 5147, 5149, 5150])
select records from df_multi excluding frame 5148 -
In [218]: df_multi.loc[frames]
Out[218]:
x y
frame face lmark
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
... .. ..
5145 NaN NaN NaN NaN
5146 NaN NaN NaN NaN
5147 NaN NaN NaN NaN
5149 NaN NaN NaN NaN
5150 NaN NaN NaN NaN
[312813 rows x 2 columns]
In [219]: timeit df_multi.loc[frames]
7.83 s ± 607 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Conclusion
Both methods select the correct result set but filtering a DataFrame with MultiIndex appears to be orders of magnitude slower than using the default range index. Do you agree?
Update 13-03-2020 #ALollz - thanks for the inspiration. Here is a much faster way of filtering a DataFrame with MultiIndex -
In [40]: timeit df_multi.loc[df_multi.index.get_level_values('frame') != 5148]
4.53 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [41]: df_multi.loc[df_multi.index.get_level_values('frame') != 5148]
Out[41]:
x y
frame face lmark
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
... .. ..
5145 NaN NaN NaN NaN
5146 NaN NaN NaN NaN
5147 NaN NaN NaN NaN
5149 NaN NaN NaN NaN
5150 NaN NaN NaN NaN
[312813 rows x 2 columns]

Not really.
A MultiIndex has tuples as the indices. You switch to a MultiIndex but then still supply a single array of scalars as the Index, so pandas spends a lot of time trying to figure out exactly how to align those. If you instead supply the correct array of MultiIndex locs the speed is nearly the same (though maybe ~10x slower)
Sample Data
import pandas as pd
df = pd.concat([pd.DataFrame(range(10**3))]*5, axis=1)
df.columns = range(5)
df_mult = df.copy().set_index([0,1], append=True)
ids = df[df[4].ne(4)].index
%timeit df.loc[ids]
#398 µs ± 5.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df_mult.loc[ids]
#121 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Use the correct MultiIndex locs
ids_mult = df_mult[df_mult[4].ne(4)].index
%timeit df_mult.loc[ids_mult]
#2.57 ms ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Though you might just slice by the Boolean Series, which tends to be very fast for most larger selections.
%timeit df_mult[df_mult[4].ne(4)]
#705 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

High Dimensional Data with time series

I am trying to make a pandas dataframe to hold my experimental data. The data is described below:
I have ~300 individuals participating in an experiment made of ~200 trials, where each trial has a number of experimentally controlled parameters (~10 parameters). For every trial and every individual I have a timeseries of some measurement, which is 30 timepoints long.
What is the best way to structure this data into a dataframe? I will need to be able to do things like get the experimental values for every individual at a certain time during all the trials with certain parameters, or get average values over certain times and trials for an indivudal, etc. Basically I will need to be able to slice this data in most conceivable ways.
Thanks!
EDIT: If you want to look at how I have my data at the moment, scroll down to the last 3 cells in this notebook: https://drive.google.com/file/d/1UZG_S2fg4MzaED8cLwE-nKHG0SHqevUr/view?usp=sharing
The data variable has all the parameters for each trial, and the interp_traces variable is an array of the timeseries measurements for each timepoint, individual, and trial.
I'd like to put everything in one thing if possible. The multi-index looks promising.

In my opinion need MultiIndex.
Sample:
individuals = list('ABCD')
trials = list('ab')
par = list('xyz')
dates = pd.date_range('2018-01-01', periods=5)
n = ['ind','trials','pars']
mux = pd.MultiIndex.from_product([individuals, trials, par], names=n)
df = pd.DataFrame(index=mux, columns=dates)
print (df)
2018-01-01 2018-01-02 2018-01-03 2018-01-04 2018-01-05
ind trials pars
A a x NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN
z NaN NaN NaN NaN NaN
b x NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN
z NaN NaN NaN NaN NaN
B a x NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN
z NaN NaN NaN NaN NaN
b x NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN
z NaN NaN NaN NaN NaN
C a x NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN
z NaN NaN NaN NaN NaN
b x NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN
z NaN NaN NaN NaN NaN
D a x NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN
z NaN NaN NaN NaN NaN
b x NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN
z NaN NaN NaN NaN NaN

Using pandas reindex with floats: interpolation

Can you explain this bizarre behaviour?
df=pd.DataFrame({'year':[1986,1987,1988],'bomb':arange(3)}).set_index('year')
In [9]: df.reindex(arange(1986,1988.125,.125))
Out[9]:
bomb
1986.000 0
1986.125 NaN
1986.250 NaN
1986.375 NaN
1986.500 NaN
1986.625 NaN
1986.750 NaN
1986.875 NaN
1987.000 1
1987.125 NaN
1987.250 NaN
1987.375 NaN
1987.500 NaN
1987.625 NaN
1987.750 NaN
1987.875 NaN
1988.000 2
In [10]: df.reindex(arange(1986,1988.1,.1))
Out[10]:
bomb
1986.0 0
1986.1 NaN
1986.2 NaN
1986.3 NaN
1986.4 NaN
1986.5 NaN
1986.6 NaN
1986.7 NaN
1986.8 NaN
1986.9 NaN
1987.0 NaN
1987.1 NaN
1987.2 NaN
1987.3 NaN
1987.4 NaN
1987.5 NaN
1987.6 NaN
1987.7 NaN
1987.8 NaN
1987.9 NaN
1988.0 NaN
When the increment is anything other than .125, I find that the new index values do not "find" the old rows that have matching values. ie there is a precision problem that is not being overcome. This is true even if I force the index to be a float before I try to interpolate. What is going on and/or what is the right way to do this?
I've been able to get it to work with increment of 0.1 by using
reindex( np.array(map(round,arange(1985,2010+dt,dt)*10))/10.0 )
By the way, I'm doing this as the first step in linearly interpolating a number of columns (e.g. "bomb" is one of them). If there's a nicer way to do that, I'd happily be set straight.

You are getting what you ask for. The reindex method only tries to for the data onto the new index that you provide. As mentioned in comments you are probably looking for dates in the index. I guess you were expecting the reindex method to do this though(interpolation):
df2 =df.reindex(arange(1986,1988.125,.125))
pd.Series.interpolate(df2['bomb'])
1986.000 0.000
1986.125 0.125
1986.250 0.250
1986.375 0.375
1986.500 0.500
1986.625 0.625
1986.750 0.750
1986.875 0.875
1987.000 1.000
1987.125 1.125
1987.250 1.250
1987.375 1.375
1987.500 1.500
1987.625 1.625
1987.750 1.750
1987.875 1.875
1988.000 2.000
Name: bomb
the second example you use is inconsistency is probably because of floating point accuracies. Stepping by 0.125 is equal to 1/8 which can be exactly done in binary. stepping by 0.1 is not directly mappable to binary so 1987 is probably out by a fraction.
1987.0 == 1987.0000000001
False

I think you are better off doing something like this by using PeriodIndex
In [39]: df=pd.DataFrame({'bomb':np.arange(3)})
In [40]: df
Out[40]:
bomb
0 0
1 1
2 2
In [41]: df.index = pd.period_range('1986','1988',freq='Y').asfreq('M')
In [42]: df
Out[42]:
bomb
1986-12 0
1987-12 1
1988-12 2
In [43]: df = df.reindex(pd.period_range('1986','1988',freq='M'))
In [44]: df
Out[44]:
bomb
1986-01 NaN
1986-02 NaN
1986-03 NaN
1986-04 NaN
1986-05 NaN
1986-06 NaN
1986-07 NaN
1986-08 NaN
1986-09 NaN
1986-10 NaN
1986-11 NaN
1986-12 0
1987-01 NaN
1987-02 NaN
1987-03 NaN
1987-04 NaN
1987-05 NaN
1987-06 NaN
1987-07 NaN
1987-08 NaN
1987-09 NaN
1987-10 NaN
1987-11 NaN
1987-12 1
1988-01 NaN
In [45]: df.iloc[0,0] = -1
In [46]: df['interp'] = df['bomb'].interpolate()
In [47]: df
Out[47]:
bomb interp
1986-01 -1 -1.000000
1986-02 NaN -0.909091
1986-03 NaN -0.818182
1986-04 NaN -0.727273
1986-05 NaN -0.636364
1986-06 NaN -0.545455
1986-07 NaN -0.454545
1986-08 NaN -0.363636
1986-09 NaN -0.272727
1986-10 NaN -0.181818
1986-11 NaN -0.090909
1986-12 0 0.000000
1987-01 NaN 0.083333
1987-02 NaN 0.166667
1987-03 NaN 0.250000
1987-04 NaN 0.333333
1987-05 NaN 0.416667
1987-06 NaN 0.500000
1987-07 NaN 0.583333
1987-08 NaN 0.666667
1987-09 NaN 0.750000
1987-10 NaN 0.833333
1987-11 NaN 0.916667
1987-12 1 1.000000
1988-01 NaN 1.000000

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

pandas: efficient way to a filter out a few rows (outliers) from a large DataFrame - pandas

I discovered it is much quicker to filter a DataFrame with default RangeIndex than using multiIndex

Related

Pandas DataFrame subtraction is getting an unexpected result. Concatenating instead?

pandas dataframe rebuild based on cells

Is it better to avoid DataFrames with MultiIndex when filtering?

High Dimensional Data with time series

Using pandas reindex with floats: interpolation

Categories

Resources