>>> pd.DataFrame([1], index=['1']).loc['2'] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['1','2']] # Succeeds, as in the answer below.
I'd like something that doesn't fail in either of
>>> pd.DataFrame([1], index=['1']).loc['2'] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']] # KeyError
Is there a function like loc which gracefully handles this, or some other way of expressing this query?
Update for #AlexLenail comment
It's a fair point that this will be slow for large lists. I did a little bit of more digging and found that the intersection method is available for Indexes and columns. I'm not sure about the algorithmic complexity but it's much faster empirically.
You can do something like this.
good_keys = df.index.intersection(all_keys)
df.loc[good_keys]
Or like your example
df = pd.DataFrame([1], index=['1'])
df.loc[df.index.intersection(['2'])]
Here is a little experiment below
n = 100000
# Create random values and random string indexes
# have the bad indexes contain extra values not in DataFrame Index
rand_val = np.random.rand(n)
rand_idx = []
for x in range(n):
rand_idx.append(str(x))
bad_idx = []
for x in range(n*2):
bad_idx.append(str(x))
df = pd.DataFrame(rand_val, index=rand_idx)
df.head()
def get_valid_keys_list_comp():
# Return filtered DataFrame using list comprehension to filter keys
vkeys = [key for key in bad_idx if key in df.index.values]
return df.loc[vkeys]
def get_valid_keys_intersection():
# Return filtered DataFrame using list intersection() to filter keys
vkeys = df.index.intersection(bad_idx)
return df.loc[vkeys]
%%timeit
get_valid_keys_intersection()
# 64.5 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
get_valid_keys_list_comp()
# 6.14 s ± 457 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Original answer
I'm not sure if pandas has a built-in function to handle this but you can use Python list comprehension to filter to valid indexes with something like this.
Given a DataFrame df2
A B C D F
test 1.0 2013-01-02 1.0 3 foo
train 1.0 2013-01-02 1.0 3 foo
test 1.0 2013-01-02 1.0 3 foo
train 1.0 2013-01-02 1.0 3 foo
You can filter your index query with this
keys = ['test', 'train', 'try', 'fake', 'broken']
valid_keys = [key for key in keys if key in df2.index.values]
df2.loc[valid_keys]
This will also work for columns if you use df2.columns instead of df2.index.values
I found an alternative (provided a check for df.empty is made beforehand). You could do something like this
df[df.index=='2'] -> returns either a dataframe with matched values or empty dataframe.
It seems to work fine for me. I'm running Python 3.5 with pandas version 0.20.3.
import numpy as np
import pandas as pd
# Create dataframe
data = {'distance': [0, 300, 600, 1000],
'population': [4.8, 0.7, 6.4, 2.9]}
df = pd.DataFrame(data, index=['Alabama','Alaska','Arizona','Arkansas'])
keys = ['Alabama', 'Alaska', 'Arizona', 'Virginia']
# Create a subset of the dataframe.
df.loc[keys]
distance population
Alabama 0.0 4.8
Alaska 300.0 0.7
Arizona 600.0 6.4
Virginia NaN NaN
Or if you want to exclude the NaN row:
df.loc[keys].dropna()
distance population
Alabama 0.0 4.8
Alaska 300.0 0.7
Arizona 600.0 6.4
This page https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike has the solution:
In [8]: pd.DataFrame([1], index=['1']).reindex(['2'])
Out[8]:
0
2 NaN
Using the sample dataframe from #binjip's answer:
import numpy as np
import pandas as pd
# Create dataframe
data = {'distance': [0, 300, 600, 1000],
'population': [4.8, 0.7, 6.4, 2.9]}
df = pd.DataFrame(data, index=['Alabama','Alaska','Arizona','Arkansas'])
keys = ['Alabama', 'Alaska', 'Arizona', 'Virginia']
Get matching records from the dataframe. NB: The dataframe index must be unique for this to work!
df.reindex(keys)
distance population
Alabama 0.0 4.8
Alaska 300.0 0.7
Arizona 600.0 6.4
Virginia NaN NaN
If you want to omit missing keys:
df.reindex(df.index.intersection(keys))
distance population
Alabama 0 4.8
Alaska 300 0.7
Arizona 600 6.4
df.loc uses index (values from df.index) not the position of the row. Did you mean to use .iloc instead
Related
I have two geodataframes or geoseries, both consists of thousands of points.
My requirement is to append (merge) both geodataframes and drop duplicate points.
In other words, output = gdf1 all points + gdf2 points that do not intersect with gdf1 points
I tried as:
output = geopandas.overlay(gdf1, gdf2, how='symmetric_difference')
However, it is very slow.
Do you know any faster way of doing it ?
Here is another way of combining dataframes using pandas, along with timings, versus geopandas:
import pandas as pd
import numpy as np
data1 = np.random.randint(-100, 100, size=10000)
data2 = np.random.randint(-100, 100, size=10000)
df1 = pd.concat([-pd.Series(data1, name="longitude"), pd.Series(data1, name="latitude")], axis=1)
df1['geometry'] = df1.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
df2 = pd.concat([-pd.Series(data2, name="longitude"), pd.Series(data2, name="latitude")], axis=1)
df2['geometry'] = df2.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
df1 = df1.set_index(["longitude", "latitude"])
df2 = df2.set_index(["longitude", "latitude"])
%timeit pd.concat([df1[~df1.index.isin(df2.index)],df2[~df2.index.isin(df1.index)]])
112 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This seems a lot faster than using geopandas
import geopandas as gp
gdf1 = gp.GeoDataFrame(
df1, geometry=gp.points_from_xy(df1.index.get_level_values("longitude"), df1.index.get_level_values("latitude")))
gdf2 = gp.GeoDataFrame(
df2, geometry=gp.points_from_xy(df2.index.get_level_values("longitude"), df2.index.get_level_values("latitude")))
%timeit gp.overlay(gdf1, gdf2, how='symmetric_difference')
29 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But maybe you need some kind of optimisations as mentioned here.
The function checks for non-matching indexes from each df and then combines them.
df1 = pd.DataFrame([1,2,3,4],columns=['col1']).set_index("col1")
df2 = pd.DataFrame([3,4,5,6],columns=['col1']).set_index("col1")
pd.concat([df1[~df1.index.isin(df2.index)],df2[~df2.index.isin(df1.index)]])
col1
1
2
5
6
sampleID
testnames
results
23939332
[32131,34343,35566]
[NEGATIVE,0.234,3.331]
32332323
[34343,96958,39550,88088]
[0,312,0.008,0.1,0.2]
The table above is what I have, and the one below is what I want to achieve:
sampleID
32131
34343
39550
88088
96985
35566
23939332
NEGATIVE
0.234
NaN
NaN
NaN
3.331
32332323
NaN
0,312
0.1
0.2
0.008
NaN
So I need to create columns of unique values from the testnames column and fill the cells with the corresponding values from the results column.
Considering this is as a sample from a very large dataset (table).
Here is a commented solution:
(df.set_index(['sampleID']) # keep sampleID out of the expansion
.apply(pd.Series.explode) # expand testnames and results
.reset_index() # reset the index
.groupby(['sampleID', 'testnames']) #
.first() # set the expected shape
.unstack()) #
It gives the result you expected, though with a different column order:
results
testnames 32131 34343 35566 39550 88088 96958
sampleID
23939332 NEGATIVE 0.234 3.331 NaN NaN NaN
32332323 NaN 0.312 NaN 0.1 0.2 0.008
Let's see how it does on generated data:
def build_df(n_samples, n_tests_per_sample, n_test_types):
df = pd.DataFrame(columns=['sampleID', 'testnames', 'results'])
test_types = np.random.choice(range(0,100000), size=n_test_types, replace=False)
for i in range(n_samples):
testnames = list(np.random.choice(test_types,size=n_tests_per_sample))
results = list(np.random.random(size=n_tests_per_sample))
df = df.append({'sampleID': i, 'testnames':testnames, 'results':results}, ignore_index=True)
return df
def reshape(df):
df2 = (df.set_index(['sampleID']) # keep the sampleID out of the expansion
.apply(pd.Series.explode) # expand testnames and results
.reset_index() # reset the index
.groupby(['sampleID', 'testnames']) #
.first() # set the expected shape
.unstack())
return df2
%time df = build_df(60000, 10, 100)
# Wall time: 9min 48s (yes, it was ugly)
%time df2 = reshape(df)
# Wall time: 1.01 s
reshape() breaks when n_test_types becomes too large, with ValueError: Unstacked DataFrame is too big, causing int32 overflow.
I have a df containing column of "Income_group", "Rate", and "Probability", respectively. I need randomly select rate for each income group. How can I write a Loop function and print out the result for each income bin.
The pandas data frame table looks like this:
import pandas as pd
df={'Income_Groups':['1','1','1','2','2','2','3','3','3'],
'Rate':[1.23,1.25,1.56, 2.11,2.32, 2.36,3.12,3.45,3.55],
'Probability':[0.25, 0.50, 0.25,0.50,0.25,0.25,0.10,0.70,0.20]}
df2=pd.DataFrame(data=df)
df2
Datatable
Shooting in the dark here, but you can use np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], p=x['Probability']))
)
Output (can vary due to randomness):
Income_Groups
1 1.25
2 2.36
3 3.45
dtype: float64
You can also pass size into np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], size=3, p=x['Probability']))
)
Output:
Income_Groups
1 [1.23, 1.25, 1.25]
2 [2.36, 2.11, 2.11]
3 [3.12, 3.12, 3.45]
dtype: object
GroupBy.apply because of the weights.
import numpy as np
(df2.groupby('Income_Groups')
.apply(lambda gp: np.random.choice(a=gp.Rate, p=gp.Probability, size=1)[0]))
#Income_Groups
#1 1.23
#2 2.11
#3 3.45
#dtype: float64
Another silly way because your weights seem to be have precision to 2 decimal places:
s = df2.set_index(['Income_Groups', 'Probability']).Rate
(s.repeat(s.index.get_level_values('Probability')*100) # Weight
.sample(frac=1) # Shuffle |
.reset_index() # + | -> Random Select
.drop_duplicates(subset=['Income_Groups']) # Select |
.drop(columns='Probability'))
# Income_Groups Rate
#0 2 2.32
#1 1 1.25
#3 3 3.45
I have a dataframe, df which looks like this
Open High Low Close Volume
Date
2007-03-22 2.65 2.95 2.64 2.86 176389
2007-03-23 2.87 2.87 2.78 2.78 63316
2007-03-26 2.83 2.83 2.51 2.52 54051
2007-03-27 2.61 3.29 2.60 3.28 589443
2007-03-28 3.65 4.10 3.60 3.80 1114659
2007-03-29 3.91 3.91 3.33 3.57 360501
2007-03-30 3.70 3.88 3.66 3.71 185787
I'm attempting to create a new column, that first shifts the Open column 3 rows (df.Open.shift(-3)) and then takes the average of itself and the next 2 values.
So for example the above dataframe's Open column would be shifted -3 rows and look something like this:
Date
2007-03-22 2.610
2007-03-23 3.650
2007-03-26 3.910
2007-03-27 3.700
2007-03-28 3.710
2007-03-29 3.710
2007-03-30 3.500
I then want to take the forward mean of the next 3 values(including itself) via iteration.
So the first iteration would 2.610 (first value) + 3.650 + 3.910(which are the next values) divided by 3.
Then we take the next value 3.650 (first value) and do the same. Creating a column of values.
At first I tried something like :
df['Avg'] =df.Open.shift(-3).iloc[0:3].mean()
But this doesn't iterate through all the values of Open.shift
This next loop seems to work but is very slow, and I was told it's bad practice to use for loops in Pandas.
for i in range(0, len(df.Open)):
df['Avg'][i] =df.Open.shift(-3).iloc[i:i+4].mean()
I tried to thinking of ways to use apply
df.Open.shift(-3).apply(loc[0:4].mean())
df.Open.shift(-3).apply(lambda x: x[0:4].mean())
but these seems to give errors such as
TypeError: 'float' object is not subscriptable etc
I can't think of an elegant way of doings this.
Thank you.
You can use pandas rolling_mean. Since it uses backward window, it will give you the first two rows as 2.61 (value itself) and 3.13(mean of row 0 and row 1). To handle that, you can use shift(-2) to shift the values by 2 rows.
pd.rolling_mean(df, window=3, min_periods=1).shift(-2)
output:
open
date
2007-03-22 3.390000
2007-03-23 3.753333
2007-03-26 3.773333
2007-03-27 3.706667
2007-03-28 3.640000
2007-03-29 NaN
2007-03-30 NaN
numpy solution
As promised
NOTE: HUGE CAVEAT
This is an advanced technique and is not recommended for any beginner!!!
Using this might actually shave your poodle bald by accident. BE CAREFUL!
as_strided
from numpy.lib.stride_tricks import as_strided
import numpy as np
import pandas as pd
# I didn't have your full data for all dates
# so I created my own array
# You should be able to just do
# o = df.Open.values
o = np.array([2.65, 2.87, 2.83, 2.61, 3.65, 3.91, 3.70, 3.71, 3.71, 3.50])
# because we shift 3 rows, I trim with 3:
# because it'll be rolling 3 period mean
# add two np.nan at the end
# this makes the strides cleaner.. sortof
# whatever, I wanted to do it
o = np.append(o[3:], [np.nan] * 2)
# strides are the size of the chunk of memory
# allocated to each array element. there will
# be a stride for each numpy dimension. for
# a one dimensional array, I only want the first
s = o.strides[0]
# it gets fun right here
as_strided(o, (len(o) - 2, 3), (s, s))
# ^ \___________/ \__/
# | \ \______
# object to stride --- size of array --- \
# to make memory chunk
# to step forward
# per dimension
[[ 2.61 3.65 3.91]
[ 3.65 3.91 3.7 ]
[ 3.91 3.7 3.71]
[ 3.7 3.71 3.71]
[ 3.71 3.71 3.5 ]
[ 3.71 3.5 nan]
[ 3.5 nan nan]]
Now we just take the mean. All together
o = np.array([2.65, 2.87, 2.83, 2.61, 3.65, 3.91, 3.70, 3.71, 3.71, 3.50])
o = np.append(o[3:], [np.nan] * 2)
s = o.strides[0]
as_strided(o, (len(o) - 2, 3), (s, s)).mean(1)
array([ 3.39 , 3.75333333, 3.77333333, 3.70666667, 3.64 ,
nan, nan])
You can wrap it in a pandas series
pd.Series(
as_strided(o, (len(o) - 2, 3), (s, s)).mean(1),
df.index[3:],
)
This question is linked to Speedup of pandas groupby. It is about speeding up a groubby cumproduct calculation. The DataFrame is 2D and has a multi index consisting of 3 integers.
The HDF5 file for the dataframe can be found here: http://filebin.ca/2Csy0E2QuF2w/phi.h5
The actual calculation that I'm performing is similar to this:
>>> phi = pd.read_hdf('phi.h5', 'phi')
>>> %timeit phi.groupby(level='atomic_number').cumprod()
100 loops, best of 3: 5.45 ms per loop
The other speedup that might be possible is that I do this calculation about 100 times using the same index structure but with different numbers. I wonder if it can somehow cache the index.
Any help will be appreciated.
Numba appears to work pretty well here. In fact, these results seem almost too good to be true with the numba function below being about 4,000x faster than the original method and 5x faster than plain cumprod without a groupby. Hopefully these are correct, let me know if there is an error.
np.random.seed(1234)
df=pd.DataFrame({ 'x':np.repeat(range(200),4), 'y':np.random.randn(800) })
df = df.sort('x')
df['cp_groupby'] = df.groupby('x').cumprod()
from numba import jit
#jit
def group_cumprod(x,y):
z = np.ones(len(x))
for i in range(len(x)):
if x[i] == x[i-1]:
z[i] = y[i] * z[i-1]
else:
z[i] = y[i]
return z
df['cp_numba'] = group_cumprod(df.x.values,df.y.values)
df['dif'] = df.cp_groupby - df.cp_numba
Test that both ways give the same answer:
all(df.cp_groupby==df.cp_numba)
Out[1447]: True
Timings:
%timeit df.groupby('x').cumprod()
10 loops, best of 3: 102 ms per loop
%timeit df['y'].cumprod()
10000 loops, best of 3: 133 µs per loop
%timeit group_cumprod(df.x.values,df.y.values)
10000 loops, best of 3: 24.4 µs per loop
pure numpy solution, assuming the data is sorted by the index, though no handling of NaN:
res = np.empty_like(phi.values)
l = 0
r = phi.index.levels[0]
for i in r:
phi.values[l:l+i,:].cumprod(axis=0, out=res[l:l+i])
l += i
about 40 times faster on the multiindex data from the question.
Though a problem is that this does rely on how pandas stores the data in its backend array. So it may stop working when pandas changes.
>>> phi = pd.read_hdf('phi.h5', 'phi')
>>> %timeit phi.groupby(level='atomic_number').cumprod()
100 loops, best of 3: 4.33 ms per loop
>>> %timeit np_cumprod(phi)
10000 loops, best of 3: 111 µs per loop
If you want a fast but not very pretty workaround, you could do something like the following. Here's some sample data and your default approach.
df=pd.DataFrame({ 'x':np.repeat(range(200),4), 'y':np.random.randn(800) })
df = df.sort('x')
df['cp_group'] = df.groupby('x').cumprod()
And here's the workaround. It's looks rather long (it is) but each individual step is simple and fast. (The timings are at the bottom.) The key is simply to avoid using groupby at all in this case by replacing with shift and such -- but because of that you also need to make sure your data is sorted by the groupby column.
df['cp_nogroup'] = df.y.cumprod()
df['last'] = np.where( df.x == df.x.shift(-1), 0, df.y.cumprod() )
df['last'] = np.where( df['last'] == 0., np.nan, df['last'] )
df['last'] = df['last'].shift().ffill().fillna(1)
df['cp_fast'] = df['cp_nogroup'] / df['last']
df['dif'] = df.cp_group - df.cp_fast
Here's what it looks like. 'cp_group' is your default and 'cp_fast' is the above workaround. If you look at the 'dif' column you'll see that several of these are off by very small amounts. This is just a precision issue and not anything to worry about.
x y cp_group cp_nogroup last cp_fast dif
0 0 1.364826 1.364826 1.364826 1.000000 1.364826 0.000000e+00
1 0 0.410126 0.559751 0.559751 1.000000 0.559751 0.000000e+00
2 0 0.894037 0.500438 0.500438 1.000000 0.500438 0.000000e+00
3 0 0.092296 0.046189 0.046189 1.000000 0.046189 0.000000e+00
4 1 1.262172 1.262172 0.058298 0.046189 1.262172 0.000000e+00
5 1 0.832328 1.050541 0.048523 0.046189 1.050541 2.220446e-16
6 1 -0.337245 -0.354289 -0.016364 0.046189 -0.354289 -5.551115e-17
7 1 0.758163 -0.268609 -0.012407 0.046189 -0.268609 -5.551115e-17
8 2 -1.025820 -1.025820 0.012727 -0.012407 -1.025820 0.000000e+00
9 2 1.175903 -1.206265 0.014966 -0.012407 -1.206265 0.000000e+00
Timings
Default method:
In [86]: %timeit df.groupby('x').cumprod()
10 loops, best of 3: 100 ms per loop
Standard cumprod but without the groupby. This should be a good approximation of the maximum possible speed you could achieve.
In [87]: %timeit df.cumprod()
1000 loops, best of 3: 536 µs per loop
And here's the workaround:
In [88]: %%timeit
...: df['cp_nogroup'] = df.y.cumprod()
...: df['last'] = np.where( df.x == df.x.shift(-1), 0, df.y.cumprod() )
...: df['last'] = np.where( df['last'] == 0., np.nan, df['last'] )
...: df['last'] = df['last'].shift().ffill().fillna(1)
...: df['cp_fast'] = df['cp_nogroup'] / df['last']
...: df['dif'] = df.cp_group - df.cp_fast
100 loops, best of 3: 2.3 ms per loop
So the workaround is about 40x faster for this sample dataframe but the speedup will depend on the dataframe (in particular on the number of groups).