Pandas: fancy indexing a dataframe

Pandas: fancy indexing a dataframe - numpy

I have a Pandas dataframe, df1, that is a year-long 5 minute timeseries with columns A-Z.
df1.shape
(105121, 26)
df1.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2002-01-02 00:00:00, ..., 2003-01-02 00:00:00]
Length: 105121, Freq: 5T, Timezone: None
I have a second dataframe, df2, that is a year-long daily timeseries (over the same period) with matching columns. The values of this second frame are Booleans.
df2.shape
(365, 26)
df2.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2002-01-02 00:00:00, ..., 2003-01-01 00:00:00]
Length: 365, Freq: D, Timezone: None
I want to use df2 as a fancy index to df1, i.e. "df1.ix[df2]" or somesuch, such that I get back a subset of df1's columns for each date -- i.e. those which df2 says are True on that date (with all timestamps thereon). Thus the shape of the result should be (105121, width), where width is the number of distinct columns the Booleans imply (width<=26).
Currently, df1.ix[df2] only partially works. Only the 00:00 values for each day are picked out, which makes sense in the light of df2's 'point-like' time series.
I next tried time spans as the df2 index:
df2.index
PeriodIndex: 365 entries, 2002-01-02 to 2003-01-01
This time, I get an error:
/home/wchapman/.local/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/index.pyc in get_indexer(self, target, method, limit)
844 this = self.astype(object)
845 target = target.astype(object)
--> 846 return this.get_indexer(target, method=method, limit=limit)
847
848 if not self.is_unique:
AttributeError: 'numpy.ndarray' object has no attribute 'get_indexer'
My interim solution is to loop by date, but this seems inefficient. Is Pandas capable of this kind of fancy indexing? I don't see examples anywhere in the documentation.

Here's one way to do this:
t_index = df1.index
d_index = df2.index
mask = t_index.map(lambda t: t.date() in d_index)
df1[mask]
And slightly faster (but with the same idea) would be to use:
mask = pd.to_datetime([datetime.date(*t_tuple)
for t_tuple in zip(t_index.year,
t_index.month,
t_index.day)]).isin(d_index)

Related

Create a sequence row from a filtered series

I am trying to create a row that has columns from t0 to t(n).
I have a complete data frame (df) that stores the full set of data, and a data series (df_t) specific time markers I am interested in.
What I want is to create a row that has the time marker as t0 then the previous [sequence_length] rows from the complete data frame.
def t_data(df, df_t, col_names, sequence_length):
df_ret = pd.DataFrame()
for i in range(sequence_length):
col_names_seq = [col_name + "_" + str(i) for col_name in col_names]
df_ret[col_names_seq] = df[df.shift(i)["time"].isin(df_t)][col_names]
return df_ret
Running:
t_data(df, df_t, ["close"], 3)
I get:
close_0 close_1 close_2
1110 1.32080 NaN NaN
2316 1.30490 NaN NaN
2549 1.30290 NaN NaN
The obvious line in issue is:
df[df.shift(i)["time"].isin(df_t)][col_names]
I have tried several ways but cant seem to select data surrounding a subset.
Sample (df):
time open close high low volume EMA21 EMA13 EMA9
20 2005-01-10 04:10:00 1.3071 1.3074 1.3075 1.3070 32.0 1.306624 1.306790 1.306887
21 2005-01-10 04:15:00 1.3074 1.3073 1.3075 1.3073 16.0 1.306685 1.306863 1.306969
22 2005-01-10 04:20:00 1.3073 1.3072 1.3074 1.3072 35.0 1.306732 1.306911 1.307015
Sample (df_t):
1110 2005-01-13 23:00:00
2316 2005-01-18 03:30:00
2549 2005-01-18 22:55:00
Name: time, dtype: datetime64[ns]
I don’t have data but hope this drawing helps:

def t_data(df, df_T, n):
# Get the indexs of the original df that matches with the values of df_T
indexs = df.reset_index().merge(df_T, how="inner")['index'].tolist()
#create new index list where we will store the index-n vales
newIndex = []
#create list of values to subtract from index
toSub = np.arange(n)
#loop over index values and subtract the values, and append in newIndex
for i in indexs:
for sub in toSub:
v = i - sub
newIndex.append(v)
#Use iloc to get all the rows in the original df with the newIndex values that we want
closedCosts = df.iloc[newIndex].reset_index(drop = True)["close"].values
#concat our data back to df_T, and reshape closedCosts by n columns
df_final = pd.concat([df_T, pd.DataFrame(closedCosts.reshape(-1, n))], axis= 1)
#return final df
return df_final
This should do what you're asking for. The easiest way to do this is to figure out all the indexs that you would want from the original df with its corresponding closing value. Note: you will have to rename the columns after this, but all the values are there.

Concatenating two tables in pandas, giving preference to one for identical indices

I'm trying to combine two data sets df1 and df2. Rows with unique indices are always copied, rows with duplicate indices should always be picked from df1. Imagine two time series, and df2 has additional data but is of lesser quality than df1, so ideally data comes from df1, but I'm willing to backfill from df2
df1:
date value v2
2020/01/01 df1-1 x
2020/01/03 df1-3 y
df2:
date value v2
2020/01/02 df2-2 a
2020/01/03 df2-3 b
2020/01/04 df2-4 c
are combined into
date value v2
2020/01/01 df1-1 x
2020/01/02 df2-2 a
2020/01/03 df1-3 y
2020/01/04 df2-4 c
The best I've got so far is
df = df1.merge(df2, how="outer",left_index=True, right_index=True, suffixes=('','_y'))
df['value'] = result_df['value'].combine_first(result_df['value_y'])
df['v2'] = result_df['v2'].combine_first(result_df['v2'])
df=df[['value', 'v2']]
That gets the job done, but it seems unnecessarily clunky. Is there a more idiomatic way to achieve this?

Your wrote rows with unique indices but you didn't show them,
so I assume that date column should be treated as these indices.
Furthermore, I noticed that all values in your DataFrames are not NaN.
If you guaratee this, you can run:
df1.set_index('date').combine_first(df2.set_index('date'))\
.reset_index()
Steps:
combine_first - combine both DataFrames based on values
in their date columns.
reset_index - change date column (for now the index) into
a "regular" column.
Another possible approach
If both your DataFrames have "standard" index (consecutive numbers starting
from 0) and you want to keep only rows for just these unique indices,
you can run:
df = pd.concat([df1, df2]).reset_index().drop_duplicates(subset='index')\
.set_index('index')
df.index.name = None
But then the result is:
date value v2
0 2020-01-01 df1-1 x
1 2020-01-03 df1-3 y
2 2020-01-04 df2-4 c
so it is different from what you presented as are combined into
(as I assume - your expected result). This time you lost the row
with v2 == 'a'.
Yet another approach
Based also on the assumption that all values in your DataFrames are not NaN:
df1.combine_first(df2)
The result will be just as the previous one.

dataframe multiply some columns with a series

I have a dataframe df1 where the index is a DatetimeIndex and there are 5 columns, col1, col2, col3, col4, col5.
I have another df2 which has an almost equal datetimeindex (some days of df1 may be missing from df1), and a single 'Value' column.
I would like to multiply df1 in-place by the Value from df2 when the dates are the same. But not for all columns col1...col5, only col1...col4
I can see it is possible to multiply col1*Value, then col2*Value and so on... and make up a new dataframe to replace df1.
Is there a more efficient way?

You an achieve this, by reindexing the second dataframe so they are the same shape, and then using the dataframe operator mul:
Create two data frames with datetime series. The second one using only business days to make sure we have gaps between the two. Set the dates as indices.
import pandas as pd
# first frame
rng1 = pd.date_range('1/1/2011', periods=90, freq='D')
df1 = pd.DataFrame({'value':range(1,91),'date':rng1})
df1.set_index('date', inplace =True)
# second frame with a business day date index
rng2 = pd.date_range('1/1/2011', periods=90, freq='B')
df2 = pd.DataFrame({'date':rng2})
df2['value_to_multiply'] = range(1-91)
df2.set_index('date', inplace =True)
reindex the second frame with the index from the first. Df1 will now have gaps for non-business days filled with the first previous valid observation.
# reindex the second dataframe to match the first
df2 =df2.reindex(index= df1.index, method = 'ffill')
Multiple df2 by df1['value_to_multiply_by']:
# multiple filling nans with 1 to avoid propagating nans
# nans can still exists if there are no valid previous observations such as at the beginning of a dataframe
df1.mul(df2['value_to_multiply_by'].fillna(1), axis=0)

Pandas not detecting the datatype of a Series properly

I'm running into something a bit frustrating with pandas Series. I have a DataFrame with several columns, with numeric and non-numeric data. For some reason, however, pandas thinks some of the numeric columns are non-numeric, and ignores them when I try to run aggregating functions like .describe(). This is a problem, since pandas raises errors when I try to run analyses on these columns.
I've copied some commands from the terminal as an example. When I slice the 'ND_Offset' column (the problematic column in question), pandas tags it with the dtype of object. Yet, when I call .describe(), pandas tags it with the dtype float64 (which is what it should be). The 'Dwell' column, on the other hand, works exactly as it should, with pandas giving float64 both times.
Does anyone know why I'm getting this behavior?
In [83]: subject.phrases['ND_Offset'][:3]
Out[83]:
SubmitTime
2014-06-02 22:44:44 0.3607049
2014-06-02 22:44:44 0.2145484
2014-06-02 22:44:44 0.4031347
Name: ND_Offset, dtype: object
In [84]: subject.phrases['ND_Offset'].describe()
Out[84]:
count 1255.000000
unique 432.000000
top 0.242308
freq 21.000000
dtype: float64
In [85]: subject.phrases['Dwell'][:3]
Out[85]:
SubmitTime
2014-06-02 22:44:44 111
2014-06-02 22:44:44 81
2014-06-02 22:44:44 101
Name: Dwell, dtype: float64
In [86]: subject.phrases['Dwell'].describe()
Out[86]:
count 1255.000000
mean 99.013546
std 30.109327
min 21.000000
25% 81.000000
50% 94.000000
75% 111.000000
max 291.000000
dtype: float64
And when I use the .groupby function to group the data by another attribute (when these Series are a part of a DataFrame), I get the DataError: No numeric types to aggregate error when I try to call .agg(np.mean) on the group. When I try to call .agg(np.sum) on the same data, on the other hand, things work fine.
It's a bit bizarre -- can anyone explain what's going on?
Thank you!

It might be because the ND_Offset column (what I call A below) contains a non-numeric value such as an empty string. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0.36, ''], 'B': [111, 81]})
print(df['A'].describe())
# count 2.00
# unique 2.00
# top 0.36
# freq 1.00
# dtype: float64
try:
print(df.groupby(['B']).agg(np.mean))
except Exception as err:
print(err)
# No numeric types to aggregate
print(df.groupby(['B']).agg(np.sum))
# A
# B
# 81
# 111 0.36
Aggregation using np.sum works because
In [103]: np.sum(pd.Series(['']))
Out[103]: ''
whereas np.mean(pd.Series([''])) raises
TypeError: Could not convert to numeric
To debug the problem, you could try to find the non-numeric value(s) using:
for val in df['A']:
if not isinstance(val, float):
print('Error: val = {!r}'.format(val))

Vectorized method to sync two arrays

I have two Pandas TimeSeries: x, and y, which I would like to sync "as of". I would like to find for every element in x the latest (by index) element in y that preceeds it (by index value). For example, I would like to compute this new_x:
x new_x
---- -----
13:01 13:00
14:02 14:00
y
----
13:00
13:01
13:30
14:00
I am looking for a vectorized solution, not a Python loop. The time values are based on Numpy datetime64. The y array's length is in the order of millions, so O(n^2) solutions are probably not practical.

In some circles this operation is known as the "asof" join. Here is an implementation:
def diffCols(df1, df2):
""" Find columns in df1 not present in df2
Return df1.columns - df2.columns maintaining the order which the resulting
columns appears in df1.
Parameters:
----------
df1 : pandas dataframe object
df2 : pandas dataframe objct
Pandas already offers df1.columns - df2.columns, but unfortunately
the original order of the resulting columns is not maintained.
"""
return [i for i in df1.columns if i not in df2.columns]
def aj(df1, df2, overwriteColumns=True, inplace=False):
""" KDB+ like asof join.
Finds prevailing values of df2 asof df1's index. The resulting dataframe
will have same number of rows as df1.
Parameters
----------
df1 : Pandas dataframe
df2 : Pandas dataframe
overwriteColumns : boolean, default True
The columns of df2 will overwrite the columns of df1 if they have the same
name unless overwriteColumns is set to False. In that case, this function
will only join columns of df2 which are not present in df1.
inplace : boolean, default False.
If True, adds columns of df2 to df1. Otherwise, create a new dataframe with
columns of both df1 and df2.
*Assumes both df1 and df2 have datetime64 index. """
joiner = lambda x : x.asof(df1.index)
if not overwriteColumns:
# Get columns of df2 not present in df1
cols = diffCols(df2, df1)
if len(cols) > 0:
df2 = df2.ix[:,cols]
result = df2.apply(joiner)
if inplace:
for i in result.columns:
df1[i] = result[i]
return df1
else:
return result
Internally, this uses pandas.Series.asof().

What about using Series.searchsorted() to return the index of y where you would insert x. You could then subtract one from that value and use it to index y.
In [1]: x
Out[1]:
0 1301
1 1402
In [2]: y
Out[2]:
0 1300
1 1301
2 1330
3 1400
In [3]: y[y.searchsorted(x)-1]
Out[3]:
0 1300
3 1400
note: the above example uses int64 Series

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas: fancy indexing a dataframe - numpy

Related

Create a sequence row from a filtered series

Concatenating two tables in pandas, giving preference to one for identical indices

dataframe multiply some columns with a series

Pandas not detecting the datatype of a Series properly

Vectorized method to sync two arrays

Categories

Resources