How to parse data from time period to periodic time series with Pandas - pandas

I'm currently trying to parse data, which is given with a start and end date, from one dataframe to a second dataframe, which has a periodic timedateindex.
df1 is my input dataframe and I would like to parse it to df2-structure.
Actually I don't need the values itself, I just want to mark the period where they occur.
df1
Start End Value1 Value2
1 2018-01-02 15:20 2018-01-02 19:50 x Nan
2 2018-03-21 05:40 2018-03-22 11:20 a b
3 ...
df2
Value1 Value2
2018-01-02 15:10 False False
2018-01-02 15:20 True False
2018-01-02 15:30 True False
2018-01-02 15:40 True False
...
2018-01-02 19:50 True False
2018-01-02 20:00 False False
I got already the structure for df2, but I couldn't figure out how to transform the data.
date_rng=pd.date_range(start='2018-01-01', end='2018-12-31', freq='10min')
df2=pd.DataFrame(date_rng, columns=['Date'])
df2['datetime'] = pd.to_datetime(df2['Date'])
df2 = df2.set_index('datetime')
df2.drop(['Date'], axis=1, inplace=True)
Anyone can help?
many thanks

You could initialize df2 with all values set to False and then iterate through both dataframes and checking if the time in df2 is within one/ore more of the specified intervals in df1.
Here a working example:
# | create some dummy data
data = [{'Start': '2018-01-02 15:20', 'End': '2018-01-02 19:50', 'Value1': 'x', 'Value2': np.nan},
{'Start': '2018-01-01 00:00:00', 'End': '2018-01-01 00:15:00', 'Value1': 'a', 'Value2': np.nan}]
df1 = pd.DataFrame(data)
df1['Start'] = pd.to_datetime(df1['Start'])
df1['End'] = pd.to_datetime(df1['End'])
date_rng=pd.date_range(start='2018-01-01', end='2018-12-31', freq='10min')
df2=pd.DataFrame(date_rng, columns=['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2 = df2.set_index('Date', drop=True)
# | initialize all values with False
df2['Value1'] = False
df2['Value2'] = False
# | iterate through dataframes (also check if values are NaN)
for _, row_1 in df1.iterrows():
for index_2, row_2 in df2.iterrows():
if not pd.isnull(row_1['Value1']):
row_2['Value1'] = row_1['Start'] <= index_2 and row_1['End'] >= index_2
if not pd.isnull(row_1['Value2']):
row_2['Value2'] = row_1['Start'] <= index_2 and row_1['End'] >= index_2
output:
Value1 Value2
Date
2018-01-01 00:00:00 True False
2018-01-01 00:10:00 True False
2018-01-01 00:20:00 False False
2018-01-01 00:30:00 False False
.
.
.

Related

How to determine the end of a non-NaN series in pandas

For a data frame
df = pd.DataFrame([[np.nan, 3.0, 7.0], [0.0, 5.0, 8.0], [0.0, 0.0, 0.0], [1.0, 3.0, np.nan], [1.0, np.nan, np.nan]],
columns=[1, 2, 3], index=pd.date_range('20180101', periods=5))
which is
1 2 3
2018-01-01 NaN 3.0 7.0
2018-01-02 0.0 5.0 8.0
2018-01-03 0.0 0.0 0.0
2018-01-04 1.0 3.0 NaN
2018-01-05 1.0 NaN NaN
I would like know when a non-NaN series (column) is over. The resulting data frame should look
1 2 3
2018-01-01 False False False
2018-01-02 False False False
2018-01-03 False False False
2018-01-04 False False True
2018-01-05 False True True
I tried to work with
df.apply(lambda x: x.last_valid_index())
which results in
1 2018-01-05
2 2018-01-04
3 2018-01-03
So far so good. But how to continue? All solutions (also those not containing last_valid_index()) are welcome!
Use back filling missing values with test missing values:
df1 = df.bfill().isna()
print (df1)
1 2 3
2018-01-01 False False False
2018-01-02 False False False
2018-01-03 False False False
2018-01-04 False False True
2018-01-05 False True True
Detail:
print (df.bfill())
1 2 3
2018-01-01 0.0 3.0 7.0
2018-01-02 0.0 5.0 8.0
2018-01-03 0.0 0.0 0.0
2018-01-04 1.0 3.0 NaN
2018-01-05 1.0 NaN NaN

Replace a string value with NaN in pandas data frame - Python

Do I have to replace the value? with NaN so you can invoke the .isnull () method. I have found several solutions but some errors are always returned. Suppose:
data = pd.DataFrame([[1,?,5],[?,?,4],[?,32.1,1]])
and if I try:
pd.data.replace('?', np.nan)
I have:
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
but data.isnull() returns:
0 1 2
0 False False False
1 False False False
2 False False False
Why?
I think you forget assign back:
data = pd.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
data = data.replace('?', np.nan)
#alternative
#data.replace('?', np.nan, inplace=True)
print (data)
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
print (data.isnull())
0 1 2
0 False True False
1 True True False
2 True False False
# a dataframe with string values
dat = pd.DataFrame({'a':[1,'FG', 2, 4], 'b':[2, 5, 'NA', 7]})
Removing non numerical elements from the dataframe:
"Method 1 - with regex"
dat2 = dat.replace(r'^([A-Za-z]|[0-9]|_)+$', np.NaN, regex=True)
dat2
"Method 2 - with pd.to_numeric"
dat3 = pd.DataFrame()
for col in dat.columns:
dat3[col] = pd.to_numeric(dat[col], errors='coerce')
dat3
? is a not null. So you will expect to get a False under the isnull test
>>> data = pandas.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
>>> data
0 1 2
0 False False False
1 False False False
2 False False False
After you replace ? with NaN the test will look much different
>>> data = data.replace('?', np.nan)
>>> data
0 1 2
0 False True False
1 True True False
2 True False False
I believe when you are doing pd.data.replace('?', np.nan) this action is not done in place, so you must try -
data = data.replace('?', np.nan)

Pandas - Find and index rows that match row sequence pattern

I would like to find a pattern in a dataframe in a categorical variable going down rows. I can see how to use Series.shift() to look up / down and using boolean logic to find the pattern, however, I want to do this with a grouping variable and also label all rows that are part of the pattern, not just the starting row.
Code:
import pandas as pd
from numpy.random import choice, randn
import string
# df constructor
n_rows = 1000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
'group_var': choice(list(string.ascii_uppercase), n_rows),
'row_pat': choice([0, 1, 2, 3], n_rows),
'values': randn(n_rows)})
# sorting
df.sort_values(by=['group_var', 'date_time'], inplace=True)
df.head(10)
Which returns this:
I can find the start of the pattern (with no grouping though) by this:
# the row ordinal pattern to detect
p0, p1, p2, p3 = 1, 2, 2, 0
# flag the row at the start of the pattern
df['pat_flag'] = \
df['row_pat'].eq(p0) & \
df['row_pat'].shift(-1).eq(p1) & \
df['row_pat'].shift(-2).eq(p2) & \
df['row_pat'].shift(-3).eq(p3)
df.head(10)
What i cant figure out, is how to do this only withing the "group_var", and instead of returning True for the start of the pattern, return true for all rows that are part of the pattern.
Appreciate any tips on how to solve this!
Thanks...
I think you have 2 ways - simplier and slowier solution or faster complicated.
use Rolling.apply and test pattern
replace 0s to NaNs by mask
use bfill with limit (same as fillna with method='bfill') for repeat 1
then fillna NaNs to 0
last cast to bool by astype
pat = np.asarray([1, 2, 2, 0])
N = len(pat)
df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
.apply(lambda x: (x==pat).all())
.mask(lambda x: x == 0)
.bfill(limit=N-1)
.fillna(0)
.astype(bool)
)
If is important performance, use strides, solution from link was modify:
use rolling window approach
compare with pattaern and return Trues for match by all
get indices of first occurencies by np.mgrid and indexing
create all indices with list comprehension
compare by numpy.in1d and create new column
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)
Another solution, thanks #divakar:
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
Timings:
np.random.seed(456)
import pandas as pd
from numpy.random import choice, randn
from scipy.ndimage.morphology import binary_dilation
import string
# df constructor
n_rows = 100000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
'group_var': choice(list(string.ascii_uppercase), n_rows),
'row_pat': choice([0, 1, 2, 3], n_rows),
'values': randn(n_rows)})
# sorting
df.sort_values(by=['group_var', 'date_time'], inplace=True)
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)
print (df.iloc[460:480])
date_time group_var row_pat values rm0 rm1 rm2
12045 2019-06-25 21:00:00 A 3 -0.081152 False False False
12094 2019-06-27 22:00:00 A 1 -0.818167 False False False
12125 2019-06-29 05:00:00 A 0 -0.051088 False False False
12143 2019-06-29 23:00:00 A 0 -0.937589 False False False
12145 2019-06-30 01:00:00 A 3 0.298460 False False False
12158 2019-06-30 14:00:00 A 1 0.647161 False False False
12164 2019-06-30 20:00:00 A 3 -0.735538 False False False
12210 2019-07-02 18:00:00 A 1 -0.881740 False False False
12341 2019-07-08 05:00:00 A 3 0.525652 False False False
12343 2019-07-08 07:00:00 A 1 0.311598 False False False
12358 2019-07-08 22:00:00 A 1 -0.710150 True True True
12360 2019-07-09 00:00:00 A 2 -0.752216 True True True
12400 2019-07-10 16:00:00 A 2 -0.205122 True True True
12404 2019-07-10 20:00:00 A 0 1.342591 True True True
12413 2019-07-11 05:00:00 A 1 1.707748 False False False
12506 2019-07-15 02:00:00 A 2 0.319227 False False False
12527 2019-07-15 23:00:00 A 3 2.130917 False False False
12600 2019-07-19 00:00:00 A 1 -1.314070 False False False
12604 2019-07-19 04:00:00 A 0 0.869059 False False False
12613 2019-07-19 13:00:00 A 2 1.342101 False False False
In [225]: %%timeit
...: df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
...: .apply(lambda x: (x==pat).all())
...: .mask(lambda x: x == 0)
...: .bfill(limit=N-1)
...: .fillna(0)
...: .astype(bool)
...: )
...:
1 loop, best of 3: 356 ms per loop
In [226]: %%timeit
...: arr = df['row_pat'].values
...: b = np.all(rolling_window(arr, N) == pat, axis=1)
...: c = np.mgrid[0:len(b)][b]
...: d = [i for x in c for i in range(x, x+N)]
...: df['rm2'] = np.in1d(np.arange(len(arr)), d)
...:
100 loops, best of 3: 7.63 ms per loop
In [227]: %%timeit
...: arr = df['row_pat'].values
...: b = np.all(rolling_window(arr, N) == pat, axis=1)
...:
...: m = (rolling_window(arr, len(pat)) == pat).all(1)
...: m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
...: df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
...:
100 loops, best of 3: 7.25 ms per loop
You could make use of the pd.rolling() methods and then simply compare the arrays that it returns with the array that contains the pattern that you are attempting to match on.
pattern = np.asarray([1.0, 2.0, 2.0, 0.0])
n_obs = len(pattern)
df['rolling_match'] = (df['row_pat']
.rolling(window=n_obs , min_periods=n_obs)
.apply(lambda x: (x==pattern).all())
.astype(bool) # All as bools
.shift(-1 * (n_obs - 1)) # Shift back
.fillna(False) # convert NaNs to False
)
It is important to specify the min periods here in order to ensure that you only find exact matches (and so the equality check won't fail when the shapes are misaligned). The apply function is doing a pairwise check between the two arrays, and then we use the .all() to ensure all match. We convert to a bool, and then call shift on the function to move it to being a 'forward looking' indicator instead of only occurring after the fact.
Help on the rolling functionality available here -
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
This works.
It works like this:
a) For every group, it takes a window of size 4 and scans through the column until it finds the combination (1,2,2,0) in exact sequence. As soon as it finds the sequence, it populates the corresponding index values of new column 'pat_flag' with 1.
b) If it doesn't find the combination, it populates the column with 0.
pattern = [1,2,2,0]
def get_pattern(df):
df = df.reset_index(drop=True)
df['pat_flag'] = 0
get_indexes = []
temp = []
for index, row in df.iterrows():
mindex = index +1
# get the next 4 values
for j in range(mindex, mindex+4):
if j == df.shape[0]:
break
else:
get_indexes.append(j)
temp.append(df.loc[j,'row_pat'])
# check if sequence is matched
if temp == pattern:
df.loc[get_indexes,'pat_flag'] = 1
else:
# reset if the pattern is not found in given window
temp = []
get_indexes = []
return df
# apply function to the groups
df = df.groupby('group_var').apply(get_pattern)
## snippet of output
date_time group_var row_pat values pat_flag
41 2018-03-13 21:00:00 C 3 0.731114 0
42 2018-03-14 05:00:00 C 0 1.350164 0
43 2018-03-14 11:00:00 C 1 -0.429754 1
44 2018-03-14 12:00:00 C 2 1.238879 1
45 2018-03-15 17:00:00 C 2 -0.739192 1
46 2018-03-18 06:00:00 C 0 0.806509 1
47 2018-03-20 06:00:00 C 1 0.065105 0
48 2018-03-20 08:00:00 C 1 0.004336 0
Expanding on Emmet02's answer: using the rolling function for all groups and setting match-column to 1 for all matching pattern indices:
pattern = np.asarray([1,2,2,0])
# Create a match column in the main dataframe
df.assign(match=False, inplace=True)
for group_var, group in df.groupby("group_var"):
# Per group do rolling window matching, the last
# values of matching patterns in array 'match'
# will be True
match = (
group['row_pat']
.rolling(window=len(pattern), min_periods=len(pattern))
.apply(lambda x: (x==pattern).all())
)
# Get indices of matches in current group
idx = np.arange(len(group))[match == True]
# Include all indices of matching pattern,
# counting back from last index in pattern
idx = idx.repeat(len(pattern)) - np.tile(np.arange(len(pattern)), len(idx))
# Update matches
match.values[idx] = True
df.loc[group.index, 'match'] = match
df[df.match==True]
edit: Without a for loop
# Do rolling matching in group clause
match = (
df.groupby("group_var")
.rolling(len(pattern))
.row_pat.apply(lambda x: (x==pattern).all())
)
# Convert NaNs
match = (~match.isnull() & match)
# Get indices of matches in current group
idx = np.arange(len(df))[match]
# Include all indices of matching pattern
idx = idx.repeat(len(pattern)) - np.tile(np.arange(len(pattern)), len(idx))
# Mark all indices that are selected by "idx" in match-column
df = df.assign(match=df.index.isin(df.index[idx]))
You can do this by defining a custom aggregate function, then using it in group_by statement, finally merge it back to the original dataframe. Something like this:
Aggregate function:
def pattern_detect(column):
# define any other pattern to detect here
p0, p1, p2, p3 = 1, 2, 2, 0
column.eq(p0) & \
column.shift(-1).eq(p1) & \
column.shift(-2).eq(p2) & \
column.shift(-3).eq(p3)
return column.any()
Use group by function next:
grp = df.group_by('group_var').agg([patter_detect])['row_pat']
Now merge it back to the original dataframe:
df = df.merge(grp, left_on='group_var',right_index=True, how='left')

check if at least one value exists in pandas dataframe index

How can I check if at least one of the index of df2 is in df1?
df1
Val
StartDate
2015-03-31 NaN
2015-04-03 NaN
2015-04-05 8.08
2015-04-06 23.48
df2
Val
StartDate
2015-03-31 True
2015-04-01 True
2015-04-02 True
2015-04-03 True
2015-04-04 True
2015-04-05 True
2015-04-06 True
df2.index in df1.index
returns False
Use Index.isin with Index.any for check at least one True:
a = df1.index.isin(df2.index).any()
print (a)
True
Detail:
print (df1.index.isin(df2.index))
[ True True True True]
[i for i in df1.index if i in df2.index]

returning same initial input when filtering dataframes' values

I have the following dataframe I have obtained from the read_html pandas' property.
A        1.48        2.64    1.02         2.46   2.73
B       658.4        14.33    7.41        15.35   8.59
C        3.76         2.07    4.61         2.26   2.05
D   513854.86         5.70    0.00         5.35  30.16
I would like to remove the rows that are over 150 so I did adf1= df[df > 150], however it returns the same table.
Then I thought to include in the decimals in the routeroute = pd.read_html(https//route , decimal='.') and continues returning the same initial dataframe with no filters.
This would be my desired output:
A        1.48        2.64    1.02         2.46   2.73
C        3.76         2.07    4.61         2.26   2.05
Need:
print (df)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
1 B 658.40 14.33 7.41 15.35 8.59
2 C 3.76 2.07 4.61 2.26 2.05
3 D 513854.86 5.70 0.00 5.35 30.16
df1 = df[~(df.iloc[:, 1:] > 150).any(1)]
print (df1)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
2 C 3.76 2.07 4.61 2.26 2.05
Or:
df1 = df[(df.iloc[:, 1:] <= 150).all(1)]
print (df1)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
2 C 3.76 2.07 4.61 2.26 2.05
Explanation:
First select all columns without first by iloc:
print (df.iloc[:, 1:])
1 2 3 4 5
0 1.48 2.64 1.02 2.46 2.73
1 658.40 14.33 7.41 15.35 8.59
2 3.76 2.07 4.61 2.26 2.05
3 513854.86 5.70 0.00 5.35 30.16
Then compare - get boolean DataFrame:
print (df.iloc[:, 1:] > 150)
1 2 3 4 5
0 False False False False False
1 True False False False False
2 False False False False False
3 True False False False False
print (df.iloc[:, 1:] <= 150)
1 2 3 4 5
0 True True True True True
1 False True True True True
2 True True True True True
3 False True True True True
Then use all for check if all values in row has Trues
or any for check if at least one value is True:
print ((df.iloc[:, 1:] > 150).any(1))
0 False
1 True
2 False
3 True
dtype: bool
print ((df.iloc[:, 1:] <= 150).all(1))
0 True
1 False
2 True
3 False
dtype: bool
Last first Series invert with ~ and filter by boolean indexing.