Pandas - Slice between two indexes - pandas

I need to process data with tensorflow for classification. Therefore I need to create DataFrames for each unit which was processed in my machine. The machine continously writes process data and also writes when a unit enters and leaves the machine.
A value in 'uid_in' means the unit with the logged number entered the machine, 'uid_out' means the unit left the machine.
I need to create a DataFrame like this for each unit processes by the machine.
[...]
time uhz1 uhz2 lhz1 lh2 uid_in uid_out
5 08:05:00 201 200 101 100 1.0 NaN #Unit1 enters the machine
6 08:06:00 201 200 99 101 2.0 NaN
[...]
14 08:14:00 199 199 99 101 10.0 NaN
15 08:15:00 201 201 100 100 11.0 1.0 #Unit1 leaves the machine
[...]
How can I create the Dataframe df.loc[enter:leave] for each unit automatically?
When I try to pass a DataFrame.index it does not work in df.loc
start = df[df.uid_in.isin([123])]
end = df[df.uid_out.isin([123])]
unit1_df = df.loc[start:end]

Your code almost worked out!
Original DataFrame:
time uhz1 uhz2 lhz1 lh2 uid_in uid_out
0 08:00:00 201 199 100 100 NaN NaN
1 08:01:00 199 199 100 99 NaN NaN
[...]
5 08:05:00 201 200 101 100 1.0 NaN
[...]
55 08:55:00 241 241 140 140 NaN 41.0
[...]
58 08:58:00 244 244 143 143 NaN NaN
59 08:59:00 245 245 144 144 NaN NaN
New code:
start = df[df.uid_in.eq(1.0)].index[0]
end = df[df.uid_out.eq(1.0)].index[0]
unit1_df = df.loc[start:end]
print(unit1_df)
Output
time uhz1 uhz2 lhz1 lh2 uid_in uid_out
5 08:05:00 201 200 101 100 1.0 NaN
6 08:06:00 201 200 99 101 2.0 NaN
[...]
14 08:14:00 199 199 99 101 10.0 NaN
15 08:15:00 201 201 100 100 11.0 1.0

I think you were pretty close. I modified your statements and picked out the start and end indices of start and end, as Ian indicated.
""" time uhz1 uhz2 lhz1 lh2 uid_in uid_out
5 08:05:00 201 200 101 100 1.0 NaN
6 08:06:00 201 200 99 101 2.0 NaN
14 08:14:00 199 199 99 101 10.0 NaN
15 08:15:00 201 201 100 100 11.0 1.0
"""
import pandas as pd
df = pd.read_clipboard()
start = df.uid_in.eq(1.0).index[0]
end = df.uid_out.eq(1.0).index[0]
unit1_df = df.loc[start:end]
unit1_df
Output:
time uhz1 uhz2 lhz1 lh2 uid_in uid_out
5 08:05:00 201 200 101 100 1.0 NaN
6 08:06:00 201 200 99 101 2.0 NaN
14 08:14:00 199 199 99 101 10.0 NaN
15 08:15:00 201 201 100 100 11.0 1.0
One-liner:
unit1_df = df.loc[df.uid_in.eq(1.0).index[0]:df.uid_out.eq(1.0).index[0]]

Related

Diff() function use with groupby for pandas

I am encountering an errors each time i attempt to compute the difference in readings for a meter in my dataset. The dataset structure is this.
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0
I am attempting to generate a new column called consumption that computes the difference in quantities consumed for each house(identified by houseid-meterid) after every month of the year.
The code i am using to implement this is:
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
After executing this code, the consumption column is filled with NaN values. How can I correctly implement this logic.
The end result looks like this:
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity consumption
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0 NaN
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0 NaN
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0 NaN
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0 NaN
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0 NaN
Many thank in advance.
I have attempted to use
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(0)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff()
all this commands result in the same behaviour as stated above.
Expected output should be:
Datetime houseid-meterid cleaned_quantity consumption
2019-02-01 215M201 23.0 20
2019-03-02 215M201 43.0 9
2019-04-01 215M201 52.0 12
2019-05-01 215M201 64.0 36
2019-06-01 215M201 100.0 20
what steps should i take?
Sort values by Datetime (if needed) then group by houseid-meterid before compute the diff for cleaned_quantity values then shift row to align with the right data:
df['consumption'] = (df.sort_values('Datetime')
.groupby('houseid-meterid')['cleaned_quantity']
.transform(lambda x: x.diff().shift(-1)))
print(df)
# Output
Datetime houseid-meterid cleaned_quantity consumption
0 2019-02-01 215M201 23.0 20.0
1 2019-03-02 215M201 43.0 9.0
2 2019-04-01 215M201 52.0 12.0
3 2019-05-01 215M201 64.0 36.0
4 2019-06-01 215M201 100.0 NaN

how to extract values from previous dataframe based on row and column condition?

sorry for my naive, but i can't solve this. any reference or solution ?
df1 =
date a b c
0 2011-12-30 100 400 700
1 2021-01-30 200 500 800
2 2021-07-30 300 600 900
df2 =
date c b
0 2021-07-30 NaN NaN
1 2021-01-30 NaN NaN
2 2011-12-30 NaN NaN
desired output:
date c b
0 2021-07-30 900 600
1 2021-01-30 800 500
2 2011-12-30 700 400
Use DataFrame.fillna with convert date to indices in both DataFrames:
df = df2.set_index('date').fillna(df1.set_index('date')).reset_index()
print (df)
date c b
0 2021-07-30 900.0 600.0
1 2021-01-30 800.0 500.0
2 2011-12-30 700.0 400.0
You can reindex_like df2 after setting date a temporary index:
out = df1.set_index('date').reindex_like(df2.set_index('date')).reset_index()
output:
date c b
0 2021-07-30 900 600
1 2021-01-30 800 500
2 2011-12-30 700 400
Another possible solution, using pandas.DataFrame.update:
df2 = df2.set_index('date')
df2.update(df1.set_index('date'))
df2.reset_index()
Output:
date c b
0 2021-07-30 900.0 600.0
1 2021-01-30 800.0 500.0
2 2011-12-30 700.0 400.0

Pandas resample is jumbling date order

I'm trying to resample some tick data I have into 1 minute blocks. The code appears to work fine but when I look into the resulting dataframe it is changing the order of the dates incorrectly. Below is what it looks like pre resample:
Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
2020-06-30 17:00:00 41.68 2 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 3 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.68 1 tptTradetctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 5 tptAsk tctRegular NaN 255 NaN 0 msNormal
2020-06-30 17:00:00 41.71 8 tptAsk tctRegular NaN 255 NaN 0 msNormal
... ... ... ... ... ... ... ... ... ...
2020-01-07 17:00:21 41.94 5 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:27 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:40 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:46 41.94 4 tptBid tctRegular NaN 255 NaN 0 msNormal
2020-01-07 17:00:50 41.94 3 tptBid tctRegular NaN 255 NaN 0 msNormal
As you can see the date starts at 5pm on the 30th of June. Then I use this code:
one_minute_dataframe['Price'] = df.Var2.resample('1min').last()
one_minute_dataframe['Volume'] = df.Var3.resample('1min').sum()
one_minute_dataframe.index = pd.to_datetime(one_minute_dataframe.index)
one_minute_dataframe.sort_index(inplace = True)
And I get the following:
Price Volume
2020-01-07 00:00:00 41.73 416
2020-01-07 00:01:00 41.74 198
2020-01-07 00:02:00 41.76 40
2020-01-07 00:03:00 41.74 166
2020-01-07 00:04:00 41.77 143
... ... ...
2020-06-30 23:55:00 41.75 127
2020-06-30 23:56:00 41.74 234
2020-06-30 23:57:00 41.76 344
2020-06-30 23:58:00 41.72 354
2020-06-30 23:59:00 41.74 451
It seems to want to start from midnight on the 1st of July. But I've tried sorting the index and it still is not changing.
Also, the datetime index seems to add lots more dates outside the ones that were originally in the dataframe and plonks them in the middle of the resampled one.
Any help would be great. Apologies if I've set this out poorly
I see what's happened. Somewhere in the data download the month and day have been switched around. That's why its putting July at the top, because it thinks it's January.

pandas: rolling mean on time interval plus grouping on index

I am trying to find the 7-day rolling average for the hour of day for a category. The data frame is indexed on the category id and there is a time stamp plus other columns:
id name ds time x y z
6 red 2020-02-14 00:00:00 10 20 30
6 red 2020-02-14 01:00:00 20 40 50
6 red 2020-02-14 02:00:00 20 20 60
...
6 red 2020-02-21 00:00:00 20 30 60
6 red 2020-02-21 01:00:00 20 40 60
6 red 2020-02-21 02:00:00 10 40 60
7 green 2020-02-14 00:00:00 10 20 30
7 green 2020-02-14 01:00:00 20 40 50
7 green 2020-02-14 02:00:00 20 20 60
...
7 green 2020-02-21 00:00:00 20 30 60
7 green 2020-02-21 01:00:00 20 40 60
7 green 2020-02-21 02:00:00 10 40 60
what I would like as an output (obviously with the rolling columns filled by the rolling mean where not NaN):
id name ds time x y z rolling_x rolling_y rolling_z
6 red 2020-02-14 00:00:00 10 20 30 NaN NaN NaN
6 red 2020-02-14 01:00:00 20 40 50 NaN NaN NaN
6 red 2020-02-14 02:00:00 20 20 60 NaN NaN NaN
...
6 red 2020-02-21 00:00:00 20 30 60
6 red 2020-02-21 01:00:00 20 40 60
6 red 2020-02-21 02:00:00 10 40 60
7 green 2020-02-14 00:00:00 10 20 30 NaN NaN NaN
7 green 2020-02-14 01:00:00 20 40 50 NaN NaN NaN
7 green 2020-02-14 02:00:00 20 20 60 NaN NaN NaN
...
7 green 2020-02-21 00:00:00 20 30 60
7 green 2020-02-21 01:00:00 20 40 60
7 green 2020-02-21 02:00:00 10 40 60
My approach:
df = df.assign(day=df['ds time'].dt.normalize(),
hour=df['ds time'].dt.hour)
ret_df = df.merge(df.drop('ds time', axis=1)
.set_index('day')
.groupby(['id','hour']).rolling('7D').mean()
.drop(['hour','id'], axis=1),
on=['id','hour','day'],
how='left',
suffixes=['','_roll']
).drop(['day','hour'], axis=1)
Sample data:
dates = pd.date_range('2020-02-21', '2020-02-25', freq='H')
np.random.seed(1)
df = pd.DataFrame({
'id': np.repeat([6,7], len(dates)),
'ds time': np.tile(dates,2),
'X': np.arange(len(dates)*2),
'Y': np.random.randint(0,10, len(dates)*2)
})
df.head()
Output ret_df.head():
id ds time X Y X_roll Y_roll
0 6 2020-02-21 00:00:00 0 5 0.0 5.0
1 6 2020-02-21 01:00:00 1 8 1.0 8.0
2 6 2020-02-21 02:00:00 2 9 2.0 9.0
3 6 2020-02-21 03:00:00 3 5 3.0 5.0
4 6 2020-02-21 04:00:00 4 0 4.0 0.0

Pandas: add date field to parsed timestamp

I have several date specific text files (for ex 20150211.txt) that looks like
TopOfBook 0x21 60 07:15:00.862 101 85 5 109 500 24 +
TopOfBook 0x21 60 07:15:00.882 101 91 400 109 500 18 +
TopOfBook 0x21 60 07:15:00.890 101 91 400 105 80 14 +
TopOfBook 0x21 60 07:15:00.914 101 93.3 400 105 80 11.7 +
where the 4th column contains the timestamp.
If I read this into pandas with automatic parsing
df_top = pd.read_csv('TOP_20150210.txt', sep='\t', names=hdr_top, parse_dates=[3])
I get:
0 TopOfBook 0x21 60 2015-05-17 07:15:00.862000 101 85.0 5 109.0 500 24.0 +
1 TopOfBook 0x21 60 2015-05-17 07:15:00.882000 101 91.0 400 109.0 500 18.0 +
2 TopOfBook 0x21 60 2015-05-17 07:15:00.890000 101 91.0 400 105.0 80 14.0 +
Where the time part of course is correct, but how do I add the correct date part of this timestamp (2015-02-11)? Thank you
After parsing the dates, the third column has dtype <M8[ns]. This is the NumPy datetime64 dtype with nanosecond resolution. You can do fast date arithmetic by adding or subtracting NumPy timedelta64s.
So, for example, subtracting 6 days from df[3] yields
In [139]: df[3] - np.array([6], dtype='<m8[D]')
Out[139]:
0 2015-05-11 07:15:00.862000
1 2015-05-11 07:15:00.882000
2 2015-05-11 07:15:00.890000
3 2015-05-11 07:15:00.914000
Name: 3, dtype: datetime64[ns]
To find the correct number of days to subtract you could use
today = df.iloc[0,3]
date = pd.Timestamp(re.search(r'\d+', filename).group())
n = (today-date).days
import datetime as DT
import numpy as np
import pandas as pd
import re
filename = '20150211.txt'
df = pd.read_csv(filename, sep='\t', header=None, parse_dates=[3])
today = df.iloc[0,3]
date = pd.Timestamp(re.search(r'\d+', filename).group())
n = (today-date).days
df[3] -= np.array([n], dtype='<m8[D]')
print(df)
yields
0 1 2 3 4 5 6 7 8 \
0 TopOfBook 0x21 60 2015-02-11 07:15:00.862000 101 85.0 5 109 500
1 TopOfBook 0x21 60 2015-02-11 07:15:00.882000 101 91.0 400 109 500
2 TopOfBook 0x21 60 2015-02-11 07:15:00.890000 101 91.0 400 105 80
3 TopOfBook 0x21 60 2015-02-11 07:15:00.914000 101 93.3 400 105 80
9
0 24.0
1 18.0
2 14.0
3 11.7
You could apply and construct the datetime using your desired date values and then copying the time portion to the constructor:
In [9]:
import datetime as dt
df[3] = df[3].apply(lambda x: dt.datetime(2015,2,11,x.hour,x.minute,x.second,x.microsecond))
df
Out[9]:
0 1 2 3 4 5 6 7 8 \
0 TopOfBook 0x21 60 2015-02-11 07:15:00.862000 101 85.0 5 109 500
1 TopOfBook 0x21 60 2015-02-11 07:15:00.882000 101 91.0 400 109 500
2 TopOfBook 0x21 60 2015-02-11 07:15:00.890000 101 91.0 400 105 80
3 TopOfBook 0x21 60 2015-02-11 07:15:00.914000 101 93.3 400 105 80
9 10
0 24.0 +
1 18.0 +
2 14.0 +
3 11.7 +