How do I get the rows where a certain value occurs? - pandas

I have orders_df:
Symbol Order Shares
Date
2011-01-10 AAPL BUY 1500
2011-01-13 AAPL SELL 1500
2011-01-13 IBM BUY 4000
2011-01-26 GOOG BUY 1000
2011-02-02 XOM SELL 4000
2011-02-10 XOM BUY 4000
2011-03-03 GOOG SELL 1000
2011-03-03 IBM SELL 2200
2011-05-03 IBM BUY 1500
2011-06-03 IBM SELL 3300
2011-08-01 GOOG BUY 55
2011-08-01 GOOG SELL 55
I want have a variable that maps Date to the number of SELLS on that date. I also want a symmetric variable for BUY.
I tried doing it for all Orders by doing
num_orders_per_day = orders_df.groupby(['Date']).size()
and got:
Date
2011-01-10 1
2011-01-13 2
2011-01-26 1
2011-02-02 1
2011-02-10 1
2011-03-03 2
2011-05-03 1
2011-06-03 1
2011-08-01 2
but that is not the desired output.
What I want is sells_on_a_day:
2011-01-13 1
2011-02-02 1
2011-03-03 2
2011-06-03 1
2011-08-01 1
and then a similar buys_on_a_day variable.

First filter by boolean indexing and then get count:
num_sells_per_day = orders_df[orders_df['Order'] == 'SELL']
.groupby(level=0).size().reset_index(name='count')
print (num_sells_per_day)
Date count
0 2011-01-13 1
1 2011-02-02 1
2 2011-03-03 2
3 2011-06-03 1
4 2011-08-01 1
Alternative:
num_sells_per_day = orders_df.query("Order == 'SELL'")
.groupby(level=0)
.size()
.reset_index(name='count')
print (num_sells_per_day)
Date count
0 2011-01-13 1
1 2011-02-02 1
2 2011-03-03 2
3 2011-06-03 1
4 2011-08-01 1
Also is possible create 2 columns together, only get NaNs if some values missing:
df1 = orders_df.groupby(['Date','Order']).size().unstack()
print (df1)
Order BUY SELL
Date
2011-01-10 1.0 NaN
2011-01-13 1.0 1.0
2011-01-26 1.0 NaN
2011-02-02 NaN 1.0
2011-02-10 1.0 NaN
2011-03-03 NaN 2.0
2011-05-03 1.0 NaN
2011-06-03 NaN 1.0
2011-08-01 1.0 1.0

Related

Diff() function use with groupby for pandas

I am encountering an errors each time i attempt to compute the difference in readings for a meter in my dataset. The dataset structure is this.
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0
I am attempting to generate a new column called consumption that computes the difference in quantities consumed for each house(identified by houseid-meterid) after every month of the year.
The code i am using to implement this is:
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
After executing this code, the consumption column is filled with NaN values. How can I correctly implement this logic.
The end result looks like this:
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity consumption
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0 NaN
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0 NaN
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0 NaN
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0 NaN
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0 NaN
Many thank in advance.
I have attempted to use
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(0)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff()
all this commands result in the same behaviour as stated above.
Expected output should be:
Datetime houseid-meterid cleaned_quantity consumption
2019-02-01 215M201 23.0 20
2019-03-02 215M201 43.0 9
2019-04-01 215M201 52.0 12
2019-05-01 215M201 64.0 36
2019-06-01 215M201 100.0 20
what steps should i take?
Sort values by Datetime (if needed) then group by houseid-meterid before compute the diff for cleaned_quantity values then shift row to align with the right data:
df['consumption'] = (df.sort_values('Datetime')
.groupby('houseid-meterid')['cleaned_quantity']
.transform(lambda x: x.diff().shift(-1)))
print(df)
# Output
Datetime houseid-meterid cleaned_quantity consumption
0 2019-02-01 215M201 23.0 20.0
1 2019-03-02 215M201 43.0 9.0
2 2019-04-01 215M201 52.0 12.0
3 2019-05-01 215M201 64.0 36.0
4 2019-06-01 215M201 100.0 NaN

Merging two series with alternating dates into one grouped Pandas dataframe

Given are two series, like this:
#period1
DATE
2020-06-22 310.62
2020-06-26 300.05
2020-09-23 322.64
2020-10-30 326.54
#period2
DATE
2020-06-23 312.05
2020-09-02 357.70
2020-10-12 352.43
2021-01-25 384.39
These two series are correlated to each other, i.e. they each mark either the beginning or the end of a date period. The first series marks the end of a period1 period, the second series marks the end of period2 period. The end of a period2 period is at the same time also the start of a period1 period, and vice versa.
I've been looking for a way to aggregate these periods as date ranges, but apparently this is not easily possible with Pandas dataframes. Suggestions extremely welcome.
In the easiest case, the output layout should reflect the end dates of periods, which period type it was, and the amount of change between start and stop of the period.
Explicit output:
DATE CHG PERIOD
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.0 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
However, if there is any possibility of actually grouping by a date range consisting of start AND stop date, that would be much more favorable
Thank you!
p1 = pd.DataFrame(data={'Date': ['2020-06-22', '2020-06-26', '2020-09-23', '2020-10-30'], 'val':[310.62, 300.05, 322.64, 326.54]})
p2 = pd.DataFrame(data={'Date': ['2020-06-23', '2020-09-02', '2020-10-12', '2021-01-25'], 'val':[312.05, 357.7, 352.43, 384.39]})
p1['period'] = 1
p2['period'] = 2
df = p1.append(p2).sort_values('Date').reset_index(drop=True)
df['CHG'] = abs(df['val'].diff(periods=1))
df.drop('val', axis=1)
Output:
Date period CHG
0 2020-06-22 1 NaN
1 2020-06-23 2 1.43
2 2020-06-26 1 12.00
3 2020-09-02 2 57.65
4 2020-09-23 1 35.06
5 2020-10-12 2 29.79
6 2020-10-30 1 25.89
7 2021-01-25 2 57.85
EDIT: matching the format START - STOP - CHANGE - PERIOD
Starting from the above data frame:
df['Start'] = df.Date.shift(periods=1)
df.rename(columns={'Date': 'Stop'}, inplace=True)
df = df1[['Start', 'Stop', 'CHG', 'period']]
df
Output:
Start Stop CHG period
0 NaN 2020-06-22 NaN 1
1 2020-06-22 2020-06-23 1.43 2
2 2020-06-23 2020-06-26 12.00 1
3 2020-06-26 2020-09-02 57.65 2
4 2020-09-02 2020-09-23 35.06 1
5 2020-09-23 2020-10-12 29.79 2
6 2020-10-12 2020-10-30 25.89 1
7 2020-10-30 2021-01-25 57.85 2
# If needed:
df1.index = pd.to_datetime(df1.index)
df2.index = pd.to_datetime(df2.index)
df = pd.concat([df1, df2], axis=1)
df.columns = ['start','stop']
df['CNG'] = df.bfill(axis=1)['start'].diff().abs()
df['PERIOD'] = 1
df.loc[df.stop.notna(), 'PERIOD'] = 2
df = df[['CNG', 'PERIOD']]
print(df)
Output:
CNG PERIOD
Date
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.00 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
2021-01-29 14.32 1
2021-02-12 22.57 2
2021-03-04 15.94 1
2021-05-07 45.42 2
2021-05-12 16.71 1
2021-09-02 47.78 2
2021-10-04 24.55 1
2021-11-18 41.09 2
2021-12-01 19.23 1
2021-12-10 20.24 2
2021-12-20 15.76 1
2022-01-03 22.73 2
2022-01-27 46.47 1
2022-02-09 26.30 2
2022-02-23 35.59 1
2022-03-02 15.94 2
2022-03-08 21.64 1
2022-03-29 45.30 2
2022-04-29 49.55 1
2022-05-04 17.06 2
2022-05-12 36.72 1
2022-05-17 15.98 2
2022-05-19 18.86 1
2022-06-02 27.93 2
2022-06-17 51.53 1

Calculate day's difference between successive pandas dataframe rows with condition

I have a dataframe as following:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1
XYZ 3/3/2020 1
XYZ 3/4/2020 1
XYZ 3/5/2020 1
XYZ 3/5/2020 0
XYZ 3/6/2020 1
XYZ 3/8/2020 1
ABC 3/9/2020 0
ABC 3/10/2020 1
ABC 3/11/2020 0
ABC 3/12/2020 1
The relTweet displays whether the tweet is relevant (1) or not (0).
\nI need to find the days difference (GaplastRel) between each successive rows for each company, with a condition that the previous day's tweet should be relevant tweet (i.e. relTweet =1 ). e.g. For the first record relTweet should be 0. For the 2nd record, relTweet should be 1 as the last relevant tweet was made one day ago.
Below is the example of needed output:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1 0
XYZ 3/3/2020 1 1
XYZ 3/4/2020 1 1
XYZ 3/5/2020 1 1
XYZ 3/5/2020 0 1
XYZ 3/6/2020 1 1
XYZ 3/8/2020 1 2
ABC 3/9/2020 0 0
ABC 3/10/2020 1 0
ABC 3/11/2020 0 1
ABC 3/12/2020 1 2
Following is my code:
dataDf['Date'] = pd.to_datetime(dataDf['Date'], format='%m/%d/%Y')
dataDf['relTweet'] = (dataDf.groupby('Company', group_keys=False)
.apply(lambda g: g['Date'].diff().replace(0, np.nan).ffill()))
This code gives the days difference between successive rows for each company without conisidering the relTweet =1 condition. I am not sure how to apply the condition.
Following is the output of the above code:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1 NaT
XYZ 3/3/2020 1 1 days
XYZ 3/4/2020 1 1 days
XYZ 3/5/2020 1 1 days
XYZ 3/5/2020 0 0 days
XYZ 3/6/2020 1 1 days
XYZ 3/8/2020 1 2 days
ABC 3/9/2020 0 NaT
ABC 3/10/2020 1 1 days
ABC 3/11/2020 0 1 days
ABC 3/12/2020 1 1 days
Change your mind sometime we need merge_asof rather than groupby
df1=df.loc[df['relTweet']==1,['Company','Date']]
df=pd.merge_asof(df,df1.assign(Date1=df1.Date),by='Company',on='Date', allow_exact_matches=False)
df['GaplastRel']=(df.Date-df.Date1).dt.days.fillna(0)
df
Out[31]:
Company Date relTweet Date1 GaplastRel
0 XYZ 2020-03-02 1 NaT 0.0
1 XYZ 2020-03-03 1 2020-03-02 1.0
2 XYZ 2020-03-04 1 2020-03-03 1.0
3 XYZ 2020-03-05 1 2020-03-04 1.0
4 XYZ 2020-03-05 0 2020-03-04 1.0
5 XYZ 2020-03-06 1 2020-03-05 1.0
6 XYZ 2020-03-08 1 2020-03-06 2.0
7 ABC 2020-03-09 0 NaT 0.0
8 ABC 2020-03-10 1 NaT 0.0
9 ABC 2020-03-11 0 2020-03-10 1.0
10 ABC 2020-03-12 1 2020-03-10 2.0

Pandas Shift Date Time Columns Back One Hour

I have data in a DF (df1) that starts and ends like this below and I'm trying to shift the "0" and "1" columns below so that the date and time is moved back one hour so that the date and time start at hour == 0 not hour == 1.
data starts (df1) -
0 1 2 3 4 5 6 7
0 20160101 100 7.977169 109404.0 20160101 100 4.028678 814.0
1 20160101 200 8.420204 128546.0 20160101 200 4.673662 2152.0
2 20160101 300 9.515370 165931.0 20160101 300 8.019863 8100.0
data ends (df1) -
0 1 2 3 4 5 6 7
8780 20161231 2100 4.198906 11371.0 20161231 2100 0.995571 131.0
8781 20161231 2200 4.787433 19083.0 20161231 2200 1.029809 NaN
8782 20161231 2300 3.987506 9354.0 20161231 2300 0.900942 NaN
8783 20170101 0 3.284947 1815.0 20170101 0 0.899262 NaN
I need the date and time to start shifted back one hour so start time is hour begin not hour end -
0 1 2 3 4 5 6 7
0 20160101 000 7.977169 109404.0 20160101 100 4.028678 814.0
1 20160101 100 8.420204 128546.0 20160101 200 4.673662 2152.0
2 20160101 200 9.515370 165931.0 20160101 300 8.019863 8100.0
and ends like this with the date and time below -
0 1 2 3 4 5 6 7
8780 20161231 2000 4.198906 11371.0 20161231 2100 0.995571 131.0
8781 20161231 2100 4.787433 19083.0 20161231 2200 1.029809 NaN
8782 20161231 2200 3.987506 9354.0 20161231 2300 0.900942 NaN
8783 20161231 2300 3.284947 1815.0 20170101 0 0.899262 NaN
And, i have no real idea of how to accomplish this or how to research it. Thank you,
It would be better to create a proper datetime object then simply remove the hours as a sum which will handle any redaction in days. We can then use dt.strftime to re-create your object (string) columns.
s = pd.to_datetime(
df[0].astype(str) + df[1].astype(str).str.zfill(4), format="%Y%m%d%H%M"
)
0 2016-01-01 01:00:00
1 2016-01-01 02:00:00
2 2016-01-01 03:00:00
8780 2016-12-31 21:00:00
8781 2016-12-31 22:00:00
8782 2016-12-31 23:00:00
8783 2017-01-01 00:00:00
dtype: datetime64[ns]
df[1] = (s - pd.DateOffset(hours=1)).dt.strftime("%H%M").str.lstrip("0").str.zfill(3)
df[0] = (s - pd.DateOffset(hours=1)).dt.strftime("%Y%d%m")
print(df)
0 1 2 3 4 5 6 7
0 20160101 000 7.977169 109404.0 20160101 100 4.028678 814.0
1 20160101 100 8.420204 128546.0 20160101 200 4.673662 2152.0
2 20160101 200 9.515370 165931.0 20160101 300 8.019863 8100.0
8780 20163112 2000 4.198906 11371.0 20161231 2100 0.995571 131.0
8781 20163112 2100 4.787433 19083.0 20161231 2200 1.029809 NaN
8782 20163112 2200 3.987506 9354.0 20161231 2300 0.900942 NaN
8783 20163112 2300 3.284947 1815.0 20170101 0 0.899262 NaN
Use, DataFrame.shift to shift the columns 0, 1, then use Series.bfill on column 0 of df2 to fill the missing values, then use .fillna on column 1 of df2 to fill the NaN values, finally use Dataframe.join to join the dataframe df2 with the dataframe df1:
df2 = df1[['0', '1']].shift()
df2['0'] = df2['0'].bfill()
df2['1'] = df2['1'].fillna('000')
df2 = df2.join(df1.loc[:, '2':])
# print(df2)
0 1 2 3 4 5 6 7
0 20160101 000 7.977169 109404.0 20160101 100 4.028678 814.0
1 20160101 100 8.420204 128546.0 20160101 200 4.673662 2152.0
2 20160101 200 9.515370 165931.0 20160101 300 8.019863 8100.0
...
8780 20160101 300 4.198906 11371.0 20161231 2100 0.995571 131.0
8781 20161231 2100 4.787433 19083.0 20161231 2200 1.029809 NaN
8782 20161231 2200 3.987506 9354.0 20161231 2300 0.900942 NaN
8783 20161231 2300 3.284947 1815.0 20170101 0 0.899262 NaN
You can do subtraction in pandas (considering that the data in your dataframe are not string type)
I will show you an example on how it can be done
import pandas as pd
df = pd.DataFrame()
df['time'] = [0,100,500,2100,2300,0] #creating dataframe
df['time'] = df['time']-100 #This is what you want to do
Now your data will be subtracted by 100.
There is a case when subtracting 0 you will get -100 as time. For that you can do this:
for i in range(len(df['time'])):
if df['time'].iloc[i]== -100:
df['time'].iloc[i]=2300

pandas Why datetimes indices in two data frames are not matching

This is a follow on to my most recent question. Thanks to Bob Haffner I have almost got a solution for updating columns in one df using values in another df.
I have df1:
shares
trans_date symbol
2011-01-13 AAPL -1500
IBM 4000
2011-01-26 GOOG 1000
2011-02-02 XOM -4000
2011-02-10 XOM 4000
2011-03-03 GOOG -1000
IBM -2200
2011-05-03 IBM 1500
2011-06-03 IBM -3300
2011-06-10 AAPL 1200
2011-08-01 GOOG 0
2011-12-20 AAPL -1200
df1 info :
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 12 entries, (2011-01-13 00:00:00, AAPL) to (2011-12-20 00:00:00, AAPL)
Data columns (total 1 columns):
shares 12 non-null int64
dtypes: int64(1)
memory usage: 192.0+ bytes
And this df2:
2011-01-13 GOOG 0
AAPL 0
XOM 0
IBM 0
_CASH 0
GOOG 0
AAPL 0
XOM 0
IBM 0
_CASH 0
2011-01-26 GOOG 0
AAPL 0
XOM 0
IBM 0
_CASH 0
2011-02-02 GOOG 0
AAPL 0
XOM 0
IBM 0
_CASH 0
2011-02-10 GOOG 0
AAPL 0
XOM 0
IBM 0
_CASH 0
2011-03-03 GOOG 0
AAPL 0
XOM 0
IBM 0
_CASH 0
..
df2 info is :
return pd.TimeSeries(index=dates, data=dates)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 65 entries, (2011-01-13, GOOG) to (2011-12-20, _CASH)
Data columns (total 1 columns):
0 65 non-null int64
dtypes: int64(1)
memory usage: 1.0+ KB
None
I then issue a df.update command:
df2_updated = df2.update(df1)
This returns 'none'. I cannot understand why the indices are not matching. I am grateful for any help.
Drew Yallop