loc function for the months [duplicate] - pandas

With a datetime index to a Pandas dataframe, it is easy to get a range of dates:
df[datetime(2018,1,1):datetime(2018,1,10)]
Filtering is straightforward too:
df[ (df['column A'] = 'Done') & (df['column B'] < 3.14 )]
But what is the best way to simultaneously filter by range of dates and any other non-date criteria?

3 boolean conditions
c0 = df.index.to_series().between('2018-01-01', '2018-01-10')
c1 = df['column A'] == 'Done'
c2 = df['column B'] < 3.14
df[c0 & c1 & c2]
column A column B
2018-01-04 Done 2.533385
2018-01-06 Done 2.789072
2018-01-08 Done 2.230017
Setup
np.random.seed([3, 1415])
df = pd.DataFrame({
'column A': ['Done', 'Not Done'] * 10,
'column B': np.random.randn(20) + np.pi
}, pd.date_range('2017-12-25', periods=20))
df
column A column B
2017-12-25 Done 1.011868
2017-12-26 Not Done 1.873127
2017-12-27 Done 1.171093
2017-12-28 Not Done 0.882538
2017-12-29 Done 2.792306
2017-12-30 Not Done 3.114638
2017-12-31 Done 3.457829
2018-01-01 Not Done 3.490375
2018-01-02 Done 3.856957
2018-01-03 Not Done 3.912356
2018-01-04 Done 2.533385
2018-01-05 Not Done 3.493983
2018-01-06 Done 2.789072
2018-01-07 Not Done 2.725724
2018-01-08 Done 2.230017
2018-01-09 Not Done 2.999055
2018-01-10 Done 3.888432
2018-01-11 Not Done 1.637436
2018-01-12 Done 3.752955
2018-01-13 Not Done 3.541812

If there is multiple boolean masks is possible use np.logical_and.reduce:
m1 = df.index > '2018-01-01'
m2 = df.index < '2018-01-10'
m3 = df['column A'] == 'Done'
m4 = df['column B'] < 3.14
#piRSquared's data sample
df = df[np.logical_and.reduce([m1, m2, m3, m4])]
print (df)
column A column B
2018-01-04 Done 2.533385
2018-01-06 Done 2.789072
2018-01-08 Done 2.230017

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2018-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2018-2-1':'2018-2-10'])
Hope! it will helpful

I did this below to filter for both dataframes to have the same date
corn_url = 'https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=WPU012202&scale=left&cosd=1971-01-01&coed=2020-04-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2009-06-01&line_index=1&transformation=lin&vintage_date=2020-06-09&revision_date=2020-06-09&nd=1971-01-01'
wheat_url ='https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=WPU0121&scale=left&cosd=1947-01-01&coed=2020-04-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2009-06-01&line_index=1&transformation=lin&vintage_date=2020-06-09&revision_date=2020-06-09&nd=1947-01-01'
corn = pd.read_csv(corn_url,index_col=0,parse_dates=True)
wheat = pd.read_csv(wheat_url,index_col=0, parse_dates=True)
corn.head()
PP Index 1982
DATE
1971-01-01 63.4
1971-02-01 63.6
1971-03-01 62.0
1971-04-01 60.8
1971-05-01 60.2
wheat.head()
PP Index 1982
DATE
1947-01-01 53.1
1947-02-01 56.5
1947-03-01 68.0
1947-04-01 66.0
1947-05-01 66.7
wheat = wheat[wheat.index > '1970-12-31']
wheat.head()
PP Index 1982
DATE
1971-01-01 42.6
1971-02-01 42.6
1971-03-01 41.4
1971-04-01 41.7
1971-05-01 41.8

Related

Find index range of a sequence in dataframe column

I have a timeseries:
Sales
2018-01-01 66.65
2018-01-02 66.68
2018-01-03 65.87
2018-01-04 66.79
2018-01-05 67.97
2018-01-06 96.92
2018-01-07 96.90
2018-01-08 96.90
2018-01-09 96.38
2018-01-10 95.57
Given an arbitrary sequence of values, let's say [66.79,67.97,96.92,96.90], how could I obtain the corresponding indices, for example: [2018-01-04, 2018-01-05,2018-01-06,2018-01-07]?
Use pandas.Series.isin to filter the column Sales then pandas.DataFrame.index to return the row labels (aka index, dates in your df) and finally pandas.Series.to_list to build a list :
vals = [66.79,67.97,96.92,96.90]
result = df[df['Sales'].isin(vals)].index.to_list()
# Output :
print(result)
['2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08']

Assign Day number to unique date in pandas [duplicate]

I have a column with dates in string format '2017-01-01'. Is there a way to extract day and month from it using pandas?
I have converted the column to datetime dtype but haven't figured out the later part:
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df.dtypes:
Date datetime64[ns]
print(df)
Date
0 2017-05-11
1 2017-05-12
2 2017-05-13
With dt.day and dt.month --- Series.dt
df = pd.DataFrame({'date':pd.date_range(start='2017-01-01',periods=5)})
df.date.dt.month
Out[164]:
0 1
1 1
2 1
3 1
4 1
Name: date, dtype: int64
df.date.dt.day
Out[165]:
0 1
1 2
2 3
3 4
4 5
Name: date, dtype: int64
Also can do with dt.strftime
df.date.dt.strftime('%m')
Out[166]:
0 01
1 01
2 01
3 01
4 01
Name: date, dtype: object
A simple form:
df['MM-DD'] = df['date'].dt.strftime('%m-%d')
Use dt to get the datetime attributes of the column.
In [60]: df = pd.DataFrame({'date': [datetime.datetime(2018,1,1),datetime.datetime(2018,1,2),datetime.datetime(2018,1,3),]})
In [61]: df
Out[61]:
date
0 2018-01-01
1 2018-01-02
2 2018-01-03
In [63]: df['day'] = df.date.dt.day
In [64]: df['month'] = df.date.dt.month
In [65]: df
Out[65]:
date day month
0 2018-01-01 1 1
1 2018-01-02 2 1
2 2018-01-03 3 1
Timing the methods provided:
Using apply:
In [217]: %timeit(df['date'].apply(lambda d: d.day))
The slowest run took 33.66 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 210 µs per loop
Using dt.date:
In [218]: %timeit(df.date.dt.day)
10000 loops, best of 3: 127 µs per loop
Using dt.strftime:
In [219]: %timeit(df.date.dt.strftime('%d'))
The slowest run took 40.92 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 284 µs per loop
We can see that dt.day is the fastest
This should do it:
df['day'] = df['Date'].apply(lambda r:r.day)
df['month'] = df['Date'].apply(lambda r:r.month)

How to sum up a selected range of rows via a condition?

I hope with these additional information someone could find time to help me with this new issue.
sample date here --> file
'Date as index' (datetime.date)
As I said I'm trying to select a range in a dataframe every time x is in interval [-20 -190] and create a new dataframe with a new column which is the sum of the selected rows and keep the last "encountered" date as index
EDIT : The "loop" start at the first date/beginning of the df and when a value which is less than 0 or -190 is found, then sum it up and continue to find and sum it up and so on
BUT I still got values which are still in the intervall (-190, 0)
example and code below.
Thks
import pandas as pd
df = pd.read_csv('http://www.sharecsv.com/s/0525f76a07fca54717f7962d58cac692/sample_file.csv', sep = ';')
df['Date'] = df['Date'].where(df['x'].between(-190, 0)).bfill()
df3 = df.groupby('Date', as_index=False)['x'].sum()
df3
##### output #####
Date sum
0 2019-01-01 13:48:00 -131395.21
1 2019-01-02 11:23:00 -250830.08
2 2019-01-02 11:28:00 -154.35
3 2019-01-02 12:08:00 -4706.87
4 2019-01-03 12:03:00 -260158.22
... ... ...
831 2019-09-29 09:18:00 -245939.92
832 2019-09-29 16:58:00 -0.38
833 2019-09-30 17:08:00 -129365.71
834 2019-09-30 17:13:00 -157.05
835 2019-10-01 08:58:00 -111911.98
########## expected output #############
Date sum
0 2019-01-01 13:48:00 -131395.21
1 2019-01-02 11:23:00 -250830.08
2 2019-01-02 12:08:00 -4706.87
3 2019-01-03 12:03:00 -260158.22
... ... ...
831 2019-09-29 09:18:00 -245939.92
832 2019-09-30 17:08:00 -129365.71
833 2019-10-01 08:58:00 -111911.98
...
...
Use Series.where with Series.between for replace values to NaNs of Date column with back filling missing values and then aggregate sum, next step is filter out rows with match range by boolean indexing and last use DataFrame.resample with cast Series to one column DataFrame by Series.to_frame:
#range -190, 0
df['Date'] = df['Date'].where(df['x'].between(-190, 0)).bfill()
df3 = df.groupby('Date', as_index=False)['x'].sum()
df3 = df3[~df3['x'].between(-190, 0)]
df3 = df3.resample('D', on='Date')['x'].sum().to_frame()

How to change datetime to numeric discarding 0s at end [duplicate]

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

pandas:groupby('date_x')['outcome'].mean()

https://www.kaggle.com/anokas/time-travel-eda
what is these code mean exactly?groupby('date_x')['outcome'].mean(),I could not find this in sklearn doc.
date_x['Class probability'] = df_train.groupby('date_x')['outcome'].mean()
date_x['Frequency'] = df_train.groupby('date_x')['outcome'].size()
date_x.plot( secondary_y='Frequency',figsize=(22, 10))
thanks!
I think better is use DataFrameGroupBy.agg for aggregate by size for length of groups and mean per groups which are grouping by column date_x:
d = {'mean':'Class probability','size':'Frequency'}
df = df_train.groupby('date_x')['outcome'].agg(['mean','size']).rename(columns=d)
df.plot( secondary_y='Frequency',figsize=(22, 10))
For more information check applying multiple functions at once.
Sample:
d = {'date_x':pd.to_datetime(['2015-01-01','2015-01-01','2015-01-01',
'2015-01-02','2015-01-02']),
'outcome':[20,30,40,50,60]}
df_train = pd.DataFrame(d)
print (df_train)
date_x outcome
0 2015-01-01 20 ->1.group
1 2015-01-01 30 ->1.group
2 2015-01-01 40 ->1.group
3 2015-01-02 50 ->2.group
4 2015-01-02 60 ->2.group
d = {'mean':'Class probability','size':'Frequency'}
df = df_train.groupby('date_x')['outcome'].agg(['mean','size']).rename(columns=d)
print (df)
Class probability Frequency
date_x
2015-01-01 30 3
2015-01-02 55 2