Create a new DataFrame column from an existing one? - pandas

I'm using pandas 0.12.0. I have a DataFrame that looks like:
date ms
0 2013-06-03 00:10:00 75.846318
1 2013-06-03 00:20:00 78.408277
2 2013-06-03 00:30:00 75.807990
3 2013-06-03 00:40:00 70.509438
4 2013-06-03 00:50:00 71.537499
I want to generate a third column, "tod", which contains just the time portion of the date (i.e. call .time() on each value). I'm somewhat of a pandas newbie, so I suspect this is trivial but I'm just not seeing how to do it.

Just apply the Timestamp time method to items in the date column:
In [11]: df['date'].apply(lambda x: x.time())
# equivalently .apply(pd.Timestamp.time)
Out[11]:
0 00:10:00
1 00:20:00
2 00:30:00
3 00:40:00
4 00:50:00
Name: date, dtype: object
In [12]: df['tod'] = df['date'].apply(lambda x: x.time())
This gives a column of datetime.time objects.

Using the method Andy created on Index is faster than apply
In [93]: df = DataFrame(randn(5,1),columns=['A'])
In [94]: df['date'] = date_range('20130101 9:05',periods=5)
In [95]: df['time'] = Index(df['date']).time
In [96]: df
Out[96]:
A date time
0 0.053570 2013-01-01 09:05:00 09:05:00
1 -0.382155 2013-01-02 09:05:00 09:05:00
2 0.357984 2013-01-03 09:05:00 09:05:00
3 -0.718300 2013-01-04 09:05:00 09:05:00
4 0.531953 2013-01-05 09:05:00 09:05:00
In [97]: df.dtypes
Out[97]:
A float64
date datetime64[ns]
time object
dtype: object
In [98]: df['time'][0]
Out[98]: datetime.time(9, 5)

Related

upsampling timeseries from daily to hourly

I am using data below, which is saved in a CSV file, and trying to convert it to hourly using linear interpolation. However, not successful.
Code:
import pandas as pd
df = pd.read_csv('d:/Python/resampling/FairyLake.csv')
df[ 'Date' ] = pd.to_datetime(df['Date'])
df.set_index('Date').resample('M').interpolate()
print(df)
Data
Date,Discharge
1/3/2008,0.05865
1/4/2008,0.105812
1/5/2008,0.191388
1/6/2008,0.315378
1/7/2008,0.477782
1/8/2008,0.6786
1/9/2008,0.917832
1/10/2008,0.783875701
1/11/2008,0.65678957
1/12/2008,0.545651187
1/13/2008,0.44222808
1/14/2008,0.353907613
1/15/2008,0.27414753
Results
Date Discharge
0 2008-01-03 0.058650
1 2008-01-04 0.105812
2 2008-01-05 0.191388
3 2008-01-06 0.315378
4 2008-01-07 0.477782
5 2008-01-08 0.678600
6 2008-01-09 0.917832
7 2008-01-10 0.783876
8 2008-01-11 0.656790
9 2008-01-12 0.545651
10 2008-01-13 0.442228
11 2008-01-14 0.353908
12 2008-01-15 0.274148
Two things:
resample interpolate should be hourly (H)
results need to be assigned back df = ...:
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date').resample('H').interpolate()
df:
Discharge
Date
2008-01-03 00:00:00 0.058650
2008-01-03 01:00:00 0.060615
2008-01-03 02:00:00 0.062580
2008-01-03 03:00:00 0.064545
2008-01-03 04:00:00 0.066510
... ...
2008-01-14 20:00:00 0.287441
2008-01-14 21:00:00 0.284118
2008-01-14 22:00:00 0.280794
2008-01-14 23:00:00 0.277471
2008-01-15 00:00:00 0.274148

How to partition my 'timeconsume' feature as per pandas datetimeindex

I have a time-based feature in my pandas 5 min interval data frame, so it looks something like
dataDate TimeconinSec
2020-11-11 22:25:00 302
2020-11-11 23:25:00 605
2020-11-12 00:25:00 302
few times this feature may have value beyond 5 mins( 300sec), so I want it to be like the following output, going back on time and distribute the time feature
dataDate TimeconinSec
2020-11-11 22:20:00 300
2020-11-11 22:25:00 002
2020-11-11 23:15:00 300
2020-11-11 23:20:00 300
2020-11-11 23:25:00 005
2020-11-12 00:20:00 300
2020-11-12 00:25:00 002
I have tried different pandas date range functions, but how can I partition my time-based features across the intervals
Let’s first convert everything to proper timestamps, and compute the beginning and end of every interval:
>>> df['date'] = pd.to_datetime(df['dataDate'])
>>> df['since'] = (df['date'] - df['TimeconinSec'].astype('timedelta64[s]')).dt.floor(freq='300s')
>>> df['until'] = df['since'] + df['TimeconinSec'].astype('timedelta64[s]')
Then we can use pd.date_range to generate all the proper intermediate interval bounds:
>>> bounds = df.apply(lambda s: [*pd.date_range(s['since'], s['until'], freq='300s'), s['until']], axis='columns')
>>> bounds
0 [2020-11-11 22:15:00, 2020-11-11 22:20:00, 202...
1 [2020-11-11 23:10:00, 2020-11-11 23:15:00, 202...
2 [2020-11-12 00:15:00, 2020-11-12 00:20:00, 202...
dtype: object
Then with explode we can make these into their own series. I’m using the series twice, once for the beginning of the interval and once for the end, so shifted. Note the groupby().shift() which allows to perform the shift only within the same index.
>>> interval_ends = pd.concat([bounds.explode(), bounds.explode().groupby(level=0).shift(-1)], axis='columns', keys=['start', 'end'])
>>> interval_ends
start end
0 2020-11-11 22:15:00 2020-11-11 22:20:00
0 2020-11-11 22:20:00 2020-11-11 22:20:02
0 2020-11-11 22:20:02 NaT
1 2020-11-11 23:10:00 2020-11-11 23:15:00
1 2020-11-11 23:15:00 2020-11-11 23:20:00
1 2020-11-11 23:20:00 2020-11-11 23:20:05
1 2020-11-11 23:20:05 NaT
2 2020-11-12 00:15:00 2020-11-12 00:20:00
2 2020-11-12 00:20:00 2020-11-12 00:20:02
2 2020-11-12 00:20:02 NaT
After that we can discard the indexes and simply compute the time inside each interval:
>>> interval_ends.reset_index(drop=True, inplace=True)
>>> delays = (interval_ends['end'] - interval_ends['start']).astype('timedelta64[s]')
>>> delays
0 300.0
1 2.0
2 NaN
3 300.0
4 300.0
5 5.0
6 NaN
7 300.0
8 2.0
9 NaN
dtype: float64
Finally we just have to join the interval starts with these delays and drop lines containing NaNs, and we’ve got your final result:
>>> delays = delays.rename('time_in_secs').dropna().astype('int')
>>> interval_ends[['start']].join(delays, how='inner')
start time_in_secs
0 2020-11-11 22:15:00 300
1 2020-11-11 22:20:00 2
3 2020-11-11 23:10:00 300
4 2020-11-11 23:15:00 300
5 2020-11-11 23:20:00 5
7 2020-11-12 00:15:00 300
8 2020-11-12 00:20:00 2

Pivot sort by time values - Pandas

I want to pivot a df and display values based off time values, not column values.
df = pd.DataFrame({
'Place' : ['John','Alan','Cory','Jim','John','Alan','Cory','Jim'],
'Number' : ['2','3','5','5','3','4','6','6'],
'Code' : ['1','2','3','4','1','2','3','4'],
'Time' : ['1904-01-01 08:00:00','1904-01-01 09:00:00','1904-01-02 01:00:00','1904-01-02 02:00:00','1904-01-01 08:10:00','1904-01-01 09:10:00','1904-01-02 01:10:00','1904-01-02 02:10:00'],
})
df = df.pivot_table(index = 'Number', columns = 'Place', values = 'Time', aggfunc = 'first').fillna('')
Out:
Place Alan Cory Jim John
Number
2 1904-01-01 08:00:00
3 1904-01-01 09:00:00 1904-01-01 08:10:00
4 1904-01-01 09:10:00
5 1904-01-02 01:00:00 1904-01-02 02:00:00
6 1904-01-02 01:10:00 1904-01-02 02:10:00
Intended Output:
Place John Alan Cory Jim
Number
2 1904-01-01 08:00:00
3 1904-01-01 08:10:00 1904-01-01 09:00:00
4 1904-01-01 09:10:00
5 1904-01-02 01:00:00 1904-01-02 02:00:00
6 1904-01-02 01:10:00 1904-01-02 02:10:00
Note: I've only added a dummy dates to differentiate for times after midnight. I will eventually drop the dates and just leave the times once the df is appropriately sorted.
Unfortunately pivot_table sorting columns names by default and no paramater for avoid it. So possible solution is DataFrame.reindex by original unique values of column Place:
#if necessary convert to datetimes and sorting
df['Time'] = pd.to_datetime(df['Time'])
df = df.sort_values('Time')
df1 = df.pivot_table(index='Number',columns='Place',values='Time',aggfunc='first').fillna('')
df1 = df1.reindex(columns=df['Place'].unique())
print (df1)
Place John Alan Cory \
Number
2 1904-01-01 08:00:00
3 1904-01-01 08:10:00 1904-01-01 09:00:00
4 1904-01-01 09:10:00
5 1904-01-02 01:00:00
6 1904-01-02 01:10:00
Place Jim
Number
2
3
4
5 1904-01-02 02:00:00
6 1904-01-02 02:10:00

Pandas Set on copy warning when using .loc

I'm trying to change the values in a column of a dataframe based on a condition.
In [1]:df.head()
Out[2]: gen cont
timestamp
2012-07-01 00:00:00 0.293 0
2012-07-01 00:30:00 0.315 0
2012-07-01 01:00:00 0.0 0
2012-07-01 01:30:00 0.005 0
2012-07-01 02:00:00 0.231 0
I want to set the 'gen' column to NaN whenever the sum of the 2 columns is below a threshold of 0.01, so what I want is this:
In [1]:df.head()
Out[2]: gen cont
timestamp
2012-07-01 00:00:00 0.293 0
2012-07-01 00:30:00 0.315 0
2012-07-01 01:00:00 NaN 0
2012-07-01 01:30:00 NaN 0
2012-07-01 02:00:00 0.231 0
I have used this:
df.loc[df.gen + df.con <0.01 ,'gen'] = np.nan
It gives me the result I want but with the warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I am confused because I am using .loc and I think I'm using it in the way suggested.
For me your solution works nice.
Alternative solution with mask, it by default add NaN if condition True:
df['gen'] = df['gen'].mask(df['gen'] + df['cont'] < 0.01)
print (df)
timestamp gen cont
0 2012-07-01 00:00:00 0.293 0
1 2012-07-01 00:30:00 0.315 0
2 2012-07-01 01:00:00 NaN 0
3 2012-07-01 01:30:00 NaN 0
4 2012-07-01 02:00:00 0.231 0
EDIT:
You need copy.
If you modify values in df later you will find that the modifications do not propagate back to the original data (df_in), and that Pandas does warning.
df = df_in.loc[sDate:eDate].copy()

pandas resampling without performing statistics

I have a five minute dataframe:
rng = pd.date_range('1/1/2011', periods=60, freq='5Min')
df = pd.DataFrame(np.random.randn(60, 4), index=rng, columns=['A', 'B', 'C', 'D'])
A B C D
2011-01-01 00:00:00 1.287045 -0.621473 0.482130 1.886648
2011-01-01 00:05:00 0.402645 -1.335942 -0.609894 -0.589782
2011-01-01 00:10:00 -0.311789 0.342995 -0.875089 -0.781499
2011-01-01 00:15:00 1.970683 0.471876 1.042425 -0.128274
2011-01-01 00:20:00 -1.900357 -0.718225 -3.168920 -0.355735
2011-01-01 00:25:00 1.128843 -0.097980 1.130860 -1.045019
2011-01-01 00:30:00 -0.261523 0.379652 -0.385604 -0.910902
I would like to resample only the data on the 15 minute interval, but without aggregating into a statistic (I dont want the mean,median,stdev).I want to subsample and get the actual data on the 15 minute interval.Is there a builtin method to do this?
My output would be:
A B C D
2011-01-01 00:00:00 1.287045 -0.621473 0.482130 1.886648
2011-01-01 00:15:00 1.970683 0.471876 1.042425 -0.128274
2011-01-01 00:30:00 -0.261523 0.379652 -0.385604 -0.910902
You can resample to 15 min and take the 'first' of each group:
In [40]: df.resample('15min').first()
Out[40]:
A B C D
2011-01-01 00:00:00 -0.415637 -1.345454 1.151189 -0.834548
2011-01-01 00:15:00 0.221777 -0.866306 0.932487 -1.243176
2011-01-01 00:30:00 -0.690039 0.778672 -0.527087 -0.156369
...
Another way to do this is constructing the new desired index and do a reindex (this is a bit more work in this case, but in the case of a irregular time series this ensures it takes the data at exactly each 15min):
In [42]: new_rng = pd.date_range('1/1/2011', periods=20, freq='15min')
In [43]: df.reindex(new_rng)
Out[43]:
A B C D
2011-01-01 00:00:00 -0.415637 -1.345454 1.151189 -0.834548
2011-01-01 00:15:00 0.221777 -0.866306 0.932487 -1.243176
2011-01-01 00:30:00 -0.690039 0.778672 -0.527087 -0.156369
...
Function asfreq() doesn't do any aggregation:
df.asfreq('15min')