I want to pivot a df and display values based off time values, not column values.
df = pd.DataFrame({
'Place' : ['John','Alan','Cory','Jim','John','Alan','Cory','Jim'],
'Number' : ['2','3','5','5','3','4','6','6'],
'Code' : ['1','2','3','4','1','2','3','4'],
'Time' : ['1904-01-01 08:00:00','1904-01-01 09:00:00','1904-01-02 01:00:00','1904-01-02 02:00:00','1904-01-01 08:10:00','1904-01-01 09:10:00','1904-01-02 01:10:00','1904-01-02 02:10:00'],
})
df = df.pivot_table(index = 'Number', columns = 'Place', values = 'Time', aggfunc = 'first').fillna('')
Out:
Place Alan Cory Jim John
Number
2 1904-01-01 08:00:00
3 1904-01-01 09:00:00 1904-01-01 08:10:00
4 1904-01-01 09:10:00
5 1904-01-02 01:00:00 1904-01-02 02:00:00
6 1904-01-02 01:10:00 1904-01-02 02:10:00
Intended Output:
Place John Alan Cory Jim
Number
2 1904-01-01 08:00:00
3 1904-01-01 08:10:00 1904-01-01 09:00:00
4 1904-01-01 09:10:00
5 1904-01-02 01:00:00 1904-01-02 02:00:00
6 1904-01-02 01:10:00 1904-01-02 02:10:00
Note: I've only added a dummy dates to differentiate for times after midnight. I will eventually drop the dates and just leave the times once the df is appropriately sorted.
Unfortunately pivot_table sorting columns names by default and no paramater for avoid it. So possible solution is DataFrame.reindex by original unique values of column Place:
#if necessary convert to datetimes and sorting
df['Time'] = pd.to_datetime(df['Time'])
df = df.sort_values('Time')
df1 = df.pivot_table(index='Number',columns='Place',values='Time',aggfunc='first').fillna('')
df1 = df1.reindex(columns=df['Place'].unique())
print (df1)
Place John Alan Cory \
Number
2 1904-01-01 08:00:00
3 1904-01-01 08:10:00 1904-01-01 09:00:00
4 1904-01-01 09:10:00
5 1904-01-02 01:00:00
6 1904-01-02 01:10:00
Place Jim
Number
2
3
4
5 1904-01-02 02:00:00
6 1904-01-02 02:10:00
Related
I am using data below, which is saved in a CSV file, and trying to convert it to hourly using linear interpolation. However, not successful.
Code:
import pandas as pd
df = pd.read_csv('d:/Python/resampling/FairyLake.csv')
df[ 'Date' ] = pd.to_datetime(df['Date'])
df.set_index('Date').resample('M').interpolate()
print(df)
Data
Date,Discharge
1/3/2008,0.05865
1/4/2008,0.105812
1/5/2008,0.191388
1/6/2008,0.315378
1/7/2008,0.477782
1/8/2008,0.6786
1/9/2008,0.917832
1/10/2008,0.783875701
1/11/2008,0.65678957
1/12/2008,0.545651187
1/13/2008,0.44222808
1/14/2008,0.353907613
1/15/2008,0.27414753
Results
Date Discharge
0 2008-01-03 0.058650
1 2008-01-04 0.105812
2 2008-01-05 0.191388
3 2008-01-06 0.315378
4 2008-01-07 0.477782
5 2008-01-08 0.678600
6 2008-01-09 0.917832
7 2008-01-10 0.783876
8 2008-01-11 0.656790
9 2008-01-12 0.545651
10 2008-01-13 0.442228
11 2008-01-14 0.353908
12 2008-01-15 0.274148
Two things:
resample interpolate should be hourly (H)
results need to be assigned back df = ...:
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date').resample('H').interpolate()
df:
Discharge
Date
2008-01-03 00:00:00 0.058650
2008-01-03 01:00:00 0.060615
2008-01-03 02:00:00 0.062580
2008-01-03 03:00:00 0.064545
2008-01-03 04:00:00 0.066510
... ...
2008-01-14 20:00:00 0.287441
2008-01-14 21:00:00 0.284118
2008-01-14 22:00:00 0.280794
2008-01-14 23:00:00 0.277471
2008-01-15 00:00:00 0.274148
I have 2 columns of data in a pandas DF that looks like this with the "DateTime" column in format YYYY-MM-DD HH:MM:SS - this is first 24 hrs but the df is for one full year or 8784 x 2.
BAFFIN BAY DateTime
8759 8.112838 2016-01-01 00:00:00
8760 7.977169 2016-01-01 01:00:00
8761 8.420204 2016-01-01 02:00:00
8762 9.515370 2016-01-01 03:00:00
8763 9.222840 2016-01-01 04:00:00
8764 8.872423 2016-01-01 05:00:00
8765 8.776145 2016-01-01 06:00:00
8766 9.030668 2016-01-01 07:00:00
8767 8.394983 2016-01-01 08:00:00
8768 8.092915 2016-01-01 09:00:00
8769 8.946967 2016-01-01 10:00:00
8770 9.620883 2016-01-01 11:00:00
8771 9.535951 2016-01-01 12:00:00
8772 8.861761 2016-01-01 13:00:00
8773 9.077692 2016-01-01 14:00:00
8774 9.116074 2016-01-01 15:00:00
8775 8.724343 2016-01-01 16:00:00
8776 8.916940 2016-01-01 17:00:00
8777 8.920438 2016-01-01 18:00:00
8778 8.926278 2016-01-01 19:00:00
8779 8.817666 2016-01-01 20:00:00
8780 8.704014 2016-01-01 21:00:00
8781 8.496358 2016-01-01 22:00:00
8782 8.434297 2016-01-01 23:00:00
I am trying to calculate daily averages of the "BAFFIN BAY" and I've tried these approaches:
davg_df2 = df2.groupby(pd.Grouper(freq='D', key='DateTime')).mean()
davg_df2 = df2.groupby(pd.Grouper(freq='1D', key='DateTime')).mean()
davg_df2 = df2.groupby(by=df2['DateTime'].dt.date).mean()
All of these approaches yields the same answer as shown below :
BAFFIN BAY
DateTime
2016-01-01 6.008044
However, if you do the math, the correct average for 2016-01-01 is 8.813134 Thank you kindly for your help. I'm assuming the grouping is just by day or 24hrs to make consecutive DAILY averages but the 3 approaches above clearly is looking at other data in my 8784 x 2 DF.
I just ran your df with this code and i get 8.813134:
df['DateTime'] = pd.to_datetime(df['DateTime'])
df = df.groupby(by=pd.Grouper(freq='D', key='DateTime')).mean()
print(df)
Output:
BAFFIN BAY
DateTime
2016-01-01 8.813134
I have an 'hour' column in a pandas dataframe that is simply a list of numbers from 0 to 23 representing hours. How can I convert them to an hour format such as 01:00 when the numbers are single digit ( like 1 ) and double digit (like 18)? The single digit numbers need to have a leading zero, a colon and two trailing zeros. The double digit numbers need only a colon and two trailing zeros. How can this be accomplished in a dataframe? Also, I have a 'date' column that needs to merge with the hour column after the hour column is converted.
e.g. date hour
2018-07-01 0
2018-07-01 1
2018-07-01 3
...
2018-07-01 21
2018-07-01 22
2018-07-01 23
Needs to look like:
date
2018-07-01 01:00
...
2018-07-01 23:00
The source of the data is a .csv file.
Thanks for your consideration. I'm new to pandas and I can't find in their documentation how to do this considering the single and double digit numbers.
Convert hours to timedeltas by to_timedelta and add to datetimes converted by to_datetime if necessary:
df['date'] = pd.to_datetime(df['date']) + pd.to_timedelta(df['hour'], unit='h')
print (df)
date hour
0 2018-07-01 00:00:00 0
1 2018-07-01 01:00:00 1
2 2018-07-01 03:00:00 3
3 2018-07-01 21:00:00 21
4 2018-07-01 22:00:00 22
5 2018-07-01 23:00:00 23
If need also remove hour column use DataFrame.pop
df['date'] = pd.to_datetime(df['date']) + pd.to_timedelta(df.pop('hour'), unit='h')
print (df)
date
0 2018-07-01 00:00:00
1 2018-07-01 01:00:00
2 2018-07-01 03:00:00
3 2018-07-01 21:00:00
4 2018-07-01 22:00:00
5 2018-07-01 23:00:00
I have a dataframe with trip counts every 20 minutes during a whole month, let's say:
Date Trip count
0 2019-08-01 00:00:00 3
1 2019-08-01 00:20:00 2
2 2019-08-01 00:40:00 4
3 2019-08-02 00:00:00 6
4 2019-08-02 00:20:00 4
5 2019-08-02 00:40:00 2
I want to take daily mean of all trip counts every 20 minutes. Desired output (for above values) looks like:
Date mean
0 00:00:00 4.5
1 00:20:00 3
2 00:40:00 3
..
72 23:40:00 ..
You can aggregate by times created by Series.dt.time, because there are always 00, 20, 40 minutes only and no seconds:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby(df['Date'].dt.time).mean()
#alternative
#df1 = df.groupby(df['Date'].dt.strftime('%H:%M:%S')).mean()
print (df1)
Trip count
Date
00:00:00 4.5
00:20:00 3.0
00:40:00 3.0
I'm using pandas 0.12.0. I have a DataFrame that looks like:
date ms
0 2013-06-03 00:10:00 75.846318
1 2013-06-03 00:20:00 78.408277
2 2013-06-03 00:30:00 75.807990
3 2013-06-03 00:40:00 70.509438
4 2013-06-03 00:50:00 71.537499
I want to generate a third column, "tod", which contains just the time portion of the date (i.e. call .time() on each value). I'm somewhat of a pandas newbie, so I suspect this is trivial but I'm just not seeing how to do it.
Just apply the Timestamp time method to items in the date column:
In [11]: df['date'].apply(lambda x: x.time())
# equivalently .apply(pd.Timestamp.time)
Out[11]:
0 00:10:00
1 00:20:00
2 00:30:00
3 00:40:00
4 00:50:00
Name: date, dtype: object
In [12]: df['tod'] = df['date'].apply(lambda x: x.time())
This gives a column of datetime.time objects.
Using the method Andy created on Index is faster than apply
In [93]: df = DataFrame(randn(5,1),columns=['A'])
In [94]: df['date'] = date_range('20130101 9:05',periods=5)
In [95]: df['time'] = Index(df['date']).time
In [96]: df
Out[96]:
A date time
0 0.053570 2013-01-01 09:05:00 09:05:00
1 -0.382155 2013-01-02 09:05:00 09:05:00
2 0.357984 2013-01-03 09:05:00 09:05:00
3 -0.718300 2013-01-04 09:05:00 09:05:00
4 0.531953 2013-01-05 09:05:00 09:05:00
In [97]: df.dtypes
Out[97]:
A float64
date datetime64[ns]
time object
dtype: object
In [98]: df['time'][0]
Out[98]: datetime.time(9, 5)