I have a Pandas dataframe, the index/labels are date. I just want to get out the starting date (ie the first entry) and the ending date (ie the last entry). What is the best way to do that?
Any help would be much appreciated.
You could use the index's format method. For example,
In [44]: df = pd.DataFrame({'foo':1}, index=pd.date_range('2000-1-1', periods=5, freq='D'))
In [45]: df
Out[45]:
foo
2000-01-01 1
2000-01-02 1
2000-01-03 1
2000-01-04 1
2000-01-05 1
[5 rows x 1 columns]
In [46]: df.index[[0,-1]].format()
Out[46]: ['2000-01-01', '2000-01-05']
To get the index of adataframe as a list use:
df.index.format()
Related
I have a dataframe with a "Fecha" column, I would like to reduce de Dataframe size through filter it and maintain just the rows which are on each 10 minutes multiple and discard all rows which are not in 10 minutes multiple.
Some idea?
Thanks
I have to guess some variable names. But assuming your dataframe name is df, the solution should look similar to:
df['Fecha'] = pd.to_datetime(df['Fecha'])
df = df[df['Fecha'].minute % 10 == 0]
The first line guarantees that your 'Fecha' column is in DateTime-Format. The second line filters all rows which are a multiple of 10 minutes. To do this you use the modulus operator %.
Since I'm not sure if this solves your problem, here's a minimal example that runs by itself:
import pandas as pd
idx = pd.date_range(pd.Timestamp(2020, 1, 1), periods=60, freq='1T')
series = pd.Series(1, index=idx)
series = series[series.index.minute % 10 == 0]
series
The first three lines construct a series with a 1 minute index, which is filtered in the fourth line.
Output:
2020-01-01 00:00:00 1
2020-01-01 00:10:00 1
2020-01-01 00:20:00 1
2020-01-01 00:30:00 1
2020-01-01 00:40:00 1
2020-01-01 00:50:00 1
dtype: int64
This is input table in pandas:
this is an output table as shown below:
dtype: int64
Dear Friends,
I am new to pandas, how to get the result is shown in the second image using pandas.
I am getting output as shown below using this approach
"df.groupby(['Months', 'Status']).size()"
Months Status
Apr-20 IW 2
OW 1
Jun-20 IW 4
OW 4
May-20 IW 3
OW 2
dtype: int64
But how to convert this output as shown in the second image?
It will be more helpful if someone is able to help me. Thanks in advance.
Use crosstab with margins=True parameter, then if necessary remove last Total column, change order of columns by DataFrame.reindex with ordering of original column and last convert index to column by DataFrame.reset_index and remove columns names by DataFrame.rename_axis:
df = (pd.crosstab(df['Status'], df['Months'], margins_name='Total', margins=True)
.iloc[:, :-1]
.reindex(df['Months'].unique(), axis=1)
.reset_index()
.rename_axis(None, axis=1))
print (df)
Status Apr_20 May_20 Jun_20
0 IW 4 2 4
1 OW 1 2 4
2 Total 5 4 8
Unstack, and then transpose:
df = df.groupby(['Months', 'Status']).size().unstack().T
To get a total row:
df.sum().rename('Total').to_frame().T.append(df)
I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494
I have a pandas DataFrame where I am trying to change one of the cells' timestamps to a different timestamp. However, I'm not getting the result I expect. Here's what I have:
>>> df = pd.DataFrame({"d": [np.datetime64('2013-07-14T10:30:30.521Z'), np.datetime64('2013-07-21T10:30:30.521Z')]})
>>> df
d
0 2013-07-14 10:30:30.521000
1 2013-07-21 10:30:30.521000
>>> df.iloc[-1, df.columns.get_loc("d")] = np.datetime64('2013-08-29T10:30:30.521Z')
>>> df
d
0 2013-07-14 10:30:30.521000
1 1970-01-01 00:22:57.772230521
As the example illustrates, the timestamp I get for df.loc[1, "d"] is not the one I am assigning to that cell. I don't understand this behavior or where I'm going wrong. Is there some other way I need to be changing the value of a timestamp?
Edit: the above is just a simple example. My actual df has many columns, not just 1. I'm using pandas version 0.16.1 (and can't change the version).
since your df has only one column you can do it this way:
In [29]: df
Out[29]:
d
0 2013-07-14 10:30:30.521
1 2013-07-21 10:30:30.521
In [30]: df.iloc[-1] = pd.to_datetime('2013-08-29T10:30:30.521Z')
In [31]: df
Out[31]:
d
0 2013-07-14 10:30:30.521
1 2013-08-29 10:30:30.521
UPDATE: if you have multiple columns in your DF:
In [47]: df
Out[47]:
d a
0 2013-07-14 10:30:30.521 1
1 2013-07-21 10:30:30.521 1
In [48]: df.loc[df.index[-1], 'd'] = pd.to_datetime('2013-08-29T10:30:30.521Z')
In [49]: df
Out[49]:
d a
0 2013-07-14 10:30:30.521 1
1 2013-08-29 10:30:30.521 1
I'm sure this is probably very simple but I can't figure out how to slice a pandas HDFStore table by its datetime index to get a specific range of rows.
I have a table that looks like this:
mdstore = pd.HDFStore(store.h5)
histTable = '/ES_USD20120615_MIDPOINT30s'
print(mdstore[histTable])
open high low close volume WAP \
date
2011-12-04 23:00:00 1266.000 1266.000 1266.000 1266.000 -1 -1
2011-12-04 23:00:30 1266.000 1272.375 1240.625 1240.875 -1 -1
2011-12-04 23:01:00 1240.875 1242.250 1240.500 1242.125 -1 -1
...
[488000 rows x 7 columns]
For example I'd like to get the range from 2012-01-11 23:00:00 to 2012-01-12 22:30:00. If it were in a df I would just use datetimes to slice on the index, but I can't figure out how to do that directly from the store table so I don't have to load the whole thing into memory.
I tried mdstore.select(histTable, where='index>20120111') and that worked in as much as I got everything on the 11th and 12th, but I couldn't see how to add a time in.
Example is here
needs pandas >= 0.13.0
In [2]: df = DataFrame(np.random.randn(5),index=date_range('20130101 09:00:00',periods=5,freq='s'))
In [3]: df
Out[3]:
0
2013-01-01 09:00:00 -0.110577
2013-01-01 09:00:01 -0.420989
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
2013-01-01 09:00:04 -0.830469
[5 rows x 1 columns]
In [4]: df.to_hdf('test.h5','data',mode='w',format='table')
Specify it as a quoted string
In [8]: pd.read_hdf('test.h5','data',where='index>"20130101 09:00:01" & index<"20130101 09:00:04"')
Out[8]:
0
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
[2 rows x 1 columns]
You can also specify it directly as a Timestamp
In [10]: pd.read_hdf('test.h5','data',where='index>Timestamp("20130101 09:00:01") & index<Timestamp("20130101 09:00:04")')
Out[10]:
0
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
[2 rows x 1 columns]