When plotting a dataframe, how to set the x-range for a 'YYYY-MM' value - pandas

I have a pandas df with the below values. I can create a nifty chart that looks like the following:
import matplotlib.pyplot as plt
ax = pdf_month.plot(x="month", y="count", kind="bar")
plt.show()
I want to truncate the date range (to ignore 1900-01-01 and other months that not import, but everytime I try I get error messages (see below). The date range would be something like '2016-01' to '2018-04'
ax.set_xlim(pdf_month['month'][17],pdf_date['count'].values.max())
where pdf_month['month'][17] gives you a value of u'2017-01'.
pdf_month.printSchema
root
|-- month: string (nullable = true)
|-- count: long (nullable = false)
How do I set the range on the month values for a x-value that isn't really an int or a date. I still have the original, pre-grouped dates. Is there a better way to group by month that would allow you to customize the x-axis?
error messages:
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
sample output of pd_month
month count
0 1900-01 353
1 2015-09 1
2 2015-10 2
3 2015-11 2
4 2015-12 1
5 2016-01 1
6 2016-02 1
7 2016-03 3
8 2016-04 2
9 2016-05 5
10 2016-06 7
11 2016-07 13
12 2016-08 12
13 2016-09 41
14 2016-10 19
15 2016-11 17
16 2016-12 20

You can try Series date indexing, Pandas Series allow for date slicing as follows:
df.month['2016-01': '2018-04']
This works with datetime indexes.

Related

How to plot time series and group years together?

I have a dataframe that looks like below, the date is the index. How would I plot a time series showing a line for each of the years? I have tried df.plot(figsize=(15,4)) but this gives me one line.
Date Value
2008-01-31 22
2008-02-28 17
2008-03-31 34
2008-04-30 29
2009-01-31 33
2009-02-28 42
2009-03-31 45
2009-04-30 39
2019-01-31 17
2019-02-28 12
2019-03-31 11
2019-04-30 12
2020-01-31 24
2020-02-28 34
2020-03-31 43
2020-04-30 45
You can just do a groupby using year.
df = pd.read_clipboard()
df = df.set_index(pd.DatetimeIndex(df['Date']))
df.groupby(df.index.year)['Value'].plot()
In case you want to use year as series of data and compare day to day:
import matplotlib.pyplot as plt
# Create a date column from index (easier to manipulate)
df["date_column"] = pd.to_datetime(df.index)
# Create a year column
df["year"] = df["date_column"].dt.year
# Create a month-day column
df["month_day"] = (df["date_column"].dt.month).astype(str).str.zfill(2) + \
"-" + df["date_column"].dt.day.astype(str).str.zfill(2)
# Plot. Pivot will create for each year a column and these columns will be used as series.
df.pivot('month_day', 'year', 'Value').plot(kind='line', figsize=(12, 8), marker='o' )
plt.title("Values per Month-Day - Year comparison", y=1.1, fontsize=14)
plt.xlabel("Month-Day", labelpad=12, fontsize=12)
plt.ylabel("Value", labelpad=12, fontsize=12);

Insert items from MultiIndexed dataframe into regular dataframe based on time

I have this regular dataframe indexed by 'Date', called ES:
Price Day Hour num_obs med abs_med Ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203
I have this other dataframe indexed by the following MultiIndex. The first index goes from 0 to 23 and the second index goes from 0 to 55. In other words we have daily 5 minute increment data.
5min_Ret
0 0 2.235875e-06
5 9.814064e-07
10 -1.453213e-06
15 4.295757e-06
20 5.884896e-07
25 -1.340122e-06
30 9.470660e-06
35 1.178204e-06
40 -1.111621e-05
45 1.159005e-05
50 6.148861e-06
55 1.070586e-05
1 0 1.485287e-05
5 3.018576e-06
10 -1.513273e-05
15 -1.105312e-05
20 3.600874e-06
...
I want to create a column in the original dataframe, ES, that has the appropriate '5min_Ret' at each appropriate hour/5minute combo.
I've tried multiple things: looping over rows, finding some apply function. But nothing has worked so far. I feel like I'm overlooking a simple and Pythonic solution here.
The expected output creates a new column called '5min_ret' to the original dataframe in which each row corresponds to the correct hour/5minute pair from the smaller dataframe containing the 5min_ret
Price Day Hour num_obs med abs_med Ret 5min_ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364 xxxx
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562 xxxx
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132 xxxx
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132 xxxx
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203 xxxx
I think one way is to use merge on hour and minute. First create a column 'min' in ES from the datetimeindex such as:
ES['min'] = ES.index.minute
Now you can merge with your multiindex DF containing the column '5min_Ret' that I named df_multi such as:
ES = ES.merge(df_multi.reset_index(), left_on = ['hour','min'],
right_on = ['level_0','level_1'], how='left')
Here you merge on 'hour' and 'min' from ES with 'level_0' and 'level_1', which are created from your multiindex of df_multi when you do reset_index, and on the value of the left df (being ES)
You should get a new column in ES named '5min_Ret' with the value you are looking for. You can drop the colum 'min' if you don't need it anymore by ES = ES.drop('min',axis=1)

How do I sort column by targeting a specific number within that cell?

I would like to use Pandas Python to sort a specific column by date (more specifically the year). However, the year is buried within a bunch of other numbers. How do I just target the 2 digits that I need?
In the example below, I want to sort this column by the numbers [16,14,15...] rather than considering all the numbers in that row.
3/18/16 11:46
6/19/14 14:58
7/27/15 14:22
8/3/15 12:59
2/20/13 12:33
9/27/16 12:08
7/27/15 14:22
Given a dataframe like this,
date
0 3/18/16
1 6/19/14
2 7/27/15
3 8/3/15
4 2/20/13
5 9/27/16
6 7/27/15
You can convert the date column to datetime format and then sort.
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by = 'date')
The resulting dataframe
date
4 2013-02-20
1 2014-06-19
2 2015-07-27
6 2015-07-27
3 2015-08-03
0 2016-03-18
5 2016-09-27

Pandas Resample Strange Zero Tolerance Behavior

I'm attempting to resample a time series in Pandas and I am getting some odd behavior:
print samples[196].base_df.to_string()
Units Sales
2008-07-03 3 820.00
2008-07-04 3 470.00
...
2010-06-22 1 335.00
2010-06-24 2 180.00
2010-06-30 -1 -2502.00
print samples[196].base_df.resample('15d', how='sum')
Units Sales
2008-07-03 17 3.149130e+03
2008-07-18 29 6.305210e+03
...
2010-06-08 18 5.204000e+03
2010-06-23 1 -2.322000e+03
2010-07-08 0 6.521324e-312
I would have expected the last value in the resampled series to be either zero or omitted. Is this expected behavior for the resample function? If helpful I can post the full time series, but it is a bit long...

rank data over a rolling window in pandas DataFrame

I am new to Python and the Pandas library, so apologies if this is a trivial question. I am trying to rank a Timeseries over a rolling window of N days. I know there is a rank function but this function ranks the data over the entire timeseries. I don't seem to be able to find a rolling rank function.
Here is an example of what I am trying to do:
A
01-01-2013 100
02-01-2013 85
03-01-2013 110
04-01-2013 60
05-01-2013 20
06-01-2013 40
If I wanted to rank the data over a rolling window of 3 days, the answer should be:
Ranked_A
01-01-2013 NaN
02-01-2013 Nan
03-01-2013 1
04-01-2013 3
05-01-2013 3
06-01-2013 2
Is there a built-in function in Python that can do this? Any suggestion?
Many thanks.
If you want to use the Pandas built-in rank method (with some additional semantics, such as the ascending option), you can create a simple function wrapper for it
def rank(array):
s = pd.Series(array)
return s.rank(ascending=False)[len(s)-1]
that can then be used as a custom rolling-window function.
pd.rolling_apply(df['A'], 3, rank)
which outputs
Date
01-01-2013 NaN
02-01-2013 NaN
03-01-2013 1
04-01-2013 3
05-01-2013 3
06-01-2013 2
(I'm assuming the df data structure from Rutger's answer)
You can write a custom function for a rolling_window in Pandas. Using numpy's argsort() in that function can give you the rank within the window:
import pandas as pd
import StringIO
testdata = StringIO.StringIO("""
Date,A
01-01-2013,100
02-01-2013,85
03-01-2013,110
04-01-2013,60
05-01-2013,20
06-01-2013,40""")
df = pd.read_csv(testdata, header=True, index_col=['Date'])
rollrank = lambda data: data.size - data.argsort().argsort()[-1]
df['rank'] = pd.rolling_apply(df, 3, rollrank)
print df
results in:
A rank
Date
01-01-2013 100 NaN
02-01-2013 85 NaN
03-01-2013 110 1
04-01-2013 60 3
05-01-2013 20 3
06-01-2013 40 2