Y axis in panel - pandas

I have a DataFrame dft like this:
Date Apple Amazon Facebook US Bond
0 2018-01-02 NaN NaN NaN NaN
1 2018-01-03 NaN NaN NaN NaN
2 2018-01-04 NaN NaN NaN NaN
3 2018-01-05 NaN NaN NaN NaN
4 2018-01-08 NaN NaN NaN NaN
... ... ... ... ... ...
665 2020-08-24 0.708554 0.528557 0.152367 0.185932
666 2020-08-25 0.639243 0.534403 0.106550 0.133563
667 2020-08-26 0.520858 0.562482 0.018176 0.133283
668 2020-08-27 0.549531 0.593006 -0.011161 0.261187
669 2020-08-28 0.552725 0.595580 -0.038886 0.278847
Change the Date type
dft["Date"] = pd.to_datetime(dft["Date"]).dt.date
idf = dft.interactive()
date_from = datetime.date(yearStart, 1, 1)
date_to = datetime.date(yearEnd, 8, 31)
date_slider = pn.widgets.DateSlider(name="date", start = date_from, end = date_to, steps=1, value=date_from)
date_slider
and I see a date slider. All good. More controls:
tickerNames = ['Apple', 'Amazon', 'Facebook', 'US Bond']
# Radio buttons for metric measures
yaxis = pn.widgets.RadioButtonGroup(
name='Y axis',
options=tickerNames,
button_type='success'
)
pipeline = (
idf[
(idf.Date <= date_slider)
]
.groupby(['Date'])[yaxis].mean()
.to_frame()
.reset_index()
.sort_values(by='Date')
.reset_index(drop=True)
)
if I now type
pipeline
I see a table with a date slider above it, where each symbol is it's own "tab". If I click on the symbol, and I change the slider, I see more/less data. Again all good. Here is where I get confused. I want to plot the values of the columns:
plot = pipeline.hvplot(x = 'Date', by='WHAT GOES IN HERE', y=yaxis,line_width=2, title="Prices")
NOTE: WHAT GOES IN HERE. I need the values in the `dtf` dataframe above, but I can't hardwire the symbol since it depends on what the user chooses in the `table`? I want an interactive chart, so that as I slide the date_slider, all more and more of the data for each symbol gets plotted.
If I do it the old fashioned way:
fig = plt.figure(figsize=(15, 7))
ax1 = fig.add_subplot(1, 1, 1)
dft.plot(ax=ax1)
ax1.set_xlabel('Date')
ax1.set_ylabel('21days rolling daily change')
ax1.set_title('21days rolling daily change of financial assets')
plt.show()
It works as expected?

Related

How to get the groupby nth row directly in the row as an item?

I have Date, Time, Open, High, low, Close, data on a minute basis of a stock. It is arranged in ascending order ( date wise ). I want to make a new column and for every day (for each row) insert the yesterday price at second row of last date). So for instance I have mentioned price of 18812.3 in front of 11th Jan since last date was 10th Jan and its second row has a price of 18812.3. Similarly I have done it for day before yesterday too. I tried using nth of groupby object but for I have to create a group by object. The below code is getting the a new Dataframe but I would like to create a column directly having the desired values.
test = bn_futures.groupby('Date')['Open','High','Low','Close'].nth(1).reset_index()
Try: (check comments)
# Convert Date to datetime64 and set it as index
df = df.assign(Date=pd.to_datetime(df['Date'], dayfirst=True)).set_index('Date')
# Find second value for each day
prices = df.groupby(level=0)['Open'].nth(1).squeeze()
# Find last row for each day
mask = ~df.index.duplicated(keep='last')
# Create new columns
df.loc[mask, 'price at yesterday'] = prices.shift(1)
df.loc[mask, 'price 2d ago'] = prices.shift(2)
Output:
>>> df
Open price at yesterday price 2d ago
Date
2015-01-09 1 NaN NaN
2015-01-09 2 NaN NaN
2015-01-09 3 NaN NaN
2015-01-10 4 NaN NaN
2015-01-10 5 NaN NaN
2015-01-10 6 2.0 NaN
2015-01-11 7 NaN NaN
2015-01-11 8 NaN NaN
2015-01-11 9 5.0 2.0
Setup a MRE:
df = pd.DataFrame({'Date': ['09-01-2015', '09-01-2015', '09-01-2015',
'10-01-2015', '10-01-2015', '10-01-2015',
'11-01-2015', '11-01-2015', '11-01-2015'],
'Open': [1, 2, 3, 4, 5, 6, 7, 8, 9]})

pandas, fillna on multiindex columns

index_tuples=[]
for distance in ["near", "far"]:
for vehicle in ["bike", "car"]:
index_tuples.append([distance, vehicle])
index = pd.MultiIndex.from_tuples(index_tuples, names=["distance", "vehicle"])
df = pd.DataFrame(index=["city"], columns = index)
d = {(x,y):my_home_city[x][y] for x in my_home_city for y in my_home_city[x]}
df.loc['my_home_city',:]=d
df
Out[994]:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
I'd like to do df['near']['bike'].fillna(False, inplace=True)
it says
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I think inplace is not good practice, check this and this, so assign back selected column by tuple:
df[('near', 'bike')] = df[('near', 'bike')].fillna(False)
print (df)
distance near far
vehicle bike car bike car
city False NaN NaN NaN
my_home_city 1 0.0 0.0 1.0
But your solution should be changed:
df[('near', 'bike')].fillna(False, inplace=True)

DataFrame: Moving average with rolling, mean and shift while ignoring NaN

I have a data set, let's say, 420x1. Now I would to calculate the moving average of the past 30 days, excluding the current date.
If I do the following:
df.rolling(window = 30).mean().shift(1)
my df results in a window with lots of NaNs, which is probably caused by NaNs in the original dataframe here and there (1 NaN within the 30 data points results the MA to be NaN).
Is there a method that ignores NaN (avoiding apply-method, I run it on large data so performance is key)? I do not want to replace the value with 0 because that could skew the results.
the same applies than to moving standard deviation.
For example you can adding min_periods, and NaN is gone
df=pd.DataFrame({'A':[1,2,3,np.nan,2,3,4,np.nan]})
df.A.rolling(window=2,min_periods=1).mean()
Out[7]:
0 1.0
1 1.5
2 2.5
3 3.0
4 2.0
5 2.5
6 3.5
7 4.0
Name: A, dtype: float64
Option 1
df.dropna().rolling('30D').mean()
Option 2
df.interpolate('index').rolling('30D').mean()
Option 2.5
df.interpolate('index').rolling(30).mean()
Option 3
s.rolling('30D').apply(np.nanmean)
Option 3.5
df.rolling(30).apply(np.nanmean)
You can try dropna() to remove the nan values or fillna() to replace the nan with specific value.
Or you can filter out all nan value by notnull() or isnull() within your operation.
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three'])
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df2
one two three
a 0.434024 -0.749472 -1.393307
b NaN NaN NaN
c 0.897861 0.032307 -0.602912
d NaN NaN NaN
e -1.056938 -0.129128 1.328862
f -0.581842 -0.682375 -0.409072
g NaN NaN NaN
h -1.772906 -1.342019 -0.948151
df3 = df2[df2['one'].notnull()]
# use ~isnull() would return the same result
# df3 = df2[~df2['one'].isnull()]
print df3
one two three
a 0.434024 -0.749472 -1.393307
c 0.897861 0.032307 -0.602912
e -1.056938 -0.129128 1.328862
f -0.581842 -0.682375 -0.409072
h -1.772906 -1.342019 -0.948151
For further reference, Pandas has a clean documentary about handling missing data(read this).

Selecting columns of a pandas dataframe based on criteria

I have a DF which contains results from the UK election results with one column per party. So the DF is something like:
In[107]: Results.columns
Out[107]:
Index(['Press Association ID Number', 'Constituency Name', 'Region', 'Country',
'Constituency ID', 'Constituency Type', 'Election Year', 'Electorate',
' Total number of valid votes counted ', 'Unnamed: 9',
...
'Wessex Reg', 'Whig', 'Wigan', 'Worth', 'WP', 'WRP', 'WVPTFP', 'Yorks',
'Young', 'Zeb'],
dtype='object', length=147)
e.g.
Results.head(2)
Out[108]:
Press Association ID Number Constituency Name Region Country \
0 1 Aberavon Wales Wales
1 2 Aberconwy Wales Wales
Constituency ID Constituency Type Election Year Electorate \
0 W07000049 County 2015 49,821
1 W07000058 County 2015 45,525
Total number of valid votes counted Unnamed: 9 ... Wessex Reg Whig \
0 31,523 NaN ... NaN NaN
1 30,148 NaN ... NaN NaN
Wigan Worth WP WRP WVPTFP Yorks Young Zeb
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
[2 rows x 147 columns]
The columns containing the votes for the different parties are Results.ix[:, 'Unnamed: 9':]
Most of these parties poll very few votes in any constituency, and so I would like to exclude them. Is there a way (short of iterating through each row and column myself) of returning only those columns which meet a particular condition, for example having at least one value > 1000? I would ideally like to be able to specify something like
Results.ix[:, 'Unnamed: 9': > 1000]
you can do it this way:
In [94]: df
Out[94]:
a b c d e f g h
0 -1.450976 -1.361099 -0.411566 0.955718 99.882051 -1.166773 -0.468792 100.333169
1 0.049437 -0.169827 0.692466 -1.441196 0.446337 -2.134966 -0.407058 -0.251068
2 -0.084493 -2.145212 -0.634506 0.697951 101.279115 -0.442328 -0.470583 99.392245
3 -1.604788 -1.136284 -0.680803 -0.196149 2.224444 -0.117834 -0.299730 -0.098353
4 -0.751079 -0.732554 1.235118 -0.427149 99.899120 1.742388 -1.636730 99.822745
5 0.955484 -0.261814 -0.272451 1.039296 0.778508 -2.591915 -0.116368 -0.122376
6 0.395136 -1.155138 -0.065242 -0.519787 100.446026 1.584397 0.448349 99.831206
7 -0.691550 0.052180 0.827145 1.531527 -0.240848 1.832925 -0.801922 -0.298888
8 -0.673087 -0.791235 -1.475404 2.232781 101.521333 -0.424294 0.088186 99.553973
9 1.648968 -1.129342 -1.373288 -2.683352 0.598885 0.306705 -1.742007 -0.161067
In [95]: df[df.loc[:, 'e':].columns[(df.loc[:, 'e':] > 50).any()]]
Out[95]:
e h
0 99.882051 100.333169
1 0.446337 -0.251068
2 101.279115 99.392245
3 2.224444 -0.098353
4 99.899120 99.822745
5 0.778508 -0.122376
6 100.446026 99.831206
7 -0.240848 -0.298888
8 101.521333 99.553973
9 0.598885 -0.161067
Explanation:
In [96]: (df.loc[:, 'e':] > 50).any()
Out[96]:
e True
f False
g False
h True
dtype: bool
In [97]: df.loc[:, 'e':].columns
Out[97]: Index(['e', 'f', 'g', 'h'], dtype='object')
In [98]: df.loc[:, 'e':].columns[(df.loc[:, 'e':] > 50).any()]
Out[98]: Index(['e', 'h'], dtype='object')
Setup:
In [99]: df = pd.DataFrame(np.random.randn(10, 8), columns=list('abcdefgh'))
In [100]: df.loc[::2, list('eh')] += 100
UPDATE:
starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

Can you prevent automatic alphabetical order of df.append()?

I am trying to append data to a log where the order of columns isn't in alphabetical order but makes logical sense, ex.
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 Diff_Goals_2
I am running through several calculations based on different variables and logging the results through appending a dictionary of the values after each run. Is there a way to prevent the df.append() function to order the columns alphabetically?
Seems you have to reorder the columns after the append operation:
In [25]:
# assign the appended dfs to merged
merged = df1.append(df2)
# create a list of the columns in the order you desire
cols = list(df1) + list(df2)
# assign directly
merged.columns = cols
# column order is now as desired
merged.columns
Out[25]:
Index(['Org_Goals_1', 'Calc_Goals_1', 'Diff_Goals_1', 'Org_Goals_2', 'Calc_Goals_2', 'Diff_Goals_2'], dtype='object')
example:
In [26]:
df1 = pd.DataFrame(columns=['Org_Goals_1','Calc_Goals_1','Diff_Goals_1'], data = randn(5,3))
df2 = pd.DataFrame(columns=['Org_Goals_2','Calc_Goals_2','Diff_Goals_2'], data=randn(5,3))
merged = df1.append(df2)
cols = list(df1) + list(df2)
merged.columns = cols
merged
Out[26]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.028935 NaN -0.687143 NaN 1.528579
1 0.943432 NaN -2.055357 NaN -0.720132
2 0.035234 NaN 0.020756 NaN 1.556319
3 1.447863 NaN 0.847496 NaN -1.458852
4 0.132337 NaN -0.255578 NaN -0.222660
0 NaN 0.131085 NaN 0.850022 NaN
1 NaN -1.942110 NaN 0.672965 NaN
2 NaN 0.944052 NaN 1.274509 NaN
3 NaN -1.796448 NaN 0.130338 NaN
4 NaN 0.961545 NaN -0.741825 NaN
Diff_Goals_2
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
0 0.727619
1 0.022209
2 -0.350757
3 1.116637
4 1.947526
The same alpha sorting of the columns happens with concat also so it looks like you have to reorder after appending.
EDIT
An alternative is to use join:
In [32]:
df1.join(df2)
Out[32]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.163745 1.608398 0.876040 0.651063 0.371263
1 -1.762973 -0.471050 -0.206376 1.323191 0.623045
2 0.166269 1.021835 -0.119982 1.005159 -0.831738
3 -0.400197 0.567782 -1.581803 0.417112 0.188023
4 -1.443269 -0.001080 0.804195 0.480510 -0.660761
Diff_Goals_2
0 -2.723280
1 2.463258
2 0.147251
3 2.328377
4 -0.248114
Actually, I found "advanced indexing" to work quite well
df2=df.ix[:,'order of columns']
As I see it, the order is lost, but when appending, the original data should have the correct order. To maintain that, assuming Dataframe 'alldata' and dataframe to be appended data 'newdata', appending and keeping column order as in 'alldata' would be:
alldata.append(newdata)[list(alldata)]
(I encountered this problem with named date fields, where 'Month' would be sorted between 'Minute' and 'Second')