Slice pandas' MultiIndex DataFrame - pandas

To keep track of all simulation-results in a parametric run, i create a MultIndex DataFrame named dfParRun in pandas as follows:
import pandas as pd
import numpy as np
import itertools
limOpt = [0.1,1,10]
reimbOpt = ['Cash','Time']
xOpt = [0.1, .02, .03, .04, .05, .06, .07, .08]
zOpt = [1,5n10]
arrays = [limOpt, reimbOpt, xOpt, zOpt]
parameters = list(itertools.product(*arrays))
nPar = len(parameters)
variables = ['X', 'Y', 'Z']
nVar = len(variables)
index = pd.MultiIndex.from_tuples(parameters, names=['lim', 'reimb', 'xMax', 'zMax'])
dfParRun = pd.DataFrame(np.random.rand((nPar, nVar)), index=index, columns=variables)
To analyse my parametric run, i want to slice this dataframe but this seems a burden. For example, i want to have all results for xMax above 0.5 and lim equal to 10. At this moment, the only working method i find is:
df = dfParRun.reset_index()
df.loc[(df.xMax>0.5) & (df.lim==10)]
and i wonder if there is a method without resetting the index of the DataFrame ?

option 1
use pd.IndexSlice
caveat: requires sort_index
dfParRun.sort_index().loc[pd.IndexSlice[10, :, .0500001:, :]]
option 2
use your df after having reset_index
df.query('xMax > 0.05 & lim == 10')
setup
import pandas as pd
import numpy as np
import itertools
limOpt = [0.1,1,10]
reimbOpt = ['Cash','Time']
xOpt = [0.1, .02, .03, .04, .05, .06, .07, .08]
zOpt = [1, 5, 10]
arrays = [limOpt, reimbOpt, xOpt, zOpt]
parameters = list(itertools.product(*arrays))
nPar = len(parameters)
variables = ['X', 'Y', 'Z']
nVar = len(variables)
index = pd.MultiIndex.from_tuples(parameters, names=['lim', 'reimb', 'xMax', 'zMax'])
dfParRun = pd.DataFrame(np.random.rand(*(nPar, nVar)), index=index, columns=variables)
df = dfParRun.reset_index()

Related

Stacked Bar Graph with Errorbars in Pandas / Matplotlib

I want to show my Data in two (or more) stacked Bargraphs inkluding Errorbars. My Code leans on an working Example, but uses df`s at input instead of Arrays.
I tried to set the df-output to an array, but this will not work
from uncertain_panda import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
raw_data = {'': ['Error', 'Value'],'Stars': [3, 18],'Cats': [2,15],'Planets': [1,12],'Dogs': [2,16]}
df = pd.DataFrame(raw_data)
df.set_index('', inplace=True)
print(df)
N = 2
ind = np.arange(N)
width = 0.35
first_Value = df.loc[['Value'],['Cats','Dogs']]
second_Value = df.loc[['Value'],['Stars','Planets']]
first_Error = df.loc[['Error'],['Cats','Dogs']]
second_Error = df.loc[['Error'],['Stars','Planets']]
p1 = plt.bar(ind, first_Value, width, yerr=first_Error)
p2 = plt.bar(ind, second_Value, width, yerr=second_Error, bottom=first_Value)
plt.xticks(ind, ('Pets', 'Universe'))
plt.legend((p1[0], p2[0]), ('Cats', 'Dogs', 'Stars', 'Planets'))
plt.show()
I expect an output like this:
https://matplotlib.org/3.1.0/gallery/lines_bars_and_markers/bar_stacked.html#sphx-glr-gallery-lines-bars-and-markers-bar-stacked-py
Instead i get this error:
TypeError: only size-1 arrays can be converted to Python scalars

Value from iterative function in pandas

I have a dataframe and would like to have the values in one column being set through an iterative function as below.
import pandas as pd
import numpy as np
d = {'col1': [0.4444, 25.4615],
'col2': [0.5, 0.7],
'col3': [7, 7]}
df = pd.DataFrame(data=d)
df['col4'] = df['col1'] * df['col3']/4
def func(df):
a = np.exp(-df['col4'])
n = 1
while df['col2'] < a:
a = a + df['col4'] * 4 / n
n += 1
return n
df['col5'] = func(df)
I get an error message "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." How can I run the function per row to solve the series/ambiguity problem?
EDIT: Added expected output.
out = {'col1': [0.4444, 25.4615],
'col2': [0.5, 0.7],
'col3': [7, 7],
'col4': [0.7777, 44.557625],
'col5': [0, 49]}
dfout = pd.DataFrame(out)
I am not sure what the values in col4 and col5 will be but according to the calculation I am trying to replicate those will be the values.
EDIT2: I had missed n+=1 in the while loop. added it now.
EDIT3: I am trying to apply
f(0) = e^-col4
f(n) = col4 * f(n-1) / n for n > 0
until f > col2 and then return the value of n per row.
Using the information you provided, this seems to be the solution:
import pandas as pd
import numpy as np
d = {'col1': [0.4444, 25.4615],
'col2': [0.5, 0.7],
'col3': [7, 7]}
df = pd.DataFrame(data=d)
df['col4'] = df['col1'] * df['col3']/4
def func(df):
n = 1
return n
df['col5'] = func(df)
For what it is worth, here is an inefficient solution: after each iteration, keep track of which coefficient starts satisfying the condition.
import pandas as pd
import numpy as np
d = {'col1': [0.4444, 25.4615],
'col2': [0.5, 0.7],
'col3': [7, 7]}
df = pd.DataFrame(data=d)
df['col4'] = df['col1'] * df['col3']/4
def func(df):
a = np.exp(-df['col4'])
n = 1
ns = [None] * len(df['col2'])
status = a > df['col2']
for i in range(len(status)):
if ns[i] is None and status[i]:
ns[i] = n
# stops when all coefficients satisfy the condition
while not status.all():
a = a * df['col4'] * n
status = a > df['col2']
n += 1
for i in range(len(status)):
if ns[i] is None and status[i]:
ns[i] = n
return ns
df['col5'] = func(df)
print(df['col5'])

Count of Kernel Density Estimation (KDE)

I have some data (A,B) and have used seaborn to make a contour plot of it.
import pandas as pd
import seaborn as sns
# Dataframe 1
df_1 = pd.DataFrame({'A':[1,2,1,2,3,4,2,1,4], 'B': [2,1,2,1,2,3,4,2,1]})
# Plot A v B
ax = sns.kdeplot(df_1["A"], df_1["B"])
I would like to get the cumulative count please (C). I’d like to make a new plot with C on the Y axis, A on the X axis and contours of B. I think that if I could start off by making a new dataframe of A,B,H where H was the count (the height of the volcano) then that might be a start. The resulting plot might look a bit like this:
I think I've worked it out but this solution is messy:
import pandas as pd
import numpy as np
from scipy import stats
from itertools import chain
Fruit = 9 # How many were there?
# Dataframe 1
df_1 = pd.DataFrame({'A':[1,2,1,2,3,4,2,1,4], 'B': [2,1,2,1,2,3,4,2,1]})
m1 = df_1["A"]
m2 = df_1["B"]
xmin = 0
xmax = 5
ymin = 0
ymax = 5
# Kernel density estimate:
X, Y = np.mgrid[xmin:xmax:5j, ymin:ymax:5j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
H = np.reshape(kernel(positions).T, X.shape)
# Re-jig it
X = X.reshape((25, 1))
Y = Y.reshape((25, 1))
H = H.reshape((25, 1))
X_L = list(chain.from_iterable(X))
Y_L = list(chain.from_iterable(Y))
H_L = list(chain.from_iterable(H))
df_2 = pd.DataFrame({'A': X_L, 'B': Y_L, 'H': H_L})
# Find the cumulative count C
df_2 = df_2.sort_values('B')
C = np.cumsum(H)
C = C.reshape((25, 1))
C_L = list(chain.from_iterable(C))
df_2['C'] = pd.DataFrame(C_L, index=df_2.index)
# Scale C
Max_C = np.amax(C)
df_2.loc[:,'C'] *= Fruit / Max_C
# Break it down to constant B
df_2_B_0 = df_2[df_2['B'] == 0]
df_2_B_1 = df_2[df_2['B'] == 1]
df_2_B_2 = df_2[df_2['B'] == 2]
df_2_B_3 = df_2[df_2['B'] == 3]
df_2_B_4 = df_2[df_2['B'] == 4]
# Plot A v C
ax = df_2_B_0.plot('A','C', label='0')
df_2_B_1.plot('A','C',ax=ax, label='1')
df_2_B_2.plot('A','C',ax=ax, label='2')
df_2_B_3.plot('A','C',ax=ax, label='3')
df_2_B_4.plot('A','C',ax=ax, label='4')
plt.ylabel('C')
plt.legend(title='B')

set new index for pandas DataFrame (interpolating?)

I have a DataFrame where the index is NOT time. I need to re-scale all of the values from an old index which is not equi-spaced, to a new index which has different limits and is equi-spaced.
The first and last values in the columns should stay as they are (although they will have the new, stretched index values assigned to them).
Example code is:
import numpy as np
import pandas as pd
%matplotlib inline
index = np.asarray((2, 2.5, 3, 6, 7, 12, 15, 18, 20, 27))
x = np.sin(index / 10)
df = pd.DataFrame(x, index=index)
df.plot();
newindex = np.linspace(0, 29, 100)
How do I create a DataFrame where the index is newindex and the new x values are interpolated from the old x values?
The first new x value should be the same as the first old x value. Ditto for the last x value. That is, there should not be NaNs at the beginning and copies of the last old x repeated at the end.
The others should be interpolated to fit the new equi-spaced index.
I tried df.interpolate() but couldn't work out how to interpolate against the newindex.
Thanks in advance for any help.
This is works well:
import numpy as np
import pandas as pd
def interp(df, new_index):
"""Return a new DataFrame with all columns values interpolated
to the new_index values."""
df_out = pd.DataFrame(index=new_index)
df_out.index.name = df.index.name
for colname, col in df.iteritems():
df_out[colname] = np.interp(new_index, df.index, col)
return df_out
I have adopted the following solution:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
def reindex_and_interpolate(df, new_index):
return df.reindex(df.index | new_index).interpolate(method='index', limit_direction='both').loc[new_index]
index = np.asarray((2, 2.5, 3, 6, 7, 12, 15, 18, 20, 27))
x = np.sin(index / 10)
df = pd.DataFrame(x, index=index)
newindex = pd.Float64Index(np.linspace(min(index)-5, max(index)+5, 50))
df_reindexed = reindex_and_interpolate(df, newindex)
plt.figure()
plt.scatter(df.index, df.values, color='red', alpha=0.5)
plt.scatter(df_reindexed.index, df_reindexed.values, color='green', alpha=0.5)
plt.show()
I wonder if you're up against one of pandas limitations; it seems like you have limited choices for aligning your df to an arbitrary set of numbers (your newindex).
For example, your stated newindex only overlaps with the first and last numbers in index, so linear interpolation (rightly) interpolates a straight line between the start (2) and end (27) of your index.
import numpy as np
import pandas as pd
%matplotlib inline
index = np.asarray((2, 2.5, 3, 6, 7, 12, 15, 18, 20, 27))
x = np.sin(index / 10)
df = pd.DataFrame(x, index=index)
newindex = np.linspace(min(index), max(index), 100)
df_reindexed = df.reindex(index = newindex)
df_reindexed.interpolate(method = 'linear', inplace = True)
df.plot()
df_reindexed.plot()
If you change newindex to provide more overlapping points with your original data set, interpolation works in a more expected manner:
newindex = np.linspace(min(index), max(index), 26)
df_reindexed = df.reindex(index = newindex)
df_reindexed.interpolate(method = 'linear', inplace = True)
df.plot()
df_reindexed.plot()
There are other methods that do not require one to manually align the indices, but the resulting curve (while technically correct) is probably not what one wants:
newindex = np.linspace(min(index), max(index), 1000)
df_reindexed = df.reindex(index = newindex, method = 'ffill')
df.plot()
df_reindexed.plot()
I looked at the pandas docs but I couldn't identify an easy solution.
https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-reindexing

Using plot_date change node icon type

When using plot_date, how do you change some of the nodes in the set from a circle to an X?
For example all nodes are circles except the 3, 8, and 19 node, which are all Xs.
I have used a sample dataset, since you didnt provided any.
import pandas as pd
import matplotlib.pyplot as plt
data = {'2014-11-15':1, '2014-11-16':2, '2014-11-17':3, '2014-11-18':5, '2014-11-19':8, '2014-11-20': 19}
df = pd.DataFrame(list(data.iteritems()), columns=['Date', 'val'])
df = df.set_index(pd.to_datetime(df.Date, format='%Y-%m-%d'))
o_list = []
x_list = []
check_list = [3,8,19]
for index in df.index:
if df.val[index] in check_list:
o_list.append(index)
else:
x_list.append(index)
df_o = df.ix[o_list]
df_x = df.ix[x_list]
fig = plt.figure()
plt.plot_date(df_o.index, df_o.val, 'bo')
plt.plot_date(df_x.index, df_x.val, 'bx')
plt.show()