I'm working with two blocks, one to use pandas to import data from a .csv file and another one to use this information to construct variable values.
The first block is working fine, I'm able to construct a table with indexed values (in this case is Energy Tariff prices):
import pandas as pd
df_1 = pd.read_csv('time_tff_data.csv', sep=';', usecols=['time', 'tff'], index_col='time', header=0)
data_Tariff = {
'tff':{'time': df_1['tff'].to_dict()} #Set of tariff prices
}
data = {None: dict(tariff=data_Tariff)}
The problem is that on the other block, the one that I need to use the data, I'm not able to initialize a parameter with the data within the dictionary. Altough I'm using Pyomo (for optimization) my question isn't about Pyomo itself, but about how to initialize a Parameter with the data stored in a dictionary (self.tff):
from pyomo.environ import *
from data import data_Tariff
from pyomo.environ import SimpleBlock
class tariff(SimpleBlock):
def __init__(self, *args, **kwds):
super().__init__(*args, **kwds)
self.time = Set()
self.Tmax = Param(self.time, doc='Maximum tariff price', default=1.06, multable=True)
self.Tmin = Param(self.time, doc='Minimum tariff price', default=0.39, multable=True)
self.tff = Param(self.time, doc='Set of tariff prices', default=data_Tariff['tff'], mutable=True)
self.Tc = Var(self.time, doc='Tariff priority index', initialize=0)
def _Tc(m, t):
if tff is not None:
return (m.Tc[t] == (m.Tmax-m.tff[t])/(m.Tmax-m.Tmin) for t in m.time)
return Constraint.Skip
self.Tc = Constraint(self.time, rule=_Tc, doc='Tariff priority index')
My question is: how do I import tariff data "tff[t]" from the data block, since the set is idexed by time [t]?
Couple quick observations...
First, you should be using the keyword initialize not default to initialize from a collection. Also, I can't see why you would make this mutable, so you might remove that. Try:
self.tff = Param(self.time, doc='Set of tariff prices', initialize=data_Tariff['tff'])
This assumes that data_Tariff[tff] returns a properly constructed dictionary that is indexed by self.time
Backing up, I see that you also need to initialize the self.time:
self.time = Set(initialize=data_Tariff[tff].keys())
your constraint... Your constraint appears incorrect. The for t in m.time part is taken care of when you call the rule with a set. It will make a constraint for each value of t. And the check for tff... Probably unnecessary, right? If it is necessary, you need to reference it as self.tff. So:
def _Tc(m, t):
return m.Tc[t] == (m.Tmax-m.tff[t])/(m.Tmax-m.Tmin)
self.Tc = Constraint(self.time, rule=_Tc, doc='Tariff priority index')
Also, your Tmax and Tmin appear to be just constants (not indexed). If that is the case, you can simplify a little bit and just treat them as constants that are regular python variables and take them out of the model declaration, if desired.
Related
I want to save the model comparison data frame from compare_models() in pycaret.
# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')
# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')
# compare models
best = compare_models()
i.e. this data frame as shown above.
Does anyone know how to do that?
The solution is :
df = pull()
by Goosang Yu from the pycaret slack community.
compare_models() returns a pandas dataframe, containing the information of the list of models. Hence, you only need to save a dataframe, which can be for example achieved with best.to_csv(path). If you want to save the object in a different format (pickle, xml, ...), you can refer to pandas i/o documentation.
I'm trying to highlight specific columns in my dataframe using guideline from this post, https://stackoverflow.com/a/41655055/5158984.
My question is on the use of the subset argument. My guess is that it's part of the **kwargs argument. However, the official documentation from Pandas, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html, vaguely explains it.
So in general, how can I know which key words I can use whenever I see **kwargs?
Thanks!
It seems that you are confusing pandas.DataFrame.applymap and df.style.applymap (where df is an instance of pd.DataFrame), for which subset stands on its own and is not part of the kwargs arguments.
Here is one way to find out (in your terminal or a Jupyter notebook cell) what are the named parameters of this method (or any other Pandas method for that matter):
import pandas as pd
df = pd.DataFrame()
help(df.style.applymap)
# Output
Help on method applymap in module pandas.io.formats.style:
applymap(func: 'Callable', subset: 'Subset | None' = None, **kwargs)
-> 'Styler' method of pandas.io.formats.style.Styler instance
Apply a CSS-styling function elementwise.
Updates the HTML representation with the result.
Parameters
----------
func : function
``func`` should take a scalar and return a string.
subset : label, array-like, IndexSlice, optional
A valid 2d input to `DataFrame.loc[<subset>]`, or, in the case of a 1d input
or single key, to `DataFrame.loc[:, <subset>]` where the columns are
prioritised, to limit ``data`` to *before* applying the function.
**kwargs : dict
Pass along to ``func``.
...
I was going through the source code of Koalas, trying to get a handle on how they actually achieve plotting large datasets. It turns our that they use either sampling or TopN - selecting a given number of records.
I understand the meaning of sampling and internally it uses spark.DataFrame.sample to do it. For TopN, however, they simply take the first max_rows number of records from Koalas' DataFrame using data = data.head(max_rows + 1).to_pandas().
This seems strange and I wonder whether it's correctly reflecting the statistical properties of the dataset doing the data selection in this way.
Koalas DataFrame's plot accessor:
class KoalasPlotAccessor(PandasObject):
pandas_plot_data_map = {
"pie": TopNPlotBase().get_top_n,
"bar": TopNPlotBase().get_top_n,
"barh": TopNPlotBase().get_top_n,
"scatter": SampledPlotBase().get_sampled,
"area": SampledPlotBase().get_sampled,
"line": SampledPlotBase().get_sampled,
}
_backends = {} # type: ignore
...
class TopNPlotBase:
def get_top_n(self, data):
from databricks.koalas import DataFrame, Series
max_rows = get_option("plotting.max_rows")
# Simply use the first 1k elements and make it into a pandas dataframe
# For categorical variables, it is likely called from df.x.value_counts().plot.xxx().
if isinstance(data, (Series, DataFrame)):
data = data.head(max_rows + 1).to_pandas()
...
I am trying to figure out how to use self in PandasUDF.GroupBy.Apply in a Class method in Python and also pass arguments in it. I have tried a lot of different ways but couldn't make it work. I also searched the internet extensively looking for an example of PandasUDF which is used inside a class with self and arguments but could not find anything like that. I know how to do all of the before mentioned things with Pandas.GroupBy.Apply.
The only way through which i could make it work was by declaring it static-method
class Train:
return_type = StructType([
StructField("div_nbr", FloatType()),
StructField("store_nbr", FloatType()),
StructField("model_str", BinaryType())
])
function_type = PandasUDFType.GROUPED_MAP
def __init__(self):
............
def run_train(self):
output = sp_df.groupby(['A', 'B']).apply(self.model_train)
output.show(10)
#staticmethod
#pandas_udf(return_type, function_type)
def model_train(pd_df):
features_name = ['days_into_year', 'months_into_year', 'minutes_into_day', 'hour_of_day', 'recency']
X = pd_df[features_name].copy()
Y = pd.DataFrame(pd_df['trans_type_value']).copy()
estimator_1 = XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=300, verbosity=1,
objective='reg:squarederror', booster='gbtree', n_jobs=-1, gamma=0,
min_child_weight=5, max_delta_step=0, subsample=0.6, colsample_bytree=0.8,
colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, base_score=0.5, random_state=1234, missing=None,
importance_type='gain')
estimator_1.fit(X, Y)
df_to_return = pd_df[['div_nbr', 'store_nbr']].drop_duplicates().copy()
df_to_return['model_str'] = pickle.dumps(estimator_1)
return df_to_return
What i would like to achieve in reality is that declare return_type and function_type, features_name in __init__(), then use it in PandasUDF, also pass parameters to be used inside the function when doing PandasUDF.GroupBy.Apply
If anyone could help me out, I would highly appreciate that. I am a bit newbie to PySpark.
Background
Pandas UDF Lifecycle:
A spark datatype (Column, DataFrame) is serialized into Arrow's Table format via PyArrow.
That data is sent to python virtual environments (VM), which are created JIT, within each executor.
Before reaching the python VM, the data is deserialized to a pandas Column/DataFrame and your pandas_udf code is run on that Column/DataFrame.
The Pandas output is serialized back to Arrow's Table format.
The python VM sends data back to the calling process.
Before reaching the Spark environment, the Arrow Table is decoded back to a spark datatype.
The Problem:
When working with extra data, such as a class's self, the pandas udf still needs to serialize and send that data. Serializing complex python objects like classes is not in PyArrow's capabilities, so you must either create a wrapper function and reference only specific serializable python types within the pandas_udf or 2) use a #staticmethod to negate the need for self.
The Solutions
1 - Pandas UDF with a Parameter in a Class: wrap the method with a function and create a local variable within that wrapper - src. Note that all variables that are referenced within the pandas_udf must be supported by PyArrow. Most python types are supported, but classes are not.
class my_class:
def __init__(self, s):
self.s = s
def wrapper_add_s(self, column):
local_s = self.s # create a local copy of s to be referenced by udf
#pandas_udf("string")
def add_s(column: pd.Series) -> pd.Series:
return column + f'_{local_s}'
return add_s(column)
def add_col(df):
return df.withColumn("Name", self.wrapper_add_s("Name"))
c = my_class(s='hi')
c.add_col(df)
2 - Pandas UDF without a Parameter in a Class: use the #staticmethod decorator
class my_class:
def __init__(self, s):
pass
#staticmethod
#pandas_udf("string")
def add_s(column: pd.Series) -> pd.Series:
return column + ' static string'
def add_col(df):
return df.withColumn("Name", self.wrapper_add_s("Name"))
c = my_class()
c.add_col(df)
Not in a Class
If you're looking a simple structure to pass a parameter to a pandas_udf outside of a class, use this... - src
def wrapper_add_s(column, s):
#pandas_udf("string")
def add_s(column: pd.Series) -> pd.Series:
return column + f'_{s}'
return add_s(column)
df = df.withColumn("Name", wrapper_add_s("Name", s='hi'))
The pandas dataframe rows correspond to successive time samples of a Kalman filter. I want to display the trajectory (truth, measurements and filter estimates) in a stream.
def show_tracker(index,data=run_tracker()):
i = int(index)
sleep(0.1)
p = \
hv.Scatter(data[0:i], kdims=['x'], vdims=['y'])(style=dict(color='r')) *\
hv.Curve (data[0:i], kdims=['x.true'], vdims=['y.true']) *\
hv.Scatter(data[0:i], kdims=['x.est'], vdims=['y.est'])(style=dict(color='darkgreen')) *\
hv.Curve (data[0:i], kdims=['x.est'], vdims=['y.est'])(style=dict(color='lightgreen'))
return p
%%opts Scatter [width=600,height=280]
ndx=TimeIndex()
hv.DynamicMap(show_tracker, kdims=[], streams=[ndx])
for i in range(N):
ndx.update(index=i)
Issue 1: Axes are automatically set to the bounds of the data.
Consequently, trajectory updates occur at the very edge of the plot boundaries.
Is there a setting to allow some slop,
or do I have to compute appropriate bounds in the show_tracker function?
Issue 2: Bokeh backend;
I can zoom and pan, but
"Reset" causes the data set to be lost. How do I fix that?
Issue 3: The default data argument to show_tracker
requires the function to be reexecuted to generate a new dataframe.
Is there an easy way to address that?
Issue 1
This is one of the last outstanding issues for the 1.7 release coming next week, track this issue for updates. However we also just changed how the ranges are updated on a DynamicMap, if you want to update the ranges make sure to set %%opts Scatter {+framewise} or norm=dict(framewise=True) on one of the displayed objects as you're already doing for the style options.
Issue 2
This is an unfortunate shortcoming of the reset tool in bokeh, you can track this issue for updates.
Issue 3:
That depends on what exactly you're doing, has the data already been generated or are you updating it on the fly? If you just have to generate the data once you can just create it outside function, which means it will be in scope:
data = run_tracker()
def show_tracker(index):
i = int(index)
sleep(0.1)
...
return p
If you actually want to generate new data dynamically the easiest thing to do is write a little class to keep track of the state. You can even make that class a Stream so you don't have to define it separately. Here's what that might look like:
class KalmanTracker(hv.streams.Stream):
index = param.Integer(default=1)
def __init__(self, **params):
# Initializes empty data and parameters
self.data = None
super(KalmanTracker, self).__init__(**params)
def update_data(self, index):
# Update self.data here
def get_view(self, index):
# Update index exceeds data length and
# create a holoviews view of the data
if self.data is None or len(self.data) < index:
self.update_data(index)
data = self.data[:index]
....
return hv_obj
def show(self):
# Create DynamicMap to display and
# pass in self as the Stream
return hv.DynamicMap(self.get_view, kdims=[],
streams=[self])
tracker = KalmanTracker()
tracker.show()
# Should update data and plot
tracker.update(index=10)
Once you've done that you can also use the paramnb library to generate widgets from this class. You'd simply do this:
tracker = KalmanTracker()
paramnb.Widgets(tracker, callback=tracker.update)
tracker.show()