newbie: holoviews curves from pandas follow up: issues with stream - pandas

The pandas dataframe rows correspond to successive time samples of a Kalman filter. I want to display the trajectory (truth, measurements and filter estimates) in a stream.
def show_tracker(index,data=run_tracker()):
i = int(index)
sleep(0.1)
p = \
hv.Scatter(data[0:i], kdims=['x'], vdims=['y'])(style=dict(color='r')) *\
hv.Curve (data[0:i], kdims=['x.true'], vdims=['y.true']) *\
hv.Scatter(data[0:i], kdims=['x.est'], vdims=['y.est'])(style=dict(color='darkgreen')) *\
hv.Curve (data[0:i], kdims=['x.est'], vdims=['y.est'])(style=dict(color='lightgreen'))
return p
%%opts Scatter [width=600,height=280]
ndx=TimeIndex()
hv.DynamicMap(show_tracker, kdims=[], streams=[ndx])
for i in range(N):
ndx.update(index=i)
Issue 1: Axes are automatically set to the bounds of the data.
Consequently, trajectory updates occur at the very edge of the plot boundaries.
Is there a setting to allow some slop,
or do I have to compute appropriate bounds in the show_tracker function?
Issue 2: Bokeh backend;
I can zoom and pan, but
"Reset" causes the data set to be lost. How do I fix that?
Issue 3: The default data argument to show_tracker
requires the function to be reexecuted to generate a new dataframe.
Is there an easy way to address that?

Issue 1
This is one of the last outstanding issues for the 1.7 release coming next week, track this issue for updates. However we also just changed how the ranges are updated on a DynamicMap, if you want to update the ranges make sure to set %%opts Scatter {+framewise} or norm=dict(framewise=True) on one of the displayed objects as you're already doing for the style options.
Issue 2
This is an unfortunate shortcoming of the reset tool in bokeh, you can track this issue for updates.
Issue 3:
That depends on what exactly you're doing, has the data already been generated or are you updating it on the fly? If you just have to generate the data once you can just create it outside function, which means it will be in scope:
data = run_tracker()
def show_tracker(index):
i = int(index)
sleep(0.1)
...
return p
If you actually want to generate new data dynamically the easiest thing to do is write a little class to keep track of the state. You can even make that class a Stream so you don't have to define it separately. Here's what that might look like:
class KalmanTracker(hv.streams.Stream):
index = param.Integer(default=1)
def __init__(self, **params):
# Initializes empty data and parameters
self.data = None
super(KalmanTracker, self).__init__(**params)
def update_data(self, index):
# Update self.data here
def get_view(self, index):
# Update index exceeds data length and
# create a holoviews view of the data
if self.data is None or len(self.data) < index:
self.update_data(index)
data = self.data[:index]
....
return hv_obj
def show(self):
# Create DynamicMap to display and
# pass in self as the Stream
return hv.DynamicMap(self.get_view, kdims=[],
streams=[self])
tracker = KalmanTracker()
tracker.show()
# Should update data and plot
tracker.update(index=10)
Once you've done that you can also use the paramnb library to generate widgets from this class. You'd simply do this:
tracker = KalmanTracker()
paramnb.Widgets(tracker, callback=tracker.update)
tracker.show()

Related

How to save the model comparison dataframe from compare_models() in pycaret?

I want to save the model comparison data frame from compare_models() in pycaret.
# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')
# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')
# compare models
best = compare_models()
i.e. this data frame as shown above.
Does anyone know how to do that?
The solution is :
df = pull()
by Goosang Yu from the pycaret slack community.
compare_models() returns a pandas dataframe, containing the information of the list of models. Hence, you only need to save a dataframe, which can be for example achieved with best.to_csv(path). If you want to save the object in a different format (pickle, xml, ...), you can refer to pandas i/o documentation.

Is plotting with Koalas using TopN has any statistic meaning?

I was going through the source code of Koalas, trying to get a handle on how they actually achieve plotting large datasets. It turns our that they use either sampling or TopN - selecting a given number of records.
I understand the meaning of sampling and internally it uses spark.DataFrame.sample to do it. For TopN, however, they simply take the first max_rows number of records from Koalas' DataFrame using data = data.head(max_rows + 1).to_pandas().
This seems strange and I wonder whether it's correctly reflecting the statistical properties of the dataset doing the data selection in this way.
Koalas DataFrame's plot accessor:
class KoalasPlotAccessor(PandasObject):
pandas_plot_data_map = {
"pie": TopNPlotBase().get_top_n,
"bar": TopNPlotBase().get_top_n,
"barh": TopNPlotBase().get_top_n,
"scatter": SampledPlotBase().get_sampled,
"area": SampledPlotBase().get_sampled,
"line": SampledPlotBase().get_sampled,
}
_backends = {} # type: ignore
...
class TopNPlotBase:
def get_top_n(self, data):
from databricks.koalas import DataFrame, Series
max_rows = get_option("plotting.max_rows")
# Simply use the first 1k elements and make it into a pandas dataframe
# For categorical variables, it is likely called from df.x.value_counts().plot.xxx().
if isinstance(data, (Series, DataFrame)):
data = data.head(max_rows + 1).to_pandas()
...

FuncAnimation doesn't respond when after dynamically sending data to plot to move a scatter point

So I'm using FuncAnimation from matplotlib so dynamically plot some data as it arrives from a serial port (in my project is the vehicle class from dronekit, which is displayed with the green dot), what I have basically is the animation called which every loop is receiving a new vehicle class with data changed so it can be plotted, but for some reason it plots but after a couple of seconds later after the thread of the mission(which allows the "refresh" of the vehicle data it pops up and kills python (Wheel of death), here's what I get:
I've put some tracking prints inside the function that is called when the FuncAnimation starts running, looks like this:
def droneAnimation(i, vehicle, droneScatter):
time.sleep(1)
lat = [vehicle.location.global_relative_frame.lat]
lon = [vehicle.location.global_relative_frame.lon]
alt = [vehicle.location.global_relative_frame.alt]
print("Alt received: " + str(alt))
droneScatter._offsets3d = (lat,lon,alt)
print("Changed pos")
As you can see those prints are triggered the first few seconds but still crashes after a few iterations.
The FuncAnimation is called like this:
fig,droneScatter = plotLiveSimpleFacade(vehicle,w,2)
ani = FuncAnimation(fig,droneAnimation, fargs = (vehicle,droneScatter))
plt.draw()
plt.pause(0.1)
m = threading.Thread(target=MissionStart(vehicle,hmax) , name = "MISSION")
m.start()
For reference: fig is a plt.figure(),droneScatter is just a scatter point, vehicle is the vehicle class containing the data that dynamically updates and the MissionStart thread is just a thread to make the vehicle class change overtime.
I would like to mention as well that the fig is on interactive mode on, and the axes limits are set well (I saw that when you dynamically change data but don't scale the axes may have problems) also, trying different combinations of plt.draw() and plt.plot(block = False) leads me to not plotting at all or just a blank plot.
Since I have no idea of what is causing this I'll put the dronekit tag on this and the threading to see if anyone has any idea!
I've looked onto threading with matplotlib and looks like threading with this said library it's not the best as it's not thread safe, the best bet is to look at multiprocessing with python or approaching the problem in a different manner.
You can find more information at this post

Convert date/time index of external dataset so that pandas would plot clearly

When you already have time series data set but use internal dtype to index with date/time, you seem to be able to plot the index cleanly as here.
But when I already have data files with columns of date&time in its own format, such as [2009-01-01T00:00], is there a way to have this converted into the object that the plot can read? Currently my plot looks like the following.
Code:
dir = sorted(glob.glob("bsrn_txt_0100/*.txt"))
gen_raw = (pd.read_csv(file, sep='\t', encoding = "utf-8") for file in dir)
gen = pd.concat(gen_raw, ignore_index=True)
gen.drop(gen.columns[[1,2]], axis=1, inplace=True)
#gen['Date/Time'] = gen['Date/Time'][11:] -> cause error, didnt work
filter = gen[gen['Date/Time'].str.endswith('00') | gen['Date/Time'].str.endswith('30')]
filter['rad_tot'] = filter['Direct radiation [W/m**2]'] + filter['Diffuse radiation [W/m**2]']
lis = np.arange(35040) #used the number of rows, checked by printing. THis is for 2009-2010.
plt.xticks(lis, filter['Date/Time'])
plt.plot(lis, filter['rad_tot'], '.')
plt.title('test of generation 2009')
plt.xlabel('Date/Time')
plt.ylabel('radiation total [W/m**2]')
plt.show()
My other approach in mind was to use plotly. Yet again, its main purpose seems to feed in data on the internet. It would be best if I am familiar with all the modules and try for myself, but I am learning as I go to use pandas and matplotlib.
So I would like to ask whether there are anyone who experienced similar issues as I.
I think you need set labels to not visible by loop:
ax = df.plot(...)
spacing = 10
visible = ax.xaxis.get_ticklabels()[::spacing]
for label in ax.xaxis.get_ticklabels():
if label not in visible:
label.set_visible(False)

Tensorboard histograms to matplotlib

I would like to "dump" the tensorboard histograms and plot them via matplotlib. I would have more scientific paper appealing plots.
I managed to hack the way through the Summary file using the tf.train.summary_iterator and dump the histogram that I wanted to dump( tensorflow.core.framework.summary_pb2.HistogramProto object).
By doing that and implementing what the java-script code does with the data (https://github.com/tensorflow/tensorboard/blob/c2fe054231fe77f3a5b05dbc519f713d2e738d1c/tensorboard/plugins/histogram/tf_histogram_dashboard/histogramCore.ts#L104), I managed to get something similar (same trends) with the tensorboard plots, but not the exact same plot.
Can I have some light on this?
Thanks
In order to plot a tensorboard histogram with matplotlib I am doing the following:
event_acc = EventAccumulator(path, size_guidance={
'histograms': STEP_COUNT,
})
event_acc.Reload()
tags = event_acc.Tags()
result = {}
for hist in tags['histograms']:
histograms = event_acc.Histograms(hist)
result[hist] = np.array([np.repeat(np.array(h.histogram_value.bucket_limit), np.array(h.histogram_value.bucket).astype(np.int)) for h in histograms])
return result
h.histogram_value.bucket_limit gives me the value and h.histogram_value.bucket the count of this value. So when i repeat the values accordingly (np.repeat(...)), I get a huge array of expected size. This array can now be plotted with the default matplotlib logic.
The best solution is loading all events and reconstructing all the histogram (as the answer of #khuesmann) but not using EventAccumulator but EventFileLoader. This will give you a histogram per wall time and step as the ones Tensorboard plots. It can be extended to return a list of actions by timestep and wall time.
Don't forget to check which tag will you use.
from tensorboard.backend.event_processing.event_file_loader import EventFileLoader
# Just in case, PATH_OF_FILE is the path of the file, not the folder
loader = EventFileLoader(PATH_Of_FILE)
# Where to store values
wtimes,steps,actions = [],[],[]
for event in loader.Load():
wtime = event.wall_time
step = event.step
if len(event.summary.value) > 0:
summary = event.summary.value[0]
if summary.tag == HISTOGRAM_TAG:
wtimes += [wtime]*int(summary.histo.num)
steps += [step] *int(summary.histo.num)
for num,val in zip(summary.histo.bucket,summary.histo.bucket_limit):
actions += [val] *int(num)
bear in mind that tensorflow approximates the actions and treats the actions as continuous variables, so even if you have discrete actions (e.g. 0,1,3) you will end up actions as 0.2,0.4,0.9,1.4 ... in that case round the values will do it.
A good solution is the one from #khuesmann, but this only allows you to retrieve the accumulated histogram, not the histogram per step -- which is the one actually being showed in tensorboard.
If you want the distribution and so far, what I have understood is that Tensorboard usually compresses the histogram to decrease the memory used to store the data -- imagine storing a 2D histogram over 4 million steps, the memory can increase fast quickly. These compress histograms are accessible by doing this:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
n2n = EventAccumulator(PATH)
n2n.Reload()
# Check the tags under histograms and choose the one you want
n2n.Tags()
# This will give you the list used by tensorboard
# of the compress histograms by timestep and wall time
n2n.CompressedHistograms(HISTOGRAM_TAG)
The only problem is that it compresses the histogram to five percentiles (in Basic points they are 0, 668, 1587, 3085, 5000, 6915, 8413, 9332, 10000) which corresponds to (-Inf, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, Inf) in standard deviations. Check the code here.
I haven't read much, but it wouldn't be hard to reconstruct the temporal histograms that tensorboard shows. If I find a way to do it, I will post it here.
The simplest way is to parse the events with tbparse and plot the histograms with seaborn kde_ridgeplot.
This tutorial generates the stacked distribution plot with around 30 lines of Python code:
Tensorboard preview:
Parse by tbparse & plotted by seaborn:
You can open an issue if you encountered any question during parsing. (I'm the author of tbparse)