Real time plotting with plotly - plotly-python

I have a use case where I wish to plot a dataset as it is gradually filled by an external function in real time. I know the range for the x values ahead of time, so the dataset will initially look something like
x 0 1 2 3 4 5
y nan nan nan nan nan nan
and as my program progresses the nans in the y axis will be filled in - this should trigger the graph to redraw itself. This program will be run from the command line or an ipython/jupyter environment
Is plotly a suitable library for this case, and if so what would it look like? My initial experiments make it seem like plotly is more suited for complex graphs with 'finished' data

Related

Monte Carlo Simulation to populate a pdf matrix

I am constructing a pdf matrix, for an data which looks like:
Date
Reference
Secondary
10.01.2023
2
4
11.01.2023
5
6
12.01.2023
5
3
I formed a matrix between Reference and Secondary using pd.crosstab and normalizing it column wise and later plotted it using seaborn.heatmap. It looks something like this:
Please ignore the lower panel. The green tabs are column normalised pdf matrix for the Reference and the Secondary from the above table. X-axis is the Secondary and y-axis is the Reference. My problem is matrix is not populated for higher bins. For example in the figure you see in the x-axis, the bin 17 is missing. It simply means that Secondary has no values on overlapping days with the Reference. However, I want to populate this bin (bin 17) by doing a Monte Carlo simulation and getting a distribution like other bins.
Is there any easy way to do this?

Multivariate LSTM with missing values

I am working on a Time Series Forecasting problem using LSTM.
The input contains several features, so I am using a Multivariate LSTM.
The problem is that there are some missing values, for example:
Feature 1 Feature 2 ... Feature n
1 2 4 nan
2 5 8 10
3 8 8 5
4 nan 7 7
5 6 nan 12
Instead of interpolating the missing values, that can introduce bias in the results, because sometimes there are a lot of consecutive timestamps with missing values on the same feature, I would like to know if there is a way to let the LSTM learn with the missing values, for example, using a masking layer or something like that? Can someone explain to me what will be the best approach to deal with this problem?
I am using Tensorflow and Keras.
As suggested by François Chollet (creator of Keras) in his book, one way to handle missing values is to replace them with zero:
In general, with neural networks, it’s safe to input missing values as
0, with the condition that 0 isn’t already a meaningful value. The
network will learn from exposure to the data that the value 0 means
missing data and will start ignoring the value. Note that if you’re
expecting missing values in the test data, but the network was trained
on data without any missing values, the network won’t have learned to
ignore missing values! In this situation, you should artificially
generate training samples with missing entries: copy some training
samples several times, and drop some of the features that you expect
are likely to be missing in the test data.
So you can assign zero to NaN elements, considering that zero is not used in your data (you can normalize the data to a range, say [1,2], and then assign zero to NaN elements; or alternatively, you can normalize all the values to be in range [0,1] and then use -1 instead of zero to replace NaN elements.)
Another alternative way is to use a Masking layer in Keras. You give it a mask value, say 0, and it would drop any timestep (i.e. row) where all its features are equal to the mask value. However, all the following layers should support masking and you also need to pre-process your data and assign the mask value to all the features of a timestep which includes one or more NaN features. Example from Keras doc:
Consider a Numpy data array x of shape (samples, timesteps,features),
to be fed to an LSTM layer. You want to mask timestep #3
and #5 because you lack data for these timesteps. You can:
set x[:, 3, :] = 0. and x[:, 5, :] = 0.
insert a Masking layer with mask_value=0. before the LSTM layer:
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(LSTM(32))
Update (May 2021): According to an updated suggestion from François Cholle, it might be better to use a more meaningful or informative value (instead of using zero) for masking missing values. This value could be computed (e.g. mean, median, etc.) or predicted from the data itself.

Dynamic graph that changes with time? Is it possible with VBA?

I have some data that I was wondering I could graph with time?
Basically first "graph" would be at time 0 which for example would be X vs Y, then second "graph" would be at time 0.5 seconds which is again X vs Y etc..
Maybe this will help explain my case:
Time (s) X Y
0 1 1
0.5 2 2
1 3 3
1.5 4 4
2 5 5
2.5 6 6
So according to the table above the values I want to graph are X vs Y, but I have so many time points of X & Y. Is it possible with VBA to produce an animation of X vs Y that loops through all the time points I have?
I tried to google alternatives but didn't find what I want, maybe Im looking in the wrong place? Is it possible with VBA? If not is there any software that can do this for me? The graph doesnt have to update automatically, I dont mind pressing a button for it to jump to the next time interval (Without creating a new graph).
Take a look at Power BI Designer (and the Power BI service, of course). The Scatter chart in that tool can take a value for the play axis, so the data can be played as an animation.
You could also use Power View in Excel 2013. I created a Power View scatter chart in Excel 2013 and clipped the animation to YouTube.
A sample file with a Power View and several series with a play axis can be accessed on my OneDrive

variable size rolling window regression

In Pandas OLS the window size is fix length. How can I achieve set the window size based on index instead of number of rows?
I have a series where it has variable number of observations per day and I have 10 years history of data, so I want to run rolling OLS on 1 year rolling window. loop through each date is a bit too slow, anyway to make it faster? Here is the example of the data.
Date x y
2008-1-2 10.0 2
2008-1-2 5.0 1
2008-1-3 7.0 1.5
2008-1-5 9.0 3.0
...
2013-5-30 11.0 2.5
I would like something simple like pandas.ols(df.y, df.x, window='1y'), rather than looping each row since it will be slow to do the loop.
There is method for doing this in pandas see documentation http://pandas.pydata.org/pandas-docs/dev/computation.html#computing-rolling-pairwise-correlations:
model = pandas.ols(y=df.y, x=df.x, window=250)
you will just have to provide your period is a number of intervals on frame instead of '1y'. There are also many additional options that you might find useful on your data.
all the rolling ols statistics are in model
model.beta.plot()
to show rolling beta

Continuous Attribute - Distribution in Naive Bayes Algorithm

I am trying to implement Naive Bayes Algorithm - by writing my own code in MATLAB. I was confused what distribution to choose for one of the continuous attributes. It has values as follows:
MovieAge :
1
2
3
4
..
10
1
11
2
12
1
3
13
2
1
4
14
3
2
5
15
4
3
6
16
5
4
....
32
9
3
15
Please let me know which distribution to use for such data? and in my test set, this attribute will contain values (some times) that are not included in training data. how to handle this problem? Thanks
15
Like #Ben's answer, starting with Histogram sounds good.
I take your input, and the histogram looks like below:
Save your data into a text file called histdata, one line per value:
Python code used to generate the plot:
import matplotlib.pyplot as plt
data = []
for line in file('./histdata'):
data.append(int(line))
plt.hist(data, bins=10)
plt.xlabel('Movie Age')
plt.ylabel('Counts')
plt.show()
Assuming this variable takes integer values, rather than being continuous (based on the example), the simplest method is a histogram-type approach: the probability of some value is the fraction of times it occurs in the training data. Consider a final bin for all values above some number (maybe 20 or so based on your example). If you have problems with zero counts, add one to all of them (can be seen as a Dirichlet prior if you're that way inclined).
As for a parametric form, if you prefer one, the Poisson distribution is a possibility. A qq plot, or even a goodness of fit test, will suggest how appropriate this is in your case, but I suspect you're going to be better with the histogram based method.