Monte Carlo Simulation to populate a pdf matrix - pandas

I am constructing a pdf matrix, for an data which looks like:
Date
Reference
Secondary
10.01.2023
2
4
11.01.2023
5
6
12.01.2023
5
3
I formed a matrix between Reference and Secondary using pd.crosstab and normalizing it column wise and later plotted it using seaborn.heatmap. It looks something like this:
Please ignore the lower panel. The green tabs are column normalised pdf matrix for the Reference and the Secondary from the above table. X-axis is the Secondary and y-axis is the Reference. My problem is matrix is not populated for higher bins. For example in the figure you see in the x-axis, the bin 17 is missing. It simply means that Secondary has no values on overlapping days with the Reference. However, I want to populate this bin (bin 17) by doing a Monte Carlo simulation and getting a distribution like other bins.
Is there any easy way to do this?

Related

How to plot timeseries with many NaNs?

Originally I had a dataframe containing power consumption of some devices like this:
and I wanted to plot power consumption vs time for different devices, one plot per one of 6 possible dates. After grouping by date I got plots like this one (for each group = date):
Then I tried to create similar plot, but switch date and device roles so that it is grouped by device and colored by date. In order to do it I prepared this dataframe:
It is similar to the previous one, but has many NaN values due to differing measurement times. I thought it won't be a problem, but then after grouping by device, subplots look like this one (ex is just a name of sub-dataframe extracted from loop going through groups = devices):
This is the ex dataframe (mean lag between observations is around 20 seconds)
Question: What should I do to make plot grouped by device look like ones grouped by date? (I'd like to use ex dataframe but handle NaNs somehow.)
I found solution in answer to similar question: ex.interpolate(method='linear').plot(). This line will fill gaps between data points via interpolation between plotting. This is the result:
Another thing that can help is adding .plot(marker='o', ms = 3) which won't fill gaps between points, but at least will make points visible (previously some points, mainly the peaks in energy consumption were too small in scale of whole plot). This is the result:

Plot different Times Series Data in one Chart with shared x-Axes Pandas

I want to plot 5 different data frames in 1 plot. Containing the same measurement but done at different times. The plot should share the x-Axis for all measurement.
First thing i did was to calculate the time between the measurement points. It differs between 5-10 ms but sometimes also big gaps of 200 ms.
Then i calculated the running sum over this column. Then i set this column as the index (dtype "timedelta64[ns]")
Now i want to plot those 5 times.series in one plot which share the x-Axis (as time in ms)
But i donĀ“t now how because they have almost no common index together. The plot should have one common x-Axis from 0-3 seconds containing the 5 measurements.
Thank you!
2 Example DataFrames:
example for measuremt01
example for measuremt02

Best approach to fill signal gaps

I have a numpy 2dim array that represents a multi channel Bio-Signal. This array has dimension 20 x n_samples where the columns represent : Sample number - 16 channels data - time.
Given to bluetooth connection i have some package drops so i have gaps in signal. The array has to be imported into MNE-Python for further analysis. This library assumes that the sampling rate is constant (it's not able to able to handle gaps assuming that we MUST have a sample every 4 ms) so i have tried 3 different approaches:
Don't fill the gaps and let the signal to be spliced together (MNE Python create a structure with data equally spaced)
Fill the gaps with np.nan
Fill the gaps with 0s
My question is regarding the filtering that i need to apply on the data. I have used scipy.welch in order to get the PSD of the signal. It seems that the signal with nan as filler performs better than the original one and the one filled with 0s but the behavior is strange once i try to get the psd of a low passed and high pass filtered version of the signal.
Does anyone know what is the best approach?
Here are 3 images for the different filling strategies. (The top ones are the psd obtained with MNE library, the bottom ones with scipy.welch). The filter used is a FIR.
Filled with NAN
Filled with 0s
Spliced

Pandas, compute many means with bootstrap confidence intervals for plotting

I want to compute means with bootstrap confidence intervals for some subsets of a dataframe; the ultimate goal is to produce bar graphs of the means with bootstrap confidence intervals as the error bars. My data frame looks like this:
ATG12 Norm ATG5 Norm ATG7 Norm Cancer Stage
5.55 4.99 8.99 IIA
4.87 5.77 8.88 IIA
5.98 7.88 8.34 IIC
The subsets I'm interested in are every combination of Norm columns and cancer stage. I've managed to produce a table of means using:
df.groupby('Cancer Stage')['ATG12 Norm', 'ATG5 Norm', 'ATG7 Norm'].mean()
But I need to compute bootstrap confidence intervals to use as error bars for each of those means using the approach described here: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/
It boils down to:
import scipy
import scikits.bootstraps as bootstraps
CI = bootstrap.ci(data=Series, statfunction=scipy.mean)
# CI[0] and CI[1] are your low and high confidence intervals
I tried to apply this method to each subset of data with a nested-loop script:
for i in data.groupby('Cancer Stage'):
for p in i.columns[1:3]: # PROBLEM!!
Series = i[p]
print p
print Series.mean()
ci = bootstrap.ci(data=Series, statfunction=scipy.mean)
Which produced an error message
AttributeError: 'tuple' object has no attribute called 'columns'
Not knowing what "tuples" are, I have some reading to do but I'm worried that my current approach of nested for loops will leave me with some kind of data structure I won't be able to easily plot from. I'm new to Pandas so I wouldn't be surprised to find there's a simpler, easier way to produce the data I'm trying to graph. Any and all help will be very much appreciated.
The way you iterate over the groupby-object is wrong! When you use groupby(), your data frame is sliced along the values in your groupby-column(s), together with these values as group names, forming a so-called "tuple":
(name, dataforgroup). The correct recipe for iterating over groupby-objects is
for name, group in data.groupby('Cancer Stage'):
print name
for p in group.columns[0:3]:
...
Please read more about the groupby-functionality of pandas here and go through the python-reference in order to understand what tuples are!
Grouping data frames and applying a function is essentially done in one statement, using the apply-functionality of pandas:
cols=data.columns[0:2]
for col in columns:
print data.groupby('Cancer Stage')[col].apply(lambda x:bootstrap.ci(data=x, statfunction=scipy.mean))
does everything you need in one line, and produces a (nicely plotable) series for you
EDIT:
I toyed around with a data frame object I created myself:
df = pd.DataFrame({'A':range(24), 'B':list('aabb') * 6, 'C':range(15,39)})
for col in ['A', 'C']:
print df.groupby('B')[col].apply(lambda x:bootstrap.ci(data=x.values))
yields two series that look like this:
B
a [6.58333333333, 14.3333333333]
b [8.5, 16.25]
B
a [21.5833333333, 29.3333333333]
b [23.4166666667, 31.25]

Core Plot Graph Label steps

I'm using Core Plot to draw graphs in my app.
I just encountered a problem:
I have dates on the X-Axis and I use a custom labeling policy.
If I only have a few records everything works fine
If I have many records all the labels are near and not useful :-(
So the question is: How can I decide which values display and which not to always have 10 labels, separated one from the other.
Divide the number of points by the number of labels you want and round up. For example, if you have 25 data points and want roughly 10 labels, label every third data point. You'll end up with 9 evenly spaced labels.