Continuous Attribute - Distribution in Naive Bayes Algorithm - testing

I am trying to implement Naive Bayes Algorithm - by writing my own code in MATLAB. I was confused what distribution to choose for one of the continuous attributes. It has values as follows:
MovieAge :
1
2
3
4
..
10
1
11
2
12
1
3
13
2
1
4
14
3
2
5
15
4
3
6
16
5
4
....
32
9
3
15
Please let me know which distribution to use for such data? and in my test set, this attribute will contain values (some times) that are not included in training data. how to handle this problem? Thanks
15

Like #Ben's answer, starting with Histogram sounds good.
I take your input, and the histogram looks like below:
Save your data into a text file called histdata, one line per value:
Python code used to generate the plot:
import matplotlib.pyplot as plt
data = []
for line in file('./histdata'):
data.append(int(line))
plt.hist(data, bins=10)
plt.xlabel('Movie Age')
plt.ylabel('Counts')
plt.show()

Assuming this variable takes integer values, rather than being continuous (based on the example), the simplest method is a histogram-type approach: the probability of some value is the fraction of times it occurs in the training data. Consider a final bin for all values above some number (maybe 20 or so based on your example). If you have problems with zero counts, add one to all of them (can be seen as a Dirichlet prior if you're that way inclined).
As for a parametric form, if you prefer one, the Poisson distribution is a possibility. A qq plot, or even a goodness of fit test, will suggest how appropriate this is in your case, but I suspect you're going to be better with the histogram based method.

Related

Monte Carlo Simulation to populate a pdf matrix

I am constructing a pdf matrix, for an data which looks like:
Date
Reference
Secondary
10.01.2023
2
4
11.01.2023
5
6
12.01.2023
5
3
I formed a matrix between Reference and Secondary using pd.crosstab and normalizing it column wise and later plotted it using seaborn.heatmap. It looks something like this:
Please ignore the lower panel. The green tabs are column normalised pdf matrix for the Reference and the Secondary from the above table. X-axis is the Secondary and y-axis is the Reference. My problem is matrix is not populated for higher bins. For example in the figure you see in the x-axis, the bin 17 is missing. It simply means that Secondary has no values on overlapping days with the Reference. However, I want to populate this bin (bin 17) by doing a Monte Carlo simulation and getting a distribution like other bins.
Is there any easy way to do this?

Cannot use fit method in seaborn.distplot

I have a dataframe with many columns, one of which counts the duration of a process in months. Sample data is available bellow:
ID Unit Duration
231 TS 2
427 SP 4
291 EI 1
312 SP 3
So, I am trying to plot the histogram, filtered by units and fitting it (mostly for visualization purposes) to stats.expon, which is the best fit for most units. Seems simple enough:
graph = sns.distplot(df[df['Unit'] == 'SP']['Duration'], kde = False, fit = stats.expon)
But it raises TypeError: No loop matching the specified signature and casting
was found for ufunc add. What am I doing wrong? I'm kind of new do matplotlib and seaborn, so excuse me if this is trivial.

Hierarchical clustering with different sample size on Python

I would like to know if it's possible to do hierarchical clustering with different sample size on Python? More precisely, with Ward's minimum variance method.
For instance, I have 5 lists of integers, A, B, C, D, E of different lengths. What I want to do is to group these 5 lists into 3 groups according to Ward's method (the decrease in variance for the cluster being merged).
Does anyone knows how to do so?
We can consider theses 5 lists are your samples you want to cluster in 3 groups.
Hierarchical cluster as you may know can take as input distance matrices.
Distance matrices evaluate some sort of pairwise distances (or dissimilarities) between your samples.
You have to construct this 5x5 matrix by choosing a meaningful distance function. This greatly depends on what your samples/integers represent. As your samples do not have constant length you can't compute metrics like euclidean distance.
For example if integers in your lists can be interpreted as classes, you could compute Jaccard Index to express some sort of dissimilarity.
[1 2 3 4 5] and [1 3 4] have a Jaccard similarity index of 3/5 (or
dissimilarity of 2/5).
0 being entirely different and 1 perfectly identical.
https://en.wikipedia.org/wiki/Jaccard_index
Once your dissimilarity matrix is computed (in fact it represent only 5 choose 2 = 10 different values as this matrix is symmetrical) you can apply hierarchical clustering on it.
The important part being finding the adapted distance function to your problem.

Pandas, compute many means with bootstrap confidence intervals for plotting

I want to compute means with bootstrap confidence intervals for some subsets of a dataframe; the ultimate goal is to produce bar graphs of the means with bootstrap confidence intervals as the error bars. My data frame looks like this:
ATG12 Norm ATG5 Norm ATG7 Norm Cancer Stage
5.55 4.99 8.99 IIA
4.87 5.77 8.88 IIA
5.98 7.88 8.34 IIC
The subsets I'm interested in are every combination of Norm columns and cancer stage. I've managed to produce a table of means using:
df.groupby('Cancer Stage')['ATG12 Norm', 'ATG5 Norm', 'ATG7 Norm'].mean()
But I need to compute bootstrap confidence intervals to use as error bars for each of those means using the approach described here: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/
It boils down to:
import scipy
import scikits.bootstraps as bootstraps
CI = bootstrap.ci(data=Series, statfunction=scipy.mean)
# CI[0] and CI[1] are your low and high confidence intervals
I tried to apply this method to each subset of data with a nested-loop script:
for i in data.groupby('Cancer Stage'):
for p in i.columns[1:3]: # PROBLEM!!
Series = i[p]
print p
print Series.mean()
ci = bootstrap.ci(data=Series, statfunction=scipy.mean)
Which produced an error message
AttributeError: 'tuple' object has no attribute called 'columns'
Not knowing what "tuples" are, I have some reading to do but I'm worried that my current approach of nested for loops will leave me with some kind of data structure I won't be able to easily plot from. I'm new to Pandas so I wouldn't be surprised to find there's a simpler, easier way to produce the data I'm trying to graph. Any and all help will be very much appreciated.
The way you iterate over the groupby-object is wrong! When you use groupby(), your data frame is sliced along the values in your groupby-column(s), together with these values as group names, forming a so-called "tuple":
(name, dataforgroup). The correct recipe for iterating over groupby-objects is
for name, group in data.groupby('Cancer Stage'):
print name
for p in group.columns[0:3]:
...
Please read more about the groupby-functionality of pandas here and go through the python-reference in order to understand what tuples are!
Grouping data frames and applying a function is essentially done in one statement, using the apply-functionality of pandas:
cols=data.columns[0:2]
for col in columns:
print data.groupby('Cancer Stage')[col].apply(lambda x:bootstrap.ci(data=x, statfunction=scipy.mean))
does everything you need in one line, and produces a (nicely plotable) series for you
EDIT:
I toyed around with a data frame object I created myself:
df = pd.DataFrame({'A':range(24), 'B':list('aabb') * 6, 'C':range(15,39)})
for col in ['A', 'C']:
print df.groupby('B')[col].apply(lambda x:bootstrap.ci(data=x.values))
yields two series that look like this:
B
a [6.58333333333, 14.3333333333]
b [8.5, 16.25]
B
a [21.5833333333, 29.3333333333]
b [23.4166666667, 31.25]

variable size rolling window regression

In Pandas OLS the window size is fix length. How can I achieve set the window size based on index instead of number of rows?
I have a series where it has variable number of observations per day and I have 10 years history of data, so I want to run rolling OLS on 1 year rolling window. loop through each date is a bit too slow, anyway to make it faster? Here is the example of the data.
Date x y
2008-1-2 10.0 2
2008-1-2 5.0 1
2008-1-3 7.0 1.5
2008-1-5 9.0 3.0
...
2013-5-30 11.0 2.5
I would like something simple like pandas.ols(df.y, df.x, window='1y'), rather than looping each row since it will be slow to do the loop.
There is method for doing this in pandas see documentation http://pandas.pydata.org/pandas-docs/dev/computation.html#computing-rolling-pairwise-correlations:
model = pandas.ols(y=df.y, x=df.x, window=250)
you will just have to provide your period is a number of intervals on frame instead of '1y'. There are also many additional options that you might find useful on your data.
all the rolling ols statistics are in model
model.beta.plot()
to show rolling beta