how to find the average in a OOP class format with specific requirements python - oop

I have created 12 people with age, gender, height, and weight with OOP class, however, I don't understand how I can make it that it picks specific stuff that I want from the available classes. for example
class Person:
def __init__(self, age, gender, height, weight):
self.height = height
self.weight = weight
self.age = age
self.gender = gender
person1 = Person(19, "Female", 162, 63)
print(person1.weight)
person2 = Person(23, "Male", 170, 89)
print(person2.weight)
person3 = Person(34, "Male", 178, 88)
print(person3.weight)
person4 = Person(44, "Female", 169, 66)
print(person4.weight)
person5 = Person(23, "Female", 166, 64)
print(person5.weight)
person6 = Person(25, "Male", 180, 78)
print(person6.weight)
person7 = Person(18, "Female", 174, 70)
print(person7.weight)
person8 = Person(23, "Male", 190, 90)
print(person8.weight)
person9 = Person(34, "Female", 177, 72)
print(person9.weight)
person10 = Person(27, "Male", 187, 85)
print(person10.weight)
person11 = Person(22, "Male", 184, 80)
print(person11.weight)
person12 = Person(41, "Female", 158, 69)
print(person12.weight)
I want to find the average weight of males in this category, I just don't understand how I can make the code to choose specifically from males and only to average their weight, I added the height and age just for like creating a real environment but these numbers are not important since I only want to find the males weight that's all.

You have created 12 'Person' objects, from person1 till person12.
In each print statement, you printed the weight field of each Person you created (from person1 till person12).
For average, the formula in this case is
(sum of weights where gender is "Male") / (amount of Person objectswhere gender is "Male")
Then first you should go over all your objects. There is a naive way of doing this:
male_weight_sum = person2.weight + person3.weight + person6.weight + person8.weight + person10.weight + person11.weight
total_males = 6 # after counting
average_male_weight = male_weight_sum / total_males
print(average_male_weight)
I guess that this code looks too long, especially in the first line. And did you notice I needed to check each person, and find out its gender? That's not realistic in programming.
A better solution which is less obnoxious, is to prevent using many variables of the same time and same destination. You can use a list instead. Defining a list is made by this way:
person_list = [] # List without items
Now lets add the 12 Person objects you created:
person_list = [Person(19, "Female", 162, 63), Person(23, "Male", 170, 89), Person(44, "Female", 169, 66), Person(23, "Female", 166, 64), Person(25, "Male", 180, 78), (Person(18, "Female", 174, 70), Person(23, "Male", 190, 90), Person(34, "Female", 177, 72), Person(27, "Male", 187, 85), Person(22, "Male", 184, 80), Person(41, "Female", 158, 69)]
Or append all Person objects inline:
person_list = [
Person(19, "Female", 162, 63),
Person(23, "Male", 170, 89),
Person(44, "Female", 169, 66),
Person(23, "Female", 166, 64),
Person(25, "Male", 180, 78),
Person(18, "Female", 174, 70),
Person(23, "Male", 190, 90),
Person(34, "Female", 177, 72),
Person(27, "Male", 187, 85),
Person(22, "Male", 184, 80),
Person(41, "Female", 158, 69)]
Now, after we have a list with the items we need to analyze, we need to sum males' weights, to calculate how many males are in this list, and then calculate the average:
male_weight_sum = 0
total_males = 0
average_male_weight = 0
for i in range(len(person_list)): # Python looping through range, from 0 to 11
if person_list[i].gender == "Male": # We only care about males
male_weight_sum = male_weight_sum + person_list[i].weight
total_males = total_males + 1
average_male_weight = male_weight_sum / total_males) # You should also pre-check if total_males > 0, because you can't divide by zero
print(average_male_weight)

One way you can do this is like this. First, we can find all of the people using the module gc. Looping through gc.get_objects(), we can check if each instance in the program is an instance of Person. Then, we can increment a variable for the number of male people and increment a variable for the total weight. Then we can divide. This is one way of writing code for this:
import gc
class Person:
def __init__(self, age, gender, height, weight):
self.height = height
self.weight = weight
self.age = age
self.gender = gender
total_weight = 0
number_of_males = 0
for instance in gc.get_objects():
if isinstance(instance, Person):
if instance.gender == "male":
number_of_males += 1
total_weight += instance.weight
print(total_weight / number_of_males)
Or, using list comprehension we can do this:
import gc
male_weights = [instance.weight for instance in gc.get_objects() if (isinstance(instance, Person) and instance.gender == "male")]
print(sum(male_weights) / len(male_weights))
However, you can test them since I don't have time to test them.

Related

Chi square function: how to?

I don't understand if what I'm doing is correct.
I wish to perform a chi-squared test on a dataset, but I'm not sure about the result.
This is my dataset:
> dput(chi)
structure(list(`Age.(days)` = c("< 7 days", "7-10 days", "10-12 days",
"12-15 days", "15-20 days", "20-25 days"), Broods = c(6, 9, 10,
6, 14, 5), N.Carnus = c(92, 74, 48, 17, 37, 10)), row.names = c(NA,
6L), class = "data.frame")
And this is the test:
chi$"Age.(days)" <- as.character(chi$"Age.(days)")
chisq.test(table(chi$"Age.(days)", chi$"N.Carnus"))
What I wish to find is if there is a significant connection between age of the host and number of parasites (coming from a certain number of broods).
Thank you :)

regex text parser

I have the dataframe like
ID Series
1102 [('taxi instructions', 13, 30, 'NP'), ('consistent basis', 31, 47, 'NP'), ('the atc taxi clearance', 89, 111, 'NP')]
1500 [('forgot data pages info', 0, 22, 'NP')]
649 [('hud', 0, 3, 'NP'), ('correctly fotr approach', 12, 35, 'NP')]
I am trying to parse the text in column named Series to different columns named Series1 Series2 etc upto the highest number of texts parsed.
df_parsed = df['Series'].str[1:-1].str.split(', ', expand = True)
something like this:
ID Series Series1 Series2 Series3
1102 [('taxi instructions', 13, 30, 'NP'), ('consistent basis', 31, 47, 'NP'), ('the atc taxi clearance', 89, 111, 'NP')] taxi instructions consistent basis the atc taxi clearance
1500 [('forgot data pages info', 0, 22, 'NP')] forgot data pages info
649 [('hud', 0, 3, 'NP'), ('correctly fotr approach', 12, 35, 'NP')] hud correctly fotr approach
The format of your final result is not easy to understand, but maybe you can follow the concept to create your new columns:
def process(ls):
return ' '.join([x[0] for x in ls])
df['Series_new'] = df['Series'].apply(lambda x: process(x))
And if you want to create N new columns (N = max_len(Series_list)), I think you can calculate N first. Then, follow the concept above and fill in NaN properly to create N new columns.

numpy array giving decimals output

I am trying to create a numpy array from like ones, x and X2 but the code below is giving me decimal points.
x = [10, 23, 25, 30, 37, 40, 46, 52, 60, 65]
y = [22, 46, 48, 62, 75, 90, 100, 110, 180, 150]
xx = np.array([float(v) for v in x])
yy = np.array([float(v) for v in y])
X2 = xx ** 2
ones = np.ones(len(x))
Xq = np.c_[ones,xx, X2]
Y = np.array(y).reshape(len(y),1)
print(Xq)
My output is:
[[ 1.00000000e+00 1.00000000e+01 1.00000000e+02]
[ 1.00000000e+00 2.30000000e+01 5.29000000e+02]
[ 1.00000000e+00 2.50000000e+01 6.25000000e+02]]
But I want it as ints.

Pymc size / indexing issue

I am trying to model Kruschke's "filtration-condensation experiment" with pymc 2.3.5. (numpy 1.10.1)
Basicaly there are:
4 groups
each group has 40 individuals
each individual has 64 Bernoulli trials (correct/incorrect)
What I am modeling:
each individual's results are Binomial distribution (e.g. 45 correct out of 64).
my belief about each individual's performance is Beta distribution.
this Beta distribution is influenced by group to which individual belongs (through parameters A=mu*kappa and B=(1-mu)*kappa)
my belief about how strong each group's influence is Gamma distribution (kappa variable)
my belief about each group's average is Beta distribution (mu variable)
The problem:
when I do modeling with "size=" parameters, pymc get's lost
when I seperate each distribution manually (no size=) the pymc does good job
I include the code below:
Data
import numpy as np
import seaborn as sns
import pymc as pm
from pymc.Matplot import plot as mcplot
%matplotlib inline
# Data
ncond = 4
nSubj = 40
trials = 64
N = np.repeat([trials], (ncond * nSubj))
z = np.array([45, 63, 58, 64, 58, 63, 51, 60, 59, 47, 63, 61, 60, 51, 59, 45,
61, 59, 60, 58, 63, 56, 63, 64, 64, 60, 64, 62, 49, 64, 64, 58, 64, 52, 64, 64,
64, 62, 64, 61, 59, 59, 55, 62, 51, 58, 55, 54, 59, 57, 58, 60, 54, 42, 59, 57,
59, 53, 53, 42, 59, 57, 29, 36, 51, 64, 60, 54, 54, 38, 61, 60, 61, 60, 62, 55,
38, 43, 58, 60, 44, 44, 32, 56, 43, 36, 38, 48, 32, 40, 40, 34, 45, 42, 41, 32,
48, 36, 29, 37, 53, 55, 50, 47, 46, 44, 50, 56, 58, 42, 58, 54, 57, 54, 51, 49,
52, 51, 49, 51, 46, 46, 42, 49, 46, 56, 42, 53, 55, 51, 55, 49, 53, 55, 40, 46,
56, 47, 54, 54, 42, 34, 35, 41, 48, 46, 39, 55, 30, 49, 27, 51, 41, 36, 45, 41,
53, 32, 43, 33])
condition = np.repeat([0,1,2,3], nSubj)
Does not work
# modeling
mu = pm.Beta('mu', 1, 1, size=ncond)
kappa = pm.Gamma('gamma', 1, 0.1, size=ncond)
# Prior
theta = pm.Beta('theta', mu[condition] * kappa[condition], (1 - mu[condition]) * kappa[condition], size=len(z))
# likelihood
y = pm.Binomial('y', p=theta, n=N, value=z, observed=True)
# model
model = pm.Model([mu, kappa, theta, y])
mcmc = pm.MCMC(model)
#mcmc.use_step_method(pm.Metropolis, mu)
#mcmc.use_step_method(pm.Metropolis, theta)
#mcmc.assign_step_methods()
mcmc.sample(100000, burn=20000, thin=3)
# outputs never converge and does vary in new simulations
mcplot(mcmc.trace('mu'), common_scale=False)
Works
z1 = z[:40]
z2 = z[40:80]
z3 = z[80:120]
z4 = z[120:]
Nv = N[:40]
mu1 = pm.Beta('mu1', 1, 1)
mu2 = pm.Beta('mu2', 1, 1)
mu3 = pm.Beta('mu3', 1, 1)
mu4 = pm.Beta('mu4', 1, 1)
kappa1 = pm.Gamma('gamma1', 1, 0.1)
kappa2 = pm.Gamma('gamma2', 1, 0.1)
kappa3 = pm.Gamma('gamma3', 1, 0.1)
kappa4 = pm.Gamma('gamma4', 1, 0.1)
# Prior
theta1 = pm.Beta('theta1', mu1 * kappa1, (1 - mu1) * kappa1, size=len(Nv))
theta2 = pm.Beta('theta2', mu2 * kappa2, (1 - mu2) * kappa2, size=len(Nv))
theta3 = pm.Beta('theta3', mu3 * kappa3, (1 - mu3) * kappa3, size=len(Nv))
theta4 = pm.Beta('theta4', mu4 * kappa4, (1 - mu4) * kappa4, size=len(Nv))
# likelihood
y1 = pm.Binomial('y1', p=theta1, n=Nv, value=z1, observed=True)
y2 = pm.Binomial('y2', p=theta2, n=Nv, value=z2, observed=True)
y3 = pm.Binomial('y3', p=theta3, n=Nv, value=z3, observed=True)
y4 = pm.Binomial('y4', p=theta4, n=Nv, value=z4, observed=True)
# model
model = pm.Model([mu1, kappa1, theta1, y1, mu2, kappa2, theta2, y2,
mu3, kappa3, theta3, y3, mu4, kappa4, theta4, y4])
mcmc = pm.MCMC(model)
#mcmc.use_step_method(pm.Metropolis, mu)
#mcmc.use_step_method(pm.Metropolis, theta)
#mcmc.assign_step_methods()
mcmc.sample(100000, burn=20000, thin=3)
# outputs converge and are not too much different in every simulation
mcplot(mcmc.trace('mu1'), common_scale=False)
mcplot(mcmc.trace('mu2'), common_scale=False)
mcplot(mcmc.trace('mu3'), common_scale=False)
mcplot(mcmc.trace('mu4'), common_scale=False)
mcmc.summary()
Can someone please explain it to me why mu[condition] and gamma[condition] does not work? :)
I guess that not splitting thetas into different variables is the problem but cannot understand why and maybe there is a way to pass a shape parameter to size= on theta?
First of all, I can confirm that the first version doesn't lead to stable results. What I can't confirm is that the second one is much better; I have seen very different results also with the second code, with values for the first mu parameter varying between 0.17 and 0.9 for different runs.
The convergence problems can be cured by using good starting values for the Markov chain. This can be done by first doing a maximum a posteriori (MAP) estimate, and then starting the Markov chain from there. The MAP step is computationally inexpensive and leads to a converging Markov chain with reproducible results for both variants of your code. For reference and comparison: The values I see for the four mu parameters are around 0.94 / 0.86 / 0.72 and 0.71.
You can do the MAP estimation by inserting the following two lines of code right after the line in which you define your model with "model=pm.Model(...":
map_ = pm.MAP(model)
map_.fit()
This technique is covered in more detail in Cameron Davidson-Pilon's Bayesian Methods for Hackers, together with other helpful topics around PyMC.

implementation of Hierarchial Agglomerative clustering

i am newbie and just want to implement Hierarchical Agglomerative clustering for RGB images. For this I extract all values of RGB from an image. And I process image.Next I find its distance and then develop the linkage. Now from linkage I want to extract my original data (i.e RGB values) on specified indices with indices id. Here is code I have done so far.
image = Image.open('image.jpg')
image = image.convert('RGB')
im = np.array(image).reshape((-1,3))
rgb = list(im.getdata())
X = pdist(im)
Y = linkage(X)
I = inconsistent(Y)
based on the 4th column of consistency. I opt minimum value of the cutoff in order to get maximum clusters.
cutoff = 0.7
cluster_assignments = fclusterdata(Y, cutoff)
# Print the indices of the data points in each cluster.
num_clusters = cluster_assignments.max()
print "%d clusters" % num_clusters
indices = cluster_indices(cluster_assignments)
ind = np.array(enumerate(rgb))
for k, ind in enumerate(indices):
print "cluster", k + 1, "is", ind
dendrogram(Y)
I got results like this
cluster 6 is [ 6 11]
cluster 7 is [ 9 12]
cluster 8 is [15]
Means cluster 6 contains the indices of 6 and 11 leafs. Now at this point I stuck in how to map these indices to get original data(i.e rgb values). indices of each rgb values to each pixel in the image. And then I have to generate codebook to implement Agglomeration Clustering. I have no idea how to approach this task. Read a lot of stuff but nothing clued.
Here is my solution:
import numpy as np
from scipy.cluster import hierarchy
im = np.array([[54,101,9],[ 67,89,27],[ 67,85,25],[ 55,106,1],[ 52,108,0],
[ 55,78,24],[ 19,57,8],[ 19,46,0],[ 95,110,15],[112,159,57],
[ 67,118,26],[ 76,127,35],[ 74,128,30],[ 25,62,0],[100,120,9],
[127,145,61],[ 48,112,25],[198,25,21],[203,11,10],[127,171,60],
[124,173,45],[120,133,19],[109,137,18],[ 60,85,0],[ 37,0,0],
[187,47,20],[127,170,52],[ 30,56,0]])
groups = hierarchy.fclusterdata(im, 0.7)
idx_sorted = np.argsort(groups)
group_sorted = groups[idx_sorted]
im_sorted = im[idx_sorted]
split_idx = np.where(np.diff(group_sorted) != 0)[0] + 1
np.split(im_sorted, split_idx)
output:
[array([[203, 11, 10],
[198, 25, 21]]),
array([[187, 47, 20]]),
array([[127, 171, 60],
[127, 170, 52]]),
array([[124, 173, 45]]),
array([[112, 159, 57]]),
array([[127, 145, 61]]),
array([[25, 62, 0],
[30, 56, 0]]),
array([[19, 57, 8]]),
array([[19, 46, 0]]),
array([[109, 137, 18],
[120, 133, 19]]),
array([[100, 120, 9],
[ 95, 110, 15]]),
array([[67, 89, 27],
[67, 85, 25]]),
array([[55, 78, 24]]),
array([[ 52, 108, 0],
[ 55, 106, 1]]),
array([[ 54, 101, 9]]),
array([[60, 85, 0]]),
array([[ 74, 128, 30],
[ 76, 127, 35]]),
array([[ 67, 118, 26]]),
array([[ 48, 112, 25]]),
array([[37, 0, 0]])]