A few doubts on PCA and HCA analyses on bivariate Raman Spectroscopy data - data-visualization

I'm an undergrad working on a project, trying to replicate(imitate to be precise) the dendogram and PCA scatterplots as noted from other reference papers.
I have 265 columns of Raman spectra ranging from 401 to 1802 (per-cm). For each observation of the Raman spectra, I have corresponding observations of 30 Test and 30 control groups, belonging to Class 1 and Class 2(Test group indicates the disease, and control indicates healthy data). (Here is a screenshot of how much data I could fit into 1920x1080 haha!)
I did a HCA analysis by just extracting the Test and control data, deleting the top two columns so that I was left with nothing but good 'ol numbers, and deleting the RAMAN spectroscopy column entirely.
Followed standard procedure, used the following code, and got the following output.
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering as HCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv('/home/stormageddon/HCA PCA Analyses/Code/use.csv')
import csv
scaler = StandardScaler()
scaler.fit(df)
StandardScaler(copy=True, with_mean=True, with_std=True)
scaled_data = scaler.transform(df)
import scipy.cluster.hierarchy as shc
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 7))
plt.title("HCA")
clusters = shc.linkage(scaled_data,
method='average',
metric="euclidean")
shc.dendrogram(Z=clusters)
plt.show()
The output looks like this.
So a couple of questions regarding this....
(I followed the exact methods used in Raman spectroscopy literature, i.e Agglomorate clustering, and UPGMA algorithm)
The final finetuned result I was aiming for,is here.(Ref:Barnas.et.al)
So my main points of concern were, in my case the points are overlapping and the graph overall is not looking good. How do I: A]Reduce the number of points to liken it with the ideal image
B]Label the points like those in the image, colourcoded in two colours(for my two classes)?
I did the PCA similarly, but my scatterplot contains points of the same colour.
How do I colourcode the dots in the PCA plot as well?
P.S: I did the HCA correctly right? Or is that part wrong?

Related

How to plot multiple graphs in Matplotlib from the numpy datasets I am working on?

I am new to programming, and I'm having difficulty plotting multiple graphs. What I am trying to get is a graph containing values of K along the Y-axis plotted against values of Dk. I need this graph to contain all the K=f(Dk) for each temperature Tcwin in range (10,40,1)
While the code seems to be working well and I have obtained the data I was trying to calculate, I can't seem to plot them. Any help would be appreciated.
import numpy as np
import pandas as pd
A=3000
d_in=20
CF=0.85
w=2.26
Tcwin=12
Dk=np.arange(27.418,301.598,27.418)
dk=(Dk*1000/(A*3.600))
cp=4.19
Gw=13000
e=2.718281828
f_velocity=w*1.1/(20**0.25)
for Tcwin in range(10,40,1):
while Tcwin<35:
print(Tcwin)
f_w=0.12*CF*(1+0.15*Tcwin)
Ф_в=f_velocity**f_w
K=CF*4070*((1.1*w/(d_in**0.25))**(0.12*CF*(1+0.15*Tcwin)))*(1-(((35-Tcwin)**2)*(0.52-0.0072*dk)*(CF**0.5))/1000)
n=(K*A)/(cp*Gw*1000)
Tcwout_theor=Tcwin+(Dk*2225/(cp*Gw))
Subcooling_theor=(Tcwout_theor-Tcwin)/(e**(K*A/(cp*(Gw*1000/3600)*1000)))
TR_theor=Tcwout_theor-Tcwin
Tsat_theor=Tcwout_theor+Subcooling_theor
print(K)
print(Tcwout_theor)
print(Subcooling_theor)
print(Tsat_theor)
Tcwin+=1
else:
print('Loop done')
Is this what you are looking for? plotting after each run:
import numpy as np
import pandas as pd
A=3000
d_in=20
CF=0.85
w=2.26
Tcwin=12
Dk=np.arange(27.418,301.598,27.418)
dk=(Dk*1000/(A*3.600))
cp=4.19
Gw=13000
e=2.718281828
f_velocity=w*1.1/(20**0.25)
for Tcwin in range(10,40,1):
while Tcwin<35:
print(Tcwin)
f_w=0.12*CF*(1+0.15*Tcwin)
Ф_в=f_velocity**f_w
K=CF*4070*((1.1*w/(d_in**0.25))**(0.12*CF*(1+0.15*Tcwin)))*(1-(((35-Tcwin)**2)*(0.52-0.0072*dk)*(CF**0.5))/1000)
n=(K*A)/(cp*Gw*1000)
Tcwout_theor=Tcwin+(Dk*2225/(cp*Gw))
Subcooling_theor=(Tcwout_theor-Tcwin)/(e**(K*A/(cp*(Gw*1000/3600)*1000)))
TR_theor=Tcwout_theor-Tcwin
Tsat_theor=Tcwout_theor+Subcooling_theor
print(K)
print(Tcwout_theor)
print(Subcooling_theor)
print(Tsat_theor)
Tcwin+=1
plt.plot(K,dk) #---------------> this is the code for plotting
else:
print('Loop done')

matplotlib - seaborn - the numbers on the correlation plots are not readable

The plot below shows the correlation for one column. The problem is that the numbers are not readable, because there are many columns in it.
How is it possible to show only 5 or 6 most important columns and not all of them with very low importance?
plt.figure(figsize=(20,3))
sns.heatmap(df.corr()[['price']].sort_values('price', ascending=False).iloc[1:].T, annot=True,
cmap='Spectral_r', vmax=0.9, vmin=-0.31)
You can limit the cells shown via .iloc[1:7]. If you also want to show the highest negative values, you could create a second plot with .iloc[-6:]. To have both together, you could use numpy's slicing function and write .iloc[np.r_[1:4, -3:0]].
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.rand(7, 27), columns=['price'] + [*'abcdefghijklmnopqrstuvwxyz'])
plt.figure(figsize=(20, 3))
sns.heatmap(df.corr()[['price']].sort_values('price', ascending=False).iloc[1:7].T,
annot=True, annot_kws={'rotation':90, 'size': 20},
cmap='Spectral_r', vmax=0.9, vmin=-0.31)
plt.show()
annot can also be a list of labels. Using this, you can define a string matrix that you use to display the desired numbers and set the others to an empty string.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns; sns.set_theme()
import pandas as pd
from string import ascii_letters
# generate random data
rs = np.random.RandomState(33)
df = pd.DataFrame(data=rs.normal(size=(100, 26)),
columns=list(ascii_letters[26:]))
importance_index = 5 # until which idx to hide values
data = df.corr()[['A']].sort_values('A', ascending=False).iloc[1:].T
labels = data.astype(str) # make a str-copy
labels.iloc[0,:importance_index] = ' ' # mask columns that you want to hide
sns.heatmap(data, annot=labels, cmap='Spectral_r', vmax=0.9, vmin=-0.31, fmt='', annot_kws={'rotation':90})
plt.show()
The output on some random data:
This works but it has its limits, particulary with setting fmt='' (can't use it to conveniently format decimals anymore, need to do it manually now). I would also question whether your approach is even the best one to take here. I think consistency in plots is quite important. I would rather evaluate if we can't rotate the heatmap labels (I've included it above) or leave them out completely since it is technically redundant due to the color-coding. Alternatively, you could only plot the cells with the "important" values.

How can i plotting two columns with string as value in a DataSet with Matplotlib?

I have the following Dataset and I wanna create a plot, which to columns compares with each other.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
ds=pd.read_csv('h-t-t-p-:bit.ly/uforeports') #My DataSet
ds.head(5) # Only the fist 5 rows to show
ds1= ds.head(4).drop(['Colors Reported','State'],axis=1) # Droping of unnecesssary rows
print(ds1)
Now I wanna compare "City" and "Shape Reported" with help of plotting. I found something with Pandas but this is not so elegant!
x=ds.loc[0:100,['State']]
y=ds.loc[0:100,['Shape Reported']]
x.apply(pd.value_counts).plot(kind='bar', subplots=True)
y.apply(pd.value_counts).plot(kind='bar', subplots=True)
Do you know a better solution with Matplotlib to this problem?
This is what I want
It's not exactly clear how you want to compare them.
The simplest way of drawing a bar chart is:
df['State'].value_counts().plot.bar()
df['Shape Reported'].value_counts().plot.bar()
If you just want to do it for the first 100 rows as in your example, just add head(100):
df['State'].head(100).value_counts().plot.bar()
df['Shape Reported'].head(100).value_counts().plot.bar()
EDIT:
To compare the two values you can plot a bivariate distribution plot. This is easily done with seaborn:
import seaborn
sns.displot(df,x='State', y='Shape Reported', height=6, aspect=1.33)
Result:

Histogram in Bokeh charts takes a looong time

I am trying to move from matplotlib to bokeh. However, I am finding some annoying features. Last I encountered was that it took several minutes to make an histogram of about 1.5M entries - it would have taken a fraction of a second with Matplotlib. Is that normal? And if so, what's the reason?
from bokeh.charts import Histogram, output_file, show
import pandas as pd
output_notebook()
jd1 = pd.read_csv("somefile.csv")
p = Histogram(jd1['QTY'], bins=50)
show(p)
I'm not sure offhand what might be going on with Histogram in your case. Without the data file it's impossible to try and reproduce or debug. But in any case bokeh.charts does not really have a maintainer at the moment, so I would actually just recommend using bokeh.plotting to create your historgam. The bokeh.plotting API is stable (for several years now) and extensively documented. It's a few more lines of code but not many:
import numpy as np
from bokeh.plotting import figure, show, output_notebook
output_notebook()
# synthesize example data
measured = np.random.normal(0, 0.5, 1000)
hist, edges = np.histogram(measured, density=True, bins=50)
p = figure(title="Normal Distribution (μ=0, σ=0.5)")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color=None)
show(p)
As you can see that takes (on my laptop) ~half a second for a 10 million point histogram, including generating synthetic data and binning it.

Cutting up the x-axis to produce multiple graphs with seaborn?

The following code when graphed looks really messy at the moment. The reason is I have too many values for 'fare'. 'Fare' ranges from [0-500] with most of the values within the first 100.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
titanic = sns.load_dataset("titanic")
y =titanic.groupby([titanic.fare//1,'sex']).survived.mean().reset_index()
sns.set(style="whitegrid")
g = sns.factorplot(x='fare', y= 'survived', col = 'sex', kind ='bar' ,data= y,
size=4, aspect =2.5 , palette="muted")
g.despine(left=True)
g.set_ylabels("Survival Probability")
g.set_xlabels('Fare')
plt.show()
I would like to try slicing up the 'fare' of the plots into subsets but would like to see all the graphs at the same time on one screen. I was wondering it this is possible without having to resort to groupby.
I will have to play around with the values of 'fare' to see what I would want each graph to represent, but for a sample let's use break up the graph into these 'fare' values.
[0-18]
[18-35]
[35-70]
[70-300]
[300-500]
So the total would be 10 graphs on one page, because of the juxtaposition with the opposite sex.
Is it possible with Seaborn? Do I need to do a lot of configuring with matplotlib? Thanks.
Actually I wrote a little blog post about this a while ago. If you are plotting histograms you can use the by keyword:
import matplotlib.pyplot as plt
import seaborn.apionly as sns
sns.set() #rescue matplotlib's styles from the early '90s
data = sns.load_dataset('titanic')
data.hist(by='class', column = 'fare')
plt.show()
Otherwise if you're just plotting value-counts, you have to roll your own grid:
def categorical_hist(self,column,by,layout=None,legend=None,**params):
from math import sqrt, ceil
if layout==None:
s = ceil(sqrt(self[column].unique().size))
layout = (s,s)
return self.groupby(by)[column]\
.value_counts()\
.sort_index()\
.unstack()\
.plot.bar(subplots=True,layout=layout,legend=None,**params)
categorical_hist(data, by='class', column='embark_town')
Edit If you want survival rate by fare range, you could do something like this
data.groupby(pd.cut(data.fare,10)).apply(lambda x.survived.sum(): x./len(x))