How to visualize optimal number of clusters from diana using fviz_nbclust - hierarchical-clustering

My data is a data frame with 2141 rows and 11 columns:
mydatad: data frame
mydatad.diana : result of clustering with diana method
I obtained the optimal number of clusters as follows:
mydata.nc <- NbClust(data = NULL, diss = mydatad.diana$diss,
distance = NULL, min.nc = 2, max.nc = 50,
method= "single", index = "silhouette")
from str(mydata.nc) I get that the best nc is 2. But I tried to visualize the results with no success.
P1 <- fviz_nbclust(T2141.nc, method = "silhouette")
returns Number_clusters and Value_Index, but no graphic
and I don’t understand how to use nf_viz using the output of NbClust as x (the first argument in fviz_nbclust).
Then I tried to use the function as in https://bradleyboehmke.github.io/HOML/hierarchical.html , my code is:
p1 <- fviz_nbclust(mydatad, FUN = hcut, method = "wss",
k.max = 10)
It works when I use hcut with the methods silhouette and wss (does not work with gap_stat method) but when I replace hcut by diana it returns an error message. I don’t understand what the default option of hcut is and how to replace it with diana information.
I tried coding for this replacement trying to follow the factoextra documentation but I do not understand how to incorporate the diana information in the hcut here.
I don’t know whether the second argument above is FUN or FUNclust , I tried many options and did not work.
I kindly request help on how to use fviz_nbclust with the output of NbClust and with a diana result.

Try
fviz_nbclust(data.frame, hcut, hc_func = "diana")

Related

Try to check if column consists out of 3 numbers, and change the value to the first number of the column

I'm trying to create a new column called 'team'. In the image below you see different type of codes. The first number of the code is the team someone's in, IF the number consists out of 3 characters. E.G: 315 = team 3, 240 = team 2, and 3300 = NULL.
In the image below you can see my data flow so far and the expression I have tried, but doesn't work.
You forget parenthesis () in your regex :
Try :
^([0-9]{3})$
Demo

[pandas]Dividing all elements of columns in df with elements in another column (Same df)

I'm sorry, I know this is basic but I've tried to figure it out myself for 2 days by sifting through documentation to no avail.
My code:
import numpy as np
import pandas as pd
name = ["bob","bobby","bombastic"]
age = [10,20,30]
price = [111,222,333]
share = [3,6,9]
list = [name,age,price,share]
list2 = np.transpose(list)
dftest = pd.DataFrame(list2, columns = ["name","age","price","share"])
print(dftest)
name age price share
0 bob 10 111 3
1 bobby 20 222 6
2 bombastic 30 333 9
Want to divide all elements in 'price' column with all elements in 'share' column. I've tried:
print(dftest[['price']/['share']]) - Failed
dftest['price']/dftest['share'] - Failed, unsupported operand type
dftest.loc[:,'price']/dftest.loc[:,'share'] - Failed
Wondering if I could just change everything to int or float, I tried:
dftest.astype(float) - cant convert from str to float
Ive tried iter and items methods but could not understand the printouts...
My only suspicion is to use something called iterate, which I am unable to wrap my head around despite reading other old posts...
Please help me T_T
Apologies in advance for the somewhat protracted answer, but the question is somewhat unclear with regards to what exactly you're attempting to accomplish.
If you simply want price[0]/share[0], price[1]/share[1], etc. you can just do:
dftest['price_div_share'] = dftest['price'] / dftest['share']
The issue with the operand types can be solved by:
dftest['price_div_share'] = dftest['price'].astype(float) / dftest['share'].astype(float)
You're getting the cant convert from str to float error because you're trying to call astype(float) on the ENTIRE dataframe which contains string columns.
If you want to divide each item by each item, i.e. price[0] / share[0], price[1] / share[0], price[2] / share[0], price[0] / share[1], etc. You would need to iterate through each item and append the result to a new list. You can do that pretty easily with a for loop, although it may take some time if you're working with a large dataset. It would look something like this if you simply want the result:
new_list = []
for p in dftest['price'].astype(float):
for s in dftest['share'].astype(float):
new_list.append(p/s)
If you want to get this in a new dataframe you can simply save it to a new dataframe using pd.Dataframe() method:
new_df = pd.Dataframe(new_list, columns=[price_divided_by_share])
This new dataframe would only have one column (the result, as mentioned above). If you want the information from the original dataframe as well, then you would do something like the following:
new_list = []
for n, a, p in zip(dftest['name'], dftest['age'], dftest['price'].astype(float):
for s in dftest['share'].astype(float):
new_list.append([n, a, p, s, p/s])
new_df = pd.Dataframe(new_list, columns=[name, age, price, share, price_div_by_share])
If you check the data types of your dataframe, you will realise that they are all strings/object type :
dftest.dtypes
name object
age object
price object
share object
dtype: object
first step will be to change the relevant columns to numbers - this is one way:
dftest = dftest.set_index("name").astype(float)
dftest.dtypes
age float64
price float64
share float64
dtype: object
This way you make the names a useful index, and separate it from the numeric data. This is just a suggestion; you may have other reasons to leave names as a columns - in that case, you have to individually change the data types of each column.
Once that is done, you can safely execute your code :
dftest.div(dftest.share,axis=0)
age price share
name
bob 3.333333 37.0 1.0
bobby 3.333333 37.0 1.0
bombastic 3.333333 37.0 1.0
I assume this is what you expect as your outcome. If not, you can tweak it. Main part is get your data types as numbers before computation/division can occur.

How to make a new variable based on 30 other variables

I have 30 variables on family history of cancer i.e. breast cancer father, breast cancer mother, breast cancer sister etc. I would like to make a new variable and give it a value of "1" if in one of my columns there is a 1.
Thus:
I have 30 variables with answers 1 to 3; 1 is yes, 2 is no and, 3 is unknown if one of the 30 variables is given a 1 I would like my new variable to take on the value 1.
Does someone know how I can do this?
You can create a list instead of separate 30 variables and then filter it out to create a new variable. This will make it more dynamic.
// This will be the cancer history for a single family
var cancerHistory = [];
// Add dummy data
cancerHistory.push('yes');
cancerHistory.push('no')
cancerHistory.push('unknown');
cancerHistory.push('no');
// Check if at least one of them is "yes"
var hasHistoryOfCancer = cancerHistory.indexOf('yes') > -1;
alert(hasHistoryOfCancer); // true
You can use a for loop. You did not mention the language so I am writing the code in Python which is easy to understand. If you want it in other language you can use the similar approach and apply it
import pandas as pd
new_var = []
df = pd.read_csv("DataFile.csv") # Convert data file to csv and put name it.
for i in range(len(df)):
x = [df['column1'][i], df['column2'][i] ...., df['column30'][i]]
if (1 in x): new_var.append(1)
else: new_var.append(0)
df['new_var'] = new_var
df.to_csv('NewDataFile.csv', sep=',', encoding='utf-8')

Finding the count of a set of substrings in pandas dataframe

I am given a set of substrings. I need to find the count of occurrence of all those substrings in a particular column in a dataframe. The relevant datframe would look like this
training['concat']
0 svAxu$paxArWAn
1 xvAxaSa$varRANi
2 AxAna$xurbale
3 go$BakwAH
4 viXi$Bexena
5 nIwi$kuSalaM
6 lafkA$upamam
7 yaSas$lipsoH
8 kaSa$AGAwam
9 hewumaw$uwwaram
10 varRa$pUgAn
My set of substrings is a dictionary, where the keys are the substrings and values are the probabilities with which they occur
reg = {'anuBavAn':0.35, 'a$piwra':0.2 ...... 'piwra':0.7, 'pa':0.03, 'a':0.0005}
#The length of dicitioanry is 2000
Particularly I need to find those substrings which occur more than twice
I have written the following code that performs the task. Is there a more elegant pythonic way or panda specific way to achieve the same as the current implementation is taking quite some time to execute.
elites = dict()
for reg_pat in reg_:
count = 0
eliter = len(training[training['concat'].str.contains(reg_pat)]['concat'])
if eliter >=3:
elites[reg_pat] = reg_[reg_pat]
You can use apply instead str.contains, it is faster:
reg_ = {'anuBavAn':0.35, 'a$piwra':0.2, 'piwra':0.7, 'pa':0.03, 'a':0.0005}
elites = dict()
for reg_pat in reg_:
if training['concat'].apply(lambda x: reg_pat in x).sum() >= 3:
elites[reg_pat] = reg_[reg_pat]
print (elites)
{'a': 0.0005}
Hopefully I have interpreted your question correctly. I'm inclined to stay away from regex here (in fact, I've never used it in conjunction with pandas), but it's not wrong, strictly speaking. In any case, I find it hard to believe that any regex operations are faster than a simple in check, but I could be wrong on that.
for substr in reg:
totalStringAppearances = training.apply((lambda string: substr in string))
totalStringAppearances = totalStringAppearances.sum()
if totalStringAppearances > 2:
reg[substr] = totalStringAppearances / len(training)
else:
# do what you want to with the very rare substrings
Some gotchas:
If you wanted something like a substring 'a' in 'abcdefa' to return 2, then this will not work. It merely checks for existence of the substring in each string.
Inside the apply(), I am using a potentially unreliable exploitation of booleans. See this question for more details.
Post-edit: Jezrael's answer is more complete as it uses the same variable names. But, in a simple case, regarding regex vs. apply and in, I validate his claim, and my presumption:

Error Handling with Live Data Matlab

I am using a live data API that is returning the next arriving trains. I plan on giving the user the next 5 trains arriving. If there are less than 5 trains arriving, how you handle that? I am having trouble thinking of a way, I was thinking a way with if statements but don't know how I would set them up.
time1Depart = dataReturnedFromLiveAPI{1,1}.orig_departure_time;
time2Depart = dataReturnedFromLiveAPI{1,2}.orig_departure_time;
time3Depart = dataReturnedFromLiveAPI{1,3}.orig_departure_time;
time4Depart = dataReturnedFromLiveAPI{1,4}.orig_departure_time;
time5Depart = dataReturnedFromLiveAPI{1,5}.orig_departure_time;
time1Arrival = dataReturnedFromLiveAPI{1,1}.orig_arrival_time;
time2Arrival = dataReturnedFromLiveAPI{1,2}.orig_arrival_time;
time3Arrival = dataReturnedFromLiveAPI{1,3}.orig_arrival_time;
time4Arrival = dataReturnedFromLiveAPI{1,4}.orig_arrival_time;
time5Arrival = dataReturnedFromLiveAPI{1,5}.orig_arrival_time;
The code right now uses a matrix that goes from 1:numoftrains but I am using just the first five.
It's a bad practice to assign individual value to a separate variable. Better if you pass all related values to a vector or cell array depending on class of orig_departure_time and orig_arrival_time.
It looks like dataReturnedFromLiveAPI is a cell array of structures. Then you can do:
timeDepart = cellfun(#(x), x.orig_departure_time, ...
dataReturnedFromLiveAPI(1,1:min(5,size(dataReturnedFromLiveAPI,2))), ...
'UniformOutput',0 );
timeArrival = cellfun(#(x), x.orig_arrival_time, ...
dataReturnedFromLiveAPI(1,1:min(5,size(dataReturnedFromLiveAPI,2))), ...
'UniformOutput',0 );
Then you how to access the values one by one as
time1Depart = timeDepart{1};
If orig_departure_time and orig_arrival_time are numeric scalars, you can use ...'UniformOutput',1.... You will get output as a vector and can get single values with timeDepart(1).