Curse of dimensionality when dimensions are fixed - dimensions

I think there is a big misunderstanding in the Data Science community in respect to what exactly 'curse of high dimensionality' means. Please consider two examples:
1) I want to compare the distance between point A and point B in a 1000-dimensional and 1001-dimensional space. This is an example of curse of high dimensionality because there is a high chance that the distance will be higher in the 1001-dimensional space.
2) I want to compare the distance between point A and point B in a 1000-dimensional space, and a distance between point A and point C in the 1000-dimensional space. This is not a curse of high dimensionality because even though the dimensions are high, they are kept fixed.
Is the second statement correct? If the distance ratio between points A-B is twice higher than A-C in a 2-dimensional space, I would expect to see twice higher distance ratio in 1000 dimensional space of the same points. This means that the curse of high dimensionality only occurs when one tries to compare distances between different numbers of dimensions.

I think I have answered this question with a little test. Therefore, I am going to leave here in case it is going to be useful for somebody:
I did an experiment where I created a dummy data-set with 3 observations (A=1, B=2, C=4), calculated euclidean distance between points, and varied a number of features to see if the ratio of distances between the points start to differentiate as the features increase.
After 2 features:
0 1 2 ratio
0 0.00 1.41 4.24 3.00
1 0.00 1.41 2.83 2.00
2 0.00 2.83 4.24 1.50
After 100 features:
0 1 2 ratio
0 0.00 10.00 30.00 3.00
1 0.00 10.00 20.00 2.00
2 0.00 20.00 30.00 1.50
After 1000 features:
0 1 2 ratio
0 0.00 31.62 94.87 3.00
1 0.00 31.62 63.25 2.00
2 0.00 63.25 94.87 1.50
After 10000 features:
0 1 2 ratio
0 0.00 100.00 300.00 3.00
1 0.00 100.00 200.00 2.00
2 0.00 200.00 300.00 1.50
What does this means? Curse of high dimensionality does not occur when dimensions are fixed. It can be seen that the ratio distance between first closest point (1) and second closest point (2) remains constant when number of dimensions are increasing.
To put it in perspective, yes you do travel longer to points but that makes sense as your total data space increases with each added feature. However, the ratio of travel between points is kept the same and that's what matters.
To be honest, I do not see such a problem with the famous 'curse of high dimensionality' unless you are in the situation where you need to compare same points in variant n of dimensions.

Related

Sklearn only predicts one class while dataset is fairly balanced (±80/20 split)

I am trying to come up with a way to check what are the most influential factors of a person not paying back a loan (defaulting). I have worked with the sklearn library quite intensively, but I feel like I am missing something quite trivial...
The dataframe looks like this:
0 7590-VHVEG Female Widowed Electronic check Outstanding loan 52000 20550 108 0.099 288.205374 31126.180361 0 No Employed No Dutch No 0
1 5575-GNVDE Male Married Bank transfer Other 42000 22370 48 0.083 549.272708 26365.089987 0 Yes Employed No Dutch No 0
2 3668-QPYBK Male Registered partnership Bank transfer Study 44000 24320 25 0.087 1067.134272 26678.356802 0 No Self-Employed No Dutch No 0
The distribution of the "DefaultInd" column (target variable) is this:
0 0.835408
1 0.164592
Name: DefaultInd, dtype: float64
I have label encoded the data to make it look like this, :
CustomerID Gender MaritalStatus PaymentMethod SpendingTarget EstimatedIncome CreditAmount TermLoanMonths YearlyInterestRate MonthlyCharges TotalAmountPayments CurrentLoans SustainabilityIndicator EmploymentStatus ExistingCustomer Nationality BKR_Registration DefaultInd
0 7590-VHVEG 0 4 2 2 52000 20550 108 0.099 288.205374 31126.180361 0 0 0 0 5 0 0
1 5575-GNVDE 1 1 0 1 42000 22370 48 0.083 549.272708 26365.089987 0 1 0 0 5 0 0
2 3668-QPYBK 1 2 0 4 44000 24320 25 0.087 1067.134272 26678.356802 0 0 2 0 5 0
After that I have removed NaNs and cleaned it up some more (removing capitalizion, punctuation etc)
After that, I try to run this cell:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
y = df['DefaultInd']
X = df.drop(['CustomerID','DefaultInd'],axis=1)
X = X.astype(float)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(classification_report(y_test, y_pred))
Which results in this:
precision recall f1-score support
0 0.83 1.00 0.91 1073
1 0.00 0.00 0.00 213
accuracy 0.83 1286
macro avg 0.42 0.50 0.45 1286
weighted avg 0.70 0.83 0.76 1286
As you can see, the "1" class does not get predicted 1 time, I am wondering whether or not this behaviour is to be expected (I think it is not). I tried to use class_weightd = ‘balanced’, but that resulted in an average f1 score of 0.59 (instead of 0.76)
I feel like I am missing something, or is this kind of behaviour expected and should I rebalance the dataset before fitting? I feel like the division is not that skewed (±80/20), there should not be this big of a problem.
Any help would be more than appreciated :)

Calculate average of non numeric columns in pandas

I have a df "data" as below
Name Quality city
Tom High A
nick Medium B
krish Low A
Jack High A
Kevin High B
Phil Medium B
I want group it by city and a create a new columns based on the column "quality" and calculate avegare as below
city High Medium Low High_Avg Medium_AVG Low_avg
A 2 0 1 66.66 0 33.33
B 1 1 0 50 50 0
I tried with the below script and I know it is completely wrong.
data_average = data_df.groupby(['city'], as_index = False).count()
Get a count of the frequencies, divide the outcome by the sum across columns, and finally concatenate the datframes into one :
result = pd.crosstab(df.city, df.Quality)
averages = result.div(result.sum(1).array, axis=0).mul(100).round(2).add_suffix("_Avg")
#combine the dataframes
pd.concat((result, averages), axis=1)
Quality High Low Medium High_Avg Low_Avg Medium_Avg
city
A 2 1 0 66.67 33.33 0.00
B 1 0 2 33.33 0.00 66.67

Can anyone help meevaluate testing set data in Weka

I got one training dataset and one testing dataset. I am using weka explorer, trying to create a model with Random forest (algorithm). After creating model when I use my testing set data to implement it by (supply test set/ re-evaluate on current dataset) tab, it showing some thing like that.
What am I doing wrong?
Training Model:
=== Evaluation on training set ===
Time taken to test model on training data: 0.24 seconds
=== Summary ===
Correctly Classified Instances 5243 98.9245 %
Incorrectly Classified Instances 57 1.0755 %
Kappa statistic 0.9439
Mean absolute error 0.0453
Root mean squared error 0.1137
Relative absolute error 23.2184 %
Root relative squared error 36.4074 %
Coverage of cases (0.95 level) 100 %
Mean rel. region size (0.95 level) 59.3019 %
Total Number of Instances 5300
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.996 0.067 0.992 0.996 0.994 0.944 0.999 1.000 0
0.933 0.004 0.968 0.933 0.950 0.944 0.999 0.990 1
Weighted Avg. 0.989 0.060 0.989 0.989 0.989 0.944 0.999 0.999
=== Confusion Matrix ===
a b <-- classified as
4702 18 | a = 0
39 541 | b = 1
Model Implement on my testing dataset:
=== Evaluation on test set ===
Time taken to test model on supplied test set: 0.22 seconds
=== Summary ===
Total Number of Instances 0
Ignored Class Unknown Instances 4000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.000 0.000 0.000 0.000 0.000 0.000 ? ? 0
0.000 0.000 0.000 0.000 0.000 0.000 ? ? 1
Weighted Avg. NaN NaN NaN NaN NaN NaN NaN NaN
=== Confusion Matrix ===
a b <-- classified as
0 0 | a = 0
0 0 | b = 1
Your test data set does not appear to have labels.
You can only evaluate your prediction quality using labeled data.

x axis duplicate values are added up on SSRS charts

Forgive my Ignorance. I am new to SSRS reporting.
My x axis on a scatter chart is simple ID value(s) and my Y axis is Price. The values in the table are as follows
ID Price
1 1.47
1 1.52
1 1.46
2 1.40
2 1.44
2 1.38
When the chart is plotted value for 1 under y axis is sum of 1.47 + 1.52 + 1.46. The same happens for 2 etc.
I would like individuals dots on the chart e.g 1.47 a dot, 1.52 a dot and 1.46 another dot and so on.
Thanks
You need to set the Category Group to the Id of your result set.
Of course I'm guessing as you really need to provide a bit more information (like a screenshot of your chart?)

SQL linear interpolation based on lookup table

I need to build linear interpolation into an SQL query, using a joined table containing lookup values (more like lookup thresholds, in fact). As I am relatively new to SQL scripting, I have searched for an example code to point me in the right direction, but most of the SQL scripts I came across were for interpolating between dates and timestamps and I couldn't relate these to my situation.
Basically, I have a main data table with many rows of decimal values in a single column, for example:
Main_Value
0.33
0.12
0.56
0.42
0.1
Now, I need to yield interpolated data points for each of the rows above, based on a joined lookup table with 6 rows, containing non-linear threshold values and the associated linear normalized values:
Threshold_Level Normalized_Value
0 0
0.15 20
0.45 40
0.60 60
0.85 80
1 100
So for example, if the value in the Main_Value column is 0.45, the query will lookup its position in (or between) the nearest Threshold_Level, and interpolate this based on the adjacent value in the Normalized_Value column (which would yield a value of 40 in this example).
I really would be grateful for any insight into building a SQL query around this, especially as it has been hard to track down any SQL examples of linear interpolation using a joined table.
It has been pointed out that I could use some sort of rounding, so I have included a more detailed table below. I would like the SQL query to lookup each Main_Value (from the first table above) where it falls between the Threshold_Min and Threshold_Max values in the table below, and return the 'Normalized_%' value:
Threshold_Min Threshold_Max Normalized_%
0.00 0.15 0
0.15 0.18 5
0.18 0.22 10
0.22 0.25 15
0.25 0.28 20
0.28 0.32 25
0.32 0.35 30
0.35 0.38 35
0.38 0.42 40
0.42 0.45 45
0.45 0.60 50
0.60 0.63 55
0.63 0.66 60
0.66 0.68 65
0.68 0.71 70
0.71 0.74 75
0.74 0.77 80
0.77 0.79 85
0.79 0.82 90
0.82 0.85 95
0.85 1.00 100
For example, if the value from the Main_Value table is 0.52, it falls between Threshold_Min 0.45 and Threshold_Max 0.60, so the Normalized_% returned is 50%. The problem is that the Threshold_Min and Max values are not linear. Could anyone point me in the direction of how to script this?
Assuming you want the Main_Value and the nearest (low and not high) or equal Normalized_Value, you can do it like this:
select t1.Main_Value, max(t2.Normalized_Value) as Normalized_Value
from #t1 t1
inner join #t2 t2 on t1.Main_Value >= t2.Threshold_Level
group by t1.Main_Value
Replace #t1 and #t2 by the correct tablenames.