Finding Similarities Between houses in pandas dataframe for content filtering - pandas

I want to apply content filtering for houses. I would like to find similarity score for each houses to recommend. What can I recommend for house one? So I need similarity matrix for houses. How can I find it?
Thank you
data = [['house1',100,1500,'gas','3+1']
,['house2',120,2000,'gas','2+1']
,['house3',40,1600,'electricity','1+1']
,['house4',110,1450,'electricity','2+1']
,['house5',140,1200,'electricity','2+1']
,['house6',90,1000,'gas','3+1']
,['house7',110,1475,'gas','3+1']
]
Create the pandas DataFrame
df = pd.DataFrame(data, columns =
['house','size','price','heating_type','room_count'])

If we define similarity in terms of absolute difference in case of numeric values and similarity ratio calculated by SequenceMatcher in case of strings (or more presicely 1 - ratio to make it comparable to differences), we can apply these operations to the respective columns and then normalize the result to the range of 0 ... 1 where 1 means (almost) equality and 0 means minimum similarity. Summing up the individual columns, we get the most similar house as the house with the maximum total similarity rating.
from difflib import SequenceMatcher
df = df.set_index('house')
res = pd.DataFrame(df[['size','price']].sub(df.loc['house1',['size','price']]).abs())
res['heating_type'] = df.heating_type.apply(lambda x: 1 - SequenceMatcher(None, df.heating_type[0], x).ratio())
res['room_count'] = df.room_count.apply(lambda x: 1 - SequenceMatcher(None, df.room_count[0], x).ratio())
res['total'] = res['size'] + res.price + res.heating_type + res.room_count
res = 1 - res / res.max()
print(res)
print('\nBest match of house1 is ' + res.total[1:].idxmax())
Result:
size price heating_type room_count total
house
house1 1.000000 1.00 1.0 1.0 1.000000
house2 0.666667 0.00 1.0 0.0 0.000000
house3 0.000000 0.80 0.0 0.0 0.689942
house4 0.833333 0.90 0.0 0.0 0.882127
house5 0.333333 0.40 0.0 0.0 0.344010
house6 0.833333 0.00 1.0 1.0 0.019859
house7 0.833333 0.95 1.0 1.0 0.932735
Best match of house1 is house7

Related

return list by dataframe linear interpolation

I have a dataframe that has, let's say 5 entries.
moment
stress
strain
0
0.12
13
0.11
1
0.23
14
0.12
2
0.56
15
0.56
I would like to get a 1D float list in the order of [moment, stress, strain], based on the linear interpolation of strain = 0.45
I have read a couple of threads talking about the interpolate() method from pandas. But it is used when you have Nan entry and you fill in the number.
How do I accomplish a similar task with my case?
Thank you
One method is to add new row to your dataframe with NaN values and sort it:
df = df.append(
{"moment": np.nan, "stress": np.nan, "strain": 0.45}, ignore_index=True
)
df = df.sort_values(by="strain").set_index("strain")
df = df.interpolate(method="index")
print(df)
Prints:
moment stress
strain
0.11 0.1200 13.00
0.12 0.2300 14.00
0.45 0.4775 14.75
0.56 0.5600 15.00
To get the values back:
df = df.reset_index()
print(
df.loc[df.strain == 0.45, ["moment", "stress", "strain"]]
.to_numpy()
.tolist()[0]
)
Prints:
[0.47750000000000004, 14.75, 0.45]

Calculate nearest distance to certain points in python

I have a dataset as shown below, each sample has x and y values and the corresponding result
Sr. X Y Resut
1 2 12 Positive
2 4 3 positive
....
Visualization
Grid size is 12 * 8
How I can calculate the nearest distance for each sample from red points (positive ones)?
Red = Positive,
Blue = Negative
Sr. X Y Result Nearest-distance-red
1 2 23 Positive ?
2 4 3 Negative ?
....
dataset
Its a lot easier when there is sample data, make sure to include that next time.
I generate random data
import numpy as np
import pandas as pd
import sklearn
x = np.linspace(1,50)
y = np.linspace(1,50)
GRID = np.meshgrid(x,y)
grid_colors = 1* ( np.random.random(GRID[0].size) > .8 )
sample_data = pd.DataFrame( {'X': GRID[0].flatten(), 'Y':GRID[1].flatten(), 'grid_color' : grid_colors})
sample_data.plot.scatter(x="X",y='Y', c='grid_color', colormap='bwr', figsize=(10,10))
BallTree (or KDTree) can create a tree to query with
from sklearn.neighbors import BallTree
red_points = sample_data[sample_data.grid_color == 1]
blue_points = sample_data[sample_data.grid_color != 1]
tree = BallTree(red_points[['X','Y']], leaf_size=15, metric='minkowski')
and use it with
distance, index = tree.query(sample_data[['X','Y']], k=1)
now add it to the DataFrame
sample_data['nearest_point_distance'] = distance
sample_data['nearest_point_X'] = red_points.X.values[index]
sample_data['nearest_point_Y'] = red_points.Y.values[index]
which gives
X Y grid_color nearest_point_distance nearest_point_X \
0 1.0 1.0 0 2.0 3.0
1 2.0 1.0 0 1.0 3.0
2 3.0 1.0 1 0.0 3.0
3 4.0 1.0 0 1.0 3.0
4 5.0 1.0 1 0.0 5.0
nearest_point_Y
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
Modification to have red point not find themself;
Find the nearest k=2 instead of k=1;
distance, index = tree.query(sample_data[['X','Y']], k=2)
And, with help of numpy indexing, make red points use the second instead of the first found;
sample_size = GRID[0].size
sample_data['nearest_point_distance'] = distance[np.arange(sample_size),sample_data.grid_color]
sample_data['nearest_point_X'] = red_points.X.values[index[np.arange(sample_size),sample_data.grid_color]]
sample_data['nearest_point_Y'] = red_points.Y.values[index[np.arange(sample_size),sample_data.grid_color]]
The output type is the same, but due to randomness it won't agree with earlier made picture.
cKDTree for scipy can calculate that distance for you. Something along those lines should work:
df['Distance_To_Red'] = cKDTree(coordinates_of_red_points).query((df['x'], df['y']), k=1)

Normalizing and denormalizing rows in a dataframe

I have a dataframe with 20k rows and 100 columns. I am trying to normalize my data across rows. Scikit's MinMaxScaler doesn't allow me to do this by rows. It has something called minmax_scale that allows row normalization but I cannot denormalize it later. At least, I don't see how to do it. How would you guys do it?
From sklearn.preprocessing.minmax_scale:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 5],
'B': [88, 300, 200]})
# Find and store min and max vectors
min_values = df.min()
max_values = df.max()
normalized_df = (df - min_values) / (df.max() - min_values)
denormalized_df= normalized_df * (max_values - min_values) + min_values
A B
1 88
2 300
5 200
A B
0.00 0.000000
0.25 1.000000
1.00 0.528302
A B
1.0 88.0
2.0 300.0
5.0 200.0

Some confusion in creating pivot table

I am trying to create a pivot table but i am not getting the result i want. Couldn't able to understand why is this happening.
I have a dataframe like this -
data_channel_is_lifestyle data_channel_is_bus shares
0 0.0 0.0 593
1 0.0 1.0 711
2 0.0 1.0 1500
3 0.0 0.0 1200
4 0.0 0.0 505
And the result i am looking for is name of the columns in the index and sum of shares in the column. So
i did this -
news_copy.pivot_table(index=['data_channel_is_lifestyle','data_channel_is_bus'], values='shares', aggfunc=sum)
but i am getting the result something like this -
shares
data_channel_is_lifestyle data_channel_is_bus
0.0 0.0 107709305
1.0 19168370
1.0 0.0 7728777
I don't want these 0's and 1's, i just want the result to be something like this -
shares
data_channel_is_lifestyle 107709305
data_channel_is_bus 19168370
How can i do this?
As you put it, it's just matrix multipliation:
df.filter(like='data').T#(df[['shares']])
Output (for sample data):
shares
data_channel_is_lifestyle 0.0
data_channel_is_bus 2211.0

pandas Selecting single value from df using .loc() is producing a df instead of a numeric

I have two dataframes, sarc and non. After running describe() on both I want to compare the mean value for a particular column in both dataframes. I used .loc() and tried saving the value as a float but it is saving as a dataframe, which prevents me from comparing the two values using the > operator. Here's my code:
sarc.describe()
label c_len c_s_l_len score
count 5092.0 5092.000000 5092.000000 5092.000000
mean 1.0 54.876277 33.123527 6.919874
std 0.0 37.536986 22.566558 43.616977
min 1.0 0.000000 0.000000 -96.000000
25% 1.0 29.000000 18.000000 1.000000
50% 1.0 47.000000 28.000000 2.000000
75% 1.0 71.000000 43.000000 5.000000
max 1.0 466.000000 307.000000 2381.000000
non.describe()
label c_len c_s_l_len score
count 4960.0 4960.000000 4960.000000 4960.000000
mean 0.0 55.044153 33.100806 6.912298
std 0.0 47.873732 28.738776 39.216049
min 0.0 0.000000 0.000000 -119.000000
25% 0.0 23.000000 14.000000 1.000000
50% 0.0 43.000000 26.000000 2.000000
75% 0.0 74.000000 44.000000 4.000000
max 0.0 594.000000 363.000000 1534.000000
non_c_len_mean = non.describe().loc[['mean'], ['c_len']].astype(np.float64)
sarc_c_len_mean = sarc.describe().loc[['mean'], ['c_len']].astype(np.float64)
if sarc_c_len_mean > non_c_len_mean:
# do stuff
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The variables are indeed of <class 'pandas.core.frame.DataFrame'> type, and each prints as a labeled 1-row, 1-col df instead of just the value. How can I select only the numeric value as a float?
Remove the [] in .loc when you pick the columns and index
non.describe().loc['mean', 'c_len']