I have a sparse matrix that stores computed similarities between a set of documents. The matrix is an ndarray.
0 1 2 3 4
0 1.000000 0.000000 0.000000 0.000000 0.000000
1 0.000000 1.000000 0.067279 0.000000 0.000000
2 0.000000 0.067279 1.000000 0.025758 0.012039
3 0.000000 0.000000 0.025758 1.000000 0.000000
4 0.000000 0.000000 0.012039 0.000000 1.000000
I would like to transform this data into a 3-dimensional dataframe as follows.
docA docB similarity
1 2 0.067279
2 3 0.025758
2 4 0.012039
This final result does not contain matrix diagonals or zero values. It also lists each document pair only once (i.e. in one row only). Is there is a built-in / efficient method to achieve this end result? Any pointers would be much appreciated.
Thanks!
Convert the dataframe to an array:
x = df.to_numpy()
Get a list of non-diagonal non-zero entries from the sparse symmetric distance matrix:
i, j = np.triu_indices_from(x, k=1)
v = x[i, j]
ijv = np.concatenate((i, j, v)).reshape(3, -1).T
ijv = ijv[v != 0.0]
Convert it back to a dataframe:
df_ijv = pd.DataFrame(ijv)
I'm not sure if this is any faster or anything but an alternative way to do the middle step is to convert the numpy array to an ijv or "triplet" sparse matrix:
from scipy import sparse
coo = sparse.coo_matrix(x)
ijv = np.concatenate((coo.row, coo.col, coo.data)).reshape(3, -1).T
Now given a symmetric distance matrix, all you need to do is to keep the non-zero elements on the upper right triangle. You could loop through these. Or you could pre-mask the array with np.triu_indices_from(x, k=1), but that kind of defeats the whole purpose of this supposedly faster method... hmmm.
Related
I want to calculate the TF-IDF of keywords for a given genre. These keywords were never part of a text, they were already separated but in a different format. I extracted them from that format and put them into lists. The same with genres
I had a df in this format:
```keywords,genres
['k1','k2','k3'],['g1','g2']
['k2','k5','k7'],['g1','g3']
['k1','k2','k9'],['g4']
['k6','k7','k8'],['g3','g5]
...```
I used explode on the genres col and got:
```['k1','k2','k3'],g1
['k1','k2','k3'],g2
['k2','k5','k7'],g1
['k2','k5','k7'],g3
['k1','k2','k9'],g4
['k6','k7','k8'],g3
['k6','k7','k8'],g5
...```
then I 'grouped by' genre to have this df_agg:
```genres,keywords
g1,['k1','k2','k3','k2','k5','k7']
g2,['k1','k2','k3']
g3,['k2','k5','k7','k6','k7','k8']
g4,['k1','k2','k9']
g5,['k6','k7','k8']
...```
So I made these changes to calculate the Tf-IDF for the keywords per genre but I'm not sure whether this is the correct format as df_agg['keywords'] is a list but all examples I see online use a text and get the tokens off the text. Doesn't my df_agg structure suggest that genres are documents and the keywords are the tokens ready?
Should I do something different?
What you're doing is a bit unconventional, but if you wish to do so you can proceed as follows: do one step back and compose a string of your tokens:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df["keywords"].apply(lambda x: " ".join(x))).toarray()
which you can put into a df, if you wish:
df_tfidf = pd.DataFrame(tfidf_matrix, columns=tfidf.vocabulary_)
print(df_tfidf)
k1 k2 k3 k5 k7 k6 k8 \
0 0.359600 0.605014 0.433206 0.433206 0.000000 0.359600 0.000000
1 0.562638 0.473309 0.677803 0.000000 0.000000 0.000000 0.000000
2 0.000000 0.279457 0.000000 0.400198 0.400198 0.664401 0.400198
3 0.503968 0.423954 0.000000 0.000000 0.000000 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000 0.609818 0.506204 0.609818
k9
0 0.000000
1 0.000000
2 0.000000
3 0.752515
4 0.000000
I want to apply content filtering for houses. I would like to find similarity score for each houses to recommend. What can I recommend for house one? So I need similarity matrix for houses. How can I find it?
Thank you
data = [['house1',100,1500,'gas','3+1']
,['house2',120,2000,'gas','2+1']
,['house3',40,1600,'electricity','1+1']
,['house4',110,1450,'electricity','2+1']
,['house5',140,1200,'electricity','2+1']
,['house6',90,1000,'gas','3+1']
,['house7',110,1475,'gas','3+1']
]
Create the pandas DataFrame
df = pd.DataFrame(data, columns =
['house','size','price','heating_type','room_count'])
If we define similarity in terms of absolute difference in case of numeric values and similarity ratio calculated by SequenceMatcher in case of strings (or more presicely 1 - ratio to make it comparable to differences), we can apply these operations to the respective columns and then normalize the result to the range of 0 ... 1 where 1 means (almost) equality and 0 means minimum similarity. Summing up the individual columns, we get the most similar house as the house with the maximum total similarity rating.
from difflib import SequenceMatcher
df = df.set_index('house')
res = pd.DataFrame(df[['size','price']].sub(df.loc['house1',['size','price']]).abs())
res['heating_type'] = df.heating_type.apply(lambda x: 1 - SequenceMatcher(None, df.heating_type[0], x).ratio())
res['room_count'] = df.room_count.apply(lambda x: 1 - SequenceMatcher(None, df.room_count[0], x).ratio())
res['total'] = res['size'] + res.price + res.heating_type + res.room_count
res = 1 - res / res.max()
print(res)
print('\nBest match of house1 is ' + res.total[1:].idxmax())
Result:
size price heating_type room_count total
house
house1 1.000000 1.00 1.0 1.0 1.000000
house2 0.666667 0.00 1.0 0.0 0.000000
house3 0.000000 0.80 0.0 0.0 0.689942
house4 0.833333 0.90 0.0 0.0 0.882127
house5 0.333333 0.40 0.0 0.0 0.344010
house6 0.833333 0.00 1.0 1.0 0.019859
house7 0.833333 0.95 1.0 1.0 0.932735
Best match of house1 is house7
I have two dataframes, sarc and non. After running describe() on both I want to compare the mean value for a particular column in both dataframes. I used .loc() and tried saving the value as a float but it is saving as a dataframe, which prevents me from comparing the two values using the > operator. Here's my code:
sarc.describe()
label c_len c_s_l_len score
count 5092.0 5092.000000 5092.000000 5092.000000
mean 1.0 54.876277 33.123527 6.919874
std 0.0 37.536986 22.566558 43.616977
min 1.0 0.000000 0.000000 -96.000000
25% 1.0 29.000000 18.000000 1.000000
50% 1.0 47.000000 28.000000 2.000000
75% 1.0 71.000000 43.000000 5.000000
max 1.0 466.000000 307.000000 2381.000000
non.describe()
label c_len c_s_l_len score
count 4960.0 4960.000000 4960.000000 4960.000000
mean 0.0 55.044153 33.100806 6.912298
std 0.0 47.873732 28.738776 39.216049
min 0.0 0.000000 0.000000 -119.000000
25% 0.0 23.000000 14.000000 1.000000
50% 0.0 43.000000 26.000000 2.000000
75% 0.0 74.000000 44.000000 4.000000
max 0.0 594.000000 363.000000 1534.000000
non_c_len_mean = non.describe().loc[['mean'], ['c_len']].astype(np.float64)
sarc_c_len_mean = sarc.describe().loc[['mean'], ['c_len']].astype(np.float64)
if sarc_c_len_mean > non_c_len_mean:
# do stuff
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The variables are indeed of <class 'pandas.core.frame.DataFrame'> type, and each prints as a labeled 1-row, 1-col df instead of just the value. How can I select only the numeric value as a float?
Remove the [] in .loc when you pick the columns and index
non.describe().loc['mean', 'c_len']
I am working on a data set which consists of columns with numerical/int data. It consists of mainly -1s,0s and then spartially values among 10s and 1000s.I want to replace the coulmn content with labels...
pd.qcut(df['TS1'].rank(method='first'),3,labels=["low","mid","high"],duplicates='drop')
the command only converts one column, I dont know how to categorize the whole dataset.
So, I create a dataframe with data similar to your dataset:
df = pd.DataFrame(np.random.rand(5, 3)) * 1000
df.iloc[0:3, 2] = 0
df.iloc[[1, 3], :] = -1
print(df)
output:
0 1 2
0 679.473489 844.456345 0.0000
1 -1.000000 -1.000000 -1.0000
2 125.684455 696.829219 0.0000
3 -1.000000 -1.000000 -1.0000
4 97.520572 869.919917 528.5606
Create a dataframe for the categories, then loop over columns to get the qcut for each column:
cat_df = pd.DataFrame(index=df.index, columns=df.columns)
for column in df.columns:
cat_df[column] = pd.qcut(df.loc[:,column],3,labels=["low","mid","high"],duplicates='drop')
print(cat_df)
output:
0 1 2
0 high high mid
1 low low low
2 high mid mid
3 low low low
4 mid high high
Then boxplot:
df.boxplot()
plt.show()
I have a actual plane with known 3D coordinates of it's four corners relative to a landmark. It's coordinates are:
Front left corner: -32.5100 128.2703 662.2551
Front right corner: 65.2244 131.0850 656.1088
Back left corner: -23.4983 129.0271 838.3724
Back right corner: 74.1135 131.4294 833.4199
I am now creating a 3D obj file plane by using blender which has a image as texture mapped on it. By following the tutorial about adding texture on a plane using blender, I get both my obj file and mtl file shows below. I tried to directly replace the geometric vertex of the obj file to my own coordinates, but the coordinates are not connected in meshlab. Any idea about how to modify the obj file?
Thanks,
OBJ File:
# Blender v2.76 (sub 0) OBJ File: ''
# www.blender.org
mtllib planePhantom.mtl
o Plane
v -0.088000 0.000000 0.049250
v 0.088000 0.000000 0.049250
v -0.088000 0.000000 -0.049250
v 0.088000 0.000000 -0.049250
vt 0.000000 0.000000
vt 1.000000 0.000000
vt 1.000000 1.000000
vt 0.000000 1.000000
vn 0.000000 1.000000 0.000000
usemtl Material.001
s off
f 1/1/1 2/2/1 4/3/1 3/4/1
MTL file:
# Blender MTL File: 'None'
# Material Count: 1
newmtl Material.001
Ns 96.078431
Ka 1.000000 1.000000 1.000000
Kd 0.640000 0.640000 0.640000
Ks 0.500000 0.500000 0.500000
Ke 0.000000 0.000000 0.000000
Ni 1.000000
d 1.000000
illum 0
map_Kd IMG_0772_cropped_unsharpmask_100_4_0.jpeg
The plane shows in meshlab before replacing:
Okay, it turns out be that the order for connecting vertex is wrong and I have already figured it out