Cosine Similarity between 2 cells in a datafame - pandas

doc_1 = data.iloc[15]['Description']
doc_2 = "Data is a new oil"
data = [doc_1, doc_2]
count_vectorizer = CountVectorizer()
vector_matrix = count_vectorizer.fit_transform(data)
tokens = count_vectorizer.get_feature_names()
vector_matrix.toarray()
def create_dataframe(matrix, tokens):
doc_names = [f'doc_{i+1}' for i, _ in enumerate(matrix)]
df = pd.DataFrame(data=matrix, index=doc_names, columns=tokens)
return(df)
create_dataframe(vector_matrix.toarray(),tokens)
cosine_similarity_matrix = cosine_similarity(vector_matrix)
create_dataframe(cosine_similarity_matrix,['doc_1','doc_2'])
The code calculate the Cosine Similarity between a cell and the string, but how can i improve my code so that i can calculate the Cosine Similarity between cells. So i want doc1 compared with all the other cells in the column.
So i will get a table like this, where the dots is de cosine similarities:
x
doc2
doc3
doc1
....
....
picture of the table how it should look like

Related

How to calculate the number of scatterplot data points in a particular 'region' of the graph

As my questions says I'm trying to find a way to calculate the number of scatterplot data points (pink dots) in a particular 'region' of the graph or either side of the black lines/boundaries. Open to any ideas as I don't even know where to start. Thank you!!
The code:
################################
############ GES ##############
################################
p = fits.open('GES_DR17.fits')
pfeh = p[1].data['Fe_H']
pmgfe = p[1].data['Mg_Fe']
pmnfe = p[1].data['Mn_Fe']
palfe = p[1].data['Al_Fe']
#Calculate [(MgMn]
pmgmn = pmgfe - pmnfe
ax1a.scatter(palfe, pmgmn, c='thistle', marker='.',alpha=0.8,s=500,edgecolors='black',lw=0.3, vmin=-2.5, vmax=0.65)
ax1a.plot([-1,-0.07],[0.25,0.25], c='black')
ax1a.plot([-0.07,1.0],[0.25,0.25], '--', c='black')
x = np.arange(-0.15,0.4,0.01)
ax1a.plot(x,4.25*x+0.8875, 'k', c='black')
Let's call the two axes x and y. Any line in this plot can be written as
a*x + b*y + c = 0
for some value of a,b,c. But if we plug in a points with coordinates (x,y) in to the left hand side of the equation above we get positive value for all points of the one side of the line, and a negative value for the points on the other side of the line. So If you have multiple regions delimited by lines you can just check the signs. With this you can create a boolean mask for each region, and just count the number of Trues by using np.sum.
# assign the coordinates to the variables x and y as numpy arrays
x = ...
y = ...
line1 = a1*x + b1*y + c1
line2 = a2*x + b2*y + c2
mask = (line1 > 0) & (line2 < 0) # just an example, signs might vary
count = np.sum(mask)

Better way to concatenate panda matrices

I need to concatenate multiple matrices (containing numbers and strings) in a loop, so far I wrote this solution but I don't like to use a dummy variable (h) and I'm sure the code could be improved.
Here it is:
h = 0
for name in list_of_matrices:
h +=1
Matrix = pd.read_csv(name)
if h == 1:
Matrix_final = Matrix
continue
Matrix_final = pd.concat([Matrix_final,Matrix])
For some reason if I use the following code I end up having 2 matrices one after the other and not a joint one (so this code is not fitting):
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)

how create pair combinations in tensorflow

I have multi dim tensor,I want create pair combination tensor according i-th dim,then create two tensor,for example,
a=tf.constant([[[1,1],[2,2]],
[[3,3],[4,4]],
[[5,5],[6,6]],shape=(3,2,2)) ,
I create pair combination according 0 dim (index is [0,1,2],so pair is (0,1),(0,2),(1,2), so new tensor b’s 0 dim come from old index [0,0,1], new tensor d’s 0 dim come from old index [1,2,2], finished result is:
b=tf.constant([[[1,1],[2,2]],
[[1,1],[2,2]],
[[3,3],[4,4]]],shape=(3,2,2))
c=tf.constant([[[3,3],[4,4]],
[[3,3],[4,4]],
[[5,5],[6,6]]],shape=(3,2,2))
Use tf.gather():
import tensorflow as tf
a = tf.constant([[[1,1],[2,2]],
[[3,3],[4,4]],
[[5,5],[6,6]]])
pair = ((0,1),(0,2),(1,2))
pair = tf.convert_to_tensor(pair)
inds = pair[:,0]
b = tf.gather(a, inds)
inds = pair[:,1:]
c = tf.gather(a, inds)

Julia: Collapsing DataFrame by multiple values retaining additional variables

I have some data that has duplicate fields with the exception of a single field which I would like to join. In the data everything but the report should stay the same on each day and each company. Companies can file multiple reports on the same day.
I can join using the following code but I am losing the variables which are not in my by function. Any suggestions?
Mock Data
using DataFrames
# Number of observations
n = 100
words = split("the wigdet drop air flat fall fling flap freeze flop tool fox", " ")
df = DataFrame(day = cumsum(rand(0:1, n)), company = rand(0:3, n),
report = [join(rand(words, rand(1:5, 1)[1]), " ") for i in 1:n])
x = df[:, [:day, :company]]
# Number of variables which are identical for each day/company.
nv = 100
for i in 1:nv
df[:, Symbol("v" * string(i))] = ""
end
for i in 1:size(x, 1),j in 1:nv
df[(df.day .== x[i,1]) .& (df.company .== x[i,2]), Symbol("v" * string(j))] =
join(rand('a':'z', 3), "")
end
Collapsed data
outdf = by(df, [:company, :day]) do sub
t = DataFrame(fullreport = join(sub.report, "\n(Joined)\n"))
end
Here are some minor tweaks in your data preparation code:
using DataFrames
# Number of observations
n = 100
words = split("the wigdet drop air flat fall fling flap freeze flop tool fox", " ")
df = DataFrame(day = cumsum(rand(0:1, n)), company = rand(0:3, n),
report = [join(rand(words, rand(1:5, 1)[1]), " ") for i in 1:n])
x = df[:, [:day, :company]]
# Number of variables which are identical for each day/company.
nv = 100
for i in 1:nv
df[:, Symbol("v", i)] .= ""
end
for i in 1:size(x, 1), j in 1:nv
df[(df.day .== x[i,1]) .& (df.company .== x[i,2]), Symbol("v", j)] .= join(rand('a':'z', 3), "")
end
and here is by that keeps all other variables (assuming they are constant per group, this code should be efficient even for relatively large data):
outdf = by(df, [:company, :day]) do sub
merge((fullreport = join(sub.report, "\n(Joined)\n"),),
copy(sub[1, Not([:company, :day, :report])]))
end
I put the fullreport variable as a first one.
Here is the code that would keep all rows from the original data frame:
outdf = by(df, [:company, :day]) do sub
insertcols!(select(sub, Not([:company, :day, :report])), 1,
fullreport = join(sub.report, "\n(Joined)\n"))
end
and now you can e.g. check that unique(outdf) produces the same data frame as the one that was generated by the fist by.
(in the codes above I dropped also :report variable as I guess you did not want it in the result - right?)

Vectorizing np.minimum & np.minimum over axes with broadcasting

I've roughly got something like
A = np.random.random([n, 2])
B = np.random.random([3, 2])
...
ret = 0
for b in B:
for a in A:
start = np.max([a[0], b[0]])
end = np.min([a[1], b[1]])
ret += np.max([0, end - start])
return ret
Putting it into words, A is an input array of n 2D intervals and B is a known array of 2D intervals, and I'm trying to compute the length of total intersection between all intervals.
Is there a way to vectorize it? My first though was using the np.maximize and np.minimize along with broadcasting, but nothing seems to work.
Broadcast after extending dimensions to vectorize things -
p1 = np.maximum(A[:,None,0],B[:,0])
p2 = np.minimum(A[:,None,1],B[:,1])
ret = np.maximum(0,p2-p1).sum()