How to multiply two Sparse matrices by Spars method with python? - numpy

I tried to multiply the two Sparse matrices, but I had trouble deleting extra rows that were all zeros, I usednumpy.delete(my_matrix, [n], axis=0)and got this error:
index 4 is out of bounds for axis 0 with size 3
def mult_mat(mat1, mat2):
col = mat1[0][1]
row = mat2[0][0]
row_mat1, row_mat2 = np.shape(mat1)[0], np.shape(mat2)[0]
if col != row:
return "Multiplication is not possible because the number" \
" of columns in the first matrix is opposite of the" \
" number of rows in the second matrix"
my_matrix = np.array([[0] * 3] * (mat1[0][2] * mat2[0][2]))
n = 0
for r in range(1, row_mat1):
for h in range(1, row_mat2):
if mat1[r][1] == mat2[h][0]:
my_matrix[n][0], my_matrix[n][1], my_matrix[n][2] = mat1[r][0], mat2[h][1], mat1[r][2] * mat2[h][2]
n += 1
row_my_matrix = np.shape(my_matrix)[0]
for n in range(row_my_matrix):
if my_matrix[n][0] == 0 & my_matrix[n][1] == 0 & my_matrix[n][2] == 0:
my_matrix = np.delete(my_matrix, [n], axis=0)
return my_matrix

Related

What is a time complexity of the following algorithm in Big Theta Notation?

res = 0
for i in range (1,n):
j = i
while j % 2 == 0:
j = j/2
res = res + j
I understand that upper bound is O(nlogn), however I'm wondering if it's possible to find a stronger constraint? I'm stuck with the analysis.
Some ideas that may be helpful:
Could create a function (g(n)) that annotates your function (f(n)) to include how many operations occur when running f(n)
def f(n):
res = 0
for i in range (1,n):
j = i
while j % 2 == 0:
j = j/2
res = res + j
return res
def g(n):
comparisons = 0
operations = 0
assignments = 0
assignments += 1
res = 0
assignments += 1. # i = 1
comparisons += 1. # i < n
for i in range (1,n):
assignments += 1
j = i
operations += 1
comparisons += 1
while j % 2 == 0:
operations += 1
assignments += 1
j = j/2
operations += 1
assignments += 1
res = res + j
operations += 1
comparisons += 1
operations += 1 # i + 1
assignments += 1 # assign to i
comparisons += 1 # i < n ?
return operations + comparisons + assignments
For n = 1, the code runs without hitting any loops: assigning the value of res; assigning i as 1; comparing i to n and skipping the loop as a result.
For n > 1, you get into the for loop, and the for statement is all that is changing the loop varaible, so the complexity of the rest of the code is at least O(n).
Once in the loop:
if i is odd, then you only assign j, perform the mod operation and compare to zero. That will be the case for half the values of i, so each run of the loop from 2 to n will (half the time) add a fixed number of a few operations (including the loop operations). So, that's still O(n), just with a larger constant.
if i is even, then we divide by 2 until it is odd. This is what we need to work out the impact of.
Based on my counting of the different operations, I get:
g_initial_setup = 3 (every time)
g_for_any_i = 6 (half the time, it is just this)
g_for_even_i = 6 for each time we divide by two (the other half of the time)
For a random even i between 2 and n, half the time we will only need to divide by two once, half the remaining time by two again, half the remaining time by two again, etc. So we have an infinite series as n goes to infinity of sum(1/2^i) for 1 < i < n, and multiply that by the 6 operations done for each halving of j.
I would expect from this:
g(n) = 3 + (n * 6) + (n * 6) * sum( 1 / pow(2,m) for m between 1 and n )
Given that the infinite series 1/2^n = 1, we simplify that to:
g(n) = 3 + 12n as n approaches infinity.
That implies that the algorithm is O(n). Huh. I did not expect that.
Let's try out the function g(n) from above, counting all the operations that are occurring as f(n) is computed.
g(1) = 3 operations
g(2) = 9
g(3) = 21
g(4) = 27
g(5) = 45
g(10) = 123
g(100) = 1167
g(1000) = 11943
g(10000) = 119943
g(100000) = 1199931
g(1000000) = 11999919
g(10000000) = 119999907
Okay, unless I've really made a serious error here, it's O(n).

I am trying to append values to an empty 2D array dynamically but getting an error everytime

output = np.empty([17157,4])
for every row in data
for rows in data:
initializing variables
snowfall = 0
positive_temp = 0
mass_balance = 0
melt = 0
for every cell in a row
for columns in range(12):
if rows[columns+2] < 0:
snowfall += rows[columns+14]
else:
positive_temp += rows[columns+2]
melt += positive_temp * 7
mass_balance += snowfall - melt
lat = rows[0]
lon = rows[1]
elev = rows[26]
appending values to output
np.append(output, ([lat, lon, mass_balance, elev]), axis = 0)

Division by Zero error in calculating series

I am trying to compute a series, and I am running into an issue that I don't know why is occurring.
"RuntimeWarning: divide by zero encountered in double_scalars"
When I checked the code, it didn't seem to have any singularities, so I am confused. Here is the code currently(log stands for natural logarithm)(edit: extending code if that helps):
from numpy import pi, log
#Create functions to calculate the sums
def phi(z: int):
k = 0
phi = 0
#Loop through 1000 times to try to approximate the series value as if it went to infinity
while k <= 100:
phi += ((1/(k+1)) - (1/(k+(2*z))))
k += 1
return phi
def psi(z: int):
psi = 0
k = 1
while k <= 101:
psi += ((log(k))/( k**(2*z)))
k += 1
return psi
def sig(z: int):
sig = 0
k = 1
while k <= 101:
sig += ((log(k))**2)/(k^(2*z))
k += 1
return sig
def beta(z: int):
beta = 0
k = 1
while k <= 101:
beta += (1/(((2*z)+k)^2))
k += 1
return beta
#Create the formula to approximate the value. For higher accuracy, either calculate more derivatives of Bernoulli numbers or increase the boundry of k.
def Bern(z :int):
#Define Euler–Mascheroni constant
c = 0.577215664901532860606512
#Begin computations (only approximation)
B = (pi/6) * (phi(1) - c - 2 * log(2 * pi) - 1) - z * ((pi/6) * ((phi(1)- c - (2 * log(2 * pi)) - 1) * (phi(1) - c) + beta(1) - 2 * psi(1)) - 2 * (psi(1) * (phi(1) - c) + sig(1) + 2 * psi(1) * log(2 * pi)))
#output
return B
A = int(input("Choose any value: "))
print("The answer is", Bern(A + 1))
Any help would be much appreciated.
are you sure you need a ^ bitwise exclusive or operator instead of **? I've tried to run your code with input parameter z = 1. And on a second iteration the result of k^(2*z) was equal to 0, so where is from zero division error come from (2^2*1 = 0).

Probabilistic Record Linkage in Pandas

I have two dataframes (X & Y). I would like to link them together and to predict the probability that each potential match is correct.
X = pd.DataFrame({'A': ["One", "Two", "Three"]})
Y = pd.DataFrame({'A': ["One", "To", "Free"]})
Method A
I have not yet fully understood the theory but there is an approach presented in:
Sayers, A., Ben-Shlomo, Y., Blom, A.W. and Steele, F., 2015. Probabilistic record linkage. International journal of epidemiology, 45(3), pp.954-964.
Here is my attempt to implementat it in Pandas:
# Probability that Matches are True Matches
m = 0.95
# Probability that non-Matches are True non-Matches
u = min(len(X), len(Y)) / (len(X) * len(Y))
# Priors
M_Pr = u
U_Pr = 1 - M_Pr
O_Pr = M_Pr / U_Pr # Prior odds of a match
# Combine the dataframes
X['key'] = 1
Y['key'] = 1
Z = pd.merge(X, Y, on='key')
Z = Z.drop('key',axis=1)
X = X.drop('key',axis=1)
Y = Y.drop('key',axis=1)
# Levenshtein distance
def Levenshtein_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
L_D = np.vectorize(Levenshtein_distance, otypes=[float])
Z["D"] = L_D(Z['A_x'], Z['A_y'])
# Max string length
def Max_string_length(X, Y):
return max(len(X), len(Y))
M_L = np.vectorize(Max_string_length, otypes=[float])
Z["L"] = M_L(Z['A_x'], Z['A_y'])
# Agreement weight
def Agreement_weight(D, L):
return 1 - ( D / L )
A_W = np.vectorize(Agreement_weight, otypes=[float])
Z["C"] = A_W(Z['D'], Z['L'])
# Likelihood ratio
def Likelihood_ratio(C):
return (m/u) - ((m/u) - ((1-m) / (1-u))) * (1-C)
L_R = np.vectorize(Likelihood_ratio, otypes=[float])
Z["G"] = L_R(Z['C'])
# Match weight
def Match_weight(G):
return math.log(G) * math.log(2)
M_W = np.vectorize(Match_weight, otypes=[float])
Z["R"] = M_W(Z['G'])
# Posterior odds
def Posterior_odds(R):
return math.exp( R / math.log(2)) * O_Pr
P_O = np.vectorize(Posterior_odds, otypes=[float])
Z["O"] = P_O(Z['R'])
# Probability
def Probability(O):
return O / (1 + O)
Pro = np.vectorize(Probability, otypes=[float])
Z["P"] = Pro(Z['O'])
I have verified that this gives the same results as in the paper. Here is a sensitivity check on m, showing that it doesn't make a lot of difference:
Method B
These assumptions won't apply to all applications but in some cases each row of X should match a row of Y. In that case:
The probabilities should sum to 1
If there are many credible candidates to match to then that should reduce the probability of getting the right one
then:
X["I"] = X.index
# Combine the dataframes
X['key'] = 1
Y['key'] = 1
Z = pd.merge(X, Y, on='key')
Z = Z.drop('key',axis=1)
X = X.drop('key',axis=1)
Y = Y.drop('key',axis=1)
# Levenshtein distance
def Levenshtein_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
L_D = np.vectorize(Levenshtein_distance, otypes=[float])
Z["D"] = L_D(Z['A_x'], Z['A_y'])
# Max string length
def Max_string_length(X, Y):
return max(len(X), len(Y))
M_L = np.vectorize(Max_string_length, otypes=[float])
Z["L"] = M_L(Z['A_x'], Z['A_y'])
# Agreement weight
def Agreement_weight(D, L):
return 1 - ( D / L )
A_W = np.vectorize(Agreement_weight, otypes=[float])
Z["C"] = A_W(Z['D'], Z['L'])
# Normalised Agreement Weight
T = Z .groupby('I') .agg({'C' : sum})
D = pd.DataFrame(T)
D.columns = ['T']
J = Z.set_index('I').join(D)
J['P1'] = J['C'] / J['T']
Comparing it against Method A:
Method C
This combines method A with method B:
# Normalised Probability
U = Z .groupby('I') .agg({'P' : sum})
E = pd.DataFrame(U)
E.columns = ['U']
K = Z.set_index('I').join(E)
K['P1'] = J['P1']
K['P2'] = K['P'] / K['U']
We can see that method B (P1) doesn't take account of uncertainty whereas method C (P2) does.

ValueError: too many values to unpack (expected 4)

I am getting "ValueError: too many values to unpack (expected 4)" with the below code. Please help me!!
I am trying to lemmatize and cut off common words and then add to library so I can identify most common words and find the relationship between words.
def build_dataset(words, vocabulary_size):
lexicon = []
for l in words:
all_words = word_tokenize(l.lower())
lexicon += list(all_words )
lexicon = [lemmatizer.lemmatize(i) for i in lexicon]
w_counts = Counter(lexicon)
word = []
for w in w_counts:
if 5000 > w_counts[w] > 50 :
word.append(w)
print(len(word))
return word
count = [['UNK', -1]]
count.extend(collections.Counter(word).most_common(vocabulary_size - 1))
dictionary = dict()
for l2, _ in count:
dictionary[l2] = len(dictionary)
data = list()
unk_count = 0
for l2 in word:
if l2 in dictionary:
index = dictionary[l2]
else:
index = 0
unk_count += 1
data.append(index)
count[0][1] = unk_count
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reverse_dictionary
data, count, dictionary, reverse_dictionary = build_dataset(words, vocabulary_size)