How to calculate the time complexity for backtracking with pruning? - time-complexity

The question is something like this,
Given a list of edges, find a path from SRC to DEST that gives the highest points.
INPUT: [['A', 'B', 5] , ['A', 'C', 2], ['B', 'C', 5]], find path from A to B.
OUTPUT: ['A', 'C', 'B'], which gives 10 points.
I understand this is a graph problem, one way to solve it will be DFS with backtracking whereby we try out all the options that lead us from A to B, and record the one with the highest score.
The time complexity for this will probably be N! in a fully connected graph, since we are trying all permutations.
However, I think we can optimise it by pruning while backtracking. E.g. We keep track of the highest points we can get at each node. If the current point is less than the existing highest point, we don't have to try anymore.
But I can't quite figure out the time complexity for this with pruning. Will the worst case still be N! since we can technically still do all permutations?

This problem is very similar to the TSP (Traveling Salesman Problem).
One way to solve this problem is dynamic programming with bit-masking, which runs in O(2^N * N^2) which is better than O(N!).

Related

Higher (4+) dimension/axis in numpy...are they ever actually used in computation?

what I mean by the title is that sometimes I come across code that requires numpy operations (for example sum or average) along a specified axis. For example:
np.sum([[0, 1], [0, 5]], axis=1)
I can grasp this concept, but do we actually ever do these operations also along higher dimensions? Or is that not a thing? And if yes, how do you gain intuition for high-dimensional datasets and how do you make sure you are working along the right dimension/axis?

row-wise calculation of cosine similarity in pandas without looping

I have a pandas dataframe df with many rows. For each row, I want to calculate the cosinus similarity between the row's columns A (first vector) and the row's columns B (second vector). At the end, I aim to get a vector with one cosine similarity value for each row. I have found a solution but it seems to me like it could be done much faster without this loop. May anyone give me some feedback on this code?
Thank you very much!
for row in np.unique(df.index):
cos_sim[row]=scipy.spatial.distance.cosine(df[df.index==row][columnsA],
df[df.index==row][columnsB])
df['cos_sim']=cos_sim
Here comes some sample data:
df = pd.DataFrame({'featureA1': [2, 4, 1, 4],
'featureA2': [2, 4, 1, 4],
'featureB1': [10, 2, 1, 8]},
'featureB2': [10, 2, 1, 8]},
index=['Pit', 'Mat', 'Tim', 'Sam'])
columnsA=['featureA1', 'featureA2']
columnsB=['featureB1', 'featureB2']
This is my desired output (cosine similarity for Pit, Mat, Tim and Sam):
cos_sim=[1, 1, 1, 1]
I am already receiving this output with my method, but I am sure the code could be improved from a performance perspective
several things you can improve on :)
Take a look at the DataFrame.apply function. pandas already offers you looping "under the hood".
df['cos_sim'] = df.apply(lambda _df: scipy.spatial.distance.cosine(_df[columnsA], _df[columnsB])
or something similar should be more performant
Also take a look at DataFrame.loc
df[df.index==row][columnsA]
and
df.loc[row,columnsA]
should be equivalent
If you really have to iterate over the dataframe (should be avoided again due to performance penalties and it is more difficult to read and understand), pandas gives you a generator for the rows (and id)
for index, row in df.iterrows():
scipy.spatial.distance.cosine(row[columnsA], row[columnsB])
Finally as mentioned above to get better answers on stackoverflow, always provide a concrete example where the problem is reproducible. Otherwise it is much harder to interpret the question correctly and to test a solution.
Pretty old post but I am replying for future readers. I created https://github.com/ma7555/evalify for all those rowwise similarity/distance calculations (disclaimer: i am the owner of the package)

NumPy Difference Between np.average() and np.mean() [duplicate]

This question already has answers here:
np.mean() vs np.average() in Python NumPy?
(5 answers)
Closed 4 years ago.
NumPy has two different functions for calculating an average:
np.average()
and
np.mean()
Since it is unlikely that NumPy would include a redundant feature their must be a nuanced difference.
This was a concept I was very unclear on when starting data analysis in Python so I decided to make a detailed self-answer here as I am sure others are struggling with it.
Short Answer:
'Mean' and 'Average' are two different things. People use them interchangeably but shouldn't. np.mean() gives you the arithmetic mean where as np.average() allows you to get the arithmetic mean if you don't add other parameters, but can also be used to take a weighted average.
Long Answer and Background:
Statistics:
Since NumPy is mostly used for working with data sets it is important to understand the mathematical concept that causes this confusion. In simple mathematics and every day life we use the word Average and Mean as interchangeable words when this is not the case.
Mean: Commonly refers to the 'Arithmetic Mean' or the sum of a collection of numbers divided by the number of numbers in the collection1
Average: Average can refer to many different calculations, of which the 'Arithmetic Mean' is one. Others include 'Median', 'Mode', 'Weighted Mean, 'Interquartile Mean' and many others.2
What This Means For NumPy:
Back to the topic at hand. Since NumPy is normally used in applications related to mathematics it needs to be a bit more precise about the difference between Average() and Mean() than tools like Excel which use Average() as a function for finding the 'Arithmetic Mean'.
np.mean()
In NumPy, np.mean() will allow you to calculate the 'Arithmetic Mean' across a specified axis.
Here's how you would use it:
myArray = np.array([[3, 4], [5, 6]])
np.mean(myArray)
There are also parameters for changing which dType is used and which axis the function should compute along (the default is the flattened array).
np.average()
np.average() on the other hand allows you to take a 'Weighted Mean' in which different numbers in your array may have a different weight. For example, in the documentation we can see:
>>> data = range(1,5)
>>> data
[1, 2, 3, 4]
>>> np.average(data)
2.5
>>> np.average(range(1,11), weights=range(10,0,-1))
4.0
For the last function if you were to take a non-weighted average you would expect the answer to be 6. However, it ends up being 4 because we applied the weights too it.
If you don't have a good handle on what a 'weighted mean' we can try and simplify it:
Consider this a very elementary summary of our 'weighted mean' it isn't going to be quite mathematically accurate (which I hope someone will correct) but it should allow you to visualize what we're discussing.
A mean is the average of all numbers summed and divided by the total number of numbers. This means they all have an equal weight, or are counted once. For our mean sample this meant:
(1+2+3+4+5+6+7+8+9+10+11)/11 = 6
A weighted mean involves including numbers at different weights. Since in our above example it wouldn't include whole numbers it can be a bit confusing to visualize so we'll imagine the weighting fit more nicely across the numbers and it would look something like this:
(1+1+1+1+1+1+1+1+1+1+1+2+2+2+2+2+2+2+2+2+3+3+3+3+3+3+3+3+4+4+4+4+4+4+4+5+5+5+5+5+5+6+6+6+6+6+6+7+7+7+7+7+8+8+8+8+9+9+9+-11)/59 = 3.9~
Even though in the actual number set there is only one instance of the number 1 we're counting it at 10 times its normal weight. This can also be done the other way, we could count a number at 1/3 of its normal weight.
If you don't provide a weight parameter to np.average() it will simply give you the equal weighted average across the flattened axis which is equivalent to the np.mean().
Why Would I Ever Use np.mean()?
If np.average() can be used to find the flat arithmetic mean then you may be asking yourself "why would I ever use np.mean()?" np.mean() allows for a few useful parameters that np.average() does not. One of the key ones is the dType parameter which allows you to set the type used in the computation.
For example the NumPy docs give us this case:
Single point precision:
>>> a = np.zeros((2, 512*512), dtype=np.float32)
>>> a[0, :] = 1.0
>>> a[1, :] = 0.1
>>> np.mean(a)
0.546875
Based on the calculation above it looks like our average is 0.546875 but if we use the dType parameter to float64 we get a different result:
>>> np.mean(a, dtype=np.float64)
0.55000000074505806
The actual average 0.55000000074505806.
Now, if you round both of these to two significant digits you get 0.55 in both cases. Where this accuracy becomes important is if you are doing multiple sets of operations on the number still, especially when dealing with very large (or very small numbers) that need a high accuracy.
For example:
((((0.55000000074505806*184.6651)^5)+0.666321)/46.778) =
231,044,656.404611
((((0.55000000074505806*184.6651)^5)+0.666321)/46.778) =
231,044,654.839687
Even in simpler equations you can end up being off by a few decimal places and that can be relevant in:
Scientific simulations: Due to lengthy equations, multiple steps and a high degree of accuracy needed.
Statistics: The difference between a few percentage points of accuracy can be crucial (for example in medical studies).
Finance: Continually being off by even a few cents in large financial models or when tracking large amounts of capital (banking/private equity) could result in hundreds of thousands of dollars in errors by the end of the year.
Important Word Distinction
Lastly, simply on interpretation you may find yourself in a situation where analyzing data where it is asked of you to find the 'Average' of a dataset. You may want to use a different method of average to find the most accurate representation of the dataset. For example, np.median() may be more accurate than np.average() in cases with outliers and so its important to know the statistical difference.

Finding Shortest Path using BFS search on a Undirected Graph, knowing the length of the SP

I was asked an interview question today and I was not able to solve at that time.
The question is to get the minimum time complexity of finding the shortest path from node S to node T in a graph G where:
G is undirected and unweighted
The connection factor of G is given as B
The length of shortest path from S to T is given as K
The first thing I thought was that in general case, the BFS is fastest way to get the SP from S to T, in O(V+E) time. Then how can we use the B and K to reduce the time. I'm not sure what a connection factor is, so I asked the interviewer, then he told me that it is on average a node has B edges with other nodes. So I was thinking that if K = 1, then the time complexity should be O(B). But wait, it is "on average", which means it could still be O(E+V), where the graph is a like a star and all other nodes are connected to S.
If we assume that the B is a strict up limit. Then the first round of BFS is O(B), and the second is O(B*B), and so on, like a tree. Some of the nodes in the lower layer may be already visited in the previous round therefore should not be added. Still, the worst scenario is that the graph is huge and none of the node has been visited. And the time complexity is
O(B) + O(B^2) + O(B^3) ... O(B^K)
Using the sum of Geometric Series, the sum is O(B(1-B^K)/(1-B)). But this SUM should not exceed V+E.
So, is the time complexity is O(Min(SUM, V+E))?
I have no idea how to correctly solve this problem. Any help is appreciated.
Your analysis seems correct. Please refer to the following references.
http://axon.cs.byu.edu/~martinez/classes/312/Slides/Paths.pdf
https://courses.engr.illinois.edu/cs473/sp2011/lectures/03_class.pdf

Maximizing in mathematica with multiple maxima

I'm trying to compute the maxima of some function of one variable (something like this:)
(which is calculated from a non-trivial convolution, so, no, I don't have an expression for it)
Using the command:
NMaximize[{f[x], 0 < x < 1}, x, AccuracyGoal -> 4, PrecisionGoal -> 4]
(I'm not that worried about super accuracy, a rough estimate of 10^-4 is already enough)
The result of this is x* = 0.55, which is not what should be. (i.e., it is picking the third peak).
Is there any way of telling mathematica that the global maxima is the first one when counting from x = 0 (I know this is always true), or make mathematica search with a better approach? (Notice, I don't want things like Stimulated Annealing approach; each evaluation is very costly!)
Thanks very much!
Try FindMaximum with a starting point of 0 or some similarly small value.