So, it's an indirect sort that returns the indices that would sort an array. Why is it "argsort" (which makes some sense given that it takes an argument -- the type of sort to use) but not "indirect_sort" or something like that? Or get_sort_indexer?
Googling for argsort gives numpy as first result, so it might indeed be a bit uncommon name. This might be something historic, apparently it already existed in 1997 in Numeric, a predecessor to Numpy.
I would guess they came up with that name as a parallel to argmax, which seems to be standard mathematical function that is implemented in many languages (numpy, mathematica, ...), even though the analogy is not perfect.
Related
I’m fairly new to pandas. I want to know if there’s a way to associate my own arbitrary key/value pairs with a given cell.
For example, maybe row 5 of column ‘customerCount’ contains the int value 563. I want to use the pandas functions normally for computing sums, averages, etc. However, I also want to remember that this particular cell also has my own properties such as verified=True or flavor=‘salty’ or whatever.
One approach would be to make additional columns for those values. That would work if you just have a property or two on one column, but it would result in an explosion of columns if you have many properties, and you want to mark them on many columns.
I’ve spent a while researching it, and it looks like I could accomplish this by subclassing pd.Series. However, it looks like there would be a considerable amount of deep hacking involved. For example, if a column’s underlying data would normally be an ndarray of dtype ‘int’, I could internally store my own list of objects in my subclass, but hack getattribute() so that my_series.dtype yields ‘int’. The intent in doing that would be that pandas still treats the series as having dtype ‘int’, without knowing or caring what’s going on under the covers. However, it looks like I might have to override a bunch of things in my pd.Series subclass to make pandas behave normally.
Is there some easier way to accomplish this? I don’t necessarily need a detailed answer; I’m looking for a pointer in the right direction.
Thanks.
My teacher made this argument and asked us what could be wrong with it.
for an array A of n distinct numbers. Since there are n! permutations of A,
we cannot check for each permutation whether it is sorted, in a total time which
is polynomial in n. Therefore, sorting A cannot be in P.
my friend thought that it just should be : therefore sorting a cannot be in NP.
Is that it or are we thinking to easily about it?
The problem with this argument is that it fails to adequately specify the exact problem.
Sorting can be linear-time (O(n)) in the number of elements to sort, if you're sorting a large list of integers drawn from a small pool (counting sort, radix sort).
Sorting can be linearithmic-time (O(nlogn)) in the number of elements to sort, if you're sorting a list of arbitrary things which are all totally ordered according to some ordering relation (e.g., less than or equal to on the integers).
Sorting based on a partial order (e.g. topological sorting) must be analyzed in yet another way.
We can imagine a problem like sorting whereby the sortedness of a list cannot be determined by comparing adjacent entries only. In the extreme case, sortedness (according to what we are considering to be sorting) might only be verifiable by checking the entire list. If our kind of sorting is designed so as to guarantee there is exactly one sorted permutation of any given list, the time complexity is factorial-time (O(n!)) and the problem is not in P.
That's the real problem with this argument. If your professor is assuming that "sorting" refers to sorting integers not in any particular small range, the problem with the argument then is that we do not need to consider all permutations in order to construct the sorted one. If I have a bag with 100 marbles and I ask you to remove three marbles, the time complexity is constant-time; it doesn't matter that there are n(n-1)(n-2)/6 = 161700, or O(n^3), ways in which you can accomplish this task.
The argument is a non-sequitur, the conclusion does not logically follow from the previous steps. Why doesn't it follow? Giving a satisfying answer to that question requires knowing why the person who wrote the argument thinks it is correct, and then addressing their misconception. In this case, the person who wrote the argument is your teacher, who doesn't think the argument is correct, so there is no misconception to address and hence no completely satisfying answer to the question.
That said, my explanation would be that the argument is wrong because it proposes a specific algorithm for sorting - namely, iterating through all n! permutations and choosing the one which is sorted - and then assumes that the complexity of the problem is the same as the complexity of that algorithm. This is a mistake because the complexity of a problem is defined as the lowest complexity out of all algorithms which solve it. The argument only considered one algorithm, and didn't show anything about other algorithms which solve the problem, so it cannot reach a conclusion about the complexity of the problem itself.
Please, I want to use NLsolve of Julia for solving systems of nonlinear equations. The package requires to set the initials for the unknowns of your system. With my system of equations (sorry, I can not include it here since I need at least 10 reputations to be allowed to), when I keep the initials [0.1; 1.2] as in the example from the documentation, I obtain the "paper-pencil" solution. But if I set the initials at [1.1;2.2] for example, I receive the following error:
DomainError:
Exponentiation yielding a complex result requires a complex argument.
Replace x^y with (x+0im)^y, Complex(x)^y, or similar.
Please, how should I come up with suitable values for initials for a given system of equations?
A rootfinder is always dependent on the initial condition. That's just how those algorithms work. The closer your guess is to the true maximum the better off you are. Trust region methods are more robust than Newton methods, but you'll never get away from the fact that these are local methods.
Looking for the proper data type (such as IndexedSeq[Double]) to use when designing a domain-specific numerical computing library. For this question, I'm limiting scope to working with 1-Dimensional arrays of Double. The library will define a number functions that are typically applied for each element in the 1D array.
Considerations:
Prefer immutable data types, such as Vector or IndexedSeq
Want to minimize data conversions
Reasonably efficient in space and time
Friendly for other people using the library
Elegant and clean API
Should I use something higher up the collections hierarchy, such as Seq?
Or is it better to just define the single-element functions and leave the mapping/iterating to the end user?
This seems less efficient (since some computations could be done once per set of calls), but at at the same time a more flexible API, since it would work with any type of collection.
Any recommendations?
If your computations are to do anything remotely computationally intensive, use Array, either raw or wrapped in your own classes. You can provide a collection-compatible wrapper, but make that an explicit wrapper for interoperability only. Everything other than Array is generic and thus boxed and thus comparatively slow and bulky.
If you do not use Array, people will be forced to abandon whatever things you have and just use Array instead when performance matters. Maybe that's okay; maybe you want the computations to be there for convenience not efficiency. In that case, I suggest using IndexedSeq for the interface, assuming that you want to let people know that indexing is not outrageously slow (e.g. is not List), and use Vector under the hood. You will use about 4x more memory than Array[Double], and be 3-10x slower for most low-effort operations (e.g. multiplication).
For example, this:
val u = v.map(1.0 / _) // v is Vector[Double]
is about three times slower than this:
val u = new Array[Double](v.length)
var j = 0
while (j<u.length) {
u(j) = 1.0/v(j) // v is Array[Double]
j += 1
}
If you use the map method on Array, it's just as slow as the Vector[Double] way; operations on Array are generic and hence boxed. (And that's where the majority of the penalty comes from.)
I am using Vectors all the time when I deal with numerical values, since it provides very efficient random access as well as append/prepend.
Also notice that, the current default collection for immutable indexed sequences is Vector, so that if you write some code like for (i <- 0 until n) yield {...}, it returns IndexedSeq[...] but the runtime type is Vector. So, it may be a good idea to always use Vectors, since some binary operators that take two sequences as input may benefit from the fact that the two arguments are of the same implementation type. (Not really the case now, but some one has pointed out that vector concatenation could be in log(N) time, as opposed to the current linear time due to the fact that the second parameter is simply treated as a general sequence.)
Nevertheless, I believe that Seq[Double] should already provide most of the function interfaces you need. And since mapping results from Range does not yield Vector directly, I usually put Seq[Double] as the argument type as my input, so that it has some generality. I would expect that efficiency is optimized in the underlying implementation.
Hope that helps.
Not sure what exactly to google for this question, so I'll post it directly to SO:
Variables in Haskell are immutable
Pure functions should result in same values for same arguments
From these two points it's possible to deduce that if you call somePureFunc somevar1 somevar2 in your code twice, it only makes sense to compute the value during the first call. The resulting value can be stored in some sort of a giant hash table (or something like that) and looked up during subsequent calls to the function. I have two questions:
Does GHC actually do this kind of optimization?
If it does, what is the behaviour in the case when it's actually cheaper to repeat the computation than to look up the results?
Thanks.
GHC doesn't do automatic memoization. See the GHC FAQ on Common Subexpression Elimination (not exactly the same thing, but my guess is that the reasoning is the same) and the answer to this question.
If you want to do memoization yourself, then have a look at Data.MemoCombinators.
Another way of looking at memoization is to use laziness to take advantage of memoization. For example, you can define a list in terms of itself. The definition below is an infinite list of all the Fibonacci numbers (taken from the Haskell Wiki)
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
Because the list is realized lazily it's similar to having precomputed (memoized) previous values. e.g. fibs !! 10 will create the first ten elements such that fibs 11 is much faster.
Saving every function call result (cf. hash consing) is valid but can be a giant space leak and in general also slows your program down a lot. It often costs more to check if you have something in the table than to actually compute it.