Big 0 attempts to answer the question of inefficiency in algorithmic complexity. N+1 describes inefficiency as it relates to database queries in terms of separate queries to populate each item in a collection.
I'm currently trying to get my head around each of these concepts in different contexts at work, and I'm wondering now if somebody could explain whether these two concepts relate to each other in any way? Could somebody provide a description that would apply to both of them?
Big O notation for complexity is defined using number of operations of a Turing machine and can therefore describe any algorithm. The N+1 select problem describes inefficient relational algorithm (query), which needs always N+1 operations for every record. As that query is an algorithm, you can analyse its complexity.
O(N+1)=O(N)
This means you have linear complexity. Now if we used the correct algorithm, we would need only one operation (select) per record for each of two tables. The complexity would be:
O(2)=O(1)
This algorithm has constant complexity. This shows, that by analysing the complexity of both algorithms you would see which one suffers from the N+1 selects problem.
Is this clear?
Related
For large values of n, an algorithm that takes 20000n^2 steps has better time complexity (takes less time) than one that takes 0.001n^5 steps
I believe this statement is true. But, why?
If there are more steps wouldn't that take more time?
Computational complexity is considered in the asymptotic sense because the important question is usually of scaling. Even with your clear case, the ^5 algorithm begins to take longer around 275 items - which isn't very many. See this figure from wolfram alpha:
Quoting from the wikipedia article linked above:
Usually asymptotic estimates are used because different implementations of the same algorithm may differ in efficiency. However the efficiencies of any two "reasonable" implementations of a given algorithm are related by a constant multiplicative factor called a hidden constant.
All that said, if you have two comparable algorithms and the one with less complexity has a significant constant coefficient and you're only going to process 10 items, then it very well may be a good idea to choose the less efficient one. Some common libraries even switch algorithms depending upon the size of the data being processed; this is called a hybrid algorithm and Python's sorted implementation, Timsort uses it to switch between insertion sort and merge sort.
Theory vs practice here.
Regarding time complexity, and I have a conceptual question that we didn't get to go deeper into in class.
Here it is:
There's a barbaric BROOT force algorithm, O(n^3)... and we got it down o O(n) and it was considered good enough. If we dive in deeper, it is actually O(n)+O(n), two separate iterations of the input. I came up with another way which was actually O(n/2). But those two algorithms are considered to be the same since both are O(n) and as n reaches infinity, it makes no difference, so not necessary at all once we reach O(n).
My question is:
In reality, in practice, we always have a finite number of inputs (admittedly occasionally in the trillions). So following the time complexity logic, O(n/2) is four times as fast as O(2n). So if we can make it faster, why not?
Time complexity is not everything. As you already noticed, the Big-Oh can hide a lot and also assumes that all operations cost the same.
In Practice you should always try to find a fast/the fastest solution for your problem. Sometimes this means that you use a algorithm with a bad complexity but good constants if you know that your problem is always small. Depending on your use case, you also want to implement optimizations that utilize hardware properties like cache optimizations.
I have an upcoming interview and was looking through some technical interview questions and I came across this one. It is asking for the time complexity for the insertion and deletion functions of a hash map. The consensus seems to be that the time complexity is O(1) if the has map is distributed evenly but O(n) if they are all in the same pool.
I guess my question is how exactly are hash maps stored in memory? How would these 2 cases happen?
One answer on your linked page is:
insertion always would be O(1) if even not properly distributed (if we
make linked list on collision) but Deletion would be O(n) in worst
case.
This is not a good answer. A generalized answer to time complexity for a hashmap would come to a similar statement as the Wikipedia article on hash tables:
Time complexity
in big O notation
Average Worst case
Space O(n) O(n)
Search O(1) O(n)
Insert O(1) O(n)
Delete O(1) O(n)
To adress your question how hash maps are stored in memory: There are a number of "buckets" that store values in the average case, but must be expanded to some kind of list when a hash collision occurs. Good explanations of hash tables are the Wikipedia article, this SO question and this C++ example.
The time complexity table above is like this because in the average case, a hash map just looks up and stores single values, but collisions make everything O(n) in worst case, where all your elements share a bucket and the behaviour is similar to the list implementation you chose for that case.
Note that there are specialized implementations that adress the worst cases here, also described in the Wikipedia article, but each of them has other disadvantages, so you'll have to choose the best for your use case.
This may be a simple question for those know-how guys. But I cannot figure it out by myself.
Suppose there are a large number of objects that I need to select some from. Each object has two known variables: cost and benefit. I have a budget, say $1000. How could I find out which objects I should buy to maximize the total benefit within the given budget? I want a numeric optimization solution. Thanks!
Your problem is called the "knapsack problem". You can read more on the wikipedia page. Translating the nomenclature from your original question into that of the wikipedia article, your problem's "cost" is the knapsack problem's "weight". Your problem's "benefit" is the knapsack problem's "value".
Finding an exact solution is an NP-complete problem, so be prepared for slow results if you have a lot of objects to choose from!
You might also look into Linear Programming. From MathWorld:
Simplistically, linear programming is
the optimization of an outcome based
on some set of constraints using a
linear mathematical model.
Yes, as stated before, this is the knapsack problem and I would choose to use linear programming.
The key to this problem is storing data so that you do not need to recompute things more than once (if enough memory is available). There are two general ways to go about linear programming: top-down, and bottom - up. This one is a bottom up problem.
(in general) Find base case values, what is the most optimal object to select for a small case. Then build on this. If we allow ourselves to spend more money what is the best combination of objects for that small increment in money. Possibilities could be taking more of what you previously had, taking one new object and replacing the old one, taking another small object that will still keep you under your budget etc.
Like I said, the main idea is to not recompute values. If you follow this pattern, you will get to a high number and find that in order to buy X amount of dollars worth of goods, the best solution is combining what you had for two smaller cases.
Something pretty annoying in evolutionary computing is that mildly different and overlapping concepts tend to pick dramatically different names. My latest confusion because of this is that gene-expression-programming seems very similar to cartesian-genetic-programming.
(how) Are these fundamentally different concepts?
I've read that indirect encoding of GP instructions is an effective technique ( both GEP and CGP do that ). Has there been reached some sort of consensus that indirect encoding has outdated classic tree bases GP?
Well, it seems that there is some difference between gene expression programming (GEP) and cartesian genetic programming (CGP or what I view as classic genetic programming), but the difference might be more hyped up than it really ought to be. Please note that I have never used GEP, so all of my comments are based on my experience with CGP.
In CGP there is no distinction between genotype and a phenotype, in other words- if you're looking at the "genes" of a CGP you're also looking at their expression. There is no encoding here, i.e. the expression tree is the gene itself.
In GEP the genotype is expressed into a phenotype, so if you're looking at the genes you will not readily know what the expression is going to look like. The "inventor" of GP, Cândida Ferreira, has written a really good paper and there are some other resources which try to give a shorter overview of the whole concept.
Ferriera says that the benefits are "obvious," but I really don't see anything that would necessarily make GEP better than CGP. Apparently GEP is multigenic, which means that multiple genes are involved in the expression of a trait (i.e. an expression tree). In any case, the fitness is calculated on the expressed tree, so it doesn't seem like GEP is doing anything to increase the fitness. What the author claims is that GEP increases the speed at which the fitness is reached (i.e. in fewer generations), but frankly speaking you can see dramatic performance shifts from a CGP just by having a different selection algorithm, a different tournament structure, splitting the population into tribes, migrating specimens between tribes, including diversity into the fitness, etc.
Selection:
random
roulette wheel
top-n
take half
etc.
Tournament Frequency:
once per epoch
once per every data instance
once per generation.
Tournament Structure:
Take 3, kill 1 and replace it with the child of the other two.
Sort all individuals in the tournament by fitness, kill the lower half and replace it with the offspring of the upper half (where lower is worse fitness and upper is better fitness).
Randomly pick individuals from the tournament to mate and kill the excess individuals.
Tribes
A population can be split into tribes that evolve independently of each-other:
Migration- periodically, individual(s) from a tribe would be moved to another tribe
The tribes are logically separated so that they're like their own separate populations running in separate environments.
Diversity Fitness
Incorporate diversity into the fitness, where you count how many individuals have the same fitness value (thus are likely to have the same phenotype) and you penalize their fitness by a proportionate value: the more individuals with the same fitness value, the more penalty for those individuals. This way specimens with unique phenotypes will be encouraged, therefore there will be much less stagnation of the population.
Those are just some of the things that can greatly affect the performance of a CGP, and when I say greatly I mean that it's in the same order or greater than Ferriera's performance. So if Ferriera didn't tinker with those ideas too much, then she could have seen much slower performance of the CGPs... especially if she didn't do anything to combat stagnation. So I would be careful when reading performance statistics on GEP, because sometimes people fail to account for all of the "optimizations" available out there.
There seems to be some confusion in these answers that must be clarified. Cartesian GP is different from classic GP (aka tree-based GP), and GEP. Even though they share many concepts and take inspiration from the same biological mechanisms, the representation of the individuals (the solutions) varies.
In CGPthe representation (mapping between genotype and phenotype) is indirect, in other words, not all of the genes in a CGP genome will be expressed in the phenome (a concept also found in GEP and many others). The genotypes can be coded in a grid or array of nodes, and the resulting program graph is the expression of active nodes only.
In GEP the representation is also indirect, and similarly not all genes will be expressed in the phenotype. The representation in this case is much different from treeGP or CGP, but the genotypes are also expressed into a program tree. In my opinion GEP is a more elegant representation, easier to implement, but also suffers from some defects like: you have to find the appropriate tail and head size which is problem specific, the mnltigenic version is a bit of a forced glue between expression trees, and finally it has too much bloat.
Independently of which representation may be better than the other in some specific problem domain, they are general purpose, can be applied to any domain as long as you can encode it.
In general, GEP is simpler from GP. Let's say you allow the following nodes in your program: constants, variables, +, -, *, /, if, ...
For each of such nodes with GP you must create the following operations:
- randomize
- mutate
- crossover
- and probably other genetic operators as well
In GEP for each of such nodes only one operation is needed to be implemented: deserialize, which takes array of numbers (like double in C or Java), and returns the node. It resembles object deserialization in languages like Java or Python (the difference is that deserialization in programming languages uses byte arrays, where here we have arrays of numbers). Even this 'deserialize' operation doesn't have to be implemented by the programmer: it can be implemented by a generic algorithm, just like it's done in Java or Python deserialization.
This simplicity from one point of view may make searching of best solution less successful, but from other side: requires less work from programmer and simpler algorithms may execute faster (easier to optimize, more code and data fits in CPU cache, and so on). So I would say that GEP is slightly better, but of course the definite answer depends on problem, and for many problems the opposite may be true.