Sorting array where many of the keys are identical - time complexity analysis - time-complexity

The question goes like this:
Given an array of n elements where elements are same. Worst case time complexity of sorting the array (with RAM model assumptions) will be:
So, I thought to use selection algorithm in order to find the element whose size is the , call it P. This should take O(n). Next, I take any element which doesn't equal this element and put it in another array. In total I will have k=n-n^(2001/2002) elements. Sorting this array will cost O(klog(k)) which equals O(nlogn). Finally, I will find the max element which is smaller than P and the min element which is bigger than P and I can sort the array.
All of it takes O(nlogn).
Note: if , then we can reduce the time to O(n).
I have two question: is my analysis correct? Is there any way to reduce time complexity? Also, what is the RAM model assumptions?
Thanks!

Your analysis is wrong - there is no guarantee that the n^(2001/2002)th-smallest element is actually one of the duplicates.
n^(2001/2002) duplicates simply don't constitute enough of the input to make things easier, at least in theory. Sorting the input is still at least as hard as sorting the n - n^(2001/2002) = O(n) other elements, and under standard comparison sort assumptions in the RAM model, that takes at least O(n*log(n)) worst-case time.
(For practical input sizes, n^(2001/2002) duplicates would be at least 98% of the input, so isolating the duplicates and sorting the rest would be both easy and highly efficient. This is one of those cases where the asymptotic analysis doesn't capture the behavior we care about in practice.)

Related

time complexity of HashSet

I was discussing with a friend about the Hashset design using mod function as the Hashing function.
The time complexity of such implementation appears to be O(N/K) , where N is total items stored in the set and k is total # of buckets. This time complexity assumes that that all items are distributed among all buckets and bucket's average size is N/K.
I confused myself because i believe the time complexity should be O(N). Since time complexity is the worst case performance. Here the worst case could be that all N items go to same bucket and value we are looking for could be at the end of the bucket. Please help me here.
You're right that the worst case is all items going into one bucket. The items being evenly distributed is the best case. That said, O(N/k) is the same as O(N) if k is held constant, since constants can be neglected. I would not expect k to be part of the input to a lookup anyway. If k can vary, then it is different, but the worst case is still O(N).

Oracle SQL or PLSQL scale with load

Suppose I have query ( it has joins on multiple tables ) and assuming it is tuned, and optimized. This query runs on the target database/tables with N number of records and query results R number of records and takes time T. Now gradually the load increases and say the target records become N2, and result it give is R2 and time it takes as T2. Assuming that I have allocated enough memory to the Oracle, L2/L1 will be close to T2/T1. Means the proportional increase in the load will result proportional increase in execution time. For this question lets say L2 = 5L1, means load has increased to 5times. Then time take to complete by this query would also be 5times or little more, right? So, to reduce the proportional growth in time, do we have options in Oracle, like parallel hint etc? In Java we split the job in multiple threads and 2times the load with 2times the worker thread we get almost same time to complete. So with increasing load we increase the worker thread and achieve the scaling issue reasonably well. Is such thing possible in Oracle or does Oracle take care of such thing in the back end and will scale, by splitting the load internally into parallel processing? Here, I have multi core processors. I Will experiment it, but if expert opinion is available it will help.
No. Query algorithms do not necessarily grow linearly.
You should probably learn something about algorithms and complexity. But many algorithms used in a data are super-linear. For instance, ordering a set of rows has a complexity of O(n log n), meaning that if you double the data size, the time taken for sorting more than doubles.
This is also true of index lookups and various join algorithms.
On the other hand, if your query is looking up a few rows using a b-tree index, then the complex is O(log n) -- this is sublinear. So index lookups grow more slowly than the size of the data.
So, in general you cannot assume that increasing the size of data by a factor of n has a linear effect on the time.

Is there a significant benefit to partitioning rather than pivoting when (quick)sorting?

In each phase of the Quick-Sort algorithm, a pivot is chosen, and the data is bi-partitioned into elements below and above the pivot (and some tie breaking for being equal to it).
Now, each pivoting requires a pass over the data. Also, the variance of the pivot's relative position around being the median element is pretty high. Instead, one could conceivably take a sample of k elements, i.i.d. , and partition the data into k+1 parts, using the sample elements as boundaries for the different parts. And while some parts may be too small or too big, most should be fine; and the rest will need some correction (e.g. by counting) but that should be doable.
I've not tried such an approach out over (binary) Quick-Sort but was wondering if such an approach is used in practice at all, and if there's actually benefit to doing so.
PS - If you can say something about the amenability to parallelization (using AVX-512 on a CPU, with a Xeon Phi, with a GPU etc.) - that would be extra helpful.

Given an array of N integers how to find the largest element which appears an even number of times in the array with minimum time complexity

You are given an array of N integers. You are asked to find the largest element which appears an even number of times in the array. What is the time complexity of your algorithm? Can you do this without sorting the entire array?
You could do it in O(n log n) with a table lookup method. For each element in the list, look it up in the table. If it is missing, insert a key-value pair with the key being the element and the value as the number of appearances (starting at one); if it is present, increment the appearances. At the end just loop through the table in O(n) and look for the largest key with an even value.
In theory for an ideal hash-table, a lookup operation is O(1). So you can find and/or insert all n elements in O(n) time, making the total complexity O(n). However, in practice you will have trouble with space allocation (need much more space than data set size) and collisions (why you need it). This makes the O(1) lookup very difficult to achieve; in the worst case scenario it can be as much as O(n) (though also unlikely) - making the total complexity O(n^2).
Instead you can be more secure with a tree-based table - that is, the keys are stored in a binary tree. Lookup and insertion operations are all O(log n) in this case, provided that the tree is balanced; there are a wide range of tree structures to help ensure this e.g. Red-Black trees, AVL, splay, B-trees etc (Google is your friend). This will make the total complexity a guaranteed O(n log n).

Space Complexity of Ids Algorithm

Hello I try to calculate the space complexity of IDS (Iterative deepening depth-first search) algorithm but I cannot figure out how to do that. I can't really understand how the algorithm visits the nodes of a tree, so that I could calculate how space it needs. Can you help me?
As far as i have understood, IDS works like this: starting with a limit of 0, meaning the depth of the root node in a graph(or your starting point), it performs a Depth First Search until it exhausts the nodes it finds within the sub-graph defined by the limit. It then proceeds by increasing the limit by one, and performing a Depth First Search from the same starting point, but on the now bigger sub-graph defined by the larger limit. This way, IDS manages to combine benefits of DFS with those of BFS(breadth first search). I hope this clears up a few things.
from Wikipedia:
The space complexity of IDDFS is O(bd), where b is the branching factor and d is the depth of shallowest goal. Since iterative deepening visits states multiple times, it may seem wasteful, but it turns out to be not so costly, since in a tree most of the nodes are in the bottom level, so it does not matter much if the upper levels are visited multiple times.[2]