How is insertion for a Singly Linked List and Doubly Linked List constant time? - time-complexity

Thinking about it, I thought the time complexity for insertion and search for any data structure should be the same, because to insert, you first have to search for the location you want to insert, and then you have to insert.
According to here: http://bigocheatsheet.com/, for a linked list, search is linear time but insertion is constant time. I understand how searching is linear (start from the front, then keep going through the nodes on the linked list one after another until you find what you are searching for), but how is insertion constant time?
Suppose I have this linked list:
1 -> 5 -> 8 -> 10 -> 8
and I want to insert the number 2 after the number 8, then would I have to first search for the number 8 (search is linear time), and then take an extra 2 steps to insert it (so, insertion is still linear time?)?
#insert y after x in python
def insert_after(x, y):
search_for(y)
y.next = x.next
x.next = y
Edit: Even for a doubly linked list, shouldn't it still have to search for the node first (which is linear time), and then insert?

So if you already have a reference to the node you are trying to insert then it is O(1). Otherwise, it is search_time + O(1). It is a bit misleading but on wikipedia there is a chart explains it a bit better:
Contrast this to a dynamic array, which, if you want to insert at the beginning is: Θ(n).
Just for emphasis: The website you reference is referring to the actual act of inserting given we already know where we want to insert.

Time to insert = Time to set three pointers = O(3) = constant time.
Time to insert the data is not the same as time to insert the data at a particular location. The time asked is the time to insert the data only.

Related

Hash tables Time Complexity Confusion

I just started learning about Hash Dictionaries. Currently we are implementing a hash dictionary with separate buckets that are made of chains (linked lists). The book posed this problem and I am having a lot of trouble figuring it out. Imagine we have an initial table size of 10 ie 10 buckets. If we want to know the time complexity for n insertions and a single lookup, how do we figure this out? (Assuming a pointer access is one unit of time).
It poses three scenarios:
A hash dictionary that does not resize, what is the time complexity for n insertions and 1 lookup?
A hash dictionary that resizes by 1 when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
A hash dictionary that resizes by doubling the table size when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
MY initial thoughts had me really confused. I couldn't quite figure out how to know the length of some given chain for an insertion. Assuming k length (I thought), there is the pointer access of the for loop going through the whole chain so k units of time. Then, in each iteration to insert it checks if the current node's data is equivalent to the key trying to be inserted (if it exists, overwrite it) so either 2k units of time if not found, 2k+1 if found. Then, it does 5 pointer accesses to prepend some element. So, 2k+5 or 2k+1 to insert 1 time. Thus, O(kn) for the first scenario for n insertions. To lookup, it seems to be 2k+1 or 2k. So for 1 lookup, o(k). I don't have a clue how to approach the other two scenarios. Some help would be great. Once again to clarify: k isn't mentioned in the problem. The only facts given are an initial size of 10 and the information given in the scenarios, so k can't be used as the results for the time complexity of n insertions or 1 lookup.
if you have a hash dictionary then your insert, delete and search operation will take O(n) of Time-Complexity for 1 key in the worst case scenario. For n insertions it would be O(n^2). It doesn't matter what the size of your table is.
|--------|
|element1| -> element2 -> element3 -> element4 -> element5
|--------|
| null |
|--------|
| null |
|--------|
| null |
|--------|
| null |
|--------|
Now for Average Case
Scenario one will have the load factor fixed (assuming m slots) : n/m. Therefore, one insert function will be O(1+n/m). 1 for the hash function computation and n/m for the lookup.
For the 2nd and 3rd scenario it should be O(1+n/m+1) and O(1+n/2m) respectively.
For your confusion, you can ask yourself a question that what will be the expected chain length for any random set of keys. The solution will be that we can't be sure at all.
That's where the idea of load factor comes into place to define the average case scenario, we give each slot equal probability to form a chain, if our no. of keys is greater than the slot count.
Imagine we have an initial table size of 10 ie 10 buckets. If we want to know the time complexity for n insertions and a single lookup, how do we figure this out?
When we talk about time complexity, we're looking at the steepness of the n-vs-time-for-operation curve as n approaches infinity. In the case above, you're saying there are only ten buckets, so - assuming the hash function scatters the insertions across the buckets with near-uniform distribution (as it should), n insertions will result in 10 lists of roughly n/10 elements.
During each insertion, you can hash to the correct bucket in O(1) time. Now - a crucial factor here is whether you want your hash table implementation to protect you against duplicate insertions.
If you simply trust there will be no duplicates, or the hash table is allowed to have duplicates (e.g. C++'s unordered_multiset), then the insertion itself can be done without inspecting the existing bucket content, at an accessible end of the bucket's list (i.e. using a head or tail pointer), also in O(1) time. That means the overall time per insertion is O(1), and the total time for n insertions is O(n).
If the implementation much identify and avoid duplicates, then for each insertion it has to search along the existing linked list, the size of which is related to n by a constant #buckets factor (1/10) and varies linearly during insertion from 1 to 1/10 of the final number of elements, so on average is n/2/10 which - removing constant factors - simplifies to n. In other words, each insertion is O(n).
Presumably the question intends to ask the time for a single lookup done after all elements are inserted: in that case you have the 10 linked lists of ~n/10 length, so the lookup will hash to one of those lists and then on average have to look half way along the list before finding the desired value: that's roughly n/20 elements searched, but as /20 is a constant factor it can be dropped, and we can say the average complexity is O(n).
A hash dictionary that does not resize, what is the time complexity for n insertions and 1 lookup?
Well, we discussed that above with our hash table size stuck at 10.
A hash dictionary that resizes by 1 when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
Say the table has 100 buckets and 80 elements, you insert an 81st element, it resizes to 101, the load factor is then about .802 - should it immediately resize again, or wait until doing another insertion? Anyway, ignoring that -each resize operation involves visiting, rehashing (unless the elements or nodes cache the hash values), and "rewiring" the linked lists for all existing elements: that's O(s) where s is the size of the table at that point in time. And you're doing that once or twice (depending on your answer to "immediately resize again" behaviour above) for s values from 1 to n, so s averages n/2, which simplifies to n. The insertion itself may or may not involve another iteration of the bucket's linked list (you could optimise to search while resizing). Regardless the overall time complexity is O(n2).
The lookup then takes O(1), because the resizing has kept the load factor below a constant amount (i.e. the average linked list length is very, very short (even ignoring the empty buckets).
A hash dictionary that resizes by doubling the table size when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
If you consider the resultant hash table there with n elements inserted, about half the elements will have been inserted without needing to be rehashed, while for about a quarter, they'll have been rehashed once, and an eight rehashed twice, a sixteenth rehashed 3 times, a 32nd rehashed 4 times: if you sum up that series - 1/4 + 2/8 + 3/16 + 4/32 + 5/64 + 6/128... - the series approaches 1 as n goes to infinity. In other words, the average amount of repeated rehashing/linking work done per element in the final table size doesn't increase with n - it's constant. So, the total time to insert is simply O(n). Then because the load factor is kept below 0.8 - a constant rather than a function of n - the lookup time is O(1).

Infinite scroll algorithm for random items with different weight ( probability to show to the user )

I have a web / mobile application that should display an infinite scroll view (the continuation of the list of items is loaded periodically in a dynamic way) with items where each of the items have a weight, the bigger is the weight in comparison to the weights of other items the higher should be the chances/probability to load the item and display it in the list for the users, the items should be loaded randomly, just the chances for the items to be in the list should be different.
I am searching for an efficient algorithm / solution or at least hints that would help me achieve that.
Some points worth to mention:
the weight has those boundaries: 0 <= w < infinite.
the weight is not a static value, it can change over time based on some item properties.
every item with a weight higher than 0 should have a chance to be displayed to the user even if the weight is significantly lower than the weight of other items.
when the users scrolls and performs multiple requests to API, he/she should not see duplicate items or at least the chance should be low.
I use a SQL Database (PostgreSQL) for storing items so the solution should be efficient for this type of database. (It shouldn't be a purely SQL solution)
Hope I didn't miss anything important. Let me know if I did.
The following are some ideas to implement the solution:
The database table should have a column where each entry is a number generated as follows:
log(R) / W,
where—
W is the record's weight greater than 0 (itself its own column), and
R is a per-record uniform random number in (0, 1)
(see also Arratia, R., "On the amount of dependence in the prime factorization of a uniform random integer", 2002). Then take the records with the highest values of that column as the need arises.
However, note that SQL has no standard way to generate random numbers; DBMSs that implement SQL have their own ways to do so (such as RANDOM() for PostgreSQL), but how they work depends on the DBMS (for example, compare MySQL's RAND() with T-SQL's NEWID()).
Peter O had a good idea, but had some issues. I would expand it a bit in favor of being able to shuffle a little better as far as being user-specific, at a higher database space cost:
Use a single column, but store in multiple fields. Recommend you use the Postgres JSONB type (which stores it as json which can be indexed and queried). Use several fields where the log(R) / W. I would say roughly log(U) + log(P) where U is the number of users and P is the number of items with a minimum of probably 5 columns. Add an index over all the fields within the JSONB. Add more fields as the number of users/items get's high enough.
Have a background process that is regularly rotating the numbers in #1. This can cause duplication, but if you are only rotating a small subset of the items at a time (such as O(sqrt(P)) of them), the odds of the user noticing are low. Especially if you are actually querying for data backwards and forwards and stitch/dedup the data together before displaying the next row(s). Careful use of manual pagination adjustments helps a lot here if it's an issue.
Before displaying items, randomly pick one of the index fields and sort the data on that. This means you have a 1 in log(P) + log(U) chance of displaying the same data to the user. Ideally the user would pick a random subset of those index fields (to avoid seeing the same order twice) and use that as the order, but can't think of a way to make that work and be practical. Though a random shuffle of the index and sorting by that might be practical if the randomized weights are normalized, such that the sort order matters.

Negamax: what to do with "partial" results after canceling a search?

I'm implementing negamax with alpha/beta transposition table based on the pseudo code here, with roughly this algorithm:
NegaMax():
1. Transposition Table lookup
2. Loop through moves
2a. **Bail if I'm out of time**
2b. Make move, call -NegaMax, undo move
2c. Update bestvalue, alpha/beta but if appropriate
3. Transposition table store/update
4. Return bestvalue
I'm also using iterative deepening, calling NegaMax with progressively higher depths.
My question is: when I determine I've run out of time (2a. in the beginning of move loop) what is the right thing to do? Do I bail immediately (not updating the transposition table) or do I just break the loop (saving whatever partial work I've done)?
Currently, I return null at that point, signifying that the search was canceled before "completing" that node (whether by trying every move or the alpha/beta cut). The null gets propagated up and up the stack, and each node on the way up bails by return, so step 3 never runs.
Essentially, I only store values in the TT if the node "completed". The scenario I keep seeing with the iterative deepening:
I get through depths 1-5 really quick, so the TT has a depth = 5, type = Exact entry.
The depth = 6 search is taking a long time, so I bail.
I ultimately return the best move in the transposition table, which is the move I found during the depth = 5 search. The problem is, if I start a new depth = 6 search, it feels like I'm starting it from scratch. However, if I save whatever partial results I found, I worry that I'll have corrupted my TT, potentially by overwriting the completed depth = 5 entry with an incomplete depth = 6 entry.
If the search wasn't completed, the score is inaccurate and should likely not be added to the TT. If you have a best move from the previous ply and it is still best and the score hasn't dropped significantly, you might play that.
On the other hand, if at depth 6 you discover that the opponent has a mate in 3 (oops!) or could win your queen, you might have to spend even more time to try to resolve that.
That would leave you with less time for the remaining moves (if any...), but it might be better to be slightly short on time than to get mated with plenty of time remaining. :-)

Deleting redundant values in timeseries data

Consider a database scheme like this:
CREATE TABLE log (
observation_point_id INTEGER PRIMARY KEY NOT NULL,
datetime TEXT NOT NULL,
value REAL NOT NULL
)
which contains 'observations' of some value; say for example a temperature measurement. The observation device (i.e., thermometer :) ) samples the temperature every 5 seconds and this gets logged to the database.
There are multiple thermometers, each of which is identified (for the purposes of this simplified example) by an 'observation_point'.
Now, let's assume that the precision of my thermometer is one degree; then I will have many observations that are redundant. Let's say I log x degrees at 9h00m00s, then it's quite likely it will still be x degrees at 9h00m05s, 9h00m10s etc. So I only need to store the value and time at which I first measured this temperature, and at which I last measured it.
I can check on every insert if the value immediately preceding it is redundant, and then delete that. But that's quite expensive, especially considering that there are many loggers to write to my database, and the frequency of logging is higher than 5 seconds in my real use case.
So my idea is to run a 'cleanup' every, say 1 minute, that will delete all values between extremes e1 and e2 where the interval [e1,e2] is defined as each series of subsequent values v1, v2, ..., vn where v1 = v2 = ... = vn. 'Subsequent' here meaning when ordered by 'datetime'.
My question: is there a way to express this in an SQL query? Is there another way to approach this?
(my baseline is to do a 'select order by', then loop over all results). I can't do anything 'before' my values hit the database (i.e., cache values until I get the next measurement and only write value if that measurement is different), because I might also get observations at a much lower frequency than once every few seconds, and I cannot afford to lose observations. (now that I'm typing this, maybe I could 'cache' values in a separate database table, but I think I'm straying too far from my real question now).

Represent Ordering in a Relational Database

I have a collection of objects in a database. Images in a photo gallery, products in a catalog, chapters in a book, etc. Each object is represented as a row. I want to be able to arbitrarily order these images, storing that ordering in the database so when I display the objects, they will be in the right order.
For example, let's say I'm writing a book, and each chapter is an object. I write my book, and put the chapters in the following order:
Introduction, Accessibility, Form vs. Function, Errors, Consistency, Conclusion, Index
It goes to the editor, and comes back with the following suggested order:
Introduction, Form, Function, Accessibility, Consistency, Errors, Conclusion, Index
How can I store this ordering in the database in a robust, efficient way?
I've had the following ideas, but I'm not thrilled with any of them:
Array. Each row has an ordering ID, when order is changed (via a removal followed by an insertion), the order IDs are updated. This makes retrieval easy, since it's just ORDER BY, but it seems easy to break.
// REMOVAL
UPDATE ... SET orderingID=NULL WHERE orderingID=removedID
UPDATE ... SET orderingID=orderingID-1 WHERE orderingID > removedID
// INSERTION
UPDATE ... SET orderingID=orderingID+1 WHERE orderingID > insertionID
UPDATE ... SET orderID=insertionID WHERE ID=addedID
Linked list. Each row has a column for the id of the next row in the ordering. Traversal seems costly here, though there may by some way to use ORDER BY that I'm not thinking of.
Spaced array. Set the orderingID (as used in #1) to be large, so the first object is 100, the second is 200, etc. Then when an insertion happens, you just place it at (objectBefore + objectAfter)/2. Of course, this would need to be rebalanced occasionally, so you don't have things too close together (even with floats, you'd eventually run into rounding errors).
None of these seem particularly elegant to me. Does anyone have a better way to do it?
An other alternative would be (if your RDBMS supports it) to use columns of type array. While this breaks the normalization rules, it can be useful in situations like this. One database which I know about that has arrays is PostgreSQL.
The acts_as_list mixin in Rails handles this basically the way you outlined in #1. It looks for an INTEGER column called position (of which you can override to name of course) and using that to do an ORDER BY. When you want to re-order things you update the positions. It has served me just fine every time I've used it.
As a side note, you can remove the need to always do re-positioning on INSERTS/DELETES by using sparse numbering -- kind of like basic back in the day... you can number your positions 10, 20, 30, etc. and if you need to insert something in between 10 and 20 you just insert it with a position of 15. Likewise when deleting you can just delete the row and leave the gap. You only need to do re-numbering when you actually change the order or if you try to do an insert and there is no appropriate gap to insert into.
Of course depending on your particular situation (e.g. whether you have the other rows already loaded into memory or not) it may or may not make sense to use the gap approach.
If the objects aren't heavily keyed by other tables, and the lists are short, deleting everything in the domain and just re-inserting the correct list is the easiest. But that's not practical if the lists are large and you have lots of constraints to slow down the delete. I think your first method is really the cleanest. If you run it in a transaction you can be sure nothing odd happens while you're in the middle of the update to screw up the order.
Just a thought considering option #1 vs #3: doesn't the spaced array option (#3) only postpone the problem of the normal array (#1)? Whatever algorithm you choose, either it's broken, and you'll run into problems with #3 later, or it works, and then #1 should work just as well.
I did this in my last project, but it was for a table that only occasionally needed to be specifically ordered, and wasn't accessed too often. I think the spaced array would be the best option, because it reordering would be cheapest in the average case, just involving a change to one value and a query on two).
Also, I would imagine ORDER BY would be pretty heavily optimized by database vendors, so leveraging that function would be advantageous for performance as opposed to the linked list implementation.
Use a floating point number to represent the position of each item:
Item 1 -> 0.0
Item 2 -> 1.0
Item 3 -> 2.0
Item 4 -> 3.0
You can place any item between any other two items by simple bisection:
Item 1 -> 0.0
Item 4 -> 0.5
Item 2 -> 1.0
Item 3 -> 2.0
(Moved item 4 between items 1 and 2).
The bisection process can continue almost indefinitely due to the way floating point numbers are encoded in a computer system.
Item 4 -> 0.5
Item 1 -> 0.75
Item 2 -> 1.0
Item 3 -> 2.0
(Move item 1 to the position just after Item 4)
Since I've mostly run into this with Django, I've found this solution to be the most workable. It seems that there isn't any "right way" to do this in a relational database.
I'd do a consecutive number, with a trigger on the table that "makes room" for a priority if it already exists.
I had this problem as well. I was under heavy time pressure (aren't we all) and I went with option #1, and only updated rows that changed.
If you swap item 1 with item 10, just do two updates to update the order numbers of item 1 and item 10. I know it is algorithmically simple, and it is O(n) worst case, but that worst case is when you have a total permutation of the list. How often is that going to happen? That's for you to answer.
I had the same issue and have probably spent at least a week concerning myself about the proper data modeling, but I think I've finally got it. Using the array datatype in PostgreSQL, you can store the primary key of each ordered item and update that array accordingly using insertions or deletions when your order changes. Referencing a single row will allow you to map all your objects based on the ordering in the array column.
It's still a bit choppy of a solution but it will likely work better than option #1, since option 1 requires updating the order number of all the other rows when ordering changes.
Scheme #1 and Scheme #3 have the same complexity in every operation except INSERT writes. Scheme #1 has O(n) writes on INSERT and Scheme #3 has O(1) writes on INSERT.
For every other database operation, the complexity is the same.
Scheme #2 should not even be considered because its DELETE requires O(n) reads and writes. Scheme #1 and Scheme #3 have O(1) DELETE for both read and write.
New method
If your elements have a distinct parent element (i.e. they share a foreign key row), then you can try the following ...
Django offers a database-agnostic solution to storing lists of integers within CharField(). One drawback is that the max length of the stored string can't be greater than max_length, which is DB-dependent.
In terms of complexity, this would give Scheme #1 O(1) writes for INSERT, because the ordering information would be stored as a single field in the parent element's row.
Another drawback is that a JOIN to the parent row is now required to update ordering.
https://docs.djangoproject.com/en/dev/ref/validators/#django.core.validators.validate_comma_separated_integer_list