Insert new value in binary search tree - binary-search-tree

Can i insert the same value to binary search tree?
and if can, this insert in left or right thanks all!
this is a binary search tree
can i insert 23 to this tree?

Yes, unless otherwise specified by the actual context, a binary search tree can have duplicates. There is no rule for choosing a side.
Moreover, even if a side is agreed upon, binary search trees are typically self-balancing, and that means that this choice of side will not be maintained.
For instance, imagine the extreme case where only the value 42 is inserted in a tree, like five times, and we would agree that duplicates are inserted in the right subtree, then we would get this tree:
42
\
42
\
42
\
42
\
42
However, the BST would be rebalanced after a few inserts, and the resulting tree would be more like this (by that rebalancing):
42
/ \
42 42
\ \
42 42
And now it is clear you cannot assume at which side duplicate values could be stored.
Another approach is to allow duplicates but not store them as separate nodes. Instead you could make the "payload" (if any) an array of data that was inserted with the same key. Or, if there is no payload to go with a key, have a counter at each node indicating how often that key was inserted.
This might be preferable.

Related

Phrase matching for keys in redis

I have the following keys in redis :
"542 136 mountain road"
"542 136 mountainview road"
"542136 mountain road"
"542 136 mountain"
"136 mountain road"
"136 mountain"
I would like to get the keys which contains the phrase 136 mountain.
With the glob-style pattern, I am currently making 4 queries so as to satisfy all the possible cases .
scan 0 MATCH '*[\ ]136 mountain[\ ]*'
scan 0 MATCH '*[\ ]136 mountain'
scan 0 MATCH '136 mountain[\ ]*'
scan 0 MATCH '136 mountain'
These four queries in total would return 4 results :
"542 136 mountain road"
"542 136 mountain"
"136 mountain road"
"136 mountain"
Please do share your inputs if there is any better way of changing the pattern string so that all the 4 results can be obtained in a single query.
I don't believe this can be achieved with a glob-style pattern.
I would also note that even if there was a pattern that matched the phrases presented, due to the nature of how SCAN works you would need to iterate through the entire dataset (making separate calls) to get the results you are looking for. Then you would need to consider the fact your data may be changing during the iteration period.
From the docs
It is important to note that the MATCH filter is applied after
elements are retrieved from the collection, just before returning data
to the client. This means that if the pattern matches very little
elements inside the collection, SCAN will likely return no elements in
most iterations.
Reference: https://redis.io/commands/scan#the-match-option
--
Option 1
Use SCAN to iterate through the entire dataset and further filter the data at the application level.
Option 2
Depending on what type of guarantees you're looking for and how much data you have, you could use KEYS. This is generally not a recommended approach, but it's an option to consider.
Example:
KEYS '*136 mountain*'
Much like the SCAN approach, you will have a larger response than what you are looking for and will need to use your language of choice to further filter the result.
Option 3
Index your data by doing some preprocessing at the application level. If the key matches your desired pattern add it to a SET / Sorted Set.
Option 4
Write a Lua script.

Redis, how does SCAN cursor "state management" work?

Redis has a SCAN command that may be used to iterate keys matching a pattern etc.
Redis SCAN doc
You start by giving a cursor value of 0; each call returns a new cursor value which you pass into the next SCAN call. A value of 0 indicates iteration is finished. Supposedly no server or client state is needed (except for the cursor value)
I'm wondering how Redis implements the scanning algorithm-wise?
You may find answer in redis dict.c source file. Then I will quote part of it.
Iterating works the following way:
Initially you call the function using a cursor (v) value of 0. 2)
The function performs one step of the iteration, and returns the
new cursor value you must use in the next call.
When the returned cursor is 0, the iteration is complete.
The function guarantees all elements present in the dictionary get returned between the start and end of the iteration. However it is possible some elements get returned multiple times. For every element returned, the callback argument 'fn' is called with 'privdata' as first argument and the dictionary entry'de' as second argument.
How it works
The iteration algorithm was designed by Pieter Noordhuis. The main idea is to increment a cursor starting from the higher order bits. That is, instead of incrementing the cursor normally, the bits of the cursor are reversed, then the cursor is incremented, and finally the bits are reversed again.
This strategy is needed because the hash table may be resized between iteration calls. dict.c hash tables are always power of two in size, and they use chaining, so the position of an element in a given table is given by computing the bitwise AND between Hash(key) and SIZE-1 (where SIZE-1 is always the mask that is equivalent to taking the rest of the division between the Hash of the key and SIZE).
For example if the current hash table size is 16, the mask is (in binary) 1111. The position of a key in the hash table will always be the last four bits of the hash output, and so forth.
What happens if the table changes in size?
If the hash table grows, elements can go anywhere in one multiple of the old bucket: for example let's say we already iterated with a 4 bit cursor 1100 (the mask is 1111 because hash table size = 16).
If the hash table will be resized to 64 elements, then the new mask will be 111111. The new buckets you obtain by substituting in ??1100 with either 0 or 1 can be targeted only by keys we already visited when scanning the bucket 1100 in the smaller hash table.
By iterating the higher bits first, because of the inverted counter, the cursor does not need to restart if the table size gets bigger. It will continue iterating using cursors without '1100' at the end, and also without any other combination of the final 4 bits already explored.
Similarly when the table size shrinks over time, for example going from 16 to 8, if a combination of the lower three bits (the mask for size 8 is 111) were already completely explored, it would not be visited again because we are sure we tried, for example, both 0111 and 1111 (all the variations of the higher bit) so we don't need to test it again.
Wait... You have TWO tables during rehashing!
Yes, this is true, but we always iterate the smaller table first, then we test all the expansions of the current cursor into the larger table. For example if the current cursor is 101 and we also have a larger table of size 16, we also test (0)101 and (1)101 inside the larger table. This reduces the problem back to having only one table, where the larger one, if it exists, is just an expansion of the smaller one.
Limitations
This iterator is completely stateless, and this is a huge advantage, including no additional memory used.
The disadvantages resulting from this design are:
It is possible we return elements more than once. However this is usually easy to deal with in the application level.
The iterator must return multiple elements per call, as it needs to always return all the keys chained in a given bucket, and all the expansions, so we are sure we don't miss keys moving during rehashing.
The reverse cursor is somewhat hard to understand at first, but this comment is supposed to help.

Redis: fan out news feeds in list or sorted set?

I'm caching fan-out news feeds with Redis in the following way:
each feed activity is a key/value, like activity:id where the value is a JSON string of the data.
each news feed is currently a list, the key is feed:user:user_id and the list contains the keys of the relevant activities.
to retrieve a news feed I use for example: 'sort feed:user:user_id by nosort get * limit 0 40'
I'm considering changing the feed to a sorted set where the score is the activity's timestamp, this way the feed is always sorted by time.
I read http://arindam.quora.com/Redis-sorted-sets-and-lists-Pertaining-to-Newsfeed which recommend using lists because of the time complexity of sorted sets, but by keep using lists I have to take care of the insert order,
inserting a past story requires to iterate through the list and finding the right index to push to. (which can cause new problems in distributed environments).
should I keep using lists or go for sorted sets?
is there a way to retrieve the news feed instantly from a sorted set, (like with the sort ... get * command for a list) or does it have to be zrange and then iterating through the results and getting each value?
Yes, sorted sets are very fast and powerful. They seem a much better match for your requirements than SORT operations. The time complexity is often misunderstood. O(log(N)) is very fast, and scales just fine. We use it for tens of millions of members in one sorted set. Retrieval and insertion is sub-millisecond.
Use ZRANGEBYSCORE key min max WITHSCORES [LIMIT offset count] to get your results.
Depending on how you store the timestamps as 'scores', ZREVRANGEBYSCORE might be better.
A small remark about the timestamps: Sorted set SCORES which don't need a decimal part should be using 15 digits or less. So the SCORE has to stay in the range -999999999999999 to 999999999999999. Note: These limits exist because Redis server actually stores the score (float) as a redis-string representation internally.
I therefore recommend this format, converted to Zulu Time: -20140313122802 for second-precision. You may add 1 digit for 100ms-precision, but no more if you want no loss in precision. It's still a float64 by the way, so loss of precision could be fine in some scenarios, but your case fits in the 'perfect precision' range, so that's what I recommend.
If your data expires within 10 years, you can also skip the three first digits (CCY of CCYY), to achieve .0001 second precision.
I suggest negative scores here, so you can use the simpler ZRANGEBYSCORE instead of the REV one. You can use -inf as the start score (minus infinity) and LIMIT 0 100 to get the top 100 results.
Two sorted set members (or 'keys' but that's ambiguous since the sorted set is also a key in itself) may share a score, that's no problem, the results within an identical score are alphabetical.
Hope this helps, TW
Edit after chat
The OP wanted to collect data (using a ZSET) from different keys (GET/SET or HGET/HSET keys). JOIN can do that for you, ZRANGEBYSCORE can't.
The preferred way of doing this, is a simple Lua script. The Lua script is executed on the server. In the example below I use EVAL for simplicity, in production you would use SCRIPT EXISTS, SCRIPT LOAD and EVALSHA. Most client libraries have some bookkeeping logic built-in, so you don't upload the script each time.
Here's an example.lua:
local r={}
local zkey=KEYS[1]
local a=redis.call('zrangebyscore', zkey, KEYS[2], KEYS[3], 'withscores', 'limit', 0, KEYS[4])
for i=1,#a,2 do
r[i]=a[i+1]
r[i+1]=redis.call('get', a[i])
end
return r
You use it like this (raw example, not coded for performance):
redis-cli -p 14322 set activity:1 act1JSON
redis-cli -p 14322 set activity:2 act2JSON
redis-cli -p 14322 zadd feed 1 activity:1
redis-cli -p 14322 zadd feed 2 activity:2
redis-cli -p 14322 eval '$(cat example.lua)' 4 feed '-inf' '+inf' 100
Result:
1) "1"
2) "act1JSON"
3) "2"
4) "act2JSON"

Mapreduce Table Diff

I have two versions (old/new) of a database table with about 100,000,000 records. They are in files:
trx-old
trx-new
The structure is:
id date amount memo
1 5/1 100 slacks
2 5/1 50 wine
id is the simple primary key, other fields are non-key. I want to generate three files:
trx-removed (ids of records present in trx-old but not in trx-new)
trx-added (records from trx-new whose ids are not present in trx-old)
trx-changed (records from trx-new whose non-key values have changed since trx-old)
I need to do this operation every day in a short batch window. And actually, I need to do this for multiple tables and across multiple schemas (generating the three files for each) so the actual app is a bit more involved. But I think the example captures the crux of the problem.
This feels like an obvious application for mapreduce. Having never written a mapreduce application my questions are:
is there some EMR application that already does this?
is there an obvious Pig or maybe Cascading solution lying about?
is there some other open source example that is very close to this?
PS I saw the diff between tables question but the solutions over there didn't look scalable.
PPS Here is a little Ruby toy that demonstrates the algorithm: Ruby dbdiff
I think it would be easiest just to write your own job, mostly because you'll want to use MultipleOutputs to write to the three separate files from a single reduce step when the typical reducer only writes to one file. You'd need to use MultipleInputs to specify a mapper for each table.
This seems like the perfect problem to solve in cascading. You have mentioned that you have never written MR application and if the intent is to get started quickly (assuming you are familiar with Java) then Cascading is the way to go IMHO. I'll touch more on this in a second.
It is possible to use Pig or Hive but these aren't as flexible if you want to perform additional analysis on these columns or change schemas since you can build your Schema on the fly in Cascading by reading from the column headers or from a mapping file you create to denote the Schema.
In Cascading you would:
Set up your incoming Taps : Tap trxOld and Tap trxNew (These point to your source files)
Connect your taps to Pipes: Pipe oldPipe and Pipe newPipe
Set up your outgoing Taps : Tap trxRemoved, Tap trxAdded and Tap trxChanged
Build your Pipe analysis (this is where the fun (hurt) happens)
trx-removed :
trx-added
Pipe trxOld = new Pipe ("old-stuff");
Pipe trxNew = new Pipe ("new-stuff");
//smallest size Pipe on the right in CoGroup
Pipe oldNnew = new CoGroup("old-N-new", trxOld, new Fields("id1"),
trxNew, new Fields("id2"),
new OuterJoin() );
The outer join gives us NULLS where ids are missing in the other Pipe (your source data), so we can use FilterNotNull or FilterNull in the logic that follows to get us final pipes that we then connect to Tap trxRemoved and Tap trxAdded accordingly.
trx-changed
Here I would first concatenate the fields that you are looking for changes in using FieldJoiner then use an ExpressionFilter to give us the zombies (cause they changed), something like:
Pipe valueChange = new Pipe("changed");
valueChange = new Pipe(oldNnew, new Fields("oldValues", "newValues"),
new ExpressionFilter("oldValues.equals(newValues)", String.class),
Fields.All);
What this does is it filters out Fields with the same value and keeps the differences. Moreover, if the expression above is true it gets rid of that record. Finally, connect your valueChange pipe to your Tap trxChanged and your will have three outputs with all the data you are looking for with code that allows for some added analysis to creep in.
As #ChrisGerken suggested, you would have to use MultipleOutputs and MultipleInputs in order to generate multiple output files and associate custom mappers to each input file type (old/new).
The mapper would output:
key: primary key (id)
value: record from input file with additional flag (new/old depending on the input)
The reducer would iterate over all records R for each key and output:
to removed file: if only a record with flag old exists.
to added file: if only a record with flag new exists.
to changed file: if records in R differ.
As this algorithm scales with the number of reducers, you'd most likely need a second job, which would merge the results to a single file for a final output.
What come to my mind is that:
Consider your tables are like that:
Table_old
1 other_columns1
2 other_columns2
3 other_columns3
Table_new
2 other_columns2
3 other_columns3
4 other_columns4
Append table_old's elements "a" and table_new's elements "b".
When you merge both files and if an element exist on the first file and not in the second file this is removed
table_merged
1a other_columns1
2a other_columns2
2b other_columns2
3a other_columns3
3b other_columns3
4a other_columns4
From that file you can do your operations easily.
Also, let say your id's are n digits, and you have 10 clusters+1 master. Your key would be 1st digit of id, therefore, you divide the data to clusters evenly. You would do grouping+partitioning so your data would be sorted.
Example,
table_old
1...0 data
1...1 data
2...2 data
table_new
1...0 data
2...2 data
3...2 data
Your key is first digit and you do grouping according to that digit, and your partition is according to rest of id. Then your data is going to come to your clusters as
worker1
1...0b data
1...0a data
1...1a data
worker2
2...2a data
2...2b data and so on.
Note that, a, b doesnt have to be sorted.
EDIT
Merge is going to be like that:
FileInputFormat.addInputPath(job, new Path("trx-old"));
FileInputFormat.addInputPath(job, new Path("trx-new"));
MR will get two input and the two file will be merged,
For the appending part, you should create two more jobs before Main MR, which will have only Map. The first Map will append "a" to every element in first list and the second will append "b" to elements of second list. The third job(the one we are using now/main map) will only have reduce job to collect them. So you will have Map-Map-Reduce.
Appending can be done like that
//you have key:Text
new Text(String.valueOf(key.toString()+"a"))
but I think there may be different ways of appending, some of them may be more efficient in
(text hadoop)
Hope it would be helpful,

How can I store and index a list in a relational database?

I'm working on constructing a database (SQLite) to store information about each run of a Mathematica script I've written. The script takes several input parameters, so my DB has a table with a column for each parameter (among other columns).
Some of the input parameters are lists of numbers. My first thought for storing these is to use a junction table as described in the accepted answer to this question. But I typically use the same list for several different runs. How can I look up whether any given list is already in the database, so I can reuse its ID rather than storing it again?
Constraints as mentioned in comments:
There is no explicit upper bound on the length of a list but in practice it ranges from 1 to about 50.
The number of distinct lists will be small, on the order of 10.
I actually have 3 list parameters. For two of them, the values in the list are non-negative, double precision floating point numbers; for the third, the values are pairs of such numbers.
There are no duplicate entries. (These are more precisely sets, so no duplicates and order is irrelevant)
I can easily arrange for the list elements to be in sorted order.
For example: suppose my table is set up like this
CREATE TABLE jobs (id INTEGER PRIMARY KEY, param1 REAL, param2_id INTEGER);
CREATE TABLE param2 (param2_id INTEGER PRIMARY KEY, value REAL);
When I run the script, it sets the parameters and then calls a function to run the calculation, like so:
param1 = 4;
param2 = {.1, .3, .5};
runTheCalculation[param1, param2]
Assuming this is the very first run of the script, it will insert the following contents into the DB:
jobs: id param1 param2_id
1 4.0 1
param2: param2_id value
1 0.1
1 0.3
1 0.5
So far, so good. Now let's say I run the script again with one different parameter,
param1 = 2;
param2 = {.1, .3, .5};
runTheCalculation[]
In a naive implementation, this will result in the database containing this:
jobs: id param1 param2_id
1 4.0 1
2 2.0 2
param2: param2_id value
1 0.1
1 0.3
1 0.5
2 0.1
2 0.3
2 0.5
But I would like it to be able to look up the fact that the list {.1, .3, .5} is already in the database, so that after the second run the DB contains this instead:
jobs: id param1 param2_id
1 4.0 1
2 2.0 1
param2: param2_id value
1 0.1
1 0.3
1 0.5
What sort of a query can I use to find that the list {.1, .3, .5} already exists in the table param2?
I'm not opposed to creating additional tables if necessary. Or if there is some model other than using a junction table that makes more sense, that's fine too.
If the list is short, and the quantity of lists is relatively small, then you can simply sequence the lists in the TBL_Lists and see if yours matches. This is pretty inefficient as it will enumerate all stored lists to compare to your one stored list.
Another way, and the better way in my opinion, would be to hash the list and store its hash in a TBL_List_Hashes
Hashing the list will require enumerating it one time.
An example hashing algorithm might be to build a string of all the sorted numerical values, uniformally padded, then run any hashing method on the concatenated string.
It should be relatively easy to obtain a hash of a given list and then retrieve the matching hash from the DB.
Even with a relatively simple hash algorithm with collisions you will be able to significantly reduce the number of lists you need to validate in order to make the comparison.
So if your hash algorithm has collisions then you're adding an enumeration (and query) expense for each erroneous match.
EDIT:
Here is a relevant answer for .net
.net 3.5 List<T> Equality and GetHashCode
EDIT2:
And if you are order-agnostic in your matching then simply standardize the list order before hashing
GetHashCode for a Class with a List Object
You ask: How can I look up whether any given list is already in the database?
The normal way is to use an index, and indexes are always row-oriented. So standard database design suggests you somehow need to get the whole list (normalized) into a row.
Since you're on SQLLite, you don't have too many options:
http://www.sqlite.org/datatype3.html
I recommend TEXT! You can index BLOB as well, and BLOB will save some space, but probably TEXT will work just fine, and TEXT is usually a lot more convenient to debug and work with. Try to invent some kind of canonical String format for your lists that you can parse/generate, and always INSERT/SELECT that from the database in a consistent way (e.g., consistent rounding, pre-sorted, duplicates removed, trailing and leading zeroes always consistent), and you should be fine.
Warning: it's a low-engineering approach, and perhaps even "not-the-right-way (TM)," but if it gets the job done....
In general, don't use lists, unless you have a very unusual set of requirements, and enough hands-on experience to anticipate the consequences.
A many-to-many relationship contained in a junction table, with appropriate indexes, will perform just as well and be much easier to use. It's also more flexible.