DynamoDB: Have sequencing within Items - sequence

I am developing forums on DynamoDB.
There is a table posts which contains all the posts in a thread.
I need to have a notion of sequence in the posts, i.e. I need to know which post came first and which came later.
My service would be running in a distributed env.
I am not sure if using Timestamp is the best solution for deciding the sequence, as the hosts might have slightly different times and might be off my milliseconds/ seconds.
Is there another way to do this?
Can I get DynamoDB to populate the date so it is consistent?
Or is there a sequence generator that I can use in a distributed env?

You can't use DynamoDB to auto-populate dates. You can use other services to provide you with auto-generating numbers or use DynamoDB's atomic increment to create your own UUID.
This can become a bottleneck if your forum is very successful (needs lots of numbers per second). I think you should start with timestamp and later on add complexity to your id generating (concatenate timestamp+uuid or timstamp+atomiccounter)
It is always a best practice to sync your servers clock (ntpd)

Use a dedicated sequence table. If you have only one sequence (say, PostId), then there's going to be only one row with two attributes in the table.
Yes, there's extra cost and effort of managing another table, but this is the best solution I know by far and haven't seen any one else mentioning it.
The table should have a key attribute as primary partition key, and a numeric value attribute with initial value of 1 (or whatever you want the initial value to be).
Every time you want to get the next available key, you tell DynamoDB to do this:
Increment the value where key = PostId by 1, and return the value before incrementing.
Note that this is one single atomic operation. DynamoDB handles the auto-incrementing, so there's no concurrency issues.
In code, there're more than one ways of implementing this. Here's one example:
Map<String,AttributeValue> key = new HashMap<>();
key.put("key", new AttributeValue("PostId"));
Map<String, AttributeValueUpdate> item = new HashMap<String, AttributeValueUpdate>();
item.put("value",
new AttributeValueUpdate()
.withAction(AttributeAction.ADD)
.withValue(1));
UpdateItemRequest request = new UpdateItemRequest("Sequences", key, item).withReturnValues(ReturnValue.ALL_OLD);
UpdateItemResult result = dynamoDBClient.updateItem(request);
Integer postId = Integer.parseInt(result.getAttributes().get("value").getN()); // <- this is the sequential ID you want to set to your post

Another variation of Chen's suggestion is to have strict ordering of posts within a given Forum Thread, as opposed to globally across all Threads. One way to do this is to have a Reply table with the Hash key of ThreadId, and a range key of ReplyId. The ReplyId would be a Number type attribute starting at 0. Every time someone replies, your app does a Query on the Reply table for the one most recent reply on that thread (ScanIndexForward: false, Limit: 1, ThreadId: ). To insert your new reply use the ReplyId of the one returned in the Query, + 1. Then use PutItem, using a Conditional Write, so that if someone else replies at the same time, an error will be returned, and your app can start again with the query.
If you want the simplest initial solution possible, then the timestamp+uuid concatenation Chen suggests is the simplest approach. A global atomic counter item will be a scaling bottleneck, as Chen mentions, and based on what you've described, a global sequence number isn't required for your app.

Related

REDIS usecase using large keys with small values

I have a use-case for using redis that is a little bit different.
In my MySQL I have an entity, let's call it HumanEntity. this HumanEntity has many to many relations.
HumanEntity.Urls - Many URLs per HumanEntity.
HumanEntity.UserNames - Many UserNames per HumanEntity.
HumanEntity.Phones ...
HumanEntity.Emails ...
in a normal one hour, the application creates hundreds of these many values.
The use-case is that, the application receives an HTTP call (100 per one second) with a HumanEntity value (Url or UserName or Phone or Email).
I need to scan my MySQL (1,000,000 records) and return back the HumanEntity.Id(integer) .
Since its ok to have some latency in the data integrity I thought about REDIS.
Can I store the values as a Redis key and the and the HumanEntity.Id(integer) as the value.
My API needs to return back the HumanEntity.Id(integer).
does it make sense to have such long key and such short value? The URL, for example, maybe 1500 bytes and the value can be 1 byte.
What is the best redis method to implement that?
Thanks
If the values are not unique then you may have some problem. Phones, emails or usernames maybe unique for user but i am not sure about url or any other property stored in your database. You may overwrite the value of an identifier with another user's.
If you don't have any problem like that; you may proceed with string types, Time complexity of GET and SET is O(1) - that's the best you may get.
In some cases such as checking whether the user used any coupon, you may use long(let's say 64 chars) user id as key, and 1 as value and use EXISTS to determine it. So it's valid to use long key and short value.

How predictable is NEWSEQUENTIALID?

According to Microsoft's documentation on NEWSEQUENTIALID, the output of NEWSEQUENTIALID is predictable. But how predictable is predictable? Say I have a GUID that was generated by NEWSEQUENTIALID, how hard would it be to:
Calculate the next value?
Calculate the previous value?
Calculate the first value?
Calculate the first value, even without knowing any GUID's at all?
Calculate the amount of rows? E.g. when using integers, /order?id=842 tells me that there are 842 orders in the application.
Below is some background information about what I am doing and what the various tradeoffs are.
One of the security benefits of using GUID's over integers as primary keys is that GUID's are hard to guess. E.g. say a hacker sees a URL like /user?id=845 he might try to access /user?id=0, since it is probable that the first user in the database is an administrative user. Moreover, a hacker can iterate over /user?id=0..1..2 to quickly gather all users.
Similarly, a privacy downside of integers is that they leak information. /order?id=482 tells me that the web shop has had 482 orders since its implementation.
Unfortunately, using GUID's as primary keys has well-known performance downsides. To this end, SQL Server introduced the NEWSEQUENTIALID function. In this question, I would like to learn how predictable the output of NEWSEQUENTIALID is.
The underlying OS function is UuidCreateSequential. The value is derived from one of your network cards MAC address and a per-os-boot incremental value. See RFC4122. SQL Server does some byte-shuffling to make the result sort properly. So the value is highly predictable, in a sense. Specifically, if you know a value you can immediately predict a range of similar value.
However one cannot predict the equivalent of id=0, nor can it predict that 52DE358F-45F1-E311-93EA-00269E58F20D means the store sold at least 482 items.
The only 'approved' random generation is CRYPT_GEN_RANDOM (which wraps CryptGenRandom) but that is obviously a horrible key candidate.
In most cases, the next newsequentialid can be predicted by taking the current value and adding one to the first hex pair.
In other words:
1E29E599-45F1-E311-80CA-00155D008B1C
is followed by
1F29E599-45F1-E311-80CA-00155D008B1C
is followed by
2029E599-45F1-E311-80CA-00155D008B1C
Occasionally, the sequence will restart from a new value.
So, it's very predictable
NewSequentialID is a wrapper around the windows function UuidCreateSequential
You can try this code:
DECLARE #tbl TABLE (
PK uniqueidentifier DEFAULT NEWSEQUENTIALID(),
Num int
)
INSERT INTO #tbl(Num) values(1),(2),(3),(4),(5)
select * from #tbl
On my machine in this time is result:
PK Num
52DE358F-45F1-E311-93EA-00269E58F20D 1
53DE358F-45F1-E311-93EA-00269E58F20D 2
54DE358F-45F1-E311-93EA-00269E58F20D 3
55DE358F-45F1-E311-93EA-00269E58F20D 4
56DE358F-45F1-E311-93EA-00269E58F20D 5
You should try it several times in different time/date to interpolate the behaviour.
I tried it run several times and the first part is changing everytime (you see in results: 52...,53...,54...,etc...). I waited some time to check it, and after some time the second part is incremented too. I suppose the incementation continues to the all parts. Basically it look like simple +=1 incementation transformed into Guid.
EDIT:
If you want sequential GUID and you want have control over the values, you can use Sequences.
Sample code:
select cast(cast(next value for [dbo].[MySequence] as varbinary(max)) as uniqueidentifier)
• Calculate the next value? Yes
Microsoft says:
If privacy is a concern, do not use this function. It is possible to guess the value of the next generated GUID and, therefore, access data associated with that GUID.
SO it's a possibility to get the next value. I don't find information if it is possible to get the prevoius one.
from: http://msdn.microsoft.com/en-us/library/ms189786.aspx
edit: another few words about NEWSEQUENTIALID and security: http://vadivel.blogspot.com/2007/09/newid-vs-newsequentialid.html
Edit:
NewSequentialID contains the server's MAC address (or one of them), therefore knowing a sequential ID gives a potential attacker information that may be useful as part of a security or DoS attack.
from: Are there any downsides to using NewSequentialID?

ROR - Generate an alpha-numeric string for a DB ID

In our DB, every Person has an ID, which is the DB generated, auto-incremented integer. Now, we want to generate a more user-friendly alpha-numeric ID which can be publicly exposed. Something like the Passport number. We obviously don't want to expose the DB ID to the users. For the purpose of this question, I will call what we need to generate, the UID.
Note: The UID is not meant to replace the DB ID. You can think of the UID as a prettier version of the DB ID, which we can give out to the users.
I was wondering if this UID can be a function of the DB ID. That is, we should be able to re-generate the same UID for a given DB ID.
Obviously, the function will take a 'salt' or key, in addition to the DB ID.
The UID should not be sequential. That is, two neighboring DB IDs should generate visually different-looking UIDs.
It is not strictly required for the UID to be irreversible. That is, it is okay if somebody studies the UID for a few days and is able to reverse-engineer and find the DB ID. I don't think it will do us any harm.
The UID should contain only A-Z (uppercase only) and 0-9. Nothing else. And it should not contain characters which can be confused with other alphabets or digits, like 0 and O, l and 1 and so on. I guess Crockford's Base32 encoding takes care of this.
The UID should be of a fixed length (10 characters), regardless of the size of the DB ID. We could pad the UID with some constant string, to bring it to the required fixed length. The DB ID could grow to any size. So, the algorithm should not have any such input limitations.
I think the way to go about this is:
Step 1: Hashing.
I have read about the following hash functions:
SHA-1
MD5
Jenkin's
The hash returns a long string. I read here about something called XOR folding to bring the string down to a shorter length. But I couldn't find much info about that.
Step 2: Encoding.
I read about the following encoding methods:
Crockford Base 32 Encoding
Z-Base32
Base36
I am guessing that the output of the encoding will be the UID string that I am looking for.
Step 3: Working around collisions.
To work around collisions, I was wondering if I could generate a random key at the time of UID generation and use this random key in the function.
I can store this random key in a column, so that we know what key was used to generate that particular UID.
Before inserting a newly generated UID into the table, I would check for uniqueness and if the check fails, I can generate a new random key and use it to generate a new UID. This step can be repeated till a unique UID is found for a particular DB ID.
I would love to get some expert advice on whether I am going along the correct lines and how I go about actually implementing this.
I am going to be implementing this in a Ruby On Rails app. So, please take that into consideration in your suggestions.
Thanks.
Update
The comments and answer made me re-think and question one of the requirements I had: the need for us to be able to regenerate the UID for a user after assigning it once. I guess I was just trying to be safe, in the case where we lose a user's UID and we will able to get it back if it is a function of an existing property of the user. But we can get around that problem just by using backups, I guess.
So, if I remove that requirement, the UID then essentially becomes a totally random 10 character alphanumeric string. I am adding an answer containing my proposed plan of implementation. If somebody else comes with a better plan, I'll mark that as the answer.
As I mentioned in the update to the question, I think what we are going to do is:
Pre-generate a sufficiently large number of random and unique ten character alphanumeric strings. No hashing or encoding.
Store them in a table in a random order.
When creating a user, pick the first these strings and assign it to the user.
Delete this picked ID from the pool of IDs after assigning it to a user.
When the pool reduces to a low number, replenish the pool with new strings, with uniqueness checks, obviously. This can be done in a Delayed Job, initiated by an observer.
The reason for pre-generating is that we are offloading all the expensive uniqueness checking to a one-time pre-generation operation.
When picking an ID from this pool for a new user, uniqueness is guaranteed. So, the operation of creating user (which is very frequent) becomes fast.
Would db_id.chr work for you? It would take the integers and generate a character string from them. You could then append their initials or last name or whatever to it. Example:
user = {:id => 123456, :f_name => "Scott", :l_name => "Shea"}
(user.id.to_s.split(//).map {|x| (x.to_i + 64).chr}).join.downcase + user.l_name.downcase
#result = "abcdefshea"

Rails an MongoDB, how to get the last document inserted and be sure it is thread safe?

I need when I add a new document in my collection X to get the last document that was inserted in that same collection, because some values of that document must influence the document I am currently inserting.
Basically as a simple example I would need to do that:
class X
include Mongoid::Document
include Mongoid::Timestamps
before_save :set_sum
def set_sum
self.sum = X.last.sum + self.misc
end
field :sum, :type => Integer
field :misc, :type => Integer
end
How can I make sure that type of process will never break if there are concurrent insert? I must make sure that when self.sum = X.last.sum + self.misc is calculate that X.last.sum absolutely represents that last possible document inserted in the collection ?
This is critical to my system. It needs to be thread safe.
Alex
ps: this also needs to be performant, when there are 50k documents in the collections, it can't take time to get the last value...
this kind of behavior is equivalent to having an auto increment id.
http://www.mongodb.org/display/DOCS/How+to+Make+an+Auto+Incrementing+Field
The cleanest way is to have a side collection with one (or more) docs representing the current total values.
Then in your client, before inserting the new doc, do a findAndModify() that atomically updates the totals AND retrieves the current total doc.
Part of the current doc can be an auto increment _id, so that even if there are concurrent inserts, your document will then be correctly ordered as long as you sort by _id.
Only caveat: if your client app dies after findAndModify and before insert, you will have a gap in there.
Either that's ok or you need to add extra protections like keeping a side log.
If you want to be 100% safe you can also get inspiration from 2-phase commit
http://www.mongodb.org/display/DOCS/two-phase+commit
Basically it is the proper way to do transaction with any db that spans more than 1 server (even sql wouldnt help there)
best
AG
If you need to keep a running sum, this should probably be done on another document in a different collection. The best way to keep this running sum is to use the $inc atomic operation. There's really no need to perform any reads while calculating this sum.
You'll want to insert your X document into its collection, then also $inc a value on a different document that is meant for keeping this tally of all the misc values from the X documents.
Note: This still won't be transactional, because you're updating two documents in two different collections separately, but it will be highly performant, and fully thread safe.
Fore more info, check out all the MongoDB Atomic Operations.

When would you ever set the increment value on a database identity field?

Given the table:
CREATE TABLE Table1
(
UniqueID int IDENTITY(1,1)
...etc
)
Now why would you ever set the increment to something other than 1?
I can understand setting the initial seed value differently. For example if, say, you're creating one database table per month of data (e.g. Table1_082009, Table1_092009) and want to start the UniqueID of the new table where the old one left off. (I probably wouldn't use that strategy myself, but hey, I can see people doing it).
But for the increment? I can only imagine it being of any use in really odd situations, for example:
after the initial data is inserted, maybe later someone will want to turn identity insert on and insert new rows in the gaps, but for efficient lookup on the index will want the rows to be close to each other?
if you're looking up ids based directly off a URL, and want to make it harder for people to arbitrarily access the other items (for example, instead of the user being able to work out that changing the URL suffix from /GetData?id=1000 to /GetData?id=1001, you set an increment of 437 so that the next url is actually /GetData?id=1437)? Of course if this is your "security" then you're probably already in trouble...
I can't think of anything else. Has anyone used an increment that wasn't 1, and why? I'm really just curious.
One idea might be using this to facilitate partitionnement of data (though there might be more "automated" ways to do that) :
Considering you have two servers :
On one server, you start at 1 and increment by 2
On the other server, you start at 2 and increment by 2.
Then, from your application, your send half inserts to one server, and the other half to the second server
some kind of software load-balancing
This way, you still have the ability to identify your entries : the "UniqueID" is still unique, even if the data is split on two servers / tables.
But that's only a wild idea -- there are probably some other uses to that...
Once, for pure fun, (Oh yeah, we have a wild side to us) decided to negative increment. It was strange to see the numbers grow in size and smaller in value at the same time.
I could hardly sit still in my chair.
edit (afterthought):
You think the creator of the IDENTITY was in love with FOR loops? You know..
for (i = 0; i<=99; i+=17)
or for those non semi-colon folks out there
For i = 0 to 100 step 17
Only for entertainment. And you have to be REALLY bored.