Current Scenario
Datastore used: Dynamo Db.
DB size: 15-20 MB
Problem: for storing data I am thinking to use a common hash as the partition key (and timestamp as sort key), so that the complete table is saved in a single partition only. This would give me undivided throughput for the table.
But I also intend to create GSIs for querying, so I was wondering whether it would be wrong to use GSIs for single partition. I can use Local SIs also.
Is this the wrong approach?
Under the hood, GSI is basically just another DynamoDB table. It follows the same partitioning rules as the main table. Partitions in you primary table are not correlated to the partitions of your GSIs. So it doesn't matter if your table has a single partition or not.
Using single partition in DynamoDB is a bad architectural choice overall, but I would argue that for 20 Mb database that doesn't matter too much.
DynamoDB manages table partitioning for you automatically, adding new
partitions if necessary and distributing provisioned throughput
capacity evenly across them.
Deciding which partition the item should go can't be controlled if the partition key values are different.
I guess what you are going to do is having same partition key value for all the items with different sort key value (timestamp). In this case, I believe the data will be stored in single partition though I didn't understand your point regarding undivided throughput.
If you wanted to keep all the items of the index in single partition, I think LSI (Local Secondary Index) would be best suited here. LSI is basically having an alternate sort key for the partition key.
A local secondary index maintains an alternate sort key for a given
partition key value.
Your single partition rule is not applicable for index and you wanted different partition key, then you need GSI.
Related
I'm moving to tables partitioned by a timestamp-column with its value in milliseconds. Now I want to generate clusters by hour, which will depend on the same timestamp-column I used for partitioning.
I want to use the same column for partitioning and clustering but I'm not sure if that works to generate hourly clusters.
I was planning on adding a new column which contains only the hour related with every record, and then using this column to create my clustered table, but I want to better understand what will happen if I use the same timestamp-column that I used for partitioning.
If the table contains lots of different partition by id (10K+) and the size of the table is growing (with millions of records), will it run into out of memory issue?
When this query runs, does system require to store partition window (which would be 10K then) in to memory as well as row_number for each partition?
It is safe to use it. How it will perform - it depends. If there is a "POC index" (Partition, Order by, Cover), i.e. an index with your partition by column as a first key column, then your order by column(s) as key column(s), and the selected columns as included columns, it will be the best for this particular query. But you should consider the pros and cons of this index. If there is no such index (or similar), the query will be heavier - think for table scan, spills to tempdb, etc. What will be the impact on your server - you should test to see.
I want to store data in an Azure Table. The Primary Key for this data will be an MD5 hash.
To get a good balance of performance and scalability it is a good idea to use a combination of both Partition Key and Row Key in the Azure Table.
I am considering splitting the MD5 hash into two parts at an arbitrary point. I will probably use the first three or so characters for the Partition Key so as to have a higher likelihood of collisions, and therefore end up with Partitions that each have a decent quantity of Row entries in them. The rest of the characters will make up the Row Key. This would mean the data is spread over 4,096 Partitions.
The overall dataset could become large, in the order of hundreds of thousands of records.
I am aware that atomic operations can more easily be done across entries in the same Partition; this is not a concern for me.
Is this Key-splitting approach worth considering? Or should I simply go for the simpler approach and have the Partition Key use the entire MD5 hash, with an empty Row Key?
Both of your approches are fine. Basically, 4096 partitions are sufficient for scaling; if you want even better scalability, use the full MD5 as partition key since you don't need atomic operations with a partition. Please note that row key can't be an empty string, so consider using a constant string or the same value as partition key (full MD5) instead.
I need to create a table that would contain a slice of data produced by a continuously running process. This process generates messages that contain two mandatory components, among other things: a globally unique message UUID, and a message timestamp.
Those messages would be later retrieved by the UUID.
In addition, on a regular basis I would need to delete all messages from that table that are too old, i.e. whose timestamps are more than X away from the current time.
I've been reading the DynamoDB v2 documentation (e.g. Local Secondary Indexes) trying to figure out how to organize my table and whether or not I need a secondary index to perform searches for messages to delete. There might be a simple answer to my question, but I am somehow confused...
So should I just create a table with the UUID as the hash and messageTimestamp as the range key (together with a "message" attribute that would contain the actual message), and then not create any secondary indices? In the examples that I've seen, the hash was something that was not unique (e.g. ForumName under the above link). In my case, the hash would be unique. I am not sure whether than makes any difference.
And if I create the table with hash and range as described, and without a secondary index, then how would I query for all messages that are in a certain timerange regardless of their UUIDs?
DynamoDB introduced Global Secondary Index which would solve this problem.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
We've wrestled with this as well. The best solution we've come up with is to create second table for storing the time series data. To do this:
1) Use the date plus "bucket" id for a hash key
You could just use the date, but then I'm guessing today's date would become a "hot" key - one that is written with excessive frequency. This can create a serious bottleneck, as the total throughput for a particular DynamoDB partition is equal to the total provisioned throughput divided by the number of partitions - that means if all your writes are to a single key (today's key) and you have a throughput of 20 writes per second, then with 20 partitions, your total throughput would be 1 write per second. Any requests beyond this would be throttled. Not a good situation.
The bucket can be a random number from 1 to n, where n should be greater than the number of partitions used by the underlying DB. Determining n is a bit tricky of course because Dynamo does not reveal how many partitions it uses. But we are currently working with the upper limit of 200 based on the example found here. The writeup at this link was the basis for our thinking in coming up with this approach.
2) Use the UUID for the range key
3) Query records by issuing queries for each day and bucket.
This may seem tedious, but it is more efficient than a full scan. Another possibility is to use Elastic Map Reduce jobs, but I have not tried that myself yet so cannot say how easy/effective it is to work with.
We are still figuring this out ourselves, so I'm interested to hear others' comments. I also found this presentation very helpful in thinking through how best to use Dynamo:
Falling In and Out Of Love with Dynamo
-John
In short you can not. All DynamoDB queries MUST contain the primary hash index in the query. Optionally, you can also use the range key and/or a local secondary index. With the current DynamoDB functionality you won't be able to use an LSI as an alternative to the primary index. You also are not able to issue a query with only the range key (you can test this out easily in the AWS Console).
A (costly) workaround that I can think of is to issue a scan of the table, adding filters based on the timestamp value in order to find out which fields to delete. Note that filtering will not reduce the used capacity of the query, as it will parse the whole table.
I've looked for a while over Google but can't find the answer!
Can a database table have multiple partitions, indexes and clusters attached to it?
Will it bring up an error if a partition is on the same row as an index?
Is there any benefit in this?
Many thanks,
Zulu
A table can have many indexes, and those are unrelated to whether it is a standard table, a partitioned table, or a clustered table. (Although if the table is partitioned, you have a choice about whether to create a separate index on each partition or a global index for the whole table.)
A table cannot belong to multiple clusters, since a cluster determines the actual physical storage location of the table.
A table can have multiple partitions (of course, else what would be the point?). It can't have multiple partitioning schemes, if that's what you mean.
I presume but have not confirmed that clusters and partitions are mutually exclusive, since they would have potentially conflicting effects on how the table data should be organized on disk.