gcp cloud datastore indexing strategy - indexing

New to datastore but familiar with Cassandra and Dynamo. I have a use case where I have an unique composite key made up of two fields (A, B). B would be in desc order. My access pattern would be to query for the latest (base on B) entities given an A value (with pagination). My problem is that A could have a very high cardinality (potentially in the 5-10 million range, but as low as 100-1000).
If this was in Dynamo I would have A be the partition key and B be the sort key.
In Datastore, however, the concept of key identifier is throwing me off. should I have unique CONCAT(A,B) be the key identifier (to achieve some sort of unique constraints)? And then add one index on A and B again for queries?
I can't find much information on the inner workings of Datastore so I'm not sure if having CONCAT(A,B) be the key identifier would distribute the data randomly. I'm guessing for fast queries I would want all entities with the same A value to be stored in the same partition. Or does index work like views in a relational database?

Related

Global Vs Local Secondary Indexes in DynamoDB

I am still confused as to the use of Local Secondary Indexes. Please give me specific use cases for when there is a need for LSI vs GSI.
For example here, is the "GenreAlbumTitle" index supposed to be a GSI or LSI?https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.CoreComponents.html#HowItWorks.CoreComponents.PrimaryKey
I can't seem to get my head around the need for needing LSI because any indexes I need would be to cover the whole rows of the table, and not just specific for one partition alone. And if someone can also touch on the cost aspect because I understand LSI is cheaper (but why is it cheaper)?
Thank you all!
Every item in Dynamo must have a unique primary key. The primary key is the base table index. A primary key must have a partition key and can optionally have a range key (also called a sort key). Within a partition, items are ordered by range key. Accessing items using a partition key is fast.
Secondary indexes allow you to query the table using an alternative key. A Local Secondary Index (LSI) has the same partition key as the primary key (index), but a different range key. The way to think about an LSI is that its the same data as the primary index (key), just ordered by a different attribute.
A Global Secondary Index (GSI) has a different partition key to the primary key and is therefore a different set of data.
One of the important differences between an LSI and GSI is that an LSI takes its throughput capacity from the base table, where as you have purchase GSI throughput capacity separately. Put another way, an LSI costs you nothing and a GSI incurs extra cost over your base table.
Lets have a look at the Music table example. Lets say the base table has this schema;
Artist: (Primary Key) Partition Key
SongTitle: (Primary Key) Range Key
AlbumTitle:
DateOfRelease:
This table is a list of songs. I can access all the songs for an artist really efficiently (i.e. query by Artist using the partition key). When I do this query the songs will be ordered by SongTitle. I can also access songs by Artist and SongTitle very efficiently using the unique primary key.
Now lets say I want to get all songs by an Artist but ordered by DateOfRelease. In the current schema I would need to get all the songs and then order them in my application. A good alternative would be to create a new index, with a partition key of Artist and a range key DateOfRelease. This will be a LSI because the partition key of the index (Artist) is the same as the partition key of the primary key. I do not need to purchase additional throughput capacity as this index will provision itself from the base table capacity.
Now lets say I want to access the songs by AlbumTitle, ordered by SongTitle, i.e. create lists of Albums. To do this efficiently I create a new index with partition key AlbumTitle and range key SongTitle. This is a GSI because the partition key is different to the primary key. This GSI must be provisioned separately to the base table and therefore costs extra.
In answer to your question, GenreAlbumTitle is a GSI because it has a different partition key to Music.
There is some misconceptions about the costs of using LSI, so let me clarify here.
Using LSI is not free of charge. Just like GSI, dynamoDB needs to create and maintain additional partial copies of the table in order to quickly get the results. This maintenance of additional copies will incur additional read, write, and storage costs identical to that of GSI. (Additional cost will be written in bold). The only difference is that instead of allocating a separate pay plan, you use the same pay plan as the main table.
Before discussing the additional cost, let me again summarize what kind of information is stored in the partial copy table. The partial table copy (LSI) contains the partition key (same as original table), the sort key (a different one than the original table), and any additional projected attributes.
Original Table
Artist (Primary Key)
Song title (sort key)
Album Title
Date Of Release
Michael Jackson
Beat It
Thriller
December 1, 1982
Weeknd
The Hills
Beauty Behind the Madness
May 27, 2015,
LSI
Artist (Primary Key)
Album Title (sort key)
Date Of Release
Michael Jackson
Thriller
December 1, 1982
Weeknd
Beauty Behind the Madness
May 27, 2015,
Projected attributes are the additional information we want to query from the LSI. I could say, "show me all of release dates of the albums by the Weeknd, ordered by the album name". As you can see, we don't care about the song title here, and is not included in our LSI projections.
Additional cost for reads:
You are charged for 1 Read Unit for queries that can be satisfied by using LSI table alone. Example: "Show me all of release dates of the songs by the Weeknd, ordered by the album name."
You are charged 1 additional Read Unit for queries that LSI doesn't know how to serve on its own, forcing it to go to the main table for help. This will cost a total of 2 Read Units. Example: "show me all of release dates AND the song titles of the songs by the Weeknd, ordered by the album name."
Additional cost for writes:
(Writes are done to the main table, with its own write costs, and changes are later propagated to LSI's)
If the update to the main table results in creation of a new row in the LSI => 1 additional Write Unit
If the update to the main table results in the key attribute of an existing row in the LSI to be updated => cost for deletion (1 unit) + creation (1 unit) = 2 additional Write Unit
If the update to the main table results in the non-key attribute of an existing row in the LSI to be updated => 1 additional Write Unit
If the update to the main table results in the deletion of an existing attribute of an existing row in the LSI => 1 additional Write Unit
If the update to the main table does not change any rows of LSI => 0 additional Write Unit
Additional cost for storage
You pay additional cost for: (size of index keys + size of projected attributes + overhead) x number of rows
As you can see, if we are not careful with LSI, extra costs can become overbearing. To minimize cost, you must:
Carefully consider your typical queries. Which types of information do you need?
There is tradeoff between read cost and storage cost. If you project every attribute to the LSI, than you will incur no extra read cost, but your storage cost will double. If you project only the key attributes, and you often fetch additional information other than the key attributes, there will be a lot of extra read costs from having to go back to the main table for help.
For tables that are write-heavy, expect to incur a huge uptick in the write cost. Remember, if the update to the main table updates the key attribute of an item in the LSI, you pay 2 additional write units, and for non-key attributes, 1 additional write unit.

Primary Key and GSI Design in DynamoDB

I've recently started learning DynamoDB and created a table 'reviews' with the following attributes (along with the DynamoDB type):
productId - String
username - String
feedbackText - String
lastModifiedDate - Number (I'm storing the UNIX timestamp)
createdDate - Number
active - Number (0/1 value, 1 for all records by default)
Following are the queries that I expect to run on this table:
1. Get all reviews for a 'productId'
2. Get all reviews submitted by a 'username' (sorted asc/desc by lastModifiedDate)
3. Get N most recent reviews across products and users (using lastModifiedDate)
Now in order to be able to run these queries I have created the following on the 'reviews' table:
1. A Primary Key with 'productId' as the Hash Key and 'username' as the Range Key
2. A GSI with 'username' as the Hash Key and 'lastModifiedDate' as the Range Key
3. A GSI with 'active' as the Hash Key and 'lastModifiedDate' as the Range Key
The last index is somewhat of a hack since I introduced the 'active' attribute in my table only so that the value can be '1' for all records and I can use it as a Hash Key for the GSI.
My question is simple. I've read a bit about DynamoDB already and this is the best design I could come up with. I want to ask if there is a better primary key/index design that I could be using here. If there is a concept in DynamoDB which I may have missed that could be beneficial in this specific use case. Thanks!
I think your design is correct:
the table key and GSI from point 2 will cover your first two queries. No surprises here, this is pretty standard.
I think your design for the last query is correct, even if somewhat hacky and possibly not the best in terms of performance. Using the same value for hash key is what you need to do considering DynamoDB limitations. You want to be able to get values in order so you need to use a range key. As you want to only use the range key, you need to provide the same value for the hash key. You should just note that this may not scale very well when your table grows into many partitions (though I don't have any data to back that statement up).

DynamoDB: When to use what PK type?

I am trying to read up on best practices on DynamoDB. I saw that DynamoDB has two PK types:
Hash Key
Hash and Range Key
From what I read, it appears the latter is like the former but supports sorting and indexing of a finite set of columns.
So my question is why ever use only a hash key without a range key? Is it a viable choice only when the table is not searched?
It'd also be great to have some general guidelines on when to use what key type. I've read several guides (including Amazon's own documentation on DynamoDB) but none of them appear to directly address this question.
Thanks
The choice of which key to use comes down to your Use Cases and Data Requirements for a particular scenario. For example, if you are storing User Session Data it might not make much sense using the Range Key since each record could be referenced by a GUID and accessed directly with no grouping requirements. In general terms once you know the Session Id you just get the specific item querying by the key. Another example could be storing User Account or Profile data, each user has his own and you most likely will access it directly (by User Id or something else).
However, if you are storing Order Items then the Range Key makes much more sense since you probably want to retrieve the items grouped by their Order.
In terms of the Data Model, the Hash Key allows you to uniquely identify a record from your table, and the Range Key can be optionally used to group and sort several records that are usually retrieved together. Example: If you are defining an Aggregate to store Order Items, the Order Id could be your Hash Key, and the OrderItemId the Range Key. Whenever you would like to search the Order Items from a particular Order, you just query by the Hash Key (Order Id), and you will get all your order items.
You can find below a formal definition for the use of these two keys:
"Composite Hash Key with Range Key allows the developer to create a
primary key that is the composite of two attributes, a 'hash
attribute' and a 'range attribute.' When querying against a composite
key, the hash attribute needs to be uniquely matched but a range
operation can be specified for the range attribute: e.g. all orders
from Werner in the past 24 hours, or all games played by an individual
player in the past 24 hours." [VOGELS]
So the Range Key adds a grouping capability to the Data Model, however, the use of these two keys also have an implication on the Storage Model:
"Dynamo uses consistent hashing to partition its key space across its
replicas and to ensure uniform load distribution. A uniform key
distribution can help us achieve uniform load distribution assuming
the access distribution of keys is not highly skewed."
[DDB-SOSP2007]
Not only the Hash Key allows to uniquely identify the record, but also is the mechanism to ensure load distribution. The Range Key (when used) helps to indicate the records that will be mostly retrieved together, therefore, the storage can also be optimized for such need.
Choosing the correct keys to represent your data is one of the most critical aspects during your design process, and it directly impacts how much your application will perform, scale and cost.
Footnotes:
The Data Model is the model through which we perceive and manipulate our data. It describes how we interact with the data in the database [FOWLER]. In other words, it is how you abstract your data model, the way you group your entities, the attributes that you choose as primary keys, etc
The Storage Model describes how the database stores and manipulates the data internally [FOWLER]. Although you cannot control this directly, you can certainly optimize how the data is retrieved or written by knowing how the database works internally.

surrogate keys vs compound keys

Fairly new to database schema (plan to use SQLite). Having said that, I'm thinking about using surrogate keys because the database currently contains a compound key (3 columns), which shows up in most of my tables. I have several tables that contains 3 columns for the unique key and one column containing some information; I also have one table contains 3 column for unique key and the same 3 columns as foreign keys (many parents). Combining all these tables into a single table doesn't seem to make sense because there would be many empty fields.
Any pit falls if I choose one or the other? Which one is generally considered more convenient for programming?
Thank you in advance.
Each technique has advantages and disadvantages.
In general, it's easier to write SQL statements and JOINs if you only need to refer to the single surrogate key column. It also reduces the size of your database fairly substantially.
On the other hand, with surrogate keys you often find yourself having to add at least one extra table to your JOINs in order to retrieve information that is part of the surrogate key.
Two additional advantages of a surrogate key:
Many frameworks require the use of an integer primary key field.
If you are binding your records to any kind of user interface control (inputs on a web page, for example), it's considerably easier to attach a single value to the control for identification purposes than it is to encode and decode multiple columns.

Database Design: Alternate to composite keys?

I am building a database system and having trouble with the design of one of my tables.
In this system there is a users table, an object table, an item table and cost table.
A unique record in the cost table is determine by the user, object, item and year. However, there can be multiple records that have the same year if the item is different.
The hierarchy goes user->object->item->year, multiple unique years per item, multiple unique items per object, multiple unique objects per user, multiple unique users.
What would be the best way to design the cost table?
I am thinking of including the userid, objectid and itemid as foreign keys and then using a composite key consisting of userid, objecid, itemid and costyear. I have heard that composite keys are bad design, but I am unsure how to structure this to get away from using a composite key. As you can tell my database building skills are a bit rusty.
Thanks!
P.S. If it matters, this is an interbase db.
To avoid the composite key, you just define a surrogate key. This holds an artificial value, for instance an auto counter.
You still can (and should) define a unique constraint on these columns.
Btw: its not only recommended not to use composite keys, it's also recommendable to use surrogate keys. In all your tables.
Use an internally generated key field (called surrogate keys), something like CostID, that the users will never see but will uniquely identify each entry in the Cost table (in SqlServer, fields like uniqueidentifier or IDENTITY would do the trick.)
Try building your database with a composite key using exactly the columns you outlined, and see what happens. You may be pleasantly surprised. Making sure that there is no missing data in those four columns, and making sure that no two rows have the same value in all four columns will help protect the integrity of your data.
When you declare a composite primary key, the order of columns in your declaration won't affect the logical consequences of the dclaration. However the composite index that the DBMS builds for you will also have the columns in the same order, and the order of columns in a composite index does affect performance.
For queries that specify only one, two, or three of these columns, the index will be useless if the first column in the index is a column not specified in the query. If you know in advance how your queries are gonig to me, and which queries most need to run fast, this can help you declare the columns for the primary key in the right order. In rare circumstances creating two or three additional one column indexes can speed up some queries, at the cost of slowing down updates.