Primary Key and GSI Design in DynamoDB

Primary Key and GSI Design in DynamoDB - primary-key

I've recently started learning DynamoDB and created a table 'reviews' with the following attributes (along with the DynamoDB type):
productId - String
username - String
feedbackText - String
lastModifiedDate - Number (I'm storing the UNIX timestamp)
createdDate - Number
active - Number (0/1 value, 1 for all records by default)
Following are the queries that I expect to run on this table:
1. Get all reviews for a 'productId'
2. Get all reviews submitted by a 'username' (sorted asc/desc by lastModifiedDate)
3. Get N most recent reviews across products and users (using lastModifiedDate)
Now in order to be able to run these queries I have created the following on the 'reviews' table:
1. A Primary Key with 'productId' as the Hash Key and 'username' as the Range Key
2. A GSI with 'username' as the Hash Key and 'lastModifiedDate' as the Range Key
3. A GSI with 'active' as the Hash Key and 'lastModifiedDate' as the Range Key
The last index is somewhat of a hack since I introduced the 'active' attribute in my table only so that the value can be '1' for all records and I can use it as a Hash Key for the GSI.
My question is simple. I've read a bit about DynamoDB already and this is the best design I could come up with. I want to ask if there is a better primary key/index design that I could be using here. If there is a concept in DynamoDB which I may have missed that could be beneficial in this specific use case. Thanks!

I think your design is correct:
the table key and GSI from point 2 will cover your first two queries. No surprises here, this is pretty standard.
I think your design for the last query is correct, even if somewhat hacky and possibly not the best in terms of performance. Using the same value for hash key is what you need to do considering DynamoDB limitations. You want to be able to get values in order so you need to use a range key. As you want to only use the range key, you need to provide the same value for the hash key. You should just note that this may not scale very well when your table grows into many partitions (though I don't have any data to back that statement up).

Related

MariaDB Indexing

Let's say I have a table of 200,000,000 users. For each user I have saved a certain attribute. Let it be their lastname.
I am unsure of which index type to use with MariaDB. The only queries made to the database will be in the form of SELECT lastname FROM table WHERE username='MYUSERNAME'.
Is it therefore the best to just define the column username as a primary key. Or do I need to do anything else? Also how long is it going to take until the index is built?
Sorry for this question, but this is my first database with more than 200.000 rows.

I would go with:
CREATE INDEX userindex on `table`(username);
This will index the usernames since this is what your query is searching on. This will speed up the results coming back as the username column will be indexed.
Try it and if it reduces performance just delete the index, nothing lost (although make sure you do have backups! :))
This article will help you out https://mariadb.com/kb/en/getting-started-with-indexes/
It says primary keys are best set at table creation and as I guess yours is already in existence that would mean either copying it and creating a primary key or just using an index.
I recently indexed a table with non unique strings as an ID and although it took a few minutes to index the speed performance was a great improvement, this table was 57m rows.
-EDIT- Just re-read and thought it was 200,000 as mentioned at the end but see it is 200,000,000 in the title, that's a hella lotta rows.

username sounds like something that is "unique" and not null. So, make it NOT NULL and have PRIMARY KEY(username), without an AUTO_INCREMENT surrogate PK.
If it not unique, or cannot be NOT NULL, then INDEX(username) is very likely to be useful.
To design indexes, you must first know what queries you will be performing. (If you had called it simply "col1", I would not have been able to guess at the above advice.)
There are 3 index types:
BTree (actually B+Tree; see Wikipedia). This is the default and the most commonly used index type. It is efficient at finding a row given a specific value (WHERE user_name = 'joe'). It is also useful for a range of values (WHERE user_name LIKE 'Smith%').
FULLTEXT is useful for a TEXT column where you want to search for "words" inside it.
SPATIAL is useful for 2-dimensional data, such as geographical points on a map or other type of grid.

searching on array items on a DynamoDB table

I need to understand how one can search attributes of a DynamoDB that is part of an array.
So, in denormalising a table, say a person that has many email addresses. I would create an array into the person table to store email addresses.
Now, as the email address is not part of the sort key, and if I need to perform a search on an email address to find the person record. I need to index the email attribute.
Can I create an index on the email address, which is 1-many relationship with a person record and it's stored as an array as I understand it in DynamoDB.
Would this secondary index be global or local? Assuming I have billions of person records?
If I could create it as either LSI or GSI, please explain the pros/cons of each.
thank you very much!

Its worth getting the terminology right to start with. DynamoDB supported data types are
Scalar - String, number, binary, boolean
Document - List, Map
Sets - String Set, Number Set, Binary Set
I think you are suggesting you have an attribute that contains a list of emails. The attribute might look like this
Emails: ["one#email.com", "two#email.com", "three#email.com"]
There are a couple of relevant points about Key attributes described here. Firstly keys must be top-level attributes (they cant be nested in JSON documents). Secondly they must be of scalar types (i.e. String, Number or Binary).
As your list of emails is not a scalar type, you cannot use it in a key or index.
Given this schema you would have to perform a scan, in which you would set the FilterExpression on your Emails attribute using the CONTAINS operator.

Stu's answer has some great information in it and he is right, you can't use an Array it's self as a key.
What you CAN sometimes do is concatenate several variables (or an Array) into a single string with a known seperator (maybe '_' for example), and then use that string as a Sort Key.
I used this concept to create a composite Sort Key that consisted of multiple ISO 8061 date objects (DyanmoDB stores dates as ISO 8061 in String type attributes). I also used several attributes that were not dates but were integers with a fixed character length.
By using the BETWEEN comparison I am able to individually query each of the variables that are concatenated into the Sort Key, or construct a complex query that matches against all of them as a group.
In other words a data object could use a Sort Key like this:
email#gmail.com_email#msn.com_email#someotherplace.com
Then you could query that (assuming you knew what the partition key is) with something like this:
SELECT * FROM Users
WHERE User='Bob' AND Emails LIKE '%email#msn.com%'
YOU MUST know the partition key in order to perform a Query no matter what you choose as your Sort Key and no matter how that Sort Key is constructed.
I think the real question you are asking is what should my sort keys and partition keys be? That will depend on exactly which queries you want to make and how frequently each type of query is used.
I have found that I have way more success with DynamoDB if I think about the queries I want to make first, and then go from there.
A word on Secondary Indexes (GSI / LSI)
The issue here is that you still need to 'know' the Partition Key for your secondary data structure. GSI / LSI help you avoid needing to create additional DynamoDB tables for the sole purpose of improving data access.
From Amazon:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html
To me it sounds more like the issue is selecting the Keys.
LSI (Local Secondary Index)
If (for your Query case) you don't know the Partition Key to begin with (as it seems you don't) then a Local Secondary Index won't help — since it has the SAME Partition Key as the base table.
GSI (Global Secondary Index)
A Global Secondary Index could help in that you can have a DIFFERENT Partition Key and Sort Key (presumably a partition key that you could 'know' for this query).
So you could use the Email attribute (perhaps composite) as the Sort Key on your GSI and then something like a service name, or sign-up stage, as your Partition Key. This would let you 'know' what partition that user would be in based on their progress or the service they signed up from (for example).
GSI / LSI still need to generate unique values using their keys so keep that in mind!

Composite Primary Key equivalent in Redis

I'm new to nosql databases so forgive my sql mentality but I'm looking to store data that can be 'queried' by one of 2 keys. Here's the structure:
{user_id, business_id, last_seen_ts, first_seen_ts}
where if this were a sql DB I'd use the user_id and business_id as a primary composite key. The sort of querying I'm looking for is a
1.'get all where business_id = x'
2.'get all where user_id = x'
Any tips? I don't think I can make a simple secondary index based on the 2 retrieval types above. I looked into commands like 'zadd' and 'zrange' but there isn't really any sorting involved here.
The use case for Redis for me is to alleviate writes and reads on my SQL database while this program computes (doing its storage in redis) what eventually will be written to the SQL DB.

Note: given the OP's self-proclaimed experience, this answer is intentionally simplified for educational purposes.
(one of) The first thing(s) you need to understand about Redis is that you design the data so every query will be what you're used to think about as access by primary key. It is convenient, in that sense, to imagine Redis' keyspace (the global dictionary) as something like this relational table:
CREATE TABLE redis (
key VARCHAR(512MB) NOT NULL,
value VARCHAR(512MB),
PRIMARY KEY (key)
);
Note: in Redis, value can be more than just a String of course.
Keeping that in mind, and unlike other database models where normalizing data is the practice, you want to have your Redis ready to handle both of your queries efficiently. That means you'll be saving the data twice: once under a primary key that allows searching for businesses by id, and another time that allows querying by user id.
To answer the first query ("'get all where business_id = x'"), you want to have a key for each x that hold the relevant data (in Redis we use the colon, ':', as separator as a matter of convention) - so for x=1 you'd probably call your key business:1, for x=a1b2c3 business:a1b2c3 and so forth.
Each such business:x key could be a Redis Set, where each member represents the rest of the tuple. So, if the data is something like:
{user_id: foo, business_id: bar, last_seen_ts: 987, first_seen_ts: 123}
You'd be storing it with Redis with something like:
SADD business:bar foo
Note: you can use any serialization you want, Set members are just Strings.
With this in place, answering the first query is just a matter of SMEMBERS business:bar (or SSCANing it for larger Sets).
If you've followed through, you already know how to serve the second query. First, use a Set for each user (e.g. user:foo) to which you SADD user:foo bar. Then SMEMBERS/SSCAN and you're almost home.
The last thing you'll need is another set of keys, but this time you can use Hashes. Each such Hash will store the additional information of the tuple, namely the timestamps. We can use a "Primary Key" made up of the bussiness and the user ids (or vice versa) like so:
HMSET foo:bar first 123 last 987
After you've gotten the results from the 1st or 2nd query, you can fetch the contents of the relevant Hashes to complete the query (assuming that the queries return the timestamps as well).

The idiomatic way of doing this in Redis is to use a SET for each type of query you want to do.
In your case you would create:
a hash for each tuple (user_id, business_id, last_seen_ts, first_seen_ts)
a set with a name like user:<user_id>:business:<business_id>, to store the keys of the hashes for this user and this business (you have to add the ID of the hashes with SADD)
Then to get all data for a given user and business, you have to get the SET content with SMEMBERS first, and then to GET every HASH whose ID is in the SET.

Surrogate key from all column hash

I would like to create a surrogate key for a hive table, but one that could be replicated every time the data was put in the table. Other tables would reference this table through the surrogate key, and the table could be regenerated to add more rows, and that association wouldn't be broken. My thought is to basically have a composite key of all columns in the table.
Is it reasonable to concatenate all of my columns and take the md5 hash of that string to use as an easy look-up to that row?
The problems that I see with this solution are:
If the data changes in the rows, the association will still be broken
There is no real guarantee that the hash values are unique (though with my numbers, collisions are very unlikely)
notes on the data:
The data is partitioned by day, and there are around 100k rows for
each day.
There are cases that two rows have the exact same data and
it's fine if they end up with the same key.

You have answered your own question:
There is no real guarantee that the hash values are unique (though
with my numbers, collisions are very unlikely)
Keys need to be unique, that's their purpose. If you give me a records key (be it surrogate or natural) I can find that record. Hashes are not going to be unique.
You need to go back and ask yourself WHY you want this surrogate key. If its just for a unique identifier then use your DB's unique identifier|sequence type and be done with it.
If there is a business requirement (The need to replicate the SK <- why?) then go back to that reason and try and come up with a more direct|simple solution for it.
(We tried hashes for type2 change detection - it did not work and we went back to column by column comparisons)
This concerns me:
There are cases that two rows have the exact same data and it's fine if they end up with the same key
If you have 2 records in your database that are exactly the same then you are missing data: even a sequence or timestamp, something that can be used to differentiate your records. If you don't have a natural key, you are probably missing something.

Should I use a unique string as a primary key or should I make it as a separate autoincrement INT? [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicates:
Surrogate Vs. Natural/Business Keys
When not to use surrogate primary keys?
So, I would prefer having an INT as a primary key with AI. But, here is the reason why I am considering making a unique string as a primary key: I don't have to query related tables to fetch the primary key since I will already know it.
For example:
I have a many to many relation:
Customer - Order - Product
Let's say I want to add a new customer and a new order, and I already know what they bought. I have to do a query on product table to get the INT, but if I have the string (unique) that is a primary key I don't have to do the query (this seems cleaner to me, I am not talking about optimization/run-time-speeds, not at all).

If you are not worried about optimization, the main 2 criteria for primary key are:
Uniqueness
Constanteness (never changes)
E.g. if your problem domain is - and will always be - such that 2 product names are always distinct, AND that no product will ever change its name (think Norton Antivirus -> Symantec Antivirus for a simple example of name change), then you may use the product name as the unique key.
The two MUST be 100% true not only today, but for any foreseeable future lifetime of the database.
Therefore using numeric ID is highly recommended as you may not always be able to forecee such things - and changing the DB structure later on to have a product ID is of course orders of magnitude worse than a minor inconvenience of needing to map and ID from the name in your queries.

If you can guarantee that your VARCHAR field is indeed unique and hopefully stable (not changing), then you can definitely use it as a primary key without any conceptual problems.
The only real reasons against using it as your primary key (or even more importantly: your clustering key in SQL Server) are indeed performance-based. A wider and varying size clustering key is suboptimal in many ways, and affects not just your table and its clustering index, but also all non-clustered indices on that table. But if that's none of your concern, again - you'll be fine with a VARCHAR as your primary key.

Er... both are correct in a way
Your logical model and design will use the unique string. This is the natural key.
The actual implementation may use a numeric auto number column (surrogate key) because of architecture/performance

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas