searching on array items on a DynamoDB table - indexing

I need to understand how one can search attributes of a DynamoDB that is part of an array.
So, in denormalising a table, say a person that has many email addresses. I would create an array into the person table to store email addresses.
Now, as the email address is not part of the sort key, and if I need to perform a search on an email address to find the person record. I need to index the email attribute.
Can I create an index on the email address, which is 1-many relationship with a person record and it's stored as an array as I understand it in DynamoDB.
Would this secondary index be global or local? Assuming I have billions of person records?
If I could create it as either LSI or GSI, please explain the pros/cons of each.
thank you very much!

Its worth getting the terminology right to start with. DynamoDB supported data types are
Scalar - String, number, binary, boolean
Document - List, Map
Sets - String Set, Number Set, Binary Set
I think you are suggesting you have an attribute that contains a list of emails. The attribute might look like this
Emails: ["one#email.com", "two#email.com", "three#email.com"]
There are a couple of relevant points about Key attributes described here. Firstly keys must be top-level attributes (they cant be nested in JSON documents). Secondly they must be of scalar types (i.e. String, Number or Binary).
As your list of emails is not a scalar type, you cannot use it in a key or index.
Given this schema you would have to perform a scan, in which you would set the FilterExpression on your Emails attribute using the CONTAINS operator.

Stu's answer has some great information in it and he is right, you can't use an Array it's self as a key.
What you CAN sometimes do is concatenate several variables (or an Array) into a single string with a known seperator (maybe '_' for example), and then use that string as a Sort Key.
I used this concept to create a composite Sort Key that consisted of multiple ISO 8061 date objects (DyanmoDB stores dates as ISO 8061 in String type attributes). I also used several attributes that were not dates but were integers with a fixed character length.
By using the BETWEEN comparison I am able to individually query each of the variables that are concatenated into the Sort Key, or construct a complex query that matches against all of them as a group.
In other words a data object could use a Sort Key like this:
email#gmail.com_email#msn.com_email#someotherplace.com
Then you could query that (assuming you knew what the partition key is) with something like this:
SELECT * FROM Users
WHERE User='Bob' AND Emails LIKE '%email#msn.com%'
YOU MUST know the partition key in order to perform a Query no matter what you choose as your Sort Key and no matter how that Sort Key is constructed.
I think the real question you are asking is what should my sort keys and partition keys be? That will depend on exactly which queries you want to make and how frequently each type of query is used.
I have found that I have way more success with DynamoDB if I think about the queries I want to make first, and then go from there.
A word on Secondary Indexes (GSI / LSI)
The issue here is that you still need to 'know' the Partition Key for your secondary data structure. GSI / LSI help you avoid needing to create additional DynamoDB tables for the sole purpose of improving data access.
From Amazon:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html
To me it sounds more like the issue is selecting the Keys.
LSI (Local Secondary Index)
If (for your Query case) you don't know the Partition Key to begin with (as it seems you don't) then a Local Secondary Index won't help — since it has the SAME Partition Key as the base table.
GSI (Global Secondary Index)
A Global Secondary Index could help in that you can have a DIFFERENT Partition Key and Sort Key (presumably a partition key that you could 'know' for this query).
So you could use the Email attribute (perhaps composite) as the Sort Key on your GSI and then something like a service name, or sign-up stage, as your Partition Key. This would let you 'know' what partition that user would be in based on their progress or the service they signed up from (for example).
GSI / LSI still need to generate unique values using their keys so keep that in mind!

Related

Behavior of a SORT without BY on standard internal tables? Is it safe?

What exactly does the SORT statement without key specification do when run on a standard internal table? As per the documentation:
If no explicit sort key is entered using the addition BY, the internal table itab is sorted by the primary table key. The priority of the sort is based on the order in which the key fields are specified in the table definition. In standard keys, the sort is prioritized according to the order of the key fields in the row type of the table. If the primary table key of a standard table is empty, no sort takes place. If this is known statically, the syntax check produces a warning.
With the primary table key being defined as:
Each internal table has a primary table key that is either a self-defined key or the standard key. For hashed tables, the primary key is a hash key, for sorted tables, the primary key is a sorted key. Both of these table types are key tables for which key access is optimized and the primary key thus has its own administration. The key fields of these tables are write-protected when you access individual rows. Standard tables also have a primary key, but the corresponding access is not optimized, there is no separate key administration, and the key fields are not write-protected.
And for good measure, the standard key is defined as:
Primary table key of an internal table, whose key fields in a structured row type are all table fields with character-like data types and byte-like data types. If the row type contains substructures, these are broken down into elementary components. The standard key for non-structured row types is the entire table row if the row type itself is not a table type. If there are no corresponding table fields, or the row type itself is a table type, the standard key from standard tables is empty or contains no key fields.
All of which mainly just confuses me as I'm not sure if I can really rely on the basic SORT statement to provide a reliable or safe result. Should I really just avoid it in all situations or does it have a purpose if used properly?
By extension, if I want to run a DELETE ADJACENT DUPLICATES FROM itab COMPARING ALL FIELDS, when would it be safe to do so after a simple SORT itab.? Only if I added a key on all fields? Without an explicit key only if I have an internal table with clike and xsequence columns? If I want to execute that DELETE statement, what is the most optimal SORT statement to run on the internal table?
SORT without BY should be avoided in all situations because it "makes the program difficult to understand and possibly unpredictable" (dixit ABAP documentation). I think that if you don't mention BY, there is a warning by a static check in the Code Inspector. You should use SORT itab BY table_line where table_line is a special name ("pseudo-component") meaning "all fields of the line".
Not your question, but you may also define the internal table with primary and secondary keys, so that you don't need to sort explicitly - DELETE ADJACENT DUPLICATES can be used with any of those keys.
Internal tables can have keys that can be inherited from structures the itab is based on or specified. As the documentation says, sort without by sorts by primary key, and that is safe assuming the internal table is implemented correctly.
I think this feature is designed as a dynamic feature to be used with smart table key design. If done correctly, sort without by can get your program to adapt to table key changes in the future. (so if your key changes, sort with change with it). Problems might arise when key is modified in an odd way.
As rule of a thumb:
The more specific your program code is, the less prone to errors (and safer) it is.
So sort by key_id, key_date will always produce the same sort by those 2 fields.
Dynamic components in an application make it more flexible, but tend to have (often hard to notice) bugs coming out when things they rely on are modified .
So if you take the previous example with 2 key fields, you add 1 in the middle (let's say key_is_active between 2 existing fields), sorting results might change in a way you did not expect.
If you had an algorithm that processes based on date, your algorithm might be broken by that change.
In your particular case with delete adjacent I would follow Sandra Rossi's advice.

Composite Primary Key equivalent in Redis

I'm new to nosql databases so forgive my sql mentality but I'm looking to store data that can be 'queried' by one of 2 keys. Here's the structure:
{user_id, business_id, last_seen_ts, first_seen_ts}
where if this were a sql DB I'd use the user_id and business_id as a primary composite key. The sort of querying I'm looking for is a
1.'get all where business_id = x'
2.'get all where user_id = x'
Any tips? I don't think I can make a simple secondary index based on the 2 retrieval types above. I looked into commands like 'zadd' and 'zrange' but there isn't really any sorting involved here.
The use case for Redis for me is to alleviate writes and reads on my SQL database while this program computes (doing its storage in redis) what eventually will be written to the SQL DB.
Note: given the OP's self-proclaimed experience, this answer is intentionally simplified for educational purposes.
(one of) The first thing(s) you need to understand about Redis is that you design the data so every query will be what you're used to think about as access by primary key. It is convenient, in that sense, to imagine Redis' keyspace (the global dictionary) as something like this relational table:
CREATE TABLE redis (
key VARCHAR(512MB) NOT NULL,
value VARCHAR(512MB),
PRIMARY KEY (key)
);
Note: in Redis, value can be more than just a String of course.
Keeping that in mind, and unlike other database models where normalizing data is the practice, you want to have your Redis ready to handle both of your queries efficiently. That means you'll be saving the data twice: once under a primary key that allows searching for businesses by id, and another time that allows querying by user id.
To answer the first query ("'get all where business_id = x'"), you want to have a key for each x that hold the relevant data (in Redis we use the colon, ':', as separator as a matter of convention) - so for x=1 you'd probably call your key business:1, for x=a1b2c3 business:a1b2c3 and so forth.
Each such business:x key could be a Redis Set, where each member represents the rest of the tuple. So, if the data is something like:
{user_id: foo, business_id: bar, last_seen_ts: 987, first_seen_ts: 123}
You'd be storing it with Redis with something like:
SADD business:bar foo
Note: you can use any serialization you want, Set members are just Strings.
With this in place, answering the first query is just a matter of SMEMBERS business:bar (or SSCANing it for larger Sets).
If you've followed through, you already know how to serve the second query. First, use a Set for each user (e.g. user:foo) to which you SADD user:foo bar. Then SMEMBERS/SSCAN and you're almost home.
The last thing you'll need is another set of keys, but this time you can use Hashes. Each such Hash will store the additional information of the tuple, namely the timestamps. We can use a "Primary Key" made up of the bussiness and the user ids (or vice versa) like so:
HMSET foo:bar first 123 last 987
After you've gotten the results from the 1st or 2nd query, you can fetch the contents of the relevant Hashes to complete the query (assuming that the queries return the timestamps as well).
The idiomatic way of doing this in Redis is to use a SET for each type of query you want to do.
In your case you would create:
a hash for each tuple (user_id, business_id, last_seen_ts, first_seen_ts)
a set with a name like user:<user_id>:business:<business_id>, to store the keys of the hashes for this user and this business (you have to add the ID of the hashes with SADD)
Then to get all data for a given user and business, you have to get the SET content with SMEMBERS first, and then to GET every HASH whose ID is in the SET.

Hash lookup table primary key

I have to populate a database with a set of $string,md5($string) CSV files, essentially a hash lookup table.
My question is:
should I use the string as Primary key? The hash? Add an extra ID column?
I think the hash would be good since thats what I'll be asking the database, but hashes can collide, Strings should be unique anyways (to save space) but I wanted a second opinion on it.
I'm asking with performance in mind considering it will be populated with at least 35GB of data. So really any suggestions appreciated
If the string is going to be used for foreign key references, then I would not (necessarily) recommend hashing. You can:
Create a serial (auto-incremented) id column as the primary key.
Create a unique index on name.
This should facilitate lookups in the table as well as verifying that name is unique. It is better to use fixed-length numbers for foreign key references than variable length strings.
If you use a hash value and really do not want duplicates, then you would need some mechanism for distinguishing between different strings with the same hash value. A natural choice would be some sort of incremental counter -- but that would leave you pretty close to the solution with just the counter and no hash. I don't, per se, see the advantage of storing such a hash value in the table.
I ended up using a SERIAL id field, so Icould count how many entrys I had.
The initial problem started as I thought you coul only index columns with PRIMARY KEY.
So problem solved now, I just indexed properly and performance is great!

DynamoDB: When to use what PK type?

I am trying to read up on best practices on DynamoDB. I saw that DynamoDB has two PK types:
Hash Key
Hash and Range Key
From what I read, it appears the latter is like the former but supports sorting and indexing of a finite set of columns.
So my question is why ever use only a hash key without a range key? Is it a viable choice only when the table is not searched?
It'd also be great to have some general guidelines on when to use what key type. I've read several guides (including Amazon's own documentation on DynamoDB) but none of them appear to directly address this question.
Thanks
The choice of which key to use comes down to your Use Cases and Data Requirements for a particular scenario. For example, if you are storing User Session Data it might not make much sense using the Range Key since each record could be referenced by a GUID and accessed directly with no grouping requirements. In general terms once you know the Session Id you just get the specific item querying by the key. Another example could be storing User Account or Profile data, each user has his own and you most likely will access it directly (by User Id or something else).
However, if you are storing Order Items then the Range Key makes much more sense since you probably want to retrieve the items grouped by their Order.
In terms of the Data Model, the Hash Key allows you to uniquely identify a record from your table, and the Range Key can be optionally used to group and sort several records that are usually retrieved together. Example: If you are defining an Aggregate to store Order Items, the Order Id could be your Hash Key, and the OrderItemId the Range Key. Whenever you would like to search the Order Items from a particular Order, you just query by the Hash Key (Order Id), and you will get all your order items.
You can find below a formal definition for the use of these two keys:
"Composite Hash Key with Range Key allows the developer to create a
primary key that is the composite of two attributes, a 'hash
attribute' and a 'range attribute.' When querying against a composite
key, the hash attribute needs to be uniquely matched but a range
operation can be specified for the range attribute: e.g. all orders
from Werner in the past 24 hours, or all games played by an individual
player in the past 24 hours." [VOGELS]
So the Range Key adds a grouping capability to the Data Model, however, the use of these two keys also have an implication on the Storage Model:
"Dynamo uses consistent hashing to partition its key space across its
replicas and to ensure uniform load distribution. A uniform key
distribution can help us achieve uniform load distribution assuming
the access distribution of keys is not highly skewed."
[DDB-SOSP2007]
Not only the Hash Key allows to uniquely identify the record, but also is the mechanism to ensure load distribution. The Range Key (when used) helps to indicate the records that will be mostly retrieved together, therefore, the storage can also be optimized for such need.
Choosing the correct keys to represent your data is one of the most critical aspects during your design process, and it directly impacts how much your application will perform, scale and cost.
Footnotes:
The Data Model is the model through which we perceive and manipulate our data. It describes how we interact with the data in the database [FOWLER]. In other words, it is how you abstract your data model, the way you group your entities, the attributes that you choose as primary keys, etc
The Storage Model describes how the database stores and manipulates the data internally [FOWLER]. Although you cannot control this directly, you can certainly optimize how the data is retrieved or written by knowing how the database works internally.

SQL database design: storing the type of a row

I am designing a database to contain a table reference, with a column type that is one of several predefined values (e.g., book, movie, magazine, etc.). I intend the range of possible values to expand over time (e.g. if I realize that I missed the academic_paper type, I want to be able to put that in).
The easiest solution would seem to be to simply store a string representing the type into the table. But this sounds like it would result in a lot of wasted space.
The other solution I thought of is creating a new table reference_types, which the type column references in its foreign key. This seems to have the added benefit of ensuring valid foreign keys (so that I won't accidentally mistype a "magzine" somewhere in my code), possible allow for faster queries for all media of a certain type (since integer comparisons should be much faster than string comparisons), but also slow my application down a bit as joins would be required whenever I need the reference type, and probably complicate logic because of those extra joins.
What are your thoughts on schema design for this problem?
Your second solution is the correct one. Create a secondary table to store your reference types and link them using a foreign key.
For further reading on this subject the search term you'd want to use is 'database normalisation'.
Create the reference_types table. And in your references table use integer and also add a reference_type_name field.
You can query the references table to get the integer key and print its name when needed without performing a join to the other table, and still use that table to perfom other operations, just keep both tables with equal type names.
I know it sonds redundant, but it's really the fastest way to do a simple query by int key and have it all together.
It depends, if you will want to add some other information to reference types, then use the second approach. If not, use the first one because it's faster and the information stored is only a string (you can always select unique to retrieve your types). Read this article for more info.