Query without partition key in DynamoDB - indexing

I'm designing a table in DynamoDB which will contain a large number of records, each with a unique ID and a timestamp. I will need to retrieve a set of records that fall between two dates, irrespective of all other property values.
Adding a global secondary index for the timestamp field seems like a logical solution, however this isn't straightforward.
The Query command in DynamoDB requires a KeyConditionExpression parameter, which determines which results are returned by the query. From the DynamoDB developer guide:
To specify the search criteria, you use a key condition expression—a string that determines the items to be read from the table or index. You must specify the partition key name and value as an equality condition. You can optionally provide a second condition for the sort key (if present).
Since the partition key must be specified exactly, it cannot be used for querying a range of values. A range could be specified for the sort key, but only in addition to an exact match on the partition key. Hence, the only way I can see this working is to add a dummy field for the index partition key, where every record has the same value, then perform the query on the timestamp as the sort key. This seems hacky, and, presumably, is not how it's intended to be used.
Some answers to similar questions suggest using the Scan command instead of Query, however this would be very inefficient when fetching a small number of records from a very large table.
Is it possible to efficiently query the table to get all records where a condition matches the timestamp field only?

How big of a range are you dealing with?
You could for instance have a GSI partition key of
YYYY
or YYYY-MM
or YYYY-MM-DD
Your sort key could be the remainder of the timestamp..
You may need to make multiple queries, if for instance the amount of data necessitates daily partitions and you want to show 7 days at a time.
Also be sure to read the best practices for time-scale data part of the developer guide.

Related

What is the difference between partitioning and indexing in DB ? (Performance-Wise)

I am new to SQL and have been trying to optimize the query performances of my microservices to my DB (Oracle SQL).
Based on my research, I checked that you can do indexing and partitioning to improve query performance, I seem to have known each of the concept and how to do it, but I'm not sure about the difference between both?
For example, suppose I have table Orders, with 100 million entries with
Columns:
OrderId (PK)
CustomerId (6 digit unique number)
Order (what the order is. Ex: "Burger")
CreatedTime (Timestamp)
In essence, both methods "subdivides" the orders table so that a DB query wont need to scan through all 100 million entries in DB, right?
Lets say I want to find orders on "2020-01-30", I can create an index on createdTime to improve the performance.
But I can also create a partition based on createdTime to improve the performance. (the partition is per day)
Are there any difference to both methods in this case? Is one better than the other ?
There are several ways to partition - by range, by list, by hash, and by reference (although that tends to be uncommon).
If you were to partition by a date column, it would usually be using a range, so one month/week/day uses one partition, another uses another etc. If you want to filter rows where this date is equal to a value then you can do a partition full table scan to read all of the partition that houses this data with a full scan. This can end up being quite efficient if most of the data in the partition would match your filter - apply the same thinking about whether a full table scan in general is a good idea but when the data in the table is already filtered down. If you wanted to look for an hour long date range and you’re partitioning by range with monthly intervals then you’re going to be reading about 730 times more data than necessary. Local indexes are useful in that scenario.
Smaller partitions also help this out, but you can end up with a case where you have thousands of partitions. If you have selective queries that don’t know which partition needs to be read - you could want global indexes. These add a lot of effort into all partition maintenance operations.
If you index the date column instead then you can quickly establish the location of the rows in your table that meet your filter. This is easy in an index because it’s just a sorted list - you find the first key in the index that matches the filter and read until it no longer matches. You then have to lookup these rows using single block reads. Usual efficiency rules of an index apply - the less data you need to read with your filters the more useful the index will be.
Usually, queries include more than just a date filter. These additional filters might be more appropriate for your partitioning scheme. You could also just include the columns in your index (remembering the Golden Rule of Indexing would tell you if you’re using a column with equality filters it should go before columns that you use range filters on in an index).
You can generally get all the performance you need with just indexing. Partitioning really comes into play when you have important queries that need to read huge chunks of data (generally reporting queries) or when you need to do things like purge data older than X months.

Are indexes on columns with always a different value worth it?

Does creating an index on a column that will always have a different value in each record (like a unique column) improves performances on SELECTs?
I understand that having an index on a column named ie. status which can have 3 values (such as PENDING, DONE, FAILED) and searching only FAILED in 1kk records will be faster.
But what happens if I have a unique id (not the primary key) in 1kk records, and I'm doing a SELECT on that column?
An index on a unique column is actually better than an index on a column with a few values.
To understand why, you need a basic understanding of how databases manage storage. This is a high-level view.
The primary purpose of an index is to reduce the number of pages that need to be read for a query. The rows themselves are stored on data pages. If you don't have an index, then all the data needs to be read.
The index is a data structure that makes it efficient to find a particular value. You can think of it as a sorted list, where a binary search is used to identify the right location. In actual fact, these are usually stored in a structure called b-trees (where the "b" stands for "balanced", not "binary") but that is an implementation detail. And there are types of indexes that don't use b-trees.
So, if the values are unique, then an index is extremely helpful. Instead of doing a full table scan, the "row id" can efficiently be looked up in the index and then only one data page needs to be read.
Note that unique constraints are implemented using indexes. If you have declared a column to be unique, there is no need for an additional index because it is already there.

Cursor select on 2nd Primary key on db2 sql field for embedded positional value - Unable to determine most efficient for longterm design

I have a SQL DB2 table where the first two fields are the primary keys (not including the third field which is date/time stamp). The table was designed by another team with the intent to make it generic. I was brought into the project after the key value for the second field was coded for when it was inserted onto the table. This leads me to this: We now have to do a cursor select with a WHERE clause that includes the first primary key – and then for the second primary key it must be for only when it is a specific value in position 21 for 8 bytes. (And we will always know what that value will be for the second field.) The second field is a generic 70 byte field (alphanumeric). My question is should we use a LIKE wildcard for the WHERE clause statement for the second primary field condition or instead a SUBSTR since we know the position of the value? I ask because I have done an EXPLAIN yet I do not see a difference between the two (neither does my database analyst). And this is for a few million records for a 1300 byte long table. However, my concern is volume of the data on the table will grow on various systems. Therefore performance may become an issue. Just right now it is hard to measure the difference between LIKE and SUBSTR. But I would like to do my due diligence and code this for long term performance. And if there is a third option, please let me know.
The CPU difference between a SUBSTR and a LIKE will be minimal. What might be more important is the filter factor estimate in the explain. You might want to use a statistical view over the table to help the optimizer here. In that case the SUBSTR operator will be better as you would include that as a column in the stats view.
Alternatively, a generated column based on the SUBSTR would also provide the better stats, and you could include the column in an index too.

AWSDynamoDBQueryExpression order by UNIX timestamp

I am new to NoSQL databases and I have changed my database schema from storing dates as a UTC timestamp string, to a UNIX timestamp (number), in hopes that I can create either a scan or a query expression to find the 1000 most recent items in the table. I have yet to find a simple snippet of code to accomplish this using the AWSDynamoDBQueryExpression class. Scan doesn't appear to have any sort mechanism but query might. Any ideas?
There is no ORDER BY functionality in DynamoDB. If you want to run a top N query you'll have to performa a scan and then order the results yourself.
Mark B is right that query results can be ordered by the sort key but that is only within the context of a query. Queries are inherently limited to a single partition key.
If your table is small then you can get away with creating a Global Secondary Index on the table in which the partition key can be an attribute that is the same for all items and then use the timestamp attribute as a sort key. But keep in mind that this will break down once your table gets bigger. And if you're doing that you might as well not be using Dynamo. You're better off with an relational database on RDS.
First you need to make sure the timestamp field is the sort key for your DynamoDB table (or the sort key for a Global Secondary Index on that table). Then you just need to run a query. From the documentation:
Query results are always sorted by the sort key value. If the data
type of the sort key is Number, the results are returned in numeric
order; otherwise, the results are returned in order of UTF-8 bytes. By
default, the sort order is ascending. To reverse the order, set the
ScanIndexForward parameter to false.

How is RedShift's sortkey different from Clustered Index?

RedShift has a tool called the sortkey, a column which you can specify. This will ensure that the data remains in this sorted order.
How is this any different from a clustered index? This does the same thing.
Amazon Redshift does not support indexes. So, calling it an index would be misleading.
Rather, data is physically stored in the order requested. This has the benefit of enabling zone maps, which identify the range of data stored in a given block. For example, if data is sorted by date, each zone map would identify the earliest and latest dates stored in that zone. This helps Redshift ignore blocks that do not contain relevant data.
SORTKEYs can also include multiple columns and even interleaved sorts -- a method of combining two different sort orders while maintaining efficiency.