How is RedShift's sortkey different from Clustered Index? - sql

RedShift has a tool called the sortkey, a column which you can specify. This will ensure that the data remains in this sorted order.
How is this any different from a clustered index? This does the same thing.

Amazon Redshift does not support indexes. So, calling it an index would be misleading.
Rather, data is physically stored in the order requested. This has the benefit of enabling zone maps, which identify the range of data stored in a given block. For example, if data is sorted by date, each zone map would identify the earliest and latest dates stored in that zone. This helps Redshift ignore blocks that do not contain relevant data.
SORTKEYs can also include multiple columns and even interleaved sorts -- a method of combining two different sort orders while maintaining efficiency.

Related

Query without partition key in DynamoDB

I'm designing a table in DynamoDB which will contain a large number of records, each with a unique ID and a timestamp. I will need to retrieve a set of records that fall between two dates, irrespective of all other property values.
Adding a global secondary index for the timestamp field seems like a logical solution, however this isn't straightforward.
The Query command in DynamoDB requires a KeyConditionExpression parameter, which determines which results are returned by the query. From the DynamoDB developer guide:
To specify the search criteria, you use a key condition expression—a string that determines the items to be read from the table or index. You must specify the partition key name and value as an equality condition. You can optionally provide a second condition for the sort key (if present).
Since the partition key must be specified exactly, it cannot be used for querying a range of values. A range could be specified for the sort key, but only in addition to an exact match on the partition key. Hence, the only way I can see this working is to add a dummy field for the index partition key, where every record has the same value, then perform the query on the timestamp as the sort key. This seems hacky, and, presumably, is not how it's intended to be used.
Some answers to similar questions suggest using the Scan command instead of Query, however this would be very inefficient when fetching a small number of records from a very large table.
Is it possible to efficiently query the table to get all records where a condition matches the timestamp field only?
How big of a range are you dealing with?
You could for instance have a GSI partition key of
YYYY
or YYYY-MM
or YYYY-MM-DD
Your sort key could be the remainder of the timestamp..
You may need to make multiple queries, if for instance the amount of data necessitates daily partitions and you want to show 7 days at a time.
Also be sure to read the best practices for time-scale data part of the developer guide.

What is the difference between partitioning and indexing in DB ? (Performance-Wise)

I am new to SQL and have been trying to optimize the query performances of my microservices to my DB (Oracle SQL).
Based on my research, I checked that you can do indexing and partitioning to improve query performance, I seem to have known each of the concept and how to do it, but I'm not sure about the difference between both?
For example, suppose I have table Orders, with 100 million entries with
Columns:
OrderId (PK)
CustomerId (6 digit unique number)
Order (what the order is. Ex: "Burger")
CreatedTime (Timestamp)
In essence, both methods "subdivides" the orders table so that a DB query wont need to scan through all 100 million entries in DB, right?
Lets say I want to find orders on "2020-01-30", I can create an index on createdTime to improve the performance.
But I can also create a partition based on createdTime to improve the performance. (the partition is per day)
Are there any difference to both methods in this case? Is one better than the other ?
There are several ways to partition - by range, by list, by hash, and by reference (although that tends to be uncommon).
If you were to partition by a date column, it would usually be using a range, so one month/week/day uses one partition, another uses another etc. If you want to filter rows where this date is equal to a value then you can do a partition full table scan to read all of the partition that houses this data with a full scan. This can end up being quite efficient if most of the data in the partition would match your filter - apply the same thinking about whether a full table scan in general is a good idea but when the data in the table is already filtered down. If you wanted to look for an hour long date range and you’re partitioning by range with monthly intervals then you’re going to be reading about 730 times more data than necessary. Local indexes are useful in that scenario.
Smaller partitions also help this out, but you can end up with a case where you have thousands of partitions. If you have selective queries that don’t know which partition needs to be read - you could want global indexes. These add a lot of effort into all partition maintenance operations.
If you index the date column instead then you can quickly establish the location of the rows in your table that meet your filter. This is easy in an index because it’s just a sorted list - you find the first key in the index that matches the filter and read until it no longer matches. You then have to lookup these rows using single block reads. Usual efficiency rules of an index apply - the less data you need to read with your filters the more useful the index will be.
Usually, queries include more than just a date filter. These additional filters might be more appropriate for your partitioning scheme. You could also just include the columns in your index (remembering the Golden Rule of Indexing would tell you if you’re using a column with equality filters it should go before columns that you use range filters on in an index).
You can generally get all the performance you need with just indexing. Partitioning really comes into play when you have important queries that need to read huge chunks of data (generally reporting queries) or when you need to do things like purge data older than X months.

SQL - How to find a specific value from DB without going through the entire DB table

I wonder how can I find a specific value from DB without going through the entire DB table.
by example:
There is a DB of students and we are looking for all the students with a certain name, how do you do that without going through the whole DB table.
Use INDEXES
Indexes are used to quickly locate data without having to search every row in a database table every time a database table is accessed. ... Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.
SQL Server has four options for improving performance for this type of query:
A regular index (either clustered or non-clustered).
A full text index.
Partitioning.
Hash index (for memory optimized tables).
A regular index, created using create index, is the "canonical" answer to this question. It is like an alphabetical list of all names with a pointer to the record. The implementation uses something called B-trees, so the analogy is not perfect. These indexes can be used for equality (eg. =, is null) and inequality comparisons (eg. in, <, >)
A full text index indexes all words in a text column (for some definition of "word"). This can be used for a range of full text search options -- and available through contains.
Partitioning is used when you have lots and lots of data and only a handful of categories. That is highly unlikely with a name in a student database. But it physically splits the data into separate files for each name or range of names.
Hash-based indexing is only available on memory-optimized tables. These are only useful for comparisons using = and in (and <> and not in).

AWSDynamoDBQueryExpression order by UNIX timestamp

I am new to NoSQL databases and I have changed my database schema from storing dates as a UTC timestamp string, to a UNIX timestamp (number), in hopes that I can create either a scan or a query expression to find the 1000 most recent items in the table. I have yet to find a simple snippet of code to accomplish this using the AWSDynamoDBQueryExpression class. Scan doesn't appear to have any sort mechanism but query might. Any ideas?
There is no ORDER BY functionality in DynamoDB. If you want to run a top N query you'll have to performa a scan and then order the results yourself.
Mark B is right that query results can be ordered by the sort key but that is only within the context of a query. Queries are inherently limited to a single partition key.
If your table is small then you can get away with creating a Global Secondary Index on the table in which the partition key can be an attribute that is the same for all items and then use the timestamp attribute as a sort key. But keep in mind that this will break down once your table gets bigger. And if you're doing that you might as well not be using Dynamo. You're better off with an relational database on RDS.
First you need to make sure the timestamp field is the sort key for your DynamoDB table (or the sort key for a Global Secondary Index on that table). Then you just need to run a query. From the documentation:
Query results are always sorted by the sort key value. If the data
type of the sort key is Number, the results are returned in numeric
order; otherwise, the results are returned in order of UTF-8 bytes. By
default, the sort order is ascending. To reverse the order, set the
ScanIndexForward parameter to false.

SQL Server table is sorted by default

I have simple SSIS package where I import data from flat file into SQL Server table (SQL Server 005). File contains 70k rows and table has no primary key. Importing is sucessful but when I open SQL Server table the order of rows is different from the that of file. After observing closely I see that data in table is sorted by default by first column. Why this is happening? and how I can avoid default sort?
Thanks.
You cannot rely on ordering unless you specify order by in your SQL query. SQL is a relational algebra that works with sets. Those sets are unordered. Database tables do not have an intrinsic ordering.
It may well be that the sets are ordered due to the way the data is retrieved from the tables. This may be based on primary key, order of insertion, clustered key, seemingly random order based on the execution plan of the query or the actual data in the table or even the phase of the moon.
Bottom line, if you want a specific order, use order by. If you don't want a specific order, the DBMS is free to deliver your rows in any order, including one based on the first column.
If you really want them sorted depending on the position in the import file, you should add another column to the table to store an increasing number based on its position in that file. Then use order by using that column. But that's a pretty arbitrary sort order, you're generally better off choosing one that makes more sense to the data (transaction ID, date/time, customer number or whatever else you have).
If you want to avoid the default sort (however variable that may be), use a specific sort.
In general no order is applied if there is no ordering in the select query.
What I have noticed is that the table results might return in the order of the primary key, but this is not gaurenteed either.
So all in all, if you do not specify a ordering, no ordering can be assumed.