I am new to NoSQL databases and I have changed my database schema from storing dates as a UTC timestamp string, to a UNIX timestamp (number), in hopes that I can create either a scan or a query expression to find the 1000 most recent items in the table. I have yet to find a simple snippet of code to accomplish this using the AWSDynamoDBQueryExpression class. Scan doesn't appear to have any sort mechanism but query might. Any ideas?
There is no ORDER BY functionality in DynamoDB. If you want to run a top N query you'll have to performa a scan and then order the results yourself.
Mark B is right that query results can be ordered by the sort key but that is only within the context of a query. Queries are inherently limited to a single partition key.
If your table is small then you can get away with creating a Global Secondary Index on the table in which the partition key can be an attribute that is the same for all items and then use the timestamp attribute as a sort key. But keep in mind that this will break down once your table gets bigger. And if you're doing that you might as well not be using Dynamo. You're better off with an relational database on RDS.
First you need to make sure the timestamp field is the sort key for your DynamoDB table (or the sort key for a Global Secondary Index on that table). Then you just need to run a query. From the documentation:
Query results are always sorted by the sort key value. If the data
type of the sort key is Number, the results are returned in numeric
order; otherwise, the results are returned in order of UTF-8 bytes. By
default, the sort order is ascending. To reverse the order, set the
ScanIndexForward parameter to false.
Related
I'm designing a table in DynamoDB which will contain a large number of records, each with a unique ID and a timestamp. I will need to retrieve a set of records that fall between two dates, irrespective of all other property values.
Adding a global secondary index for the timestamp field seems like a logical solution, however this isn't straightforward.
The Query command in DynamoDB requires a KeyConditionExpression parameter, which determines which results are returned by the query. From the DynamoDB developer guide:
To specify the search criteria, you use a key condition expression—a string that determines the items to be read from the table or index. You must specify the partition key name and value as an equality condition. You can optionally provide a second condition for the sort key (if present).
Since the partition key must be specified exactly, it cannot be used for querying a range of values. A range could be specified for the sort key, but only in addition to an exact match on the partition key. Hence, the only way I can see this working is to add a dummy field for the index partition key, where every record has the same value, then perform the query on the timestamp as the sort key. This seems hacky, and, presumably, is not how it's intended to be used.
Some answers to similar questions suggest using the Scan command instead of Query, however this would be very inefficient when fetching a small number of records from a very large table.
Is it possible to efficiently query the table to get all records where a condition matches the timestamp field only?
How big of a range are you dealing with?
You could for instance have a GSI partition key of
YYYY
or YYYY-MM
or YYYY-MM-DD
Your sort key could be the remainder of the timestamp..
You may need to make multiple queries, if for instance the amount of data necessitates daily partitions and you want to show 7 days at a time.
Also be sure to read the best practices for time-scale data part of the developer guide.
I have the following table structure
id name
--------------
10991 Shoug
10990 Moneera
10989 Abc
10988 xyz
id is the primary key column, As you can see the id is in decreasing order (ie: select * from users returns the records in this order) because of the way the data was inserted.
How do I resort the table in the ascending order of the primary key permanently? Preferably with SQL alone?
I found this answer but its not working for me. I am using Postgres as the database.
How do I resort the table in the ascending order of the primary key permanently?
You are missing a fundamental concept about relational databases. In SQL, tables represent unordered sets. There is no "permanent" ordering. The only ordering is provided by the order by clause on a query.
Some databases support a concept called "clustering"/"clustered indexes". This means that the data on data pages is actually ordered according to some key. In these databases, even when using a table with a clustered index, you are still not guaranteed that the data is returned in any particular order. Unless you use ORDER BY.
Postgres does not support this functionality, so even this is not available.
Apparently not possible in Postgres.
Curious what your use case was for this. I was researching any downside to using UUID as PK in Postgres, which led me to your post. Since pg does not physically reorder the data according to PK, I concluded that only the size (vs traditional sequence) would be the difference. I'm more used to SQL Server creating a Cluster index and reordering data by default.
As your link in your OP suggests, you can force it to reorder, but it is not persistent. Postgres also has a CLUSTER command to do this. This does lock your table, though.
https://www.postgresql.org/docs/current/sql-cluster.html
https://dba.stackexchange.com/questions/38710/how-does-postgresql-physically-order-new-records-on-disk-after-a-cluster-on-pri
I have a large table that I run queries like select date_att > date '2001-01-01' on regularly. I'm trying to increase the speed of these queries by clustering the table on date_att, but when I run those queries through explain analyze it still chooses to sequentially scan the table, even on queries as simple as SELECT date_att from table where date_att > date '2001-01-01'. Why is this the case? I understand that since the query returns a large portion of the table, the optimizer will ignore the index, but since the table is clustered by that attribute, shouldn't it be able to really quickly binary search through the table to the point where date > '2001-01-01' and return all results after that? This query still takes as much time as without the clustering.
It seems like you are confusing two concepts:
PostgreSQL clustering of a table
Clustering a table according to an index in PostgreSQL aligns the order of table rows (stored in a heap table) to the order in the index at the time of clustering. From the docs:
Clustering is a one-time operation: when the table is subsequently
updated, the changes are not clustered.
http://www.postgresql.org/docs/9.3/static/sql-cluster.html
Clustering potentially (often) improves query speed for range queries because the selected rows are stored nearby in the heap table by coincidence. There is nothing that guarantees this order! Consequently the optimizer cannot assume that it is true.
E.g. if you insert a new row that fulfills your where clause it might be inserted at any place in the table — e.g. where rows for 1990 are stored. Hence, this assumtion doesn't hold true:
but since the table is clustered by that attribute, shouldn't it be able to really quickly binary > search through the table to the point where date > '2001-01-01' and return all results after that?
This brings us to the other concept you mentioned:
Clustered Indexes
This is something completely different, not supported by PostgreSQL at all but by many other databases (SQL Server, MySQL with InnoDB and also Oracle where it is called 'Index Organized Table').
In that case, the table data itself is stored in an index structure — there is no separate heap structure! As it is an index, the order is also maintained for each insert/update/delete. Hence your assumption would hold true and indeed I'd expect the above mentioned database to behave as you would expect it (given the date column is the clustering key!).
Hope that clarifies it.
What is the fastest way to reorder (resort) a table so that the physical representation most closely matches the logical, using PostgreSQL?
Having data ordered does not mean that you won't need ORDER BY clause in your queries anymore.
It just means that the logical order or the data is likely to match the physical order, and retrieving the data in logical order (say, from an index scan) will more likely result in a sequential read access to the table.
Note that neither MySQL nor PostgreSQL guarantee that INSERT … SELECT … ORDER BY will result in the data being ordered in the table.
To order data in PostgreSQL you should define an index with the given order and issue this command:
CLUSTER mytable USING myindex
This will rebuild the table and all indexes defined on it.
Order of insertion does not in the end always control order in the table. Tables are ordered by their clustered index, if they have one. If they do not have one, then the order is technically undefined, although it is probably safe in many cases to assume that they're ordered in insertion order this doesn't mean they'll stay that way. Lacking a specific ordering the DB engine is free to reorder as it sees fit to optimize retrieval.
If you want to control order on disk, the single best way is to properly define your clustered index. Always.
Use CLUSTER to reorder a table by a given index.
(BTW, ALTER TABLE ... ORDER BY ...)
I have simple SSIS package where I import data from flat file into SQL Server table (SQL Server 005). File contains 70k rows and table has no primary key. Importing is sucessful but when I open SQL Server table the order of rows is different from the that of file. After observing closely I see that data in table is sorted by default by first column. Why this is happening? and how I can avoid default sort?
Thanks.
You cannot rely on ordering unless you specify order by in your SQL query. SQL is a relational algebra that works with sets. Those sets are unordered. Database tables do not have an intrinsic ordering.
It may well be that the sets are ordered due to the way the data is retrieved from the tables. This may be based on primary key, order of insertion, clustered key, seemingly random order based on the execution plan of the query or the actual data in the table or even the phase of the moon.
Bottom line, if you want a specific order, use order by. If you don't want a specific order, the DBMS is free to deliver your rows in any order, including one based on the first column.
If you really want them sorted depending on the position in the import file, you should add another column to the table to store an increasing number based on its position in that file. Then use order by using that column. But that's a pretty arbitrary sort order, you're generally better off choosing one that makes more sense to the data (transaction ID, date/time, customer number or whatever else you have).
If you want to avoid the default sort (however variable that may be), use a specific sort.
In general no order is applied if there is no ordering in the select query.
What I have noticed is that the table results might return in the order of the primary key, but this is not gaurenteed either.
So all in all, if you do not specify a ordering, no ordering can be assumed.