Indexing and cluster in Greenplum database

Indexing and cluster in Greenplum database - indexing

I am new to Greenplum database. I have a question.
Is Cluster on table mandatory after creating an index on a column in Greenplum in case of row-based distribution?

The "massively parallel" (MPP) nature of Greenplum's software-level architecture, when coupled with the throughput capabilities of modern servers makes indexes unnecessary in most cases.
To say it differently, the speed of table scans in Greenplum is a feature, rather than a bottleneck. Please refer to this great writeup on how MPP works under the hood : https://dwarehouse.wordpress.com/2012/12/28/introduction-to-massively-parallel-processing-mpp-database/

If your data is not updated frequently and you need quickly return the result, you can use clustered index table. it will cost much time. you can build index for the column-oriented table.

Related

Has anyone used Snowflake search optimization and gained benefits over cluster keys?

Reference:
https://docs.snowflake.com/en/user-guide/search-optimization-service.html#what-access-control-privileges-are-needed-for-the-search-optimization-service
Has anyone used Snowflake search optimization and gained benefits over cluster keys?
Please share any use cases, cost vs performance as well.
Appreciate the insights

In general, Search Optimisation Service (SOS) would be more beneficial over Clustering for point lookup queries, the type of queries that retrieves 1 or a few rows from a very large table using equality or IN filter condition.
Since you can only have one cluster key in a table, SOS can also help optimise searches from non-cluster-key columns in a clustered table.
However unlike Clustering, SOS adds storage cost which holds search access path data for each table with SOS enabled

Why does Redshift not need materialized views or indexes?

In the Redshift FAQ under
Q: How does the performance of Amazon Redshift compare to most traditional databases for data warehousing and analytics?
It says the following:
Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.
Why is this the case?

It's a bit disingenuous to be honest (in my opinion). Although RedShift has neither of these, I'm not sure that's the same as saying it wouldn't benefit from them.
Materialised Views
I have no real idea why they make this claim. Possibly because they consider the engine so performant that the gains from having them are minimal.
I would dispute this and the product I work on maintains its own materialised views and can show significant performance gains from doing so. Perhaps AWS believe I must be doing something wrong in the first place?
Indexes
RedShift does not have indexes.
It does have SORT ORDER which is exceptionally similar to a clustered index. It is simply a list of fields by which the data is ordered (like a composite clustered index).
It even has recently introduced INTERLEAVED SORT KEYS. This is a direct attempt to have multiple independent sort orders. Instead of ordering by a THEN b THEN c it effectively orders by each of them at the same time.
That becomes kind of possible because of how RedShift implements its column store.
- Each column is stored separately from each other column
- Each column is stored in 1MB blocks
- Each 1MB block has summary statistics
As well as being the storage pattern this effectively becomes a set of pseudo indexes.
- If the data is sorted by a then b then x
- But you want z = 1234
- RedShift looks at the block statistics (for column z) first
- Those stats will say the minimum and maximum values stored by that block
- This allows Redshift to skip many of those blocks in certain conditions
- This intern allows RedShift to identify which blocks to read from the other columns

as of dec 2019, Redshift has a preview of materialized views: Announcement
from the documentation: A materialized view contains a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database. Amazon Redshift returns the precomputed results from the materialized view, without having to access the base tables at all. From the user standpoint, the query results are returned much faster compared to when retrieving the same data from the base tables.

Indexes are basically used in OLTP systems to retrieve a specific or a small group of values. On the contrary, OLAP systems retrieve a large set of values and performs aggregation on the large set of values. Indexes would not be a right fit for OLAP systems. Instead it uses a secondary structure called zone maps with sort keys.
The indexes operate on B trees. The 'life without a btree' section in the below blog explains with examples how an index based out of btree affects OLAP workloads.
https://blog.chartio.com/blog/understanding-interleaved-sort-keys-in-amazon-redshift-part-1
The combination of columnar storage, compression codings, data distribution, compression, query compilations, optimization etc. provides the power to Redshift to be faster.
Implementing the above factors, reduces IO operations on Redshift and eventually providing better performance. To implement an efficient solution, it requires a great deal of knowledge on the above sections and as well as the on the queries that you would run on Amazon Redshift.
for eg.
Redshift supports Sort keys, Compound Sort keys and Interleaved Sort keys.
If your table structure is lineitem(orderid,linenumber,supplier,quantity,price,discount,tax,returnflat,shipdate).
If you select orderid as your sort key but if your queries are based on shipdate, Redshift will be operating efficiently.
If you have a composite sortkey on (orderid, shipdate) and if your query only on ship date, Redshift will not be operating efficiently.
If you have an interleaved soft key on (orderid, shipdate) and if your query
Redshift does not support materialized views but it easily allows you to create (temporary/permant) tables by running select queries on existing tables. It eventually duplicates data but at the required format to be executed for queries (similar to materialized view) The below blog gives your some information on the above approach.
https://www.periscopedata.com/blog/faster-redshift-queries-with-materialized-views-lifetime-daily-arpu.html
Redshift does fare well with other systems like Hive, Impala, Spark, BQ etc. during one of our recent benchmark frameworks

The simple answer is: because it can read the needed data really, really fast and in parallel.
One of the primary uses of indexes are "needle-in-the-haystack" queries. These are queries where only a relatively small number of rows are needed and these match a WHERE clause. Columnar datastores handle these differently. The entire column is read into memory -- but only the column, not the rest of the row's data. This is sort of similar to having an index on each column, except the values need to be scanned for the match (that is where the parallelism comes in handy).
Other uses of indexes are for matching key pairs for joining or for aggregations. These can be handled by alternative hash-based algorithms.
As for materialized views, RedShift's strength is not updating data. Many such queries are quite fast enough without materialization. And, materialization incurs a lot of overhead for maintaining the data in a high transaction environment. If you don't have a high transaction environment, then you can increment temporary tables after batch loads.

They recently added support for Materialized Views in Redshift: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-redshift-introduces-support-for-materialized-views-preview/
Syntax for materialized view creation:
CREATE MATERIALIZED VIEW mv_name
[ BACKUP { YES | NO } ]
[ table_attributes ]
AS query
Syntax for refreshing a materialized view:
REFRESH MATERIALIZED VIEW mv_name

Is having many partition key in azure table storagea good design for read queries?

I know that having many partition keys reduce the batch processing (EGT) in the Azure Table Storage. However I wonder to know whether there is any performance issue in terms of reading as well or not? For example, if I designed my Azure Table such that every new entity has a new partition key and I end up having 1M or more partition keys. IS there any performance disadvantege for read queries?

If the most often operation done by you is Point Query (PartitionKey and RowKey specified), the unique-partition-key design is quite good. However if your querying operation is usually Table Scan (No Partition Key specified), the design will be awful.
You can refer to chapter "Design for querying" in Azure Table Design Guide for the details.

Point query is the most efficient query to retrieve a single entity by specifying a single PartitionKey and RowKey using equality predicates. If your PartitionKey is unique, you may consider using a constant string as RowKey to enable you to leverage point query. The choice of design also depends on how you plan to read/retrieve your data. If you always plan to use point query to retrieve the data, this design makes sense.
Please see “New PartitionKey Value for Every Entity” section in the following article http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx. In short, it will scale very well since our storage system has an option to load balance several partitions. However, if your application requires you to retrieve data without specifying a PartionKey, it will be inefficient because it will result a table scan.
Please me an email # ascl#microsoft.com, if you want to discuss further on your table design.

What is the most scalable design for this table structure

DataColumn, DataColumn, DateColumn
Every so often we put data into the table via date.
So everything seems great at first, but then I thought: What happens when there are a million or billion rows in the table? Should I be breaking up the tables by date? This way the query performance will never degrade? How do people deal with this sort of thing?

You can use partitioned tables starting with SQL 2K5: Partitioned Tables
This way you gain the benefits of keeping the logical design pure while being able to move old data into a different file group.

You should not break your tables because of data. Instead, you should worry about your indexes, normalization and so on.
Update
A little deeper explanation. Let's suppose you have a table with a million records. If you have different dates on [DateColumn], your greatest ally will be the indexes that work with the [DateColumn]. Then you make sure your queries always filter by at least [DateColumn].
This way, you will be fine.

This easily qualifies as premature optimization, which is tough to achieve in db design IMHO, because optimization is/should be closer to the surface in data modeling.
But all you need to do is create an index on the DateColumn field. An index is actually a much better performance solution than any kind of table splitting/breaking up and keeps your design and therefore all of you programming much simpler. (And you can decide to use partitioning w/o affecting your design in the future if it helps.)

Sounds like you could use a history table. If you are mostly going to query the current date's data, then migrate the old data to the history table and your main table will not grow so much.

If I understand you question correctly, you have a table with some data and a date. Your question is -- will I see improved performance if I make a new table say, every year. This way the queries will never have to look at more than one years worth of data.
This is wrong. Instead what you should do is set the date field as an index. The server will be able to give you the performance gain you need if it is an index.
If you don't do this your program's logic will get crazy and ultimately slow down your system.
Keep it simple.
(NB - There are some advanced partitioning features you can make use of, but these can be layered in later if needed -- it is unlikely you will need these features but the simple design should be able to migrate to them if needed.)

When tables and indexes become very
large, partitioning can help by
partitioning the data into smaller,
more manageable sections.
Microsoft SQL Server 2005 allows you
to partition your tables based on
specific data usage patterns using
defined ranges or lists. SQL Server
2005 also offers numerous options for
the long-term management of
partitioned tables and indexes by the
addition of features designed around
the new table and index structure.
Furthermore, if a large table exists
on a system with multiple CPUs,
partitioning the table can lead to
better performance through parallel
operations.
You might need considering the
following too: In SQL Server 2005,
related tables (such as Order and
OrderDetails tables) that are
partitioned to the same partitioning
key and the same partitioning function
are said to be aligned. When the
optimizer detects that two partitioned
and aligned tables are joined, SQL
Server 2005 can join the data that
resides on the same partitions first
and then combine the results. This
allows SQL Server 2005 to more
effectively use multiple-CPU
computers.
Read about Partitioned Tables and Indexes in SQL Server 2005

What is the best way to build an index to get the fastest read response?

I need to index up to 500,000 entries for fastest read. The index needs to be rebuilt periodically , on disk. I am trying to decide between a simple file like a hash on disk or a single table in an embedded database. I have no need for an RDBMS engine.

I'm assuming you're referring to indexing tables on a relational DBMS (like mySql, Oracle, or Postgres).
Indexes are secondary data stores that keep a record of a subset of fields for a table in a specific order.
If you create an index, any query that includes the subset of fields that are indexed in its WHERE clause will perform faster.
However, adding indexes will reduce INSERT performance.
In general, indexes don't need to be rebuilt unless they become corrupted. They should be maintained on the fly by your DBMS.

Perhaps BDB? It is a high perf. database that doesn't use a DBMS.

If you've storing state objects by key, how about Berkeley DB.

cdb if the data does not change.
/Allan

PyTables Pro claims that "for situations that don't require fast updates or deletions, OPSI is probably one of the best indexing engines available". However I've not personally used it, but the F/OSS version of PyTables gives already gives you good performance:
http://www.pytables.org/moin/PyTablesPro

This is what MapReduce was invented for. Hadoop is a cool java implementation.

If the data doesn't need to be completely up to date, you might also like to think about using a data warehousing tool for OLAP purposes (such as MSOLAP). The can perform lightning fast read-only queries based on pre-calculated data.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas