I am quite new to Redshift but have quite some experience in the BI area. I need help from an expert Redshift developer. Here's my situation:
I have an external (S3) database added to Redshift. This will suffer very frequent changes, approx. every 15 minutes. I will run a lot of concurrent queries directly from Qlik Sense against this external DB.
As best practices say that Redshift + Spectrum works best when the smaller table resides in Redshift, I decided to move some calculated dimension tables locally and leave the outer tables in S3. The challenge I have is if it's better suited to use materialized views for this or tables.
I already tested both, with DIST STYLE = ALL and proper SORT KEY and the test show that MVs are faster. I just don't understand why that is. Considering the dimension tables are fairly small (<3mil rows), I have the following questions:
shall we use MVs and refresh them via scheduled task or use table and perform some form of ETL via stored procedure (maybe) to refresh it.
if table is used: I tried casting the varchar keys (heavily used in joins) to bigint to force encoding to AZ64, but queries perform worse than without casting (where encode=LZO). Is this because in the external DB it's stored as varchar?
if MV is used: I also tried above casting in the query behind MV, but the encoding says NONE (figured out by checking the table created behind the scene). Moreover, even without casting, most of the key columns used in joins have no encoding. Might it be that this is the reason why MVs are faster than table? And should I not expect the opposite - no encoding = worse performance?
Some more info: in S3, we store in the form of parquet files, with proper partitioning. In Redshift, the tables are sorted against the same columns as S3 partitioning, plus some more columns. And all queries use joins against these columns in the same order and also a filter in the where clause on these columns. So the query is well structured.
Let me know if you need any other details.
Related
I'm working in SQL Workbench in Redshift. We have daily event tables for customer accounts, the same format each day just with updated info. There are currently 300+ tables. For a simple example, I would like to extract the top 10 rows from each table and place them in 1 table.
Table name format is Events_001, Events_002, etc. Typical values are Customer_ID and Balance.
Redshift does not appear to support declare variables, so I'm a bit stuck.
You've effectively invented a kind of pseudo-partitioning; where you manually partition the data by day.
To manually recombine the tables create a view to union everything together...
CREATE VIEW
events_combined
AS
SELECT 1 AS partition_id, * FROM events_001
UNION ALL
SELECT 2 AS partition_id, * FROM events_002
UNION ALL
SELECT 3 AS partition_id, * FROM events_003
etc, etc
That's a hassle, you need to recreate the view every time you add a new table.
That's why most modern databases have partitioning schemes built in to them, so all the boiler-plate is taken care of for you.
But RedShift doesn't do that. So, why not?
In general because RedShift has many alternative mechanisms for dividing and conquering data. It's columnar, so you can avoid reading columns you don't use. It's horizontally partitioned across multiple nodes (sharded), to share the load with large volumes of data. It's sorted and compressed in pages to avoid loading rows you don't want or need. It has dirty pages for newly arriving data, which can then be cleaned up with a VACUUM.
So, I would agree with others that it's not normal practice. Yet, Amazon themselves do have a help page (briefly) describing your use case.
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html
So, I'd disagree with "never do this". Still, it is a strong indication that you've accidentally walked in to an anti-pattern and should seriously re-consider your design.
As others have pointed out many small tables in Redshift is really inefficient, like terrible if taken to the extreme. But that is not your question.
You want to know how to perform the same query on multiple tables from SQL Workbench. I'm assuming you are referring to SQLWorkbench/J. If so you can define variables in the bench and use these variable in queries. Then you just need to update the variable and rerun the query. Now SQLWorkbench/J doesn't offer any looping or scripting capabilities. If you want to loop you will need to wrap the bench in a script (like a BAT file or a bash script).
My preference is to write a jinja template with the SQL in it along with any looping and variable substitution. Then apply a json with the table names and presto you have all the SQL for all the tables in one file. I just need to run this - usually with the psql cli but at times I'm import it into my bench.
My advice is to treat Redshift as a query execution engine and use an external environment (Lambda, EC2, etc) for the orchestration of what queries to run and when. Many other databases (try to) provide a full operating environment inside the database functionality. Applying this pattern to Redshift often leads to problems. Use Redshift for what it is great at and perform the other actions elsewhere. In the end you will find that the large AWS ecosystem provides extended capabilities as compared to other databases, it's just that these aren't all done inside of Redshift.
I have more than 100 tables in Redshift that I'd like to UNION to create one consolidated table. I can't hardcode this query because the list of tables will grow quite quickly. So I want to be able to achieve a process wherein I'm able to write something like, "UNION all tables where the table name contains 'orders'".
What's the best way to do this in Redshift? I'm open to using third party tools/languages to do this if needed, but if possible to do within Redshift, that would be ideal.
I don't think this can be done inside of Redshift - I'll let someone with a bright idea chime in but I don't think there is a way.
So you will need an external system to compose the query for you. The table names can be found in the Redshift catalogs and composing the query can be made in a templating system like jinja2. Jinja2 can loop on a list of tables and build the UNION ALL SQL for you and runs stand alone or as a python library. Or you can have a process (Lambda) that builds a view over all your tables and you query just accesses the view.
Now let's talk about why you shouldn't be doing this. First off Redshift is designed to be efficient on large tables. The storage block size is 1MB and for tables of less than a few million rows can be significantly inefficient. A table of 10,000 rows can use less than 1% of the storage space for actual data so reading these tables can have a high overhead and if you need to scan 100's of these you can spend all your time reading barely used blocks. Not only is this inefficient in terms of execution but also in disk storage. You could be heading for big problems on this path.
Also, the Redshift query compiler has limits on segments and parts in the query. Unioning all these tables will hit these limits and fail as you move forward and add tables. Defining a process that will break one day is not likely where you want to be.
I want to move multiple SQLite files to PostgreSQL.
Data contained in these files are monthly time-series (one month in a single *.sqlite file). Each has about 300,000 rows. There are more than 20 of these files.
My dilemma is how to organize the data in the new database:
a) Keep it in multiple tables
or
b) Merge it to one huge table with new column describing the time period (e.g. 04.2016, 05.2016, ...)
The database will be used only to pull data out of it (with the exception of adding data for new month).
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Which structure should I go for - one huge table or multiple smaller tables?
Think I would definitely go for one table - just make sure you use sensible indexes.
If you have the space and the resource 1 table, as other users have appropriately pointed out databases can handle millions of rows no problem.....Well depends on the data that is in them. The row size can make a big difference... Such as storing VARCHAR(MAX), VARBINARY(MAX) and several per row......
there is no doubt writing queries, ETL (extract transform load) is significantly easier on a single table! And maintenance of that is easier too from a archival perspective.
But if you never access the data and you need the performance in the primary table some sort of archive might make since.
There are some BI related reasons to maintain multiple tables but it doesn't sound like that is your issue here.
There is no perfect answer and will depend on your situation.
PostgreSQL is easily able to handle millions of rows in a table.
Go for option b) but..
with new column describing the time period (e.g. 04.2016, 05/2016, ...)
Please don't. Querying the different periods will become a pain, an unnecessary one. Just put the date in one column, put a index on the column and you can, probably, execute fast queries on it.
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Complicated for you to write or for the database to execute? An Example would be nice for us to get an image of your actual requirements.
In the Redshift FAQ under
Q: How does the performance of Amazon Redshift compare to most traditional databases for data warehousing and analytics?
It says the following:
Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.
Why is this the case?
It's a bit disingenuous to be honest (in my opinion). Although RedShift has neither of these, I'm not sure that's the same as saying it wouldn't benefit from them.
Materialised Views
I have no real idea why they make this claim. Possibly because they consider the engine so performant that the gains from having them are minimal.
I would dispute this and the product I work on maintains its own materialised views and can show significant performance gains from doing so. Perhaps AWS believe I must be doing something wrong in the first place?
Indexes
RedShift does not have indexes.
It does have SORT ORDER which is exceptionally similar to a clustered index. It is simply a list of fields by which the data is ordered (like a composite clustered index).
It even has recently introduced INTERLEAVED SORT KEYS. This is a direct attempt to have multiple independent sort orders. Instead of ordering by a THEN b THEN c it effectively orders by each of them at the same time.
That becomes kind of possible because of how RedShift implements its column store.
- Each column is stored separately from each other column
- Each column is stored in 1MB blocks
- Each 1MB block has summary statistics
As well as being the storage pattern this effectively becomes a set of pseudo indexes.
- If the data is sorted by a then b then x
- But you want z = 1234
- RedShift looks at the block statistics (for column z) first
- Those stats will say the minimum and maximum values stored by that block
- This allows Redshift to skip many of those blocks in certain conditions
- This intern allows RedShift to identify which blocks to read from the other columns
as of dec 2019, Redshift has a preview of materialized views: Announcement
from the documentation: A materialized view contains a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database. Amazon Redshift returns the precomputed results from the materialized view, without having to access the base tables at all. From the user standpoint, the query results are returned much faster compared to when retrieving the same data from the base tables.
Indexes are basically used in OLTP systems to retrieve a specific or a small group of values. On the contrary, OLAP systems retrieve a large set of values and performs aggregation on the large set of values. Indexes would not be a right fit for OLAP systems. Instead it uses a secondary structure called zone maps with sort keys.
The indexes operate on B trees. The 'life without a btree' section in the below blog explains with examples how an index based out of btree affects OLAP workloads.
https://blog.chartio.com/blog/understanding-interleaved-sort-keys-in-amazon-redshift-part-1
The combination of columnar storage, compression codings, data distribution, compression, query compilations, optimization etc. provides the power to Redshift to be faster.
Implementing the above factors, reduces IO operations on Redshift and eventually providing better performance. To implement an efficient solution, it requires a great deal of knowledge on the above sections and as well as the on the queries that you would run on Amazon Redshift.
for eg.
Redshift supports Sort keys, Compound Sort keys and Interleaved Sort keys.
If your table structure is lineitem(orderid,linenumber,supplier,quantity,price,discount,tax,returnflat,shipdate).
If you select orderid as your sort key but if your queries are based on shipdate, Redshift will be operating efficiently.
If you have a composite sortkey on (orderid, shipdate) and if your query only on ship date, Redshift will not be operating efficiently.
If you have an interleaved soft key on (orderid, shipdate) and if your query
Redshift does not support materialized views but it easily allows you to create (temporary/permant) tables by running select queries on existing tables. It eventually duplicates data but at the required format to be executed for queries (similar to materialized view) The below blog gives your some information on the above approach.
https://www.periscopedata.com/blog/faster-redshift-queries-with-materialized-views-lifetime-daily-arpu.html
Redshift does fare well with other systems like Hive, Impala, Spark, BQ etc. during one of our recent benchmark frameworks
The simple answer is: because it can read the needed data really, really fast and in parallel.
One of the primary uses of indexes are "needle-in-the-haystack" queries. These are queries where only a relatively small number of rows are needed and these match a WHERE clause. Columnar datastores handle these differently. The entire column is read into memory -- but only the column, not the rest of the row's data. This is sort of similar to having an index on each column, except the values need to be scanned for the match (that is where the parallelism comes in handy).
Other uses of indexes are for matching key pairs for joining or for aggregations. These can be handled by alternative hash-based algorithms.
As for materialized views, RedShift's strength is not updating data. Many such queries are quite fast enough without materialization. And, materialization incurs a lot of overhead for maintaining the data in a high transaction environment. If you don't have a high transaction environment, then you can increment temporary tables after batch loads.
They recently added support for Materialized Views in Redshift: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-redshift-introduces-support-for-materialized-views-preview/
Syntax for materialized view creation:
CREATE MATERIALIZED VIEW mv_name
[ BACKUP { YES | NO } ]
[ table_attributes ]
AS query
Syntax for refreshing a materialized view:
REFRESH MATERIALIZED VIEW mv_name
I have been tasked with replacing a costly stored procedure which performs calculations across 10 - 15 tables, some of which contain many millions of rows. The plan is to pre-stage the many computations and store the results in separate tables for speeding reading.
Having quickly created these new tables and inserted all of the necessary pre-staged data as a test case, the execution time of getting the same results is vastly improved, as you would expect.
My question is, what is the best practice for keeping these new separate tables up to date?
A procedure which runs at a specific interval could do it, but there
is a requirement for the data to be live.
A trigger on each table could do it, but that seems very costly, and
could cause slow-downs for everywhere else that uses these tables.
Are there other alternatives?
Have you considered Indexed Views for this? As long as you meet the criteria for creating Indexed Views (no self joins etc) it may well be a good solution.
The downsides of Indexed Views are that when the data in underlying tables is changed (delete, update, insert) then it will have to recalculate the indexed view. This can slow down these types of operations in certain circumstances so you have to be careful. I've put some links to documentation below;
https://www.brentozar.com/archive/2013/11/what-you-can-and-cant-do-with-indexed-views/
https://msdn.microsoft.com/en-GB/library/ms191432.aspx
https://technet.microsoft.com/en-GB/library/ms187864(v=sql.105).aspx
what is the best practice for keeping these new separate tables up to date?
Answer is it depends .Depends on what ..?
1.How frequently you will use those computed values
2.what is the acceptable data latency
we to have same kind of reporting where we store computed values in seperate tables and use them in reports.In our case we run this sps before sending the reports through SQL server agent
Consider using an A/B table solution. Place a generic view on over the _A table version (CREATE VIEW MY_TABLE AS SELECT * FROM MY_TABLE_A). And then you rebuild the _B version, and then switch the view to the _B version (CREATE VIEW MY_TABLE AS SELECT * FROM MY_TABLE_B). It takes twice as much space for processing, but it gives you the opportunity to build your tables without down-time.