AWS Redshift Migration from Teradata - sql

So I learned how to code in SQL about 2 months ago, so I'm still pretty new and still learning different commands/functions each day. I have been tasked with migrating some queries from Teradata to Redshift and there are obviously some differing syntax. Now I have been able to replace most of them, but I am stuck on a command "SYS_CALENDAR". Can someone explain to me how SYS_CALENDAR works so I could potentially hard code it or does anyone know any suitable replacements that run within AWS Redshift?
Thanks

As someone who has ported a large Teradata solution to Redshift let me say good luck. These are very different systems and porting the SQL to achieve functional equivalence is only the first challenge. I'm happy to have an exchange on what these challenges will likely be if you like but first off your question.
SYS_CALENDAR in Teradata is a system view that can be used like a normal view that holds information about every date. This can be queried or joined as needed to get, for example, the day-of-week or week-of-year information about a date. It really performs a date calculation function base on OS information but is used like a view.
No equivalent view exists in Redshift and this creates some porting difficulties. Many create "DATES" tables in Redshift to hold the information they need for dates across some range and there are web pages on making such a table (ex. https://elliotchance.medium.com/building-a-date-dimension-table-in-redshift-6474a7130658). Just pre-calculate all the date information you need for the range of dates in your database and you can swap this into queries when porting. This is the simplest route to take for porting and is the one that many choose (sometimes wrongly).
The issue with this route is that a user supported DATES table is often a time bomb waiting to go off and technical debt for the solution. This table only has the dates you specify at creation and the range of dates often expands over time. When it is used with a date that isn't in the DATES table wrong answers are created, data is corrupted, and it is usually silent. Not good. Some create processes to expand the date range but again this is based on some "expectation" of how the table will be used. It is also a real table with ever expanding data that is frequently used causing potential query performance issues and isn't really needed - a performance tax for all time.
The better long-term answer is to use the native Redshift (Postgres) date functions to operate on the dates as you need. Doing this uses the OS's understanding of dates (without bound) and does what Teradata does with the system view (calculate the needed information). For example you can get the work-week of a date by using the DATE_PART() function instead of joining with the SYS_CALENDAR view. This approach doesn't have the downsides of the DATES table but does come with porting cost. The structure of queries need to change (remove joins and add functions) which takes more work and requires understanding of the original query. Unfortunately time, work, and understanding are things that are often in short supply when porting databases which is why the DATES table approach is often seen and lives forever as technical debt.
I assume that this port is large in nature and if so my recommendation is this - lay out these trade offs for the stakeholders. If they cannot absorb the time to convert the queries (likely) propose the DATES table approach but have the technical debt clearly documented along with the "end date" at which functionality will break. I'd pick a somewhat close date, like 2025, so that some action will need to be on the long-term plans. Have triggers documented as to when action is needed.
This will not be the first of these "technical debt" issues that come up in a port such as this. There are too many places where "get it done" will trump "do it right". You haven't even scratch the surface on performance issues - these are very different databases and data solutions tuned, over time, for Teradata will not perform optimally on Redshift based on a simple port. This isn't an "all is lost" level issue. Just get the choices documented along with the long-term implications of those choices. Have triggers (dates or performance measures) defined for when aspects of the "port" will need to be followed up with an "optimization" effort. Management likes to forget about the need for follow-up on these efforts so get these documented.

Related

Retrieve and interpret large amounts of data from a SQL Server database

I'm an Electrical Test Engineer. Programming experience with C, mostly for devices with 256B of RAM or less. And have not a lot of experience with SQL databases...
We got a database with production data, serial numbers and testing results.
At the creation of the database no tools was created to retrieve the data.
If we can't retrieve the data, the database may as well not exist.
We have the data, the database exists. I want to create tools to retrieve and interpret data. And in the future do statistical analysis on the data.
The database has over 500k unique devices. With over 10 million measurements.
My question is: what's the most sensical way to retrieve and display the data?
For instance: a program what loops trough every entry and records the data will be complicated to write and will take days to complete.
The program and query's get complicated very fast.
We have Device types, Batch numbers, Serial numbers.
For every DISTINCT (DeviceType)
For Every DISTINCT (Batch number)
COUNT DISTINCT (Serial number) where...
NOT IN User <> 'development'...
AND Testing result <> 'FAIL'...
AND Date between ... and ...
Not to mention the measurement data, as each device may be tested multiple times. It seemed a trivial task, I'm now overwhelmed by the complexity.
I will create the code and query's myself. What I ask is help finding a strategy.
Ask yourself what questions you want to answer by recourse to the data. If the data is recorded in the most granular way possible, then it may be appropriate to consider common grouping or aggregation methods. These might include grouping by device, or location, or something else - each of these dimensions will have a business interpretation.
Writing down 3 top business requests should give you a starting point for constructing your extract/analysis strategy.
Next, try and draw together a data model, find out what tables exist, what references and relations they have to one another.
Together, between the questions you want to answer and the table-design, you should then be in a position to start constructing sensible general use queries.
Sometimes you may find different business questions can be answered using a common view of the data - when you're happy with an extract path, you can write this out using a common query language called SQL and - if appropriate, create a VIEW using that language. This abstracts the problem and makes it more convenient for users to actually get the answers they're looking for.
Your database will provide tools to write and run SQL statements, and you will need to refer to the documentation for your database to figure out how that happens - it's usually similar, but implementations differ across databases.

Data Aggregation - Daily SQL Script vs Data Warehouse

Pardon me if this has already been asked (I know very little about Data Warehouse/BI and have yet to master the keywords).
I have a table that grow by more then 100 000 rows per day, each row having a timestamp and multiple information about an item (dimensions, weight,color,etc). Individual data can be useful for roughly a month after this period we are only interested in aggregations. I have a dedicated software that allow a more detailed visualisation of individual rows and mainly use PowerPivot for my reporting needs.
I could come up with an SQL query that would fill a new table daily:
In which I would have a row for each hour/item/batch and I would summarize the information (sum/average/stddev/etc.)
Within a day my script would be up and running and I could use powerpivot against this new table. All this while staying where I'm comfortable: plain old SQL.
From the few information I gathered reading about DataWarehouse and BI, what I'm about to do sounds a lot like creating dimensions and facts. My question therefore: is it worthwhile to investigate further in that direction (BI) or since my problem is relatively simple I would do better staying in a relational database.
N.B. Reports that are being produced are usually linked against another database to produce more meaningful informations. Task that is very well accomplished by Powerpivot.
Datawarehouses are normally implemented in relational databases, so your existing skills will still be usable.
Given that you have expressed an interest in the dimension/fact table approach to datawarehousing, the canonical books on this approach are usually considered to be:
The Date Warehouse Toolkit (Kimball, Ross)
The Date Warehouse Lifecycle Toolkit (Kimball, Ross, Thornthwaite, Mundy, Becker)
(The former has more of a technical focus, while the latter approaches the subject from a wider lifecycle management viewpoint.)
Implementing DWHs can be time-consuming, so it may be worth continuing with your existing approach even if you decide to build a DWH.
Good news: it sounds like you already have a data warehouse. "Data warehouse" is a very generic term, with no real formal definition - it pretty much means whatever you want it to.
Commonly accepted characteristics are:
Data warehouses do not run on the operational databases
Data warehouses schemas are optimized for querying, not for "normal form" compliance
Data warehouses are populated by "Extract, Transform, Load" proceses (ETL).
It sounds like you're already doing all of that. If there are no business requirements to change, I'd leave it as it is. If your business users are asking to create their own queries, using different levels of aggregation, filtering, or granularit, a star schema may be the way to go.
The most effective solutions are those which are simple, adequate to meet existing needsand stay within available skillsets.
I agree that this approach works well for your situation an if it provides the reports and information you need then its worth starting this way. If you need more complex functionality later then you can go for more complex BI

Advice for hand-written olap-like extractions from relational database

We've implemented over the course of the years a series of web based reports summarizing historical business data (product sales, traffic, etc). The thing relies heavily on complex SQL queries, and the boss expects the results to be real time, but they need up to a minute to execute. The reports are customizable on a several dimensions.
I've done some basic research, and it looks like what we need is some kind of OLAP (?), ETL(?), whatever.
Is that true? Are we supposed to convert to a whole package and trash our beloved developments, or is there a possibility to keep it relational, SQL-based, and get close to a dedicated solution by simply pre-calculating some optimized views with a batch process running at night? Have you got pointers to good documentation on the subject?
Thank you.
You can do ETL (Extract, transform, and load) at night, loading the (probably summarized) data into tables that can usually be queried pretty quickly. Appropriate indexes are still important.
It often makes sense to put those summary tables in a different schema, a different database, or on a different server, but you don't absolutely have to do that.
The structure of the tables is important, and it's not like designing tables for an OLTP system. The IBM Redbooks have a couple of titles that can help you design the tables.
Data Modeling Techniques for Data
Warehousing
Dimensional Modeling: In a Business
Intelligence Environment
Most dbms today support SQL analytic functions. See, for example, Analytic Functions by Example for Oracle, or Window Functions for PostgreSQL.
In the long term, it sounds as though a move to a data warehouse would definitely benefit you (as suggested in Catcall's answer). You can use the existing reports as a starting point for your data warehouse's requirements.
In the short term, you could build summarised tables optimised for your existing reporting requirements. This should probably be regarded as a stopgap, unless you are never going to change these reports again.
You might also benefit from looking into partitioning tables in your database by date/time, since you will probably still want to report the current day's data for realtime reporting purposes.

Where's the tradeoff between normalization (SQL View) and performance/fiability (SQL Table)

I've got quite a long business process which eventually results into financial operations.
What matters in the end is quite exclusively these final operations, although I've got to keep a log of everything which led to it.
Since all the information contained into the final operations is available in other tables (used during the business process), it makes sense to use a view, but the view logic would be quite complicated (there are dozens of tables implicated), and I'm concerned that :
even with appropriate indexes, a table will probably be way faster (my table will eventually contain millions if items, and should be fully searchable on almost all its columns)
the view logic would be complicated, so I'm afraid it may complicate things in a few years if I want to evolve my business logic.
Because of those two reasons, I'm a bit tempted to write the data in a table at the end of my business process instead of relying on a view, but duplicating the data doesn't smells right (and it also looks a bit like premature optimization, but since it's such a central point in my design, I'd like to address the issue ASAP)
Have you ever faced such a choice? What did you decide?
Edit : creating a table would clearly lead to duplication in my situation, ie. the data written in the table exists somewhere else in the database and could be retrieved using only joins without any calculations.
I think you answered your question writing it down Brann.
This problem can be seen in this way: from one hand you have "real time data". You have fresh data and from them it's nice to create view to show "real time data" too.
But as time goes on, there are more data and logic changes. So it's good to have written down summaries of data you had some time ago. It's very pragmatic - you do not duplicate data, because you recalculate them and save into new table summary of them.
So when you think of it in this way, it's obvious that in this example new table will be better. As you write:
Faster access
Can have more complicated logic
Have archive data unchanged when logic changes
So when you meet this (or part) of this criteria as you requirement than its not choice - you go into tables.
When I would use view is only when showing fresh data out of other fresh data. In very, very simple problems. And when it gets more complicated - you always switch to new table.
So do not be afraid to go into it. Having one summary table with faster access is very pretty solution and it's a sign of well formed database.
Take care of the design of this table - so when business logic changes - you do not need to change everything from one stone in this table. And then everything will be OK!
I'm for the new table in this situation. The view has many disadvantages - performance clearly, complexity, and logic lock in. However, IMHO the over-arching reason is that as the underlying data changes, so the value in your view will change also. In most instances this is a good thing, however, with financial operations isn't it better to have a fixed record of what occured.
I always decide to have better normalization. In your case , though the view may be complicated , it's better to have that than the new table which has to be kept in sync with all the data changing operation.Plus the view would always be current while your end of business day table population would be only current for few hours a day.
Also , you have a bigger problem if the data in this table goes out of sync for whatever reasons.
As MrTelly alluded to, are you sure that your end result table really is a duplication of the view data? Or, is it actually a record of the final action taken as a result of the items in the view data.
For a clearer example, let's say that every time my gas tank gets to half-empty I buy $10 of gas. I write this down in a log. One day I buy my gas and write it in my log then later find out that my fuel gauge was broken and I really had 3/4 a tank of gas. Should I now erase the $10 purchase from my log because the underlying data (the level of gas in my tank) has changed? Ok, maybe that's not a clearer example, but hopefully it gets the point across. Recording the results is a different thing from recording the events that led up to the result. This is especially true in financial application. Therefore, I don't know that you're breaking normalization at all with storing the final outcome in its own table.
An indexed view is the way. But there are quite a few limitations to this approach, but it's generally favorable although it has some overhead issues if implemented incorrectly. With this approach you won't need to keep track of the changes that take place in your base tables and the data would accumulate itself nicely in that indexed view of yours. In theory.
Reference:
Improving Performance with SQL Server 2005 Indexed Views
Oracle: Materialized View Concepts and Architecture

How can my application benefit from temporary tables?

I've been reading a little about temporary tables in MySQL but I'm an admitted newbie when it comes to databases in general and MySQL in particular. I've looked at some examples and the MySQL documentation on how to create a temporary table, but I'm trying to determine just how temporary tables might benefit my applications and I guess secondly what sorts of issues I can run into. Granted, each situation is different, but I guess what I'm looking for is some general advice on the topic.
I did a little googling but didn't find exactly what I was looking for on the topic. If you have any experience with this, I'd love to hear about it.
Thanks,
Matt
Temporary tables are often valuable when you have a fairly complicated SELECT you want to perform and then perform a bunch of queries on that...
You can do something like:
CREATE TEMPORARY TABLE myTopCustomers
SELECT customers.*,count(*) num from customers join purchases using(customerID)
join items using(itemID) GROUP BY customers.ID HAVING num > 10;
And then do a bunch of queries against myTopCustomers without having to do the joins to purchases and items on each query. Then when your application no longer needs the database handle, no cleanup needs to be done.
Almost always you'll see temporary tables used for derived tables that were expensive to create.
First a disclaimer - my job is reporting so I wind up with far more complex queries than any normal developer would. If you're writing a simple CRUD (Create Read Update Delete) application (this would be most web applications) then you really don't want to write complex queries, and you are probably doing something wrong if you need to create temporary tables.
That said, I use temporary tables in Postgres for a number of purposes, and most will translate to MySQL. I use them to break up complex queries into a series of individually understandable pieces. I use them for consistency - by generating a complex report through a series of queries, and I can then offload some of those queries into modules I use in multiple places, I can make sure that different reports are consistent with each other. (And make sure that if I need to fix something, I only need to fix it once.) And, rarely, I deliberately use them to force a specific query plan. (Don't try this unless you really understand what you are doing!)
So I think temp tables are great. But that said, it is very important for you to understand that databases generally come in two flavors. The first is optimized for pumping out lots of small transactions, and the other is optimized for pumping out a smaller number of complex reports. The two types need to be tuned differently, and a complex report run on a transactional database runs the risk of blocking transactions (and therefore making web pages not return quickly). Therefore you generally don't want to avoid using one database for both purposes.
My guess is that you're writing a web application that needs a transactional database. In that case, you shouldn't use temp tables. And if you do need complex reports generated from your transactional data, a recommended best practice is to take regular (eg daily) backups, restore them on another machine, then run reports against that machine.
The best place to use temporary tables is when you need to pull a bunch of data from multiple tables, do some work on that data, and then combine everything to one result set.
In MS SQL, Temporary tables should also be used in place of cursors whenever possible because of the speed and resource impact associated with cursors.
If you are new to databases, there are some good books by Joe Kelko that review best practices for ANSI SQL. SQL For Smarties will describe in great detail the use of temp table, impact of indexes, where clauses, etc. It's a great reference book with in depth detail.
I've used them in the past when I needed to create evaluated data. That was before the time of views and sub selects in MySQL though and I generally use those now where I would have needed a temporary table. The only time I might use them is if the evaluated data took a long time to create.
I haven't done them in MySQL, but I've done them on other databases (Oracle, SQL Server, etc).
Among other tasks, temporary tables provide a way for you to create a queryable (and returnable, say from a sproc) dataset that's purpose-built. Let's say you have several tables of figures -- you can use a temporary table to roll those figures up to nice, clean totals (or other math), then join that temp table to others in your schema for final output. (An example of this, in one of my projects, is calculating how many scheduled calls a given sales-related employee must make per week, bi-weekly, monthly, etc.)
I also often use them as a means of "tilting" the data -- turning columns to rows, etc. They're good for advanced data processing -- but only use them when you need to. (My golden rule, as always, applies: If you don't know why you're using x, and you don't know how x works, then you probably shouldn't use it.)
Generally, I wind up using them most in sprocs, where complex data processing is needed. I'd love to give a concrete example, but mine would be in T-SQL (as opposed to MySQL's more standard SQL), and also they're all client/production code which I can't share. I'm sure someone else here on SO will pick up and provide some genuine sample code; this was just to help you get the gist of what problem domain temp tables address.