I see there is an importance in having a row inserted and a row last updated fields in a fact table. But I could not find any standard data warehouse or a reference which says that this is a good thing to do. I am uncertain whether this is because it is a bad practice; if so why should it be so? If it is because of the data size, I see it is only 8bytes for a full date field.
Any help is greatly appreciated!!!
There's nothing talking about whether it's a good or bad practice because we include creation time and updated time only if we need them or ever will need them.
It's a "good thing to do" if you need to access those columns and a "bad thing to do" if your table will never require those columns.
The inclusion of insert and update timestamps in your data warehouse tables allow you to be able to report from the perspective of as was and/or as is with regards to the data warehouse. These timestamps would be in addition to any timestamps that may be captured from the source.
They also make troubleshooting easier and in a worse case scenario the ability to back out a set of data from a specific run of an ETL process.
At a previous client, the data model we implemented included upwards of 6 different timestamps to provide slowly changing history, as is/ as was reporting, and source related time stamps. It made for very flexible reporting but also increased the learning curve of how to get exactly what you wanted from the table(s).
Related
So I learned how to code in SQL about 2 months ago, so I'm still pretty new and still learning different commands/functions each day. I have been tasked with migrating some queries from Teradata to Redshift and there are obviously some differing syntax. Now I have been able to replace most of them, but I am stuck on a command "SYS_CALENDAR". Can someone explain to me how SYS_CALENDAR works so I could potentially hard code it or does anyone know any suitable replacements that run within AWS Redshift?
Thanks
As someone who has ported a large Teradata solution to Redshift let me say good luck. These are very different systems and porting the SQL to achieve functional equivalence is only the first challenge. I'm happy to have an exchange on what these challenges will likely be if you like but first off your question.
SYS_CALENDAR in Teradata is a system view that can be used like a normal view that holds information about every date. This can be queried or joined as needed to get, for example, the day-of-week or week-of-year information about a date. It really performs a date calculation function base on OS information but is used like a view.
No equivalent view exists in Redshift and this creates some porting difficulties. Many create "DATES" tables in Redshift to hold the information they need for dates across some range and there are web pages on making such a table (ex. https://elliotchance.medium.com/building-a-date-dimension-table-in-redshift-6474a7130658). Just pre-calculate all the date information you need for the range of dates in your database and you can swap this into queries when porting. This is the simplest route to take for porting and is the one that many choose (sometimes wrongly).
The issue with this route is that a user supported DATES table is often a time bomb waiting to go off and technical debt for the solution. This table only has the dates you specify at creation and the range of dates often expands over time. When it is used with a date that isn't in the DATES table wrong answers are created, data is corrupted, and it is usually silent. Not good. Some create processes to expand the date range but again this is based on some "expectation" of how the table will be used. It is also a real table with ever expanding data that is frequently used causing potential query performance issues and isn't really needed - a performance tax for all time.
The better long-term answer is to use the native Redshift (Postgres) date functions to operate on the dates as you need. Doing this uses the OS's understanding of dates (without bound) and does what Teradata does with the system view (calculate the needed information). For example you can get the work-week of a date by using the DATE_PART() function instead of joining with the SYS_CALENDAR view. This approach doesn't have the downsides of the DATES table but does come with porting cost. The structure of queries need to change (remove joins and add functions) which takes more work and requires understanding of the original query. Unfortunately time, work, and understanding are things that are often in short supply when porting databases which is why the DATES table approach is often seen and lives forever as technical debt.
I assume that this port is large in nature and if so my recommendation is this - lay out these trade offs for the stakeholders. If they cannot absorb the time to convert the queries (likely) propose the DATES table approach but have the technical debt clearly documented along with the "end date" at which functionality will break. I'd pick a somewhat close date, like 2025, so that some action will need to be on the long-term plans. Have triggers documented as to when action is needed.
This will not be the first of these "technical debt" issues that come up in a port such as this. There are too many places where "get it done" will trump "do it right". You haven't even scratch the surface on performance issues - these are very different databases and data solutions tuned, over time, for Teradata will not perform optimally on Redshift based on a simple port. This isn't an "all is lost" level issue. Just get the choices documented along with the long-term implications of those choices. Have triggers (dates or performance measures) defined for when aspects of the "port" will need to be followed up with an "optimization" effort. Management likes to forget about the need for follow-up on these efforts so get these documented.
I'm looking for some ideas managing very large SQL queries in Oracle.
My employer is looking to build very wide reports ( 150 - 200 ) columns of data per report.
Each item is a sub-query or an element from a view. The data has to be real time, so DW style batch processing is not an option. We also don't use any BI tools , just a java app that generates Excel ( its a requirement to output data in Excel)
The query also contains unions as feeds from other systems.
The queries result in very large SQL ( about 1500 lines) that is very difficult to manage.
What strategies can I employ to make the work more manageable?
It is also not a performance problem. I was able to optimize the query to be very efficient , its mostly width of the query , managing 200 columns is a challenge in itself.
I deal with queries this length daily and here is some of what helps me out in manitaining them:
First alias every single one of the those columns. When you are building it you may know where each one came from but when it is time to make a change, it is really helpful to know exactly where each column came from. This applies to join conditions, group by and where conditions as well as the select columns.
Organize in easily understandable and testable chunks. I use temp tables to pull things that make sense together and so I can see the results before the final query while in test mode.
This brings me to test mode. If I have chunks of data, I design the proc with a test mode and then query individual temp tables when in test mode, so I can see where the data went wrong if there is a bug. Not sure how Oracle works but in SQL Server, I make this the last parameter and give it a default value, so that it doesn't need to be passed in by the application.
Consider logging the execution details and the values of passed in parameters and certainly log any error messages. This will help tremendously when you have to troubleshoot why this report that has functioned perfectly for six years doesn't work for this one user.
Put columns on a separate line for each one and do the same for where clauses. At times you may have to troublshoot by commenting out joins until you find the one that is causing the problem. It is easier if you can easily comment out the associated fields as well.
If you don't have a technical design document, then at least use comments to explain your thought process. You want to understand the whys not the hows in any comments. This stuff is hard to come back to later and understand even when you wrote it. Give your future self some help.
In developing from scratch, I put the select list in and then comment all but the first item. Then I build the query only until I get that value - testing until I am sure what I got was correct. Then I add the next one and whatever joins or where conditions I might need to get it. Test again making sure it is right. (Oops why did that go from 1000 records to 20000 when I added that? Hmm maybe there is something I need to handle there or is that right?) By adding only one thing at a time, you will find an error in the logic much faster and be much more confident of your results. It will also take you less time than trying to build a massive query in one go.
Finally, there is no substitute for understanding your data. There are plently of complex queries that work but do not give the correct answer. Know if you need an inner join or a left join. Know what where conditions you need to get the records you want. Know how to handle the records when you have a one-to-many relationship (this may require push back on the requirements); should you have 3 lines (one for each child record), or should you put that data in a comma delimited list or should you pick only one of the many records and have one line using aggregation. If the latter, what is the criteria for choosing the record you want to keep?
Without seeing the specifics of your problem, here are a couple of ideas that immediately come to mind:
If you are looking purely for management, I might suggest organizing your subqueries as a number of views and then referencing those views in your final query.
For performance on the other hand you may want to consider creating temp tables or even materialized views (which are fixed views) to break up the heavier parts of your process.
If your queries require an enormous amount of subquerying in order to gain usable data, you might need to rethink your database design and possibly create a number of datamarts to easily access reporting data. Think of these as mini-warehouses sans the multi-year trended data.
Finally, I know you said you don't use any BI tools but this problem certainly seems like one that might make sense by organizing your data into "cubes" or Business Object "universes". It might be worthwhile to at least entertain the cost of bringing on a BI tool vs. the programming hours to support the current setup.
For example, in order to provide an effective way to query repsondents answers to a dynamic questionnaire, where responses are stored in a keyword/response pair.
I am aware that there may be some latency in updating the catalogue/text index as new entries are added, but this may not matter if reporting/querying is not a real time concern. (i.e. performed at some later date)
So in answer to my own question, the transactional aspect of this doesnt actually matter, does it?
I would distinguish between data consistency in selected storage and gap between data arrival and appearing in search results for the user as you might use external or even remote search solutions for your application as the index update might take some significant time depends on the case.
I'm working with a legacy database which due to poor management and design has had a wildgrowth of columns which never have been or are no longer beeing used.
Is it possible to some how query for column usage? As in how often a column is beeing selected (either specifically or with *, or joined on)?
Seems to me like this is something we should be able to somehow retrieve but i have been unable to find anything like this.
Greetings,
F.B. ten Kate
Unfortunately, this analysis on the DB side isn't really going to be a full answer. I've seen a LOT of instances where application code only needed 3 columns of a 10+ column table, but selected them all anyway.
Your column would still show up on a usage report in any sort of trace or profiling you did, but it still may not ACTUALLY be in use.
You might have to either a) analyze the entire collection of apps that use this website or b) start drafting the a return-on-investment style doc on whether it's worth rebuilding.
This article will give you a good idea of how to search all fixed code (prodedures, views, functions and triggers) for the columns that are used. The code in the article searches for a specific table/column combination. You could easily adapt it to run for all columns. For anything dynamically executed, you'd probably have to set up a profiler trace.
Even if you could determine whether a column had been used in the past X period of time, would that be good enough? There may be some obscure program out there that populates a column once a week, a month, a year; or once every time they click the mystery button that no one ever clicks, or to log the report that only Fred in accounting ever runs (he quit two years ago), or that gets logged to if that one rare bug happens (during daylight savings time, perhaps?)
My point is, the only way you can truly be certain that a column is absolutely not used by anything is to review everything -- every call, every line of code, every ad hoc Excel data dump, every possible contingency -- everything that references the database . As this may be all but unachievable, try to get a formally defined group of programs and procedures that must be supported, bend over backwards to make sure they are supported, and be prepared to fix things when some overlooked or forgotten piece of functionality turns up.
I've got quite a long business process which eventually results into financial operations.
What matters in the end is quite exclusively these final operations, although I've got to keep a log of everything which led to it.
Since all the information contained into the final operations is available in other tables (used during the business process), it makes sense to use a view, but the view logic would be quite complicated (there are dozens of tables implicated), and I'm concerned that :
even with appropriate indexes, a table will probably be way faster (my table will eventually contain millions if items, and should be fully searchable on almost all its columns)
the view logic would be complicated, so I'm afraid it may complicate things in a few years if I want to evolve my business logic.
Because of those two reasons, I'm a bit tempted to write the data in a table at the end of my business process instead of relying on a view, but duplicating the data doesn't smells right (and it also looks a bit like premature optimization, but since it's such a central point in my design, I'd like to address the issue ASAP)
Have you ever faced such a choice? What did you decide?
Edit : creating a table would clearly lead to duplication in my situation, ie. the data written in the table exists somewhere else in the database and could be retrieved using only joins without any calculations.
I think you answered your question writing it down Brann.
This problem can be seen in this way: from one hand you have "real time data". You have fresh data and from them it's nice to create view to show "real time data" too.
But as time goes on, there are more data and logic changes. So it's good to have written down summaries of data you had some time ago. It's very pragmatic - you do not duplicate data, because you recalculate them and save into new table summary of them.
So when you think of it in this way, it's obvious that in this example new table will be better. As you write:
Faster access
Can have more complicated logic
Have archive data unchanged when logic changes
So when you meet this (or part) of this criteria as you requirement than its not choice - you go into tables.
When I would use view is only when showing fresh data out of other fresh data. In very, very simple problems. And when it gets more complicated - you always switch to new table.
So do not be afraid to go into it. Having one summary table with faster access is very pretty solution and it's a sign of well formed database.
Take care of the design of this table - so when business logic changes - you do not need to change everything from one stone in this table. And then everything will be OK!
I'm for the new table in this situation. The view has many disadvantages - performance clearly, complexity, and logic lock in. However, IMHO the over-arching reason is that as the underlying data changes, so the value in your view will change also. In most instances this is a good thing, however, with financial operations isn't it better to have a fixed record of what occured.
I always decide to have better normalization. In your case , though the view may be complicated , it's better to have that than the new table which has to be kept in sync with all the data changing operation.Plus the view would always be current while your end of business day table population would be only current for few hours a day.
Also , you have a bigger problem if the data in this table goes out of sync for whatever reasons.
As MrTelly alluded to, are you sure that your end result table really is a duplication of the view data? Or, is it actually a record of the final action taken as a result of the items in the view data.
For a clearer example, let's say that every time my gas tank gets to half-empty I buy $10 of gas. I write this down in a log. One day I buy my gas and write it in my log then later find out that my fuel gauge was broken and I really had 3/4 a tank of gas. Should I now erase the $10 purchase from my log because the underlying data (the level of gas in my tank) has changed? Ok, maybe that's not a clearer example, but hopefully it gets the point across. Recording the results is a different thing from recording the events that led up to the result. This is especially true in financial application. Therefore, I don't know that you're breaking normalization at all with storing the final outcome in its own table.
An indexed view is the way. But there are quite a few limitations to this approach, but it's generally favorable although it has some overhead issues if implemented incorrectly. With this approach you won't need to keep track of the changes that take place in your base tables and the data would accumulate itself nicely in that indexed view of yours. In theory.
Reference:
Improving Performance with SQL Server 2005 Indexed Views
Oracle: Materialized View Concepts and Architecture