How do I trace back the actual columns used/joined in calculated measure? THe reason I am asking this question is, I am trying to write the equivalent TSQL query to verify the result with that of calculated measure.
So far, my approach has been a look up into the measure properties and find the table/view and the column name used. The joining column has been a difficult part (let s say it impossible for me), because the DSV looks very messy and is hard to follow the lines.
Any suggestions appreciated!
I suppose there is no hope of contacting the person who created the cube, or currently maintains it? If you are not familiar with the data warehouse the cube comes from, I can see why this would feel like a nightmare. There may be various procs/jobs which load data in.
One on-going problem in my last job was a customer upset that the cube totals did not match with the figures from other places in our web application. Each week there would be a debate between the OLAP team and the web team, to decide which value ws "right" and how we could explain the difference!
Related
So I learned how to code in SQL about 2 months ago, so I'm still pretty new and still learning different commands/functions each day. I have been tasked with migrating some queries from Teradata to Redshift and there are obviously some differing syntax. Now I have been able to replace most of them, but I am stuck on a command "SYS_CALENDAR". Can someone explain to me how SYS_CALENDAR works so I could potentially hard code it or does anyone know any suitable replacements that run within AWS Redshift?
Thanks
As someone who has ported a large Teradata solution to Redshift let me say good luck. These are very different systems and porting the SQL to achieve functional equivalence is only the first challenge. I'm happy to have an exchange on what these challenges will likely be if you like but first off your question.
SYS_CALENDAR in Teradata is a system view that can be used like a normal view that holds information about every date. This can be queried or joined as needed to get, for example, the day-of-week or week-of-year information about a date. It really performs a date calculation function base on OS information but is used like a view.
No equivalent view exists in Redshift and this creates some porting difficulties. Many create "DATES" tables in Redshift to hold the information they need for dates across some range and there are web pages on making such a table (ex. https://elliotchance.medium.com/building-a-date-dimension-table-in-redshift-6474a7130658). Just pre-calculate all the date information you need for the range of dates in your database and you can swap this into queries when porting. This is the simplest route to take for porting and is the one that many choose (sometimes wrongly).
The issue with this route is that a user supported DATES table is often a time bomb waiting to go off and technical debt for the solution. This table only has the dates you specify at creation and the range of dates often expands over time. When it is used with a date that isn't in the DATES table wrong answers are created, data is corrupted, and it is usually silent. Not good. Some create processes to expand the date range but again this is based on some "expectation" of how the table will be used. It is also a real table with ever expanding data that is frequently used causing potential query performance issues and isn't really needed - a performance tax for all time.
The better long-term answer is to use the native Redshift (Postgres) date functions to operate on the dates as you need. Doing this uses the OS's understanding of dates (without bound) and does what Teradata does with the system view (calculate the needed information). For example you can get the work-week of a date by using the DATE_PART() function instead of joining with the SYS_CALENDAR view. This approach doesn't have the downsides of the DATES table but does come with porting cost. The structure of queries need to change (remove joins and add functions) which takes more work and requires understanding of the original query. Unfortunately time, work, and understanding are things that are often in short supply when porting databases which is why the DATES table approach is often seen and lives forever as technical debt.
I assume that this port is large in nature and if so my recommendation is this - lay out these trade offs for the stakeholders. If they cannot absorb the time to convert the queries (likely) propose the DATES table approach but have the technical debt clearly documented along with the "end date" at which functionality will break. I'd pick a somewhat close date, like 2025, so that some action will need to be on the long-term plans. Have triggers documented as to when action is needed.
This will not be the first of these "technical debt" issues that come up in a port such as this. There are too many places where "get it done" will trump "do it right". You haven't even scratch the surface on performance issues - these are very different databases and data solutions tuned, over time, for Teradata will not perform optimally on Redshift based on a simple port. This isn't an "all is lost" level issue. Just get the choices documented along with the long-term implications of those choices. Have triggers (dates or performance measures) defined for when aspects of the "port" will need to be followed up with an "optimization" effort. Management likes to forget about the need for follow-up on these efforts so get these documented.
I'm looking for some ideas managing very large SQL queries in Oracle.
My employer is looking to build very wide reports ( 150 - 200 ) columns of data per report.
Each item is a sub-query or an element from a view. The data has to be real time, so DW style batch processing is not an option. We also don't use any BI tools , just a java app that generates Excel ( its a requirement to output data in Excel)
The query also contains unions as feeds from other systems.
The queries result in very large SQL ( about 1500 lines) that is very difficult to manage.
What strategies can I employ to make the work more manageable?
It is also not a performance problem. I was able to optimize the query to be very efficient , its mostly width of the query , managing 200 columns is a challenge in itself.
I deal with queries this length daily and here is some of what helps me out in manitaining them:
First alias every single one of the those columns. When you are building it you may know where each one came from but when it is time to make a change, it is really helpful to know exactly where each column came from. This applies to join conditions, group by and where conditions as well as the select columns.
Organize in easily understandable and testable chunks. I use temp tables to pull things that make sense together and so I can see the results before the final query while in test mode.
This brings me to test mode. If I have chunks of data, I design the proc with a test mode and then query individual temp tables when in test mode, so I can see where the data went wrong if there is a bug. Not sure how Oracle works but in SQL Server, I make this the last parameter and give it a default value, so that it doesn't need to be passed in by the application.
Consider logging the execution details and the values of passed in parameters and certainly log any error messages. This will help tremendously when you have to troubleshoot why this report that has functioned perfectly for six years doesn't work for this one user.
Put columns on a separate line for each one and do the same for where clauses. At times you may have to troublshoot by commenting out joins until you find the one that is causing the problem. It is easier if you can easily comment out the associated fields as well.
If you don't have a technical design document, then at least use comments to explain your thought process. You want to understand the whys not the hows in any comments. This stuff is hard to come back to later and understand even when you wrote it. Give your future self some help.
In developing from scratch, I put the select list in and then comment all but the first item. Then I build the query only until I get that value - testing until I am sure what I got was correct. Then I add the next one and whatever joins or where conditions I might need to get it. Test again making sure it is right. (Oops why did that go from 1000 records to 20000 when I added that? Hmm maybe there is something I need to handle there or is that right?) By adding only one thing at a time, you will find an error in the logic much faster and be much more confident of your results. It will also take you less time than trying to build a massive query in one go.
Finally, there is no substitute for understanding your data. There are plently of complex queries that work but do not give the correct answer. Know if you need an inner join or a left join. Know what where conditions you need to get the records you want. Know how to handle the records when you have a one-to-many relationship (this may require push back on the requirements); should you have 3 lines (one for each child record), or should you put that data in a comma delimited list or should you pick only one of the many records and have one line using aggregation. If the latter, what is the criteria for choosing the record you want to keep?
Without seeing the specifics of your problem, here are a couple of ideas that immediately come to mind:
If you are looking purely for management, I might suggest organizing your subqueries as a number of views and then referencing those views in your final query.
For performance on the other hand you may want to consider creating temp tables or even materialized views (which are fixed views) to break up the heavier parts of your process.
If your queries require an enormous amount of subquerying in order to gain usable data, you might need to rethink your database design and possibly create a number of datamarts to easily access reporting data. Think of these as mini-warehouses sans the multi-year trended data.
Finally, I know you said you don't use any BI tools but this problem certainly seems like one that might make sense by organizing your data into "cubes" or Business Object "universes". It might be worthwhile to at least entertain the cost of bringing on a BI tool vs. the programming hours to support the current setup.
The question pretty much sums it up.
I am creating a model that involves textual status information on some processes. I would like to show these as text but cant for the life of me figure out how.
Tried FirstNonBlank(textualcolumn, 1) without luck. Anyone know if this is possible?
Rather than having a text measure physically in any fact table I would suggest you to go for calculated measure. As per your post the measure has to represent some process status (I suppose Open or closed), you can easily write a MDX expression for the calculated measure.
Pardon me if this has already been asked (I know very little about Data Warehouse/BI and have yet to master the keywords).
I have a table that grow by more then 100 000 rows per day, each row having a timestamp and multiple information about an item (dimensions, weight,color,etc). Individual data can be useful for roughly a month after this period we are only interested in aggregations. I have a dedicated software that allow a more detailed visualisation of individual rows and mainly use PowerPivot for my reporting needs.
I could come up with an SQL query that would fill a new table daily:
In which I would have a row for each hour/item/batch and I would summarize the information (sum/average/stddev/etc.)
Within a day my script would be up and running and I could use powerpivot against this new table. All this while staying where I'm comfortable: plain old SQL.
From the few information I gathered reading about DataWarehouse and BI, what I'm about to do sounds a lot like creating dimensions and facts. My question therefore: is it worthwhile to investigate further in that direction (BI) or since my problem is relatively simple I would do better staying in a relational database.
N.B. Reports that are being produced are usually linked against another database to produce more meaningful informations. Task that is very well accomplished by Powerpivot.
Datawarehouses are normally implemented in relational databases, so your existing skills will still be usable.
Given that you have expressed an interest in the dimension/fact table approach to datawarehousing, the canonical books on this approach are usually considered to be:
The Date Warehouse Toolkit (Kimball, Ross)
The Date Warehouse Lifecycle Toolkit (Kimball, Ross, Thornthwaite, Mundy, Becker)
(The former has more of a technical focus, while the latter approaches the subject from a wider lifecycle management viewpoint.)
Implementing DWHs can be time-consuming, so it may be worth continuing with your existing approach even if you decide to build a DWH.
Good news: it sounds like you already have a data warehouse. "Data warehouse" is a very generic term, with no real formal definition - it pretty much means whatever you want it to.
Commonly accepted characteristics are:
Data warehouses do not run on the operational databases
Data warehouses schemas are optimized for querying, not for "normal form" compliance
Data warehouses are populated by "Extract, Transform, Load" proceses (ETL).
It sounds like you're already doing all of that. If there are no business requirements to change, I'd leave it as it is. If your business users are asking to create their own queries, using different levels of aggregation, filtering, or granularit, a star schema may be the way to go.
The most effective solutions are those which are simple, adequate to meet existing needsand stay within available skillsets.
I agree that this approach works well for your situation an if it provides the reports and information you need then its worth starting this way. If you need more complex functionality later then you can go for more complex BI
I see there is an importance in having a row inserted and a row last updated fields in a fact table. But I could not find any standard data warehouse or a reference which says that this is a good thing to do. I am uncertain whether this is because it is a bad practice; if so why should it be so? If it is because of the data size, I see it is only 8bytes for a full date field.
Any help is greatly appreciated!!!
There's nothing talking about whether it's a good or bad practice because we include creation time and updated time only if we need them or ever will need them.
It's a "good thing to do" if you need to access those columns and a "bad thing to do" if your table will never require those columns.
The inclusion of insert and update timestamps in your data warehouse tables allow you to be able to report from the perspective of as was and/or as is with regards to the data warehouse. These timestamps would be in addition to any timestamps that may be captured from the source.
They also make troubleshooting easier and in a worse case scenario the ability to back out a set of data from a specific run of an ETL process.
At a previous client, the data model we implemented included upwards of 6 different timestamps to provide slowly changing history, as is/ as was reporting, and source related time stamps. It made for very flexible reporting but also increased the learning curve of how to get exactly what you wanted from the table(s).