Optimizing View with 54 SELECTs -- Updated with Process used - sql

I've started a new job and am looking at some of their views that they wanted me to get familiar with.
One of the views that I'm going through has 54 SELECTs. I've NEVER seen such a massive View before and after browsing through it, I'm fairly certain that I can optimize it.
What I'm looking for is an easier way to compare the JOINs between each of the selects to find commonality WITHOUT just having to sift through them all by hand.
I'm fairly certain that THAT is what I'll have to do, but I was hoping that someone would have an easier or more efficient and less time consuming way to do this...
Anyone? :)

Update 12/4/2019:
TLDR; I separated each main SELECT into it's own file, then using Notepad++ on the full view code file I found common values to sort and organize each individual SELECT into a folder named the common value. From there I used WinMerge to compare each group of SELECTs in their respective folders against each other to find differences and commonalities. I then started refactoring and minimizing the excessive code (and table calls >.< ) using CTE's (because this is Oracle and a View).
I ended up splitting out the views SELECTs by looking for one of the field names that's a column in the view. This meant that I'd get the beginnings of each SELECT.
I then copied and pasted each one into a different .sql file. I also saved the full view into it's own .sql file.
Using Notepad++ I searched for each instance of the view column name and took down the column value that is the main information needed for the view. Using that information, I grouped each SELECT into a folder based on that value
Once I'd gotten all the SELECTs into their respective folders, I used WinMerge to the SELECTs to the others in their same folder. This helped me to locate commonalities and differences, some differences that would be harder to lump the datasource table into a CTE with the others.
I've since been updating, refactoring, and fixing the view one folder grouped SELECTs at a time.
The project I'm in has had me touching each one of the main code values that I had used to organize the SELECTs into folders, so I WILL end up recoding the ENTIRE view. :)

Identifying similar lines of code is a difficult problem, and as far as I know Clone Doctor is the only program that can do it for PL/SQL.
I only used it once, about 10 years ago. But I remember being surprised at how well it found different lines of code that were really duplicates at a deep level. It wasn't perfect, and had some bugs and took a while to get working. But there's a free trial version.

Related

Quasi-automatic import realization from *csv

My problem is that I have to realize an easy-refreshable database system. At start, I have 2 tables imported from *csv format, and I need to restructure them and make certain new tables. I also have to add ID-s to certain properties, to make searching more faster (it is going to contain loads of data, and is going to be refreshed almost every day). It is going to contain millions of records with many rows. My first question is how to import? I need only selective rows from the original file, but the *csv layout will remain the same.
My second question is how to create indexes in the new tables (that were generated by the queries mentioned) to certain properties to make searching faster?
If you think, my questions are not exact enough, please comment, and I'm going to answer you with my best knowledge.
Thanks in advance.

Managing very large SQL queries

I'm looking for some ideas managing very large SQL queries in Oracle.
My employer is looking to build very wide reports ( 150 - 200 ) columns of data per report.
Each item is a sub-query or an element from a view. The data has to be real time, so DW style batch processing is not an option. We also don't use any BI tools , just a java app that generates Excel ( its a requirement to output data in Excel)
The query also contains unions as feeds from other systems.
The queries result in very large SQL ( about 1500 lines) that is very difficult to manage.
What strategies can I employ to make the work more manageable?
It is also not a performance problem. I was able to optimize the query to be very efficient , its mostly width of the query , managing 200 columns is a challenge in itself.
I deal with queries this length daily and here is some of what helps me out in manitaining them:
First alias every single one of the those columns. When you are building it you may know where each one came from but when it is time to make a change, it is really helpful to know exactly where each column came from. This applies to join conditions, group by and where conditions as well as the select columns.
Organize in easily understandable and testable chunks. I use temp tables to pull things that make sense together and so I can see the results before the final query while in test mode.
This brings me to test mode. If I have chunks of data, I design the proc with a test mode and then query individual temp tables when in test mode, so I can see where the data went wrong if there is a bug. Not sure how Oracle works but in SQL Server, I make this the last parameter and give it a default value, so that it doesn't need to be passed in by the application.
Consider logging the execution details and the values of passed in parameters and certainly log any error messages. This will help tremendously when you have to troubleshoot why this report that has functioned perfectly for six years doesn't work for this one user.
Put columns on a separate line for each one and do the same for where clauses. At times you may have to troublshoot by commenting out joins until you find the one that is causing the problem. It is easier if you can easily comment out the associated fields as well.
If you don't have a technical design document, then at least use comments to explain your thought process. You want to understand the whys not the hows in any comments. This stuff is hard to come back to later and understand even when you wrote it. Give your future self some help.
In developing from scratch, I put the select list in and then comment all but the first item. Then I build the query only until I get that value - testing until I am sure what I got was correct. Then I add the next one and whatever joins or where conditions I might need to get it. Test again making sure it is right. (Oops why did that go from 1000 records to 20000 when I added that? Hmm maybe there is something I need to handle there or is that right?) By adding only one thing at a time, you will find an error in the logic much faster and be much more confident of your results. It will also take you less time than trying to build a massive query in one go.
Finally, there is no substitute for understanding your data. There are plently of complex queries that work but do not give the correct answer. Know if you need an inner join or a left join. Know what where conditions you need to get the records you want. Know how to handle the records when you have a one-to-many relationship (this may require push back on the requirements); should you have 3 lines (one for each child record), or should you put that data in a comma delimited list or should you pick only one of the many records and have one line using aggregation. If the latter, what is the criteria for choosing the record you want to keep?
Without seeing the specifics of your problem, here are a couple of ideas that immediately come to mind:
If you are looking purely for management, I might suggest organizing your subqueries as a number of views and then referencing those views in your final query.
For performance on the other hand you may want to consider creating temp tables or even materialized views (which are fixed views) to break up the heavier parts of your process.
If your queries require an enormous amount of subquerying in order to gain usable data, you might need to rethink your database design and possibly create a number of datamarts to easily access reporting data. Think of these as mini-warehouses sans the multi-year trended data.
Finally, I know you said you don't use any BI tools but this problem certainly seems like one that might make sense by organizing your data into "cubes" or Business Object "universes". It might be worthwhile to at least entertain the cost of bringing on a BI tool vs. the programming hours to support the current setup.

Finding unused columns

I'm working with a legacy database which due to poor management and design has had a wildgrowth of columns which never have been or are no longer beeing used.
Is it possible to some how query for column usage? As in how often a column is beeing selected (either specifically or with *, or joined on)?
Seems to me like this is something we should be able to somehow retrieve but i have been unable to find anything like this.
Greetings,
F.B. ten Kate
Unfortunately, this analysis on the DB side isn't really going to be a full answer. I've seen a LOT of instances where application code only needed 3 columns of a 10+ column table, but selected them all anyway.
Your column would still show up on a usage report in any sort of trace or profiling you did, but it still may not ACTUALLY be in use.
You might have to either a) analyze the entire collection of apps that use this website or b) start drafting the a return-on-investment style doc on whether it's worth rebuilding.
This article will give you a good idea of how to search all fixed code (prodedures, views, functions and triggers) for the columns that are used. The code in the article searches for a specific table/column combination. You could easily adapt it to run for all columns. For anything dynamically executed, you'd probably have to set up a profiler trace.
Even if you could determine whether a column had been used in the past X period of time, would that be good enough? There may be some obscure program out there that populates a column once a week, a month, a year; or once every time they click the mystery button that no one ever clicks, or to log the report that only Fred in accounting ever runs (he quit two years ago), or that gets logged to if that one rare bug happens (during daylight savings time, perhaps?)
My point is, the only way you can truly be certain that a column is absolutely not used by anything is to review everything -- every call, every line of code, every ad hoc Excel data dump, every possible contingency -- everything that references the database . As this may be all but unachievable, try to get a formally defined group of programs and procedures that must be supported, bend over backwards to make sure they are supported, and be prepared to fix things when some overlooked or forgotten piece of functionality turns up.

Printing an ER Diagram for mySQL database (800+ tables)

We have a system built by Parallels, which is relying on a huge (800+) tables to maintain everything.
I need to learn this system, in order to be able to write queries to retrieve data for report generation on various needs.
I am obviously, having difficulties isolating which tables are currently relevant for the task at hand, so I thought the best way would be, to generate and print ERD over multiple pages, for the entire system of tables.
I have attempted to drag all the tables using TOAD - which crashed :)
On second attempt, I dragged tables A-N, after a (long) while, M-Z tables successfully.
I even managed to have them all resized, arranged and saved the ERD into file.
However, when I go into print or preview, the sub-process for print crashes hehe.
Any suggestions on how to print this massive ERD? or perhaps another method? The table names dont seem self explanatory, so I cant (and honestly, not really wanting) go over 800+ tables, and hope I dont miss what I need, or parts of.
I would greatly appreciate any advices or ideas on how to proceed, before I even get to actually writing the scripts and code.
The database is on mySQL under CentOS, some tables are InnoDB, some are MyISAM.
Many tables seem to be having Foreign Keys.
Thanks!
I worked at a place that had several hundred tables (near 1k) and no one really knew what was going on in the system, company was growing and hiring a lot. A guy was tasked with doing a diagram, and he auto-magically created a gigantic tiled poster that contained every table with lines connecting various tables (going all over the place). I'm not sure what he used, it was Unix and Oracle years ago (way before Linux and open source). There was no real rhyme or reason to the layout of the the tables in his diagram. He had successfully created a diagram of every table. The "poster" was put on a wall in a common area, and got a few looks, but no one ever really used it, it was unusable, too cluttered, too unorganized. As a result, I used MS-Word to create a single page diagram containing the 20 main tables (it went through a few iterations as I "discovered" new main tables) with lines for each foreign key and each table located in a logical manner. I showed the column name, data type, nullability, PK, and all FKs. I put my diagram up on my wall by my monitor. Eventually everyone wanted a copy of my diagram, including the person that made the "poster". When I left that job they were still giving my diagram to new hires.
I recommend that you work like an explorer, find the key tables and map them as you go, making as many specific diagrams as necessary as you discover the system. Trying to make a gigantic "poster" automatically will not work very well.
have you tried mysql workbench?
if you don't mind windows, you could try Enterprise Architect as well
MySQL Workbench has some great tools for reverse engineering from the create script. I haven't used it for such large databases, but you should check it.
Link: http://wb.mysql.com/
IIRC, MS Sql Server has some nice utility for making diagrams, I know it helped a lot, you could add a table and it would automatically add all related tables. If you could convert your tables to a MS SQL compatible sql script, this might help.
Navicat 10.1 and later can do the job. use its model tool and import the database into it, then rearrange at your ease. printing results a pdf or directly to printer.

How should I organize complex SQL views in Rails?

I manage a research database with Ruby on Rails. The data that is entered is primarily used by scientists who prefer to have all the relevant information for a study in one single massive table for use in their statistics software of choice. I'm currently presenting it as CSV, as it's very straightforward to do and compatible with the tools people want to use.
I've written many views (the SQL kind, not the Rails HTML/ERB kind) to make the output they expect a reality. Some of these views are quite large and have a fair amount of complexity behind them. I wrote them in SQL because there are many calculations and comparisons that are more easily done with SQL. They're currently loaded into the database straight from a file named views.sql. To get the requested data, I do a select * from my_view;.
The views.sql file is getting quite large. Part of the problem is that we're still figuring out what the data we collect means, so there's a lot of changes being made to the views all the time -- and a ton of them are being created. Many of them need to be repeatable.
I've recently run into issues organizing and testing these views. Rails works great for user interface stuff and business logic, but I'm not aware of much existing structure for handling the reporting we require.
Some options I've thought of:
Should I move them into the most relevant models somehow? Several of the views interact with each other, which makes this situation more complex than just doing a single find_by_sql, so I don't know if they should only be part of the model.
Perhaps they should be treated as a "view" in the MVC sense? (That is, they could be moved into app/views/ and live alongside the HTML, perhaps as files named something like my_view.csv.sql which return CSV.)
How would you deal with a complex reporting problem like this?
UPDATE for Mladen Jablanović
It started by having a couple of views for reporting purposes. My boss(es) decided they wanted more, so I started writing more. Some give couple hundred columns of data, based on the requirements I've been given.
I have a couple thousand lines of views all shoved in a single file now. I don't like that situation, so I want to reorganize/refactor the code. I'd also like an easy way of providing CSVs -- I'm currently running queries and emailing them by hand, which could easily be automated. Finally, I would like to be able to write some tests on the output of the views, since a couple of regressions have already popped up.
I haven't worked much with SQL and views directly, so I can't help you there, but you can certainly build an ActiveRecord model on top of a view, very easily in fact. The book Enterprise Rails has a whole chapter on it (here it is at Google Books).
We are using views in our DB extensively and some of them are exposed as Rails models. You work with them as you would with tables, except for you can't update them of course.
Also, some of the columns may be calculated using other columns (different ratios for example) so we don't do it in the view, but in the model instead (ok, not entirely true, we construct SQL snippet and pass it to :select => '' portion of find call).
Presentation logic (such as date and number formatting) goes to Rails views.
I'm afraid I can't help you with more concrete advice, as the scope of the question is pretty wide.
EDIT:
Hundreds of columns doesn't sound reasonable. Sounds like immense amount of data in one place. How do they use it at all? We have web application where they can drill down and filter the results, narrow timespan and time step etc, so they never have more then 10-20 columns in the reports.
We store our views one view per SQL file. Also, you can combine it with a numerical prefix in order to ensure proper creation order (in case some of them depend on others). No migrations there, whole DB layer is app-agnostic.
For CSV, you can create either a set of scripts you can invoke either manually, or using cron, or you can use FasterCSV from your Rails app and generate CSVs by HTTP request.