I have inherited a database that's causing me issues.
I'm in the need of describing something horrible to stakeholders. So far using the names of anti patterns and sending them away pointing them to a google search on this has been the most efficent to buy me some time.
Trouble is, I have not come across this before. Here's what's happening.
I have a simple single table, with a couple of columns. One of these columns contains values like:
660x90_SomeCity_SomeCountryISO_ImageName_SomeRubbish
or
SomeIataAirportCode_SomeCountry_660x90_SomeRubbish_ImageName
Now the database contains an (admittedly so far and on current data) faultless logic to extract and lookup things so that the output has additional columns such as:
AdSize
Country
City
The trouble is that this is achieved through gradual conversions implemented in a labyrinth of 50 (not joking) different views. I've now got to formalize the logic to something like
View One: Extract the first column and work out the length of it.
View Two: Now split of the 2nd column using the length.
View Three: If after replacing the x in the first column the value is numeric, store the value in "AdSize", and place the second value in the "CityCandidateOne" column.
To me this is a horrible antipattern and should all be done either in custom functions, or preferably during the ETL process, in one place so the logic can be captured.
However I'm not being given the time and wonder if this is a known anti pattern. Usually I can then use the credibility of a Google search to buy a little time to really sort this out.
I'd start with this answer which covers the violation of First Normal Form.
I also found this free ebook that might be of value.
I understand that what you are facing is something on a grander scale that just putting a couple of values in a field with a comma or other token to separate them, but I don't know of any antipattern that covers such a baroque mess.
Finally, here you can find more about "replacing SQL logic with Views" as an antipattern (just look for "Views as SQL Building Blocks Anti-Pattern" in the article) but take in account that in this case the problem seem to be about inefficient access to the data.
Last minute edit: maybe this is just a special case of the general Golden Hammer antipattern? (see also: http://en.wikipedia.org/wiki/Golden_hammer)
Why not simply rewrite the SQL how you would rather do it, then print out the execution plans of both, and show the performance and timing of both. That should be enough to show them that it needs to change (and if there is no major performance difference, then your only other argument can be one of maintainability and that's something you're going to have to argue by showing them what it takes to make changes).
Related
I'm looking for some ideas managing very large SQL queries in Oracle.
My employer is looking to build very wide reports ( 150 - 200 ) columns of data per report.
Each item is a sub-query or an element from a view. The data has to be real time, so DW style batch processing is not an option. We also don't use any BI tools , just a java app that generates Excel ( its a requirement to output data in Excel)
The query also contains unions as feeds from other systems.
The queries result in very large SQL ( about 1500 lines) that is very difficult to manage.
What strategies can I employ to make the work more manageable?
It is also not a performance problem. I was able to optimize the query to be very efficient , its mostly width of the query , managing 200 columns is a challenge in itself.
I deal with queries this length daily and here is some of what helps me out in manitaining them:
First alias every single one of the those columns. When you are building it you may know where each one came from but when it is time to make a change, it is really helpful to know exactly where each column came from. This applies to join conditions, group by and where conditions as well as the select columns.
Organize in easily understandable and testable chunks. I use temp tables to pull things that make sense together and so I can see the results before the final query while in test mode.
This brings me to test mode. If I have chunks of data, I design the proc with a test mode and then query individual temp tables when in test mode, so I can see where the data went wrong if there is a bug. Not sure how Oracle works but in SQL Server, I make this the last parameter and give it a default value, so that it doesn't need to be passed in by the application.
Consider logging the execution details and the values of passed in parameters and certainly log any error messages. This will help tremendously when you have to troubleshoot why this report that has functioned perfectly for six years doesn't work for this one user.
Put columns on a separate line for each one and do the same for where clauses. At times you may have to troublshoot by commenting out joins until you find the one that is causing the problem. It is easier if you can easily comment out the associated fields as well.
If you don't have a technical design document, then at least use comments to explain your thought process. You want to understand the whys not the hows in any comments. This stuff is hard to come back to later and understand even when you wrote it. Give your future self some help.
In developing from scratch, I put the select list in and then comment all but the first item. Then I build the query only until I get that value - testing until I am sure what I got was correct. Then I add the next one and whatever joins or where conditions I might need to get it. Test again making sure it is right. (Oops why did that go from 1000 records to 20000 when I added that? Hmm maybe there is something I need to handle there or is that right?) By adding only one thing at a time, you will find an error in the logic much faster and be much more confident of your results. It will also take you less time than trying to build a massive query in one go.
Finally, there is no substitute for understanding your data. There are plently of complex queries that work but do not give the correct answer. Know if you need an inner join or a left join. Know what where conditions you need to get the records you want. Know how to handle the records when you have a one-to-many relationship (this may require push back on the requirements); should you have 3 lines (one for each child record), or should you put that data in a comma delimited list or should you pick only one of the many records and have one line using aggregation. If the latter, what is the criteria for choosing the record you want to keep?
Without seeing the specifics of your problem, here are a couple of ideas that immediately come to mind:
If you are looking purely for management, I might suggest organizing your subqueries as a number of views and then referencing those views in your final query.
For performance on the other hand you may want to consider creating temp tables or even materialized views (which are fixed views) to break up the heavier parts of your process.
If your queries require an enormous amount of subquerying in order to gain usable data, you might need to rethink your database design and possibly create a number of datamarts to easily access reporting data. Think of these as mini-warehouses sans the multi-year trended data.
Finally, I know you said you don't use any BI tools but this problem certainly seems like one that might make sense by organizing your data into "cubes" or Business Object "universes". It might be worthwhile to at least entertain the cost of bringing on a BI tool vs. the programming hours to support the current setup.
Say I offer user to check off languages she speaks and store it in a db. Important side note, I will not search db for any of those values, as I will have some separate search engine for search.
Now, the obvious way of storing these values is to create a table like
UserLanguages
(
UserID nvarchar(50),
LookupLanguageID int
)
but the site will be high load and we are trying to eliminate any overhead where possible, so in order to avoid joins with main member table when showing results on UI, I was thinking of storing languages for a user in the main table, having them comma separated, like "12,34,65"
Again, I don't search for them so I don't worry about having to do fulltext index on that column.
I don't really see any problems with this solution, but am I overlooking anything?
Thanks,
Andrey
Don't.
You don't search for them now
Data is useless to anything but this one situation
No data integrity (eg no FK)
You still have to change to "English,German" etc for display
"Give me all users who speak x" = FAIL
The list is actually a presentation issue
It's your system, though, and I look forward to answering the inevitable "help" questions later...
You might not be missing anything now, but when you're requirements change you might regret that decision. You should store it normalized like your first instinct suggested. That's the correct approach.
What you're suggesting is a classic premature optimization. You don't know yet whether that join will be a bottleneck, and so you don't know whether you're actually buying any performance improvement. Wait until you can profile the thing, and then you'll know whether that piece needs to be optimized.
If it does, I would consider a materialized view, or some other approach that pre-computes the answer using the normalized data to a cache that is not considered the book of record.
More generally, there are a lot of possible optimizations that could be done, if necessary, without compromising your design in the way you suggest.
This type of storage has almost ALWAYS come back to haunt me. For one, you are not even in first normal form. For another, some manager or the other will definitely come back and say.. "hey, now that we store this, can you write me a report on... "
I would suggest going with a normalized design. Put it in a separate table.
Problems:
You lose join capability (obviously).
You have to reparse the list on each page load / post back. Which results in more code client side.
You lose all pretenses of trying to keep database integrity. Just imagine if you decide to REMOVE a language later on... What's the sql going to be to fix all of your user profiles?
Assuming your various profile options are stored in a lookup table in the DB, you still have to run "30 queries" per profile page. If they aren't then you have to code deploy for each little change. bad, very bad.
Basing a design decision on something that "won't happen" is an absolute recipe for failure. Sure, the business people said they won't ever do that... Until they think of a reason they absolutely must do it. Today. Which will be promptly after you finish coding this.
As I stated in a comment, 30 queries for a low use page is nothing. Don't sweat it, and definitely don't optimize unless you know for darn sure it's necessary. Guess how many queries SO does for it's profile page?
I generally stay away at the solution you described, you asking for troubles when you store relational data in such fashion.
As alternative solution:
You could store as one bitmasked integer, for example:
0 - No selection
1 - English
2 - Spanish
4 - German
8 - French
16 - Russian
--and so on powers of 2
So if someone selected English and Russian the value would be 17, and you could easily query the values with Bitwise operators.
Premature optimization is the root of all evil.
EDIT: Apparently the context of my observation has been misconstrued by some - and hence the downvotes. So I will clarify.
Denormalizing your model to make things easier and/or 'more performant' - such as creating concatenated columns to represent business information (as in the OP case) - is what I refer to as a "premature optimization".
While there may be some extreme edge cases where there is no other way to get the necessary performance necessary for a particular problem domain - one should rarely assume this is the case. In general, such premature optimizations cause long-term grief because they are hard to undo - changing your data model once it is in production takes a lot more effort than when it initially deployed.
When designing a database, developers (and DBAs) should apply standard practices like normalization to ensure that their data model expresses the business information being collected and managed. I don't believe that proper use of data normalization is an "optimization" - it is a necessary practice. In my opinion, data modelers should always be on the lookout for models that could be restructured to (at least) third normal form (3NF).
If you're not querying against them, you don't lose anything by storing them in a form like your initial plan.
If you are, then storing them in the comma-delimited format will come back to haunt you, and I doubt that any speed savings would be significant, especially when you factor in the work required to translate them back.
You seem to be extremely worried about adding in a few extra lookup table joins. In my experience, the time it takes to actually transmit the HTML response and have the browser render it far exceed a few extra table joins. Especially if you are using indexes for your primary and foreign keys (as you should be). It's like you are planning a multi-day cross-country trip and you are worried about 1 extra 10 minute bathroom stop.
The lack of long-term flexibility and data integrity are not worth it for such a small optimization (which may not be necessary or even noticeable).
Nooooooooooooooooo!!!!!!!!
As stated very well in the above few posts.
If you want a contrary view to this debate, look at wordpress. Tables are chocked full of delimited data, and it's a great, simple platform.
I understand that, in the interest of efficiency, when you query the database you should only return the columns that are needed and no more.
But given that I like to use objects to store the query result, this leaves me in a dilemma:
If I only retrieve the column values that I need in a particular situation, I can only partially populate the object. This, I feel, leaves my object in a non-ideal state where only some of the properties and methods are available. Later, if a situation arises where I would like to the reuse the object but find that the new situation requires a different but overlapping set of columns to be returned, I am faced with a choice.
Should I reuse the existing SQL and add to the list of selected columns the additional fields that are required by the new situation so that the same query and object mapping logic can be reused for both? Or should I create another method that results in the execution of a slightly different SQL which results in the populating of only those object properties that were returned by the 2nd query?
I strongly suspect that there is no magic answer and that the answer really "depends" on the situation but I guess I am looking for general advice. In general, my approach has been to either return all columns from the queried table or to add to the query the additional columns as they are needed but to reuse the same SQL (and mapping code) that is, until performance becomes a concern. In general, I find that unless you are retrieving a large number of row - and I usually am not - that the cost of adding additional columns to the output does not have a noticable effect on performance and that the savings in development time and the simplified API that result are a good trade off.
But how do you deal with this situation when performance does become a factor? Do you create methods like
Employees.GetPersonalInfo
Employees.GetLittleMorePersonlInfoButMinusSalary
etc, etc etc
Or do you somehow end up creating an API where the user of your API has to specify which columns/properties he wants populated/returned, thereby adding complexity and making your API less friendly/easy to use?
Let's say you want to get Employee info. How many objects would typically be involved?
1) an Employee object
2) An Employees collection object containing one Employee object for each Employee row returned
3) An object, such as EmployeeQueries that returns contains methods such as "GetHiredThisWeek" which returns an Employees collection of 0 or more records.
I realize all of this is very subjective, but I am looking for suggestions on what you have found works best for you.
I would say make your application correct first, then worry about performance in this case.
You could be optimizing away your queries only to realize you won't use that query anyway. Create the most generalized queries that your entire app can use, and then as you are confident things are working properly, look for problem areas if needed.
It is likely that you won't have a great need for huge performance up front. Some people say the lazy programmers are the best programmers. Don't over-complicate things up front, make a single Employee object.
If you find a need to optimize, you'll create a method/class, or however your ORM library does it. This should be an exception to the rule; only do it if you have reason to do so.
...the cost of adding additional columns to the output does not have a noticable effect on performance...
Right. I don't quite understand what "new situation" could arise, but either way, it would be a much better idea (IMO) to get all the columns rather than run multiple queries. There isn't much of a performance penalty at all for getting more columns than you need (although the queries will take more RAM, but that shouldn't be a big issue; besides, hardware is cheap). Also, you'd save yourself quite a bit of development time.
As for the second part of your question, it's really up to you. As an example, Rails takes more of a "usability first, performance last" approach, but that may not be what you want. It just depends on your needs. If you're willing to sacrifice a little usability for performance, by all means, go for it. I would.
If you are using your Objects in a "row at a time" CRUD type application, then, by all means copy all the columns into your object, the extra overhead is minimal, and you object becomes truly re-usable for any program wanting row access to the table.
However if your SQL is doing a complex join or returning a large set of rows, then request precisely and only what you want. You get two performance penalties here, one handling each column each time will eat up cpu for no benefit, and, two most DBMS systems have a bag of tricks for optimising queries (such as index only access) which can only be used if you specify precisely which columns you want.
There is no reuse issue in most of these cases as scan/search processes tend to very specific to a particular use case.
I've got quite a long business process which eventually results into financial operations.
What matters in the end is quite exclusively these final operations, although I've got to keep a log of everything which led to it.
Since all the information contained into the final operations is available in other tables (used during the business process), it makes sense to use a view, but the view logic would be quite complicated (there are dozens of tables implicated), and I'm concerned that :
even with appropriate indexes, a table will probably be way faster (my table will eventually contain millions if items, and should be fully searchable on almost all its columns)
the view logic would be complicated, so I'm afraid it may complicate things in a few years if I want to evolve my business logic.
Because of those two reasons, I'm a bit tempted to write the data in a table at the end of my business process instead of relying on a view, but duplicating the data doesn't smells right (and it also looks a bit like premature optimization, but since it's such a central point in my design, I'd like to address the issue ASAP)
Have you ever faced such a choice? What did you decide?
Edit : creating a table would clearly lead to duplication in my situation, ie. the data written in the table exists somewhere else in the database and could be retrieved using only joins without any calculations.
I think you answered your question writing it down Brann.
This problem can be seen in this way: from one hand you have "real time data". You have fresh data and from them it's nice to create view to show "real time data" too.
But as time goes on, there are more data and logic changes. So it's good to have written down summaries of data you had some time ago. It's very pragmatic - you do not duplicate data, because you recalculate them and save into new table summary of them.
So when you think of it in this way, it's obvious that in this example new table will be better. As you write:
Faster access
Can have more complicated logic
Have archive data unchanged when logic changes
So when you meet this (or part) of this criteria as you requirement than its not choice - you go into tables.
When I would use view is only when showing fresh data out of other fresh data. In very, very simple problems. And when it gets more complicated - you always switch to new table.
So do not be afraid to go into it. Having one summary table with faster access is very pretty solution and it's a sign of well formed database.
Take care of the design of this table - so when business logic changes - you do not need to change everything from one stone in this table. And then everything will be OK!
I'm for the new table in this situation. The view has many disadvantages - performance clearly, complexity, and logic lock in. However, IMHO the over-arching reason is that as the underlying data changes, so the value in your view will change also. In most instances this is a good thing, however, with financial operations isn't it better to have a fixed record of what occured.
I always decide to have better normalization. In your case , though the view may be complicated , it's better to have that than the new table which has to be kept in sync with all the data changing operation.Plus the view would always be current while your end of business day table population would be only current for few hours a day.
Also , you have a bigger problem if the data in this table goes out of sync for whatever reasons.
As MrTelly alluded to, are you sure that your end result table really is a duplication of the view data? Or, is it actually a record of the final action taken as a result of the items in the view data.
For a clearer example, let's say that every time my gas tank gets to half-empty I buy $10 of gas. I write this down in a log. One day I buy my gas and write it in my log then later find out that my fuel gauge was broken and I really had 3/4 a tank of gas. Should I now erase the $10 purchase from my log because the underlying data (the level of gas in my tank) has changed? Ok, maybe that's not a clearer example, but hopefully it gets the point across. Recording the results is a different thing from recording the events that led up to the result. This is especially true in financial application. Therefore, I don't know that you're breaking normalization at all with storing the final outcome in its own table.
An indexed view is the way. But there are quite a few limitations to this approach, but it's generally favorable although it has some overhead issues if implemented incorrectly. With this approach you won't need to keep track of the changes that take place in your base tables and the data would accumulate itself nicely in that indexed view of yours. In theory.
Reference:
Improving Performance with SQL Server 2005 Indexed Views
Oracle: Materialized View Concepts and Architecture
Option A
We are working on a small project that requires a pricing wizard for custom tables. (yes, actual custom tables- the kind you eat at. From here out I'll call them kitchen tables so we don't get confused) I came up with a model where each kitchen table part was a database table. So the database looked like this:
TableLineItem
-------------
ID
TableSizeID
TableEdgeWoodID
TableBaseID
Quantity
TableEdgeWoodID
---------------
ID
Name
MaterialUnitCost
LaborSetupHours
LaborWorkHours
Each part has to be able to calculate its price. Most of the calculations are very similar. I liked this structure because I can drag it right into the linq-to-sql designer, and have all of my classes generated. (Less code writing means less to maintain...) I then implement a calculate cost interface which just takes in the size of the table. I have written some tests and this functions pretty well. I added also added a table to filter parts in the UI based on previous selections. (You can't have a particular wood with a particular finish.) There some other one off exceptions in the model, and I have them hard coded. This model is very rigid, and changing requirements would change the datamodel. (For example, if all the tables suddenly need umbrellas.)
Option B:
After various meetings with my colleagues (which probably took more time than it should considering the size of this project), my colleagues decided they would prefer a more generic approach. Something like this:
Spec
----
SpecID
SpecTypeID
TableType_LookupID
Name
MaterialUnitCost
LaborSetupHours
LaborWorkHours
SpecType
--------
SpecTypeID
ParentSpecType_SpecTypeID
IsCustomerOption
IsRequiredCustomerOption
etc...
This is a much more generic approach that could be used to construct any product. (like, if they started selling chairs...) I think this would take longer time to implement, but would be more flexible in the future. (although I doubt we will revisit this.) Also you lose some referential integrity- you would need triggers to enforce that a table base cannot be set for a table wood.
Questions:
Which database structure do you prefer? Feel free to suggest your own.
What would be considered a best practice? If you have several similar database tables, do you create 1 database table with a type column, or several distinct tables? I suspect the answer begins with "It depends..."
What would an estimated time difference be in the two approaches (1 week, 1 day, 150% longer, etc)
Thanks in advance. Let me know if you have any questions so I can update this.
Having been caught out much more often than I should have by designing db structures that met my clients original specs but which turned out to be too rigid, I would always go for the more flexible approach, even though it takes more time to set up.
I don't have time for a complete answer right now, but I'll throw this out:
It's usually a bad idea to design a database based on the development tool that you're using to code against it.
You want to be generic to a point. Tables in a database should represent something and it is possible to make it too generic. For example, a table called "Things" is probably too generic.
It may be possible to make constraints that go beyond what you expect. Your example of a "table base" with a "table wood" didn't make sense to me, but if you can expand on a specific example someone might be able to help with that.
Finally, if this is a small application for a single store then your design is going to have much less impact on the project outcome than it would if you were designing for an application that would be heavily used and constantly changed. This goes back to the "too generic" comment above. It is possible to overdesign a system when its use will be minimal and well-defined. I hope that makes sense.
Given your comment below about the table bases and woods, you could set up a table called TableAttributes (or something similar) and each possible option would be of a particular table attribute type. You could then enforce that any given option is only used for the attribute to which it applies all through foreign keys.
There is a tendency to over-abstract with database schema design, because the cost of change can be high. Myself, I like table names that are fairly descriptive. I often equate schema design with OO design. E.g., you wouldn't normally create a class named Thing, you would probably call it Product, Furniture, Item, something that relates to your business.
In the schema you have provided there is a mix of the abstract (spec) and the specific (TableType_LookupID). I would tend to equalize the level of abstraction, so use entities like:
ProductGroup (for the case where you have a product that is a collection of other products)
Product
ProductType
ProductDetail
ProductDetailType
etc.
Here's what my experience would tell me:
Which database structure do you prefer? Without a doubt, I'd go for approach one. Go for the simplest setup that might work. If you add complexity, always ask yourself, what value will it have to the customer?
What would be considered a best practice? That does indeed depend, among others on the size of the project and the expected rate of change. As a general rule, generic tables are worth it when you expect the customer to be adding new types. For example, if your customer wants to be able to add a new "color" entity to the table, you'd need generic tables. You can't predict beforehand what they will add.
What would an estimated time difference be in the two approaches? Not knowing your business, skill, and environment, it's impossible to give a valid estimate. The approach that you are confident in coding will take the least time. Here, my guess would be approach #1 could be 5x-50x as fast. Generic tables are hard, both on the database and the client side.
Option B..
Generic is generally better than specific. Software already is doomed to fail or reach it's capacity by it's design for a certain set of tasks only. If you build something generic it will break less if abstracted with a realistic analysis of where it might head. As long as you stay away from over-abstraction and under-abstraction, it's probably the sweet spot.
In this case the adage "less code is more" would probably be drawn in that you wouldn't have to come back and re-write it again.