Large Class Refactor Rules

Large Class Refactor Rules - oop

This is another question related to a question I asked a few minutes ago. If I have a class that I believe only has one responsibility but a lot of business rules and that class is large, about 4000 lines or more, is it OK to not re-factor the class into multiple classes.

4000 lines is too much. Either you have 500 methods or you have really long methods. I cant see a way that can be managable. Seems obvious but I suggest you start with grouping similar methods/variables together. e.g. all cost data goes into productCost class etc. instead. Use query methods instead of calculated fields that are being used by many methods.

A 4,000 line class isn't very maintainable. It might be hard to test pieces of the logic in isolation. A more practical reason to split it up is that multiple programmers can work on it in parallel if it is separated into multiple classes. This is a lot harder to do if it's one class.
You lose a lot of good software quality attributes by leaving this as a monolithic monster. There are better patterns to reduce its inner complexity, even if it truly is all cohesive.

I would say "no". 4,000 lines is much too large.
I would examine the business rules to see if they don't imply the class is really composite. In particular, if it is possible to partition the set of business rules into sensible subsets, then it's likely each subset may indicate that your class needs to be broken into components, each with its own set of business rules, and that the rules should be parceled out among the components.
I'd also look at refactoring the business rules into a more compact representation.

Related

Is it better to use a boolean variable to replace an if condition for readability or not?

I am in the second year of my bachelor study in information technology. Last year in one of my courses they taught me to write clean code so other programmers have an easier time working with your code. I learned a lot about writing clean code from a video ("clean code") on pluralsight (paid website for learning which my school uses). There was an example in there about assigning if conditions to boolean variables and using them to enhance readability. In my course today my teacher told me it's very bad code because it decreases performance (in bigger programs) due to increased tests being executed. I was wondering now whether I should continue using boolean variables for readability or not use them for performance. I will illustrate in an example (I am using python code for this example):
example boolean variable
Let's say we need to check whether somebody is legal to drink alcohol we get the persons age and we know the legal drinking age is 21.
is_old_enough = persons_age >= legal_drinking_age
if is_old_enough:
do something
My teacher told me today that this would be very bad for performance since 2 tests are performed first persons_age >= legal_drinking_age is tested and secondly in the if another test occurs whether the person is_old_enough.
My teacher told me that I should just put the condition in the if, but in the video they said that code should be read like natural language to make it clear for other programmers. I was wondering now which would be the better coding practice.
example condition in if:
if persons_age >= legal_drinking_age:
do something
In this example only 1 test is tested whether persons_age >= legal_drinking_age. According to my teacher this is better code.
Thank you in advance!
yours faithfully
Jonas

I was wondering now which would be the better coding practice.
The real safe answer is : Depends..
I hate to use this answer, but you won't be asking unless you have faithful doubt. (:
IMHO:
If the code will be used for long-term use, where maintainability is important, then a clearly readable code is preferred.
If the program speed performance crucial, then any code operation that use less resource (smaller dataSize/dataType /less loop needed to achieve the same thing/ optimized task sequencing/maximize cpu task per clock cycle/ reduced data re-loading cycle) is better. (example keyword : space-for-time code)
If the program minimizing memory usage is crucial, then any code operation that use less storage and memory resource to complete its operation (which may take more cpu cycle/loop for the same task) is better. (example: small devices that have limited data storage/RAM)
If you are in a race, then you may what to code as short as possible, (even if it may take a slightly longer cpu time later). example : Hackathon
If you are programming to teach a team of student/friend something.. Then readable code + a lot of comment is definitely preferred .
If it is me.. I'll stick to anything closest to assembly language as possible (as much control on the bit manipulation) for backend development. and anything closest to mathematica-like code (less code, max output, don't really care how much cpu/memory resource is needed) for frontend development. ( :
So.. If it is you.. you may have your own requirement/preference.. from the user/outsiders/customers point of view.. it is just a working/notWorking program. YOur definition of good program may defer from others.. but this shouldn't stop us to be flexible in the coding style/method.
Happy exploring. Hope it helps.. in any way possible.

Performance
Performance is one of the least interesting concerns for this question, and I say this as one working in very performance-critical areas like image processing and raytracing who believes in effective micro-optimizations (but my ideas of effective micro-optimization would be things like improving memory access patterns and memory layouts for cache efficiency, not eliminating temporary variables out of fear that your compiler or interpreter might allocate additional registers and/or utilize additional instructions).
The reason it's not so interesting is, because, as pointed out in the comments, any decent optimizing compiler is going to treat those two you wrote as equivalent by the time it finishes optimizing the intermediate representation and generates the final results of the instruction selection/register allocation to produce the final output (machine code). And if you aren't using a decent optimizing compiler, then this sort of microscopic efficiency is probably the last thing you should be worrying about either way.
Variable Scopes
With performance aside, the only concern I'd have with this convention, and I think it's generally a good one to apply liberally, is for languages that don't have a concept of a named constant to distinguish it from a variable.
In those cases, the more variables you introduce to a meaty function, the more intellectual overhead it can have as the number of variables with a relatively wide scope increases, and that can translate to practical burdens in maintenance and debugging in extreme cases. If you imagine a case like this:
some_variable = ...
...
some_other_variable = ...
...
yet_another_variable = ...
(300 lines more code to the function)
... in some function, and you're trying to debug it, then those variables combined with the monstrous size of the function starts to multiply the difficulty of trying to figure out what went wrong. That's a practical concern I've encountered when debugging codebases spanning millions of lines of code written by all sorts of people (including those no longer on the team) where it's not so fun to look at the locals watch window in a debugger and see two pages worth of variables in some monstrous function that appears to be doing something incorrectly (or in one of the functions it calls).
But that's only an issue when it's combined with questionable programming practices like writing functions that span hundreds or thousands of lines of code. In those cases it will often improve everything just focusing on making reasonable-sized functions that perform one clear logical operation and don't have more than one side effect (or none ideally if the function can be programmed as a pure function). If you design your functions reasonably then I wouldn't worry about this at all and favor whatever is readable and easiest to comprehend at a glance and maybe even what is most writable and "pliable" (to make changes to the function easier if you anticipate a future need).
A Pragmatic View on Variable Scopes
So I think a lot of programming concepts can be understood to some degree by just understanding the need to narrow variable scopes. People say avoid global variables like the plague. We can go into issues with how that shared state can interfere with multithreading and how it makes programs difficult to change and debug, but you can understand a lot of the problems just through the desire to narrow variable scopes. If you have a codebase which spans a hundred thousand lines of code, then a global variable is going to have the scope of a hundred thousands of lines of code for both access and modification, and crudely speaking a hundred thousand ways to go wrong.
At the same time that pragmatic sort of view will find it pointless to make a one-shot program which only spans 100 lines of code with no future need for extension avoid global variables like the plague, since a global here is only going to have 100 lines worth of scope, so to speak. Meanwhile even someone who avoids those like the plague in all contexts might still write a class with member variables (including some superfluous ones for "convenience") whose implementation spans 8,000 lines of code, at which point those variables have a much wider scope than even the global variable in the former example, and this realization could drive someone to design smaller classes and/or reduce the number of superfluous member variables to include as part of the state management for the class (which can also translate to simplified multithreading and all the similar types of benefits of avoiding global variables in some non-trivial codebase).
And finally it'll tend to tempt you to write smaller functions as well, since a variable towards the top of some function spanning 500 lines of code is going to also have a fairly wide scope. So anyway, my only concern when you do this is to not let the scope of those temporary, local variables get too wide. And if they do, then the general answer is not necessarily to avoid those variables but to narrow their scope.

What is the best way to represent a form with hundreds of questions in a model

I am trying to design an income return tax software.
What is the best way to represent/store a form with hundreds of questions in a model?
Just for this example, I need at least 6 models (T4, T4A(OAS), T4A(P), T1032, UCCB, T4E) which possibly contain hundreds of fields.
Is it by creating hundred of fields? Storing values in a map? An Array?

One very generic approach could be XML
XML allows you to
nest your data to any degree
combine values and meta information (attributes and elements)
describe your data in detail with XSD
store it externally
maintain it easily
even combine it with additional information (look at processing instructions)
and (last but not least) store the real data in almost the same format as the modell...
and (laster but even not leaster :-) ) there is XSLT to transform your XML data into any other format (such as HTML for nice presentation)
There is high support for XML in all major languages and database systems.
Another way could be a typical parts list (or bill of materials/BOM)
This tree structure is - typically - implemented as a table with a self-referenced parentID. Working with such a table needs a lot of recursion...
It is very highly recommended to store your data type-safe. Either use a character storage format and a type identifier (that means you have to cast all your values here and there), or you use different type-safe side tables via reference.
Further more - if your data is to be filled from lists - you should define a datasource to load a selection list dynamically.
Conclusio
What is best for you mainly depends on your needs: How often will the modell change? How many rules are there to guarantee data's integrity? Are you using a RDBMS? Which language/tools are you using?

With a case like this, the monolithic aggregate is probably unavoidable (unless you can deduce common fields). I'm going to exclude RDBMS since the topic seems to focus more on lower-level data structures and a more proprietary-style solution, though that could be a very valid option that can manage all these fields.
In this case, I think it ceases to become so much about formalities as just daily practicalities.
Probably worst from that standpoint in this case is a formal object aggregating fields, like a class or struct with a boatload of data members. Those tend to be the most awkward and the most unattractive as monoliths, since they tend to have a static nature about them. Depending on the language, declaration/definition/initialization could be separate which means 2-3 lines of code to maintain per field. If you want to read/write these fields from a file, you have to write a separate line of code for each and every field, and maintain and update all that code if new fields added or existing ones removed. If you start approaching anything resembling polymorphic needs in this case, you might have to write a boatload of branching code for each and every field, and that too has to be maintained.
So I'd say hundreds of fields in a static kind of aggregate is, by far, the most unmaintainable.
Arrays and maps are effectively the same thing to me here in a very language-agnostic sense provided that you need those key/value pairs, with only potential differences in where you store the keys and what kind of algorithmic complexity is involved. Whatever you do, probably a key search in this monolith should be logarithmic time or better. 'Maps/associative arrays' in most languages tend to inherently have this quality.
Those can be far more suitable, and you can achieve the kind of runtime flexibility that you like on top of those (like being able to manage these from a file and add the fields on the fly with no pre-existing knowledge). They'll be far more forgiving here.
So if the choice is between a bunch of fields in a class and something resembling a map, I'd suggest going for a map. The dynamic nature of it will be far more forgiving for these kinds of cases and will typically far outweigh the compile-time benefits of, say, checking to make sure a field actually exists and producing a syntax error otherwise. That kind of checking is easy to add back in and more if we just accept that it will occur at runtime.
An exception that might make the field solution more appealing is if you involve reflection and more dynamic techniques to generate an object with the appropriate fields on the fly. Then you get back those dynamic benefits and flexibility at runtime. But that might be more unwieldy to initialize the structure, could involve leaning a lot more heavily on heavy-duty (and possibly very computationally-expensive) introspection and type manipulation and code generation mechanisms, and also end up with more funky code that's hard to maintain.
So I think the safest bet is the map or associative array, and a language that lets you easily add new fields, inspect existing ones, etc. with very fast turnaround. If the language doesn't inherently have that quality, you could look to an external file to dynamically add fields, and just maintain the file.

Database design: how to avoid serialization when data structure is not static

I've recently been confronted with the need to design a database. Since this is my first time, I thought I'd better ask for some advice to make sure I'm building on solid foundations.
Goal
I'd like to store objects (POD structures best thought of as multi-maps) in an
SQL database for storage and querying. The objects' contents as well as its 'structure' are continuously modified. The database will be accessed intensively through both queries and updates.
Use Case
First, each object should have a unique identifier.
Second, different type of objects exist. For example, ObjectA is an instance of ClassA. ClassA can have attributes A1, A2, A3, etc. As a result, ObjectA can (but isn't required, NULL is allowed) have values for these attributes. However, each of these attributes may have more than one value, ie: ObjectA.A1="foo" and ObjectA.A1="bar" are both possible. The number of attributes of ClassA can change. For simplicity's sake, attributes can only be added, not removed.
Third, attributes are not specific to one class, ie: objects of ClassB can also have attributes A1, A2, etc. Thus ObjectB.A1="foo" is also possible. I'm not sure whether this changes anything, but I have a feeling it might in a design where each attribute corresponds with a table.
Finally, the following pseudo-queries and actions must be supported:
Get all the objects of type ClassA with attribute A1 equal to "bar".
Get all the attributes of ObjectB.
Add an attribute A4 to objects of type ClassA.
Add an object of type ClassC which has attributes A1="foobar", A2="bar".
Limitations
First, I want to avoid serializing the data, so multiple values in a single column are out of the question. The database should be normalized and the data structures should be atomic. The database will be queried very frequently, so I cannot afford wasting time trying to implement a complex query mechanism. I will end up re-inventing the wheel (probably a square one as well).
Second, I cannot use any prior knowledge of an object's internal structure as this will only become available at run-time. For example, in the use case above, the attributes are not known before-hand. So while I have thought of having a design where each attribute is a table, I cannot figure out how to get all the attributes of an object in such a set-up.
Environment
I'm using SQLite 3.7, C++.
Question
What would be an appropriate, flexible database design that meets the requirements of the described problem?
Any help, pointers or tips leading to useful insights or a solid design are very welcome.
Thanks!
ps: I have only basic theoretical familiarity and limited practical experience with relational databases, certainly no prior professional experience. I have been reading up on the subject the past week and have grasped some of the concepts which I think will be relevant to my case (normalization, foreign keys, etc), but I'm still going through my book at this moment.

If this is your first time out, and your project is as significant as it seems, you might want to invest the time and effort to learn the fundamentals from the ground up. CJ Date and many other authors have books and on line tutorials that can take you through the fundamentals. They are excellent works.
There are some fields within IT that are dominated by almost complete adhocracy. Not so database design. To begin with, EF Codd laid the groundwork on a very solid mathematical basis some 42 years ago, and the basic model has held up very well over time. There has been progress, but almost no backtracking. And very little change for the sake of change.
SQL has likewise enjoyed a lot of stability over its long lifespan.
Next, trial and error in database design can be enormously costly. There are dozens of cases where unfortunate choices made by newbies have ended up costing millions in data investments that didn't pan out.
Trial and error has its place. Tips and tricks have their place. Answers on SO have their place. But so does formal learning.

how to separate parts of long mathematical algorithm?

I have several pages of code. It's pretty ugly because it's doing a lot of "calculation" etc. But it contains of several phases, like many algorthims, like that:
calculate orders I want to leave
kill orders I want to leave but I can't leave because of volume restrictions
calculate orders I want to add
kill other orders I want to leave but I can't because of new orders
adjust new orders ammount to fit desired volume
Totally I have 5 pages of ugly code which I want to separate at least by stage. But I don't want to introduce separate method for each stage, because these stages make sense only together, stage itself is useless so I think it would be wrong to create separate method for each stage.
I think I should use c# #region for separation, what do you think, will you suggest something better?

Use private methods to seperate logic into small tasks, even if said logic is only used in one place, it increases readability of code by a lot.

Avoid #region directives for this purpose, they only sweep dirt under the carpet.
I second #RasmusFranke's advice, divide et impera: while separating functionalities into methods you may notice that a bunch of methods happen to represent a concept which is class-worthy, then you can move the methods in a new class. Reusability is not the only reason to create methods.
Refactor, refactor, refactor. Keep in mind principles like SOLID while using techniques from Refactoring and Working Effectively with Legacy Code.
Take it slow and use if you can tools like Resharper or Refactor! Pro which help to minimize mistakes that could occur while refactoring.
Use your tests to check if you broke anything, especially if you do not have access to the previously mentioned tools or if you are doing some major refactoring. If you don't have tests try to write some, even if it may be daunting to write tests for legacy code.
Last but not least, do not touch it if you don't need to. If it works but it is "ugly" and it is not a part of your code needing changes, let it be.

Metrics & Object-oriented programming

I would like to know if somebody often uses metrics to validate its code/design.
As example, I think I will use:
number of lines per method (< 20)
number of variables per method (< 7)
number of paremeters per method (< 8)
number of methods per class (< 20)
number of field per class (< 20)
inheritance tree depth (< 6).
Lack of Cohesion in Methods
Most of these metrics are very simple.
What is your policy about this kind of mesure ? Do you use a tool to check their (e.g. NDepend) ?

Imposing numerical limits on those values (as you seem to imply with the numbers) is, in my opinion, not very good idea. The number of lines in a method could be very large if there is a significant switch statement, and yet the method is still simple and proper. The number of fields in a class can be appropriately very large if the fields are simple. And five levels of inheritance could be way too many, sometimes.
I think it is better to analyze the class cohesion (more is better) and coupling (less is better), but even then I am doubtful of the utility of such metrics. Experience is usually a better guide (though that is, admittedly, expensive).

A metric I didn't see in your list is McCabe's Cyclomatic Complexity. It measures the complexity of a given function, and has a correlation with bugginess. E.g. high complexity scores for a function indicate: 1) It is likely to be a buggy function and 2) It is likely to be hard to fix properly (e.g. fixes will introduce their own bugs).
Ultimately, metrics are best used at a gross level -- like control charts. You look for points above and below the control limits to identify likely special cases, then you look at the details. For example a function with a high cyclomatic complexity may cause you to look at it, only to discover that it is appropriate because it a dispatcher method with a number of cases.

management by metrics does not work for people or for code; no metrics or absolute values will always work. Please don't let a fascination with metrics distract from truly evaluating the quality of the code. Metrics may appear to tell you important things about the code, but the best they can do is hint at areas to investigate.
That is not to say that metrics are not useful. Metrics are most useful when they are changing, to look for areas that may be changing in unexpected ways. For example, if you suddenly go from 3 levels of inheritance to 15, or 4 parms per method to 12, dig in and figure out why.
example: a stored procedure to update a database table may have as many parameters as the table has columns; an object interface to this procedure may have the same, or it may have one if there is an object to represent the data entity. But the constructor for the data entity may have all of those parameters. So what would the metrics for this tell you? Not much! And if you have enough situations like this in the code base, the target averages will be blown out of the water.
So don't rely on metrics as absolute indicators of anything; there is no substitute for reading/reviewing the code.

Personally I think it's very difficult to adhere to these types of requirements (i.e. sometimes you just really need a method with more than 20 lines), but in the spirit of your question I'll mention some of the guidelines used in an essay called Object Calisthenics (part of the Thoughtworks Anthology if you're interested).
Levels of indentation per method (<2)
Number of 'dots' per line (<2)
Number of lines per class (<50)
Number of classes per package (<10)
Number of instance variances per class (<3)
He also advocates not using the 'else' keyword nor any getters or setters, but I think that's a bit overboard.

Hard numbers don't work for every solution. Some solutions are more complex than others. I would start with these as your guidelines and see where your project(s) end up.
But, regarding these number specifically, these numbers seem pretty high. I usually find in my particular coding style that I usually have:
no more than 3 parameters per method
signature about 5-10 lines per method
no more than 3 levels of inheritance
That isn't to say I never go over these generalities, but I usually think more about the code when I do because most of the time I can break things down.

As others have said, keeping to a strict standard is going to be tough. I think one of the most valuable uses of these metrics is to watch how they change as the application evolves. This helps to give you an idea how good a job you're doing on getting the necessary refactoring done as functionality is added, and helps prevent making a big mess :)

OO Metrics are a bit of a pet project for me (It was the subject of my master thesis). So yes I'm using these and I use a tool of my own.
For years the book "Object Oriented Software Metrics" by Mark Lorenz was the best resource for OO metrics. But recently I have seen more resources.
Unfortunately I have other deadlines so no time to work on the tool. But eventually I will be adding new metrics (and new language constructs).
Update
We are using the tool now to detect possible problems in the source. Several metrics we added (not all pure OO):
use of assert
use of magic constants
use of comments, in relation to the compelxity of methods
statement nesting level
class dependency
number of public fields in a class
relative number of overridden methods
use of goto statements
There are still more. We keep the ones that give a good image of the pain spots in the code. So we have direct feedback if these are corrected.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas