Best design patterns for refactoring code without breaking other parts - oop

I have some PHP code from an application that was written using Laravel. One of the modules was written quite poorly. The controller of this module has a whole bunch of reporting functions which uses the functions defined inside the model object to fetch data. And the functions inside the model object are super messy.
Following is a list of some of the functions from the controller (ReportController.php) and the model (Report.php)(I'm only giving names of functions and no implementations as my question is design related)
Functions from ReportController.php
questionAnswersReportPdf()
fetchStudentAnswerReportDetail()
studentAnswersReport()
wholeClassScoreReportGradebookCSV()
wholeClassScoreReportCSV()
wholeClassScoreReportGradebook()
formativeWholeClassScoreReportGradebookPdf()
wholeClassScoreReport()
fetchWholeClassScoreReport()
fetchCurriculumAnalysisReportData()
curriculumAnalysisReportCSV()
curriculumAnalysisReportPdf()
studentAnswersReportCSV()
fetchStudentScoreReportStudents()
Functions from Report.php
getWholeClassScoreReportData
getReportsByFilters
reportMeta
fetchCurriculumAnalysisReportData
fetchCurriculumAnalysisReportGraphData
fetchCurriculumAnalysisReportUsersData
fetchTestHistoryClassAveragesData
fetchAllTestHistoryClassAveragesData
fetchAllTestHistoryClassAveragesDataCsv
fetchHistoryClassAveragesDataCsv
fetchHistoryClassAveragesGraphData
The business logic has been written in quite a messy way also. Some parts of it are in the controller while other parts are in the model object.
I have 2 specific questions :
a) I have an ongoing goal of reducing code complexity and optimizing code structure. How can I leverage common OOP design patterns to ensure altering the code in any given report does not negatively affect the other reports? I specifically want to clean up the code for some critical reports first but want to ensure that by doing this none of the other reports will break.
b) The reporting module is relatively static in definition and unlikely to change over time. The majority of reports generated by the application involve nested sub-queries as well as standard grouping & filtering options. Most of these SQL queries have been hosed within the functions of the model object and contain some really complex joins. Without spending time evaluating the database structure or table indices, which solution architecture techniques would you recommend for scaling the report functionality to ensure optimized performance? Below is a snippet of one of the SQL queries
$sql = 'SELECT "Parent"."Id",
"Parent"."ParentId",
"Parent"."Name" as systemStandardName,
string_agg(DISTINCT((("SubsectionQuestions"."QuestionSerial"))::text) , \', \') AS "quesions",
count(DISTINCT("SubsectionQuestions"."QuestionId")) AS "totalQuestions",
case when sum("SQUA"."attemptedUsers")::float > 0 then
(COALESCE(round((
(
sum(("SQUA"."totalCorrectAnswers"))::float
/
sum("SQUA"."attemptedUsers")::float
)
*100
)::numeric),0))
else 0 end as classacuracy,
case when sum("SQUA"."attemptedUsers")::float > 0 then
(COALESCE((round(((1 -
(
(
sum(("SQUA"."totalCorrectAnswers"))::float
/
sum("SQUA"."attemptedUsers")::float
)
)
)::float * count(DISTINCT("SubsectionQuestions"."QuestionId")))::numeric,1)),0))
else 0 end as pgain
FROM "'.$gainCategoryTable.'" as "Parent"
'.$resourceTableJoin.'
INNER JOIN "SubsectionQuestions"
ON "SubsectionQuestions"."QuestionId" = "resourceTable"."ResourceId"
INNER JOIN "Subsections"
ON "Subsections"."Id" = "SubsectionQuestions"."SubsectionId"
LEFT Join (
Select "SubsectionQuestionId",
count(distinct case when "IsCorrect" = \'Yes\' then CONCAT ("UserId", \' \', "SubsectionQuestionId") else null end) AS "totalCorrectAnswers"
, count(distinct CONCAT ("UserId", \' \', "SubsectionQuestionId")) AS "attemptedUsers"
From "SubsectionQuestionUserAnswers"';
if(!empty($selectedUserIdsArr)){
$sql .= ' where "UserId" IN (' .implode (",", $selectedUserIdsArr).')' ;
}else {
$sql .= ' where "UserId" IN (' .implode (",", $assignmentUsers).')' ;
}
$sql .= ' AND "AssessmentAssignmentId" = '.$assignmentId.' AND "SubsectionQuestionId" IN ('.implode(",", $subsectionQuestions).') Group by "SubsectionQuestionId"
) as "SQUA" on "SQUA"."SubsectionQuestionId" = "SubsectionQuestions"."Id"
INNER JOIN "AssessmentAssignment"
ON "AssessmentAssignment"."assessmentId" = "Subsections"."AssessmentId"
INNER JOIN "AssessmentAssignmentUsers"
ON "AssessmentAssignmentUsers"."AssignmentId" = "AssessmentAssignment"."Id"
AND "AssessmentAssignmentUsers"."Type" = \'User\'
'.$conditaionlJoin.'
WHERE "Parent"."Id" IN ('.implode(',', $ssLeaf).')
'.$conditionalWhere.'
GROUP BY "Parent"."Id",
"Parent"."ParentId",
"Parent"."Name"
'.$sorter;
$results = DB::select(DB::raw($sql));

My take on a). In my experience when I want to reduce code complexity/sheer messiness I slowly refactor out code that violates the single responsibility principle while I'm working in that area already for either a bug fix or a feature update. I try not to spend hours upon hours of updating code that is "working" that I'm not actively updating for a business process reason. Follow the "Leave it better than you found it" approach as you work in this code base, and it will get better over time. Doing this will allow you to improve the code base while also getting features and bug fixes out the door, while also keeping business owners/project managers happy because you're keeping things moving.

about a) : The first thing I do, to ensure none of my refactoring is breaking anything, is covering the code with, at least, unitary tests (doing TDD ensures the most optimal coverage). It's easier when, like #DavidY says, you respect principles like SRP (does my class try to answer too many problems ?). With test, you'll feel safer when you'll need to refactor, and the tests will tell you exactly where it broke.
about b) : Do not optimize until you need it. And optimize only when you know what cost you the most. It's the best way to know what pattern you need, otherwise you may try to force the wrong solution into the wrong problem.

Related

How to automatically break down a SQL-like query with many joins into discrete, independent steps?

Note: This is a learning exercise to learn how to implement a SQL-like relational database. This is just one thin slice of a question in the overall grand vision.
I have the following query, given a test database with a few hundred records:
select distinct "companies"."name"
from "companies"
inner join "projects" on "projects"."company_id" = "companies"."id"
inner join "posts" on "posts"."project_id" = "projects"."id"
inner join "comments" on "comments"."post_id" = "posts"."id"
inner join "addresses" on "addresses"."company_id" = "companies"."id"
where "addresses"."name" = 'Address Foo'
and "comments"."message" = 'Comment 3/3/2/1';
Here, the query is kind of unrealistic, but it demonstrates the point which I am trying to make. The point is to have a query with a few joins, so that I can figure out how to write this in sequential steps.
The first part of the question is (which I think I've partially figured out), is how do you write these joins as a sequence of independent steps, with the output of one fed into the input of the other? Also, is there more than one way to do it?
// step 1
let companies = select('companies')
// step 2
let projects = join(companies, select('projects'), 'id', 'company_id')
// step 3
let posts = join(projects, select('posts'), 'id', 'project_id')
// step 4
let comments = join(posts, select('comments'), 'id', 'post_id')
// step 5
let finalPosts = posts.filter(post => !!comments.find(comment => comment.post_id === post.id))
// step 6
let finalProjects = projects.filter(project => !!posts.find(post => post.project_id === project.id))
// step 7, could also be run in parallel to step 2 potentially
let addresses = join(companies, select('addresses'), 'id', 'company_id')
// step 8
let finalCompanies = companies.filter(company => {
return !!posts.find(post => post.company_id === company.id)
&& !!addresses.find(address => address.company_id === company.id)
})
These filters could probably be more optimized using indexes of some sort, but that is beside the point I think. This just demonstrates that there seem to be about 8 steps to find the companies we are looking for.
The main question is, how do you automatically figure out the steps from the SQL query?
I am not asking about how to parse the SQL query into an AST. Assume we have some sort of object structure we are dealing with, like an AST, to start.
How would you have to have the SQL query in structured object form, such that it would lead to these 8 steps? I would like to be able to specify a query (using a custom JSON-like syntax, not SQL), and then have it divide the query into these steps to divide and conquer so to speak and perform the queries in parts (for learning how to implement distributed databases). But I don't see how we go from SQL-like syntax, to 8 steps. Can you show how that might be done?
Here is the full code for the demo, which you can run with psql postgres -f test.sql. The result should be "Company 3".
Basically looking for a high level algorithm (doesn't even need to be code), which describes the key way you would break down some sort of AST-like object representation of a SQL query, into the actual planned steps of the query.
My algorithm looks like this in my head:
represent SQL query in object tree.
convert object tree to steps.
I am not really sure what (1) should be structured like, and even if we had some sort of structure, I'm not sure how to get that to complete (2). Looking for more details on the implementations of these steps, mainly step (2).
My "object structure" for step 1 would be something like this:
const query = {
select: {
distinct: true,
columns: ['companies.name'],
from: ['companies'],
},
joins: [
{
type: 'inner join',
table: 'projects',
left: 'projects.company_id',
right: 'companies.id',
},
...
],
conditions: [
{
left: 'addresses.name',
op: '=',
right: 'Address Foo'
},
...
]
}
I am not sure how useful that is, but it doesn't relate to steps at all. At a high level, what kind of code would I have to write to convert that object sort of structure into steps? Seems like one potential avenue is do a topological sort on the joins. But then you need to combine that with the select and conditions somehow, not sure how you would even begin to programmatically know what step should be before what other step, or even what the steps are. Maybe if I somehow could break it into known "chunks", then it would be simple to apply TOP sort to it after that, but then the question is still, how to get into chunks from the object structure / SQL?
Basically, I have been reading about the theory behind "query planning and optimization", but don't know how to apply it in this regard. How did this site do it?
One aspect is breaking at least the where conditions into CNF.
Implementing joins is a huge topic which is probably out of scope for a StackOverflow answer.
If you're looking for practical information about how joins are implemented, I would suggest...
The Join Operation section of Use The Index, Luke for different types of join implementation.
Section 7 of the The SQLite Query Optimizer Overview covers joins. And reading the SQLite source code. It is about as small a practical SQL implementation will get.
The output of explain in Postgresql gives very detailed information about how it has implemented the query. And they are explained in Operator Optimization Information

PyPika control order of with clauses

I am using PyPika (version 0.37.6) to create queries to be used in BigQuery. I am building up a query that has two WITH clauses, and one clause is dependent on the other. Due to the dynamic nature of my application, I do not have control over the order in which those WITH clauses are added to the query.
Here is example working code:
a_alias = AliasedQuery("a")
b_alias = AliasedQuery("b")
a_subq = Query.select(Term.wrap_constant("1").as_("z")).select(Term.wrap_constant("2").as_("y"))
b_subq = Query.from_(a_alias).select("z")
q = Query.with_(a_subq, "a").from_(a_alias).select(a_alias.y)
q = q.with_(b_subq, "b").from_(b_alias).select(b_alias.z)
sql = q.get_sql(quote_char=None)
That generates a working query:
WITH a AS (SELECT '1' z,'2' y) ,b AS (SELECT a.z FROM a) SELECT a.y,b.z FROM a,b
However, if I add the b WITH clause first, then since a is not yet defined, the resulting query:
WITH b AS (SELECT a.z FROM a), a AS (SELECT '1' z,'2' y) SELECT a.y,b.z FROM a,b
does not work. Since BigQuery does not support WITH RECURSIVE, that is not an option for me.
Is there any way to control the order of the WITH clauses? I see the _with list in the QueryBuilder (the type of variable q), but since that's a private variable, I don't want to rely on that, especially as new versions of PyPika may not operate the same way.
One way I tried to do this is to always insert the first WITH clause at the beginning of the _with list, like this:
q._with.insert(0, q._with.pop())
Although this works, I'd like to use a PyPika supported way to do that.
In a related question, is there a supported way within PyPika to see what has already been added to the select list or other parts of the query? I noticed the q.selects member variable, but selects is not part of the public documentation. Using q.selects did not actually work for me when using our project's Python version (3.6) even though it did work in Python 3.7. The code I was trying to use is:
if any(field.name == "date" for field in q.selects if isinstance(field, Field))
The error I got was as follows:
def __getitem__(self, item: slice) -> "BetweenCriterion":
if not isinstance(item, slice):
> raise TypeError("Field' object is not subscriptable")
Thank you in advance for your help.
I could not figure out how to control the order of the WITH clauses after calling query.with_() (except for the hack already noted). As a result, I restructured my application to get around this problem. I am now calling query.with_() before building up the rest of the query.
This also made my related question moot, because I no longer need to see what I've already added to the query.

Django ORM Cross Product

I have three models:
class Customer(models.Model):
pass
class IssueType(models.Model):
pass
class IssueTypeConfigPerCustomer(models.Model):
customer=models.ForeignKey(Customer)
issue_type=models.ForeignKey(IssueType)
class Meta:
unique_together=[('customer', 'issue_type')]
How can I find all tuples of (custmer, issue_type) where there is no IssueTypeConfigPerCustomer object?
I want to avoid a loop in Python. A solution which solves this in the DB would be preferred.
Background: for every customer and for every issue-type, there should be a config in the DB.
If you can afford to make one database trip for each issue type, try something like this untested snippet:
def lacking_configs():
for issue_type in IssueType.objects.all():
for customer in Customer.objects.filter(
issuetypeconfigpercustomer__issue_type=None
):
yield customer, issue_type
missing = list(lacking_configs())
This is probably OK unless you have a lot of issue types or if you are doing this several times per second, but you may also consider having a sensible default instead of making a config object mandatory for each combination of issue type and customer (IMHO it is a bit of a design-smell).
[update]
I updated the question: I want to avoid a loop in Python. A solution which solves this in the DB would be preferred.
In Django, every Queryset is either a list of Model instances or a dict (values querysets), so it is impossible to return the format you want (a list of tuples of Model) without some Python (and possibly multiple trips to the database).
The closest thing to a cross product would be using the "extra" method without a where parameter, but it involves raw SQL and knowing the underlying table name for the other model:
missing = Customer.objects.extra(
select={"issue_type_id": 'appname_issuetype.id'},
tables=['appname_issuetype']
)
As a result, each Customer object will have an extra attribute, "issue_type_id", containing the id of one IssueType. You can use the where parameter to filter based on NOT EXISTS (SELECT 1 FROM appname_issuetypeconfigpercustomer WHERE issuetype_id=appname_issuetype.id AND customer_id=appname_customer.id). Using the values method you can have something close to what you want - this is probably enough information to verify the rule and create the missing records. If you need other fields from IssueType just include them in the select argument.
In order to assemble a list of (Customer, IssueType) you need something like:
cross_product = [
(customer, IssueType.objects.get(pk=customer.issue_type_id))
for customer in
Customer.objects.extra(
select={"issue_type_id": 'appname_issuetype.id'},
tables=['appname_issuetype'],
where=["""
NOT EXISTS (
SELECT 1
FROM appname_issuetypeconfigpercustomer
WHERE issuetype_id=appname_issuetype.id
AND customer_id=appname_customer.id
)
"""]
)
]
Not only this requires the same number of trips to the database as the "generator" based version but IMHO it is also less portable, less readable and violates DRY. I guess you can lower the number of database queries to a couple using something like this:
missing = Customer.objects.extra(
select={"issue_type_id": 'appname_issuetype.id'},
tables=['appname_issuetype'],
where=["""
NOT EXISTS (
SELECT 1
FROM appname_issuetypeconfigpercustomer
WHERE issuetype_id=appname_issuetype.id
AND customer_id=appname_customer.id
)
"""]
)
issue_list = dict(
(issue.id, issue)
for issue in
IssueType.objects.filter(
pk__in=set(m.issue_type_id for m in missing)
)
)
cross_product = [(c, issue_list[c.issue_type_id]) for c in missing]
Bottom line: in the best case you make two queries at the cost of legibility and portability. Having sensible defaults is probably a better design compared to mandatory config for each combination of Customer and IssueType.
This is all untested, sorry if some homework was left for you.

How to use LINQ to Entities to make a left join using a static value

I've got a few tables, Deployment, Deployment_Report and Workflow. In the event that the deployment is being reviewed they join together so you can see all details in the report. If a revision is going out, the new workflow doesn't exist yet new workflow is going into place so I'd like the values to return null as the revision doesn't exist yet.
Complications aside, this is a sample of the SQL that I'd like to have run:
DECLARE #WorkflowID int
SET #WorkflowID = 399 -- Set to -1 if new
SELECT *
FROM Deployment d
LEFT JOIN Deployment_Report r
ON d.FSJ_Deployment_ID = r.FSJ_Deployment_ID
AND r.Workflow_ID = #WorkflowID
WHERE d.FSJ_Deployment_ID = 339
The above in SQL works great and returns the full record if viewing an active workflow, or the left side of the record with empty fields for revision details which haven't been supplied in the event that a new report is being generated.
Using various samples around S.O. I've produced some Entity to SQL based on a few multiple on statements but I feel like I'm missing something fundamental to make this work:
int Workflow_ID = 399 // or -1 if new, just like the above example
from d in context.Deployments
join r in context.Deployment_Reports.DefaultIfEmpty()
on
new { d.Deployment_ID, Workflow_ID }
equals
new { r.Deployment_ID, r.Workflow_ID }
where d.FSJ_Deployment_ID == fsj_deployment_id
select new
{
...
}
Is the SQL query above possible to create using LINQ to Entities without employing Entity SQL? This is the first time I've needed to create such a join since it's very confusing to look at but in the report it's the only way to do it right since it should only return one record at all times.
The workflow ID is a value passed in to the call to retrieve the data source so in the outgoing query it would be considered a static value (for lack of better terminology on my part)
First of all don't kill yourself on learning the intricacies of EF as there are a LOT of things to learn about it. Unfortunately our deadlines don't like the learning curve!
Here's examples to learn over time:
http://msdn.microsoft.com/en-us/library/bb397895.aspx
In the mean time I've found this very nice workaround using EF for this kind of thing:
var query = "SELECT * Deployment d JOIN Deployment_Report r d.FSJ_Deployment_ID = r.Workflow_ID = #WorkflowID d.FSJ_Deployment_ID = 339"
var parm = new SqlParameter(parameterName="WorkFlowID" value = myvalue);
using (var db = new MyEntities()){
db.Database.SqlQuery<MyReturnType>(query, parm.ToArray());
}
All you have to do is create a model for what you want SQL to return and it will fill in all the values you want. The values you are after are all the fields that are returned by the "Select *"...
There's even a really cool way to get EF to help you. First find the table with the most fields, and get EF to generated the model for you. Then you can write another class that inherits from that class adding in the other fields you want. SQL is able to find all fields added regardless of class hierarchy. It makes your job simple.
Warning, make sure your filed names in the class are exactly the same (case sensitive) as those in the database. The goal is to make a super class model that contains all the fields of all the join activity. SQL just knows how to put them into that resultant class giving you strong typing ability and even more important use-ability with LINQ
You can even use dataannotations in the Super Class Model for displaying other names you prefer to the User, this is a super nice way to keep the table field names but show the user something more user friendly.

Translate SQL to OCL?

I have a piece of SQL that I want to translate to OCL. I'm not good at SQL so I want to increase maintainability by this. We are using Interbase 2009, Delphi 2007 with Bold and modeldriven development. Now my hope is that someone here both speaks good SQL and OCL :-)
The original SQL:
Select Bold_Id, MessageId, ScaniaId, MessageType, MessageTime, Cancellation, ChassieNumber, UserFriendlyFormat, ReceivingOwner, Invalidated, InvalidationReason,
(Select Parcel.MCurrentStates From Parcel
Where ScaniaEdiSolMessage.ReceivingOwner = Parcel.Bold_Id) as ParcelState From ScaniaEdiSolMessage
Where MessageType = 'IFTMBP' and
not Exists (Select * From ScaniaEdiSolMessage EdiSolMsg
Where EdiSolMsg.ChassieNumber = ScaniaEdiSolMessage.ChassieNumber and EdiSolMsg.ShipFromFinland = ScaniaEdiSolMessage.ShipFromFinland and EdiSolMsg.MessageType = 'IFTMBF') and
invalidated = 0 Order By MessageTime desc
After a small simplification:
Select Bold_Id, (Select Parcel.MCurrentStates From Parcel
where ScaniaEdiSolMessage.ReceivingOwner = Parcel.Bold_Id) From ScaniaEdiSolMessage
Where MessageType = 'IFTMBP' and not Exists (Select * From ScaniaEdiSolMessage
EdiSolMsg Where EdiSolMsg.ChassieNumber = ScaniaEdiSolMessage.ChassieNumber and
EdiSolMsg.ShipFromFinland = ScaniaEdiSolMessage.ShipFromFinland and
EdiSolMsg.MessageType = 'IFTMBF') and invalidated = 0
NOTE: There are 2 cases for MessageType, 'IFTMBP' and 'IFTMBF'.
So the table to be listed is ScaniaEdiSolMessage.
It has attributes like:
MessageType: String
ChassiNumber: String
ShipFromFinland: Boolean
Invalidated: Boolean
It has also a link to table Parcel named ReceivingOwner with BoldId as key.
So it seems like it list all rows of ScaniaEdiSolMessage and then have a subquery that also list all rows of ScaniaEdiSolMessage and name it EdiSolMsg. Then it exclude almost all rows. In fact the query above give one hit from 28000 records.
In OCL it is easy to list all instances:
ScaniaEdiSolMessage.allinstances
Also easy to filter rows by select for example:
ScaniaEdiSolMessage.allinstances->select(shipFromFinland and not invalidated)
But I do not understand how I should make a OCL to match the SQL above.
Listen to Gabriel and Stephanie, learn more SQL.
You state that you want to make the code more maintainable, yet the number of developers who understand SQL is greater by far than the number of developers who understand OCL.
If you leave the project tomorrow after converting this to OCL, the chances that you'd be able to find someone who could maintain the OCL are very slim. However, the chances that you could find someone to maintain the SQL are very high.
Don't try to fit a square peg in a round hole just because you're good with round hammers :)
There is a project, Dresden OCL, that might help you.
Dresden OCL provides a set of tools to parse and evaluate OCL constraints on various models like UML, EMF and Java. Furthermore Dresden OCL provides tools for Java/AspectJ and SQL code generation. The tools of Dresden OCL can be either used as a library for other project or as a plug-in project that extends Eclipse with OCL support.
I haven't used it, but there is a demo showing how the tool generates SQL from a model and OCL constraints. I realize you're asking for the opposite, but maybe using this you can figure it out. There is also a paper that describes OCL->SQL transformations by the same people.
With MDriven (successor of Bold for Delphi) I would do it like this:
When working with OCL to SQL everything becomes easier if you think about the different set's of information you need to check - and then use ocl operators as ->intersection to find the set you are after.
So in your case you might have a set like this:
ScaniaEdiSolMessage.allinstances->select(shipFromFinland and not invalidated)
but you also have a set like this:
ScaniaEdiSolMessage.allinstances->select(m|m.ReceivingOwner.MessageType = 'IFTMBP')
And you further more have this criteria:
Parcel.allinstances->select(p|p.Messages->exists(m|m.MessageType = 'IFTMBF')).Messages
If all these Sets have the same result type (collection of ScaniaEdiSolMessage) you can simply intersect them to get your desired result
ScaniaEdiSolMessage.allinstances->select(shipFromFinland and not invalidated)
->intersection(ScaniaEdiSolMessage.allinstances->select(m|m.ReceivingOwner.MessageType = 'IFTMBP'))
->intersection(Parcel.allinstances->select(p|p.Messages->exists(m|m.MessageType = 'IFTMBF')).Messages
)
And looking at that we can reduce it a bit to:
ScaniaEdiSolMessage.allinstances
->select(m|m.shipFromFinland and (not m.invalidated) and
(m.ReceivingOwner.MessageType = 'IFTMBP'))
->intersection(Parcel.allinstances->select(p|
p.Messages->exists(m|m.MessageType = 'IFTMBF')).Messages
)