How to automatically break down a SQL-like query with many joins into discrete, independent steps? - sql

Note: This is a learning exercise to learn how to implement a SQL-like relational database. This is just one thin slice of a question in the overall grand vision.
I have the following query, given a test database with a few hundred records:
select distinct "companies"."name"
from "companies"
inner join "projects" on "projects"."company_id" = "companies"."id"
inner join "posts" on "posts"."project_id" = "projects"."id"
inner join "comments" on "comments"."post_id" = "posts"."id"
inner join "addresses" on "addresses"."company_id" = "companies"."id"
where "addresses"."name" = 'Address Foo'
and "comments"."message" = 'Comment 3/3/2/1';
Here, the query is kind of unrealistic, but it demonstrates the point which I am trying to make. The point is to have a query with a few joins, so that I can figure out how to write this in sequential steps.
The first part of the question is (which I think I've partially figured out), is how do you write these joins as a sequence of independent steps, with the output of one fed into the input of the other? Also, is there more than one way to do it?
// step 1
let companies = select('companies')
// step 2
let projects = join(companies, select('projects'), 'id', 'company_id')
// step 3
let posts = join(projects, select('posts'), 'id', 'project_id')
// step 4
let comments = join(posts, select('comments'), 'id', 'post_id')
// step 5
let finalPosts = posts.filter(post => !!comments.find(comment => comment.post_id === post.id))
// step 6
let finalProjects = projects.filter(project => !!posts.find(post => post.project_id === project.id))
// step 7, could also be run in parallel to step 2 potentially
let addresses = join(companies, select('addresses'), 'id', 'company_id')
// step 8
let finalCompanies = companies.filter(company => {
return !!posts.find(post => post.company_id === company.id)
&& !!addresses.find(address => address.company_id === company.id)
})
These filters could probably be more optimized using indexes of some sort, but that is beside the point I think. This just demonstrates that there seem to be about 8 steps to find the companies we are looking for.
The main question is, how do you automatically figure out the steps from the SQL query?
I am not asking about how to parse the SQL query into an AST. Assume we have some sort of object structure we are dealing with, like an AST, to start.
How would you have to have the SQL query in structured object form, such that it would lead to these 8 steps? I would like to be able to specify a query (using a custom JSON-like syntax, not SQL), and then have it divide the query into these steps to divide and conquer so to speak and perform the queries in parts (for learning how to implement distributed databases). But I don't see how we go from SQL-like syntax, to 8 steps. Can you show how that might be done?
Here is the full code for the demo, which you can run with psql postgres -f test.sql. The result should be "Company 3".
Basically looking for a high level algorithm (doesn't even need to be code), which describes the key way you would break down some sort of AST-like object representation of a SQL query, into the actual planned steps of the query.
My algorithm looks like this in my head:
represent SQL query in object tree.
convert object tree to steps.
I am not really sure what (1) should be structured like, and even if we had some sort of structure, I'm not sure how to get that to complete (2). Looking for more details on the implementations of these steps, mainly step (2).
My "object structure" for step 1 would be something like this:
const query = {
select: {
distinct: true,
columns: ['companies.name'],
from: ['companies'],
},
joins: [
{
type: 'inner join',
table: 'projects',
left: 'projects.company_id',
right: 'companies.id',
},
...
],
conditions: [
{
left: 'addresses.name',
op: '=',
right: 'Address Foo'
},
...
]
}
I am not sure how useful that is, but it doesn't relate to steps at all. At a high level, what kind of code would I have to write to convert that object sort of structure into steps? Seems like one potential avenue is do a topological sort on the joins. But then you need to combine that with the select and conditions somehow, not sure how you would even begin to programmatically know what step should be before what other step, or even what the steps are. Maybe if I somehow could break it into known "chunks", then it would be simple to apply TOP sort to it after that, but then the question is still, how to get into chunks from the object structure / SQL?
Basically, I have been reading about the theory behind "query planning and optimization", but don't know how to apply it in this regard. How did this site do it?
One aspect is breaking at least the where conditions into CNF.

Implementing joins is a huge topic which is probably out of scope for a StackOverflow answer.
If you're looking for practical information about how joins are implemented, I would suggest...
The Join Operation section of Use The Index, Luke for different types of join implementation.
Section 7 of the The SQLite Query Optimizer Overview covers joins. And reading the SQLite source code. It is about as small a practical SQL implementation will get.
The output of explain in Postgresql gives very detailed information about how it has implemented the query. And they are explained in Operator Optimization Information

Related

Create subgraph query in Gremlin around single node with outgoing and incoming edges

I have a large Janusgraph database and I'd to create a subgraph centered around one node type and including incoming and outgoing nodes of specific types.
In Cypher, the query would look like this:
MATCH (a:Journal)N-[:PublishedIn]-(b:Paper{paperTitle:'My Paper Title'})<-[:AuthorOf]-(c:Author)
RETURN a,b,c
This is what I tried in Gremlin:
sg = g.V().outE('PublishedIn').subgraph('j_p_a').has('Paper','paperTitle', 'My Paper Title')
.inE('AuthorOf').subgraph('j_p_a')
.cap('j_p_a').next()
But I get a syntax error. 'AuthorOf' and 'PublishedIn' are not the only edge types ending at 'Paper' nodes.
Can someone show me how to correctly execute this query in Gremlin?
As written in your query, the outE step yields edges and the has step will check properties on those edges, following that the query processor will expect an inV not another inE. Without your data model it is hard to know exactly what you need, however, looking at the Cypher I think this is what you want.
sg = g.V().outE('PublishedIn').
subgraph('j_p_a').
inV().
has('Paper','paperTitle', 'My Paper Title').
inE('AuthorOf').
subgraph('j_p_a')
cap('j_p_a').
next()
Edited to add:
As I do not have your data I used my air-routes graph. I modeled this query on yours and used some select steps to limit the data size processed. This seems to work in my testing. Hopefully you can see the changes I made and try those in your query.
sg = g.V().outE('route').as('a').
inV().
has('code','AUS').as('b').
select('a').
subgraph('sg').
select('b').
inE('contains').
subgraph('sg').
cap('sg').
next()

Django filter on any matching column

in django admin we have search field at the top of the page (right?).
If I take example of user model. and search field is there.
search_fields = ('name', 'phone', 'email', 'city')
I get a query_param in GET api. but my program doesn't know what it is. it could be a phone_no or email or name or anything else. I have to search in User model and filter out all rows which contain this data in any of the column.
so I want to write a django query but I'm not sure what is the best optimized way to do this.
write now I use Q object and then OR operation.
e.g. Q(phone=param) | Q(email=param) .
and Iwant to know if i have to write SQL query for this how would I
Q object is definitely right. You could do it dynamically like:
from django.db.models import Q
search_fields = ('name', 'phone', 'email', 'city')
q_objs = [Q(**{'%s' % i: param}) for i in search_fields]
queries = reduce(lambda x, y: x | y, q_objs)
results = Model.objects.filter(queries)
You can even do q_objs = [Q(**{'%s__icontains' % i: param}) for i in search_fields] to support incomplete search.
With ORM you can do no more than chain Q objects. You can try to write your own SQL queries but there is much work to do and you won't accomplish the best possible search also.
If you really want to have fast search that will deal with big amounts of data you need to take a look at haystack. It's a very good django module which makes search easy and fast.

Best design patterns for refactoring code without breaking other parts

I have some PHP code from an application that was written using Laravel. One of the modules was written quite poorly. The controller of this module has a whole bunch of reporting functions which uses the functions defined inside the model object to fetch data. And the functions inside the model object are super messy.
Following is a list of some of the functions from the controller (ReportController.php) and the model (Report.php)(I'm only giving names of functions and no implementations as my question is design related)
Functions from ReportController.php
questionAnswersReportPdf()
fetchStudentAnswerReportDetail()
studentAnswersReport()
wholeClassScoreReportGradebookCSV()
wholeClassScoreReportCSV()
wholeClassScoreReportGradebook()
formativeWholeClassScoreReportGradebookPdf()
wholeClassScoreReport()
fetchWholeClassScoreReport()
fetchCurriculumAnalysisReportData()
curriculumAnalysisReportCSV()
curriculumAnalysisReportPdf()
studentAnswersReportCSV()
fetchStudentScoreReportStudents()
Functions from Report.php
getWholeClassScoreReportData
getReportsByFilters
reportMeta
fetchCurriculumAnalysisReportData
fetchCurriculumAnalysisReportGraphData
fetchCurriculumAnalysisReportUsersData
fetchTestHistoryClassAveragesData
fetchAllTestHistoryClassAveragesData
fetchAllTestHistoryClassAveragesDataCsv
fetchHistoryClassAveragesDataCsv
fetchHistoryClassAveragesGraphData
The business logic has been written in quite a messy way also. Some parts of it are in the controller while other parts are in the model object.
I have 2 specific questions :
a) I have an ongoing goal of reducing code complexity and optimizing code structure. How can I leverage common OOP design patterns to ensure altering the code in any given report does not negatively affect the other reports? I specifically want to clean up the code for some critical reports first but want to ensure that by doing this none of the other reports will break.
b) The reporting module is relatively static in definition and unlikely to change over time. The majority of reports generated by the application involve nested sub-queries as well as standard grouping & filtering options. Most of these SQL queries have been hosed within the functions of the model object and contain some really complex joins. Without spending time evaluating the database structure or table indices, which solution architecture techniques would you recommend for scaling the report functionality to ensure optimized performance? Below is a snippet of one of the SQL queries
$sql = 'SELECT "Parent"."Id",
"Parent"."ParentId",
"Parent"."Name" as systemStandardName,
string_agg(DISTINCT((("SubsectionQuestions"."QuestionSerial"))::text) , \', \') AS "quesions",
count(DISTINCT("SubsectionQuestions"."QuestionId")) AS "totalQuestions",
case when sum("SQUA"."attemptedUsers")::float > 0 then
(COALESCE(round((
(
sum(("SQUA"."totalCorrectAnswers"))::float
/
sum("SQUA"."attemptedUsers")::float
)
*100
)::numeric),0))
else 0 end as classacuracy,
case when sum("SQUA"."attemptedUsers")::float > 0 then
(COALESCE((round(((1 -
(
(
sum(("SQUA"."totalCorrectAnswers"))::float
/
sum("SQUA"."attemptedUsers")::float
)
)
)::float * count(DISTINCT("SubsectionQuestions"."QuestionId")))::numeric,1)),0))
else 0 end as pgain
FROM "'.$gainCategoryTable.'" as "Parent"
'.$resourceTableJoin.'
INNER JOIN "SubsectionQuestions"
ON "SubsectionQuestions"."QuestionId" = "resourceTable"."ResourceId"
INNER JOIN "Subsections"
ON "Subsections"."Id" = "SubsectionQuestions"."SubsectionId"
LEFT Join (
Select "SubsectionQuestionId",
count(distinct case when "IsCorrect" = \'Yes\' then CONCAT ("UserId", \' \', "SubsectionQuestionId") else null end) AS "totalCorrectAnswers"
, count(distinct CONCAT ("UserId", \' \', "SubsectionQuestionId")) AS "attemptedUsers"
From "SubsectionQuestionUserAnswers"';
if(!empty($selectedUserIdsArr)){
$sql .= ' where "UserId" IN (' .implode (",", $selectedUserIdsArr).')' ;
}else {
$sql .= ' where "UserId" IN (' .implode (",", $assignmentUsers).')' ;
}
$sql .= ' AND "AssessmentAssignmentId" = '.$assignmentId.' AND "SubsectionQuestionId" IN ('.implode(",", $subsectionQuestions).') Group by "SubsectionQuestionId"
) as "SQUA" on "SQUA"."SubsectionQuestionId" = "SubsectionQuestions"."Id"
INNER JOIN "AssessmentAssignment"
ON "AssessmentAssignment"."assessmentId" = "Subsections"."AssessmentId"
INNER JOIN "AssessmentAssignmentUsers"
ON "AssessmentAssignmentUsers"."AssignmentId" = "AssessmentAssignment"."Id"
AND "AssessmentAssignmentUsers"."Type" = \'User\'
'.$conditaionlJoin.'
WHERE "Parent"."Id" IN ('.implode(',', $ssLeaf).')
'.$conditionalWhere.'
GROUP BY "Parent"."Id",
"Parent"."ParentId",
"Parent"."Name"
'.$sorter;
$results = DB::select(DB::raw($sql));
My take on a). In my experience when I want to reduce code complexity/sheer messiness I slowly refactor out code that violates the single responsibility principle while I'm working in that area already for either a bug fix or a feature update. I try not to spend hours upon hours of updating code that is "working" that I'm not actively updating for a business process reason. Follow the "Leave it better than you found it" approach as you work in this code base, and it will get better over time. Doing this will allow you to improve the code base while also getting features and bug fixes out the door, while also keeping business owners/project managers happy because you're keeping things moving.
about a) : The first thing I do, to ensure none of my refactoring is breaking anything, is covering the code with, at least, unitary tests (doing TDD ensures the most optimal coverage). It's easier when, like #DavidY says, you respect principles like SRP (does my class try to answer too many problems ?). With test, you'll feel safer when you'll need to refactor, and the tests will tell you exactly where it broke.
about b) : Do not optimize until you need it. And optimize only when you know what cost you the most. It's the best way to know what pattern you need, otherwise you may try to force the wrong solution into the wrong problem.

Why does the where() method run SQL queries after all nested relations are eager-loaded?

In my controller method for the the index view I have the following line.
#students_instance = Student.includes(:memo_tests => {:memo_target => :memo_level})
So for each Student I eager-load all necessary info.
Later on in a .map block, I call the .where() method on one of the relations as shown below.
#all_students = #students_instance.map do |student|
...
last_pass = student.memo_tests.where(:result => true).last.created_at.utc
difference_in_weeks = ((last_pass.to_i - current_date.to_i) / 1.week).round
...
end
This leads to a single SQL query for each student. And since I have over 300+ students, leads to very slow load times and over 300+ SQL queries.
Am I right in thinking that this is caused by the .where() method. I think this because I have checked everything else and these are the two lines that cause all of the queries.
More importantly, is there a better way to do this that reduces these queries to a single query?
The moment you ask where, the statement is translated to a query. Normally, the result should be sql-cached...
Anyway, in order to be sure, you can instead add programming logic to your statement. That way, you are not requesting a NEW sql statement.
last_pass = student.memo_tests.map {|m| m.created_at if m.result}.compact.sort.last
EDIT
I see the OP's question does not require sorting... So, leaving the sorting out:
last_pass = student.memo_tests.map {|m| m.created_at if m.result}.compact.last
compact is required to remove nil results from the array.

Optimize the query PostgreSql-8.4

I have rails controller coding as below:
#checked_contact_ids = #list.contacts.all(
:conditions => {
"contacts_lists.contact_id" => #list.contacts.map(&:id),
"contacts_lists.is_checked" => true
}
).map(&:id)
its equivalent to sql
SELECT *
FROM "contacts"
INNER JOIN "contacts_lists" ON "contacts".id = "contacts_lists".contact_id
WHERE ("contacts_lists".list_id = 67494 )
This above query takes more time to run, I want another way to run the same query with minimum time.
Is anyone knows please notice me Or is it possible? or is the above query enough for give output?
I am waiting information...................
I think the main problem with your original AR query is that it isn't doing any joins at all; you pull a bunch of objects out of the database via #list.contacts and then throw most of that work away to get just the IDs.
A first step would be to replace the "contacts_lists.contact_id" => #list.contacts.map(&:id) with a :joins => 'contact_lists' but you'd still be pulling a bunch of stuff out of the database, instantiating a bunch of objects, and then throwing it all away with the .map(&:id) to get just ID numbers.
You know SQL already so I'd probably go straight to SQL via a convenience method on your List model (or whatever #list is), something like this:
def checked_contact_ids
connection.execute(%Q{
SELECT contacts.id
FROM contacts
INNER JOIN contacts_lists ON contacts.id = contacts_lists.contact_id
WHERE contacts_lists.list_id = #{self.id}
AND contacts_lists.is_checked = 't'
}).map { |r| r['id'] }
end
And then, in your controller:
#checked_contact_ids = #list.checked_contact_ids
If that isn't fast enough then review your indexes on the contacts_lists table.
There's no good reason not go straight to SQL when you know exactly what data you need and you need it fast; just keep the SQL isolated inside your models and you shouldn't have any problems.