Best practice for scaling SQL queries on joins? - sql

I'm writing a REST api that works with SQL and am constantly finding myself in similar situations to this one, where I need to return lists of objects with nested lists inside each object by querying over table joins.
Let's say I have a many-to-many relationship between Users and Groups. I have a User table and a Group table and a junction table UserGroup between them. Now I want to write a REST endpoint that returns a list of users, and for each user the groups that they are enrolled in. I want to return a json with a format like this:
[
{
"username": "test_user1",
<other attributes ...>
"groups": [
{
"group_id": 2,
<other attributes ...>
},
{
"group_id": 3,
<other attributes ...>
}
]
},
{
"username": "test_user2",
<other attributes ...>
"groups": [
{
"group_id": 1,
<other attributes ...>
},
{
"group_id": 2,
<other attributes ...>
}
]
},
etc ...
There are two or three ways to query SQL for this that I can think of:
Issue a variable number of SQL queries: Query for a list of Users, then loop over each user to query over the junction linkage to populate the groups list for each user. The number of SQL queries linearly increases with the number of users returned.
example (using python flask_sqlalchemy / flask_restx):
users = db.session.query(User).filter( ... )
for u in users:
groups = db.session.query(Group).join(UserGroup, UserGroup.group_id == Group.id) \
.filter(UserGroup.user.id == u.id)
retobj = api.marshal([{**u.__dict__, 'groups': groups} for u in users], my_model)
# Total number of queries: 1 + number of users in result
Issue a constant number of SQL queries: This can be done by issuing one monolithic SQL query performing all joins with potentially lots of redundant data in the User's columns, or, often more preferably, a few separate SQL queries. For example, query for a list of Users, then query the Group table joining on GroupUsers, then manually group groups in server code.
example code:
from collections import defaultdict
users = db.session.query(User).filter( ... )
uids = [u.id for u in users]
groups = db.session.query(User.user_id, Group).join(UserGroup, UserGroup.group_id == Group.id) \
.filter(UserGroup.user_id._in(uids))
aggregate = defaultdict(list)
for g in groups:
aggregate[g.user_id].append(g[1].__dict__)
retobj = api.marshal([{**u.__dict__, 'groups': aggregate[u.id]} for u in users], my_model)
# Total number of queries: 2
The third approach, with limited usefulness, is to use string_agg or a similar approach to force SQL to concatenate a grouping into one string column, then unpack the string into a list server-side, for example if all I want was the group number I could use string_agg and group_by to get back "1,2" in one query to the User table. But this is only useful if you don't need complex objects.
I'm attracted to the second approach because I feel like it's more efficient and scalable because the number of SQL queries (which I have assumed is the main bottleneck for no particularly good reason) is constant, but it takes some more work on the server's side to filter all the groups into each user. But I thought part of the point of using SQL is to take advantage of its efficient sorting/filtering so you don't have to do it yourself.
So my question is, am I right in thinking that it's a good idea to make the number of SQL queries constant at the expense of more server-side processing and dev time? Is it a waste of time to try to whittle down the number of unnecessary SQL queries? Will I regret it if I don't, when API is tested at scale? Is there a better way to solve this problem that I'm not aware of?

Using joinedload option you can load all the data with just one query:
q = (
session.query(User)
.options(db.joinedload(User.groups))
.order_by(User.id)
)
users = q.all()
for user in users:
print(user.name)
for ug in user.groups:
print(" ", ug.name)
When you run the query above, all the groups would have been loaded already from the database using the query similar to below:
SELECT "user".id,
"user".name,
group_1.id,
group_1.name
FROM "user"
LEFT OUTER JOIN (user_group AS user_group_1
JOIN "group" AS group_1 ON group_1.id = user_group_1.group_id)
ON "user".id = user_group_1.user_id
And now you only need to serialize the result with proper schema.

Related

Sequelize raw include/join

I want to be able to do the following simple SQL query using Sequelize:
SELECT * FROM one
JOIN (SELECT COUNT(*) AS count, two_id FROM two GROUP BY two_id) AS table_two
ON one.two_id = two.two_id
I can't seem to find anything about raw include, or raw model
For performance reason, I don't want subselect in the main query (which I know sequelize already works well with) aka:
SELECT * FROM one, (SELECT COUNT(*) AS count FROM two WHERE one.two_id = two.two_id) AS count
Regarding the following sequelize code (models One and Two exists)
models.One.findAll({
include: [
models: model.Two
// what to add here in order to get the example SQL
]
})
Seems like I found a somewhat hacky workaround:
You can use fn inside selections to use any SQL word (like JOIN), resulting in something like this for my use case:
models.One.findAll({
attributes: [
fn('JOIN', literal('SELECT COUNT(*) AS count FROM two WHERE one.two_id = two.two_id')),
],
});
Note you can do that only on the last attribute (else it's a misplaced joint)

Filter a Postgres Table based on multiple columns

I'm working on an shopping website. User selects multiple filters on and sends the request to backend which is in node.js and using postgres as DB.
So I want to search the required data in a single query.
I have a json object containing all the filters that user selected. I want to use them in postgres query and return to user the obtained results.
I have a postgres Table that contains a few products.
name Category Price
------------------------------
LOTR Books 50
Harry Potter Books 30
Iphone13 Mobile 1000
SJ8 Cameras 200
I want to filter the table using n number of filters in a single query.
I have to make it work for multiple filters such as the ones mentioned below. So I don't have to write multiple queries for different filters.
{ category: 'Books', price: '50' }
{ category: 'Books' }
{category : ['Books', 'Mobiles']}
I can query the table using
SELECT * FROM products WHERE category='Books' AND 'price'='100'
SELECT * FROM products WHERE category='Books'
SELECT * FROM products WHERE category='Books' OR category='Mobiles'
respectively.
But I want to write my query in such a way that it populates the Keys and Values dynamically. So I may not have to write separate query for every filter.
I have obtained the key and value pairs from the request.query and saved them
const params = req.query;
const keys: string = Object.keys(params).join(",")
const values: string[] = Object.values(params)
const indices = Object.keys(params).map((obj, i) => {
return "$" + (i + 1)
})
But I'm unable to pass them in the query in a correct manner.
Does anybody have a suggestion for me? I'd highly appreciate any help.
Thank you in advance.
This is not the way you filter data from a SQL database table.
You need to use the NodeJS pg driver to connect to the database, then write a SQL query. I recommend prepared statements.
A query would look like:
SELECT * FROM my_table WHERE price < ...
At least based on your question, to me, it is unclear why would want to do these manipulations in JavaScript, nor what you want to be accomplished really.

SQL dynamic WHERE clause DataSet

I have a SQL queries with a where clauses that have to exclude rows based on a list of values of some columns, these list may be hard coded (suplied by the users) or constructed from other select query.
Also the hard coded list may be updated by the users, so I need every time to update the list on the query, and that is inconvinient.
I am wondering about the best way to parameter these lists.
Exemple of WHERE clause :
WHERE
Article_Code not in ('PA_003','PA_003','PE_234','FR_980','FA_333','FC_001','TA_999','FC_212','DC_009','FF_333','PR_001')
AND
((Partner_Status != 'Radied') or (Partner_Status = 'Radied' and Partner_Code in ('PR_000453','PR_0004311T','PR_V3345','PR_004D55') ))
AND
(Case_Code not in (select Case_Code from Agreement where DDR = 3))
One though is to build a table of parameter with this structure : (ExclusionCode - Column - ColumnMemberToExclude - ExclusionDescription) :
ExclusionCode is an internal code that I gnerate to identify the reason of exclusion.
Colum is the column to use on the where (ex: Article_Code)
ColumnMemberToExclude is the member to use in the where (ex: PA_003)
ExclusionDescription : functional description (ex: exclude the list of porsche product)
and then construct the where clause as a string from this table.
Is this the best way to do ?

Rails SQL query optimization

I have a featured section in my website that contains featured posts of three types: normal, big and small. Currently I am fetching the three types in three separate queries like so:
#featured_big_first = Post.visible.where(pinged: 1).where('overlay_type =?', :big).limit(1)
#featured_big_first = Post.visible.where(pinged: 1).where('overlay_type =?', :small).limit(1)
#featured_big_first = Post.visible.where(pinged: 1).where('overlay_type =?', :normal).limit(5)
Basically I am looking for a query that will combine those three in to one and fetch 1 big, 1 small, 5 normal posts.
I'd be surprised if you don't want an order. As you have it, it is supposed to find a random small, random large, and 5 random normal.
Yes, you can use a UNION. However, you will have to do an execute SQL. Look at the log for the SQL for each of your three queries, and do an execute SQL of a string which is each of the three queries with UNION in between. It might work, or it might have problems with the limit.
It is possible in SQL by joining the table to itself, doing a group by on one of the aliases for the table, a where when the other aliased table is <= the group by table, and adding a having clause where count of the <= table is under the limit.
So, if you had a simple query of the posts table (without the visible and pinged conditions) and wanted the records with the latest created_at date, then the normal query would be:
SELECT posts1.*
FROM posts posts1, posts posts2
WHERE posts2.created_at >= posts1.create_at
AND posts1.overlay_type = 'normal'
AND posts2.overlay_type = 'normal'
GROUP BY posts1.id
HAVING count(posts2.id) <= 5
Take this SQL, and add your conditions for visible and pinged, remembering to use the condition for both posts1 and posts2.
Then write the big and small versions and UNION it all together.
I'd stick with the three database calls.
I don't think this is possible but you can use scope which is more rails way to write a code
Also it may just typo but you are reassigning the #featured_big_first so it will contain the data of the last query only
in post.rb
scope :overlay_type_record lamda{|type| joins(:visible).where(["visible.pinged=1 AND visible.overlay_type =", type])}
and in controller
#featured_big_first = Post.overlay_type_record(:big).limit(1)
#featured_small_first = Post.overlay_type_record(:small).limit(1)
#featured_normal_first = Post.overlay_type_record(:normal).limit(5)

SQL query performance for a 'map' statement

I am using Ruby on Rails 3 and I would like to know what is the performance difference for these query statements:
# Case 1
accounts = ids.map { |id| Account.find_by_id(id) }
# Case 2
accounts = ids.map { |id| Account.where(:id => id).first }
There is another way to do things better? If ids are 100, how can I limit the search until accounts are 5?
As #RubyFanatic said, there's no real difference between those two (they'd both generate the same query), but there is a considerably better way of doing it:
accounts = Account.where(:id => ids)
This will generate sql like select * from accounts where accounts.id in (1,2,3) and will be considerably faster than finding them one at a time.
And if you want to only use say 5 of the ids from the array of ids, you'd need to decide which 5 to use. For example, if you wanted to use the first 5;
accounts = Account.where(:id => ids[0..4])
Or, you could use limit, but this makes the query have to do a little more work still if the ids array is large:
accounts = Account.where(:id => ids).limit(5)
There should be no performance difference for these two queries. They are literally doing the same thing. The second statement might be slightly slower but it's so minuscule that it does not even matter.