Ecto Query: Count/1 of left_joined value multiplied by outer_join value count - sql

I have an items postgres table that has_many bookmarks and has_many notes. I am querying these with {:ecto_sql, "~> 3.7"}.
I want a query that returns all finished items that have notes, and I want that query to also count that item's bookmarks.
When I left_join a note's bookmarks and select_merge count(bookmarks), I get the proper count, but when I add an inner_join notes to the item, the count of bookmarks is multiplied by the number of notes, e.g. if an item has 2 bookmarks and 4 notes, bookmark_count will be 8 when it should be 2.
Here is my funky ecto query:
from item in Item,
where: item.finished == true,
left_join: bookmark in assoc(item, :bookmarks),
on: bookmark.item_id == item.id and bookmark.deleted == false,
select_merge: %{bookmark_count: count(bookmark)},
inner_join: note in assoc(item, :notes),
on: note.accepted == true
Many thanks in advance for feedback/guidance!

Basically: aggregate the N-side before joining to avoid multiplying rows from the main table. Faster, too. See:
Two SQL LEFT JOINS produce incorrect result
Use a semi-join for notes with EXISTS to only verify the existence of a related qualifying row. This also never multiplies rows.
This query should implement your objective:
SELECT i.*, COALESCE(b.ct, 0) AS bookmark_count
FROM items i
LEFT JOIN (
SELECT b.item_id AS id, count(*) AS ct
FROM bookmarks b
WHERE NOT b.deleted
GROUP BY 1
) b USING (id)
WHERE i.finished
AND EXISTS (
SELECT FROM notes n
WHERE n.item_id = i.id
AND n.accepted
);
I slipped in a couple other minor improvements.

Related

Find rows with fewer than X associations (including 0)

I have students associated to schools, and want to find all schools that have five or fewer (including zero) students that have has_mohawk = false.
Here's an Activerecord query:
School.joins(:students)
.group(:id)
.having("count(students.id) < 5")
.where(students: {has_mohawk: true})
This works for schools with 1 - 4 such students with mohawks, but omits schools where there are no such students!
I figured out a working solution and will post it. But I am interested in a more elegant solution.
Using Rails 5. I'm also curious whether Rails 6's missing could handle this?
find all schools that have five or fewer (including zero) students that have has_mohawk = false.
Here is an optimized SQL solution. SQL is what it comes down to in any case. (ORMs like Active Record are limited in their capabilities.)
SELECT sc.*
FROM schools sc
LEFT JOIN (
SELECT school_id
FROM students
WHERE has_mohawk = false
GROUP BY 1
HAVING count(*) >= 5
) st ON st.school_id = sc.id
WHERE st.school_id IS NULL; -- "not disqualified"
While involving all rows, aggregate before joining. That's faster.
This query takes the reverse approach by excluding schools with 5 or more qualifying students. The rest is your result - incl. schools with 0 qualifying students. See:
Select rows which are not present in other table
Any B-tree index on students (school_id) can support this query, but this partial multicolumn index would be perfect:
CREATE INDEX ON students (school_id) WHERE has_mohawk = false;
If there can be many students per school, this is faster:
SELECT sc.*
FROM schools sc
JOIN LATERAL (
SELECT count(*) < 5 AS qualifies
FROM (
SELECT -- select list can be empty (cheapest)
FROM students st
WHERE st.school_id = sc.id
AND st.has_mohawk = false
LIMIT 5 -- !
) st1
) st2 ON st2.qualifies;
The point is not to count all qualifying students, but stop looking once we found 5. Since the join to the LATERAL subquery with an aggregate function always returns a row (as opposed to the join in the first query), schools without qualifying students are kept in the loop, and we don't need the reverse approach.
About LATERAL:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
In addition to the first query, write another to find schools where no students have mohawks (works in Rails 5).
School.left_outer_joins(:students)
.group(:id)
.having("max(has_mohawk::Integer) = 0")
You might think from this popular answer that you could instead just write:
School.left_outer_joins(:students)
.group(:id)
.where.not(student: {has_mohawk: true})
But that will include (at least in Rails 5) schools where there is any student with a has_mohawk value of false, even if some students have a has_mohawk value of true.
Explanation of max(has_mohawk::Integer) = 0
It converts the boolean to an integer (1 for true, 0 for false). Schools with any true values will have a max of 1, and can thus be filtered out.
Similiar: SQL: Select records where ALL joined records satisfy some condition

postgres - How to update the same record twice in a joined update query

I'm trying to write the following migration that drops the conversations_users table (which was a join table that included a read_up_to column) and copies the read_up_to information into the new messages.read_by_user_ids array. For every 1 message row there are at least 2 conversations_users rows, so this join is repeating messages. I expected the following expression to work, but it's only assigning one user_id to the read_by_user_ids array, and I'm guessing that's because the update isn't happening sequentially.
Result:
message_id: 1, read_by_user_ids: { 15 }
Desired result: message_id: 1, read_by_user_ids: { 15, 19 }
UPDATE
messages as m
SET
read_by_user_ids = CASE
WHEN cu.read_up_to >= m.created_at THEN array_append(
COALESCE(m.read_by_user_ids, '{}'),
cu.user_id
) ELSE m.read_by_user_ids
END
FROM
conversations_users cu
WHERE
cu.conversation_id = m.thread_id;
I'm on my phone, so apologies for untested typos.
As per my comment, aggregate the individual incoming user_ids into one array per conversation. Then use array_cat to combine the two arrays.
This way you only need to do one update per target row.
I also noticed that you only want to update rows based on a date comparison, so I added that to the sub query I proposed.
UPDATE
messages as m
SET
read_by_user_ids = array_cat(
COALESCE(m.read_by_user_ids, '{}'),
cu.user_id_array
)
FROM
(
SELECT
cu.conversation_id,
array_agg(cu.user_id) AS user_id_array
FROM
messages
INNER JOIN
conversations_users
ON cu.conversation_id = m.thread_id
AND cu.read_up_to >= m.created_at
GROUP BY
cu.conversation_id
)
cu
WHERE
cu.conversation_id = m.thread_id;
There are many other options on how to generate the array in the sub-query. Which is the most efficient will depend on the profile of your data, indexes, etc. But the principle remains the same; updating the same row multiple times in a single statement doesn't work, you need to update each row once, with an array as the input.

Flatten multiple query results with same ID to single row?

I'm curious about something in a SQL Server database. My current query pulls data about my employer's items for sale. It finds information for just under 105,000 items, which is correct. However, it returns over 155,000 rows, because each item has other things related to it. Right now, I run that data through a loop in Python, manually flattening it out by checking if the item the loop is working on is the same one it just worked on. If it is, I start filling in that item's extra information. Ideally, the SQL would return all this data already put into one row.
Here is an overview of the setup. I'm leaving out a few details for simplicity's sake, since I'm curious about the general theory, not looking for something I can copy and paste.
Item: contains the item ID, SKU, description, vendor ID, weight, and dimensions.
AttributeName: contains attr_id and attr_text. For instance, "color", "size", or "style".
AttributeValue: contains attr_value_id and attr_text. For instance, "blue" or "small".
AttributeAssign: contains item_id and attr_id. This ties attribute names to items.
attributeValueAssign: contains item_id and attr_value_id, tying attribute values to items.
A series of attachments is set up in a similar way, but with attachment and attachmentAssignment. Attachments can have only values, no names, so there is no need for the extra complexity of a third table as there is with attributes.
Vendor is simple: the ID is used in the item table. That is:
select item_id, vendorName
from item
join vendor on vendor_id = item.vendorNumber
gets you the name of an item's vendor.
Now, the fun part: items may or may not have vendors, attributes, or attachments. If they have either of the latter two, there's no way to know how many they have. I've seen items with 0 attributes and items with 5. Attachments are simpler, as there can only be 0 or 1 per item, but the possibility of 0 still demands an outer left join so I am guaranteed to get all the items.
That's how I get multiple rows per item. If an item has three attrigbutes, I get either four or seven rows for just that item--I'm not sure if it's a row per name/value or a row per name AND a row per value. Either way, this is the kind of thing I'd like to stop. I want each row in my result set to contain all attributes, with a cap at seven and null for any missing attribute. That is, something like:
item_id; item_title; item_sku; ... attribute1_name; attribute1_value; attribute2_name; attribute2_value; ... attribute7_value
1; some random item; 123-45; ... color; blue; size; medium; ... null
Right now, I'd get multiple rows for that, such as (only ID and attributes):
ID; attribute 1 name; attribute 1 value; attribute 2 name; attribute 2 value
1; color; blue; null; null
1; color; blue; size; medium
I'm after the second row only--all the information put together into one row per unique item ID. Currently, though, I get multiple rows, and Python has to put everything together. I'm outputting this to a spreadsheet, so information about an item has to be on that item's row.
I can just keep using Python if this is too much bother. But I wondered if there was a way to do it that would be relatively easy. My script works fine, and execution time isn't a concern. This is more for my own curiosity than a need to get anything working. Any thoughts on how--or if--this is possible?
Here is #WCWedin's answer modified to use a CTE.
WITH attrib_rn as
(
select
*, row_number() over(partition by item_id order by name, attribute_id) as row_number
from attributes
)
select
i.item_id,
attr1.name as attribute1_name, attr1.value as attribute1_value,
...
attr7.name as attribute7_name, attr7.value as attribute7_value
from items i
left join attrib_rn as attr1 ON attr1.item_id = i.item_id AND attr1.row_number = 1
left join attrib_rn as attr2 ON attr2.item_id = i.item_id AND attr2.row_number = 2
left join attrib_rn as attr3 ON attr3.item_id = i.item_id AND attr3.row_number = 3
left join attrib_rn as attr4 ON attr4.item_id = i.item_id AND attr4.row_number = 4
left join attrib_rn as attr5 ON attr5.item_id = i.item_id AND attr5.row_number = 5
left join attrib_rn as attr6 ON attr6.item_id = i.item_id AND attr6.row_number = 6
left join attrib_rn as attr7 ON attr7.item_id = i.item_id AND attr7.row_number = 7
Since you only want the first 7 attributes and you want to keep all of the logic in the SQL query, you're probably looking at using row_number. Subqueries will do the job directly with multiple joins, and the performance will probably be pretty good since you're only joining so many times.
select
i.item_id,
attr1.name as attribute1_name, attr1.value as attribute1_value,
...
attr7.name as attribute7_name, attr7.value as attribute7_value
from
items i
left join (
select
*, row_number() over(partition by item_id order by name, attribute_id) as row_number
from
attributes
) AS attr1 ON
attr1.item_id = i.item_id
AND attr1.row_number = 1
...
left join (
select
*, row_number() over(partition by item_id order by name, attribute_id) as row_number
from
attributes
) AS attr7 ON
attr7.item_id = i.item_id
AND attr7.row_number = 7
In SQL Server, you can tackle this with a subquery containing 'ROW_NUMBER() OVER', and a few CASE statements to map the top 7 into columns.
A little tricky, but post your full query that returns the big list and I'll demonstrate how to transpose it.

MySQL Join from multiple options to select one value

I am putting together a nice little database for adding values to options, all these are setup through a map (Has and Belongs to Many) table, because many options are pointing to a single value.
So I am trying to specify 3 option.ids and a single id in a value table - four integers to point to a single value. Three tables. And I am running into a problem with the WHERE part of the statement, because if multiple values share an option there are many results. And I need just a single result.
SELECT value.id, value.name FROM value
LEFT JOIN (option_map_value, option_table)
ON (value.id = option_map_value.value_id AND option_map_value.option_table_id = option_table.id)
WHERE option_table.id IN (5, 2, 3) AND value.y_axis_id = 16;
The problem with the statement seems to be the IN on the WHERE clause. If one of the numbers are different in the IN() part, then there are multiple results - which is not good.
I have tried DISTINCT, which again works if there is one result, but returns many if there is many. The closest we have gotten to is adding a count - to return to value with the most options at the top.
So is there a way to do the WHERE to be more specific. I cannot break it out into option_table.id = 5 AND option_table.id = 2 - because that one fails. But can the WHERE clause be more specifc?
Maybe it is me being pedantic, but I would like to be able to return just the single result, instead of a count of results... Any ideas?
The problem with the statement seems to be the IN on the WHERE clause. If one of the numbers are different in the IN() part, then there are multiple results - which is not good. I have tried DISTINCT, which again works if there is one result, but returns many if there is many. The closest we have gotten to is adding a count - to return to value with the most options at the top.
You were very close, considering the DISTINCT:
SELECT v.id,
v.name
FROM VALUE v
LEFT JOIN OPTION_MAP_VALUE omv ON omv.value_id = v.id
LEFT JOIN OPTION_TABLE ot ON ot.id = omv.option_table_id
WHERE ot.id IN (5, 2, 3)
AND v.y_axis_id = 16
GROUP BY v.id, v.name
HAVING COUNT(*) = 3
You were on the right track, but needed to use GROUP BY instead in order to be able to use the HAVING clause to count the DISTINCT list of values.
Caveat emptor:
The GROUP BY/HAVING COUNT version of the query is dependent on your data model having a composite key, unique or primary, defined for the two columns involved (value_id and option_table_id). If this is not in place, the database will not stop duplicates being added. If duplicate rows are possible in the data, this version can return false positives because a value_id could have 3 associations to the option_table_id 5 - which would satisfy the HAVING COUNT(*) = 3.
Using JOINs:
A safer, though more involved, approach is to join onto the table that can have multiple options, as often as you have criteria:
SELECT v.id,
v.name
FROM VALUE v
JOIN OPTION_MAP_VALUE omv ON omv.value_id = v.id
JOIN OPTION_TABLE ot5 ON ot5.id = omv.option_table_id
AND ot5.id = 5
JOIN OPTION_TABLE ot2 ON ot2.id = omv.option_table_id
AND ot2.id = 2
JOIN OPTION_TABLE ot3 ON ot3.id = omv.option_table_id
AND ot3.id = 3
WHERE v.y_axis_id = 16
GROUP BY v.id, v.name

outer query to list only if its rowcount equates to inner subquery

Need help on a query using sql server 2005
I am having two tables
code
chargecode
chargeid
orgid
entry
chargeid
itemNo
rate
I need to list all the chargeids in entry table if it contains multiple entries having different chargeids
which got listed in code table having the same charge code.
data :
code
100,1,100
100,2,100
100,3,100
101,11,100
101,12,100
entry
1,x1,1
1,x2,2
2,x3,2
11,x4,1
11,x5,1
using the above data , it query should list chargeids 1 and 2 and not 11.
I got the way to know how many rows in entry satisfies the criteria, but m failing to get the chargeids
select count (distinct chargeId)
from entry where chargeid in (select chargeid from code where chargecode = (SELECT A.chargecode
from code as A join code as B
ON A.chargecode = B.chargeCode and A.chargetype = B.chargetype and A.orgId = B.orgId AND A.CHARGEID = b.CHARGEid
group by A.chargecode,A.orgid
having count(A.chargecode) > 1)
)
First off: I apologise for my completely inaccurate original answer.
The solution to your problem is a self-join. Self-joins are used when you want to select more than one row from the same table. In our case we want to select two charge IDs that have the same charge code:
SELECT DISTINCT c1.chargeid, c2.chargeid FROM code c1
JOIN code c2 ON c1.chargeid != c2.chargeid AND c1.chargecode = c2.chargecode
JOIN entry e1 ON e1.chargeid = c1.chargeid
JOIN entry e2 ON e2.chargeid = c2.chargeid
WHERE c1.chargeid < c2.chargeid
Explanation of this:
First we pick any two charge IDs from 'code'. The DISTINCT avoids duplicates. We make sure they're two different IDs and that they map to the same chargecode.
Then we join on 'entry' (twice) to make sure they both appear in the entry table.
This approach gives (for your example) the pairs (1,2) and (2,1). So we also insist on an ordering; this cuts to result set down to just (1,2), as you described.