SQL question: how do I find the count of IDs that are always mapped to a 'true' field in another table - sql

I have a database that collects a list of document packages in one table and each individual page in another table
Each page has a PackageID connecting the two tables.
I'm trying to find the count of all packages where ALL pages connected to it have a boolean field (stored on the page table) of true. Even if 1/20 of the pages connected to the packageID is false, I don't want that packageID counted
Right now all I have is:
SELECT COUNT(DISTINCT pages.package_id)
FROM pages
WHERE boolean_field = true
But I'm not sure how to add that if one page w/ that package_id has the boolean_field != true than I don't want it counted. I also want to know the count of those packages that have any that are false.
I'm not sure if I need a subquery, if statement, having clause, or what.
Any direction even if it's what operators I should study on would be super helpful. Thanks :).

select count(*)
from
(
select package_id
from pages
group by package_id
having min(boolean_field) = 1
) tmp

Another way to express this is:
select count(*)
from packages p
where not exists (select 1
from pages pp
where pp.package_id = p.package_id and
not pp.boolean_field
);
The advantage of this approach is that it avoids aggregation, which can be a big win performance wise. It can also take advantage of an index on pages(package_id, boolean_field).

Related

Implementing LIMIT against rows of single table while using LEFT JOIN

Given a fairly stereotypical scenario with an item table referenced by an images table holding multiple images for the same item, I'm trying to figure out how to retrieve a specific number of items, while collecting all of the image rows.
The setup is trivial, and looks like:
CREATE TABLE items (
id INTEGER PRIMARY KEY,
...
...
)
CREATE TABLE images (
item_id INTEGER PRIMARY KEY,
url STRING,
...
...
)
LIMIT 20 clamps the maximum number of rows in the entire result set, and incidentally truncates mid-way through an item record.
I'm currently doing two queries, but besides being architecturally not at all ideal, it's practically proving quite awkward to coordinate (which makes a lot of sense since I'm definitely doing it wrong). I'm not finding any info on how to coordinate LIMIT with LEFT JOINs, so I thought I'd ask. Thanks!
NB. Similar questions to this do indeed exist, but are asking how to do things like retrieving eg the first (say) 5 images for each item. I'm looking to retrieve (say) 10 items and get all the images for each.
A query like:
SELECT * FROM items ORDER BY ? LIMIT 10
will return 10 rows (at most) from items.
You need to provide the column(s) for the ORDER BY clause if you have some specific sorting condition in mind.
If not then remove the ORDER BY clause, but in this case nothing guarantees the resultset that you will get.
So all you have to do is LEFT join the above query to images:
SELECT it.*, im.*
FROM (SELECT * FROM items ORDER BY ? LIMIT 10) it
LEFT JOIN images im
ON im.item_id = it.id

Activerecord query returning doubles while using uniq

I am running the following query with the goal of returning a unique set of customer objects:
Customer.joins(:projects).select('customers.*, projects.finish_date').where("projects.closed = false").uniq
However, this code will generate duplicates if a customer has more than one project active (e.g. closed = true). If I remove the projects.finish_date from the select clause this query works as intended. However, I need this to be in there to be able to sort on that column.
How can I make this query return a unique set of customers?
How can I make this query return a unique set of customers?
This doesn't completely make sense, and probably isn't what you want.
The problem is that you're joining against the projects table, at which point there may be several rows for the same customer with different project finish_dates. These rows are unique and will be returned as multiple unique Customer objects, each with different a finish_date.
If you only want one of these, how is Rails to determine which one? Wouldn't it be a problem if you only had one customer object with one finish_date returned if there are really 10 projects for that customer, each with a different finish_date?
Instead, you probably want something like this:
customers = Customer.joins(:projects).select('customers.*, projects.finish_date').where("projects.closed = false").uniq
customers.group_by(&:id)
This groups all of your same customers together.
OR, you might want:
projects = Project.where(closed: false).includes(:user)
users = projects.map(&:user).uniq
In either case, you're producing a unique set of users from the superset of all user-project joins.
RE Your comments:
If you want to get a list of customers with their most recent associated project, you could use a sub query in your where:
select customers.*, projects.finish_date from customers
inner join projects on projects.customer_id = customers.id
where projects.id = (
select id from projects
where customer.id = project.customer_id
and closed = false
order by finish_date desc
limit 1
)
You can express this using ActiveRecord by embedding the sub-query in a where:
Customer.joins(:projects)
.select('customers.*, projects.finish_date as finish_date')
.where('select id from projects where customer.id = project.customer_id and closed = false order by finish_date desc limit 1')
I have no idea how this will perform for you, but I suspect poorly.
I would always stick to a simple includes and in-Ruby filter before attempting to optimize with SQL.

MS Access Distinct Records in Recordset

So, I once again seem to have an issue with MS Access being finicky, although it seems to also be an issue when trying similar queries in SSMS (SQL Server Management Studio).
I have a collection of tables, loosely defined as follows:
table widget_mfg { id (int), name (nvarchar) }
table widget { id (int), name (nvarchar), mfg_id (int) }
table widget_component { id (int), name (nvarchar), widget_id (int), component_id }
table component { id (int), name (nvarchar), ... } -- There are ~25 columns in this table
What I'd like to do is query the database and get a list of all components that a specific manufacturer uses. I've tried some of these queries:
SELECT c.*, wc.widget_id, w.mfg_id
FROM ((widget_component wc INNER JOIN widget w ON wc.widget_id = w.id)
INNER JOIN widget_manufacturer wm on w.mfg_id = wm.id)
INNER JOIN component c on c.id = wc.component_id
WHERE wm.id = 1
The previous example displays duplicates of any part that is contained in multiple widget_component lists for different widgets.
I've also tried doing:
SELECT DISTINCT c.id, c.name, wc.widget_id, w.mfg_id
FROM component c, widget_component wc, widget w, widget_manufacturer wm
WHERE wm.id=w.mfg_id AND wm.id = 1
This doesn't display anything at all. I was reading about sub-queries, but I do not understand how they work or how they would apply to my current application.
Any assistance in this would be beneficial.
As an aside, I am not very good with either MS Access or SQL in general. I know the basics, but not a lot beyond that.
Edit:
I just tried this code, and it works to get all the component.id's while limiting them to a single entry each. How do I go about using the results of this to get a list of all the rest of the component data (component.*) where the id's from the first part are used to select this data?
SELECT DISTINCT c.part_no
FROM component c, widget w, widget_component wc, widget_manufacturer wm
WHERE(((c.id=wc.component_id AND wc.widget_id=w.id AND w.mfg_id=wm.id AND wm.id=1)))
(P.S. this is probably not the best way to do this, but I am still learning SQL.)
What I'd like to do is query the database and get a list of all
components that a specific manufacturer uses
There are several ways to do this. IN is probably the easiest to write
SELECT c.*
FROM component c
WHERE c.id IN (SELECT c.component_id
FROM widget w
INNER JOIN widget_component c
ON w.id = c.widget_id
WHERE w.mfg_id = 123)
The IN sub query finds all the component ids that a specific manufacturer uses. The outer query then selects any component.id that is that result. It doesn't matter if its in there once or 1000 times it will only get the component record once.
The other ways of doing this are using an EXISTS sub query or using a join to the query (but then you do need to de-dup it)
It sounds like your component -to- widget relationship is one-to-many. Hence the duplicates. (i.e., the same component is used by more than one widget).
Your Select is almost OK --
SELECT c.*, wc.widget_id, w.mfg_id
but the wc.widget_id is causing the duplicates (per the assumption above).
So remove wc.widget_id from the SELECT, or else aggregate it (min, max, count, etc.). Removing is easier. If you agregate, remember to add a group by clause.
Try this:
SELECT DISTINCT c.*, w.mfg_id
Also -- FWIW, it's generally a better practice to use field names, instead of the *

How can I group objects retrieved from database tables that have the same properties?

I am working on an application that allows users to build a "book" from a number of "pages" and then place them in any order that they'd like. It's possible that multiple people can build the same book (the same pages in the same order). The books are built by the user prior to them being processed and printed, so I need to group books together that have the same exact layout (the same pages in the same order). I've written a million queries in my life, but for some reason I can't grasp how to do this.
I could simply write a big SELECT query, and then loop through the results and build arrays of objects that have the same pages in the same sequence, but I'm trying to figure out how to do this with one query.
Here is my data layout:
dbo.Books
BookId
Quantity
dbo.BookPages
BookId
PageId
Sequence
dbo.Pages
PageId
DocName
So, I need some clarification on a few things:
Once a user orders the pages the way they want, are they saved back down to a database?
If yes, then is the question to run a query to group book orders that have the same page-numbering, so that they are sent to the printers in an optimal way?
OR, does the user layout the pages, then send the order directly to the printer? And if so, it seems more complicated/less efficient to capture requested print jobs, and order them on-the-fly on the way out to the printers ...
What language/technology are you using to create this solution? .NET? Java?
With the answers to these questions, I can better gauge what you need.
With the answers to my questions, I also assume that:
You are using some type of many-to-many table to store customer page ordering. If so, then you'll need to write a query to select distinct page-orderings, and group by those page orderings. This is possible with a single SQL query.
However, if you feel you want more control over how this data is joined, then doing this programmatically may be the way to go, although you will lose performance by reading in all the data, and then outputting that data in a way that is consumable by your printers.
The books are identical only if the page count = match count.
It was tagged TSQL when I started. This may not be the same syntax on SQL.
;WITH BookPageCount
AS
(
select b1.bookID, COUNT(*) as [individualCount]
from book b1 with (nolock)
group by b1.bookID
),
BookCombinedCount
AS
(
select b1.bookID as [book1ID], b2.bookID as [book2ID], COUNT(*) as [combindCount]
from book b1 with (nolock)
join book b2 with (nolock)
on b1.bookID < b2.bookID
and b1.squence = b2.squence
and b1.page = b2.page
group by b1.bookID, b2.bookID
)
select BookCombinedCount.book1ID, BookCombinedCount.book2ID
from BookCombinedCount
join BookPageCount as book1 on book1.bookID = BookCombinedCount.book1ID
join BookPageCount as book2 on book2.bookID = BookCombinedCount.book2ID
where BookCombinedCount.combindCount = book1.individualCount
and BookCombinedCount.combindCount = book2.individualCount.PageCount

Update values in each row based on foreign_key value

Downloads table:
id (primary key)
user_id
item_id
created_at
updated_at
The user_id and item_id in this case are both incorrect, however, they're properly stored in the users and items table, respectively (import_id for in each table). Here's what I'm trying to script:
downloads.each do |download|
user = User.find_by_import_id(download.user_id)
item = item.find_by_import_id(download.item_id)
if user && item
download.update_attributes(:user_id => user.id, :item.id => item.id)
end
end
So,
look up the user and item based on
their respective "import_id"'s. Then
update those values in the download record
This takes forever with a ton of rows. Anyway to do this in SQL?
If I understand you correctly, you simply need to add two sub-querys in your SELECT statement to lookup the correct IDs. For example:
SELECT id,
(SELECT correct_id FROM User WHERE import_id=user_id) AS UserID,
(SELECT correct_id FROM Item WHERE import_id=item_id) AS ItemID,
created_at,
updated_at
FROM Downloads
This will translate your incorrect user_ids to whatever ID you want to come from the User table and it will do the same for your item_ids. The information coming from SQL will now be correct.
If, however, you want to update the tables with the correct information, you could write this like so:
UPDATE Downloads
SET user_id = User.user_id,
item_id = Item.item_id
FROM Downloads
INNER JOIN User ON Downloads.user_id = User.import_id
INNER JOIN Item ON Downloads.item_id = Item.import_id
WHERE ...
Make sure to put something in the WHERE clause so you don't update every record in the Downloads table (unless that is the plan). I rewrote the above statement to be a bit more optimized since the original version had two SELECT statements per row, which is a bit intense.
Edit:
Since this is PostgreSQL, you can't have the table name in both the UPDATE and the FROM section. Instead, the tables in the FROM section are joined to the table being updated. Here is a quote about this from the PostgreSQL website:
When a FROM clause is present, what essentially happens is that the target table is joined to the tables mentioned in the fromlist, and each output row of the join represents an update operation for the target table. When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn't join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.
http://www.postgresql.org/docs/8.1/static/sql-update.html
With this in mind, here is an example that I think should work (can't test it, sorry):
UPDATE Downloads
SET user_id = User.user_id,
item_id = Item.item_id
FROM User, Item
WHERE Downloads.user_id = User.import_id AND
Downloads.item_id = Item.import_id
That is the basic idea. Don't forget you will still need to add extra criteria to the WHERE section to limit the rows that are updated.
i'm totally guessing from your question, but you have some kind of lookup table that will match an import user_id with the real user_id, and similarly from items. i.e. the assumption is your line of code:
User.find_by_import_id(download.user_id)
hits the database to do the lookup. the import_users / import_items tables are just the names i've given to the lookup tables to do this.
UPDATE downloads
SET downloads.user_id = users.user_id
, downloads.item_id = items.items_id
FROM downloads
INNER JOIN import_users ON downloads.user_id = import_users.import_user_id
INNER JOIN import_items ON downloads.item_id = import_items.import_item_id
Either way (lookup is in DB, or it's derived from code), would it not just be easier to insert the information correctly in the first place? this would mean you can't have any FK's on your table since sometimes they point to one table, and others they point to another. seems a bit odd.