Activerecord query returning doubles while using uniq - sql

I am running the following query with the goal of returning a unique set of customer objects:
Customer.joins(:projects).select('customers.*, projects.finish_date').where("projects.closed = false").uniq
However, this code will generate duplicates if a customer has more than one project active (e.g. closed = true). If I remove the projects.finish_date from the select clause this query works as intended. However, I need this to be in there to be able to sort on that column.
How can I make this query return a unique set of customers?

How can I make this query return a unique set of customers?
This doesn't completely make sense, and probably isn't what you want.
The problem is that you're joining against the projects table, at which point there may be several rows for the same customer with different project finish_dates. These rows are unique and will be returned as multiple unique Customer objects, each with different a finish_date.
If you only want one of these, how is Rails to determine which one? Wouldn't it be a problem if you only had one customer object with one finish_date returned if there are really 10 projects for that customer, each with a different finish_date?
Instead, you probably want something like this:
customers = Customer.joins(:projects).select('customers.*, projects.finish_date').where("projects.closed = false").uniq
customers.group_by(&:id)
This groups all of your same customers together.
OR, you might want:
projects = Project.where(closed: false).includes(:user)
users = projects.map(&:user).uniq
In either case, you're producing a unique set of users from the superset of all user-project joins.
RE Your comments:
If you want to get a list of customers with their most recent associated project, you could use a sub query in your where:
select customers.*, projects.finish_date from customers
inner join projects on projects.customer_id = customers.id
where projects.id = (
select id from projects
where customer.id = project.customer_id
and closed = false
order by finish_date desc
limit 1
)
You can express this using ActiveRecord by embedding the sub-query in a where:
Customer.joins(:projects)
.select('customers.*, projects.finish_date as finish_date')
.where('select id from projects where customer.id = project.customer_id and closed = false order by finish_date desc limit 1')
I have no idea how this will perform for you, but I suspect poorly.
I would always stick to a simple includes and in-Ruby filter before attempting to optimize with SQL.

Related

SQL question: how do I find the count of IDs that are always mapped to a 'true' field in another table

I have a database that collects a list of document packages in one table and each individual page in another table
Each page has a PackageID connecting the two tables.
I'm trying to find the count of all packages where ALL pages connected to it have a boolean field (stored on the page table) of true. Even if 1/20 of the pages connected to the packageID is false, I don't want that packageID counted
Right now all I have is:
SELECT COUNT(DISTINCT pages.package_id)
FROM pages
WHERE boolean_field = true
But I'm not sure how to add that if one page w/ that package_id has the boolean_field != true than I don't want it counted. I also want to know the count of those packages that have any that are false.
I'm not sure if I need a subquery, if statement, having clause, or what.
Any direction even if it's what operators I should study on would be super helpful. Thanks :).
select count(*)
from
(
select package_id
from pages
group by package_id
having min(boolean_field) = 1
) tmp
Another way to express this is:
select count(*)
from packages p
where not exists (select 1
from pages pp
where pp.package_id = p.package_id and
not pp.boolean_field
);
The advantage of this approach is that it avoids aggregation, which can be a big win performance wise. It can also take advantage of an index on pages(package_id, boolean_field).

Active Record - How to perform a nested select on a second table?

I need to list all customers along with their latest order date (plus pagination).
How can I write the following SQL query using Active Record?
select *,
(
select max(created_at)
from orders
where orders.customer_id = customers.id
) as latest_order_date
from customers
limit 25 offset 0
I tried this but it complains missing FROM-clause entry for table "customers":
Customer
.select('*')
.select(
Order
.where('customer_id = customers.id')
.maximum(:created_at)
).page(params[:page])
# Generates this (clearly only the Order query):
SELECT MAX("orders"."created_at")
FROM "orders"
WHERE (customer_id = customers.id)
EDIT: it would be good to keep AR's parameterization and kaminari's pagination goodness.
You haven't given us any information about the relationship between these two tables, so I will assume Customer has_many Orders.
While ActiveRecord doesn't support what you are trying to do, it is built on top of Arel, which does.
Every Rails model has a method named arel_table that will return its corresponding Arel::Table. You might want a helper library to make this cleaner because the default way is a little cumbersome. I will use the plain Arel syntax to maximize compatibility.
ActiveRecord understands Arel objects and can accept them alongside its own syntax.
orders = Order.arel_table
customers = Customer.arel_table
Customer.joins(:orders).group(:id).select([
customers[Arel.star],
orders[:created_at].maximum.as('latest_order_date')
])
Which produces
SELECT "customers".*, MAX("orders"."created_at") AS "latest_order_date"
FROM "customers"
INNER JOIN "orders" ON "orders"."customer_id" = "customers"."id"
GROUP BY "customers"."id"
This is the customary way of doing this, but if you still want to do it as a subquery, you can do this
Customer.select([
customers[Arel.star],
orders.project(orders[:created_at].maximum)
.where(orders[:customer_id].eq(customers[:id]))
.as('latest_order_date')
])
Which gives us
SELECT "customers".*, (
SELECT MAX("orders"."created_at")
FROM "orders"
WHERE "orders"."customer_id" = "customers"."id" ) "latest_order_date"
FROM "customers"
The most Active Record-ish way I've come up with so far is:
Customer
.page(params[:page])
.select('*')
.select(<<-SQL.squish)
(
SELECT MAX(created_at) AS latest_order_date
FROM orders
WHERE orders.customer_id = customers.id
)
SQL
I still wish I could make the string part more Active Record-ish.
The <<-SQL is just heredoc.
Here is the same answer #adam was giving, but not using AREL and just straight ActiveRecord. Not sure it's really much better than #João Marcelo Souza
Customer.select("customers.*, max(orders.created_at)").joins(:orders).group("customers.id").page(params[:page])
(The group by avoids list all the customers columns by using this feature of Postgres 9.1 and higher.)
The OP doesn't say, but the query doesn't handle the case where the customer has no orders. This version does that:
Customer.select("customers.*, coalesce(max(orders.created_at),0)").joins("left outer join orders on orders.customer_id=customers.id").group("customers.id").page(params[:page])

How to join products and their characteristics

How to join products and their characteristics
I have two tables.
Product (id, title, price, created_at, updated_at etc)
and
ProductCharacteristic(id, product_id, sold_quantity, date, craated_at, updated_at etc).
I should show products table (header is product.id, product.title, product.price, sold_quantity) for some period of time and ordered by any fields from header.
And I can't write query
Now I have following query
> current_project.products.includes(:product_characteristics).group('products.id').pluck(:title, 'SUM(product_characteristics.sold_quantity) AS sold_quantity')
(45.4ms) SELECT "products"."title", SUM(product_characteristics.sold_quantity) AS sold_quantity FROM "products" LEFT OUTER JOIN "product_characteristics" ON "product_characteristics"."product_id" = "products"."id" WHERE "products"."project_id" = $1 GROUP BY products.id [["project_id", 20]]
Please help me to write query through orm(to add where with dates and ordering) or write raw sql query.
I used pluck. It returns array of arrays (not array of hashes). It's no so good of course.
product_characteristics.date field is unique by scope product_id. But please give me two examples (with this condition and without it to satisfy my curiosity).
And I use postgresql and rails 4.2.x
P.S. By the way the ProductCharacteristic table will have a lot of records(mote than one million). Should I use postgresql table partitioning. Can it improve performance?
Thank you.
You can use select instead of count in that case, and the property will be accessible as product.sold_quantity
The query becomes
products = current_project.products.joins(:product_characteristics).group('products.id').select(:title, 'SUM(product_characteristics.sold_quantity) AS sold_quantity')
products.first.sold_quantity # => works
To order, you can just add an order clause
products = products.order(id: :asc)
or
products = products.order(id: :desc)
for instance
And for the where
products = products.where("created_at > ?", 2.days.ago)
for instance.
You can chain sql clauses after the first line, it does not matter cause the query will only be launched when you actually use the retrieved set.
And so you can also do stuff like
if params[:foo]
products = products.order(:id)
end

What's the most efficient way to exclude possible results from an SQL query?

I have a subscription database containing Customers, Subscriptions and Publications tables.
The Subscriptions table contains ALL subscription records and each record has three flags to mark the status: isActive, isExpire and isPending. These are Booleans and only one flag can be True - this is handled by the application.
I need to identify all customers who have not renewed any magazines to which they have previously subscribed and I'm not sure that I've written the most efficient SQL query. If I find a lapsed subscription I need to ignore it if they already have an active or pending subscription for that particular magazine.
Here's what I have:
SELECT DISTINCT Customers.id, Subscriptions.publicationName
FROM Subscriptions
LEFT JOIN Customers
ON Subscriptions.id_Customer = Customers.id
LEFT JOIN Publications
ON Subscriptions.id_Publication = Publications.id
WHERE Subscriptions.isExpired = 1
AND NOT EXISTS
( SELECT * FROM Subscriptions s2
WHERE s2.id_Publication = Subscriptions.id_Publication
AND s2.id_Customer = Subscriptions.id_Customer
AND s2.isPending = 1 )
AND NOT EXISTS
( SELECT * FROM Subscriptions s3
WHERE s3.id_Publication = Subscriptions.id_Publication
AND s3.id_Customer = Subscriptions.id_Customer
AND s3.isActive = 1 )
I have just over 50,000 subscription records and this query takes almost an hour to run which tells me that there's a lot of looping or something going on where for each record the SQL engine is having to search again to find any 'isPending' and 'isActive' records.
This is my first post so please be gentle if I've missed out any information in my question :) Thanks.
I don't have your complete database structure, so I can't test the following query but it may contain some optimization. I will leave it to you to test, but will explain why I have changed, what I have changed.
select Distinct Customers.id, Subscriptions.publicationName
from Subscriptions
join Customers on Subscriptions.id_Customer = Customer.id
join Publications
ON Subscriptions.id_Publication = Publications.id
Where Subscriptions.isExpired = 1
And Not Exists
(select * from Subscriptions s2
join Customers on s2.id_Customer = Customer.id
join Publications
ON s2.id_Publication = Publications.id
where s2.id_Customer = s2.id_customer and
(s2.isPending = 1 or s2.isActive = 1))
If you have no resulting data in Customer or Publications DB, then the Subscription information isn't useful, so I eliminated the LEFT join in favor of simply join. Combine the two Exists subqueries. These are pretty intensive if I recall so the fewer the better. Last thing which I did not list above but may be worth looking into is, can you run a subquery with specific data fields returned and use it in an Exists clause? The use of Select * will return all data fields which slows down processing. I'm not sure if you can limit your result unfortunately, because I don't have an equivalent DB available to me that I can test on (the google probably knows).
I suspect there are further optimizations that could be made on this query. Eliminating the Exists clause in favor of an 'IN' clause may help, but I can't think of a way right now, seeing how you've got to match two unique fields (customer id and the relevant subscription). Let me know if this helps at all.
With a table of 50k rows, you should be able to run a query like this in seconds.

Update values in each row based on foreign_key value

Downloads table:
id (primary key)
user_id
item_id
created_at
updated_at
The user_id and item_id in this case are both incorrect, however, they're properly stored in the users and items table, respectively (import_id for in each table). Here's what I'm trying to script:
downloads.each do |download|
user = User.find_by_import_id(download.user_id)
item = item.find_by_import_id(download.item_id)
if user && item
download.update_attributes(:user_id => user.id, :item.id => item.id)
end
end
So,
look up the user and item based on
their respective "import_id"'s. Then
update those values in the download record
This takes forever with a ton of rows. Anyway to do this in SQL?
If I understand you correctly, you simply need to add two sub-querys in your SELECT statement to lookup the correct IDs. For example:
SELECT id,
(SELECT correct_id FROM User WHERE import_id=user_id) AS UserID,
(SELECT correct_id FROM Item WHERE import_id=item_id) AS ItemID,
created_at,
updated_at
FROM Downloads
This will translate your incorrect user_ids to whatever ID you want to come from the User table and it will do the same for your item_ids. The information coming from SQL will now be correct.
If, however, you want to update the tables with the correct information, you could write this like so:
UPDATE Downloads
SET user_id = User.user_id,
item_id = Item.item_id
FROM Downloads
INNER JOIN User ON Downloads.user_id = User.import_id
INNER JOIN Item ON Downloads.item_id = Item.import_id
WHERE ...
Make sure to put something in the WHERE clause so you don't update every record in the Downloads table (unless that is the plan). I rewrote the above statement to be a bit more optimized since the original version had two SELECT statements per row, which is a bit intense.
Edit:
Since this is PostgreSQL, you can't have the table name in both the UPDATE and the FROM section. Instead, the tables in the FROM section are joined to the table being updated. Here is a quote about this from the PostgreSQL website:
When a FROM clause is present, what essentially happens is that the target table is joined to the tables mentioned in the fromlist, and each output row of the join represents an update operation for the target table. When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn't join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.
http://www.postgresql.org/docs/8.1/static/sql-update.html
With this in mind, here is an example that I think should work (can't test it, sorry):
UPDATE Downloads
SET user_id = User.user_id,
item_id = Item.item_id
FROM User, Item
WHERE Downloads.user_id = User.import_id AND
Downloads.item_id = Item.import_id
That is the basic idea. Don't forget you will still need to add extra criteria to the WHERE section to limit the rows that are updated.
i'm totally guessing from your question, but you have some kind of lookup table that will match an import user_id with the real user_id, and similarly from items. i.e. the assumption is your line of code:
User.find_by_import_id(download.user_id)
hits the database to do the lookup. the import_users / import_items tables are just the names i've given to the lookup tables to do this.
UPDATE downloads
SET downloads.user_id = users.user_id
, downloads.item_id = items.items_id
FROM downloads
INNER JOIN import_users ON downloads.user_id = import_users.import_user_id
INNER JOIN import_items ON downloads.item_id = import_items.import_item_id
Either way (lookup is in DB, or it's derived from code), would it not just be easier to insert the information correctly in the first place? this would mean you can't have any FK's on your table since sometimes they point to one table, and others they point to another. seems a bit odd.