Optimization of SQL query based on attribute (not having specific value) in joined table - sql

My models structure: Movie has_many :captions. Language of the Caption may be “en”, “de”, “fr”...
Problem:
An effective query to select Movies that don’t have Captions with an “en” language.
App that needs above runs on Rails, and for this I’m currently using something like this in Caption model:
def self.ids_of_movies_without_caption_in_en
a = (1..(Movie.last.lp.to_i)).to_a
b = Caption.in_lang("en").collect {|h| h.movie_id }
(a - b)
end
As you can see, I collect id’s (lp) of all movies and then I remove from that array id’s of those movies where Captions have “en” as a language. The outcome is an array of id’s of Movies I need.
Above works, but as you can imagine it’s quite “heavy”. I believe that there is a better (and maybe trivial) approach to it. However, being “fresh” with SQL, I ask for some guidance in writing an efficient query. This runs on PostgreSQL
Implementation in Rails (5.2) would be an additional bonus!
This is the situation: let's say in the database there are 1000 movies, and 4000 captions for those movies. There are of course movies that don't have any captions. Out of those 4000 captions 400 are in "en" language. The query I'm looking for would return 600 movies, where caption in "en" does not exist (including movies with 0 captions).

This is quite easy in SQL. I'm not quite sure what the tables look like, but something like this:
select movie_id
from captions
group by movie_id
having not bool_or(language = 'en');
If you want movies with no captions, then use not exists:
select m.movie_id
from movies m
where not exists (select 1
from captions c
where c.movie_id = m.movie_id and
m.language = 'en'
);

Related

How to check if a tuple contains an element in Apache Pig?

Let's say I have this file:
movie_id,title,genres
95004,Superman/Doomsday (2007),Action|Animation
136297,Mortal Kombat: The Journey Begins (1995),Action
193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi
193573,Love Live! The School Idol Movie (2015),Animation
I load it like this:
movies = LOAD 'movies.csv' USING PigStorage(',') AS (moviesId:int, title:chararray, genres: chararray);
movies = FOREACH movies GENERATE movieId, title, STRSPLIT(genres,'\\|') as genres;
describe a; //a: {movieId: int,title: chararray,genres: ()}
Example of dump a results:
...
(193581,Black Butler: Book of the Atlantic (2017),(Action,Animation,Comedy,Fantasy))
(193583,No Game No Life: Zero (2017),(Animation,Comedy,Fantasy))
...
Now, if I undestand correctly, the field genres is of type tuple. The question is how can I do a query such as: "get all the action movies?". I don't know how to check if a specific element is present in the tuple genres.
I know how to do this with a Python UDF function, but I would like to know if it is possible without one. Maybe I should load the file differently.
Thank you for your help.
If you are happy to put the genres into a bag rather than into a tuple (I think this would be more appropriate since the number of genres varies from record to record). This could be then solved with a nested FOREACH by filtering the bag for specific genres then testing to see if the bag is not empty.
movies = LOAD 'movies.csv' USING PigStorage(',') AS (moviesId:int, title:chararray, genres: chararray);
moviesSplit = FOREACH movies GENERATE movieId, title, TOKENIZE(genres,'|') as genres;
actionTest = FOREACH moviesSplit {
action = FILTER genres by $0 == 'Action';
GENERATE *, action;
actionMovies = FILTER actionTest BY NOT IsEmpty(action);

ActiveRecord multiple joins through association

My customer model has many videos and videos has many video activities.
I want to join on video activities to limit based off videos that belong to a customer who has a specific email domain.
This code will give me all the video activities belonging to customer with id 52, but since videos don't have customer email, I need to join customer onto video and then do a .where.
VideoActivity.joins(:video).where(videos: {customer_id: 52})
How is this done? Doing VideoActivity.joins(:video).joins(:customer) gives me an error saying VideoActivity doesn't have a customer associated with it.
VideoActiviy has no relation with customer, you need to say that the customer is related to the video, then it's easier to use active record's #merge that doing a hash where
VideoActivity.joins(video: :customer).merge(Customer.where('some condition'))
If you have a scope in videos you could use that too, here's an example
VideoActivity.joins(video: :customer).merge(Customer.some_scope)
PS:
a scope could be
# customer model
scope :email_ends_with, ->(string) { where('email ilike ?', "%#{string}") }
Then use it
VideoActivity.joins(video: :customer).merge(Customer.email_ends_with('gmail.com'))
There are a bunch of ways to do this and all end up at about the same place, but using an explicit where statement will easily accomplish this goal.
VideoActivity.joins(:video).where("videos.customer_id = ?", 52)
You can write plain sql to fetch records
activities = VideoActivity.joins('INNER JOIN videos ON video_activities.video_id = videos.id INNER JOIN customers ON videos.customer_id = customers.id')
activities.where('customers.email = ?', customer_email)

How to retrieve a list of records and the count of each one's children with condition in Active Record?

There are two models with our familiar one-to-many relationship:
class Custom
has_many :orders
end
class Order
belongs_to :custom
end
I want to do the following work:
get all the custom information whose age is over 18, and how many big orders(pay for 1,000 dollars) they have?
UPDATE:
for the models:
rails g model custom name:string age:integer
rails g model orders amount:decimal custom_id:integer
I hope one left join sql statement will do all my job, and don't construct unnecessary objects like this:
Custom.where('age > ?', '18').includes(:orders).where('orders.amount > ?', '1000')
It will construct a lot of order objects which I don't need, and it will calculate the count by Array#count function which will waste time.
UPDATE 2:
My own solution is wrong, it will remove customs who doesn't have big orders from the result.
Finding adult customers with big orders
This solution uses a single query, with the nested orders relation transformed into a sub-query.
big_customers = Custom.where("age > ?", "18").where(
id: Order.where("amount > ?", "1000").select(:custom_id)
)
Grab all adults and their # of big orders (MySQL)
This can still be done in a single query. The count is grabbed via a join on orders and sticking the count of orders into a column in the result called big_orders_count, which ActiveRecord turns into a method. It involves a lot more "raw" SQL. I don't know any way to avoid this with ActiveRecord except with the great squeel gem.
adults = Custom.where("age > ?", "18").select([
Custom.arel_table["*"],
"count(orders.id) as big_orders_count"
]).joins(%{LEFT JOIN orders
ON orders.custom_id = customs.id
AND orders.amount > 1000})
# see count:
adults.first.big_orders_count
You might want to consider caching counters like this. This join will be expensive on the database, so if you had a dedicated customs.big_order_count column that was either refreshed regularly or updated by an observer that watches for big Order records.
Grab all adults and their # of big orders (PostgreSQL)
Solution 2 is mysql only. To get this to work in postgresql I created a third solution that uses a sub-query. Still one call to the DB :-)
adults = Custom.where("age > ?", "18").select([
%{"customs".*},
%{(
SELECT count(*)
FROM orders
WHERE orders.custom_id = customs.id
AND orders.amount > 1000
) AS big_orders_count}
])
# see count:
adults.first.big_orders_count
I have tested this against postgresql with real data. There may be a way to use more ActiveRecord and less SQL, but this works.
Edited.
#custom_over_18 = Custom.where("age > ?", "18").orders.where("amount > ?", "1000").count

Filtering model with HABTM relationship

I have 2 models - Restaurant and Feature. They are connected via has_and_belongs_to_many relationship. The gist of it is that you have restaurants with many features like delivery, pizza, sandwiches, salad bar, vegetarian option,… So now when the user wants to filter the restaurants and lets say he checks pizza and delivery, I want to display all the restaurants that have both features; pizza, delivery and maybe some more, but it HAS TO HAVE pizza AND delivery.
If I do a simple .where('features IN (?)', params[:features]) I (of course) get the restaurants that have either - so or pizza or delivery or both - which is not at all what I want.
My SQL/Rails knowledge is kinda limited since I'm new to this but I asked a friend and now I have this huuuge SQL that gets the job done:
Restaurant.find_by_sql(['SELECT restaurant_id FROM (
SELECT features_restaurants.*, ROW_NUMBER() OVER(PARTITION BY restaurants.id ORDER BY features.id) AS rn FROM restaurants
JOIN features_restaurants ON restaurants.id = features_restaurants.restaurant_id
JOIN features ON features_restaurants.feature_id = features.id
WHERE features.id in (?)
) t
WHERE rn = ?', params[:features], params[:features].count])
So my question is: is there a better - more Rails even - way of doing this? How would you do it?
Oh BTW I'm using Rails 4 on Heroku so it's a Postgres DB.
This is an example of a set-iwthin-sets query. I advocate solving these with group by and having, because this provides a general framework.
Here is how this works in your case:
select fr.restaurant_id
from features_restaurants fr join
features f
on fr.feature_id = f.feature_id
group by fr.restaurant_id
having sum(case when f.feature_name = 'pizza' then 1 else 0 end) > 0 and
sum(case when f.feature_name = 'delivery' then 1 else 0 end) > 0
Each condition in the having clause is counting for the presence of one of the features -- "pizza" and "delivery". If both features are present, then you get the restaurant_id.
How much data is in your features table? Is it just a table of ids and names?
If so, and you're willing to do a little denormalization, you can do this much more easily by encoding the features as a text array on restaurant.
With this scheme your queries boil down to
select * from restaurants where restaurants.features #> ARRAY['pizza', 'delivery']
If you want to maintain your features table because it contains useful data, you can store the array of feature ids on the restaurant and do a query like this:
select * from restaurants where restaurants.feature_ids #> ARRAY[5, 17]
If you don't know the ids up front, and want it all in one query, you should be able to do something along these lines:
select * from restaurants where restaurants.feature_ids #> (
select id from features where name in ('pizza', 'delivery')
) as matched_features
That last query might need some more consideration...
Anyways, I've actually got a pretty detailed article written up about Tagging in Postgres and ActiveRecord if you want some more details.
This is not "copy and paste" solution but if you consider following steps you will have fast working query.
index feature_name column (I'm assuming that column feature_id is indexed on both tables)
place each feature_name param in exists():
select fr.restaurant_id
from
features_restaurants fr
where
exists(select true from features f where fr.feature_id = f.feature_id and f.feature_name = 'pizza')
and
exists(select true from features f where fr.feature_id = f.feature_id and f.feature_name = 'delivery')
group by
fr.restaurant_id
Maybe you're looking at it backwards?
Maybe try merging the restaurants returned by each feature.
Simplified:
pizza_restaurants = Feature.find_by_name('pizza').restaurants
delivery_restaurants = Feature.find_by_name('delivery').restaurants
pizza_delivery_restaurants = pizza_restaurants & delivery_restaurants
Obviously, this is a single instance solution. But it illustrates the idea.
UPDATE
Here's a dynamic method to pull in all filters without writing SQL (i.e. the "Railsy" way)
def get_restaurants_by_feature_names(features)
# accepts an array of feature names
restaurants = Restaurant.all
features.each do |f|
feature_restaurants = Feature.find_by_name(f).restaurants
restaurants = feature_restaurants & restaurants
end
return restaurants
end
Since its an AND condition (the OR conditions get dicey with AREL). I reread your stated problem and ignoring the SQL. I think this is what you want.
# in Restaurant
has_many :features
# in Feature
has_many :restaurants
# this is a contrived example. you may be doing something like
# where(name: 'pizza'). I'm just making this condition up. You
# could also make this more DRY by just passing in the name if
# that's what you're doing.
def self.pizza
where(pizza: true)
end
def self.delivery
where(delivery: true)
end
# query
Restaurant.features.pizza.delivery
Basically you call the association with ".features" and then you use the self methods defined on features. Hopefully I didn't misunderstand the original problem.
Cheers!
Restaurant
.joins(:features)
.where(features: {name: ['pizza','delivery']})
.group(:id)
.having('count(features.name) = ?', 2)
This seems to work for me. I tried it with SQLite though.

Rails, order by number of matched tags and then by name

Here's what I want to do: Listing has a many-to-many relationship with Tag through Taggings. I want to allow a user to search for listings by title (of the listing) and name (of zero or more tags). I want to order the number of results first by the listings with the greatest number of tags matched, and then by title.
It seems like this question has been done before -- it might be as simple as matching this question (Ordering items with matching tags by number of tags that match) from MySQL. However, I'm not SQL-literate at all, which is why I'm asking for help.
Update:
Here is an example of what I want.
Say I have 3 listings.
listing1 has tags "humor," "funny," and "hilarious."
listing2 = 2 has tags "funny," "silly," and "goofy."
listing3 = 3 has tags "funny," "silly," and "goofy."
listing4 = 4 has the tag "completely serious."
If I make a search with the tags "funny" and "silly", what I should get back is listing2, listing3, listing1, and listing4 (ignoring titles for now).
Interesting problem. I think you might have to use some SQL sugar to do this scope.
Something like this:
Listing
.joins("LEFT JOIN taggings ON taggings.listing_id = listings.id")
.joins('LEFT JOIN tags ON tags.id = taggings.tag_id AND tags.name IN ("funny","silly")')
.group(:id)
.order("count(tags.id), name DESC")
Does that help?
Assuming you want a solution in pure ActiveRecord so as not to touch any SQL...
Listing.order("tags.count DESC, title")
In this case you'd probably be better off using a counter cache for tags to optimize your queries.