Rails (or maybe SQL): Finding and deleting duplicate AR objects - sql

ActiveRecord objects of the class 'Location' (representing the db-table Locations) have the attributes 'url', 'lat' (latitude) and 'lng' (longitude).
Lat-lng-combinations on this model should be unique. The problem is, that there are a lot of Location-objects in the database having duplicate lat-lng-combinations.
I need help in doing the following
Find objects that share the same
lat-lng-combination.
If the 'url' attribute of the object
isn't empty, keep this object and delete the
other duplicates. Otherwise just choose the
oldest object (by checking the attribute
'created_at') and delete the other duplicates.
As this is a one-time-operation, solutions in SQL (MySQL 5.1 compatible) are welcome too.

If it's a one time thing then I'd just do it in Ruby and not worry too much about efficiency. I haven't tested this thoroughly, check the sorting and such to make sure it'll do exactly what you want before running this on your db :)
keep = []
locations = Location.find(:all)
locations.each do |loc|
# get all Locations's with the same coords as this one
same_coords = locations.select { |l| l.lat == loc.lat and \
l.lng == loc.lng }
with_urls = same_coords.select { |l| !l.url.empty? }
# decide which list to use depending if there were any urls
same_coords = with_urls.any? ? with_urls : same_coords
# pick the best one
keep << same_coords.sort { |a,b| b.created_at <=> a.created_at }.first.id
end
# only keep unique ids
keep.uniq!
# now we just delete all the rows we didn't decide to keep
locations.each do |loc|
loc.destroy unless keep.include?( loc.id )
end
Now like I said, this is definitely poor, poor code. But sometimes just hacking out the thing that works is worth the time saved in thinking up something 'better', especially if it's just a one-off.

If you have 2 MySQL columns, you can use the CONCAT function.
SELECT * FROM table1 GROUP BY CONCAT(column_lat, column_lng)
If you need to know the total
SELECT COUNT(*) AS total FROM table1 GROUP BY CONCAT(column_lat, column_lng)
Or, you can combine both
SELECT COUNT(*) AS total, table1.* FROM table1
GROUP BY CONCAT(column_lat, column_lng)
But if you can explain more on your question, perhaps we can have more relevant answers.

Related

Get records with no related data using activerecord and RoR3?

I am making scopes for a model that looks something like this:
class PressRelease < ActiveRecord::Base
has_many :publications
end
What I want to get is all press_releases that does not have publications, but from a scope method, so it can be chained with other scopes. Any ideas?
Thanks!
NOTE: I know that there are methods like present? or any? and so on, but these methods does not return an ActiveRecord::Relation as scope does.
NOTE: I am using RoR 3
Avoid eager_loading if you do not need it (it adds overhead). Also, there is no need for subselect statements.
scope :without_publications, -> { joins("LEFT OUTER JOIN publications ON publications.press_release_id = press_releases.id").where(publications: { id: nil }) }
Explanation and response to comments
My initial thoughts about eager loading overhead is that ActiveRecord would instantiate all the child records (publications) for each press release. Then I realized that the query will never return press release records with publications. So that is a moot point.
There are some points and observations to be made about the way ActiveRecord works. Some things I had previously learned from experience, and some things I learned exploring your question.
The query from includes(:publications).where(publications: {id: nil}) is actually different from my example. It will return all columns from the publications table in addition to the columns from press_releases. The publication columns are completely unnecessary because they will always be null. However, both queries ultimately result in the same set of PressRelease objects.
With the includes method, if you add any sort of limit, for example chaining .first, .last or .limit(), then ActiveRecord (4.2.4) will resort to executing two queries. The first query returns IDs, and the second query uses those IDs to get results. Using the SQL snippet method, ActiveRecord is able to use just one query. Here is an example of this from one of my applications:
Profile.includes(:positions).where(positions: { id: nil }).limit(5)
# SQL (0.8ms) SELECT DISTINCT "profiles"."id" FROM "profiles" LEFT OUTER JOIN "positions" ON "positions"."profile_id" = "profiles"."id" WHERE "positions"."id" IS NULL LIMIT 5
# SQL (0.8ms) SELECT "profiles"."id" AS t0_r0, ..., "positions"."end_year" AS t1_r11 FROM "profiles" LEFT OUTER JOIN "positions" ON "positions"."profile_id" = "profiles"."id" # WHERE "positions"."id" IS NULL AND "profiles"."id" IN (107, 24, 7, 78, 89)
Profile.joins("LEFT OUTER JOIN positions ON positions.profile_id = profiles.id").where(positions: { id: nil }).limit(5)
# Profile Load (1.0ms) SELECT "profiles".* FROM "profiles" LEFT OUTER JOIN positions ON positions.profile_id = profiles.id WHERE "positions"."id" IS NULL LIMIT 5
Most importantly
eager_loading and includes were not intended to solve the problem at hand. And for this particular case I think you are much more aware of what is needed than ActiveRecord is. You can therefore make better decisions about how to structure the query.
you can de the following in your PressRelease:
scope :your_scope, -> { where('id NOT IN(select press_release_id from publications)') }
this will return all PressRelease record without publications.
Couple ways to do this, first one requires two db queries:
PressRelease.where.not(id: Publications.uniq.pluck(:press_release_id))
or if you don't want to hardcode association foreign key:
PressRelease.where.not(id: PressRelease.uniq.joins(:publications).pluck(:id))
Another one is to do a left join and pick those without associated elements - you get a relation object, but it will be tricky to work with it as it already has a join on it:
PressRelease.eager_load(:publications).where(publications: {id: nil})
Another one is to use counter_cache feature. You will need to add publication_count column to your press_releases table.
class Publications < ActiveRecord::Base
belongs_to :presss_release, counter_cache: true
end
Rails will keep this column in sync with a number of records associated to given mode, so then you can simply do:
PressRelease.where(publications_count: [nil, 0])

ZF2 how to avoid sql query limit to add quotes in subquery

I'm trying to set up a subquery in ZendFramework 2 and I got an issue with the limit function for a Select object. Whatever I do, numeric value is put between quotes and makes my query fails : I should get LIMIT 1 and instead I get LIMIT '1'.
Seems this is not the first time this issue has been encountered, I saw some have asked about this issue before (like 8 months ago) but without getting any proper answer.
I also saw this issue has been marker as resolved in 2012 (https://github.com/zendframework/zf2/pull/2775) so I really don't understand what's happening there.
Here's my code in ZF2 :
$resultSet = $this->tableGateway->select( function (Select $select) use ($params) {
$sub = new Select();
$sub->from(array('temp' => 'scores'))
->columns(array(new \Zend\Db\Sql\Expression("id AS id")))
->where(array('temp.glitch' => array('None', 'Glitch')))
->where('temp.zone=scores.zone')
->order('temp.multi DESC, temp.score DESC')
->limit(1);
$select->join('players', 'player=players.id', array('player_name' => 'name', 'player_url' => 'name_url'))
->join('countries', 'players.country=countries.id', array('country_name' => 'name', 'country_iso' => 'iso'))
->join('cars', 'car=cars.id', array('car_name' => 'name'), 'left')
->join('zones', 'zone=zones.id', array('zone_name' => 'name'));
$select->where(array('scores.id' => $sub));
$select->order('scores.zone ASC');
print_r($select->getSqlString());
});
This should render the following query (which I get right except LIMIT '1' instead of LIMIT 1) :
SELECT "scores".*, "players"."name" AS "player_name", "players"."name_url" AS "player_url", "countries"."name" AS "country_name", "countries"."iso" AS "country_iso", "cars"."name" AS "car_name", "zones"."name" AS "zone_name"
FROM "scores" INNER JOIN "players" ON "player"="players"."id"
INNER JOIN "countries" ON "players"."country"="countries"."id"
LEFT JOIN "cars" ON "car"="cars"."id"
INNER JOIN "zones" ON "zone"="zones"."id"
WHERE "scores"."id" = (SELECT id AS id FROM "scores" AS "temp" WHERE "temp"."glitch" IN ('None', 'Glitch')
AND temp.zone=scores.zone ORDER BY "temp"."multi" DESC, "temp"."score" DESC LIMIT 1)
ORDER BY "scores"."zone" ASC
Since this doesn't seem to work this way, is there another way I could proceed to get my limit (using Mysql 5 database) ?
EDIT :
Thanks for your help. Finally I figured out a way to get things done the way I want and to remove the quotes by simply remove the subquery construction and to write it directly in the where function :
$select->where('scores.id = (SELECT id FROM scores AS lookup WHERE lookup.zone = scores.zone ORDER BY multi DESC , score DESC LIMIT 1)');
Although I can continue my dev with this, I feel more like using a poor trick to get rid of this issue and so I will let this question unanswered until someone comes with a real solution there.
Anyway there might be no solution at all, since it might be an issue in ZF2 core itself.
Change the line -
$select->where(array('scores.id' => $sub));
with
$select->where(array('scores.id' => new \Zend\Db\Sql\Expression("({$sub->getSqlString($this->tableGateway->adapter->getPlatform())})"));
Try with just above change.
And if it still doesn't work then make changes to the core Select class file located at -
PROJECT_FOLDER/vendor/zendframework/zendframework/library/Zend/Db/Sql/Select.php
Line No. 921 -
Change $sql = $platform->quoteValue($limit); with $sql = $limit;
Line No. 940 -
Change return array($platform->quoteValue($offset)); with return array($offset);
I have come across the issue from github and wondered as why it is still not working with the latest ZF2 files. I know the solution given above doesn't look like the proper one but I had to somehow make it work. I have tried it and it works.
Its only a quick fix before the actual solution comes into picture.

Filtering model with HABTM relationship

I have 2 models - Restaurant and Feature. They are connected via has_and_belongs_to_many relationship. The gist of it is that you have restaurants with many features like delivery, pizza, sandwiches, salad bar, vegetarian option,… So now when the user wants to filter the restaurants and lets say he checks pizza and delivery, I want to display all the restaurants that have both features; pizza, delivery and maybe some more, but it HAS TO HAVE pizza AND delivery.
If I do a simple .where('features IN (?)', params[:features]) I (of course) get the restaurants that have either - so or pizza or delivery or both - which is not at all what I want.
My SQL/Rails knowledge is kinda limited since I'm new to this but I asked a friend and now I have this huuuge SQL that gets the job done:
Restaurant.find_by_sql(['SELECT restaurant_id FROM (
SELECT features_restaurants.*, ROW_NUMBER() OVER(PARTITION BY restaurants.id ORDER BY features.id) AS rn FROM restaurants
JOIN features_restaurants ON restaurants.id = features_restaurants.restaurant_id
JOIN features ON features_restaurants.feature_id = features.id
WHERE features.id in (?)
) t
WHERE rn = ?', params[:features], params[:features].count])
So my question is: is there a better - more Rails even - way of doing this? How would you do it?
Oh BTW I'm using Rails 4 on Heroku so it's a Postgres DB.
This is an example of a set-iwthin-sets query. I advocate solving these with group by and having, because this provides a general framework.
Here is how this works in your case:
select fr.restaurant_id
from features_restaurants fr join
features f
on fr.feature_id = f.feature_id
group by fr.restaurant_id
having sum(case when f.feature_name = 'pizza' then 1 else 0 end) > 0 and
sum(case when f.feature_name = 'delivery' then 1 else 0 end) > 0
Each condition in the having clause is counting for the presence of one of the features -- "pizza" and "delivery". If both features are present, then you get the restaurant_id.
How much data is in your features table? Is it just a table of ids and names?
If so, and you're willing to do a little denormalization, you can do this much more easily by encoding the features as a text array on restaurant.
With this scheme your queries boil down to
select * from restaurants where restaurants.features #> ARRAY['pizza', 'delivery']
If you want to maintain your features table because it contains useful data, you can store the array of feature ids on the restaurant and do a query like this:
select * from restaurants where restaurants.feature_ids #> ARRAY[5, 17]
If you don't know the ids up front, and want it all in one query, you should be able to do something along these lines:
select * from restaurants where restaurants.feature_ids #> (
select id from features where name in ('pizza', 'delivery')
) as matched_features
That last query might need some more consideration...
Anyways, I've actually got a pretty detailed article written up about Tagging in Postgres and ActiveRecord if you want some more details.
This is not "copy and paste" solution but if you consider following steps you will have fast working query.
index feature_name column (I'm assuming that column feature_id is indexed on both tables)
place each feature_name param in exists():
select fr.restaurant_id
from
features_restaurants fr
where
exists(select true from features f where fr.feature_id = f.feature_id and f.feature_name = 'pizza')
and
exists(select true from features f where fr.feature_id = f.feature_id and f.feature_name = 'delivery')
group by
fr.restaurant_id
Maybe you're looking at it backwards?
Maybe try merging the restaurants returned by each feature.
Simplified:
pizza_restaurants = Feature.find_by_name('pizza').restaurants
delivery_restaurants = Feature.find_by_name('delivery').restaurants
pizza_delivery_restaurants = pizza_restaurants & delivery_restaurants
Obviously, this is a single instance solution. But it illustrates the idea.
UPDATE
Here's a dynamic method to pull in all filters without writing SQL (i.e. the "Railsy" way)
def get_restaurants_by_feature_names(features)
# accepts an array of feature names
restaurants = Restaurant.all
features.each do |f|
feature_restaurants = Feature.find_by_name(f).restaurants
restaurants = feature_restaurants & restaurants
end
return restaurants
end
Since its an AND condition (the OR conditions get dicey with AREL). I reread your stated problem and ignoring the SQL. I think this is what you want.
# in Restaurant
has_many :features
# in Feature
has_many :restaurants
# this is a contrived example. you may be doing something like
# where(name: 'pizza'). I'm just making this condition up. You
# could also make this more DRY by just passing in the name if
# that's what you're doing.
def self.pizza
where(pizza: true)
end
def self.delivery
where(delivery: true)
end
# query
Restaurant.features.pizza.delivery
Basically you call the association with ".features" and then you use the self methods defined on features. Hopefully I didn't misunderstand the original problem.
Cheers!
Restaurant
.joins(:features)
.where(features: {name: ['pizza','delivery']})
.group(:id)
.having('count(features.name) = ?', 2)
This seems to work for me. I tried it with SQLite though.

Query: getting the last record for each member

Given a table ("Table") as follows (sorry about the CSV style since I don't know how to make it look like a table with the Stack Overflow editor):
id,member,data,start,end
1,001,abc,12/1/2012,12/31/2999
2,001,def,1/1/2009,11/30/2012
3,002,ghi,1/1/2009,12/31/2999
4,003,jkl,1/1/2012,10/31/2012
5,003,mno,8/1/2011,12/31/2011
If using Ruby Sequel, how should I write my query so I will get the following dataset in return.
id,member,data,start,end
1,001,abc,12/1/2012,12/31/2999
3,002,ghi,1/1/2009,12/31/2999
4,003,jkl,1/1/2012,10/31/2012
I get the most current (largest end date value) record for EACH (distinct) member from the original table.
I can get the answer if I convert the table to an Array, but I am looking for a solution in SQL or Ruby Sequel query, if possible. Thank you.
Extra credit: The title of this post is lame...but I can't come up with a good one. Please offer a better title if you have one. Thank you.
The Sequel version of this is a bit scary. The best I can figure out is to use a subselect and, because you need to join the table and the subselect on two columns, a "join block" as described in Querying in Sequel. Here's a modified version of Knut's program above:
require 'csv'
require 'sequel'
# Create Test data
DB = Sequel.sqlite()
DB.create_table(:mytable){
field :id
String :member
String :data
String :start # Treat as string to keep it simple
String :end # Ditto
}
CSV.parse(<<xx
1,"001","abc","2012-12-01","2999-12-31"
2,"001","def","2009-01-01","2012-11-30"
3,"002","ghi","2009-01-01","2999-12-31"
4,"003","jkl","2012-01-01","2012-10-31"
5,"003","mno","2011-08-01","2011-12-31"
xx
).each{|x|
DB[:mytable].insert(*x)
}
# That was all setup, here's the query
ds = DB[:mytable]
result = ds.join(ds.select_group(:member).select_append{max(:end).as(:end)}, :member=>:member) do |j, lj, js|
Sequel.expr(Sequel.qualify(j, :end) => Sequel.qualify(lj, :end))
end
puts result.all
This gives you:
{:id=>1, :member=>"001", :data=>"abc", :start=>"2012-12-01", :end=>"2999-12-31"}
{:id=>3, :member=>"002", :data=>"ghi", :start=>"2009-01-01", :end=>"2999-12-31"}
{:id=>4, :member=>"003", :data=>"jkl", :start=>"2012-01-01", :end=>"2012-10-31"}
In this case it's probably easier to replace the last four lines with straight SQL. Something like:
puts DB[
"SELECT a.* from mytable as a
join (SELECT member, max(end) AS end FROM mytable GROUP BY member) as b
on a.member = b.member and a.end=b.end"].all
Which gives you the same result.
What's the criteria for your result?
If it is the keys 1,3 and 4 you may use DB[:mytable].filter( :id => [1,3,4]) (complete example below)
For more information about filtering with sequel, please refer the sequel documentation, especially Dataset Filtering.
require 'csv'
require 'sequel'
#Create Test data
DB = Sequel.sqlite()
DB.create_table(:mytable){
field :id
field :member
field :data
field :start #should be date, not implemented in example
field :end #should be date, not implemented in example
}
CSV.parse(<<xx
id,member,data,start,end
1,001,abc,12/1/2012,12/31/2999
2,001,def,1/1/2009,11/30/2012
3,002,ghi,1/1/2009,12/31/2999
4,003,jkl,1/1/2012,10/31/2012
5,003,mno,8/1/2011,12/31/2011
xx
).each{|x|
DB[:mytable].insert(*x)
}
#Create Test data - end -
puts DB[:mytable].filter( :id => [1,3,4]).all
In my opinion, you're approaching the problem from the wrong side. ORMs (and Sequel as well) represent a nice, DSL-ish layer above the database, but, underneath, it's all SQL down there. So, I would try to formulate the question and the answer in a way to get SQL query which would return what you need, and then see how it would translate to Sequel's language.
You need to group by member and get the latest record for each member, right?
I'd go with the following idea (roughly):
SELECT t1.*
FROM table t1
LEFT JOIN table t2 ON t1.member = t2.member AND t2.end > t1.end
WHERE t2.id IS NULL
Now you should see how to perform left joins in Sequel, and you'll need to alias tables as well. Shouldn't be that hard.

Complex subqueries in activerecord

I'm doing a rails app. I have to do a comparison engine a bit complex. I'm currently trying to do a prototype. My query can vary widely so i have to work with a lot of scopes, but that's not my problem.
My query have to compare candidates. These candidates have answered some tests. These tests belongs to category. Theses tests have different max value, and i have to be able to compare candidates by categories.
So i have to calculate a % of good answers. I have to be able to compare candidates in all possible use cases in one category. So, i have to be able to compare the average good answer rate for all this category.
In a nutshell : I have to be able to use subqueries in order to compare some candidates. I have to be able to compare them for a test or a category. My problem is using a subquery able to return a good answer rate for all tests a candidats may have passed in a category.
And I have to be able to use this subquery in an order_by or having clause.
How can I construct this subquery ? I have no problem to handle complex conditional queries with some scopes. This has to be a real subquery, because I am working with 6 or 7 models here.
I ask for an active record way, cause this must work with whatever database supported by rails.
Excuse my poor English.
Edit :
An example is worth 1000 words so how could do something like this :
Sessiontest.find(Candidat.where(:firstname => 'toto'))
This example is stupid, ok. So, is it possible to do something like this ?
Edit2 :
I saw some posts about AREL. I wish to know if it is possible to do this without a third party plugin.
Is it possible to do some sub queries in subqueries with arel? Because for example, my number of points per test, is the sum of the points of all his questions. (Sad, but I have to keep it). And I need this, so my subquery can calculate my good answers %.
So you got the idea. That's something, which has to be really powerful, so I need something powerful, and not too much error prone.
Edit3 : I made some progress, but I can't for a while post an answer.
It seem possible to get this work without any plugin. I have some success in buildings some subqueries like this :
toto = Candidat.where(:lastname => Candidat.select(:lastname).where(:lastname => "ulysse").limit(1))
The request :
Candidat Load (1.0ms)[0m SELECT "candidats".* FROM "candidats" WHERE "candidats"."lastname" IN (SELECT "candidats"."id" FROM "candidats" WHERE "candidats"."lastname" = 'ulysse' LIMIT 1
This works and create a real subquery. I will try some more advanced experiences, in order to get the level I actually need.
Just tried sub-subquery works wonder too.
Edit 5 :
I am trying some more advanced things, and there is a lot of things, i still don't understand.
- toto = Candidat.where("id = ? / ? ", Sessiontest.select(:id).where(:id => 6), Sessiontest.select(:id).where(:id => 2))
This is just a stupid example in order to get an object with an id of 3. This code works, but not as i expected.
See, the sql :
1m[35m (1.0ms)[0m SELECT COUNT("sessiontests"."id") FROM "sessiontests" WHERE "sessiontests"."id" = 6
[1m[36mSessiontest Load (0.0ms)[0m [1mSELECT id FROM "sessiontests" WHERE "sessiontests"."id" = 6[0m
[1m[35m (1.0ms)[0m SELECT COUNT("sessiontests"."id") FROM "sessiontests" WHERE "sessiontests"."id" = 2
[1m[36mSessiontest Load (1.0ms)[0m [1mSELECT id FROM "sessiontests" WHERE "sessiontests"."id" = 2[0m
[1m[35mCandidat Load (1.0ms)[0m SELECT "candidats".* FROM "candidats" WHERE (id = 6 / 2)
So, it does not use a subqueries. I tried with .to_sql. But it introduce my sql this way :
1m[36mCandidat Load (0.0ms)[0m [1mSELECT "candidats".* FROM "candidats" WHERE (id = 'SELECT id FROM "sessiontests" WHERE "sessiontests"."id" = 6' / 2 )[0m
So active record quoted the subreust for security purpose. this is closer to my wish, but not really what i want.
This does not work
Candidat.where("id = (?) / ? ", Sessiontest.select(:id).where(:id => 6).to_sql, Sessiontest.select(:id).where(:id => 2))
Quotes prevents the subquery to work.
But this work :
Candidat.where("id = (" + Sessiontest.select(:id).where(:id => 6).to_sql + ") / (" + Sessiontest.select(:id).where(:id => 2).to_sql + ") ")
[1m[36mCandidat Load (1.0ms)[0m [1mSELECT "candidats".* FROM "candidats" WHERE (id = (SELECT id FROM "sessiontests" WHERE "sessiontests"."id" = 6) / (SELECT id FROM "sessiontests" WHERE "sessiontests"."id" = 2) )[0m
But I find this ugly. I will try to get these subqueries working in a more dynamic way. I mean replace the integer values by columns name.
I don't have anymore the exact answer to this question, because i do not work in the same enterprise anymore. But the solution to this problem, was to use a group_by clause. So the request became really easy.
With a group_by, i was able to manipulate, category or a technology with ease.