Remove duplicate records based on multiple columns?

Remove duplicate records based on multiple columns? - ruby-on-rails-3

I'm using Heroku to host my Ruby on Rails application and for one reason or another, I may have some duplicate rows.
Is there a way to delete duplicate records based on 2 or more criteria but keep just 1 record of that duplicate collection?
In my use case, I have a Make and Model relationship for cars in my database.
Make Model
--- ---
Name Name
Year
Trim
MakeId
I'd like to delete all Model records that have the same Name, Year and Trim but keep 1 of those records (meaning, I need the record but only once). I'm using Heroku console so I can run some active record queries easily.
Any suggestions?

class Model
def self.dedupe
# find all models and group them on keys which should be common
grouped = all.group_by{|model| [model.name,model.year,model.trim,model.make_id] }
grouped.values.each do |duplicates|
# the first one we want to keep right?
first_one = duplicates.shift # or pop for last one
# if there are any more left, they are duplicates
# so delete all of them
duplicates.each{|double| double.destroy} # duplicates can now be destroyed
end
end
end
Model.dedupe
Find All
Group them on keys which you need for uniqueness
Loop on the grouped model's values of the hash
remove the first value because you want to retain one copy
delete the rest

If your User table data like below
User.all =>
[
#<User id: 15, name: "a", email: "a#gmail.com", created_at: "2013-08-06 08:57:09", updated_at: "2013-08-06 08:57:09">,
#<User id: 16, name: "a1", email: "a#gmail.com", created_at: "2013-08-06 08:57:20", updated_at: "2013-08-06 08:57:20">,
#<User id: 17, name: "b", email: "b#gmail.com", created_at: "2013-08-06 08:57:28", updated_at: "2013-08-06 08:57:28">,
#<User id: 18, name: "b1", email: "b1#gmail.com", created_at: "2013-08-06 08:57:35", updated_at: "2013-08-06 08:57:35">,
#<User id: 19, name: "b11", email: "b1#gmail.com", created_at: "2013-08-06 09:01:30", updated_at: "2013-08-06 09:01:30">,
#<User id: 20, name: "b11", email: "b1#gmail.com", created_at: "2013-08-06 09:07:58", updated_at: "2013-08-06 09:07:58">]
1.9.2p290 :099 >
Email id's are duplicate, so our aim is to remove all duplicate email ids from user table.
Step 1:
To get all distinct email records id.
ids = User.select("MIN(id) as id").group(:email,:name).collect(&:id)
=> [15, 16, 18, 19, 17]
Step 2:
To remove duplicate id's from user table with distinct email records id.
Now the ids array holds the following ids.
[15, 16, 18, 19, 17]
User.where("id NOT IN (?)",ids) # To get all duplicate records
User.where("id NOT IN (?)",ids).destroy_all
** RAILS 4 **
ActiveRecord 4 introduces the .not method which allows you to write the following in Step 2:
User.where.not(id: ids).destroy_all

Similar to #Aditya Sanghi 's answer, but this way will be more performant because you are only selecting the duplicates, rather than loading every Model object into memory and then iterating over all of them.
# returns only duplicates in the form of [[name1, year1, trim1], [name2, year2, trim2],...]
duplicate_row_values = Model.select('name, year, trim, count(*)').group('name, year, trim').having('count(*) > 1').pluck(:name, :year, :trim)
# load the duplicates and order however you wantm and then destroy all but one
duplicate_row_values.each do |name, year, trim|
Model.where(name: name, year: year, trim: trim).order(id: :desc)[1..-1].map(&:destroy)
end
Also, if you truly don't want duplicate data in this table, you probably want to add a multi-column unique index to the table, something along the lines of:
add_index :models, [:name, :year, :trim], unique: true, name: 'index_unique_models'

You could try the following: (based on previous answers)
ids = Model.group('name, year, trim').pluck('MIN(id)')
to get all valid records. And then:
Model.where.not(id: ids).destroy_all
to remove the unneeded records. And certainly, you can make a migration that adds a unique index for the three columns so this is enforced at the DB level:
add_index :models, [:name, :year, :trim], unique: true

To run it on a migration I ended up doing like the following (based on the answer above by #aditya-sanghi)
class AddUniqueIndexToXYZ < ActiveRecord::Migration
def change
# delete duplicates
dedupe(XYZ, 'name', 'type')
add_index :xyz, [:name, :type], unique: true
end
def dedupe(model, *key_attrs)
model.select(key_attrs).group(key_attrs).having('count(*) > 1').each { |duplicates|
dup_rows = model.where(duplicates.attributes.slice(key_attrs)).to_a
# the first one we want to keep right?
dup_rows.shift
dup_rows.each{ |double| double.destroy } # duplicates can now be destroyed
}
end
end

Based on #aditya-sanghi's answer, with a more efficient way to find duplicates using SQL.
Add this to your ApplicationRecord to be able to deduplicate any model:
class ApplicationRecord < ActiveRecord::Base
# …
def self.destroy_duplicates_by(*columns)
groups = select(columns).group(columns).having(Arel.star.count.gt(1))
groups.each do |duplicates|
records = where(duplicates.attributes.symbolize_keys.slice(*columns))
records.offset(1).destroy_all
end
end
end
You can then call destroy_duplicates_by to destroy all records (except the first) that have the same values for the given columns. For example:
Model.destroy_duplicates_by(:name, :year, :trim, :make_id)

I chose a slightly safer route (IMHO). I started by getting all the unique records.
ids = Model.where(other_model_id: 1).uniq(&:field).map(&:id)
Then I got all the ids
all_ids = Model.where(other_model_id: 1).map(&:id)
This allows me to do a matrix subtraction for the duplicates
dups = all_ids - ids
I then map over the duplicate ids and fetch the model because I want to ensure I have the records I am interested in.
records = dups.map do |id| Model.find(id) end
When I am sure I want to delete, I iterate again to delete.
records.map do |record| record.delete end
When deleting duplicate records on a production system, you want to be very sure you are not deleting important live data, so in this process, I can double-check everything.
So in the case above:
all_ids = Model.all.map(&:ids)
uniq_ids = Model.all.group_by do |model|
[model.name, model.year, model.trim]
end.values.map do |duplicates|
duplicates.first.id
end
dups = all_ids - uniq_ids
records = dups.map { |id| Model.find(id) }
records.map { |record| record.delete }
or something like this.

You can try this sql query, to remove all duplicate records but latest one
DELETE FROM users USING users user WHERE (users.name = user.name AND users.year = user.year AND users.trim = user.trim AND users.id < user.id);

Related

What is the fastest, most standard way to "mass query" a set of grandchildren (has_many, has_many) belonging to a single object?

Use case: on this site, users will be able to go on and select rental property for a specific amount of days. Users will be often be selling the same type of rental property.
Problem: Because multiple "sellers" will be renting out the same exact item, the "property detail page" will have many listings created by many different sellers (or in some case, a seller will have multiple properties available falling into the same "property detail page"). Each of these "listings" objects will have many pricing objects which contain a date, a price, and an availability boolean.
Current models are broken down below:
property.rb
has_many :listings
has_many :prices, :through => :listings
listing.rb
belongs_to :user
belongs_to :property
has_many :prices
price.rb
belongs_to :listing
What I have tried:
If for example, I wanted to obtain the MINIMUM sum of pricing for a specific property, I had jotted down this:
# property.rb
# minimum price for a pricing set out of all of the price objects
def minimum_price(start_date, end_date)
# this would sum up each days pricing to give the rental period a final price
prices = self.prices.where("day <= ?", end_date).where("day >= ?", start_date).sum(:price)
end
When I do it like this however, it simply combines every single users prices giving nothing of use.
Any help would be greatly appreciated! Of course I could loop through a properties listings until I found a minimum price set for a given date range, but that seems as though it would take an unnecessary amount of time and be largely inefficient.
EDIT
An example of data that should be outputted is a set of price objects that are the cheapest ones in a specific date range from ONE particular listing. It can not just combine all of the best priced dates from all of the users and add them as the buyer will be renting from ONE seller.
This is an actual example of desired output, as you can see these prices are ALL from the same listing ID.
[#<Price id: 156, day: "2020-12-01", listing_id: 7, price: 5.0, available: true, created_at: "2020-12-17 14:22:46", updated_at: "2020-12-17 14:22:46">, #<Price id: 157, day: "2020-12-02", listing_id: 7, price: 5.0, available: true, created_at: "2020-12-17 14:22:46", updated_at: "2020-12-17 14:22:46">, #<Price id: 158, day: "2020-12-03", listing_id: 7, price: 5.0, available: true, created_at: "2020-12-17 14:22:46", updated_at: "2020-12-17 14:22:46">, #<Price id: 159, day: "2020-12-04", listing_id: 7, price: 5.0, available: true, created_at: "2020-12-17 14:22:46", updated_at: "2020-12-17 14:22:46">]

So it sounds like you are calling this on a property so:
prices = self.prices.where("prices.day >= ? AND prices.day <= ?", start_date, end_date).sum(:price).group_by {|price| price.listing_id}
There is probably a SQL based way that you can use AR relations to do this. But this will give you a hash with a key for each listing_id and the value of that key should be the sum. I say "should" because this is a bit abstract for me to do without a system to test it on.

ruby each loop in order on sql order query result

I have a query in my Controller that works perfectly:
#klasses_mon = Klass.order(:start).where(day: 'MON').find_each
my result is (shown by <%= #klasses_mon.inspect %> in my view):
#<Enumerator: #<ActiveRecord::Relation
[#<Klass id: 9, name: "Cycling", teacher: "Tomek", day: "MON", start: 510, duration: 45>,
#<Klass id: 8, name: "LBT", teacher: "Monia", day: "MON", start: 600, duration: 60>,
#<Klass id: 11, name: "HIIT", teacher: "Aga", day: "MON", start: 930, duration: 45>]>
:find_each({:start=>nil, :finish=>nil, :batch_size=>1000, :error_on_ignore=>nil})>
But I am trying to display it in each loop. For some reason it is not ordered anymore. Looks like each loop does not keep the order from my query result:
<% #klasses_mon.each do |k| %>
<p><%= k.teacher %>,
<%= k.name %>
START: <%= k.start/60 %>:<%= k.start%60 %>
<% end %>
result:
Monia, LBT START: 10:0
Tomek, Cycling START: 8:30
Aga, HIIT START: 15:30
How should I do that?

From the fine manual:
find_each(start: nil, finish: nil, batch_size: 1000, error_on_ignore: nil)
[...]
NOTE: It's not possible to set the order. That is automatically set to ascending on the primary key (“id ASC”) to make the batch ordering work. This also means that this method only works when the primary key is orderable (e.g. an integer or string).
So find_each is explicitly documented to ignore any ordering that you try to use.
find_each doesn't use LIMIT and OFFSET to move the batch window through the result set as that tends to be very expensive as the OFFSET increases, instead it orders by the primary key and includes a id > last_one condition in the WHERE clause to set the start of the batch and a LIMIT clause to set the batch size. Ordering by the PK and querying on the PK are both generally inexpensive as is a LIMIT clause.
find_each is the wrong tool for this job, find_each is for batch work but you're just displaying a short list of records so you want a simple:
#klasses_mon = Klass.order(:start).where(day: 'MON')

The method #find_each ignores any scoped order and forces a sort by the primary key (usually id). This is stated in the documentation and is because #find_each needs to make sure that it doesn't repeat any records during iteration.
You can see this in your console if you try:
> #klasses_mon = Klass.order(:start).where(day: 'MON').find_each
> #klasses_mon.map(&:start) # force the relation to execute and return rows.
Scoped order is ignored, it's forced to be batch order.
Klass Load (0.ms) SELECT "klasses".* FROM "klasses" WHERE "klasses"."day" = 'MON' ORDER BY "klasses"."id"
=> [600, 510, 930]
If you're not expecting to run through thousands of rows, you can drop the find_each:
#klasses_mon = Klass.where(day: "MON").order(:start)

Query for all of a specific record where its related resource all have the same value for a single attribute

class User < ActiveRecord::Base
has_many :memberships
# included columns
# id: integer
---------------------
Membership < ActiveRecord::Base
belongs_to :user
# included columns
# user_id: integer
# active: boolean
I'd like to be able to grab all users where all their memberships have 'active = false' in a single query. So far the best that I've been able to come up with is:
#grab possibles
users = User.joins(:memberships).where('memberships.active = false')
#select ones that satisfy condition
users.select{ |user| user.memberships.pluck(&:active).uniq == [false] }
which is not that great since I have to use ruby to pluck out the valid ones.

This can do the trick:
users_with_active_membership = User.joins(:memberships).where(memberships: { active: true })
users = User.where( 'users.id NOT IN (?)', users_with_active_membership.pluck(:id) )
I am not sure of the result but I expect it to be 2 nested queries, one selecting the User ids having an active membership, the other query to select the User not in this previous ids list.
I can't test it since I don't have an environment with these relationships. Can you try it and post the SQL query generated? (add .to_sql to see it)
Another way, I don't know which could be the most efficient:
User.where( 'users.id NOT IN (?)', Membership.where(active: true).group(:user_id).pluck(:user_id) )

Ruby on Rails finding the number of items in a cart

I have my Rails models in an online store I making setup with a cart that has line items. Every time a product is clicked on, a line item is generated that has a unique cart id, matching carts I make for user sessions (this example comes from the book Agile Web Development with rails.)
I want to count the number of items in a users cart, what's the best way to do this.
here's an example of what
li.each do |line|
puts li.to_yaml
end
outputs ....
- !ruby/object:LineItem
attributes:
id: 14
product_id: 81
cart_id: 11
created_at: 2012-06-27 14:10:09.060706000Z
updated_at: 2012-06-27 14:10:09.060706000Z
quantity: 1
---
- !ruby/object:LineItem
attributes:
id: 1
product_id: 2
cart_id: 6
created_at: 2012-06-25 18:29:20.726280000Z
updated_at: 2012-06-25 18:56:08.690670000Z
quantity: 2
- !ruby/object:LineItem
attributes:
id: 2
product_id: 4
cart_id: 6
created_at: 2012-06-25 18:56:10.014333000Z
updated_at: 2012-06-25 18:56:10.014333000Z
quantity: 1
So, I'd want the user with cart_id of 6 to know they have 3 items. Thanks.

Yes there is a better way. How are your models set up? Basically you'd want users to have a cart and a cart belongs_to a user. Also, since a line_item has a cart_id, you can have line_item belongs_to cart and cart has_many line_items.
With these associations, you can easily get what you need like:
cart.line_items.count
user.cart
user.cart.line_items
etc.
You can read up on rails associations here:
http://api.rubyonrails.org/classes/ActiveRecord/Associations/ClassMethods.html

I figured it out using the console
After choosing a cart so that #cart.id = 6 (current_cart = Cart.find(6) since the console didn't set the cart using a browser session)
#count = 0
LineItem.all.each do |item|
if (item.cart_id == #cart.id)
#count += item.quantity
end
end
There must be a better, more railsy way, though...

current_cart.line_items.size
Provided you have the following in cart.rb
has_many :line_items
-edit-
Ops, sorry, you wanted the sum of quantity.
current_cart.line_items.sum('quantity')

Ruby on Rails - How to join two tables?

I have two tables (subjects and pages) in one-to-many relations. I want to add criterias from subjects as well pages to parse a sql, but the progress has been very slow and often times running into problems. I'm brand new in rails, please help.
class Subject < ActiveRecord::Base
has_many :pages
end
class Page < ActiveRecord::Base
belongs_to :subject
end
sample data in subjects, listed three columns below:
id name level
1 'Math' 1
6 'Math' 2
...
Sample data in pages, listed columns below:
id name subject_id
-- -------------------- ----------
2 Addition 1
4 Subtraction 1
5 Simple Multiplication 6
6 Simple Division 6
7 Hard Multiplication 6
8 Hard Division 6
9 Elementary Divsion 1
Given that I don't know the subject.id, I only know the subject name and level, and page name. Here is the sql I want to generate (or something similar that would achieve the same result):
select subjects.id, subjects.name, pages.id, pages.name from subjects, pages
where subjects.id = pages.subject_id
and subjects.name = 'Math'
and subjects.level = '2'
and pages.name like '%Division' ;
I expect to get two rows in the result:
subjects.id subjects.name pages.id pages.name
----------- ------------- -------- -----------
6 Math 6 Simple Division
6 Math 8 Hard Division
This is a very simple sql, but I have not been able to get want I wanted in rails.
Here is my rails console:
>> subject = Subject.where(:name => 'Math', :level => 2)
Subject Load (0.4ms) SELECT `subjects`.* FROM `subjects` WHERE `subjects`.`name` = 'Math' AND `subjects`.`level` = 2
[#<Subject id: 6, name: "Math", position: 1, visible: true, created_at: "2011-12-17 04:25:54", updated_at: "2011-12-17 04:25:54", level: 2>]
>>
>> subject.joins(:pages).where(['pages.name LIKE ?', '%Division'])
Subject Load (4.2ms) SELECT `subjects`.* FROM `subjects` INNER JOIN `pages` ON `pages`.`subject_id` = `subjects`.`id` WHERE `subjects`.`name` = 'Math' AND `subjects`.`level` = 2 AND (pages.name LIKE '%Division')
[#<Subject id: 6, name: "Math", position: 1, visible: true, created_at: "2011-12-17 04:25:54", updated_at: "2011-12-17 04:25:54", level: 2>, #<Subject id: 6, name: "Math", position: 1, visible: true, created_at: "2011-12-17 04:25:54", updated_at: "2011-12-17 04:25:54", level: 2>]
>>
>> subject.to_sql
"SELECT `subjects`.* FROM `subjects` WHERE `subjects`.`name` = 'Math' AND `subjects`.`level` = 2"
>> subject.size
1
>> subject.class
ActiveRecord::Relation
1st statement: subject = Subject.where(:name => 'Math', :level => 2)
2nd statement: subject.joins(:pages).where(['pages.name LIKE ?', '%Division'])
Questions:
the results of the chained sql really returns two rows, but subject.size says only 1?
How do I tell it to return columns from :pages as well?
Why subject.to_sql still shows the sql from statement 1 only, why did it not include the chained sql from statement 2?
Essentially, what do I need to write the statements differently to parse the sql as listed above (or achieve the same result)?
Many thanks.

1) ActiveRecord is going to map your query results to objects not arbitrary returned rows, so because you based the query creation off of the Subject class it is looking at your resulting rows and figures out that it is only referring to 1 unique Subject object, so returns just that single Subject instance.
2) The column data is there, but you are working against what ActiveRecord wants to give you, which is objects. If you would rather have Pages returned, then you need to base the creation of the query on the Page class.
3) You didn't save the results of adding the join(:pages)... back into the subject variable. If you did:
subject = subject.joins(:pages).where(['pages.name LIKE ?', '%Division'])
You would get the full query when running subject.to_sql
4) To get page objects you can do something like this, notice that we are basing it off of the Page class:
pages = Page.joins(:subject).where(['subjects.name = ? AND subjects.level = ? AND pages.name LIKE ?', 'Math', 2, '%Division'])
Then to access the subject name from there for the first Page object returned:
pages[0].subject.name
Which because you have the join in the first, won't result in another SQL query. Hope this helps!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove duplicate records based on multiple columns? - ruby-on-rails-3

You can try this sql query, to remove all duplicate records but latest one DELETE FROM users USING users user WHERE (users.name = user.name AND users.year = user.year AND users.trim = user.trim AND users.id < user.id);

Related

What is the fastest, most standard way to "mass query" a set of grandchildren (has_many, has_many) belonging to a single object?

ruby each loop in order on sql order query result

Query for all of a specific record where its related resource all have the same value for a single attribute

Ruby on Rails finding the number of items in a cart

Ruby on Rails - How to join two tables?

Categories

Resources