I need to proccess a csv file with 10,000 records in it. It takes about 4 seconds per 100 records, and I'm cool with it.
I want to be able to split the 10,000 record to 100 jobs, that will be executed when server is not busy(no order needed).
I'm working with heroku, and if I can, I'll be happy to distribute the handling to a few nodes.
What's best practice here? How to handle the mongo connection? How to split the job, and create this tasks that will run in future?
Don't need a full solution just some guidance please.
I'd have the same suggestion as Sergio. Try the Sidekiq background worker gem. There's the kiqstand middleware that makes it work with Mongoid.
Rough possible sketch:
# Gemfile
gem 'kiqstand'
gem 'sidekiq'
# Procfile
web: ...
worker: bundle exec sidekiq -e $RACK_ENV -C config/sidekiq.yml
# config/sidekiq.yml
#
# Configuration file for Sidekiq.
# Options here can still be overridden by cmd line args.
---
:verbose: false
:namespace: sidekiq
:concurrency: 25
:queues:
- [often, 7]
- [default, 5]
- [seldom, 3]
# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
# Require kiqstand middleware
config.server_middleware do |chain|
chain.add Kiqstand::Middleware
end
end
# app/workers/import_worker.rb
# The actual Sidekiq worker that performs the input
class ImportWorker
include Sidekiq::Worker
def perform(my_var)
# processing the csv
end
end
Sidekiq and Kiqstand should handle the Mongoid Mongodb connections. For splitting the tasks you could create a second worker that feeds the first one. As the arguments sent to ImportWorker.perform_async(my_var) will be serialized and stored in Redis they should be small, just a row reference or so in your case.
Hope that gives a few pointers.
also have a look at the smarter_csv gem - which is designed to read CSV files in batches, so you can hand them off to a resque or sidekiq worker..
https://github.com/tilo/smarter_csv
Related
As I use Redis to start up with a bunch of other processes via Foreman, I find its output on startup quite verbose.
Redis writes more than twice the number of lines to stdout than any other process in my Procfile, mainly because of the ASCII art that gets printed to the log.
Is there a (startup) option to keep the log more concise, for example by turning off the output of the logo?
TLDR: If you have redis version 4.0 or higher you can do redis-server | cat to trick it into thinking it's not running in a tty.
Original answer:
I've had a quick check in the config docs and you shouldn't be seeing this. Can you maybe check your config file and see if you've set always-show-logo to yes?
The comment that accompanies it is as follows:
# By default Redis shows an ASCII art logo only when started to log to the
# standard output and if the standard output is a TTY. Basically this means
# that normally a logo is displayed only in interactive sessions.
#
# However it is possible to force the pre-4.0 behavior and always show a
# ASCII art logo in startup logs by setting the following option to yes.
I guess if you're on a version < 4.0 then that might explain what you're seeing.
Here is the issue/fix from 2014 https://github.com/antirez/redis/issues/1935
I'm developing a Rails 3.2.16 app and deploying to a Heroku dev account with one free web dyno and no worker dynos. I'm trying to determine if a (paid) worker dyno is really needed.
The app sends various emails. I use delayed_job_active_record to queue those and send them out.
I also need to check a notification count every minute. For that I'm using rufus-scheduler.
rufus-scheduler seems able to run a background task/thread within a Heroku web dyno.
On the other hand, everything I can find on delayed_job indicates that it requires a separate worker process. Why? If rufus-scheduler can run a daemon within a web dyno, why can't delayed_job do the same?
I've tested the following for running my every-minute task and working off delayed_jobs, and it seems to work within the single Heroku web dyno:
config/initializers/rufus-scheduler.rb
require 'rufus-scheduler'
require 'delayed/command'
s = Rufus::Scheduler.singleton
s.every '1m', :overlap => false do # Every minute
Rails.logger.info ">> #{Time.now}: rufus-scheduler task started"
# Check for pending notifications and queue to delayed_job
User.send_pending_notifications
# work off delayed_jobs without a separate worker process
Delayed::Worker.new.work_off
end
This seems so obvious that I'm wondering if I'm missing something? Is this an acceptable way to handle the delayed_job queue without the added complexity and expense of a separate worker process?
Update
As #jmettraux points out, Heroku will idle an inactive web dyno after an hour. I haven't set it up yet, but let's assume I'm using one of the various keep-alive methods to keep it from sleeping: Easy way to prevent Heroku idling?.
According to this
https://blog.heroku.com/archives/2013/6/20/app_sleeping_on_heroku
your dyno will go to sleep if he hasn't serviced requests for an hour. No dyno, no scheduling.
This could help as well: https://devcenter.heroku.com/articles/clock-processes-ruby
I was doing the failover testing of mongodb on my local environment. I have two mongo servers(hostname1, hostname2) and an arbiter.
I have the following configuration in my mongoid.yml file
localhost:
hosts:
- - hostname1
- 27017
- - hostname2
- 27017
database: myApp_development
read: :primary
use_activesupport_time_zone: true
Now when I start my rails application, everything works fine, and the data is read from the primary(hostname1). Then I kill the mongo process of the primary(hostname1), so the secondary(hostname2) becomes the primary and starts serving the data.
Then after some time I start the mongo process of hostname1 then it becomes the secondary in the replica set.
Now the primary(hostname2) and secondary(hostname1) are working all right.
The real problem starts here.
I kill the mongo process of my new primary(hostname2), but this time, the secondary(hostname1) does not become the primary, and any further requests to the rails application raises the following error
Cannot connect to a replica set using seeds hostname2
Please help. Thanks in advance.
** UPDATE: **
I entered some loggers in the mongo repl_connection class, and came across this.
When I boot the rails app, I have both the hosts in the seeds array, that the mongo driver keeps track of. But during the second failover only the host that went down is present in this array.
Hence I would also like to know, how and when one of the hosts get removed from the seed list.
I use the delayed job gem to handle my email deliveries. It is working fine in the development and I am very happy with it. However after I deployed to the server, when I use command:
RAILS_ENV=production script/delayed_job start
it will be working. I've checked the log file and database, everything is fine and I can receive the mails just as I expected. However, when I exit from the server, nothing is going to happen.
I've checked my database by using sequel pro and seen that the delayed job has created a row in the DB and after the time in the run_at column, the row would disappear, but no mails can be received. When I log in again, the delayed job process is still running, and the log is nothing strange, but I just cannot receive and email that I suppose to. I can't keep my self log in all the time. Without the delayed job, I can use the traditional way and it's working properly but slow. Why the delayed job failed after I log out of the server?
This is my delayed job setting in the config/initializers/delay_job.rb
require "bcrypt"
Delayed::Worker.max_attempts = 5
Delayed::Worker.delay_jobs = !Rails.env.test?
Delayed::Worker.destroy_failed_jobs = false
P.S. I am not sure is it anything to do with the standalone passenger as I have to use different version of rails so I have to use a standalone passenger with port 3002.
I think I've found the solution.
After reading through this https://github.com/collectiveidea/delayed_job/wiki/Common-problems#wiki-jobs_are_silently_removed_from_the_database
I soon realized I might miss the "require bcrypt" in the configuration file.
I use RVM and have many gemsets, but just this particular gemset has the gem bcrypt-ruby. The delayed job might use the global or default gemset after I log out the system, so I install bcrypt-ruby in all the gemsets and restart the standalone passenger and it works!.
But still, I dont really know the connection between bcrypt and the delayed job.
We have a rails 3.2(.11) app with many dynos running on the heroku bamboo stack, connecting to a MySQL RDS server. There seem to be some issues with our current database connections, so we are trying to debug exactly how many connections each dyno is spinning up. I know I can set the size of a connection pool in the DATABASE_URL config on heroku, but can't seem to find out how many connections are currently being used by default.
Two main questions:
1) How can I find the size of the connection pool used by heroku?
2) Is there any reason why a dyno would need a connection pool size greater than 1? My understanding is that rails can only execute 1 request at a time so one database connection should be all that is needed as far as I can see.
To check the pool size, start a heroku console heroku run rails c, and run:
ActiveRecord::Base.connection_pool.size
Some webservers such as Puma are multithreaded, so the DB pool size matters. You can also run a multi-threaded worker such as Sidekiq, which also will be affected by the pool size.
Note that Heroku will ignore your database.yml file. To set the pool size you can append ?pool=25 to the DATABASE_URL in your heroku app's configuation.
This information is available via an interface in Rails https://github.com/rails/rails/blob/master/activerecord/lib/active_record/connection_handling.rb#L98-L106 it is in Rails 3+
ActiveRecord::Base.connection_config
# => {:adapter=>"postgresql", :encoding=>"utf8", :pool=>5, :host=>"localhost", :database=>"triage_development"}
You can use this to get the current pool size without needing to eval or relying on the existence of an unexposed instance variable however in rails 3 it may return nil if it hasn't been explicitly set
ActiveRecord::Base.connection_config[:pool]
# => 5