how to update Rails.cache TTL on read/fetch - ruby-on-rails-3

I would like to keep my objects in the Rails cache so long as there is a read within some interval (say 10 minutes). I can successfully set a TTL on cache object creation with:
Rails.cache.fetch('key', expires_in: 10.minutes) do
some_expensive_operation
end
But I notice that subsequent cache reads for 'key' do not up the TTL (at least not in my setup, which is Rails 3.2 + Redis).
Is there a way to have Rails.cache.{fetch,read} re-up the TTL for cache hits?
My alternative is to do the following, which seems potentially wasteful:
result = Rails.cache.read('key') || some_expensive_operation
Rails.cache.write('key', result, expires_in: 10.minutes)

Here's a redis-specific solution I came up with:
def fetch(key, ttl_seconds)
cached_value = $redis.get(key)
if cached_value
# re-up the TTL
$redis.expire(key, ttl_seconds)
result = Marshal.load(cached_value)
else
result = yield
$redis.setex(key, ttl_seconds, Marshal.dump(result))
end
result
end
fetch('key', 10.minutes) do
some_expensive_operation
end
I'd prefer something that's cache-provider neutral, but that just doesn't seem to be part of the Rails caching API, unless I've missed something.

As I understand you want update TTL on fetch just to make cache store only frequently used records. I guess the reason is cache size.
Just don't do that! Let cache engine to decide what records to keep. If you're using Memcached it does this job out of box, you can never bother deleting old records. If you're using Redis you need to configure it as an LRU cache.

Related

Is this redis lua script that deals with key expire race conditions a pure function?

I've been playing around with redis to keep track of the ratelimit of an external api in a distributed system. I've decided to create a key for each route where a limit is present. The value of the key is how many request I can still make until the limit resets. And the reset is made by setting the TTL of the key to when the limit will reset.
For that I wrote the following lua script:
if redis.call("EXISTS", KEYS[1]) == 1 then
local remaining = redis.call("DECR", KEYS[1])
if remaining < 0 then
local pttl = redis.call("PTTL", KEYS[1])
if pttl > 0 then
--[[
-- We would exceed the limit if we were to do a call now, so let's send back that a limit exists (1)
-- Also let's send back how much we would have exceeded the ratelimit if we were to ignore it (ramaning)
-- and how long we need to wait in ms untill we can try again (pttl)
]]
return {1, remaining, pttl}
elseif pttl == -1 then
-- The key expired the instant after we checked that it existed, so delete it and say there is no ratelimit
redis.call("DEL", KEYS[1])
return {0}
elseif pttl == -2 then
-- The key expired the instant after we decreased it by one. So let's just send back that there is no limit
return {0}
end
else
-- Great we have a ratelimit, but we did not exceed it yet.
return {1, remaining}
end
else
return {0}
end
Since a watched key can expire in the middle of a multi transaction without aborting it. I assume the same is the case for lua scripts. Therefore I put in the cases for when the ttl is -1 or -2.
After I wrote that script I looked a bit more in depth at the eval command page and found out that a lua script has to be a pure function.
In there it says
The script must always evaluates the same Redis write commands with
the same arguments given the same input data set. Operations performed
by the script cannot depend on any hidden (non-explicit) information
or state that may change as script execution proceeds or between
different executions of the script, nor can it depend on any external
input from I/O devices.
With this description I'm not sure if my function is a pure function or not.
After Itamar's answer I wanted to confirm that for myself so I wrote a little lua script to test that. The scripts creates a key with a 10ms TTL and checks the ttl untill it's less then 0:
redis.call("SET", KEYS[1], "someVal","PX", 10)
local tmp = redis.call("PTTL", KEYS[1])
while tmp >= 0
do
tmp = redis.call("PTTL", KEYS[1])
redis.log(redis.LOG_WARNING, "PTTL:" .. tmp)
end
return 0
When I ran this script it never terminated. It just went on to spam my logs until I killed the redis server. However time dosen't stand still while the script runs, instead it just stops once the TTL is 0.
So the key ages, it just never expires.
Since a watched key can expire in the middle of a multi transaction without aborting it. I assume the same is the case for lua scripts. Therefore I put in the cases for when the ttl is -1 or -2.
AFAIR that isn't the case w/ Lua scripts - time kinda stops (in terms of TTL at least) when the script's running.
With this description I'm not sure if my function is a pure function or not.
Your script's great (without actually trying to understand what it does), don't worry :)

Using multiple threads for DB updates results in higher write time per update

So I have a script that is supposed to update a giant table (Postgres). Since the table has about 150m rows and I want to complete this as fast as possible, using multiple threads seemed like a perfect answer. However, I'm seeing something very weird.
When I use a single thread, the write time to an update is much much lower than when I use multiple threads.
require 'sequel'
.....
DB = Sequel.connect(DB_CREDS)
queue = Queue.new
read_query = query = DB["
SELECT id, extra_fields
FROM objects
WHERE XYZ IS FALSE
"]
read_query.use_cursor(:rows_per_fetch => 1000).each do |row|
queue.push(row)
end
Up until this point, IMO it shouldn't matter because we're just reading stuff from the DB and it has nothing to do with writing. From here, I've tried two approaches. Single-threaded and Multi-threaded.
NOTE - This is not the actual UPDATE query that I want to execute, it's just a pseudo one for demonstration purposes. The actual query is a lot longer and plays with JSON and stuff so I can't really update the entire table using a single query.
Single-threaded
until queue.empty?
photo = queue.shift
id = photo[:id]
update_query = DB["
UPDATE objects
SET XYZ = TRUE
WHERE id = #{id}
"]
result = update_query.update
end
If I execute this, I see in my DB logs that each update query takes time less than 0.01 seconds
I, [2016-08-15T10:45:48.095324 #54495] INFO -- : (0.001441s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395179
I, [2016-08-15T10:45:48.103818 #54495] INFO -- : (0.008331s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395181
I, [2016-08-15T10:45:48.106741 #54495] INFO -- : (0.002743s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395182
Multi-threaded
MAX_THREADS = 5
num_threads = 0
all_threads = []
until queue.empty?
if num_threads < MAX_THREADS
photo = queue.shift
num_threads += 1
all_threads << Thread.new {
id = photo[:id]
update_query = DB["
UPDATE photos
SET cv_tagged = TRUE
WHERE id = #{id}
"]
result = update_query.update
num_threads -= 1
Thread.exit
}
end
end
all_threads.each do |thread|
thread.join
end
Now, in theory it should be faster right? But each update takes about 0.5 seconds. I'm so surprised what that is the case.
I, [2016-08-15T11:02:10.992156 #54583] INFO -- : (0.414288s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498834
I, [2016-08-15T11:02:11.097004 #54583] INFO -- : (0.622775s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498641
I, [2016-08-15T11:02:11.097074 #54583] INFO -- : (0.415521s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498826
Any ideas on -
Why this is happening?
How can I increase the update speed for multiple threads approach.
Have you configured Sequel so that it has a connection pool of 5 connections?
Have you considered doing multiple updates per call via an IN clause?
If you haven't done 1, you have N threads fighting over N-n connections, which equates to resource starvation, which is a classic concurrency issue.
Your example can be reduced to: DB[:objects].where(:XYZ=>false).update(:XYZ=>true)
I'm guessing your actual need is not that simple. But the same approach may still work. Instead of issuing a query per row, use a single query to update all related rows.
I went through something similar on a project ("import all history from a legacy database into a new one with completely different structure and organization"). Unless you managed to shoot yourself in the foot somewhere else, you have 2 basic bottlenecks to look for:
the database's disk IO
the ruby process' CPU
Some suggestions,
database IO: use DB transactions, update 1000 records per transaction (you can tweak the exact number but 1000 is usually good) - huge DB table usually means a lot of indexes too, every couple of update actions will trigger a REINDEX and AUTOVACUUM actions within the DB which will result in a significant drop of update speed, a transaction basically allows you to push a 1000 updated records without REINDEX and AUTOVACUUM and then perform both actions, the result is MUCH faster (something like an order of magnitude)
database IO: change indexes, drop every index you can live without during the update process, ideally you will have only 1 very streamlined index which allows unique row lookups for update purposes
ruby CPU: unless you are using JRuby or Rubinius, or REALLY paying the price of network latency to your DB, threads will do you no big benefit, use fork/processes (see GIL). You did a great job choosing Sequel over AR for this
ruby CPU: if you decide to go threads + JRuby with this don't forget to try and plug in jProfiler, it's amazing at tracing bottlenecks in Java and author of SideKiq swears it is amazing for JRuby too - unfortunately, afaik, there is no equivalent of jProfiler for C Ruby (there are profiling tools, but nowhere as useful)
After you implement these suggestions you know you did all you could when:
all of the CPUs on the Ruby box are on 100% load
the hard disk IO of the DB is on 100% throughput
Find this sweet spot and don't add additional ruby update threads/processes after that (or add more hardware) and that's that
PS check out https://github.com/ruby-concurrency/concurrent-ruby - it's a great parallelization lib

How redis pipe-lining works in pyredis?

I am trying to understand, how pipe lining in redis works? According to one blog I read, For this code
Pipeline pipeline = jedis.pipelined();
long start = System.currentTimeMillis();
for (int i = 0; i < 100000; i++) {
pipeline.set("" + i, "" + i);
}
List<Object> results = pipeline.execute();
Every call to pipeline.set() effectively sends the SET command to Redis (you can easily see this by setting a breakpoint inside the loop and querying Redis with redis-cli). The call to pipeline.execute() is when the reading of all the pending responses happens.
So basically, when we use pipe-lining, when we execute any command like set above, the command gets executed on the server but we don't collect the response until we executed, pipeline.execute().
However, according to the documentation of pyredis,
Pipelines are a subclass of the base Redis class that provide support for buffering multiple commands to the server in a single request.
I think, this implies that, we use pipelining, all the commands are buffered and are sent to the server, when we execute pipe.execute(), so this behaviour is different from the behaviour described above.
Could someone please tell me what is the right behaviour when using pyreids?
This is not just a redis-py thing. In Redis, pipelining always means buffering a set of commands and then sending them to the server all at once. The main point of pipelining is to avoid extraneous network back-and-forths-- frequently the bottleneck when running commands against Redis. If each command were sent to Redis before the pipeline was run, this would not be the case.
You can test this in practice. Open up python and:
import redis
r = redis.Redis()
p = r.pipeline()
p.set('blah', 'foo') # this buffers the command. it is not yet run.
r.get('blah') # pipeline hasn't been run, so this returns nothing.
p.execute()
r.get('blah') # now that we've run the pipeline, this returns "foo".
I did run the test that you described from the blog, and I could not reproduce the behaviour.
Setting breakpoints in the for loop, and running
redis-cli info | grep keys
does not show the size increasing after every set command.
Speaking of which, the code you pasted seems to be Java using Jedis (which I also used).
And in the test I ran, and according to the documentation, there is no method execute() in jedis but an exec() and sync() one.
I did see the values being set in redis after the sync() command.
Besides, this question seems to go with the pyredis documentation.
Finally, the redis documentation itself focuses on networking optimization (Quoting the example)
This time we are not paying the cost of RTT for every call, but just one time for the three commands.
P.S. Could you get the link to the blog you read?

Is it possible to set dynamic download delay in scrapy?

I know that a constant delay can be set in
settings.py
DOWNLOAD_DELAY = 2
however, if I set the delay to 2s it is not efficient enough. If I set the DOWNLOAD_DELAY = 0.
The crawler is able to crawl about 10 pages. after that, the target page will return something like " you are requesting too frequently ".
What I want to do is the keep the download_delay to 0. once the "requesting too frequently" msg is found in the html. it change the delay to 2s. After a while it switch back to zero.
is there any module can do this? or any other better idea to handle such case?
Update:
I found that is a extension call AutoThrottle
but is it able to customize some logic like this??
if (requesting too frequently) is found
increase the DOWNLOAD_DELAY
If right after you get anti-spider page, then in 2 seconds you can get data page, then what you are asking probably requires writing a downloader middleware
that checks for anti-spider page, reset all scheduled requests to a renew-queue, start a looping call when spider is idle to get request from the renew-queue, (the looping interval is your hack for a new download delay), and try to decide when the download delay is not necessary again (requires some tests), then stop the looping and reschedule all the requests in renew-queue to scrapy scheduler. You will need to use redis queue in case of distributed crawl.
With download delay set to 0, in my experience throughput can go easily above 1000 items/min. If anti-spider page pops up after 10 responses, then it is not worth the effort.
Instead maybe you can try to find out how fast does your target server allow, may be 1.5s, 1s, 0.7s, 0.5s etc. Then maybe redesign your product takes into consideration the throughput your crawler can achieve.
You can use Auto Throttle extension now. It is turned off by default. You can add these parameters in your project's settings.py file to enable it.
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 300
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = True
Yes, You can use the time module to set the dynamic delay.
import time
for i in range(10):
*** Operations 1****
time.sleep( i )
*** Operations 2****
Now you can see the delay between Operations 1 and Operations 2.
Note:
the variable 'i' is in the form of seconds.

Rails 3.2.2 log files unordered, requests intertwined

I recollect getting log files that were nicely ordered, so that you could follow one request, then the next, and so on.
Now, the log files are, as my 4 year old says "all scroggled up", meaning that they are no longer separate, distinct chunks of text. Loggings from two requests get intertwined/mixed up.
For instance:
Started GET /foobar
...
Completed 200 OK in 2ms (Views: 0.4ms | ActiveRecord: 0.8ms)
Patient Load (wait, that's from another request that has nothing to do with foobar!)
[ blank space ]
Something else
This is maddening, because I can't tell what's happening within one single request.
This is running on Passenger.
I tried to search for the same answer but couldn't find any good info. I'm not sure if you should fix server or rails code.
If you want more info about the issue here is the commit that removed old way of logging https://github.com/rails/rails/commit/04ef93dae6d9cec616973c1110a33894ad4ba6ed
If you value production log readability over everything else you can use the
PassengerMaxInstancesPerApp 1
configuration. It might cause some scaling issues. Alternatively you could stuff something like this in application.rb:
process_log_filename = Rails.root + "log/#{Rails.env}-#{Process.pid}.log"
log_file = File.open(process_log_filename, 'a')
Rails.logger = ActiveSupport::BufferedLogger.new(log_file)
Yep!, they have made some changes in the ActiveSupport::BufferedLogger so it is not any more waiting until the request has ended to flush the logs:
http://news.ycombinator.com/item?id=4483390
https://github.com/rails/rails/commit/04ef93dae6d9cec616973c1110a33894ad4ba6ed
But they have added the ActiveSupport::TaggedLogging which is very funny and you can stamp every log with any kind of mark you want.
In your case could be good to stamp the logs with the request UUID like this:
# config/application.rb
config.log_tags = [:uuid]
Then even if the logs are messed up you still can follow which of them correspond to the request you are following up.
You can make more funny things with this feature to help you in your logs study:
How to log user_name in Rails?
http://zogovic.com/post/21138929607/running-time-in-rails-logs
Well, for me the TaggedLogging solution is a no go, I can live with some logs getting lost if the server crashes badly, but I want my logs to be perfectly ordered. So, following advice from the issue comments I'm applying this to my app:
# lib/sequential_logs.rb
module ActiveSupport
class BufferedLogger
def flush
#log_dest.flush
end
def respond_to?(method, include_private = false)
super
end
end
end
# config/initializers/sequential_logs.rb
require 'sequential_logs.rb'
Rails.logger.instance_variable_get(:#logger).instance_variable_get(:#log_dest).sync = false
As far as I can say this hasn't affected my app, it is still running and now my logs make sense again.
They should add some quasi-random reqid and write it in every line regarding one single request. This way you won't get confused.
I haven't used it, but I believe Lumberjack's unit_of_work method may be what you're looking for. You call:
Lumberjack.unit_of_work do
yield
end
And all logging done either in that block or in the yielded block are tagged with a unique ID.