I have a django application that needs to do sometimes millions of API requests. In order to make it faster, I'm using Celery to send the requests and wait for response before consuming them.
I have tried setting up Celery with Redis as a broker and a backend:
CELERY_BROKER_URL = 'redis://localhost:6379'
CELERY_RESULT_BACKEND = 'redis://localhost:6379'
CELERY_ACCEPT_CONTENT = ['application/json']
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TASK_SERIALIZER = 'json'
That gave me an error: BacklogLimitExceeded: 54c6d0ce-318d-461b-b942-5edcd258b5f1
Then I changed to a RabbitMQ broker and RPC backend :
CELERY_BROKER_URL = 'amqp://guest#localhost//'
CELERY_RESULT_BACKEND = 'rpc://'
CELERY_ACCEPT_CONTENT = ['application/json']
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TASK_SERIALIZER = 'json'
Same error
My code for the API call is quite classic:
I have a list of urls_chunks = [[url1, url2....url1000], [url1, url2....url1000]]
This is needed because of the API rate limit of 1000 per minute (at the end of each call group I sleep 1 minute)
for urls in urls_chunks:
returned_data = []
for url in urls:
result = call_api.delay(url)
returned_data.append(result)
for response in returned_data:
result = response.result
## Do something with the result
sleep(60)
I think all the url calls / responses are being cached somewhere and exceeding memory? but I thought that wouldn't happen with RPC. I thought of using a purge() but that does not work either as it is not supported by RPC...
Anyone know how to deal with this? I'm currently running in dev environment on MacOS, with the intention of deploying to Ubuntu.
Thanks!
Related
This is my first project using rabbitmq and I am complete lost because I am not sure what would be the best way to solve a problem.
The program is fairly simple, it just listen for alarms events, and then put the events in a rabbitmq queue, but I am struggling with the architecture of the program.
If I open, publish and then close the connection for every single event, I will add a lot of latency, and unnecessary packages will be transmitted (even more than the usual because I am using TLS)...
If I keep a connection open, and create a function that publish the messages (I only work with a single queue, pretty basic), I will eventually have problems because multiple events can occur at the same time, and my program will not know what to do if the connection to the rabbitmq broker end.
Reading their documentations, the solution seems use one of their "Connection Adapters", which would fit me like a glove because I just rewrite all my connection stuff from basic sockets to use Twisted (I really liked their high level approach). But there is a problem. Their "basic example" is fairly complex for someone who barely considers himself "intermediate".
In a perfect world, I would be able to run the service in the same reactor as the "alarm servers" and call a method to publish a message. But I am struggling to understand the code. Has anyone who worked with pika could point me a better direction, or even tell me if there is a easier way?
Well, I will post what worked for me. Probably is not the best alternative but maybe it helps someone who gets here with the same problem.
First I decided to drop Twisted and use Asyncio (nothing personal, I just wanted to use it because it's already in python), and even tho pika had a good example using Asynchronous, I tried and found it easier to just use aio_pika.
I end up with 2 main functions. One for a publisher and another for a subscriber.
Bellow is my code that works for me...
# -*- coding: utf-8 -*-
import asyncio
import aio_pika
from myapp import conf
QUEUE_SEND = []
def add_queue_send(msg):
"""Add MSG to QUEUE
Args:
msg (string): JSON
"""
QUEUE_SEND.append(msg)
def build_url(amqp_user, amqp_pass, virtual_host):
"""Build Auth URL
Args:
amqp_user (str): User name
amqp_pass (str): Password
virtual_host (str): Virtual Host
Returns:
str: AMQP URL
"""
return ''.join(['amqps://',
amqp_user, ':', amqp_pass,
'#', conf.get('amqp_host'), '/', virtual_host,
'?cafile=', conf.get('ca_cert'),
'&keyfile=', conf.get('client_key'),
'&certfile=', conf.get('client_cert'),
'&no_verify_ssl=0'])
async def process_message(message: aio_pika.IncomingMessage):
"""Read a new message
Args:
message (aio_pika.IncomingMessage): Mensagem
"""
async with message.process():
# TODO: Do something with the new message
await asyncio.sleep(1)
async def consumer(url):
"""Keep listening to a MQTT queue
Args:
url (str): URL
Returns:
aio_pika.Connection: Conn?
"""
connection = await aio_pika.connect_robust(url=url)
# Channel
channel = await connection.channel()
# Max concurrent messages?
await channel.set_qos(prefetch_count=100)
# Queue
queue = await channel.declare_queue(conf.get('amqp_queue_client'))
# What call when a new message is received
await queue.consume(process_message)
# Returns the connection?
return connection
async def publisher(url):
"""Send messages from the queue.
Args:
url (str): URL de autenticaĆ§Ć£o
"""
connection = await aio_pika.connect_robust(url=url)
# Channel
channel = await connection.channel()
while True:
if QUEUE_SEND:
# If the list (my queue) is not empty
msg = aio_pika.Message(body=QUEUE_SEND.pop().encode())
await channel.default_exchange.publish(msg, routing_key='queue')
else:
# Just wait
await asyncio.sleep(1)
await connection.close()
I started both using the ``loop.create_task```.
As I said. It kinda worked for me (even tho I am still having an issue with another part of my code) but I did not want to left this question open since most people can have the same issue.
If you know a better approach or a more elegant approach, please, share.
I am writing a Scrapy spider whose purpose is to make requests to a remote server in order to hydrate the cache. It's an infinite crawler because I need to make requests after regular intervals. I created initial spider which generates request and hit the server, it worked fine but now when I am making it running infinitely, I am not getting responses. I even tried to debug in the process_response middleware but couldn't get my spider till there. Here is a sketch of code which I am implementing
def generate_requests(self, payloads):
for payload in payloads:
if payload:
print(f'making request with payload {payload}')
yield Request(url=Config.HOTEL_CACHE_AVAILABILITIES_URL, method='POST', headers=Config.HEADERS,
callback=self.parse, body=json.dumps(payload), dont_filter=True, priority=1)
def start_requests(self):
crawler_config = CrawlerConfig()
while True:
if not self.city_scheduler:
for location in crawler_config.locations:
city_name = location.city_name
ttl = crawler_config.get_city_ttl(city_name)
payloads = crawler_config.payloads.generate_payloads(location)
self.city_scheduler[location.city_name] = (datetime.now() + timedelta(minutes=ttl)).strftime("%Y-%m-%dT%H:%M:%S")
yield from self.generate_requests(payloads)
Seems like scrapy has some odd behavior with while loop in start_requests. you can check similar enhancement on scrapy repo here.
Moving while loop logic in your parse method will solve this issue.
We are using Masstransit with RabbitMq for making RPCs from one component of our system to others.
Recently we faced the limit of throughput on client side, measured about 80 completed responses per second.
While trying to investigate where the problem was, I found that requests were processed fast by the RPC server, then responses were put to callback queue, and then, the queue processing speed was 80 M\s
This limit is only on client side. Starting another process of the same client app on the same machine doubles requests throughput on the server side, but then I see two callback queues, filled with messages, are being consumed each with the same 80 M\s
We are using single instance of IBus
builder.Register(c =>
{
var busSettings = c.Resolve<RabbitSettings>();
var busControl = MassTransitBus.Factory.CreateUsingRabbitMq(cfg =>
{
var host = cfg.Host(new Uri(busSettings.Host), h =>
{
h.Username(busSettings.Username);
h.Password(busSettings.Password);
});
cfg.UseSerilog();
cfg.Send<IProcessorContext>(x =>
{
x.UseCorrelationId(context => context.Scope.CommandContext.CommandId);
});
}
);
return busControl;
})
.As<IBusControl>()
.As<IBus>()
.SingleInstance();
The send logic looks like this:
var busResponse = await _bus.Request<TRequest, TResult>(
destinationAddress: _settings.Host.GetServiceUrl<TCommand>(queueType),
message: commandContext,
cancellationToken: default(CancellationToken),
timeout: TimeSpan.FromSeconds(_settings.Timeout),
callback: p => { p.WithPriority(priority); });
Has anyone faced the problem of that kind?
My guess that there is some program limit in the response dispatch logic. It might be the Max thread pool size, or the size of the buffer, also the prefetch count of response queue.
I tried to play with .Net thread pool size, but nothing helped.
I'm kind of new to Masstransit and will appreciate any help with my problem.
Hope it can be fixed in configuration way
There are a few things you can try to optimize the performance. I'd also suggest checking out the MassTransit-Benchmark and running it in your environment - this will give you an idea of the possible throughput of your broker. It allows you to adjust settings like prefetch count, concurrency, etc. to see how they affect your results.
Also, I would suggest using one of the request clients to reduce the setup for each request/response. For example, create the request client once, and then use that same client for each request.
var serviceUrl = yourMethodToGetIt<TRequest>(...);
var client = Bus.CreateRequestClient<TRequest>(serviceUrl);
Then, use that IRequestClient<TRequest> instance whenever you need to perform a request.
Response<Value> response = await client.GetResponse<TResponse>(new Request());
Since you are just using RPC, I'd highly recommend settings the receive endpoint queue to non-durable, to avoid writing RPC requests to disk. And adjust the bus prefetch count to a higher value (higher than the maximum number of concurrent requests you may have by 2x) to ensure that responses are always delivered directly to your awaiting response consumer (it's an internal thing to how RabbitMQ delivers messages).
var busControl = Bus.Factory.CreateUsingRabbitMq(cfg =>
{
cfg.PrefetchCount = 1000;
}
What is the advantage of using Source Streaming vs the regular way of handling requests? My understanding that in both cases
The TCP connection will be reused
Back-pressure will be applied between the client and the server
The only advantage of Source Streaming I can see is if there is a very large response and the client prefers to consume it in smaller chunks.
My use case is that I have a very long list of users (millions), and I need to call a service that performs some filtering on the users, and returns a subset.
Currently, on the server side I expose a batch API, and on the client, I just split the users into chunks of 1000, and make X batch calls in parallel using Akka HTTP Host API.
I am considering switching to HTTP streaming, but cannot quite figure out what would be the value
You are missing one other huge benefit: memory efficiency. By having a streamed pipeline, client/server/client, all parties safely process data without running the risk of blowing up the memory allocation. This is particularly useful on the server side, where you always have to assume the clients may do something malicious...
Client Request Creation
Suppose the ultimate source of your millions of users is a file. You can create a stream source from this file:
val userFilePath : java.nio.file.Path = ???
val userFileSource = akka.stream.scaladsl.FileIO(userFilePath)
This source can you be use to create your http request which will stream the users to the service:
import akka.http.scaladsl.model.HttpEntity.{Chunked, ChunkStreamPart}
import akka.http.scaladsl.model.{RequestEntity, ContentTypes, HttpRequest}
val httpRequest : HttpRequest =
HttpRequest(uri = "http://filterService.io",
entity = Chunked.fromData(ContentTypes.`text/plain(UTF-8)`, userFileSource))
This request will now stream the users to the service without consuming the entire file into memory. Only chunks of data will be buffered at a time, therefore, you can send a request with potentially an infinite number of users and your client will be fine.
Server Request Processing
Similarly, your server can be designed to accept a request with an entity that can potentially be of infinite length.
Your questions says the service will filter the users, assuming we have a filtering function:
val isValidUser : (String) => Boolean = ???
This can be used to filter the incoming request entity and create a response entity which will feed the response:
import akka.http.scaladsl.server.Directives._
import akka.http.scaladsl.model.HttpResponse
import akka.http.scaladsl.model.HttpEntity.Chunked
val route = extractDataBytes { userSource =>
val responseSource : Source[ByteString, _] =
userSource
.map(_.utf8String)
.filter(isValidUser)
.map(ByteString.apply)
complete(HttpResponse(entity=Chunked.fromData(ContentTypes.`text/plain(UTF-8)`,
responseSource)))
}
Client Response Processing
The client can similarly process the filtered users without reading them all into memory. We can, for example, dispatch the request and send all of the valid users to the console:
import akka.http.scaladsl.Http
Http()
.singleRequest(httpRequest)
.map { response =>
response
.entity
.dataBytes
.map(_.utf8String)
.foreach(System.out.println)
}
I have a Lua module I'm writing for making requests to a public API:
-- users.lua
local http = require("socket.http")
local base_url = 'http://example.com'
local api_key = "secret"
local users = {}
function users.info(user_id)
local request_url = base_url .. '/users/' .. user_id .. "?api_key=" .. api_key
print("Requesting " .. request_url)
local response = http.request(request_url)
print("Response " .. response)
return response
end
return users
This works, but I'd like to use TDD to finish writing the entire API wrapper.
I have a spec (using the busted framework) which works, but it makes an actual request to the API:
-- spec/users_spec.lua
package.path = "../?.lua;" .. package.path
describe("Users", function()
it("should fetch the users info", function()
local users = require("users")
local s = spy.on(users, "info")
users.info("chip0db4")
assert.spy(users.info).was_called_with("chip0db4")
end)
end)
How do I mock this out, much like how WebMock works in Ruby, where the actual endpoint is not contacted? The solution doesn't need to be specific to the busted framework, btw.
After receiving some excellent feedback from https://github.com/TannerRogalsky, as shown here https://gist.github.com/TannerRogalsky/b56bc886811f8f0a9d2a, I decided to write my own mocking library for http requests: https://github.com/chip/webmock. It's in it's very early stages, but it's at least a start. I'd be grateful for contributions to the repo or suggestion on other approaches or Lua modules available.