Now I plan to use scrapy in a more distributed approach, and I'm not
sure if the spiders/pipelines/downloaders/schedulers and engine are
all hosted in separate processes or threads, could anyone share some
info about this? and could we change the process/thread count for each
component? I know now there are two settings "CONCURRENT_REQUESTS" and
"CONCURRENT_ITEMS", they will determine the concurrent threads for
downloaders and pipelines, right? and if I want to deploy spiders/
pipelines/downloaders in different machines, I need to serialize the
items/requests/responses, right?
Appreciate very much for your helps!!
Thanks,
Edward.
Scrapy is single threaded. It uses the Reactor pattern to achieve concurrent network requests. This is done using the Twisted Framework.
People wanting to distribute Scrapy usually try to implement some messaging framework. Some use Redis, some others try RabbitMQ
Also have a look at Scrapyd
Related
Concurrency and parallel processing are two different things.
I know that FastAPI supports concurrency. It can handle multiple api requests concurrently using async and await.
What I want to know is, if FastAPI also supports Multiprocessing and Parallel processing of requests or not ?
If yes then how can I implement parallel processing of requests?
I have searched a lot but everywhere I found about concurrency only. I am new to FastAPI. Thanks for your help!!
When running your app with Uvicorn or Gunicorn, you can specify how many workers/processes you want. In Uvicorn, you just need to pass --workers N as an argument, and in Gunicorn it's pretty much the same, with --workers=N. That will start N processes all receiving requests at the same time.
I have built a "Large" application using Flask-AppBuilder and have 2 questions I have not seen the answer to.
Is there any way to "split" a large application into multiple
components (similar to what Blueprints do).
My business logic has mostly ended up in the View's but... some of
it does not feel right there. Few things I have added to the
models, again does not feel right. This is logic that tends to
create a long running processes so I have been testing out Celery.
Any examples of either of these would be lovely.
it does not really matter what framework you use, but as soon as the application grows you may want to isolate critical logic. Both for the reasons you described above, but also to be future-proof (you may want to move to a new frontend in the future without rewrite the heavy lifting).
I ususally set up a redis worker for this, and use e.g. flask only to trigger the queue with function calls. That also makes the application more scalable (concurrent users, more data) as you can simply start more workers listening to your queue if needed.
In essence:
from redis import Redis
from rq import Queue
from rq.job import Job
conn = Redis()
q = Queue(connection=conn)
Then as example in the flask routes (for appBuilder, use the views, or create your own lib) call:
result = q.enqueue('utils.your_function_name',args=(id,))
Have a look at RQ here for more examples, also how to monitor the status of your jobs etc.
https://python-rq.org/
I am new to NServiceBus, trying to introduce messaging into a WCF/RPC solution.
Because of architectural constraints and overhead (memory and cpu usage already high) IT Operations will not allow MSDTC. (I'm also keen to avoid 2PC fwiw). I also require messages over http so the NSB bridge looks like a great solution.
Based on these posts (how-i-avoid-two-phase-commit and extending-nservicebus-avoiding-two-phase-commits) it looks to me as though it's possible to use NSB with the DTC disabled.
It sounds like EventStore does manage to avoid 2PC in the same way that I want to setup NSB but at the moment I just want to get NSB to work rather than adding event sourcing into the mix.
Questions:
Are there any examples of configuring NSB to work this way? I'm quite happy to add the extra complexity (custom message handler with local message state storage) - without 2PC there isn't really another option. I already know of this example (IdempotentConsumer) but the test projects for this repo contain no code. It would be even more helpful if there was an example using nosql storage.
Will I need to alter the NSB bridge to deal with no DTC? I'm guessing no - bridge transactions are only against the local queue but the process that consumes the local queue will obviously need to coded to avoid 2PC. Correct?
Are there any other useful resources/posts around using NSB without MSDTC? The solution (how-i-avoid-two-phase-commit) seems not too complex but given I'm just starting out with NSB it would be great to find a quickstart for this...
I would have thought this would be a common scenario - but there doesn't seem to be much written about avoiding MSDTC while still using NSB. Surely there are others who are using a message bus but aren't allowed to use MSDTC... Is there another obvious way that I've missed?
thanks
2) Yes you should be fine. Since you're doing the deduplication your self you don't need the gateway to do it for you. Just configure it to use the InMemory persistence and you should be fine.
I am a newbie to real-time application development and am trying to wrap my head around the myriad options out there. I have read as many blog posts, notes and essays out there that people have been kind enough to share. Yet, a simple problem seems unanswered in my tiny brain. I thought a number of other people might have the same issues, so I might as well sign up and post here on SO. Here goes:
I am building a tiny real-time app which is asynchronous chat + another fun feature. I boiled my choices down to the following two options:
LAMP + RabbitMQ
Node.JS + Redis + Pub-Sub
I believe that I get the basics to start learning and building this out. However, my (seriously n00b) questions are:
How do I communicate with the end-user -> Client to/from Server in both of those? Would that be simple Javascript long/infinite polling?
Of the two, which might more efficient to build out and manage from a single Slice (assuming 100 - 1,000 users)?
Should I just build everything out with jQuery in the 'old school' paradigm and then identify which stack might make more sense? Just so that I can get the product fleshed out as a prototype and then 'optimize' it. Or is writing in one over the other more than mere optimization? ( I feel so, but I am not 100% on this personally )
I hope this isn't a crazy question and won't get flamed right away. Would love some constructive feedback, love this community!
Thank you.
Architecturally, both of your choices are the same as storing data in an Oracle database server for another application to retrieve.
Both the RabbitMQ and the Redis solution require your apps to connect to an intermediary server that handles the data communications. Redis is most like Oracle, because it can be used simply as a persistent database with a network API. But RabbitMQ is a little different because the MQ Broker is not really responsible for persisting data. If you configure it right and use the right options when publishing a message, then RabbitMQ will actually persist the data for you but you can't get the data out except as part of the normal message queueing process. In other words, RabbitMQ is for communicating messages and only offers persistence as a way of recovering from network problems or system crashes.
I would suggest using RabbitMQ and whatever programming languages you are already familiar with. Since the M in LAMP is usually interpreted as MySQL, this means that you would either not use MySQL at all, or only use it for long term storage of data, not for the realtime communications.
The RabbitMQ site has a huge amount of documentation about building apps with AMQP. I suggest that after you install RabbitMQ, you read through the docs for rabbitmqctl and then create a vhost to experiment in. That way it is easy to clean up your experiments without resetting everything. I also suggest using only topic exchanges because you can emulate the behavior of direct and fanout exchanges by using wildcards in the routing_key.
Remember, you only publish messages to exchanges, and you only receive messages from queues. The exchange is responsible for pattern matching the message's routing_key to the queue's binding_key to determine which queues should receive a copy of the message. It is worthwhile learning the whole AMQP model even if you only plan to send messages to one queue with the same name as the routing_key.
If you are building your client in the browser, and you want to build a prototype, then you should consider just using XHR today, and then move to something like Kamaloka-js which is a pure Javascript implementation of AMQP (the AMQ Protocol) which is the standard protocol used to communicate to a RabbitMQ message broker. In other words, build it with what you know today, and then speed it up later which something (AMQP) that has a long term future in your toolbox.
Should I just build everything out with jQuery in the 'old school' paradigm and then identify which stack might make more sense? Just so that I can get the product fleshed out as a prototype and then 'optimize' it. Or is writing in one over the other more than mere optimization? ( I feel so, but I am not 100% on this personally )
This is usually called RAD (rapid application design/development) and it is what I would recommend right now. This lets you build the proof of concept that you can use to work off of later to get what you want to happen.
As for how to talk to the clients from the server, and vice versa, have you read at all on websockets?
Given the choice between LAMP or event based programming, for what you're suggesting, I would tell you to go with the event based programming, so nodejs. But that's just one man's opinion.
Well,
LAMP - Apache create new process for every request. RabbitMQ can be useful with many features.
Node.js - Uses single process to handle all request asynchronously with help of event looping. So, no extra overhead process creation like apache.
For asynchronous chat application,
socket.io + Node.js + redis pub-sup is best stack.
I have already implemented real-time notification using above stack.
I'd like to have a real-time 'system status'/'activity monitor' console for my Twisted application.
The app is basically a protocol.ServerFactory which accepts connections performs different jobs.
Kind of like the twisted.manhole, I'm looking for the simplest way to create a admin application where I can check the current stats of my app.
The admin can be a simple ascii-based shell or html/json setup.
I'm aware that I could build this with a bunch of counters, a separate protocol for authenticating and monitoring these, but I'm thinking Twisted might already have such thing since it at least knows the number of connections, protocol types, etc etc.
Tips?
There's the unmaintained, slowly rotting twisted.internet.gladereactor. If you're using twistd, then you can use this trivally:
twistd --reactor debug-gui ...
If you're running the reactor directly yourself, then it's only slightly more effort:
from twisted.manhole import gladereactor
gladereactor.install()
from twisted.internet import reactor
...
The Inspect feature appears to have been broken for some time, but it will still show you a list of established connections and what state they are in, and it will also apparently give you a traffic log for each connection. Fixing Inspect may also be a fairly straightforward effort, in case you're looking for a little project. :)