Scrapy - new instance of Item Pipeline classes per process/job?

Scrapy - new instance of Item Pipeline classes per process/job? - scrapy

I use Scrapyd for scheduling and launching spider jobs.
In Item Pipelines classes i set job specific variables into the class, which should not be shared by other spiders/jobs.
So my question is, does Scrapy/Scrapyd create new instance of pipeline class for each spider job/process?

Scrapy/Scrapyd create new instance of pipelines, middlewares, etc for each job/process.
Hovewer your pipelines must not have static(or per class variables) on some conditions data can be accessed from other spider via python class variable.

Related

Schedulers in Project Reactor with Spring Webflux

Project Reactor is awesome, easily I can switch a thread to processing some parts on another thread but I've looked inside to Schedulers.fromExecutorService() method, and this method every time allocates new ExecutorService. So when this method is called then always schedulers are creating and allocated again. I am not sure but I think it potential memory leak...
Mono<String> sometext() {
return Mono
.fromCallable(() -> "" )
.subscribeOn(Schedulers.newParallel("my-custom));
}
I wonder about registering Scheduler as bean, it singleton so only once will be allocated not every time or create him in the constructor. Many of the blogs explaining the threading model in this way.
...
private final Scheduler scheduler = Schedulers.newParallel("my-custom);
..
Mono.fromCallable(() -> "" ).subscribeOn(scheduler)

Schedulers.newParallel() will indeed create a new scheduler with an associated backed threadpool every time you call it - so yes, you're correct, if you're using that method then you want to make sure you store a reference to it somewhere so you can reuse it. Simply providing the same name argument won't just retrieve the new scheduler, it'll just create a different one with the same name.
How you do this is up to you - it can be via a spring bean (as long as it's a singleton and not a prototype bean!), a field, or whatever else fits best in with your use case.
However, before all of this I'd first consider whether you definitely need to create a separate parallel scheduler at all. The Schedulers.parallel() scheduler is a default parallel scheduler that can be used for parallel work out the tin (it doesn't create a new one on each invocation), and unless you need separately configured parallel schedulers for separate services for some reason, best practice is just to use that.

Scrapy: settings, multiple concurrent spiders, and middlewares

I'm used to running spiders one at a time, because we mostly work with scrapy crawl and on scrapinghub, but I know that one can run multiple spiders concurrently, and I have seen that middlewares often have a spider parameter in their callbacks.
What I'd like to understand is:
the relationship between Crawler and Spider. If I run one spider at a time, I'm assuming there's one of each. But if you run more spiders together, like in the example linked above, do you have one crawler for multiple spiders, or are they still 1:1?
is there in any case only one instance of a middleware of a certain class, or do we get one per-spider or per-crawler?
Assuming there's one, what are the crawler.settings in the middleware creation (for example, here)? In the documentation it says that those take into account the settings overridden in the spider, but if there are multiple spiders with conflicting settings, what happens?
I'm asking because I'd like to know how to handle spider-specific settings. Take again the DeltaFetch middleware as an example:
enabling it seems to be a global matter, because DELTAFETCH_ENABLED is read from the crawler.settings
however, the sqlite db is opened in spider_opened and is a unique instance variable (i.e., not depending on the spider); so if you have more than one spider and the instance is shared, when the second spider is opened, the old db is lost. And if you have only one instance of the middleware per spider, why bother passing the spider as a parameter?
Is that a correct way of handling it, or should you rather have a dict spider_dbs indexed by spider name?

Transferring a string from AWS Lambda to an EC2 instance that is started in the same lambda function

I am completely new to working with AWS. Currently I am in the following situation: My lambda function starts an EC2 instance. This instance will need the information contained in the 'ID' variable. I was wondering how I could transfer this data from my lambda function to the EC2 instance. Is this even possible?
import boto3
region = 'eu-west-1'
instances = ['AnEC2Instance-ID']
ec2 = boto3.client('ec2', region_name=region)
import os
def lambda_handler(event, context):
ID = event.get('ID')
ec2.start_instances(InstanceIds=instances)
print('started your instance: ' + str(instances))
Here 'AnEC2Instance-ID' is supposed to be an EC2 instance ID.
This lambda function is triggered by a gateway API. The ID is obtained from this Gatway API using the line: ID = event.get('ID')

These EC2 instances have already been launched and in this lambda are being started via boto3 ec2.start_instances. Prior to this you would have to do some clever AWS stuff to modify the instance's user-data and also have the instance configured to re-run the user-data at start (not just launch). Quite complex IMHO.
Two alternate suggestions:
Revisit your need to start an existing EC2 instance, as you can easily pass data to a new instance with boto3 in the client.run_instances function.
Or if you truly need to revive an existing EC2 instance, you might need a third component to manage the correlation of EC2 instance IDs and your Event IDs: how about DynamoDB? First your script above writes a key-value pair of the InstanceID and Event ID. Then invoke ec2.start_instances and when the EC2 instance starts it is pre-configured to do curl http://169.254.169.254/latest/meta-data/instance-id, and uses that value to query the DynamoDB?

When launch an Amazon EC2 instance, you can provide data in the User Data parameter.
This data will then be accessible on the instance via:
http://169.254.169.254/latest/user-data/
This technique is also used to pass a startup script to an instance. There is software provided on the standard Amazon AMIs that will run the script if it starts with specific identifiers. However, you can simply pass any data via User Data to make it available to the instance.

Attach to process instead of Dispatch

I am using pywin32 and calling the Dispatch function to create a COM object, but this means a new instance of the application is created (in this case PTV Vissim) whenever I call the function. Is it possible, instead, to attach to an already existing Vissim application? This would speed up development, since I wouldn't have to wait for the application to start every time I run a test.
This is my existing relevant code:
import win32com.client as com
Vissim = com.Dispatch("Vissim.Vissim.540")

Specifically for PTV Vissim, there is an option to start Vissim with the extension -automation (for example: vissim100.exe -automation). If you start PTV Vissim with the extension -automation, it provides PTV Vissim as a COM server in the automation mode for COM scripts that are started subsequently.
See chapter "Starting PTV Vissim via the command prompt" of the PTV Vissim Help.

In general, you can not "attach" to an existing Vissim instance as a COM server. Each client connection should be at best backed-up by an independent Vissim instance.
That being said, it is still possible to accomplish your goal, that is - use the command line switch "-automation" to launch Vissim.exe, and that running Vissim.exe will act as an automation server as you desired.
--
What is under-the-hood?
The truth is, right in Vissim.exe's startup code, CoRegisterClassObject(CLSID, pUnk, dwClsContext, flags, &dwRegister) is by default called with flag = REGCLS_SINGLEUSE.
REGCLS_SINGLEUSE simply means, after a client application has been connected to an Vissim class object as hosted by a running Vissim.exe, the Vissim class object's class factory is removed from public view (i.e., not in the OS system's Class Table anymore). This means, a new client connection will have to launch a new Vissim instance in order to obtain the class factory, hence the creation of a new Vissim instance is in order.
However, if you use the command line switch "-automation" while launching a Vissim.exe instance, that Vissim.exe will use REGCLS_MULTIPLEUSE flag to register the class factory instead. That allows multiple client connections to the same running Vissim.exe instance afterwards.
I have more detailed blog on this matter and other relevant issues here. You might want to check them out at blog.wupingxin.net

Are node.js module variables shared across multiple invocation?

i m creating a node.js server, where i have a "notifications" module. Every request on the node server will invoke this module and i need to have a set of private variables for each separate invocation.
For example, when a request is made to notifications module, it needs to remember the user id, request time, session id, etc.
How can i do that efficiently?
If i simple declare variables in module scope, they seem to be shared for each module invocation. So, it fails in remembering every request's data privately.
What i need is each time i invoke a node.js module, it will remember its data at that time. So, please point out how can i do that?
Thanks,
Anjan
Update
The node.js server is to uses as a chat server. the "notifications" module will scan the db for new messages and send the output in json format to the client using long polling technique.
I tried to wrap the data and the functions into an object. Then each time a request is made to chat server a new object will be created and it will carry on the desired functions. But what it did is that instead of working in parallel it executes each request in serial. So, if i make 3 request to the server, they just queue up and executes one after another.
Any clue on that?
The module source code can be found here: http://www.ultrasoftbd.com/notifications.js
Thanks,
Anjan

There are a couple ways that come to mind to approach this issue:
1) Have your module export a constructor which can be called by the API users of your module to give them a new instance of an object. This way, each user will have its own object instance which has its own private variables.
// Client example.
var mod = require('myModule')
, sess = new mod.Session();
sess.method(args);
sess.finalize();
2) Have your module provide a "registry" interface to these private variables which includes an identifier unique to the caller.
// Client example.
var mod = require('myModule');
var id = mod.initialize(); // Returns a unique ID.
mod.method(id, args); // Each method requires the ID.
mod.finalize(id);
These solutions share the idea that each instance (or ID) is tracked separately by your module so that the statistics (or whatever your module does) can be computed per client instance rather than globally to the module.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scrapy - new instance of Item Pipeline classes per process/job? - scrapy

I use Scrapyd for scheduling and launching spider jobs. In Item Pipelines classes i set job specific variables into the class, which should not be shared by other spiders/jobs. So my question is, does Scrapy/Scrapyd create new instance of pipeline class for each spider job/process?

Scrapy/Scrapyd create new instance of pipelines, middlewares, etc for each job/process. Hovewer your pipelines must not have static(or per class variables) on some conditions data can be accessed from other spider via python class variable.

Related

Schedulers in Project Reactor with Spring Webflux

Scrapy: settings, multiple concurrent spiders, and middlewares

Transferring a string from AWS Lambda to an EC2 instance that is started in the same lambda function

Attach to process instead of Dispatch

Are node.js module variables shared across multiple invocation?

Categories

Resources