scrapyd or CrawlerProcess for parralel parsing

scrapyd or CrawlerProcess for parralel parsing - scrapy

I need to run a lot of spiders (~20-50) in parallel on the same server.
Some of my spiders are in progress more than two days and sometimes I need to run a new one before all processes are finished.
As I understand, such possibility provides scrapyd (separate daemon process) and CrawlerProcess (class of scrapy).
Or maybe celery is more suitable here? (I'd like to use python3)
What are the special aspects of each approach and which one is better for my project?

As mention in https://github.com/scrapy/scrapyd/issues/143, the scrpyd is going to support python3. Regardless of it, celery is a good choice.

Related

Does Chrome.robot support parallel run

I was working on the Cucumber report then found the parallel option, as of now I am running only #1 thread and using parallel =false in the feature file. As per my understanding, we cant use parallelism with the karate.robot as it needs one activated window with a title. Please correct me if I am wrong?

I think the main challenge is that most of the UI interactions assume that the "active" window is "on top", visible and has focus. If you can figure out a way to use Element.invoke() for everything, maybe - but you will need to experiment.
Personally I feel that the better strategy is to split your test suite across multiple cloud nodes, and maybe virtual-machines or EC2 instances will work, provided you get the RDP stuff sorted out.
Note that Karate has a way to run distributed tests: https://github.com/intuit/karate/wiki/Distributed-Testing - it may need some research though.

snakemake use both --keep-going and --stats

The --keep-going flag tells snakemake to go on with independent jobs if a job fails.
The --stats /path_to_the_runtime_statistics_file option produces the runtime statistics of all the rules at the end of the pipeline.
However, if a job fails then the pipeline does not produce the runtime statistics file at all.
I.e. if you have 100 jobs and only one of them fails, then the runtime statistics about the 99 successful jobs are not produced.
How should one get the runtime statistics of the jobs succeeded?
Thanks in advance.

if you see the snake make API documentation of how --stats is called execute function implementation. You will get to know that, the implementation calls snakemake.stats module implemtation
over a condition which says if success:!
So, the straight answer to your question is NO you can't do it!
Two way of moving forward:
Quick & simple solution uses their stats implementation, and write what you wanted to do taking/calling particular functionality as per your needs! :)
from snakemake import stats
and do whatever you want .....
If you can't, then create an issue on snakemake github, Based on the priority their developer will add this feature to the newer versions of snakemake! It very slow process.

Linux process activities

Is there possibility to show what's going on under specified process in Linux?
For example, i run SQL query -> select evil_function();
and notice that process under Linux uses all cpu.
So is there something with what I can see whats going on under this process?
What I want is to see what queries is running under this process.
Thanks!

strace will tell you what system calls the process is making.
To see what called routines are taking the most CPU, you need to run a profiling tool, and make sure the executable of the process you in compiled correctly (sometimes it needs to be instrumented during compilation for profiling, sometimes it just needs to be compiled with debug symbols, or not stripped of them after compilation).
You might want to look at oprofile, valgrind, gprof and for starters on free tools - there are also commercial products available.
Here are a few links:
http://www.pixelbeat.org/programming/profiling/
http://en.wikipedia.org/wiki/List_of_performance_analysis_tools

You are mixing a whole bunch of things.
If you are talking about MySQL do:
show processlist;
For info specifically about linux processes, you can strace the process to get a list of system function that it calls. Unless you are experienced with linux this will be useless to you.
If the process is paused then you can find out what function it is stopped on, but that's probably not what you want, since you say the process is running.
There are also various tools that can give you info on what parts of the disk the process is reading, and how much memory it's allocating.
And finally you can use gdb to break into the process and single step your way through it to see exactly what it's doing. This will also likely be useless to you since an SQL server does a LOT of things - far to many to understand by this method.

Daemon with Clojure/JVM

I'd like to have a small (not doing too damn much) daemon running on a little server, watching a directory for new files being added to it (and any directories in the main one), and calling another Clojure program to deal with that new file.
Ideally, each file would be added to a queue (a list represented by a ref in Clojure?) and the main process would take care of those files in the queue on a FIFO basis.
My question is: is having a JVM up running this little program all the time too much a resource hog? And do you have any suggestions as to how go about doing this?
Thank you very much!
EDIT: Another question I should ask: should I run this as its own instance (using less memory) and have it launch a new JVM when a file is seen, or have it on the same JVM the Clojure code that will process the file?

As long as it is running fine now and it has no memory leaks it should be fine.
From the daemon terminology I gather it is on a unix clone, and in this case best is to start it from an init script, or from the rc.local script. Unfortunately details differ from OS to OS to be more specific.
Limit the memry using -Xmx=64m or something to make sure it fails before taking down the rest of the services. Play a bit with the number to find the lowest reliable size.
Also, since clojures claim to fame is its ability to deal with concurrency it make a lot of sense to only run one JVM with all functionality running on it in multiple threads. The overhead of spawning new processes is already very big and if it is a JVM which needs to JIT and warm up its memory management, doubly so. On a resource constrained machine could pose a problem. and on a resource rich machine this is a waste.
I always found that the JVM is not made to quickly run something script like and exit again. It is really not made for that use case in my opinion
.

How would I go about taking a snapshot of a process to preserve its state for future investigation? Is this possible?

Whether this is possible I don't know, but it would mighty useful!
I have a process that fails periodically (running in Windows 2000). I then have just one chance to react to it before having to restart it and painfully wait for it to fail again. I didn't write the process so don't have the source to debug. The failure is seemingly random.
With a snapshot of the process I could repeatedly and quickly test reactions to the failure.
I had thought of running inside a VM but this isn't possible in this instance.
EDIT:
#Jon Cage asked:
When you say a snapshot, you mean capturing a process when it's about to fail (including memory, program state etc. etc.) ...and then replaying it's final few seconds repeatedly to see what effect it has on some other component?
This is exactly what I mean!

I think minidump is what you are looking for.
You can also used Userdump:
The User Mode Process Dumper
(userdump) dumps any running Win32
processes memory image (including
system processes such as csrss.exe,
winlogon.exe, services.exe, etc) on
the fly, without attaching a debugger,
or terminating target processes.
Generated dump file can be analyzed or
debugged by using the standard
debugging tools.
This article shows you how to use it.

My best bet is to start the process in a debugger (OllyDbg being my preferred tool).
The process will pause on an exception, and you can try to figure out what happened shortly before that.
This needs some understanding of assembler and does not allow to create a snapshot of the process for later analysis. You would need to write your own debugger for that - it should be theoretically possible.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

scrapyd or CrawlerProcess for parralel parsing - scrapy

As mention in https://github.com/scrapy/scrapyd/issues/143, the scrpyd is going to support python3. Regardless of it, celery is a good choice.

Related

Does Chrome.robot support parallel run

snakemake use both --keep-going and --stats

Linux process activities

Daemon with Clojure/JVM

How would I go about taking a snapshot of a process to preserve its state for future investigation? Is this possible?

Categories

Resources