How to use all CPU cores for Scrapy

How to use all CPU cores for Scrapy - scrapy

My scrapy program only uses one CPU core no matter how CONCURRENT_REQUESTS i do. If or not some methods in scrapy can use all the cpu core just in one scrapy crawler?
ps: it seems have arguement max_proc to use in early edition, but i cannot find it now.

Scrapy does not use multiple CPUs.
This is by design. Usually the bottleneck of Scrapy is not the CPU, but the network input/output. So, even using a single CPU, Scrapy can be more efficient that a synchronous framework or library (e.g. requests) used in combination with multiprocessing.
If CPU is a bottleneck in your case, you should consider having a separate, multiprocessing-enabled process handle the CPU-heavy parts.
If you still want to run Scrapy spiders in multiple processes, see Running Scrapy from a script. You can combine that with Python’s multiprocessing module. Or, better yet, using Scrapyd or one of the alternatives.

Related

One instance with multple GPUs or multiple instances with one GPU

I am running multiple models using GPUs and all jobs combined can be run on 4 GPUs, for example. Multiple jobs can be run on the same GPU since the GPU memory can handle it.
Is it a better idea to spin up a powerful instance with all 4 GPUs as part of it and run all the jobs on one instance? Or should I go the route of having multiple instances with 1 GPU on each?
There are a few factors I'm thinking of:
Latency of reading files. Having a local disk on one machine should be faster latency wise, but it would be a quite a few reads from one source. Would this cause any issues?
I would need quite a few vCPU and a lot of memory to scale the IOPS since GPC scales IOPS that way, apparently. What is the best way to approach this? If anyone has any more on this, would appreciate pointers.
If in the future I need to downgrade to save costs/downgrade performance, I could simple stop the instance and change my specs.
Having everything on one machine would be easier to work with. I know in production I would want a more distributed approach, but this is strictly experimentation.
Those are my main thoughts. Am I missing something? Thanks for all of the help.

Ended up going with one machine with multiple GPUs. Just assigned the jobs to the different GPUs to make the memory work.

I suggest you'll take a look here if you want to run multiple tasks on the same GPU.
Basically when using several tasks (different processes or containers) on the same GPU, it won't be efficient due to some kind on context switching.
You'll need the latest nvidia hardware to test it.

Does DASK LSFCluster support GPU?

Can I specify number of GPUs by using DASK LSFCluster? I know we can specify number of cores, which means CPU only. We would like to request GPUs from the LSF scheduler. Is it possible?

Dask-Jobqueue allows you to add extra lines to your job script with keywords like env_extra=. These might be enough for you to add custom resource requests to LSF.

Python's support for multi-threading

I heard that python still has this global interpreter lock issue. As a result, threads execution in python are not actually multi-threaded.
What are the possible solutions to overcome this problem?
I am using python 2.7.3

For understanding python's GIL, I would recommend using this link: http://www.dabeaz.com/python/UnderstandingGIL.pdf
From python wiki:
The GIL is controversial because it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations. Note that potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL. Therefore it is only in multithreaded programs that spend a lot of time inside the GIL, interpreting CPython bytecode, that the GIL becomes a bottleneck.
There are discussions on eliminating the GIL, but I guess its not achieved yet. If you really want to achieve multi-threading for your custom code, you can also switch to Java.
Do see if that helps.

How can I stream data directly into tensorflow as opposed to reading files on disc?

Every tensorflow tutorial I've been able to find so far works by first loading the training/validation/test images into memory and then processing them. Does anyone have a guide or recommendations for streaming images and labels as input into tensorflow? I have a lot of images stored on a different server and I would like to stream those images into tensorflow as opposed to saving the images directly on my machine.
Thank you!

Tensorflow does have Queues, which support streaming so you don't have to load the full data in memory. But yes, they only support reading from files on the same server by default. The real problem you have is that, you want to load in memory data from some other server. I can think of following ways to do this:
Expose your images using a REST service. Write your own queueing mechanism in python and read this data (using Urllib or something) and feed it to Tensorflow placeholders.
Instead of using python queues (as above) you can use Tensorflow queues as well (See this answer), although it's slighly more complicated. The advantage will be, tensorflow queues can use multiple cores giving you better performance, compared to normal python multi-threaded queues.
Use a network mount to fool your OS into believing the data is on the same machine.
Also, remember when using this sort of distributed setup, you will always incur network overhead (time taken for images to be transferred from Server 1 to 2), which can slow your training by a lot. To counteract this, you'll have to build a multi-threaded queueing mechanism with fetch-execute overlap, which is a lot of effort. An easier option IMO is to just copy the data into your training machine.

You can use the sockets package in Python to transfer a batch of images, and labels from your server to your host. Your graph needs to be defined to take a placeholder as input. The placeholder must be compatible with your batch size.

How to avoid Boost ASIO reactor becoming constrained to a single core?

TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?

The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.

I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.

In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas