When the Scrapy is running, If speed is slow, I want to increase CONCURRENT_REQUESTS, if speed is fast, I want to reduce CONCURRENT_REQUESTS.
""I use self.custom_settings['CONCURRENT_REQUESTS'] = xxx, it's not working after running.""
You can acheve this with scrapy AutoThrottle extension
https://docs.scrapy.org/en/2.5/topics/autothrottle.html
Updating settings on runtime is not possible on most of cases
Scrapy/Git/Isse: Update spider settings during runtime #4196
Related
I'm using Celery with a Redis broker to do some "heavy" processing for my Django app. Everything is running locally in Docker containers on WSL2.
The tasks output a JSON which is roughly 2.5 Mb large and it takes up to 9 seconds to retrieve the result via get() in the Django app. For smaller payloads, the time goes down
I tried increasing the RAM and the CPU for WSL2 up to 6 CPUs and 8Gb RAM. Celery was configured with --max-memory-per-child=1024000 --concurrency=4
I've tried using different result_backend configuration with similar results:
Redis
RPC
SQLite with SQLAlchemy
I tried setting an interval when using SQLite (doesn't matter for RPC & Redis) with a 0.5sec improvement get(interval=0.01)
I also tried changing the result_serializer from JSON to pickle for poorer performance. But I don't think the serializer is the culprit here as serializing / deserializing the same JSON is pretty fast in console
>>> timeit.timeit(lambda: pickle.dumps(big_dict,0), number=10)
0.567067899999528
>>> timeit.timeit(lambda: pickle.loads(str), number=10)
0.3542163999991317
I tried using compression, only zlib seemed to provide a small gain.
I'm not too familiar with this setup but IMHO I should be able to retrieve results faster. The best I could achieve was 6sec. Any idea how to improve this or how to explain it ?
settings.py
CELERY_BROKER_URL = "redis://{host}:{port}/{db}".format(
host=os.environ.get('REDIS_HOST'),
port=os.environ.get('REDIS_PORT'),
db=os.environ.get('CELERY_REDIS_DB')
)
CELERY_RESULT_BACKEND = "redis://{host}:{port}/{db}".format(
host=os.environ.get('REDIS_HOST'),
port=os.environ.get('REDIS_PORT'),
db=os.environ.get('CELERY_REDIS_DB')
)
# CELERY_RESULT_BACKEND = 'db+sqlite:///celery.sqlite' # SQL Example (need SQLAlchemy==1.4.29 in requirements.txt)
# CELERY_RESULT_BACKEND = 'rpc://localhost' # RPC Example
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
Thanks
In general, Redis has a reputation for being bad at dealing with large objects and is not generally intended to be a large object store. You're better off using a general purpose RDBMS or a file store and returning a key to where the JSON can be retrieved.
I am trying to run some parallel jobs through Python multiprocessing. Here is an example code:
import multiprocessing as mp
import os
def f(name, total):
print('process {:d} starting doing business in {:d}'.format(name, total))
#there will be some unix command to run external program
if __name__ == '__main__':
total_task_num = 100
mp.Queue()
all_processes = []
for i in range(total_task_num):
p = mp.Process(target=f, args=(i,total_task_num))
all_processes.append(p)
p.start()
for p in all_processes:
p.join()
I also set export OMP_NUM_THREADS=1 to make sure that only one thread for one process.
Now I have 20 cores in my desktop. For 100 parallel jobs, I want to let it run 5 cycles so that each core run one job (20*5=100).
I tried to do the same code in CentOS and ubuntu. It seems that CentOS will automatically do a job splitting. In other words, there will be only 20 parallel running jobs at the same time. However, ubuntu will start 100 jobs simultaneously. As such, each core will be occupied by 5 jobs. This will significantly increase the total run time due to high work load.
I wonder if there is an elegant solution to teach ubuntu to run only 1 job per core.
To enable a process run on a specific CPU, you use the command taskset in linux. Accordingly you can arrive on a logic based on "taskset -p [mask] [pid]" that assigns each process to a specific core in a loop.
Also , python helps in incorporation of affinity control via sched_setaffinity that can be checked for confining a process to specific cores. Accordingly , you can arrive on a logic for usage of "os.sched_setaffinity(pid, mask)" where pid is the process id of the process whose mask represents the group of CPUs to which the process shall be confined to.
In python, there are also other tools like https://pypi.org/project/affinity/ that can be explored for usage.
I keep getting this when I run my scrapy spider raise TimeoutError("Getting %s took longer than %s seconds." % (url, timeout))
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.exampletest.com/test took longer than 190 seconds..
I have set the following settings but didn't help
'AUTOTHROTTLE_ENABLED':False,
'DOWNLOAD_TIMEOUT':20,
'RETRY_ENABLED': False,
How can I control if the website doesn't respond in under 30 sec to just pass or ignore it.
190 is a weird default, so I’ll go ahead and assume that you are using scrapy-crawlera.
If that is the case, know that scrapy-crawlera ignores DOWNLOAD_DELAY because Crawlera requires higher timeout values, as requests through Crawlera can take much longer.
If you want to decrease the timeout value nonetheless, change CRAWLERA_DOWNLOAD_TIMEOUT instead.
I know that a constant delay can be set in
settings.py
DOWNLOAD_DELAY = 2
however, if I set the delay to 2s it is not efficient enough. If I set the DOWNLOAD_DELAY = 0.
The crawler is able to crawl about 10 pages. after that, the target page will return something like " you are requesting too frequently ".
What I want to do is the keep the download_delay to 0. once the "requesting too frequently" msg is found in the html. it change the delay to 2s. After a while it switch back to zero.
is there any module can do this? or any other better idea to handle such case?
Update:
I found that is a extension call AutoThrottle
but is it able to customize some logic like this??
if (requesting too frequently) is found
increase the DOWNLOAD_DELAY
If right after you get anti-spider page, then in 2 seconds you can get data page, then what you are asking probably requires writing a downloader middleware
that checks for anti-spider page, reset all scheduled requests to a renew-queue, start a looping call when spider is idle to get request from the renew-queue, (the looping interval is your hack for a new download delay), and try to decide when the download delay is not necessary again (requires some tests), then stop the looping and reschedule all the requests in renew-queue to scrapy scheduler. You will need to use redis queue in case of distributed crawl.
With download delay set to 0, in my experience throughput can go easily above 1000 items/min. If anti-spider page pops up after 10 responses, then it is not worth the effort.
Instead maybe you can try to find out how fast does your target server allow, may be 1.5s, 1s, 0.7s, 0.5s etc. Then maybe redesign your product takes into consideration the throughput your crawler can achieve.
You can use Auto Throttle extension now. It is turned off by default. You can add these parameters in your project's settings.py file to enable it.
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 300
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = True
Yes, You can use the time module to set the dynamic delay.
import time
for i in range(10):
*** Operations 1****
time.sleep( i )
*** Operations 2****
Now you can see the delay between Operations 1 and Operations 2.
Note:
the variable 'i' is in the form of seconds.
How does one use the flag options for benchmarks with the gocheck testing framework? In the link that I provided it seems to be that the only example they provide is by running go test -check.b, however, they do not provide additional comments on how it works so its hard to use it. I could not even find the -check in the go documentation when I did go help test nor when I did go help testflag. In particular I want to know how to use the benchmark testing framework better and control how long it runs for or for how many iterations it runs for etc etc. For example in the example they provide:
func (s *MySuite) BenchmarkLogic(c *C) {
for i := 0; i < c.N; i++ {
// Logic to benchmark
}
}
There is the variable c.N. How does one specify that variable? Is it through the actual program itself or is it through go test and its flags or the command line?
On the side note, the documentation from go help testflag did talk about -bench regex, benchmem and benchtime t options, however, it does not talk about the -check.b option. However I did try to run these options as described there but it didn't really do anything I could notice. Does gocheck work with the original options for go test?
The main problem I see is that there is no clear documentation for how to use the gocheck tool or its commands. I accidentally gave it a wrong flag and it threw me a error message suggesting useful commands that I need (which limited description):
-check.b=false: Run benchmarks
-check.btime=1s: approximate run time for each benchmark
-check.f="": Regular expression selecting which tests and/or suites to run
-check.list=false: List the names of all tests that will be run
-check.v=false: Verbose mode
-check.vv=false: Super verbose mode (disables output caching)
-check.work=false: Display and do not remove the test working directory
-gocheck.b=false: Run benchmarks
-gocheck.btime=1s: approximate run time for each benchmark
-gocheck.f="": Regular expression selecting which tests and/or suites to run
-gocheck.list=false: List the names of all tests that will be run
-gocheck.v=false: Verbose mode
-gocheck.vv=false: Super verbose mode (disables output caching)
-gocheck.work=false: Display and do not remove the test working directory
-test.bench="": regular expression to select benchmarks to run
-test.benchmem=false: print memory allocations for benchmarks
-test.benchtime=1s: approximate run time for each benchmark
-test.blockprofile="": write a goroutine blocking profile to the named file after execution
-test.blockprofilerate=1: if >= 0, calls runtime.SetBlockProfileRate()
-test.coverprofile="": write a coverage profile to the named file after execution
-test.cpu="": comma-separated list of number of CPUs to use for each test
-test.cpuprofile="": write a cpu profile to the named file during execution
-test.memprofile="": write a memory profile to the named file after execution
-test.memprofilerate=0: if >=0, sets runtime.MemProfileRate
-test.outputdir="": directory in which to write profiles
-test.parallel=1: maximum test parallelism
-test.run="": regular expression to select tests and examples to run
-test.short=false: run smaller test suite to save time
-test.timeout=0: if positive, sets an aggregate time limit for all tests
-test.v=false: verbose: print additional output
is writing wrong commands the only way to get some help with this tool? it doesn't have a help flag or something?
I'm 5 years late, but to specify how many N times to run. Use the option -benchtime Nx.
Example:
go test -bench=. -benchtime 100x
BenchmarkTest 100 ... ns/op
Please read more about all go testing flags here.
see the Description_of_testing_flags:
-bench regexp
Run benchmarks matching the regular expression.
By default, no benchmarks run. To run all benchmarks,
use '-bench .' or '-bench=.'.
-check.b works the same way as -test.bench.
E.g. to run all benchmarks:
go test -check.b=.
to run a specific benchmark:
go test -check.b=BenchmarkLogic
more information about testing in Go can be found here