How to minimize server load when parsing using scrapy?/How to ignore <body> and parse info only from <head> - scrapy

I collect statistics and all the information I need is in the <head> (script tag) of site.
It have massive <body>(about 5-10 kb per page) so can I dont parse it for less server load?
I would be glad if you recommend alternative optimizations to reduce the server load
settings.py
CONCURRENT_REQUESTS = 32 DOWNLOAD_DELAY = 0.33 now speed 180/per min(sometimes 200)

Scrapy operates with entire response body only.
This behaviour coded in scrapy core.
CONCURRENCY_REQUEST = 32
Scrapy don't have CONCURRENCY_REQUEST setting. Did you mean CONCURRENT_REQUESTS ?
DOWNLOAD_DELAY = 0.33 now speed 180/per min(sometimes 200)
If you didn't specify RANDOMIZE_DOWNLOAD_DELAY as False (default valueTrue).
download delay will be a random number between 0.5x to 1.5x of DOWNLOAD_DELAY setting.

Related

Fetching results from Celery backend is abnormally slow

I'm using Celery with a Redis broker to do some "heavy" processing for my Django app. Everything is running locally in Docker containers on WSL2.
The tasks output a JSON which is roughly 2.5 Mb large and it takes up to 9 seconds to retrieve the result via get() in the Django app. For smaller payloads, the time goes down
I tried increasing the RAM and the CPU for WSL2 up to 6 CPUs and 8Gb RAM. Celery was configured with --max-memory-per-child=1024000 --concurrency=4
I've tried using different result_backend configuration with similar results:
Redis
RPC
SQLite with SQLAlchemy
I tried setting an interval when using SQLite (doesn't matter for RPC & Redis) with a 0.5sec improvement get(interval=0.01)
I also tried changing the result_serializer from JSON to pickle for poorer performance. But I don't think the serializer is the culprit here as serializing / deserializing the same JSON is pretty fast in console
>>> timeit.timeit(lambda: pickle.dumps(big_dict,0), number=10)
0.567067899999528
>>> timeit.timeit(lambda: pickle.loads(str), number=10)
0.3542163999991317
I tried using compression, only zlib seemed to provide a small gain.
I'm not too familiar with this setup but IMHO I should be able to retrieve results faster. The best I could achieve was 6sec. Any idea how to improve this or how to explain it ?
settings.py
CELERY_BROKER_URL = "redis://{host}:{port}/{db}".format(
host=os.environ.get('REDIS_HOST'),
port=os.environ.get('REDIS_PORT'),
db=os.environ.get('CELERY_REDIS_DB')
)
CELERY_RESULT_BACKEND = "redis://{host}:{port}/{db}".format(
host=os.environ.get('REDIS_HOST'),
port=os.environ.get('REDIS_PORT'),
db=os.environ.get('CELERY_REDIS_DB')
)
# CELERY_RESULT_BACKEND = 'db+sqlite:///celery.sqlite' # SQL Example (need SQLAlchemy==1.4.29 in requirements.txt)
# CELERY_RESULT_BACKEND = 'rpc://localhost' # RPC Example
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
Thanks
In general, Redis has a reputation for being bad at dealing with large objects and is not generally intended to be a large object store. You're better off using a general purpose RDBMS or a file store and returning a key to where the JSON can be retrieved.

Scrapy , how to change CONCURRENT_REQUESTS when is running?

When the Scrapy is running, If speed is slow, I want to increase CONCURRENT_REQUESTS, if speed is fast, I want to reduce CONCURRENT_REQUESTS.
""I use self.custom_settings['CONCURRENT_REQUESTS'] = xxx, it's not working after running.""
You can acheve this with scrapy AutoThrottle extension
https://docs.scrapy.org/en/2.5/topics/autothrottle.html
Updating settings on runtime is not possible on most of cases
Scrapy/Git/Isse: Update spider settings during runtime #4196

scrapy timeout not controlling twisted timeout

I keep getting this when I run my scrapy spider raise TimeoutError("Getting %s took longer than %s seconds." % (url, timeout))
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.exampletest.com/test took longer than 190 seconds..
I have set the following settings but didn't help
'AUTOTHROTTLE_ENABLED':False,
'DOWNLOAD_TIMEOUT':20,
'RETRY_ENABLED': False,
How can I control if the website doesn't respond in under 30 sec to just pass or ignore it.
190 is a weird default, so I’ll go ahead and assume that you are using scrapy-crawlera.
If that is the case, know that scrapy-crawlera ignores DOWNLOAD_DELAY because Crawlera requires higher timeout values, as requests through Crawlera can take much longer.
If you want to decrease the timeout value nonetheless, change CRAWLERA_DOWNLOAD_TIMEOUT instead.

jmeter and apachetop - why I see different values?

Probably explanation is simple - but I couldn't find answer to my question:
I am running jmeter test from one VM (worker) to another (target). On worker I have jmeter with 100 threads (100 users). On target I have API that runs on Apache. When I run "apachetop -f access_log" on target, I see only about 7 req/s.
Can someone explain me, why I don't see 100 req/s on target?
In test result in jmeter I see always 200 OK, so all request are hitting the target, and moreover target always responds. So I am not dropping any requests here. Network bandwidth between machines is 1G. What I am missing here?
Thanks,
Daddy
100 users doesn't necessarily mean 100 requests per second, even more, it is highly unlikely.
According to JMeter glossary:
Elapsed time. JMeter measures the elapsed time from just before sending the request to just after the last response has been received. JMeter does not include the time needed to render the response, nor does JMeter process any client code, for example Javascript.
Roughly, if JMeter is able to get response from server in 1 second - you will get 100 requests/second. If response time will be 2 seconds - throughput will be 50 requests/second, etc, response time 4 seconds - 25 requests/second, etc.
Also JMeter configuration matters. If you don't provide enough loops you may run into a situation where some threads already finished and some are not even started. See JMeter Test Results: Why the Actual Users Number is Lower than Expected article for more detailed explanation.
Your target load = 100 threads ( you are assuming it should generate 100 req/sec as per your plan)
Your actual load = 7 req / sec = 7*3600 / hour = 25200
Per thread throughput = 25200 / 100 threads = 252 iterations / thread / hour
Per transaction time = 3600 / 252 = 14.2 secs
This means, JMeter should be actually sending each request every 14 secs per thread. i.e., 100 requests for every 14.2 secs.
Now, analyze your JMeter summary report for the transaction timers to find out where the remaining 13.2 secs are being spent.
Possible issues are
1. High DNS resolution time (DNS issue)
2. High connection setup time (indicates load balancer issues)
3. High Request send time (indicates n/w or firewall throttling issues)
4. High request receive time (same as #3)
Now, the time that you see in Apache logs are mostly visible to JMeter as time to first byte time. I am not sure about the machine that you are running your testing. If your worker can support curl, use Curl to find the components for a single request.
echo 'request payload for POST'
| curl -X POST -H 'User-Agent: myBrowser' -H 'Content-Type: application/json' -d #- -s -w '\nDNS time:\t%{time_namelookup}\nTCP Connect time:\t%{time_connect}\nAppCon Protocol time:\t%{time_appconnect}\nRedirect time:\t%{time_redirect}\nPreXfer time:\t%{time_pretransfer}\nStartXfer time:\t%{time_starttransfer}\n\nTotal time:\t%{time_total}\n' http://mytest.test.com
If the above output indicates no such issues then the time must have been spent within JMeter. You should tune your JMeter implementation by using various options like beanshell / JSR223 etc.

Is it possible to set dynamic download delay in scrapy?

I know that a constant delay can be set in
settings.py
DOWNLOAD_DELAY = 2
however, if I set the delay to 2s it is not efficient enough. If I set the DOWNLOAD_DELAY = 0.
The crawler is able to crawl about 10 pages. after that, the target page will return something like " you are requesting too frequently ".
What I want to do is the keep the download_delay to 0. once the "requesting too frequently" msg is found in the html. it change the delay to 2s. After a while it switch back to zero.
is there any module can do this? or any other better idea to handle such case?
Update:
I found that is a extension call AutoThrottle
but is it able to customize some logic like this??
if (requesting too frequently) is found
increase the DOWNLOAD_DELAY
If right after you get anti-spider page, then in 2 seconds you can get data page, then what you are asking probably requires writing a downloader middleware
that checks for anti-spider page, reset all scheduled requests to a renew-queue, start a looping call when spider is idle to get request from the renew-queue, (the looping interval is your hack for a new download delay), and try to decide when the download delay is not necessary again (requires some tests), then stop the looping and reschedule all the requests in renew-queue to scrapy scheduler. You will need to use redis queue in case of distributed crawl.
With download delay set to 0, in my experience throughput can go easily above 1000 items/min. If anti-spider page pops up after 10 responses, then it is not worth the effort.
Instead maybe you can try to find out how fast does your target server allow, may be 1.5s, 1s, 0.7s, 0.5s etc. Then maybe redesign your product takes into consideration the throughput your crawler can achieve.
You can use Auto Throttle extension now. It is turned off by default. You can add these parameters in your project's settings.py file to enable it.
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 300
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = True
Yes, You can use the time module to set the dynamic delay.
import time
for i in range(10):
*** Operations 1****
time.sleep( i )
*** Operations 2****
Now you can see the delay between Operations 1 and Operations 2.
Note:
the variable 'i' is in the form of seconds.