Scrapy request+response+download time - scrapy

UPD: Not close question because I think my way is not so clear as should be
Is it possible to get current request + response + download time for saving it to Item?
In "plain" python I do
start_time = time()
urllib2.urlopen('http://example.com').read()
time() - start_time
But how i can do this with Scrapy?
UPD:
Solution enought for me but I'm not sure of quality of results. If you have many connections with timeout errors Download time may be wrong (even DOWNLOAD_TIMEOUT * 3)
For
settings.py
DOWNLOADER_MIDDLEWARES = {
'myscraper.middlewares.DownloadTimer': 0,
}
middlewares.py
from time import time
from scrapy.http import Response
class DownloadTimer(object):
def process_request(self, request, spider):
request.meta['__start_time'] = time()
# this not block middlewares which are has greater number then this
return None
def process_response(self, request, response, spider):
request.meta['__end_time'] = time()
return response # return response coz we should
def process_exception(self, request, exception, spider):
request.meta['__end_time'] = time()
return Response(
url=request.url,
status=110,
request=request)
inside spider.py in def parse(...
log.msg('Download time: %.2f - %.2f = %.2f' % (
response.meta['__end_time'], response.meta['__start_time'],
response.meta['__end_time'] - response.meta['__start_time']
), level=log.DEBUG)

You could write a Downloader Middleware which would time each request. It would add a start time to the request before it's made and then a finish time when it's finished. Typically, arbitrary data such as this is stored in the Request.meta attribute. This timing information could later be read by your spider and added to your item.
This downloader middleware sounds like it could be useful on many projects.

Not sure if you need a Middleware here. Scrapy has a request.meta which you can query and yield. For download latency, simply yield
download_latency=response.meta.get('download_latency'),
The amount of time spent to fetch the response, since the request has
been started, i.e. HTTP message sent over the network. This meta key
only becomes available when the response has been downloaded. While
most other meta keys are used to control Scrapy behavior, this one is
supposed to be read-only.

I think the best solution is by using scrapy signals. Whenever the request reaches the downloader it emits request_reached_downloader signal. After download it emits response_downloaded signal. You can catch it from the spider and assign time and its differences to meta from there.
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(SignalSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
return spider
More elaborate answer is on here

Related

How can I add a new spider arg to my own template in Scrapy/Zyte

I am working on a paid proxy spider template and would like the ability to pass in a new argument on the command line for a Scrapy crawler. How can I do that?
This is achievable by using kwargs in your spider's __init__-Method:
import scrapy
class YourSpider(scrapy.Spider):
name = your_spider
def __init__(self, *args, **kwargs):
super(YourSpider, self).__init__(*args, **kwargs)
self.your_arg = kwargs.get("your_cmd_arg", 42)
Now it would be possible to call the spider as follows:
scrapy crawl your_spider -a your_cmd_arg=foo
For more information on the topic, feel free to check this page in the Scrapy documentation.

How to feed an audio file from S3 bucket directly to Google speech-to-text

We are developing a speech application using Google's speech-to-text API. Now our data (audio files) get stored in S3 bucket on AWS. is there a way to directly pass the S3 URI to Google's speech-to-text API?
From their documentation it seems this is at the moment not possible in Google's speech-to-text API
This is not the case for their vision and NLP APIs.
Any ideas why this limitation for speech APIs?
And whats a good work around for this?
Currently, Google only allows audio files from either your local source or from Google's Cloud Storage. No reasonable explanation is given on the documentation about this.
Passing audio referenced by a URI
More typically, you will pass a uri parameter within the Speech request's audio field, pointing to an audio file (in binary format, not base64) located on Google Cloud Storage
I suggest you move your files to Google Cloud Storage. If you don't want to, there is a good workaround:
Use Google Cloud Speech API with streaming API. You are not required to store anything anywhere. Your speech application provides input from any microphone. And don't worry if you don't know handling inputs from microphone.
Google provides a sample code that does it all:
# [START speech_transcribe_streaming_mic]
from __future__ import division
import re
import sys
from google.cloud import speech
import pyaudio
from six.moves import queue
# Audio recording parameters
RATE = 16000
CHUNK = int(RATE / 10) # 100ms
class MicrophoneStream(object):
"""Opens a recording stream as a generator yielding the audio chunks."""
def __init__(self, rate, chunk):
self._rate = rate
self._chunk = chunk
# Create a thread-safe buffer of audio data
self._buff = queue.Queue()
self.closed = True
def __enter__(self):
self._audio_interface = pyaudio.PyAudio()
self._audio_stream = self._audio_interface.open(
format=pyaudio.paInt16,
channels=1,
rate=self._rate,
input=True,
frames_per_buffer=self._chunk,
# Run the audio stream asynchronously to fill the buffer object.
# This is necessary so that the input device's buffer doesn't
# overflow while the calling thread makes network requests, etc.
stream_callback=self._fill_buffer,
)
self.closed = False
return self
def __exit__(self, type, value, traceback):
self._audio_stream.stop_stream()
self._audio_stream.close()
self.closed = True
# Signal the generator to terminate so that the client's
# streaming_recognize method will not block the process termination.
self._buff.put(None)
self._audio_interface.terminate()
def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
"""Continuously collect data from the audio stream, into the buffer."""
self._buff.put(in_data)
return None, pyaudio.paContinue
def generator(self):
while not self.closed:
# Use a blocking get() to ensure there's at least one chunk of
# data, and stop iteration if the chunk is None, indicating the
# end of the audio stream.
chunk = self._buff.get()
if chunk is None:
return
data = [chunk]
# Now consume whatever other data's still buffered.
while True:
try:
chunk = self._buff.get(block=False)
if chunk is None:
return
data.append(chunk)
except queue.Empty:
break
yield b"".join(data)
def listen_print_loop(responses):
"""Iterates through server responses and prints them.
The responses passed is a generator that will block until a response
is provided by the server.
Each response may contain multiple results, and each result may contain
multiple alternatives; for details, see the documentation. Here we
print only the transcription for the top alternative of the top result.
In this case, responses are provided for interim results as well. If the
response is an interim one, print a line feed at the end of it, to allow
the next result to overwrite it, until the response is a final one. For the
final one, print a newline to preserve the finalized transcription.
"""
num_chars_printed = 0
for response in responses:
if not response.results:
continue
# The `results` list is consecutive. For streaming, we only care about
# the first result being considered, since once it's `is_final`, it
# moves on to considering the next utterance.
result = response.results[0]
if not result.alternatives:
continue
# Display the transcription of the top alternative.
transcript = result.alternatives[0].transcript
# Display interim results, but with a carriage return at the end of the
# line, so subsequent lines will overwrite them.
#
# If the previous result was longer than this one, we need to print
# some extra spaces to overwrite the previous result
overwrite_chars = " " * (num_chars_printed - len(transcript))
if not result.is_final:
sys.stdout.write(transcript + overwrite_chars + "\r")
sys.stdout.flush()
num_chars_printed = len(transcript)
else:
print(transcript + overwrite_chars)
# Exit recognition if any of the transcribed phrases could be
# one of our keywords.
if re.search(r"\b(exit|quit)\b", transcript, re.I):
print("Exiting..")
break
num_chars_printed = 0
def main():
language_code = "en-US" # a BCP-47 language tag
client = speech.SpeechClient()
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=RATE,
language_code=language_code,
)
streaming_config = speech.StreamingRecognitionConfig(
config=config, interim_results=True
)
with MicrophoneStream(RATE, CHUNK) as stream:
audio_generator = stream.generator()
requests = (
speech.StreamingRecognizeRequest(audio_content=content)
for content in audio_generator
)
responses = client.streaming_recognize(streaming_config, requests)
# Now, put the transcription responses to use.
listen_print_loop(responses)
if __name__ == "__main__":
main()
# [END speech_transcribe_streaming_mic]
Dependencies are google-cloud-speech and pyaudio
For AWS S3, you can store your audio files there before/after you get the transcripts from Google Speech API.
Streaming is super fast as well.
And don't forget to include your credentials. You need to get authorized first by providing GOOGLE_APPLICATION_CREDENTIALS

How do I tell the spider to stop requesting after n failed requests?

import scrapy
class MySpider(scrapy.Spider):
start_urls = []
def __init__(self, **kwargs):
for i in range(1, 1000):
self.start_urls.append("some url"+i)
def parse(self, response):
print(response)
Here we queue 1000 urls in __init__ function, but I want to stop making all those requests if it fails or returns something undesirable. How do I tell the spider to stop making requests say after 10 failed requests.
You might want to set CLOSESPIDER_ERRORCOUNT to 10 in that case. It probably doesn't account for failed requests only, though. Alternatively, you might set HTTPERROR_ALLOWED_CODES to handle even the error responses (failed requests) and implement your own failed request counter inside the spider. Then, when the counter is above threshold, you raise CloseSpider exception yourself.

Scrapy. How to yield item after spider_close call?

I want to yield an item only when the crawling is finished.
I am trying to do it via
def spider_closed(self, spider):
item = EtsyItem()
item['total_sales'] = 1111111
yield item
But it does not yield anything, though the function is called.
How do I yield an item after the scraping is over?
Depending on what you want to do, there might be a veeeery hacky solution for this.
Instead of spider_closed you may want to consider using spider_idle signal which is fired before spider_closed. One difference between idle and close is that spider_idle allows execution of requests which then may contain a callback or errback to yield the desired item.
Inside spider class:
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
# ...
crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
return spider
# ...
def yield_item(self, response):
yield MyItem(name='myname')
def spider_idle(self, spider):
req = Request('https://fakewebsite123.xyz',
callback=lambda:None, errback=self.yield_item)
self.crawler.engine.crawl(req, spider)
However this comes with several side effects so i discourage anyone from using this in production, for example the final request which will raise a DNSLookupError. I just want to show what is possible.
Oof, I'm afraid spider_closed is used for tearing down. I suppose you can do it by attaching some custom stuff to Pipeline to post-process your items.

Scrapy high CPU usage

I have a very simple test spider which does no parsing. However I'm passing a large number of urls (500k) to the spider in the start_requests method and seeing very high (99/100%) cpu usage. Is this the expected behaviour? if so how can I optimize this (perhaps batching and using spider_idle?)
class TestSpider(Spider):
name = 'test_spider'
allowed_domains = 'mydomain.com'
def __init__(self, **kw):
super(Spider, self).__init__(**kw)
urls_list = kw.get('urls')
if urls_list:
self.urls_list = urls_list
def parse(self, response):
pass
def start_requests(self):
with open(self.urls_list, 'rb') as urls:
for url in urls:
yield Request(url, self.parse)
I think that the main problem here is that you are scraping too many links, try adding a Rule to avoid scraping links that didnt contains what you want.
Scrapy provides really useful Docs, check them out!:
http://doc.scrapy.org/en/latest/topics/spiders.html