Scrapy failes to parse on pages with long response time - scrapy

I am requesting and parsing the search results of an eCommerce website with Scrapy 2.x. Unfortunately the page has a response time of often more then 5s. It appears that it returns a 200 without any search results present and the parser starts parsing the page, resulting in 0 extractions as the results are not present yet.
How can one increase the time to render, or the delay?
This is the target page:
https://shop.apotal.de/keywordsearch?SEARCH_STRING=fester%20stuhlgang&VIEW_INDEX=0&VIEW_SIZE=40
Scraping is pretty straight forward:
yield scrapy.Request(
url=url,
meta={'handle_httpstatus_list': [301]},
callback=self.parse_item,
)

Related

Request callback issue with infinite crawler

I am writing a Scrapy spider whose purpose is to make requests to a remote server in order to hydrate the cache. It's an infinite crawler because I need to make requests after regular intervals. I created initial spider which generates request and hit the server, it worked fine but now when I am making it running infinitely, I am not getting responses. I even tried to debug in the process_response middleware but couldn't get my spider till there. Here is a sketch of code which I am implementing
def generate_requests(self, payloads):
for payload in payloads:
if payload:
print(f'making request with payload {payload}')
yield Request(url=Config.HOTEL_CACHE_AVAILABILITIES_URL, method='POST', headers=Config.HEADERS,
callback=self.parse, body=json.dumps(payload), dont_filter=True, priority=1)
def start_requests(self):
crawler_config = CrawlerConfig()
while True:
if not self.city_scheduler:
for location in crawler_config.locations:
city_name = location.city_name
ttl = crawler_config.get_city_ttl(city_name)
payloads = crawler_config.payloads.generate_payloads(location)
self.city_scheduler[location.city_name] = (datetime.now() + timedelta(minutes=ttl)).strftime("%Y-%m-%dT%H:%M:%S")
yield from self.generate_requests(payloads)
Seems like scrapy has some odd behavior with while loop in start_requests. you can check similar enhancement on scrapy repo here.
Moving while loop logic in your parse method will solve this issue.

Scrapy view and requests.get difference for "https://www.mywebsite.com.sg"

I am trying to scrape https://www.mywebsite.com.sg but the following command returns 400 bad request error:
scrapy view https://www.mywebsite.com.sg
If I use:
data=requests.get("https://www.mywebsite.com.sg")
I can get the content of the webpage in data.text and data.content.
However all the xpath operations in my script dotn't work as data.xpath and data.content are both empty.
There seems to be no protection in the webpage as postman software can get result with a simple HTTP GET query.
How do I get the response object to be properly filled?

Crawler4J seed url gets encoded and error page is crawler instead of actual page

I am using crawler 4J to crawl user profile on gitHub for instance I want to crawl url: https://github.com/search?q=java+location:India&p=1
for now I am adding this hard coded url in my crawler controller like:
String url = "https://github.com/search?q=java+location:India&p=1"; controller.addSeed(url);
When crawler 4J starts the URL Crawled is :
https://github.com/search?q=java%2Blocation%3AIndia&p=1
which gives me error page.
What should I do, I have tried giving encoded url but that doesn't work either.
I had to eventually make the slightest of changes to crawler4J source code:
File Name: URLCanonicalizer.java
Method : percentEncodeRfc3986
Just commented the first line in this method and I was able to crawl and fetch my results
//string = string.replace("+", "%2B");
In my url there was + character and that was being replaced by %2B and I was getting a error page,I wonder why they have specifically replaced + character before encoding the entire URL.

Does Import.io api support status of the extractor?

I've just created an extractor with import.io. This extractor uses chaining. Firstly I'm extracting some urls from one page and with these extracted urls, I'm extracting detail pages. When detail pages' extraction finish, I want to get the results. But how can I be sure that extraction is completed. Is there any api endpoint for checking the status of extraction?
I found "GET /store/connector/{id}" endpoint from legacy. But when I try this, I got 404. You can take a look at the screenshot.
Another question is, I want to schedule my extractor twice a day. Is this possible?
Thanks
Associated with each Extractor are Crawl Runs. A crawl run represents the running of an extractor with a specific configuration (training, list of URLs, etc). The state of each of a crawl run can have one of the following values:
STARTED => Currently running
CANCELLED => Started but cancelled by the user
FINISHED => Run was complete
Additional metadata that is included is as follows:
Started At - When the run started
Stopped At - When the run finished
Total URL Count - Total number of URLs in the run
Success URL Count - # of successful URLs queried
Failed URL Count - # of failed URLs queried
Row Count - Total number of rows returned in the run
The REST API to get the list of craw runs associated with an extractor is as follows:
curl -s X GET "https://store.import.io/store/crawlrun/_search?_sort=_meta.creationTimestamp&_page=1&_perPage=30&extractorId=$EXTRACTOR_ID&_apikey=$IMPORT_IO_API_KEY"
where
$EXTRACTOR_ID - Extractor to list crawl runs
$IMPORT_IO_API_KEY - Import.io API from your account

How to make Scrapy execute callbacks before the start_requests method finishes?

I have a large file of relative urls that I want to scrape with Scrapy, and I've written some code to read this file line-by-line and build requests for my spider to parse. Below is some sample code.
spider:
def start_requests(self):
with open(self._file) as infile:
for line in infile:
inlist = line.replace("\n","").split(",")
item = MyItem(data = inlist[0])
request = scrapy.Request(
url = "http://foo.org/{0}".format(item["data"]),
callback = self.parse_some_page
)
request.meta["item"]
yield request
def parse_some_page(self,response):
...
request = scrapy.Request(
url = "http://foo.org/bar",
callback = self.parse_some_page2
)
yield request
This works fine, but with a large input file, I'm seeing that parse_some_page2 isn't invoked until start_requests finishes yielding all the initial requests. Is there some way I can make Scrapy start invoking the callbacks earlier? Ultimately, I don't want to wait for a million requests before I start seeing items flow through the pipeline.
I came up with 2 solutions. 1) Run spiders in separate processes if there are too many large sites. 2) Use deferreds and callbacks via Twisted (please don't run away, it won't be too scary). I'll discuss how to use the 2nd method because the first one can simply be googled.
Every function that executes yield request will "block" until a result is available. So your parse_some_page() function yields a scrapy response object and will not go on to the next URL until a response is returned. I did manage to find some sites (mostly foreign government sites) that take a while to fetch and hopefully it simulates a similar situation you're experiencing. Here is a quick and easy example:
# spider/stackoverflow_spider.py
from twisted.internet import defer
import scrapy
class StackOverflow(scrapy.Spider):
name = 'stackoverflow'
def start_requests(self):
urls = [
'http://www.gob.cl/en/',
'http://www.thaigov.go.th/en.html',
'https://www.yahoo.com',
'https://www.stackoverflow.com',
'https://swapi.co/',
]
for index, url in enumerate(urls):
# create callback chain after a response is returned
deferred = defer.Deferred()
deferred.addCallback(self.parse_some_page)
deferred.addCallback(self.write_to_disk, url=url, filenumber=index+1)
# add callbacks and errorbacks as needed
yield scrapy.Request(
url=url,
callback=deferred.callback) # this func will start the callback chain AFTER a response is returned
def parse_some_page(self, response):
print('[1] Parsing %s' % (response.url))
return response.body # this will be passed to the next callback
def write_to_disk(self, content, url, filenumber):
print('[2] Writing %s content to disk' % (url))
filename = '%d.html' % filenumber
with open(filename, 'wb') as f:
f.write(content)
# return what you want to pass to the next callback function
# or raise an error and start Errbacks chain
I've changed things slightly to be a bit easier to read and run. The first thing to take note of in start_requests() is that Deferred objects are created and callback functions are being chained (via addCallback()) within the urls loop. Now take a look at the callback parameter for scrapy.Request:
yield scrapy.Request(
url=url,
callback=deferred.callback)
What this snippet will do is start the callback chain immediately after scrapy.Response becomes available from the request. In Twisted, Deferreds start running callback chains only after Deferred.callback(result) is executed with a value.
After a response is provided, the parse_some_page() function will run with the Response as an argument. What you will do is extract what ever you need here and pass it to the next callback (ie. write_to_disk() my example). You can add more callbacks to the Deferred in the loop if necessary.
So the difference between this answer and what you did originally is that you used yield to wait for all the responses first, then execute callbacks. Where as my method uses Deferred.callback() as the callback for each request such that each response will be processed immediately.
Hopefully this helps (and/or works).
References
Twisted Deferred Reference
Explanation of parse(): Briefly summarizes how yeild/return affects parsing.
Non-Blocking Recipes (Klien): I blog post I wrote a while back on async callbacks in Klien/Twisted. Might be helpful to newbies.
PS
I have no clue if this will actually work for you since I couldn't find a site that is too large to parse. Also, I'm brand-spankin' new at Scrapy :D but I have years of Twisted under my belt.