I don't want this
return [FormRequest.from_response(response,
Because the login form not have the <form> tag
So I tried:
return scrapy.FormRequest(url="...",
formdata={},
callback=self.after_post)
return [FormRequest(url="...",
formdata={},
callback=self.after_post)]
return scrapy.http.Request(url="...",
method='POST',
headers={'Content-Type':'application/json'},
body=json.dumps(postData),
callback=self.after_post)
(Ref: https://docs.scrapy.org/en/latest/topics/request-response.html#using-formrequest-to-send-data-via-http-post)
But seem that scrapy not do the POST request and the code not goto the after_post
2020-04-07 10:43:03 [scrapy.core.engine] INFO: Closing spider (finished)
Can anyone tell me if I did something wrong here.
Thank you.
I solved it. I writed the post in another function
So when do this:
def parse(self, response):
self.do_post()
(not work)
And
def parse(self, response):
return self.do_post()
(This will work)
Not sure why the return caused the problem.
(Question closed)
Related
The crawling process seems to ignore and/or not execute the line yield scrapy.Request(property_file, callback=self.parse_property). The first scrapy.Request in def start_requests goes through and executed properly, but not one in def parse_navpage as seen here.
import scrapy
class SmartproxySpider(scrapy.Spider):
name = "scrape_zoopla"
allowed_domains = ['zoopla.co.uk']
def start_requests(self):
# Read source from file
navpage_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html"
yield scrapy.Request(navpage_file, callback=self.parse_navpage)
def parse_navpage(self, response):
listings = response.xpath("//div[starts-with(#data-testid, 'search-result_listing_')]")
for listing in listings:
listing_url = listing.xpath(
"//a[#data-testid='listing-details-link']/#href").getall() # List of property urls
break
print(listing_url) #Works
property_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/properties/Property_1.html"
print("BEFORE YIELD")
yield scrapy.Request(property_file, callback=self.parse_property) #Not going through
print("AFTER YIELD")
def parse_property(self, response):
print("PARSE PROPERTY")
print(response.url)
print("PARSE PROPERTY AFTER URL")
Running scrapy crawl scrape_zoopla in the command returns:
2022-09-10 20:38:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html> (referer: None)
BEFORE YIELD
AFTER YIELD
2022-09-10 20:38:24 [scrapy.core.engine] INFO: Closing spider (finished)
Both scrapy.Requests requested local files and only the first one worked. The files exist and properly display the pages and in case one of them does not the crawler would return error "No such file or directory" and likely be interrupted. It seems here the crawler just passed right through the request, not even gone through it, and returned no error. What is the error here?
This is a total shot in the dark but you could try sending both requests from your start_requests method. I honestly don't see why this would work but It might be worth a shot.
import scrapy
class SmartproxySpider(scrapy.Spider):
name = "scraoe_zoopla"
allowed_domains = ['zoopla.co.uk']
def start_requests(self):
# Read source from file
navpage_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html"
property_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/properties/Property_1.html"
yield scrapy.Request(navpage_file, callback=self.parse_navpage)
yield scrapy.Request(property_file, callback=self.parse_property)
def parse_navpage(self, response):
listings = response.xpath("//div[starts-with(#data-testid, 'search-result_listing_')]")
for listing in listings:
listing_url = listing.xpath(
"//a[#data-testid='listing-details-link']/#href").getall() # List of property urls
break
print(listing_url) #Works
def parse_property(self, response):
print("PARSE PROPERTY")
print(response.url)
print("PARSE PROPERTY AFTER URL")
Update
It just dawned on me why this is happening. It is because you have the allowed_domains attribute set but the request you are making is on your local file system which naturally is not going to match the allowed domain.
Scrapy assumes that all of the initial urls sent from start_requests are permitted and therefore doesn't do any verification for those, but all subsequent parse methods check against the allowed_domains attribute.
Just remove that line from the top of your spider class and your original structure should work fine.
I'm new in Scrapy. I wrote my first spider for this site https://book24.ru/knigi-bestsellery/?section_id=1592 and it works fine
import scrapy
class BookSpider(scrapy.Spider):
name = 'book24'
start_urls = ['https://book24.ru/knigi-bestsellery/']
def parse(self, response):
for link in response.css('div.product-card__image-holder a::attr(href)'):
yield response.follow(link, callback=self.parse_book)
for i in range (1, 5):
next_page = f'https://book24.ru/knigi-bestsellery/page-{i}/'
yield response.follow(next_page, callback=self.parse)
print(i)
def parse_book(self, response):
yield{
'name': response.css('h1.product-detail-page__title::text').get(),
'type': response.css('div.product-characteristic__value a::attr(title)')[2].get()
}
Now I try to write a spider only for one page
import scrapy
class BookSpider(scrapy.Spider):
name = 'book'
start_urls = ['https://book24.ru/product/transhumanism-inc-6015821/']
def parse_book(self, response):
yield{
'name': response.css('h1.product-detail-page__title::text').get(),
'type': response.css('div.product-characteristic__value a::attr(title)')[2].get()
}
And it doesn't work, I get an empty file after this command in terminal.
scrapy crawl book -O book.csv
I don't know why.
Will be grateful for the help!
You were getting raise
NotImplementedError(f'{self.__class__.__name__}.parse callback is not defined')
NotImplementedError: BookSpider.parse callback is not defined
according the document
parse(): a method that will be called to handle the response
downloaded for each of the requests made. The response parameter is an
instance of TextResponse that holds the page content and has further
helpful methods to handle it.
The parse() method usually parses the response, extracting the scraped
data as dicts and also finding new URLs to follow and creating new
requests (Request) from them.
just rename your def parse_book(self, response): to def parse(self, response):
Its work fine.
I am scraping a search result page where in some cases a 301 redirect will be triggered. In that case I do not want to crawl that page, but I need to call a different callback function, passing the redirect URL string to it.
I belive it should be possible to do it along the rules, but could not figure out how to:
class GetbidSpider(CrawlSpider):
handle_httpstatus_list = [301]
rules = (
Rule(
LinkExtractor(
allow=['^https://www\.testrule*$'],
),
follow=False,
callback= 'parse_item'
),
)
def parse_item(self, response):
self.logger.info('Parsing %s', response.url)
print(response.status)
print(response.headers[b'Location'])
The logfile only shows:
DEBUG: Crawled (301) <GET https:...
But the parsind info never gets printed, indicating never entering the function.
How can I
I really can't understand why my suggestions don't work for you. This is a tested code:
import scrapy
class RedirectSpider(scrapy.Spider):
name = 'redirect_spider'
def start_requests(self):
yield scrapy.Request(
url='https://www.moneycontrol.com/india/stockpricequote/pesticidesagrochemicals/piindustries/PII',
meta={'handle_httpstatus_list': [301]},
callback=self.parse,
)
def parse(self, response):
print(response.status)
print(response.headers[b'Location'])
pass
It took a while but I finally understand where the discrepancies are coming from!
scrapy crawl MeetupGetParticipants with url https://www.meetup.com/Google-Cloud_Meetup_Singapore_by_Cloud-Ace/events/264513425/attendees/
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x04E0BD30>
[s] item {}
[s] request <GET http://meetup.com/Google-Cloud_Meetup_Singapore_by_Cloud-Ace/events/264513425/attendees/ via http://localhost:8050/render.html>
[s] response <200 http://meetup.com/Google-Cloud_Meetup_Singapore_by_Cloud-Ace/events/264513425/attendees/>
[s] settings <scrapy.settings.Settings object at 0x04E0BC70>
[s] spider <MeetupGetParticipants 'MeetupGetParticipants' at 0x4ff0450>
Why is Splash returning the original url? Isn't the purpose of Splash to return the one rendered by render.html? What I want is the result of http://localhost:8050/render.html?url=https://www.meetup.com/Google-Cloud_Meetup_Singapore_by_Cloud-Ace/events/264513425/attendees/ (which gives me a rendered webpage).
Bascically i could make it work by myself just tricking the url ... There is something I don't understand here.
Looks like I could make it work with a nice lua script :) It returns a rendered json response with everything I need in it.
def start_requests(self):
lua_script = """
function main(splash)
assert(splash:go(splash.args.url))
while not splash:select('.attendee-item') do
splash:wait(0.1)
end
return {html=splash:html()}
end
"""
yield SplashRequest(url=self.url, callback=self.parse,
endpoint='execute',
args={'lua_source': lua_script,
'wait': 5,
},
)
I need to update location on site that uses redio button. This can be done with simple Post request. The problem is that output of this request is
window.location='http://store.intcomex.com/en-XCL/Products/Categories?r=True';
Since it is not a valid url Scrapy redirects it to PageNotFound and closes spider.
2017-09-17 09:57:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https:
//store.intcomex.com/en-XCL/ServiceClient/PageNotFound> from <POST https://store.intcomex.com/en-XC
L//User/SetNewLocation>
Here is my code:
def after_login(self, response):
# inspect_response(response, self)
url = "https://store.intcomex.com/en-XCL//User/SetNewLocation"
data={"id":"xclf1"
}
yield scrapy.FormRequest(url, formdata=data, callback = self.location)
# inspect_response(response, self)
def location(self, response):
yield scrapy.Request(url = 'http://store.intcomex.com/en-XCL/Products/Categories?r=True', callback = self.categories)
The question is how can I redirect scrapy to valid url after executing Post request that changes location? Is there some argument that indicates target url or i can execute it without callback and yiel correct url on the next line?
Thanks.