how to handle multiple return values in scrapy from splash

how to handle multiple return values in scrapy from splash - scrapy

i'm using scrapy with splash, in my splash i can send multiple values but in my scrapy code i could not handle all.for example,
this my splash script
splash_script = """
function main(splash)
local url = splash.args.url
return {
html = splash:html(),
number = 1
}
end
"""
The method trigger splash from scrapy
yield scrapy.Request(
url= response.urljoin(url),
callback = self.product_details,
errback=self.error,
dont_filter=True,
meta = {
'splash':{
'endpoint': 'render.html',
'cache_args': ['lua_source'],
'args' :{
'index': index,
'http_method':'GET',
'lua_source': self.splash_script,
}
}
},
)
The call back method
def product_details(self,response):
print response.body
This method receives only html content, i cant see the number

Your are printing response.body . This only includes the html.
You have to use response.data to see the 1.
You can also access the elements individually:
response.data['html']
or
response.data['number']
And when you return stuff, make sure you are assigning it in the return statement:
NOT-
html = splash:html()
number = 1
return {number,html}
BUT
return {number = 1, html = splash:html()}
Basically, you have to assign the JSON keys in the return statement even if you might have done so outside.
Extra info but that really screwed me up and you might run into the same problem.

Related

How to use the yield function to scrape data from multiple pages

I'm trying to scrape data from amazon India website. I am not able collect response and parse the elements using the yield() method when:
1) I have to move from product page to review page
2) I have to move from one review page to another review page
Product page
Review page
Code flow:
1) customerReviewData() calls the getCustomerRatingsAndComments(response)
2) The getCustomerRatingsAndComments(response)
finds the URL of the review page and call the yield request method with getCrrFromReviewPage(request) as callback method, with url of this review page
3) getCrrFromReviewPage() gets new response of the firstreview page and scrape all the elements from the first review page (page loaded) and add it to customerReviewDataList[]
4) get URL of the next page if it exists and recursively call getCrrFromReviewPage() method, and crawl elements from next page, until all the review page is crawled
5) All the reviews gets added to the customerReviewDataList[]
I have tried playing around with yield() changing the parameters and also looked up the scrapy documentation for yield() and Request/Response yield
# -*- coding: utf-8 -*-
import scrapy
import logging
customerReviewDataList = []
customerReviewData = {}
#Get product name in <H1>
def getProductTitleH1(response):
titleH1 = response.xpath('normalize-space(//*[#id="productTitle"]/text())').extract()
return titleH1
def getCustomerRatingsAndComments(response):
#Fetches the relative url
reviewRelativePageUrl = response.css('#reviews-medley-footer a::attr(href)').extract()[0]
if reviewRelativePageUrl:
#get absolute URL
reviewPageAbsoluteUrl = response.urljoin(reviewRelativePageUrl)
yield Request(url = reviewPageAbsoluteUrl, callback = getCrrFromReviewPage())
self.log("yield request complete")
return len(customerReviewDataList)
def getCrrFromReviewPage():
userReviewsAndRatings = response.xpath('//div[#id="cm_cr-review_list"]/div[#data-hook="review"]')
for userReviewAndRating in userReviewsAndRatings:
customerReviewData[reviewTitle] = response.css('#cm_cr-review_list .review-title span ::text').extract()
customerReviewData[reviewDescription] = response.css('#cm_cr-review_list .review-text span::text').extract()
customerReviewDataList.append(customerReviewData)
reviewNextPageRelativeUrl = response.css('#cm_cr-pagination_bar .a-pagination .a-last a::attr(href)')[0].extract()
if reviewNextPageRelativeUrl:
reviewNextPageAbsoluteUrl = response.urljoin(reviewNextPageRelativeUrl)
yield Request(url = reviewNextPageAbsoluteUrl, callback = getCrrFromReviewPage())
class UsAmazonSpider(scrapy.Spider):
name = 'Test_Crawler'
allowed_domains = ['amazon.in']
start_urls = ['https://www.amazon.in/Philips-Trimmer-Cordless-Corded-QT4011/dp/B00JJIDBIC/ref=sr_1_3?keywords=philips&qid=1554266853&s=gateway&sr=8-3']
def parse(self, response):
titleH1 = getProductTitleH1(response),
customerReviewData = getCustomerRatingsAndComments(response)
yield{
'Title_H1' : titleH1,
'customer_Review_Data' : customerReviewData
}
I'm getting the following response:
{'Title_H1': (['Philips Beard Trimmer Cordless and Corded for Men QT4011/15'],), 'customer_Review_Data': <generator object getCustomerRatingsAndComments at 0x048AC630>}
The "Customer_review_Data" should be a list of dict of title and review
I am not able to figure out as to what mistake I am doing here.
When I use the log() or print() to see what data is captured in customerReviewDataList[], unable to see the data in the console either.
I am able to scrape all the reviews in customerReviewDataList[], if they are present in the product page,
In this scenario where I have to use the yield function I am getting the output stated above like this [https://ibb.co/kq8w6cf]
This is the kind of output I am looking for:
{'customerReviewTitle': ['Difficult to find a charger adapter'],'customerReviewComment': ['I already have a phillips trimmer which was only cordless. ], 'customerReviewTitle': ['Good Product'],'customerReviewComment': ['Solves my need perfectly HK']}]}
Any help is appreciated. Thanks in advance.

You should complete the Scrapy tutorial. The Following links section should be specially helpful to you.
This is a simplified version of your code:
def data_request_iterator():
yield Request('https://example.org')
class MySpider(Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
yield {
'title': response.css('title::text').get(),
'data': data_request_iterator(),
}
Instead, it should look like this:
class MySpider(Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
item = {
'title': response.css('title::text').get(),
}
yield Request('https://example.org', meta={'item': item}, callback=self.parse_data)
def parse_data(self, response):
item = response.meta['item']
# TODO: Extend item with data from this second response as needed.
yield item

How to check if url from xpath exists?

I have two functions in Scrapy
def parse_attr(self, response):
for resource in response.xpath(''):
item = Item()
item['Name'] = response.xpath('').extract()
item['Title'] = response.xpath('').extract()
item['Contact'] = response.xpath('').extract()
item['Gold'] = response.xpath('').extract()
company_page = response.urljoin(resource.xpath('/div/#href').extract_first())
if company_page:
request = scrapy.Request(company_page, callback = self.company_data)
request.meta['item'] = item
yield request
else:
yield item
def company_data(self, response):
item = response.meta['item']
item['Products'] = response.xpath('').extract()
yield item
parse_attr calls company_data when it extracts #href from page and it passes it to company_page, however, this href does not always exists. How can i check if href exists, and if not, stop scrapy from moving to other function?
Above code does not satisfy this condition because company_page is always true.
What I want is scrapy to stop if there is no href, and finish its job just with items it already has. If href is found, then I want scrapy to move to other function and extract additional item.

response.urljoin() will always return something (the request's base URL), even if the argument is empty. Therefore your variable will always contain a value and consequently evaluate as True.
You need to do the URL joining inside your conditional. For example:
company_page = resource.xpath('/div/#href').extract_first()
if company_page:
company_page = response.urljoin(company_page)
request = scrapy.Request(company_page, callback = self.company_data)
request.meta['item'] = item
yield request
else:
yield item

Loop on scrapy FormRequest but only one item created

So I've tried to loop on a formrequest that call my function that create, fill and yield the item, only pb : only one and only one item is done no matter how many times he looped and I can't figure out why ?
def access_data(self, res):
#receive all ID and request the infos
res_json = (res.body).decode("utf-8")
res_json = json.loads(res_json)
for a in res_json['data']:
logging.warning(a['id'])
req = FormRequest(
url='https://my_url',
cookies={my_cookies},
method='POST',
callback=self.fill_details,
formdata={'valeur': str(a['id'])},
headers={'X-Requested-With': 'XMLHttpRequest'}
)
yield req
def fill_details(self, res):
logging.warning("annonce")
item = MyItem()
item['html'] = res.xpath('//body//text()')
item['existe'] = True
item['ip_proxy'] = None
item['launch_time'] = str(mySpider.init_time)
yield item
To be sure everything is clear :
When I run this, the log "annonce" is printed only one time while my logging a['id'] in my request loop is printed a lot and i can't find a way to fix this

I found the way !
If any one has the same pb : as my url is always the same (only formdata change) the scrapy filter is taking the control and destroy the duplicates.
Activate dont_filter to true in the formrequest to make it works

Update response.body in scrapy(without reload)

I use scrapy and selenium for crawl! my site use ajax for pagination! actully , url no change and so response.body also no change! I want to click with selenium (for pagination) and get self.driver.page_source and use it instead response.body!
So i writed this code :
res = scrapy.http.TextResponse(url=self.driver.current_url, body=self.driver.page_source,
encoding='utf-8')
print(str(res)) //nothing to print!
for quote in res.css("#ctl00_ContentPlaceHolder1_Grd_Dr_DXMainTable > tr.dxgvDataRow_Office2003Blue"):
i = i+1
item = dict()
item['id'] = int(quote.css("td.dxgv:nth-child(1)::text").extract_first())
And no error !

You can replace body of original response in scrapy by using response.replace() method:
def parse(self, response):
response = response.replace(body=driver.page_source)

Overriding start_requests is Scrapy not synchronous

I'm trying to override Scrapy's start_requests method, but unsuccessful. I'm already fine to iterate through pages. The problem is that now I have to iterate firstly through cities and than pages.
My code looks like this:
URL = "https://example.com/%s/?page=%d"
starting_number = 1
number_of_pages = 3
cities = [] # there are array of cities
selected_city = "..."
def start_requests(self):
for city in cities:
selected_city = city
print "####################"
print "##### CITY: " + selected_city + " #####"
for i in range(self.page_number, number_of_pages, +1):
print "##### page: " + str(i) + " #####"
yield scrapy.Request(url=(URL % (selected_city, i)), callback = self.parse)
print "####################"
In console I see that when crawler starts working it prints all cities and pages, and than only start requests. Therefore as the result my crawler parses only the first city. They work asynchronously, while I need synchronous.
What is the right way to iterate in my case?
Thanks for any help!

My problem was that I used wrongly global variable selected_city in the remaining code.
I thought that on every iteration it would stop to do parse method, and than continue to next iteration. Therefore I set parameter item['city'] = selected_city in parse method.
Now I just pass parameter city through Request's meta parameter.
Sample code:
def start_requests(self):
requests = []
for city in cities:
for i in range(self.page_number, number_of_pages, +1):
requests.append(scrapy.Request(url=(URL % (city, i)), callback = self.parse, meta = {'city': city}))
return requests
And in parse method retrieving by doing: item['city'] = response.request.meta['city']

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to handle multiple return values in scrapy from splash - scrapy

Related

How to use the yield function to scrape data from multiple pages

How to check if url from xpath exists?

Loop on scrapy FormRequest but only one item created

Update response.body in scrapy(without reload)

Overriding start_requests is Scrapy not synchronous

Categories

Resources