I have HTML Response in Database how can I parse them in Scrapy Spider? - scrapy

I have a table of HTTP Responses in Database I want to modify spider so that it can grab those responses from database and parse in parse method.
I have tried using start_requests method but it didn't work, please help me.
this is the code i tried.
def start_requests(self):
sample = [
{
"status" : 200,
"body" : b"<b></b>",
"url" : "http://google.com"
}
]
for item in sample:
yield self.parse(scrapy.http.Response(
**item,
#encoding='utf-8'
))

Related

How to swiftly scrap a list of urls from dynamically rendered websites using scrapy-playwright using parallel processing?

Here is my Spider that works well, but is not very fast for larger numbers of pages (10s of thousands)
import scrapy
import csv
from scrapy_playwright.page import PageMethod
class ImmoSpider(scrapy.Spider):
name = "immo"
def start_requests(self):
with open("urls.csv","r") as f:
reader = csv.DictReader(f)
list = [item['Url-scraper'] for item in reader][0: 1]
for elem in list :
yield scrapy.Request(elem, meta={'playwright': True, 'playwright_include_page' : True, "playwright_page_methods": [
PageMethod("wait_for_load_state", 'networkidle')
],})
async def parse(self, response):
page = response.meta["playwright_page"]
await page.close()
#parse stuff
yield {
#yield stuff
}
My scraper project is setup like the official scrapy getting started tutorial.
I'm still a beginner scraper so maybe I missed the simple solution.

Scrapy Playwright: execute CrawlSpider using scrapy playwright

Is it possible to execute CrawlSpider using Playwright integration for Scrapy? I am trying the following script to execute a CrawlSpider but it does not scrape anything. It also does not show any error!
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class GumtreeCrawlSpider(CrawlSpider):
name = 'gumtree_crawl'
allowed_domains = ['www.gumtree.com']
def start_requests(self):
yield scrapy.Request(
url='https://www.gumtree.com/property-for-sale/london/page',
meta={"playwright": True}
)
return super().start_requests()
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[#class='grid-col-12']/ul[1]/li/article/a"), callback='parse_item', follow=False),
)
async def parse_item(self, response):
yield {
'Title': response.xpath("//div[#class='css-w50tn5 e1pt9h6u11']/h1/text()").get(),
'Price': response.xpath("//h3[#itemprop='price']/text()").get(),
'Add Posted': response.xpath("//dl[#class='css-16xsajr elf7h8q4'][1]/dd/text()").get(),
'Links': response.url
}
Requests extracted from the rule do not have the playwright=True meta key, that's a problem if they need to be rendered by the browser to have useful content. You could solve that by using Rule.process_request, something like:
def set_playwright_true(request, response):
request.meta["playwright"] = True
return request
class MyCrawlSpider(CrawlSpider):
...
rules = (
Rule(LinkExtractor(...), callback='parse_item', follow=False, process_request=set_playwright_true),
)
Update after comment
Make sure your URL is correct, I get no results for that particular one (remove /page?).
Bring back your start_requests method, seems like the first page also needs to be downloaded using the browser
Unless marked explicitly (e.g. #classmethod, #staticmethod) Python instance methods receive the calling object as implicit first argument. The convention is to call this self (e.g. def set_playwright_true(self, request, response)). However, if you do this, you will need to change the way you create the rule, either:
Rule(..., process_request=self.set_playwright_true)
Rule(..., process_request="set_playwright_true")
From the docs: "process_request is a callable (or a string, in which case a method from the spider object with that name will be used)"
My original example defines the processing function outside of the spider, so it's not an instance method.
As suggested by elacuesta, I'd only add change your "parse_item" def from an async to a standard def.
def parse_item(self, response):
It defies what all I've read too, but that got me through.

how to crawl asp webform link using scrapy

I want to scrape a webform site but the links aren't regular hrefs they are like below:
and I want to have scrapy get that link and go there
< a id="ctl00_ContentPlaceHolder1_DtGrdAttraf_ctl06_LnkBtnDisplayHadith" title="some title" class="Txt" onmouseover="changeStyle(this, 'lnk')" onmouseout="changeStyle(this, 'Txt TxtSmall')" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$ContentPlaceHolder1$DtGrdAttraf$ctl06$LnkBtnDisplayHadith", "", false, "", "http://www.sonnaonline.com/DisplayResults.aspx?Menu=1&ParentID=13&Flag=dbID&Selid=8483", false, true))">the link text</ a>
Asp.net is a form-driven framework. So, you have to fill in the form and manually post it to get to the page directs?
How to do that?
At first, you can have a look at here, my scrapy code.
https://github.com/Timezone-design/python-scrapy-asp-net/blob/master/scrapy_spider/spiders/burzarada_spider.py
You should first find out what WebForm_DoPostBackWithOptions() do in the page. You can just search the function by Ctrl+U, from the page source.
You will soon find out what it does, where does it fill these informations "ctl00$ContentPlaceHolder1$DtGrdAttraf$ctl06$LnkBtnDisplayHadith", "", false, "", "http://www.sonnaonline.com/DisplayResults.aspx?Menu=1&ParentID=13&Flag=dbID&Selid=8483", false, true in.
Then, the thing is clear.
You extract the href of the a tag to a string by
response.css('... a ::attr(href)').extract()[0].href # assuming there are many <a>s there
Then split the string "ctl00$ContentPlaceHolder1$DtGrdAttraf$ctl06$LnkBtnDisplayHadith", "", false, "", "http://www.sonnaonline.com/DisplayResults.aspx?Menu=1&ParentID=13&Flag=dbID&Selid=8483", false, true by commas, and, fill them in proper input elements and post it by scrapy.FormRequest.
yield scrapy.FormRequest(
'https://burzarada.hzz.hr/Posloprimac_RadnaMjesta.aspx',
formdata = {
'__EVENTTARGET': eventTarget,
'__EVENTARGUMENT': eventArgument,
'__LASTFOCUS': lastFocus,
'__VIEWSTATE': viewState,
'__VIEWSTATEGENERATOR': viewStateGenerator,
'__VIEWSTATEENCRYPTED': viewStateEncrypted,
'ctl00$MainContent$ddlPageSize': pageSize,
'ctl00$MainContent$ddlSort': sort,
},
callback=self.parse_multiple_pages
)
Explanation:
https://burzarada.hzz.hr/Posloprimac_RadnaMjesta.aspx # url to post the form.
formdata # form data as json. keys are input names.
callback # function to get the response and do next things.
Viola! You can get into the page and the response can be got as an argument in function you gave as callback.
You can see some examples in the link above.

Matching multiple response in a single scenario outline

I want to match multiple response of an API. Please find below Scenario Outline.
Background:
* def kittens = read('../sample.json')
Scenario Outline: Create Test1
Given url url
And request <Users>
When method POST
Then status 200
And match response.success.name == <expectedName>
And match response.success.contact.mobile == <expectedMobile>
Examples:
|Users|expectedName|expectedMobile|
|kittens.User1|'Micheal'|'123456'|
|kittens.User2|'Steve'|'998877'|
In the above case, I am able to match for 2 fields but I want to validate for more fields, but it is increasing pile of code that I do not want.
Multiple response of an API:
"success": {
"name": "Micheal",
"addr": "Tesla road",
"contact": {
"mobile": 123456,
"phone": 4422356
}
}
"success": {
"name": "Steve",
"addr": "Karen Road",
"contact": {
"mobile": 998877,
"phone": 244344
}
}
I am looking for minimizing the lines of code.
Can you please tell me another way where I can load entire response into expected and then I will traverse in the example section?
Please help me. Thank you !!
I strongly recommend you don't do this, and the reasons are explained in detail here: https://stackoverflow.com/a/54126724/143475
Also note instead of going field-by-field, you can use the whole JSON in the Examples column, or even pull from a file, see examples.feature.
You will actually end up with a much more "increasing pile of code" if you go down this path in my sincere opinion.

scrapy-splash do not crawl recursively with CrawlerSpider

I have integrated scrapy-splash in my CrawlerSpider process_request in rules like this:
def process_request(self,request):
request.meta['splash']={
'args': {
# set rendering arguments here
'html': 1,
}
}
return request
The problem is that the crawl renders just urls in the first depth,
I wonder also how can I get response even with bad http code or redirected reponse;
Thanks in advance,
Your problem may be related to this: https://github.com/scrapy-plugins/scrapy-splash/issues/92
In short, try to add this to your parsing callback function:
def parse_item(self, response):
"""Parse response into item also create new requests."""
page = RescrapItem()
...
yield page
if isinstance(response, (HtmlResponse, SplashTextResponse)):
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = SplashRequest(url=link.url, callback=self._response_downloaded,
args=SPLASH_RENDER_ARGS)
r.meta.update(rule=rule, link_text=link.text)
yield rule.process_request(r)
In case, you wonder why this could return both items and new requests. Here is from the doc: https://doc.scrapy.org/en/latest/topics/spiders.html
In the callback function, you parse the response (web page) and return
either dicts with extracted data, Item objects, Request objects, or an
iterable of these objects. Those Requests will also contain a callback
(maybe the same) and will then be downloaded by Scrapy and then their
response handled by the specified callback.