how to call one spider from another spider on scrapy - scrapy

I have two spiders, and I want one call the other with the information scraped, which are not links that I could follow. Is there any way of doing it calling a spider from another one?
To illustrate better the problem: the url of the "one" page is the form /one/{item_name}, where {item_name} is the information I can get from the page /other/
...
<li class="item">item1</li>
<li class="item">someItem</li>
<li class="item">anotherItem</li>
...
Then I have the spider OneSpider that scrapes the /one/{item_name}, and the OtherSpider that scrapes /other/ and retrieve the item names, as shown below:
class OneSpider(Spider):
name = 'one'
def __init__(self, item_name, *args, **kargs):
super(OneSpider, self).__init__(*args, **kargs)
self.start_urls = [ f'/one/{item_name}' ]
def parse(self, response):
...
class OtherSpider(Spider):
name = 'other'
start_urls = [ '/other/' ]
def parse(self, response):
itemNames = response.css('li.item::text').getall()
# TODO:
# for each item name
# scrape /one/{item_name}
# with the OneSpider
I already checked these two questions: How to call particular Scrapy spiders from another Python script, and scrapy python call spider from spider, and several other questions where the main solution is creating another method inside the class and passing it as callbacks to new requests, but I don't think it is applicable when these new requests would have customized urls.

Scrapy don't have possibility to call spider from another spider.
related issue in scrapy github repo
However You can merge logic from 2 your spiders into single spider class:
import scrapy
class OtherSpider(scrapy.Spider):
name = 'other'
start_urls = [ '/other/' ]
def parse(self, response):
itemNames = response.css('li.item::text').getall()
for item_name in itemNames:
yield scrapy.Request(
url = f'/one/{item_name}',
callback = self.parse_item
)
def parse_item(self, response):
# parse method from Your OneSpider class

Related

How to make full URLs via Xpath from relative URLs?

<td class="searchResultsLargeThumbnail" data-hj-suppress="">
<a href="/ilan/emlak-konut-satilik-atasehir-agaoglu-soutside-2-plus1-ferah-cephe-iyi-konum-1057265758/detay" title="ATAŞEHİR AĞAOĞLU SOUTSİDE 2+1 FERAH CEPHE İYİ KONUM">
...
<a href="/ilan/emlak-konut-satilik-atapark-konutlarinda-buyuk-tip-2-plus1-ebeveyn-banyolu-102-m-daire-1057086925/detay" title="Atapark Konutlarında Büyük Tip 2+1 Ebeveyn Banyolu 102 m² Daire">
...
<a href="/ilan/emlak-konut-satilik-metropol-istanbul-yuksek-katli-cift-banyolu-satilik-2-plus1-daire-1049614464/detay" title="Metropol İstanbul Yüksek Katlı Çift Banyolu Satılık 2+1 Daire">
...
There is a website with such a page. I am trying to scrape each ad's inner page information. For this iteration, I need absolute links of the pages instead of relative links.
After running this code:
import scrapy
class AtasehirSpider(scrapy.Spider):
name = 'atasehir'
allowed_domains = ['www.sahibinden.com']
start_urls = ['https://www.sahibinden.com/satilik/istanbul-atasehir?address_region=2']
def parse(self, response):
for ad in response.xpath("//td[#class='searchResultsLargeThumbnail']/a/#href"):
print(ad.get())
I get an output like this:
/ilan/emlak-konut-satilik-atasehir-agaoglu-soutside-2-plus1-ferah-cephe-iyi-konum-1057265758/detay
/ilan/emlak-konut-satilik-atapark-konutlarinda-buyuk-tip-2-plus1-ebeveyn-banyolu-102-m-daire-1057086925/detay
/ilan/emlak-konut-satilik-metropol-istanbul-yuksek-katli-cift-banyolu-satilik-2-plus1-daire-1049614464/detay
...
2022-10-14 03:37:23 [scrapy.core.engine] INFO: Closing spider (finished)
I've tried several solutions from here.
def parse(self, response):
for ad in response.xpath("//td[#class='searchResultsLargeThumbnail']/a/#href"):
if not ad.startswith('http'):
ad = urljoin(base_url, ad)
print(ad.get())
def parse(self, response):
for ad in response.xpath("//td[#class='searchResultsLargeThumbnail']/a/#href"):
yield response.follow(ad, callback=self.parse)
print(ad.get())
I think "follow()" possesses quite an easy way to solve the problem but I could not overcome this error due to not having enough notion of programming.
Scrapy has a builtin method for doing this using response.urljoin() you can perform this on all links whether they are a realative url or not. The scrapy implementation does the checking for you. It only requires one argument because it inserts url used to generate the response automatically.
for example:
def parse(self, response):
for ad in response.xpath("//td[#class='searchResultsLargeThumbnail']/a/#href").getall():
ad = response.urljoin(ad)
print(ad)
You can try something like this :
def parse(self, response):
for ad in response.xpath("//td[#class='searchResultsLargeThumbnail']/a/#href"):
ad_url = f"https://www.https://www.sahibinden.com/{ad}"
print(ad_url)

scraping next page using scrapy using css

I am scraping zomato page i need item name and description from next page. I am comfortable with css tags so using those. I have created anaother function parsse_next to do so but not able to find logic what i should write there.I am new to scrapy.I need something like i have written for restaurant name.
def parse(self, response):
rest=response.css(".result-order-flow-title.hover_feedback.zred.bold.fontsize0.ln20::attr(title)").extract()
for restaurant in zip(rest):
scrapped_info={
'restaurant':restaurant[0],
}
yield scrapped_info
nextpage=response.css('.result-order-flow-title.hover_feedback.zred.bold.fontsize0.ln20::attr(href)').extract()
if nextpage is not None:
yield scrapy.Request(response.urljoin(nextpage),callback=self.parsenext)
def parsenext(self,response):
def parse(self, response):
rest=response.css(".result-order-flow-title.hover_feedback.zred.bold.fontsize0.ln20::attr(title)").extract()
for restaurant in zip(rest):
scrapped_info={
'restaurant':restaurant[0],
}
yield scrapped_info
nextpage=response.css('.result-order-flow-title.hover_feedback.zred.bold.fontsize0.ln20::attr(href)').extract()
if nextpage is not None:
yield scrapy.Request(response.urljoin(nextpage),callback=self.parse)

Is it good to access spider attributes in scrapy pipeline?

In the scrapy pipeline doc saids one of the parameter of the function 'process_item' is the spider
process_item(self, item, spider)
Parameters:
item (Item object or a dict) – the item scraped
spider (Spider object) – the spider which scraped the item
I want to send a list of one type of 'item' to the pipeline, but after many digging through the internet everyone is either yielding or returning the item to the pipeline once at a time.
SamplerSpider.py
class SamplerSpider(scrapy.Spider):
name = 'SamplerSpider'
allowed_domains = ['xxx.com']
start_urls = (CONSTANTS.URL)
result = []
pipeline.py
class SamplerSpiderPipeline(object):
def __init__(self):
// do something here
def process_item(self, item, spider):
// do something with spider.result
Is this a good way to do it? If not then why?
Scraping information from a document will always result in more than one item. Why scrapy pipeline is designed to process the item once at a time?
thanks in advance.

Scrapy spider not following pagination

I am using code from this link(https://github.com/eloyz/reddit/blob/master/reddit/spiders/pic.py) but somehow I am unable to visit paginated page.
I am on using scrapy 1.3.0
You don't have any mechanism for processing next page, all you do is gathering images.
Here is what you should be doing, I wrote some selectors but didn't test it.
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy import Request
import urlparse
class xxx_spider(Spider):
name = "xxx"
allowed_domains = ["xxx.com"]
def start_requests(self):
url = 'first page url'
yield Request(url=url, callback=self.parse, meta={"page":1})
def parse(self, response):
page = response.meta["page"] + 1
html = Selector(response)
pics = html.css('div.thing')
for selector in pics:
item = PicItem()
item['image_urls'] = selector.xpath('a/#href').extract()
item['title'] = selector.xpath('div/p/a/text()').extract()
item['url'] = selector.xpath('a/#href').extract()
yield item
next_link = html.css("span.next-button a::attr(href)")
if not next_link is None:
yield Request(url=url, callback=self.parse, meta={"page":page})
Similar to what you did, but when I get images, I then check next page link, if it exists then I yield another request with it.
Mehmet

Exporting unique items from CrawlSpider

I am using scrapy's CrawlSpider spider class to iterate over the list of start_urls and crawl each site's internal pages to fetch e-mail addresses. I would like to export a file with a single (unique) item for each start_url, with the list of matched e-mails. For that I purpose I needed to override the make_requests_from_url and parse methods so I can pass each start_url item in the response's meta dict (see code) to the internal pages. The output from running this code is:
www.a.com,['webmaster#a.com']
www.a.com,['webmaster#a.com','info#a.com']
www.a.com,['webmaster#a.com','info#a.com','admin#a.com']
However, I only want the export file to contain the last entry from the above output
(www.a.com,['admin#a.com,webmaster#a.com, info#a.com'])
Is that possible?
Code:
class MySpider(CrawlSpider):
start_urls = [... urls list ...]
def parse(self, response):
for request_or_item in CrawlSpider.parse(self, response):
if isinstance(request_or_item, Request):
request_or_item.meta.update(dict(url_item=response.meta['url_item']))
yield request_or_item
def make_requests_from_url(self, url):
# Create a unique item for each url. Append email to this item from internal pages
url_item = MyItem()
url_item["url"] = url
url_item["emais"] = []
return Request(url, dont_filter=True, meta = {'url_item': url_item})
def parse_page(self, response):
url_item = response.meta["url_item"]
url_item["emails"].append(** some regex of emails from the response object **)
return url_item
You could use pipeline to process items.
see Duplicates filter on Scrapy documentation.