How to make full URLs via Xpath from relative URLs? - scrapy

<td class="searchResultsLargeThumbnail" data-hj-suppress="">
<a href="/ilan/emlak-konut-satilik-atasehir-agaoglu-soutside-2-plus1-ferah-cephe-iyi-konum-1057265758/detay" title="ATAŞEHİR AĞAOĞLU SOUTSİDE 2+1 FERAH CEPHE İYİ KONUM">
...
<a href="/ilan/emlak-konut-satilik-atapark-konutlarinda-buyuk-tip-2-plus1-ebeveyn-banyolu-102-m-daire-1057086925/detay" title="Atapark Konutlarında Büyük Tip 2+1 Ebeveyn Banyolu 102 m² Daire">
...
<a href="/ilan/emlak-konut-satilik-metropol-istanbul-yuksek-katli-cift-banyolu-satilik-2-plus1-daire-1049614464/detay" title="Metropol İstanbul Yüksek Katlı Çift Banyolu Satılık 2+1 Daire">
...
There is a website with such a page. I am trying to scrape each ad's inner page information. For this iteration, I need absolute links of the pages instead of relative links.
After running this code:
import scrapy
class AtasehirSpider(scrapy.Spider):
name = 'atasehir'
allowed_domains = ['www.sahibinden.com']
start_urls = ['https://www.sahibinden.com/satilik/istanbul-atasehir?address_region=2']
def parse(self, response):
for ad in response.xpath("//td[#class='searchResultsLargeThumbnail']/a/#href"):
print(ad.get())
I get an output like this:
/ilan/emlak-konut-satilik-atasehir-agaoglu-soutside-2-plus1-ferah-cephe-iyi-konum-1057265758/detay
/ilan/emlak-konut-satilik-atapark-konutlarinda-buyuk-tip-2-plus1-ebeveyn-banyolu-102-m-daire-1057086925/detay
/ilan/emlak-konut-satilik-metropol-istanbul-yuksek-katli-cift-banyolu-satilik-2-plus1-daire-1049614464/detay
...
2022-10-14 03:37:23 [scrapy.core.engine] INFO: Closing spider (finished)
I've tried several solutions from here.
def parse(self, response):
for ad in response.xpath("//td[#class='searchResultsLargeThumbnail']/a/#href"):
if not ad.startswith('http'):
ad = urljoin(base_url, ad)
print(ad.get())
def parse(self, response):
for ad in response.xpath("//td[#class='searchResultsLargeThumbnail']/a/#href"):
yield response.follow(ad, callback=self.parse)
print(ad.get())
I think "follow()" possesses quite an easy way to solve the problem but I could not overcome this error due to not having enough notion of programming.

Scrapy has a builtin method for doing this using response.urljoin() you can perform this on all links whether they are a realative url or not. The scrapy implementation does the checking for you. It only requires one argument because it inserts url used to generate the response automatically.
for example:
def parse(self, response):
for ad in response.xpath("//td[#class='searchResultsLargeThumbnail']/a/#href").getall():
ad = response.urljoin(ad)
print(ad)

You can try something like this :
def parse(self, response):
for ad in response.xpath("//td[#class='searchResultsLargeThumbnail']/a/#href"):
ad_url = f"https://www.https://www.sahibinden.com/{ad}"
print(ad_url)

Related

how to call one spider from another spider on scrapy

I have two spiders, and I want one call the other with the information scraped, which are not links that I could follow. Is there any way of doing it calling a spider from another one?
To illustrate better the problem: the url of the "one" page is the form /one/{item_name}, where {item_name} is the information I can get from the page /other/
...
<li class="item">item1</li>
<li class="item">someItem</li>
<li class="item">anotherItem</li>
...
Then I have the spider OneSpider that scrapes the /one/{item_name}, and the OtherSpider that scrapes /other/ and retrieve the item names, as shown below:
class OneSpider(Spider):
name = 'one'
def __init__(self, item_name, *args, **kargs):
super(OneSpider, self).__init__(*args, **kargs)
self.start_urls = [ f'/one/{item_name}' ]
def parse(self, response):
...
class OtherSpider(Spider):
name = 'other'
start_urls = [ '/other/' ]
def parse(self, response):
itemNames = response.css('li.item::text').getall()
# TODO:
# for each item name
# scrape /one/{item_name}
# with the OneSpider
I already checked these two questions: How to call particular Scrapy spiders from another Python script, and scrapy python call spider from spider, and several other questions where the main solution is creating another method inside the class and passing it as callbacks to new requests, but I don't think it is applicable when these new requests would have customized urls.
Scrapy don't have possibility to call spider from another spider.
related issue in scrapy github repo
However You can merge logic from 2 your spiders into single spider class:
import scrapy
class OtherSpider(scrapy.Spider):
name = 'other'
start_urls = [ '/other/' ]
def parse(self, response):
itemNames = response.css('li.item::text').getall()
for item_name in itemNames:
yield scrapy.Request(
url = f'/one/{item_name}',
callback = self.parse_item
)
def parse_item(self, response):
# parse method from Your OneSpider class

scraping next page using scrapy using css

I am scraping zomato page i need item name and description from next page. I am comfortable with css tags so using those. I have created anaother function parsse_next to do so but not able to find logic what i should write there.I am new to scrapy.I need something like i have written for restaurant name.
def parse(self, response):
rest=response.css(".result-order-flow-title.hover_feedback.zred.bold.fontsize0.ln20::attr(title)").extract()
for restaurant in zip(rest):
scrapped_info={
'restaurant':restaurant[0],
}
yield scrapped_info
nextpage=response.css('.result-order-flow-title.hover_feedback.zred.bold.fontsize0.ln20::attr(href)').extract()
if nextpage is not None:
yield scrapy.Request(response.urljoin(nextpage),callback=self.parsenext)
def parsenext(self,response):
def parse(self, response):
rest=response.css(".result-order-flow-title.hover_feedback.zred.bold.fontsize0.ln20::attr(title)").extract()
for restaurant in zip(rest):
scrapped_info={
'restaurant':restaurant[0],
}
yield scrapped_info
nextpage=response.css('.result-order-flow-title.hover_feedback.zred.bold.fontsize0.ln20::attr(href)').extract()
if nextpage is not None:
yield scrapy.Request(response.urljoin(nextpage),callback=self.parse)

How to use Python scrapy for myltiple URL's

my question is similar to this post:
How to use scrapy for Amazon.com links after "Next" Button?
I want my crawler to traverse through all "Next" links. I've searched a lot, but most people ether focus on how to parse the ULR or simply put all URL's in the initial URL list.
So far, I am able to visit the first page and parse the next page's link. But I don't know how to visit that page using the same crawler(spider). I tried to append the new URL into my URL list, it does appended (I checked the length), but later it doesn't visit the link. I have no idea why...
Note that in my case, I only know the first page's URL. Second page's URL can only be obtained after visiting the first page. The same, (i+1)'th page's URL is hidden in the i'th page.
In the parse function, I can parse and print the correct next page link URL. I just don't know how to visit it.
Please help me. Thank you!
import scrapy
from bs4 import BeautifulSoup
class RedditSpider(scrapy.Spider):
name = "test2"
allowed_domains = ["http://www.reddit.com"]
urls = ["https://www.reddit.com/r/LifeProTips/search?q=timestamp%3A1427232122..1437773560&sort=new&restrict_sr=on&syntax=cloudsearch"]
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': { 'wait': 0.5 }
}
})
`
def parse(self, response):
page = response.url[-10:]
print(page)
filename = 'reddit-%s.html' % page
#parse html for next link
soup = BeautifulSoup(response.body, 'html.parser')
mydivs = soup.findAll("a", { "rel" : "nofollow next" })
link = mydivs[0]['href']
print(link)
self.urls.append(link)
with open(filename, 'wb') as f:
f.write(response.body)
Update
Thanks to Kaushik's answer, I figured out how to make it work. Though I still don't know why my original idea of appending new URL's doesn't work...
The updated code is as follow:
import scrapy
from bs4 import BeautifulSoup
class RedditSpider(scrapy.Spider):
name = "test2"
urls = ["https://www.reddit.com/r/LifeProTips/search?q=timestamp%3A1427232122..1437773560&sort=new&restrict_sr=on&syntax=cloudsearch"]
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': { 'wait': 0.5 }
}
})
def parse(self, response):
page = response.url[-10:]
print(page)
filename = 'reddit-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
#parse html for next link
soup = BeautifulSoup(response.body, 'html.parser')
mydivs = soup.findAll("a", { "rel" : "nofollow next" })
if len(mydivs) != 0:
link = mydivs[0]['href']
print(link)
#yield response.follow(link, callback=self.parse)
yield scrapy.Request(link, callback=self.parse)
What you require is explained very well in the Scrapy docs . I don't think you would need any other explanation other than that. Suggest going through it once for better understanding.
A brief explanation first though:
To follow a link to the next page, Scrapy provides many methods. The most basic methods is using the http.Request method
Request object :
class scrapy.http.Request(url[, callback,
method='GET', headers, body, cookies, meta, encoding='utf-8',
priority=0, dont_filter=False, errback, flags])
>>> yield scrapy.Request(url, callback=self.next_parse)
url (string) – the URL of this request
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter.
For convenience though, Scrapy has inbuilt shortcut for creating Request objects by using response.follow where the url can be an absolute path or a relative path.
follow(url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding=None, priority=0, dont_filter=False,
errback=None)
>>> yield response.follow(url, callback=self.next_parse)
In case if you have to move through to the next link by passing values to a form or any other type of input field, you can use the Form Request objects. The FormRequest class extends the base Request with functionality
for dealing with HTML forms. It uses lxml.html forms to pre-populate
form fields with form data from Response objects.
Form Request object
from_response(response[, formname=None,
formid=None, formnumber=0, formdata=None, formxpath=None,
formcss=None, clickdata=None, dont_click=False, ...])
If you want to simulate a HTML Form POST in your spider and send a couple of key-value fields, you can return a FormRequest object (from your spider) like this:
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]
Note : If a Request doesn’t specify a callback, the spider’s parse() method will be used. If exceptions are raised during processing, errback is called instead.

Scrapy spider not following pagination

I am using code from this link(https://github.com/eloyz/reddit/blob/master/reddit/spiders/pic.py) but somehow I am unable to visit paginated page.
I am on using scrapy 1.3.0
You don't have any mechanism for processing next page, all you do is gathering images.
Here is what you should be doing, I wrote some selectors but didn't test it.
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy import Request
import urlparse
class xxx_spider(Spider):
name = "xxx"
allowed_domains = ["xxx.com"]
def start_requests(self):
url = 'first page url'
yield Request(url=url, callback=self.parse, meta={"page":1})
def parse(self, response):
page = response.meta["page"] + 1
html = Selector(response)
pics = html.css('div.thing')
for selector in pics:
item = PicItem()
item['image_urls'] = selector.xpath('a/#href').extract()
item['title'] = selector.xpath('div/p/a/text()').extract()
item['url'] = selector.xpath('a/#href').extract()
yield item
next_link = html.css("span.next-button a::attr(href)")
if not next_link is None:
yield Request(url=url, callback=self.parse, meta={"page":page})
Similar to what you did, but when I get images, I then check next page link, if it exists then I yield another request with it.
Mehmet

How do I get past an accept dialog using scrapy?

I need to scrape a site that has an accept dialog that I need to get through first. The form is as follows:
<form action="/lst_sale/" method="post"> <input class="button" type="submit" value="Accept"/> </form>
Clicking the accept button takes me to the page with a table that I need to parse. Right now I have:
# function to parse markup
def parse(self, response):
yield FormRequest(url="http://www.somedomain.com/lst_sale",
method="POST",
formdata={},
callback=self.parse_list)
def parse_list(self, response):
# do something...
Problem is parse_list never gets called so I am assuming that the form post is not happening. Any ideas on how I can get this to work?
Thx!
Found the answer. Turns out I was not sending the proper values. Now works using:
def parse(self, response):
yield FormRequest.from_response(
response,
formdata={"value":"Accept"},
callback=self.after_accept)
def after_accept(self, response):
yield Request("http://example.com?some_vars=some_values", callback=self.parse_list)
def parse_list(self, response):
#begin scraping!
This handles the ASP.NET_SessionId for me. I set COOKIES_DEBUG = True in settings.py and that showed me that the sessions were indeed being handled - which led me to finding the root of my issue. I hope.