Scrapy find all links with different(similar) class - scrapy

I'm trying to scrap links with certain class "post-item post-item-xxxxx". But since the class is different in each, how can I capture all of them?
<li class="post-item post-item-18887"><a
href="http://example.com/archives/18887.html" title="Post1"</a></li>
<li class="post-item post-item-18883"><a href="http://example.com/archives/18883.html" title="Post2"</a></li>
my code:
scrap all the cafe links from example.com
class DengaSpider(scrapy.Spider):
name = 'cafes'
allowed_domains = ['example.com']
start_urls = [
'http://example.com/archives/8136.html',
]
rules = [
Rule(
LinkExtractor(
allow=('^http://example\.com/archives/\d+.html$'),
unique=True
),
follow=True,
callback="parse_items"
)
]
def parse(self, response):
cafelink = response.css('post.item').xpath('//a/#href').extract()
if cafelink is not None:
print(cafelink)
the .css part is not working, how can I fix it?

Here's a sample run for the above html in scrapy shell:
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="Test HTML String", body='<li class="post-item post-item-18887"><a href="http://example.com/archives/18883.html" title="Post2"</li>', encoding='utf-8')
>>>
>>> cafelink = response.css('li.post-item a::attr(href)').extract_first()
>>> cafelink
'http://example.com/archives/18887.html'
>>>
>>> cafelink = response.css('li.post-item a::attr(href)').extract()
>>> cafelink
['http://example.com/archives/18887.html', 'http://example.com/archives/18883.html']

Xpath has the contains() method for this, so you might try this:
cafelink = response.xpath("//*[contains(#class, 'post-item-')]//a/#href").extract()
Also be careful when using // in xpath. It makes xpath starts the search in the document root, no matter where it currently is.

If all the items you want also have the "post-item" class then why do you need to capture them by their other class? In case you still need to do that, try the "starts with" CSS selector:
response.css('li[class^="post-item post-item-"]')
Documentation here.

Related

Crawling single pages with scrapy.Spider works but not for entire website with CrawlSpider

Need some help here. My code is working when I am crawling one page via (scrapy.Spider). However once I switch to (CrawlSpider) to scrape the entire website, it does not seems to work at all.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class QuotesSpider(CrawlSpider):
name = "quotes"
allowed_domains = ['reifen.check24.de']
start_urls = [
'https://reifen.check24.de/pkw-sommerreifen/toyo-proxes-cf2-205-55r16-91h-2276003?label=ppc',
'https://reifen.check24.de/pkw-sommerreifen/michelin-pilot-sport-4-205-55zr16-91w-213777?label=pc'
]
rules = (
Rule(LinkExtractor(deny= ('cart')), callback='parse_item', follow=True),
)
def parse(self, response):
for quote in response.xpath('/html/body/div[2]/div/section/div/div/div[1]'):
yield {
'brand': quote.xpath('//tbody//tr[1]//td[2]//text()').get(),
'pattern': quote.xpath('//tbody//tr[3]//td[2]//text()').get(),
'size': quote.xpath('//tbody//tr[6]//td[2]//text()').get(),
'RR': quote.xpath('div[1]/div[1]/div/div[1]/div[2]/span/span/span/div/div/div[1]/span/text()').get(),
'WL': quote.xpath('div[1]/div[1]/div/div[1]/div[2]/span/span/span/div/div/div[2]/span/text()').get(),
'noise': quote.xpath('div[1]/div[1]/div/div[1]/div[2]/span/span/span/div/div/div[3]/span/text()').get(),
}
Am I missing something?
You have a tiny mistake:
rules = (
Rule(LinkExtractor(deny= ('cart')), callback='parse_item', follow=True),
)
should be:
rules = (
Rule(LinkExtractor(deny= ('cart')), callback='parse', follow=True),
)

What are the correct tags and properties to select?

I want to crawl a web site (http://theschoolofkyiv.org/participants/220/dan-acostioaei) to extract artist's name and biography only. When I define the tags and properties, it comes out without any text, which I want to see.
I am using scrapy to crawl the web site. For other websites, it works fine. I have tested my codes but it seems I can not define the correct tags or properties. Can you please have a look at my codes?
This is the code that I used to crawl the website. (I do not understand why stackoverflow enforces me to enter irrelevant text all the time. I have already explained what I wanted to say.)
import scrapy
from scrapy.selector import Selector
from artistlist.items import ArtistlistItem
class ArtistlistSpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["theschoolofkyiv.org"]
start_urls = ['http://theschoolofkyiv.org/participants/220/dan-acostioaei']
enter code here
def parse(self, response):
titles = response.xpath("//div[#id='participants']")
for titles in titles:
item = ArtistlistItem()
item['artist'] = response.css('.ng-binding::text').extract()
item['biography'] = response.css('p::text').extract()
yield item
This is the output that I get:
{'artist': [],
'biography': ['\n ',
'\n ',
'\n ',
'\n ',
'\n ',
'\n ']}
Simple illustration (assuming you already know about AJAX request mentioned by Tony Montana):
import scrapy
import re
import json
from artistlist.items import ArtistlistItem
class ArtistlistSpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["theschoolofkyiv.org"]
start_urls = ['http://theschoolofkyiv.org/participants/220/dan-acostioaei']
def parse(self, response):
participant_id = re.search(r'participants/(\d+)', response.url).group(1)
if participant_id:
yield scrapy.Request(
url="http://theschoolofkyiv.org/wordpress/wp-json/posts/{participant_id}".format(participant_id=participant_id),
callback=self.parse_participant,
)
def parse_participant(self, response):
data = json.loads(response.body)
item = ArtistlistItem()
item['artist'] = data["title"]
item['biography'] = data["acf"]["en_participant_bio"]
yield item

Scrapy - Copying only the xpath into .csv file

I have many other scripts with simlar basic code that work, but when I run this spider in cmd, and I open the .csv file to look at the "titles" saved, I get the xpath copied into excel. Any idea why?
import scrapy
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['https://www.imdb.com/search/title?start=1']
start_urls = ['https://www.imdb.com/search/title?start=1/']
def parse(self, response):
titles = response.xpath('//*[#id="main"]/div/div/div[3]/div[1]/div[3]/h3/a')
pass
print(titles)
for title in titles:
yield {'Title': title}
--- Try Two Below:------
for subject in titles:
yield {
'Title': subject.xpath('.//h3[#class="lister-item-header"]/a/text()').extract_first(),
'Runtime': subject.xpath('.//p[#class="text-muted"]/span/text()').extract_first(),
'Description': subject.xpath('.//p[#class="text-muted"]/p/text()').extract_first(),
'Director': subject.xpath('.//*[#id="main"]/a/text()').extract_first(),
'Rating': subject.xpath('.//div[#class="inline-block ratings-imdb-rating"]/strong/text()').extract_first()
}
Use extract() or extract_first(), also use shorter and more capacious notation for xpath:
import scrapy
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['https://www.imdb.com/search/title?start=1']
start_urls = ['https://www.imdb.com/search/title?start=1/']
def parse(self, response):
subjects = response.xpath('//div[#class="lister-item mode-advanced"]')
for subject in subjects:
yield {
'Title': subject.xpath('.//h3[#class="lister-item-header"]/a/text()').extract_first(),
'Rating': subject.xpath('.//div[#class="inline-block ratings-imdb-rating"]/strong/text()').extract_first(),
'Runtime': subject.xpath('.//span[#class="runtime"]/text()').extract_first(),
'Description': subject.xpath('.//p[#class="text-muted"]/text()').extract_first(),
'Directior': subject.xpath('.//p[contains(text(), "Director")]/a[1]/text()').extract_first(),
}
output:

How to restrict the area in which LinkExtractor is being applied?

I have a scraper with the following rules:
rules = (
Rule(LinkExtractor(allow=('\S+list=\S+'))),
Rule(LinkExtractor(allow=('\S+list=\S+'))),
Rule(LinkExtractor(allow=('\S+view=1\S+')), callback='parse_archive'),
)
As you can see, the 2nd and 3rd rules are exactly the same.
What I would like to do is tell scrappy extract the links I am interested in by referring to particular places within a page only. For convenience, I am sending you the corresponding XPaths although I would prefer a solution based on BeatifullSoup's syntax.
//*[#id="main_frame"]/tbody/tr[3]/td[2]/table/tbody/tr/td/div/table/tbody/tr/td[1]
//*[#id="main_frame"]/tbody/tr[3]/td[2]/table/tbody/tr/td/div/form/table/tbody/tr[1]
//*[#id="main_frame"]/tbody/tr[3]/td[2]/table/tbody/tr/td/div/form/table/tbody/tr[2]
EDIT:
Let me give you an example. Let's assume that I want to extract the five (out of six) links on the top of Scrapy's Offcial Page:
And here is my spider. Any ideas?
class dmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["scrapy.org"]
start_urls = [
"http://scrapy.org/",
]
rules = (
Rule(LinkExtractor(allow=('\S+/'), restrict_xpaths=('/html/body/div[1]/div/ul')), callback='first_level'),
)
def first_level(self, response):
taco = dmozItem()
taco['basic_url'] = response.url
return taco
This can be done with the restrict_xpaths parameter. See the LxmlLinkExtractor documentation
Edit:
You can also pass a list to restrict_xpaths.
Edit 2:
Full example that should work:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class dmozItem(scrapy.Item):
basic_url = scrapy.Field()
class dmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["scrapy.org"]
start_urls = [
"http://scrapy.org/",
]
def clean_url(value):
return value.replace('/../', '/')
rules = (
Rule(
LinkExtractor(
allow=('\S+/'),
restrict_xpaths=(['.//ul[#class="navigation"]/a[1]',
'.//ul[#class="navigation"]/a[2]',
'.//ul[#class="navigation"]/a[3]',
'.//ul[#class="navigation"]/a[4]',
'.//ul[#class="navigation"]/a[5]']),
process_value=clean_url
),
callback='first_level'),
)
def first_level(self, response):
taco = dmozItem()
taco['basic_url'] = response.url
return taco

How to use scrapy to crawl multiple pages? (two level)

On my site I created two simple pages:
Here are their first html script:
test1.html :
<head>
<title>test1</title>
</head>
<body>
<a href="test2.html" onclick="javascript:return xt_click(this, "C", "1", "Product", "N");" indepth="true">
<span>cool</span></a>
</body></html>
test2.html :
<head>
<title>test2</title>
</head>
<body></body></html>
I want scraping text in the title tag of the two pages.here is "test1" and "test2".
but I am a novice with scrapy I only happens scraping only the first page.
my scrapy script:
from scrapy.spider import Spider
from scrapy.selector import Selector
from testscrapy1.items import Website
class DmozSpider(Spider):
name = "bill"
allowed_domains = ["http://exemple.com"]
start_urls = [
"http://www.exemple.com/test1.html"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//head')
items = []
for site in sites:
item = Website()
item['title'] = site.xpath('//title/text()').extract()
items.append(item)
return items
How to pass the onclik?
and how to successfully scraping the text of the title tag of the second page?
Thank you in advance
STEF
To use multiple functions in your code, send multiple requests and parse them, you're going to need: 1) yield instead of return, 2) callback.
Example:
def parse(self,response):
for site in response.xpath('//head'):
item = Website()
item['title'] = site.xpath('//title/text()').extract()
yield item
yield scrapy.Request(url="http://www.domain.com", callback=self.other_function)
def other_function(self,response):
for other_thing in response.xpath('//this_xpath')
item = Website()
item['title'] = other_thing.xpath('//this/and/that').extract()
yield item
You cannot parse javascript with scrapy, but you can understand what the javascript does and do the same: http://doc.scrapy.org/en/latest/topics/firebug.html