Broad Scrapy Crawl: sgmlLinkextractor rule does not work - scrapy

I've spent a lot of time playing around and using google but I could not solve my problem. I am new to Scrapy and I hope you can help me.
Part of the spider that works: I define my start_requests urls out of a MySQL Datbase. With the 'parse_item' statement I write the response into seperate files. Both of these steps work fine.
My Problem: Additionally I want to follow every url which contains '.ch' and - as I do for the start_requests - send them to the 'parse_item' method. I therefore defined a rule with a sgmlLinkExtractor and the 'parse_item' method as the callback. This does not work. After completion, I only have the files for the urls defined in 'start_requests'. I don't get any error messages.
Here is my code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import mysql.connector
from scrapy.http import Request
class FirstSpider(CrawlSpider):
name = 'firstspider'
start_urls = []
def start_requests(self):
conn = mysql.connector.connect(user='root', password = 'root', host= 'localhost', database = 'Eistee')
cursor = conn.cursor()
query = ("SELECT Domain, CompanyName FROM Crawlbydomain LIMIT 300, 100")
cursor.execute(query)
results = cursor.fetchall()
for result in results:
urlrequest = 'http://'+result[0]
yield Request(urlrequest, callback = self.parse_item )
rules = (Rule (SgmlLinkExtractor(allow=('.ch', )), callback='parse_item', follow= True),)
def parse_item(self, response):
filename = response.url.translate(None, './')
open(filename, 'wb').write(response.body)
Can you help me?

To make CrawlSpider do its "magic" you need the requests to go through CrawlSpider's parse() callback.
So in start_requests() your Requests must use callback=self.parse (or not set the callback argument)
If you also want the start Requests to go through parse_item you need to set a parse_start_url attribute in your spider set to parse_item.
So you need to have something like:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import mysql.connector
from scrapy.http import Request
class FirstSpider(CrawlSpider):
name = 'firstspider'
def start_requests(self):
conn = mysql.connector.connect(user='root', password = 'root', host= 'localhost', database = 'Eistee')
cursor = conn.cursor()
query = ("SELECT Domain, CompanyName FROM Crawlbydomain LIMIT 300, 100")
cursor.execute(query)
results = cursor.fetchall()
for result in results:
urlrequest = 'http://'+result[0]
yield Request(urlrequest)
rules = (Rule (SgmlLinkExtractor(allow=('.ch', )), callback='parse_item', follow= True),)
def parse_item(self, response):
filename = response.url.translate(None, './')
open(filename, 'wb').write(response.body)
parse_start_url = parse_item

Related

How to take user argument and pass it to Rule extractor in Scrapy

I have a config file in which many website details are present. I am taking user input argument in scrapy using -a parameter and taking out matching allowed_domains and start_urls from config file. Since this is a generic spider, I am using rule extractor.
Below is my code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from bs4 import BeautifulSoup
import yaml
import re
import scrapy
with open("/scrapyConfig.yaml", "r") as f:
config = yaml.load(f, Loader=yaml.FullLoader)
def cleanHtml(raw_html):
CLEANR = re.compile('<.*?>')
cleanText = str(re.sub(CLEANR,'', raw_html))
return cleanText
def remove_tags(html):
soup = BeautifulSoup(html, "html.parser")
for data in soup(['style', 'script']):
data.decompose()
noTagsData = str(' '.join(soup.stripped_strings))
return noTagsData
class SpiderSpider(CrawlSpider):
name = 'spider1'
def __init__(self, **kwargs):
super().__init__(**kwargs)
userInp = self.site
self.allowed_domains=config[userInp]['allowed_domain']
self.start_urls=config[userInp]['start_url']
rules = [(Rule(LinkExtractor(unique=False,allow=(config[self.site]['regex1'],config[self.site]['regex2'])),callback='parse_item',follow=True))]
def parse_item(self,response):
uncleanText = response.xpath(config[self.site]['xpath1']).extract()
cleanText = [x.replace("\n","") for x in uncleanText]
cleanText = [x.replace("\t"," ") for x in cleanText]
cleanText = [x.replace("\r","") for x in cleanText]
cleanText = [x.replace("\xa0","") for x in cleanText]
cleanText = [x.replace(":"," ") for x in cleanText]
cleanText = remove_tags(str(cleanText))
finalCleanJD = cleanHtml(str(cleanText))
yield {"URL":response.url,"Job Description":finalCleanJD}
I am able to take the user input and fetch corresponding allowed_domains and start_urls from config file using init function but when I am passing the same argument in rule extractor, it is not recognising self.site and if I put this rule extractor inside init function then spider is not scraping the page. It's just written as crawled in terminal and then it exits. Even the rule variable is not highlighted when it is inside init function which tells that rule variable is not used anywhere but when it is put outside init function it is getting highlighted but it is not recognizing self.site variable. How can I make this generic spider take user input argument and take out the matching details from config file and start scraping?

How to store scraped links in Scrapy

I did a lot of searches on the web but I couldn't find anything related or maybe it has to do with the wording used.
Basically, I would like to write a spider that would able to save the scraped links and to check if some other links have been already scraped. Is there any build in function in scrapy to do so?
Many thanks
You can write your own method for this purpose. I have written in my project and you can take reference from this. A dictionary called already_parsed_urls and for every callback, I am updating this dictionary.
You can look at the below code snippet and take reference.
from scrapy.spiders import CrawlSpider
from scrapy_splash import SplashRequest
class Spider(CrawlSpider):
name = 'test'
allowed_domains = []
web_url = ''
start_urls = ['']
counter = 0
already_parsed_urls = {}
wait_time = 3
timeout = '90'
def start_requests(self):
for start_url in self.start_urls:
yield SplashRequest(start_url, callback=self.parse_courses,
args={'wait': self.wait_time, 'timeout': self.timeout})
def parse_courses(self, response):
course_urls = []
yield SplashRequest(course_urls[0], callback=self.parse_items, args={'wait': self.wait_time})
def parse_items(self, response):
if not self.already_parsed_urls.get(response.url):
# Get Program URL
program_url = response.url
self.already_parsed_urls[response.url] = 1
else:
return {}

Is there a way to get ID of the starting URL from database in scrapy with some function, make_requests_from_url

I am pulling start URL's from Database and also need ID's associated with the URL so that I can pass it in the ITEMS pipeline and store in the table along with items.
I am using "make_requests_from_url(row[1])" to pass the start URL's "start_urls = []" which forms the list of starting URL's. The id's row[0] is what I need to pass to Items when the respective items are crawled.
Below is my spider code:
import scrapy
import mysql.connector
from ..items import AmzProductinfoItem
class AmzProductinfoSpiderSpider(scrapy.Spider):
name = 'amz_ProductInfo_Spider'
nextPageNumber = 2
allowed_domains = ['amazon.in']
start_urls = []
url_fid = []
def __init__(self):
self.connection = mysql.connector.connect(host='localhost', database='datacollecter', user='root', password='', charset="utf8", use_unicode=True)
self.cursor = self.connection.cursor()
def start_requests(self):
sql_get_StartUrl = 'SELECT * FROM database.table'
self.cursor.execute(sql_get_StartUrl)
rows = self.cursor.fetchall()
for row in rows:
yield self.make_requests_from_url(row[1])
I have tried with comparing "response.url" in parse method but that changes as spider moves on to next page.
Not sure how can I achieve this. any direction is appreciated.
It's not clear why do you need self.make_requests_from_url. You can yield your requests directly:
def start_requests(self):
sql_get_StartUrl = 'SELECT * FROM database.table'
self.cursor.execute(sql_get_StartUrl)
rows = self.cursor.fetchall()
for row in rows:
yield scrapy.Request(url=row[1], meta={'url_id': row[0]}, callback=self.parse)
def parse(self, response):
url_id = response.meta["url_id"]

Why does my CrawlerProcess not have the function "crawl"?

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from items import BackpageItem, CityvibeItem
from scrapy.shell import inspect_response
import re
import time
import sys
class MySpider(CrawlSpider):
name = 'example'
allowed_domains = ['www.example.com']
# Set last_age to decide how many pages are crawled
last_page = 10
start_urls = ['http://www.example.com/washington/?page=%s' % page for page in xrange(1,last_page)]
rules = (
#Follow all links inside <div class="cat"> and calls parse_item on each link
Rule(LinkExtractor(
restrict_xpaths=('//a[#name="listing_link"]')),
callback='parse_item'),
)
# Extract relevent text from the website into a ExampleItem
def parse_item(self, response):
item = ExampleItem()
item['title'] = response.xpath('string(//h2[#class="post-title"]/text())').extract()
item['desc'] = response.xpath('string(//div[#class="section post-body"]/text())').extract()
item['url'] = response.url
item['location'] = response.xpath('string(//div[#class="posting"]/div[2]/text())').extract()
item['posted_date'] = response.xpath('string(//div[#class="post-date"]/span/text())').extract()#.re("(?<=Posted\s*).*")
item['crawled_date'] = time.strftime("%c")
# not sure how to get the other image urls right now
item['image_urls'] = response.xpath('string(//div[#class="section post-contact-container"]/div/div/img/#src)').extract()
# I can't find this section on any pages right now
item['other_ad_urls'] = response.xpath('//a[#name="listing_link"]/#href').extract()
item['phone_number'] = "".join(response.xpath('//div[#class="post-info"]/span[contains(text(), "Phone")]/following-sibling::a/text()').extract())
item['email'] = "".join(response.xpath('//div[#class="post-info"]/span[contains(text(), "Email")]/following-sibling::a/text()').extract())
item['website'] = "".join(response.xpath('//div[#class="post-info limit"]/span[contains(text(), "Website")]/following-sibling::a/text()').extract())
item['name'] = response.xpath('//div[#class="post-name"]/text()').extract()
#uncomment for debugging
#inspect_response(response, self)
return item
# process1 = CrawlerProcess({
# 'ITEM_PIPELINES': {
# #'scrapy.contrib.pipeline.images.ImagesPipeline': 1
# 'backpage.pipelines.GeolocationPipeline': 4,
# 'backpage.pipelines.LocationExtractionPipeline': 3,
# 'backpage.pipelines.BackpagePipeline': 5
# }
# });
process1 = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process1.crawl(MySpider)
process1.start()
My spider works perfectly when I run it from the command line with
scrapy crawl example
but I will need to run multiple spiders, so I want to put them all in a script and use CrawlerProcess. When I try to run this I get the error,
AttributeError: 'CrawlerProcess' object has no attribute 'crawl'
This is scrapy version 0.24.6.
All items and pipelines are correct, because the spider works from the command line.
There is (was?) a compatibility problem between Scrapy and Scrapyd. I needed to run Scrapy 0.24 and Scrapyd 1.0.1.
Here is the issue on Github
https://github.com/scrapy/scrapyd/issues/100#issuecomment-115268880

How to make rules of CrawlSpider context-sensitive?

I notice that, the rule of CrawlSpider extract urls on every none-leaf pages.
Can I enable rule only when current page meets some condition (for example: url matches a regex)?
I have two pages:
-------------------Page A-------------------
Page URL: http://www.site.com/pattern-match.html
--------------------------------------------
- [link](http://should-extract-this)
- [link](http://should-extract-this)
- [link](http://should-extract-this)
--------------------------------------------
--------------------Page B--------------------
Page URL: http://www.site.com/pattern-not-match.html
-----------------------------------------------
- [link](http://should-not-extract-this)
- [link](http://should-not-extract-this)
- [link](http://should-not-extract-this)
-----------------------------------------------
So, the rule should only extract urls from PageA. How to do it? Thanks!
I just found a dirty way to inject response to rule.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from scrapy.http import Request, HtmlResponse
from scrapy.contrib.spiders import CrawlSpider, Rule
import inspect
class MyCrawlSpider(CrawlSpider):
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
if links and rule.process_links:
links = rule.process_links(links)
seen = seen.union(links)
for link in links:
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
# ***>>> HACK <<<***
# pass `response` as additional argument to `process_request`
fun = rule.process_request
if not hasattr(fun, 'nargs'):
fun.nargs = len(inspect.getargs(fun.func_code).args)
if fun.nargs==1:
yield fun(r)
elif fun.nargs==2:
yield fun(r, response)
else:
raise Exception('too many arguments')
Try it out:
def process_request(request, response):
if 'magick' in response.url:
return request
class TestSpider(MyCrawlSpider):
name = 'test'
allowed_domains = ['test.com']
start_urls = ['http://www.test.com']
rules = [
Rule(SgmlLinkExtractor(restrict_xpaths='//a'), callback='parse_item', process_request=process_request),
]
def parse_item(self, response):
print response.url