How to extract links anywhere at any depth? - scrapy

I am scraping dell.com website, my goal is pages like http://accessories.us.dell.com/sna/productdetail.aspx?c=us&cs=19&l=en&s=dhs&sku=A7098144. How do I set link extracting rules so they find these pages anywhere at any depth? As I know, by default there is no limit on depth. If I do:
rules = (
Rule (
SgmlLinkExtractor(allow=r"productdetail\.aspx"),
callback="parse_item"
),
)
it doesn't work: it crawles only the starting page. If I do:
rules = (
Rule (
SgmlLinkExtractor(allow=r".*")
),
Rule (
SgmlLinkExtractor(allow=r"productdetail\.aspx"),
callback="parse_item"
),
)
it crawles product pages but doesn't scrape them (I mean doesn't call parse_item() on them). I tried include follow=True on the first rule although if there is no callback it should be True by default.
EDIT:
This is the rest of my code except for parse function:
import re
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
class DellSpider(CrawlSpider):
name = 'dell.com'
start_urls = ['http://www.dell.com/sitemap']
rules = (
Rule (
SgmlLinkExtractor(allow=r".*")
),
Rule (
SgmlLinkExtractor(allow=r"productdetail\.aspx"),
callback="parse_item"
),
)

From the CrawlSpider documentation:
If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.
Thus, you need to invert the order of your Rules. Currently .* will match everything, before productdetail\.aspx is checked at all.
This should work:
rules = (
Rule (
SgmlLinkExtractor(allow=r"productdetail\.aspx"),
callback="parse_item"
),
Rule (
SgmlLinkExtractor(allow=r".*")
),
)
However, you will have to make sure that links will be followed in parse_item, if you want to follow links on productdetail pages. The second rule will not be called on productdetail pages.

Related

Scrapy Project Review/ Link Rules

This is my second project and I was wondering if someone could review and give me best practices in applying scrapy framework. I also have a specific issue: not all courses are scraped from the site.
Goal: scrape all golf courses info from golf advisor website. Link: https://www.golfadvisor.com/course-directory/1-world/
Approach: I used CrawlSpider to include rules for links to explore.
Result: Only 19,821 courses out of 36,587 were scraped from the site.
Code:
import scrapy
from urllib.parse import urljoin
from collections import defaultdict
# adding rules with crawlspider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class GolfCourseSpider(CrawlSpider):
name = 'golfadvisor'
allowed_domains = ['golfadvisor.com']
start_urls = ['https://www.golfadvisor.com/course-directory/1-world/']
base_url = 'https://www.golfadvisor.com/course-directory/1-world/'
# use rules to visit only pages with 'courses/' in the path and exclude pages with 'page=1, page=2, etc'
# since those are duplicate links to the same course
rules = [
Rule(LinkExtractor(allow=('courses/'), deny=('page=')), callback='parse_filter_course', follow=True),
]
def parse_filter_course(self, response):
# checking if it is an actual course page. excluded it for final ran, didnt fully
# exists = response.css('.CoursePageSidebar-map').get()
# if exists:
# the page is split in multiple sections with different amount of details specified on each.
# I decided to use nested for loop (for section in sections, for detail in section) to retrieve data.
about_section = response.css('.CourseAbout-information-item')
details_section = response.css('.CourseAbout-details-item')
rental_section = response.css('.CourseAbout-rentalsServices-item')
practice_section = response.css('.CourseAbout-practiceInstruction-item')
policies_section = response.css('.CourseAbout-policies-item')
sections = [
about_section,
details_section,
rental_section,
practice_section,
policies_section
]
# created a default list dict to add new details from for loops
dict = defaultdict(list)
# also have details added NOT from for loop sections, but hard coded using css and xpath selectors.
dict = {
'link': response.url,
'Name': response.css('.CoursePage-pageLeadHeading::text').get().strip(),
'Review Rating': response.css('.CoursePage-stars .RatingStarItem-stars-value::text').get('').strip(),
'Number of Reviews': response.css('.CoursePage-stars .desktop::text').get('').strip().replace(' Reviews',''),
'% Recommend this course': response.css('.RatingRecommendation-percentValue::text').get('').strip().replace('%',''),
'Address': response.css('.CoursePageSidebar-addressFirst::text').get('').strip(),
'Phone Number': response.css('.CoursePageSidebar-phoneNumber::text').get('').strip(),
# website has a redirecting link, did not figure out how to get the main during scraping process
'Website': urljoin('https://www.golfadvisor.com/', response.css('.CoursePageSidebar-courseWebsite .Link::attr(href)').get()),
'Latitude': response.css('.CoursePageSidebar-map::attr(data-latitude)').get('').strip(),
'Longitude': response.css('.CoursePageSidebar-map::attr(data-longitude)').get('').strip(),
'Description': response.css('.CourseAbout-description p::text').get('').strip(),
# here, I was suggested to use xpath to retrieve text. should it be used for the fields above and why?
'Food & Beverage': response.xpath('//h3[.="Available Facilities"]/following-sibling::text()[1]').get('').strip(),
'Available Facilities': response.xpath('//h3[.="Food & Beverage"]/following-sibling::text()[1]').get('').strip(),
# another example of using xpath for microdata
'Country': response.xpath("(//meta[#itemprop='addressCountry'])/#content").get('')
}
# nested for loop I mentioned above
for section in sections:
for item in section:
dict[item.css('.CourseValue-label::text').get().strip()] = item.css('.CourseValue-value::text').get('').strip()
yield dict
E.G. it discovered only two golf courses in Mexico:
Club Campestre de Tijuana
Real del Mar Golf Resort
I've ran the code specifically scraping the pages it didn't pick up: I was able to scrapy those pages individually. Therefore my link extraction rules are wrong.
This is the output file with ~20k courses: https://drive.google.com/file/d/1izg2gZ87qbmMtg4S_VKQmkzlKON3poIs/view?usp=sharing
Thank you,
Yours Data Enthusiast

How do I find a specific tag's value (which could be anything) with beautifulsoup?

I am trying to get the job IDs from the tags of Indeed listings. So far, I have taken Indeed search results and put each job into its own "bs4.element.Tag" object, but I don't know how to extract the value of the tag (or is it a class?) "data-jk". Here is what I have so far:
import requests
import bs4
import re
# 1: scrape (5?) pages of search results for listing ID's
results = []
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=10"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=20"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=30"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=40"))
# each search page has a query "q", location "l", and a "start" = 10*int
# the search results are contained in a "td" with ID = "resultsCol"
justjobs = []
for eachResult in results:
soup_jobs = bs4.BeautifulSoup(eachResult.text, "lxml") # this is for IDs
justjobs.extend(soup_jobs.find_all(attrs={"data-jk":True})) # re.compile("data-jk")
# each "card" is a div object
# each has the class "jobsearch-SerpJobCard unifiedRow row result clickcard"
# as well as a specific tag "data-jk"
# "data-jk" seems to be the actual IDs used in each listing's URL
# Now, each div element has a data-jk. I will try to get data-jk from each one:
jobIDs = []
print(type(justjobs[0])) # DEBUG
for eachJob in justjobs:
jobIDs.append(eachJob.find("data-jk"))
print("Length: " + str(len(jobIDs))) # DEBUG
print("Example JobID: " + str(jobIDs[1])) # DEBUG
The examples I've seen online generally try to get the information contained between and , but I am not sure how to get the info from inside of the (first) tag itself. I've tried doing it by parsing it as a string instead:
print(justjobs[0])
for eachJob in justjobs:
jobIDs.append(str(eachJob)[115:131])
print(jobIDs)
but the website is also inconsistent with how the tags operate, and I think that using beautifulsoup would be more flexible than multiple cases and substrings.
Any pointers would be greatly appreciated!
Looks like you can regex them out from a script tag
import requests,re
html = requests.get('https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0').text
p = re.compile(r"jk:'(.*?)'")
ids = p.findall(html)

spaCy: custom infix regex rule to split on `:` for patterns like mailto:johndoe#gmail.com is not applied consistently

With the default tokenizer, spaCy treats mailto:johndoe#gmail.com as one single token.
I tried the following:
nlp = spacy.load('en_core_web_lg')
infixes = nlp.Defaults.infixes + (r'(?<=mailto):(?=\w+)', )
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
However, the above custom rule doesn't seem to do what I would like to do in a consistent matter. For example, if I apply the tokenizer to mailto:johndoe#gmail.c, it does what I want:
nlp("mailto:johndoe#gmail.c")
# [mailto, :, johndoe#gmail.c]
However, if I apply the tokenizer to mailto:johndoe#gmail.com, it does not work as intended.
nlp("mailto:johndoe#gmail.com")
# [mailto:johndoe#gmail.com]
I wonder if there is a way to fix this inconsistency?
There's a tokenizer exception pattern for URLs, which matches things like mailto:johndoe#gmail.com as one token. It knows that top-level domains have at least two letters so it matches gmail.co and gmail.com but not gmail.c.
You can override it by setting:
nlp.tokenizer.token_match = None
Then you should get:
[t.text for t in nlp("mailto:johndoe#gmail.com")]
# ['mailto', ':', 'johndoe#gmail.com']
[t.text for t in nlp("mailto:johndoe#gmail.c")]
# ['mailto', ':', 'johndoe#gmail.c']
If you want the URL tokenization to be as by default except for mailto:, you could modify the URL_PATTERN from lang/tokenizer_exceptions.py (also see how TOKEN_MATCH is defined right below it) and use that rather than None.

Who parent if we to use rules in Scarpy?

rules = (
Rule(LinkExtractor(
restrict_xpaths='//need_data',
deny=deny_urls), callback='parse_info'),
Rule(LinkExtractor(allow=r'/need/', deny=deny_urls), follow=True),
)
rules to extract need URLs for scraping, right?
Can I in callback def get URL we move?
For example.
website - needdata.com
Rule(LinkExtractor(allow=r'/need/', deny=deny_urls), follow=True), to extract URL like needdata.com/need/1 , right?
Rule(LinkExtractor(
restrict_xpaths='//need_data',
deny=deny_urls), callback='parse_info'),
to extract urls from needdata.com/need/1 , for example it a table with people.
and then parse_info to scrape it. Right?
But I want to understand in parse_info who a parent?
If needdata.com/need/1 has needdata.com/people/1
I want to add to a file column parent and data will be needdata.com/need/1
How to do that? Thank you very much.
We want to use
lx = LinkExtractor(allow=(r'shop-online/',))
And then
for l in lx.extract_links(response):
# l.url - it our url
And then use
meta={'category': category}
The better decision I do not find.

CrawlSpider: Ignore URL before request

I have a CrawlSpider derived spider. When the url has a certain format, it does a callback to a function named parse_item.
rules = (
Rule(
LinkExtractor(
allow=('/whatever/', )
)
),
Rule(
LinkExtractor(
allow=('/whatever/detailpage/1234/')
),
callback='parse_item'
),
)
I have a status only_new=True for my spider. When this state is enabled, I don`t want to crawl urls which are already in my database.
I would like to check the url BEFORE the request is done, because when I have 5 new detailpages I want to crawl but 1000 detailpages I don`t want to crawl, i want to send 5 requests instead of 1000.
But in the callback funtion, the request has already been done. I would like to do something like following:
rules = (
(...)
Rule(
LinkExtractor(
allow=('/whatever/detailpage/1234/')
),
callback_before_request='check_if_request_is_nessesary'
),
)
def check_if_request_is_nessesary(spider, url):
if spider.only_new and url_exists_in_database():
raise IgnoreRequest
else:
do_request_and_call_parse_item(url)
Is this possible with a middleware or something?
You're looking for the process_links attribute for the Rule -- it allows you to specify a callable or a method name to be used for filtering the list of Link objects returned by the LinkExtractor.
Your code would look something like this:
rules = (
(...)
Rule(
LinkExtractor(
allow=('/whatever/detailpage/1234/')
),
process_links='filter_links_already_seen'
),
)
def filter_links_already_seen(self, links):
for link in links:
if self.only_new and url_exists_in_database(link.url):
continue
else:
yield link