I have a CrawlSpider derived spider. When the url has a certain format, it does a callback to a function named parse_item.
rules = (
Rule(
LinkExtractor(
allow=('/whatever/', )
)
),
Rule(
LinkExtractor(
allow=('/whatever/detailpage/1234/')
),
callback='parse_item'
),
)
I have a status only_new=True for my spider. When this state is enabled, I don`t want to crawl urls which are already in my database.
I would like to check the url BEFORE the request is done, because when I have 5 new detailpages I want to crawl but 1000 detailpages I don`t want to crawl, i want to send 5 requests instead of 1000.
But in the callback funtion, the request has already been done. I would like to do something like following:
rules = (
(...)
Rule(
LinkExtractor(
allow=('/whatever/detailpage/1234/')
),
callback_before_request='check_if_request_is_nessesary'
),
)
def check_if_request_is_nessesary(spider, url):
if spider.only_new and url_exists_in_database():
raise IgnoreRequest
else:
do_request_and_call_parse_item(url)
Is this possible with a middleware or something?
You're looking for the process_links attribute for the Rule -- it allows you to specify a callable or a method name to be used for filtering the list of Link objects returned by the LinkExtractor.
Your code would look something like this:
rules = (
(...)
Rule(
LinkExtractor(
allow=('/whatever/detailpage/1234/')
),
process_links='filter_links_already_seen'
),
)
def filter_links_already_seen(self, links):
for link in links:
if self.only_new and url_exists_in_database(link.url):
continue
else:
yield link
Related
rules = (
Rule(LinkExtractor(
restrict_xpaths='//need_data',
deny=deny_urls), callback='parse_info'),
Rule(LinkExtractor(allow=r'/need/', deny=deny_urls), follow=True),
)
rules to extract need URLs for scraping, right?
Can I in callback def get URL we move?
For example.
website - needdata.com
Rule(LinkExtractor(allow=r'/need/', deny=deny_urls), follow=True), to extract URL like needdata.com/need/1 , right?
Rule(LinkExtractor(
restrict_xpaths='//need_data',
deny=deny_urls), callback='parse_info'),
to extract urls from needdata.com/need/1 , for example it a table with people.
and then parse_info to scrape it. Right?
But I want to understand in parse_info who a parent?
If needdata.com/need/1 has needdata.com/people/1
I want to add to a file column parent and data will be needdata.com/need/1
How to do that? Thank you very much.
We want to use
lx = LinkExtractor(allow=(r'shop-online/',))
And then
for l in lx.extract_links(response):
# l.url - it our url
And then use
meta={'category': category}
The better decision I do not find.
I am using the wikimedia api to retrieve all possible URL's from a wikipedia article ,'https://en.wikipedia.org/w/api.php?action=query&prop=links&redirects&pllimit=500&format=json' , but it is only giving a list of link titles , for example , Artificial Intelligence , wikipedia page has a link titled " delivery networks," , but the actual URL is "https://en.wikipedia.org/wiki/Content_delivery_network" , which is what I want
Use a generator:
action=query&
format=jsonfm&
titles=Estelle_Morris&
redirects&
generator=links&
gpllimit=500&
prop=info&
inprop=url
See API docs on generators and the info module.
I have replaced most of my previous answer, including the code, to use the information provided in Tgr's answer, in case someone else would like sample Python code. This code is heavily based on code from Mediawiki for so-called 'raw continuations'.
I have deliberately limited the number of links requested per invocation to five so that one more parameter possibility could be demonstrated.
import requests
def query(request):
request['action'] = 'query'
request['format'] = 'json'
request['prop'] = 'info'
request['generator'] = 'links'
request['inprop'] = 'url'
previousContinue = {}
while True:
req = request.copy()
req.update(previousContinue)
result = requests.get('http://en.wikipedia.org/w/api.php', params=req).json()
if 'error' in result:
raise Error(result['error'])
if 'warnings' in result:
print(result['warnings'])
if 'query' in result:
yield result['query']
if 'continue' in result:
previousContinue = {'gplcontinue': result['continue']['gplcontinue']}
else:
break
count = 0
for result in query({'titles': 'Estelle Morris', 'gpllimit': '5'}):
for url in [_['fullurl'] for _ in list(result.values())[0].values()]:
print (url)
I mentioned in my first answer that, if the OP wanted to do something similar with artificial intelligence then he should begin with 'Artificial intelligence' — noting the capitalisation. Otherwise the search would start with a disambiguation page and all of the complications that could arise with those.
I am scraping dell.com website, my goal is pages like http://accessories.us.dell.com/sna/productdetail.aspx?c=us&cs=19&l=en&s=dhs&sku=A7098144. How do I set link extracting rules so they find these pages anywhere at any depth? As I know, by default there is no limit on depth. If I do:
rules = (
Rule (
SgmlLinkExtractor(allow=r"productdetail\.aspx"),
callback="parse_item"
),
)
it doesn't work: it crawles only the starting page. If I do:
rules = (
Rule (
SgmlLinkExtractor(allow=r".*")
),
Rule (
SgmlLinkExtractor(allow=r"productdetail\.aspx"),
callback="parse_item"
),
)
it crawles product pages but doesn't scrape them (I mean doesn't call parse_item() on them). I tried include follow=True on the first rule although if there is no callback it should be True by default.
EDIT:
This is the rest of my code except for parse function:
import re
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
class DellSpider(CrawlSpider):
name = 'dell.com'
start_urls = ['http://www.dell.com/sitemap']
rules = (
Rule (
SgmlLinkExtractor(allow=r".*")
),
Rule (
SgmlLinkExtractor(allow=r"productdetail\.aspx"),
callback="parse_item"
),
)
From the CrawlSpider documentation:
If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.
Thus, you need to invert the order of your Rules. Currently .* will match everything, before productdetail\.aspx is checked at all.
This should work:
rules = (
Rule (
SgmlLinkExtractor(allow=r"productdetail\.aspx"),
callback="parse_item"
),
Rule (
SgmlLinkExtractor(allow=r".*")
),
)
However, you will have to make sure that links will be followed in parse_item, if you want to follow links on productdetail pages. The second rule will not be called on productdetail pages.
I have generated a form called Clients by using scaffold , then when I decided to change its url form routes file , I typed this command
match '/book', to: 'clients#new'
and now the problem is : I having two links for the new action , the first link is localhost:3000/clients/new , and the second one is localhost:3000/book , how to delete the first one ( clients/new ) , and thats also what I want to do for update / delete / index actions
in routes.rb file:
resources :clients, except: :new
I'm using django templates to create an e-mail. I do something like this:
msg_html = render_to_string('email/%s.html' % template_name, context)
msg = EmailMultiAlternatives(subject, msg_text, from_email, to_list)
msg.attach_alternative(msg_html, "text/html")
msg.content_subtype = "html"
msg.send()
My template uses named url patterns like so:
{% url url_name parameter %}
Which works fine, except it will render a relative url, like:
/app_name/url_node/parameter
Usually that is enough, but since this is an e-mail I really need it to be a full url - with the server name prepended, like:
http://localhost:8000/app_name/url_node/parameter.
How can I do this? How do I dynamically get the server name? (I don't want to hardcode it for sure).
Another way of asking: how do I get HttpServletRequest.getContextPath() ala Java, but instead in Django/Python?
thanks
Use sites application:
Site.objects.get_current().domain
Example direct from documentation:
from django.contrib.sites.models import Site
from django.core.mail import send_mail
def register_for_newsletter(request):
# Check form values, etc., and subscribe the user.
# ...
current_site = Site.objects.get_current()
send_mail('Thanks for subscribing to %s alerts' % current_site.name,
'Thanks for your subscription. We appreciate it.\n\n-The %s team.' % current_site.name,
'editor#%s' % current_site.domain,
[user.email])
# ...