how to save all items for each url - scrapy

I add a url list like this. When it goes to pipeline, it seems all items from the url list will be passed to process_item.
how to separate items according to the specific url? For example, to save items from one url to one file.
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]

Add a ref_url to the item before yielding it, then check for it in the pipeline. There are other solutions, but this is the most direct one I guess.

Related

How to save Scrapy Broad Crawl Results?

Scrapy has a built-in way of persisting results in AWS S3 using the FEEDS setting.
but for a broad crawl over different domains this would create a single file, where the results from all domains are saved.
how could I save the results of each domain in its own separate file?
I wasn't able to find any reference to this in the documentation.
In the FEED_URI setting, you can add placeholder that will be replaced with scraped data.
For i.e. domain name can be incluede in the file name by using the domain attribute like this
FEED_URI = 's3://my-bucket/{domain}/%(time)s.json'
This solution would only work if the spider were run one time per domain, but since you haven't explicitly said so, I would assume a single run crawls multiple domains.
If you know all domains beforehand, you can generate the value of the FEEDS setting programmatically and use item filtering.
# Assumes that items have a domain field and that all target domains are
# defined in an ALL_DOMAINS variable.
class DomainFilter:
def __init__(self, feed_options):
self.domain = feed_options["domain"]
def accepts(self, item):
return item["domain"] == self.domain
ALL_DOMAINS = ["toscrape.com", ...]
FEEDS = {
f"s3://mybucket/{domain}.jsonl": {
"format": "jsonlines",
"item_filter": DomainFilter,
}
for domain in ALL_DOMAINS
}

Limiting scrapy Request and Items

everyone, I've been learning scrapy for a month. I need assistance with following problems:
Suppose there are 100-200 urls and I use Rule to extract further links from those urls and I want to limit the request of those links, like maximum 30 requests for each url. Can I do that?
If I'm searching a keyword on all urls, If the word is found on particular url, then I want scrapy to stop searching from that url and move to next one.
I've tried limiting url but it doesn't work at all.
Thanks, i hope everything is clear.
You can use a process_links callback function with your Rule, this will be passed the list of extracted links from each response, and you can trim it down to your limit of 30.
Example (untested):
class MySpider(CrawlSpider):
name = "test"
allowed_domains = ['example.org']
rules = (
Rule(LinkExtractor(), process_links="dummy_process_links"),
)
def dummy_process_links(self, links):
links = links[:30]
return links
If I understand correctly, and you want stop after finding some word in the page of the response, all you need to do is find the word:
def my_parse(self, response):
if b'word' is in response.body:
offset = response.body.find(b'word')
# do something with it

How to use allowed_domains with relative paths?

I'm new to scrapy and want to build a simple web crawler. Unfortunately, if I'm using a allowed_domain, scrapy filters out all subpages as the domain is using relative path. How can this be fixed?
class ExampleSpider(CrawlSpider):
name = "example_crawler"
allowed_domains = ["www.example.com"]
start_urls = ["https://www.example.com"]
rules = (
Rule(LinkExtractor(),
callback="parse_text",
follow=True),)
def parse_text(self, response):
pass
If I remove the allowed_domains all subpages are crawled. However, if I'm using a allowed domain all subpages are getting filtered because of the relative path issue. Can this be solved?
Allowed domains should not contain www. and such.
If you take a look at OffsiteMiddleware it renders all values in allowed_domains to regex and then matches every page you try to crawl to this regegular expression:
regex = r'^(.*\.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None)
return re.compile(regex)
Regular expression allows subdomains so you can easily have allowed_domains=['example.com', 'foo.example.com']. If you leave in www. scrapy thinks it's a subdomain thus it will fail on urls that do not have it.

Looping on Scrapy doens't work properly

I'm trying to write a small web crawler with Scrapy.
I wrote a crawler that grabs the URLs of certain links on a certain page, and wrote the links to a csv file. I then wrote another crawler that loops on those links, and downloads some information from the pages directed to from these links.
The loop on the links:
cr = csv.reader(open("linksToCrawl.csv","rb"))
start_urls = []
for row in cr:
start_urls.append("http://www.zap.co.il/rate"+''.join(row[0])[1:len(''.join(row[0]))])
If, for example, the URL of the page I'm retrieving information from is:
http://www.zap.co.il/ratemodel.aspx?modelid=835959
then more information can (sometimes) be retrieved from following pages, like:
http://www.zap.co.il/ratemodel.aspx?modelid=835959&pageinfo=2
("&pageinfo=2" was added).
Therefore, my rules are:
rules = (Rule (SgmlLinkExtractor (allow = ("&pageinfo=\d",
), restrict_xpaths=('//a[#class="NumBtn"]',))
, callback="parse_items", follow= True),)
It seemed to be working fine. However, it seems that the crawler is only retrieving information from the pages with the extended URLs (with the "&pageinfo=\d"), and not from the ones without them. How can I fix that?
Thank you!
You can override parse_start_url() method in CrawlSpider:
class MySpider(CrawlSpider):
def parse_items(self, response):
# put your code here
...
parse_start_url = parse_items
Your rule allows urls with "&pageinfo=\d" . In effect only the pages with matching url will be processed. You need to change the allow parameter for the urls without pageinfo to be processed.

How can I scrape other specific pages on a forum with Scrapy?

I have a Scrapy Crawler that crawls some guides from a forum.
The forum that I'm trying to crawl the data has got a number of pages.
The problem is that I cannot extract the links that I want to because there aren't specific classes or ids to select.
The url structure is like this one: http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1
Obviously I can change the number after desc&page=1 to 2, 3, 4 and so on but I would like to know what is the best choice to do this.
How can I accomplish that?
PS: This is the spider code
http://dpaste.com/hold/794794/
I can't seem to open the forum URL (always redirects me to another website), so here's a best effort suggestion:
If there are links to the other pages on the thread page, you can create a crawler rule to explicitly follow these links. Use a CrawlSpider for that:
class GuideSpider(CrawlSpider):
name = "Guide"
allowed_domains = ['www.guides.com']
start_urls = [
"http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1",
]
rules = [
Rule(SgmlLinkExtractor(allow=("forumdisplay.php.*f=108.*page=",), callback='parse_item', follow=True)),
]
def parse_item(self, response):
# Your code
...
The spider should automatically deduplicate requests, i.e. it won't follow the same URL twice even if two pages link to it. If there are very similar URLs on the page with only one or two query arguments different (say, order=asc), you can specify deny=(...) in the Rule constructor to filter them out.