Limiting scrapy Request and Items - scrapy

everyone, I've been learning scrapy for a month. I need assistance with following problems:
Suppose there are 100-200 urls and I use Rule to extract further links from those urls and I want to limit the request of those links, like maximum 30 requests for each url. Can I do that?
If I'm searching a keyword on all urls, If the word is found on particular url, then I want scrapy to stop searching from that url and move to next one.
I've tried limiting url but it doesn't work at all.
Thanks, i hope everything is clear.

You can use a process_links callback function with your Rule, this will be passed the list of extracted links from each response, and you can trim it down to your limit of 30.
Example (untested):
class MySpider(CrawlSpider):
name = "test"
allowed_domains = ['example.org']
rules = (
Rule(LinkExtractor(), process_links="dummy_process_links"),
)
def dummy_process_links(self, links):
links = links[:30]
return links
If I understand correctly, and you want stop after finding some word in the page of the response, all you need to do is find the word:
def my_parse(self, response):
if b'word' is in response.body:
offset = response.body.find(b'word')
# do something with it

Related

Is there a way to get the URL that a link is scraped from?

I've got a spider written out that crawls my website and scrapes a bunch of tags. I'm now trying to have it return the URL that the link was discovered on.
For example:
www.example.com/product/123 was found on www.example.com/page/2.
When scrapy scrapes information from /product/123 I want to have a field that is "Scraped From" and return /page/2. For every URL that is scraped, I'd want to find the originating page that the URL was found. I've been pouring over the docs and can't seem to figure this out. Any help would be appreciated!
The easiest way is to use the response.headers. There should be a referer header.
referer = response.headers['Referer']
You can also use meta to pass information along to the next URL.
def parse(self, response):
product_url = response.css('#url').get()
yield scrapy.Request(product_url, callback=self.parse_product, meta={'referer': response.url})
def parse_product(self, response):
referer = response.meta['referer']
item = ItemName()
item['referer'] = referer
yield item

Scarpy outoput json

I'm struggling with Scrapy to output only "hits" to a json file. I'm new at this, so if there is just a link I should review, that might help (I've spent a fair amount of time googling around, still struggling) though code correction tips more welcome:).
I'm working off of the scrapy tutorial (https://doc.scrapy.org/en/latest/intro/overview.html) , with the original code outputing a long list including field names and output like "field: output" where both blanks and found items appear. I'd like only to include links that are found, and output them w/o the field name to a file.
For the following code I am trying, if I issue "scrapy crawl quotes2 -o quotes.json > output.json, it works but the quotes.json is always blank (i.e., including if I do "scrapy crawl quotes2 -o quotes.json").
In this case, as an experiment, I only want to return the URL if the string "Jane" is in the URL (e.g., /author/Jane-Austen):
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes2"
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('a'):
for i in quote.css('a[href*=Jane]::attr(href)').extract():
if i is not None:
print(i)
I've tried "yield" and items options, but am not up to speed enough to make them work. My longer term ambition to go to sites without having to understand the html tree (which may in and of itself be the wrong approach) and look for URLs with specific text in the URL string.
Thoughts? Am guessing this is not too hard, but is beyond me.
Well this is happening because you are printing the items, you have to tell Scrapy explicitly to 'yield' them.
But before that i don't see why you are looping through the anchor nodes instead of that you should loop over the quotes using css or XPath selectors, extract all the author links inside that quote and lastly check if that URL contains a specific String (Jane for you case).
for quote in response.css('.quote'):
jane_url = quote.xpath('.//a[contains(#href, "Jane")]').extract_first()
if jane_url is not None:
yield {
'url': jane_url
}

Setting a custom long list of starting URLS in Scrapy

The crawling starts from the list included in start_urls = []
I need a long list of these starting urls and 2 methods of solving this problem:
Method 1: Using pandas to define the starting_urls array
#Array of Keywords
keywords = pandas.Keyword
urls = {}
count = 0
while(count < 100):
urls[count]='google.com?q=' + keywords[count]
count = count + 1
#Now I have the starting urls in urls array.
However, it doesn't seem to define starting_urls = urls because when I run:
scrapy crawl SPIDER
I get the error:
Error: Request url must be str or unicode, got int:
Method 2:
Each starting URL contains paginated content and in the def parse method I have the following code to crawl all linked pages.
next_page = response.xpath('//li[#class="next"]/a/#href').extract_first()
yield response.follow(next_page, callback=self.parse)
I want to add additional pages to crawl from the urls array defined above.
count=0
while(count < 100):
yield response.follow(urls[count], callback=self.parse)
count=count + 1
But it seems that none of these 2 methods work. Maybe I can't add this code the spider.py file?
To make first note, though obviously I can't say I've ran your entire script like that it's incomplete but first thing I noticed is that your face URL does need to have or be the proper format... "http://ect.ect" for scrapy tp make a proper request
Also, not to question your skills but if you weren't aware that by using strip, split and join functions you can turn from list, strings, dictionaries add integers back and forth from each other to achieve the needed desired effect...
WHATS HAPPENING TO YOU:
While be using range instead of count... but mimic your issue
lis = range(11)
site = "site.com/page="
for i in lis:
print(site + i)
----------
TypeError: Can't convert 'int' object to str implicity
#TURNING MY INT INTO STR:
lis = range(11)
site = "site.com/page="
for i in lis:
print(site + str(i))
--------------------
site.com/page=0
site.com/page=1
site.com/page=2
site.com/page=3
site.com/page=4
site.com/page=5
site.com/page=6
site.com/page=7
site.com/page=8
site.com/page=9
site.com/page=10
As to the error, when you you have the count to "+ 1", and then configure the entire URL then to add that 1 ... You are then trying to makes a string variable with an integer... I'd imagine simply turning the integer into a string before then constructing your url, then back to and interger before you add one to the count so it could be changed appropriately to then...
My go-to way to keep my coat as clean as possible is much cleaner. By adding an extra file at the root or current working folder of which you start to crawl, with all the urls you wish to scrape, you can use then pythons read and write functions and open the file with you or else decide your spider script.. like this
class xSpider(BaseSpider):
name = "w.e"
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
What really bothers me is that your error, is saying that you're compiling drink with an integer which I will ask you again if you need further for a complete snippet of your spider and in the spirit of coders kinship, also your settings.py because I'll tell you right now that who end up finding out, despite of any adjustments to the settings.Py file you won't be able to scrape Google search pages... Rather, not entire number of result page... Which I will then recommend to Scrappy conjunction with beautiful suit
The immediate problem I see is that you are making a DICT when it expects a list. :). Change it to a list.
There are also all kinds of interactions depending on which underlying spider you inherited from (if you did at all). Try switching to list then hit the question up again with more data if you still are having problems

Looping on Scrapy doens't work properly

I'm trying to write a small web crawler with Scrapy.
I wrote a crawler that grabs the URLs of certain links on a certain page, and wrote the links to a csv file. I then wrote another crawler that loops on those links, and downloads some information from the pages directed to from these links.
The loop on the links:
cr = csv.reader(open("linksToCrawl.csv","rb"))
start_urls = []
for row in cr:
start_urls.append("http://www.zap.co.il/rate"+''.join(row[0])[1:len(''.join(row[0]))])
If, for example, the URL of the page I'm retrieving information from is:
http://www.zap.co.il/ratemodel.aspx?modelid=835959
then more information can (sometimes) be retrieved from following pages, like:
http://www.zap.co.il/ratemodel.aspx?modelid=835959&pageinfo=2
("&pageinfo=2" was added).
Therefore, my rules are:
rules = (Rule (SgmlLinkExtractor (allow = ("&pageinfo=\d",
), restrict_xpaths=('//a[#class="NumBtn"]',))
, callback="parse_items", follow= True),)
It seemed to be working fine. However, it seems that the crawler is only retrieving information from the pages with the extended URLs (with the "&pageinfo=\d"), and not from the ones without them. How can I fix that?
Thank you!
You can override parse_start_url() method in CrawlSpider:
class MySpider(CrawlSpider):
def parse_items(self, response):
# put your code here
...
parse_start_url = parse_items
Your rule allows urls with "&pageinfo=\d" . In effect only the pages with matching url will be processed. You need to change the allow parameter for the urls without pageinfo to be processed.

How can I scrape other specific pages on a forum with Scrapy?

I have a Scrapy Crawler that crawls some guides from a forum.
The forum that I'm trying to crawl the data has got a number of pages.
The problem is that I cannot extract the links that I want to because there aren't specific classes or ids to select.
The url structure is like this one: http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1
Obviously I can change the number after desc&page=1 to 2, 3, 4 and so on but I would like to know what is the best choice to do this.
How can I accomplish that?
PS: This is the spider code
http://dpaste.com/hold/794794/
I can't seem to open the forum URL (always redirects me to another website), so here's a best effort suggestion:
If there are links to the other pages on the thread page, you can create a crawler rule to explicitly follow these links. Use a CrawlSpider for that:
class GuideSpider(CrawlSpider):
name = "Guide"
allowed_domains = ['www.guides.com']
start_urls = [
"http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1",
]
rules = [
Rule(SgmlLinkExtractor(allow=("forumdisplay.php.*f=108.*page=",), callback='parse_item', follow=True)),
]
def parse_item(self, response):
# Your code
...
The spider should automatically deduplicate requests, i.e. it won't follow the same URL twice even if two pages link to it. If there are very similar URLs on the page with only one or two query arguments different (say, order=asc), you can specify deny=(...) in the Rule constructor to filter them out.