I am scraping a set of ~10,000 links in the same domain with identical structure using scrapy runspider command. Randomly in between some pages (a significant ~40% to 50% pages) are Crawled but Not scraped, because in my parse method I evaluate a particular element in the page, based on which I scrape the other elements of the page. But as it goes for some Reason (more on this Reason later), for some of the urls that element evaluates incorrectly. To fix this I want to call my parse method for these urls repeatedly up to a maximum of say 5 times till it evaluates correctly (hoping that in 5 runs the page will respond correctly to the condition or otherwise I assume that the element is genuinely evaluated as wrong). How to code this (part code below)?
Possible Reason for the above behaviour: my weblinks are of the type
www.example.com/search_term/ which are actually dynamically generated page after entering "search_term" in www.example.com. So my guess is that in several cases Scrapy gets the response even before the page www.example.com/search_term/ is fully generated. Maybe the ideal solution is to use a webdriver and all, but all of that will be too complex for me at this stage. As long as I get 95% scraping, I am happy.
Relevant Code below (sanitised for readability without leaving out any details):
class mySpider(scrapy.Spider):
name = "spidername"
def start_requests(self):
urls = [url1, ... url10000]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse,headers={
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3"})
def parse(self, response):
if (value of particular_item in page == 10):
yield {'someitem':
response.xpath('/html/body/div').extract())}
else:
<<Once again call this parse fuction with the same url upto a maximum of 5 times - Need help in writing the code here>>
Your XPath requires that the body of the HTML you are parsing has a div as first element:
<html>
<body>
<div>...
Are you sure every site looks that way? Without any information on what you try to scrape I cannot give you more advice.
Alternatively you can try another solution where you extract all the divs from the website:
for div in response.xpath('//div').extract():
yield {'div': div}
Related
This has been addressed to some extent here and here
But I'd like to ask here before doing any of what's suggested there because I don't really like any of the approaches.
So basically, I'm trying to scrape Steam games. As you may know, Steam has a link where you can access the whole reviews for a game, an example:
https://steamcommunity.com/app/730/reviews/?browsefilter=toprated&snr=1_5_100010_
You can ignore snr and browsefilter query params there.
Anyhow, I have created a single Spider that will crawl the list of games here and works pretty well:
https://store.steampowered.com/search/?sort_by=Released_DESC
But now, for each game I want to retrieve all reviews.
Originally I created a new Spider that deals with the infinte scroll in the page that has the whole set of reviews for a game, but obviously that spider needs the URL where those reviews live.
So basically what I'm doing now is scrape all games pages and store the URL with reviews for each game in a txt file that is then passed as parameter to the second spider. But I don't like this because it forces me to do a 2-step process and besides, I need to map the results of the second spider to the results of the first one somehow (this reviews belong to this game, etc)
So my questions are:
Would it be best to send the results of scraping the game page (and thus the URL with All reviews) to the second spider, or at least the URL and then fetch all reviews for each game using the second spider? This will be O(N*M) in terms of performance, being N number of games and M number of reviews per game, maybe just because of this, having 2 spiders is worth it...thoughts?
Can I actually invoke a Spider from another Spider? From what I've read in Scrapy documentation, doesn't look like it. I can probably move everything to one spider but will look awful and it doesn't adhere to the single-responsability principle...
Why don't you use a different parse procedure?
https://docs.scrapy.org/en/latest/intro/tutorial.html#more-examples-and-patterns
def parse(self, response):
# follow links to author pages
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default='').strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse_author)
And add the needed values with the meta tag:
https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta
example in
Is it possible to pass a variable from start_requests() to parse() for each individual request?
I'm a real beginner but I've been searching high and low and can't seem to find a solution. I'm working on building some spiders but I can't figure out how to identify what URL my scraped data comes from.
My spider is extremely basic right now, I'm trying to learn as I go.
I've tried a few lines I've found on stackoverflow but can't get anything working other than a print function (I can't remember if it was "URL: " + response.request.url or something similar. I tried a bunch of things) that worked in the parse section of the code but I can't get anything working in the yield.
I could add other identifiers in the output but ideally I'd like the URL for the project I'm working towards
import scrapy
class FanaticsSpider(scrapy.Spider):
name = 'fanatics'
start_urls = ['https://www.fanaticsoutlet.com/nfl/new-england-patriots/new-england-patriots-majestic-showtime-logo-cool-base-t-shirt-navy/o-9172+t-70152507+p-1483408147+z-8-1114341320',
'https://www.fanaticsoutlet.com/nfl/new-england-patriots/new-england-patriots-nfl-pro-line-mantra-t-shirt-navy/o-2427+t-69598185+p-57711304142+z-9-2975969489',]
def parse(self, response):
yield {
'sale-price': response.xpath('//span[#data-talos="pdpProductPrice"]/span[#class="sale-price"]/text()').re('[$]\d+\.\d+'),
#'sale-price': response.xpath('//span[#data-talos="pdpProductPrice"]/span[#class="sale-price"]/text()').get(),
'regular-price': response.xpath('//span[#data-talos="pdpProductPrice"]/span[#class="regular-price strike-through"]/text()').re('[$]\d+\.\d+'),
#'regular-price': response.xpath('//span[#data-talos="pdpProductPrice"]/span[#class="regular-price strike-through"]/text()').get(),
}
Any help is much appreciated. I haven't begun to learn anything about pipeline yet, I'm not sure if that might hold a solution?
You can simply add the url in the yield like this:
yield {...,
'url': response.url,
...}
I'm struggling with Scrapy to output only "hits" to a json file. I'm new at this, so if there is just a link I should review, that might help (I've spent a fair amount of time googling around, still struggling) though code correction tips more welcome:).
I'm working off of the scrapy tutorial (https://doc.scrapy.org/en/latest/intro/overview.html) , with the original code outputing a long list including field names and output like "field: output" where both blanks and found items appear. I'd like only to include links that are found, and output them w/o the field name to a file.
For the following code I am trying, if I issue "scrapy crawl quotes2 -o quotes.json > output.json, it works but the quotes.json is always blank (i.e., including if I do "scrapy crawl quotes2 -o quotes.json").
In this case, as an experiment, I only want to return the URL if the string "Jane" is in the URL (e.g., /author/Jane-Austen):
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes2"
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('a'):
for i in quote.css('a[href*=Jane]::attr(href)').extract():
if i is not None:
print(i)
I've tried "yield" and items options, but am not up to speed enough to make them work. My longer term ambition to go to sites without having to understand the html tree (which may in and of itself be the wrong approach) and look for URLs with specific text in the URL string.
Thoughts? Am guessing this is not too hard, but is beyond me.
Well this is happening because you are printing the items, you have to tell Scrapy explicitly to 'yield' them.
But before that i don't see why you are looping through the anchor nodes instead of that you should loop over the quotes using css or XPath selectors, extract all the author links inside that quote and lastly check if that URL contains a specific String (Jane for you case).
for quote in response.css('.quote'):
jane_url = quote.xpath('.//a[contains(#href, "Jane")]').extract_first()
if jane_url is not None:
yield {
'url': jane_url
}
The crawling starts from the list included in start_urls = []
I need a long list of these starting urls and 2 methods of solving this problem:
Method 1: Using pandas to define the starting_urls array
#Array of Keywords
keywords = pandas.Keyword
urls = {}
count = 0
while(count < 100):
urls[count]='google.com?q=' + keywords[count]
count = count + 1
#Now I have the starting urls in urls array.
However, it doesn't seem to define starting_urls = urls because when I run:
scrapy crawl SPIDER
I get the error:
Error: Request url must be str or unicode, got int:
Method 2:
Each starting URL contains paginated content and in the def parse method I have the following code to crawl all linked pages.
next_page = response.xpath('//li[#class="next"]/a/#href').extract_first()
yield response.follow(next_page, callback=self.parse)
I want to add additional pages to crawl from the urls array defined above.
count=0
while(count < 100):
yield response.follow(urls[count], callback=self.parse)
count=count + 1
But it seems that none of these 2 methods work. Maybe I can't add this code the spider.py file?
To make first note, though obviously I can't say I've ran your entire script like that it's incomplete but first thing I noticed is that your face URL does need to have or be the proper format... "http://ect.ect" for scrapy tp make a proper request
Also, not to question your skills but if you weren't aware that by using strip, split and join functions you can turn from list, strings, dictionaries add integers back and forth from each other to achieve the needed desired effect...
WHATS HAPPENING TO YOU:
While be using range instead of count... but mimic your issue
lis = range(11)
site = "site.com/page="
for i in lis:
print(site + i)
----------
TypeError: Can't convert 'int' object to str implicity
#TURNING MY INT INTO STR:
lis = range(11)
site = "site.com/page="
for i in lis:
print(site + str(i))
--------------------
site.com/page=0
site.com/page=1
site.com/page=2
site.com/page=3
site.com/page=4
site.com/page=5
site.com/page=6
site.com/page=7
site.com/page=8
site.com/page=9
site.com/page=10
As to the error, when you you have the count to "+ 1", and then configure the entire URL then to add that 1 ... You are then trying to makes a string variable with an integer... I'd imagine simply turning the integer into a string before then constructing your url, then back to and interger before you add one to the count so it could be changed appropriately to then...
My go-to way to keep my coat as clean as possible is much cleaner. By adding an extra file at the root or current working folder of which you start to crawl, with all the urls you wish to scrape, you can use then pythons read and write functions and open the file with you or else decide your spider script.. like this
class xSpider(BaseSpider):
name = "w.e"
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
What really bothers me is that your error, is saying that you're compiling drink with an integer which I will ask you again if you need further for a complete snippet of your spider and in the spirit of coders kinship, also your settings.py because I'll tell you right now that who end up finding out, despite of any adjustments to the settings.Py file you won't be able to scrape Google search pages... Rather, not entire number of result page... Which I will then recommend to Scrappy conjunction with beautiful suit
The immediate problem I see is that you are making a DICT when it expects a list. :). Change it to a list.
There are also all kinds of interactions depending on which underlying spider you inherited from (if you did at all). Try switching to list then hit the question up again with more data if you still are having problems
I'm trying to write a small web crawler with Scrapy.
I wrote a crawler that grabs the URLs of certain links on a certain page, and wrote the links to a csv file. I then wrote another crawler that loops on those links, and downloads some information from the pages directed to from these links.
The loop on the links:
cr = csv.reader(open("linksToCrawl.csv","rb"))
start_urls = []
for row in cr:
start_urls.append("http://www.zap.co.il/rate"+''.join(row[0])[1:len(''.join(row[0]))])
If, for example, the URL of the page I'm retrieving information from is:
http://www.zap.co.il/ratemodel.aspx?modelid=835959
then more information can (sometimes) be retrieved from following pages, like:
http://www.zap.co.il/ratemodel.aspx?modelid=835959&pageinfo=2
("&pageinfo=2" was added).
Therefore, my rules are:
rules = (Rule (SgmlLinkExtractor (allow = ("&pageinfo=\d",
), restrict_xpaths=('//a[#class="NumBtn"]',))
, callback="parse_items", follow= True),)
It seemed to be working fine. However, it seems that the crawler is only retrieving information from the pages with the extended URLs (with the "&pageinfo=\d"), and not from the ones without them. How can I fix that?
Thank you!
You can override parse_start_url() method in CrawlSpider:
class MySpider(CrawlSpider):
def parse_items(self, response):
# put your code here
...
parse_start_url = parse_items
Your rule allows urls with "&pageinfo=\d" . In effect only the pages with matching url will be processed. You need to change the allow parameter for the urls without pageinfo to be processed.