Is there a way to get the URL that a link is scraped from? - scrapy

I've got a spider written out that crawls my website and scrapes a bunch of tags. I'm now trying to have it return the URL that the link was discovered on.
For example:
www.example.com/product/123 was found on www.example.com/page/2.
When scrapy scrapes information from /product/123 I want to have a field that is "Scraped From" and return /page/2. For every URL that is scraped, I'd want to find the originating page that the URL was found. I've been pouring over the docs and can't seem to figure this out. Any help would be appreciated!

The easiest way is to use the response.headers. There should be a referer header.
referer = response.headers['Referer']
You can also use meta to pass information along to the next URL.
def parse(self, response):
product_url = response.css('#url').get()
yield scrapy.Request(product_url, callback=self.parse_product, meta={'referer': response.url})
def parse_product(self, response):
referer = response.meta['referer']
item = ItemName()
item['referer'] = referer
yield item

Related

Spider through Scrapyrt returns 0 items when scraping search page

I created a spider that can scrape a page in an e-commerce site and gather the data on the different items.
The spider works fine with specific pages of the site (www.sitedomain/123-item-category), as well as with the search page (www.sitedomain/searchpage?controller?search=keywords+item+to+be+found).
But, when I run it through Scrapyrt the specific page works fine, but the search page returns 0 items. No errors, just 0 items.This occurs on 2 different sites with 2 different spiders.
Is there something specific to search pages that has to be taken in account when using Scrapyrt?
Take a spider like this:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "minimal"
def start_requests(self):
urls = [
"https://www.dungeondice.it/ricerca?controller=search&s=ticket+to+ride",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print("Found ", len(response.css("article")), " items")
for article in response.css("article"):
print("Item: ", article.css("img::attr(title)").get())
(I also set the Obey_robots = False)
I get 20 items back, if I run:
scrape crawl minimal
But I get 0 items back (no errors, just no results), if I run:
curl "http://localhost:9081/crawl.json?spider_name=minimal&url=https://www.dungeondice.it/ricerca?controller=search&s=ticket+to+ride"
I think you could put "return item" in the end of "process_item" in your pipeline.py:
def process_item(self, item, spider):
...
return item
I found the same issue as you described, after I edit my codes, I solved it.
Hope this could help =)

Limiting scrapy Request and Items

everyone, I've been learning scrapy for a month. I need assistance with following problems:
Suppose there are 100-200 urls and I use Rule to extract further links from those urls and I want to limit the request of those links, like maximum 30 requests for each url. Can I do that?
If I'm searching a keyword on all urls, If the word is found on particular url, then I want scrapy to stop searching from that url and move to next one.
I've tried limiting url but it doesn't work at all.
Thanks, i hope everything is clear.
You can use a process_links callback function with your Rule, this will be passed the list of extracted links from each response, and you can trim it down to your limit of 30.
Example (untested):
class MySpider(CrawlSpider):
name = "test"
allowed_domains = ['example.org']
rules = (
Rule(LinkExtractor(), process_links="dummy_process_links"),
)
def dummy_process_links(self, links):
links = links[:30]
return links
If I understand correctly, and you want stop after finding some word in the page of the response, all you need to do is find the word:
def my_parse(self, response):
if b'word' is in response.body:
offset = response.body.find(b'word')
# do something with it

Scraping a list of urls after log in

The site to scrape has multiple projects with multiple pages and requires a log in. I tried:
def start_request(self):
return [scrapy.FormRequest(, callback=self.logged_in)]
def logged_in(self, response):
with open(...) as f:
for url in f.readlines():
return scrapy.Request(url, callback=self.parse)
def parse(self, response):
... do some scraping ...
... determine the url of the next page ...
return scrapy.Request(... next page ..., self.parse)
This results in scraping all pages of one project (login is successful), but then it stops.
If return scrapy.Request() in function logged_in() is replaced by yield scrapy.Request() than it reads the first page of all projects.
I played around with the returns and yields, but I can't get it to scrape all pages of all projects.
BTW I tried to create an array start_uls, but that doesn't work because it first needs to log into the site.
return will always return once, so don't expect more than one Request to be returned there, if you wish to return more than once, use yield
Scrapy Requests has a parameter called dont_filter that probably filter your calls to parse function
dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
#Guy Gavriely makes a good point about return. I would add that you need another parse method to filter all the pages you want:
def start_request(self):
return [scrapy.FormRequest(, callback=self.logged_in)]
def logged_in(self, response):
with open(...) as f:
for url in f.readlines():
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
... do some scraping ...
yield scrapy.Request(... next page ..., self.parse_user)
def parse_user(self, response):
... do some scraping ....
yield items
You're not done yet! The final parse method may need to be iterated for example:
def parse(self, response):
... do some scraping ...
for sel in (some_list_of_users_by_an_xpath):
user_profile_tag = response.xpath('xpath_to_user_profile_urls')
user_profile_url_clean = user_profile_tag(might need to clean prior to Request ie 'domain' + user_profile_tag or .split or .replace, etc.)
yield scrapy.Request(user_profile_url_clean, self.parse_user)
In this case the parse function will parse a user every time in that list of users. Then parse_user will do most of the actual digging and scraping. Once it's done it will return to the original parse method and just go to the next one from the list.
Good luck!

Looping on Scrapy doens't work properly

I'm trying to write a small web crawler with Scrapy.
I wrote a crawler that grabs the URLs of certain links on a certain page, and wrote the links to a csv file. I then wrote another crawler that loops on those links, and downloads some information from the pages directed to from these links.
The loop on the links:
cr = csv.reader(open("linksToCrawl.csv","rb"))
start_urls = []
for row in cr:
start_urls.append("http://www.zap.co.il/rate"+''.join(row[0])[1:len(''.join(row[0]))])
If, for example, the URL of the page I'm retrieving information from is:
http://www.zap.co.il/ratemodel.aspx?modelid=835959
then more information can (sometimes) be retrieved from following pages, like:
http://www.zap.co.il/ratemodel.aspx?modelid=835959&pageinfo=2
("&pageinfo=2" was added).
Therefore, my rules are:
rules = (Rule (SgmlLinkExtractor (allow = ("&pageinfo=\d",
), restrict_xpaths=('//a[#class="NumBtn"]',))
, callback="parse_items", follow= True),)
It seemed to be working fine. However, it seems that the crawler is only retrieving information from the pages with the extended URLs (with the "&pageinfo=\d"), and not from the ones without them. How can I fix that?
Thank you!
You can override parse_start_url() method in CrawlSpider:
class MySpider(CrawlSpider):
def parse_items(self, response):
# put your code here
...
parse_start_url = parse_items
Your rule allows urls with "&pageinfo=\d" . In effect only the pages with matching url will be processed. You need to change the allow parameter for the urls without pageinfo to be processed.

How can I scrape other specific pages on a forum with Scrapy?

I have a Scrapy Crawler that crawls some guides from a forum.
The forum that I'm trying to crawl the data has got a number of pages.
The problem is that I cannot extract the links that I want to because there aren't specific classes or ids to select.
The url structure is like this one: http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1
Obviously I can change the number after desc&page=1 to 2, 3, 4 and so on but I would like to know what is the best choice to do this.
How can I accomplish that?
PS: This is the spider code
http://dpaste.com/hold/794794/
I can't seem to open the forum URL (always redirects me to another website), so here's a best effort suggestion:
If there are links to the other pages on the thread page, you can create a crawler rule to explicitly follow these links. Use a CrawlSpider for that:
class GuideSpider(CrawlSpider):
name = "Guide"
allowed_domains = ['www.guides.com']
start_urls = [
"http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1",
]
rules = [
Rule(SgmlLinkExtractor(allow=("forumdisplay.php.*f=108.*page=",), callback='parse_item', follow=True)),
]
def parse_item(self, response):
# Your code
...
The spider should automatically deduplicate requests, i.e. it won't follow the same URL twice even if two pages link to it. If there are very similar URLs on the page with only one or two query arguments different (say, order=asc), you can specify deny=(...) in the Rule constructor to filter them out.