Scarpy outoput json - scrapy

I'm struggling with Scrapy to output only "hits" to a json file. I'm new at this, so if there is just a link I should review, that might help (I've spent a fair amount of time googling around, still struggling) though code correction tips more welcome:).
I'm working off of the scrapy tutorial (https://doc.scrapy.org/en/latest/intro/overview.html) , with the original code outputing a long list including field names and output like "field: output" where both blanks and found items appear. I'd like only to include links that are found, and output them w/o the field name to a file.
For the following code I am trying, if I issue "scrapy crawl quotes2 -o quotes.json > output.json, it works but the quotes.json is always blank (i.e., including if I do "scrapy crawl quotes2 -o quotes.json").
In this case, as an experiment, I only want to return the URL if the string "Jane" is in the URL (e.g., /author/Jane-Austen):
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes2"
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('a'):
for i in quote.css('a[href*=Jane]::attr(href)').extract():
if i is not None:
print(i)
I've tried "yield" and items options, but am not up to speed enough to make them work. My longer term ambition to go to sites without having to understand the html tree (which may in and of itself be the wrong approach) and look for URLs with specific text in the URL string.
Thoughts? Am guessing this is not too hard, but is beyond me.

Well this is happening because you are printing the items, you have to tell Scrapy explicitly to 'yield' them.
But before that i don't see why you are looping through the anchor nodes instead of that you should loop over the quotes using css or XPath selectors, extract all the author links inside that quote and lastly check if that URL contains a specific String (Jane for you case).
for quote in response.css('.quote'):
jane_url = quote.xpath('.//a[contains(#href, "Jane")]').extract_first()
if jane_url is not None:
yield {
'url': jane_url
}

Related

How can I list the URL of the page the data was scraped from with Scrapy?

I'm a real beginner but I've been searching high and low and can't seem to find a solution. I'm working on building some spiders but I can't figure out how to identify what URL my scraped data comes from.
My spider is extremely basic right now, I'm trying to learn as I go.
I've tried a few lines I've found on stackoverflow but can't get anything working other than a print function (I can't remember if it was "URL: " + response.request.url or something similar. I tried a bunch of things) that worked in the parse section of the code but I can't get anything working in the yield.
I could add other identifiers in the output but ideally I'd like the URL for the project I'm working towards
import scrapy
class FanaticsSpider(scrapy.Spider):
name = 'fanatics'
start_urls = ['https://www.fanaticsoutlet.com/nfl/new-england-patriots/new-england-patriots-majestic-showtime-logo-cool-base-t-shirt-navy/o-9172+t-70152507+p-1483408147+z-8-1114341320',
'https://www.fanaticsoutlet.com/nfl/new-england-patriots/new-england-patriots-nfl-pro-line-mantra-t-shirt-navy/o-2427+t-69598185+p-57711304142+z-9-2975969489',]
def parse(self, response):
yield {
'sale-price': response.xpath('//span[#data-talos="pdpProductPrice"]/span[#class="sale-price"]/text()').re('[$]\d+\.\d+'),
#'sale-price': response.xpath('//span[#data-talos="pdpProductPrice"]/span[#class="sale-price"]/text()').get(),
'regular-price': response.xpath('//span[#data-talos="pdpProductPrice"]/span[#class="regular-price strike-through"]/text()').re('[$]\d+\.\d+'),
#'regular-price': response.xpath('//span[#data-talos="pdpProductPrice"]/span[#class="regular-price strike-through"]/text()').get(),
}
Any help is much appreciated. I haven't begun to learn anything about pipeline yet, I'm not sure if that might hold a solution?
You can simply add the url in the yield like this:
yield {...,
'url': response.url,
...}

Link extractor is not able to get the paths beyond a certain path

I need a bit help and your guidance on Scrapy.
My Start_Url is :: http://lighting.philips.co.uk/prof/
Have pasted my code below, which is able to get the links / paths till the below url. But not going beyond that. I need to go to each product's page, listed under the path below. In the "productsinfamily" page the specific products are listed (perhaps within a java script). My Crawler is not able to reach those individual product pages.
http://www.lighting.philips.co.uk/prof/led-lamps-and-tubes/led-lamps/corepro-ledbulb/productsinfamily/
Below is the code for the Crawl spider-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ProductSearchSpider(CrawlSpider):
name = "product_search"
allowed_domains = ["lighting.philips.co.uk"]
start_urls = ['http://lighting.philips.co.uk/prof/']
rules = (Rule(LinkExtractor(allow=
(r'^https?://www.lighting.philips.co.uk/prof/led-lamps-and-tubes/.*', ),),
callback='parse_page', follow=True),)
def parse_page(self, response):
yield{'URL' : response.url}
You are right that the links are defined in javascript.
If you take a look at the html source, on line 3790 you can see a variable named d75products created. This is later used to populate a template and display the products.
The way I'd approach this would be to extract this data from the source and use the json module to load it. Once you have the data, you can do with it whatever you want.
Another way would be to use something (e.g. a browser) to execute the javascript, and then parse the resulting html. I do think that's unnecessary and overcomplicated though.

Scrapy + extract only text + carriage returns in output file

I am new to Scrapy and trying to extract content from web page, but getting lots of extra characters in the output. See image attached.
How can I update my code to get rid of the characters? I need to extract only the href from the web page.
My code:
class AttractionSpider(CrawlSpider):
name = "get-webcontent"
start_urls = [
'http://quotes.toscrape.com/page/1/'
]
rules = ()
def create_dirs(dir):
if not os.path.exists(dir):
os.makedirs(dir)
else:
shutil.rmtree(dir) #removes all the subdirectories!
os.makedirs(dir)
def __init__(self, name=None, **kwargs):
super(AttractionSpider, self).__init__(name, **kwargs)
self.items_buffer = {}
self.base_url = "http://quotes.toscrape.com/page/1/"
from scrapy.conf import settings
settings.overrides['DOWNLOAD_TIMEOUT'] = 360
def write_to_file(file_name, content_list):
with open(file_name, 'wb') as fp:
pickle.dump(content_list, fp)
def parse(self, response):
print ("Start scrapping webcontent....")
try:
str = ""
hxs = Selector(response)
links = hxs.xpath('//li//#href').extract()
with open('test1_href', 'wb') as fp:
pickle.dump(links, fp)
if not links:
return
log.msg("No Data to scrap")
for link in links:
v_url = ''.join( link.extract() )
if not v_url:
continue
else:
_url = self.base_url + v_url
except Exception as e:
log.msg("Parsing failed for URL {%s}"%format(response.request.url))
raise
def parse_details(self, response):
print ("Start scrapping Detailed Info....")
try:
hxs = Selector(response)
yield l_venue
except Exception as e:
log.msg("Parsing failed for URL {%s}"%format(response.request.url))
raise
Now I must say... obviously you have some experience with Python programming congrats, and you're obviously doing the official Scrapy docs tutorial which is great but for the life of me I have no idea exactly given the code snippet you have provided of what you're trying to accomplish. But that's ok, here's a couple of things:
You are using a Scrapy crawl spider. When using a cross spider the rules set the follow or pagination if you will as well as pointing in a car back to the function when the appropriate regular expression matches the rule to a page to then initialize the extraction or itemization. This is absolutely crucial to understand that you cannot use a crossfire without setting the rules and equally as important when using a cross spider you cannot use the parse function, because the way the cross spider is built parse function is already a native built-in function within itself. Do go ahead and read the documents or just create a cross spider and see how it doesn't create in parse.
Your code
class AttractionSpider(CrawlSpider):
name = "get-webcontent"
start_urls = [
'http://quotes.toscrape.com/page/1/'
]
rules = () #big no no ! s3 rul3s
How it should look like
class AttractionSpider(CrawlSpider):
name = "get-webcontent"
start_urls = ['http://quotes.toscrape.com'] # this would be cosidered a base url
# regex is our bf, kno him well, bassicall all pages that follow
#this pattern ... page/.* (meant all following include no exception)
rules = (
Rule(LinkExtractor(allow=r'/page/.*'), follow=True),callback='parse_item'),
)
Number two: go over the thing I mentioned about using the parts function with a Scrapy crawl spider, you should use "parse-_item"; I assume that you at least looked over the official docs but to sum it up, the reason that it cannot be used this because the crawl spider already uses Parts within its logic so by using Parts within a cross spider you're overriding a native function that it has and can cause all sorts of bugs and issues.
That's pretty straightforward; I don't think I have to go ahead and show you a snippet but feel free to go to the official Docs and on the right side where it says "spiders" go ahead and scroll down until you hit "crawl spiders" and it gives some notes with a caution...
To my next point: when you go from your initial parts you are not (or rather you do not) have a call back that goes from parse to Parts details which leads me to believe that when you perform the crawl you don't go past the first page and aside from that, if you're trying to create a text file (or you're using the OS module 2 write out something but you're not actually writing anything) so I'm super confused to why you are using the right function instead of read.
I mean, myself I have in many occasions use an external text file or CSV file for that matter that includes multiple URLs so I don't have to stick it in there but you're clearly writing out or trying to write to a file which you said was a pipeline? Now I'm even more confused! But the point is that I hope you're well aware of the fact that if you are trying to create a file or export of your extracted items there are options to export and to three already pre-built formats that being CSV JSON. But as you said in your response to my comment that if indeed you're using a pipeline and item and Porter intern you can create your own format of export as you so wish but if it's only the response URL that you need why go through all that hassle?
My parting words would be: it would serve you well to go over again Scrapy's official docs tutorial, at nauseam and stressing the importance of using also the settings.py as well as items.py.
# -*- coding: utf-8 -*-
import scrapy
import os
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from quotes.items import QuotesItem
class QcrawlSpider(CrawlSpider):
name = 'qCrawl'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
rules = (
Rule(LinkExtractor(allow=r'page/.*'), callback='parse_item', follow=True),
)
def parse_item(self, response):
rurl = response.url
item = QuotesItem()
item['quote'] =response.css('span.text::text').extract()
item['author'] = response.css('small.author::text').extract()
item['rUrl'] = rurl
yield item
with open(os.path.abspath('') + '_' + "urllisr_" + '.txt', 'a') as a:
a.write(''.join([rurl, '\n']))
a.close()
Of course, the items.py would be filled out appropriately by the ones you see in the spider but by including the response URL both itemized I can do both writing out given even the default Scrappy methods CSV etc or I can create my own.
In this case being a simple text file but one can get pretty crafty; for example, if you write it out correctly that's the same using the OS module you can, for example as I have create m3u playlist from video hosting sites, you can get fancy with a custom CSV item exporter. But even with that then using a custom pipeline we can then write out a custom format for your csvs or whatever it is that you wish.

Setting a custom long list of starting URLS in Scrapy

The crawling starts from the list included in start_urls = []
I need a long list of these starting urls and 2 methods of solving this problem:
Method 1: Using pandas to define the starting_urls array
#Array of Keywords
keywords = pandas.Keyword
urls = {}
count = 0
while(count < 100):
urls[count]='google.com?q=' + keywords[count]
count = count + 1
#Now I have the starting urls in urls array.
However, it doesn't seem to define starting_urls = urls because when I run:
scrapy crawl SPIDER
I get the error:
Error: Request url must be str or unicode, got int:
Method 2:
Each starting URL contains paginated content and in the def parse method I have the following code to crawl all linked pages.
next_page = response.xpath('//li[#class="next"]/a/#href').extract_first()
yield response.follow(next_page, callback=self.parse)
I want to add additional pages to crawl from the urls array defined above.
count=0
while(count < 100):
yield response.follow(urls[count], callback=self.parse)
count=count + 1
But it seems that none of these 2 methods work. Maybe I can't add this code the spider.py file?
To make first note, though obviously I can't say I've ran your entire script like that it's incomplete but first thing I noticed is that your face URL does need to have or be the proper format... "http://ect.ect" for scrapy tp make a proper request
Also, not to question your skills but if you weren't aware that by using strip, split and join functions you can turn from list, strings, dictionaries add integers back and forth from each other to achieve the needed desired effect...
WHATS HAPPENING TO YOU:
While be using range instead of count... but mimic your issue
lis = range(11)
site = "site.com/page="
for i in lis:
print(site + i)
----------
TypeError: Can't convert 'int' object to str implicity
#TURNING MY INT INTO STR:
lis = range(11)
site = "site.com/page="
for i in lis:
print(site + str(i))
--------------------
site.com/page=0
site.com/page=1
site.com/page=2
site.com/page=3
site.com/page=4
site.com/page=5
site.com/page=6
site.com/page=7
site.com/page=8
site.com/page=9
site.com/page=10
As to the error, when you you have the count to "+ 1", and then configure the entire URL then to add that 1 ... You are then trying to makes a string variable with an integer... I'd imagine simply turning the integer into a string before then constructing your url, then back to and interger before you add one to the count so it could be changed appropriately to then...
My go-to way to keep my coat as clean as possible is much cleaner. By adding an extra file at the root or current working folder of which you start to crawl, with all the urls you wish to scrape, you can use then pythons read and write functions and open the file with you or else decide your spider script.. like this
class xSpider(BaseSpider):
name = "w.e"
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
What really bothers me is that your error, is saying that you're compiling drink with an integer which I will ask you again if you need further for a complete snippet of your spider and in the spirit of coders kinship, also your settings.py because I'll tell you right now that who end up finding out, despite of any adjustments to the settings.Py file you won't be able to scrape Google search pages... Rather, not entire number of result page... Which I will then recommend to Scrappy conjunction with beautiful suit
The immediate problem I see is that you are making a DICT when it expects a list. :). Change it to a list.
There are also all kinds of interactions depending on which underlying spider you inherited from (if you did at all). Try switching to list then hit the question up again with more data if you still are having problems

Looping on Scrapy doens't work properly

I'm trying to write a small web crawler with Scrapy.
I wrote a crawler that grabs the URLs of certain links on a certain page, and wrote the links to a csv file. I then wrote another crawler that loops on those links, and downloads some information from the pages directed to from these links.
The loop on the links:
cr = csv.reader(open("linksToCrawl.csv","rb"))
start_urls = []
for row in cr:
start_urls.append("http://www.zap.co.il/rate"+''.join(row[0])[1:len(''.join(row[0]))])
If, for example, the URL of the page I'm retrieving information from is:
http://www.zap.co.il/ratemodel.aspx?modelid=835959
then more information can (sometimes) be retrieved from following pages, like:
http://www.zap.co.il/ratemodel.aspx?modelid=835959&pageinfo=2
("&pageinfo=2" was added).
Therefore, my rules are:
rules = (Rule (SgmlLinkExtractor (allow = ("&pageinfo=\d",
), restrict_xpaths=('//a[#class="NumBtn"]',))
, callback="parse_items", follow= True),)
It seemed to be working fine. However, it seems that the crawler is only retrieving information from the pages with the extended URLs (with the "&pageinfo=\d"), and not from the ones without them. How can I fix that?
Thank you!
You can override parse_start_url() method in CrawlSpider:
class MySpider(CrawlSpider):
def parse_items(self, response):
# put your code here
...
parse_start_url = parse_items
Your rule allows urls with "&pageinfo=\d" . In effect only the pages with matching url will be processed. You need to change the allow parameter for the urls without pageinfo to be processed.