I am using a ScrapingHub API, and am using shub, to deploy my project. However, the items result is in as shown:
Unfortunately, I need it in the following order --> Title, Publish Date, Description, Link. How can I get the output to be in exactly that order for every item class?
Below is a short sample of my spider:
import scrapy
from scrapy.spiders import XMLFeedSpider
from tickers.items import tickersItem
class Spider(XMLFeedSpider):
name = "Scraper"
allowed_domains = ["yahoo.com"]
start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=ABIO,ACFN,AEMD,AEZS,AITB,AJX,AU,AKERMN,AUPH,AVL,AXPW
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=DRIO
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=IDXG,IMMU,IMRN,IMUC,INNV,INVT,IPCI,INPX,JAGX,KDMN,KTOV,LQMT
)
itertag = 'item'
def parse_node(self, response, node):
item = {}
item['Title'] = node.xpath('title/text()',).extract_first()
item['Description'] = node.xpath('description/text()').extract_first()
item['Link'] = node.xpath('link/text()').extract_first()
item['PublishDate'] = node.xpath('pubDate/text()').extract_first()
return item
Additionally, here is my attached items.py file, it is in the same order as my spider, so I have no idea why the output is not in order.
Items.py:
import scrapy
class tickersItem(scrapy.Item):
Title = scrapy.Field()
Description = scrapy.Field()
Link = scrapy.Field()
PublishDate = scrapy.Field()
The syntax of my code is in order for both the items and the spider file, and I have no idea how to fix it. I am a new python programmer.
Instead of defining items in items.py, you could use collections.OrderedDict. Just import collections module and in parse_node method, change the line:
item = {}
to line:
item = collections.OrderedDict()
Or, if you want defined items, you could use approach outlined in this answer. Your items.py would then contain this code:
from collections import OrderedDict
from scrapy import Field, Item
import six
class OrderedItem(Item):
def __init__(self, *args, **kwargs):
self._values = OrderedDict()
if args or kwargs: # avoid creating dict for most common case
for k, v in six.iteritems(dict(*args, **kwargs)):
self[k] = v
class tickersItem(OrderedItem):
Title = Field()
Description = Field()
Link = Field()
PublishDate = Field()
You should then also modify your spider code to use this item, accordingly. Refer to the documentation.
Related
I want to get '25430989' from the end of this url.
https://www.example.com/cars-for-sale/2007-ford-focus-1-6-diesel/25430989
How would I write it using the xpath?
I get the link using this xpath:
link = row.xpath('.//a/#href').get()
When I use a regex tester I can isolate it with r'(\d+)$ but when I put it into my code it doesn't work for some reason.
import scrapy
import re
from ..items import DonedealItem
class FarmtoolsSpider(scrapy.Spider):
name = 'farmtools'
allowed_domains = ['www.donedeal.ie']
start_urls = ['https://www.donedeal.ie/all?source=private&sort=publishdate%20desc']
def parse(self, response):
items = DonedealItem()
rows = response.xpath('//ul[#class="card-collection"]/li')
for row in rows:
if row.xpath('.//ul[#class="card__body-keyinfo"]/li[contains(text(),"0 min")]/text()'):
link = row.xpath('.//a/#href').get() #this is the full link.
linkid = link.re(r'(\d+)$).get()
title = row.xpath('.//p[#class="card__body-title"]/text()').get()
county = row.xpath('.//li[contains(text(),"min")]/following-sibling::node()/text()').get()
price = row.xpath('.//p[#class="card__price"]/span[1]/text()').get()
subcat = row.xpath('.//a/div/div[2]/div[1]/p[2]/text()[2]').get()
items['link'] = link
items['linkid'] = linkid
items['title'] = title
items['county'] = county
items['price'] = price
items['subcat'] = subcat
yield items
I'm trying to get the linkid.
The problem is here
link = row.xpath('.//a/#href').get() #this is the full link.
linkid = link.re(r'(\d+)$).get()
When you use the .get() method it returns a string that is saved in the link variable, and strings don't have a .re() method for you to call. You can use one of the methods from the re module (docs for reference).
I would use re.findall(), it will return you a list of values that matches the regex (in this case only one item would return), or None if nothing matches. re.search() is also a good choice, but will return you an re.Match object.
import re #Don't forget to import it
...
link = row.xpath('.//a/#href').get()
linkid = re.findall(r'(\d+)$', link)
Now, the Scrapy selectors also support regex, so an alternative would be implementing it like this: (No need for re module)
linkid = row.xpath('.//a/#href').re_first(r'(\d+)$')
Notice I didn't use .get() there.
I have written this short spider code to extract titles from hacker news front page(http://news.ycombinator.com/).
import scrapy
class HackerItem(scrapy.Item): #declaring the item
hackertitle = scrapy.Field()
class HackerSpider(scrapy.Spider):
name = 'hackernewscrawler'
allowed_domains = ['news.ycombinator.com'] # website we chose
start_urls = ['http://news.ycombinator.com/']
def parse(self,response):
sel = scrapy.Selector(response) #selector to help us extract the titles
item=HackerItem() #the item declared up
# xpath of the titles
item['hackertitle'] =
sel.xpath("//tr[#class='athing']/td[3]/a[#href]/text()").extract()
# printing titles using print statement.
print (item['hackertitle']
However when i run the code scrapy scrawl hackernewscrawler -o hntitles.json -t json
i get an empty .json file that does not have any content in it.
You should change print statement to yield:
import scrapy
class HackerItem(scrapy.Item): #declaring the item
hackertitle = scrapy.Field()
class HackerSpider(scrapy.Spider):
name = 'hackernewscrawler'
allowed_domains = ['news.ycombinator.com'] # website we chose
start_urls = ['http://news.ycombinator.com/']
def parse(self,response):
sel = scrapy.Selector(response) #selector to help us extract the titles
item=HackerItem() #the item declared up
# xpath of the titles
item['hackertitle'] = sel.xpath("//tr[#class='athing']/td[3]/a[#href]/text()").extract()
# return items
yield item
Then run:
scrapy crawl hackernewscrawler -o hntitles.json -t json
I am currently building my first scrapy project. Currently I am trying to extract data from a HTML table. Here is my crawl spider so far:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from digikey.items import DigikeyItem
from scrapy.selector import Selector
class DigikeySpider(CrawlSpider):
name = 'digikey'
allowed_domains = ['digikey.com']
start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1']
['www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/4?stock=1']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1', ), deny=('subsection\.php', ))),
)
def parse_item(self, response):
item = DigikeyItem()
item['partnumber'] = response.xpath('//td[#class="tr-mfgPartNumber"]/a/span[#itemprop="name"]/text()').extract()
item['manufacturer'] = response.xpath('///td[6]/span/a/span/text()').extract()
item['description'] = response.xpath('//td[#class="tr-description"]/text()').extract()
item['quanity'] = response.xpath('//td[#class="tr-qtyAvailable ptable-param"]//text()').extract()
item['price'] = response.xpath('//td[#class="tr-unitPrice ptable-param"]/text()').extract()
item['minimumquanity'] = response.xpath('//td[#class="tr-minQty ptable-param"]/text()').extract()
yield item
parse_start_url = parse_item
It scrapes the table at www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/4?stock=1. It then exports all data to a digikey.csv file but all data is in one cell.
Csv file with scraped data in one cell
setting.py
BOT_NAME = 'digikey'
SPIDER_MODULES = ['digikey.spiders']
NEWSPIDER_MODULE = 'digikey.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'digikey ("Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36")'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
I want the information scraped with one line at a time with the corresponding information associated with that partnumber.
items.py
import scrapy
class DigikeyItem(scrapy.Item):
partnumber = scrapy.Field()
manufacturer = scrapy.Field()
description = scrapy.Field()
quanity= scrapy.Field()
minimumquanity = scrapy.Field()
price = scrapy.Field()
pass
Any help is much appreciated!
The problem is that you're loading into each field of a single item the whole columns. I feel that what you want is something like:
for row in response.css('table#productTable tbody tr'):
item = DigikeyItem()
item['partnumber'] = (row.css('.tr-mfgPartNumber [itemprop="name"]::text').extract_first() or '').strip()
item['manufacturer'] = (row.css('[itemprop="manufacture"] [itemprop="name"]::text').extract_first() or '').strip()
item['description'] = (row.css('.tr-description::text').extract_first() or '').strip()
item['quanity'] = (row.css('.tr-qtyAvailable::text').extract_first() or '').strip()
item['price'] = (row.css('.tr-unitPrice::text').extract_first() or '').strip()
item['minimumquanity'] = (row.css('.tr-minQty::text').extract_first() or '').strip()
yield item
I've changed a bit the selectors to try to make it shorter. btw please avoid the manual extract_first and strip repetitions that I've used here (just for testing purposes), and consider using Item Loaders, it should be easier to take the first and strip/format the desired output.
Having trouble with my Scrapy crawler following links. Below is my code. I want it to essentially go to YouTube pages, pull the twitter links, and then call parse_page3 and pull in information but right now, only the parse_page2 extraction part is working.
Thanks!
Eric
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
# from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from tutorial.items import YTItem
class YTSpider(scrapy.Spider):
name = "youtube"
allowed_domains = ["youtube.com","twitter.com"]
start_urls = [
"https://www.youtube.com/jackcontemusic/about",
"https://www.youtube.com/user/natalydawn/about"
]
rules = [Rule(LinkExtractor(allow=('twitter.com',)), callback='parse_twitter'),]
def parse(self, response):
item = YTItem()
item['main_url'] = response.url
request = scrapy.Request(response.url, callback=self.parse_page2)
request.meta['item'] = item
yield request
def parse_page2(self, response):
item = response.meta['item']
item['joindate'] = response.selector.xpath('normalize-space(//li[contains(text(),"Joined")]/text())').extract()
item['subscribers'] = response.selector.xpath('//li[#class="about-stat " and contains(.,"subscribers")]/node()/node()').extract()
item['views'] = response.selector.xpath('//li[#class="about-stat " and contains(.,"views")]/node()/node()').extract()
item['url'] = response.selector.xpath('//div[#class="cmt_iframe_holder"]/#data-href').extract()
item['fb'] = response.selector.xpath('(//li[#class="channel-links-item"]/a[#title="Facebook"]/#href)[1]').extract()
item['twitter'] = response.selector.xpath('(//li[#class="channel-links-item"]/a[#title="Twitter"]/#href)[1]').extract()
item['googleplus'] = response.selector.xpath('(//li[#class="channel-links-item"]/a[#title="Google+"]/#href)[1]').extract()
item['itunes'] = response.selector.xpath('(//li[#class="channel-links-item"]/a[#title="iTunes"]/#href)[1]').extract()
item['instagram'] = response.selector.xpath('(//li[#class="channel-links-item"]/a[#title="Instagram"]/#href)[1]').extract()
return item
def parse_twitter(self, response):
item = YTItem()
item['twitter_url'] = response.url
request = scrapy.Request(response.url, callback=self.parse_twitter)
item = response.meta['item']
item['tweets'] = response.selector.xpath('(//span[#class="ProfileNav-value"])[1]').extract()
return item
If you want to use Rules and LinkExtractors, you need to use CrawlSpider class.
Replace:
class YTSpider(scrapy.Spider):
with:
from scrapy.contrib.spiders import CrawlSpider
class YTSpider(CrawlSpider):
In this code I want to scrape title,subtitle and data inside the links but having issues on pages beyond 1 and 2 as getting only 1 item scraped.I want to extract only those entries having title as delhivery only
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from delhivery.items import DelhiveryItem
class criticspider(CrawlSpider):
name = "delh"
allowed_domains = ["consumercomplaints.in"]
start_urls = ["http://www.consumercomplaints.in/?search=delhivery&page=2"]
def parse(self, response):
sites = response.xpath('//table[#width="100%"]')
items = []
for site in sites:
item = DelhiveryItem()
item['title'] = site.xpath('.//td[#class="complaint"]/a/span[#style="background-color:yellow"]/text()').extract()[0]
#item['title'] = site.xpath('.//td[#class="complaint"]/a[text() = "%s Delivery Courier %s"]/text()').extract()[0]
item['subtitle'] = site.xpath('.//td[#class="compl-text"]/div/b[1]/text()').extract()[0]
item['date'] = site.xpath('.//td[#class="small"]/text()').extract()[0].strip()
item['username'] = site.xpath('.//td[#class="small"]/a[2]/text()').extract()[0]
item['link'] = site.xpath('.//td[#class="complaint"]/a/#href').extract()[0]
if item['link']:
if 'http://' not in item['link']:
item['link'] = urljoin(response.url, item['link'])
yield scrapy.Request(item['link'],
meta={'item': item},
callback=self.anchor_page)
items.append(item)
def anchor_page(self, response):
old_item = response.request.meta['item']
old_item['data'] = response.xpath('.//td[#style="padding-bottom:15px"]/div/text()').extract()[0]
yield old_item
You need to change the item['title'] to this:
item['title'] = ''.join(site.xpath('//table[#width="100%"]//span[text() = "Delhivery"]/parent::*//text()').extract()[0])
Also edit sites to this to extract the required links only (ones with Delhivery in it)
sites = response.xpath('//table//span[text()="Delhivery"]/ancestor::div')
EDIT:
so I understand now that you need to add a pagination rule to your code.
it should be something like this:
You just need to add your imports and write the new xpaths from the item's link itself, such as this one
class criticspider(CrawlSpider):
name = "delh"
allowed_domains = ["consumercomplaints.in"]
start_urls = ["http://www.consumercomplaints.in/?search=delhivery"]
rules = (
# Extracting pages, allowing only links with page=number to be extracted
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[#class="pagelinks"]', ), allow=('page=\d+', ),unique=True),follow=True),
# Extract links of items on each page the spider gets from the first rule
Rule(SgmlLinkExtractor(restrict_xpaths=('//td[#class="complaint"]', )), callback='parse_item'),
)
def parse_item(self, response):
item = DelhiveryItem()
#populate item object here the same way you did, this function will be called for each item link.
#This meand that you'll be extracting data from pages like this one :
#http://www.consumercomplaints.in/complaints/delhivery-last-mile-courier-service-poor-delivery-service-c772900.html#c1880509
item['title'] = response.xpath('<write xpath>').extract()[0]
item['subtitle'] = response.xpath('<write xpath>').extract()[0]
item['date'] = response.xpath('<write xpath>').extract()[0].strip()
item['username'] = response.xpath('<write xpath>').extract()[0]
item['link'] = response.url
item['data'] = response.xpath('<write xpath>').extract()[0]
yield item
Also I suggest when you write an xpath, that you don't use any styling parameters, try to use #class or #id, only use #width, #style or any styling params if it's the only way.