Problem having same data while crawling a web page - scrapy

I am trying to crawl a web page to get reviews and ratings of that web page. But i am getting the same data as the output.
import scrapy
import json
from scrapy.spiders import Spider
class RatingSpider(Spider):
name = "rate"
def start_requests(self):
for i in range(1, 10):
url = "https://www.fandango.com/aquaman-208499/movie-reviews?pn=" + str(i)
print(url)
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(json.dumps({'rating': response.xpath("//div[#class='star-rating__score']").xpath("#style").extract(),
'review': response.xpath("//p[#class='fan-reviews__item-content']/text()").getall()}))
expected: crawling 1000 pages of the web site https://www.fandango.com/aquaman-208499/movie-reviews
actual output:
https://mobile.fandango.com/aquaman-208498/movie-reviews?pn=1
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}
https://mobile.fandango.com/aquaman-208499/movie-reviews?pn=9
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}

The reviews are dynamically populated using JavaScript.
You have to inspect the requests made by the site in cases likes this.
The URL to get user reviews is this:
https://www.fandango.com/napi/fanReviews/208499/1/5
It returns a json with 5 reviews.
Your spider could be rewrite like this:
import scrapy
import json
from scrapy.spiders import Spider
class RatingSpider(Spider):
name = "rate"
def start_requests(self):
movie_id = "208499"
for page in range(1, 10):
# You have to pass the referer, otherwise the site returns a 403 error
headers = {'referer': 'https://www.fandango.com/aquaman-208499/movie-reviews?pn={page}'.format(page=page)}
url = "https://www.fandango.com/napi/fanReviews/208499/{page}/5".format(page=page)
yield scrapy.Request(url=url, callback=self.parse, headers=headers)
def parse(self, response):
data = json.loads(response.text)
for review in data['data']:
yield review
Note that I am also using yield instead of print to extract the items, this is how Scrapy expect items to be generated.
You can run this spider like this to export the extracted items to a file:
scrapy crawl rate -o outputfile.json

Related

Waiting for element to be visible

I'm practicing with a playwright and scrapy integration towards clicking on a selector with a hidden selector. The aim is to click the selector and wait for the other two hidden selectors to load, then click on one of these and then move on. However, I'm getting the following error:
waiting for selector "option[value='type-2']"
selector resolved to hidden <option value="type-2" defaultvalue="">Type 2 (I started uni on or after 2012)</option>
attempting click action
waiting for element to be visible, enabled and stable
element is not visible - waiting...
I think the issue is when the selector is clicked, it disappears for some reason. I have implemented a wait on the selector, but the issue still persists.
from scrapy.crawler import CrawlerProcess
import scrapy
from scrapy_playwright.page import PageCoroutine
class JobSpider(scrapy.Spider):
name = 'job_play'
custom_settings = {
'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15',
}
def start_requests(self):
yield scrapy.Request(
url = 'https://www.student-loan-calculator.co.uk/',
callback = self.parse,
meta= dict(
playwright = True,
playwright_include_page = True,
playwright_page_coroutines = [
PageCoroutine("fill", "#salary", '28000'),
PageCoroutine("fill", "#debt", '25000'),
PageCoroutine("click", selector="//select[#id='loan-type']"),
PageCoroutine('wait_for_selector', "//select[#id='loan-type']"),
PageCoroutine('click', selector = "//select[#id='loan-type']/option[2]"),
PageCoroutine('wait_for_selector', "//div[#class='form-row calculate-button-row']"),
PageCoroutine('click', selector = "//button[#class='btn btn-primary calculate-button']"),
PageCoroutine('wait_for_selector', "//div[#class='container results-table-container']"),
PageCoroutine("wait_for_timeout", 5000),
]
),
)
def parse(self, response):
container = response.xpath("//div[#class='container results-table-container']")
for something in container:
yield {
'some':something
}
if __name__ == "__main__":
process = CrawlerProcess(
settings={
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"CONCURRENT_REQUESTS": 32,
"FEED_URI":'loans.jl',
"FEED_FORMAT":'jsonlines',
}
)
process.crawl(JobSpider)
process.start()

How to crawl whole website by following links using CrawlSpider?

I realized that using CrawlSpider with a LinkExtractor rule only parses the linked pages but not the starting page itself.
For example, if http://mypage.test contains links to http://mypage.test/cats/ and http://mypage.test/horses/, the crawler would parse the cats and horses page without parsing http://mypage.test. Here's a simple code sample:
from scrapy.crawler import CrawlerProcess
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = ['http://mypage.test']
rules = [
Rule(LinkExtractor(), callback='parse_page', follow=True),
]
def parse_page(self, response):
yield {
'url': response.url,
'status': response.status,
}
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'ITEM_PIPELINES': {
'pipelines.MyPipeline': 100,
},
})
process.crawl(MySpider)
process.start()
My goal is to parse every single page in a website by following links. How do I accomplish that?
Apparently, CrawlSpider with a LinkExtractor rule only parses the linked pages but not the starting page itself.
Remove start_urls and add:
def start_requests(self):
yield Request('http://mypage.test', callback="parse_page")
yield Request("http://mypage.test", callback="parse")
CrawlSpider uses self.parse to extract and follow links.

Scrapy : Preserving a website

I'm trying to save a copy the pyparsing project on wikispaces.com before they take wikispaces down at the end of the month.
It seems odd (perhaps my version of google is broken ^_^) but I can't find any examples of duplicating/copying a site as. That is, as one views it upon a browser. SO has this and this on the topic but they are just saving the text, strictly the HTML/DOM structure, for the site. Unless I'm mistaken these asnwers do not appear to save the images/header link files/javascript and related information necessary to render the page. Further examples I have seen are more concerned with extraction of parts of the page and not duplicating it as is.
I was wondering if anyone had any experience with this sort of thing or could point me to a useful blog/doc somewhere. I've used WinHTTrack in the past but the robots.txt or the pyparsing.wikispaces.com/auth/ route are preventing it from running properly and I figured I'd get some scrapy experience in.
For those interested to see what I have tried thus far. Here is my crawl spider implementation, that acknowledges the robots.txt file
import scrapy
from scrapy.spiders import SitemapSpider
from urllib.parse import urlparse
from pathlib import Path
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class PyparsingSpider(CrawlSpider):
name = 'pyparsing'
allowed_domains = ['pyparsing.wikispaces.com']
start_urls = ['http://pyparsing.wikispaces.com/']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
# i = {}
# #i['domain_id'] = response.xpath('//input[#id="sid"]/#value').extract()
# #i['name'] = response.xpath('//div[#id="name"]').extract()
# #i['description'] = response.xpath('//div[#id="description"]').extract()
# return i
page = urlparse(response.url)
path = Path(page.netloc)/Path("" if page.path == "/" else page.path[1:])
if path.parent : path.parent.mkdir(parents = True, exist_ok=True) # Creates the folder
path = path.with_suffix(".html")
with open(path, 'wb') as file:
file.write(response.body)
Trying the same thing with the sitemap spider is similar. The first SO link provides an implementation with a plain spider.
import scrapy
from scrapy.spiders import SitemapSpider
from urllib.parse import urlparse
from pathlib import Path
class PyParsingSiteMap(SitemapSpider) :
name = "pyparsing"
sitemap_urls = [
'http://pyparsing.wikispaces.com/sitemap.xml',
# 'http://pyparsing.wikispaces.com/robots.txt',
]
allowed_domains = ['pyparsing.wikispaces.com']
start_urls = ['http://pyparsing.wikispaces.com'] # "/home"
custom_settings = {
"ROBOTSTXT_OBEY" : False
}
def parse(self, response) :
page = urlparse(response.url)
path = Path(page.netloc)/Path("" if page.path == "/" else page.path[1:])
if path.parent : path.parent.mkdir(parents = True, exist_ok=True) # Creates the folder
path = path.with_suffix(".html")
with open(path, 'wb') as file:
file.write(response.body)
None of these spiders collect more then the HTML structure
Also I have found that the links, ..., that are saved do not appear to point to proper relative paths. Atleast, when opening the saved files, the links point to a path relative to the hard drive and not relative to the file. While opening a page via http.server the links point to dead locations, presumably the .html extension is the trouble here. It might be necessary to remap/replace links in the stored structure.

Wicked_PDF render a string from a template in a background process

I've got a controller "tech" that has an action to email and invoice, from there we use Delayed::Job.enqueue to shove the actual email action into a background process which will be handled via a worker dyno on Heroku.
This is all working fine.
The trouble that I found is that my generated PDF invoice lives over on the Heroku Web Dyno file system and the Worker has no idea where this is.
I will do upload the PDF during the generation process, it takes too damn long.
So I need to create the invoice over on the worker dyno when it goes to execute the mailer action to send the message.
I found this blog with some detailed instructions on creating the pdf from a string: http://viget.com/extend/how-to-create-pdfs-in-rails
But it's not working at all for me, here is the code:
html = render_to_string(:action =>":show", :layout => "invoice.html")
#pdf = WickedPdf.new.pdf_from_string(html)
And the error:
"last_error"=>"undefined method `response_body=' for #<MailSenderJob:0x007fdf7e70a638>
I know this is from the docs:
WickedPdf.new.pdf_from_string(
render_to_string('templates/pdf.html.erb', :layout => 'pdfs/layout_pdf'),
:footer => {
:content => render_to_string(:layout => 'pdfs/layout_pdf')
}
)
And that code has never worked for me at all.
What I'm getting over and over is the response_body= error. It's like it's not getting a response at all.
At the top of my file I'm doing:
include ActionController::Rendering
Because this is the module that has the render_to_string method inside it.
Any help at all - please keep in mind in your response that I'm running this code on a Heroku WORKER dyno - so if there's any dependency that I need to manually include that is naturally included on the web server, please let me know.
I ended up having to do some weird stuff with this to finally get it working.
html = File.read(Rails.root.join('app','views','technician','invoice.html.erb'))
html = ERB.new(html).result(binding)
html = html.gsub!(/\0/,'') # There is a null byte in the rendered html, so we'll strip it out (this is kind of a hack)
# Render the PDF - we're on a worker dyno and have no access to the pdf we rendered already on the web dyno
#pdf = WickedPdf.new.pdf_from_string(
html,
:handlers => [:erb],
:footer => {
:center => "Copyright 2014"
},
:disable_smart_shrinking => true,
)
#pdf = #pdf.gsub!(/\0/,'') # Again with the null bytes!
Using Partials.
I know what you mean, it gets a little funky when you're rendering PDFs in the background job as opposed to a Controller action.
I thought I would share my implementation as a comparison and for others to get another example from.
notification_mailer.rb
def export
header_html = render_to_string( partial: 'exports/header.pdf.erb',
locals: { report_title: 'Emissions Export' } )
body_html = render_to_string( partial: "exports/show.pdf.erb" )
footer_html = render_to_string( partial: 'exports/footer.pdf.erb' )
#pdf = WickedPdf.new.pdf_from_string(
body_html,
orientation: 'Landscape',
margin: { bottom: 20, top: 30 },
header: { content: header_html },
footer: { content: footer_html } )
# Attach to email as attachment.
attachments["Emissions Export.pdf"] = #pdf
# Send email. Attachment assigned above will automatically be included.
mail( { subject: 'Emissions Export PDF', to: 'elon#musk.com' } )
end

Wicked PDF, generating PDF from database table- images and style issues

I have an uploader (internal use only) that will upload an HTML document to a binary column of a table in my client-facing website. The client facing site has an index that allows the user to view the page as a normal website (using send_data h_t.html_code, :type => "html", :disposition => "inline"). I also want to give the user the ability to download a PDF of the page. For that I'm using wicked_pdf.
The entire problem seems to stem from the fact that the data is stored in the database. As strange as it sounds, it is vital to business operations that I get formatting exact. The issue is I can't see any image, and the stylesheets/style tags don't have any effect.
What I've tried-
Gsub-
def show
html = HtmlTranscript.find(params[:id])
html_code = html.html_code.gsub('<img src="/images/bwTranscriptLogo.gif" alt="Logo">','<%= wicked_pdf_image_tag "bwTranscriptLogo.gif" %>')
html_code = html_code.gsub('<link rel="StyleSheet" href="" type="text/css">','<%= wicked_pdf_stylesheet_link_tag "transcripts.css" %>')
transcript = WickedPdf.new.pdf_from_string(html_code)
respond_to do |format|
format.html do
send_data transcript, :type => "pdf", :disposition => "attachment"
end
##### i never could get this part figured out, so if you have a fix for this...
# format.pdf do
# render :pdf => "transcript_for_#{#html.created_at}", :template => "html_transcripts/show.html.erb", :layout => false
# end
end
end
Using a template-
#Controller (above, modified)
html = HtmlTranscript.find(params[:id])
#html_code = html.html_code.gsub('<img src="/images/bwTranscriptLogo.gif" alt="Logo">','<%= wicked_pdf_image_tag "bwTranscriptLogo.gif" %>')
#html_code = #html_code.gsub('<link rel="StyleSheet" href="" type="text/css">','<%= wicked_pdf_stylesheet_link_tag "transcripts.css" %>')
transcript = WickedPdf.new.pdf_from_string(render_to_string(:template => "html_transcripts/show.html.erb", :layout => false))
#view
<!-- tried with stylesheet & image link tags, with wicked_pdf stylesheet & image link tags, with html style & img tags, etc -->
<%= raw(#html_code) %>
And both will generate a transcript- but neither will have style OR image.
Creating an initializer-
module WickedPdfHelper
def wicked_pdf_stylesheet_link_tag(*sources)
sources.collect { |source|
"<style type='text/css'>#{Rails.application.assets.find_asset("#{source}.css")}</style>"
}.join("\n").gsub(/url\(['"](.+)['"]\)(.+)/,%[url("#{wicked_pdf_image_location("\\1")}")\\2]).html_safe
end
def wicked_pdf_image_tag(img, options={})
image_tag wicked_pdf_image_location(img), options
end
def wicked_pdf_image_location(img)
"file://#{Rails.root.join('app', 'assets', 'images', img)}"
end
def wicked_pdf_javascript_src_tag(source)
"<script type='text/javascript'>#{Rails.application.assets.find_asset("#{source}.js").body}</script>"
end
def wicked_pdf_javascript_include_tag(*sources)
sources.collect{ |source| wicked_pdf_javascript_src_tag(source) }.join("\n").html_safe
end
end
did absolutely nothing, and I have no idea what to try next.
As a side note, the code to view the HTML version of the transcript is as follows:
def transcript_data
h_t = HtmlTranscript.find(params[:id])
send_data h_t.html_code, :type => "html", :disposition => "inline"
end
It requires no view, as the html data is stored in the database, but I get image, style, etc. Everything works with the HTML version- just not the PDF.
I'm on ruby 1.8.7 with rails 3.0.20.
Solved-
As it turns out, there was more than one issue at hand.
1- Installation of wkhtmltopdf for Ubuntu via $apt-get install does not quite do the trick for what I wanted...
see http://rubykitchen.in/blog/2013/03/17/pdf-generation-with-rails
(there may have also been an issue with having not previously run sudo apt-get install openssl build-essential xorg libssl-dev libxrender-dev, as when I did, it installed a number of components I did not previously have.)
2- The HTML files I had uploaded contained image & style code that was breaking the formatting. I fixed it with this...
def rm_by_line(which = 0, line1 = 0, line2 = 0)
h_t = HtmlTranscript.find(which)
line_by_line = h_t.html_code.split('
')
for i in line1..line2
line_by_line[i] = ''
end
line_by_line = line_by_line.join('
').strip
return line_by_line
end
Then, all I had to do was pass which lines I wanted to remove.
(I had to split the parens with a carriage return because '\n' didn't function properly when calling 'raw' on the returned string.)
3- wicked_pdf_stylesheet_link_tag and wicked_pdf_image_tag were undefined. I had to inline the style formatting I wanted into a layout I created (turns out wicked_pdf_stylesheet_link_tag used asset pipeline wich my ruby/rails did not implement, which also means I had to get rid of the javascript helpers) and created a helper for wicked_pdf_image_tag, making a switch in the layout for which image tag (image_tag or wicked_pdf_image_tag) to be used.
4- I needed both a .html.erb & a .pdf.erb for my templates, so I made both.
5- Got rid of WickedPdf.new.pdf_from_string in favor of linking to either html or pdf by using :format => 'html' or :format => 'pdf' in the link_to tag.