Scrapy Selectors with Basic HTML Pages no Output - scrapy

I'm having a tough time getting basic (very basic) html pages to output anything with the Scrapy spiders I'm using, hoping someone can put me on the right path.
Example of the html I'm trying to scrape:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<head>
<link rel="shortcut icon" href="../images/favicon.ico">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Style-Type" content="text/css">
<link rel="stylesheet" href="../include/default.css" type="text/css">
<meta name="Author" content="Author">
<title>Article Title</title>
</head>
<body>
<h3>Month Day, Year</h3>
<hr size="1">
<h4>Article Title Here:</h4>
<p>paragraph 1, Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo</p>
<p>paragraph 2. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.</p>
<p>paragraph 3, Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.</p>
<p>closing, Sed ut perspiciatis unde omnis iste natus </p>
<hr size="1">
</body>
</html>
I'm trying to scrape it with the following Scrapy spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from basic.items import BasicItem
class BasicSpider(CrawlSpider):
name = "basiccrawl"
allowed_domains = ["example.com"]
start_urls = [
"http://example.com/articles/",
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
date = hxs.xpath('//h3')
title = hxs.xpath('//h4')
body = hxs.xpath('//p')
yield item
I assume I'm oversimplifying the xpath rules?

If your spider is a subclass of CrawlSpider, it should not override the default parse callback (it is used internally by the CrawlSpider class). This is a bit confusing when you're getting started using Scrapy, it will probably be addressed in a future release.
In the code you posted, you are not using the CrawlSpider rules, so maybe you could ask yourself if you really need to inherit from CrawlSpider. You can go far inheriting only from scrapy.Spider.
The XPath expressions look fine, but the .xpath() method only returns a selector, you're missing the call to the .extract() method. Also, you probably don't need to instantiate the selector, if you're using Scrapy 0.24+ you can simply do:
def parse(self, response):
date = response.xpath('//h3').extract()
title = response.xpath('//h4').extract()
body = response.xpath('//p').extract()
yield item
You may want to go through the Scrapy tutorial I've written, that tries to get you started pretty quickly: http://hopefulramble.blogspot.com/2014/08/web-scraping-with-scrapy-first-steps_30.html

You are almost on the correct path, you need to use the method extract(), that method will return a list of element(s), if you are on begining of learning, maybe an slideshare that I create about Scrapy can help you :D
http://www.slideshare.net/franciscoyes/scrapy-42681497

Related

How Print Vue Component W/ Styles

I'm trying to print a component using VueJS all style is in the same file, but it's not getting the CSS Styling. Also I use Quasar framework, don't know if it can affect the final result.
<div style="margin: 12px 12px">
<div class="central-layout">
<p>
<strong
><a href="https://www.nightprogrammer.com/" target="_blank"
>Nightprogrammer.com</a
></strong
>: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus
posuere, tellus lobortis posuere tempor, elit sem varius libero, ac
vulputate ante orci vel odio. Nam sit amet tortor malesuada tellus
rutrum varius vel a mauris. Integer volutpat neque mauris, vel imperdiet
mi aliquet ac. Proin sed iaculis ipsum. Vivamus tincidunt egestas
sapien, vitae faucibus velit ultricies eget. Donec mattis ante arcu, a
pretium tortor scelerisque et. Morbi sed dui tempor, consectetur turpis
sed, tristique arcu.
</p>
</div>
</div>
<style scoped>
.central-layout{
flex-direction: column-reverse;
}
</style>
exportToPDF() {
const content = this.$refs.printInteraction.innerHTML
let cssHtml = ''
for (const node of [...document.querySelectorAll('style')]) {
cssHtml += node.outerHTML
}
const winPrint = window.open('', '', 'left=0,top=0,width=800,height=900,toolbar=0,scrollbars=0,status=0')
winPrint.document.write(`<!DOCTYPE html>
<html>
<head>
${cssHtml}
</head>
<body>
${content}
</body>
</html>`)
winPrint.document.close()
winPrint.focus()
winPrint.print()
winPrint.close()
}
}
}
Can anyone help me?
I need to print with the styling set in the page
Exporting that properly may be quite difficult because not all the things are properly supported. I recommend the usage of this: https://github.com/niklasvh/html2canvas
Then you could convert the image into a PDF. But anyway, such thing is quite heavy and should be handled by some backend: convert, host the file on AWS/alike and sent back as a callback.

Generating PDF in next.js

Hii i am trying to generate PDF in next.js. I tried many libraries like react-pdf, jsPDF etc. but all they are client side library they need window to perform their action. Is There any solution for generating pdf in next.js. If Present then please suggest with code example.
i recommend to use the api routes of next.js and use a node.js pdf library. On the frontend you access the api with the correct path to return the pdf and just render it.
next.js api routes
pdf-lib
Example:
import { PDFDocument, StandardFonts, rgb } from 'pdf-lib'
export default async (req, res) => {
const pdfDoc = await PDFDocument.create();
// do pdf stuff
const buffer = await pdfDoc.save();
res.send(buffer);
};
export const config = {
api: { bodyParser: false, },
};
Using jsPDF, I found this tutorial with Next.js.
pages/index.tsx:
import * as React from "react";
import Image from "next/image";
import dynamic from "next/dynamic";
const GeneratePDF = dynamic(()=>import("./../src/components/GeneratePDF"),{ssr:false});
const app =()=>{
const ref = React.useRef();
return(<div className="main">
<div className="content" ref={ref}>
<h1>Hello PDF</h1>
<img id="image" src="/images/image_header.jpg" width="300" height="200"/>
<p id="text">
Lorem ipsum dolor sit, amet consectetur adipisicing elit. Quisquam animi, molestiae quaerat assumenda neque culpa ab aliquam facilis eos nesciunt! Voluptatibus eligendi vero amet dolorem omnis provident beatae nihil earum!
Lorem, ipsum dolor sit amet consectetur adipisicing elit. Ea, est. Magni animi fugit voluptates mollitia officia libero in. Voluptatibus nisi assumenda accusamus deserunt sunt quidem in, ab perspiciatis ad rem.
Lorem ipsum dolor sit amet consectetur adipisicing elit. Nihil accusantium reprehenderit, quasi dolorum deserunt, nisi dolores quae officiis odio vel natus! Pariatur enim culpa velit consequatur sapiente natus dicta alias!
Lorem ipsum dolor sit amet consectetur adipisicing elit. Consequatur, asperiores error laudantium corporis sunt earum incidunt expedita quo quidem delectus fugiat facilis quia impedit sit magni quibusdam ipsam reiciendis quaerat!
</p>
</div>
<GeneratePDF html={ref}/>
</div>);
}
export default app;

Facebook Debugger: Change Canonical URL value after Reverse Proxy Rewrite

I've created a simple app that server renders some basic SPA content based on the user agent.
For example, if an AngularJS website link is shared on Facebook i have a Apache rewrite rule to redirect that link to the rendering app. The rendering app then checks the URL that was passed as a query parameter and returns the specified rendered content.
Everything works as expected, but there's a problem with the rendered result. The canonical link showed in the Facebook post is the rendering app's link.
Here's what's happening:
Shared Link: www.example.com/the-shared-link
Facebook's post result:
Instead of displaying the shared link (www.example.com/the-shared-link) the rendering app is shown instead (rendering.app.com). But if i click on the Facebook post, it opens the correct website page.
Facebook Debugger result:
All the needed meta tags are added to the rendered result page:
<!-- Schema.org markup for Google+ -->
<meta itemprop="name" content="Lorem ipsum dolor sit amet, consectetur adipiscing elit." />
<meta itemprop="description" content="Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo." />
<meta itemprop="image" content="http://www.example.com/some-image.jpg" />
<!-- Twiter Cards -->
<meta name="twitter:card" content="summary" />
<meta name="twitter:site" content="Lorem ipsum dolor sit amet, consectetur adipiscing elit." />
<meta name="twitter:title" content="Lorem ipsum dolor sit amet, consectetur adipiscing elit." />
<meta name="twitter:description" content="Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo." />
<meta name="twitter:image:src" content="http://www.example.com/some-image.jpg" />
<!--/ Twiter Cards -->
<!-- Open Graph -->
<meta property="og:site_name" content="Lorem ipsum dolor sit amet, consectetur adipiscing elit." />
<meta property="og:type" content="website" />
<meta property="og:title" content="Lorem ipsum dolor sit amet, consectetur adipiscing elit." />
<meta property="og:url" content="http://www.example.com/the-shared-link" />
<meta property="og:description" content="Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo." />
<meta property="og:image" content="http://www.example.com/some-image.jpg" />
<meta property="og:image:width" content="500" />
<meta property="og:image:height" content="375" />
<!--/ Open Graph -->
The Apache htaccess rewrite rule:
RewriteCond %{HTTP_USER_AGENT} (facebookexternalhit/[0-9]|Facebot|Twitterbot/[0-9]|Pinterest|Pinterestbot|LinkedInBot/[0-9])
RewriteRule ^(.*)$ http://rendering.app.com/?url=%{REQUEST_URI} [P,L]
What am i doing wrong? How can i change the canonical url to the original shared link?
Solved my issue!
The rendering.app.com domain had a rewrite rule to force https. This causes a 301 HTTP Redirect (just as the Facebook Debugger showed).
Using https://rendeting.app.com solved my issue. Another way of solving the 301 HTTP Redirect would be removing the https rewrite rule in the target domain.

#media queries being ignored when viewing html emails via iOS Mail

i am dealing with very strange problem. I have tested on iphone6s and iphone6
1) the email is sent to a non-gmail account that is configured on the iphone
2) the html message is viewed from iOS Mail
3) here is where it gets weird and i will try to describe best as possible
a) if i view the email directly from the client by click on the email from the list of emails. The media query is not respected.
b) if i view the message, click the down arrow to view the previous message and then click the up to view the original message, the media query is respected.
4) i have tried both icloud and yahoo accounts and two different iphones (6 and 6s)
i have got it down to this simple example.
of course all the simple emulators work as you would expect and not exhibit the problem
<!DOCTYPE html>
<html lang="en">
<head>
<title>this is a test</title>
<meta charset="UTF-8">
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<style>
/* ----------- iPhone 5 and 5S ----------- */
/* Portrait */
#media only screen and (min-device-width: 320px) and (max-device-width: 568px) and (-webkit-min-device-pixel-ratio: 2) and (orientation: portrait) {
#main-wrapper{
max-width: 320px;
margin: 2px auto;
background-color: red;
}
}
</style>
</head>
<body>
<div id="main-wrapper" style=" background-color: #ffffff;">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis in ante velit. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Maecenas imperdiet erat metus, sed maximus tortor dignissim vel. Fusce luctus eget turpis a pretium. Nunc sagittis vulputate risus et porta. Cras eros nisl, placerat id ultricies sit amet, eleifend vel augue. Nullam dignissim sodales rhoncus. Morbi hendrerit aliquam tortor.
Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Nulla tellus massa, accumsan ac ex a, congue lobortis ipsum. Sed vitae ultrices purus. Nam vulputate lacus vitae massa laoreet scelerisque. Duis in risus non elit sodales pharetra. Nunc ultrices nisl quis leo mollis, sed consectetur tortor placerat. Fusce ultricies eleifend nisi, in congue metus iaculis ut.
</div>
</body>
</html>
Assuming that by the default IOS email reader you mean iOS Mail, media queries should be supported. Two things
You shouldn't need an initial-scale attribute in your viewport tag, have you tried <meta name="viewport" content="width=device-width">?
Depending on how specific an environment you want this CSS to impact, you might not need such a loaded #media tag either. Have you tried something like #media screen and (max-device-width: 568px)?

playing a local video in a uiwebview

i load a local html file into an ipad application:
NSURL *baseURL = [NSURL fileURLWithPath:[[NSBundle mainBundle] bundlePath]];
NSString *path = [[NSBundle mainBundle] pathForResource:#"lieferant1" ofType:#"html"];
NSString *content = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:nil];
[webView loadHTMLString:content baseURL:baseURL];
the webpage gets displayed, content of my html file:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<p>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</p>
<p>
<video>
<source src="banane.m4v">
</video>
</p>
so, banane.m4v is in my root, just in some groups, but not in real directories.
but the video section in my webview keeps black. no video gets loaded. i uploaded my html file and my .m4v video to my webserver to check if its not encoded properly, but everything works fine, surfing the html file with my ipad.
some ideas whats wrong?
oh my god
<video controls>
<source src="banane.m4v">
</video>
where controls is the magic word.
:X
I may be wrong (or silly) but, did you try file:///banane.m4v (Add the extra '/' for root)
For everyone that experiences a crossed out play symbol,
When you drag your movie-files (mp4's work as well, just make sure the codec is okay - easiest way is to export for iPhone with QuickTime) a dialog appears.
Make sure you tick your app under "add to targets".
Or if you already copied your video into the project, if you click it in Xcode's file browser, you'll have an option on the right-hand side to select targets.
You need to set the allowsInlineMediaPlayback property on the UIWebView:
[webView setAllowsInlineMediaPlayback:YES];