Scrape text from html element by class that's inside a list element

Scrape text from html element by class that's inside a list element - beautifulsoup

I'm trying to scrape the headline from the first h4 element with the class "regularitem".
The output should look like "It took months to hit 3 million reported cases..."
I keep getting list index out of range.
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
URL = 'http://rss.cnn.com/rss/cnn_topstories.rss'
req = requests.get(URL, headers)
soup = BeautifulSoup(req.content, 'html.parser')
headline = soup.findAll('h4', attrs = {'class' : 'itemtitle'})[0]
print(headline.get_text)
The page html looks like this:
<li xmlns:dc="http://purl.org/dc/elements/1.1/" class="regularitem">
<h4 class="itemtitle">It took months to hit 3 million reported cases. Now nearly two weeks later, US is on the verge of 4 million.</h4>
<h5 class="itemposttime">
<span>Posted:</span>Wed, 22 Jul 2020 14:52:09 GMT</h5>
<div class="itemcontent" name="decodeable">Tracking US cases | Podcast | Those you've lost<div class="feedflare">
<img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=yIl2AUoC8zA" border="0"> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=7Q72WNTAKBA" border="0"> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=3M8R-V8mvn8:YP_46RpyuXw:V_sGLiPBpWU" border="0"> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=qj6IDK7rITs" border="0"> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=3M8R-V8mvn8:YP_46RpyuXw:gIN9vFwOqvQ" border="0">
</div><img src="http://feeds.feedburner.com/~r/rss/cnn_topstories/~4/3M8R-V8mvn8" height="1" width="1" alt=""></div>
</li>
<li xmlns:dc="http://purl.org/dc/elements/1.1/" class="regularitem">
<h4 class="itemtitle">'We are going as quickly as we possibly can' on vaccine development, Fauci says</h4>
<h5 class="itemposttime"></h5>
<div class="itemcontent" name="decodeable"><div class="feedflare">
<img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=yIl2AUoC8zA" border="0"> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=7Q72WNTAKBA" border="0"> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=e93U4xIXAew:NWKJndB2i28:V_sGLiPBpWU" border="0"> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=qj6IDK7rITs" border="0"> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=e93U4xIXAew:NWKJndB2i28:gIN9vFwOqvQ" border="0">
</div><img src="http://feeds.feedburner.com/~r/rss/cnn_topstories/~4/e93U4xIXAew" height="1" width="1" alt=""></div>
</li>
I've tried removing the list index and changing to soup.find (see the example below) but when I do that I get: "NoneType' object has no attribute 'get_text"
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
URL = 'http://rss.cnn.com/rss/cnn_topstories.rss'
req = requests.get(URL, headers)
soup = BeautifulSoup(req.content, 'html.parser')
headline = soup.find('h4', attrs = {'class' : 'itemtitle'})
print(headline.get_text)

The request you are making is not pulling the html because the html on the page is dynamically loaded. The solution is to use selenium since selenium emulates a real browser and will load the dynamic content.
# these settings may differ based on where you install your chrome webdriver
# alternatively you can use the firefox webdrive - there are plenty of tutorials online that show
# you how to install those
chrome_options = Options()
chrome_options.add_argument('--headless') # or '-start maximized' if you want to see the window open
driver = webdriver.Chrome(options=chrome_options)
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
URL = 'http://rss.cnn.com/rss/cnn_topstories.rss'
driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'html.parser')
headlines = soup.findAll('h4', {'class': 'itemtitle'})
for headline in headlines:
headlineText = headline.find('a').text
print(headlineText)
Let me know if this was helpful

Related

How to fix that URL query params are not working via the web share target API in vuejs pwa?

I'm building a new PWA in VueJS and have registered the app as a Share Target in my manifest.json (https://developers.google.com/web/updates/2018/12/web-share-target). Now my code works if I put the query params directly in the URL via the browser address bar (e.g. "/#/page-edit?url=https://google.com/&title=Google&description=SearchEngine"), but it doesn't work if I sent it via the Web Share Target API.
I have already tried a range of different manifest settings, but I'm not sure if my manifest settings are wrong or my code (e.g., tried both method "GET" and "POST", etc).
Current Manifest:
{
"name": "...",
"short_name": "...",
"icons": [],
"start_url": "/",
"display": "standalone",
"orientation": "portrait",
"background_color": "...",
"theme_color": "...",
"share_target": {
"action": "/#/page-edit",
"method": "GET",
"enctype": "application/x-www-form-urlencoded",
"params": {
"title": "title",
"text": "description",
"url": "url"
}
}
}
Current Vue view:
I have removed most of the not important code. As you can see I load the query data in two ways at the moment:
1. As data defaults, e.g., 'url': this.$route.query.url || null
2. As a variable in a <p>, e.g. {{ this.$route.query.url }}
<template>
<form class="modal-view">
<div class="field">
<label for="url" class="label">URL / link</label>
<div class="control">
<input id="url" v-model="url" class="input" type="url" placeholder="https://..." >
</div>
<p><strong>url query:</strong> {{ this.$route.query.url }}</p>
</div>
<div class="field">
<label for="title" class="label">Title</label>
<div class="control">
<input id="title" v-model="title" class="input" type="text" placeholder="The greatest article" >
</div>
<p><strong>title query:</strong> {{ this.$route.query.title }}</p>
</div>
<div class="field">
<label for="description" class="label">Description</label>
<div class="control">
<input id="description" v-model="description" class="input" type="text" placeholder="The greatest article" >
</div>
<p><strong>description query:</strong> {{ this.$route.query.description }}</p>
</div>
<hr class="is-small has-no-line">
<div class="field is-grouped is-grouped-right">
<div class="control">
<button #click.prevent="createPage" class="button is-primary is-fullwidth is-family-secondary">Submit</button>
</div>
</div>
</form>
</template>
<script>
import ...
export default {
name: 'page-edit',
computed: {},
data () {
return {
// Initialize default form values
'url': this.$route.query.url || null,
'title': this.$route.query.title || null,
'description': this.$route.query.description || null
}
},
mounted () {},
methods: {
createPage () {},
}
}
</script>
So what I would expect is that the query params can also be read if shared via the Web Share Target API, but at this point, it doesn't show anything this way. But good to mention again, it does all work if I simply change the query params in the browser address bar (that's also why I'm confused).
End result should look like
Edits
Edit 1
Have been playing around a bit more, and now found out that if I use window.location.href that it shows the following:
https://appurl.com/?title=xxx&description=xxx#/page-edit
I.e. it puts the query params in the wrong position?
Edit 2
Might be related to this Vue Router issue: Hash mode places # at incorrect location in URL if current query parameters exist on page load
Edit 3
Somehow fixed it with (I think)
const router = new Router({
mode: 'history'
And removing the # from the action in share_target

To fix it I have done the following things:
Added the following in the router, which resulted in removing the # from all URLs
const router = new Router({
mode: 'history'
Removed the # from the share_target.action in the manifest
Somehow fixed it all!

Scraping HTML inside JSON with Scrapy

I'm requesting a website whose response is a JSON like this:
{
"success": true,
"response": "<html>... html goes here ...</html>"
}
I've seen both ways to scrap HTML or JSON, but haven't found how to scrap HTML inside a JSON. Is it possible to do this using scrapy?

One way is to build a scrapy.Selector out of the HTML inside the JSON data.
I'll assume you have the Response object with JSON data in it, available through response.text.
(Below, I'm building a test response to play with (I'm using scrapy 1.1 with Python 3):
response = scrapy.http.TextResponse(url='http://www.example.com/json', body=r'''
{
"success": true,
"response": "<html>\n <head>\n <base href='http://example.com/' />\n <title>Example website</title>\n </head>\n <body>\n <div id='images'>\n <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n </div>\n </body>\n</html>"
}
''', encoding='utf8')
)
Using json module you can get the HTML data like this:
import json
data = json.loads(response.text)
You get something like :
>>> data
{'success': True, 'response': "<html>\n <head>\n <base href='http://example.com/' />\n <title>Example website</title>\n </head>\n <body>\n <div id='images'>\n <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n </div>\n </body>\n</html>"}
Then you can build a new selector like this:
selector = scrapy.Selector(text=data['response'], type="html")
after which you can use XPath or CSS selectors on it:
>>> selector.xpath('//title/text()').extract()
['Example website']

Well, there's another way that you definitely do not need to construct a response object.You can use lxml to parse your html text. You don't need to install any new lib , since Scrapy Selector is based on lxml. Just add the code below to import lxml lib.
from lxml import etree
Here is an exmaple, assuming that the json response is:
{
"success": true,
"htmlinjson": "<html><body> <p id='p1'>p111111</p> <p id='p2'>p22222</p> </html>"
}
Extract the html text from the json response by:
import json
htmlText = json.loads(response.text)['htmlinjson']
Then construct a lxml xpath selcector using:
from lxml import etree
resultPage = etree.HTML(htmlText)
Now use the lxml selector to extract text of the node with id="p1", basing on xpath just like how scrapy xpath selector do:
print resultPage.xpath('//p[#id="p1"]')[0].text
You will get:
p111111
Hope that helps :)

You can try json.loads(initial_response) , so you get a dict and can use his keys, like ['response']

Jsoup: login without id or name for submit button

I've tried to login into website using Jsoup, but unfortunatelly I've forced with some problems- don't know how should I pass submit button into because there's no id or name for it. Could you take a look how should my code look like?
<form action="http://www.abcde.com/index.php?app=core&module=global&section=login&do=process" method="post" id="login">
<input type="hidden" name="auth_key" value="auth_key">
<input type="hidden" name="referer" value="http://www.abcde.com/">
<h3>Login</h3>
<div class="ipsForm ipsForm_horizontal">
<fieldset>
<ul>
<li class="ipsField ipsField_primary">
<label for="ips_username" class="ipsField_title">Username</label>
<div class="ipsField_content">
<input id="ips_username" type="text" class="input_text" name="ips_username" size="30" tabindex="0">
</div>
</li>
<li class="ipsField ipsField_primary">
<label for="ips_password" class="ipsField_title">Password</label>
<div class="ipsField_content">
<input id="ips_password" type="password" class="input_text" name="ips_password" size="30" tabindex="0"><br>
</div>
</li>
</ul>
</fieldset>
<div class="ipsForm_submit ipsForm_center">
<input type="submit" class="ipsButton" value="Login" tabindex="0">
</div>
</div>
</form>
I've started:
Connection.Respose loginForm = Jsoup.connect("http://www.abcde.com/").method(Connection.Method.GET)
.execute();
Document document = Jsoup
.connect("http://www.abcde.com/)
.data("cookieexists", "false")
.data("ips_username", "username", "ips_password",
"password").cookies(loginForm.cookies()).post();

First you must send a GET request to the server -
Document doc1 = Jsoup.connect("www.forumowisko.pl")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0")
.get();
Then you extract the auth_key value -
Element e = doc.select("input[id=auth_key]").first();
String authKey = e.attr("value");
And now you can send the POST request -
Document doc2 = Jsoup.connect("http://www.forumowisko.pl/index.php?app=core&module=global&section=login&do=process")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0")
.data("auth_key", authKey)
.data("ips_username", "MyUsername")
.data("ips_password, "MyPassword")
.data("rememberMe", "1")
.data("referer", "http://www.forumowisko.pl/")
.cookies(doc1.cookies())
.post();
Notice that the POST request has a different URL from the GET request.

OpenERP 7 error loading in browser

I installed OpenERP 7 in Cent OS Final 6.4 and Python 2.6.6 i=with static ip
I accessed from an outside network using internet it works well
But i accessed within the network it throws an error
Uncaught Error: QWeb2: Template 'WebClient' not found

now i use version is odoo 9.0 and ubuntu 15.10 and python 2.7.10
i also meet this issue. And release issue with later.
use
ajax.loadXML('/web_editor/static/src/xml/snippets.xml', qweb);
to load this xml files,and sinppets.xml maybe like this
<?xml version="1.0" encoding="utf-8"?>
<templates id="template" xml:space="preserve">
<t t-name="web_editor.snippets">
<div id='oe_snippets' class="hidden-xs o_open">
<div id="o_arrow">
<span class="fa fa-angle-double-right fa-1x" style="top: 10px;">
</span>
<div>Insert Blocks</div>
<span class="fa fa-angle-double-right fa-1x" style="bottom: 10px;">
</span>
</div>
<div id="o_left_bar">
<span class='snippets_loading'>Snippets are loading...
</span>
</div>
</div>
</t>
</templates>
var ajax = require('web.ajax');
var core = require('web.core');
var Widget = require('web.Widget');
var qweb = core.qweb;
ajax.loadXML('/web_editor/static/src/xml/snippets.xml', qweb);
var BuildingBlock = Widget.extend({
template: 'web_editor.snippets',
init: function (parent, $editable) {
this._super.apply(this, arguments);
},
start: function() {....
}
});
and now, none issue.
good luck~

Using swfupload with the Playframework and a Mac

I am trying to implement SWFUpload using the Play! framework and a Mac.
When using a Mac I get a 302 error (Upload Error: 302). This seems, I think, to come from the redirect that happens by getting the upload page of the Play! framework (as it is translated from the routes file?).
It works fine on IE on Windows.
I have searched a lot, and read a lot, but haven't found if there is a particular solution. Are there any suggestions on how to implement this, or any suggestions for another simple to implement file uploader (for big field, with progress)?
EDIT:
I have tried both Safari and Firefox on my Macbook and both return as status the 302 upload error. I tried again on Windows with IE and it works fine.
The html (stylesheet) code:
#{extends 'main.html' /}
#{set title: 'Upload Media File' /}
<script type="text/javascript" src="/public/swfupload/swfupload.js"></script>
<script type="text/javascript" src="/public/swfupload/swfupload.queue.js"></script>
<script type="text/javascript" src="/public/swfupload/fileprogress.js"></script>
<script type="text/javascript" src="/public/swfupload/handlers.js"></script>
<script type="text/javascript">
var swfu;
window.onload = function() {
var settings = {
flash_url : "/public/swfupload/swfupload.swf",
flash9_url : "/public/swfupload/swfupload_fp9.swf",
upload_url: "doUpload",
file_post_name: "data",
post_params: {"article.id" : "${article?.id}"},
file_size_limit : "1000 MB",
file_types : "*.*",
file_types_description : "All Files",
file_upload_limit : 1,
file_queue_limit : 0,
custom_settings : {
progressTarget : "fsUploadProgress",
cancelButtonId : "btnCancel"
},
debug: false,
// Button settings
button_image_url: "/public/swfupload/TestImageNoText_65x29.png",
button_width: "65",
button_height: "29",
button_placeholder_id: "spanButtonPlaceHolder",
button_text: '<span>Start</span>',
button_text_style: ".theFont { font-size: 16; }",
button_text_left_padding: 12,
button_text_top_padding: 3,
// The event handler functions are defined in handlers.js
swfupload_preload_handler : preLoad,
swfupload_load_failed_handler : loadFailed,
file_queued_handler : fileQueued,
file_queue_error_handler : fileQueueError,
file_dialog_complete_handler : fileDialogComplete,
upload_start_handler : uploadStart,
upload_progress_handler : uploadProgress,
upload_error_handler : uploadError,
upload_success_handler : uploadSuccess,
upload_complete_handler : uploadComplete,
queue_complete_handler : queueComplete // Queue plugin event
};
swfu = new SWFUpload(settings);
};
</script>
<h2 class="title">#{get 'title' /}</h2>
<div style="clear: both;"> </div>
#{form #index(), id:'uploadForm', enctype:'multipart/form-data'}
<div class="entity">
<p>This page demonstrates a simple usage of SWFUpload. It uses the Queue Plugin to simplify uploading or cancelling all queued files.</p>
<div class="fieldset flash" id="fsUploadProgress">
<span class="legend">Upload Queue</span>
</div>
<div id="divStatus">0 Files Uploaded</div>
<div>
<span id="spanButtonPlaceHolder"></span>
<input id="btnCancel" type="button" value="Cancel All Uploads" onclick="swfu.cancelQueue();" disabled="disabled" style="margin-left: 2px; font-size: 8pt; height: 29px;" />
</div>
</div>
#{/form}
The entry in the routes file (the Articles.upload is the screen, the doUpload is called by swfupload):
GET /admin/articles/{id}/upload Articles.upload
POST /admin/articles/doUpload Articles.doUpload
EDIT 2:
I also tried Uploadify, which returns the same error. Is there anybody who knows a workaround for the Play! framework or an uploader that does work with Play!

If you are in /admin/articles/{id}/upload and call doUpload without any /path/to/something, you're in the wrong "directory". Might it be that you test Windows locally, but MacOS remotely?
In any case, the problem most probably has nothing to do with the OS, but with something else.

There is html 5 uploader at http://www.uploadify.com which can be downloaded for $5, and it works perfect for me so far.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scrape text from html element by class that's inside a list element - beautifulsoup

Related

How to fix that URL query params are not working via the web share target API in vuejs pwa?

Scraping HTML inside JSON with Scrapy

Jsoup: login without id or name for submit button

OpenERP 7 error loading in browser

Using swfupload with the Playframework and a Mac

Categories

Resources