I'm requesting a website whose response is a JSON like this:
{
"success": true,
"response": "<html>... html goes here ...</html>"
}
I've seen both ways to scrap HTML or JSON, but haven't found how to scrap HTML inside a JSON. Is it possible to do this using scrapy?
One way is to build a scrapy.Selector out of the HTML inside the JSON data.
I'll assume you have the Response object with JSON data in it, available through response.text.
(Below, I'm building a test response to play with (I'm using scrapy 1.1 with Python 3):
response = scrapy.http.TextResponse(url='http://www.example.com/json', body=r'''
{
"success": true,
"response": "<html>\n <head>\n <base href='http://example.com/' />\n <title>Example website</title>\n </head>\n <body>\n <div id='images'>\n <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n </div>\n </body>\n</html>"
}
''', encoding='utf8')
)
Using json module you can get the HTML data like this:
import json
data = json.loads(response.text)
You get something like :
>>> data
{'success': True, 'response': "<html>\n <head>\n <base href='http://example.com/' />\n <title>Example website</title>\n </head>\n <body>\n <div id='images'>\n <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n </div>\n </body>\n</html>"}
Then you can build a new selector like this:
selector = scrapy.Selector(text=data['response'], type="html")
after which you can use XPath or CSS selectors on it:
>>> selector.xpath('//title/text()').extract()
['Example website']
Well, there's another way that you definitely do not need to construct a response object.You can use lxml to parse your html text. You don't need to install any new lib , since Scrapy Selector is based on lxml. Just add the code below to import lxml lib.
from lxml import etree
Here is an exmaple, assuming that the json response is:
{
"success": true,
"htmlinjson": "<html><body> <p id='p1'>p111111</p> <p id='p2'>p22222</p> </html>"
}
Extract the html text from the json response by:
import json
htmlText = json.loads(response.text)['htmlinjson']
Then construct a lxml xpath selcector using:
from lxml import etree
resultPage = etree.HTML(htmlText)
Now use the lxml selector to extract text of the node with id="p1", basing on xpath just like how scrapy xpath selector do:
print resultPage.xpath('//p[#id="p1"]')[0].text
You will get:
p111111
Hope that helps :)
You can try json.loads(initial_response) , so you get a dict and can use his keys, like ['response']
Related
I have a question related to Selenium.
I want to get the text "DISSMISSED" from a webpage. however I tried the following codes and it doesn't work or cannot locate the element.
text7 = driver.find_element_by_xpath("//span/#class='icon-check']").text
or
text7 = driver.find_element_by_xpath("//div[strong[text()='Case Status']]").text
Here is the html code:
<p>
<strong>Case Status: </strong>
<span class="icon-check" aria-hidden="true"></span>
DISMISSED
<span> </span>
The text you want to get is in between the <p></p> tag, then
//span[#class='icon-check']/parent::p
with this simple xpath you can get the DISMISSED text.
If HTML is as below:
<HTML>
<Body>
<p>
<strong>Case Status: </strong>
<span class="icon-check" aria-hidden="true"></span>
DISMISSED
<span> </span>
</p>
</Body>
</HTML>
That means text DISMISSED belongs to p tag
So try that
//p/strong[contains(.,'Case Status')]/following-sibling::span/..
OR
//p[contains(.,'DISMISSED')]/strong[contains(.,'Case Status')]/following-sibling::span/..
I have created custom latest blog template. But I can't show cover images in thumbnails.
Cover image should be here:
I have written following code to show the cover image:
<div class="panel">
<t t-set="properties" t-value="json.loads(post.cover_properties)">
<a class="o_panel_cover" t-attf-href="#{blog_url('', ['blog', 'post'], blog=post.blog_id, post=post)}" t-att-style="background-image: #{cover_properties.get('background-image')};">
</a>
</t>
<div class="panel-heading mt0 mb0">
<h4 class="mt0 mb0">
<a t-attf-href="#{blog_url('', ['blog', 'post'], blog=post.blog_id, post=post)}" t-field="post.name"></a>
<span t-if="not post.website_published" class="text-warning">
<span class="fa fa-exclamation-triangle ml8" title="Unpublished"/>
</span>
</h4>
</div>
After writing the code image not loading and it shows like this:
How can I show the image?
I will suggest you to clear the cache of the browser, sometimes because of cache overloading we don't get image.
Firstly, there are several things with the controller.
The latest post route doesn't render cover-properties, it is like below:
return request.render("website_blog.latest_blogs", {
'posts': posts,
'pager': pager,
'blog_url': blog_url,
})
So I added necessary functions in my controller and returned like this:
return request.render("website_blog.latest_blogs", {
'posts': posts,
'pager': pager,
'blog_url': blog_url,
'blogs':blogs,
'blog_posts': blog_posts,
'blog_posts_cover_properties': [json.loads(b.cover_properties) for b in blog_posts],
})
On XML returned like this:
<t t-set="cover_properties" t-value="blog_posts_cover_properties[post_index]"/>
<a class="o_panel_cover" t-attf-href="#{blog_url('', ['blog', 'post'], blog=post.blog_id, post=post)}"
t-attf-style="background-image: #{cover_properties.get('background-image')};"></a>
I am working on Hr_Recruitment module.I have added a binary image field for HR->Application.I am trying to add functionality for external user to fill the job application them self through website.I have added name,email,phone,resume attachment fields in website for Job application.when they click on submit, it is updating in HR->Job Application Form.But Image field is not getting updated in Application.When opening the job application It is showing message like "Could not show the selected image".How to solve this issue?
controller/main.py
if post.get('image',False):
image = request.registry['ir.attachment']
name = post.get('image').filename
file = post.get('image')
attach = file.stream
file.show()
f = attach.getvalue()
webbrowser.open(image)
attachment_id = Attachments.create(request.cr, request.uid, {
'name': image,
'res_name': image,
'res_model': 'hr.applicant',
'res_id': applicant_id,
'datas': base64.decodestring(str(res[0])),
'datas_fname': post['image'].filename,
}, request.context)
views/templates.xml
<div t-attf-class="form-group">
<label class="col-md-3 col-sm-4 control-label" for="image">Image</label>
<div class="col-md-7 col-sm-8">
<img id="uploadPreview" style="width: 100px; height: 100px;" />
<input id="uploadImage" name="image" type="file" class="file" multiple="true" data-show-upload="true" data-show-caption="true" data-show-preview="true" onchange="PreviewImage();"/>
</div>
</div>
Add the image field as shown below in your XML template:
<img itemprop="image" style="margin-top: -53px; margin-left:19px; width:80px;" class="img img-responsive" t-att-src="website.image_url(partner, 'image', None if product_image_big else '300x300')"/>
<input class="input-file profileChooser" id="fileInput" type="file" name="ufile" onchange="validateProfileImg();"/>
remove the unnecessary attributes for you.
And in your controller function, you can get the value from the image fields as:
vals = {}
if post['ufile']:
vals.update({'image': base64.encodestring(post['ufile'].read())})
request.registry['res.partner'].write(cr, uid, [partner_id], vals)
The above code works for me, I have been using this to update partner images from ODOO website.
I am using odoo13 and I used this format to upload an image from the website to the respartner form view:
<div class="col-md-6">
<label for="image_1920" class="form-label">Passport Photo*</label>
<input type="file" class="form-control" id="image_1920" name="image_1920" required="1" />
</div>
import base64
#http.route('/registered', auth='public', methods=['GET', 'POST'], website=True)
def registration_submit(self, *args, **kw):
image = kw.get('image_1920', False)
kw.update({
'free_member': True,
'image_1920': base64.encodestring(image.read()) if image else False
})
member = request.env['res.partner'].sudo().create(kw)
I am trying append EJS templates using jQuery, I am not sure if what I am doing is right.
views/form/room-form.ejs
<form>
...
<div class="add_room_inputs">
<%- include partials/add_room %>
</div>
<div id="NewAddRoomForm">
<p>Add new form</p>
</div>
...
</form>
\assets\linker\js\custom-functions.js
$(document).ready(function(){
$('#NewAddRoomForm').click(function() {
$('.add_room_inputs').append('<%- include /assets/linker/templates/add-room-input.ejs %>');
});
}
/assets/linker/templates/room-input.ejs
<input name="rooms[]" class="form-control" placeholder="Room Name" type="text">
alternative solution (custom-functions.js file) which fails on file not found
var html = new EJS({url: '/assets/linker/templates/add-room-input.ejs.ejs'}).render();
$('#NewAddRoomForm').click(function() {
$('.add_room_inputs').append(html);
});
How can I implement such thing?
Did you include jst.js in the layout? by default it should be there as you are using Sails JS.
What you're trying to do wont work...
The problem is that the server has already rendered the template and sent the result to the client. Also the client doesn't have access to the view files on the server.
An Ajax framework like KnockoutJS is probably closer to what you are looking for, but there are others.
Here...I decided to make you a fiddle
http://jsfiddle.net/8D34n/23/
SO wants code too so here is a repaste
<form>
<h1 data-bind="text: form_title"></h1>
Add Form
<br><br>
<div id="NewAddRoomForm" data-bind="foreach: form_list">
<div data-bind="text: 'Form '+($index() + 1)"></div>
<input data-bind="value: name" /><br>
<input data-bind="value: age" />
</div>
<br><br>
</form>
View results
I have this html. I'm trying to get its InnerText without any tags in it,
<h1>my h1 content</h1>
<div class="thisclass">
<p> some text</p>
<p> some text</p>
<div style="some_style">
some text
<script type="text/javascript">
<!-- some script -->
</script>
<script type='text/javascript' src='some_script.js'></script>
</div>
<p> some text<em>some text</em>some text.<em> <br /><br /></em><strong><em>some text</em></strong></p>
<p> </p>
</div>
What am trying to do is get the text as the user would see it from the class thisclass.
I want to strip any script tag, and all tags, and just get plain text.
This is what am using:
Dim Tags As HtmlNodeCollection = root.SelectNodes("//div[#class='thisclass'] | //h1")
Does anyone have any ideas?
Thanks.
Try this (warning c# code ahead):
foreach(var script in root.SelectNodes("//script"))
{
script.ParentNode.RemoveChild(script);
}
Console.WriteLine(root.InnerText);
This gave me the following output:
my h1 content some text some textsome text some textsome textsome text. some text
Hope this helps.