I can not figure out how to parse the "author" and "fact" tags out of the following XML. If the formatting looks strange here is a link to the XML doc.
<response stat="ok">
−<ltml version="1.1">
−<item id="5403381" type="work">
<author id="21" authorcode="rowlingjk">J. K. Rowling</author>
<url>http://www.librarything.com/work/5403381</url>
−<commonknowledge>
−<fieldList>
−<field type="42" name="alternativetitles" displayName="Alternate titles">
−<versionList>
−<version id="3413291" archived="0" lang="eng">
<date timestamp="1298398701">Tue, 22 Feb 2011 13:18:21 -0500</date>
−<person id="18138">
<name>ablachly</name>
<url>http://www.librarything.com/profile/ablachly</url>
</person>
−<factList>
<fact>Harry Potter and the Sorcerer's Stone </fact>
</factList>
</version>
</versionList>
</field>
So far I have tried this code to get the author but it does not work:
#xml_doc = Nokogiri::XML(open("http://www.librarything.com/services/rest/1.1/?method=librarything.ck.getwork&isbn=0590353403&apikey=d231aa37c9b4f5d304a60a3d0ad1dad4"))
#xml_doc.xpath('//response').each do |n|
#author = n
end
I couldn't get at any nodes deeper than //response using the link you provided. I ended up using Nokogiri::XML::Reader and pushing elements into a hash, since there may be multiple authors, and there are definitely multiple facts. You can use whatever data structure you like, but this gets the content of the fact and author tags:
require 'nokogiri'
require 'open-uri'
url = "http://www.librarything.com/services/rest/1.1/?method=librarything.ck.getwork&isbn=0590353403&apikey=d231aa37c9b4f5d304a60a3d0ad1dad4"
reader = Nokogiri::XML::Reader(open(url))
book = {
author: []
fact: []
}
reader.each do |node|
book.each do |k,v|
if node.name == k.to_s && !node.inner_xml.empty?
book[k] << node.inner_xml
end
end
end
You could try:
nodes = #xml_doc.xpath("//xmlns:author", "xmlns" => "http://www.librarything.com/")
puts nodes[0].inner_text
nodes = #xml_doc.xpath("//xmlns:fact", "xmlns" => "http://www.librarything.com/")
nodes.each do |n|
puts n.inner_text
end
The trick is in the namespace.
Related
I am using Scrapy's xml feed spider sitemap to crawl and extract urls and only urls.
The xml sitemap looks like this:
<url>
<loc>
https://www.example.com/american-muscle-5-pc-kit-box.html
</loc>
<lastmod>2020-10-14T15:40:02+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
<image:image>
<image:loc>
https://www.example.com/pub/media/catalog/product/cache/de5bc950da2c28fc62848f9a6b789a5c/1/2/1202_45.jpg
</image:loc>
<image:title>
5 PC. GAUGE KIT, 3-3/8" & 2-1/16", ELECTRIC SPEEDOMETER, AMERICAN MUSCLE
</image:title>
</image:image>
<PageMap>
<DataObject type="thumbnail">
<Attribute name="name" value="5 PC. GAUGE KIT, 3-3/8" & 2-1/16", ELECTRIC SPEEDOMETER, AMERICAN MUSCLE"/>
<Attribute name="src" value="https://www.example.com/pub/media/catalog/product/cache/de5bc950da2c28fc62848f9a6b789a5c/1/2/1202_45.jpg"/>
</DataObject>
</PageMap>
</url>
I ONLY want to get the contents of the <loc></loc>
So I set my scrapy spider up like this (some parts omitted for brevity):
start_urls = ['https://www.example.com/sitemap.xml']
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'url'
def parse_node(self, response, selector):
item = {}
item['url'] = selector.select('url').get()
selector.remove_namespaces()
yield {
'url': selector.xpath('//loc/text()').getall()
}
That ends up givin me the url and url for all the product images. How can I set this spider up to ONLY get the actual product page url?
In order to change this part of sitemap spider logic it is required to override It's _parse_sitemap method (source)
and replace section
elif s.type == 'urlset':
for loc in iterloc(it, self.sitemap_alternate_links):
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break
by something like this
elif s.type == 'urlset':
for entry in it:
item = entry #entry - sitemap entry parsed as dictionary by Sitemap spider
...
yield item # instead of making request - return item
In this case spider should return items from parsed sitemap entries instead of making requests for every link
I have to convert some beautifulsoup code.
Basically what I want is just get all children of the body node and select which has text and store them.
Here is the code with bs4 :
def get_children(self, tag, dorecursive=False):
children = []
if not tag :
return children
for t in tag.findChildren(recursive=dorecursive):
if t.name in self.text_containers \
and len(t.text) > self.min_text_length \
and self.is_valid_tag(t):
children.append(t)
return children
this works fine
when I try this with lxml lib instead, children is empty :
def get_children(self, tag, dorecursive=False):
children = []
if not tag :
return children
tags = tag.getchildren()
for t in tags:
#print(t.tag)
if t.tag in self.text_containers \
and len(t.tail) > self.min_text_length \
and self.is_valid_tag(t):
children.append(t)
return children
any idea ?
Code:
import lxml.html
import requests
class TextTagManager:
TEXT_CONTAINERS = {
'li',
'p',
'span',
*[f'h{i}' for i in range(1, 6)]
}
MIN_TEXT_LENGTH = 60
def is_valid_tag(self, tag):
# put some logic here
return True
def get_children(self, tag, recursive=False):
children = []
tags = tag.findall('.//*' if recursive else '*')
for t in tags:
if (t.tag in self.TEXT_CONTAINERS and
t.text and
len(t.text) > self.MIN_TEXT_LENGTH and
self.is_valid_tag(t)):
children.append(t)
return children
manager = TextTagManager()
url = 'https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers'
html = requests.get(url).text
doc = lxml.html.fromstring(html)
for child in manager.get_children(doc, recursive=True):
print(child.tag, ' -> ', child.text)
Output:
li -> HTML traversal: offer an interface for programmers to easily access and modify of the "HTML string code". Canonical example:
li -> HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example:
.getchildren() returns all direct children. If you want to have a recursive option, you can use .findall():
tags = tag.findall('.//*' if recursive else '*')
This answer should help you understand the difference between .//tag and tag.
I'm trying to concatenate 3 fields to form a internal code and display it in the views:
I have 3 models:
Category (size=2)
Product (size=4)
Serie (size=3)
And I want to display it in the form like this
Product Code: CAT-PROD-001
I don't know if i have to use a computed field or if exist anoter way to do this, because I was doing test with computed fields but can't reach the desired output.
Edit:
Now I'm trying to use a computed field with a onchange function to generate the value on the field
MODEL
# -*- coding:utf-8 -*-
from openerp import models,fields,api
class exec_modl(models.Model):
_name = "exec.modl"
_rec_name = "exec_desc"
exec_code = fields.Char('Identificador',required=True,size=3)
exec_desc = fields.Char('Descripción',required=True)
cour_exec = fields.Many2one('cour.modl')
proc_exec = fields.Many2one('enro.modl')
inte_code = fields.Char(compute='_onchange_proc')
FUNCTION
#api.onchange('proc_exec')
def _onchange_proc(self):
cate = "XX"
cour = "XXXX"
exet = "XXX"
output = cate+"-"+cour+"-"+exet
return output
I'm just trying with plain values just to know how to send it to the field.
EDIT 2:
Using the answer from #Charif I can print the static strings on the form, but the next milestome I'm trying to reach is getting the codes (external models fields) to crate that inte_code
ex: From the model cour.modl I want to get the value from the field cour_code(internal_id for course) corresponding to the cour_exec field on the first model (the cour_exec field have the description of the course from cour.modl model)
#api.depends('proc_exec')
def _onchange_proc(self):
cate = "XX"
cour = self.env['cour.modl'].search([['cour_desc','=',self.cour_exec]])
exet = "XXX"
output = cate+"-"+cour+"-"+exet
self.inte_code = output
E #api.depends('inte_code')
def _onchange_proc(self):
cate = "XX"
# first domain use tuple not list
cour_result = self.env['cour.modl'].search([('id','=',exec_modl.cour_exec)]).cour_code
cour = "" # empty string because you cannot contcatenate None or False with a string value
#if cour_result :
# cour = ",".join(crse_code for crse_code in cour_result.ids)
#else :
# print "result of search is empty check you domain"
exet = "XXX"
output = cate+"-"+cour+"-"+exet+"-"+cour_result
self.inte_code = output
EDIT 3
I've been trying to usse the search mode calling other model values but I have the console output :
Can't adapt type 'Many2One' , seems im trying to compare 2 different type of fields, the types can be parsed on odoo ? or I'm using a wrong syntax for search method?
#api.depends('inte_code')
def _onchange_proc(self):
cate = "XX"
# first domain use tuple not list
cour_result = self.env['cour.modl'].search([('id','=',exec_modl.cour_exec)]).cour_code
exet = "XXX"
output = cate+"-"+cour+"-"+exet+"-"+cour_result
self.inte_code = output
EDIT 4 : ANSWER
Finally I've reach the desired output! using the following code:
#api.depends('inte_code')
def _onchange_proc(self):
cate_result = self.cate_exec
proc_result = self.env['enro.modl'].search([('id','=',str(self.proc_exec.id))]).enro_code
cour_result = self.env['cour.modl'].search([('id','=',str(self.cour_exec.id))]).cour_code
output = str(proc_result)+"-"+str(cate_result)+"-"+str(cour_result)+"-"+self.exec_code
self.inte_code = output
Additionaly I've added a related field for add the course category to the final output.
cate_exec = fields.Char(related='cour_exec.cour_cate.cate_code')
Now the output have this structure:
INTERNAL_PROC_ID-CAT_COURSE-COURSE-EXECUTION_CODE
EX: xxxxxxxx-xx-xxxx-xxx
First in compute field use api.depends not onchange :
Second the compute function don't return anything but it passes the record on the self variable so all you have to do is assign the value to the computed field.
#api.depends('proc_exec')
def _onchange_proc(self):
# compute the value
# ...
# Than assign it to the field
self.computed_field = computed_value
one of the thing that i recommand to do is to loop the self because it's recordSet so if the self contains more than one record this previous code will raise signlton error
so you can do this :
# compute the value here if it's the same for every record in self
for rec in self :
# compute the value here it depends on the value of the record
rec.compute_field = computeValue
or use api.one with api.depends
#api.one
#api.depends('field1', 'field2', ...)
EDITS:
#api.depends('proc_exec')
def _onchange_proc(self):
cate = "XX"
# first domain use tuple not list
cour_result = self.env['cour.modl'].search([('cour_desc','=',self.cour_exec)])
cour = "" # empty string because you cannot contcatenate None or False with a string value
if cour_result :
cour = ",".join(id for id in cour_result.ids)
else :
print "result of search is empty check you domain"
exet = "XXX"
output = cate+"-"+cour+"-"+exet
self.inte_code = output
try this code i think the result of search is a recordSet so you can get the list of ids by name_of_record_set.ids than create a string from the list of ids to concatenate it try and let me know if there is an error because i'm using work PC i don't have odoo on my hand ^^
You can create new wizard.
From wizard you can generate Internal Reference.
class create_internal_reference(models.TransientModel):
_name="create.internal.reference"
#api.multi
def create_internal_reference(self):
product_obj=self.env['product.product']
active_ids=self._context.get('active_ids')
if active_ids:
products=product_obj.browse(active_ids)
products.generate_new_internal_reference()
return True
Create View & act_window
<record model="ir.ui.view" id="create_internal_reference_1">
<field name="name">Create Internal Reference</field>
<field name="model">create.internal.reference</field>
<field name="arch" type="xml">
<form string="Create Internal Reference">
<footer>
<button name="create_internal_reference" string="Generate Internal Reference" type="object" class="oe_highlight"/>
<button string="Cancel" class="oe_link" special="cancel" />
</footer>
</form>
</field>
</record>
<act_window name="Generate Internal Reference" res_model="create.internal.reference"
src_model="product.product" view_mode="form" view_type="form"
target="new" multi="True" key2="client_action_multi"
id="action_create_internal_reference"
view_id="create_internal_reference_1"/>
class product_product(models.Model):
_inherit='product.product'
#api.multi
def generate_new_internal_reference(self):
for product in self:
if not product.internal_reference:
product.internal_reference='%s-%s-%s'%(str(product.categ_id.name)[:2],str(product.name)[:4],third_field[:3])
From product.product under more button you can access this wizard and generate internal reference.
This may help you.
The below SQL:
SELECT
XMLELEMENT("classics", xmlattributes('xxxxxxxxx' AS "eventId"),
XMLELEMENT("author",xmlattributes(FIRST_NAME AS "firstName", LAST_NAME AS "lastName", BIRTH AS "dob", DEATH AS "dod" )),
XMLELEMENT("bibliography",
XMLELEMENT("type",xmlattributes(DESCRIPTION AS "desc"),
XMLELEMENT("award", xmlattributes(NOBEL AS "nobelPrize"))),
XMLELEMENT("books",xmlattributes(BOOK_TITLE AS "title", PUBLISHED_DATE AS "published" ))))
FROM CLASSICS
WHERE AUTHOR_ID=23;
does not group by correctly by book, what i am trying to achieve is to have this XML as result:
<?xml version="1.0" encoding="UTF-8"?>
<classics eventId="234567890">
<author firstName="Ernest " lastName="Hemingway" dob="1899-07-21" dod="1961-07-02" />
<bibliography>
<inner>
<type desc="Novel" >
<award nobel="true" />
</type>
<books>
<inner title="The Old Man And The Sea" published="1952" />
<inner title="For Whom The Bell Tolls" published="1940" />
<inner title="A Farewell To Arms" published="1929" />
</books>
</inner>
</bibliography>
</classics>
at the moment i get 3 records - 3 distinct XML for each of the books - I have tried to use XMLAGG to group by AUTHOR_id (this field is the link for each book in the table CLASSICS table for an authir) - it is very simple structure ONE AUTHOR HAS PUBLISHED AT LEAST ONE BOOK OR MANY - and i need to store the "classic" xml object for an author containing an array of books inside another array "bibliography"
this is the code i tried to used inside the XMLELEMENT "bibliography" for the Array "books":WITH NO LUCK
select
xmlagg(
xmlelement("books",
xmlattributes(BOOK_TITLE AS "title", PUBLISHED_DATE AS "published" )
))from CLASSICS
GROUP BY AUTHOR_ID
the main goal in the end is to reach this JSON structure after i have the XML:
{
"eventId": "234567890",
"author": {
"firstName": "Ernest",
"lastName": "Hemingway",
"dob": "1899-07-21",
"dod": "1961-07-02"
},
"bibliography": [
{
"type": {
"desc": "Novel",
"award": {
"nobelPrize": "1954"
}
},
"books": [
{
"title": "The Old Man And The Sea",
"published": "1952"
},
{
"title": "For Whom The Bell Tolls",
"published": "1940"
},
{
"title": "A Farewell To Arms",
"published": "1929"
}
]
}
]
}
but seems quiet complicated declare arrays of objects in XML/SQL.
any suggestions?
I having two fields for a Paragraph Model, with one of them being a ManyToMany field.
class Tag(models.Model):
tag = models.CharField(max_length=500)
def __unicode__(self):
return self.tag
admin.site.register(Tag)
class Paragraph(models.Model):
article = models.ForeignKey(Article)
text = models.TextField()
tags = models.ManyToManyField(Tag)
def __unicode__(self):
return "Headline: " + self.article.headline + " Tags: " + ', '.join([t.tag for t in self.tags.all()])
admin.site.register(Paragraph)
And my .txt files reflects the ManyToMany relationship to index tags-
{{object.text}}
{% for tag in object.tags.all %}
{{tag.tag}}
{% endfor %}
My views.py then uses SQS to search for all the tags (I want to accomplish this first before including text field) and retrieves those. So in this case, the query is "Politics"-
def politics(request):
paragraphs = []
sqs = SearchQuerySet().filter(tag="Politics")
paragraphs = [a.object for a in sqs[0:10]]
return render_to_response("search/home_politics.html",{"paragraphs":paragraphs},context_instance=RequestContext(request))
Edited:
and my search_indexes.py
class ParagraphIndex(indexes.SearchIndex, indexes.Indexable):
text= indexes.CharField(document=True, use_template=True)
tags= indexes.CharField(model_attr='tags')
def get_model(self):
return Paragraph
def index_queryset(self):
return self.get_model().objects
def load_all_queryset(self):
# Pull all objects related to the Paragraph in search results.
return Paragraph.objects.all().select_related()
However this doesn't retrive anything even though a few paragraphs have tags that are "Politics". Am I missing anything here or should I approach related data another way? I am a beginner with Haystack so any help will be much appreciated. Thanks in advance!
So this is a very useful article that helped me solve the problem.
Based on the article, this is how my search_indexes.py looks now:
class ParagraphIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
tags = indexes.MultiValueField()
def prepare_tags(self,object):
return [tag.tag for tag in object.tags.all()]
def get_model(self):
return Paragraph
def index_queryset(self):
return self.get_model().objects
def load_all_queryset(self):
# Pull all objects related to Paragraph in search results.
return Paragraph.objects.all().select_related()
and my views.py:
def politics(request):
paragraphs = []
sqs = SearchQuerySet().filter(tags='Politics')
paragraphs = [a.object for a in sqs[0:10]]
return render_to_response("search/home.html",
{"paragraphs":paragraphs},
context_instance=RequestContext(request))
And I am using elasticsearch for the engine. Hope this helps!