Generate xml documents using lxml and vary element text and attributes based on logic - lxml

I have my lxml code like this
from lxml import etree
import sys
fd = open('D:\\text.xml', 'wb')
xmlns = "http://www.fpml.org/FpML-5/confirmation"
xsi = "http://www.w3.org/2001/XMLSchema-instance"
fpmlVersion="http://www.fpml.org/FpML-5/confirmation ../../fpml-main-5-6.xsd http://www.w3.org/2000/09/xmldsig# ../../xmldsig-core-schema.xsd"
page = etree.Element("{"+xmlns+"}dataDocument",nsmap={None:xmlns,'xsi':xsi })
doc = etree.ElementTree(page)
page.set("fpmlVersion", fpmlVersion)
trade = etree.SubElement(page,'trade')
tradeheader = etree.SubElement(trade,'tradeheader')
partyTradeIdentifier = etree.SubElement(tradeheader,'partyTradeIdentifier')
partyReference = etree.SubElement(partyTradeIdentifier,'partyReference',href='party1')
tradeId = etree.SubElement(partyTradeIdentifier,'tradeId',tradeIdScheme='http://www.partyA.com/swaps/trade-id')
tradeId.text = 'TW9235'
swap = etree.SubElement(trade,'swap')
party = etree.SubElement(page,'party',id='party1')
partyID = etree.SubElement(party,'partyID')
partyID.text = 'PARTYAUS33'
partyName = etree.SubElement(party,'partyName')
partyName.text = 'Party A'
party = etree.SubElement(page,'party',id='party2')
partyID = etree.SubElement(party,'partyID')
partyID.text = 'BARCGB2L'
partyName = etree.SubElement(party,'partyName')
partyName.text = 'Party B'
s = etree.tostring(doc, xml_declaration=True,encoding="UTF-8",pretty_print=True)
print (s)
fd.write(s)
And i need to generate a xml file like
<?xml version='1.0' encoding='UTF-8'?>
<dataDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.fpml.org/FpML-5/confirmation" fpmlVersion="http://www.fpml.org/FpML-5/confirmation ../../fpml-main-5-6.xsd http://www.w3.org/2000/09/xmldsig# ../../xmldsig-core-schema.xsd">
<trade>
<tradeheader>
<partyTradeIdentifier>
<partyReference href="party1"/>
<tradeId tradeIdScheme="http://www.partyA.com/swaps/trade-id">TW9235</tradeId>
</partyTradeIdentifier>
</tradeheader>
<swap/>
</trade>
<party id="party1">
<partyID>PARTYAUS33</partyID>
<partyName>Party A</partyName>
</party>
<party id="party2">
<partyID>BARCGB2L</partyID>
<partyName>Party B</partyName>
</party>
</dataDocument>
Now the above code works.
However i need to generate 10k such files where the elements text or attributes vary .
For example the partyID maybe different like
PARTYGER45 instead of PARTYUS33 is there a clean way to do this instead of hard coding it ?
Similarly i need to vary lot of things like the tradeId TW9235

one way could be to have the output xml without values loaded to lxml objectify and then loop while setting relevant values and write it to a file, meaning
from lxml import objectify
with open('in.xml') as f_in:
for pId in ['PARTYGER45', ...]:
dataDocument = objectify.parse(f.read())
dataDocument.party.partyID._setText(pId)
...
obj_xml = lxml.etree.tostring(dataDocument)
with open('out_%s.xml' % pId, 'w') as f_out:
f.write(obj_xml)
another way might be to use lxml and xslt, again, start from an empty structured xml and transform the structure according to your needs.

Related

Can use Beautifulsoup to find elements hidden by other wrapped elements?

I would like to extract the text data of the author affiliations on this page using Beautiful soup.
I know of a work around using selenium to simply click on the 'show more' link and scan the page again? Im not sure what kind of elements these are, hidden? as they only appear in the inspector after clicking the button.
Is there a way to extract this info just using beautiful soup or do I need selenium or something equivalent to reveal the elements in the HTML code?
from bs4 import BeautifulSoup
import requests
url = 'https://www.sciencedirect.com/science/article/abs/pii/S0920379621007596'
sp = BeautifulSoup(r.content, 'html.parser')
r = sp.get(url)
author_data = sp.find('div', id='author-group')
affiliations = author_data.find('dl', class_='affiliation').text
print(affiliations)
That info is within a script tag though you need to map the letters for affiliations to the actual affiliations. The code below extracts the JavaScript object housing the info you want and handles with JSON library.
There is then a series of steps to dynamically determine which indices hold the info of interest and then use a constructed mapping of the letters to affiliations to assign the correct affiliation to each author.
The author first and last names are also dynamically ascertained and joined together with a space.
The intention was to avoid hardcoding indices which might change over time.
import re
import json
import requests
r = requests.get('https://www.sciencedirect.com/science/article/abs/pii/S0920379621007596',
headers={'User-Agent': 'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"abstracts".*})', r.text).group(1))
base = [i for i in data['authors']['content']
if i.get('#name') == 'author-group'][0]['$$']
affiliation_data = [i for i in base if i['#name'] == 'affiliation']
author_data = [i for i in base if i['#name'] == 'author']
name_info = [i['_'] for author in author_data for i in author['$$']
if i['#name'] in ['given-name', 'surname']]
affiliations = dict(zip([j['_'] for i in affiliation_data for j in i['$$'] if j['#name'] == 'label'], [
j['_'] for i in affiliation_data for j in i['$$'] if isinstance(j, dict) and '_' in j and j['_'][0].isupper()]))
# print(affiliations)
author_affiliations = dict(zip([' '.join([i[0], i[1]]) for i in zip(name_info[0::2], name_info[1::2])], [
affiliations[j['_']] for author in author_data for i in author['$$'] if i['#name'] == 'cross-ref' for j in i['$$'] if j['_'] != '⁎']))
print(author_affiliations)

web scrape does not find the correct tags

I am trying to extract the text of this page: https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033 using bs4 and pandas
I start with:
src=requests.get(url).content
soup = BeautifulSoup(src,'xml')
and see that the text I am interested in is wrapped in p tags,
but when I run soup.find_all('p'), the only return I get is the closing paragraph.
How can I extract the paragraph text within? What am I missing?
These are the paragraphs I am trying to extract:
I tried also with selenium using:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\\chromedriver.exe"
driver = webdriver.Chrome(options = chrome_options, executable_path = chrome_driver)
driver.get(url)
page = driver.page_source
page_soup = BeautifulSoup(page,'xml')
div=page_soup.find_all('p')
[a.text for a in div]
I figured it out.
The body of the site comes from a <script> tag that holds a JSON but with a funky encoding.
That said tag has an id of "ng-lseg-state", which means this is Angular's custom HTML encoding.
You can target the <script> tag with BeautifulSoup and parse it with the json module.
Then, however, you need to deal with Angular's encoding. One way, a bit crude thou, is to chain a bunch of .replace() methods.
Here's how:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {"id": "ng-lseg-state"})
article = json.loads(script.string.replace("&q;", '"'))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&a;path=news-article"
article_body = article[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
decoded_body = (
article_body
.replace('&l;', '<')
.replace('&g;', '>')
.replace('&q;', '"')
)
print(BeautifulSoup(decoded_body, "lxml").find_all("p")[22].getText())
This outputs:
Essentra plc is a FTSE 250 company and a leading global provider of essential components and solutions.&a;#160; Organised into three global divisions, Essentra focuses on the light manufacture and distribution of high volume, enabling components which serve customers in a wide variety of end-markets and geographies.
However, as I've said, this is not the best approach, as I'm not entirely sure how to deal with a bunch of other characters, namely:
&a;#160;
&a;amp;
&s;
just to name a few. But I've already asked about this.
EDIT:
Here's a fully working code based on the answer to my question, mentioned above.
import html
import json
import requests
from bs4 import BeautifulSoup
def unescape(decoded_html):
char_mapping = {
'&a;': '&',
'&q;': '"',
'&s;': '\'',
'&l;': '<',
'&g;': '>',
}
for key, value in char_mapping.items():
decoded_html = decoded_html.replace(key, value)
return html.unescape(decoded_html)
url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {"id": "ng-lseg-state"})
payload = json.loads(unescape(script.string))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&path=news-article"
article_body = payload[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
print(BeautifulSoup(article_body, "lxml").find_all("p")[22].getText())

How do you get everything under the root element?

I'm using etree to clean up some html. I realise that I have to have a root tag to hold all the elements, but I want to return the string without the root. Is there any way to do this.
from lxml import etree
fragment = etree.fromstring("<fragment>text1 <a>tex2 </a>text3<b>text4</b> <c>text 5</c>text 6<span style=''> Style</span></fragment>")
delete_tag = 'delete_me'
for element in fragment.xpath(".//span[#style='']"):
element.tag = delete_tag
etree.strip_tags(fragment, delete_tag)
print(etree.tostring(fragment))
What I get is
b'<fragment>text1 <a>tex2 </a>text3<b>text4</b> <c>text 5</c>text 6 Style</fragment>'
but I want is
text1 <a>tex2 </a>text3<b>text4</b> <c>text 5</c>text 6 Style
Try something along these lines:
elems = fragment.xpath('.//*')
target = ''
target+=(fragment.xpath('./text()[1]')[0])
for elem in elems:
target+=(etree.tostring(elem).decode())
target
Output:
'text1 <a>tex2 </a>text3<b>text4</b> <c>text 5</c>text 6 Style'

I want to extract the numbers at the end of a url using regular expressions in scrapy

I want to get '25430989' from the end of this url.
https://www.example.com/cars-for-sale/2007-ford-focus-1-6-diesel/25430989
How would I write it using the xpath?
I get the link using this xpath:
link = row.xpath('.//a/#href').get()
When I use a regex tester I can isolate it with r'(\d+)$ but when I put it into my code it doesn't work for some reason.
import scrapy
import re
from ..items import DonedealItem
class FarmtoolsSpider(scrapy.Spider):
name = 'farmtools'
allowed_domains = ['www.donedeal.ie']
start_urls = ['https://www.donedeal.ie/all?source=private&sort=publishdate%20desc']
def parse(self, response):
items = DonedealItem()
rows = response.xpath('//ul[#class="card-collection"]/li')
for row in rows:
if row.xpath('.//ul[#class="card__body-keyinfo"]/li[contains(text(),"0 min")]/text()'):
link = row.xpath('.//a/#href').get() #this is the full link.
linkid = link.re(r'(\d+)$).get()
title = row.xpath('.//p[#class="card__body-title"]/text()').get()
county = row.xpath('.//li[contains(text(),"min")]/following-sibling::node()/text()').get()
price = row.xpath('.//p[#class="card__price"]/span[1]/text()').get()
subcat = row.xpath('.//a/div/div[2]/div[1]/p[2]/text()[2]').get()
items['link'] = link
items['linkid'] = linkid
items['title'] = title
items['county'] = county
items['price'] = price
items['subcat'] = subcat
yield items
I'm trying to get the linkid.
The problem is here
link = row.xpath('.//a/#href').get() #this is the full link.
linkid = link.re(r'(\d+)$).get()
When you use the .get() method it returns a string that is saved in the link variable, and strings don't have a .re() method for you to call. You can use one of the methods from the re module (docs for reference).
I would use re.findall(), it will return you a list of values that matches the regex (in this case only one item would return), or None if nothing matches. re.search() is also a good choice, but will return you an re.Match object.
import re #Don't forget to import it
...
link = row.xpath('.//a/#href').get()
linkid = re.findall(r'(\d+)$', link)
Now, the Scrapy selectors also support regex, so an alternative would be implementing it like this: (No need for re module)
linkid = row.xpath('.//a/#href').re_first(r'(\d+)$')
Notice I didn't use .get() there.

Write byte strings to a file

I have following code
from lxml import etree
xmlns = "http://www.fpml.org/FpML-5/confirmation"
xsi = "http://www.w3.org/2001/XMLSchema-instance"
fpmlVersion="http://www.fpml.org/FpML-5/confirmation ../../fpml-main-5-6.xsd http://www.w3.org/2000/09/xmldsig# ../../xmldsig-core-schema.xsd"
page = etree.Element("{"+xmlns+"}dataDocument",nsmap={None:xmlns,'xsi':xsi })
doc = etree.ElementTree(page)
page.set("fpmlVersion", fpmlVersion)
trade = etree.SubElement(page,'trade')
party = etree.SubElement(page,'party',id='party1')
partyID = etree.SubElement(party,'partyID')
partyID.text = 'PARTYAUS33'
partyName = etree.SubElement(party,'partyName')
partyName.text = 'Party A'
party = etree.SubElement(page,'party',id='party2')
partyID = etree.SubElement(party,'partyID')
partyID.text = 'BARCGB2L'
partyName = etree.SubElement(party,'partyName')
partyName.text = 'Party B'
s = etree.tostring(doc, xml_declaration=True,encoding="UTF-8",pretty_print=True)
print (s)
How do i save the contents of S to a file
You have to open() a new file in binary mode, and later use the filehandle write() function, so add:
import sys
fd = open(sys.argv[1], 'wb')
at the beginning.
And:
fd.write(s)
at the end.
So, you now have to pass an additional argument to the script, that will be the output file to write. Also note that I use the w flag in the open() function that will destroy (to re-create) any file with that name in current directory.
Run it like:
python3 script.py outfile
That yields:
<?xml version='1.0' encoding='UTF-8'?>
<dataDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.fpml.org/FpML-5/confirmation" fpmlVersion="http://www.fpml.org/FpML-5/confirmation ../../fpml-main-5-6.xsd http://www.w3.org/2000/09/xmldsig# ../../xmldsig-core-schema.xsd">
<trade/>
<party id="party1">
<partyID>PARTYAUS33</partyID>
<partyName>Party A</partyName>
</party>
<party id="party2">
<partyID>BARCGB2L</partyID>
<partyName>Party B</partyName>
</party>
</dataDocument>