Searching entry of xml with BeautifulSoup - beautifulsoup

I know that there are many questions to this topic.. But I can't answer my question with that..
Can someone help me how to do that?
My xml file looks like that:
...
<object>
<acronym>Mmachc</acronym>
<alias-tags>1810037K07Rik RP23-177C18.3</alias-tags>
<chromosome-id>49</chromosome-id>
<ensembl-id nil="true"/>
<entrez-id>67096</entrez-id>
<genomic-reference-update-id>491928275</genomic-reference-update-id>
<homologene-id>12082</homologene-id>
<id>42939</id>
<legacy-ensembl-gene-id nil="true"/>
<name>
methylmalonic aciduria cblC type, with homocystinuria
</name>
<organism-id>2</organism-id>
<original-name>
</original-name>
<original-symbol>Mmachc</original-symbol>
<reference-genome-id nil="true"/>
<sphinx-id>95240</sphinx-id>
<version-status>no change</version-status>
</object>
<object>
...
So if I now want to search the object that contains e.g. the entrez-id 67096 to see which acronym it has.. I tried first:
url = "http://api.brain-map.org/api/v2/data/query.xml?num_rows=10000&start_row=10001&&criteria=model::Gene,rma::criteria,products[abbreviation$eq%27Mouse%27]"
req = requests.get(url)
doc = req.text
root = etree.XML(doc)
soup = BeautifulSoup(doc)
dict1 = {}
for object in soup.find_all('object'):
dict1[object.find('entrez-id') == 67096]
The output for that is KeyError: False..
Can someone help me with that?
Also if I try to find it as string '67096' I got key error false..

You don't really need beautifulsoup here; just try something like:
target = root.xpath('//entrez-id[.="67096"]/preceding-sibling::acronym/text()')
target[0]
Output:
'Mmachc'

Related

Problem Replacing <br> Tags with Newline Using bs4

Problem: I cannot replace <br> tags with a newline character using Beautiful Soup 4.
Code: My program (the relevant portion of it) currently looks like
for br in board.select('br'):
br.replace_with('\n')
but I have also tried board.find_all() in place of board.select().
Results: When I use board.replace_with('\n') all <br> tags are replaced with the string literal \n. For example, <p>Hello<br>world</p> would end up becoming Hello\nworld. Using board.replace_with(\n) causes the error
File "<ipython-input-27-cdfade950fdf>", line 10
br.replace_with(\n)
^
SyntaxError: unexpected character after line continuation character
Other Information: I am using a Jupyter Notebook, if that is of any relevance. Here is my full program, as there may be some issue elsewhere that I have overlooked.
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get("https://boards.4chan.org/g/")
soup = BeautifulSoup(page.content, 'html.parser')
board = soup.find('div', class_='board')
for br in board.select('br'):
br.replace_with('\n')
message = [obj.get_text() for obj in board.select('.opContainer .postMessage')]
image = [obj['href'] for obj in board.select('.opContainer .fileThumb')]
pid = [obj.get_text() for obj in board.select('.opContainer .postInfo .postNum a[title="Reply to this post"]')]
time = [obj.get_text() for obj in board.select('.opContainer .postInfo .dateTime')]
for x in range(len(image)):
image[x] = "https:" + image[x]
post = pd.DataFrame({
"ID": pid,
"Time": time,
"Image": image,
"Message": message,
})
post
pd.options.display.max_rows
pd.set_option('display.max_colwidth', -1)
display(post)
Any advice would be appreciated. Thank you for reading.
Just tried and it works for me, my bs4 version is 4.8.0, and I am using Python 3.5.3,
example:
from bs4 import BeautifulSoup
soup = BeautifulSoup('hello<br>world')
for br in soup('br'):
br.replace_with('\n')
# <br> was replaced with \n successfully
assert str(soup) == '<html><body><p>hello\nworld</p></body></html>'
# get_text() also works as expected
assert soup.get_text() == 'hello\nworld'
# it is a \n not a \\n
assert soup.get_text() != 'hello\\nworld'
I am not used to work with Jupyter Notebook, but seems that your problem is that whatever you are using to visualize the data is showing you the string representation instead of actually printing the string,
hope this helps,
Regards,
adb
Instead of replacing after converting to soup, try replacing the <br> tags before converting. Like,
soup = BeautifulSoup(str(page.content).replace('<br>', '\n'), 'html.parser')
Hope this helps! Cheers!
P.S.: I did not get any logical reason why this is not working after changing into soup.
After experimenting with variations of
page = requests.get("https://boards.4chan.org/g/")
str_page = page.content.decode()
str_split = '\n<'.join(str_page.split('<'))
str_split = '>\n'.join(str_split.split('>'))
str_split = str_split.replace('\n', '')
str_split = str_split.replace('<br>', ' ')
soup = BeautifulSoup(str_split.encode(), 'html.parser')
for the better part of two hours, I have determined that the Panda data-frame prints the newline character as a string literal. Everything else indicates that the program is working as intended, so I assume this has been the problem all along.
for some reason direct replace with newline does not work with bs4 you have to first replace with some other unique character (character sequence preferably) and then replace that sequence in text with newline.
try this.
for br in soup.find_all('br'): br.replace_with('+++')
text=soup.get_text().replace('+++','\n)

Getting a vaule in a link for later usage in Selenium

I have a link on my webpage which I need to get the value from and save in for later usage (constructing a direct URL).
The html-link I want to obtain the value from look like this:
<a ng-bind="saving.customerContractName || (saving| savingscontract:$parent.$parent.cmsData) " ng-attr-target="{{(saving.type === 'ASK') ? '_blank' : undefined}}" ng-href="/lpn/mo/Logon.action?avtalenummer=176742" class="ng-binding" target="" href="/lpn/mo/Logon.action?avtalenummer=176742">Fondskonto Link (176742)</a>
The value I need to obtain is 176742.
Any tips on how to extract this value? And further use it in a direct URL call (something) like this:
String url2 = "https://www2-t.storebrand.no/ppjs/#/savings/index/THE_VALUE_HERE";
driver.get(url2);
this might work.
txt = driver.find_element_by_partial_link_text("Fondskonto Link").get_attribute("href").split("=")[1]
url = "https://www2-t.storebrand.no/ppjs/#/savings/index/%s" % txt
driver.get(url)

Get links in Beautiful Soup

I'm trying to parse the following links in Beautiful Soup and I'm not exactly sure what the best way of doing this is. Any suggestions would be greatly appreciated.
Thanks
If anyone is ever interested, I figured out how to do this:
from bs4 import BeautifulSoup
xml = requests.get("http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html").text
def find_governor_races(html):
soup = BeautifulSoup(html, 'html.parser')
pattern = "http://www.realclearpolitics.com/epolls/????/governor/??/*-*.html"
links = []
for option in soup.find_all('option'):
links.append(option['value'])
matched_links = []
for link in links:
if fnmatch(link, pattern):
matched_links.append(link)
return matched_links

xpath filter on attribute text

Hi I have an xml file like below,
<Pods>
<item>
<URL>data/_data/2014/09/11/pods/10057-1837887-2965978-0.pdf</URL>
<RunDate>11/09/2014</RunDate>
<DateSigned>11/09/2014 09:13:49
</DateSigned>
</item>
<item>
<URL>data/_data/2014/09/11/pods/10057-0-2965978-0-scan.pdf</URL>
<DateSigned>Not signed</DateSigned>
</item>
</Pods>
I would like to get the conents of the <URL> where <DateSigned> is not equal to "Not Signed"
I have tried
Dim URLNode As XmlNodeList = doc.SelectNodes("//ITEM[DateSigned=Not Signed]/URL")
but this says invalid token I am unsure what I have dine wrong.
Thanks for any help
You need to simply wrap Not signed within single-quotes to make it recognized as string value. Also, XML is case-sensitive, use the correct cases :
Dim URLNode As XmlNodeList = _
doc.SelectNodes("//item[DateSigned='Not signed']/URL")
or if you really meant to get <item> where <DateSigned> is not equals Not signed :
Dim URLNode As XmlNodeList = _
doc.SelectNodes("//item[DateSigned != 'Not signed']/URL")

Generate xml documents using lxml and vary element text and attributes based on logic

I have my lxml code like this
from lxml import etree
import sys
fd = open('D:\\text.xml', 'wb')
xmlns = "http://www.fpml.org/FpML-5/confirmation"
xsi = "http://www.w3.org/2001/XMLSchema-instance"
fpmlVersion="http://www.fpml.org/FpML-5/confirmation ../../fpml-main-5-6.xsd http://www.w3.org/2000/09/xmldsig# ../../xmldsig-core-schema.xsd"
page = etree.Element("{"+xmlns+"}dataDocument",nsmap={None:xmlns,'xsi':xsi })
doc = etree.ElementTree(page)
page.set("fpmlVersion", fpmlVersion)
trade = etree.SubElement(page,'trade')
tradeheader = etree.SubElement(trade,'tradeheader')
partyTradeIdentifier = etree.SubElement(tradeheader,'partyTradeIdentifier')
partyReference = etree.SubElement(partyTradeIdentifier,'partyReference',href='party1')
tradeId = etree.SubElement(partyTradeIdentifier,'tradeId',tradeIdScheme='http://www.partyA.com/swaps/trade-id')
tradeId.text = 'TW9235'
swap = etree.SubElement(trade,'swap')
party = etree.SubElement(page,'party',id='party1')
partyID = etree.SubElement(party,'partyID')
partyID.text = 'PARTYAUS33'
partyName = etree.SubElement(party,'partyName')
partyName.text = 'Party A'
party = etree.SubElement(page,'party',id='party2')
partyID = etree.SubElement(party,'partyID')
partyID.text = 'BARCGB2L'
partyName = etree.SubElement(party,'partyName')
partyName.text = 'Party B'
s = etree.tostring(doc, xml_declaration=True,encoding="UTF-8",pretty_print=True)
print (s)
fd.write(s)
And i need to generate a xml file like
<?xml version='1.0' encoding='UTF-8'?>
<dataDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.fpml.org/FpML-5/confirmation" fpmlVersion="http://www.fpml.org/FpML-5/confirmation ../../fpml-main-5-6.xsd http://www.w3.org/2000/09/xmldsig# ../../xmldsig-core-schema.xsd">
<trade>
<tradeheader>
<partyTradeIdentifier>
<partyReference href="party1"/>
<tradeId tradeIdScheme="http://www.partyA.com/swaps/trade-id">TW9235</tradeId>
</partyTradeIdentifier>
</tradeheader>
<swap/>
</trade>
<party id="party1">
<partyID>PARTYAUS33</partyID>
<partyName>Party A</partyName>
</party>
<party id="party2">
<partyID>BARCGB2L</partyID>
<partyName>Party B</partyName>
</party>
</dataDocument>
Now the above code works.
However i need to generate 10k such files where the elements text or attributes vary .
For example the partyID maybe different like
PARTYGER45 instead of PARTYUS33 is there a clean way to do this instead of hard coding it ?
Similarly i need to vary lot of things like the tradeId TW9235
one way could be to have the output xml without values loaded to lxml objectify and then loop while setting relevant values and write it to a file, meaning
from lxml import objectify
with open('in.xml') as f_in:
for pId in ['PARTYGER45', ...]:
dataDocument = objectify.parse(f.read())
dataDocument.party.partyID._setText(pId)
...
obj_xml = lxml.etree.tostring(dataDocument)
with open('out_%s.xml' % pId, 'w') as f_out:
f.write(obj_xml)
another way might be to use lxml and xslt, again, start from an empty structured xml and transform the structure according to your needs.