BS4 How to get the text from [<td >1</td>] - beautifulsoup

Running into problems to get the text from [1]. I have the below code
url = "C:\\local.html"
page = open(url)
soup = BeautifulSoup(page.read(), "html.parser")
NoFall = soup.find_all("div", {"id":"Kendo_Table1535711951642"})
for AuxCopy in NoFall:
product = AuxCopy.find('table').find_all('tr')
for tr in product:
td = tr.find_all('td', {"data-label":"Fallen Behind Days"})
print(td)
The code is giving me the below output from the HTML
[]
[]
[<td class="ng-binding ng-scope nowrap" data-colid="FallenBehindDays" data-label="Fallen Behind Days" title="1">1</td>]
[<td class="ng-binding ng-scope nowrap" data-colid="FallenBehindDays" data-label="Fallen Behind Days" title="0">0</td>]
[<td class="ng-binding ng-scope nowrap" data-colid="FallenBehindDays" data-label="Fallen Behind Days" title="6">6</td>]
[<td class="ng-binding ng-scope nowrap" data-colid="FallenBehindDays" data-label="Fallen Behind Days" title="1">1</td>]
If I use td.text to get the numbers between <td > </td> it gives me the below error
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
I'm trying to get only the numbers listed as follows
1
0
6
1
or find the largest number from the list in this case 6

I got what I wanted by doing the following and the output is 6
page = open(url)
soup = BeautifulSoup(page.read(), "html.parser")
NoFall = soup.find_all("div", {"id":"Kendo_Table1535711951642"})
max_value = None
for AuxCopy in NoFall:
product = AuxCopy.find('table').find_all('tr')
for tr in product:
td = tr.find('td', {"data-label":"Fallen Behind Days"})
if td is not None:
if max_value is None or int(td.text) > max_value: max_value = int(td.text)
print(max_value)

Related

Beautiful Soup: How to get timestamp inside td

How do I get the value from 'data-timestamp' and convert it into an integer using BeautifulSoup. I'm iterating through each row on a website (which is a tr class).
So if i were to set up the code as
ratings = []
rows = soup.select('tbody tr')
for row in rows:
'insert code here'
ratings.append(rating)
However, I can't seem to access the value in the data-timestamp. I've tried using attrs but I'm assuming I'm doing it wrong. Any help would be much appreciated.
<td data-timestamp="4.5833333333333" class="hide-on-hover fill-space relative">
<div class="col border-box text-center nowrap row large-up-text-right padding-horz-small push">```
This should give you the string value:
[...]
for row in rows:
data_timestamp_str = row.find("td")['data-timestamp']
[...]
You can convert the string to an integer with int(data_timestamp_str), but note that in your example data this wouldn't work, because the value of data-timestamp is 4.583333333333, which is not an integer.
Access the tag using [], then round it to two decimal points, for example:
from bs4 import BeautifulSoup
html_doc = """<td data-timestamp="4.5833333333333" class="hide-on-hover fill-space relative">
<div class="col border-box text-center nowrap row large-up-text-right padding-horz-small push">```"""
soup = BeautifulSoup(html_doc, 'html.parser')
ratings = []
rows = soup.select('td')
for row in rows:
ratings.append(round(float(soup.select_one('td')['data-timestamp']), 2))
print(*ratings)
Outputs:
4.58

Beautiful Soup - How to find tags after a specific item in HTML?

I need to find tags after a specific item on a website. So, is there a way to skip the tag objects until this specific one, then find the matching ones to given criteria? I need all p with class XYZ after the div with class ABC.
response = requests.get(url).text
soup = BeautifulSoup(response)
items = soup.find_all('p', {'class': 'MessageTextSize js-message-text message-text'}) # only return the ones after the div with class of "Text 2"
Edit: You can see a sample code block below which is part response. The aim is finding the last two paragraphs (Text 3 & Text 4) despite the first one (Text 1) also has the same p class with them. So, I need to look for the parameter of find_all function after the Text 2 (class MessageTextSize js-message-text message-text).
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 1</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize MessageTextSize--jumbo js-message-text message-text" data-aria-label-part="0">Text 2</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 3</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 4</p>
</div>
p.s. bs4 version is 4.8.1, which is the latest release.
You can always use a custom function (or a lambda expression) inside find_all. The following is self-explanatory (IMO).
result = soup.find_all(
lambda x: x.name == 'p' and
'XYZ' in x.get('class', '') and
x.find_previous('div', class_='ABC')
)
Example
from bs4 import BeautifulSoup
html = """
<p class="XYZ">Text 1</p>
<p class="XYZ">Text 2</p>
<div class="ABC"></div>
<p class="XYZ">Text 3</p>
<p class="XYZ">Text 4</p>
"""
soup = BeautifulSoup(html, 'html.parser')
result = soup.find_all(
lambda x: x.name == 'p' and
'XYZ' in x.get('class', '') and
x.find_previous('div', class_='ABC')
)
print(result)
Output
[<p class="XYZ">Text 3</p>, <p class="XYZ">Text 4</p>]
EDIT
MessageTextSize js-message-text message-text represents three classes, not one.
x.get('class', '') returns a list of classes -
['MessageTextSize', 'js-message-text', 'message-text']
In your particular case, you have to target a p tag not a div, if I understood correctly.
So, you have to use
result = soup.find_all(
lambda x: x.name == 'p' and
'MessageTextSize js-message-text message-text' in ' '.join(x.get('class', ''))
and x.find_previous('p', class_='MessageTextSize MessageTextSize--jumbo js-message-text message-text')
)
Ref:
find_previous()
Function as filter
If I understand you correctly, this should work:
item = soup.select_one('p[class*="MessageTextSize--jumbo"]')
sibs = item.parent.find_next_siblings()
for sib in sibs:
print(sib.text.strip())
Output:
Text 3
Text 4

Find multiple tags with condition

Is it possible to find multiple tags with a condition?
<a href = "/img/something.jpg">
<img src= "/img/somethingelse.png">
Could I say
Find all "a" and "img" tags containing "/img/"
Yes, just supply function (can be lambda function) to find_all() method:
data = """<a href = "/img/something.jpg">
<img src= "/img/somethingelse.png">"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for tag in soup.body.find_all(lambda t: t.name in ('a', 'img') and \
('href' in t.attrs and '/img/' in t['href']) or
('src' in t.attrs and '/img/' in t['src'])):
print(tag.name, tag.attrs)
print('*' * 80)
Outputs:
a {'href': '/img/something.jpg'}
********************************************************************************
img {'src': '/img/somethingelse.png'}
********************************************************************************

slicing an html file to pandas dataframe while preserving parent-child relationship of div tags of the format

i'm trying to cut an html file into a dataframe preserving parent child relationship between div tags.
for instance:
<div class="ddemrcontentitem ddremovable" dd:entityid="0" id="_5C026969-
71BA-456E-A183-BC923BAB9E99" style="clear: both;"
xmlns:dd="DynamicDocumentation">Orders:
<div style="padding-left: 8px;">
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251406974" id="_57B1A3DC-1899-4752-9516-6F137BBE1C8F">CBC w/ Auto Diff</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251389861" id="_0A418835-4384-4ACC-A4FD-3C901539DADB">Hygiene Activity</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251389598" id="_5D06090F-7330-49B1-BB53-28496388E8C1">Regular Diet</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251407213" id="_0D683EC1-4D18-45F4-BD52-0451DDA3BF5A">Sodium Level</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251410812" id="_82ACC1FF-DA2E-472C-BA0F-E881293BDCBA">Sodium Level</div>
</div>
orders should be parent to each of (CBC w/ Auto Diff,Regular Diet,Sodium Level,Sodium Level) in a dictionary or a dataframe.
this is my failing trial:
import pandas as pd
import bs4
'''i imported the file- parsed html using bs4 package
made a list of the div tags and made 2 dictionary too
one with the text and one with the full tags and text
then made tables of them (pandas dataframes)'''
alpha = open('D://python/893714319.00.html','r')
beta = bs4.BeautifulSoup(alpha, 'lxml')
lister = []
fulllister = []
listerer = {}
mydivs = beta.findAll('div')
for div in mydivs:
lister.append(div.text)
fulllister.append(div.contents)
listerer = {k:v for v,k in enumerate(lister)}
fulllisterer = {k:v for k,v in enumerate(fulllister)}
listerer = sorted(listerer.items(), key=lambda x: x[1])
fulllisterer = sorted(fulllisterer.items(), key = lambda x:x[1])
listerer = pd.DataFrame(listerer)
fulllisterer = pd.DataFrame(fulllisterer)
listerer.dropna( inplace='True',how='any')
fulllisterer.dropna(axis=1, inplace='True',how='any')
'''trying to characterize the string that is parent and what is child
by counting <div> in it but this is not working , i don't know why
by parent i mean 'orders' and the children would be 'cbc' and so
'''
fulllisterer['divier']= ""
fulllisterer['count']= 0
for string in fulllisterer[1].iteritems():
fulllisterer['count']=string.count('<div>')
if string.count('<div>')>1:
fulllisterer['divier'] = fulllisterer[1]
the output would look like:
<html>
<body>
<table>
<th>parent</th>
<th>child</th>
<tr>
<td>orders</td>
<td>CBC w/ Auto Diff</td>
</tr>
<tr>
<td>orders</td>
<td> Hygiene Activity</td>
</tr>
<tr>
<td>orders</td>
<td> Regular Diet</td>
</tr>
<tr>
<td>orders</td>
<td>Sodium Level</td>
</tr>
<tr>
<td>orders</td>
<td>Sodium Level</td>
</tr>
</table>
</body></html>
the output would be like
I think you were just over-engineering this. The following code, adapted from your snippet should do
import pandas as pd
import bs4
beta = bs4.BeautifulSoup(alpha, 'lxml')
mydivs = beta.findAll('div')
lister = []
for div in mydivs:
lister.append(div.text)
data_list = lister[0].split('\n')
data_list = [el.strip().replace(':', '') for el in data_list if el.strip() != '']
df = pd.DataFrame()
print pd.DataFrame({'parent': data_list[0], 'child':data_list[1:]})
Now you just need to make sure this is called for each parent div tag in place of lister[0].

Change value in onlick web page option using vba (IE 11)

I'am new in VBA DOM oblject and i'm trying to select a specific option in a web page. Unfortunatly that option has an ID name that is the same as others option in the page except for the value of the inner function.
here the web code:
<table class="stati_check" id="ctl00_NetSiuCPH_ctl25">
<tr>
<td>
<span class="check">
<INPUT onclick="if (!boxWorkflow_rbSelect(this)) return;setTimeout('__doPostBack(\'ctl00$NetSiuCPH$ctl25$ctl00$WorkflowState\',\'\')', 0)" tabIndex=0
id=ctl00_NetSiuCPH_ctl25_ctl00_WorkflowState type=radio
value=ONVALIDAPROOF name=ctl00$NetSiuCPH$ctl25$ctl00$WorkflowState>
</span>
<td>
<span class="check">
<INPUT onclick="if (!boxWorkflow_rbSelect(this)) return;setTimeout('__doPostBack(\'ctl00$NetSiuCPH$ctl25$ctl01$WorkflowState\',\'\')', 0)" tabIndex=0
id=ctl00_NetSiuCPH_ctl25_ctl01_WorkflowState type=radio
value=ONANNULA1 name=ctl00$NetSiuCPH$ctl25$ctl01$WorkflowState>
</span>
</td>
</tr>
</table>
As can be see the two ID are the same so when I run the VBA code line:
Set pdr_button = ie.Document.getElementById("ctl00_NetSiuCPH_ctl25_ctl00_WorkflowState").click
It will be selected only the first option, but i'm trying to select the ones with the value "ONANNULA1".
I've tried with remove/setatribute:
Set Annulla_Button = ie.Document.getElementById("ctl00_NetSiuCPH_ctl25_ctl00_WorkflowState")
Annulla_Button.removeAttribute ("value")
Annulla_Button.setAttribute ("value"), "ONANNULLA1"
Annulla_Button.Click
However the result is that nothing will be selected.
Can someone help me.
thanks in advance for you patience
So that the strip of code:`
Set TABELLA_STATI = ie.Document.getElementById("ctl00_NetSiuCPH_ctl25").getElementsByTagName("TR")
For Each tr In TABELLA_STATI
Set coltd = tr.getElementsByTagName("TD")
For Each td In coltd
Set option_Button = td.getElementsByTagName("INPUT")
Option_Value = option_Button(0).Value
If InStr(1, Option_Value, "ONANNUL") > 0 Then
option_Button(0).Click
Exit For
End If
Next td
Next tr
`