Beautiful Soup: How to get timestamp inside td - beautifulsoup

How do I get the value from 'data-timestamp' and convert it into an integer using BeautifulSoup. I'm iterating through each row on a website (which is a tr class).
So if i were to set up the code as
ratings = []
rows = soup.select('tbody tr')
for row in rows:
'insert code here'
ratings.append(rating)
However, I can't seem to access the value in the data-timestamp. I've tried using attrs but I'm assuming I'm doing it wrong. Any help would be much appreciated.
<td data-timestamp="4.5833333333333" class="hide-on-hover fill-space relative">
<div class="col border-box text-center nowrap row large-up-text-right padding-horz-small push">```

This should give you the string value:
[...]
for row in rows:
data_timestamp_str = row.find("td")['data-timestamp']
[...]
You can convert the string to an integer with int(data_timestamp_str), but note that in your example data this wouldn't work, because the value of data-timestamp is 4.583333333333, which is not an integer.

Access the tag using [], then round it to two decimal points, for example:
from bs4 import BeautifulSoup
html_doc = """<td data-timestamp="4.5833333333333" class="hide-on-hover fill-space relative">
<div class="col border-box text-center nowrap row large-up-text-right padding-horz-small push">```"""
soup = BeautifulSoup(html_doc, 'html.parser')
ratings = []
rows = soup.select('td')
for row in rows:
ratings.append(round(float(soup.select_one('td')['data-timestamp']), 2))
print(*ratings)
Outputs:
4.58

Related

BS4 How to get the text from [<td >1</td>]

Running into problems to get the text from [1]. I have the below code
url = "C:\\local.html"
page = open(url)
soup = BeautifulSoup(page.read(), "html.parser")
NoFall = soup.find_all("div", {"id":"Kendo_Table1535711951642"})
for AuxCopy in NoFall:
product = AuxCopy.find('table').find_all('tr')
for tr in product:
td = tr.find_all('td', {"data-label":"Fallen Behind Days"})
print(td)
The code is giving me the below output from the HTML
[]
[]
[<td class="ng-binding ng-scope nowrap" data-colid="FallenBehindDays" data-label="Fallen Behind Days" title="1">1</td>]
[<td class="ng-binding ng-scope nowrap" data-colid="FallenBehindDays" data-label="Fallen Behind Days" title="0">0</td>]
[<td class="ng-binding ng-scope nowrap" data-colid="FallenBehindDays" data-label="Fallen Behind Days" title="6">6</td>]
[<td class="ng-binding ng-scope nowrap" data-colid="FallenBehindDays" data-label="Fallen Behind Days" title="1">1</td>]
If I use td.text to get the numbers between <td > </td> it gives me the below error
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
I'm trying to get only the numbers listed as follows
1
0
6
1
or find the largest number from the list in this case 6
I got what I wanted by doing the following and the output is 6
page = open(url)
soup = BeautifulSoup(page.read(), "html.parser")
NoFall = soup.find_all("div", {"id":"Kendo_Table1535711951642"})
max_value = None
for AuxCopy in NoFall:
product = AuxCopy.find('table').find_all('tr')
for tr in product:
td = tr.find('td', {"data-label":"Fallen Behind Days"})
if td is not None:
if max_value is None or int(td.text) > max_value: max_value = int(td.text)
print(max_value)

How can I get a value from an attribute inside a tag

I have a soup object like:
<a class="love-action js-add-to-favorites" data-id="415953" data-price="715.00" href="#">
</a>
I did
soup = BeautifulSoup(src, 'lxml') #передаем переменную в суп
price = soup.find(class_='col-5 col-sm-4 col-lg-7 mob-position detail-top-actions').find('a', class_='love-action js-add-to-favorites')
print(price)
I'd like to get only: 715.00
How to fix?
You can access attributes of a tag by treating it like a dictionary - So simply get the value from the attribute data-price by:
price['data-price']
Example based on your question
soup = BeautifulSoup(src, 'lxml') #передаем переменную в суп
price = soup.find(class_='col-5 col-sm-4 col-lg-7 mob-position detail-top-actions').find('a', class_='love-action js-add-to-favorites')
print(price['data-price'])
Output
715.00

Beautiful Soup - How to find tags after a specific item in HTML?

I need to find tags after a specific item on a website. So, is there a way to skip the tag objects until this specific one, then find the matching ones to given criteria? I need all p with class XYZ after the div with class ABC.
response = requests.get(url).text
soup = BeautifulSoup(response)
items = soup.find_all('p', {'class': 'MessageTextSize js-message-text message-text'}) # only return the ones after the div with class of "Text 2"
Edit: You can see a sample code block below which is part response. The aim is finding the last two paragraphs (Text 3 & Text 4) despite the first one (Text 1) also has the same p class with them. So, I need to look for the parameter of find_all function after the Text 2 (class MessageTextSize js-message-text message-text).
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 1</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize MessageTextSize--jumbo js-message-text message-text" data-aria-label-part="0">Text 2</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 3</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 4</p>
</div>
p.s. bs4 version is 4.8.1, which is the latest release.
You can always use a custom function (or a lambda expression) inside find_all. The following is self-explanatory (IMO).
result = soup.find_all(
lambda x: x.name == 'p' and
'XYZ' in x.get('class', '') and
x.find_previous('div', class_='ABC')
)
Example
from bs4 import BeautifulSoup
html = """
<p class="XYZ">Text 1</p>
<p class="XYZ">Text 2</p>
<div class="ABC"></div>
<p class="XYZ">Text 3</p>
<p class="XYZ">Text 4</p>
"""
soup = BeautifulSoup(html, 'html.parser')
result = soup.find_all(
lambda x: x.name == 'p' and
'XYZ' in x.get('class', '') and
x.find_previous('div', class_='ABC')
)
print(result)
Output
[<p class="XYZ">Text 3</p>, <p class="XYZ">Text 4</p>]
EDIT
MessageTextSize js-message-text message-text represents three classes, not one.
x.get('class', '') returns a list of classes -
['MessageTextSize', 'js-message-text', 'message-text']
In your particular case, you have to target a p tag not a div, if I understood correctly.
So, you have to use
result = soup.find_all(
lambda x: x.name == 'p' and
'MessageTextSize js-message-text message-text' in ' '.join(x.get('class', ''))
and x.find_previous('p', class_='MessageTextSize MessageTextSize--jumbo js-message-text message-text')
)
Ref:
find_previous()
Function as filter
If I understand you correctly, this should work:
item = soup.select_one('p[class*="MessageTextSize--jumbo"]')
sibs = item.parent.find_next_siblings()
for sib in sibs:
print(sib.text.strip())
Output:
Text 3
Text 4

compare the 'class' of container tag

Let's say I extract some classes from some HTML:
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
print(p_standard)
And the output looks like this:
<p class="P3">a</p>
<p class="Standard">b</p>
<p class="P3">c</p>
<p class="Standard">d</p>
And let's say I only wanted to print the text inside the P3 classes so that the output looks like:
a
c
I thought this code below would work, but it didn't. How can I compare the class name of the container tag to some value?
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
if p_standard.get("class") == "P3":
print(p_standard.get_text())
I'm aware that in my first line, I could have simply done r"P3" instead of r"Standard|P3", but this is only a small fraction of the actual code (not the full story), and I need to leave that first line as it is.
Note: doing something like .find("p", class_ = "P3") only works for descendants, not for the container tag.
OK, so after playing around with the code, it turns out that
p_standard.get("class")[0] == "P3"
works. (I was missing the [0])
So this code works:
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
if p_standard.get("class")[0] == "P3":
print(p_standard.get_text())
I think the following is more efficient. Use select and CSS Or syntax to gather list based on either class.
from bs4 import BeautifulSoup as bs
html = '''
<html>
<head></head>
<body>
<p class="P3">a</p>
<p class="Standard">b</p>
<p class="P3">c</p>
<p class="Standard">d</p>
</body>
</html>
'''
soup = bs(html, 'lxml')
p_standards = soup.select('.Standard,.P3')
for p_standard in p_standards:
if 'P3' in p_standard['class']:
print(item.text)

slicing an html file to pandas dataframe while preserving parent-child relationship of div tags of the format

i'm trying to cut an html file into a dataframe preserving parent child relationship between div tags.
for instance:
<div class="ddemrcontentitem ddremovable" dd:entityid="0" id="_5C026969-
71BA-456E-A183-BC923BAB9E99" style="clear: both;"
xmlns:dd="DynamicDocumentation">Orders:
<div style="padding-left: 8px;">
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251406974" id="_57B1A3DC-1899-4752-9516-6F137BBE1C8F">CBC w/ Auto Diff</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251389861" id="_0A418835-4384-4ACC-A4FD-3C901539DADB">Hygiene Activity</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251389598" id="_5D06090F-7330-49B1-BB53-28496388E8C1">Regular Diet</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251407213" id="_0D683EC1-4D18-45F4-BD52-0451DDA3BF5A">Sodium Level</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251410812" id="_82ACC1FF-DA2E-472C-BA0F-E881293BDCBA">Sodium Level</div>
</div>
orders should be parent to each of (CBC w/ Auto Diff,Regular Diet,Sodium Level,Sodium Level) in a dictionary or a dataframe.
this is my failing trial:
import pandas as pd
import bs4
'''i imported the file- parsed html using bs4 package
made a list of the div tags and made 2 dictionary too
one with the text and one with the full tags and text
then made tables of them (pandas dataframes)'''
alpha = open('D://python/893714319.00.html','r')
beta = bs4.BeautifulSoup(alpha, 'lxml')
lister = []
fulllister = []
listerer = {}
mydivs = beta.findAll('div')
for div in mydivs:
lister.append(div.text)
fulllister.append(div.contents)
listerer = {k:v for v,k in enumerate(lister)}
fulllisterer = {k:v for k,v in enumerate(fulllister)}
listerer = sorted(listerer.items(), key=lambda x: x[1])
fulllisterer = sorted(fulllisterer.items(), key = lambda x:x[1])
listerer = pd.DataFrame(listerer)
fulllisterer = pd.DataFrame(fulllisterer)
listerer.dropna( inplace='True',how='any')
fulllisterer.dropna(axis=1, inplace='True',how='any')
'''trying to characterize the string that is parent and what is child
by counting <div> in it but this is not working , i don't know why
by parent i mean 'orders' and the children would be 'cbc' and so
'''
fulllisterer['divier']= ""
fulllisterer['count']= 0
for string in fulllisterer[1].iteritems():
fulllisterer['count']=string.count('<div>')
if string.count('<div>')>1:
fulllisterer['divier'] = fulllisterer[1]
the output would look like:
<html>
<body>
<table>
<th>parent</th>
<th>child</th>
<tr>
<td>orders</td>
<td>CBC w/ Auto Diff</td>
</tr>
<tr>
<td>orders</td>
<td> Hygiene Activity</td>
</tr>
<tr>
<td>orders</td>
<td> Regular Diet</td>
</tr>
<tr>
<td>orders</td>
<td>Sodium Level</td>
</tr>
<tr>
<td>orders</td>
<td>Sodium Level</td>
</tr>
</table>
</body></html>
the output would be like
I think you were just over-engineering this. The following code, adapted from your snippet should do
import pandas as pd
import bs4
beta = bs4.BeautifulSoup(alpha, 'lxml')
mydivs = beta.findAll('div')
lister = []
for div in mydivs:
lister.append(div.text)
data_list = lister[0].split('\n')
data_list = [el.strip().replace(':', '') for el in data_list if el.strip() != '']
df = pd.DataFrame()
print pd.DataFrame({'parent': data_list[0], 'child':data_list[1:]})
Now you just need to make sure this is called for each parent div tag in place of lister[0].