slicing an html file to pandas dataframe while preserving parent-child relationship of div tags of the format - pandas

i'm trying to cut an html file into a dataframe preserving parent child relationship between div tags.
for instance:
<div class="ddemrcontentitem ddremovable" dd:entityid="0" id="_5C026969-
71BA-456E-A183-BC923BAB9E99" style="clear: both;"
xmlns:dd="DynamicDocumentation">Orders:
<div style="padding-left: 8px;">
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251406974" id="_57B1A3DC-1899-4752-9516-6F137BBE1C8F">CBC w/ Auto Diff</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251389861" id="_0A418835-4384-4ACC-A4FD-3C901539DADB">Hygiene Activity</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251389598" id="_5D06090F-7330-49B1-BB53-28496388E8C1">Regular Diet</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251407213" id="_0D683EC1-4D18-45F4-BD52-0451DDA3BF5A">Sodium Level</div>
<div class="ddemrcontentitem ddremovable" dd:contenttype="NONMEDORDERS" dd:entityid="251410812" id="_82ACC1FF-DA2E-472C-BA0F-E881293BDCBA">Sodium Level</div>
</div>
orders should be parent to each of (CBC w/ Auto Diff,Regular Diet,Sodium Level,Sodium Level) in a dictionary or a dataframe.
this is my failing trial:
import pandas as pd
import bs4
'''i imported the file- parsed html using bs4 package
made a list of the div tags and made 2 dictionary too
one with the text and one with the full tags and text
then made tables of them (pandas dataframes)'''
alpha = open('D://python/893714319.00.html','r')
beta = bs4.BeautifulSoup(alpha, 'lxml')
lister = []
fulllister = []
listerer = {}
mydivs = beta.findAll('div')
for div in mydivs:
lister.append(div.text)
fulllister.append(div.contents)
listerer = {k:v for v,k in enumerate(lister)}
fulllisterer = {k:v for k,v in enumerate(fulllister)}
listerer = sorted(listerer.items(), key=lambda x: x[1])
fulllisterer = sorted(fulllisterer.items(), key = lambda x:x[1])
listerer = pd.DataFrame(listerer)
fulllisterer = pd.DataFrame(fulllisterer)
listerer.dropna( inplace='True',how='any')
fulllisterer.dropna(axis=1, inplace='True',how='any')
'''trying to characterize the string that is parent and what is child
by counting <div> in it but this is not working , i don't know why
by parent i mean 'orders' and the children would be 'cbc' and so
'''
fulllisterer['divier']= ""
fulllisterer['count']= 0
for string in fulllisterer[1].iteritems():
fulllisterer['count']=string.count('<div>')
if string.count('<div>')>1:
fulllisterer['divier'] = fulllisterer[1]
the output would look like:
<html>
<body>
<table>
<th>parent</th>
<th>child</th>
<tr>
<td>orders</td>
<td>CBC w/ Auto Diff</td>
</tr>
<tr>
<td>orders</td>
<td> Hygiene Activity</td>
</tr>
<tr>
<td>orders</td>
<td> Regular Diet</td>
</tr>
<tr>
<td>orders</td>
<td>Sodium Level</td>
</tr>
<tr>
<td>orders</td>
<td>Sodium Level</td>
</tr>
</table>
</body></html>
the output would be like

I think you were just over-engineering this. The following code, adapted from your snippet should do
import pandas as pd
import bs4
beta = bs4.BeautifulSoup(alpha, 'lxml')
mydivs = beta.findAll('div')
lister = []
for div in mydivs:
lister.append(div.text)
data_list = lister[0].split('\n')
data_list = [el.strip().replace(':', '') for el in data_list if el.strip() != '']
df = pd.DataFrame()
print pd.DataFrame({'parent': data_list[0], 'child':data_list[1:]})
Now you just need to make sure this is called for each parent div tag in place of lister[0].

Related

How to extract value of all classes in beautiful Soup

I have a HTML file with a structure like this:
<p id="01">... EU legislation and the <em>monetary power</em> of the
<span class="institution" Wikidata="Q8901" name="European Central Bank">ECB</span>.</p>
<p id="02"><span class="person" Wikidata="Q563217">Guido Carli</span>, Governor of the
<span class="institution" Wikidata="Q806176">Bank of Italy</span> ...</p>
I need to have a Python dict like this:
{'institution': ['Q8901', 'Q806176'], 'person': ['Q563217']}
So I need to get the value of the class attribute of all span tags, along with their text. How can I do this with bs4?
Select your elements and iterate the ResultSet while appending the values to your dict. To extract the values of an attribute use .get(). Because class will give you a list pick yours by index or key.
Example
from bs4 import BeautifulSoup
html = '''
<p id="01">... EU legislation and the <em>monetary power</em> of the
<span class="institution" Wikidata="Q8901" name="European Central Bank">ECB</span>.</p>
<p id="02"><span class="person" Wikidata="Q563217">Guido Carli</span>, Governor of the
<span class="institution" Wikidata="Q806176">Bank of Italy</span> ...</p>
'''
soup = BeautifulSoup(html)
d = {
'institution':[],
'person':[]
}
for e in soup.select('span[wikidata]'):
d[e.get('class')[0]].append(e.get('wikidata'))
d
Output
{'institution': ['Q8901', 'Q806176'], 'person': ['Q563217']}
This is the way I solved my problem thanks to #HedgeHog.
from bs4 import BeautifulSoup
from collections import defaultdict
def capture_info(soup: 'BeautifulSoup') -> defaultdict:
info = defaultdict(list)
for i in soup.select('span[Wikidata]'):
info[i.get('class')[0]].append(i.get('wikidata'))
return info
html = '''
<p id="01">... EU legislation and the <em>monetary power</em> of the
<span class="institution" Wikidata="Q8901" name="European Central Bank">ECB</span>.</p>
<p id="02"><span class="person" Wikidata="Q563217">Guido Carli</span>, Governor of the
<span class="institution" Wikidata="Q806176">Bank of Italy</span> ...</p>
'''
soup = BeautifulSoup(html, 'html.parser')
info = capture_info(soup)
The output is:
{'institution': ['Q8901', 'Q806176'], 'person': ['Q563217']})

Beautiful Soup: How to get timestamp inside td

How do I get the value from 'data-timestamp' and convert it into an integer using BeautifulSoup. I'm iterating through each row on a website (which is a tr class).
So if i were to set up the code as
ratings = []
rows = soup.select('tbody tr')
for row in rows:
'insert code here'
ratings.append(rating)
However, I can't seem to access the value in the data-timestamp. I've tried using attrs but I'm assuming I'm doing it wrong. Any help would be much appreciated.
<td data-timestamp="4.5833333333333" class="hide-on-hover fill-space relative">
<div class="col border-box text-center nowrap row large-up-text-right padding-horz-small push">```
This should give you the string value:
[...]
for row in rows:
data_timestamp_str = row.find("td")['data-timestamp']
[...]
You can convert the string to an integer with int(data_timestamp_str), but note that in your example data this wouldn't work, because the value of data-timestamp is 4.583333333333, which is not an integer.
Access the tag using [], then round it to two decimal points, for example:
from bs4 import BeautifulSoup
html_doc = """<td data-timestamp="4.5833333333333" class="hide-on-hover fill-space relative">
<div class="col border-box text-center nowrap row large-up-text-right padding-horz-small push">```"""
soup = BeautifulSoup(html_doc, 'html.parser')
ratings = []
rows = soup.select('td')
for row in rows:
ratings.append(round(float(soup.select_one('td')['data-timestamp']), 2))
print(*ratings)
Outputs:
4.58

Beautiful Soup - How to find tags after a specific item in HTML?

I need to find tags after a specific item on a website. So, is there a way to skip the tag objects until this specific one, then find the matching ones to given criteria? I need all p with class XYZ after the div with class ABC.
response = requests.get(url).text
soup = BeautifulSoup(response)
items = soup.find_all('p', {'class': 'MessageTextSize js-message-text message-text'}) # only return the ones after the div with class of "Text 2"
Edit: You can see a sample code block below which is part response. The aim is finding the last two paragraphs (Text 3 & Text 4) despite the first one (Text 1) also has the same p class with them. So, I need to look for the parameter of find_all function after the Text 2 (class MessageTextSize js-message-text message-text).
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 1</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize MessageTextSize--jumbo js-message-text message-text" data-aria-label-part="0">Text 2</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 3</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 4</p>
</div>
p.s. bs4 version is 4.8.1, which is the latest release.
You can always use a custom function (or a lambda expression) inside find_all. The following is self-explanatory (IMO).
result = soup.find_all(
lambda x: x.name == 'p' and
'XYZ' in x.get('class', '') and
x.find_previous('div', class_='ABC')
)
Example
from bs4 import BeautifulSoup
html = """
<p class="XYZ">Text 1</p>
<p class="XYZ">Text 2</p>
<div class="ABC"></div>
<p class="XYZ">Text 3</p>
<p class="XYZ">Text 4</p>
"""
soup = BeautifulSoup(html, 'html.parser')
result = soup.find_all(
lambda x: x.name == 'p' and
'XYZ' in x.get('class', '') and
x.find_previous('div', class_='ABC')
)
print(result)
Output
[<p class="XYZ">Text 3</p>, <p class="XYZ">Text 4</p>]
EDIT
MessageTextSize js-message-text message-text represents three classes, not one.
x.get('class', '') returns a list of classes -
['MessageTextSize', 'js-message-text', 'message-text']
In your particular case, you have to target a p tag not a div, if I understood correctly.
So, you have to use
result = soup.find_all(
lambda x: x.name == 'p' and
'MessageTextSize js-message-text message-text' in ' '.join(x.get('class', ''))
and x.find_previous('p', class_='MessageTextSize MessageTextSize--jumbo js-message-text message-text')
)
Ref:
find_previous()
Function as filter
If I understand you correctly, this should work:
item = soup.select_one('p[class*="MessageTextSize--jumbo"]')
sibs = item.parent.find_next_siblings()
for sib in sibs:
print(sib.text.strip())
Output:
Text 3
Text 4

BeautifulSoup Nested class selector

I am using BeautifulSoup for a project. Here is my HTML structure
<div class="container">
<div class="fruits">
<div class="apple">
<p>John</p>
<p>Sam</p>
<p>Bailey</p>
<p>Jack</p>
<ul>
<li>Sour</li>
<li>Sweet</li>
<li>Salty</li>
</ul>
<span>Fruits are good</span>
</div>
<div class="mango">
<p>Randy</p>
<p>James</p>
</div>
</div>
<div class="apple">
<p>Bill</p>
<p>Sean</p>
</div>
</div>
Now I want to grab text in div class 'apple' which falls under class 'fruits'
This is what I have tried so far ....
for node in soup.find_all("div", class_="apple")
Its returning ...
Bill
Sean
But I want it to return only ...
John
Sam
Bailey
Jack
Sour
Sweet
Salty
Fruits are good
Please note that I DO NOT know the exact structure of elements inside div class="apple" There can be any type of different HTML elements inside that class. So the selector has to be flexible enough.
Here is the full code, where I need to add this BeautifulSoup code ...
class MySpider(CrawlSpider):
name = 'dknnews'
start_urls = ['http://www.example.com/uat-area/scrapy/all-news-listing/_recache']
allowed_domains = ['example.com']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
#soup = BeautifulSoup(content.decode('utf-8','ignore'))
nf = NewsFields()
ptype = soup.find_all(attrs={"name":"dknpagetype"})
ptitle = soup.find_all(attrs={"name":"dknpagetitle"})
pturl = soup.find_all(attrs={"name":"dknpageurl"})
ptdate = soup.find_all(attrs={"name":"dknpagedate"})
ptdesc = soup.find_all(attrs={"name":"dknpagedescription"})
for node in soup.find_all("div", class_="apple"): <!-- THIS IS WHERE I NEED TO ADD THE BS CODE -->
ptbody = ''.join(node.find_all(text=True))
ptbody = ' '.join(ptbody.split())
nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
nf['bodytext'] = ptbody.encode('ascii', 'ignore')
yield nf
for url in hxs.xpath('//ul[#class="scrapy"]/li/a/#href').extract():
yield Request(url, callback=self.parse)
I am not sure how to use nested selectors with BeautifulSoup find_all ?
Any help is very appreciated.
Thanks
soup.select('.fruits .apple p')
use CSSselector, it's very easy to express class.
soup.find(class_='fruits').find(class_="apple").find_all('p')
Or, you can use find() to get the p tag step by step
EDIT:
[s for div in soup.select('.fruits .apple') for s in div.stripped_strings]
use strings generator to get all the string under the div tag, stripped_strings will get rid of \n in the results.
out:
['John', 'Sam', 'Bailey', 'Jack', 'Sour', 'Sweet', 'Salty', 'Fruits are good']
Full code:
from bs4 import BeautifulSoup
source_code = """<div class="container">
<div class="fruits">
<div class="apple">
<p>John</p>
<p>Sam</p>
<p>Bailey</p>
<p>Jack</p>
<ul>
<li>Sour</li>
<li>Sweet</li>
<li>Salty</li>
</ul>
<span>Fruits are good</span>
</div>
<div class="mango">
<p>Randy</p>
<p>James</p>
</div>
</div>
<div class="apple">
<p>Bill</p>
<p>Sean</p>
</div>
</div>
"""
soup = BeautifulSoup(source_code, 'lxml')
[s for div in soup.select('.fruits .apple') for s in div.stripped_strings]

Grails not displaying SQL results in table, what am I missing?

I'm obviously missing something obvious here but I cant for the life of me work out what, I've setup a view to display a custom SQL query, but the screen is showing nothing, here's what I've got
Controller
def queueBreakdown(){
String SQLQuery = "select state, count(test_exec_queue_id) as 'myCount' from dbo.test_exec_queue group by state"
def dataSource
def list = {
def db = new Sql(dataSource)
def results = db.rows(SQLQuery)
[results:results]
}
}
If I run this manually I get a set of results back like so
state myCount
1 1
test 2
test2 1
The queueBreakdown.gsp has the following...
<body>
<g:message code="default.link.skip.label" default="Skip to content…"/>
<div class="nav" role="navigation">
<ul>
<li><a class="home" href="${createLink(uri: '/')}"><g:message code="default.home.label"/></a></li>
</ul>
</div>
<div id="queueBreakdown-testExecQueue" class="content scaffold-list" role="main">
<h1><g:message code="Execution Queue Breakdown" /></h1>
<table>
<thead>
<tr>
<g:sortableColumn property="Run State" title="Run State"/>
<g:sortableColumn property="Count" title="Count" />
</tr>
</thead>
<tbody>
<g:each in="${results}" status="i" var="it">
<tr class="${(i % 2) == 0 ? 'even' : 'odd'}">
<td>${it.state}</td>
<td>${it.myCount}</td>
</tr>
</g:each>
</tbody>
</table>
</div>
</body>
But when I view the page I get nothing... The table has been built but there are no lines in it, what am I being thick about here?
Cheers
your controller code is really confusing, what is the action here ? queueBreakdown() or list() ? It seems like you have mixed up 2 actions together, and queueBreakdown() is not returning any model...
class SomeController {
def dataSource
def queueBreakdown() {
String SQLQuery = "select state, count(test_exec_queue_id) as 'myCount' from dbo.test_exec_queue group by state"
def db = new Sql(dataSource)
def results = db.rows(SQLQuery)
[results:results]
}
}